OK, you're not actually going to learn how to do this simply by reading the article, but you will learn how it's done, and more importantly, that it can be done. The task breaks down into three parts: classifying images (do you see a cat, a rabbit?), describing images (providing a natural language summary of the content), and annotating images (generating text descriptions for specific parts of the image). So basically we're associating object recognition with language strings (in English, in French, whatever). Going further, the neural networks can act as feature extractors, which map images to "an internal representation of the image, not something directly intelligible." Language generation algorithms, coder-decoder algorithms, and an attention mechanism mechanism round out the picture. It's pretty interesting.