Google has just released the latest version of its image captioning system as an open source model in TensorFlow. The new iteration is capable of providing image captions that are 93.9 percent accurate.
According to Google, the new release brings forth significant improvements. The system is much quicker to train and can produce more accurate and detailed image descriptions when compared to the original.
"Today's code release initializes the image encoder using the Inception V3 model, which achieves 93.9 percent accuracy on the ImageNet classification task," says Chris Shallue, a Google Brain team software engineer. "Initializing the image encoder with a better vision model gives the image captioning system a better ability to recognize different objects in the images, allowing it to generate more detailed and accurate descriptions."
Google first taught its machine learning system to provide images with accurate captions back in 2014. The system became an entry in Microsoft COCO 2015, an image captioning competition where it bested other algorithms in terms of producing accurate captions for images.
The original 2014 system utilized the Inception V1 image classification model, which was 89.6 percent accurate based on the ImageNet classification task. The image classification model was replaced with Inception V2 in 2015. The enhanced vision component that came with the V2 boosted the system's accuracy by more than 2 points, which equals 91.8 percent accuracy.
Google also added a fine-tuning phase to the image captioning system. It allows for better descriptions of identified objects within the image. For instance, the image classification model will identify a train and train tracks. The fine-tuning phase will not only provide the train's color but will also explain how the train relates to the train tracks.
Hence, instead of just describing the image as "a train on the tracks," the image captioning system, with its fine-tuning phase, would come up with "a blue and yellow train travelling down the train tracks" as the image's caption.
The whole system is trained using captioned images that number in the hundreds of thousands. The captions in the training images are written by humans. The system records these human captions and is likely to use them when presented with similar images.
Google says that the image captioning system doesn't just reuse captions; it actually understands objects within an image and how each of those objects relate to each other. More precisely, the system has the ability to produce new and accurate captions when presented with new scenes. The system also didn't need any further language training for natural-sounding phrases aside from the captions.