Deep CNN-LSTM for Generating Image Descriptions :smiling_imp:
Deep CNN-LSTM for Generating Image Descriptions :smiling_imp:
Key words: Image captioning, image description generator, explain image, merge model, deep learning, long-short term memory, recurrent neural network, convolutional neural network, word by word, word embeding, bleu score.
Related works: Deep model for computer vision and natural language, Image-sentence retrieval, Generating novel sentence descriptions for images.
Image captioning is a very interesting problem in machine learning. With the development of deep neural network, deep learning approach is the state of the art of this problem. The main mission of image captioning is to automatically generate an image's description, which requires our understanding about content of images. In the past, there are some end-to-end models which were introduced such as: GoogleNIC (show and tell), MontrealNIC (show attend and tell), LRCN, mRNN, they are called inject-model with idea is give image feature throught RNN. In 2017, Marc Tanti, et al. introduce their paper What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator? with merge-model. The main idea of this model is separate CNN and RNN, with only merge their ouput at the end and predicted by softmax layer. Base on it, we develop our model to generate image caption.
Flickr 8k, train/val/test 6:1:1.
The definitive description of the dataset is in the paper “Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics” from 2013.
The authors describe the dataset as follows:
"We introduce a new benchmark collection for sentence-based image description and search, consisting of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events … The images were chosen from six different Flickr groups, and tend not to contain any well-known people or locations, but were manually selected to depict a variety of scenes and situations."
— Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics, 2013.
Encoder ConvNet:
Optimizer
We use BLEU-score which is evaluate metric:
Caption of new images:
Report comming soon!
Happy trainning :tada: and please vote :star: if it help!