A neural network to generate captions for an image using CNN and RNN with BEAM Search.
A neural network to generate captions for an image using CNN and RNN with BEAM Search.
Examples
Image Credits : Towardsdatascience
Recommended System Requirements to train model.
Required libraries for Python along with their version numbers used while making & testing of this project
Flickr8k Dataset: Dataset Request Form
UPDATE (April/2019): The official site seems to have been taken down (although the form still works). Here are some direct download links:
Important: After downloading the dataset, put the reqired files in train_val_data folder
batch_size=64
took ~14GB GPU memory in case of InceptionV3 + AlternativeRNN and VGG16 + AlternativeRNN
batch_size=64
took ~8GB GPU memory in case of InceptionV3 + RNN and VGG16 + RNN
loss
and val_loss
are same as in case of argmax since the model is sameModel & Config | Argmax | BEAM Search |
---|---|---|
InceptionV3 + AlternativeRNN
|
(Lower the better) (Higher the better) |
BLEU Scores on Validation data (Higher the better) |
InceptionV3 + RNN
|
(Lower the better) (Higher the better) |
BLEU Scores on Validation data (Higher the better) |
VGG16 + AlternativeRNN
|
(Lower the better) (Higher the better) |
BLEU Scores on Validation data (Higher the better) |
VGG16 + RNN
|
(Lower the better) (Higher the better) |
BLEU Scores on Validation data (Higher the better) |
Model used - InceptionV3 + AlternativeRNN
Image | Caption |
---|---|
|
|
|
git clone https://github.com/dabasajay/Image-Caption-Generator.git
train_val_data
folder (files mentioned in readme there).config.py
for paths and other configurations (explained below).train_val.py
.git clone https://github.com/dabasajay/Image-Caption-Generator.git
model_data
folder (steps given above).test_data
folder.config.py
for paths and other configurations (explained below).test.py
.config
images_path
:- Folder path containing flickr dataset imagestrain_data_path
:- .txt file path containing images ids for trainingval_data_path
:- .txt file path containing imgage ids for validationcaptions_path
:- .txt file path containing captionstokenizer_path
:- path for saving tokenizermodel_data_path
:- path for saving files related to modelmodel_load_path
:- path for loading trained modelnum_of_epochs
:- Number of epochsmax_length
:- Maximum length of captions. This is set manually after training of model and required for test.pybatch_size
:- Batch size for training (larger will consume more GPU & CPU memory)beam_search_k
:- BEAM search parameter which tells the algorithm how many words to consider at a time.test_data_path
:- Folder path containing images for testing/inferencemodel_type
:- CNN Model type to use -> inceptionv3 or vgg16random_seed
:- Random seed for reproducibility of resultsrnnConfig
embedding_size
:- Embedding size used in Decoder(RNN) ModelLSTM_units
:- Number of LSTM units in Decoder(RNN) Modeldense_units
:- Number of Dense units in Decoder(RNN) Modeldropout
:- Dropout probability used in Dropout layer in Decoder(RNN) Modelbatch_size
k
in BEAM search using beam_search_k
parameter in config. Note that higher k
will improve results but it'll also increase inference time significantly.Support for VGG16 Model. Uses InceptionV3 Model by default.
Implement 2 architectures of RNN Model.
Support for batch processing in data generator with shuffling.
Implement BEAM Search.
Calculate BLEU Scores using BEAM Search.
Implement Attention and change model architecture.
Support for pre-trained word vectors like word2vec, GloVe etc.