Professional Documents
Culture Documents
Image Caption Genrator Report
Image Caption Genrator Report
By
Md Reza 17BCA22
Major Project-1
(CA-1382A)
AT
By
Guide
LINGAYA’S VIDYAPEETH
BONAFIDE CERTIFICATE
Certified that this project report “IMAGE CAPTION GENERATOR” is the bonafide work
of KRISHAN VERMA (17BCA04), SUSHIL SHAW (17BCA15) , MD REZA (17bca22)
and SHRIYA HANGLOO (17BCA11) who carried out the project in collaboration with
Deptt. of Computer Science & Application, Lingaya’s Vidyapeeth, embodies the work done
by him under the guidance of PRITIKA MAM, Assistant Professor(DCSA) towards partial
fulfilment of the requirements of the Degree of Bachelor of Technology in Computer Science
and Application from Lingaya’s Vidyapeeth, Haryana. They have fulfilled all the requirements
needed as per the rules of the Vidyapeeth, for the completion of Project. This work is original
and has not been submitted in part or in full to any Vidyapeeth or Institution.
PRITIKA MAM
(Assistant Professor)
Supervisor
Deptt. of Computer Science &
Application,
Lingaya’s Vidyapeeth, Faridabad
Haryana
LINGAYA’S VIDYAPEETH
CERTIFICATE OF AUTHENTICATION
We solemnly declare that this project “IMAGE CAPTION GENERATOR” is the bonafide
work done purely by us, carried out under the supervision of PRITIKA MAM, Assistant
Professor (DCSA) towards partial fulfilment of the requirements of the Degree of Bachelor of
Technology in Computer Science and Application from Lingaya’s Vidyapeeth, Faridabad,
during the year 2019-2020.
It is further certified that this wok has not been submitted, either in part or in full, to any other
department of the Lingaya’s Vidyapeeth, or any other Vidyapeeth, institute or elsewhere, or
publication in any form.
First of all, we would like to express our deep sense of gratitude to our supervisor
Pritika Mam, Assistant Professor (DCSA) who gave us his invaluable guidance
glowing with his words of encouragement and inspiration, criticisms and
discussions throughout the problem designing.
We are very much grateful to Mr. Kiran Kumar (Assistant Professor & H.O.D.
(DCSA)) for her valuable support and cooperation in conceptualizing the
project/research work and to all those outstanding individuals with whom we
have worked, who helped us in understanding the concept.
We are highly thankful to our family members for their all-time support in
initiating us and bringing a spark in us to pursue the work.
ABSTRACT
The work presented in this report involved developing a Image Caption Generator
based image caption generation method. With an image as the input, the method
can output an English sentence describing the content in the image. We analyse
neural network (RNN) and sentence generation. ,Then will be implementing the
flow module and LSTM (Long short term memory). The image features will be
extracted from Exception which is a CNN model trained on the Fliker_8k dataset
and then we feed the features into the LSTM model which will be responsible for
5.1 Conclusion 44
5.2 References 45
CHAPTER-1
Introduction
worked on this a lot and they considered it impossible until now! With the
computer power, we can build models that can generate captions for an image.
This is what we are going to implement in this Python based project where we
will use deep learning techniques of Convolutional Neural Networks and a type
Image caption generator is a task that involves computer vision and natural
For the image caption generator, we will be using the Flickr_8K dataset. There
are also other big datasets like Flickr_30K and MSCOCO dataset but it can take
weeks just to train the network so we will be using a small Flickr8k The
Flickr_8k_text folder contains file Flickr8k.token which is the main file of our
dataset that contains image name and their respective captions separated by
newline.
1.2 Objective:
The objective of our project is to learn the concepts of a CNN and LSTM model
with LSTM.
In this Python project, we will be implementing the caption generator using CNN
(Convolutional Neural Network) and LSTM (Long short term memory). The
image features will be extracted from Exception which is a CNN model trained
on the imagined dataset and then we feed the features into the LSTM model which
This is a Image Caption Generation software which uses freely available Data
sets there is no hosting cost if it developed to be a online software it will only
consume internet data to load the ads there. It is completely a free to use in system.
The only potential reason that it may consume internet data even when we want
to feed more data sets to it.
2. Technical feasibility
The tool and technology that were used in this project are completely free for
students the Additional we need to install all the following necessary libraries to
make this project.
Tensorflow
Keras
Pillow
Numpy
Tqdm
Jupyterlab
3. Economic Feasibility
The tools and Libraries that were used to make this project are completely free
with either CC0 or CC3 license so I can use them as long as I properly credit them
which I already did by uploading the project to github. And as mentioned in the
previous section this software uses freely available software and tools which are
intended for the use of software developers everywhere. There cannot be any
conflict regarding plagiarism of any other software because I have already
followed the rules of the license to make this project.
Chapter-2
Project
Requirement
2.1 Requirement analysis:
Language:
Hardware:
Software:
Window 10.
PyCharm.
Python Interpreter.
Jupiter Note book.
MS Office.
Chapter-3
Concepts
&
Modules
3.1 concept used:
LSTM stands for Long short term memory, they are a type of RNN
(recurrent neural network) which is well suited for sequence prediction
problems. Based on the previous text, we can predict what the next word will
be. It has proven itself effective from the traditional RNN by overcoming the
limitations of RNN which had short term memory. LSTM can carry out relevant
information throughout the processing of inputs and with a forget gate, it
discards non-relevant information.
This is what an LSTM cell looks like –
CNN is used for extracting features from the image. We will use the pre-
trained model xception.
LSTM will use the information from CNN to help generate a description of
the image.
3.2 Python Programming
Advantages:
1.Numbers:
Number data types store numeric values. Number objects are created when you
assign a value to them.
Python supports four different numerical types −
3.Lists:
Lists are the most versatile of Python's compound data types. A list contains
items separated by commas and enclosed within square brackets ([]). To some
extent, lists are similar to arrays in C. One difference between them is that all the
items belonging to a list can be of different data type.
The values stored in a list can be accessed using the slice operator ([ ] and [:])
with indexes starting at 0 in the beginning of the list and working their way to
end -1. The plus (+) sign is the list concatenation operator, and the asterisk (*) is
the repetition operator.
4.Tuples:
A tuple is another sequence data type that is similar to the list. A tuple consists
of a number of values separated by commas. Unlike lists, however, tuples are
enclosed within parentheses.
The main differences between lists and tuples are: Lists are enclosed in brackets
( [ ] ) and their elements and size can be changed, while tuples are enclosed in
parentheses ( ( ) ) and cannot be updated. Tuples can be thought of as read-
only lists.
5.Dictionary:
Python's dictionaries are kind of hash table type. They work like associative
arrays or hashes found in Perl and consist of key-value pairs. A dictionary key
can be almost any Python type, but are usually numbers or strings. Values, on
the other hand, can be any arbitrary Python object.
Dictionaries are enclosed by curly braces ({ }) and values can be assigned and
accessed using square braces ([]).
3.2.2Operators in python
Arithmetic Operators
Comparison (Relational) Operators
Assignment Operators
Logical Operators
Bitwise Operators
Membership Operators
Identity Operators
Coding
&
Development
Building the Python based Project
Initializing the jupyter notebook server by typing jupyter lab in the console
of your project folder. It will open up the interactive Python notebook
where you can run your code. Create a Python3 notebook and name
it training_caption_generator.ipynb
Coding Phase:
1. First, we import all the necessary packages
1. import string
2. import numpy as np
3. from PIL import Image
4. import os
5. from pickle import dump, load
6. import numpy as np
7.
8. from keras.applications.xception import Xception, preprocess_input
9. from keras.preprocessing.image import load_img, img_to_array
10. from keras.preprocessing.text import Tokenizer
11. from keras.preprocessing.sequence import pad_sequences
12. from keras.utils import to_categorical
13. from keras.layers.merge import add
14. from keras.models import Model, load_model
15. from keras.layers import Input, Dense, LSTM, Embedding, Dropout
16.
17. # small library for seeing the progress of loops.
18. from tqdm import tqdm_notebook as tqdm
19. tqdm().pandas()
2. Getting and performing data cleaning
The main text file which contains all image captions is Flickr8k.token in
our Flickr_8k_text folder.
Have a look at the file –
The format of our file is image and caption separated by a new line (“\n”)
Each image has 5 captions and we can see that (0 to 5)number
is assigned for each caption.
load_doc( filename ) – For loading the document file and reading the
contents inside the file into a string.
Code :
1. # Loading a text file into memory
2. def load_doc(filename):
3. # Opening the file as read only
4. file = open(filename, 'r')
5. text = file.read()
6. file.close()
7. return text
8.
9. # get all imgs with their captions
10. def all_img_captions(filename):
11. file = load_doc(filename)
12. captions = file.split('\n')
13. descriptions ={}
14. for caption in captions[:-1]:
15. img, caption = caption.split('\t')
16. if img[:-2] not in descriptions:
17. descriptions[img[:-2]] =
18. else:
19. descriptions[img[:-2]].append(caption)
20. return descriptions
21.
22. #Data cleaning- lower casing, removing puntuations and words
containing numbers
23. def cleaning_text(captions):
24. table = str.maketrans('','',string.punctuation)
25. for img,caps in captions.items():
26. for i,img_caption in enumerate(caps):
27.
28. img_caption.replace("-"," ")
29. desc = img_caption.split()
30.
31. #converts to lowercase
32. desc = [word.lower() for word in desc]
33. #remove punctuation from each token
34. desc = [word.translate(table) for word in desc]
35. #remove hanging 's and a
36. desc = [word for word in desc if(len(word)>1)]
37. #remove tokens with numbers in them
38. desc = [word for word in desc if(word.isalpha())]
39. #convert back to string
40.
41. img_caption = ' '.join(desc)
42. captions[img][i]= img_caption
43. return captions
44.
45. def text_vocabulary(descriptions):
46. # build vocabulary of all unique words
47. vocab = set()
48.
49. for key in descriptions.keys():
50. [vocab.update(d.split()) for d in descriptions[key]]
51.
52. return vocab
53.
54. #All descriptions in one file
55. def save_descriptions(descriptions, filename):
56. lines = list()
57. for key, desc_list in descriptions.items():
58. for desc in desc_list:
59. lines.append(key + '\t' + desc )
60. data = "\n".join(lines)
61. file = open(filename,"w")
62. file.write(data)
63. file.close()
64.
65.
66. # Set these path according to project folder in you system
67. dataset_text = "D:\dataflair projects\Project - Image Caption
Generator\Flickr_8k_text"
68. dataset_images = "D:\dataflair projects\Project - Image Caption
Generator\Flicker8k_Dataset"
69.
70. #we prepare our text data
71. filename = dataset_text + "/" + "Flickr8k.token.txt"
72. #loading the file that contains all data
73. #mapping them into descriptions dictionary img to 5 captions
74. descriptions = all_img_captions(filename)
75. print("Length of descriptions =" ,len(descriptions))
76.
77. #cleaning the descriptions
78. clean_descriptions = cleaning_text(descriptions)
79.
80. #building vocabulary
81. vocabulary = text_vocabulary(clean_descriptions)
82. print("Length of vocabulary = ", len(vocabulary))
83.
84. #saving each description to file
85. save_descriptions(clean_descriptions, "descriptions.txt")
3. Extracting the feature vector from all images
This technique is also called transfer learning, we don’t have to do
everything on our own, we use the pre-trained model that have been
already trained on large datasets and extract the features from these
models and use them for our tasks. We are using the Xception model
which has been trained on imagenet dataset that had 1000 different
classes to classify. We can directly import this model from the
keras.applications . Make sure you are connected to the internet as the
weights get automatically downloaded. Since the Xception model was
originally built for imagenet, we will do little changes for integrating with
our model. One thing to notice is that the Xception model takes
299*299*3 image size as input. We will remove the last classification
layer and get the 2048 feature vector.
The function extract_features() will extract features for all images and
we will map image names with their respective feature array. Then we
will dump the features dictionary into a “features.p” pickle file.
Code:
1. def extract_features(directory):
2. model = Xception( include_top=False, pooling='avg' )
3. features = {}
4. for img in tqdm(os.listdir(directory)):
5. filename = directory + "/" + img
6. image = Image.open(filename)
7. image = image.resize((299,299))
8. image = np.expand_dims(image, axis=0)
9. #image = preprocess_input(image)
10. image = image/127.5
11. image = image - 1.0
12.
13. feature = model.predict(image)
14. features[img] = feature
15. return features
16.
17. #2048 feature vector
18. features = extract_features(dataset_images)
19. dump(features, open("features.p","wb"))
4. Loading dataset for Training the model
In our Flickr_8k_test folder, we have Flickr_8k.trainImages.txt file that
contains a list of 6000 image names that we will use for training.
For loading the training dataset, we need more functions:
load_photos( filename ) – This will load the text file in a string and
will return the list of image names.
load_clean_descriptions( filename, photos ) – This function will
create a dictionary that contains captions for each photo from the list
of photos. We also append the <start> and <end> identifier for each
caption. We need this so that our LSTM model can identify the
starting and ending of the caption.
load_features(photos) – This function will give us the dictionary for
image names and their feature vector which we have previously
extracted from the Xception model.
Code :
1. #load the data
2. def load_photos(filename):
3. file = load_doc(filename)
4. photos = file.split("\n")[:-1]
5. return photos
6.
7.
8. def load_clean_descriptions(filename, photos):
9. #loading clean_descriptions
10. file = load_doc(filename)
11. descriptions = {}
12. for line in file.split("\n"):
13.
14. words = line.split()
15. if len(words)<1 :
16. continue
17.
18. image, image_caption = words[0], words[1:]
19.
20. if image in photos:
21. if image not in descriptions:
22. descriptions[image] = []
23. desc = '<start> ' + " ".join(image_caption) + ' <end>'
24. descriptions[image].append(desc)
25.
26. return descriptions
27.
28.
29. def load_features(photos):
30. #loading all features
31. all_features = load(open("features.p","rb"))
32. #selecting only needed features
33. features = {k:all_features[k] for k in photos}
34. return features
35.
36.
37. filename = dataset_text + "/" + "Flickr_8k.trainImages.txt"
38.
39. #train = loading_data(filename)
40. train_imgs = load_photos(filename)
41. train_descriptions = load_clean_descriptions("descriptions.txt",
train_imgs)
42. train_features = load_features(train_imgs)
Code:
1. #converting dictionary to clean list of descriptions
2. def dict_to_list(descriptions):
3. all_desc = []
4. for key in descriptions.keys():
5. [all_desc.append(d) for d in descriptions[key]]
6. return all_desc
7.
8. #creating tokenizer class
9. #this will vectorise text corpus
10. #each integer will represent token in dictionary
11.
12. from keras.preprocessing.text import Tokenizer
13.
14. def create_tokenizer(descriptions):
15. desc_list = dict_to_list(descriptions)
16. tokenizer = Tokenizer()
17. tokenizer.fit_on_texts(desc_list)
18. return tokenizer
19.
20. # give each word an index, and store that into tokenizer.p pickle file
21. tokenizer = create_tokenizer(train_descriptions)
22. dump(tokenizer, open('tokenizer.p', 'wb'))
23. vocab_size = len(tokenizer.word_index) + 1
24. vocab_size
For example:
The input to our model is [x1, x2] and the output will be y, where x1 is
the 2048 feature vector of that image, x2 is the input text sequence and y
is the output text sequence that the model has to predict.
Feature Extractor – The feature extracted from the image has a size
of 2048, with a dense layer, we will reduce the dimensions to 256
nodes.
Sequence Processor – An embedding layer will handle the textual
input, followed by the LSTM layer.
Decoder – By merging the output from the above two layers, we will
process by the dense layer to make the final prediction. The final
layer will contain the number of nodes equal to our vocabulary size.
Visual representation of the final model is given
below –
Code to Generate Model:
1. from keras.utils import plot_model
2.
3. # define the captioning model
4. def define_model(vocab_size, max_length):
5.
6. # features from the CNN model squeezed from 2048 to 256 nodes
7. inputs1 = Input(shape=(2048,))
8. fe1 = Dropout(0.5)(inputs1)
9. fe2 = Dense(256, activation='relu')(fe1)
10.
11. # LSTM sequence model
12. inputs2 = Input(shape=(max_length,))
13. se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
14. se2 = Dropout(0.5)(se1)
15. se3 = LSTM(256)(se2)
16.
17. # Merging both models
18. decoder1 = add([fe2, se3])
19. decoder2 = Dense(256, activation='relu')(decoder1)
20. outputs = Dense(vocab_size, activation='softmax')(decoder2)
21.
22. # tie it together [image, seq] [word]
23. model = Model(inputs=[inputs1, inputs2], outputs=outputs)
24. model.compile(loss='categorical_crossentropy', optimizer='adam')
25.
26. # summarize model
27. print(model.summary())
28. plot_model(model, to_file='model.png', show_shapes=True)
29.
30. return model
8. Training the model
To train the model, we will be using the 6000 training images by
generating the input and output sequences in batches and fitting them to
the model using model.fit_generator() method. We also save the model
to our models folder. This will take some time depending on your system
capability.
Code:
Code:
1. import numpy as np
2. from PIL import Image
3. import matplotlib.pyplot as plt
4. import argparse
5.
6.
7. ap = argparse.ArgumentParser()
8. ap.add_argument('-i', '--image', required=True, help="Image Path")
9. args = vars(ap.parse_args())
10. img_path = args['image']
11.
12. def extract_features(filename, model):
13. try:
14. image = Image.open(filename)
15.
16. except:
17. print("ERROR: Couldn't open image! Make sure the image path and
extension is correct")
18. image = image.resize((299,299))
19. image = np.array(image)
20. # for images that has 4 channels, we convert them into 3 channels
21. if image.shape[2] == 4:
22. image = image[..., :3]
23. image = np.expand_dims(image, axis=0)
24. image = image/127.5
25. image = image - 1.0
26. feature = model.predict(image)
27. return feature
28.
29. def word_for_id(integer, tokenizer):
30. for word, index in tokenizer.word_index.items():
31. if index == integer:
32. return word
33. return None
34.
35.
36. def generate_desc(model, tokenizer, photo, max_length):
37. in_text = 'start'
38. for i in range(max_length):
39. sequence = tokenizer.texts_to_sequences([in_text])[0]
40. sequence = pad_sequences([sequence], maxlen=max_length)
41. pred = model.predict([photo,sequence], verbose=0)
42. pred = np.argmax(pred)
43. word = word_for_id(pred, tokenizer)
44. if word is None:
45. break
46. in_text += ' ' + word
47. if word == 'end':
48. break
49. return in_text
50.
51.
52. #path = 'Flicker8k_Dataset/111537222_07e56d5a30.jpg'
53. max_length = 32
54. tokenizer = load(open("tokenizer.p","rb"))
55. model = load_model('models/model_9.h5')
56. xception_model = Xception(include_top=False, pooling="avg")
57.
58. photo = extract_features(img_path, xception_model)
59. img = Image.open(img_path)
60.
61. description = generate_desc(model, tokenizer, photo, max_length)
62. print("\n\n")
63. print(description)
64. plt.imshow(img)
Final Results:
Chapter- 5
Conclusion
&
References
Conclusion
https://datacarpentry.org/image-processing/aio/index.html
https://pillow.readthedocs.io/en/3.1.x/reference/Image.html
https://analyticsindiamag.com/top-8-image-processing-libraries-in-
python/
https://stackoverflow.com/questions/94875/image-processing-in-
python
http://omz-software.com/pythonista/docs/ios/PIL.html
https://www.tutorialspoint.com/python/index.htm
https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Fli
ckr8k_Dataset.zip
https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Fli
ckr8k_text.zip
https://www.tensorflow.org/guide/keras
[Papineni et al.2002] Kishore Papineni, Salim Roukos,
Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method
for automatic evaluation of machine translation. In
Proceedings of the 40th annual meeting on association
for computational linguistics, pages 311–318.
[Russakovsky et al.2015] Olga Russakovsky, Jia Deng,
Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean
Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla,
Michael Bernstein, et al. 2015. Imagenet large scale
visual recognition challenge. IJCV, 115(3):211–252.
[Sermanet et al.2013] Pierre Sermanet, David Eigen, Xiang
Zhang, Micha¨el Mathieu, Rob Fergus, and Yann
LeCun. 2013. Overfeat: Integrated recognition, localization
and detection using convolutional networks.