Download as pdf or txt
Download as pdf or txt
You are on page 1of 45

IMAGE CAPTION GENERATOR

By

Name of Student: Roll No:

Krishan Verma 17BCA04

Shriya Hangloo 17BCA11

Sushil Prasad Shaw 17BCA15

Md Reza 17BCA22

Major Project-1
(CA-1382A)

AT

Deptt. Of Computer Science & Application


LINGAYA’S VIDYAPEETH, FARIDABAD
SESSION 2019-2020
A PROJECT REPORT
ON
IMAGE CAPTION GENERATOR

By

Name of Student: Roll No: Discipline:


Krishan Verma 17BCA04 CSE
Sushil Prasad Shaw 17BCA15 CSE
Md Reza 17BCA22 CSE
Shriya Hangloo 17BCA11 CSE

PROJECT REPORT SUBMITTED IN PARTIAL


FULFILLMENT OF THE REQUIREMENTS OF THE
COURSE
Major Project-1
(CA-1382A)

Guide

Faculty/Associate Faculty Pritika Mam

DEPARTMENT OF COMPUTER SCIENCE & APPLICATION


LINGAYA’S VIDYAPEETH, FARIDABAD
SESSION 2019-2020

LINGAYA’S VIDYAPEETH

BONAFIDE CERTIFICATE

Certified that this project report “IMAGE CAPTION GENERATOR” is the bonafide work
of KRISHAN VERMA (17BCA04), SUSHIL SHAW (17BCA15) , MD REZA (17bca22)
and SHRIYA HANGLOO (17BCA11) who carried out the project in collaboration with
Deptt. of Computer Science & Application, Lingaya’s Vidyapeeth, embodies the work done
by him under the guidance of PRITIKA MAM, Assistant Professor(DCSA) towards partial
fulfilment of the requirements of the Degree of Bachelor of Technology in Computer Science
and Application from Lingaya’s Vidyapeeth, Haryana. They have fulfilled all the requirements
needed as per the rules of the Vidyapeeth, for the completion of Project. This work is original
and has not been submitted in part or in full to any Vidyapeeth or Institution.

Signature of the Supervisor

PRITIKA MAM
(Assistant Professor)
Supervisor
Deptt. of Computer Science &
Application,
Lingaya’s Vidyapeeth, Faridabad
Haryana
LINGAYA’S VIDYAPEETH

CERTIFICATE OF AUTHENTICATION

We solemnly declare that this project “IMAGE CAPTION GENERATOR” is the bonafide
work done purely by us, carried out under the supervision of PRITIKA MAM, Assistant
Professor (DCSA) towards partial fulfilment of the requirements of the Degree of Bachelor of
Technology in Computer Science and Application from Lingaya’s Vidyapeeth, Faridabad,
during the year 2019-2020.

It is further certified that this wok has not been submitted, either in part or in full, to any other
department of the Lingaya’s Vidyapeeth, or any other Vidyapeeth, institute or elsewhere, or
publication in any form.

Date: 25 May 2020 Krishan Verma (17BCA04)

Date: 25 May 2020 Sushil Shaw (17BCA15)

Date: 25 May 2020 Md Reza (17BCA22)

Date: 25 May 2020 Shriya Hangloo(17BCA11)


ACKNOWLEDGEMENT

In completing my project, we are very thankful to many individuals and we must


place on record our sincere thanks to all of them.

First of all, we would like to express our deep sense of gratitude to our supervisor
Pritika Mam, Assistant Professor (DCSA) who gave us his invaluable guidance
glowing with his words of encouragement and inspiration, criticisms and
discussions throughout the problem designing.

We are very much grateful to Mr. Kiran Kumar (Assistant Professor & H.O.D.
(DCSA)) for her valuable support and cooperation in conceptualizing the
project/research work and to all those outstanding individuals with whom we
have worked, who helped us in understanding the concept.

We are highly thankful to our family members for their all-time support in
initiating us and bringing a spark in us to pursue the work.
ABSTRACT

The work presented in this report involved developing a Image Caption Generator

application. In this project, we systematically analyse a deep neural networks

based image caption generation method. With an image as the input, the method

can output an English sentence describing the content in the image. We analyse

three components of the method: convolutional neural network (CNN), recurrent

neural network (RNN) and sentence generation. ,Then will be implementing the

caption generator using CNN (Convolutional Neural Network) by help of Tensor

flow module and LSTM (Long short term memory). The image features will be

extracted from Exception which is a CNN model trained on the Fliker_8k dataset

and then we feed the features into the LSTM model which will be responsible for

generating the image captions.


INDEX

S No. Chapter Page No.


1 CHAPTER 1: Introduction 1
1.1 Introduction 2
1.2 Objective 2
1.3 Feasibility 3
1.3.1 Financial Feasibility 4
1.3.2 Technical Feasibility 5
1.3.3 Economic Feasibility 6

1.3.4 Legal Feasibility 7

2 CHAPTER 2: Requirement Analysis 8


2.1 Language 9
2.2 Hardware 9
2.3 Software 9

3 CHAPTER 3: Concepts & Modules 10

3.1 Concepts used 11


3.1.1 CNN( Convolutional Neural Network 12

3.1.2 LSTM (Long Short Term Memory) 13


3.1.3 Image caption generator model 14
3.2 Python Programming 15
3.2.1 Introduction 16
3.2.2 Python Data Type 17
3.2.1 Python Built in Data Type 19
3.2.4 Control Structure In Python 21

3.3 Project File Structure 22


4 CHAPTER 4:Designing & Coding 23
4.1 First, we import all the necessary packages 26
4.2 Getting And Perform data cleaning 28
4.3 Extract the vector features of all images 30
4.4 Loading Dataset for Training Model 31
4.5 Tokenizing the vocabulary 33
4.6. Create Data Generator 35
4.7 Define CNN and RNN model 36
4.9 Training the model 38
4.8 Testing the model 40
4.10 Final results 42

5 CHAPTER 5: Conclusion &References 43

5.1 Conclusion 44
5.2 References 45
CHAPTER-1
Introduction

A Group of People standing on the top of Snow covered Slope.


1.1 Introduction:
We saw an image and your brain can easily tell what the image is about, but can

a computer tell what the image is representing? Computer vision researchers

worked on this a lot and they considered it impossible until now! With the

advancement in Deep learning techniques, availability of huge datasets and

computer power, we can build models that can generate captions for an image.

This is what we are going to implement in this Python based project where we

will use deep learning techniques of Convolutional Neural Networks and a type

of Recurrent Neural Network (LSTM) together.

Image caption generator is a task that involves computer vision and natural

language processing concepts to recognize the context of an image and describe

them in a natural language like English.

For the image caption generator, we will be using the Flickr_8K dataset. There

are also other big datasets like Flickr_30K and MSCOCO dataset but it can take

weeks just to train the network so we will be using a small Flickr8k The

Flickr_8k_text folder contains file Flickr8k.token which is the main file of our

dataset that contains image name and their respective captions separated by

newline.
1.2 Objective:

The objective of our project is to learn the concepts of a CNN and LSTM model

and build a working model of Image caption generator by implementing CNN

with LSTM.

In this Python project, we will be implementing the caption generator using CNN

(Convolutional Neural Network) and LSTM (Long short term memory). The

image features will be extracted from Exception which is a CNN model trained

on the imagined dataset and then we feed the features into the LSTM model which

will be responsible for generating the image captions.


1.3 Feasibility:
1. Financial feasibility

This is a Image Caption Generation software which uses freely available Data
sets there is no hosting cost if it developed to be a online software it will only
consume internet data to load the ads there. It is completely a free to use in system.
The only potential reason that it may consume internet data even when we want
to feed more data sets to it.

2. Technical feasibility

The tool and technology that were used in this project are completely free for
students the Additional we need to install all the following necessary libraries to
make this project.

 Tensorflow
 Keras
 Pillow
 Numpy
 Tqdm
 Jupyterlab

3. Economic Feasibility

The resources that are required for this project are:

 Development machine any regular PC/Laptop with a Minimum Ram of 2GB


and a decent GPU can be used for the development of this Project.
 Technical tools and software, as mentioned are the tools needed to develop
this software are available to developers at no charges.
 Python module and integrated development environment all are freely
available.
4. Legal Feasibility

The tools and Libraries that were used to make this project are completely free
with either CC0 or CC3 license so I can use them as long as I properly credit them
which I already did by uploading the project to github. And as mentioned in the
previous section this software uses freely available software and tools which are
intended for the use of software developers everywhere. There cannot be any
conflict regarding plagiarism of any other software because I have already
followed the rules of the license to make this project.
Chapter-2

Project

Requirement
2.1 Requirement analysis:

 Language:

Python is an interpreted, high-level, general-purpose programming language. Created


by Guido van Rossum and first released in 1991, Python's design philosophy
emphasizes code readability with its notable use of significant whitespace. Its language
constructs and object-oriented approach aim to help programmers write clear, logical
code for small and large-scale projects.

 Hardware:

 Minimum 1GB space in Hard Disk Drive.


 Intel core 2 Due Processor.
 2GB DDR SD RAM.
 Minimum 256 color Monitor.
 GB ULTRA HDD 7200 RPM.

 Software:

 Window 10.
 PyCharm.
 Python Interpreter.
 Jupiter Note book.
 MS Office.
Chapter-3
Concepts

&

Modules
3.1 concept used:

3.1.1 CNN( Convolutional Neural Network)

Convolutional Neural networks are specialized deep neural networks which


can process the data that has input shape like a 2D matrix. Images are easily
represented as a 2D matrix and CNN is very useful in working with images.

CNN is basically used for image classifications and identifying if an image is a


bird, a plane or Superman, etc.

3.1.2 LSTM (Long Short Term Memory)

LSTM stands for Long short term memory, they are a type of RNN
(recurrent neural network) which is well suited for sequence prediction
problems. Based on the previous text, we can predict what the next word will
be. It has proven itself effective from the traditional RNN by overcoming the
limitations of RNN which had short term memory. LSTM can carry out relevant
information throughout the processing of inputs and with a forget gate, it
discards non-relevant information.
This is what an LSTM cell looks like –

3.1.2 Image caption generator model

To make image caption generator model, we will be merging CNN


(convolutional neural network) and LSTM (long short term memory)
architectures. It is also called a CNN-RNN model.

 CNN is used for extracting features from the image. We will use the pre-
trained model xception.
 LSTM will use the information from CNN to help generate a description of
the image.
3.2 Python Programming

python is a high-level, interpreted, interactive and object-oriented scripting


language. Python is designed to be highly readable. It uses English keywords
frequently where as other languages use punctuation, and it has fewer
syntactical constructions than other languages.

Advantages:

 Python is Interpreted − Python is processed at runtime by the interpreter.


You do not need to compile your program before executing it. This is
similar to PERL and PHP.
 Python is Interactive − You can actually sit at a Python prompt and
interact with the interpreter directly to write your programs.
 Python is Object-Oriented − Python supports Object-Oriented style or
technique of programming that encapsulates code within objects.
 Python is a Beginner's Language − Python is a great language for the
beginner-level programmers and supports the development of a wide
range of applications from simple text processing to WWW browsers to
games.

3.2.1 Python data type

1.Numbers:
Number data types store numeric values. Number objects are created when you
assign a value to them.
Python supports four different numerical types −

 int (signed integers)


 long (long integers, they can also be represented in octal and
hexadecimal)
 float (floating point real values)
 complex (complex numbers)
2.Strings:
Strings in Python are identified as a contiguous set of characters represented in
the quotation marks. Python allows for either pairs of single or double quotes.
Subsets of strings can be taken using the slice operator ([ ] and [:] ) with indexes
starting at 0 in the beginning of the string and working their way from -1 at the
end.
The plus (+) sign is the string concatenation operator and the asterisk (*) is the
repetition operator.

3.Lists:

Lists are the most versatile of Python's compound data types. A list contains
items separated by commas and enclosed within square brackets ([]). To some
extent, lists are similar to arrays in C. One difference between them is that all the
items belonging to a list can be of different data type.
The values stored in a list can be accessed using the slice operator ([ ] and [:])
with indexes starting at 0 in the beginning of the list and working their way to
end -1. The plus (+) sign is the list concatenation operator, and the asterisk (*) is
the repetition operator.
4.Tuples:
A tuple is another sequence data type that is similar to the list. A tuple consists
of a number of values separated by commas. Unlike lists, however, tuples are
enclosed within parentheses.
The main differences between lists and tuples are: Lists are enclosed in brackets
( [ ] ) and their elements and size can be changed, while tuples are enclosed in
parentheses ( ( ) ) and cannot be updated. Tuples can be thought of as read-
only lists.

5.Dictionary:

Python's dictionaries are kind of hash table type. They work like associative
arrays or hashes found in Perl and consist of key-value pairs. A dictionary key
can be almost any Python type, but are usually numbers or strings. Values, on
the other hand, can be any arbitrary Python object.
Dictionaries are enclosed by curly braces ({ }) and values can be assigned and
accessed using square braces ([]).
3.2.2Operators in python

Operators are symbols that are used to perform operations on operands.


Operands may be variables and/or constants.

For example, in 2+3, + is an operator that is used to carry out addition


operation, while 2 and 3 are operands.
Python language supports the following types of operators.

 Arithmetic Operators
 Comparison (Relational) Operators
 Assignment Operators
 Logical Operators
 Bitwise Operators
 Membership Operators
 Identity Operators

3.2.3 Decision making in python


Decision making is anticipation of conditions occurring while execution of
the program and specifying actions taken according to the conditions.
Decision structures evaluate multiple expressions which produce TRUE
or FALSE as outcome. You need to determine which action to take and
which statements to execute if outcome is TRUE or FALSE otherwise.
3.2.4 Loops in python:
The first statement in a function is executed first, followed by the second,
and so on. There may be a situation when you need to execute a block
of code several number of times.
Programming languages provide various control structures that allow for
more complicated execution paths.
A loop statement allows us to execute a statement or group of statements
multiple times.

1.While loop: Repeats a statement or group of statements while a given


condition is TRUE. It tests the condition before executing the loop body.
2.Forloop: Executes a sequence of statements multiple times and
abbreviates the code that manages the loop variable.
3.Nested loop: use one or more loop inside any another while, for or
do..while loop.
Project File Structure
 Flicker8k_Dataset – Dataset folder which contains 8091 images.
 Flickr_8k_text – Dataset folder which contains text files and captions of
images.

The below files will be created by us while making the project.

 Models – It will contain our trained models.


 Descriptions.txt – This text file contains all image names and their
captions after preprocessing.
 Features.p – Pickle object that contains an image and their feature
vector extracted from the Xception pre-trained CNN model.
 Tokenizer.p – Contains tokens mapped with an index value.
 Model.png – Visual representation of dimensions of our project.
 Testing_caption_generator.py – Python file for generating a caption of
any image.
 Training_caption_generator.ipynb – Jupyter notebook in which we
train and build our image caption generator.
Chapter-4

Coding

&

Development
Building the Python based Project
Initializing the jupyter notebook server by typing jupyter lab in the console
of your project folder. It will open up the interactive Python notebook
where you can run your code. Create a Python3 notebook and name
it training_caption_generator.ipynb
Coding Phase:
1. First, we import all the necessary packages
1. import string
2. import numpy as np
3. from PIL import Image
4. import os
5. from pickle import dump, load
6. import numpy as np
7.
8. from keras.applications.xception import Xception, preprocess_input
9. from keras.preprocessing.image import load_img, img_to_array
10. from keras.preprocessing.text import Tokenizer
11. from keras.preprocessing.sequence import pad_sequences
12. from keras.utils import to_categorical
13. from keras.layers.merge import add
14. from keras.models import Model, load_model
15. from keras.layers import Input, Dense, LSTM, Embedding, Dropout
16.
17. # small library for seeing the progress of loops.
18. from tqdm import tqdm_notebook as tqdm
19. tqdm().pandas()
2. Getting and performing data cleaning

The main text file which contains all image captions is Flickr8k.token in
our Flickr_8k_text folder.
Have a look at the file –

The format of our file is image and caption separated by a new line (“\n”)
Each image has 5 captions and we can see that (0 to 5)number
is assigned for each caption.

We will define 5 functions:

 load_doc( filename ) – For loading the document file and reading the
contents inside the file into a string.

 all_img_captions( filename ) – This function will create


a descriptions dictionary that maps images with a list of 5 captions.
The descriptions dictionary will look something like this:

 cleaning_text( descriptions) – This function takes all descriptions


and performs data cleaning. This is an important step when we work
with textual data, according to our goal, we decide what type of
cleaning we want to perform on the text. In our case, we will be
removing punctuations, converting all text to lowercase and
removing words that contain numbers.
So, a caption like “A man riding on a three-wheeled wheelchair” will
be transformed into “man riding on three wheeled wheelchair”
 text_vocabulary( descriptions ) – This is a simple function that will
separate all the unique words and create the vocabulary from all the
descriptions.

 save_descriptions( descriptions, filename ) – This function will


create a list of all the descriptions that have been preprocessed and
store them into a file. We will create a descriptions.txt file to store all
the captions.

=>It will look something like this:

Code :
1. # Loading a text file into memory
2. def load_doc(filename):
3. # Opening the file as read only
4. file = open(filename, 'r')
5. text = file.read()
6. file.close()
7. return text
8.
9. # get all imgs with their captions
10. def all_img_captions(filename):
11. file = load_doc(filename)
12. captions = file.split('\n')
13. descriptions ={}
14. for caption in captions[:-1]:
15. img, caption = caption.split('\t')
16. if img[:-2] not in descriptions:
17. descriptions[img[:-2]] =
18. else:
19. descriptions[img[:-2]].append(caption)
20. return descriptions
21.
22. #Data cleaning- lower casing, removing puntuations and words
containing numbers
23. def cleaning_text(captions):
24. table = str.maketrans('','',string.punctuation)
25. for img,caps in captions.items():
26. for i,img_caption in enumerate(caps):
27.
28. img_caption.replace("-"," ")
29. desc = img_caption.split()
30.
31. #converts to lowercase
32. desc = [word.lower() for word in desc]
33. #remove punctuation from each token
34. desc = [word.translate(table) for word in desc]
35. #remove hanging 's and a
36. desc = [word for word in desc if(len(word)>1)]
37. #remove tokens with numbers in them
38. desc = [word for word in desc if(word.isalpha())]
39. #convert back to string
40.
41. img_caption = ' '.join(desc)
42. captions[img][i]= img_caption
43. return captions
44.
45. def text_vocabulary(descriptions):
46. # build vocabulary of all unique words
47. vocab = set()
48.
49. for key in descriptions.keys():
50. [vocab.update(d.split()) for d in descriptions[key]]
51.
52. return vocab
53.
54. #All descriptions in one file
55. def save_descriptions(descriptions, filename):
56. lines = list()
57. for key, desc_list in descriptions.items():
58. for desc in desc_list:
59. lines.append(key + '\t' + desc )
60. data = "\n".join(lines)
61. file = open(filename,"w")
62. file.write(data)
63. file.close()
64.
65.
66. # Set these path according to project folder in you system
67. dataset_text = "D:\dataflair projects\Project - Image Caption
Generator\Flickr_8k_text"
68. dataset_images = "D:\dataflair projects\Project - Image Caption
Generator\Flicker8k_Dataset"
69.
70. #we prepare our text data
71. filename = dataset_text + "/" + "Flickr8k.token.txt"
72. #loading the file that contains all data
73. #mapping them into descriptions dictionary img to 5 captions
74. descriptions = all_img_captions(filename)
75. print("Length of descriptions =" ,len(descriptions))
76.
77. #cleaning the descriptions
78. clean_descriptions = cleaning_text(descriptions)
79.
80. #building vocabulary
81. vocabulary = text_vocabulary(clean_descriptions)
82. print("Length of vocabulary = ", len(vocabulary))
83.
84. #saving each description to file
85. save_descriptions(clean_descriptions, "descriptions.txt")
3. Extracting the feature vector from all images
This technique is also called transfer learning, we don’t have to do
everything on our own, we use the pre-trained model that have been
already trained on large datasets and extract the features from these
models and use them for our tasks. We are using the Xception model
which has been trained on imagenet dataset that had 1000 different
classes to classify. We can directly import this model from the
keras.applications . Make sure you are connected to the internet as the
weights get automatically downloaded. Since the Xception model was
originally built for imagenet, we will do little changes for integrating with
our model. One thing to notice is that the Xception model takes
299*299*3 image size as input. We will remove the last classification
layer and get the 2048 feature vector.

model = Xception( include_top=False, pooling=’avg’ )

The function extract_features() will extract features for all images and
we will map image names with their respective feature array. Then we
will dump the features dictionary into a “features.p” pickle file.

Code:
1. def extract_features(directory):
2. model = Xception( include_top=False, pooling='avg' )
3. features = {}
4. for img in tqdm(os.listdir(directory)):
5. filename = directory + "/" + img
6. image = Image.open(filename)
7. image = image.resize((299,299))
8. image = np.expand_dims(image, axis=0)
9. #image = preprocess_input(image)
10. image = image/127.5
11. image = image - 1.0
12.
13. feature = model.predict(image)
14. features[img] = feature
15. return features
16.
17. #2048 feature vector
18. features = extract_features(dataset_images)
19. dump(features, open("features.p","wb"))
4. Loading dataset for Training the model
In our Flickr_8k_test folder, we have Flickr_8k.trainImages.txt file that
contains a list of 6000 image names that we will use for training.
For loading the training dataset, we need more functions:

 load_photos( filename ) – This will load the text file in a string and
will return the list of image names.
 load_clean_descriptions( filename, photos ) – This function will
create a dictionary that contains captions for each photo from the list
of photos. We also append the <start> and <end> identifier for each
caption. We need this so that our LSTM model can identify the
starting and ending of the caption.
 load_features(photos) – This function will give us the dictionary for
image names and their feature vector which we have previously
extracted from the Xception model.

Code :
1. #load the data
2. def load_photos(filename):
3. file = load_doc(filename)
4. photos = file.split("\n")[:-1]
5. return photos
6.
7.
8. def load_clean_descriptions(filename, photos):
9. #loading clean_descriptions
10. file = load_doc(filename)
11. descriptions = {}
12. for line in file.split("\n"):
13.
14. words = line.split()
15. if len(words)<1 :
16. continue
17.
18. image, image_caption = words[0], words[1:]
19.
20. if image in photos:
21. if image not in descriptions:
22. descriptions[image] = []
23. desc = '<start> ' + " ".join(image_caption) + ' <end>'
24. descriptions[image].append(desc)
25.
26. return descriptions
27.
28.
29. def load_features(photos):
30. #loading all features
31. all_features = load(open("features.p","rb"))
32. #selecting only needed features
33. features = {k:all_features[k] for k in photos}
34. return features
35.
36.
37. filename = dataset_text + "/" + "Flickr_8k.trainImages.txt"
38.
39. #train = loading_data(filename)
40. train_imgs = load_photos(filename)
41. train_descriptions = load_clean_descriptions("descriptions.txt",
train_imgs)
42. train_features = load_features(train_imgs)

5. Tokenizing the vocabulary


Computers don’t understand English words, for computers, we will have
to represent them with numbers. So, we will map each word of the
vocabulary with a unique index value. Keras library provides us with the
tokenizer function that we will use to create tokens from our vocabulary
and save them to a “tokenizer.p” pickle file.

Code:
1. #converting dictionary to clean list of descriptions
2. def dict_to_list(descriptions):
3. all_desc = []
4. for key in descriptions.keys():
5. [all_desc.append(d) for d in descriptions[key]]
6. return all_desc
7.
8. #creating tokenizer class
9. #this will vectorise text corpus
10. #each integer will represent token in dictionary
11.
12. from keras.preprocessing.text import Tokenizer
13.
14. def create_tokenizer(descriptions):
15. desc_list = dict_to_list(descriptions)
16. tokenizer = Tokenizer()
17. tokenizer.fit_on_texts(desc_list)
18. return tokenizer
19.
20. # give each word an index, and store that into tokenizer.p pickle file
21. tokenizer = create_tokenizer(train_descriptions)
22. dump(tokenizer, open('tokenizer.p', 'wb'))
23. vocab_size = len(tokenizer.word_index) + 1
24. vocab_size

We calculate the maximum length of the descriptions. This is important


for deciding the model structure parameters. Max_length of description
is 32.

1. #calculate maximum length of descriptions


2. def max_length(descriptions):
3. desc_list = dict_to_list(descriptions)
4. return max(len(d.split()) for d in desc_list)
5.
6. max_length = max_length(descriptions)
7. max_length

6. Create Data generator


Let us first see how the input and output of our model will look like. To
make this task into a supervised learning task, we have to provide input
and output to the model for training. We have to train our model on
6000 images and each image will contain 2048 length feature vector and
caption is also represented as numbers. This amount of data for 6000
images is not possible to hold into memory so we will be using a
generator method that will yield batches.
The generator will yield the input and output sequence.

For example:
The input to our model is [x1, x2] and the output will be y, where x1 is
the 2048 feature vector of that image, x2 is the input text sequence and y
is the output text sequence that the model has to predict.

x1(feature vector) x2(Text sequence) y(word to predict)


feature start, two
feature start, two dogs
feature start, two, dogs drink
feature start, two, dogs, drink water

feature start, two, dogs, drink, water End


1. #create input-output sequence pairs from the image description.
2.
3. #data generator, used by model.fit_generator()
4. def data_generator(descriptions, features, tokenizer, max_length):
5. while 1:
6. for key, description_list in descriptions.items():
7. #retrieve photo features
8. feature = features[key][0]
9. input_image, input_sequence, output_word =
create_sequences(tokenizer, max_length, description_list, feature)
10. yield [[input_image, input_sequence], output_word]
11.
12. def create_sequences(tokenizer, max_length, desc_list, feature):
13. X1, X2, y = list(), list(), list()
14. # walk through each description for the image
15. for desc in desc_list:
16. # encode the sequence
17. seq = tokenizer.texts_to_sequences([desc])[0]
18. # split one sequence into multiple X,y pairs
19. for i in range(1, len(seq)):
20. # split into input and output pair
21. in_seq, out_seq = seq[:i], seq[i]
22. # pad input sequence
23. in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
24. # encode output sequence
25. out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
26. # store
27. X1.append(feature)
28. X2.append(in_seq)
29. y.append(out_seq)
30. return np.array(X1), np.array(X2), np.array(y)
31.
32. #You can check the shape of the input and output for your model
33. [a,b],c = next(data_generator(train_descriptions, features,
tokenizer, max_length))
34. a.shape, b.shape, c.shape
35. #((47, 2048), (47, 32), (47, 7577))

7. Defining the CNN-RNN model


To define the structure of the model, we will be using the Keras Model
from Functional API. It will consist of three major parts:

 Feature Extractor – The feature extracted from the image has a size
of 2048, with a dense layer, we will reduce the dimensions to 256
nodes.
 Sequence Processor – An embedding layer will handle the textual
input, followed by the LSTM layer.
 Decoder – By merging the output from the above two layers, we will
process by the dense layer to make the final prediction. The final
layer will contain the number of nodes equal to our vocabulary size.
Visual representation of the final model is given
below –
Code to Generate Model:
1. from keras.utils import plot_model
2.
3. # define the captioning model
4. def define_model(vocab_size, max_length):
5.
6. # features from the CNN model squeezed from 2048 to 256 nodes
7. inputs1 = Input(shape=(2048,))
8. fe1 = Dropout(0.5)(inputs1)
9. fe2 = Dense(256, activation='relu')(fe1)
10.
11. # LSTM sequence model
12. inputs2 = Input(shape=(max_length,))
13. se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
14. se2 = Dropout(0.5)(se1)
15. se3 = LSTM(256)(se2)
16.
17. # Merging both models
18. decoder1 = add([fe2, se3])
19. decoder2 = Dense(256, activation='relu')(decoder1)
20. outputs = Dense(vocab_size, activation='softmax')(decoder2)
21.
22. # tie it together [image, seq] [word]
23. model = Model(inputs=[inputs1, inputs2], outputs=outputs)
24. model.compile(loss='categorical_crossentropy', optimizer='adam')
25.
26. # summarize model
27. print(model.summary())
28. plot_model(model, to_file='model.png', show_shapes=True)
29.
30. return model
8. Training the model
To train the model, we will be using the 6000 training images by
generating the input and output sequences in batches and fitting them to
the model using model.fit_generator() method. We also save the model
to our models folder. This will take some time depending on your system
capability.

Code:

1. # train our model


2. print('Dataset: ', len(train_imgs))
3. print('Descriptions: train=', len(train_descriptions))
4. print('Photos: train=', len(train_features))
5. print('Vocabulary Size:', vocab_size)
6. print('Description Length: ', max_length)
7.
8. model = define_model(vocab_size, max_length)
9. epochs = 10
10. steps = len(train_descriptions)
11. # making a directory models to save our models
12. os.mkdir("models")
13. for i in range(epochs):
14. generator = data_generator(train_descriptions, train_features,
tokenizer, max_length)
15. model.fit_generator(generator, epochs=1, steps_per_epoch= steps,
verbose=1)
16. model.save("models/model_" + str(i) + ".h5")
9. Testing the model
The model has been trained, now, we will make a separate file
testing_caption_generator.py which will load the model and generate
predictions. The predictions contain the max length of index values so
we will use the same tokenizer.p pickle file to get the words from their
index values.

Code:
1. import numpy as np
2. from PIL import Image
3. import matplotlib.pyplot as plt
4. import argparse
5.
6.
7. ap = argparse.ArgumentParser()
8. ap.add_argument('-i', '--image', required=True, help="Image Path")
9. args = vars(ap.parse_args())
10. img_path = args['image']
11.
12. def extract_features(filename, model):
13. try:
14. image = Image.open(filename)
15.
16. except:
17. print("ERROR: Couldn't open image! Make sure the image path and
extension is correct")
18. image = image.resize((299,299))
19. image = np.array(image)
20. # for images that has 4 channels, we convert them into 3 channels
21. if image.shape[2] == 4:
22. image = image[..., :3]
23. image = np.expand_dims(image, axis=0)
24. image = image/127.5
25. image = image - 1.0
26. feature = model.predict(image)
27. return feature
28.
29. def word_for_id(integer, tokenizer):
30. for word, index in tokenizer.word_index.items():
31. if index == integer:
32. return word
33. return None
34.
35.
36. def generate_desc(model, tokenizer, photo, max_length):
37. in_text = 'start'
38. for i in range(max_length):
39. sequence = tokenizer.texts_to_sequences([in_text])[0]
40. sequence = pad_sequences([sequence], maxlen=max_length)
41. pred = model.predict([photo,sequence], verbose=0)
42. pred = np.argmax(pred)
43. word = word_for_id(pred, tokenizer)
44. if word is None:
45. break
46. in_text += ' ' + word
47. if word == 'end':
48. break
49. return in_text
50.
51.
52. #path = 'Flicker8k_Dataset/111537222_07e56d5a30.jpg'
53. max_length = 32
54. tokenizer = load(open("tokenizer.p","rb"))
55. model = load_model('models/model_9.h5')
56. xception_model = Xception(include_top=False, pooling="avg")
57.
58. photo = extract_features(img_path, xception_model)
59. img = Image.open(img_path)
60.
61. description = generate_desc(model, tokenizer, photo, max_length)
62. print("\n\n")
63. print(description)
64. plt.imshow(img)
Final Results:
Chapter- 5
Conclusion

&

References
Conclusion

In this advanced Python project, we have implemented a CNN-RNN model by


building an image caption generator. Some key points to note are that our model
depends on the data, so, it cannot predict the words that are out of its vocabulary.
We used a small dataset consisting of 8000 images. For production-level models,
we need to train on datasets larger than 100,000 images which can produce better
accuracy models. Future work In the future, we would like to explore methods
to generate multiple sentences with different content. One possible way is to
combine interesting region detection and image captioning.The VSA method
(Karpathy and Fei-Fei, ) gives a direction of our future work. Taking as an
example, we hope the output will be a short paragraph: ”Jack has a wonderful
breakfast in a Sunday morning. He is sitting at a table with a bouquet of red
flowers. With his new iPad air on the left, he enjoys a plate of fruits and a cup of
coffee.” The short paragraph naturally describes the image content in a story-
telling fashion which is more attractive to the human beings.
References

 https://datacarpentry.org/image-processing/aio/index.html
 https://pillow.readthedocs.io/en/3.1.x/reference/Image.html
 https://analyticsindiamag.com/top-8-image-processing-libraries-in-
python/
 https://stackoverflow.com/questions/94875/image-processing-in-
python
 http://omz-software.com/pythonista/docs/ios/PIL.html
 https://www.tutorialspoint.com/python/index.htm
 https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Fli
ckr8k_Dataset.zip
 https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Fli
ckr8k_text.zip
 https://www.tensorflow.org/guide/keras
 [Papineni et al.2002] Kishore Papineni, Salim Roukos,
 Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method
 for automatic evaluation of machine translation. In
 Proceedings of the 40th annual meeting on association
 for computational linguistics, pages 311–318.
 [Russakovsky et al.2015] Olga Russakovsky, Jia Deng,
 Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean
 Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla,
 Michael Bernstein, et al. 2015. Imagenet large scale
 visual recognition challenge. IJCV, 115(3):211–252.
 [Sermanet et al.2013] Pierre Sermanet, David Eigen, Xiang
 Zhang, Micha¨el Mathieu, Rob Fergus, and Yann
 LeCun. 2013. Overfeat: Integrated recognition, localization
 and detection using convolutional networks.

You might also like