Natural Language Processing

NATURAL LANGUAGE PROCESSING
IMAGE TO TEXT GENERATOR
ABSTRACT:
The Image-to-Text Generator is a novel system designed to bridge the semantic gap between visual
and textual information by converting images into coherent and contextually relevant textual
descriptions. This innovative technology leverages state-of-the-art deep learning techniques,
particularly convolutional neural networks (CNNs) and recurrent neural networks (RNNs), to extract
meaningful features from images and generate descriptive text.
The system begins by employing a pre-trained CNN to extract high-level features from input images.
These features are then fed into an RNN-based language model, which dynamically constructs
grammatically sound and semantically rich sentences. The model is trained on large-scale datasets
containing paired images and corresponding textual descriptions, allowing it to learn the inherent
relationships between visual content and linguistic expressions.
The Image-to-Text Generator not only excels in accurately captioning images but also showcases
adaptability to diverse domains, making it suitable for applications such as image indexing, content
retrieval, and accessibility services for the visually impaired. Furthermore, the system demonstrates
the ability to generate captions that capture nuanced details and relationships within complex visual
scenes.
In addition to its practical applications, this research contributes to the broader field of computer
vision and natural language processing, fostering a deeper understanding of cross-modal
representation learning. The evaluation of the proposed system involves comprehensive
benchmarking against existing methods, showcasing its superior performance in terms of both
quantitative metrics and qualitative assessments.
The Image-to-Text Generator represents a significant step towards enhancing the synergy between
visual and textual modalities, opening up new avenues for intelligent systems that can comprehend
and communicate information seamlessly across different modalities. This research lays the
foundation for future advancements in multimodal AI systems, pushing the boundaries of what is
achievable in the realm of human-machine interaction.
IMPORTING MODULES:
import os #general py library

import pickle #storing numpy features
import numpy as np #for numerical data
from tqdm.notebook import tqdm #for PROCESSING VERY LARGE DATA
from tensorflow.keras.applications.vgg16 import VGG16, preprocess_input
from tensorflow.keras.preprocessing.image import load_img, img_to_array #loading the image on
suitable form
from tensorflow.keras.preprocessing.text import Tokenizer #tokenizing the captions
from tensorflow.keras.preprocessing.sequence import pad_sequences #for giving proper padding in
output caption
from tensorflow.keras.models import Model
from tensorflow.keras.utils import to_categorical, plot_model
from tensorflow.keras.layers import Input, Dense ,LSTM ,Embedding, Dropout ,add
LOADING THE DATASET:
# load vgg16 model
model = VGG16()
# restructure the model
model = Model(inputs=model.inputs, outputs=model.layers[-2].output)
# summarize
print(model.summary())
IMAGE EXTRACTION:
# extract features from image
features = {}
directory = os.path.join(BASE_DIR, 'Images')
for img_name in tqdm(os.listdir(directory)):
# load the image from file
img_path = directory + '/' + img_name
image = load_img(img_path, target_size=(224, 224))
# convert image pixels to numpy array
image = img_to_array(image)
# reshape data for model
image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
# preprocess image for vgg
image = preprocess_input(image)
# extract features
feature = model.predict(image, verbose=0)
# get image ID
image_id = img_name.split('.')[0]
# store feature
features[image_id] = feature
RESULT VISUALISATION:
from PIL import Image
import matplotlib.pyplot as plt
def generate_caption(image_name):
# load the image
# image_name = "1001773457_577c3a7d70.jpg"
image_id = image_name.split('.')[0]
img_path = os.path.join(BASE_DIR, "Images", image_name)
image = Image.open(img_path)
captions = mapping[image_id]
print('---------------------Actual---------------------')
for caption in captions:
print(caption)
# predict the caption
y_pred = predict_caption(model, features[image_id], tokenizer, max_length)
print('--------------------Predicted--------------------')
print(y_pred)
plt.imshow(image)
OUTPUT:

Natural Language Processing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Natural Language Processing

Uploaded by

Copyright:

Available Formats

NATURAL LANGUAGE PROCESSING

IMAGE TO TEXT GENERATOR

import os #general py library

LOADING THE DATASET:

# load vgg16 model

# restructure the model

model = Model(inputs=model.inputs, outputs=model.layers[-2].output)

# extract features from image

directory = os.path.join(BASE_DIR, 'Images')

for img_name in tqdm(os.listdir(directory)):

# load the image from file

img_path = directory + '/' + img_name

image = load_img(img_path, target_size=(224, 224))

# convert image pixels to numpy array

# reshape data for model

image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))

# preprocess image for vgg

feature = model.predict(image, verbose=0)

from PIL import Image

import matplotlib.pyplot as plt

# load the image

img_path = os.path.join(BASE_DIR, "Images", image_name)

# predict the caption

y_pred = predict_caption(model, features[image_id], tokenizer, max_length)

You might also like