Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

NATURAL LANGUAGE PROCESSING

IMAGE TO TEXT GENERATOR

ABSTRACT:

The Image-to-Text Generator is a novel system designed to bridge the semantic gap between visual
and textual information by converting images into coherent and contextually relevant textual
descriptions. This innovative technology leverages state-of-the-art deep learning techniques,
particularly convolutional neural networks (CNNs) and recurrent neural networks (RNNs), to extract
meaningful features from images and generate descriptive text.

The system begins by employing a pre-trained CNN to extract high-level features from input images.
These features are then fed into an RNN-based language model, which dynamically constructs
grammatically sound and semantically rich sentences. The model is trained on large-scale datasets
containing paired images and corresponding textual descriptions, allowing it to learn the inherent
relationships between visual content and linguistic expressions.

The Image-to-Text Generator not only excels in accurately captioning images but also showcases
adaptability to diverse domains, making it suitable for applications such as image indexing, content
retrieval, and accessibility services for the visually impaired. Furthermore, the system demonstrates
the ability to generate captions that capture nuanced details and relationships within complex visual
scenes.

In addition to its practical applications, this research contributes to the broader field of computer
vision and natural language processing, fostering a deeper understanding of cross-modal
representation learning. The evaluation of the proposed system involves comprehensive
benchmarking against existing methods, showcasing its superior performance in terms of both
quantitative metrics and qualitative assessments.

The Image-to-Text Generator represents a significant step towards enhancing the synergy between
visual and textual modalities, opening up new avenues for intelligent systems that can comprehend
and communicate information seamlessly across different modalities. This research lays the
foundation for future advancements in multimodal AI systems, pushing the boundaries of what is
achievable in the realm of human-machine interaction.

IMPORTING MODULES:

import os #general py library


import pickle #storing numpy features
import numpy as np #for numerical data
from tqdm.notebook import tqdm #for PROCESSING VERY LARGE DATA
from tensorflow.keras.applications.vgg16 import VGG16, preprocess_input
from tensorflow.keras.preprocessing.image import load_img, img_to_array #loading the image on
suitable form
from tensorflow.keras.preprocessing.text import Tokenizer #tokenizing the captions
from tensorflow.keras.preprocessing.sequence import pad_sequences #for giving proper padding in
output caption
from tensorflow.keras.models import Model
from tensorflow.keras.utils import to_categorical, plot_model
from tensorflow.keras.layers import Input, Dense ,LSTM ,Embedding, Dropout ,add

LOADING THE DATASET:

# load vgg16 model

model = VGG16()

# restructure the model

model = Model(inputs=model.inputs, outputs=model.layers[-2].output)

# summarize

print(model.summary())

IMAGE EXTRACTION:

# extract features from image

features = {}

directory = os.path.join(BASE_DIR, 'Images')

for img_name in tqdm(os.listdir(directory)):

# load the image from file

img_path = directory + '/' + img_name

image = load_img(img_path, target_size=(224, 224))

# convert image pixels to numpy array

image = img_to_array(image)

# reshape data for model

image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))

# preprocess image for vgg

image = preprocess_input(image)

# extract features

feature = model.predict(image, verbose=0)

# get image ID
image_id = img_name.split('.')[0]

# store feature

features[image_id] = feature

RESULT VISUALISATION:

from PIL import Image

import matplotlib.pyplot as plt

def generate_caption(image_name):

# load the image

# image_name = "1001773457_577c3a7d70.jpg"

image_id = image_name.split('.')[0]

img_path = os.path.join(BASE_DIR, "Images", image_name)

image = Image.open(img_path)

captions = mapping[image_id]

print('---------------------Actual---------------------')
for caption in captions:

print(caption)

# predict the caption

y_pred = predict_caption(model, features[image_id], tokenizer, max_length)

print('--------------------Predicted--------------------')

print(y_pred)

plt.imshow(image)

OUTPUT:

You might also like