Download as pdf or txt
Download as pdf or txt
You are on page 1of 69

INTERNSHIP REPORT

ON
Harnessing AI for Real World : An Internship Report
From Innomatics Research Labs
SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENT FOR THE
AWARD OF THE DEGREE OF

Master of science
in
Data science
Submitted by

ABDUL RAUF (22DSMSA–116)

Under the Supervision of

Mr. Kanav Bansal (External)


Dr. Shakeel Javaid (Internal)
Dr. Mohd Faizan (Internal)

DEPARTMENT OF STATISTICS & OPERATIONS RESEARCH


ALIGARH MUSLIM UNIVERSITY
ALIGARH(INDIA)-202002
2023–24
DEPARTMENT OF STATISTICS AND OPERATIONS
RESEARCH ALIGARH MUSLIM UNIVERSITY,
ALIGARH-202002(U.P.)INDIA

Phone(0571)
2701251

CERTIFICATE
This is certified that the Project entitled " Harnessing AI for Real World :
An Internship Report From Innomatics Research Labs " which is
submitted by ABDUL RAUF (Enrolment No. GL6092 and Roll No.
22DSMSA116) Final Year, 4th Semester 2023-2024, Aligarh Muslim
University, Aligarh.

During his internship program, he demonstrated exceptional skills with a


self-motivated attitude to learn new things and implement his end-to-end
with all of our mentioned industrial standards. His performance was
excellent, and he completed the internship successfully on time under the
guidance of Mr. Kanav Bansal (External).

The content of this project has not been submitted to any university or
institute for the award of any degree or diploma.

Dr. Shakeel Javaid (Internal) Dr. Mohd Faizan (Internal)

(Supervisor)
Department of Statistics & O.R.
Aligarh Muslim University,Aligarh
Uttar Pradesh,India(202002)
DECLARATION

I hereby declare that the internship report entitled “Harnessing AI for Real
World : An Internship Report From Innomatics Research Labs”
submitted by me, for the award of the degree of Master of Science (Data
Science), is a record of bonafide work carried out by me under the
supervision of Mr. Kanav Bansal (Data Scientist) at Innomatics research
labs, Hydrabad, Telangana, India

I further declare that the work reported in this internship report has not
been submitted and will not be submitted, either in part or in full, for the
award of any other degree or diploma in this university or any other
institution.

ABDUL RAUF
Place: Aligarh
GL6092
Date:
22DSMSA116
DeDicateD to

LATE SIR SYED AHMAD


KHAN

FOUNDER

OF

ALIGARH MUSLIM UNIVERSITY


ACKNOWLEDGEMENT

All praises are due to ALLAH Almighty, the most beneficent, merciful, and the
supreme authority of the universe, and His Prophet Mohammad Mustafa
Sallallahu Alaihi Wasallam (May Allah send blessings and peace upon him). I
owe it all to ALLAH for granting me the wisdom, health, and strength to
undertake this Internship and enabling me to its completion.
It is my privilege to express our sincere thanks and a sense of deep gratitude to
Dr. Shakeel Javaid and Mohd Faizan, Department of Statistics & Operations
Research, A.M.U., Aligarh for all their help, painstaking efforts, and deep
insight into the problem and thus improving the quality of work at all stages,
providing me with the best knowledge to carry out my work.
I owe My deep gratitude to Prof. Athar Ali Khan, Chairman, Department of
Statistics & Operations Research, A.M.U., Aligarh, for his unquestionable
cooperation and inspiration. I thank him for providing us the facilities and
making the environment conducive for carrying out the internship.
I also like to express our gratefulness to Mr. Kanav Bansal of Chief Data
Scientist at Innomatics research labs Hydrabad Telangana, India for his
time, help, valuable advice,supervision, and immense patience due to which this
work was able to take the shape in which it has been presented.
I also like to express my gratefulness to the teachers for their time, help,
valuable advice, supervision, and immense patience due to which this work was
able to take the shape in which it has been presented.
It is indeed a pleasure to thank my friends who persuaded and encouraged me to
take up and complete this task. At last, but not least, I express my gratitude and
appreciation to all those who have helped me directly or indirectly towards the
successful completion of this project.

Place: Aligarh
ABDUL RAUF

Date:
TABLE OF CONTENTS

CHAPTERS Page No.


Certificate ------------------------------------------------------------------------------ 3

Abstract -------------------------------------------------------------------------------- 5

Overview of the Company ---------------------------------------------------------- 6

Structure of the Internship ---------------------------------------------------------- 7

Chapter 1: Sentiment Analysis of Real-time Flipkart Product Reviews ----- (9-33)

1.1: Objective

1.2: Data Preprocessing

1.3: Modelling Approach

1.4: Evaluation of Model

1.5: Creating webapp using Flask

1.6: Model deployment on AWS

Chapter 2: Regex Matching Web App Development ------------------------- (34-39)

2.1: Objective

2.2: Overview of Regex

2.3: Flask application for Regex

2.4: E-mail validation feature

2.5: Deployment on AWS

1
Chapter 3: Movies Subtitle Search Engine -------------------------------------- (40-51)

3.1: Objective

3.2: About the data

3.3: Reading the data from database – decompressing and decoding

3.4: Data Cleaning

3.5: Data Chunking

3.6: Generating text embedding

3.7: Storing data in vector database (Chromadb)

3.8: Front end application to access the search engine (Streamlit)

Chapter 4: GenAI App – AI Code Reviewer ---------------------------------- (52-54)

4.1: Objective

4.2: Feature

4.3: Workflow

4.4: Frontend webapp development using streamlit

4.5: Conclusion

Chapter 5: RAG System on “Leave No Context Behind” Paper ------------ (55-61)

5.1: Objective

5.2: Use and need

5.3: Features

5.4: Implementation

5.4: Future enhancement

References ----------------------------------------------------------------------------- 62

Appendix ------------------------------------------------------------------------------ 63

2
3
4
ABSTRACT

Innomatics Research Labs, Hyderabad, offered an exhilarating and enriching internship


experience where I worked as a Data Science Intern. During my tenure, I was deeply engaged in
a comprehensive blend of theoretical learning and practical application. I undertook various
challenging projects, such as performing sentiment analysis on product reviews from Flipkart,
developing advanced recommendation systems, and creating innovative Retrieval-Augmented
Generation (RAG) applications using state-of-the-art technologies like OpenAI and LangChain.
Alongside these projects, I received extensive training in data analysis, generative AI, and
advanced data science techniques, which greatly enhanced my understanding and skills in these
areas. Each task involved meticulous steps of data scraping, preprocessing, machine learning
model development, and cloud deployment, providing a holistic view of the data science
pipeline. This internship not only bolstered my technical proficiency but also fueled my
enthusiasm for applying AI-driven solutions to real-world problems. The experiences and
knowledge gained during this period have been pivotal in shaping my career aspirations in the
dynamic fields of data science and artificial intelligence.

5
OVERVIEW OF THE COMPANY
Innomatics Research Labs, located in Hyderabad, is a leading educational and research institute
specializing in data science, artificial intelligence (AI), and advanced technology training.The
mission of Innomatics is to empower individuals and organizations with cutting-edge knowledge
and skills in data science and AI, bridging the gap between academia and industry.

Core Services:

 Educational Programs: Intensive bootcamps, certification courses, and specialized


workshops in data science and AI.
 Research and Development: Focused on the latest advancements in machine learning
and AI to drive innovation.
 Corporate Training: Customized programs to upskill employees and enhance
organizational capabilities.
 Consulting Services: Expertise in implementing data-driven and AI solutions for
businesses.
 Facilities:State-of-the-art infrastructure with modern classrooms, fully-equipped labs,
and collaborative spaces, creating an optimal learning environment.
 Industry Partnerships:Strong collaborations with leading tech companies and industry
organizations, providing students with real-world project opportunities and internships.
 Community and Culture:An inclusive and innovative culture fostering a vibrant
community of learners and professionals passionate about technology.
 Achievements:Recognized for excellence in data science and AI education, with a track
record of training thousands of professionals who excel in their careers.
 Location: Based in Hyderabad, a hub for the tech industry, offering numerous
opportunities for networking, internships, and employment.

6
Structure of the Internship
During my tenure as a Data Science Intern at Innomatics Research Labs, the internship was
structured to provide a comprehensive learning experience, combining hands-on project work
with structured educational sessions. The program was designed to ensure that interns gained
both practical and theoretical knowledge in data science, generative AI, and other advanced
technologies such as OpenAI and LangChain. The structure of the internship was as follows:

1. Onboarding and Orientation:

The internship began with an onboarding session where interns were introduced to the company's
culture, values, and expectations. This session also included an overview of the tools and
technologies we would be using throughout the internship.

2. Educational Sessions:

Regular training sessions were conducted to enhance our understanding of key concepts in data
science and artificial intelligence. These sessions covered a range of topics including:

 Data Science Fundamentals: Introduction to data manipulation, statistical analysis, and


machine learning algorithms.
 Generative AI: Deep dives into generative models, their applications, and how they can
be used to create new content.
 OpenAI and LangChain: Detailed discussions on how to leverage these powerful tools
for advanced language processing and AI-driven applications.

3. Project Assignments:

Interns were assigned real-world projects to apply the knowledge gained from educational
sessions. These projects were carefully selected to cover various aspects of data science and AI,
ensuring a well-rounded experience. Key projects included:

 Data Collection and Preprocessing: Scraping data from the Flipkart website and
performing necessary preprocessing steps.
 Sentiment Analysis: Building models to classify sentiments of product reviews.

7
 Recommendation Systems: Developing a system to recommend movie subtitles based
on user queries using embeddings.
 RAG Applications: Creating Retrieval-Augmented Generation applications using
OpenAI and Google Gemini models.

4. Regular Reviews and Feedback:

Progress reviews were conducted at regular intervals to track the advancement of projects and
provide constructive feedback. This helped in refining our approaches and improving the quality
of our work.

5. Deadline Management:

Each task and project had specific deadlines, instilling a sense of time management and
prioritization. Interns were encouraged to manage their time effectively to meet these deadlines
while maintaining the quality of their outputs.

6. Mentorship and Support:

Throughout the internship, we had access to mentors and senior data scientists who provided
guidance, answered queries, and shared their insights and experiences. This mentorship was
invaluable in navigating complex problems and enhancing our learning.

7. Final Presentation and Evaluation:

The internship concluded with a final presentation where interns showcased their projects and
the results they achieved. This presentation was followed by an evaluation based on the quality
of work, the approach taken, and the overall learning demonstrated.

Conclusion:

The structured approach of combining theoretical learning with practical project work, along
with regular feedback and mentorship, provided a holistic learning experience. This internship
not only enhanced my technical skills in data science and AI but also taught me valuable lessons
in project management, teamwork, and professional communication.

8
PROJECT -1

Sentiment Analysis of Real-time Flipkart Product


Reviews

Objective
The objective of this project is to classify customer reviews as positive or negative and
understand the pain points of customers who write negative reviews. By analyzing the sentiment
of reviews, we aim to gain insights into product features that contribute to customer satisfaction
or dissatisfaction.

Dataset
Dataset have been scraped from the Flipkart website. The dataset consists of 8,518 reviews for
the "YONEX MAVIS 350 Nylon Shuttle" product from Flipkart. Each review includes features
such as Reviewer Name, Rating, Review Title, Review Text, Place of Review, Date of Review,
Up Votes, and Down Votes.

Data Preprocessing
 Text Cleaning: Remove special characters, punctuation, and stopwords from the review
text.
 Text Normalization: Perform lemmatization or stemming to reduce words to their base
forms.
 Numerical Feature Extraction: Apply techniques like Bag-of-Words (BoW), Term
Frequency-Inverse Document Frequency (TF-IDF), Word2Vec (W2V), and BERT
models for feature extraction.

Modeling Approach
 Model Selection: Train and evaluate various machine learning and deep learning models
using the embedded text data.
 Evaluation Metric: Use the F1-Score as the evaluation metric to assess the performance
of the models in classifying sentiment.

9
Model Deployment
The model is then deployed on the AWS platform using Flask for the web framework. This
deployment
ent allows for scalability and accessibility, enabling users to interact with the model
through a user-friendly
friendly web interface. By leveraging AWS services, we ensure that the model
benefits from robust infrastructure, enhanced security, and reliable perform
performance.
ance. The integration
of Flask facilitates seamless communication between the frontend and the backend, providing a
responsive and efficient user experience

DATA COLLECTION:

The scraped data, stored in a CSV file, was loaded into the analysis environment using
u the
Pandas
andas library. The CSV file contained 8,518 reviews for the "YONEX MAVIS 350 Nylon
Shuttle" product from Flipkart, with attributes such as Reviewer Name, Rating, Review Title,
Review Text, Place of Review, Date of Review, Up Votes, and Down Votes
Votes.. By leveraging the
Pandas
andas library, the data was efficiently read into a DataFrame for further analysis and
processing.

LOADING DATASET:

The scraped dataset, containing 8,518 reviews for the "YONEX MAVIS 350 Nylon Shuttle"
product from Flipkart and stored in CSV format, was imported into Google Colab for analysis.
Leveraging the Pandas
andas library within the Colab environment, the CSV file was efficiently loaded
into a DataFrame, allowing for seamless data manipulation and analysis within the Colab
environment.

Figure 1: Loading data

10
DATA PREPROCESSING:

In this step, the dataset underwent preprocessing to handle null values and ensure data quality.
Initially, null values were examined across all columns. The column "Reviewer Name" contained
only 10 null values, which were deemed insignificant for the analysis. Therefore, these null
values were dropped from the dataset.Subsequently, missing values in other columns were filled
using the forward fill (ffill) method. This approach ensured that missing values were replaced
with the preceding non-null values within each respective column, thereby maintaining the
continuity of the data.

Figure 2: Null values in the column

Dropping null values can lead to loss of valuable data, especially if the null values represent a
small portion of the dataset, so we are filling with the forward fill.

11
Figure 3:Removing the null values

A custom function was developed to generate a sentiment score based on the values of upvotes,
downvotes, and rating. This function computed the sentiment score using a predetermined
algorithm that considers the weighted influence of these factors on the overall sentiment of a
review.

Subsequently, a new column labeled "Sentiment" was created in the dataset. Reviews were
categorized into two sentiment classes: 0 for negative sentiment and 1 for positive sentiment.
This categorization was determined based on the calculated sentiment score, where reviews with
higher scores were assigned a positive sentiment label, while those with lower scores were
labeled as negative sentiment.

12
Figure 4: Creating sentiment column

The independent variables, denoted as X, were utilized to store the features or predictors of the
dataset. These features are typically used to predict or explain variations in the dependent
variable.

On the other hand, the dependent variable, labeled as y, was employed to store the sentiments
associated with each observation in the dataset. In this context, sentiments represent the target
variable that the model aims to predict or classify based on the values of the independent
variables.

By assigning the features to X and the sentiments to y, the dataset was appropriately structured
for modeling purposes, facilitating the development and evaluation of predictive models.

13
Figure 5: Seperating the dependent and independent variable
A pie chart was constructed to visualize the distribution of sentiments within the dataset. The
chart revealed that the majority of reviews, accounting for 87.76% of the total, exhibited a
positive sentiment. In contrast, a smaller proportion, comprising 12.24% of the reviews,
conveyed a negative sentiment

Figure 6: Pie chart of negative and positive sentiment

14
A word cloud visualization was generated to illustrate the frequency of terms within the dataset.
In this visualization, certain terms appeared more prominently, depicted in larger
larg font sizes,
indicating their higher frequency of occurrence.

Notably, the terms "good," "read," and "shuttle" emerged as the most prominent within the word
cloud, depicted in larger font sizes compared to other terms. This prominence suggests that these
terms are among the most frequently mentioned within the dataset, capturing key themes or
topics associated with the reviews.The word cloud provides a visually engaging representation of
the prominent terms, offering insights into the prevalent topics or sentiments expressed within
the dataset.

Figure 7:: Word Cloud of the sentiment column

A preprocessing function was developed to enhance the quality of text data by performing
essential text processing
ssing tasks. This function employed lemmatization to normalize words to
their base or dictionary form, thereby reducing variant forms of words to their common root.

Additionally, the function utilized Stopwords


topwords removal to eliminate common words that do not
contribute significant meaning to the text. These stopwords, such as "the," "is," and "and," were

15
filtered out to focus on relevant content.By incorporating lemmatization and stopwords removal,
the preprocessing function effectively transformed raw text data into a cleaner and more
standardized format, conducive to subsequent analysis and modeling tasks.

Figure 8:Performing Lemmatization and applying Stopwords

MODELLING APPOACH:

A pipeline was constructed to streamline the application of multiple Machine learning


algorithms, including Naive Bayes, Logistic Regression, Decision Tree, Random Forest, and
XGBoost.

16
Figure 9:Pipeline of different machine learning algorithm

A parameter grid was defined to systematically explore various combinations of hyperparameters


for each machine learning algorithm included in the pipeline, such as Naive Bayes, Logistic
Regression, Decision Tree, Random Forest, and XGBoost.

17
Figure 10:Applying different parameter to the algorithms

18
GridSearchCV was employed for each machine learning algorithm within the pipeline to
systematically search through the parameter grid and identify the optimal combination of
hyperparameters that maximizes the performance metrics, such as accuracy or F1-score.

Figure 11:Using GridSearchCV to find best model

After performing GridSearchCV for each algorithm, the models were trained and the best
parameters were obtained. These best parameters were then used to instantiate the respective
models, which were further evaluated and their performance metrics, such as accuracy or F1-
score, were computed. Finally, the models along with their best parameters were presented for
further analysis and interpretation.

19
Figure 12:Models and their respective best parameters

EVALUATION OF THE MODEL :


To determine the best model based on output, model size, wall time, and memory usage, it's
essential to consider various factors:

 Output: Evaluate the performance metrics (e.g., accuracy, F1-score) of each model to
assess its predictive capability.
 Model Size: Consider the memory footprint of each model, as smaller models are
generally more efficient and easier to deploy in resource-constrained environments.
 Wall Time: Assess the time taken by each model to train and make predictions, as
shorter wall times indicate faster processing speed.
 Memory Usage: Examine the memory consumption of each model during training and
inference, as lower memory usage is preferable, especially in memory-limited scenarios.

Based on these considerations, the best model would be the one that achieves competitive
performance metrics while maintaining a reasonable model size, wall time, and memory usage.

20
It's essential to strike a balance between predictive accuracy and computational efficiency to
choose the most suitable model for the given task and resource constraints.

Figure 13: Test scores with model size

Creating a Flask Web Application:


For the project, a Flask web application was developed to provide users with an intuitive
interface for interacting with the machine learning models and accessing the analysis results. The
development process encompassed several essential steps to ensure the usability and
functionality of the web application.

Project Setup:

The Flask environment was set up by installing the Flask framework and organizing the project
structure. This involved creating directories for templates and static files, as well as defining the
main Flask application file to serve as the entry point for the web application.

21
User Interface Design:

HTML templates were designed to create the user interface of the web application. These
templates were carefully crafted to incorporate user-friendly elements, such as forms for
inputting data and sections for displaying analysis results. The design aimed to provide users
with an intuitive and seamless experience while interacting with the application.

Route Definition:

Routes were defined within the Flask application to handle different URL paths and HTTP
methods. These routes included endpoints for rendering HTML templates and processing user
input. Each route was associated with a specific view function responsible for executing the
corresponding logic and generating the appropriate response.

View Function Implementation:

View functions were implemented in Python to define the logic behind each route. These
functions processed user input, executed the necessary computations using the machine learning
models, and generated the output to be displayed in the HTML templates. Jinja templating was
utilized to dynamically populate the HTML templates with data generated by the view functions.

22
Using XGBoost as the best-performing model, we saved the trained model to a pickle (pkl) file
for easy access and utilization within the Flask web application. This streamlined the deployment
process and allowed for seamless integration of the model into the application.

Within the Flask application, the saved XGBoost model can be loaded from the pkl file and used
to make predictions on new data submitted by users through the web interface. This ensures that
the predictive capabilities of the model are readily available to users, enabling them to obtain
accurate predictions in real-time.

Additionally, by leveraging XGBoost's efficient performance and low memory consumption, the
Flask web application maintains optimal responsiveness and scalability, even when handling
large volumes of user requests.Overall, incorporating the XGBoost model into the Flask web

23
application enhances its functionality and empowers users to make informed decisions based on
the model's predictions.

Figure 14: Positive Sentiment

Figure 15:Negative Sentiment

24
To access the web application, simply run the python app.py in the command palette and copy
the link and open a web browser and enter the provided server URL (http://localhost) followed
by the port number (5000). Once the application is loaded, you can input text data into the
provided form and submit it to obtain sentiment analysis results.

DEPLOYMENT ON AMAZON WEB SERVICES PLATFORM


Instance Creation and Private Key Generation:

To facilitate secure access to cloud instances, a process was implemented to create instances
programmatically and generate private keys for SSH access. Upon instance creation, a private
key is automatically generated and made available for download by the user.

Private Key Generation:

Upon instance creation, a private key is generated for SSH access. This key serves as a secure
authentication mechanism for accessing the instance remotely.

Downloadable Private Key:

Users are provided with a download link to securely obtain the private key. This ensures that
users have access to the key for SSH authentication.

25
Storage Instructions:

Users are instructed to store the private key securely in the folder where their SSH client is
located. This ensures that the key is readily accessible for SSH authentication purposes.

Secure Access:

By generating private keys for SSH access and providing secure download links, the process
ensures secure access to cloud instances while maintaining user convenience.

Security Group Creation and Attachment:

To ensure secure communication and access control within the network environment, a security
group was created and attached to the network interface. This process involved defining the
necessary rules and configurations to restrict incoming and outgoing traffic based on specific
criteria.

Security Group Creation:

A security group was created to define the inbound and outbound traffic rules for the network
interface. This included specifying allowed protocols, ports, and IP address ranges to regulate
communication flow.

Rule Definition:

Within the security group, rules were defined to control traffic based on various criteria such as
source IP addresses, destination ports, and protocol types. These rules were configured to enforce
security policies and restrict unauthorized access to the network interface.

26
Security Group has been created as it is shown at top in green box and its details are given
below: .

Attachment to Network Interface:

After defining the security group rules, it was attached to the network interface associated with
the cloud instance. This ensured that the security policies defined within the security group were
applied to the network traffic flowing through the interface.

Secure Communication:

By attaching the security group to the network interface, secure communication channels were
established, allowing authorized traffic to pass through while blocking unauthorized access
attempts. This enhanced the overall security posture of the network environment and mitigated
potential security risks.

27
As security group has been attached to the network interface it will be show here as Abdul is
created .

File Transfer to Virtual Server:

28
To facilitate the deployment and setup of the project on the virtual server, files were securely
copied to the Ubuntu server using the scp (secure copy) command. This command enables secure
transfer of files between the local machine and the remote server over SSH.

Using scp for File Transfer:

The scp command was utilized to recursively copy directories and files from the local machine to
the Ubuntu server. The -r flag ensures that entire directories are copied, and the -i option
specifies the private key file used for authentication.

Connecting to the Ubuntu Server:

After securely copying the required files to the Ubuntu server, the next step involves connecting
to the server via SSH (Secure Shell). This allows for direct management and configuration of the
server.

Using ssh for Server Connection:

The ssh command is used to establish a secure connection to the remote server. The command
requires the private key for authentication, along with the username and server address.

29
Installing Python 3 on the Ubuntu Server:

To run Python applications and scripts on the Ubuntu server, Python 3 needs to be installed. The
following steps outline the installation process for Python 3 on an Ubuntu server.

`sudo apt install python3-pip`, where `sudo` is shorthand form of ‘super user do’ and `apt`
stands for ‘advanced package tool ’ and the `pip` is the recursive acronym for ‘Preferred Installer
Program’.

After connecting to the Ubuntu server, it is essential to ensure that all software packages are up-
to-date. Additionally, verifying that the files copied earlier are correctly placed on the server is
crucial for the subsequent setup and deployment process using the `sudo apt update` then `sudo
apt upgrade`.

30
Verifying Copied Files:

To verify that the files copied to the server are correctly placed, the ls command was used. This
command lists the contents of a directory, allowing for a quick verification of the presence and
structure of the copied files.

Installing Requirements and Running the Flask App on the Server:

After copying the project files to the Ubuntu server and ensuring that Python 3 is installed, the
next step is to install the required dependencies and run the Flask application. This process
involves installing packages listed in a requirements.txt file and starting the Flask app to make it
accessible via the server's public IP address.

Installing Required Packages:

The project dependencies are specified in a requirements.txt file. To install these packages, the
following command is used:

31
Sure, here's how you can describe the process of accessing the Flask web app using the public
IPv4 address on other devices like laptops and mobiles in your report:

Accessing the Flask Web App on Other Devices:

After successfully running the Flask web application on the server, it is essential to ensure that
the application is accessible from other devices such as laptops and mobiles using the server's
public IPv4 address.

Steps to Access the Flask Web App:

Identify Public IPv4 Address:

32
The public IPv4 address of the server can be found in the instance details provided by the cloud
service provider. This address will be used to access the web app from other devices.

Running the Flask App:

Ensure the Flask application is running and configured to bind to all available IP addresses
(0.0.0.0) and a specific port (e.g., 5000):

33
PROJECT - 2

Regex Matching Web App Development

Objective :
Create a web app using Flask that can match the regular expression as regex do.

Overview of Regex :
Definition:

 Regular Expressions (Regex): A sequence of characters that forms a search pattern,


typically used for string-searching algorithms for "find" or "find and replace" operations
on strings.

Core Components:

 Literals: Basic characters that match themselves (e.g., 'a' matches the character 'a').
 Meta-characters: Characters with special meanings (e.g., '.' matches any character, '*'
matches zero or more occurrences of the preceding element).
 Character Classes: Denoted by square brackets, they match any one of the characters
within the brackets (e.g., [abc] matches 'a', 'b', or 'c').
 Quantifiers: Indicate numbers of characters or expressions to match (e.g., {2,4} means
between 2 and 4 occurrences).

Uses:

 String Validation: Ensuring that strings meet certain criteria (e.g., email addresses,
phone numbers).
 String Searching: Finding specific patterns within larger bodies of text.
 String Manipulation: Extracting, replacing, or modifying substrings within a text.
 Data Extraction: Pulling out useful information from text files, logs, or datasets.

34
Flask Application for Regex Substitution
In this project, a Flask web application was developed to demonstrate the use of regular
expressions (regex) for text processing. The application takes a paragraph about Aligarh Muslim
University (AMU) and replaces all instances of "AMU" with "amu". This process illustrates the
practical application of regex in a web-based environment.

Application Overview

The web application allows users to input a text paragraph and perform regex operations on it.
Specifically, it replaces occurrences of the string "AMU" with "amu" within the provided text.
This functionality is useful for standardizing text data, which is a common requirement in data
preprocessing tasks.

35
Input Paragraph

The application includes a predefined paragraph about Aligarh Muslim University (AMU). This
paragraph serves as the sample text for the regex substitution operation. The paragraph details
the history and academic offerings of AMU, providing a realistic and meaningful context for the
text processing task.

Regex Operation

The core functionality of the application revolves around the use of Python’s re module for regex
operations. The specific regex task involves finding all instances of the substring "AMU" in the
text and replacing them with "amu". This substitution is case-sensitive, ensuring that only the
exact match "AMU" is replaced.

Running the Flask Application

To run the application on the server, Python and Flask are used. The app is set to bind to all
available IP addresses and listens on a specified port (e.g., 5000). This setup makes the
application accessible from any device connected to the internet via the server's public IP
address.

36
Email Validation Feature

The email validation feature uses regex to check whether the input email addresses conform to
standard email format. This involves ensuring that the email contains the necessary components,
such as the local part, the "@" symbol, and the domain part.

Deployment on AWS
To make the application accessible over the internet, it was deployed on the AWS (Amazon Web
Services) platform. The deployment involved several steps:

Setting Up the EC2 Instance:

 An EC2 (Elastic Compute Cloud) instance was created to serve as the server for the Flask
application.
 A suitable AMI (Amazon Machine Image) was selected, and the instance type was
chosen based on the application's requirements.
 A key pair was generated and downloaded for SSH access to the instance.

Configuring Security Groups:

 A security group was created and attached to the EC2 instance to allow inbound traffic on
the required port (e.g., 5000 for Flask).
 The security group was configured to allow HTTP traffic from all IP addresses
(0.0.0.0/0).

Connecting to the EC2 Instance:

 The key pair file (.pem) was used to connect to the EC2 instance via SSH.
 Once connected, the necessary software packages were installed, including Python 3 and
pip.

37
Deploying the Flask Application:

 The project files were copied to the EC2 instance using SCP (Secure Copy Protocol).
 Python packages listed in the requirements.txt file were installed using pip.
 The Flask application was started, binding it to all available IP addresses (0.0.0.0) and a
specific port (5000).

Accessing the Application:

 The Flask application was made accessible via the server's public IPv4 address.
 Users could access the application through a web browser by navigating to the public
IPv4 address followed by the port number (e.g., http://203.0.113.0:5000).

Conclusion
Regular expressions, commonly known as regex, provide a powerful tool for pattern matching
and manipulation within textual data. This project delves into the practical applications of regex
within a Flask web application, showcasing its versatility in text processing and email validation.
Furthermore, the deployment of the application on the Amazon Web Services (AWS) platform
highlights the scalability and accessibility of cloud-based solutions.

Text processing is a fundamental task in many applications, ranging from data cleaning to
information extraction. Regular expressions offer a flexible and efficient way to search,
manipulate, and extract patterns within textual data. In this project, we demonstrate the use of
regex to replace occurrences of "AMU" with "amu" within a text corpus. This simple yet
illustrative example showcases how regex can be utilized to perform bulk text substitutions,
improving data consistency and readability.

Deploying web applications on cloud platforms like AWS offers numerous advantages, including
scalability, reliability, and accessibility. In this project, we leverage AWS to host and deploy the
Flask application, ensuring that it is accessible over the internet. By utilizing AWS services such
as Elastic Compute Cloud (EC2) and Simple Storage Service (S3), we create a robust and
scalable infrastructure for hosting the application. This cloud-based deployment model
eliminates the need for on-premises hardware and allows for seamless scaling to accommodate
varying levels of user traffic.

38
Furthermore, deploying the application on AWS enhances its accessibility, enabling users to
access it from anywhere with an internet connection. Whether it's for educational purposes,
professional development, or practical use cases, users can conveniently access the Flask
application and explore its regex functionalities. This accessibility fosters wider adoption and
usage of the application, extending its reach to a broader audience of learners and practitioners.

In conclusion, this project offers a comprehensive exploration of regex in text processing and
email validation within a Flask web application. By showcasing practical examples and
deploying the application on the AWS platform, we highlight the versatility and power of regex
in handling textual data and validating user input. Moreover, the interactive user experience and
cloud-based deployment model contribute to a more engaging and accessible learning
environment. As regex continues to play a crucial role in data processing and validation, projects
like this serve as valuable resources for learners and professionals alike, empowering them to
harness the full potential of regex in their applications.

39
PROJECT - 3

Movies Subtitle Search Engine

In the fast-evolving landscape of digital content, effective search engines play a pivotal role in
connecting users with relevant information. For Google, providing a seamless and accurate
search experience is paramount. This project focuses on improving the search relevance for
video subtitles, enhancing the accessibility of video content.

There is a difficulty of finding relevant video content through traditional keyword-based search
methods when searching for subtitles. Currently, most search engines rely on keywords within
video titles, descriptions, or closed captions. However, this approach is not ideal for finding
specific content within a video based on dialogue or spoken information

Objective:
Develop an advanced search engine algorithm that efficiently retrieves subtitles based on user
queries, with a specific emphasis on subtitle content. The primary goal is to leverage natural
language processing and machine learning techniques to enhance the relevance and accuracy of
search results.

About the data


The database provided contained a sample of 82,498 subtitle files from opensubtitles.org. Most
of the subtitles are of movies and TV-series which were released after 1990 and before 2024.
Database File Name: eng_subtitles_database.db Database contains a table called 'zipfiles' with
three columns :

 num: Unique Subtitle ID reference for www.opensubtitles.org


 name: Subtitle File Name
 content: Subtitle files were compressed and stored as a binary using 'latin-1' encoding.

Project steps :
Below is an outline of the steps taken to meet the project's objective:
1. Reading the Data from the Database – decompressing and decoding
2. Data cleaning
3. Data chunking
4. Generating text embedding

40
5. Storing data in a vector database (vector stores)
6. Frontend application to access the Search Engine

Tech stack
In the course of this project the following tools and languages were utilized:

● Python

● ChromaDB

● Natural Language Processing

● Streamlit

● Google Colab

Step 1: Reading the Data from the Database


Part 1: Connect to database and retrieve the data

The data, as already mentioned above, was provided in a database file (.db) compressed and
encoded in latin-1.1. In order to begin, we connected to the SQLite database extracting data from
the ‘zipfiles’ table. The entire data was retrieved from the table and stored in a pandas dataframe
called df after which the connection to the database was closed. The code snippet is as seen
below.

41
The snapshot below gives us an overview of our data after retrieval. The ‘content’ column is still
compressed and encoded.

Part 2: Creatingg a function to decompress and decode the text data

Here, we imported the zipfile and io libraries to allow us to work with zip files in python. A
function decomp_decode is created that takes a compressed data as input and returns the
decompressed and decoded
ded text. Here is a breakdown of what the code does:

42
1. Decompressing:

 First, it uses the zipfile library to handle the compressed data.


 It creates a virtual in-memory file object using io.BytesIO(data), essentially treating the
compressed data as a file.
 Then, it opens this virtual file as a ZIP archive using zipfile.ZipFile.
 It assumes the archive contains only one file and extracts it using zip_file.read(first_file).

2. Decoding:

Finally, it decodes the extracted (but still encoded) data using .decode('latin-1'). This translates
the bytes from the compressed file into human-readable text, since the encoding used is 'latin-1’.

Decode_method is used to decompress and decode the data in the content column and replace
the data in the content column.

43
Step 2: Data cleaning
The project is focused on building a semantic search engine and as such, the BERT model will
be used to generate text embeddings for the subtitle files. There are two types of cleaning that
can be performed – light cleaning and heavy cleaning. Light cleaning involves surface level text
formatting and removal of irrelevant features in our data while heavy cleaning involves
lemmatization, stop word removal, removal of special characters and punctuation. In order to
decide which cleaning type to apply, we considered the effects each type will have on the data.
For heavy cleaning, we will strip the subtitle data down to basic words devoid of context and
sentence demarcations which will in turn affect the efficiency of our semantic search engine in
returning relevant results. On the other side, a light cleaning helps us remove just enough
features that are not useful in the search results. To this effect we performed light cleaning on our
data to remove features like timestamps, line breaks, links, italics tags and converted the text to
lower case letters.

44
This code defines a function called clean_data that takes a text entry (a subtitle file) and returns a
cleaner version of the text. Here's a breakdown of what each step does:

1. Cleaning Formatting:
 It uses regular expressions (re) to perform various cleaning tasks:
 Removes timestamps using a pattern that matches the format HH:MM:SS,SSS -->
HH:MM:SS,SSS and replaces them with spaces.
 Removes dialogue line numbers by searching for a number followed by newline
characters (\n or \r).
 Removes unnecessary newline characters (\n or \r).
 Deletes italic formatting tags ( and).
 Additionally, it replaces website links related to OpenSubtitles.org with spaces,
potentially removing irrelevant information.

2. Normalization:
 Finally, it converts the entire text to lowercase using data.lower(). This standardizes
the text and makes it easier for further processing.

45
Step 3: Data chunking
Part 1: Creating the chunking function
We observed that the subtitle files are long and embedding such files in one single vector will
lead to information loss. It was therefore not practical to embed an entire document as a single
vector. Before generating embeddings, we carried out a unique semantic chunking process.

The code snippet above defines a function called ‘chunk’ that takes a document (text) as input
and outputs a list of semantically similar sentence groups, called chunks. Below is a breakdown
of our code:

Importing Libraries:
●We imported the SentenceTransformer class from the sentence_transformers library in order to
create sentence embeddings (numerical representations of sentences) that capture their meaning.

● And the numpy library, to carry out tasks like calculating cosine similarity.

Model Selection and Embedding:


● We used the 'all-MiniLM-L6-v2' model for faster computation.

● We created a function called ‘encode_text’ that takes the input as the chunks and create the
embeddings for each chunks.

46
Part 2: Running the function in batches
The function has been created and is ready to be used. However some challenges were
encountered while trying to perform the chunking process on the entire Data frame at a go. We
had limited compute resources on Google Colab and a large file to work with. To resolve this
challenge we created several small temporary Dataframes and apply on them then combine them.

Drop the unnecessary columns from the dataframe that are not important to us as we have to
create encodings and than store the encodings in the database.

47
Concatinating the two Dataframes in one and then convert in csv file .

48
Step4: Storing data in vector database
Unlike traditional databases, vector databases or stores are specialized types of databases
designed to store and manage information represented as vectors. These vectors are in turn used
to perform similarity searches. In this project we used ChromaDB. Other alternatives include
Pinecone, and Apache OpenSearch.

● Imported the ‘chromadb’ library,

● Created a persistent client to ensure long-lasting connection with the database during the
program’s execution. This is created to prevent the RAM on Google Colab from crashing. It
helps save the collection data in the local machine rather than the data being stored on RAM and
being volatile.

● Created a collection called ‘my_collection’ where we stored the text embeddings (of the
subtitle files) and unique identifiers or indexes. The metadata argument(optional) provides
additional information about our collection. ChromaDB uses an approximate nearest neighbor
(ANN) search algorithm called Hierarchical Navigable Small World (HNSW) to find the most
similar items to a given query. In this case we indicate that this collection uses the HNSW
algorithm with cosine similarity for nearest neighbor searches.

Note that cosine similarity is used here in order to check for the similarity between vectors.
Cosine similarity is particularly useful for tasks like ours involving semantic understanding and
information retrieval in high-dimensional spaces.

49
Now that our collections are fully set up then we extract the name column from the dataframe as
well as its unique identifier num and adding these documents into my_collection which we
created earlier

Step 6: Web application creation

The Streamlit application for movie subtitle recommendation is a project developed to provide
users with personalized movie recommendations based on their input queries. Leveraging natural
language processing techniques and pre-trained embeddings, the application computes the
similarity between user queries and movie subtitles stored in a database, then recommends the
top 10 most relevant subtitles.

In the Streamlit application for movie subtitle recommendation, the user query preprocessing and
embedding creation play a crucial role in determining the relevance of recommended movie
subtitles. Upon receiving a user query, several steps are undertaken to transform the input text
into an embedding representation and match it with the embeddings stored in the database.

Firstly, the user query undergoes preprocessing to standardize and enhance its representation for
embedding calculation. This preprocessing may involve tokenization, stop word removal, and
punctuation removal to clean the text and prepare it for further analysis. Additionally, techniques
such as lemmatization or stemming may be applied to normalize the text and reduce feature
dimensionality.

Following preprocessing, the pre-trained embedding model or custom embedding technique is


used to convert the processed text into a dense vector representation. This embedding captures
the semantic meaning and context of the user query, allowing for more meaningful comparisons
with the embeddings of movie subtitles in the database. Techniques such as word embeddings
(e.g., Word2Vec, GloVe) or contextual embeddings (e.g., BERT, ELMO) are commonly used to
generate these dense representations.

Once the embedding representation of the user query is calculated, it is compared with the
embeddings of movie subtitles stored in the database using similarity metrics such as cosine
similarity. This comparison measures the degree of similarity between the user query and each
movie subtitle, enabling the identification of the top 10 most relevant subtitles.

50
The top 10 recommended movie subtitles, along with their corresponding IDs, are then retrieved
from the database and displayed to the user. These recommendations are based on the similarity
scores obtained from the embedding comparison, ensuring that the recommended subtitles
closely match the user's input query in terms of semantic relevance and context.

Overall, the user query preprocessing, embedding creation, and matching with the database are
essential components of the recommendation process in the Streamlit application. By effectively
transforming user queries into embedding representations and comparing them with stored
embeddings, the application provides users with accurate and contextually relevant movie
subtitle recommendations, enhancing their overall viewing experience.

51
PROJECT - 4

GenAI App – AI Code Reviewer

The Python Code Reviewer Web Application is a project developed to provide automated code
review and error detection capabilities to developers. The application leverages OpenAI's natural
language processing capabilities to analyze Python code snippets and identify potential errors.
Additionally, it offers suggestions for correcting identified issues, enhancing code quality and
reducing development time.

Objective:
The primary objective of the Python Code Reviewer Web Application is to streamline the code
review process and assist developers in identifying and correcting errors in their Python code. By
automating this task, the application aims to improve code quality, reduce debugging efforts, and
enhance overall development productivity.

Features:
 Error Detection: The application analyzes Python code snippets to identify common
errors, such as syntax errors, indentation issues, missing imports, etc.
 Code Suggestions: Upon detecting errors, the application provides suggestions for
correcting the identified issues. This helps developers quickly address errors and
improve the quality of their code.
 User-Friendly Interface: The web application is designed with a user-friendly
interface using Streamlit, making it easy for developers to upload code snippets and
review the analysis results.
 Implementation:The Python Code Reviewer Web Application is implemented using
the following technologies:
OpenAI: OpenAI's natural language processing capabilities are utilized to analyze
Python code and provide error detection and correction suggestions.

52
Streamlit: The web application framework Streamlit is used to build the user
interface. Streamlit simplifies the process of creating interactive web applications
with Python, allowing developers to focus on functionality rather than frontend
design.
Python: The backend logic of the application is written in Python, including code
parsing, error detection, and suggestion generation functionalities.

Workflow:
 Input Code: Developers upload Python code snippets to the web application using the
provided interface.
 Error Detection: The application analyzes the uploaded code using OpenAI's natural
language processing models to detect errors and potential issues.
 Suggestions Generation: Upon detecting errors, the application generates suggestions
for correcting the identified issues based on best practices and common coding
conventions.
 Output Display: The analysis results, including detected errors and correction
suggestions, are displayed to the user via the web interface.

Disclaimer:
While the Python Code Reviewer Web Application aims to assist developers in improving code
quality, it may not catch all possible errors or provide perfect suggestions. Developers are
encouraged to use their judgment and review the suggestions provided by the application
critically.

53
Conclusion:
The Python Code Reviewer Web Application provides a valuable tool for developers to improve
code quality, reduce debugging efforts, and enhance overall development productivity. By
leveraging OpenAI's natural language processing capabilities and the user
user-friendly
friendly interface of
Streamlit, the application simplifies the code review process and empowers developers
developer to write
cleaner, more efficient Python code
code.

Support for Other Programming Languages: Extend the application to support code review
for other programming languages, expanding its utility to a wider range of developers.

Integration with Version Control S Systems: Integrate the application with popular version
control systems like Git to provide seamless code review functionality within development
workflows.

Advanced Error Detection: Implement more advanced error detection algorithms and strategies
to identify
ify complex issues and improve the accuracy of error detection.

54
Project 5
RAG System on “Leave No Context Behind” Paper

As we are knowing that the GenAI application can be generative but it does not have the idea of
all the things that are not trained to them like the things that have been introduced recently so to
overcome this we can train the generative model based on the documents we have and can
retrieve information from this

LangChain
Specialty: LangChain specializes in providing a comprehensive suite of language processing
functionalities through its API, catering to diverse text analysis and manipulation needs.

Best Use Case:LangChain is best suited for developers looking for a ready-to-use platform to
incorporate language processing capabilities into their applications quickly and efficiently.

LLMs (Large Language Models):

 Specialty: LLMs, such as GPT and BERT, excel in understanding and generating
human-like text through the use of deep learning architectures trained on vast amounts of
text data.
 Best Use Case: LLMs are best suited for tasks requiring advanced natural language
understanding and generation capabilities, such as chatbots, language translation,
sentiment analysis, and content generation.

Transformers:

 Specialty: Transformers are a specific type of deep learning architecture, known for their
ability to efficiently process sequences of variable lengths and capture long-range
dependencies through self-attention mechanisms.
 Best Use Case: Transformers are best suited for tasks where capturing contextual
information and understanding relationships between tokens within a sequence is crucial,
such as machine translation, text summarization, and sentiment analysis.

55
Objective :
Using LangChain framework, build a RAG system that can utilize the power of LLM like
Gemini 1.5 Pro to answer questions on the “Leave No Context Behind” paper published by
Google on 10th April 2024. In this process, external Data(i.e. Leave No Context Behind Paper)
should be retrieved and then passed to the LLM when doing the generation step.

Overview:
Retrieval-Augmented Generation (RAG) applications combine the strengths of information
retrieval and natural language generation techniques to produce high-quality and contextually
relevant outputs. These applications leverage large-scale pre-trained language models and
curated knowledge sources to generate text that is not only fluent but also grounded in factual
information.

Use and Need:


Enhanced Information Retrieval:

RAG applications improve information retrieval by generating contextually relevant responses to


user queries. By integrating pre-existing knowledge sources, such as databases or documents,
with natural language generation models, these applications can provide more informative and
accurate responses compared to traditional keyword-based search engines.

Content Creation and Summarization:

RAG applications are used for content creation and summarization tasks, where generating
coherent and informative text is essential. These applications can generate summaries of articles,
reports, or research papers, as well as create original content based on specific topics or themes.

Question Answering Systems:

RAG models are utilized in question answering systems to provide detailed and contextually
relevant answers to user queries. By retrieving relevant information from knowledge sources and

56
synthesizing it into natural language responses, these systems can address a wide range of user
questions with high accuracy.

Conversational AI:

RAG models are integrated into conversational AI systems to enhance the naturalness and
informativeness of dialogues. By retrieving relevant information from knowledge sources during
conversations, these systems can provide more engaging and contextually appropriate responses
to user inputs, leading to a more seamless user experience.

Domain-Specific Applications:

RAG applications find use in various domain-specific tasks, such as medical diagnosis, legal
analysis, and customer support. By leveraging domain-specific knowledge bases and integrating
them with natural language generation models, these applications can provide tailored and
accurate responses to specific user queries or requirements.

Summary of "Leave No Context Behind: Most Contextualized


Retrieval-Augmented Generation Models"
The paper "Leave No Context Behind: Most Contextualized Retrieval-Augmented Generation
Models" delves into the advancements in Retrieval-Augmented Generation (RAG) models,
which aim to integrate information retrieval with generative models to produce more accurate
and contextually relevant outputs. Traditional generative models often struggle with generating
accurate information due to their reliance on pre-trained knowledge that lacks real-time context.
The authors highlight this limitation and propose an enhanced approach where contextual
information is crucial for generating precise responses.

The proposed RAG model architecture involves two main components: a retriever and a
generator. The retriever searches for relevant documents based on the input query, while the
generator uses both the input query and the retrieved documents to generate a response. This dual
approach ensures that the generated content is grounded in up-to-date and relevant information,
addressing the issue of hallucinated or contextually inappropriate responses common in
traditional models.

The paper's key contribution is demonstrating how RAG models, by leveraging external sources,
significantly improve the accuracy and relevance of generated responses. Through extensive
experiments, the authors show that RAG models outperform traditional generative models across
various benchmarks, including BLEU, ROUGE, and human evaluation scores. These results

57
highlight the superiority of RAG models in producing higher-quality, contextually enriched
outputs.

Furthermore, the paper discusses potential applications of RAG models in areas such as customer
support, information retrieval, and conversational AI. These applications benefit greatly from the
ability to understand and utilize context to generate meaningful and useful responses. The
research presented in "Leave No Context Behind" underscores the critical role of context in AI-
driven text generation and marks a significant step forward in developing intelligent systems
capable of producing contextually appropriate and highly relevant text. This work paves the way
for more advanced and reliable AI applications, demonstrating the immense potential of
integrating retrieval mechanisms with generative models.

Features:
 Retrieval of Relevant Context:The applications retrieve relevant context from a
knowledge base or document corpus using a retriever component.
 Generation of Text Responses:The retrieved context is used to augment the generation
of text responses by the language model, ensuring that the generated text is grounded in
factual information and contextually relevant.
 Interactive Web Interface:The applications are deployed as interactive web apps using
Streamlit, allowing users to input queries and receive text responses in real-time.

Implementation:
The RAG applications are implemented using the following components:

Retriever Component:

The retriever component retrieves relevant context from a knowledge base or document corpus
based on the user query.

Language Model:

OpenAI's language model or Google's Gemini model is used as the core language generation
component, responsible for generating text responses based on the retrieved context.

Streamlit Web App:

Streamlit is used to create the user interface for the web applications, providing a simple and
intuitive platform for users to interact with the models.

58
Workflow:
 User Input:Users input their queries or prompts into the web application, specifying the
information they are seeking or the task they want to perform.
 Context Retrieval:The retriever component retrieves relevant context from a knowledge
base or document corpus based on the user query, providing additional information to aid
in text generation.
 Text Generation:The language model generates text responses based on the retrieved
context and the user query, incorporating relevant information to produce coherent and
contextually relevant responses.

Output Display:

The generated text responses are displayed to the user in the web application interface, allowing
them to review the information or continue the interaction.

Using Gemini and OpenAI's GPT for Retrieval-Augmented Generation

In the first implementation, I utilized Google's Gemini API to harness the capabilities of the
Gemini model for retrieval-augmented generation. By integrating the Gemini model through its
API key, the system can enhance input queries with relevant context from a vast knowledge
base, thereby generating accurate and contextually appropriate results. This process involves
creating embeddings using the Gemini model, which transforms the input data into a numerical
format that captures the semantic meaning of the content. These embeddings are crucial for
ensuring that the generated responses are not only relevant but also precise and informative.

In the second implementation, a similar approach was adopted using OpenAI's GPT model. By
accessing OpenAI's GPT through its API key, the application can augment user queries with
pertinent information retrieved from the model's extensive training data. This ensures that the
responses generated are contextually rich and accurate. As with the Gemini model, embeddings
are created for the input data, which are then used to match and retrieve the most relevant
information. The embedding process is essential for both models, as it allows for the effective
comparison and retrieval of data based on the semantic similarity of the queries and the stored
information.

In both cases, the embedding models play a pivotal role in transforming and representing the
input queries. These embeddings facilitate the retrieval of contextually relevant data, which is
then used by the respective models to generate high-quality responses. This dual approach,

59
leveraging both Gemini and OpenAI's GPT, demonstrates the versatility and effectiveness of
retrieval-augmented generation techniques in producing accurate and contextually enriched
outputs. By employing advanced embedding techniques, both implementations ensure that the
generated responses are not only relevant but also align closely with the user's intent, thereby
enhancing the overall user experience.

60
Future Enhancements:
Integration with External Knowledge Bases: Extend the applications to integrate with external
knowledge bases or document repositories to provide a wider range of context for text
generation.

Fine-Tuning and Optimization: Fine-tune the language models and optimize the retrieval and
generation processes to improve the accuracy and efficiency of text generation.

Enhanced User Interaction: Implement additional features such as multi-turn dialogue support
and user feedback mechanisms to further enhance the user experience and improve the quality of
text responses.

61
References
1. Books and Articles:
 Chollet, F. (2018). Deep Learning with Python. Manning Publications.
 Jurafsky, D., & Martin, J. H. (2021). Speech and Language Processing. Pearson.
 Russell, S., & Norvig, P. (2016). Artificial Intelligence: A Modern Approach. Pearson.

2. Websites and Documentation:


 BeautifulSoup Documentation. Retrieved from
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
 requests Documentation. Retrieved from https://requests.readthedocs.io/en/master/
 pandas Documentation. Retrieved from https://pandas.pydata.org/pandas-docs/stable/
 scikit-learn Documentation. Retrieved from https://scikit-learn.org/stable/
 XGBoost Documentation. Retrieved from https://xgboost.readthedocs.io/en/latest/
 NLTK Documentation. Retrieved from https://www.nltk.org/
 spaCy Documentation. Retrieved from https://spacy.io/usage
 Flask Documentation. Retrieved from https://flask.palletsprojects.com/en/2.0.x/
 Streamlit Documentation. Retrieved from https://docs.streamlit.io/

3. Online Courses and Tutorials:


 Coursera. "Machine Learning" by Andrew Ng. Retrieved from
https://www.coursera.org/learn/machine-learning
 Udacity. "Intro to Data Science" by Udacity. Retrieved from
https://www.udacity.com/course/intro-to-data-science--ud359
 Fast.ai. "Practical Deep Learning for Coders" by Jeremy Howard. Retrieved from
https://course.fast.ai/

4. Research Papers:
 "Leave No Context Behind: Most Contextualized Retrieval-Augmented Generation
Models" by OpenAI. Retrieved from https://www.openai.com/research/papers/
 Vaswani, A., et al. (2017). "Attention Is All You Need." NeurIPS. Retrieved from
https://arxiv.org/abs/1706.03762

62
Appendix

1. Tools and Technologies Used:

 Data Scraping: BeautifulSoup, requests.


BeautifulSoup: Python library for parsing HTML and XML documents, facilitating data
extraction from web pages.
requests: Python HTTP library for making HTTP requests to fetch data from web pages.

 Data Analysis and Preprocessing: pandas, numpy.


pandas: Powerful Python library for data manipulation and analysis, providing
DataFrame for structured data handling.
numpy: Fundamental package for scientific computing in Python, supporting arrays,
matrices, and mathematical functions.

 Machine Learning: scikit-learn, XGBoost.


scikit-learn: Robust machine learning library in Python, offering tools for data mining
and analysis.
XGBoost: Optimized gradient boosting library designed for efficient and flexible model
training.

 Natural Language Processing: NLTK, spaCy.


NLTK: Leading platform for building Python programs to work with human language
data.
spaCy: Open-source software library for advanced NLP in Python, designed for fast and
efficient text processing.

 Model Deployment: Flask, AWS EC2.


Flask: Lightweight web application framework in Python, useful for quick and easy web
development.

63
AWS EC2: Provides scalable compute capacity in the cloud, facilitating web-scale cloud
computing.

 Visualization: Matplotlib, Seaborn, WordCloud.


Matplotlib: Python plotting library offering various functions for data visualization.
Seaborn: Python visualization library based on Matplotlib, providing a high-level
interface for drawing statistical graphics.
WordCloud: Generates word clouds from text data, visually representing word
frequency or importance.

 Web Application Development: Flask, Streamlit.


Flask: Used for creating the backend of web applications, providing a simple framework
for web development.
Streamlit: Open-source Python library simplifying the creation and sharing of custom
web apps for machine learning and data science.

2. Development Environments: VS Code, Google Colab.


 VS Code (Visual Studio Code): Powerful code editor developed by Microsoft, offering
extensions for various programming tasks.
 Google Colab: Free, cloud-based Jupyter notebook environment provided by Google,
ideal for data analysis and machine learning projects.

3. Sample Code Snippets:


 Includes examples of code snippets for data scraping, data preprocessing, model training,
model deployment, and web application development.

4. Project Workflow:
 Details the workflow involving data collection, preprocessing, model training,
deployment, and interface development for the project.

64

You might also like