Professional Documents
Culture Documents
Happiest Minds 3 Floor SJR Equinox Sy - No.47/8, Doddathogur Village, Begur Hobil, Electronic City, Bengaluru Bengaluru, Karnataka 560100
Happiest Minds 3 Floor SJR Equinox Sy - No.47/8, Doddathogur Village, Begur Hobil, Electronic City, Bengaluru Bengaluru, Karnataka 560100
Study on
Happiest minds
Sy.No.47/8, Doddathogur
Village,Begur Hobil,
Submitted by
KEERTHI VARMAN .G
Registration No:
20030146DS002
2021
Page |2
Master of Technology
Declaration
This is to declare that the Report titled “Twitter Sentiment Analysis” has been made for the
partial fulfilment of the Course: Industry Internship Programme (IIP) in Semester II by me at
Happiest Minds (organization) under the guidance of Dr./Prof. Rathnakar Achary
I confirm that this Report truly represents my work undertaken as a part of my Industry
Internship Programme (IIP). This work is not a replication of work done previously by any
other person. I also confirm that the contents of the report and the views contained therein have
been discussed and deliberated with the academic supervisor.
Registration No :20030146DS002
Page |3
Master of Technology
Certificate
This is to certify that Mr. / Ms. KEERTHI VARMAN. G A Regno. No. 20031460DS002 has
completed the report titled TWITTER SENTIMENT ANALYSIS under my guidance for the
partial fulfilment of the Course: Industry Internship Programme (IIP) in Semester II of the
Master of Technology in DATA SCIENCE
Signature of Supervisor:
TABLE OF CONTENTS
1 ABSTRACT 5
2 INTRODUCTION 6
3 INDUSTRY OVERVIEW 9-11
• GLOBAL SCENARIO
• INDIAN SCENARIO
4 COMPANY OVERVIEW 12 – 18
• COMPANY LOGO
• ESTABLISHMENT
• FOUNDER
• MISSION
• VISION
• VALUES
• BUSINESS EXCELLANCE
• SERVICES
• SWOT ANALYSIS
5 TEAM STRUCTURE 19
6 PROJECT PROFILE 20-21
• PROBLEM STATEMENT
• OBJECTIVE OF THE PROJECT
ABSTRACT
Any Machine learning model needs a significant amount of sample data to give reasonably
accurate predictions. One can comment without disagreement that model accuracy has a
positive relationship with the size of the dataset it is provided to observe and learn from. In
order to fetch real-time data and process it into a suitable format, scalability demands the
algorithms taking care of these steps to be efficient. AI@Scale initiative aims to discover a
solution for fetching and processing massive datasets at production-level performance. The
goal is to find a paradigm that scales appropriately as the dataset obtained for training an ML
model increases in size. Data fetched from Real-time APIs provided by social networking
platforms can stretch into TeraBytes (TB). With such high input sizes, even minor
improvements in pre-processing algorithms have a massive impact on the speed of the
application. These tiny improvements can also help save the server’s CPU power for other
simultaneous tasks (like responding to user queries). One of the most critical parameters for
companies to gauge is the sentiment of the public towards their brand. Thus, our objective for
the duration of this internship is to develop a production-ready brand sentiment analyser,
which pulls its dataset from Twitter API using the Tweepy library. One of the necessary
requirements for this application would be to handle datasets for popular companies with
significant media presence, too (Essential scalability).
Page |6
INTRODUCTION
Twitter is a free social media platform that allows users to communicate with one
another via brief messages known as tweets. People use Twitter to connect with others
and discover new things every day, whether it's to share breaking news, post company
updates, or follow their favourite celebrities.
Twitter, which was founded in 2006, is most popular among millennials and young
professionals. Since its introduction, however, it has seen a considerable increase in
users of all ages.
Twitter enables users to follow and interact with a variety of people, brands, and news
organisations, resulting in a real-time stream of messages personalised to their
preferences. People use Twitter to provide real-time updates, photographs, videos, and
links. This allows for intelligent, real-time search results.
Twitter allows businesses to engage with consumers, sellers, partners, and workers in a
two-way fashion. Customers can stay up to date on a company's products and services
by following them on social media. Businesses may monitor their customers on Twitter
for insights into their ideas, behaviours, and attitudes about various products and
services because Twitter is an open network.
Sentiment Analysis is a text mining approach that is commonly utilised. Sentiment analysis on
Twitter, on the other hand, refers to the use of advanced text mining algorithms to assess the
sentiment of a text (in this case, a tweet) in terms of positive, negative, and neutral. It's also
known as Opinion Mining, and it's used to analyse conversations, opinions, and sharing of
ideas (all in the form of tweets) in order to decide on company strategy, political analysis, and
evaluating public acts.
Some well-known tools for analysing Twitter sentiment include Enginuity, Revealed Context,
Steamcrab, Meaning Cloud, and Social Mention. Twitter sentiment analysis datasets are
commonly utilised in R and Python. Twitter sentiment analysis is now much more than a
student project or a certification programme. There are a plethora of Twitter sentiment tutorials
accessible to teach students about the Twitter sentiment analysis project report and its use with
R and Python. You may also enrol in a python course for the same application if you want to
pursue a career in sentiment analysis dataset twitter.
Our conversation will cover Twitter Sentiment Analysis in R and Python, as well as strategies
and how to construct a Twitter Sentiment Analysis project report, as well as the benefits of
enrolling in the tutorial.
Algorithms like SVM and Naive Bayes are used in Natural Language Processing to predict the
polarity of a sentence. Sentiment analysis of Twitter data may potentially be affected by
sentence and document level considerations.
However, methods such as looking for positive and negative terms in a sentence are ineffective
because the flavour of the text block is highly dependent on the context. Looking at the POS
(Part of Speech) Tagging can help with this.
Dataset for Sentiment Analysis Twitter can be used for a variety of purposes:
Companies use Twitter Sentiment Analysis to create their business plans, to assess customers'
attitudes toward items or brands, to see how people react to their campaigns or new releases,
and to figure out why certain things aren't selling.
Page |8
Politics:
Sentiment Analysis Dataset in Politics Twitter is used to keep track of political viewpoints and
to find consistency and inconsistency between government words and actions. Twitter is being
used to analyse election results, according to the Sentiment Analysis Dataset.
Public Activism:
Twitter Sentiment Analysis is also used to track and analyse social events, foresee potentially
dangerous situations, and gauge the mood of the blogosphere.
Page |9
INDUSTRY OVERVIEW
The software industry consists of the development, distribution, and maintenance of software.
It can be broadly divided into application software, system infrastructure software, software-
as-a-service (SaaS), operating systems, database and analytics software. Statista provides
information on software market spending in general and on various software products
specifically, on both global and regional scales. Additionally, financial metrics and market
share of major players are covered as well, together with data on software development and
usage.
Fig.1
Global scenario
In terms of industry specifics, IDC projects that the technology industry is on pace to exceed
$5.3 trillion in 2022. After the speed bump of 2020, the industry is returning to its previous
growth pattern of 5%-6% growth year over year. The United States is the largest tech market
in the world, representing 33% of the total, or approximately $1.8 trillion for 2022.
P a g e | 10
Fig.2
here are a number of taxonomies for depicting the information technology space. Using the
conventional approach, the market can be categorized into five top-level buckets. The
traditional categories of hardware, software and services account for 56% of the global total.
The other core category, telecom services, accounts for 25%. The remaining 19% covers
various emerging technologies that either don’t fit into one of the traditional buckets or span
multiple categories, which is the case for many emerging as-a-service solutions that include
elements of hardware, software and service, such as IoT, drones and many automating
technologies.
Fig.3
P a g e | 11
Indian scenario
COMPANY OVERVIEW
Any Machine learning model needs a significant amount of sample data to give reasonably
accurate predictions. One can comment without disagreement that model accuracy has a
positive relationship with the size of the dataset it is provided to observe and learn from. In
order to fetch real-time data and process it into a suitable format, scalability demands the
algorithms taking care of these steps to be efficient.
AI@Scale initiative aims to discover a solution for fetching and processing massive datasets
at production-level performance. The goal is to find a paradigm that scales appropriately as
the dataset obtained for training an ML model increases in size. Data fetched from Real-time
APIs provided by social networking platforms can stretch into Terabytes (TB). With such
high input sizes, even minor improvements in pre-processing algorithms have a massive
impact on the speed of the application. These tiny improvements can also help save the
server’s CPU power for other simultaneous tasks (like responding to user queries).
One of the most critical parameters for companies to gauge is the sentiment of the public
towards their brand. Thus, our objective for the duration of this internship is to develop a
production-ready brand sentiment analyser, which pulls its dataset from Twitter API
using the Tweepy library. One of the necessary requirements for this application would be
to handle datasets for popular companies with significant media presence, too (Essential
scalability).
COMPANY LOGO:
P a g e | 13
Establishment
Founder
Happiest minds technologies limited founded by Ashok Soota. He was born on 12 November
1942 at Delhi, India. He is an Indian IT entrepreneur. In 1960, he did his schooling from Inter
Science at La Martiniere, Lucknow and holds a Bachelor of Engineering in Electronics from
university of Roorkee. Then, he did MBA from Asian Institute of Management from Manila,
Philippines. He also won many awards. He was very intelligent student in his young age.
MISSION:
VISION:
• Achieve a very successful IPO by or before FY2023 and in the interim provide a
monetization event for investors/team by Fy2020.
VALUES:
BUSINESS EXCELLENCE
Our strategy for Continual Quality Improvement journey is derived based on the business
needs, technology changes, customer feedback, suggestions and process performance.
We have an Integrated Management System (ISO 9001, ISO 27001) that benefits
our organization through increased efficiencies and effectiveness, and cost reductions while
meeting the compliance of the several external audits. It shows our commitment towards
increased performance, employee and customer satisfaction, and continuous improvement as
well. Our QMS has world-class methods for software engineering and delivery management
with the right degree of automation. Various Processes, Guidelines, Checklists, Templates
exist in our QMS for both engineering as well as for project management areas.
Delivery Methodologies:
Our suite of delivery methodologies in the below-mentioned areas demonstrates our thought
leadership and execution capabilities
• Agile methodologies
As we strive to provide best-in-class solutions and services to our customers, we monitor and
benchmark our practices, standards against globally accepted norms, and continually update
our QMS to keep pace with the industry.
P a g e | 16
SERVICES:
1)Agile infrastructure
Volatility is likely to remain constant with a new phase of globalization. To keep the rapid
pace in the market and ensure business growth, Enterprises need a support of next generation
technologies. Be it moving to cloud-based applications, mobile-enabling the workforce,
optimizing the IT infrastructure or securing the workplace, enterprises need the support of an
expert technology partner to help them navigate the various complexities involved. Though,
the company provide agile infrastructure and optimizing IT Infrastructure and Operations, it’s
important to identify and implement the right Infrastructure Management Services that
utilizes existing investments in tools and resources.
2)Cloud computing
3)Data science
Happiest Minds helps clients to solve the toughest data challenges, predict demand for
products and services to improve customer satisfaction and guide business strategies based on
knowledge and foresight.
4)DevOps
DevOps solutions plug the gaps that exists between software development, quality assurance,
and IT operations thereby enabling you to quickly produce software products and services,
while improving operational performance significantly.
P a g e | 17
5)Internet of things
Happiest Minds Internet of Things service enables organizations to transform business needs
into competitive differentiators by delivering innovative IoT powered solutions. From
integrating the right sensors and deriving inspired insights to choosing the best-fit platform,
provide comprehensive IoT services to our client
6)SDN-NFV
Software Defined Networking and Network Function Virtualization dealt with a rapidly
shifting business and technology landscape while also managing increasingly high customer
expectations. Increased proliferation of smart phones and internet-based applications, the 5G
explosion, evolving WAN requirements and emergence of new traffic patterns due to IoT and
M2M connectivity have been some of the most obvious disruptors.
7)Mobility solution
SWOT ANALYSIS
STRENGTHS:
WEAKNESS:
OPPORTUNITIES:
THREATS:
TEAM STRUCTURE
Interns
All interns work on the task assigned and report to the mentors in the next scheduled team
meeting. Team meetings are scheduled every Monday, Wednesday and Friday, 11:00 - 11:30
A.M.
P a g e | 20
PROJECT PROFILE
PROBLEM STATEMENT
Twitter monitoring system to watch customer behaviours toward brands by detecting sentiment
shifts, evaluating hot topics and geographic segmentation, and finding anomalies in scandals,
all with the goal of increasing customer engagement and retention.
Our objective for the duration of this internship was to experiment and formulate a
solution for fetching and processing massive datasets at production level performance to
design a workflow that is scalable for efficient ML model training.
One of the many practical applications of this optimised application would be handling
datasets for popular companies with significant media presence (Essential scalability).
P a g e | 21
Fig.6 DOCKER
P a g e | 22
Language chosen:
Python
Due to the abundance of assisting machine learning libraries available, along with tweepy,
the primary language for writing the ML pipeline was chosen by mentors to be Python.
Tweepy
MySQL
MySQL is a DBMS that most of our team was found to be familiar and comfortable with.
The framework has been updated and optimised over the years to be one of the fastest query
processing database systems.
Framework for Data pre-processing and inference generation and analysis for the data fetched
from Twitter and stored in the database:
NLTK, Natural Language Toolkit, a leading platform for building Python programs to work
with human language data and TextBlob, Natural Language Processing library for processing
textual data, providing a simple API for diving into standard NLP.
HTML, CSS
For the pipeline to run identically as it does on our local environment, we will be hosting a
Docker image. Docker allows us to create separate environments on a single operating system,
making docker images faster and lighter to run.
Hosting service:
The GitHub repository of the trained model would be hosted along with the corresponding
Docker image on a Kubernetes cluster. The GCP was chosen as Kubernetes provides the
features of consistent scaling of an image based on traffic received, along with auto-update,
load balancing and maintenance.
P a g e | 24
The objective of this part of the application is to filter tweets being posted in real-time
based on the presence of specific keywords that are supplied as variables in a file
‘settings.py’. These tweets are then processed and saved in a MySQL database and
analyzed to draw conclusions that are displayed on the dashboard.
Fig.7
Fig.8
P a g e | 25
A Python library called Tweepy forms a significant part of this task. It is used to listen to the
streaming data and process and save the tweets fetched using the Twitter API. The Stream-
Listener class is overridden by modifying the ‘on_status’ and ‘on_error’ methods. It mainly
accomplishes feature extraction and data preprocessing steps, followed by connecting
and loading the data into a database instance (MySQL in our case).
Fig.9
P a g e | 26
Fig.10
P a g e | 27
This step extracts the data stored in the MySQL database and analyses it by performing
Natural Language Processing operations. Not only this, but it also analyses the
geographic segmentation of the data on a particular keyword, using location-related
features.
The core of sentiment analysis is to use TextBlob to extract the polarity &
subjectivity from tweet texts, which is actually done by the data preprocessing for
better data storage previous chapter. Negative tweets represent as -1, positive tweets
represent as +1, and neutral tweets represent as 0.
Fig.11
The entire text from all tweets is tokenized, and Stop Words imported using
NLTK are used as a reference to remove commonly used words, and the ten most
common words in the Frequency Distribution of all words are extracted. Plotly
Express is again used to visualize the bar chart.
P a g e | 28
Fig.12
Fig.13
P a g e | 30
Using Dash and Plotly, the sentiment analyser is wrapped in a web application and graphs
are used to summarise the sentiment analysis.
Fig.14 Most frequently encountered words in the tweets containing chosen keyword
Fig.15 Showing number of +ve, -ve and neutral tweets at a particular time
This section is the most critical part of the application and is responsible for differentiating
it from the referenced base code. It is also the part where a significant number of our
original contributions lie. Documentations of the various libraries and frameworks were
extensively studied to identify possible processing bottlenecks in the functions used.
While the base code was built using a PostgreSQL language, our application used MySQL.
MySQL was chosen as it is relatively light on features and thus can focus more on speed
and reliability.
Fig.17
When you wanted to run a web application, you used to have to buy a server, install
Linux, build up a LAMP stack, and then run the programme. If your programme
became famous, you used good load balancing by adding a second server to ensure
that it didn't crash due to excessive traffic.
P a g e | 32
However, instead of focusing on single servers, the Internet now relies on a system
known as "the cloud," which consists of a collection of interconnected and redundant
computers. The concept of a server might be liberated from the limits of hardware
thanks to inventions like Linux kernel namespaces and c groups, and instead become
simply a piece of software. Containers are software-based servers that are a hybrid of
the Linux OS they're running on and a hyper-localized runtime environment (the
contents of the container).
Understanding Containers
Containers frequently supply both an application and configuration, reducing the amount
of time a sysadmin must spend getting an application in a container to execute compared to
installing an application from a traditional source. Dockerhub and Quay.io are both image
sources for container engines.
The most appealing feature of containers, however, is their ability to gently "die" and revive
when load balancing requires it. Containers are "cheap" to start, and they're designed to
appear and disappear easily. Whether a container's demise is caused by a crash or because it's
simply no longer needed because server traffic is low, containers are "cheap" to start.
Because containers are supposed to be ephemeral and produce new instances as needed, it's
expected that monitoring and management will be automated rather than done by a human in
real time.
P a g e | 33
Fig.18
Kubernetes, a container-centric management platform, has become the de facto
standard for deploying and operating containerized applications as a result of the
broad adoption of containers among enterprises. Kubernetes, which was first created
at Google and published as open source in 2014, was born on Google Cloud.
Kubernetes is based on Google's 15-year experience operating containerized
workloads and the open source community's essential contributions. Kubernetes,
which was inspired by Google's internal cluster management system, Borg, simplifies
the process of deploying and administering your application. Kubernetes increases
your reliability by automating container orchestration and reducing the time and
resources spent on everyday operations.
Kubernetes is a container orchestration system that allows you to create cloud-native microservices-
based apps. It also allows you to containerize existing programmes, making it the cornerstone for
application modernization and allowing you to develop apps more quickly.
Kubernetes can alter the size of a cluster required to run a service automatically. This allows
you to scale your apps up and down dynamically based on demand and run them efficiently.
KUBERNETES VS DOCKER:
Kubernetes and Docker are independent but complimentary technologies for running
containerized applications, and they're often mistaken as one or the other.
Docker allows you to package everything you need to run your application into a container
that can be kept and opened as needed. Once you've started boxing up your apps, you'll need
a mechanism to manage them, which Kubernetes provides.
Fig.19
Google Cloud Storage is a cloud-based storage platform for huge, unstructured data
volumes. Cloud Datastore for NoSQL nonrelational storage, Cloud SQL for MySQL
fully relational storage, and Google's native Cloud Bigtable database are all available
from Google.
Fig.20
P a g e | 38
2. Activate billing in the respective GCP account and set up a service account
3. Enable Compute engine API, SQL API, Service API, Container Registry API and
Google Kubernetes API for the application to work. This is because these APIs
are used in the deployment of our application.
4. Create a new project in the GCP console. All the resources relating to this project
will be hosted and deployed in this project.
5. Create an SQL instance. Base code used in the repository uses MySQL version 8.0.
Use the above. Set up the open connection and copy the IP of the SQL instance once
it has been created. Modify the env.hostname variable in the fetcher as well as the
display script- to enable connection to this SQL instance.
6. Create a database named ‘Twitter’ in this SQL instance using the Cloud SQL Shell.
7. Now, once the database has been created, we will fetch and deploy the fetcher as
well as the display script by creating a Kubernetes cluster and then deploying and
exposing our application.
8. First for the fetcher script: Run all these commands in the Cloud Shell:
▪ Clone the code from github using following command:
git clone --branch v2 https://github.com/madhavgupta211/hm-
datafetch.git data-fetcher
▪ Go to the created directory using:
cd data-fetcher
▪ set the project variable using the following command :
export PROJECT_ID=’PROJECT_NAME’
gcloud config set project “PROJECT_NAME”
▪ build the image specified on the Dockerfile using the following command:
P a g e | 39
9. First for the display script: Run all these commands in the Cloud Shell
▪ Clone the code from the github repository using the following command :
git clone --branch Final-Deployment
https://github.com/madhavgupta211/hm-data-disp.git dashboard-app
Finally, the scripts have been deployed and are running. The dashboard can be viewed
at the external IP.
P a g e | 41
Fig.21
After that a Kubernetes Cluster is created, the required images are pulled from GCR
and then deployed in the GKE.
Finaly the dashboard app deployment in the cluster is exposed via an external IP
address for clients to access.
P a g e | 42
CONCLUSION
This application has been developed as two separate versions – the first version consisting
of the base code with minor modifications, which serves as a reference to test the
performance after improvements. The second and critical version is version 2. This houses
the optimizations introduced by the team and is hosted on a Google Kubernetes Cluster
on the Google Cloud Platform.
❖ With the use case – “Real-Time Brand Sentiment Analyzer for Brand Improvement
and Topic Tracking”, we achieved a more scaled dashboard and increased
efficiency post the use of multi-threading and parallel processing.
❖ The roadblocks faced were handled adequately and also gave us a productive
experience of software development. It also allowed us to come up with innovative
solutions which were time and cost-efficient.
❖ Although we could not integrate near-real-time streaming using Kafka and Spark
due to time constraints, we gained valuable knowledge about the same
P a g e | 43
REFERENCES