Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

Page |1

Study on

“TWITTER SENTIMENT ANALYSIS”

Happiest minds

3rd floor SJR Equinox

Sy.No.47/8, Doddathogur

Village,Begur Hobil,

Electronic City, Bengaluru

Bengaluru, Karnataka 560100

Submitted by

KEERTHI VARMAN .G

Registration No:

20030146DS002

Under the Guidance of Dr/ Prof Rathnakar Achary

In partial fulfilment of the Course- Industry Internship Programme-IIP

in Semester II of the Master of Technology

2021
Page |2

Master of Technology

Industry Internship Programme (IIP)

Declaration

This is to declare that the Report titled “Twitter Sentiment Analysis” has been made for the
partial fulfilment of the Course: Industry Internship Programme (IIP) in Semester II by me at
Happiest Minds (organization) under the guidance of Dr./Prof. Rathnakar Achary

I confirm that this Report truly represents my work undertaken as a part of my Industry
Internship Programme (IIP). This work is not a replication of work done previously by any
other person. I also confirm that the contents of the report and the views contained therein have
been discussed and deliberated with the academic supervisor.

Signature of the Student :

Name of the Student (in Capital Letters) : KEERTHI VARMAN G

Registration No :20030146DS002
Page |3

Master of Technology

Certificate

This is to certify that Mr. / Ms. KEERTHI VARMAN. G A Regno. No. 20031460DS002 has
completed the report titled TWITTER SENTIMENT ANALYSIS under my guidance for the
partial fulfilment of the Course: Industry Internship Programme (IIP) in Semester II of the
Master of Technology in DATA SCIENCE

Signature of Supervisor:

Name of the Supervisor: Dr/Prof. Rathnakar Achary


Page |4

TABLE OF CONTENTS

1 ABSTRACT 5
2 INTRODUCTION 6
3 INDUSTRY OVERVIEW 9-11
• GLOBAL SCENARIO
• INDIAN SCENARIO
4 COMPANY OVERVIEW 12 – 18
• COMPANY LOGO
• ESTABLISHMENT
• FOUNDER
• MISSION
• VISION
• VALUES
• BUSINESS EXCELLANCE

• SERVICES
• SWOT ANALYSIS

5 TEAM STRUCTURE 19
6 PROJECT PROFILE 20-21
• PROBLEM STATEMENT
• OBJECTIVE OF THE PROJECT

6 FRAME WORKS AND LIBRARIES 22-23


7 COLLECTING TWITTER DATA USING 24-26
STREAMING API
8 TWITTER SENTIMENT ANALYSIS AND 27-29
INTERACTIVE DATA VISUALIZATION
9 MAKING FRONTEND OF THE APPLICATION 30
10 OPTIMIZATION AND SCALING OF THE 31-36
APPLICATION
11 DEPLOYMENT OF THE APPLICATION 37-41
12 CONCLUSION 42
13 REFERENCE 43
Page |5

ABSTRACT

Any Machine learning model needs a significant amount of sample data to give reasonably
accurate predictions. One can comment without disagreement that model accuracy has a
positive relationship with the size of the dataset it is provided to observe and learn from. In
order to fetch real-time data and process it into a suitable format, scalability demands the
algorithms taking care of these steps to be efficient. AI@Scale initiative aims to discover a
solution for fetching and processing massive datasets at production-level performance. The
goal is to find a paradigm that scales appropriately as the dataset obtained for training an ML
model increases in size. Data fetched from Real-time APIs provided by social networking
platforms can stretch into TeraBytes (TB). With such high input sizes, even minor
improvements in pre-processing algorithms have a massive impact on the speed of the
application. These tiny improvements can also help save the server’s CPU power for other
simultaneous tasks (like responding to user queries). One of the most critical parameters for
companies to gauge is the sentiment of the public towards their brand. Thus, our objective for
the duration of this internship is to develop a production-ready brand sentiment analyser,
which pulls its dataset from Twitter API using the Tweepy library. One of the necessary
requirements for this application would be to handle datasets for popular companies with
significant media presence, too (Essential scalability).
Page |6

INTRODUCTION

Twitter is a free social media platform that allows users to communicate with one
another via brief messages known as tweets. People use Twitter to connect with others
and discover new things every day, whether it's to share breaking news, post company
updates, or follow their favourite celebrities.

Twitter, which was founded in 2006, is most popular among millennials and young
professionals. Since its introduction, however, it has seen a considerable increase in
users of all ages.

Twitter enables users to follow and interact with a variety of people, brands, and news
organisations, resulting in a real-time stream of messages personalised to their
preferences. People use Twitter to provide real-time updates, photographs, videos, and
links. This allows for intelligent, real-time search results.

Twitter allows businesses to engage with consumers, sellers, partners, and workers in a
two-way fashion. Customers can stay up to date on a company's products and services
by following them on social media. Businesses may monitor their customers on Twitter
for insights into their ideas, behaviours, and attitudes about various products and
services because Twitter is an open network.

Twitter provides a great way for different departments in an organization to


communicate externally as well as internally. For sales and marketing, Twitter provides
the opportunity to engage with current and potential customers. You can inform them
about your latest news, products and services while directing them to sales offers and
new content. Employees are also able to re-share any important content, allowing your
messages to spread exponentially across their networks. Twitter also offers unique
features such as Twitter Lists to track what different groups of people are saying about
your business, industry and even your competitors.
Page |7

Sentiment Analysis is a text mining approach that is commonly utilised. Sentiment analysis on
Twitter, on the other hand, refers to the use of advanced text mining algorithms to assess the
sentiment of a text (in this case, a tweet) in terms of positive, negative, and neutral. It's also
known as Opinion Mining, and it's used to analyse conversations, opinions, and sharing of
ideas (all in the form of tweets) in order to decide on company strategy, political analysis, and
evaluating public acts.

Some well-known tools for analysing Twitter sentiment include Enginuity, Revealed Context,
Steamcrab, Meaning Cloud, and Social Mention. Twitter sentiment analysis datasets are
commonly utilised in R and Python. Twitter sentiment analysis is now much more than a
student project or a certification programme. There are a plethora of Twitter sentiment tutorials
accessible to teach students about the Twitter sentiment analysis project report and its use with
R and Python. You may also enrol in a python course for the same application if you want to
pursue a career in sentiment analysis dataset twitter.

Our conversation will cover Twitter Sentiment Analysis in R and Python, as well as strategies
and how to construct a Twitter Sentiment Analysis project report, as well as the benefits of
enrolling in the tutorial.

Positive, negative, or neutral sentiment or opinion can be conveyed on Twitter. However, no


system can guarantee 100% accuracy or prediction when it comes to sentiment analysis.

Algorithms like SVM and Naive Bayes are used in Natural Language Processing to predict the
polarity of a sentence. Sentiment analysis of Twitter data may potentially be affected by
sentence and document level considerations.

However, methods such as looking for positive and negative terms in a sentence are ineffective
because the flavour of the text block is highly dependent on the context. Looking at the POS
(Part of Speech) Tagging can help with this.

Dataset for Sentiment Analysis Twitter can be used for a variety of purposes:

Companies use Twitter Sentiment Analysis to create their business plans, to assess customers'
attitudes toward items or brands, to see how people react to their campaigns or new releases,
and to figure out why certain things aren't selling.
Page |8

Politics:

Sentiment Analysis Dataset in Politics Twitter is used to keep track of political viewpoints and
to find consistency and inconsistency between government words and actions. Twitter is being
used to analyse election results, according to the Sentiment Analysis Dataset.

Public Activism:

Twitter Sentiment Analysis is also used to track and analyse social events, foresee potentially
dangerous situations, and gauge the mood of the blogosphere.
Page |9

INDUSTRY OVERVIEW

The software industry consists of the development, distribution, and maintenance of software.
It can be broadly divided into application software, system infrastructure software, software-
as-a-service (SaaS), operating systems, database and analytics software. Statista provides
information on software market spending in general and on various software products
specifically, on both global and regional scales. Additionally, financial metrics and market
share of major players are covered as well, together with data on software development and
usage.

Fig.1

Global scenario

In terms of industry specifics, IDC projects that the technology industry is on pace to exceed
$5.3 trillion in 2022. After the speed bump of 2020, the industry is returning to its previous
growth pattern of 5%-6% growth year over year. The United States is the largest tech market
in the world, representing 33% of the total, or approximately $1.8 trillion for 2022.
P a g e | 10

Fig.2

here are a number of taxonomies for depicting the information technology space. Using the
conventional approach, the market can be categorized into five top-level buckets. The
traditional categories of hardware, software and services account for 56% of the global total.
The other core category, telecom services, accounts for 25%. The remaining 19% covers
various emerging technologies that either don’t fit into one of the traditional buckets or span
multiple categories, which is the case for many emerging as-a-service solutions that include
elements of hardware, software and service, such as IoT, drones and many automating
technologies.

Fig.3
P a g e | 11

Indian scenario

Information Technology in India is an industry consisting of two major components: IT


services and business process outsourcing (BPO). The IT industry accounted for 8% of India’s
GDP in 2020. The IT and BPM industry’s revenue is estimated at US$194 billion in FY 2021,
an increase of 2.3% YoY.[2] The domestic revenue of the IT industry is estimated at US$45
billion and export revenue is estimated at US$150 billion in FY 2021. The IT–BPM sector
overall employs 4.5 million people as of March 2021.
P a g e | 12

COMPANY OVERVIEW

Any Machine learning model needs a significant amount of sample data to give reasonably
accurate predictions. One can comment without disagreement that model accuracy has a
positive relationship with the size of the dataset it is provided to observe and learn from. In
order to fetch real-time data and process it into a suitable format, scalability demands the
algorithms taking care of these steps to be efficient.

AI@Scale initiative aims to discover a solution for fetching and processing massive datasets
at production-level performance. The goal is to find a paradigm that scales appropriately as
the dataset obtained for training an ML model increases in size. Data fetched from Real-time
APIs provided by social networking platforms can stretch into Terabytes (TB). With such
high input sizes, even minor improvements in pre-processing algorithms have a massive
impact on the speed of the application. These tiny improvements can also help save the
server’s CPU power for other simultaneous tasks (like responding to user queries).

One of the most critical parameters for companies to gauge is the sentiment of the public
towards their brand. Thus, our objective for the duration of this internship is to develop a
production-ready brand sentiment analyser, which pulls its dataset from Twitter API
using the Tweepy library. One of the necessary requirements for this application would be
to handle datasets for popular companies with significant media presence, too (Essential
scalability).

COMPANY LOGO:
P a g e | 13

Establishment

Happiest minds technologies limited established in 2011. It is a next- generation digital


transformation, infrastructure, security, and product engineering services company with
170+customers, 2400+ people across 16 locations. The company provides its products and
services in various locations like US, UK, Singapore, Australia etc. we deliver services across
industry sectors such as retail, edutech, industrial, BFSI, hi-tech, engineering R&D,
manufacturing, travel, media, entertainment and others.

Founder

Happiest minds technologies limited founded by Ashok Soota. He was born on 12 November
1942 at Delhi, India. He is an Indian IT entrepreneur. In 1960, he did his schooling from Inter
Science at La Martiniere, Lucknow and holds a Bachelor of Engineering in Electronics from
university of Roorkee. Then, he did MBA from Asian Institute of Management from Manila,
Philippines. He also won many awards. He was very intelligent student in his young age.

MISSION:

Happiest minds technologies limited mission is as follows:


Happiest people . Happiest customer

VISION:

Happiest minds technologies limited vision is as follows:

• Be the happiness Evangelists for each other, customers and society.

• Be known as the company with highest standards of corporate governance.

• Achieve a very successful IPO by or before FY2023 and in the interim provide a
monetization event for investors/team by Fy2020.

• Be recognized for thought leadership in focused areas of technology and solutions


P a g e | 14

• Be a leader in social responsibility initiatives.

VALUES:

Happiest minds technologies limited values are:

• Sharing knowledge and wealth-


Culture of teamwork and sharing knowledge and wealth.
• Mindful responsibilities-
Attentive, caring, heedful. Mindful of our responsibilities.
• Integrity-
Respect our commitments internally and externally, not just in letter, but also in spirit.
Creating an organization that stands for fiscal, social and professional integrity.
• Learning-
A culture that rewards self-development and innovation.
• Excellence-
High aspirations for global excellence backed by a strong action orientation.
• Social responsibility-
Good corporate citizen with a special emphasis on environmental responsibility and
driving inclusivity in the workplace.

BUSINESS EXCELLENCE

Happiest Minds Quality Policy:

“Happiest Minds consistently strives for customers’ happiness. We are committed to


delivering excellence in our services by continually improving processes and systems, aiding
in creating value for all our stakeholders”.

Happiest Minds Quality Management System (QMS) Framework:


P a g e | 15

Our strategy for Continual Quality Improvement journey is derived based on the business
needs, technology changes, customer feedback, suggestions and process performance.

Our Quality Management System is compliant with ISO 9001:2015


standards and it brings together industry best practices to promote systematic approach in
planning and executing projects. Besides, Happiest Minds is certified to Information Security
standards like ISO 27001, which guides our policies and procedures for protecting our
software enablers, as well as clients’ software enablers. Our Quality Management System has
been certified for ISO 9001:2015 in November 2017 by TUV Rheinland.

We have an Integrated Management System (ISO 9001, ISO 27001) that benefits
our organization through increased efficiencies and effectiveness, and cost reductions while
meeting the compliance of the several external audits. It shows our commitment towards
increased performance, employee and customer satisfaction, and continuous improvement as
well. Our QMS has world-class methods for software engineering and delivery management
with the right degree of automation. Various Processes, Guidelines, Checklists, Templates
exist in our QMS for both engineering as well as for project management areas.

Delivery Methodologies:
Our suite of delivery methodologies in the below-mentioned areas demonstrates our thought
leadership and execution capabilities

• Waterfall model for Software development

• Agile methodologies

• Embedded system software

• Service delivery lifecycle

As we strive to provide best-in-class solutions and services to our customers, we monitor and
benchmark our practices, standards against globally accepted norms, and continually update
our QMS to keep pace with the industry.
P a g e | 16

SERVICES:

1)Agile infrastructure

Volatility is likely to remain constant with a new phase of globalization. To keep the rapid
pace in the market and ensure business growth, Enterprises need a support of next generation
technologies. Be it moving to cloud-based applications, mobile-enabling the workforce,
optimizing the IT infrastructure or securing the workplace, enterprises need the support of an
expert technology partner to help them navigate the various complexities involved. Though,
the company provide agile infrastructure and optimizing IT Infrastructure and Operations, it’s
important to identify and implement the right Infrastructure Management Services that
utilizes existing investments in tools and resources.

2)Cloud computing

Cloud Computing is seamlessly making a large transformation globally to user, developer,


enterprise communities in their whole experience of information access, application
design/develop/implement, administer/manage costs and infrastructure. The benefits of cloud
computing are widely accepted, and enterprises are moving fast to experience the
transformation.

3)Data science

Happiest Minds helps clients to solve the toughest data challenges, predict demand for
products and services to improve customer satisfaction and guide business strategies based on
knowledge and foresight.

4)DevOps

DevOps solutions plug the gaps that exists between software development, quality assurance,
and IT operations thereby enabling you to quickly produce software products and services,
while improving operational performance significantly.
P a g e | 17

5)Internet of things

Happiest Minds Internet of Things service enables organizations to transform business needs
into competitive differentiators by delivering innovative IoT powered solutions. From
integrating the right sensors and deriving inspired insights to choosing the best-fit platform,
provide comprehensive IoT services to our client

6)SDN-NFV

Software Defined Networking and Network Function Virtualization dealt with a rapidly
shifting business and technology landscape while also managing increasingly high customer
expectations. Increased proliferation of smart phones and internet-based applications, the 5G
explosion, evolving WAN requirements and emergence of new traffic patterns due to IoT and
M2M connectivity have been some of the most obvious disruptors.

7)Mobility solution

Happiest Minds Enterprise Mobility services enable enterprise wide mobile-must


metamorphosis that enhances customer experiences across all touchpoints. We help our
clients fuel innovation of digital transformation imperatives like IoT, big data analytics, and
augmented intelligence using enterprise mobility as a macro force.

SWOT ANALYSIS

STRENGTHS:

• Company with high TTM EPS Growth.


• Good quarterly growth in the recent results.
• Efficient in managing Assets to generate Profits - ROA improving since last 2 year.
• Growth in Net Profit with increasing Profit Margin (QoQ).
• Company with Low Debt.
• Increasing Revenue every Quarter for the past 4 Quarters.
• Strong cash generating ability from core business - Improving Cash Flow from
operation for last 2 years.
• Company able to generate Net Cash - Improving Net Cash Flow for last 2 years.
P a g e | 18

• Annual Net Profits improving for last 2 years.


• Company with Zero Promoter Pledge.
• FII / FPI or Institutions increasing their shareholding.

WEAKNESS:

• Companies with growing costs YoY for long term projects.


• MFs decreased their shareholding last quarter.

OPPORTUNITIES:

• High Momentum Scores (Technical Scores greater than 50).


• Highest Recovery from 52 Week Low.

THREATS:

• Stocks with Expensive Valuations according to the Trendlyne Valuation Score.


• Stocks with high PE (PE > 40).
P a g e | 19

TEAM STRUCTURE

ANDREW ANAND Ashish Soni Madhav Gupta

Reporting manager Umesh Menon Meenal Gupta

Mentors Keerthi Varman G

Interns

All interns work on the task assigned and report to the mentors in the next scheduled team
meeting. Team meetings are scheduled every Monday, Wednesday and Friday, 11:00 - 11:30
A.M.
P a g e | 20

PROJECT PROFILE

PROBLEM STATEMENT

Twitter monitoring system to watch customer behaviours toward brands by detecting sentiment
shifts, evaluating hot topics and geographic segmentation, and finding anomalies in scandals,
all with the goal of increasing customer engagement and retention.

OBJECTIVE OF THE PROJECT

Our objective for the duration of this internship was to experiment and formulate a
solution for fetching and processing massive datasets at production level performance to
design a workflow that is scalable for efficient ML model training.

We attempt to do so by identifying and optimising bottlenecks and by analysing


algorithms, taking a ‘Real-Time Twitter Brand Sentiment Analyser’ as a use case.

This optimisation is followed by deploying this dockerised Machine Learning Pipeline on


the Google Kubernetes Engine using the Google Cloud Platform, which ensures that the
application is scalable not only on the data-fetching end but also in handling client-side
requests

One of the many practical applications of this optimised application would be handling
datasets for popular companies with significant media presence (Essential scalability).
P a g e | 21

Fig.4 PYTHON MULTIPROCESSING

Fig.5 GOOGLE KUBERNETES

Fig.6 DOCKER
P a g e | 22

FRAMEWORKS AND LIBRARIES

Language chosen:

Python

Due to the abundance of assisting machine learning libraries available, along with tweepy,
the primary language for writing the ML pipeline was chosen by mentors to be Python.

Framework for Data fetching from Twitter:

Tweepy

Database for dataset storage:

MySQL

MySQL is a DBMS that most of our team was found to be familiar and comfortable with.
The framework has been updated and optimised over the years to be one of the fastest query
processing database systems.

Framework for Data pre-processing and inference generation and analysis for the data fetched
from Twitter and stored in the database:

NLTK, Natural Language Toolkit, a leading platform for building Python programs to work
with human language data and TextBlob, Natural Language Processing library for processing
textual data, providing a simple API for diving into standard NLP.

Interactive Analytical Dashboard:

Plotly, An interactive, open-source, and browser-based graphing library for Python


P a g e | 23

User interface and Frontend:

HTML, CSS

For the pipeline to run identically as it does on our local environment, we will be hosting a
Docker image. Docker allows us to create separate environments on a single operating system,
making docker images faster and lighter to run.

Hosting service:

Google Kubernetes Engine

The GitHub repository of the trained model would be hosted along with the corresponding
Docker image on a Kubernetes cluster. The GCP was chosen as Kubernetes provides the
features of consistent scaling of an image based on traffic received, along with auto-update,
load balancing and maintenance.
P a g e | 24

COLLECTING TWITTER DATA USING STREAMING API

The objective of this part of the application is to filter tweets being posted in real-time
based on the presence of specific keywords that are supplied as variables in a file
‘settings.py’. These tweets are then processed and saved in a MySQL database and
analyzed to draw conclusions that are displayed on the dashboard.

Pushing processed tweets in the DB

Fig.7

Settings.py file to change keywords

Fig.8
P a g e | 25

A Python library called Tweepy forms a significant part of this task. It is used to listen to the
streaming data and process and save the tweets fetched using the Twitter API. The Stream-
Listener class is overridden by modifying the ‘on_status’ and ‘on_error’ methods. It mainly
accomplishes feature extraction and data preprocessing steps, followed by connecting
and loading the data into a database instance (MySQL in our case).

Connecting to the DB and creating a table if it does not exist

Fig.9
P a g e | 26

Pre processing fetched tweets before pushing in DB

Fig.10
P a g e | 27

TWITTER SENTIMENT ANALYSIS AND INTERACTIVE DATA


VISUALIZATION

This step extracts the data stored in the MySQL database and analyses it by performing
Natural Language Processing operations. Not only this, but it also analyses the
geographic segmentation of the data on a particular keyword, using location-related
features.

The analysis consists of the following sub-tasks:

• Sentiment analysis with TextBlob

The core of sentiment analysis is to use TextBlob to extract the polarity &
subjectivity from tweet texts, which is actually done by the data preprocessing for
better data storage previous chapter. Negative tweets represent as -1, positive tweets
represent as +1, and neutral tweets represent as 0.

Fig.11

• Topic Tracking with Natural Language Processing using RE & NLTK

The entire text from all tweets is tokenized, and Stop Words imported using
NLTK are used as a reference to remove commonly used words, and the ten most
common words in the Frequency Distribution of all words are extracted. Plotly
Express is again used to visualize the bar chart.
P a g e | 28

Fig.12

• Geographic Segmentation Recognition with Text Processing

To explore users’ geographic distributions, locations through their user profile


rather than the locations attached with tweets need to be identified. All US state
names and their abbreviations are set up as constants for further abbreviation-name
transformation. Plotly (not Plotly Express) is used to visualize the USA map.
P a g e | 29

Fig.13
P a g e | 30

MAKING FRONTEND OF THE APPLICATION

Using Dash and Plotly, the sentiment analyser is wrapped in a web application and graphs
are used to summarise the sentiment analysis.

Fig.14 Most frequently encountered words in the tweets containing chosen keyword

Fig.15 Showing number of +ve, -ve and neutral tweets at a particular time

Fig.16 Geographic segmentation by number of tweets with the chosen keyword


P a g e | 31

OPTIMIZATION AND SCALING OF THE APPLICATION

This section is the most critical part of the application and is responsible for differentiating
it from the referenced base code. It is also the part where a significant number of our
original contributions lie. Documentations of the various libraries and frameworks were
extensively studied to identify possible processing bottlenecks in the functions used.

While the base code was built using a PostgreSQL language, our application used MySQL.
MySQL was chosen as it is relatively light on features and thus can focus more on speed
and reliability.

Deployment of the application

• Maintaining Environment Isolation and Consistency

Fig.17

Docker is a software framework for constructing, executing, and managing containers


on servers and in the cloud. It is a subset of the Moby project. The tools (commands
and a daemon) or the Docker file format are both referred to as "docker."

When you wanted to run a web application, you used to have to buy a server, install
Linux, build up a LAMP stack, and then run the programme. If your programme
became famous, you used good load balancing by adding a second server to ensure
that it didn't crash due to excessive traffic.
P a g e | 32

However, instead of focusing on single servers, the Internet now relies on a system
known as "the cloud," which consists of a collection of interconnected and redundant
computers. The concept of a server might be liberated from the limits of hardware
thanks to inventions like Linux kernel namespaces and c groups, and instead become
simply a piece of software. Containers are software-based servers that are a hybrid of
the Linux OS they're running on and a hyper-localized runtime environment (the
contents of the container).

Understanding Containers

Container technology can be thought of as three different categories:

1. A container builder, such as distrobuilder for LXC or a Dockerfile for Docker,


is a tool or a set of tools used to create a container.
2. Engine: a programme that runs a container. This refers to the docker command
and the dockerd daemon in Docker. Others may use this term to refer to the
containerd daemon and its commands (such as podman.)
3. Orchestration is a container management system that includes Kubernetes and
OKD.

Containers frequently supply both an application and configuration, reducing the amount
of time a sysadmin must spend getting an application in a container to execute compared to
installing an application from a traditional source. Dockerhub and Quay.io are both image
sources for container engines.

The most appealing feature of containers, however, is their ability to gently "die" and revive
when load balancing requires it. Containers are "cheap" to start, and they're designed to
appear and disappear easily. Whether a container's demise is caused by a crash or because it's
simply no longer needed because server traffic is low, containers are "cheap" to start.
Because containers are supposed to be ephemeral and produce new instances as needed, it's
expected that monitoring and management will be automated rather than done by a human in
real time.
P a g e | 33

• Scaling the application and automating docker management

Fig.18
Kubernetes, a container-centric management platform, has become the de facto
standard for deploying and operating containerized applications as a result of the
broad adoption of containers among enterprises. Kubernetes, which was first created
at Google and published as open source in 2014, was born on Google Cloud.
Kubernetes is based on Google's 15-year experience operating containerized
workloads and the open source community's essential contributions. Kubernetes,
which was inspired by Google's internal cluster management system, Borg, simplifies
the process of deploying and administering your application. Kubernetes increases
your reliability by automating container orchestration and reducing the time and
resources spent on everyday operations.

Kubernetes simplifies container administration by automating operational activities


and including built-in commands for deploying applications, rolling out updates to
your applications, scaling your applications up and down to meet changing needs,
monitoring your applications, and more.

Kubernetes used for:


Kubernetes is a container-based platform for building applications that are simple to
maintain and deploy anywhere. Kubernetes offers a variety of alternatives to satisfy
your demands when it is offered as a managed service. Here are a few examples of
common applications.
P a g e | 34

Increasing development velocity:

Kubernetes is a container orchestration system that allows you to create cloud-native microservices-
based apps. It also allows you to containerize existing programmes, making it the cornerstone for
application modernization and allowing you to develop apps more quickly.

Deploying applications anywhere:

Kubernetes is designed to be utilised everywhere, allowing you to operate your applications


on-premises, in the cloud, and in hybrid deployments. As a result, you can execute your
applications wherever you choose.

Running efficient services:

Kubernetes can alter the size of a cluster required to run a service automatically. This allows
you to scale your apps up and down dynamically based on demand and run them efficiently.

KUBERNETES VS DOCKER:

Kubernetes and Docker are independent but complimentary technologies for running
containerized applications, and they're often mistaken as one or the other.

Docker allows you to package everything you need to run your application into a container
that can be kept and opened as needed. Once you've started boxing up your apps, you'll need
a mechanism to manage them, which Kubernetes provides.

Kubernetes is a Greek word that roughly translates to "captain" in English. Kubernetes is


responsible for carrying and delivering those boxes safely to locations where they can be
used, much as the captain is responsible for the ship's safe passage on the seas.
P a g e | 35

1. Kubernetes is a container orchestration system that can be used with or


without Docker.
2. Because Docker isn't a Kubernetes option, it's less of a "Kubernetes vs.
Docker" debate. It's about containerizing your applications and running them
at scale with Kubernetes and Docker.
3. The distinction between Docker and Kubernetes is in the roles they play in
containerizing and running applications.
4. Docker is an open industry standard for containerizing and delivering
programmes.
5. Kubernetes deploys, manages, and scales containerized applications using
Docker.

• Cloud service used to host the application

Fig.19

Google Cloud Platform (GCP), offered by Google, is a suite of cloud computing


services that runs on the same infrastructure that Google uses for its end-user
products.
Google Kubernetes Engine is an implementation of Google’s open-source
Kubernetes on the Google Cloud Platform.
P a g e | 36

Google Cloud Platform is a collection of Google's public cloud computing services.


The platform comprises a variety of Google-hosted services for computation, storage,
and application development. Software developers, cloud administrators, and other
enterprise IT experts can use Google Cloud Platform services over the public internet
or a dedicated network connection.
Google Cloud Platform includes cloud management, security, and developer tools, as
well as computation, storage, networking, big data, machine learning, and the internet
of things (IoT). Google Cloud Platform's primary cloud computing services include:

Google Compute Engine is an infrastructure-as-a-service (IaaS) platform that allows


users to host workloads on virtual machine instances.

Google App Engine is a platform-as-a-service (PaaS) product that provides software


developers with scalable hosting from Google. Developers can also use a software
developer kit (SDK) to create App Engine-compatible software.

Google Cloud Storage is a cloud-based storage platform for huge, unstructured data
volumes. Cloud Datastore for NoSQL nonrelational storage, Cloud SQL for MySQL
fully relational storage, and Google's native Cloud Bigtable database are all available
from Google.

Google Container Engine is a Docker container management and orchestration


solution that operates on Google's public cloud. The Google Container Engine
container orchestration engine is based on Google Kubernetes.

Google Cloud Platform provides services for application development


and integration. Google Cloud Pub/Sub, for example, is a controlled and real-time
messaging service that allows applications to exchange messages. Furthermore,
Google Cloud Endpoints enables developers to establish RESTful API-based services
and make them available to Apple iOS, Android, and JavaScript clients. Anycast DNS
servers, direct network hookups, load balancing, monitoring, and logging services are
among the other services available.
P a g e | 37

DEPLOYMENT OF THE APPLICATION

Fig.20
P a g e | 38

STEPS FOR DEPLOYMENT:

1. Go to your GCP console

2. Activate billing in the respective GCP account and set up a service account

3. Enable Compute engine API, SQL API, Service API, Container Registry API and
Google Kubernetes API for the application to work. This is because these APIs
are used in the deployment of our application.

4. Create a new project in the GCP console. All the resources relating to this project
will be hosted and deployed in this project.
5. Create an SQL instance. Base code used in the repository uses MySQL version 8.0.
Use the above. Set up the open connection and copy the IP of the SQL instance once
it has been created. Modify the env.hostname variable in the fetcher as well as the
display script- to enable connection to this SQL instance.

6. Create a database named ‘Twitter’ in this SQL instance using the Cloud SQL Shell.

7. Now, once the database has been created, we will fetch and deploy the fetcher as
well as the display script by creating a Kubernetes cluster and then deploying and
exposing our application.

8. First for the fetcher script: Run all these commands in the Cloud Shell:
▪ Clone the code from github using following command:
git clone --branch v2 https://github.com/madhavgupta211/hm-
datafetch.git data-fetcher
▪ Go to the created directory using:
cd data-fetcher
▪ set the project variable using the following command :
export PROJECT_ID=’PROJECT_NAME’
gcloud config set project “PROJECT_NAME”
▪ build the image specified on the Dockerfile using the following command:
P a g e | 39

docker build --no-cache -t gcr.io/${PROJECT_ID}/fetcher -app:latest .


▪ Push the image created to the Google Container Registry using the following
command:
docker push gcr.io/${PROJECT_ID}/fetcher-app:latest
▪ Image has been created and pushed to the Container Registry. Will be
deployed on the cluster once it has been created.

9. First for the display script: Run all these commands in the Cloud Shell
▪ Clone the code from the github repository using the following command :
git clone --branch Final-Deployment
https://github.com/madhavgupta211/hm-data-disp.git dashboard-app

▪ Go to the created directory dashboard-app using:


cd dashboard-app
▪ set the project variable using the following command :
export PROJECT_ID=’PROJECT_NAME’
gcloud config set project final-deployment-320904
▪ build the image specified on the Dockerfile using the following command:
docker build --no-cache -t gcr.io/${PROJECT_ID}/dashboard-app:latest .
▪ Push the image created to the Google Container Registry using the following
command: docker push
gcr.io/${PROJECT_ID}/dashboard-app:latest
▪ Create the cluster – consisted of 2 pods on which the fetcher and display
containers will be deployed and run using the following command :
gcloud beta container --project "PROJECT_NAME" clusters create
"dashboard-web" --zone "us-central1-a" --no-enable-basic-auth –cluster-
version "1.19.9-gke.1900" --machine-type "g1-small" --image-type "COS" --
disk-type "pd-standard" --disk-size "30" –scopes
"https://www.googleapis.com/auth/devstorage.read_only","https://www.go
ogleapis.com/auth/logging.write","https://www.googleapis.com/auth/monit
oring","https://www.googleapis.com/auth/servicecontrol","https://www.go
ogleapis.com/auth/service.management.readonly","https://www.googleapis
.com/auth/trace.append" --num-nodes "1" --no-enable-cloud-logging –no
P a g e | 40

enable-cloud-monitoring --enable-ip-alias --network


"projects/PROJECT_NAME/global/networks/default" --subnetwork
"projects/PROJECT_NAME/regions/us-central1/subnetworks/default" –
default-max-pods-per-node "110" –addons
HorizontalPodAutoscaling,HttpLoadBalancing --enable-autoupgrade –
enable-autorepair
▪ Get the credentials of the cluster created using the following command:
gcloud container clusters get-credentials dashboard-web --zone us
central1-a --project PROJECT_NAME
▪ Create the pods by running the Deployment.yaml file
kubectl create -f DashboardPod.yaml
▪ Get the status of the pods created using the following command:
kubectl get pods -o wide
Wait until the status of both is 1/1 – might take a couple of minutes ]
▪ Expose the deployment using the following command:
kubectl expose deployment dashboard-app --type=LoadBalancer --port 80 --target-
port 808
▪ Get the external IP where the Dashboard can be viewed using :
Kubectl get service
The External IP is the IP address of the dashboard
▪ Now the display script has been deployed.
▪ Now deploying the fetcher script using the following command:
kubectl create deployment fetcher-app –
image=gcr.io/${PROJECT_ID}/fetcher-app:latest

Finally, the scripts have been deployed and are running. The dashboard can be viewed
at the external IP.
P a g e | 41

Fig.21

The code for fetcher and dashboard stored in GitHub respositories


hm-data-fetch and hm-data-disp respectively and then the code is pulled from the
GitHub in a new project created on GCP, Now the Docker images are created inside
Our project using the config files Present in the repository, later the Docker images
for both the scripts are pushed into the Google Container registry.

After that a Kubernetes Cluster is created, the required images are pulled from GCR
and then deployed in the GKE.

Finaly the dashboard app deployment in the cluster is exposed via an external IP
address for clients to access.
P a g e | 42

CONCLUSION

This application has been developed as two separate versions – the first version consisting
of the base code with minor modifications, which serves as a reference to test the
performance after improvements. The second and critical version is version 2. This houses
the optimizations introduced by the team and is hosted on a Google Kubernetes Cluster
on the Google Cloud Platform.

❖ With the use case – “Real-Time Brand Sentiment Analyzer for Brand Improvement
and Topic Tracking”, we achieved a more scaled dashboard and increased
efficiency post the use of multi-threading and parallel processing.

❖ The roadblocks faced were handled adequately and also gave us a productive
experience of software development. It also allowed us to come up with innovative
solutions which were time and cost-efficient.

❖ The deployment of this application also gave us a flavor of DevOps.

❖ Although we could not integrate near-real-time streaming using Kafka and Spark
due to time constraints, we gained valuable knowledge about the same
P a g e | 43

REFERENCES

• Docker: Containerized applications http://www.docker.com/


• Kubernetes https://www.youtube.com/watch?v=ior4nwIv3JQ
• Tweepy: Python library to connect to Twitter APIs
https://www.tweepy.org/
• Python environment in Docker and integration with VSCode
https://www.youtube.com/watch?v=k8H0KCtsTR8

You might also like