Download as pdf or txt
Download as pdf or txt
You are on page 1of 64

Twitter Sentiment Analysis Using

Machine Learning

A project report
Submitted in partial fulfillment of the requirements for the award of

BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING

Submitted by

J. BABY 20EM1A0529
L.SUDHAKAR 20EM1A0557
K.DIVAKAR 20EM1A0548
K.JAHNAVI 20EM1A0538

Under the guidance of


Mr. K. KARUNAKAR
M. Tech., (Ph.D)
Associate Professor

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


SWARNANDHRA INSTITUTE OF ENGINEERING AND TECHNOLOGY
(Approved by AICTE, Accredited by NAAC& Affiliated to JNTU Kakinada)
Seetharampuram, Narsapur-534275, West Godavari (Dist.), AP.
2020 - 2024
DEPARTMENT OF COMPUTER SCIENCE AND ENGINERING
SWARNANDHRA INSTITUTE OF ENGINERING AND TECHNOLOGY
Approved by AICTE, Accredited by NACC & Affiliated to JNTU Kakinada
Seetharampuram, Narsapur-534275, West Godavari (Dist.), AP.

CERTIFICATE

This is to certify that the projectreport entitled “Twitter Sentiment Analysis Using
Machine Learning” is abonafied work done by J.BABY – 20EM1A0529,
L.SUDHAKAR – 20EM1A0557, K.JAHNAVI – 20EM1A0538, K.DIVAKAR –
20EM1A0548 submitted in partial fulfillment of the requirement for the award of the
degree of Bachelor of Technology in Department of COMPUTER SCIENCE AND
ENGINEERING, during the academic year 2020-2024.

ProjectGuide Head of theDepartment


Mr. K. Karunakar, Ms. P Haritha,
M. Tech.,( PhD) M.Tech.,(PhD)
Associate Professor Associate Professor

External Examiner
DECLARATION

I hereby declare that the entire project work embodied in this dissertation entitled “Twitter
Sentiment Analysis Using Machine Learning” has been independently carried out by me.
As per my knowledge, no part of this work has been submitted for any degree in any
institution, University and organization previously.

TEAM MEMBERS

J. BABY 20EM1A0529
L.SUDHAKAR 20EM1A0557
K.DIVAKAR 20EM1A0548
K.JAHNAVI 20EM1A0538
ACKNOWLEDGEMENT
I extend my heartfelt gratitude and sincere thanks to Management Members of our college
for making necessary arrangement for doing the project.

I would like to express our gratitude to my Principal Dr. P. Pandarinath, M.Tech,


Ph.D, Professor &Principal for his timely suggestions.

I would like to express my grateful thanks to Ms. P Haritha, M. Tech, (Ph.D),


Assoc. Professor &HOD, Computer Science and Engineeringfor her valuable
suggestions and guidance in regarding the software analysis, Designing and also for her
continuous efforts in successful completion of the project.

My deep gratitude to internal guide Mr. K. Karunakar, M. Tech.,(PhD),


Associate Professor, CSE we thank him for his dedication, guidance, council and keen
interest at every stage of the project.

I would like to express our deep indebtedness and whole hearted thanks to all our
faculty of CSE Department, for their full-fledged support and encouragement.

Last but not the least, our special thanks to all of my parents for their support, I
debited to all people who have contributed in some way or the other in completion of this
project work.

My gratitude for our SWARNANDHRA INSTITUTE OF ENGINEERING


AND TECHNOLOGY for providing an opportunity todo our project.

TEAM MEMBERS

J. BABY 20EM1A0529
L.SUDHAKAR 20EM1A0557
K.DIVAKAR 20EM1A0548
K.JAHNAVI 20EM1A0538
INDEX

S. No. Contents Page No.


1 INTRODUCTION
1
1.1 Introduction
2
1.2 Problem statement
3
1.3 Existing Solutions
5
1.4 Proposed Solution
6
1.5 Methodology
20
1.6 Organization

2 LITERATURE REVIEW
2.1 Sign Language Recognition System to aid Deaf-dumb
21
People Using PCA
2.2 Hand Gesture Recognition for Deaf People Interfacing 22
2.3 An American Sign Language Detection System
using HSV Color Model and EdgeDetection 22
2.4 Real Time Hand Gesture Recognition Using Different
Algorithms Based on AmericanSign Language 23

3 SYSTEM DEVELOPMENT
3.1 Dataset 24
3.2 Algorithms 28
3.3 Application Screenshots 36
3.4 ML Pipeline 38
4 PERFORMANCE ANALYSIS
4.1 Performance Analysis 39
4.2 Constraints 45
5 CONCLUSIONS
5.1 Future Scope 46
5.2 Applications 46
6 REFERENCES 47
LIST OF TABLES

Table No. Table Name Page No.


3.1 Key points for gestures 27
3.2 Sample values from the dataset 28
4.1 K-NN evaluation report 28
4.2 SVM evaluation report 29
4.3 Decision tree evaluation report 36
LIST OF FIGURES

Figure No. Figure Name Page No.


1.1 Some Sample Applications 9
1.2 Gesture gloves 14
1.3 Our proposed application screenshot 15
1.4 Machine Learning pipeline 16
K-NN example (green dot is the sample that is to
1.5 17
be classified)
SVM example (H1 doesn’t separate the classes,
H2 does but the distances are not even whereas
1.6 19
H3 separates the classes with maximum
distances either side)
1.7 Decision Tree model representation 21
3.1 American Sign Language Gestures 22
3.2 Hand Landmarks 23
3.3 Key points over hand for the gesture ‘a’ 24
3.4 k-Nearest Neighbors example 25
3.5 Support Vector Machine example 26
3.6 Application working on gesture ‘a’ 61
3.7 Application working on gesture ‘s’ 61
3.8 Application working on gesture ‘v’ 62
3.9 Application working on gesture ‘m’ 63
3.10 Model Pipeline 67
4.1 k-NN evaluation report 69
4.2 SVM evaluation report 56
4.3 Decision tree evaluation report 23
Twitter Sentiment Analysis
Using Machine Learning
ABSTRACT

With 500 Million Tweets sent each day, that is 6000 Tweets being
generated every second, Twitter is the most popular micro-blogging site
that allows users to express their views and opinions in 280 characters.
As companies and political leaders take to the online social media
platform to establish and develop their brand, one cannot ignore the
amount to data being generated on Twitter.

The proposed system aims to extract and analyses tweets, classify


them as positive or negative with the help of machine learning techniques
and algorithms, and finally subject to performance evaluation techniques.
On November 8, 2016, in a television broadcast, Prime Minister Narendra
Modi declared that all the 500- and 1000-rupee notes were illegal in an
effort to curb black money and fake notes.

Considering the demonetization dataset extracted from Twitter


using Twitter API, pre-processing is performed using Scikit-learn, which
is then subjected to algorithmic executions such as neural network,
recurrent neural network, and Support Vector Machines.

A comparison of this execution is considered to determine which


algorithm works better for given dataset in terms of recall, accuracy and
F1 Score.
Chapter I
INTRODUCTION

1.1 CONCEPTUAL STUDY OF THEPROJECT

Data science is a multi-disciplinary field that uses scientific


methods, processes, algorithms and systems to extract knowledge and
insights from data in various forms, both structured and unstructured,
similar to data mining.

Data science is a "concept to unify statistics, data analysis,


machine learning and their related methods" in order to "understand and
analyze actual phenomena" with data. It employs techniques and theories
drawn from many fields within the context of mathematics, statistics,
information science, and computer science.

Turing award winner Jim Gray imagined data science as a "fourth


paradigm" of science (empirical, theoretical, computational and now data-
driven) and asserted that "everything about science is changing because
of the impact of information technology" and the data deluge.

In 2012, when Harvard Business Review called it "The Sexiest


Job of the 21st Century", the term "data science" became a buzzword. It
is now often used interchangeably with earlier concepts like business
analytics, business intelligence, predictive modelling, and statistics. Even
the suggestion that data science is sexy was paraphrasing Hans
Rosling, featured in a 2011 BBC
documentary with the quote, "Statistics is now the sexiest subject around."
Nate Silver referred to data science as a sexed-up term for statistics. In many
cases, earlier approaches and solutions are now simply rebranded as "data
science" to be more attractive, which can cause the term to become "dilute[d]
beyond usefulness. While many university programs now offer a data
science degree, there exists no consensus on a definition or suitable
curriculum contents. To its discredit, however, many data- sciences and big-
data projects fail to deliver useful results, often as a result of poor
management and utilization of resources.
The process of learning begins with observations or data, such as examples,
direct experience, or instruction, in order to look for patterns in data and
make better decisions in the future based on the examples that we provide.
The primary aim is to allow the computers learn automatically without
human intervention or assistance and adjust actions accordingly.
1.2 OBJECTIVES OF THE PROJECT

Twitter is a popular micro-blogging service in which users post


status messages, called ”tweets”, with no more than 140 characters. The
millions of statuses appear on social networking every day. In most cases,
its users enter their messages with much fewer characters than the limit
established. Twitter represents one of the largest and most dynamic
datasets of user generated content approximately 200 million users
post400 million tweets per day.
Tweets can express opinions on different topics, which can help
to direct marketing campaigns so as to share consumers’ opinions
concerning brands and products, outbreaks of bullying, events that
generate insecurity, polarity prediction in political and sports discussions,
and acceptance orrejection of politicians, all in an electronic word-of-
mouth way. In such application domains, one deals with large text corpora
and most often” formal language”. At least two specific issues should be
addressed in any type of computer-based tweet analysis: firstly, the
frequency of misspellings and slang in tweets is much higher than that in
other domains.
Secondly, Twitter users post messages on a variety of topics,
unlike blogs, news, and other sites, which aretailored to specific topics.
Big challenges can be faced in tweet sentiment analysis: neutral tweets
are way more common than positive and negative ones. This is different
from other sentiment analysis domains, which tend to be predominantly
positive or negative; and tweets are very short and often show limited
sentiment cues.
In this paper, we used a data set about more than 7000 tweets for
training classifiers. We built a model which classified tweets collected
from Twitter APIs into the positive class or the negative class. The model
runs on three steps: a classifier categorizes tweets into objective tweets or
subjective tweets, another classifier organizes subjective tweets into
positive or negative and finally, the system summarizes tweets into a
vitual graph. For training. Our experiments proved to be highly accurate.
Related work on tweet sentiment analysis is rather limited, but the initial
results are promising.
Our main contribution is we built a sentiment analysis model
based on supervised learning such as Random forest, logistic regressiom
and Support Vector Machine for enhancing effective classification
1.3 SCOPE OF THE PROJECT

The project will helpful to the companies, political parties as well


as to the common peoples. It will be helpful political party for reviewing
about the program that they are going to do or the program that they have
performed.

Similarly, companies also can get reviews about their new product
on newly released hardware or software’s.
Chapter II
LITERATURE REVIEW

Sentiment analysis deals with identifying and classifying opinions


or sentiments which are present in source text. Social media is generating
a huge amount of sentiment rich data in the form of tweets, status updates,
reviews and blog posts etc. Sentiment analysis of this user generated data
is very useful in knowing the opinion of the crowd. Twitter sentiment
analysis is arduous ascompared to basic sentiment analysis due to the
presence of slang words and misspellings. The maximum limit of
characters that are allowed in Twitter is 140. Machine learning approach
can be used for analyzing sentiments from the text.

Some sentiment analysis are performed by analyzing the twitter


posts about electronic products like cell phones, computers etc. using
Machine Learning approach. By performing sentiment analysis in a
specific domain, it is possible to identify the effect of domain information
in sentiment classification. They presented a new feature vector for
classifying the tweets as positive, negative or neutral and extract people’s
opinion about products.

Another research tried to pre-processed the dataset, after that


extracted the adjective from the dataset that have substantial meaning
which is called feature vector, then selected the feature vector list and
thereafter applied machine learning algorithms such asRNN, Neural
network SVM along with the Semantic Orientation based Word-Net
which extracts synonyms and relation for the content feature. At the end,
they measured the performance of classifier in terms of recall, precision
and accuracy.
Some researchers had an approach where posted tweets from the
Twitter micro-blogging site are subjected to preprocessing and classified
based on their emotional content as positive, negative and neutral or
irrelevant; and compares the performance of various classifying
algorithms based on their precision and recall in such cases. Further, the
paper also discusses the applications of this research and its limitations.

Some work in this field included experiments with mood


classification on blog posts. One of the researches also deals with review
of aspect-based opinion polling from unlabeled free-form textual
customer reviews without requiring customers to answer any questions.

The tweet retrieval process needs access tokens from the twitter
developer site and a piece of code which perform the operation of
retrieving those tweets. As the base language used will be python.
Chapter III
PROBLEM DEFINITION

The problem in sentiment analysis is classifying the polarity of a


given text at the document, sentence, or feature level.

whether the expressed opinion in a document, a sentence or an


entity feature is positive, negative, or neutral.

EXISITING SYSTEM

The existing system works only on the dataset which is


constrained to a particular topic. The existing systems also do not
determine the measure of impact the results determined can have on the
particular field taken into consideration and it does not allow retrieval of
data based on the query entered by the user. it has constrained scope. In
simple words, it works on static data rather than dynamic data.
Unsupervised algorithms like Vector Quantization, are used for data
compression, pattern recognition, facial and speech recognition, etc and
therefore cannot be used in determining sentiment in twitter data.
INTRODUCTION TO PYTHON

Page 1: Introduction to Python Programming Language

Python is a high-level, interpreted programming language known for its


simplicity, readability, and versatility. Created by Guido van Rossum and first
released in 1991, Python has since become one of the most popular
programming languages worldwide, embraced by developers, data scientists,
educators, and professionals across various domains. This introduction provides
an overview of Python, its features, history, and applications, highlighting its
importance in modern software development and beyond.

Page 2: Features of Python

Python is renowned for its rich set of features that make it well-suited for a wide
range of programming tasks. Some key features of Python include:

Simple and Readable Syntax: Python's syntax emphasizes readability and


simplicity, making it easy to learn and understand. Its use of indentation for block
structure promotes clean and organized code.

High-level Language: Python abstracts low-level details and provides high-level


constructs, enabling developers to focus on solving problems rather than dealing
with memory management or hardware-specific operations.

Interpreted and Interactive: Python is an interpreted language, meaning that


code is executed line by line by an interpreter. This allows for rapid prototyping,
interactive development, and quick feedback loops.
Dynamic Typing: Python is dynamically typed, meaning that variable types are
determined at runtime rather than statically declared. This flexibility simplifies
coding and promotes rapid development but requires careful attention to variable
types.

Multi-paradigm: Python supports multiple programming paradigms, including


procedural, object-oriented, and functional programming styles. Developers can
choose the most appropriate paradigm for their problem domain or combine
different paradigms as needed.

Extensive Standard Library: Python comes with a comprehensive standard library


that provides a wide range of modules and functions for common programming
tasks. This rich ecosystem reduces the need for external dependencies and
facilitates rapid development.

Page 3: History of Python

Python's origins trace back to the late 1980s when Guido van Rossum, a Dutch
programmer, began working on a new programming language as a successor to
the ABC language. The first version of Python, Python 0.9.0, was released in
1991, featuring core concepts such as exception handling, functions, and
modules. Over the years, Python underwent significant evolution and
refinement, with major releases introducing new features, enhancements, and
optimizations. Key milestones in Python's history include:

Python 1.0 (1994): The first official release of Python 1.0 included features such
as lambda, map, filter, and reduce functions, as well as support for functional
programming constructs.
Python 2.0 (2000): Python 2 introduced list comprehensions, garbage
collection, and a unified object model. It became the predominant version of
Python used in production environments for many years.

Python 3.0 (2008): Python 3 represented a significant overhaul of the language,


introducing backward-incompatible changes to address long-standing design
flaws and inconsistencies. Despite initial resistance, Python 3 gradually gained
adoption and became the recommended version for new projects.

Page 4: Applications of Python

Python's versatility and ease of use have led to its widespread adoption across
various domains and industries. Some common applications of Python include:

Web Development: Python is widely used for web development, with


frameworks such as Django, Flask, and Pyramid facilitating the creation of
dynamic, scalable web applications.

Data Science and Machine Learning: Python has emerged as a dominant language
in the field of data science and machine learning, thanks to libraries like NumPy,
pandas, scikit- learn, and TensorFlow. These libraries provide powerful tools for
data analysis, visualization, and predictive modeling.
Scientific Computing: Python is popular among scientists and researchers for its
extensive ecosystem of libraries and tools tailored for scientific computing,
including SciPy, Matplotlib, and Jupyter.

Automation and Scripting: Python's simplicity and versatility make it well-


suited for automating repetitive tasks, system administration, and scripting. It is
commonly used for writing scripts to automate workflows, manage
infrastructure, and perform batch processing.

Game Development: Python is increasingly used in game development, both for


scripting game logic and for developing game engines and tools. Libraries like
Pygame and Panda3D provide frameworks for building games in Python.

Page 5: Conclusion

Python's popularity and influence continue to grow, driven by its simplicity,


versatility, and vibrant ecosystem. Whether you're a beginner learning to code, a
seasoned developer building complex systems, or a data scientist analyzing vast
datasets, Python offers a wealth of tools and resources to meet your needs. As
Python continues to evolve and adapt to new challenges and opportunities, it
remains an essential tool in the toolkit of modern software development,
powering innovation and creativity across industries and disciplines.
INTRODUCTION TO PANDAS :

Pandas is a powerful and widely-used open-source library for data manipulation


and analysis in Python. It provides high-performance, easy-to-use data structures
and tools for working with structured data, such as tabular data, time series, and
relational databases.

Developed by Wes McKinney and first released in 2008, Pandas has become an
essential tool for data scientists, analysts, and developers working with data-
intensive applications.

Key Features of Pandas:

Data Structures: Pandas introduces two main data structures: Series and
DataFrame. A Series is a one-dimensional array-like object that can hold various
data types, while a DataFrame is a two-dimensional labeled data structure
resembling a table or spreadsheet.

Data Manipulation: Pandas provides a rich set of functions and methods for
manipulating and transforming data. Users can perform tasks such as filtering,
sorting, grouping, aggregating, merging, and reshaping data with ease.

Missing Data Handling: Pandas offers robust support for handling missing or
incomplete data, providing functions for detecting, removing, and imputing
missing values in datasets.

Time Series Analysis: Pandas includes functionality for working with time series
data, including date/time indexing, time zone handling, resampling, and frequency
conversion.
Data Input/Output: Pandas supports a variety of file formats for input and output
operations, including CSV, Excel, SQL databases, JSON, HDF5, and more. It
simplifies the process of reading and writing data from/to external sources.
Page 2: Applications of Pandas

Pandas is widely used across various domains and industries for data analysis,
manipulation, and visualization. Some common applications of Pandas include:

Data Exploration and Cleaning: Pandas is commonly used for exploring


and cleaning datasets before analysis. Its functions for handling missing
data, filtering outliers, and transforming variables streamline the data
preprocessing pipeline.

Data Analysis and Statistics: Pandas provides powerful tools for performing
descriptive and inferential statistics on datasets, including summary statistics,
correlation analysis, hypothesis testing, and regression analysis.

Time Series Analysis: Pandas' support for time series data makes it well-suited
for analyzing and visualizing temporal data, such as financial market data, sensor
data, and weather data. Users can easily perform tasks such as resampling,
rolling window calculations, and time series decomposition.

Machine Learning and Data Modeling: Pandas integrates seamlessly with other
libraries in the Python ecosystem, such as scikit-learn, TensorFlow, and
PyTorch, for building machine learning models and conducting predictive
analytics tasks. It serves as a crucial tool for preparing data for model training
and evaluation.
Data Visualization: While Pandas itself does not provide visualization
capabilities, it integrates well with libraries like Matplotlib, Seaborn, and Plotly
for creating insightful visualizations of data. Users can quickly generate plots,
charts, and graphs to explore and communicate their findings effectively.

In summary, Pandas is a versatile and powerful library that simplifies data


manipulation, analysis, and visualization tasks in Python. Its intuitive interface,
rich functionality, and broad applicability make it an indispensable tool for
anyone working with data in Python. Whether you're a data scientist, analyst,
researcher, or developer, Pandas provides the tools you need to extract valuable
insights from your data and drive informed decision- making.

Introduction to Scikit-learn

Scikit-learn, often abbreviated as sklearn, is an open-source machine learning


library for Python. It provides simple and efficient tools for data mining and data
analysis, built on top of other scientific computing libraries such as NumPy,
SciPy, and matplotlib. Scikit- learn is designed to be user-friendly, accessible to
non-experts, and versatile enough to handle various machine learning tasks,
including classification, regression, clustering, dimensionality reduction, and
more.

Key Features of Scikit-learn:


Unified Interface: Scikit-learn provides a consistent API and interface for working
with different machine learning algorithms. This uniformity simplifies the process
of switching between algorithms and experimenting with various techniques.

Wide Range of Algorithms: Scikit-learn includes a comprehensive collection of


machine learning algorithms, including both supervised and unsupervised
methods. These algorithms cover a broad spectrum of tasks and domains,
allowing users to choose the most suitable approach for their specific problem.

Model Evaluation and Validation: Scikit-learn offers tools for evaluating and
validating machine learning models through techniques such as cross-validation,
hyperparameter tuning, model selection, and performance metrics calculation.

Integration with NumPy and SciPy: Scikit-learn seamlessly integrates with other
Python libraries such as NumPy and SciPy, leveraging their capabilities for
numerical computing, linear algebra, optimization, and statistical analysis.

Support for Data Preprocessing: Scikit-learn provides functions and utilities for
preprocessing and transforming raw data before feeding it into machine learning
algorithms. This includes tasks such as feature scaling, normalization, encoding
categorical variables, and handling missing values.
Page 2: Introduction to Matplotlib

Matplotlib is a comprehensive library for creating static, interactive, and


publication- quality visualizations in Python. It provides a flexible and powerful
interface for generating a wide range of plots, charts, and graphs, suitable for
various data visualization tasks. Originally developed by John D. Hunter and
released in 2003, Matplotlib has since become one of the most widely-used
plotting libraries in the Python ecosystem.

Key Features of Matplotlib:

Wide Range of Plot Types: Matplotlib supports a diverse range of plot types,
including line plots, scatter plots, bar plots, histograms, heatmaps, contour
plots, and more. Users can create highly customizable visualizations to
represent their data effectively.

Publication-Quality Output: Matplotlib produces high-quality, publication-ready


graphics suitable for inclusion in scientific papers, reports, presentations, and
publications. Users have fine-grained control over the appearance and styling of
their plots, including colors, fonts, labels, and annotations.

Interactive Plotting: Matplotlib provides support for interactive plotting and


exploration through interactive backends such as Qt, GTK, and Tkinter. Users
can zoom, pan, rotate, and interactively manipulate plots to explore their data
and gain deeper insights.

Integration with Jupyter Notebooks: Matplotlib integrates seamlessly with Jupyter


Notebooks, allowing users to create inline plots directly within the notebook
environment. This facilitates reproducible research and interactive data analysis
workflows.
Customizability and Extensibility: Matplotlib offers extensive customization
options and flexibility to adapt plots to specific requirements. Users can
customize every aspect of their plots, from the axis scales and ticks to the legend
placement and grid lines.

In summary, Scikit-learn and Matplotlib are two essential libraries in the Python
ecosystem for machine learning and data visualization, respectively. Together,
they provide powerful tools for exploring, analyzing, and visualizing data,
enabling users to build sophisticated machine learning models and gain valuable
insights from their data. Whether you're a data scientist, researcher, student, or
developer, Scikit-learn and Matplotlib offer the tools you need to tackle a wide
range of machine learning and data visualization tasks effectively.

PROPOSED SYSTEM

SOFTWARE REQUIREMENTS

• Operating system : Windows 7 or above.


• Tool: Anaconda Navigator – 64bit
• Scripting Tool: Jupyter Notebook
• Language: Python3.0

1.Modules

Social media platforms like Twitter have become invaluable sources of real-time
data for understanding public opinion, market trends, and societal issues.
Sentiment analysis on
Twitter data involves the automatic classification of tweets into positive,
negative, or neutral categories based on the sentiment expressed by users. This
paper focuses on the application of machine learning techniques for sentiment
analysis on Twitter, highlighting the modules and methodologies involved in
building an effective sentiment analysis system.

2. Data Collection:

The first step in sentiment analysis is data collection. Twitter provides APIs
(Application Programming Interfaces) that allow developers to access and
retrieve tweets based on various criteria such as keywords, hashtags, user
mentions, and geolocation. Data collection may also involve scraping publicly
available tweets using web scraping techniques. The collected data should be
representative, diverse, and labeled with ground truth sentiment labels for
training and evaluation purposes.

3. Data Preprocessing:

Twitter data often contains noise, including hashtags, mentions, URLs, emojis,
and non- standard spelling and grammar. Data preprocessing techniques such as
tokenization, lowercase conversion, punctuation removal, stop word removal,
stemming or lemmatization, and handling of special characters and emoticons are
applied to clean and standardize the text data before further analysis.

4. Feature Extraction:

Feature extraction involves converting the raw text data into numerical
representations suitable for machine learning algorithms. Commonly used feature
extraction techniques
include Bag-of-Words (BoW), Term Frequency-Inverse Document Frequency
(TF-IDF), Word Embeddings (e.g., Word2Vec, GloVe), and n-grams. These
techniques capture the semantic and syntactic information present in the text data
and transform it into feature vectors that can be fed into machine learning models.

5. Model Selection:

Several machine learning algorithms can be applied to perform sentiment analysis


on Twitter data, including:

Each algorithm has its advantages and disadvantages in terms of accuracy,


scalability, interpretability, and computational complexity. The choice of the
model depends on factors such as the size of the dataset, the complexity of the
text data, and the desired performance metrics.

6. Evaluation:

Evaluation metrics such as accuracy, precision, recall, F1-score, and ROC-AUC


are used to assess the performance of sentiment analysis models. The dataset is
typically split into training, validation, and test sets, and the model is trained on
the training set and evaluated on the validation and test sets. Cross-validation
techniques such as k-fold cross-validation are also used to ensure robustness and
generalizability of the model.

Twitter sentiment analysis using machine learning is a powerful tool for


understanding public sentiment and opinion on various topics. By leveraging
data collection, preprocessing, feature extraction, model selection, and
evaluation modules, researchers and practitioners can build accurate and
effective sentiment analysis systems. However,
challenges such as data noise, domain specificity, and model interpretability
remain areas for future research and improvement. Overall, sentiment analysis
on Twitter data has wide-ranging applications in marketing, politics, finance,
and public health, making it an essential area of study in the field of natural
language processing and machine learning.

Non-Functional Requirements:

Performance:

The system should be capable of processing large volumes of tweets


efficiently, ensuring real-time or near-real-time analysis.
It should have low latency and high throughput to handle concurrent user requests
and streaming data.

Scalability:

The system should be scalable to accommodate an increasing number of users and


data sources.
It should support horizontal scalability by deploying across multiple servers or
cloud instances.

Accuracy:

The sentiment analysis models should achieve high accuracy and reliability in
predicting tweet sentiments.
The system should allow users to fine-tune models and adjust parameters to
improve accuracy.
Robustness:

The system should handle errors and exceptions gracefully, ensuring uninterrupted
operation.
It should implement error handling mechanisms and failover procedures to recover
from failures quickly.

Security:

The system should ensure the confidentiality and integrity of user data and
authentication credentials.
It should employ encryption techniques to protect sensitive information during
data transmission and storage.

Usability:

The system should have a user-friendly interface that is intuitive and easy to
navigate. It should provide documentation and tutorials to guide users
through the process of data collection, model training, and evaluation.

Maintainability:

The system should be modular and well-documented to facilitate code


maintenance and updates.
It should support version control and automated testing to ensure code quality and
stability over time.
A feasibility study of Twitter sentiment analysis involves assessing the
practicality, viability, and potential benefits of implementing such a system. Here's
how you might approach it:

1. Technical Feasibility:

Data Access: Evaluate the availability and accessibility of Twitter data through
APIs. Ensure that the required data can be collected efficiently and in compliance
with Twitter's terms of service.
Data Processing: Assess the feasibility of preprocessing large volumes of Twitter
data to remove noise, standardize text, and extract features. Consider the
computational resources and time required for these tasks.
Machine Learning Models: Evaluate the performance and scalability of machine
learning algorithms for sentiment analysis on Twitter data. Consider factors such
as accuracy, speed, and memory usage.

2. Economic Feasibility:

Cost of Data Acquisition: Estimate the cost of accessing Twitter data through
APIs, considering any subscription fees or usage-based pricing models.
Infrastructure Costs: Assess the cost of infrastructure required for data storage,
processing, and analysis, including servers, storage, and computational resources.
Development and Maintenance Costs: Estimate the cost of developing and
maintaining the sentiment analysis system, including software development,
testing, documentation, and ongoing support.

3. Operational Feasibility:
User Requirements: Identify the needs and preferences of potential users of the
sentiment analysis system, such as marketing professionals, brand managers, or
researchers.
Integration with Existing Systems: Assess the compatibility and integration
requirements of the sentiment analysis system with existing tools, platforms, or
workflows used by the target users.
Training and Support: Evaluate the feasibility of providing training and
support to users for using the sentiment analysis system effectively.
4. Legal and Ethical Feasibility:

Data Privacy and Security: Ensure compliance with data privacy regulations
and guidelines when collecting, storing, and processing Twitter data. Implement
measures to protect user privacy and confidentiality.
Intellectual Property: Consider any legal issues related to the use of Twitter data,
including copyright, trademark, and intellectual property rights.
Ethical Considerations: Assess the ethical implications of sentiment analysis,
including potential biases, discrimination, and misuse of analyzed data.

5. Schedule Feasibility:

Project Timeline: Develop a realistic timeline for the implementation of the


sentiment analysis system, considering factors such as data collection,
preprocessing, model training, testing, and deployment.
Resource Availability: Identify the availability of human resources, expertise, and
skills required to develop and deploy the sentiment analysis system within the
specified timeframe.
Risk Management: Anticipate potential risks and challenges that may arise during
the project and develop strategies to mitigate them effectively.
Conclusion:
Chapter V
CHAPTER V

A feasibility study of Twitter sentiment analysis involves evaluating technical,


economic, operational, legal, ethical, and schedule-related factors to determine the
viability of implementing such a system. By carefully assessing these factors,
stakeholders can make informed decisions about whether to proceed with the
project and how to optimize its chances of success.

Certainly! Here are various types of testing commonly used in software


development, along with descriptions of each:

1. Unit Testing:

Description: Unit testing involves testing individual components or units of


code in isolation from the rest of the application. It verifies that each unit
behaves as expected and meets its specified requirements.
Purpose: To validate the correctness and functionality of small, independent units
of code, such as functions, methods, or classes.

2. Integration Testing:

Description: Integration testing focuses on testing the interaction and integration


between different units or modules of the software. It verifies that these
components work together as intended and that data flows smoothly between
them.
Purpose: To identify and address defects or inconsistencies in the interactions
between integrated components and ensure the overall integrity of the
software.

3. System Testing:
Description: System testing evaluates the behavior of the entire software system
as a whole. It verifies that the system meets its specified requirements and
performs its intended functions correctly in a real-world environment.
Purpose: To validate the overall functionality, reliability, performance, and
usability of the software from an end-to-end perspective.

4. Acceptance Testing:

Description: Acceptance testing involves verifying that the software meets the
acceptance criteria defined by stakeholders or end-users. It is typically performed
by users or client representatives to determine whether the software is ready for
production deployment.
Purpose: To ensure that the software meets the business requirements, user needs,
and expectations of stakeholders before it is released to the production
environment.

5. Regression Testing:

Description: Regression testing involves re-running previously executed test


cases to ensure that recent changes or updates to the software have not
introduced new defects or regressions in existing functionality.
Purpose: To maintain the stability and integrity of the software over time by
verifying that changes do not adversely affect existing features or behaviors.

6. Performance Testing:

Description: Performance testing evaluates the responsiveness, scalability, and


stability of the software under different workload conditions. It measures factors
such as response time, throughput, and resource utilization to identify
performance bottlenecks and areas for optimization.
Purpose: To assess the software's performance characteristics and ensure that it
can handle expected loads and scale effectively as user demand grows.
7. Usability Testing:

Description: Usability testing assesses the ease of use, intuitiveness, and user-
friendliness of the software from the perspective of end-users. It involves
observing users as they interact with the software and collecting feedback on
their experiences.
Purpose: To identify usability issues, navigation problems, and user interface (UI)
design flaws that may impact the user experience and hinder adoption of the
software.
8. Security Testing:

Description: Security testing evaluates the resilience of the software against


potential security threats, vulnerabilities, and attacks. It assesses aspects such as
data confidentiality, integrity, authentication, authorization, and protection
against common security risks.
Purpose: To identify and mitigate security vulnerabilities and ensure that the
software adheres to industry best practices and regulatory requirements for
information security.
9. Exploratory Testing:

Description: Exploratory testing involves exploring the software dynamically


and informally to uncover defects, issues, or unexpected behaviors that may not
be addressed by existing test cases. Testers rely on their intuition, creativity, and
domain knowledge to guide their testing efforts.
Purpose: To complement scripted testing approaches by uncovering hidden
defects, edge cases, or usability issues that may not be covered by formal test
cases.
10. Compatibility Testing:
Description: Compatibility testing verifies that the software functions correctly
across different environments, platforms, devices, browsers, and operating
systems. It ensures that the software is compatible with a wide range of
configurations and configurations.
Purpose: To ensure that the software delivers a consistent user experience and
performance across diverse environments and meets the needs of all potential
users.
Each type of testing serves a specific purpose and addresses different aspects of
software quality and reliability. By applying a combination of these testing
approaches throughout the software development lifecycle, organizations can
identify and address defects early, mitigate risks, and deliver high-quality
software products that meet the needs of users and stakeholders.

1. Functional Testing:

Test Case 1: Verify that the system correctly handles various types of tweets,
including text-based tweets, retweets, replies, and multimedia tweets (e.g.,
images, videos).
Test Case 2: Ensure that the system accurately identifies and filters out irrelevant
tweets, such as spam, advertisements, and duplicate content.
Test Case 3: Validate that the sentiment analysis models produce consistent and
reliable results across different types of tweets and user demographics.
Test Case 4: Confirm that the system supports multi-language sentiment analysis
and handles tweets in languages other than English.
2. Performance Testing:

Test Case 1: Measure the system's throughput by simulating a high volume of


incoming tweets and assessing its ability to process them within a specified time
frame.
Test Case 2: Evaluate the system's scalability by gradually increasing the number
of concurrent users or data sources and monitoring its response time and resource
utilization.
Test Case 3: Assess the system's stability and reliability under sustained load
conditions, including peak usage periods and sudden spikes in activity.

3. Usability Testing:

Test Case 1: Conduct usability testing with representative users to evaluate the
system's interface for clarity, intuitiveness, and ease of use.
Test Case 2: Solicit feedback from users on the effectiveness of visualizations and
reports generated by the system in conveying sentiment analysis results.
Test Case 3: Identify and address any usability issues or pain points reported by
users through surveys, interviews, or user feedback sessions.
4. Security Testing:

Test Case 1: Perform vulnerability scanning and penetration testing to identify


potential security vulnerabilities in the system's architecture, codebase, and
configuration.
Test Case 2: Verify that the system implements robust authentication and
authorization mechanisms to prevent unauthorized access to sensitive data and
functionalities.
Test Case 3: Test the system's resilience to common security threats such as
SQL injection, cross-site scripting (XSS), and cross-site request forgery
(CSRF).
5. Regression Testing:

Test Case 1: Re-run previously executed test cases to verify that software
updates or changes do not introduce new defects or regressions in existing
functionality.
Test Case 2: Validate that fixes applied to known defects or issues remain
effective and do not cause unintended side effects or regressions.
Test Case 3: Conduct automated regression testing using test scripts and tools to
ensure comprehensive coverage and consistency across test runs.
6. Integration Testing:
Test Case 1: Verify that the sentiment analysis system integrates smoothly with
external APIs and services for data collection, preprocessing, and feature
extraction.
Test Case 2: Ensure that data flows correctly between different components of the
system, including the frontend interface, backend servers, and machine learning
models.
Test Case 3: Validate that changes or updates to integrated modules do not impact
the overall functionality or performance of the system.

7. Acceptance Testing:

Test Case 1: Demonstrate the system's compliance with user requirements and
specifications outlined in the project scope and documentation.
Test Case 2: Conduct user acceptance testing (UAT) with end-users to ensure
that the system meets their needs, expectations, and use cases effectively.
Test Case 3: Obtain approval and sign-off from stakeholders to officially release
the system for production use, following successful completion of acceptance
testing.
System Architecture
Class Diagram
Sequence Diagram:
Activity Diagran:
Use-Case Diagram
System Dataset

Reading Dataset

Preprocessng

Feature Selection

spliting the dataset

Training the Model

Testing the Model

Metrics

Future Prediction
Chapter V
RESULTS

The dataset used is comment.csv and twitter.csv which is taken from


twitter api. This dataset consists of more than 7000 of comments given by
lots of users for twitter comments.

Each user commented nearly minimum of 20 posts. The below image


shows the sentiment analysis to the users.

After applying these three algorithms, the system will analysis twitter
to the users based on the comment given by the users and generate the
accuracy and f1 score.

After the result we manually check which system is more efficient for
the analysis by comparing the result.

Training
Accuracy :
0.9946602144
257645
Validation
Accuracy :
0.9501939682
142411 F1
score :
0.6004016064
257027
[[7294 138]

[ 260 299]]
Chapter VI
CONCLUSION

Sentiment analysis has become an important factor in decision


making process in a particular field. In this paper we discussed techniques
for preprocessing and information retrieval of tweets through twitter.

Also we studied about the supervised learning technique: Support


Vector Machine for text categorization which can be used to find out the
polarity of textual tweet.

From study we can conclude that SVM acknowledges some


properties of text like High Dimensional feature space, few irrelevant
feature, sparse instance vector.

The performance of SVM can be evaluated using precision and


recall. Different results show that SVM gives good performance on text
categorization as compared with Random forest. With ability to
generalize high dimensional feature space, SVM eliminates need of
feature selection.

FUTURE WORK:

Implementation of some other algorithms:

K Means++ algorithm
Recurrent neural network
Gradient boosting algorithm
Ada Boosting algorithm
Handle Grapheme Stretching
Handle authenticity of Data and Users Handle Sarcasm and Humo
REFERENCES:

[1] Falguni Gupta, Swati Singhal, Amity University, “Sentiment


Analysis of the Demonetization of Economy 2016 India, Regionwise”,
2017 7thInternational Conference on Cloud Computing, Data Science
& Engineering, January 2017, pg. 693-696.

[2] Geetika Gautam, Divakar Yadav, “Sentiment Analysis of


Twitter Data Using Machine Learning Approaches and Semantic
Analysis”, 2014 Seventh International Conference on
Contemporary Computing (IC3), August 2014, pg.437- 442.

[3] Nicolas Tsapatsoulis, Constantinos Djouvas, “Feature Extraction for


Tweet Classification: “Do the Humans Perform Better?” 2017
12thInternational Workshop on Semantic and Social Media Adaptation
and Personalization (SMAP), July 2017, pg. 53-58.

[4] Huma Parveen, Prof. Shikha Pandey, “Sentiment Analysis on


Twitter Data-set using Naïve Bayes Algorithm”, 2016
2ndInternational Conference on Applied and Theoretical Computing
and Communication Technology (iCATccT), July 2016, pg. 416-419

5] Omar Abdelwahab, Mohamed Bahgat, Christopher J. Lowrance, Adel


Elmaghraby, “Effect of Training Set Size on SVM and Naïve Bayes for
Twitter Sentiment Analysis”, 2015 IEEE International Symposium on
Signal Processing and Information Technology (ISSPIT), December
2015, pg. 46-51.
[6] Neethu M S, Rajashri R, “Sentiment Analysis in Twitter using
Machine Learning Techniques”, 2013 Fourth International
Conference on Computing, Communicationsand Networking
Technologies (ICCCNT), July 2013,pg. 1- 5.

[7] Tapan Sahni, Chinmay Chandak, Naveen Reddy Chedeti, Manish


Singh, “Efficient Twitter Sentiment Classification using Subjective
Distant Supervision”, 2017 9th International Conference on
Communication Systems and Networks
(COMSNETS), January 2017, pg. 548-553.

[8] M.Trupthi, Suresh Pabboju, G.Narasimha, “Sentiment


Analysis On Twitter Using Streaming API”, 2017 IEEE
7th International Advance Computing Conference,
January 2017, pg. 915-919
APPENDIX
APPENDIX

SOURCE CODE:
Output :

You might also like