Professional Documents
Culture Documents
Combinepdf
Combinepdf
Machine Learning
A project report
Submitted in partial fulfillment of the requirements for the award of
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
Submitted by
J. BABY 20EM1A0529
L.SUDHAKAR 20EM1A0557
K.DIVAKAR 20EM1A0548
K.JAHNAVI 20EM1A0538
CERTIFICATE
This is to certify that the projectreport entitled “Twitter Sentiment Analysis Using
Machine Learning” is abonafied work done by J.BABY – 20EM1A0529,
L.SUDHAKAR – 20EM1A0557, K.JAHNAVI – 20EM1A0538, K.DIVAKAR –
20EM1A0548 submitted in partial fulfillment of the requirement for the award of the
degree of Bachelor of Technology in Department of COMPUTER SCIENCE AND
ENGINEERING, during the academic year 2020-2024.
External Examiner
DECLARATION
I hereby declare that the entire project work embodied in this dissertation entitled “Twitter
Sentiment Analysis Using Machine Learning” has been independently carried out by me.
As per my knowledge, no part of this work has been submitted for any degree in any
institution, University and organization previously.
TEAM MEMBERS
J. BABY 20EM1A0529
L.SUDHAKAR 20EM1A0557
K.DIVAKAR 20EM1A0548
K.JAHNAVI 20EM1A0538
ACKNOWLEDGEMENT
I extend my heartfelt gratitude and sincere thanks to Management Members of our college
for making necessary arrangement for doing the project.
I would like to express our deep indebtedness and whole hearted thanks to all our
faculty of CSE Department, for their full-fledged support and encouragement.
Last but not the least, our special thanks to all of my parents for their support, I
debited to all people who have contributed in some way or the other in completion of this
project work.
TEAM MEMBERS
J. BABY 20EM1A0529
L.SUDHAKAR 20EM1A0557
K.DIVAKAR 20EM1A0548
K.JAHNAVI 20EM1A0538
INDEX
2 LITERATURE REVIEW
2.1 Sign Language Recognition System to aid Deaf-dumb
21
People Using PCA
2.2 Hand Gesture Recognition for Deaf People Interfacing 22
2.3 An American Sign Language Detection System
using HSV Color Model and EdgeDetection 22
2.4 Real Time Hand Gesture Recognition Using Different
Algorithms Based on AmericanSign Language 23
3 SYSTEM DEVELOPMENT
3.1 Dataset 24
3.2 Algorithms 28
3.3 Application Screenshots 36
3.4 ML Pipeline 38
4 PERFORMANCE ANALYSIS
4.1 Performance Analysis 39
4.2 Constraints 45
5 CONCLUSIONS
5.1 Future Scope 46
5.2 Applications 46
6 REFERENCES 47
LIST OF TABLES
With 500 Million Tweets sent each day, that is 6000 Tweets being
generated every second, Twitter is the most popular micro-blogging site
that allows users to express their views and opinions in 280 characters.
As companies and political leaders take to the online social media
platform to establish and develop their brand, one cannot ignore the
amount to data being generated on Twitter.
Similarly, companies also can get reviews about their new product
on newly released hardware or software’s.
Chapter II
LITERATURE REVIEW
The tweet retrieval process needs access tokens from the twitter
developer site and a piece of code which perform the operation of
retrieving those tweets. As the base language used will be python.
Chapter III
PROBLEM DEFINITION
EXISITING SYSTEM
Python is renowned for its rich set of features that make it well-suited for a wide
range of programming tasks. Some key features of Python include:
Python's origins trace back to the late 1980s when Guido van Rossum, a Dutch
programmer, began working on a new programming language as a successor to
the ABC language. The first version of Python, Python 0.9.0, was released in
1991, featuring core concepts such as exception handling, functions, and
modules. Over the years, Python underwent significant evolution and
refinement, with major releases introducing new features, enhancements, and
optimizations. Key milestones in Python's history include:
Python 1.0 (1994): The first official release of Python 1.0 included features such
as lambda, map, filter, and reduce functions, as well as support for functional
programming constructs.
Python 2.0 (2000): Python 2 introduced list comprehensions, garbage
collection, and a unified object model. It became the predominant version of
Python used in production environments for many years.
Python's versatility and ease of use have led to its widespread adoption across
various domains and industries. Some common applications of Python include:
Data Science and Machine Learning: Python has emerged as a dominant language
in the field of data science and machine learning, thanks to libraries like NumPy,
pandas, scikit- learn, and TensorFlow. These libraries provide powerful tools for
data analysis, visualization, and predictive modeling.
Scientific Computing: Python is popular among scientists and researchers for its
extensive ecosystem of libraries and tools tailored for scientific computing,
including SciPy, Matplotlib, and Jupyter.
Page 5: Conclusion
Developed by Wes McKinney and first released in 2008, Pandas has become an
essential tool for data scientists, analysts, and developers working with data-
intensive applications.
Data Structures: Pandas introduces two main data structures: Series and
DataFrame. A Series is a one-dimensional array-like object that can hold various
data types, while a DataFrame is a two-dimensional labeled data structure
resembling a table or spreadsheet.
Data Manipulation: Pandas provides a rich set of functions and methods for
manipulating and transforming data. Users can perform tasks such as filtering,
sorting, grouping, aggregating, merging, and reshaping data with ease.
Missing Data Handling: Pandas offers robust support for handling missing or
incomplete data, providing functions for detecting, removing, and imputing
missing values in datasets.
Time Series Analysis: Pandas includes functionality for working with time series
data, including date/time indexing, time zone handling, resampling, and frequency
conversion.
Data Input/Output: Pandas supports a variety of file formats for input and output
operations, including CSV, Excel, SQL databases, JSON, HDF5, and more. It
simplifies the process of reading and writing data from/to external sources.
Page 2: Applications of Pandas
Pandas is widely used across various domains and industries for data analysis,
manipulation, and visualization. Some common applications of Pandas include:
Data Analysis and Statistics: Pandas provides powerful tools for performing
descriptive and inferential statistics on datasets, including summary statistics,
correlation analysis, hypothesis testing, and regression analysis.
Time Series Analysis: Pandas' support for time series data makes it well-suited
for analyzing and visualizing temporal data, such as financial market data, sensor
data, and weather data. Users can easily perform tasks such as resampling,
rolling window calculations, and time series decomposition.
Machine Learning and Data Modeling: Pandas integrates seamlessly with other
libraries in the Python ecosystem, such as scikit-learn, TensorFlow, and
PyTorch, for building machine learning models and conducting predictive
analytics tasks. It serves as a crucial tool for preparing data for model training
and evaluation.
Data Visualization: While Pandas itself does not provide visualization
capabilities, it integrates well with libraries like Matplotlib, Seaborn, and Plotly
for creating insightful visualizations of data. Users can quickly generate plots,
charts, and graphs to explore and communicate their findings effectively.
Introduction to Scikit-learn
Model Evaluation and Validation: Scikit-learn offers tools for evaluating and
validating machine learning models through techniques such as cross-validation,
hyperparameter tuning, model selection, and performance metrics calculation.
Integration with NumPy and SciPy: Scikit-learn seamlessly integrates with other
Python libraries such as NumPy and SciPy, leveraging their capabilities for
numerical computing, linear algebra, optimization, and statistical analysis.
Support for Data Preprocessing: Scikit-learn provides functions and utilities for
preprocessing and transforming raw data before feeding it into machine learning
algorithms. This includes tasks such as feature scaling, normalization, encoding
categorical variables, and handling missing values.
Page 2: Introduction to Matplotlib
Wide Range of Plot Types: Matplotlib supports a diverse range of plot types,
including line plots, scatter plots, bar plots, histograms, heatmaps, contour
plots, and more. Users can create highly customizable visualizations to
represent their data effectively.
In summary, Scikit-learn and Matplotlib are two essential libraries in the Python
ecosystem for machine learning and data visualization, respectively. Together,
they provide powerful tools for exploring, analyzing, and visualizing data,
enabling users to build sophisticated machine learning models and gain valuable
insights from their data. Whether you're a data scientist, researcher, student, or
developer, Scikit-learn and Matplotlib offer the tools you need to tackle a wide
range of machine learning and data visualization tasks effectively.
PROPOSED SYSTEM
SOFTWARE REQUIREMENTS
1.Modules
Social media platforms like Twitter have become invaluable sources of real-time
data for understanding public opinion, market trends, and societal issues.
Sentiment analysis on
Twitter data involves the automatic classification of tweets into positive,
negative, or neutral categories based on the sentiment expressed by users. This
paper focuses on the application of machine learning techniques for sentiment
analysis on Twitter, highlighting the modules and methodologies involved in
building an effective sentiment analysis system.
2. Data Collection:
The first step in sentiment analysis is data collection. Twitter provides APIs
(Application Programming Interfaces) that allow developers to access and
retrieve tweets based on various criteria such as keywords, hashtags, user
mentions, and geolocation. Data collection may also involve scraping publicly
available tweets using web scraping techniques. The collected data should be
representative, diverse, and labeled with ground truth sentiment labels for
training and evaluation purposes.
3. Data Preprocessing:
Twitter data often contains noise, including hashtags, mentions, URLs, emojis,
and non- standard spelling and grammar. Data preprocessing techniques such as
tokenization, lowercase conversion, punctuation removal, stop word removal,
stemming or lemmatization, and handling of special characters and emoticons are
applied to clean and standardize the text data before further analysis.
4. Feature Extraction:
Feature extraction involves converting the raw text data into numerical
representations suitable for machine learning algorithms. Commonly used feature
extraction techniques
include Bag-of-Words (BoW), Term Frequency-Inverse Document Frequency
(TF-IDF), Word Embeddings (e.g., Word2Vec, GloVe), and n-grams. These
techniques capture the semantic and syntactic information present in the text data
and transform it into feature vectors that can be fed into machine learning models.
5. Model Selection:
6. Evaluation:
Non-Functional Requirements:
Performance:
Scalability:
Accuracy:
The sentiment analysis models should achieve high accuracy and reliability in
predicting tweet sentiments.
The system should allow users to fine-tune models and adjust parameters to
improve accuracy.
Robustness:
The system should handle errors and exceptions gracefully, ensuring uninterrupted
operation.
It should implement error handling mechanisms and failover procedures to recover
from failures quickly.
Security:
The system should ensure the confidentiality and integrity of user data and
authentication credentials.
It should employ encryption techniques to protect sensitive information during
data transmission and storage.
Usability:
The system should have a user-friendly interface that is intuitive and easy to
navigate. It should provide documentation and tutorials to guide users
through the process of data collection, model training, and evaluation.
Maintainability:
1. Technical Feasibility:
Data Access: Evaluate the availability and accessibility of Twitter data through
APIs. Ensure that the required data can be collected efficiently and in compliance
with Twitter's terms of service.
Data Processing: Assess the feasibility of preprocessing large volumes of Twitter
data to remove noise, standardize text, and extract features. Consider the
computational resources and time required for these tasks.
Machine Learning Models: Evaluate the performance and scalability of machine
learning algorithms for sentiment analysis on Twitter data. Consider factors such
as accuracy, speed, and memory usage.
2. Economic Feasibility:
Cost of Data Acquisition: Estimate the cost of accessing Twitter data through
APIs, considering any subscription fees or usage-based pricing models.
Infrastructure Costs: Assess the cost of infrastructure required for data storage,
processing, and analysis, including servers, storage, and computational resources.
Development and Maintenance Costs: Estimate the cost of developing and
maintaining the sentiment analysis system, including software development,
testing, documentation, and ongoing support.
3. Operational Feasibility:
User Requirements: Identify the needs and preferences of potential users of the
sentiment analysis system, such as marketing professionals, brand managers, or
researchers.
Integration with Existing Systems: Assess the compatibility and integration
requirements of the sentiment analysis system with existing tools, platforms, or
workflows used by the target users.
Training and Support: Evaluate the feasibility of providing training and
support to users for using the sentiment analysis system effectively.
4. Legal and Ethical Feasibility:
Data Privacy and Security: Ensure compliance with data privacy regulations
and guidelines when collecting, storing, and processing Twitter data. Implement
measures to protect user privacy and confidentiality.
Intellectual Property: Consider any legal issues related to the use of Twitter data,
including copyright, trademark, and intellectual property rights.
Ethical Considerations: Assess the ethical implications of sentiment analysis,
including potential biases, discrimination, and misuse of analyzed data.
5. Schedule Feasibility:
1. Unit Testing:
2. Integration Testing:
3. System Testing:
Description: System testing evaluates the behavior of the entire software system
as a whole. It verifies that the system meets its specified requirements and
performs its intended functions correctly in a real-world environment.
Purpose: To validate the overall functionality, reliability, performance, and
usability of the software from an end-to-end perspective.
4. Acceptance Testing:
Description: Acceptance testing involves verifying that the software meets the
acceptance criteria defined by stakeholders or end-users. It is typically performed
by users or client representatives to determine whether the software is ready for
production deployment.
Purpose: To ensure that the software meets the business requirements, user needs,
and expectations of stakeholders before it is released to the production
environment.
5. Regression Testing:
6. Performance Testing:
Description: Usability testing assesses the ease of use, intuitiveness, and user-
friendliness of the software from the perspective of end-users. It involves
observing users as they interact with the software and collecting feedback on
their experiences.
Purpose: To identify usability issues, navigation problems, and user interface (UI)
design flaws that may impact the user experience and hinder adoption of the
software.
8. Security Testing:
1. Functional Testing:
Test Case 1: Verify that the system correctly handles various types of tweets,
including text-based tweets, retweets, replies, and multimedia tweets (e.g.,
images, videos).
Test Case 2: Ensure that the system accurately identifies and filters out irrelevant
tweets, such as spam, advertisements, and duplicate content.
Test Case 3: Validate that the sentiment analysis models produce consistent and
reliable results across different types of tweets and user demographics.
Test Case 4: Confirm that the system supports multi-language sentiment analysis
and handles tweets in languages other than English.
2. Performance Testing:
3. Usability Testing:
Test Case 1: Conduct usability testing with representative users to evaluate the
system's interface for clarity, intuitiveness, and ease of use.
Test Case 2: Solicit feedback from users on the effectiveness of visualizations and
reports generated by the system in conveying sentiment analysis results.
Test Case 3: Identify and address any usability issues or pain points reported by
users through surveys, interviews, or user feedback sessions.
4. Security Testing:
Test Case 1: Re-run previously executed test cases to verify that software
updates or changes do not introduce new defects or regressions in existing
functionality.
Test Case 2: Validate that fixes applied to known defects or issues remain
effective and do not cause unintended side effects or regressions.
Test Case 3: Conduct automated regression testing using test scripts and tools to
ensure comprehensive coverage and consistency across test runs.
6. Integration Testing:
Test Case 1: Verify that the sentiment analysis system integrates smoothly with
external APIs and services for data collection, preprocessing, and feature
extraction.
Test Case 2: Ensure that data flows correctly between different components of the
system, including the frontend interface, backend servers, and machine learning
models.
Test Case 3: Validate that changes or updates to integrated modules do not impact
the overall functionality or performance of the system.
7. Acceptance Testing:
Test Case 1: Demonstrate the system's compliance with user requirements and
specifications outlined in the project scope and documentation.
Test Case 2: Conduct user acceptance testing (UAT) with end-users to ensure
that the system meets their needs, expectations, and use cases effectively.
Test Case 3: Obtain approval and sign-off from stakeholders to officially release
the system for production use, following successful completion of acceptance
testing.
System Architecture
Class Diagram
Sequence Diagram:
Activity Diagran:
Use-Case Diagram
System Dataset
Reading Dataset
Preprocessng
Feature Selection
Metrics
Future Prediction
Chapter V
RESULTS
After applying these three algorithms, the system will analysis twitter
to the users based on the comment given by the users and generate the
accuracy and f1 score.
After the result we manually check which system is more efficient for
the analysis by comparing the result.
Training
Accuracy :
0.9946602144
257645
Validation
Accuracy :
0.9501939682
142411 F1
score :
0.6004016064
257027
[[7294 138]
[ 260 299]]
Chapter VI
CONCLUSION
FUTURE WORK:
K Means++ algorithm
Recurrent neural network
Gradient boosting algorithm
Ada Boosting algorithm
Handle Grapheme Stretching
Handle authenticity of Data and Users Handle Sarcasm and Humo
REFERENCES:
SOURCE CODE:
Output :