Professional Documents
Culture Documents
Tribhuvan University: Institute of Engineering
Tribhuvan University: Institute of Engineering
Tribhuvan University: Institute of Engineering
INSTITUTE OF ENGINEERING
A
FINAL YEAR PROJECT REPORT ON
TWEEZER
(CT-755)
Anil Shrestha(070/BCT/01)
Bijay Sahani(070/BCT/05)
Bimal Shrestha(070/BCT/10)
Deshbhakta Khanal(070/BCT/13)
August, 2017
iii
TWEEZER
A
FINAL YEAR PROJECT REPORT
(CT-755)
PROJECT SUPERVISOR
Er. Nhurendra Shakya
Tribhuvan University
Lalitpur, Nepal
August, 2017
iv
ACKNOWLEDGEMENT
It is a matter of great pleasure to present this progress report on development of
“Tweezer”. We are grateful to Institute of Engineering, Pulchowk and Himala ya
College of Engineering management for providing us with this great opportunity to
develop a web application for the major project.
We are also grateful to our project supervisor Er. Nhurendra Shakya for his valuable
advice and suggestion.
Also, we would like to thank Er. Ashok GM, Head of Department and Er. Alok
Samrat Kafle, Major Project Coordinator, for providing us all the necessary
resources that meets our project requirements.
We would like to convey our thanks to the teaching and non-teaching staffs of the
Department of Electronics and computer Engineering, HCOE for their invaluab le
help and support throughout the period of the project hours. We will not miss to
express our gratitude to all our friends and everyone who has been the part of this
project by providing their comments and suggestions.
Group Members
Anil Shrestha
Bijay Sahani
Bimal Shrestha
Deshbhakta Khanal
v
ABSTRACT
Analysis of public information from social media could yield interesting results and
insights into the world of public opinions about almost any product, service or
personality. Social network data is one of the most effective and accurate indicators
of public sentiment. The explosion of Web 2.0 has led to increased activity in
Podcasting, Blogging, Tagging, Contributing to RSS, Social Bookmarking, and
Social Networking. As a result there has been an eruption of interest in people to
mine these vast resources of data for opinions. Sentiment Analysis or Opinion
Mining is the computational treatment of opinions, sentiments and subjectivity of
text. In this paper we will be discussing a methodology which allows utilization and
interpretation of twitter data to determine public opinions.
Developing a program for sentiment analysis is an approach to be used to
computationally measure customers' perceptions. This paper reports on the design
of a sentiment analysis, extracting and training a vast amount of tweets. Results
classify customers' perspective via tweets into positive and negative, which is
represented in a pie chart, bar diagram, scatter plot using php, css and html pages.
Keywords: Data mining, Natural language processing, Sentiwordnet, Naïve
Bayes
vi
TABLE OF CONTENTS
vii
5.3.5 JavaScript ................................................................................................. 28
5.3.6 HTML...................................................................................................... 28
5.3.7 Highcharts ................................................................................................ 28
6. TESTING............................................................................................................. 29
6.1. Unit Testing ................................................................................................... 29
6.2. System Testing............................................................................................... 29
6.3. Performance Testing ....................................................................................... 30
6.4. Verification and Validation ............................................................................. 30
7. ANALYSIS AND RESULTS ................................................................................ 31
7.1 Analysis.......................................................................................................... 31
7.2 Result ............................................................................................................. 31
8. LIMITATION AND FUTURE ENHANCEMENT ................................................. 33
8.1 Limitation ....................................................................................................... 33
8.2 Future Enhancement ........................................................................................ 33
CONCLUSION........................................................................................................ 34
REFERENCES......................................................................................................... 35
APPENDIX ............................................................................................................. 37
viii
LIST OF FIGURES
ix
ABBREVIATIONS
x
1. INTRODUCTION
However, those types of online data have several flaws that potentially hinder the
process of sentiment analysis. The first flaw is that since people can freely post
their own content, the quality of their opinions cannot be guaranteed. For example,
instead of sharing topic-related opinions, online spammers post spam on forums.
Some spam are meaningless at all, while others have irrelevant opinions also
known as fake opinions [7-9]. The second flaw is that ground truth of such online
data is not always available. A ground truth is more like a tag of a certain opinio n,
indicating whether the opinion is positive, negative, or neutral. The Stanford
Sentiment 140 Tweet Corpus [10] is one of the datasets that has ground truth and
is also public available. The corpus contains 1.6 million machine- tagged Twitter
messages.
1
complain, and express positive sentiment for products they use in daily life. In fact,
companies manufacturing such products have started to poll these microblogs to get
a sense of general sentiment for their product. Many time these companies study
user reactions and reply to users on microblogs. One challenge is to build
technology to detect and summarize an overall sentiment.
Our project Tweezer resembles the analyze of tweets by the peoples on certain
products of companies or brands or performed by political leaders. In order to do
this we analyzed tweets from Twitter. Tweets are a reliable source of informa tio n
mainly because people tweet about anything and everything they do includ ing
buying new products and reviewing them. Besides, all tweets contain hash tags
which make identifying relevant tweets a simple task. A number of research works
has already been done on twitter data. Most of which mainly demonstrates how
useful this information is to predict various outcomes. Our current research deals
with outcome prediction and explores localized outcomes.
We collected data using the Twitter public API which allows developers to extract
tweets from twitter programmatically.
The collected data, because of the random and casual nature of tweeting, need to be
filtered to remove unnecessary information. Filtering out these and other
problematic tweets such as redundant ones, and ones with no proper sentences was
done next.
As the preprocessing phase was done in certain extent it was possible to guarantee
that analyzing these filtered tweets will give reliable results. Twitter does not
provide the gender as a query parameter so it is not possible to obtain the gender of
a user from his or her tweets. It turned out that twitter does not ask for user gender
while opening an account so that information is seemingly unavailable.
2
Given a message containing a marked instance of a word or a phrase,
determine whether that instance is positive, negative or neutral in that
context.
Sentence Level Sentiment Analysis in Twitter:
Given a message, decide whether the message is of positive, negative, or
neutral sentiment. For messages conveying both a positive and negative
sentiment, whichever is the stronger sentiment should be chosen.
1.2 Objectives
This project will be helpful to the companies, political parties as well as to the
common people. It will be helpful to political party for reviewing about the program
that they are going to do or the program that they have performed. Simila r ly
companies also can get review about their new product on newly released hardwares
or softwares. Also the movie maker can take review on the currently running movie.
By analyzing the tweets analyzer can get result on how positive or negative or
neutral are peoples about it.
3
organization office to review their works or by political leaders or by any others
company to review about their products or brands.
The main feature of our web application is that it helps to determine the opinion
about the peoples on products, government work, politics or any other by analyzing
the tweets. Our system is capable of training the new tweets taking reference to
previously trained tweets.
The computed or analyzed data will be represented in various diagram such as Pie-
chart, Bar graph and Scatter Plot
4
2. LITERATURE REVIEW
Sentiment analysis has been handled as a Natural Language Processing task at many
levels of granularity. Starting from being a document level classification task
( Turney, 2002; Pang and Lee, 2004) [4,5], it has been handled at the sentence level
(Hu and Liu [2], 2004; Kim and Hovy, 2004 [1]) and more recently at the phrase
level (Wilson et al., 2005; Agarwal et al., 2009). Microblog data like Twitter, on
which users post real time reactions to and opinions about “everything”, poses
newer and different challenges. Some of the early and recent results on sentime nt
analysis of Twitter data are by Go et al. (2009), ( Bermingham and Smeaton, 2010)
and Pak and Paroubek (2010) [3]. Go et al. (2009) use distant learning to acquire
sentiment data. They use tweet sending in positive emotions like “:)” “:-)” as
positive and negative emoticons like “:(” “:-(” as negative. They build models using
Naive Bayes, Max Ent and Support Vector Machines (SVM), and they report SVM
outperforms other classifiers. In terms of feature space, they try a Unigram, Bigram
model in conjunction with parts-of-speech (POS) features. They note that the
unigram model outperforms all other models. Specifically, bigrams and POS
features do not help. Pak and Paroubek (2010) [3] collect data following a similar
distant learning paradigm. They perform a different classification task though:
subjective versus objective.
For subjective data they collect the tweets ending with emoticons in the same
manner as Go et al. (2009). For objective data they crawl twitter accounts of popular
newspapers like “New York Times”, “Washington Posts” etc. They report that POS
and bigrams both help (contrary to results presented by Go et al. (2009)). Both these
approaches, however, are primarily based on ngram models. Moreover, the data
they use for training and testing is collected by search queries and is therefore
biased. In contrast, we present features that achieve a significant gain over a
unigram baseline. In addition we explore a different method of data representatio n
and report significant improvement over the unigram models. Another contributio n
of this paper is that we report results on manually annotated data that does not suffer
from any known biases. Our data will be a random sample of streaming tweets
unlike data collected by using specific queries. The size of our hand-labeled data
5
will allow us to perform cross validation experiments and check forth variance in
performance of the classifier across folds. Another significant effort for sentime nt
classification on Twitter data is by Barbosa and Feng (2010).
They use polarity predictions from three websites as noisy labels to train a model
and use 1000 manually labeled tweets for tuning and another 1000 manually labeled
tweets for testing. They however do not mention how they collect their test data.
They propose the use of syntax features of tweets like retweet, hashtags, link,
punctuation and exclamation marks in conjunction with features like prior polarity
of words and POS of words. We extend their approach by using real valued prior
polarity, and by combining prior polarity with POS. Our results show that the
features that enhance the performance of our classifiers the most are features that
combine prior polarity of words with their parts of speech. The tweet syntax features
help but only marginally. Gamon (2004) perform sentiment analysis on feeadback
data from Global Support Services survey. One aim of their paper is to analyze the
role of linguistic features like POS tags. They perform extensive feature analys is
and feature selection and demonstrate that abstract linguistic analysis features
contributes to the classifier accuracy. In this paper we perform extensive feature
analysis and show that the use of only 100 abstract linguistic features performs as
well as a hard unigram baseline.
6
6,799 tokens based on Twitter data, where each token is assigned a sentiment score,
namely TSI (Total Sentiment Index), featuring itself as a positive token or a
negative token. Specifically, a TSI for a certain token is computed as:
𝑡𝑝
𝑝 − 𝑡𝑛 ∗ 𝑛
𝑇𝑆𝐼 = 𝑡𝑝
𝑝 + 𝑡𝑛 ∗ 𝑛
where p is the number of times a token appears in positive tweets and n is the
𝑡𝑝
number of times a token appears in negative tweets. is the ratio of total number
𝑡𝑛
Moreover, [8] showed that using the well-known “geo-tagged” feature in twitter to
identify the polarity of a political candidates in the US could be done by employing
the sentiment analysis algorithms to predict the future events such as the
presidential elections results. Comparing to previous approaches in sentime nt
topics, additional findings by [10] showed that adding the semantic feature produces
better Recall (retrieved documents) to compute the score) in negative sentime nt
classification.
7
3. FEASIBILITY ANALYSIS
Technical feasibility
Operational feasibility
Economic feasibility
Schedule feasibility
Evaluating the technical feasibility is the trickiest part of a feasibility study. This is
because, at the point in time there is no any detailed designed of the system, making
it difficult to access issues like performance, costs (on account of the kind of
technology to be deployed) etc. A number of issues have to be considered while
doing a technical analysis; understand the different technologies involved in the
proposed system. Before commencing the project, we have to be very clear about
what are the technologies that are to be required for the development of the new
system. Is the required technology available? Our system "Tweezer" is technica lly
feasible since all the required tools are easily available. Python and Php with
8
Javacscript can be easily handled. Although all tools seems to be easily availab le
there are challenges too.
3.1.2 Operational Feasibility
Proposed project is beneficial only if it can be turned into information systems that
will meet the operating requirements. Simply stated, this test of feasibility asks if
the system will work when it is developed and installed. Are there major barriers to
Implementation?
The proposed was to make a simplified web application. It is simpler to operate and
can be used in any webpages. It is free and not costly to operate.
3.1.3 Economic Feasibility
A project will fail if it takes too long to be completed before it is useful. Typically,
this means estimating how long the system will take to develop, and if it can be
completed in a given period of time using some methods like payback period.
Schedule feasibility is a measure how reasonable the project timetable is. Given our
technical expertise, are the project deadlines reasonable? Some project is initiated
9
with specific deadlines. It is necessary to determine whether the deadlines are
mandatory or desirable.
After the extensive analysis of the problems in the system, we are familiarized with
the requirement that the current system needs. The requirement that the system
needs is categorized into the functional and non-functional requirements. These
requirements are listed below:
Functional requirement are the functions or features that must be included in any
system to satisfy the business needs and be acceptable to the users. Based on this,
the functional requirements that the system must require are as follows:
10
User friendly
System should provide better accuracy
To perform with efficient throughput and response time
11
4. SYSTEM DESIGN AND ARCHITECTURE
12
4.2 System Flow Diagram
13
Input (Keyword) Aggregating Scores
Extract Feature
Data
14
4.3 Activity Diagram
15
4.4 Sequence Diagram
16
4.5 Data Flow Diagram
17
4.6 Flow Chart
18
5.METHODOLOGY
We will be using those machine learning and natural language processing for
sentiment analysis of tweet.
The machine learning based text classifiers are a kind of supervised machine
learning paradigm, where the classifier needs to be trained on some labeled training
data before it can be applied to actual classification task. The training data is usually
an extracted portion of the original data hand labeled manually. After suitable
training they can be used on the actual test data. The Naive Bayes is a statistica l
classifier whereas Support Vector Machine is a kind of vector space classifier. The
statistical text classifier scheme of Naive Bayes (NB) can be adapted to be used for
sentiment classification problem as it can be visualized as a 2-class text
classification problem: in positive and negative classes. Support Vector machine
(SVM) is a kind of vector space model based classifier which requires that the text
documents should be transformed to feature vectors before they are used for
classification. Usually the text documents are transformed to multidimensio na l
vectors. The entire problem of classification is then classifying every text document
represented as a vector into a particular class. It is a type of large margin classifier.
Here the goal is to find a decision boundary between two classes that is maxima lly
far from any document in the training data.
19
There are various training sets available on Internet such as Movie Reviews data
set, twitter dataset, etc. Class can be Positive, negative. For both the classes we need
training data sets.
The Naïve Bayes classifier is the simplest and most commonly used classifier.
Naïve Bayes classification model computes the posterior probability of a class,
based on the distribution of the words in the document. The model works with the
BOWs feature extraction which ignores the position of the word in the document.
It uses Bayes Theorem to predict the probability that a given feature set belongs to
a particular label.
P(label) ∗P(features|label )
P(label|features)=
P(features)
P(label) is the prior probability of a label or the likelihood that a random feature
set the label. P(features|label) is the prior probability that a given feature set is
being classified as a label. P(features) is the prior probability that a given feature
set is occurred. Given the Naïve assumption which states that all features are
independent, the equation could be rewritten as follows:
Algorithm :
i. Dictionary generation
Count occurrence of all word in our whole data set and make a dictionary of
some most frequent words.
20
21
ii. Feature set generation
All document is represented as a feature vector over the space of dictionary
words.
For each document, keep track of dictionary words along with their number of
occurrence in that document.
k |label y P ( x j k | label y )
m ni
1{x (i )
k and label ( i ) y} 1
k |label y
j
i 1 j 1
m
(1{label ( i ) y}ni ) | V |
i 1
Training
In this phase We have to generate training data(words with probability of
occurrence in positive/negative train data files ).
k |labe l y
Calculate for each label .
k |labe l y
Calculate for each dictionary words and store the result (Here: label
will be negative and positive).
22
Now we have , word and corresponding probability for each of the defined label .
Testing
Goal - Finding the sentiment of given test data file.
Similarly calculate
The following diagrams and calculations shows details on tweet data processing,
feature extraction, analysis and tweet polarity classification based on Naïve Bayes
Algorithm and Classifier.
Convert the document into feature sets, where the attributes are possible words,
and the values are the number of times a word occurs in the given document.
23
DOC I loved The Movies hated a great poor acti good CLASS
ng
1 1 1 1 1 +
2 1 1 1 1 -
3 2 1 1 1 +
4 1 1 -
5 1 1 1 1 1 +
P (+)=3/5=0.6
Compute: p (i|+); p (love|+); p (the|+); p (movies|+);
P (a|+); p (great|+); p (acting|+); p (good|+)
Le n be the number of words in the (+) case: 14. nk the number of times word k
occurs in these cases(+)
nk+1
Let p(wk /+)=
2𝑛+|𝑉𝑜𝑐𝑎𝑏𝑢𝑙𝑎𝑟𝑦 |
1 1 1 1 1 +
3 2 1 1 1 +
5 1 1 1 1 1 +
nk+1
P(+)=3/5=0.6; p(wk /+)=
2𝑛+|𝑉𝑜𝑐𝑎𝑏𝑢𝑙𝑎𝑟𝑦 |
1+1 1+1
P(i|+) = 14+10 = 0.0833; P(loved|+) = 14+10 = 0.0833;
1+1 5 +1
P(the|+) = = 0.0833; P(movies|+) = = 0.2083;
14+10 14 +10
2+1 2+1
P(a|+) = 14+10 = 0.125; P(great|+) = 14+10 = 0.125;
1+1 2+1
P(acting|+) = 14+10 = 0.0833; P(good|+) = 14+10 = 0.125;
0 +1 0+1
P(hated|+) = 14 +10 = 0.0417; P(poor|+) = 14+10 = 0.0417;
24
Now, let’s look at the negative examples
DOC I loved the Movies hated a great poor acting good CLAS S
2 1 1 1 1 -
4 1 1 -
P(-)=2/5=0.4
1+1 0+1
P(i|-) = 6+10 = 0.125; P(loved|-) = 6+10 = 0.0625;
1+1 1+1
P(the|-) = 6+10 = 0.125; P(movies|-) = 6+10 = 0.125;
0+1 0+1
P(a|-) = 6+10 = 0.0625; P(great|-) = 6+10 = 0.0625;
1+1 0+1
P(acting|-) = 6+10 = 0.125; P(good|-) = 6+10 = 0.0625;
1+1 1+1
P(hated|-) = 6+10 = 0.125; P(poor|-) = 6+10 = 0.125;
25
scores are computed by combining the results produced by eight ternary classifie rs.
WordNet is a large lexical database of English. Nouns, verbs, adjectives and
adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a
distinct concept.
WordNet is also freely and publicly available for download. WordNet’s structure
makes it a useful tool for computational linguistics and natural language processing.
It groups words together based on their meanings. Synet is nothing but a set of one
or more Synonyms. This approach uses Semantics to understand the langua ge.
Major tasks in NLP that helps in extracting sentiment from a sentence:
Basically, Positive and Negative scores got from SentiWordNet according to its
part-of-speech tag and then by counting the total positive and negative scores we
determine the sentiment polarity based on which class (i.e. either positive or
negative) has received the highest score.
5.3.1 MongoDB
26
5.3.2 Python
NLTK is a leading platform for building Python programs to work with human
language data. It provides easy-to-use interfaces to over 50 corpora and lexica l
resources such as WordNet, along with a suite of text processing libraries for
classification, tokenization, stemming, tagging, parsing, and semantic reasoning,
wrappers for industrial-strength NLP libraries, and an active discussion forum.
NLTK has been called “a wonderful tool for teaching, and working in,
computational linguistics using Python,” and “an amazing library to play with
natural language.” NLTK is suitable for linguists, engineers, students, educators,
researchers, and industry users alike. Natural Language Processing with
Python provides a practical introduction to programming for language processing.
Written by the creators of NLTK, it guides the reader through the fundamentals of
writing Python programs, working with corpora, categorizing text, analyzing
linguistic structure, and more.
5.3.4 PHP
27
5.3.5 JavaScript
We make the use of JavaScript to incorporate different plugins in our web page.
JavaScript is the programming language of HTML and the Web. JavaScript resides
inside HTML documents, and can provide levels of interactivity to web pages that
are not achievable with simple HTML.
5.3.6 HTML
We use HTML for rendering the analyzed in the web page. HTML is the standard
markup language for creating Web pages. HTML describes the structure of Web
pages using markup. HTML elements are the building blocks of HTML pages.
5.3.7 Highcharts
28
6. TESTING
Unit testing is performed for testing modules against detailed design. Inputs to the
process are usually compiled modules from the coding process. Each modules are
assembled into a larger unit during the unit testing process.
Testing has been performed on each phase of project design and coding. We carry
out the testing of module interface to ensure the proper flow of information into and
out of the program unit while testing. We make sure that the temporarily stored data
maintains its integrity throughout the algorithm's execution by examining the loc al
data structure. Finally, all error-handling paths are also tested.
We usually perform system testing to find errors resulting from unanticip ated
interaction between the sub-system and system components. Software must be
tested to detect and rectify all possible errors once the source code is generated
before delivering it to the customers. For finding errors, series of test cases must be
developed which ultimately uncover all the possibly existing errors. Diffe re nt
software techniques can be used for this process. These techniques provide
systematic guidance for designing test that
White Box testing: Internal program logic is exercised using this test case design
techniques.
Black Box testing: Software requirements are exercised using this test case design
techniques.
29
Both techniques help in finding maximum number of errors with minimal effort and
time.
6.3. Performance Testing
It is done to test the run-time performance of the software within the context of
integrated system. These tests are carried out throughout the testing process. For
example, the performance of individual module are accessed during white box
testing under unit testing.
6.4. Verification and Validation
Verification is more like 'are we building the product right?' and validation is
more like 'are we building the right product?'.
30
7. ANALYSIS AND RESULTS
7.1 Analysis
We collected dataset containing positive and negative data. Those dataset were
trained data and was classified using Naïve Bayes Classifier. Before training the
classifier unnecessary words, punctuations, meaning less words were cleaned to get
pure data. To determine positivity and negativity of tweets we collected data using
twitter API. Those data were stored in database and then retrieved back to remove
those unnecessary word and punctuations for pure data. To check polarity of test
tweet we train the classifier with the help of trained data. Those results were stored
in database and then retrieved back using php, html , javascript and css.
7.2 Result
31
Fig 7.2.2: "Bar Graph Diagram"
32
8. LIMITATION AND FUTURE ENHANCEMENT
8.1 Limitation
The system we designed is used to determine the opinion of the people based on
twitter data. We somehow completed our project and was able to determine only
positivity and negativity of tweet. For neutral data we were unable to merge dataset.
Also we are currently analyzing only 25 live tweets. This may not give proper value
33
CONCLUSION
We have completed our project using python as language, Php with Html and
Javascript for output presentation. Although there was a problem in integration of
python and php, through numbers of tutorial we were able to integrate it.
We were able to determine the positivity and negativity of each tweet. Based on
those tweets we represented them in a diagrams like Bar graph, Pie-chart and scatter
plot. All the diagrams related to outcome are shown in fig (7.2.1),fig (7.2.2),
fig(7.2.3). A small conclusion is also shown during output presentation based on
product or brand entered. Our designed system is user friendly.
34
REFERENCES
1. Kim S-M, Hovy E (2004) Determining the sentiment of opinions In: Proceedings
of the 20th international conference on Computational Linguistics, page 1367.
Association for Computational Linguistics, Stroudsburg, PA, USA.
3. Pak A, Paroubek P (2010) Twitter as a corpus for sentiment analysis and opinio n
mining In: Proceedings of the Seventh conference on International Langua ge
Resources and Evaluation. European Languages Resources Association, Valletta,
Malta.
5. Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf
Retr2(1-2): 1–135.
35
9. Mukherjee A, Liu B, Glance N (2012) Spotting fake reviewer groups in consumer
reviews In: Proceedings of the 21st, International Conference on World Wide
Web, WWW ’12, 191–200.. ACM, New York, NY, USA.
10. Saif, H., He, Y., & Alani, H. (2012). Semantic sentiment analysis of twitter. The
Semantic Web (pp. 508– 524). ISWC
11. Tan LK-W, Na J-C, Theng Y-L, Chang K (2011) Sentence-level sentiment polarity
classification using a linguistic approach In: Digital Libraries: For Cultural
Heritage, Knowledge Dissemination, and Future Creation, 77–87.. Springer,
Heidelberg, Germany.
12. Liu B (2012) Sentiment Analysis and Opinion Mining. Synthesis Lectures on
Human Language Technologies. Morgan & Claypool Publishers.
13. Gann W-JK, Day J, Zhou S (2014) Twitter analytics for insider trading fraud
detection system In: Proceedings of the sencond ASE international conference on
Big Data.. ASE.
14. Joachims T. Probabilistic analysis of the rocchio algorithm with TFIDF for text
categorization. In: Presented at the ICML conference; 1997.
36
APPENDIX
37
38
39
40