Tribhuvan University: Institute of Engineering

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

TRIBHUVAN UNIVERSITY

INSTITUTE OF ENGINEERING

HIMALAYA COLLEGE OF ENGINEERING

A
FINAL YEAR PROJECT REPORT ON

TWEEZER
(CT-755)

Anil Shrestha(070/BCT/01)
Bijay Sahani(070/BCT/05)
Bimal Shrestha(070/BCT/10)
Deshbhakta Khanal(070/BCT/13)

A PROJECT WAS SUBMITTED TO THE DEPARTMENT OF ELECTRONICS


AND COMPUTER ENGINEERING IN PARTIAL FULFILLMENT OF THE
REQUIREMENT FOR THE BACHELOR ‘S DEGREE IN COMPUTER
ENGINEERING

August, 2017

iii
TWEEZER

A
FINAL YEAR PROJECT REPORT
(CT-755)

Anil Shrestha (070/BCT/01)


Bijay Sahani (070/BCT/05)
Bimal Shrestha (070/BCT/10)
Deshbhakta Khanal (070/BCT/13)

PROJECT SUPERVISOR
Er. Nhurendra Shakya

“A major project report submitted in partial fulfillment of the


requirements for the degree of Bachelor’s in Computer Engineering.”

DEPARTMENT OF ELECTRONICS AND COMPUTER


ENGINEERING

HIMALAYA COLLEGE OF ENGINEERING

Tribhuvan University

Lalitpur, Nepal

August, 2017

iv
ACKNOWLEDGEMENT
It is a matter of great pleasure to present this progress report on development of
“Tweezer”. We are grateful to Institute of Engineering, Pulchowk and Himala ya
College of Engineering management for providing us with this great opportunity to
develop a web application for the major project.

We are also grateful to our project supervisor Er. Nhurendra Shakya for his valuable
advice and suggestion.

Also, we would like to thank Er. Ashok GM, Head of Department and Er. Alok
Samrat Kafle, Major Project Coordinator, for providing us all the necessary
resources that meets our project requirements.

We would like to convey our thanks to the teaching and non-teaching staffs of the
Department of Electronics and computer Engineering, HCOE for their invaluab le
help and support throughout the period of the project hours. We will not miss to
express our gratitude to all our friends and everyone who has been the part of this
project by providing their comments and suggestions.
Group Members
Anil Shrestha
Bijay Sahani
Bimal Shrestha
Deshbhakta Khanal

v
ABSTRACT
Analysis of public information from social media could yield interesting results and
insights into the world of public opinions about almost any product, service or
personality. Social network data is one of the most effective and accurate indicators
of public sentiment. The explosion of Web 2.0 has led to increased activity in
Podcasting, Blogging, Tagging, Contributing to RSS, Social Bookmarking, and
Social Networking. As a result there has been an eruption of interest in people to
mine these vast resources of data for opinions. Sentiment Analysis or Opinion
Mining is the computational treatment of opinions, sentiments and subjectivity of
text. In this paper we will be discussing a methodology which allows utilization and
interpretation of twitter data to determine public opinions.
Developing a program for sentiment analysis is an approach to be used to
computationally measure customers' perceptions. This paper reports on the design
of a sentiment analysis, extracting and training a vast amount of tweets. Results
classify customers' perspective via tweets into positive and negative, which is
represented in a pie chart, bar diagram, scatter plot using php, css and html pages.
Keywords: Data mining, Natural language processing, Sentiwordnet, Naïve
Bayes

vi
TABLE OF CONTENTS

LIST OF FIGURES ................................................................................................... ix


ABBREVIATIONS .................................................................................................... x
1. INTRODUCTION............................................................................................... 1
1.1 Statement of the problem ................................................................................... 2
1.2 Objectives......................................................................................................... 3
1.3 Scope of project ................................................................................................ 3
1.4 System Overview .............................................................................................. 3
1.5 System Features ................................................................................................ 4
2. LITERATURE REVIEW..................................................................................... 5
3. FEASIBILITY ANALYSIS ................................................................................. 8
3.1.1 Technical Feasibility ....................................................................................... 8
3.1.2 Operational Feasibility .................................................................................... 9
3.1.3 Economic Feasibility ...................................................................................... 9
3.1.4 Schedule Feasibility ........................................................................................ 9
3.2 Requirement Definition ................................................................................... 10
3.2.1 Functional Requirements ........................................................................... 10
3.2.2 Non-Functional Requirements.................................................................... 10
4. SYSTEM DESIGN AND ARCHITECTURE ......................................................... 12
4.1 Use Case Diagram ........................................................................................... 12
4.2 System Flow Diagram ..................................................................................... 13
4.3 Activity Diagram............................................................................................. 15
4.5 Data Flow Diagram ......................................................................................... 17
4.6 Flow Chart...................................................................................................... 18
5.METHODOLOGY ................................................................................................ 19
5.1 Machine Learning ........................................................................................... 19
5.1.1 Naïve Bayes Classifier (NB) ...................................................................... 20
5.2 Natural Language Processing ........................................................................... 25
5.3 Programming tools .......................................................................................... 26
5.3.1 MongoDB ................................................................................................ 26
5.3.2 Python...................................................................................................... 27
5.3.3 NLTK ...................................................................................................... 27
5.3.4 PHP ......................................................................................................... 27

vii
5.3.5 JavaScript ................................................................................................. 28
5.3.6 HTML...................................................................................................... 28
5.3.7 Highcharts ................................................................................................ 28
6. TESTING............................................................................................................. 29
6.1. Unit Testing ................................................................................................... 29
6.2. System Testing............................................................................................... 29
6.3. Performance Testing ....................................................................................... 30
6.4. Verification and Validation ............................................................................. 30
7. ANALYSIS AND RESULTS ................................................................................ 31
7.1 Analysis.......................................................................................................... 31
7.2 Result ............................................................................................................. 31
8. LIMITATION AND FUTURE ENHANCEMENT ................................................. 33
8.1 Limitation ....................................................................................................... 33
8.2 Future Enhancement ........................................................................................ 33
CONCLUSION........................................................................................................ 34
REFERENCES......................................................................................................... 35
APPENDIX ............................................................................................................. 37

viii
LIST OF FIGURES

Figure 4.1 Use Case Diagram………………………………………………...….13


Figure 4.2 System Flow Diagram………………………………………………..14
Figure 4.3 Activity Diagram……………………………………………………..15
Figure 4.4 Sequence Diagram……………………………………………………16
Figure 4.5 Data Flow Diagram…………………………………………………...17
Figure 4.6 Flow Chart Diagram………………………………………………….18
Figure 7.2.1 Pie-Chart Representation…………………………………………...30
Figure 7.2.2 Bar Graph Diagram…………………………………………………31
Figure 7.2.3 Scatter Plot………………………………………………………….31

ix
ABBREVIATIONS

API : Application Programming Interface


MongoDB : Mongo Database
NLP : Natural Language Processing
NLTK : Natural Language Toolkit
PHP : Personal Home Page
POS : Parts of Speech
REST : Representational State Transfer

x
1. INTRODUCTION

Sentiment is an attitude, thought, or judgment prompted by feeling. Sentime nt


analysis [1-3], which is also known as opinion mining, studies people’s sentime nts
towards certain entities. Internet is a resourceful place with respect to sentime nt
information. From a user’s perspective, people are able to post their own content
through various social media, such as forums, micro-blogs, or online social
networking sites. From a researcher’s perspective, many social media sites release
their application programming interfaces (APIs), prompting data collection and
analysis by researchers and developers. For instance, Twitter currently has three
different versions of APIs available [6], namely the REST API, the Search API,
and the Streaming API. With the REST API, developers are able to gather status
data and user information; the Search API allows developers to query specific
Twitter content, whereas the Streaming API is able to collect Twitter content in
real time. Moreover, developers can mix those APIs to create their own
applications. Hence, sentiment analysis seems having a strong fundament with the
support of massive online data.

However, those types of online data have several flaws that potentially hinder the
process of sentiment analysis. The first flaw is that since people can freely post
their own content, the quality of their opinions cannot be guaranteed. For example,
instead of sharing topic-related opinions, online spammers post spam on forums.
Some spam are meaningless at all, while others have irrelevant opinions also
known as fake opinions [7-9]. The second flaw is that ground truth of such online
data is not always available. A ground truth is more like a tag of a certain opinio n,
indicating whether the opinion is positive, negative, or neutral. The Stanford
Sentiment 140 Tweet Corpus [10] is one of the datasets that has ground truth and
is also public available. The corpus contains 1.6 million machine- tagged Twitter
messages.

Microblogging websites have evolved to become a source of varied kind of


information. This is due to nature of micro blogs on which people post real time
messages about their opinions on a variety of topics, discuss current issues,

1
complain, and express positive sentiment for products they use in daily life. In fact,
companies manufacturing such products have started to poll these microblogs to get
a sense of general sentiment for their product. Many time these companies study
user reactions and reply to users on microblogs. One challenge is to build
technology to detect and summarize an overall sentiment.
Our project Tweezer resembles the analyze of tweets by the peoples on certain
products of companies or brands or performed by political leaders. In order to do
this we analyzed tweets from Twitter. Tweets are a reliable source of informa tio n
mainly because people tweet about anything and everything they do includ ing
buying new products and reviewing them. Besides, all tweets contain hash tags
which make identifying relevant tweets a simple task. A number of research works
has already been done on twitter data. Most of which mainly demonstrates how
useful this information is to predict various outcomes. Our current research deals
with outcome prediction and explores localized outcomes.

We collected data using the Twitter public API which allows developers to extract
tweets from twitter programmatically.
The collected data, because of the random and casual nature of tweeting, need to be
filtered to remove unnecessary information. Filtering out these and other
problematic tweets such as redundant ones, and ones with no proper sentences was
done next.
As the preprocessing phase was done in certain extent it was possible to guarantee
that analyzing these filtered tweets will give reliable results. Twitter does not
provide the gender as a query parameter so it is not possible to obtain the gender of
a user from his or her tweets. It turned out that twitter does not ask for user gender
while opening an account so that information is seemingly unavailable.

1.1 Statement of the problem

The problem at hand consists of two subtasks:


 Phrase Level Sentiment Analysis in Twitter :

2
Given a message containing a marked instance of a word or a phrase,
determine whether that instance is positive, negative or neutral in that
context.
 Sentence Level Sentiment Analysis in Twitter:
Given a message, decide whether the message is of positive, negative, or
neutral sentiment. For messages conveying both a positive and negative
sentiment, whichever is the stronger sentiment should be chosen.

1.2 Objectives

The objectives of this project are:


 To implement an algorithm for automatic classification of text into positive
and negative
 Sentiment Analysis to determine the attitude of the mass is positive,
negative or neutral towards the subject of interest
 Graphical representation of the sentiment in form of Pie-Chart, Bar Diagram
and Scatter Plot.

1.3 Scope of project

This project will be helpful to the companies, political parties as well as to the
common people. It will be helpful to political party for reviewing about the program
that they are going to do or the program that they have performed. Simila r ly
companies also can get review about their new product on newly released hardwares
or softwares. Also the movie maker can take review on the currently running movie.
By analyzing the tweets analyzer can get result on how positive or negative or
neutral are peoples about it.

1.4 System Overview

This proposal entitled “TWEEZER” is a web application which is used to analyze


the tweets. We will be performing sentiment analysis in tweets and determine where
it is positive, negative or neutral. This web application can be used by any

3
organization office to review their works or by political leaders or by any others
company to review about their products or brands.

1.5 System Features

The main feature of our web application is that it helps to determine the opinion
about the peoples on products, government work, politics or any other by analyzing
the tweets. Our system is capable of training the new tweets taking reference to
previously trained tweets.

The computed or analyzed data will be represented in various diagram such as Pie-
chart, Bar graph and Scatter Plot

4
2. LITERATURE REVIEW

Sentiment analysis has been handled as a Natural Language Processing task at many
levels of granularity. Starting from being a document level classification task
( Turney, 2002; Pang and Lee, 2004) [4,5], it has been handled at the sentence level
(Hu and Liu [2], 2004; Kim and Hovy, 2004 [1]) and more recently at the phrase
level (Wilson et al., 2005; Agarwal et al., 2009). Microblog data like Twitter, on
which users post real time reactions to and opinions about “everything”, poses
newer and different challenges. Some of the early and recent results on sentime nt
analysis of Twitter data are by Go et al. (2009), ( Bermingham and Smeaton, 2010)
and Pak and Paroubek (2010) [3]. Go et al. (2009) use distant learning to acquire
sentiment data. They use tweet sending in positive emotions like “:)” “:-)” as
positive and negative emoticons like “:(” “:-(” as negative. They build models using
Naive Bayes, Max Ent and Support Vector Machines (SVM), and they report SVM
outperforms other classifiers. In terms of feature space, they try a Unigram, Bigram
model in conjunction with parts-of-speech (POS) features. They note that the
unigram model outperforms all other models. Specifically, bigrams and POS
features do not help. Pak and Paroubek (2010) [3] collect data following a similar
distant learning paradigm. They perform a different classification task though:
subjective versus objective.
For subjective data they collect the tweets ending with emoticons in the same
manner as Go et al. (2009). For objective data they crawl twitter accounts of popular
newspapers like “New York Times”, “Washington Posts” etc. They report that POS
and bigrams both help (contrary to results presented by Go et al. (2009)). Both these
approaches, however, are primarily based on ngram models. Moreover, the data
they use for training and testing is collected by search queries and is therefore
biased. In contrast, we present features that achieve a significant gain over a
unigram baseline. In addition we explore a different method of data representatio n
and report significant improvement over the unigram models. Another contributio n
of this paper is that we report results on manually annotated data that does not suffer
from any known biases. Our data will be a random sample of streaming tweets
unlike data collected by using specific queries. The size of our hand-labeled data

5
will allow us to perform cross validation experiments and check forth variance in
performance of the classifier across folds. Another significant effort for sentime nt
classification on Twitter data is by Barbosa and Feng (2010).

They use polarity predictions from three websites as noisy labels to train a model
and use 1000 manually labeled tweets for tuning and another 1000 manually labeled
tweets for testing. They however do not mention how they collect their test data.
They propose the use of syntax features of tweets like retweet, hashtags, link,
punctuation and exclamation marks in conjunction with features like prior polarity
of words and POS of words. We extend their approach by using real valued prior
polarity, and by combining prior polarity with POS. Our results show that the
features that enhance the performance of our classifiers the most are features that
combine prior polarity of words with their parts of speech. The tweet syntax features
help but only marginally. Gamon (2004) perform sentiment analysis on feeadback
data from Global Support Services survey. One aim of their paper is to analyze the
role of linguistic features like POS tags. They perform extensive feature analys is
and feature selection and demonstrate that abstract linguistic analysis features
contributes to the classifier accuracy. In this paper we perform extensive feature
analysis and show that the use of only 100 abstract linguistic features performs as
well as a hard unigram baseline.

One fundamental problem in sentiment analysis is categorization of sentime nt


polarity [5,11].Given a piece of written text, the problem is to categorize the text
into one specific sentiment polarity, positive or negative(or neutral). Based on the
scope of the text, there are three levels of sentiment polarity categorization, namely
the document level, the sentence level, and the entity and aspect level[12]. The
document level concerns whether a document, as a whole, expresses negative or
positive sentiment, while the sentence level deals with each sentence’s sentime nt
categorization. The entity and aspect level then targets on what exactly people like
or dislike from their opinions.
For feature selection, Pang and Lee [4] suggested to remove objective sentences by
extracting subjective ones. They proposed a text-categorization technique that is
able to identify subjective content using minimum cut. Gann et al. [13] selected

6
6,799 tokens based on Twitter data, where each token is assigned a sentiment score,
namely TSI (Total Sentiment Index), featuring itself as a positive token or a
negative token. Specifically, a TSI for a certain token is computed as:

𝑡𝑝
𝑝 − 𝑡𝑛 ∗ 𝑛
𝑇𝑆𝐼 = 𝑡𝑝
𝑝 + 𝑡𝑛 ∗ 𝑛

where p is the number of times a token appears in positive tweets and n is the
𝑡𝑝
number of times a token appears in negative tweets. is the ratio of total number
𝑡𝑛

of positive tweets over total number of negative tweets.

Moreover, [8] showed that using the well-known “geo-tagged” feature in twitter to
identify the polarity of a political candidates in the US could be done by employing
the sentiment analysis algorithms to predict the future events such as the
presidential elections results. Comparing to previous approaches in sentime nt
topics, additional findings by [10] showed that adding the semantic feature produces
better Recall (retrieved documents) to compute the score) in negative sentime nt
classification.

7
3. FEASIBILITY ANALYSIS

A feasibility study is a preliminary study which investigates the information of


prospective users and determines the resources requirements, costs, benefits and
feasibility of proposed system. A feasibility study takes into account various
constraints within which the system should be implemented and operated. In this
stage, the resource needed for the implementatio n such as computing equipme nt,
manpower and costs are estimated. The estimated are compared with availab le
resources and a cost benefit analysis of the system is made. The feasibility analys is
activity involves the analysis of the problem and collection of all relevant
information relating to the project. The main objectives of the feasibility study are
to determine whether the project would be feasible in terms of economic feasibility,
technical feasibility and operational feasibility and schedule feasibility or not. It is
to make sure that the input data which are required for the project are availab le.
Thus we evaluated the feasibility of the system in terms of the following categories:

 Technical feasibility
 Operational feasibility
 Economic feasibility
 Schedule feasibility

3.1.1 Technical Feasibility

Evaluating the technical feasibility is the trickiest part of a feasibility study. This is
because, at the point in time there is no any detailed designed of the system, making
it difficult to access issues like performance, costs (on account of the kind of
technology to be deployed) etc. A number of issues have to be considered while
doing a technical analysis; understand the different technologies involved in the
proposed system. Before commencing the project, we have to be very clear about
what are the technologies that are to be required for the development of the new
system. Is the required technology available? Our system "Tweezer" is technica lly
feasible since all the required tools are easily available. Python and Php with

8
Javacscript can be easily handled. Although all tools seems to be easily availab le
there are challenges too.
3.1.2 Operational Feasibility

Proposed project is beneficial only if it can be turned into information systems that
will meet the operating requirements. Simply stated, this test of feasibility asks if
the system will work when it is developed and installed. Are there major barriers to
Implementation?
The proposed was to make a simplified web application. It is simpler to operate and
can be used in any webpages. It is free and not costly to operate.
3.1.3 Economic Feasibility

Economic feasibility attempts to weigh the costs of developing and implementing a


new system, against the benefits that would accrue from having the new system in
place. This feasibility study gives the top management the economic justifica tio n
for the new system. A simple economic analysis which gives the actual comparison
of costs and benefits are much more meaningful in this case. In addition, this proves
to be useful point of reference to compare actual costs as the project progresses.
There could be various types of intangible benefits on account of automation. These
could increase improvement in product quality, better decision making, and
timeliness of information, expediting activities, improved accuracy of operations,
better documentation and record keeping, faster retrieval of information.

This is a web based application. Creation of application is not costly.


3.1.4 Schedule Feasibility

A project will fail if it takes too long to be completed before it is useful. Typically,
this means estimating how long the system will take to develop, and if it can be
completed in a given period of time using some methods like payback period.
Schedule feasibility is a measure how reasonable the project timetable is. Given our
technical expertise, are the project deadlines reasonable? Some project is initiated

9
with specific deadlines. It is necessary to determine whether the deadlines are
mandatory or desirable.

A minor deviation can be encountered in the original schedule decided at the


beginning of the project. The application development is feasible in terms of
schedule.

3.2 Requirement Definition

After the extensive analysis of the problems in the system, we are familiarized with
the requirement that the current system needs. The requirement that the system
needs is categorized into the functional and non-functional requirements. These
requirements are listed below:

3.2.1 Functional Requirements

Functional requirement are the functions or features that must be included in any
system to satisfy the business needs and be acceptable to the users. Based on this,
the functional requirements that the system must require are as follows:

 System should be able to process new tweets stored in database after


retrieval
 System should be able to analyze data and classify each tweet polarity

3.2.2 Non-Functional Requirements

Non-functional requirements is a description of features, characteristics and


attribute of the system as well as any constraints that may limit the boundaries of
the proposed system.

The non-functional requirements are essentially based on the performance,


information, economy, control and security efficiency and services.

Based on these the non-functional requirements are as follows:

10
 User friendly
 System should provide better accuracy
 To perform with efficient throughput and response time

11
4. SYSTEM DESIGN AND ARCHITECTURE

4.1 Use Case Diagram

Fig 4.1:”Use Case Diagram of Tweezer”

12
4.2 System Flow Diagram

Fig 4.2.1: “ System Flow Diagram”

13
Input (Keyword) Aggregating Scores

Tweets Retrieval Naïve Bayes Classifier

Extract Feature
Data

Fig 4.2.2: “ System Flow Diagram”

Process Tweet Get Feature vector

14
4.3 Activity Diagram

Fig: 4.3:"Activity Diagram"

15
4.4 Sequence Diagram

Fig 4.4:"Sequence Diagram"

16
4.5 Data Flow Diagram

Fig 4.5:"Data Flow Diagram"

17
4.6 Flow Chart

Fig 4.6: "Flow Chart Diagram"

18
5.METHODOLOGY

There are primarily two types of approaches for sentiment classification of


opinionated texts:

 Using a Machine learning based text classifier such as Naive Bayes


 Using Natural Language Processing

We will be using those machine learning and natural language processing for
sentiment analysis of tweet.

5.1 Machine Learning

The machine learning based text classifiers are a kind of supervised machine
learning paradigm, where the classifier needs to be trained on some labeled training
data before it can be applied to actual classification task. The training data is usually
an extracted portion of the original data hand labeled manually. After suitable
training they can be used on the actual test data. The Naive Bayes is a statistica l
classifier whereas Support Vector Machine is a kind of vector space classifier. The
statistical text classifier scheme of Naive Bayes (NB) can be adapted to be used for
sentiment classification problem as it can be visualized as a 2-class text
classification problem: in positive and negative classes. Support Vector machine
(SVM) is a kind of vector space model based classifier which requires that the text
documents should be transformed to feature vectors before they are used for
classification. Usually the text documents are transformed to multidimensio na l
vectors. The entire problem of classification is then classifying every text document
represented as a vector into a particular class. It is a type of large margin classifier.
Here the goal is to find a decision boundary between two classes that is maxima lly
far from any document in the training data.

This approach needs


 A good classifier such as Naive Byes
 A training set for each class

19
There are various training sets available on Internet such as Movie Reviews data
set, twitter dataset, etc. Class can be Positive, negative. For both the classes we need
training data sets.

5.1.1 Naïve Bayes Classifier (NB)

The Naïve Bayes classifier is the simplest and most commonly used classifier.
Naïve Bayes classification model computes the posterior probability of a class,
based on the distribution of the words in the document. The model works with the
BOWs feature extraction which ignores the position of the word in the document.
It uses Bayes Theorem to predict the probability that a given feature set belongs to
a particular label.

P(label) ∗P(features|label )
P(label|features)=
P(features)

P(label) is the prior probability of a label or the likelihood that a random feature
set the label. P(features|label) is the prior probability that a given feature set is
being classified as a label. P(features) is the prior probability that a given feature
set is occurred. Given the Naïve assumption which states that all features are
independent, the equation could be rewritten as follows:

P(label) ∗P(f1| label)∗………∗P(fn|label)


P(label|features)=
P(features)

5.1.1.1 Multinomial Naïve Bayes Classifier

Accuracy – around 75%

Algorithm :

i. Dictionary generation

Count occurrence of all word in our whole data set and make a dictionary of
some most frequent words.

20
21
ii. Feature set generation
All document is represented as a feature vector over the space of dictionary
words.

For each document, keep track of dictionary words along with their number of
occurrence in that document.

Formula used for algorithms:

k |label  y  P ( x j  k | label  y )

m ni

 1{x (i )
 k and label ( i )  y}  1
k |label  y
j
i 1 j 1
 m
(1{label ( i )  y}ni )  | V |
i 1

k |label  y = probability that a particular word in document of


label(neg/pos) = y will be the kth word in the
dictionary.

m = Number of words in ith document.

ni = Total Number of documents.

Training
In this phase We have to generate training data(words with probability of
occurrence in positive/negative train data files ).
 k |labe l  y
Calculate for each label .
 k |labe l  y
Calculate for each dictionary words and store the result (Here: label
will be negative and positive).

22
Now we have , word and corresponding probability for each of the defined label .

Testing
Goal - Finding the sentiment of given test data file.

- Generate Feature set(x) for test data file.

-For each document is test set find

Decision1=log P(x| label= pos)+log P(label=pos)

Similarly calculate

Decision2=log P(x| label= neg)+log P(label=neg)

Compare decision 1&2 to compute whether it has Negative or Positive sentiment.

The following diagrams and calculations shows details on tweet data processing,
feature extraction, analysis and tweet polarity classification based on Naïve Bayes
Algorithm and Classifier.

You have a document and a classification


DOC TEXT CLASS
1 I loved the movies +
2 I hated the movies -
3 a great movies. good movies +
4 poor acting -
5 great acting. a good movies +

Ten Unique words:

<I, loved, the, movies, hated, a, great, poor, acting, good>

Convert the document into feature sets, where the attributes are possible words,
and the values are the number of times a word occurs in the given document.

23
DOC I loved The Movies hated a great poor acti good CLASS
ng
1 1 1 1 1 +
2 1 1 1 1 -
3 2 1 1 1 +
4 1 1 -
5 1 1 1 1 1 +

Documents with positive outcomes.


DOC I loved the movies hated a great poor acting good CL
ASS
1 1 1 1 1 +
3 2 1 1 1 +
5 1 1 1 1 1 +

P (+)=3/5=0.6
Compute: p (i|+); p (love|+); p (the|+); p (movies|+);
P (a|+); p (great|+); p (acting|+); p (good|+)
Le n be the number of words in the (+) case: 14. nk the number of times word k
occurs in these cases(+)
nk+1
Let p(wk /+)=
2𝑛+|𝑉𝑜𝑐𝑎𝑏𝑢𝑙𝑎𝑟𝑦 |

Documents with positive outcomes.


DOC I loved the Movies hated a great poor acting good CLAS S

1 1 1 1 1 +
3 2 1 1 1 +
5 1 1 1 1 1 +

nk+1
P(+)=3/5=0.6; p(wk /+)=
2𝑛+|𝑉𝑜𝑐𝑎𝑏𝑢𝑙𝑎𝑟𝑦 |

1+1 1+1
P(i|+) = 14+10 = 0.0833; P(loved|+) = 14+10 = 0.0833;
1+1 5 +1
P(the|+) = = 0.0833; P(movies|+) = = 0.2083;
14+10 14 +10
2+1 2+1
P(a|+) = 14+10 = 0.125; P(great|+) = 14+10 = 0.125;
1+1 2+1
P(acting|+) = 14+10 = 0.0833; P(good|+) = 14+10 = 0.125;
0 +1 0+1
P(hated|+) = 14 +10 = 0.0417; P(poor|+) = 14+10 = 0.0417;

24
Now, let’s look at the negative examples
DOC I loved the Movies hated a great poor acting good CLAS S
2 1 1 1 1 -
4 1 1 -

P(-)=2/5=0.4
1+1 0+1
P(i|-) = 6+10 = 0.125; P(loved|-) = 6+10 = 0.0625;
1+1 1+1
P(the|-) = 6+10 = 0.125; P(movies|-) = 6+10 = 0.125;
0+1 0+1
P(a|-) = 6+10 = 0.0625; P(great|-) = 6+10 = 0.0625;

1+1 0+1
P(acting|-) = 6+10 = 0.125; P(good|-) = 6+10 = 0.0625;
1+1 1+1
P(hated|-) = 6+10 = 0.125; P(poor|-) = 6+10 = 0.125;

Now that we’ve trained our classifier,


Let’s classify a new sentence according to:
VNB = argmaxP (vj) ∏ P(W|vj)
vj ∈ V w ∈ words

where v stands for “value” or “class”


“I hated the poor acting”
If Vj=+; p(+)p(i|+)p(hated|+)p(the|+)p(poor|+)p(acting|+)=6.03*10 -7
If Vj=-; p(-)p(i|-)p(hated|-)p(the|-)p(poor|-)p(acting|-)=1.22*10-5

5.2 Natural Language Processing

Natural language processing (NLP) is a field of computer science, artific ia l


intelligence, and linguistics concerned with the interactions between computers and
human (natural) languages. This approach utilizes the publicly available library of
SentiWordNet, which provides a sentiment polarity values for every term occurring
in the document. In this lexical resource each term t occurring in WordNet is
associated to three numerical scores obj(t), pos(t) and neg(t), describing the
objective, positive and negative polarities of the term, respectively. These three

25
scores are computed by combining the results produced by eight ternary classifie rs.
WordNet is a large lexical database of English. Nouns, verbs, adjectives and
adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a
distinct concept.

WordNet is also freely and publicly available for download. WordNet’s structure
makes it a useful tool for computational linguistics and natural language processing.
It groups words together based on their meanings. Synet is nothing but a set of one
or more Synonyms. This approach uses Semantics to understand the langua ge.
Major tasks in NLP that helps in extracting sentiment from a sentence:

 Extracting part of the sentence that reflects the sentiment


 Understanding the structure of the sentence
 Diff erent tools which help process the textual data

Basically, Positive and Negative scores got from SentiWordNet according to its
part-of-speech tag and then by counting the total positive and negative scores we
determine the sentiment polarity based on which class (i.e. either positive or
negative) has received the highest score.

5.3 Programming tools

5.3.1 MongoDB

MongoDB is an open source database that uses a document-oriented data model.


MongoDB is one of several database types to arise in the mid-2000s under the
NoSQL banner. Instead of using tables and rows as in relational databases,
MongoDB is built on an architecture of collections and documents. Documents
comprise sets of key-value pairs and are the basic unit of data in MongoDB.
Collections contain sets of documents and function as the equivalent of relationa l
database tables.

26
5.3.2 Python

Python is a widely used high-level, general-purpose, interpreted, dynamic


programming language. Its design philosophy emphasizes code readability, and its
syntax allows programmers to express concepts in fewer lines of code than possible
in languages such as C or Java. The language provides constructs intended to enable
writing clear programs on both a small and large scale.
5.3.3 NLTK

NLTK is a leading platform for building Python programs to work with human
language data. It provides easy-to-use interfaces to over 50 corpora and lexica l
resources such as WordNet, along with a suite of text processing libraries for
classification, tokenization, stemming, tagging, parsing, and semantic reasoning,
wrappers for industrial-strength NLP libraries, and an active discussion forum.

NLTK has been called “a wonderful tool for teaching, and working in,
computational linguistics using Python,” and “an amazing library to play with
natural language.” NLTK is suitable for linguists, engineers, students, educators,
researchers, and industry users alike. Natural Language Processing with
Python provides a practical introduction to programming for language processing.
Written by the creators of NLTK, it guides the reader through the fundamentals of
writing Python programs, working with corpora, categorizing text, analyzing
linguistic structure, and more.
5.3.4 PHP

PHP is an HTML-embedded, server-side scripting language designed for web


development. It is also used as a general purpose programming language. PHP
codes are simply mixed with HTML codes and can be used in combination with
various web frameworks. Its scripts are executed on the server. PHP code is
processed by a PHP interpreter. The main goal of PHP is to allow web developer to
create dynamically generated pages quickly. A PHP file consists of texts, HTML
tags and scripts.

27
5.3.5 JavaScript

We make the use of JavaScript to incorporate different plugins in our web page.
JavaScript is the programming language of HTML and the Web. JavaScript resides
inside HTML documents, and can provide levels of interactivity to web pages that
are not achievable with simple HTML.
5.3.6 HTML

We use HTML for rendering the analyzed in the web page. HTML is the standard
markup language for creating Web pages. HTML describes the structure of Web
pages using markup. HTML elements are the building blocks of HTML pages.
5.3.7 Highcharts

Highcharts is a charting library written in pure JavaScript, offering an easy way of


adding interactive charts to your web site or web application.

28
6. TESTING

6.1. Unit Testing

Unit testing is performed for testing modules against detailed design. Inputs to the
process are usually compiled modules from the coding process. Each modules are
assembled into a larger unit during the unit testing process.

Testing has been performed on each phase of project design and coding. We carry
out the testing of module interface to ensure the proper flow of information into and
out of the program unit while testing. We make sure that the temporarily stored data
maintains its integrity throughout the algorithm's execution by examining the loc al
data structure. Finally, all error-handling paths are also tested.

6.2. System Testing

We usually perform system testing to find errors resulting from unanticip ated
interaction between the sub-system and system components. Software must be
tested to detect and rectify all possible errors once the source code is generated
before delivering it to the customers. For finding errors, series of test cases must be
developed which ultimately uncover all the possibly existing errors. Diffe re nt
software techniques can be used for this process. These techniques provide
systematic guidance for designing test that

 Exercise the internal logic of the software components,


 Exercise the input and output domains of a program to uncover errors in
program function, behavior and performance.

We test the software using two methods:

White Box testing: Internal program logic is exercised using this test case design
techniques.

Black Box testing: Software requirements are exercised using this test case design
techniques.

29
Both techniques help in finding maximum number of errors with minimal effort and
time.
6.3. Performance Testing

It is done to test the run-time performance of the software within the context of
integrated system. These tests are carried out throughout the testing process. For
example, the performance of individual module are accessed during white box
testing under unit testing.
6.4. Verification and Validation

The testing process is a part of broader subject referring to verification and


validation. We have to acknowledge the system specifications and try to meet the
customer’s requirements and for this sole purpose, we have to verify and validate
the product to make sure everything is in place. Verification and validation are two
different things. One is performed to ensure that the software correctly impleme nts
a specific functionality and other is done to ensure if the customer requirements are
properly met or not by the end product.

Verification is more like 'are we building the product right?' and validation is
more like 'are we building the right product?'.

30
7. ANALYSIS AND RESULTS

7.1 Analysis

We collected dataset containing positive and negative data. Those dataset were
trained data and was classified using Naïve Bayes Classifier. Before training the
classifier unnecessary words, punctuations, meaning less words were cleaned to get
pure data. To determine positivity and negativity of tweets we collected data using
twitter API. Those data were stored in database and then retrieved back to remove
those unnecessary word and punctuations for pure data. To check polarity of test
tweet we train the classifier with the help of trained data. Those results were stored
in database and then retrieved back using php, html , javascript and css.
7.2 Result

After facing a number of errors, successful elimination of those error we have


completed our project with continuous effort. At the end of the project the results
can be summarized as:

 A user friendly web based application.


 No expertise is required for using the application.
 Organizations can use the application to visualize product or brand review
graphically.

Fig 7.2.1:"Pie-Chart Representation"

31
Fig 7.2.2: "Bar Graph Diagram"

Fig 7.2.3:"Scatter Plot"

32
8. LIMITATION AND FUTURE ENHANCEMENT

8.1 Limitation

The system we designed is used to determine the opinion of the people based on
twitter data. We somehow completed our project and was able to determine only
positivity and negativity of tweet. For neutral data we were unable to merge dataset.

Also we are currently analyzing only 25 live tweets. This may not give proper value

and results. The results are not much accurate .

8.2 Future Enhancement

 Analyzing sentiments on emo/smiley.


 Determining neutrality.
 Potential improvement can be made to our data collection and analys is
method.
 Future research can be done with possible improvement such as more
refined data and more accurate algorithm.

33
CONCLUSION

We have completed our project using python as language, Php with Html and
Javascript for output presentation. Although there was a problem in integration of
python and php, through numbers of tutorial we were able to integrate it.

We were able to determine the positivity and negativity of each tweet. Based on
those tweets we represented them in a diagrams like Bar graph, Pie-chart and scatter
plot. All the diagrams related to outcome are shown in fig (7.2.1),fig (7.2.2),
fig(7.2.3). A small conclusion is also shown during output presentation based on
product or brand entered. Our designed system is user friendly.

All displaying results are displayed in webpage.

34
REFERENCES

1. Kim S-M, Hovy E (2004) Determining the sentiment of opinions In: Proceedings
of the 20th international conference on Computational Linguistics, page 1367.
Association for Computational Linguistics, Stroudsburg, PA, USA.

2. Liu B (2010) Sentiment analysis and subjectivity In: Handbook of Natural


Language Processing, Second Edition. Taylor and Francis Group, Boca. Liu B,
Hu M, Cheng J (2005) Opinion observer: Analyzing and comparing opinions on
the web In: Proceedings of the 14th International Conference on World Wide
Web, WWW ’05, 342–351. ACM, New York, NY, USA.

3. Pak A, Paroubek P (2010) Twitter as a corpus for sentiment analysis and opinio n
mining In: Proceedings of the Seventh conference on International Langua ge
Resources and Evaluation. European Languages Resources Association, Valletta,
Malta.

4. Pang B, Lee L (2004) A sentimental education: Sentiment analysis using


subjectivity summarization based on minimum cuts In: Proceedings of the 42Nd
Annual Meeting on Association for Computational Linguistics, ACL ’04..
Association for Computational Linguistics, Stroudsburg, PA, USA.

5. Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf
Retr2(1-2): 1–135.

6. Twitter (2014) Twitter apis. https://dev.twitter.com/start.

7. LiuB(2014)The science of detecting fake reviews. http://content26.com/blog/bing-


liu-the-science-of-detecting- fake-reviews/

8. Jahanbak hs h, K., & Moon, Y. (2014). The predictive power of so c ia l


media : On the predictab ility of U.S presidentia l electio ns us in g
Twitte r

35
9. Mukherjee A, Liu B, Glance N (2012) Spotting fake reviewer groups in consumer
reviews In: Proceedings of the 21st, International Conference on World Wide
Web, WWW ’12, 191–200.. ACM, New York, NY, USA.

10. Saif, H., He, Y., & Alani, H. (2012). Semantic sentiment analysis of twitter. The
Semantic Web (pp. 508– 524). ISWC

11. Tan LK-W, Na J-C, Theng Y-L, Chang K (2011) Sentence-level sentiment polarity
classification using a linguistic approach In: Digital Libraries: For Cultural
Heritage, Knowledge Dissemination, and Future Creation, 77–87.. Springer,
Heidelberg, Germany.

12. Liu B (2012) Sentiment Analysis and Opinion Mining. Synthesis Lectures on
Human Language Technologies. Morgan & Claypool Publishers.

13. Gann W-JK, Day J, Zhou S (2014) Twitter analytics for insider trading fraud
detection system In: Proceedings of the sencond ASE international conference on
Big Data.. ASE.

14. Joachims T. Probabilistic analysis of the rocchio algorithm with TFIDF for text
categorization. In: Presented at the ICML conference; 1997.

15. Yung-Ming Li, Tsung-Ying Li Deriving market intelligence from microblogs

36
APPENDIX

37
38
39
40

You might also like