Professional Documents
Culture Documents
The Thesis Report
The Thesis Report
The Thesis Report
Directed Research
Supervisor
ECE Department
Spring, 2019
DECLARATION
This is to certify that this Thesis is my original work. No part of this work has been submitted
elsewhere partially or fully for the award of any other degree or diploma. Any material
The directed research entitled “Opinion Mining Based on Crowdsourced Data” by Md.
Azwad Hasan Chowdhury (ID # 1430125042) is approved in partial fulfillment of the
requirement of the Degree of Bachelor of Science in Computer Science and Engineering on May
and has been accepted as satisfactory.
Supervisor’s Signature
Dr. K. M. A. Salam
Professor and Chairman
Department of Electrical and Computer Engineering
North South University
Dhaka, Bangladesh.
ACKNOWLEDGMENT
First of all, we wish to express our gratitude to the Almighty for giving us the strength to
perform our responsibilities and complete the report.
The capstone project program is very helpful to bridge the gap between the theoretical
knowledge and real life experience as part of Bachelor of Science (BSc) program. This report has
been designed to have a practical experience through the theoretical understanding.
We also acknowledge our profound sense of gratitude to all the teachers who have been
instrumental for providing us the technical knowledge and moral support to complete the project
with full understanding.
It is imperative to show our appreciation for our honorable faculty member Md. Shahriar Karim
for his undivided attention and help to achieve this milestone. Also, our gratefulness is divine to
the North South University, ECE department for providing us an opportunity to a research under
CSE 498R.
We thank our friends and family for their moral support to carve out this research and always
offer their support.
ABSTRACT
This report presents directed research on Opinion Mining also known as Sentiment Analysis on
several types of dataset. The analysis techniques include machine learning, deep neural network,
lexicon-based approach and non-machine learning based naïve approaches. Instead of sticking
into one particular approach, the target of this research was to explore all kinds of possibilities
and techniques that can be done on analyzing sentiments of users posted on online platforms like
social media, e-commerce site reviews and movie review sites.
Table of Contents
CHAPTER 1...................................................................................................................................1
OVERVIEW...................................................................................................................................1
1.1 Introduction............................................................................................................................2
1.4 Summary................................................................................................................................3
CHAPTER 2...................................................................................................................................4
MOTIVATION..............................................................................................................................4
2.1 Introduction............................................................................................................................5
2.4 Summary................................................................................................................................6
CHAPTER 3...................................................................................................................................7
RELATED WORK........................................................................................................................7
3.1 Introduction............................................................................................................................8
3.6 Thai Sentiment Analysis for Consumer’s Review in Multiple Dimensions Using Sentiment
Compensation Technique............................................................................................................9
3.7 Sentiment analysis using Latent Dirichlet Allocation and topic polarity word cloud
visualization.................................................................................................................................9
3.8 Deep learning for sentiment analysis of movie reviews......................................................10
CHAPTER 4.................................................................................................................................11
4.1 Introduction..........................................................................................................................12
CHAPTER 5.................................................................................................................................16
NAÏVE APPROACH...................................................................................................................16
5.1 Introduction..........................................................................................................................17
5.3 Tools....................................................................................................................................18
5.4 Modules...............................................................................................................................19
CHAPTER 6.................................................................................................................................21
6.1 Introduction..........................................................................................................................22
6.2 Dataset.................................................................................................................................22
6.3 System Architecture.............................................................................................................22
6.5 Summary..............................................................................................................................24
CHAPTER 7.................................................................................................................................25
RESULT ANALYSIS..................................................................................................................25
7.1 Introduction..........................................................................................................................26
CHAPTER 8.................................................................................................................................28
CONCLUSION............................................................................................................................28
BIBLIOGRAPHY........................................................................................................................30
APPENDIX...................................................................................................................................33
CODE SAMPLES........................................................................................................................33
List of Figures
Figure No. Figure caption Page No.
1 LSTM architecture 22
2 Recurrent Neural Network 23
3 Implemented RNN model 24
4 Twitter API access approval email 26
5 Token information snapshot 27
6 Lexicon based system design 31
7 Output result of Naive approach 36
8 Code sample of RRN implementation 34
9 Code sample of Naïve approach 35
List of Tables
Table No. Table caption Page No.
1 Result of ML based approach 35
2 Result of Lexicon based approach 35
CHAPTER 1
OVERVIEW
1
1.1 Introduction
Research and industry are becoming more and more interested in finding automatically the
polarized opinion of the general public regarding a specific subject. The advent of social
networks has opened the possibility of having access to massive blogs, recommendations, and
reviews. The challenge is to extract the polarity from these data, which is a task of opinion
mining or sentiment analysis [8]. It is now an important field of research. It has vast implications
on automation and Artificial Intelligence based applications. In every minute, people post
different types of opinions on various online platforms. Analyzing those opinions manually is a
very difficult task. To analyze this opinion automatically, researchers need to come out with
some sort of automation system. Nowadays different types of approaches are being used for
analyzing these types of data. Among those approaches, machine learning based approaches are
the most popular ones. But this can be done using different approaches as well. For example,
lexicon based approaches are also used in this area. The proposed research has also shown
different naive approaches to solve this problem. The research is conducted on different social
media data, e-commerce site review data, and movie rating based review platforms.
This research is conducted on IMDB’s movie review dataset. For extracting information, a deep
learning based algorithm called recurrent neural network (RNN) is used which is a variation of
neural network. The architecture implemented here is called Long Short Term Memory (LSTM).
This research is conducted on the real-time tweets data from social media data – Twitter. The
project needs to access real-time user tweets to analyze the polarity of their tweets. For accessing
the twitter data in real time, a module built by Twitter engineers called Tweepy is used in the
project. Tweepy is built on top of high-level programming language Python.
2
3. Lexicon Based Approach
This technique uses dictionaries of words. Each word is interpreted with its emotional polarity
and opinion force. This dictionary is then paired with the document to determine its overall
polarity score of the document. These techniques usually give high precision but low recall.
Lexical algorithms can gain near-perfect results, but, they require using a lexicon – something
that’s not always available in all the languages.
i. Analyzing user data in real-time: The research has implemented a model that gives a real-
time analyzing score on the tweets, posted by different users on Twitter. It classifies the posted
contents in the positive, negative and neutral category.
ii. Analyzing product reviews: The reviews posted by different e-commerce site users for any
particular product needs some kind of automation. If the user size of that business is very large
and the review comes from the user is also uncountable, then there must be some sort of
automation system to figure out how many reviews are positive and how many reviews are
negative for a particular product so that the business authority can take appropriate steps against
that product.
iii. Social biases: By analyzing the user data, it is possible to figure out community biases or
violation mindset related activities against some other community or class.
1.4 Summary
This chapter gives the insight of the modules that have been used in this proposed research. It
provides a clear picture on how the proposed methods and technologies are going work for
sentiment analysis with the help of machine learning, lexical and naïve approaches.
3
CHAPTER 2
MOTIVATION
4
2.1 Introduction
In this chapter, the paper discusses the motivation, due to which we thought of implementing our
proposed models and technologies. We will also discuss in this chapter as to why we have
chosen opinion mining as our research topic.
5
2.4 Summary
This chapter provided the idea about the motivation towards our thesis topic which aims to
systematically automate the entire work of natural languages through different types of methods.
6
CHAPTER 3
RELATED WORK
7
3.1 Introduction
In this chapter, the paper discusses different types of sentiment analysis (SA) methods that
currently exist in the current environment. The works conducted in this field by other researchers
previously and had tremendous impact in this field.
8
sentiment prediction. They built sentiment lexicon and predict text sentiment frameworks to do
their task. The approach was not that much different from the conventional ways. They
measured polarity of different sentences and then made a classical naïve classifier to
differentiate the words through a cluster.
Paitoon P., and Chayapol M., [5] used Sentiment compensation technique is used to
automatically compensate the sentiment to a dimension where consumer’s review mentions the
sentiment without a dimension. The results show that their proposed method outperform
sentiment to dimension (S2D) and dimension to sentiment (D2S) methods with the overall
accuracy 93.60%.
Mohammad F. A. B. and Retno K. [6] did sentiment analysis using Latent Dirichlet Allocation
(LDA) that extracts the topic of documents where the topic is represented as the appearance of
the words with different topic probability. For using LDA data needs to be represented visually.
For that reason, they did topic polarity word cloud visualization.
9
3.8 Deep learning for sentiment analysis of movie reviews
Hadi P., and Saman G. [7] explored natural language methods to perform sentiment analysis.
They worked both on binary and multi-class classification. For the binary classification they
applied the bag of words, and skip-gram word2vec models followed by various classifiers,
including random forest, SVM, and logistic regression. For the multi-class case,
they implemented the recursive neural tensor networks (RNTN).
10
CHAPTER 4
MACHINE LEARNING
BASED APPROACHES
11
4.1 Introduction
In this chapter, the paper discusses the machine learning based approaches the research has
considered. This chapter explains the data set, preprocessing and the algorithms that has been
considered for sentiment analysis.
12
For extracting feature from any texts, Recurrent Neural Network (RNN) is widely used. It is a
deep learning algorithm. Its framework Long short term memory (LSTM) is widely used for
remembering things in hierarchical order. RNN is a type of neural network where the output
from the previous step is fed as input to the current step. In traditional neural networks, all the
inputs and outputs are independent of each other, but in cases like when it is required to predict
the next word of a sentence, the previous words are required and hence there is a need to
remember the previous words. Thus RNN came into existence, which solved this issue with the
help of a Hidden Layer [9].
ht =f (C t−1 , X t )
Where: ht is the current state, C t−1 , is the previous state, and X t is the input state.
13
Where, W hh weight at recurrent neuron and W xh is weight at input neuron.
Ot =W hy ht
14
As a programming language, Python was used. Keras has built in dataset for IMDB’s movie
review. Other libraries implemented in the project are:
NumPy
Panda DataFrame
Matplotlib
In this model, binary cross entropy has been used as loss function and Adam has been used as an
optimizer.
15
CHAPTER 5
NAÏVE APPROACH
16
5.1 Introduction
In this chapter, the paper discusses naïve approaches for solving sentiment analysis. In naïve
approaches, there is no intelligence involved like machine learning based approaches. Explicit
logics need to write for accomplishing a task.
17
This proposed research used Twitter’s tweet data posted by different users but in real-time. The
proposed application gives a real-time polarity score on twitter’s recently posted tweets. For
accessing tweets in real time, the project source code needs to access Twitter API. For accessing
twitter API, users need to create a developer account. During the application process, the Twitter
developer team asks 4 questions related to the use case of the API. After filling up the form, it
takes around 5-6 hours to confirm the application. If the application is approved, Twitter's
developer team sends an email like below:
For accessing the data from source code, 4 types of secret keys are provided by the API. The
following keys are provided by twitter API: ACCESS_TOKEN, ACCESS_TOKEN_SECRET,
CONSUMER_KEY and CONSUMER_SECRET. These key values are essential and need to
refer in the project codes to access twitter’s data. The app can be created from the following
URL address:
https://developer.twitter.com/en/apps
5.3 Tools
18
Python is used as the primary programming language. To use the Twitter API, there is a module
called Tweepy is needed. At first, it needs to create an app on the Twitter developer page where
access tokens can be seen. A snapshot from the app we created is given below:
5.4 Modules
1. StreamListener: It is a class from Tweepy module that allows to listen or read the tweets
based on certain keywords or hashtags.
2. oAuthHandler: This module is used for authenticating users based on the credentials
stored in API keys.
3. Stream: Helps to give real time data from twitter.
19
5.5 Getting Tweets in JSON format
This section contains two Python files, one for the API credential values and another one for
writing the logic to get the real-time twitter data defined for a few tags. The logic file contains
two classes. One is TwitterStreamer () and the other one is StdOutListener (). The first class
provides us with Twitter data in real-time and also filters the data by the defined keywords. The
second class is only for printing the values in JSON format.
Id
Length
Date
Source
Likes
Retweets
This section also contains a function that is used to clean twitter data using Regular Expressions
of Python (RegEx.).
20
extraction, classification, translation etc. [15]. It is all about python strings. Forgetting a polarity
score, the first step is doing tokenization which is one of the basic tasks of NLP. After that, the
project needs to extract a noun phrase extraction. This returns two properties, polarity and
subjectivity. Polarity is float which lies in the range of [-1,1] where 1 means positive statement
and -1 means a negative statement. Subjective sentences generally refer to personal opinion,
emotion or judgment whereas objective refers to factual information. Subjectivity is also a float
which lies in the range of [0,1]. [16].
21
CHAPTER 6
LEXICON BASED
APPROACH
6.1 Introduction
In this chapter, the paper explores the steps associated with lexicon based approach. In this
approach, the words are divided into two categories. Positive words and negative words.
6.2 Dataset
22
In this experiment, the research is conducted on a state of art dataset called SentiWordNet [10]. It
is supported by sentiment analysis applications. It provides three annotations with each labelled
data (positive, negative and neutral). Wordnet is a linguistic resource for sentiment analysis.
A lexicon-based approach for opinion mining is based on the insight of a dataset that highly
depends on the polarity. However, due to the complexity of natural language processing, a
simple model may fail to extract the information out of one sentence properly. Because of this,
this paper proposed a fine-grained model which splits the sentences into a dictionary of small
words which is called micro phase. Let’s say this micro phase is called mi. The sentiment score
labelled with each micro phase is t j. The paper approaches two different ways of this
representation called Basic, Normalized, [11]
In the basic formulation, the sentiment of the content is achieved by first summing the polarity of
each micro-phrase. Then, the score is normalized by the range of the whole content. In this case,
the micro-phrases are just utilized to invert the polarity when a negation is found in the text.
23
n
POLbasic (mi )
Sbasic (T )=∑ ¿
i=1 ¿ T ∨¿
k
POLbasic ( mi) =∑ score(t j)
j−1
In the normalized formulation, the micro-phrase-levels cores are normalized by using the length
of the single micro-phrase, in order to weigh differently the micro-phrases according to their
length.
n
Snorm ( T )=∑ POL norm (mi)
i=1
k
score (t j)
POLnorm ( mi ) =∑ ¿
j=1 ¿ mi∨¿
6.5 Summary
In this chapter, the paper has discussed the lexicon-based approach for calculating sentiment
scores posted from an online platform. Here, this paper shows two different mathematical
models for calculating the sentiment score.
24
25
CHAPTER 7
RESULT ANALYSIS
7.1 Introduction
In this chapter, the paper discusses the results obtained from the above mentioned three methods
of doing opinion mining or sentiment analysis. The first approach is a Machine Learning based
approach where Recurrent Neural Network was used along with its popular architecture Long
26
Short Term Memory. The second method is the Naïve approach where twitter API was used to
calculate sentiment score and the last one is Lexicon based approach.
27
For this research, a polarity score was generated from the U.S president Donald Trump’s twitter
timeline. For simplicity, the project limits the number of posts. One of the results got from our
compiler and on the output screen, a sample is given below:
28
CHAPTER 8
CONCLUSION
29
To sum up everything, opinion mining or sentiment analysis is now a big field of research. Many
researchers are trying to improve sentiment analysis accuracy. As the contents of the internet-
based platforms are increasing rapidly, automation of these data is very urgent. It is quite
impossible to read all of these contents by human and take appropriate steps for each content
separately. This is where automatic opinion mining or sentiment analysis is needed.
30
BIBLIOGRAPHY
31
[1] Apoorv A., Boyi X. Ilia V., and Owen R., "Sentiment Analysis of Twitter Data", No date
found.
[2] Woldemariam, Y. (2016). Sentiment analysis in a cross-media analysis framework. 2016
IEEE International Conference On Big Data Analysis (ICBDA). doi:
10.1109/icbda.2016.7509790.
[3] Fan, X., Li, X., Du, F., Li, X., & Wei, M. (2016). Apply word vectors for sentiment
analysis of APP reviews. 2016 3Rd International Conference On Systems And
Informatics (ICSAI). doi: 10.1109/icsai.2016.7811108
[4] Vanaja, S., & Belwal, M. (2018). Aspect-Level Sentiment Analysis on E-Commerce
Data. 2018 International Conference On Inventive Research In Computing Applications
(ICIRCA). doi: 10.1109/icirca.2018.8597286.
[5] Porntrakoon, P., & Moemeng, C. (2018). Thai Sentiment Analysis for Consumer’s
Review in Multiple Dimensions Using Sentiment Compensation Technique
(SenseComp). 2018 15Th International Conference On Electrical
Engineering/Electronics, Computer, Telecommunications And Information Technology
(ECTI-CON). doi: 10.1109/ecticon.2018.8619892
[6] Bashri, M., & Kusumaningrum, R. (2017). Sentiment analysis using Latent Dirichlet
Allocation and topic polarity wordcloud visualization. 2017 5Th International
Conference On Information And Communication Technology (Icoic7). doi:
10.1109/icoict.2017.8074651
[7] (2019). Retrieved from https://cs224d.stanford.edu/reports/PouransariHadi.pdf.
[8] Rojas-Barahona, L. (2016). Deep learning for sentiment analysis. Language And
Linguistics Compass, 10(12), 701-719. doi: 10.1111/lnc3.12228
[9] Introduction to Recurrent Neural Network - GeeksforGeeks. (2019). Retrieved from
https://www.geeksforgeeks.org/introduction-to-recurrent-neural-network/
[10] Andrea Esuli Baccianella, Stefano and Fabrizio Sebastiani. SentiWordNet 3.0:
Anenhanced lexical resource for sentiment analysis and opinion mining. InProceedingsof
LREC, volume 10, pages 2200–2204, 2010.
[11] Cataldo G., Giovanni S., Marco P., "A comparison of lexicon-based approaches for
sentiment analysis of microblog”., Proceedings of the 8th International Workshop on
Information Filtering and Retrieval.
32
[12] Preslav Nakov, Zornitsa Kozareva, Alan Ritter, Sara Rosenthal, Veselin Stoyanov,and
Theresa Wilson. Semeval-2013 task 2: Sentiment analysis in twitter. 2013.
[13] Alec Go, Richa Bhayani, and Lei Huang. Twitter sentiment classification usingdistant
supervision.CS224N Project Report, Stanford, pages 1–12, 2009.
[14] TextBlob: Simplified Text Processing — TextBlob 0.15.2 documentation. (2019).
Retrieved from https://textblob.readthedocs.io/en/dev/
[15] TextBlob, N. (2019). Natural Language Processing for Beginners: Using TextBlob.
Retrieved from https://www.analyticsvidhya.com/blog/2018/02/natural-language-
processing-for-beginners-using-textblob/
[16] Ding, J., Le, Z., Zhou, P., Wang, G., & Shu, W. (2009). An Opinion-Tree Based Flexible
Opinion Mining Model. 2009 International Conference On Web Information Systems
And Mining. doi: 10.1109/wism.2009.38
33
APPENDIX
CODE SAMPLES
34
from keras import Sequential
embedding_size=32
model=Sequential()
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())
class TweetAnalyzer():
35
def clean_tweet(self, tweet):
analysis = TextBlob(self.clean_tweet(tweet))
if analysis.sentiment.polarity > 0:
return 1
elif analysis.sentiment.polarity == 0:
return 0
36