Professional Documents
Culture Documents
Practical 9 - Text Mining
Practical 9 - Text Mining
Practical 9 - Text Mining
This practical is about text mining using the movie review dataset available from Cornell
university. In this practical, we will further explore machine learning techniques by
performing a simple sentiment classification task on a text dataset using WEKA. The aim of
this task is to classify a review as either positive or negative (binary classification). We will
first briefly introduce the dataset, then start with importing the dataset in WEKA, followed by
standard text preprocessing tasks such as tokenization, stop-word removal, stemming, and
attribute selection.
References
Potts, Christopher. 2011. On the negativity of negation. In Nan Li and David Lutz,
eds., Proceedings of Semantics and Linguistic Theory 20, 636-659.
Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan, Thumbs up? Sentiment
Classification using Machine Learning Techniques, Proceedings of EMNLP 2002.
Bo Pang and Lillian Lee, A Sentimental Education: Sentiment Analysis Using
Subjectivity Summarization Based on Minimum Cuts, Proceedings of ACL 2004.
Bo Pang and Lillian Lee, Seeing stars: Exploiting class relationships for sentiment
categorization with respect to rating scales, Proceedings of ACL 2005.
Major Steps
1. First of all, download textming.zip file from LearnJCU and unzip in your local computer.
There will be two files and one directory.
File: poldata.README.2.0
o Readme file that explains about the file
File: stop_word_list.txt
o 257 stop words
Directory: txt_sentoken
o A directory including two subdirectories: neg and pos each subdirectory
containing 1000 instances
Please read the readme file, and also open stop_word_list.txt file to browse a list of
stop words. Finally, open some txt files from neg and pos directory to see how data
look like.
Fig. 3 Activating file loader tool and loading dataset using TextDirectoryLoader
Fig. 4 Activating directory loader too
By using this loader, WEKA automatically creates a relation with 2 attributes: the first one
contains the text data and the second contains the document class, as determined by the sub-
directory containing the file. As expected, you will get a relation containing 2000 instances
and two attributes (text and class). The histogram in Figure 5 shows the uniform distribution
of the review classes (blue for negative and red for positive).
Fig. 5 Imported dataset and class distribution
Once we have loaded the dataset, we move to illustrate text preprocessing and feature
extraction in WEKA. The main purpose of preprocessing is to convert documents into vectors
that are computable by machine learning algorithms.
To conduct text preprocessing using WEKA, we have to use Filter tool. Simply select
StringToWordVector filter from the package weka.filters.unsupervised.attribute. This filter
allows to configure the different stages of the transformation process (Fig. 6). In this
practical, we will only focus on these three configurations: stemmer, stopwordHandler and
tokenizer (Fig. 7). P.S. If you finish the entire practical task early, you are welcomed to try
altering other options since preprocessing will significantly affect the final classification
performance.
Fig. 6 Selecting StringToWordVector filter from the package weka.filters.unsupervised.attribute
When it is false, weka will show the words that are not frequent
Now we talk about how to configure stemmer, stopwordHandler and tokenizer. For
stemming, choose SnowballStemmer package (Fig. 8). Note: stemming is an approximation of
lemmatization that removes the suffix of a word based on standard grammatical rules. For
instance, word ‘walking’ is stemmed into ‘walk’.
For stopwordHandler, first choose WordsFromFile package and then load the file
stop_word_list.txt provided along with the dataset (Fig. 9). This stop word list contains 257
common English words. Note: the purpose of stop word removal is to reduce dimensionality
by eliminating words that are frequently appear but have limit influence in analysing
sentiments.
For tokenizer, chooser NGramTokenizer package (Fig. 10). Note: tokenization aims at
separating documents into word or phase tokens that can be easily transformed into numerical
vectors. WEKA allows you to alter property of NGramTokenizer by specifying MinSize and
MaxSize. Just keep default setting (1 for minsize and 3 for maxsize) at current stage.
Fig. 8 Choosing SnowballStemmer package
Fig. 9 Using stop_word_list.txt file for stopwordHandler
The last preprocessing step we need is attribute selection because eliminating the poorly
characterizing attributes can be useful to get a better classification accuracy. For this task,
WEKA provides the AttributeSelection filter from the weka.filters.supervised.attribute
package. The filter allows to choose an attribute evaluation method and a search strategy
(Fig. 12). The default evaluation method is CfsSubsetEval (Correlation-based feature subset
selection). This method works by choosing attributes that are highly correlated with the class
attribute while having a low correlation with other attributes. After applying the
AttributeSelection filter, we obtain the result show in Figure 13. You should be able to see a
greatly decrease in the number of attributes remained.
Fig. 12 Choosing AttributeSelection filter
1. If you click edit to see how the review data has been preprocessed it will look as below.
Now you can see that text data are now in a vector format.
2. To begin with, select Classify tab (Fig. 1). By default, WEKA uses 10-fold Cross-validation
as validation option. It is a proper option for this task due to the scale of the dataset.
Fig. 1 Beginning classification process
Some of the most adapted classification algorithms for sentiment analysis include Naïve
Bayes (NB), Support Vector Machine (SVM), k-Nearest Neighbour (kNN), and Decision Tree
(DT) algorithms. In WEKA, the Naive Bayes classifier is implemented in the NaiveBayes
component from the weka.classifiers.bayes package (Fig. 2). Note: just keep all properties
in their default settings for NB. Figure 3 shows the result.
Fig. 2 Choosing Naïve Bayes classifier
5. The last one is a Decision Tree classifier. We will use J48 in WEKA that is an
implementation of the C4.5 algorithm for building decision trees. J48 has the full name
weka.classifiers.trees.J48 (Fig. 10). The classifier is shown in the text box next to the
Choose button: It reads J48 –C 0.25 –M 2. This text gives the default parameter settings
for this classifier. C4.5 has several parameters, by the default visualization (when you
invoke the classifier) only shows –c ie. Confidence value (default 25%): lower values
incur heavier pruning and -M ie. Minimum number of instances in the two most popular
branches (default 2). P.S. For your reference, the full set of J48 parameter settings are
explained here: http://weka.sourceforge.net/doc.dev/weka/classifiers/trees/J48.html.
We won’t change any default setting at this time. Fig. 11 shows the J48 result.
Fig. 10 Choosing J48 from weka.classifiers.trees package
Fig. 11 Classification result of J48 classifier
For this task, LibSVM achieves the highest accuracy rate of 80.6%, which is coherent with
most of the recent research on sentiment analysis. Noted that this comparison table is
conducted using all three text preprocessing steps: stemming, stopwordHandler and attribute
selection. However, as we discussed previously, preprocessing will greatly influence the
performance of classification. Thus, what if we just use some of them or none of them?
Would result be better or worse? Give a try of yourself and compare the result to search for
the best performance.
# Classifier Accuracy
1 Naïve Bayes 79.35 %
2 LibSVM 79.75 %
3 IBk 70.85 %
4 J48 72.25 %