Practical 9 - Text Mining

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 22

Practical 9

This practical is about text mining using the movie review dataset available from Cornell
university. In this practical, we will further explore machine learning techniques by
performing a simple sentiment classification task on a text dataset using WEKA. The aim of
this task is to classify a review as either positive or negative (binary classification). We will
first briefly introduce the dataset, then start with importing the dataset in WEKA, followed by
standard text preprocessing tasks such as tokenization, stop-word removal, stemming, and
attribute selection.

The movie review dataset


This dataset folder contains two sub-directories. One consists of 2000 user-created movie
reviews known as “Sentiment Polarity Dataset version 2.0”
(http://www.cs.cornell.edu/People/pabo/movie-review-data). The reviews are equally
partitioned into a positive set and a negative set (1000+1000).
Each review consists of a plain text file (.txt) and a class label representing the overall user
opinion. The class attribute has only two values: pos (positive) or neg (negative). In the
labelled sets, a negative review has a score <= 4 out of 10, and a positive review has a score
>= 7 out of 10. Thus, reviews with more neutral ratings are not included in the train/test sets.
In the unsupervised set, reviews of any rating are included and there are an even number of
reviews > 5 and <= 5. The other is an additional 500 unlabelled reviews prepared for
unsupervised learning experiments.

References

 Potts, Christopher. 2011. On the negativity of negation. In Nan Li and David Lutz,
eds., Proceedings of Semantics and Linguistic Theory 20, 636-659.
 Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan, Thumbs up? Sentiment
Classification using Machine Learning Techniques, Proceedings of EMNLP 2002.
 Bo Pang and Lillian Lee, A Sentimental Education: Sentiment Analysis Using
Subjectivity Summarization Based on Minimum Cuts, Proceedings of ACL 2004.
 Bo Pang and Lillian Lee, Seeing stars: Exploiting class relationships for sentiment
categorization with respect to rating scales, Proceedings of ACL 2005.

Major Steps
1. First of all, download textming.zip file from LearnJCU and unzip in your local computer.
There will be two files and one directory.
 File: poldata.README.2.0
o Readme file that explains about the file
 File: stop_word_list.txt
o 257 stop words
 Directory: txt_sentoken
o A directory including two subdirectories: neg and pos each subdirectory
containing 1000 instances
Please read the readme file, and also open stop_word_list.txt file to browse a list of
stop words. Finally, open some txt files from neg and pos directory to see how data
look like.

2. Start WEKA Explorer

Fig. 2 Error window prompted out when opening file directly


Load data to WEKA. First, click Open file and select your folder directory. Noted that an
error window will be prompted out (Fig. 2). Don’t worry, simply click OK then a file loader
tool for special categories is activated. Choose TextDirectoryLoader . This component is
provided by WEKA that is used for importing textual datasets.

Fig. 3 Activating file loader tool and loading dataset using TextDirectoryLoader
Fig. 4 Activating directory loader too

Specify the directory where the dataset is stored as shown in Fig. 4.

By using this loader, WEKA automatically creates a relation with 2 attributes: the first one
contains the text data and the second contains the document class, as determined by the sub-
directory containing the file. As expected, you will get a relation containing 2000 instances
and two attributes (text and class). The histogram in Figure 5 shows the uniform distribution
of the review classes (blue for negative and red for positive).
Fig. 5 Imported dataset and class distribution

Once we have loaded the dataset, we move to illustrate text preprocessing and feature
extraction in WEKA. The main purpose of preprocessing is to convert documents into vectors
that are computable by machine learning algorithms.

To conduct text preprocessing using WEKA, we have to use Filter tool. Simply select
StringToWordVector filter from the package weka.filters.unsupervised.attribute. This filter
allows to configure the different stages of the transformation process (Fig. 6). In this
practical, we will only focus on these three configurations: stemmer, stopwordHandler and
tokenizer (Fig. 7). P.S. If you finish the entire practical task early, you are welcomed to try
altering other options since preprocessing will significantly affect the final classification
performance.
Fig. 6 Selecting StringToWordVector filter from the package weka.filters.unsupervised.attribute

Fig. 7 StringToWordVector filter configuration


During the transforming preprocess, we have to apply weighting scheme to evaluate the
features according to their stances in a document. For instance, boolean scheme models
features based on their presence. By default, the text retrieval model used by
StringToWordVector filter is boolean (0 or 1), meaning each review is represented with a n-
dimensional boolean vector, where n is the size of the vocabulary term models the presence
or the absence of a feature term in the document. Note: you can change the weighting scheme
to term frequency and inverse document frequency (TF-IDF) model by setting IDFTransform
and TFTransform to True. Give a try once you have done and compare the results with
boolean model.

When it is false, weka will show the words that are not frequent

Now we talk about how to configure stemmer, stopwordHandler and tokenizer. For
stemming, choose SnowballStemmer package (Fig. 8). Note: stemming is an approximation of
lemmatization that removes the suffix of a word based on standard grammatical rules. For
instance, word ‘walking’ is stemmed into ‘walk’.

For stopwordHandler, first choose WordsFromFile package and then load the file
stop_word_list.txt provided along with the dataset (Fig. 9). This stop word list contains 257
common English words. Note: the purpose of stop word removal is to reduce dimensionality
by eliminating words that are frequently appear but have limit influence in analysing
sentiments.

For tokenizer, chooser NGramTokenizer package (Fig. 10). Note: tokenization aims at
separating documents into word or phase tokens that can be easily transformed into numerical
vectors. WEKA allows you to alter property of NGramTokenizer by specifying MinSize and
MaxSize. Just keep default setting (1 for minsize and 3 for maxsize) at current stage.
Fig. 8 Choosing SnowballStemmer package
Fig. 9 Using stop_word_list.txt file for stopwordHandler

Fig. 10 Choosing NGramTokenizer package


After applying StringToWordVector filter, we get the result shown in Figure 11. Since we
have chosen a maximum n-gram size of 3, we get a relation containing 1184 binary attributes
composed of single term and phases up to three words. Based on the histogram, it seems that
the word ‘boring’ appears more frequently in negative reviews than positive ones.

Fig. 11 Attributes after applying filter

The last preprocessing step we need is attribute selection because eliminating the poorly
characterizing attributes can be useful to get a better classification accuracy. For this task,
WEKA provides the AttributeSelection filter from the weka.filters.supervised.attribute
package. The filter allows to choose an attribute evaluation method and a search strategy
(Fig. 12). The default evaluation method is CfsSubsetEval (Correlation-based feature subset
selection). This method works by choosing attributes that are highly correlated with the class
attribute while having a low correlation with other attributes. After applying the
AttributeSelection filter, we obtain the result show in Figure 13. You should be able to see a
greatly decrease in the number of attributes remained.
Fig. 12 Choosing AttributeSelection filter

Fig. 13 Reduced attributes after applying AttributeSelection filter


Above, we have performed basic text pre-processing and feature extraction on raw review
dataset. Now we can apply classification algorithms to vector represented text dataset.
Classification is a supervised learning task that learns from attributes of training set that is
already labelled and then assigns a class label to test set that is unclassified. We will
respectively introduce a few classification algorithms that have already been programmed in
WEKA and evaluate the performance accordingly.

1. If you click edit to see how the review data has been preprocessed it will look as below.

Now you can see that text data are now in a vector format.

2. To begin with, select Classify tab (Fig. 1). By default, WEKA uses 10-fold Cross-validation
as validation option. It is a proper option for this task due to the scale of the dataset.
Fig. 1 Beginning classification process

Some of the most adapted classification algorithms for sentiment analysis include Naïve
Bayes (NB), Support Vector Machine (SVM), k-Nearest Neighbour (kNN), and Decision Tree
(DT) algorithms. In WEKA, the Naive Bayes classifier is implemented in the NaiveBayes
component from the weka.classifiers.bayes package (Fig. 2). Note: just keep all properties
in their default settings for NB. Figure 3 shows the result.
Fig. 2 Choosing Naïve Bayes classifier

Fig. 3 Classification result of Naïve Bayes


3. WEKA implements a wrapper class for libsvm library named LibSVM under
weka.classifiers.functions package (Fig. 4). After selecting LibSVM classifier, we need to
modify property setting by clicking on it. For kernelType property, we choose
linearkernel instead of radial basis function (default) as our dataset is linearly separable,
a linear hyperplane would be enough to handle the classification process (Fig.5). Note:
property cost and weight are two other frequently adjusted parameters, representing
parameter c and class weighting respectively, for C-SVC in the formula weight * C. P.S. if
you would like to know more details about LibSVM, please refer to
https://www.csie.ntu.edu.tw/~cjlin/libsvm/. Figure 6 shows the LibSVM results.

Fig. 4 Choosing LibSVM classifier


Fig 5. Adjusting kernelType to linearkernel
Fig. 6 Classification result of LibSVM

4. In WEKA, the k-NN classifier is implemented in the weka.classifiers.lazy.IBk component


(Fig. 7). The k-nearest neighbors classifier assigns to an instance the class that is
prevalent among the k nearest instances. For this to be possible, a distance function
between the instances must be defined. We will utilize the default LinearNNSearch with
Euclidean Distance as similarity measurement. As just mentioned, k is to be adjusted to
determine the number of neighbours of a classified instance. By default, it is set to 1. We
will change it to 3 to minimize the influence of noise (Fig. 8). Note: you could also try
other k number to see the difference. Figure 9 shows the 3-NN result.
Fig. 7 Choosing IBk classifier from weka.classifiers.lazy package
Fig. 8 Altering number of k to 3
Fig. 9 Classification result of IBk

5. The last one is a Decision Tree classifier. We will use J48 in WEKA that is an
implementation of the C4.5 algorithm for building decision trees. J48 has the full name
weka.classifiers.trees.J48 (Fig. 10). The classifier is shown in the text box next to the
Choose button: It reads J48 –C 0.25 –M 2. This text gives the default parameter settings
for this classifier. C4.5 has several parameters, by the default visualization (when you
invoke the classifier) only shows –c ie. Confidence value (default 25%): lower values
incur heavier pruning and -M ie. Minimum number of instances in the two most popular
branches (default 2). P.S. For your reference, the full set of J48 parameter settings are
explained here: http://weka.sourceforge.net/doc.dev/weka/classifiers/trees/J48.html.
We won’t change any default setting at this time. Fig. 11 shows the J48 result.
Fig. 10 Choosing J48 from weka.classifiers.trees package
Fig. 11 Classification result of J48 classifier

6. Classification Results and Conclusion:


# Classifier Accuracy
1 Naïve Bayes 79.6%
2 LibSVM 80.6%
3 IBk 71.9%
4 J48 72.75%

For this task, LibSVM achieves the highest accuracy rate of 80.6%, which is coherent with
most of the recent research on sentiment analysis. Noted that this comparison table is
conducted using all three text preprocessing steps: stemming, stopwordHandler and attribute
selection. However, as we discussed previously, preprocessing will greatly influence the
performance of classification. Thus, what if we just use some of them or none of them?
Would result be better or worse? Give a try of yourself and compare the result to search for
the best performance.

# Classifier Accuracy
1 Naïve Bayes 79.35 %
2 LibSVM 79.75 %
3 IBk 70.85 %
4 J48 72.25 %

LibSVM has the highest classification accuracy.

You might also like