Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

1 # Sentiment Analysis on Tweets

2
3 ## Dataset Information
4
5 We use and compare various different methods for sentiment analysis on tweets (a binary
classification problem). The training dataset is expected to be a csv file of type
`tweet_id,sentiment,tweet` where the `tweet_id` is a unique integer identifying the
tweet, `sentiment` is either `1` (positive) or `0` (negative), and `tweet` is the tweet
enclosed in `""`. Similarly, the test dataset is a csv file of type `tweet_id,tweet`.
Please note that csv headers are not expected and should be removed from the training
and test datasets.
6
7 ## Requirements
8
9 There are some general library requirements for the project and some which are specific
to individual methods. The general requirements are as follows.
10 * `numpy`
11 * `scikit-learn`
12 * `scipy`
13 * `nltk`
14
15 The library requirements specific to some methods are:
16 * `keras` with `TensorFlow` backend for Logistic Regression, MLP, RNN (LSTM), and CNN.
17 * `xgboost` for XGBoost.
18
19 **Note**: It is recommended to use Anaconda distribution of Python. The
[report](https://github.com/abdulfatir/twitter-sentiment-analysis/tree/master/docs/report
.pdf) for this project can be found in `docs/`.
20
21 ## Usage
22
23 ### Preprocessing
24
25 1. Run `preprocess.py <raw-csv-path>` on both train and test data. This will generate a
preprocessed version of the dataset.
26 2. Run `stats.py <preprocessed-csv-path>` where `<preprocessed-csv-path>` is the path
of csv generated from `preprocess.py`. This gives general statistical information about
the dataset and will two pickle files which are the frequency distribution of unigrams
and bigrams in the training dataset.
27
28 After the above steps, you should have four files in total: `<preprocessed-train-csv>`,
`<preprocessed-test-csv>`, `<freqdist>`, and `<freqdist-bi>` which are preprocessed
train dataset, preprocessed test dataset, frequency distribution of unigrams and
frequency distribution of bigrams respectively.
29
30 For all the methods that follow, change the values of `TRAIN_PROCESSED_FILE`,
`TEST_PROCESSED_FILE`, `FREQ_DIST_FILE`, and `BI_FREQ_DIST_FILE` to your own paths in
the respective files. Wherever applicable, values of `USE_BIGRAMS` and `FEAT_TYPE` can
be changed to obtain results using different types of features as described in report.
31
32 ### Baseline
33 3. Run `baseline.py`. With `TRAIN = True` it will show the accuracy results on training
dataset.
34
35 ### Naive Bayes
36 4. Run `naivebayes.py`. With `TRAIN = True` it will show the accuracy results on 10%
validation dataset.
37
38 ### Maximum Entropy
39 5. Run `logistic.py` to run logistic regression model OR run `maxent-nltk.py <>` to run
MaxEnt model of NLTK. With `TRAIN = True` it will show the accuracy results on 10%
validation dataset.
40
41 ### Decision Tree
42 6. Run `decisiontree.py`. With `TRAIN = True` it will show the accuracy results on 10%
validation dataset.
43
44 ### Random Forest
45 7. Run `randomforest.py`. With `TRAIN = True` it will show the accuracy results on 10%
validation dataset.
46
47 ### XGBoost
48 8. Run `xgboost.py`. With `TRAIN = True` it will show the accuracy results on 10%
validation dataset.
49
50 ### SVM
51 9. Run `svm.py`. With `TRAIN = True` it will show the accuracy results on 10%
validation dataset.
52
53 ### Multi-Layer Perceptron
54 10. Run `neuralnet.py`. Will validate using 10% data and save the best model to
`best_mlp_model.h5`.
55
56 ### Reccurent Neural Networks
57 11. Run `lstm.py`. Will validate using 10% data and save models for each epock in
`./models/`. (Please make sure this directory exists before running `lstm.py`).
58
59 ### Convolutional Neural Networks
60 12. Run `cnn.py`. This will run the 4-Conv-NN (4 conv layers neural network) model as
described in the report. To run other versions of CNN, just comment or remove the lines
where Conv layers are added. Will validate using 10% data and save models for each
epoch in `./models/`. (Please make sure this directory exists before running `cnn.py`).
61
62 ### Majority Vote Ensemble
63 13. To extract penultimate layer features for the training dataset, run
`extract-cnn-feats.py <saved-model>`. This will generate 3 files, `train-feats.npy`,
`train-labels.txt` and `test-feats.npy`.
64 14. Run `cnn-feats-svm.py` which uses files from the previous step to perform SVM
classification on features extracted from CNN model.
65 15. Place all prediction CSV files for which you want to take majority vote in
`./results/` and run `majority-voting.py`. This will generate `majority-voting.csv`.
66
67 ## Information about other files
68
69 * `dataset/positive-words.txt`: List of positive words.
70 * `dataset/negative-words.txt`: List of negative words.
71 * `dataset/glove-seeds.txt`: GloVe words vectors from StanfordNLP which match our
dataset for seeding word embeddings.
72 * `Plots.ipynb`: IPython notebook used to generate plots present in report.
73

You might also like