Sentiment Analysis of Twitter and Stock Market News: ENGR 400: Applied Machine Learning Fall 2020

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Sentiment Analysis of Twitter and Stock Market News

ENGR 400: Applied Machine Learning


Fall 2020

Wyatt C. Steen
Department of Mechanical Engineering
University of Louisiana at Lafayette
Lafayette, LA 70504
C00304901@louisiana.edu
1 Project Description

Algorithmic stock trading typically relies on technical and quantitative analysis, as these methods are simpler
and more reliable to program. However, with these methods, it is more appropriate to use day-trading and
swing-trading strategies. These strategies are most profitable in the context of high frequency algorithmic
trading. The capital requirements to bypass day-trading limits is $25,000.00. Independent investors often do
not have the capital to achieve this, rendering high frequency trading tactics unrealistic. In addition, technical
and quantitative analysis methods omit a vital factor in stock price, which is public opinion. Fundamental
analysis relies heavily on public opinion and offers the possibility of long term trading strategies. However, in
order to incorporate fundamental analysis techniques into an independent investors portfolio, it is essential
to evaluate public opinion of the stock market, or the stock in question. Sentiment analysis, in this context,
is the process of evaluating public opinion by studying news and social media discussion of a stock. It is
time consuming for an investor to read all current literature regarding a stock, and impossible to read all
current literature on all stock. The goal of this project is to train a classification model to accurately label
tweets and news articles that mention a keyword as positive, negative or neutral.

1.1 Datasets

There are many open source data-sets available that classify sentiment words in a number of different ways.
In this project, the data to be used will label words categorically as positive, negative, or neutral. Two
data sets will be used. The first is the Lexicoder Sentiment Dictionary [1]. This dictionary is a set of
English words labeled accordingly. To incorporate all tweets and news articles extracted through respective
API’s, dictionaries of other languages must also be incorporated. The Sentiment Lexicons for 81 Languages
data-set [2] provides similarly labeled data in 81 different languages. Additionally, a data set of tweets [3]
mentioning Apple will be used for supplementary testing.

1.2 Feature Selection

The first step in the process is to retrieve a certain number of tweets and news headlines from Twitter and
various stock market news websites. We will then pre-process the tweet and headlines into “tokenized” data,
which is segmenting each tweet or headline into individual strings of words and special characters. We will
delete punctuation characters, and remove stop words. Next is to extract a bag of words array (BOW) from
the pre-processed data.

1.3 Models Investigated

A number of models will be considered using MATLAB’s Classification Learner App, which offers suggestions
of a number of models that train quickly depending on your data. The model expected to demonstrate the
best performance is an SVM classifier. The model will be trained on the combined language lexicons of
positive and negative words.
1.4 Model Evaluation

A confusion matrix will be used to evaluate the model’s performance in predicting the sentiment for each
tweet or headline. As both false negatives and false positives yield equally detrimental consequences, it is
important to evaluate both.

1.5 Programming Language

As the student suite and an assortment of relevant toolboxes are available to me, MATLAB will be used as
the primary programming language. In the case that API calls must be made in C/C++, access to MATLAB
coder is available. Various functions from the Text Analytics, Machine Learning, Parallel Computing, and
potentially the Trading Toolbox will be used.

References

[1] Data: Lexicoder. (2020, August 29). Retrieved October 12, 2020, from http://www.snsoroka.com/data-
lexicoder/
[2] Tatman, R. (2017, September 13). Sentiment Lexicons for 81 Languages. Retrieved October 12, 2020,
from https://www.kaggle.com/rtatman/sentiment-lexicons-for-81-languages
[3] Apple Twitter Sentiment - dataset by crowdflower. (2016, November 21). Retrieved October 12, 2020,
from https://data.world/crowdflower/apple-twitter-sentiment

You might also like