Professional Documents
Culture Documents
X X Faculty X Department: University
X X Faculty X Department: University
X Faculty
X Department
PROJECT PROPOSAL:
COURSE NAME
PROFESSOR
PROFESSOR NAME
CITY
YEAR
1
Table of Contents
Abstract ......................................................................................................................................
Introduction ............................................................................................................................... 4
Literature Review....................................................................................................................... 7
Methodology ............................................................................................................................
11
11
12
14
Conclusion ...............................................................................................................................
15
References ............................................................................................................................... 16
2
Abstract
This paper examines the predictability of stock movements using sentiments from the
news coverage of the COVID-19 pandemic. A lot of research has been done on the empirical
relationship between stock movements using text mining. However, not enough attention has
been paid to immediate market sentiments during crises. In this paper, several state-of-the-art
machine learning techniques are implemented to determine whether the sentiment measures of
news coverage can improve predictive potential for investors’ models. The expected result of
the research is that the signal classification algorithms, which are based on sentiment analysis,
yield higher accuracy than guessing probability on a binary factor. Thus, it would distinguish
areas where the efficient market hypothesis does not hold for the American stock market.
3
Introduction
Background.
News related to the financial world plays a key role for the investors making decisions
on financial markets. Deriving information from certain news articles on different companies
or markets, investors can make predictions on future movements of stock valuations, leading
however, there are no superior methods to interpret information coming from the news. The
most common distinction of the news interpretation methods is the division between
fundamental analysis methods and technical analysis methods. Fundamental methods require
extensive analysis of the qualitative information of companies and markets, while technical
There is, however, a subset of methods that lies in-between these two major
rather than quantitative). This is done through the use of sentiment analysis methods –
methods of natural language processing that classify and categorize the set of texts into
divisions of positive and negative texts on a certain scale. Process of feature engineering via
sentiment analysis can create insights into the state of the stock market at given time
moments, hinting at what the investor expectations and decisions are. Previous research in
behavioral finance has suggested that investors tend to make emotion-driven decisions,
depending on the extent of their optimism towards certain markets and companies (Bollen et
Problem Statement.
In this paper, a specific subset of news is going to be studied – the news coverage of
the COVID-19 pandemic in 2020. The goal of the paper is to determine if there has existed
any discrepancy in the market efficiency regarding how the news coverage of COVID-19 has
impacted the valuations of stock of American corporations. More specifically, this paper aims
4
to construct a sentiment-based algorithm, which would determine if there has been a
consistent lag between the release of the news regarding certain companies and markets and
the time the market has efficiently reflected the information from the news.
Research Question.
To what extent has the Efficient Market Hypothesis (EMH) stood throughout the
COVID-19 pandemic on the American stock market, i.e. is it possible to obtain extra profits
on American stocks using sentiment-based analysis of the news coverage of the COVID-19
pandemic?
To identify major text analysis methods and techniques in the context of financial
markets research.
articles, which detail the spread of the virus over the world and can be tracked to specific
To apply relevant sentiment metrics on the collected news articles, creating a timeline of
the overall news coverage sentiment towards the spread of the pandemic.
analysis, and compare the performance of that portfolio to the performance of a baseline
Professional Significance.
of news sentiments may prove to be more relevant than an analysis based solely on technical
indicators, which naturally often overlook the risks of crisis shocks. Thus, this research has
the potential to determine whether news coverage plays a significant role in the decision-
making of investors. So far, there has been little discussion of news impact on financial
5
markets during a crisis. Hence, the research gap this paper will attempt to cover is specifically
the effectiveness of sentiment analysis use on stock markets during a crisis shock situation.
As is the case for most machine learning problems, the best performing algorithm
greatly depends on the specifications of the problem, and with any of the specifications
changed, the performance of certain models can vastly differ. In this paper, only American
stocks included in the S&P 500 index will be considered, meaning that the conclusions and
implications might not fully apply to other stock markets, for instance, the Russian stock
market. This choice of stocks is explained by the overall stability of the American stock
market – compared to other regional stock exchanges, there is less systemic risk, which can
overshadow the actual impact of the news coverage. At the moment, the collected news
database consists of Bloomberg and Reuters articles from March 21 st, 2020 to December 31st,
2020. Since the pandemic has begun several months before that, the database ideally would
still have to be supplemented with the articles from the beginning of 2020 for completeness.
6
Literature Review
Several key models of information impact on the financial markets were introduced,
out of which the most substantial is the Efficient Market Hypothesis (EMH), which was first
developed in (Fama, 1965) [4]. According to EMH, all asset prices on stock markets reflect
all available information, meaning that stock-associated news and events have an impact on
the dynamics of stock prices. The hypothesis was further empirically confirmed to different
extents in (Malkiel, Fama, 1970) [9], which discusses the three forms of EMH – weak, semi-
strong and strong, each of them implying different information sets being reflected in the
stock prices. Depending on the form of EMH, a conclusion is made on whether or not an
investor can achieve excess returns using specific information to make decisions. In (Fama,
1965) [4] a concept of Random Walk Hypothesis (RWH) is introduced – a model, which
suggests that all stock movements are independent of each other and have the same
distribution, meaning that it is not possible to predict future movements of the stock using
past trends. EMH and RWH concepts are interconnected, as RWH requires immediate
informational efficiency.
More recently, however, literature has emerged that offers a contradictory theory on
market efficiency. (Lo, 2005) [8] introduces the concept of Adaptive Market Hypothesis
(AMH) – a substantially different interpretation of the relationship between news and stock
movements, which analyzes the market efficiency from the standpoint of behavioral finance.
environmental conditions and the number and nature of "species" in the economy” (Lo, 2005,
p. 19), suggesting that market inefficiency is often present, for instance, in cases of bubbles
and crises. Indeed, empirical research suggests that adaptive market theory is more accurately
describing the stock behavior than the efficient market theory (Urquhart, Hudson, 2013) [13].
In this paper, US, UK, and Japanese stock markets are analyzed, and the overall conclusion is
7
that neither of the markets has consistently been efficient, which potentially suggests that
timely analysis of market information can serve as an instrument to capture excess returns on
stock markets.
feature processing based on textual information in the context of financial markets analysis.
(Hagenau et al., 2013) [5] presents a summary of 11 papers, which build machine learning
algorithms on features generated from text mining and sentiment analysis. The described 11
papers differ from each other in used stock datasets, text mining feature types (bag-of-words,
n-grams, word combinations), as well as the inclusion of market feedback and specific
machine learning models. The best performing paper yields the accuracy of 65.1% correctly
identified signals, operating based on a support vector machine and word combinations.
However, such results are not consistent throughout other papers, as many papers reach
inconclusive or incompatible results. Overall, there is still not much consensus amongst
researchers on whether sentiment analysis can consistently be used to predict stock market
2014) [10], and examines 24 papers, updating the set of papers reviewed in (Hagenau et al.,
2013) [5]. This review attempts to introduce a well-rounded theoretical framework for feature
engineering based on text mining. Reviewed papers differ in types of mined texts (financial
news, corporate filings, financial disclosures, tweets), text sources (various platforms, e.g.
Bloomberg, Yahoo! Finance, Twitter), and other text-related specifications. (Kalyani et al.,
2016) [6] is a more recent example of research, which performs sentiment analysis to derive
signals based on bag-of-words analysis of the news related to stocks. This paper presents a
tidy step-by-step approach, which will potentially be replicated in this research. It should be
noted though that the majority of papers in both metareviews use daily historical stock data.
8
In this research intraday stock data will be looked at to capture the immediate investor
reaction to hardly predictable events that were present throughout the pandemic, such as
border closures and lockdown restrictions. This premise is justified by some of the previous
research, for instance, (Ding, 2014) [2] suggests that the impact of news coverage is mostly
A more traditional approach to stock movement prediction through the use of news is
event studies, which require time series analysis of the historical stock data, as well as
structural breaks implementation at the time points of the most impactful events in the
reviewed period. Pesakovic et al. (2017) [11] provides an example of an event study involving
three multinational companies, and the described impactful events included American
presidential elections. The most obvious limitation of the event study method is the fact that
the noteworthy events should be handpicked, rather than automatically collected without
subjective preference. Thus, in this research event studies methodology potentially will be
A lot of market-specific research has been done on the time series analysis of stock
data, using structural breaks due to major economic events. (Ewing, Malik, 2016) [3] provides
an example of an event study done to determine the indirect relationship between oil prices
and the US stock market. Researchers conduct a structural breaks analysis on the stock
market, in which breaks are detected through analysis of oil prices volatility. That approach
allows to capture the volatility, which is not directly explained purely by the quantitative data.
9
Previously mentioned summary reviews of existing papers also discuss a variety of
different machine learning models used for market signal classification. (Nassirtoussi et al.,
2014) [10] categorized the papers by the following models present in previous research:
Combinatory algorithms.
Multi-algorithm experiments.
A more recent work (Kraus, Feuerriegel, 2017) [7] aside from the aforementioned
methods considers several machine learning techniques developed after the release of
(Nassirtoussi et al., 2014) [10]. This paper mainly concentrates on decision trees-based
classification algorithms, such as Random Forest and Gradient Boosting, as well as deep
learning architectures RNN (Recurrent Neural Network), and its extension, LSTM (Long
Short-Term Memory network). In their work, the deep learning architectures yield higher
accuracy than the more traditional approaches, with the highest performing models reaching
an accuracy of 60.1% correctly classified abnormal returns, although the work was done using
10
Methodology
As was discussed in the literature review, in the context of market research that is not
specifically tied to some period or major event, different sources of information could be used
for different analysis purposes (news, financial disclosures, tweets, etc.). Since this research is
aimed at capturing investor decision-making during a rapidly evolving crisis, daily news
articles would serve a better purpose rather than the financial disclosures of companies.
Previous research involving sentiment analysis of financial news agrees on Bloomberg [14]
and Reuters [15] being the key news platforms that cover the vast majority of relevant news.
The initial news database, which currently consists of roughly 187 thousand news articles, is
As can be seen in the initial database, the articles are not originally structured by
affected companies or markets, and some articles might not even discuss pandemic-related
11
The database has to be filtered by a corpus of keywords that would indicate that the
“lockdown”, etc.). If a given article does not include at least one of the words from the
The database has to be restructured so that each entry refers to a specific company, stock
of which is included in the S&P 500 index. If a given article mentions several companies,
the article needs to be included several times in the database, once per company.
As for the intraday historical stock data, Marketstack [16] was used to download the
hourly stock data for the stocks in the S&P 500 index. The stock database is structured in the
following way:
At the moment, both the news database and the stock database include data from
News transformation. For each article, a measure of sentiment has to be calculated. This
will be done using the dictionary-based approach, as well as the previously mentioned
LSTM architecture.
Feature Engineering. Stock data will be used to produce technical indicator data
(numerical), which would expand the feature corpus used by classification algorithms.
Signal Classification. Predictive models, which were discussed in the literature review,
will be applied to the generated features. At each time point, a market signal about a
12
specific stock will be classified as either positive or negative using machine learning
algorithms.
Portfolio Creation and Comparison. Based on the classified signals, a stock portfolio
for each machine learning algorithm will be built. The performance of these portfolios
would then be compared to the baseline portfolio built based on structural breaks analysis
based on the Sharpe ratio. Several sentiment thresholds will be tested while creating
portfolios, meaning that in some portfolios only articles with highly positive or negative
sentiments would lead to actions, while in other portfolios even mild sentiments would be
considered significant.
13
Anticipated Results
The main anticipated result of this research project is either the confirmation or the
rejection of the market efficiency in the American stock market during the COVID-19
pandemic. This result would mainly depend on the performance of the machine learning-
based stock portfolio – if these portfolios yield statistically higher returns than the baseline
portfolios, or the S&P 500 market index itself, then a conclusion would be made that the US
stock market is not informationally efficient. This result would be a testament to the Adaptive
Market Hypothesis, suggesting that the news coverage does not immediately get reflected in
the stock prices. Moreover, the optimal sentiment threshold for stock inclusion or omission in
Another anticipated result of the project is that the accuracy of the signal classification
models turns out to be significantly higher than 50% (guessing probability for a binary
predicted parameter). Even if the portfolio returns are not higher than the S&P 500 returns, an
this research.
14
Conclusion
In this research, the market efficiency of the US stock market is tested in the context
of the COVID-19 pandemic in 2020. A lot of research has been previously done on the
empirical relationship between stock movements using text mining, however, not enough
After the review of the previous work, sentiment analysis of news articles covering the
sentiment. The sentiments, along with other features derived from technical analysis, are then
used in machine-learning algorithms to classify market signals, on which stock portfolios are
based. The overall expectation is that the portfolios based on sentiment analysis features
would outperform the baseline portfolios based on structural breaks analysis, although it is not
clear, which machine learning models would prove to be the most successful. This result
would allow to identify a more accurate model of news and stock market relationship
This research could potentially be expanded in various ways, for instance, news
articles can be further classified by categories (industry, country, market) to generate extra
features. Furthermore, stock markets of other countries can be analyzed similarly, as results
15
References
1. Bollen, J., Mao, H. and Zeng, X. (2011). Twitter mood predicts the stock market.
2. Ding, X., Zhang, Y., Liu, T., & Duan, J. (2014, October). Using structured events to
(pp. 1415-1425).
3. Ewing, B. T., & Malik, F. (2016). Volatility spillovers between oil prices and the
stock market under structural breaks. Global Finance Journal, 29, 12-23.
38(1), 34-105.
5. Hagenau, M., Liebmann, M., & Neumann, D. (2013). Automated news reading:
6. Kalyani, J., Bharathi, P., & Jyothi, P. (2016). Stock trend prediction using news
sentiment analysis.
7. Kraus, M., & Feuerriegel, S. (2017). Decision support from financial disclosures
with deep neural networks and transfer learning. Decision Support Systems, 104, 38-
48.
10. Nassirtoussi, A. K., Aghabozorgi, S., Wah, T. Y., & Ngo, D. C. L. (2014). Text
16
11. Pesakovic, G., & Ndekugri, A. (2017). Using Event Studies to Evaluate Stock
Research.
12. Schoen, H., Gayo-Avello, D., Metaxas, P. T., Mustafaraj, E., Strohmaier, M., &
Gloor, P. (2013). The power of prediction with social media. Internet Research,
23(5), 528-543.
13. Urquhart, A., & Hudson, R. (2013). Efficient or adaptive markets? Evidence from
major stock markets using very long run historic data. International Review of
17