Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/286380167

Computer Science Predicts the Stock Market

Article · December 2015

CITATIONS READS
0 4,054

1 author:

Avery Leider
Pace University
18 PUBLICATIONS 67 CITATIONS

SEE PROFILE

All content following this page was uploaded by Avery Leider on 09 December 2015.

The user has requested enhancement of the downloaded file.


Computer Science Predicts the Stock Market

Avery Leider
Pace University
Department of Computer Science
Pleasantville, New York
al43110n@pace.edu

Abstract— Why is it that the greatest utilization of recent Big Data mining task is to design a Big Data mining system to
advances in faster computing power and smarter algorithms is in predict the movement of the market in the next one or two
the stock market? How are these advances changing the stock minutes. Such systems, even if the prediction accuracy is just
market and what is the future? The place that gives the greatest slightly better than random guess, will bring significant
financial rewards for new computing phenomena is the stock
business values to the developers” [2]
market, and so it is in the innovations of the stock market that we
can see the greatest advances of computer science in use. Today
Big Data is bringing about the rise of robot brokers who will be Several of the algorithms that are in use today in the stock
able to manage stock portfolios by software models. The advances market were originally used in other fields of scientific
in hardware and software are co-designed to benefit each other. research, such as genetics. In return, refinements of these
The trend is of increasing complexity and inclusion of algorithms as they are used in the stock market may lead to
unstructured text data. This is enabled by amazing advances in new discoveries of their usefulness in other fields. Use of
computing that allow greater amounts of data to be processed by models and algorithms in new areas that are different than
supercomputers and parallel processing of distributed computers. what they were designed for carries risk as well as reward.
Keywords—stock market; big data; text mining; robot brokers
Additionally, data that before now was not possible, was
I. INTRODUCTION not collected, nor was possible to process, such as
Enabled by greater amounts of data, that can be processed unstructured text data, can now be included in what we
by incredibly smarter algorithms that could not have been account for when trying to predict the stock market.
used earlier because the speed of processing them was, in the
past, not fast enough to deliver results in time for decision This phenomenon is spreading quickly worldwide.
making, but are fast enough now, the big data revolution can Already text mining the news is spreading from English-
be seen best by observing its impact on the stock markets. speaking stock markets to Turkish-speaking stock markets.

President Obama at “the White House defines Big Data as The exciting new advances in big data processing allow
“data that is so large in volume, so diverse in variety or fresher new possibilities for predicting the stock market
moving with such velocity, that traditional modes of data prices, however, these innovations will spread to other fields
capture and analysis are insufficient—characteristics such as medicine, education and transportation – it is just that
colloquially referred to as the ‘3 Vs.’”. [1] The “3 Vs” are we see these evidenced first in the stock market because it is
Volume, Variety and Velocity, and in big data, this means more richly rewarded there than in any other field.
more, more volume, more variety, and more velocity, than
These smarter algorithms are thought of in terms of models.
previously.
The U.S. Federal Reserve System (Fed) defines “the term
model [as] a quantitative method, system, or approach that
More than the vast amounts of data, the smarter algorithms applies statistical, economic, financial, or mathematical
of today make processing that big data at near real-time speeds theories, techniques, and assumptions to process input data into
possible. This makes them useful in microseconds of decision quantitative estimates.” [3]
making.
The Federal Reserve System’s purpose is to manage banks
“For example, stock market data are a typical domain that in the U.S. so that the risk of their failure is low. Models are
constantly generates a large quantity of information, such as used widely. Other than the stock market, financial
bids, buys, and puts, in every single second. The market institutions, such as banks and insurance companies,
continuously evolves and is impacted by different factors, increasingly use models built with algorithms to manage their
such as domestic and international news, government reports, information. The Fed is worried about the risk of algorithm
and natural disasters, and so on. An appealing driven models being used for automated financial decision
making. In 2011, the Federal Reserve System issued a
regulation on Model Risk Management called SR 11-7
designed to control for the use of smarter algorithms executed market that the news brought, unlike humans responding in a
using models by banks. The risks, according to this regulation, nonlinear way to the same news that they processed in
are of the increasing sophistication and complexity of models, different amounts of time.
to the point that the human users of the models are no longer
understanding how they work and use them incorrectly, and
also that a model may produce inaccurate outputs, delivering
B. Weighted Random Forests and Seaonality
errors that are not corrected. When an algorithm developed for
a different area, such as genetics, is used in a new way in Random forests are used in Booth’s 2014 paper [7] to predict
banks, the Fed considers it a risk that must be managed by not only the signals to buy and sell stocks, but also the size that
regulations. the transaction should be. His paper aims to help software make
the decisions with “Automated trading system based on
performance weighted ensembles of random forests …trading
In the future of the smarter algorithms that are being
seasonality events”. [7] Seasonality, or seasonal trends, means
developed for the stock market, similar laws and regulations accounting for the turn-of-the-month and exchange holiday bias
will be applied, as government catches up to the catastrophic (upward trend in prices) and weekend bias (downward trend in
risk that could come with mistakes made in the super-fast prices).
financial big data mining models and robot-managed
portfolios. The performance weight of the random forests is done by
Booth with time, with the most recent historical price data
having greater weight than more distant transactions. An
II. ALGORITHMS TODAY – RELATED WORKS ensemble of two algorithms that are reacting to the seasonality
and performance weight are trained in random forests to predict
A. Self Organizing Maps and Support Vector Regression prices so that software can autonomously trade in the financial
Making predictions based only on price changes of stocks markets using those predictions to signal when to buy and sell
is not enough to account for risk and volatility. Adding in the and sometimes to signal the size of the transaction.
standard deviation or variance is needed. But further than
Using data from the European Stock Market, Ballings [8]
that, the investor needs “a robust tool that can accurately
evaluated multiple benchmark ensemble methods (combining
gauge the mood of the market” and match it against the “risk
two or more algorithms in a model) such as Random Forest
averseness or risk empathy” of the investor. The model
against single algorithm models such as Neural Networks,
presented by Choudhury in 2014 [4] first clusters all of the
Regression, Support Vector Machines and K-Nearest
stocks based on their risk and return profiles using Self
Neighbor. Ballings found that, in 2015, the Random Forest is
Organizing Maps (SOM) with k-means clustering. [5] Then
the top algorithm, followed by Support Vector Machines, then
Support Vector Regression (SVR) [6] is used to predict, for
Neural Networks, K-Nearest Neighbor, and Regression.
short trading cycles (in his paper, a short trading cycle is two
trading days), the future price and volatility of the stocks. The
Ballings contributed this comprehensive benchmark
output of the model gives a range of values that the investor
comparison using as the criteria of rating each method its
can select from according to his tolerance of risk.
accuracy in predicting the direction of stock prices of 5,767
publicly listed European companies, for the time period of one
The model shows the direction of the price movements, up
or down, and predicts their volatility. What is unique in this year ahead. He observed that “Stock price direction prediction
paper is that it combines k-means clustering with the is an important issue in the financial world. Even small
predictive technique of Support Vector Regression to deliver a improvements in predictive performance can be very
strategy for what stocks to buy and sell. Choudhury makes the profitable.” [8]
case to show that Support Vector Machine (SVM) is more C. Particle Swarm Optimization and Recursive Least-
popular today than Artificial Neural Networks (ANN), saying Squares
that SVM outperforms many other predictive algorithms. The Using data from the Taiwan Stock Index (TAIEX), Feng
output from SVM that he demonstrates shows both potential developed a prediction system that delivered output that
prices and their volatility values. predicted the trend (up or down) of the TAIEX stock price
better than the available stock trend index. Feng’s system
However, Choudhury reveals frustration where he says uses four learning algorithms: regression analysis, dynamic
“Markets are generally difficult to study owing to strong learning, as well as hybrid particle swarm optimization and
coupling between stocks, nonlinear investor response to news recursive least-squares. [9]
and imperfect dissemination of news among the investors
rendering it highly inefficient, a marked deviation from
traditional assumptions of market efficiency.” [4] This D. Software Robots
frustration reveals the advantage that a perfect response to The field of intelligent virtual agents, or software robots, is
news in faster time than the competition would deliver better almost ready to join human brokers in the stock market. In
results, as well as the advantage of software robots who would Hungary, there is an e-learning website that is teaching the
respond instantly in a linear fashion to a linear change to the people of Hungary how to work the Hungarian stock market
by interacting with a “intelligent chatter robot connected to a
specialized knowledge base and with responses based on
G. Data Mining with Big Data
emotional modeling of the user” as well as interacting with
human agents, in a virtual artificial stock market that applies According to IBM, “Every day, we create 2.5 quintillion
real trading rules and realistic market information. [10] bytes of data – so much that 90 percent of the data in the
world today has been created in the last two years alone.” [14]
Humans and software robot-agents do not trade with the “Big Data starts with large-volume, heterogeneous,
same behavior in all environments, according to Feldman. [11] autonomous sources with distributed and decentralized
He studied human traders and robot traders in a “financial control, and seeks to explore complex and evolving
market simulation prone to bubbles and crashes.” [11] relationship among data.” [15] The most important topic with
Feldman’s results show that human traders earn lower profits software being used today for big data is the speed that useful
overall, but human traders earn higher profits in crash- information can be obtained from it. Pre-processing and then
intensive periods. Both human and robots respond in manners processing the large volumes of data quickly avoids trying to
similar to each other in buying and selling choices when the store it. Near real-time results are more useful with big data.
market is not in a bubble or crash. Feldman divided his One of the greatest challenges of big data is aggregation. Data
humans into inexperienced traders and experienced traders. from different sources may be about the same subject, but be
Experienced human traders did not make an impact on the difficult to use because each source has its own format or
market, but inexperienced human traders “tend to destabilize protocols. [15]
the smaller (10 trader) markets”. [11] Also different between
human and robot: after losses, humans are faster to sell than One of the most important changes to computer science
robots are. that big data allows us to do is add a feature to our
computations that was not possible before: account for
E. Preprocessing Data to Reduce Storage unstructured text data that relates to news or sentiment. For
When there is a vast amount of data that needs to be example, in the retail markets today, big data means that in
processed at fast speed because of the perishability of the addition to keeping track of customers with data fields to find
information to be found in it, algorithms that preprocess the data where they have common values: age, gender, income,
to reduce data storage and speed up calculations are useful. The location, education, market basket, there is something new -
best example of this is Al-Jaroodi’s paper [12], in which he with big data, retail businesses can add the many additional
describes how the wealth of real-time financial market data values brought by including the data of relationships – the
available over the Internet can be retrieved, preprocessed with social functions of family, friends, co-workers, and interest
pre-defined conditions of price and time, and only the relevant groups and user communities – to study the behavior patterns
data be stored, using storage algorithm he describes. of customers of a business in exciting new ways. [15]

F. Parallel Processing of Artificial Neural Networks


Casas [13] demonstrates by experiment that making the III. COMBING UNSTRUCTURED TEXT DATA FOR NEW OUTPUTS
backpropagation algorithm run in parallel on four processors “…recent research has shown that using social networks,
simultaneously meant that the training time of the algorithm such as Twitter, it is possible to predict the stock market
was reduced by 61% over running the same algorithm over upward/downward trends with good accuracies.” [16] How is
one processor when using four inputs: (1) stock market data this accomplished using unstructured text data?
of the Standard & Poor’s 500 Index, (2) the US Prime Rate,
Data mining of unstructured text data, combined with
(3) the Consumer Price Index and (4) the Oil Price in an
numeric data of the stock market, yields exciting returns.
Artificial Neural Network (ANN). The purpose of Casas’s
Jaybhay [17] makes the case for this in his model, which uses
experiment was to measure the improvement in the time
data for the Bombay Stock Exchange. Jaybhay says
required to train an artificial neural network using multiple
“Technical analysts believe that the market is only 10 percent
cores. He conducted validation tests under three parallel
logical and 90 percent psychological.” To account for the
processing scenarios, confirming the increasing the speed and
psychological factors, he adds three unstructured text data of
effectiveness of parallel processing in learning to predict the
news to the total of four financial information sources that
output with one, two, and, four cores. Training in all three
input into his model, all available on the Web:
scenarios required 1,644 epochs. Training stopped after the
(1) www.yahoo.finance.com, (2) www.stockwatch.in (3)
error reached less than 2%. His experiment proves that the
Financial Times (www.ft.com) and (4) Reuters
parallelization of training algorithms results in improved
(www.investools.com).
training time. Training of ANN for stock markets involves
processing a volume of historical data which takes time.
The information that Jaybhay collected from yahoo/finance
Saving time increases the usefulness of the information which
is available all the time in comma separated value (.CSV)
is perishable.
files, and includes the daily prices of stocks. He also collected
the daily published news (unstructured, unlabeled text data) of
the Bombay Stock Exchange that came from three web sites:
www.finance.yahoo.com, reuters.com and The news articles of text came from three sources: (1) the
www.stockwatch.in. “official news” source from the Turkish Publish Disclosure
Platform (http://www.kap.gov.tr) which has digitally signed
The big data portion of this is revealed in how Jaybhay documents such as financial statements and revenue reports.
handles the text data. First step for Jaybhay was The other two are internet news from finance web portals, (2)
preprocessing: text data is preprocessed in three steps: (1) by Mynet Finans (http://finans.mynet.com) and (3) Bigpara
removing stop words (such as a, an, the, of, etc,) (2) by (http://www.bigpara.com).
stemming, which is, to reduce a word to its stem – such as
trimming the word “selling” to “sell” and (3) harvesting The text data from these three sites were preprocessed by
keyword phrases against a global list of top keyword phrases. following these steps: (1) stop word removal (taking out of
The top 12 keyword phrases were: hike, jump, flat, buy, loses, the text the words with meanings like a, of, as, the), (2)
down, lower, steady, recession, scam, swoon, slump. The data stemming (shorting words to their stem, such as the Turkish
was then classified as 0 or 1, to indicate upward trend or root word for “go” is “git” so he goes, she went, another goes,
downward trend information as shown by these keywords and he will go, those words were all shorted from gitti, gittim,
stem words. gidecek, giden, to git. This stemming was done using a
program written in 2007 in Java, the Zemberek Turkish
Jaybhay then used a neural network that took in the input Language Processing Framework. (3) filtering out words that
values of the stock prices and the input of the presence or occurred less than 1000 times in the document collection (4)
absence of the key phrases extracted from the news text of that this resulted in a Bag Of Words (BOW) of 2888 unique words,
day, and the output value as the closing price of the stock of each of which were then rated as positive (value of 1) and
that day. negative (value of 0) impact on the stock price.

In a test of 543 stocks, using the Jaybhay method he Then all of the text information was grouped by news day
predicted if the next day’s closing stock price would be an and associated with the movement of the stock price from the
uptrend or a downtrend correctly 481 times, which is a success close of one day to the opening of the next day.
rate of 88%.
Gunduz divided all of his data into two parts, one of 18
Jaybhay’s advice for future work is to develop better months of data to use for the training set, and the other was of
preprocessing methods for news text “more work on refining 6 months of data in their test set.
key phrases extraction will definitely produce better results.”
Gunduz found that the Official News information did not
work well, because Turkish traders buy or sell that day that the
In a very similar way that Jaybhay uses for the Bombay
stock market, Gunduz studies the stock market of Turkey. Official News is published, so this did not help his two day
trading day model. However, the Internet news was of
Gunduz [18] applies the mining of unstructured text data, significant value in predicting the direction of the stock price
combined with the numerical information of the prices, to the the following day.
stock market in Istanbul, Turkey. He successfully predicts
68% of the time the direction that the Borsa Istanbul 100 For future, Gunduz thinks that using the prices of other
Index (BIST100) prices move on opening. He [19]is using world stock markets such as US, German, and Japanese, could
two inputs to his model: (1) text data from news articles be used to improve performance as well as using text mining
released and (2) the price the day before. What he did was that used both English words as well as Turkish words.
new, because Turkish news articles – Turkish language text
mining – had not been used before, and yet, they influence the
Turkish stock market. “Factors like the political situation, FUTURE WORK AND CONCLUSIONS
economic conditions, companies future targets, investors’ Two conclusions come from this survey and one clear
expectations, global stock exchanges and psychology of future work direction. First, text mining the news and
investors influence the stock market behavior and make it hard combining that factor with the numerical data from the stock
to predict the future stock price or directions.” [18] market to predict the market will spread to other countries and
other markets beyond the English and Turkish language and
Gunuz pointed out that until recently, analysis methods to linguistics will become a hot research area as more
predict stock prices were one of two types: (1) technical sophisticated methods of text mining will be discovered. This
analysis that uses historic stock prices, and (2) fundamentals will result in robot assisted management of portfolios in all
analysis, that uses economic data such as demand for a world stock markets.
company’s products, inflation, interest rates, unemployment The second conclusion is that the advances in the stock
rate, and trading volume. [19] He developed a third way, to market algorithms and information processing speeds will lead
include news articles in the data that is used to influence the to advances in other scientific research areas as some of those
outputs.
advances will find useful purposes in new applications far prediction," Journal of Expert Systems with Applications, vol.
different than the original. November, pp. 7046-7056, 2015.
[9] Hsuan-Ming Feng and Hsiang-Chai Chou, "Evolutional RBFNs
The new direction for future work will be to combine the prediction systems generation in the application of financial time
series data," Journal of Expert Systems with Applications, vol. 38, no.
unstructured text data in Bloomberg News reporting with the
7, pp. 8285-8292, 2011.
stock prices of the New York stock market exchange to
develop a combination predictive model, furthering the work [10] Gavor Tatai, Laszlo Gulyas, Laszlo Laufer, Marton Ivanyi
"vBroker: agents teaching stock," Proceedings of 5th International
done in the Bombay and Turkish stock markets. Working Conference of Intelligent Virtual Agents (IVA), vol.
September, p. 503, 2005.
[11] Todd Feldman and Daniel Friedman, "Human and Artificial Agents in
a Crash Prone Financial Market," Journal of Computational
Economics, vol. 36, no. 3, pp. 201-229, 2010.
IV. REFERENCES
[12] Jameela Al-Jaroodi, Nader Mohamed and K. Al-Nuaimi, "An
efficient algorithm for temporal financial information monitoring,"
Middleware Technologies Lab, Bahrain, UAE, 2013.
[1] White House, "Big Data: Seizing Opportunities, Preserving Values,"
1 May 2014. [Online]. Available: [13] C. Augusto Casas, "Parallelization of artificial neural network training
https://www.whitehouse.gov/sites/default/files/docs/big_data_privacy algorithms: a financial forecasting application," IEEE Conference on
_report_may_1_2014.pdf. [Accessed 20 October 2015]. Computational Intelligence for Financial Engineering and
Economics, pp. 1-6, 2012.
[2] J. Bughin, M. Chui, and J. Manyika, "Clouds, Big Data, and Smart
Assets: Ten Tech-Enabled Business Trends to Watch," McKinsey [14] IBM, "What is Big Data," IBM, 2015. [Online]. Available:
Quarterly, August 2010. [Online]. Available: http://www-01.ibm.com/software/data/bigdata/what-is-big-data.html.
http://www.mckinsey.com/insights/high_tech_telecoms_internet/clou [Accessed 5 Nov 2015].
ds_big_data_and_smart_assets_ten_tech- [15] Xindong Wu, Xingquan Zhu, Gong-Qing Wu, and Wei Ding "Data
enabled_business_trends_to_watch. [Accessed 20 Oct 2015]. mining with big data," IEEE Transactions on Knowledge and Data
[3] Federal Reserve, "Guidance on Model Risk Management," Engineering, vol. 26, no. 1, pp. 97-107, 2014.
Supervision and Regulation (SR) 11-7 of Federal Reserve System, 4 [16] J. Bollen, H. Mao, and X. Zeng, "Twitter Mood Predicts the Stock
April 2011. [Online]. Available: Market," Journal of Computational Science, vol. 2, pp. 1-8, 2011.
http://www.federalreserve.gov/bankinforeg/srletters/sr1107.htm.
[Accessed 1 Nov 2015]. [17] Kranti M. Jaybhay, Rajesh V. Argiddi, and S.S. Apte, "Stock market
prediction model by combining numeric and news textual mining,"
[4] Subhabrata Choudhury, Subhajyoti Ghosh, Arnab Bhattacharya, International Journal of Computer Applications, vol. 57, no. 19,
Kiran Jude Fernandes, and Manoj Kumar Tiwari "A real time 2012.
clustering and SVM based price-volatility prediction for optimal
trading strategy," Journal of Neurocomputing, vol. May, pp. 419-426, [18] Hakan Gunduz and Zehra Cataltepe, "Borsa Istanbul (BIST) daily
prediction using financial news and balanced feature selection,"
2014.
International Journal of Expert Systems with Applications, vol. 42,
[5] J. Vesanto and E. Alhoniemi, "Clustering of the self-organizing map," no. 22, pp. 9001-9011, 2015.
IEEE Transactions of Neural Networks, vol. 11, no. 3, pp. 586-600,
2000. [19] G. Gidofalvi and E. Elkan, "Using news articles to predict stock price
movements," Department of Computer Science and Engineering,
[6] V. N. Vapnik, The Nature of Statistical Learning Theory, 2nd Ed., University of California, San Diego, 2001.
New York: Springer, 1995.
[7] Ash Booth, Enrico Gerding, and Frank McGroaty, "Automated
trading with performance weighted random forests and seasonality,"
Expert Systems With Applications, An International Journal, vol.
June, pp. 3651-3661, 2014.
[8] Michel Ballings, Dirk Van den Poel, Nathalie Hespeels, and Ruben
Gryp "Evaluating multiple classifiers for stock price direction

View publication stats

You might also like