Professional Documents
Culture Documents
Stock Prediction With Machine Learning - v3.5
Stock Prediction With Machine Learning - v3.5
During the last few months, there has been increased attention in the stock market due to the
Covid pandemic. The new-found leisure time has driven many people to buy and sell stocks with-
out any knowledge on the matter at hand. The number of affiliations on investing or trading apps
has increased drastically since the last year. It is natural to think that the field of predicting the
stock market has increased accordingly. However, only two main approaches have been made. One
focusing on day trading and using technical analysis of the markets to predict the immediate value,
and the other focusing on the stocks as long-time investments and using fundamental analysis to
predict the future value of the stock in the long run.
Following the outbreak of the Coronavirus, we have seen an increasing gap between the economy
and stock market that comes from the instability of the current times. This may worsen the predic-
tions created by the fundamental analysis until normality is achieved because the macroeconomic
and microeconomic factors that usually are key in the long-term predictions are not affecting the
stock market in the same way. In contrast, the technical analysis can predict short-term stock
values although it is difficult to stretch the length of time which this analysis can predict. ¿How
long into the future can technical analysis predict? ¿Will it be an accurate prediction? ¿Which
algorithm will help us obtain the best prediction with technical analysis? These were the main
questions that were asked at the beginning of the project.
The usual prediction time with technical analysis can go from hours to a month or two. How-
ever, the goal of this project is to compare different algorithms to obtain a predictor that is able to
know whether a stock will go up or down in value in 3 to 5 months using technical analysis.
The project started by doing an introduction to Deep Learning and Machine Learning. Af-
terwards, the process of obtaining an adequate amount of data to create a proper dataset began.
With enough data and the defined variables, the dataset was used to experiment with different
algorithms and different configurations to obtain the many predictors. Once the predictors were
designed a comparison was made among their results, and more data was added to the dataset
to try to improve the scores. Then, the second round of prediction started and the comparison
of the scores among the different algorithms was made again to obtain the results. After adding
more stock values to the dataset, a mistake was found on some of the rows. The coma value was
misplaced because of the difference in format in the API the data was being obtained. A third and
final round of prediction was done with the problem solved. Among the five algorithms that were
tested, the Random Forest one offered the best results, with an accuracy of 71% for the last dataset.
1
Resum
Durant els últims mesos hi ha hagut un increment de l’interès cap al mercat de valors a causa
de la pandèmia de la Covid. El nou temps lliure trobat ha portat a molta gent a comprar i vendre
accions sense tenir el coneixement suficient. El nombre d’usuaris a les aplicacions d’inversió o de
negociació borsària ha augmentat dràsticament des de l’any passat. És normal pensar que el camp
de la predicció del mercat de valors hagi augmentat de la mateixa manera. Tot i això, només s’han
fet dos tipus d’enfocaments. El primer es centra en la negociació borsària a diari, fent servir l’anàlisi
tècnica del mercat per predir el valor immediat de les accions, i l’altre centrant-se en les accions
com a inversió a llarg termini i fent servir l’anàlisi fonamental per predir el valor a futur de les
accions a la llarga.
Després del brot del Coronavirus s’ha vist que la diferència entre el mercat de valors i l’economia
ha anat augmentant, i prové de la inestabilitat dels temps actuals. Aquest fet pot empitjorar les
prediccions fetes amb l’anàlisi fonamental fins que s’aconsegueixi la normalitat degut a que els fac-
tors macroeconòmic i microeconòmic que normalment són claus en les prediccions a llarg termini
no afecten el mercat de valors de la mateixa manera. En canvi l’anàlisi tècnica pot predir els val-
ors de les accions a curt termini, tot i que és complicat allargar la quantitat de temps en la qual
podem predir. Fins a quin punt en el futur podem predir amb l’anàlisi tècnica? Serà una predicció
acurada? Quin algorisme ens ajudarà a obtenir la millor predicció amb anàlisi tècnica? Aquestes
han estat les qüestions que es van formular a l’inici del projecte.
El temps normal de predicció amb anàlisi tècnica pot anar d’hores a un mes o dos. La finali-
tat d’aquest projecte és comparar diferents algorismes per obtenir un predictor que permet saber si
el valor d’una acció pujarà o baixarà en un rang de temps d’uns 3 a 5 mesos fent servir anàlisi tècnica.
2
Resumen
Durante los últimos meses ha habido un incremento del interés hacia el mercado de valores
debido a la pandemia del Covid. El tiempo libre encontrado ha llevado a mucha gente a comprar
y vender acciones sin tener el conocimiento suficiente del tema. El número de usuarios de la apli-
caciones de inversiones o negociación bursátil ha aumentado drásticamente desde el año pasado.
Es normal pensar que el campo de la predicción del mercado de valores haya crecido acordemente.
Pese a esto, únicamente se han hecho dos tipos de enfoque. El primero se centra en la negociación
bursátil diaria, usando el análisis técnico del mercado para predecir el valor inmediato de las ac-
ciones, el otro se centra en las acciones como inversión a largo plazo, usando el análisis fundamental
para predecir los valores de las acciones a futuro.
Después del brote de Coronavirus se ha visto una brecha entre el mercado de valores i la
economı́a, que ha ido aumentando y proviene de la inestabilidad del momento que vivimos ac-
tualmente. Este hecho puede empeorar las predicciones realizadas mediante el análisis fundamental
hasta que volvamos a una normalidad, ya que los factores macroeconómico y microeconómico que
normalmente son clave en las predicciones a largo plazo no afectan al mercado de valores de la
misma manera. En contraste el análisis técnico puede predecir el valor de las acciones a corto
plazo, pese a que alargar la cantidad de tiempo que se puede predecir es complejo. ¿Hasta qué
punto en el futuro podemos predecir con el análisis técnico? ¿Será una predicción acertada? ¿Qué
algoritmo nos permitirá obtener la mejor predicción con análisis técnico? Estas han sido las pre-
guntas formuladas al inicio del proyecto.
El tiempo normal de predicción con análisis técnico puede variar desde horas a un mes o dos.
La finalidad del proyecto es comparar diferentes algoritmos para obtener un predictor que permita
saber si el valor de una acción subirá o bajará en un rango de tiempo de unos 3 a 5 meses usando
análisis técnico.
3
Acknowledgements
I would like to thank first my thesis supervisor at Ernst & Young, Ana Jimenez Castellanos,
who has guided me and has given me the recommendations to learn during the whole project while
giving me the space to grow as an Engineer and allowed me to work in this incredible project that
has engaged me to improve my skills in Machine Learning applied to economics.
I would also like to thank Prof. Climent Nadeu from the department of Communications and
Signal Theory at UPC, Barcelona. He gave me key indications to obtain knowledge prior to the
beginning of this project that has been helpful during the whole execution of the thesis.
I can’t forget my friends and college classmates, who have taught me many life lessons and have
made me who I am today. In particular I would like to thank Luis Ramón Rodrı́guez Javier, who
has helped me motivate during the project and has taught me the power of perseverance.
Finally, I have to express my eternal gratitude to my parents for teaching me so many valuable
lessons, for listening to the progress of this project even when they didn’t understand a word I said,
and for supporting me in every step I take.
4
Contents
Abstract 1
Resum 2
Resumen 3
Acknowledgements 4
List of figures 7
List of tables 8
1 Introduction 9
2 Objectives 10
2.1 Achieve a 3 to 5 months prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Figure out the best algorithm to make the prediction . . . . . . . . . . . . . . . . . . 10
4 Methodology 21
4.1 Machine Learning and Finance Research . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Creation of the dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Testing the algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3.1 Testing with Cross Validation (CV) . . . . . . . . . . . . . . . . . . . . . . . 24
4.3.2 Testing with Cross Validation and Shuffle Split . . . . . . . . . . . . . . . . . 24
4.3.3 Testing with Principal Component Analysis . . . . . . . . . . . . . . . . . . . 24
5 Results 24
5.1 First dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2 Second dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3 Third dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.4 Analysis of the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5
6 Economic and Environmental Impact 31
6.1 Economical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.2 Environmental Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
References 34
6
List of Figures
1 Candle graph of the Apple stock (AAPL) with some technical indicators. Image
from: Plus 500 trading platform[11] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Candle graphic of the Tesla stock (TSLA) showcasing a bullish market with the three
EMAs: blue(6), red(70), white(200). Image from: Plus 500 trading platform[11] . . . 12
3 Candle graphic of the Gilead stock (GILD) showcasing a bear market with the three
EMAs: blue(6), red(70), white(200). Image from: Plus 500 trading platform[11] . . . 12
4 Candle graphic of the Ford stock (F) showcasing the selling point (red dotted line)
and the buying point (white dotted line) using the MACD indicator. Image from:
Plus 500 trading platform[11] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5 Perceptron schema. Image from: DeepAI webpage[5] . . . . . . . . . . . . . . . . . . 16
6 K-NN example with k=3 and k=7. Image from: Data Camp webpage[4] . . . . . . . 17
7 3-category SVM examples using Lineal Kernel, RBF kernel and Polynomial kernel
using the Iris dataset. Image from: Scikit-learn webpage [14] . . . . . . . . . . . . . 19
8 Cross Validation example with k = 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
9 Shuffle Split example with 4 iterations. Image from: Scikit-learn webpage[15] . . . . 20
10 Architecture implemented in the project . . . . . . . . . . . . . . . . . . . . . . . . . 22
11 Dataset Creation Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
12 SVM’s time comparison between models and datasets . . . . . . . . . . . . . . . . . 28
13 Random Forest’s time comparison between models and datasets . . . . . . . . . . . . 28
14 Random Forest’s time comparison between models and datasets . . . . . . . . . . . . 29
15 One configuration of each algorithm with the lowest time and best accuracy. The
MLP and KNN values are difficult to see in the graph as they ar too similar to the
ones obtained with Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
16 Accuracy results for the KNN algorithm when modifying the number of neighbors . 30
17 Accuracy results for the Random Forest algorithm when modifying the number of
trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
18 Cost calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
19 Gantt diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7
List of Tables
1 Number of rows and stocks for each dataset . . . . . . . . . . . . . . . . . . . . . . . 23
2 Results for the first dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Results for the second dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4 Results for the third dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5 Best results for each algorithm and the iteration of the dataset . . . . . . . . . . . . 29
6 MLP with CV results when increasing the number of layers . . . . . . . . . . . . . . 31
8
1 Introduction
In this fast-paced world we live in, it is impossible to be an expert in everything; that is why we
use technology to help us with many day to day activities. Furthermore, we want technology to
improve our quality of life in many ways. Technology has given regular people access to the stock
market, bringing new opportunities to many. However, increasing people’s quality of life without
the proper knowledge is seemingly impossible. Artificial Intelligence has entered the game to level
the field. Through the use of technology and AI in the stock market, everyone will be able to buy
and sell stocks with a certain probability of success.
Two main approaches have been made in the field of stock prediction. The first approach is
mostly focused on stocks as long term investments through fundamental analysis1 . With this kind
of study, the stock value will ideally trend to the predicted intrinsic value. This approach does
not work appropriately on short or mid-time predictions because it does not indicate a stock’s
movement. On the other hand, an increasing trend in the trading community has been predicting
using technical analysis2 . Focusing our attention on the stock market’s direction may help us
predict in more volatile scenarios, although the window of time we will be able to predict will be
shorter.
This project will be using technical analysis and has two main goals. The first one is to find out
the better algorithm to predict whether a stock will go up or down in value. The second one will
be to try to stretch the prediction window to a range of 3 to 5 months.
The main milestones of this project are going to be obtaining the data, storing and processing
it to create a dataset, use different algorithms to predict whether a stock will go up or down in
value, and finally check the results to see which algorithm better predicted the trends of the many
stocks.
This thesis is structured by stating the context of the project, the objectives, and methodol-
ogy. Finally, it presents the results, the environmental and economic impact, and the project’s
conclusions.
During the thesis’s execution, some deviation appeared mainly because of the difference in the
formating of the data from the API where it was collected and Excel. A significant restraint found
during the first part of the project, data collection, was the limitation of the number of queries in
the API per day and second.
All of the process followed during the thesis is well represented in the Gantt diagram in Next
Steps, in the figure 19.
1 Fundamental analysis is a method of assessing the intrinsic value of a security by analyzing various macroeconomic
and microeconomic factors. The ultimate goal of fundamental analysis is to quantify the intrinsic value of a security.[8]
2 Technical analysis is a method used to predict the probable future price movement of a security – such as a stock
9
Figure 1: Candle graph of the Apple stock (AAPL) with some technical indicators. Image from:
Plus 500 trading platform[11]
2 Objectives
In this undergraduate thesis, there are two main objectives. One of them is to create a predictor
that is able to know whether a stock will go up or down in value in a medium time range (3 to
5 months) using technical analysis. The second one is to figure out which algorithm will help us
better predict the stock’s movement.
However, not many people have the knowledge or the time to do neither of those things.
This project aims to stretch this constraint to a more extended time period, reducing the number
of movements3 the investor will have to do and removing the necessity of the constant monitoring
of the stock market.
10
Logistic Regression
Random Forest
To figure out which of the previous algorithms are best suited, we will be comparing their
accuracy, the deviation of the accuracy, and the time that it takes to complete the training with
each algorithm.
3.1 Indicators
The indicators that were calculated and added to the dataset were the following:
The exponential moving average tracks the stock value and is a type of weighted moving average
that gives more importance to the most recent data. The algorithm used to calculate the different
EMA in the project is the following:
1 def c a l c u l a t e _ e x p o n e n t i a l _ m o v i n g _ a v e r a g e ( self , list ) :
2 weights = np . exp ( np . linspace ( -1. , 0. , self . mean_values ) )
3 weights /= weights . sum ()
4 ema = np . convolve ( list , weights ) [: len ( list ) ]
5 ema [: self . mean_values ] = ema [ self . mean_values ]
Where self.mean_values and list are the amount of days that we are using in the moving
average and the array that we are calculating the EMA from.
In the dataset three EMAs were used with mean values of 6, 70 and 200. This method is
explained in the book Ganar en la bolsa es posible by Josef Ajram[1]. The idea behind this is to
have a clear trigger to buy or sell.
11
If the stock value is above the 6 days EMA, and the 6 days EMA is above the 70 days EMA
and this one is above the 200 says EMA, then the market has a higher probability to keep
going upwards.
Figure 2: Candle graphic of the Tesla stock (TSLA) showcasing a bullish market with the three
EMAs: blue(6), red(70), white(200). Image from: Plus 500 trading platform[11]
If the stock value is below the 6 days EMA, and the 6 days EMA is below the 70 days EMA,
and this one is below the 200 days EMA, then we are in a bear market situation and it might
be a good time to sell, but never to buy.
Figure 3: Candle graphic of the Gilead stock (GILD) showcasing a bear market with the three
EMAs: blue(6), red(70), white(200). Image from: Plus 500 trading platform[11]
The 200 days EMA represents the mean if we look back for a year (long time), the 70 days EMA
represents the mean if we look back a medium amount of time, and the 6 days EMA represents the
mean if we look back a short amount of time. With these three indicators, we can see the different
12
trends in long, medium, and short-time. Furthermore, with the logic explained previously, we can
use these indicators as a trigger to know whether we have to buy, sell, or maintain the position.
MACD
Signal
Histogram
The MACD function is the difference between a 26 period EMA of the closing values and a
12 period EMA of the closing values. These two indicators are called slow EMA and fast EMA
respectively.
The Signal function is a nine-day EMA of the MACD, and it is used as a trigger signal. When
the Signal function crosses the MACD in an upwards direction, it indicates a change in the bear
market trend. It is a selling call. When the Signal function crosses the MACD in a downwards
direction, it suggests a shift to a bullish market’s tendency. It is a buying call.
To help to visualize previously mentioned outcomes, the histogram function is usually repre-
sented. It is the difference between the MACD and Signal functions. If the histogram is zero, it
indicates a change in the market’s trend. We have to check the value of the histogram before it
reaches 0. If the histogram value is positive before reaching zero, it indicates a change to a bear
market trend. If the Histogram value before reaching zero is negative, it shows a bullish market.
Figure 4: Candle graphic of the Ford stock (F) showcasing the selling point (red dotted line) and
the buying point (white dotted line) using the MACD indicator. Image from: Plus 500 trading
platform[11]
13
To calculate the previous functions the following algorithm was used:
1 def calcula te_macd ( self , data , name ) :
2 info = []
3 for d in data :
4 info . append ( d [ name ])
5 slow_m_pred = me an_pred ictor (26)
6 fast_m_pred = me an_pred ictor (12)
7 nine_m_pred = me an_pred ictor (9)
8 slow = slow_m_pred . c a l c u l a t e _ e x p o n e n t i a l _ m o v i n g _ a v e r a g e ( info )
9 fast = fast_m_pred . c a l c u l a t e _ e x p o n e n t i a l _ m o v i n g _ a v e r a g e ( info )
10
11 macd = fast - slow
12 signal = nine_m_pred . c a l c u l a t e _ e x p o n e n t i a l _ m o v i n g _ a v e r a g e ( macd )
13 hist = macd - signal
14 time = []
15 for da in data :
16 time . append ( da [ " datetime " ])
17 return time , macd , hist , signal
The data parameter is an array with candle like objects5 . The name variable represents the
value we want to calculate the MACD with (open, close, high, low). In this project the MACD
function was calculated with the closing values of each stock.
Logistic Regression
Multi Layer Perceptron (MLP)
K-Nearest Neighbors (KNN)
Random Forest
Support Vector Machine (SVM)
When we think of a classification between two outcomes the first algorithm that comes to our mind
is Binary Logistic Regression. Naturally this is the first algorithm tried in this project because it
can be interpreted as knowing the probability of a stock going up.
Generally, to train a Binary Logistic Regression predictor we will have to follow the next
steps[10]:
1. Create a weight matrix (W ) and multiply it by the input variables (X), X being a matrix
with m rows and n features:
5A candle like object is a dictionary with the open, high, low, close and volume values
14
a = w0 + w1 ∗ x1 + w2 ∗ x2 + ... + wn ∗ xn
2. Use the sigmoid6 function to do the transformation from all real numbers to a space of 0 to
1:
wi = wj − (α ∗ dwj )
Solver: SAGA
Penalty: L2
To explain the process that follows a multi layer perceptron first we need to understand what a
single perceptron does. The perceptron will do a weighted sum of all of the imputs given to create
an output. At the end of the perceptron an activation function is added.
6 Sigmoid function: y = 1/(1 + e−x )
15
Figure 5: Perceptron schema. Image from: DeepAI webpage[5]
Linear
Sigmoid
Tanh
ReLU
Leaky ReLU
Softmax
The MLP has been implemented using the module from scikit-learn
sklearn.neural_network.MLPClassifier[18] and the following parameters:
16
3.2.3 K-Nearest Neighbors (KNN)
The K-NN algorithm is a type of classification algorithm that uses the distance between the data
point we want to predict and the data points we already have to predict the category of the new
point.
The prediction can be done with euclidean distance or any other type that we decide. Another
important variable we want to choose is the number of data points that we will use to predict the
class of the new data entry. The k cannot be too high in case we have a small amount of training
data points, but it cannot be too low because while it may have a lower bias, it may introduce a
higher variance.
Figure 6: K-NN example with k=3 and k=7. Image from: Data Camp webpage[4]
In this project the K-NN has been implemented with the scikit-learn module:
sklearn.neighbors.KNeighborsClassifier [16] and the following parameters:
Number of neighbors:200
Weights: distance
Number of jobs: 6
Leaf size: 30
17
3.2.4 Random Forest
The Radom Forest algorithm consists of a large ensemble of decision trees. The algorithm works
under the basis that a large number of relatively uncorrelated models (trees) operating as a com-
mittee will outperform any of the individual constituent models. Thus, the low correlation between
the decision trees is crutial.
A decision tree will be able to categorize a data entry between a set of given classes, in our case,
between 1 and 0. An ensemble of decision trees will obtain the result of each decision tree and
determine the category comparing the amount of predicted 1 with the amount of predicted 0.
To ensure the low correlation between models in a random forest the algorithm uses two methods:
Bagging: Each tree will use the same amount of data from the dataset (N), although it will
be randomly sampled from the dataset with replacement.
Feaute randomness: Each tree in a random forest will have a random subset of features that
will be selected amongst the original ones.
In this project the Random Forest algorithm has been implemented using the scikit-learn module
sklearn.ensemble.RandomForestClassifier[19] and the following parameters:
Criterion: entropy
The algorithm of Support Vector Machine tries to find a hyperplane in a N-dimensional space
that fits the data points in different categories. In this project , the hyperplane will divide the
N-dimensional space into the two classes, 0 and 1. Depending on the problem that we face a Linear
SVM, a RBF SVM (gaussian), or a Polynomial SVM may be used. All the previous algorithms
differ on the kernel they use:
18
Figure 7: 3-category SVM examples using Lineal Kernel, RBF kernel and Polynomial kernel using
the Iris dataset. Image from: Scikit-learn webpage [14]
The Kernel we use in the algorithm will determine the shape of the function that classifies the
data points.
In this project the Support Vector Machine algorithm has been implemented using the scikit-
learn module
sklearn.svm.SVC[20], the rbf kernel (Gaussian) and the default parameters.
19
Figure 8: Cross Validation example with k = 4
Figure 9: Shuffle Split example with 4 iterations. Image from: Scikit-learn webpage[15]
20
4 Methodology
The process followed in this project has been: researching the important topics (Machine Learning
algorithms and finance), implementing the datat retrieval process to create the dataset, and writing
the code to test each algorithm. For the last two parts of the project there have been three trials
to test the algortihms with different datasets. Finally, al results were stored in an Excel Spredsheet
to facilitate the comparison amongst all algorithms.
21
Figure 10: Architecture implemented in the project
Another script was written to download all the information from the database for each stock’s
value, from the IPO10 to the present date. The stock value was compared to the values from 3 to 5
months later. If the maximum value from the future was bigger than the current value, the row was
cataloged as a buying opportunity (1), and if the minimum value from the future was smaller than
the current value, the row was cataloged as a selling opportunity (0). Finally, one more module
was programmed to calculate the different indicators used in the predictions and were added to
each row. All of the rows were added to a .txt, .xlsx and .csv file, to be used as a dataset.
For all the versions of the dataset, it was composed11 of the following columns:
Stock name
Initial Date
Initial Value
6 day EMA
70 day EMA
Histogram function
MACD function
10 Initial Public Offering
11 Columns with the final value, EMAs and MACD were added to the dataset in case they were needed in some
stages of the project, although they were never used
22
Signal funtion
Result
Three versions of the dataset were created, each of them with increasing number of rows. Thanks
to the automatization of the data collection the diversity amongst the data was increased for each
version:
All the code used to collect the data and create the dataset is in a Github repository that can
be acces using the link in the reference section [3]
As we see the number of different stocks in the last dataset increases drastically, thus rising the
diversity.
The schema of the creation of the dataset as described in this section is as follows:
23
4.3 Testing the algorithms
Once the first version of the dataset was obtained the creation of the algorithms began. Through
the use of Jupyter Notebook three different templates were created. Each template would have
different parameters and would be used with every algorithm. The process of calculating the scores
for each algorithm was done three times, one for each dataset.
The template used in the testing with cross validation, as shown in Annex 1(7.2) , enables us to
observe the different results with Standard Scaling for each algorithm using Cross Validation. It is
important to note that a timer is set at the begining and ending of the cross validation to know
the time that it takes to run the whole algorithm. For all algorithms the dataset will be split in 10
subsets to do the cross validation.
In this template the Shuffle Split function is added before the Cross Validation to generate a random
split in the dataset, in this case the dataset will not be evenly separated. The same timer is set
before doing the Cross Validation to ensure we know how much it takes to obtain the results. The
Shuffle Split will be done in groups of 10, with a test size of 0.2. We can see the code used in jupyter
notebook in Annex 1(7.2) .
The last template uses Principal Component Analysis combined with Standard Scaling before using
Cross Validation in combination with Shuffle Split to obtain the accuracy scores. A timer was added
to calculate the amount of time that we need in order to obtain said scores. We can see the template
in the Annex 1(7.2) .
5 Results
The goal of this section is to show the results for each iteration done with the different datasets,as
well as representing the result in a ”user friendly” manner.
It is important to define two main terms that will be repeated during the whole section. Mean
accuracy is the mean of all the accuracy obtained from the Cross Validation for each algorithm. The
accuracy is obtained by finding the porportion of correctly predicted cases and the total amount of
cases. The Deviation of the accuracy is the standard deviation of the accuracy obtained doing the
Cross Validation.
24
5.1 First dataset
The first dataset was composed of 92120 rows and included 170 well-known different stocks.
Algorithm name Time spent training Mean accuracy Deviation of the ac-
curacy
Logistic Regression 0 : 00 : 03 65% 0%
with CV
Logistic Regression 0 : 00 : 03 65% 0%
with CV & Shuffle-
Split
Logistic Regression 0 : 00 : 05 52% 2%
with CV Shuffle-
Split & PCA
Random forest with 0 : 01 : 39 66% 1%
CV
Random forest with 0 : 09 : 18 68% 1%
CV & ShuffleSplit
Random forest with 0 : 12 : 50 61% 0%
CV ShuffleSplit &
PCA
MLP with CV 0 : 00 : 34 65% 0%
MLP with CV & 0 : 00 : 35 65% 0%
ShuffleSplit
MLP with CV 0 : 01 : 04 64% 0%
ShuffleSplit & PCA
K-NN with CV 0 : 00 : 26 65% 0%
K-NN with CV & 0 : 00 : 23 65% 0%
ShuffleSplit
K-NN with CV 0 : 00 : 22 65% 0%
ShuffleSplit & PCA
Gaussian SVM 1 : 59 : 43 65% 0%
with CV
Gaussian SVM 3 : 55 : 05 65% 0%
with CV & Shuffle-
Split
Gaussian SVM 0 : 40 : 06 65% 0%
with CV Shuffle-
Split & PCA
25
5.2 Second dataset
The second dataset was composed of 101509 rows and included 190 well-known different stocks.
Algorithm name Time spent training Mean accuracy Deviation of the ac-
curacy
Logistic Regression 0 : 00 : 05 64% 1%
with CV
Logistic Regression 0 : 00 : 04 64% 1%
with CV & Shuffle-
Split
Logistic Regression 0 : 00 : 07 53% 1%
with CV Shuffle-
Split & PCA
Random forest with 0 : 06 : 07 66% 1%
CV
Random forest with 0 : 10 : 09 68% 0%
CV & ShuffleSplit
Random forest with 0 : 14 : 01 61% 1%
CV ShuffleSplit &
PCA
MLP with CV 0 : 01 : 41 64% 0%
MLP with CV & 0 : 01 : 20 64% 0%
ShuffleSplit
MLP with CV 0 : 01 : 19 65% 0%
ShuffleSplit & PCA
K-NN with CV 0 : 00 : 16 64% 0%
K-NN with CV & 0 : 00 : 25 64% 1%
ShuffleSplit
K-NN with CV 0 : 00 : 29 64% 0%
ShuffleSplit & PCA
Gaussian SVM 2 : 20 : 49 65% 0%
with CV
Gaussian SVM 4 : 10 : 38 65% 0%
with CV & Shuffle-
Split
Gaussian SVM 0 : 48 : 56 64% 1%
with CV Shuffle-
Split & PCA
26
5.3 Third dataset
The third dataset was composed of 720668 rows and included 1867 different stocks. For the results
of the SVM algorithm the data could not be obtained due to the large amount of samples and the
processing power needed
Algorithm name Time spent training Mean accuracy Deviation of the ac-
curacy
Logistic Regression 0 : 00 : 57 63% 0%
with CV
Logistic Regression 0 : 00 : 50 63% 0%
with CV & Shuffle-
Split
Logistic Regression 0 : 00 : 38 50% 0%
with CV Shuffle-
Split & PCA
Random forest with 1 : 11 : 40 70% 2%
CV
Random forest with 2 : 52 : 58 71% 0%
CV & ShuffleSplit
Random forest with 1 : 49 : 26 66% 0%
CV ShuffleSplit &
PCA
MLP with CV 0 : 03 : 30 63% 0%
MLP with CV & 0 : 09 : 13 64% 3%
ShuffleSplit
MLP with CV 0 : 08 : 17 63% 0%
ShuffleSplit & PCA
K-NN with CV 0 : 05 : 55 63% 0%
K-NN with CV & 0 : 05 : 16 64% 1%
ShuffleSplit
K-NN with CV 0 : 07 : 31 63% 0%
ShuffleSplit & PCA
Gaussian SVM N/A N/A N/A
with CV
Gaussian SVM N/A N/A N/A
with CV & Shuffle-
Split
Gaussian SVM N/A N/A N/A
with CV Shuffle-
Split & PCA
27
5.4 Analysis of the results
As we can see in the tables above; for most cases, the usage of Principal Component analysis has
diminished the training time, specially when used before the Support Vector Machine algorithm,
as seen in figure 12.
PCA works best when used in a dataset with a high amount of features and samples; when we
increase the number of samples, the time spent doing Cross Validtion will decrease, as seen in figure
13
Figure 13: Random Forest’s time comparison between models and datasets
When we take into account the accuracy in these graphs we can see that, in the case of the
Random Forest algorithm, when we used Principal Component Analysis the slope is higher than
only using Shuffle & Split. This means that the more we increase our dataset the more we will need
PCA to avoid higher training times. Although, for the amount of data that is in the dataset in the
28
first three versions, the algorithm without using PCA will obtain a much more accurate prediction.
Figure 14: Random Forest’s time comparison between models and datasets
We can see in table 5 the best configurations for each algorithm. As we can see, most of the
results come from from the first version of the dataset.
Algorithm name Version of the Time spent training Mean accuracy Deviation of the ac-
dataset curacy
Random forest with 3 2 : 52 : 58 71% 0%
CV & ShuffleSplit
MLP with CV 1 0 : 00 : 34 65% 0%
KNN with CV 1 0 : 00 : 22 65% 0%
ShuffleSplit & PCA
SVM (Gaussian) 1 0 : 40 : 06 65% 0%
with CV Shuffle-
Split & PCA
Logistic regression 1 0 : 00 : 03 65% 0%
with CV
Table 5: Best results for each algorithm and the iteration of the dataset
Observing the figure 15, we can see the evolution of the previous best configurations throughout
the different datasets. For all algorithms except for Random Forest the mean accuracy decreases
when the amount of data is increased. This may be due to overfitting in the first use of the dataset.
The low amount of data and the low diversification of stocks may have caused overfitting in the
models, thus resulting in a better accuracy (65%) with the first dataset. As the diversification on
the dataset (amount of different stocks) grew the mean accuracy decreased.
29
Figure 15: One configuration of each algorithm with the lowest time and best accuracy. The MLP
and KNN values are difficult to see in the graph as they ar too similar to the ones obtained with
Linear Regression
Another possible explanation to the evolution in the figure 15 is that the drop on the accuracy
may be caused by keeping the algorithm’s parameters as constants. When the amount of data, and
diversity in the data is increased the parameters for all the algorithms have remained the same,
therefore the ability to predict for each algorithm decreases.
To try to determine the case that is happening in this project a small experiment is done. Some
of the algorithm’s parameters will be modified to try to better fit the last dataset.
Figure 16: Accuracy results for the KNN algorithm when modifying the number of neighbors
As we can see on the previous figure (16), by increasing the number of neighbors the accuracy
does not increase.
The same happens when we increase the number of layers in the MLP and and change the
30
amount of perceptrons in the layers, as we see in the table bellow:
When increasing the number of trees in the Random Forest algorithm the maximum accuracy
sitill remains the one predicted with the first parameters:
Figure 17: Accuracy results for the Random Forest algorithm when modifying the number of trees
31
Adding up the usual materials such as paper, pens, chair and desk, with the average salary of
a junior engineer from UPC and the utitlities the result is a total cost of 9.936, 00euros.
7.1 Conclusions
After analizing the results in section 5 we can observe that Random Forest has outperformed the
other algorithms, specially when increasing the diversity of data in the dataset. The more amount
of data, and the more we diversified the data, the better the Random Forest algorithm perfomed12 ,
while the remaining algorithms’ accuracy diminished when the dataset grew.
12 The accuracy increased and the deviation decreased
32
During the Analysis of the results (section 5.4) we observed that the testing accuracy did not
improve when using more training data for most algorithms. Two explanations were discussed,
either the decrease of the accuracy was caused by the fact that the algorithms’ parameters were
not increased when more training data was used or it was caused by overfitting in the first dataset
due to the poor diversity of stocks.
Regarding the first explanation, we have seen that the accuracy did not improve when increasing
the number of parameters. This means that the models didn’t decrease their performances because
of the choice of the number of parameters.
In the figure 15 we can see that the only algorithm that hasn’t decreased the accuracy is the
Random Forest. This is due to the nature of the algorithm, as explained in the section 3.2.4 Random
Forest, the algorithm is an ensemble, which means that is composed of many decision trees. In
this project the number of trees in the algorithm is 300. An ensemble works under the assumption
that many uncorrelated errors average out to zero. Since each tree learns from different subsets of
our data, they are fairly uncorrelated from one another, thus making the Random Forest algorithm
more robust to overfitting than the other algorithms. All of this would explain why all of the
algorithms decreased all of their accuracy except for the Random Forest.
To create some context for the results, the article ”Predicting the daily return direction of
the stock market using hybrid machine learning algorithms”[6] talks about the results of Machine
Learning projects that aim to predict the movement of a stock for the next day. In the article the
authors mention that, for direction forecast13 , they have a lower accurcy (around 60%). Further-
more, the aim of this goal was to help regular people enter into the world of stock markets, and the
regular users of the trained algorithm will have a 50/50 chance of being right if their knowledge is
null.
In conclusion, Random Forest would be the chosen algorithm as the most suited to predict
whether to buy or sell a stock in a medium amount of time14 , because even though the training
time is far larger than the others, it increases the accuracy (71%) and will be more robust to
overfitting.
Using other algorithms to predict the stock’s future value with the same range of time.
33
References
[1] Josef Ajram. Ganar en la bolsa es posible. Plataforma Editorial, 2011.
[2] Alpha Vantage API. https://www.alphavantage.co/.
[6] Xiao Zhong & David Enke. Predicting the daily return direction of the stock market using
hybrid machine learning algorithms. 2019.
[7] Kirill Eremenko. Machine Learning A-Z: Hands-On Python and R In Data Science. https:
//www.udemy.com/course/machinelearning/learn/lecture/19678456#overview.
34
[18] Sklearn Multi Layer Perceptron module. https://scikit-learn.org/stable/modules/
generated/sklearn.neural_network.MLPClassifier.html.
[19] Sklearn Random Forest module. https://scikit-learn.org/stable/modules/generated/
sklearn.ensemble.RandomForestClassifier.html.
35
Annex 1: Jupyter templates
[ ]: #Import dataset
dataset = pd.read_csv('path/to/dataset')
X = dataset.iloc[:, 2:-9].values
y = dataset.iloc[:, -1].values
[ ]: #Scalind Data
sc = StandardScaler()
X = sc.fit_transform(X)
[ ]: #KFold split:
start = datetime.now()
scores = cross_val_score(classifier, X, y, cv=10)
finish = datetime.now()
t_diff = relativedelta(finish, start)
print('{h}h {m}m {s}s'.format(h=t_diff.hours, m=t_diff.
,→minutes, s=t_diff.seconds))
37
[ ]: import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
[ ]: #Import dataset
dataset = pd.read_csv('path/to/dataset')
X = dataset.iloc[:, 2:-9].values
y = dataset.iloc[:, -1].values
[ ]: #Scaling Data
sc = StandardScaler()
X = sc.fit_transform(X)
[ ]: #Creating Model
[ ]: #KFold split:
start = datetime.now()
scores = cross_val_score(classifier, X, y, cv=cv)
finish = datetime.now()
t_diff = relativedelta(finish, start)
print('{h}h {m}m {s}s'.format(h=t_diff.hours, m=t_diff.minutes, s=t_diff.
,→seconds))
[ ]: #Print Accuracy
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
38
[ ]: import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
[ ]: #Import dataset
dataset = pd.read_csv('path/to/dataset')
X = dataset.iloc[:, 2:-9].values
y = dataset.iloc[:, -1].values
[ ]: #Scaling data
sc = StandardScaler()
X = sc.fit_transform(X)
[ ]: #Create Model
[ ]: cv = ShuffleSplit(n_splits=10, test_size=0.2)
[ ]: #Cross Validation:
start = datetime.now()
scores = cross_val_score(classifier, X, y, cv=cv)
finish = datetime.now()
t_diff = relativedelta(finish, start)
print('{h}h {m}m {s}s'.format(h=t_diff.hours, m=t_diff.minutes, s=t_diff.
,→seconds))
[ ]: #Print Accuracy
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
39
Annex 2: Gantt Diagram
To help with the planning of the whole project a Gantt diagram was done at the first two weeks
to try to divide the thesis into smaller work packages that would cointain some tasks.
During the last work packages some difficulties arose as the last dataset was formed. A change
in the configuration of the computer modified the way the decimal numbers were interpreted, from
a dot to a coma. The database information remained the same, therefore some error would have
been introduced into the models when the third training had began. After analyzing the dataset
the error was spotted on some of the rows. The amount of rows affected by this problem were small
compared to the size of the dataset and the decision to remove these rows was taken.
39