Final Report

Report Group 05
Project Report
Bitcoin Price Prediction

Applied Statistics and Experimental Design
Author: Group 05
Tran Ngoc Khanh - 20200326
Nguyen The Minh Duc - 20204904
Nguyen Ngoc Toan - 20200544
Pham Thanh Nam - 20204921
Nguyen Hoang Tien - 20204927
Advisor: Assoc Prof. Nguyen Linh Giang
Academic year: 2022
Abstract
The year 2008 marked the birth of a completely new concept in the financial world - cryptocur-
rency. Cryptocurrency is a new type of digital currency so that all the transactions are verified and
maintained by a decentralized system, rather than by a centralized authority like the current financial
system. This development is done in conjunction with the blockchain concept and has had huge
potential growth in the future financial system. For this reason, the cryptocurrency market was born,
where blockchain projects can find investment, as well as users can put their trust in a decentralized
digital currency to perform transactions and investors can put their money to make a profit. Price
prediction systems that have been studied for a long time ago can be applied by investors and in-
vestment funds in this new market. Although this will be more difficult when the market is very
volatile and very erratic, cryptocurrency price prediction is still a very hot and interesting topic for
researchers. Our target is to implement some Statistical models, Machine Learning models and Deep
Learning models in the price movement prediction problem. Our study involves comparing all these
models and selecting the best model to make the final training strategy and evaluation. And the
result is: the best model is LightGBM and with this model, we obtained the average accuracy on the
test set of 55.262%, and the average AUC score on the test set of 0.56747. The corresponding values
of the validation sets are 54.658% and 0.55127.
Acknowledgment
We would like to express our sincere gratitude to Assoc. Prof. Nguyen Linh Giang for giving us
an opportunity to work on this wonderful project on Bitcoin price prediction. Throughout some
quick encounters with him, we wanted to say thank you for your valuable words and advice that
motivated us. We are also extremely grateful to Assoc. Prof. Than Quang Khoat who taught us
about Machine Learning this semester 20212. The knowledge that he taught us is indispensable to
completing this project. Preparing this project in collaboration with our teachers was a refreshing
experience.
1
Contents
1 Introduction 4
2 Methodology 5
3 Data Preparation 5
3.1 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Exploratory data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2.1 Label analysis (Close price) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2.2 Feature analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Feature engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4 Feature scaling and normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4.1 Window normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4.2 Logarithmic scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4 Models 9
4.1 Theoretical background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.1.1 Statistical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.1.2 Machine Learning models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.1.3 Ensemble methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.1.4 Deep learning models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Practical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Evaluation and Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.4 Final Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5 Conclusion 23
6 Scope for further research 23
A RAW DATA 26
B FEATURES AND INDICATORS 27
2
Report Group 05
List of Figures
1 Global crypto market cap [7] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Top 10 Crypto Fund Manager Managers by Managed AUM (Swfinstitute) [8] . . . . . 4
3 Project pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4 Bitcoin Close Price . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5 Bitcoin close price 1st difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6 Close price trend seasonal component . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
7 Close price residual component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
8 First 100 close price seasonal component datapoints . . . . . . . . . . . . . . . . . . . . 7
9 Residual histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
10 Residual divided by trend histogram - 1 hour . . . . . . . . . . . . . . . . . . . . . . . 7
11 Residual divided by trend histogram - 1 day . . . . . . . . . . . . . . . . . . . . . . . . 8
12 Features correlation matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
13 Close price before difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
14 The close price after first difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
15 ACF and PACF plot of each difference of Close Price . . . . . . . . . . . . . . . . . . . 10
16 Returned series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
17 Summary of ARIMA model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
18 The random 300 data points of trainning set results . . . . . . . . . . . . . . . . . . . . 10
19 GARCH model summary of volatility . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
20 Confident interval on whole test set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
21 300 random continuous data points with confident interval . . . . . . . . . . . . . . . . 11
22 Odds and logit function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
23 The distance between the expectations and the sum of the variances affect the discrim-
inant degree of the data [13] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
24 Error with training rounds [17] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
25 Amount of say . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
26 Error with training rounds [19] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
27 The repeating module in a standard RNN contains a single layer [24] . . . . . . . . . . 17
28 The repeating module in an LSTM contains four interacting layers [24] . . . . . . . . . 17
29 LSTM notations [24] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
30 LSTM cell state line [24] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
31 LSTM gate [24] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
32 LSTM forget gate [24] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
33 LSTM input gate and new cell state [24] . . . . . . . . . . . . . . . . . . . . . . . . . . 18
34 Update cell state [24] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
35 LSTM output gate [24] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
36 GAN architecture [26] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
37 Generator architecture [26] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
38 Features importance barplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
39 AUC-Iterations plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
40 Train set score distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
41 Validation set score distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
42 Test set score distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
43 Validation set calibration curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
44 Test set calibration curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
45 Test score histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
46 Test set calibration curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
47 Validation score histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
48 Validation set calibration curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
49 Raw data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
50 Features and indicators - 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
51 Features and indicators - 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3
Report Group 05
1. Introduction jumped into this cryptocurrency market. The

top 10 largest crypto funds have a total asset
The concept of cryptocurrency began in 2008,
under management (AUM) worth around $14.6
when Satoshi Nakamoto created the first cryp-
billion.
tocurrency named Bitcoin based on a peer-to-
peer version of electronic cash, which allows on-
line payments to be sent directly from one party
to another without going through a financial in-
stitution. He has provided a solution to the
double-spending problem of this technology [1]
and has been ushering in a New Era of Money
Technology [2].
This Bitcoin will be worth nothing if no one be-

lieves in it until someone realized how valuable
this technology could be when one person paid
10,000 Bitcoins for two pizzas delivered by Papa
John’s [3].
At this time, there were only two ways to ob-

Figure 2: Top 10 Crypto Fund Manager Man-
tain bitcoin — by mining it yourself or arrang-
agers by Managed AUM (Swfinstitute) [8]
ing a peer-to-peer (P2P) trade via a forum like
Bitcointalk, which Nakamoto founded to host
Bitcoin-related discussions [4]. When Bitcoin
The price movement prediction and price fore-
proven its usefulness, the need for exchanging
casting problems have been studied for a long
USD for Bitcoin increased, and the first Bitcoin
time ago, since the beginning of the finan-
exchange was born - Mt. Gox. This was a bit-
cial market quantitative analysis, which reaches
coin exchange based in Shibuya, Tokyo, Japan,
back to the early 20th century with the publica-
and it was handling over 70% of all Bitcoin
tion of the groundbreaking paper "The Theory
transactions worldwide by early 2014 [5]. Unfor-
of Speculation" by Louis Bachelier in 1900 [9].
tunately, this exchange filed for bankruptcy after
Quantitative analysts are the people who have
a series of hacks in 2014 that saw 850,000 Bit-
used the data to make investment decisions and
coins disappear – nearly 2% of the total amount
they have used Machine Learning techniques for
of bitcoin that will ever exist [6].
their decision-making processes. This quantita-
tive investment strategy has been developed pri-
marily for stock markets.
The birth of cryptocurrency market created

something new and interesting: the high volatil-
ity in cryptocurrency prices and its potential big
profit have led many researchers to use machine
learning and deep learning algorithms for pre-
dicting its movement. They have used multiple
models and strategies to complete this task.
Figure 1: Global crypto market cap [7]
In this project, we are trying to predict the Bit-
coin price movement in the next hour.
Despite the lack of security while trading and This report contains 6 main sections. Section
using cryptocurrencies, this technology is still 1 is Introduction. Section 2 is Methodology,
evolving. Its total market capitalization even where we explain the flow of this report and
peaked at around $2.6 trillion in October 2021 further describe the methods used to complete
and along with it, a series of investment funds this project. Section 3 is Data preparation - the
4
Report Group 05
most important part in this report, where we de- 3. Data Preparation

scribe the Data collection task, the Exploratory
data analysis task (EDA), the Feature engineer- 3.1. Data collection
ing task, and finally, the Feature scaling and nor- We have collected the Bitcoin data from 2
malization task. Section 4 is Models, where we sources (the API provided by Binance [10] and
explain the theoretical background of different the crawled result from CryptoQuant [11]). Our
models and the practical results. We conclude data is based on the cryptocurrency pair BTC
the report in section 5, and list some further re- – USDT where USDT is a stable coin, and its
search directions in section 6. value is worth $1. The time period of the data
that we studied is from 15:00:00 (December 19,
2. Methodology 2017) to 10:00:00 (June 22, 2022).
For this project, we have followed this pipeline: The initial data can be divided into 2 groups
(classic price-volume data and on-chain data).
Classic price-volume data is the data about
the price and the volume in the Binance ex-
change, while on-chain data is the third-party
data which is inferred by using blockchain data
analytics and centralized exchange data ana-
lytics. Blockchain data analytics involves the
block discovery of BTC to derive some met-
rics and indicators while centralized exchange
data analytics involves the use of data provided
from the centralized exchange such as the total
Figure 3: Project pipeline amount of BTC transferred to the centralized
exchange, the total amount of BTC transferred
from the centralized exchange, . . . Note that we
Before making any statistical and machine learn- have used the concept of centralized exchange
ing models, the first task to do is data prepara- to distinguish from the decentralized exchange,
tion. This task contains 4 subtasks which are which is a part of decentralized finance - a de-
data collection, exploratory data analysis, fea- rived finance technology based on blockchain.
ture engineering, and feature scaling and nor- Blockchain data analytics and centralized ex-
malization. change data analytics can be used at the same
time, for example, the data provider uses the
Next, after preparing the data, we choose the blockchain wallet of a centralized exchange to
models to approach and solve the problem. We make a new indicator about the amount of BTC
first analyze the theoretical background of each flowing in this exchange.
model and next, choose the baseline model,
then, we fit the models to the data statistic the For more details, check Appendix A
results for more in-depth evaluation. The data
collected from the previous phase is split with-
3.2. Exploratory data analysis
out shuffled into 3 parts: train set (70%), val- 3.2.1. Label analysis (Close price)
idation set (15%), and test set (15%). The in- 3.2.1.1. Overall analysis (Close price)
depth evaluation mentioned above is based on
the test set. The best model is chosen to be The Bitcoin price is highly volatile. After the
retrained and re-evaluated by using time-based last 3 months of 2017 when the price increased
cross-validation. more than 6 times from $3,000 to a peak of
$20,000, BTC witnessed a decline from the peak
Finally, we compare our results with previous to $6,000 in the first 2 months of 2018. Dur-
solutions of other researchers and make the con- ing the period from May to November of 2018,
clusion. BTC always held the support of $6,000, but
on November 14, BTC broke this hard support,
5
Report Group 05
causing its price to halve and fluctuate between 3.2.1.2. Decomposition of Time Series
$3,000 and $4,000 until April 2019. From April close price data
4 to the end of June this year, the price of BTC *Theoretical background
increased sharply 3 times and then continued to
decline to the old price 9 months later. BTC First of all, we need to understand the 3 pat-
then moved sideways through October 2020 in terns that make up a time series data, which are
the $8000 to $12,000 price range. The first 10 “trend”, “seasonal” and “residual”.
months of this year are also the time when big Trend (Tt )
investors and sharks gather to prepare for a big A trend pattern shows the tendency of the data
BTC wave in 2021. 1 year later, BTC peaked at to increase or decrease during a period of time.
nearly $70000 and it has corrected to a third A time series data can go up, go down instantly
of its peak price with a current value of just but the trend must be upward, downward or sta-
$21,000. ble in a period of time.
Seasonal (St )
A seasonal pattern represents the seasonal factor
that affect the time series data. The seasonality
is fixed and has a known frequency.
Residual (Rt )
A residual pattern is the result when we ex-
tract trend and seasonal components out of the
original time series data. This residual may be
named remainder by other researchers, but in
this report, we use the residual to name it along
with the library convention.
Since time series data contains a variety of pat-

Figure 4: Bitcoin Close Price terns, and it is often helpful to split a time
series into 3 components: a trend component,
a seasonal component, and a residual compo-
The figure 4 shows the difference between the nent. In our project, we have decided to decom-
current day’s closing price and the previous pose the original time series data into 2 compo-
day’s price. It can be seen that this spread is nents: a trend seasonal component which com-
more volatile in the recent period from 2021 bines trend component with seasonal component
and a residual component.
There are 2 types of decomposition:
Additive decomposition
yt = Tt + St + Rt
Multiplicative decomposition:
yt = Tt ∗ St ∗ Rt
*Analysis
We decompose the close price data into 3 com-
Figure 5: Bitcoin close price 1st difference ponents by using statsmodel library [12]:
6
Report Group 05
it on trend component because this component

is highly correlated with MA features). The fol-
lowing is the residual distribution:
Figure 6: Close price trend seasonal component
Figure 9: Residual histogram
We can see that the distribution is skewed to-

wards the mean (near 0) because during 2 years
before 2021, the close price is small so that
the margin of the residual component is small).
Therefore, in this period of time, statistical mod-
els and machine learning models can work well.
But after that, there exists the problem of drift
which will greatly affect the models, the price
fluctuates sharply and so does the residual. For
Figure 7: Close price residual component this reason, we take the rate of residual and
trend to remove the price unit and make it fair
for every timestamp.
For this time series data, the seasonal compo-
nent is too low as depicted in Figure 8, so we
include it in the trend component to make the
trend seasonal component.
Figure 10: Residual divided by trend histogram

- 1 hour
We have removed the effect of price but the dis-

Figure 8: First 100 close price seasonal compo- tribution is still relatively skewed towards the
nent datapoints mean. This shows that with high frequency (1h)
the rate of variation will be small.
We decided to do more analysis for the resid- To reinforce the argument, if we continue to do
ual component to create features (we don’t do the same with lower frequency (1 day), the rate
7
Report Group 05
variable of small fluctuation is significantly re- 3.3. Feature engineering

duced compared to high frequency (1h). In fact, investors often use technical indicators
to make decisions about when and where to buy
and sell. That is the reason we generate new fea-
tures based on the definition of these technical
indicators. We also include the trend seasonal
component, residual component and the indica-
tors derived from it in our features.
For more details about the features, check Ap-

pendix B.
3.4. Feature scaling and normaliza-

tion
All the features can be divided into 2 groups:
Figure 11: Residual divided by trend histogram continuous features (which are not scaled) and
- 1 day scaled features (which are already scaled by its
definition, for example, the value of RSI indica-
tor is between 0 and 100).
We will use these analyses to do the same on
trading volume. Then we will create more fea- As described in section 3.2.2, some features can
tures based on the above insights that we found. drift, that is the reason we should scale the fea-
3.2.2. Feature analysis tures carefully. Others, depending on the al-
gorithm we use, the scaling and normalization
methods may be different.
In this section, we describe only the scaling

method, for more details about the different
scaling methods applied to different features,
check Appendix B.
3.4.1. Window normalization

This is the a method that we use to normalize
the data and make it undrift:
data[i] − data[i − p : i].mean()
datanorm [i] =
data[i − p : i].std()
where datanorm is the data after being normal-

Figure 12: Features correlation matrix ized, data[i] is the original data at the row i,
data[i − p : i] contains all the rows from the last
As we can see on the correlation matrix, the p days through the row before row i.
price information is almost completely corre-
3.4.2. Logarithmic scale
lated with each other, which is understandable
since we are observing over a period of relatively Logarithmic scale method is applied for all the
high frequency. features having positive values, they are marked
using their logarithms instead of their actual val-
We also can see the correlation between miner ues. This scaling method is used to deal with
reserve USD and exchange reserve USD with the problem of skewness towards large values;
BTC price, which is understandable since it is i.e., cases in which one or a few points is much
calculated as BTC amount multiplied by BTC larger than the rest of the data.
price, so this correlation with the price is natu-
ral.
8
Report Group 05
4. Models pends only on the lagged forecast errors:

In our work, we have proposed several models Yt = c + θ1 ∗ ϵt−1 + θ2 ∗ ϵt−2 + . . . + θq ∗ ϵt−q
based on statistics and machine learning to pre-
• d: The minimal order differencing required
dict bitcoin price movements and forecast bit-
to get a near-stationary series which roams
coin prices. After applying all the models, we around a defined mean and the ACF plot
determine which one is the most accurate for our reaches to zero fairly quick.
future goals and choose the appropriate param-
eters to get better performance in this model. B ∗ Yt = Yt−1
where B is called a backshiff operator. Thus, a

4.1. Theoretical background first order difference is written as:
4.1.1. Statistical models ′
Yt = Yt − Yt−1 = (1 − B) ∗ Yt
Statistical modeling is the use of mathematical
So, in general, a d th_order difference can be
models and statistical assumptions to generate
written as :
sample data and make predictions about the real
′
world. Yt = (1 − B)d ∗ Yt
4.1.1.1. ARIMA model

An ARIMA model is one where the time se-
ARIMA, abbreviated for ’Auto Regressive Inte- ries was differenced at least once to make it sta-
grated Moving Average’, is a class of models that tionary and you combine the AR and the MA
’demonstrates’ a given time series based on its terms. So the equation becomes:
previous values: its lags and the lagged errors in ′ ′ ′
Yt = c + ϕ1 ∗ Yt−1 + ϕ2 ∗ Yt−2 + . . . +
forecasting, so that equation can be utilized in ′
order to forecast future values. ϕp ∗ Yt−p + θ1 ∗ ϵt−1 + θ2 ∗ ϵt−2 + . . . +

There are three terms characterizing an θq ∗ ϵt−q + ϵt
ARIMA model
4.1.1.3. Find the order of differencing
• p: the order of the AR term.
(d) in ARIMA model
• d: the number of differences required to
make the time series stationary.
• q: the order of the MA term.
4.1.1.2. What are ’p’, ’q’, ’d’ mean

First, we need to make the time series station-
ary in order to build an ARIMA model. This is
because AR: auto regressive in ARIMA im- Figure 13: Close price before difference
plies a Linear regression model using its lags
as predictors. And we already know that Linear
The purpose of differencing it to make the time
regression models work well for independent
series stationary.
and non-correlated predictors.
The right order of differencing is the minimum
Therefore, the value of d is the minimum num-
differencing required to get a near-stationary
ber of subtractions. If the time series is already
series which roams around a defined mean and
stationary, then d = 0.
the ACF plot reaches to zero fairly quick.
• p is the order of the AR (Auto-
Regressive) term which means that Yt de-
pends on its own lags:
Yt = α + β1 ∗ Yt−1 + β2 ∗ Yt−2 +
β3 ∗ Yt−3 . . . + βp ∗ Yt−p + ϵ1
• q is the order of the MA (Moving-

Average) term which means that Yt de- Figure 14: The close price after first difference
9
Report Group 05
remains in the differenced series. Of course, you

could just try some different combinations of
terms and see what works best. But there is a
more systematic way to do this. By looking at
the auto-correlation function (ACF) and partial
auto-correlation function (PACF) plots of the
differenced series, you can tentatively identify
the number of AR and MA terms that are
needed
Figure 15: ACF and PACF plot of each differ- Note: See on figure 15
ence of Close Price
4.1.1.5. Find the order of the MA terms

We can see that the close price is similar to white (q) in ARIMA model
noise. We can check that if the series is sta-
tionary by Augmented Dickey Fuller Test (ADF In order to find q, we can count the number of
Test). significant lags (outside the blue band) on the
ACF graph.
Augmented Dickey-Fuller Test Results: Note: See on figure 15
ADF Test Statistic -1.507686e+01
P-Value 8.565591e-28 4.1.1.6. Visualizaion of ARIMA model
Lags Used 5.000000e+00 results
Observations Used 1.639000e+03 After tuning and training, we have the summary
Critical Value (1%) -3.434346e+00 of ARIMA model.
Critical Value (5%) -2.863305e+00
Critical Value (10%) -2.567710e+00
As we can see from the results when using ADF

test on the first diff of close price, p_value is
close to 0. So we can reject the null hypothesis
and assume that the series is stationary. We will
use return-measurement as input for model.
Return: The percent change in a stock

price over a given amount of time.
P rice(t) − P rice(t − 1)
Returns(t) = ∗ 100
P rice(t − 1)
Figure 17: Summary of ARIMA model
Figure 16: Returned series
4.1.1.4. Find the order of AR terms (p)

in ARIMA model Figure 18: The random 300 data points of train-
After the time series has been stationarized by ning set results
differencing, the next step in fitting an ARIMA
model is to determine whether AR or MA terms We can see that, the predicted value is equal to
are need to correct any auto-correlation that the first lag order of the true value.
10
Report Group 05
4.1.1.7. GARCH Model GARCH model estimates the conditional vari-

The GARCH or Generalized Autore- ance present in the residuals of the ARIMA
gressive Conditional Heteroskedasticity estimation.
p q
method provides a way to model a change in X X
variance in a time series that is time dependent, yt = µ + ai ∗ yt−1 + bi ∗ ϵt−i + ϵt
i=1 i=1
such as increasing or decreasing volatility. If the
change in variance can be correlated over time, where
√
then it can be modeled using an autoregressive × zt
ϵt = σ t P
σ 2 = ω + pi=1 αt × ϵ2t−i + qi=1 βi × σt−i
2
P
process, such as ARCH.
ARCH models are a popular class of volatility
ARIMA model is not capable of forecasting a
models that use observed values of returns or
very long time, when the forecast period ap-
residuals as volatility shocks. A basic GARCH
proaches very large, the forecast results will be
model is specified as
almost constant. And ARIMA as we see the
rt = µ + ϵt model almost lags 1 with the actual price so we
want to improve it
ϵt = σt × et
q p
We wanted a margin measure to make the arima
X X more reliable by combining with the ARCH
σt2 = α0 + αi × ϵ2t−i + βj × 2
σt−j
i=1 j=1
model to get a more confident prediction inter-
val. By using ARCH’s forecast as the upper and
where et ∼ N (0, 1). lower bounds for ARIMA’s forecast results, we
get a confidence interval.
After tunning and trainnig on the residual out-
put of ARIMA model with the same process for
ARIMA, we get the summary of the GARCH
model.
Figure 20: Confident interval on whole test set
For ease of observation, we choose randomly 300

continuous data points to visualize.
Figure 21: 300 random continuous data points

with confident interval
And we compute the Probability of confidence

Figure 19: GARCH model summary of volatility interval bound actual value (Return value on
close price) and the result is 0.65329.
4.1.1.8. ARIMA_ARCH model 4.1.2. Machine Learning models
ARIMA and GARCH models are so of- The first concept that should be clear is odds, it
ten combined. An ARIMA model estimates is defined as:
the conditional mean, where subsequently a p
Odds(p) =
1−p
11
Report Group 05
Odds is a function that varies between 0 and ∞. In order to find A0 and A, we will minimize a loss
We know that the good situation occurs when function and the popular loss function for the
odds > 1, it means that a bad odds happens classification problem is binary cross-entropy.
when odds function varies between 0 and 1 while
a good odds function happens when its value is 4.1.2.2. Linear Discriminant Analysis
between 1 and ∞, and this is not fair. This (LDA)
asymmetry makes it difficult to compare these 2 The approach of LDA is different from logistic
situations, that’s why we need logarithmic scale. regression, instead of assuming about the linear
logits (which means that logit(P (y = k|x)) can
be formed by a linear relation A0 + AT x), we
assume that p = P (y = k|x) follows a normal
distribution.
Figure 22: Odds and logit function
The logarithmic scale of the original odds func- Figure 23: The distance between the expecta-
tion is logit function. This function makes the tions and the sum of the variances affect the
odds assessment symmetrical and fair. So we discriminant degree of the data [13]
have:
The problem of LDA is to find a projection on an
logit(p) = logit(P (y = k|x)) arbitrary axis through which we can determine
= log(odds(p)) the separation of two classes. This method can
be considered as a dimension reduction method
P (y = k|x)
= log where we reduce the original dimension into a
1 − P (y = k|x)
lower dimension where we can easily determine
where p = P (y = k|x) is the beneficial probabil- the separation mentioned above.
ity (when y = k) given x.
Assume that there are N data points
4.1.2.1. Logistic regression x1 , x2 , . . . , xN which are divided into 2 classes:
Suppose that this logit function can be formed C1 and C2 . The data projection onto a straight
by a linear relation with our data x. So: line can be described by a coefficient vector w
and the corresponding value of each new data
point is given by: yi = wT xi , 1 ≤ i ≤ N .
P (y = k|x)
logit(p) = log = A0 + AT x
1 − P (y = k|x) The expected vector of each class is:
The decision boundary is a hyperplane which is 1 X
the set of points x for which the log-odds are mk = xn , k = 1, 2
Nk
n∈Ck
zero and defined by A0 + AT x = 0. There are
2 popular methods that result in linear logits: According to Figure 23, the best solution is when
linear logistic regression - which is discussed in m1 , m2 are farthest away and s1 , s2 are the
this section and linear discriminant analysis - same. That is the reason LDA tries to solve:
which is discussed in the next section.
(m1 − m2 )2
e A0 +AT x maxJ(w) =
Now we have: P (y = k|x) = , s1 2 + s22
1 + eA0 +AT x where
which can be seen as a sigmoid function of x.
12
Report Group 05
selects a random sample with replacement of the

1 X 1 X training set and fits trees to these samples:
m1 −m2 = yi − yj = wT (m1 −m2 )
N1 N2
i∈C1 j∈C2 Algorithm 1 Random Forest training algorithm
and for b = 1, 2, ..., B do
Sample, with replacement, n training exam-
X ples from X, Y; call these Xb , Yb ;
s2k = (yn − mk )2 , k = 1, 2
n∈Ck
Train a classification or regression tree fb on
Xb , Yb ;
For more details, check [13] end
4.1.3. Ensemble methods
Ensemble methods are techniques that create After training, predictions for unseen samples x′
multiple models and then combine them to pro- can be made by averaging the predictions from
duce improved results [14]. In this project, we all the individual regression trees on x′ :
have used Random Forest, Adaboost and Light-
GBM. B
1 X
fˆ = fb (x′ )
4.1.3.1. Random Forest [15] B
b=1
Decision trees are a popular method for various or by taking the majority vote in the case of
machine learning tasks. Tree learning "come[s] classification trees.
closest to meeting the requirements for serving
as an off-the-shelf procedure for data mining", This bootstrapping procedure leads to better
say Hastie et al., "because it is invariant un- model performance because it decreases the vari-
der scaling and various other transformations of ance of the model, without increasing the bias.
feature values, is robust to inclusion of irrele- This means that while the predictions of a single
vant features, and produces inspectable models. tree are highly sensitive to noise in its training
However, they are seldom accurate". set, the average of many trees is not, as long
as the trees are not correlated. Simply train-
In particular, trees that are grown very deep ing many trees on a single training set would
tend to learn highly irregular patterns: they give strongly correlated trees (or even the same
overfit their training sets. Random forests are tree many times, if the training algorithm is de-
a way of averaging multiple deep decision trees, terministic); bootstrap sampling is a way of de-
trained on different parts of the same training correlating the trees by showing them different
set, with the goal of reducing the variance. This training sets.
comes at the expense of a small increase in the
bias and some loss of interpretability, but gener- Additionally, an estimate of the uncertainty of
ally greatly boosts the performance in the final the prediction can be made as the standard de-
model. viation of the predictions from all the individual
regression trees on x′ :
Forests are like the pulling together of decision s
PB ˆ2
tree algorithm efforts. Taking the teamwork of ′
b=1 (fb (x ) − f )
many trees thus improving the performance of σ=
B−1
a single random tree. Though not quite similar,
forests give the effects of a k-fold cross valida-
The number of samples/trees, B, is a free param-
tion.
eter. Typically, a few hundred to several thou-
Bagging sand trees are used, depending on the size and
The training algorithm for random forests ap- nature of the training set. An optimal number
plies the general technique of bootstrap aggre- of trees B can be found using cross-validation,
gating, or bagging, to tree learners. Given a or by observing the out-of-bag error: the mean
training set X = x1 , x2 , ..., xn with responses prediction error on each training sample xi , us-
Y = y1 , y2 , ..., yn , bagging repeatedly (B times) ing only the trees that did not have xi in their
13
Report Group 05
bootstrap sample. The training and test error low bias and low variance, or model with good
tend to level off after some number of trees have predictive results.
been fit.
The idea of aggregating decision trees of the
From bagging to random forests Random Forest algorithm is similar to the idea
The above procedure describes the original bag- of The Wisdom of Crowds proposed by James
ging algorithm for trees. Random forests also in- Surowiecki in 2004. The Wisdom of Crowds says
clude another type of bagging scheme: they use that usually aggregating information from one
a modified tree learning algorithm that selects, group is good than from an individual. In the
at each candidate split in the learning process, Random Forest algorithm, it also synthesizes in-
a random subset of the features. This process is formation from a group of decision trees and the
sometimes called "feature bagging". The reason results are better than the Decision Tree algo-
for doing this is the correlation of the trees in an rithm alone.
ordinary bootstrap sample: if one or a few fea-
tures are very strong predictors for the response
variable (target output), these features will be
selected in many of the B trees, causing them
to become correlated. An analysis of how bag-
ging and random subspace projection contribute
to accuracy gains under different conditions is
given by Ho.
Typically, for a classification problem with p fea-

√
tures, p (rounded down) features are used in
each split. For regression problems the inventors
recommend p/3 (rounded down) with a mini-
mum node size of 5 as the default. In practice Figure 24: Error with training rounds [17]
the best values for these parameters will depend
on the problem, and they should be treated as 4.1.3.2. AdaBoost
tuning parameters. AdaBoost uses an iterative approach to learn
from the mistakes of weak classifiers and turn
Why Random Forest is good? [16]
them into strong ones. What this algorithm does
In the Decision Tree algorithm, when building a
is that it builds a model and firstly gives equal
decision tree, if the depth is arbitrary, the tree
weights to all the data points.
will correctly classify all the data in the training
set, leading to the overfitting problem. The initial formula to calculate the sample
Random Forest algorithm consists of many de- weight is:
cision trees, each decision tree has random ele-
ments: 1
w(xi , yi ) =, i = 1, 2, ..., N
• Randomize data to build a decision tree. N
• Randomize the attributes to build a deci- Next, AdaBoost will create a decision stump
sion tree. having the lowest Gini index [18].
Since each decision tree in the Random Forest
algorithm does not use all the training data, nor After that, AdaBoost will calculate the “Amount
does it use all the attributes of the data to build of Say” for this classifier in classifying the data-
the tree, each tree may make a bad prediction. points using this formula:
The decision tree is not overfitting but can be 1 1 − T otalError
underfitting, in other words, the model has high log
2 T otalError
bias. However, the final result of the Random
Where the total error is the summation of all the
Forest algorithm is aggregated from many deci-
sample weights of misclassified data points.
sion trees, so the information from the trees will
complement each other, leading to a model with
14
Report Group 05
It can be seen that the test error decreases even

after training error reached zero, but why do
more weak learners and more complex hypothe-
sis not lead to overfitting?
This can be explained by the Margin Theory,

which states that AdaBoost keeps increasing the
margins even after it has managed to classify all
training examples correctly [20] [21].
4.1.3.3. LightGBM
Figure 25: Amount of say First, we need to understand how the Gradient
Boosting Decision Tree works.
Note that the total error will always be between
0 and 1 and 0 indicates perfect stump and 1 For a given data set with n examples and m
indicates horrible stump. From Figure 10, it can features
be seen that if the error rate is 0.5 (the classifier
predicts half right and half wrong), the “amount
D = (xi , yi )(|D| = n, xi ∈ Rm ), yi ∈ R)
of say” will be 0. If the error rate is small, alpha
will be positive. If the error rate is large, alpha Starting with the predicted value obtained by
will be negative. the tree using K additive functions, we have:
K
New sample weight = Old weight × e−α
X
yî = ϕ(xi ) = fk (xi ), fk ∈ F
k=1
The new sample weights will be normalized, and
Where F is the space of regression trees, each fk
a new training round will be performed.
corresponds to an independent tree structure q
In conclusion, the individual learners can be and leaf weights w, yî is the predicted value.
weak, but if the performance of each one is
To learn the set of functions used in the model,
slightly better than random guessing, the final
we minimize the following regularized objective:
model can be proven to converge to a strong
learner. X X
L(ϕ) = l(yî , yi ) + Ω(fk )
i k
Adaboost often not overfit on practice
When using Adaboost, we observe it behaving 1
where Ω(f ) = γT + λ||w||2
like this: 2
The first term is the training loss and the second
term, which is the regularized term, represents
the complexity of all the trees. Here, T is the
number of leaves in the tree and w is the leaf
weight vector of each tree. γ and λ are the 2
regularization parameters that are pre-defined.
Our objective is to minimize the loss function
above.
But the model is trained in an additive manner,

we need to define the regularized objective at the
tth iteration, and we will minimize this objective.
Formally, let yî (t) be the prediction of the ith
instance at the tth iteration, we will have:
Figure 26: Error with training rounds [19]
15
Report Group 05
That is the core idea of XGBoost algorithms.

n
Algorithm 2 Exact Greedy Algorithm for Split
X
L(t) = l(yi , yî (t−1) + ft (xi )) + Ω(ft )
i=1
Finding [22]
Input I, instance set of current node
Second-order approximation can be used to Input d, feature dimension
quickly optimize the objective. gain ←
P0 P
G ← i∈I gi , H ← i∈I hi
n
X 1 for k = 1 to m do
L(t) ≈ [l(yi , yî (t−1) )+gi ft (xi )+ hi ft2 (xi )]+Ω(ft ) GL ← 0, HL ← 0
i=1
2
for j in sorted(I, by xjk ) do
GL ← GL + gj , HL ← HL + hj
where gi = ∂ŷ(t−1) l(yi , ŷ (t−1) ) and hi =
GR ← G − GL , HR ← H − HL
∂ŷ2(t−1) l(yi , ŷ (t−1) ) are first and second order gra- G2L G2
score ← max(score, HL +λ + HRR+λ −
dient statistics on the loss function.
G2
H+λ )
Here, l(yi , yî (t−1) ) is L(t−1) and gi ft (xi ) +
1 end
hi f 2 (xi ) + Ω(ft ) is L̃(t) - the additional loss
2 t end
from iteration t. And since L(t−1) is a constant Output Split with max score
term, we remove it to obtain the simplified ob-
jective at step t.
Algorithm 3 Approximate Algorithm for Split
Define Ij = {i|q(xi ) = j} as the instance set of
leaf j. We have: Finding [22]
Input I, instance set of current node
n
X 1 Input d, feature dimension
L̃(t) = [gi ft (xi ) + hi ft2 (xi )] + Ω(ft )
i=1
2 for k = 1 to m do
n T Propose Sk = {sk1 , sk2 , ..., skl } by per-
X 1 1 X 2 centiles on feature k.
L̃(t) = [gi ft (xi ) + hi ft2 (xi )] + γT + λ w
i=1
2 2 j=1 j Propose can be done per tree (global), or
XT X 1 X per split (local).
= [( gi )wj + ( hi + λ)wj2 ] + γT
2
j=1 i∈Ij i∈Ij end
for k = 1 to
Pm do
Now we can compute the optimal weight wj∗ of Gkv ← j∈(j|sk,v ≥xjk >sk,v−1 ) gj
leaf j by: P
Hkv ← j∈(j|sk,v ≥xjk >sk,v−1 ) hj
P
i∈Ij gi
wj∗ = − P end
i∈Ij hi + λ Follow same step as in previous section to find
And its corresponding optimal value is: max score only among proposed splits.
T 2
P
(t) 1 X ( i∈Ij gi ) That is the basics of Gradient Boosting Deci-
L̃ (q) = − + γT
sion Tree. Now, we will deep dive into Light-
P
2 i∈Ij hi + λ
j=1
GBM. Indeed, LightGBM is a gradient boosting
The formula above is used to measure the quality framework based on decision trees to increase
of the tree structure q. And for evaluating the the efficiency of the model and reduce memory
split candidates, we have the following formula:
usage.
( i∈IL gi )2 ( i∈IR gi )2
" P P #
( i∈I gi )2
P
1
Lsplit = P +P −P −γ It uses two novel techniques: Gradient-based
2 i∈IL hi + λ i∈IR hi + λ i∈I hi + λ
One Side Sampling (GOSS) and Exclusive Fea-
where IL , IR , I are the instance set of left leaves, ture Bundling (EFB) which fulfills the limita-
right leaves and leaves before splitting, respec- tions of histogram-based algorithm that is pri-
tively. marily used in all Gradient Boosting Decision
16
Report Group 05
Tree frameworks, as described by the authors in 4.1.4.1. LSTM [24]

[23]. Long Short Term Memory networks are a kind
Gradient-based One Side Sampling Tech- of Recurrent Neural Network, capable of learn-
nique ing long-term dependencies and were introduced
These authors above notice that the gradient for by Hochreiter Schmidhuber (1997) [25]. This
each data instance in GBDT provides us with model is widely used in time series problems be-
useful information for data sampling. That is, if cause of its strength.
an instance is associated with a small gradient, Firstly, note that all recurrent neural networks
the training error for this instance is small and it have the form of a chain of repeating modules
is already well-trained. A straightforward idea is of neural network. In a standard RNN model,
to discard those data instances with small gra- each repeating module has a very simple struc-
dients. However, the data distribution will be ture with a single tanh layer.
changed by doing so, which will hurt the accu-
racy of the learned model. To avoid this prob-
lem, the authors propose a new method called
Gradient-based One-Side Sampling (GOSS).
GOSS keeps all the instances with large gradi-

ents and performs random sampling on the in-
stances with small gradients. In order to com-
pensate the influence to the data distribution,
when computing the information gain, GOSS Figure 27: The repeating module in a standard
introduces a constant multiplier for the data in- RNN contains a single layer [24]
stances with small gradients. Specifically, GOSS
firstly sorts the data instances according to the LSTM inherits this structure, but in a different
absolute value of their gradients and selects the way. Instead of having a single neural network
top a × 100% instances. Then it randomly sam- layer, there are four layers that interact in a spe-
ples b×100% instances from the rest of the data. cial way.
After that, GOSS amplifies the sampled data
with small gradients by a constant 1−a
b when cal-
culating the information gain. By doing so, we
put more focus on the under-trained instances
without changing the original data distribution
by much.
Exclusive Feature Bundling Technique

High-dimensional data are usually very sparse
Figure 28: The repeating module in an LSTM
which provides us a possibility of designing a
contains four interacting layers [24]
nearly lossless approach to reduce the number of
features. Specifically, in a sparse feature space,
many features are mutually exclusive, i.e., they The notations can be understood as illustrated
never take nonzero values simultaneously. The below:
exclusive features can be safely bundled into
a single feature (called an Exclusive Feature
Bundle). Hence, the complexity of histogram
building changes from O(#data × #f eature)
to O(#data × #bundle), while #bundle << Figure 29: LSTM notations [24]
#f eature. Hence, the speed for training frame-
work is improved without hurting accuracy.
The key component of LSTM is the cell state,
4.1.4. Deep learning models denoted by Ct (which means the cell output
value at tth unit of LSTM). Its where LSTM
17
Report Group 05
remember the information, it can remove or add tion we’re going to store in the cell state. This
information to this cell state at each unit, care- has two parts. First, a sigmoid layer called the
fully regulated by structures called gates. “input gate layer” decides which values we’ll up-
date. Next, a tanh layer creates a vector of new
candidate values, C̃t , that could be added to the
state. LSTM will then combine these two to cre-
ate an update to the state.
Figure 30: LSTM cell state line [24]
Gates are a way to decide if information can pass

through. They are composed out of a sigmoid
layer and a pointwise multiplication operation. Figure 33: LSTM input gate and new cell state
[24]
This sigmoid layer outputs numbers between
zero and one, describing how much of each com- Next, LSTM will update the old cell state Ct−1
ponent should be let through (from 0, which into the new cell state Ct . It multiplies the old
means “let nothing through”, to 1, which means state by ft , forgetting the things we decided to
“let everything through”) forget earlier and then adds C̃t (the new candi-
date value) × it (how much we want to update
An LSTM has three of these gates, to protect
this candidate state value).
and control the cell state.
Figure 31: LSTM gate [24]

Figure 34: Update cell state [24]
Steps of LSTM
The first step in LSTM is to decide what infor- Finally, LSTM will decide the final output.
mation it is going to throw away from the cell First, a sigmoid layer will decide what parts of
state. This decision is made by a sigmoid layer the cell state we’re going to output. Second, a
called the “forget gate layer.” It uses ht−1 and xt tanh layer is applied to the cell state Ct (to push
(where ht−1 is the output result to the memory the values to be between 1 and 1) and multiplied
cell at time t − 1 and xt is the input vector of by the previous output of the sigmoid gate to get
the memory cell at time t) to output a number the final output value.
between 0 and 1 for each number in the cell state
Ct−1 .
Figure 35: LSTM output gate [24]
Figure 32: LSTM forget gate [24]
The next step is to decide what new informa-
18
Report Group 05
4.1.4.2. GAN [26] [27] This is our GAN architecture:

The generative network generates candidates
while the discriminative network evaluates
them. The contest operates in terms of data
distributions. Typically, the generative network
learns to map from a latent space to a data
distribution of interest, while the discriminative
network distinguishes candidates produced by
the generator from the true data distribution.
The generative network’s training objective is to
increase the error rate of the discriminative net-
work (i.e., "fool" the discriminator network by
producing novel candidates that the discrimina-
tor thinks are not synthesized (are part of the
true data distribution))
GANs often suffer from a "mode collapse" where

they fail to generalize properly, missing entire
Figure 36: GAN architecture [26]
modes from the input data. For example, a
GAN trained on the MNIST dataset containing
many samples of each digit, might nevertheless We use the LSTM to predict the next price, then
timidly omit a subset of the digits from its out- use this price to generate dummy data for Dis-
put. Some researchers perceive the root problem criminator to learn. And this is our generator
to be a weak discriminative network that fails to architecture:
notice the pattern of omission, while others as-
sign blame to a bad choice of objective function.
Many solutions have been proposed. Conver-
gence of GANs is an open problem.
GANs are implicit generative models, which

means that they do not explicitly model the
likelihood function nor provide means for find-
ing the latent variable corresponding to a given
sample, unlike alternatives such as Flow-based
generative model. About the algorithm, we will
find the solutions for the following function
minG maxD V (D, G) = Ex∼pdata (x)[log(D(x))] +

Ez∼px (z)[log(1−D(G(z)))]
We follow the same idea in [26]. We use the Gen-

erator as an LSTM network to predict the price
at time t of the time series and combine with
the forward value series so that the Discrimina- Figure 37: Generator architecture [26]
tor can act as a Deep Auto Regression network
to classify real and fake data. With the hope
that the Generator will generate data as close
to reality as possible. Through this, we hope to
reduce the overfitting that occurs in the LSTM
training phase as described above.
19
Report Group 05
4.2. Practical Results
Train set Validation set Test set

MODEL
AUC ACC AUC ACC AUC ACC
ARIMA 0.4553 0.4982
Logistic Regression 0.52388 0.52551 0.51602 0.50449 0.52032 0.50129
SVM (using calibration to get probability) 0.52974 0.51398 0.50404 0.50638 0.50838 0.49353
Linear Discriminant Analysis 0.63507 0.59709 0.54451 0.52985 0.54438 0.53269
Random Forest 1 1 0.53859 0.52329 0.54506 0.52769
LightGBM 0.64555 0.5986 0.55127 0.54658 0.56747 0.55262
AdaBoost 0.58722 0.56401 0.53913 0.53838 0.56194 0.55068
LSTM 0.59734 0.5699 0.54085 0.53639 0.55233 0.54708
GAN 0.56301 0.55922 0.50911 0.50729 0.51642 0.51280
Rolling training (with LightGBM) 0.58469 0.56977 0.57309 0.55148
As we can see, the performance of the statistical and time-consuming. Therefore, we will choose
models is the lowest. This is relatively easy to LightGBM as a representative model for analy-
understand because in this method we only use sis in the following sections.
the close price for forecasting, the model uses
much less information than others, so low per- 4.3. Evaluation and Error Analysis
formance is inevitable. In section 4.2, we can see that the performance
on the LightGBM model gives the best results.
Next to the linear methods, the method that we Therefore, in this section, all the performance
see the best performance is LDA. We don’t have evaluation and error analysis will be based on
time to analyze specifically what the reason is the output of the LightGBM model. We will first
here, but we will make an assumption that be- look at the important features of this model:
cause our window normalization causes the fea-
ture distributions to move towards normal dis-
tributions. Most of the dimensions do not carry
many classification capabilities, so they satisfy
more applicable conditions than Logistic Regres-
sion or Support Vector Machine.
Regarding ensemble learning methods, we can

see the superiority of boosting, performance on
this set of methods is stronger than other mod-
els.
As for Deep Learning models, the performance is

at a good threshold. We think the performance Figure 38: Features importance barplot
can be even better on GAN and LSTM but we
don’t have resources to tune the parameters nor
make the neural network more complicated to BOP (balance of power) is the most important
see if the performance can be higher than boost- feature, it gives us the feeling that the current
ing. candle will greatly influence whether the next
candle will be green or red. The price-related
In summary, we can see the superiority of boost- features after being normalized also appear at
ing over linear models (with very fast train- the top of the most powerful features. In ad-
ing times and very good performance). Deep dition to the basic features or technical indica-
Learning has a good performance but the train- tors created, among the most powerful features
ing time is really long and the tuning of the above, there are also some features calculated by
parameters is also extremely resource-intensive us based on EDA and analysis from part 3 such
20
Report Group 05
as log_volume_resid_divide_prev. Finally, it
can be seen that all the features can be grouped
into the following: Price, Volume, Exchange,
Coinbase Exchange.
Now, we will analyze the AUC-iterations graph

during the training phase.
Figure 41: Validation set score distribution
Figure 39: AUC-Iterations plot
The AUC on validation set barely increases af-

ter a few rounds and early stopping is at 1518
(although using a tree base has a max depth of
2).
Figure 42: Test set score distribution
The score distribution is almost centered around

0.5, which shows that the model did not learned
through many boost rounds but the valid loss
has shown signs of reversal (we tried increasing
the boost round and not using early stopping,
then the results only on the training set are more
polarized and on the validation, results are ex-
tremely bad).
Even though we used window normalization to

Figure 40: Train set score distribution reduce the effect of drift and difference in sig-
nificance at the same values, the performance
didn’t improve too much (increased from accu-
racy from around 0.54 to around 0.55) .
We can see in the calibration curve that even

though the same behavior at different time inter-
vals, the same behavior at a different time may
go against the statistics we did in the training
set. (even when we choose an early stropping
21
Report Group 05
strategy for validation set, the curve on test set 4.4. Final Pipeline
and validation set are bad). After doing feature engineering and testing on
various models, we choose LightGBM with the
highest performance as our best model. Af-
ter that, we do the features selection process.
Considering the trained LightGBM model, we
can see the features with their importance gain.
For faster training with cumulative assessment,
we only choose the topmost important features
that contribute to 99.7% of the total importance
gain of all features. The number 99.7% comes
from the probability that a normal random vari-
able will fall within 3 standard deviations of the
mean. By this way of feature selection, we ob-
tained only 260 features, which means we can
Figure 43: Validation set calibration curve
reduce about 660 features.
After that, we evaluate the LightGBM model by

training cumulatively as follow: At the first it-
eration, we consider the first 10,000 data points.
From the end of the dataset, we choose the last
500 data points to be the test set, the next 2,000
data points to be the validation set, and the rest
are for the training set. After each iteration, we
increase the size of the dataset by the next 500
data points, and we choose the validation set
and test set the same way as before. At each
iteration, if the AUC score of the validation set
is lower than 0.56, we retrain the model of that
iteration, but this time we consider only 10,000
Figure 44: Test set calibration curve latest data points. This time, we choose the test
set is still the last 500 data points, but the vali-
dation set is the next 1000 data points.
We intend to use a logistic regression to calibrate
the output of the model to improve performance With this method of training the LightGBM
but the calibration curve of the datasets is too model, we obtained the average accuracy of the
bad to improve it so we will only choose the best test set of 55.148%, and the average AUC score
1 threshold on the validation set and used for the of the test set of 0.57309.
test set.
From all examples just mentioned above, it

shows that past behavior is not enough to be
used as a model for the future (both in terms of
width (attributes) times length (number of sam-
ples)). It is true because the market depends
on many other factors such as: Inflation, inter-
est rates, Covid, stocks, war, ... and one factor
that can be said to be the most important is the
large quantity. Bitcoin is in the hands of a few
mysterious people, and they can manipulate the
market without any rules at all.
22
Report Group 05
Figure 48: Validation set calibration curve

Figure 45: Test score histogram
When using the above pipeline to evaluate the

test set and the valid set, the calibration curve
is also a bit better. When considering the ac-
curacies of the whole test sets, we notice that
the time period from Jan 2021 to May 2021 has
poor performance.
After using the above evaluation method, both

the distribution and calibration curves of test
set and validation set have become quite similar,
that makes our prediction results more reliable.
5. Conclusion
Figure 46: Test set calibration curve Bitcoin prediction is an extremely difficult topic
for any method, as mentioned in the Feature En-
gineering section, the problem depends on too
many factors and if only based on common ana-
lytical data types, it is difficult to give a highly
accurate result. In the above report, we have
applied all the methods we have learned in 2
subjects Applied Statistics and Machine Learn-
ing to analyze data and come up with a Bitcoin
price forecasting model. Although we used many
different analytical methods as well as many dif-
ferent models to find a better result, we only
saw performance inching little by little. Due to
the difficulty of the problem, we do not expect
a high result, so we always try to come up with
the most complete process possible. We see that
Figure 47: Validation score histogram this is an open topic, and there is potential for
further exploration in the future.
6. Scope for further research

This topic still has a lot of interesting research
directions that we can do.
23
Report Group 05
In terms of data, we can add many other on- [9] W. Ballmann and T. Seng. Quantitative
chain indicators as well as calculate other tech- Investing. Available at https : / / www .
nical indicators to improve the model. In addi- eurekahedge.com/Research/News/1102/
tion, we can add data related to news and in- Quantitative-Investing. Oct. 2005.
vestor sentiment - these indicators can also be [10] Binance. Binance API Documentation.
accurate predictors of the next direction of the Available at https : / / binance - docs .
price. github.io/apidocs/spot/en. Sept. 2019.
In terms of model, we can consider adding CNN, [11] CryptoQuant. Bitcoin: Summary, on-
Transformers and Attention, Deep Forest, . . . chain data analytics. Available at https:
These robust models can be combined to im- / / cryptoquant . com / asset / btc /
prove the prediction performance. summary.
[12] statsmodels.tsa.seasonal.seasonald ecompose.
References Available at https://www.statsmodels.
[1] Satoshi Nakamoto. “Bitcoin: A peer-to- org / stable / generated / statsmodels .
peer electronic cash system”. In: Decentral- tsa . seasonal . seasonal _ decompose .
ized Business Review (2008), p. 21260. html.
[2] How Digital Currency Is Ushering in a [13] Linear Discriminant Analysis. How a bit-
New Era of Money Technology. Avail- coin court case in Japan may create
able at https : / / www . firstcitizens . crypto millionaires. Available at https :
com/commercial/insights/technology/ //machinelearningcoban.com/2017/06/
digital-currency. May 2021. 30/lda. June 2017.
[3] Guardian Nigeria. The idea and a brief [14] Necati Demir. Ensemble Methods: Elegant
history of cryptocurrencies. Available at Techniques to Produce Improved Machine
https : / / guardian . ng / technology / Learning Results. Available at https : / /
tech/the-idea-and-a-brief-history- www . toptal . com / machine - learning /
of-cryptocurrencies. May 2022. ensemble-methods-machine-learning.
[4] Cryptopedia Staff. The Early Days of [15] Wikipedia. Random Forest. Available at
Crypto Exchanges. Available at https:// https : / / en . wikipedia . org / wiki /
www.gemini.com/cryptopedia/crypto- Random_forest.
exchanges - early - mt - gox - hack. Mar. [16] Tuan Nguyen. Random Forest al-
2022. gorithm. Available at https : / /
[5] Wikipedia. Mt. Gox. Available at https: machinelearningcoban . com / tabml _
//en.wikipedia.org/wiki/Mt._Gox. book/ch_model/random_forest.html.
[6] Brian McGleenon. How a bitcoin court [17] Chetan Prabhu. What Do Random Forests
case in Japan may create crypto mil- And The Wisdom of Crowds Have In
lionaires. Available at https : / / uk . Common? Available at https : / / www .
finance . yahoo . com / news / britcoin - linkedin . com / pulse / random - forest -
millionaires - mt - gox - case - japan - wisdom - crowds - chetan - prabhu/. May
153624083-230116218.html. Oct. 2021. 2017.
[7] Trading View. CRYPTOCURRENCY [18] Neelam Tyagi. Understanding the Gini In-
MARKET. Available at https : / / dex and Information Gain in Decision
www . tradingview . com / markets / Trees. Available at https://medium.com/
cryptocurrencies / global - charts/. analytics- steps/understanding- the-
2022. gini - index - and - information - gain -
in - decision - trees - ab4720518ba8.
[8] Swfinstitute. Rankings by Total Managed
Mar. 2020.
AUM. Available at https : / / www .
swfinstitute . org / fund - manager -
rankings/crypto-fund-manager. 2022.
24
Report Group 05
[19] Nikolaos Nikolaou. Introduction to Ad-

aBoost. Available at https://nnikolaou.
github . io / files / Introduction _ to _
AdaBoost.pdf. 2014.
[20] Peter Bartlett et al. “Boosting the margin:
A new explanation for the effectiveness of
voting methods”. In: The annals of statis-
tics 26.5 (1998), pp. 1651–1686.
[21] Wei Gao and Zhi-Hua Zhou. “On the
doubt about margin explanation of boost-
ing”. In: Artificial Intelligence 203 (2013),
pp. 1–18.
[22] Tianqi Chen and Carlos Guestrin. “Xg-
boost: A scalable tree boosting system”.
In: Proceedings of the 22nd acm sigkdd in-
ternational conference on knowledge dis-
covery and data mining. 2016, pp. 785–
794.
[23] Guolin Ke et al. “Lightgbm: A highly effi-
cient gradient boosting decision tree”. In:
Advances in neural information processing
systems 30 (2017).
[24] Christopher Olah. Understanding LSTM
Networks. Available at https : / / colah .
github . io / posts / 2015 - 08 -
Understanding-LSTMs/. Aug. 2015.
[25] Jürgen Schmidhuber, Sepp Hochreiter, et
al. “Long short-term memory”. In: Neural
Comput 9.8 (1997), pp. 1735–1780.
[26] Kang Zhang et al. “Stock market predic-
tion based on generative adversarial net-
work”. In: Procedia computer science 147
(2019), pp. 400–406.
[27] Wikipedia. Generative adversarial net-
work. Available at https : / / en .
wikipedia . org / wiki / Generative _
adversarial_network.
25
Report Group 05
A. RAW DATA
Figure 49: Raw data
26
Report Group 05
B. FEATURES AND INDICATORS
Figure 50: Features and indicators - 1
27
Report Group 05
Figure 51: Features and indicators - 2
28

Final Report

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Final Report

Uploaded by

Copyright:

Available Formats

Report Group 05

Bitcoin Price Prediction

6 Scope for further research 23

B FEATURES AND INDICATORS 27

1. Introduction jumped into this cryptocurrency market. The

This Bitcoin will be worth nothing if no one be-

At this time, there were only two ways to ob-

The birth of cryptocurrency market created

most important part in this report, where we de- 3. Data Preparation

Since time series data contains a variety of pat-

There are 2 types of decomposition:

it on trend component because this component

Figure 6: Close price trend seasonal component

Figure 9: Residual histogram

We can see that the distribution is skewed to-

Figure 10: Residual divided by trend histogram

We have removed the effect of price but the dis-

variable of small fluctuation is significantly re- 3.3. Feature engineering

For more details about the features, check Ap-

3.4. Feature scaling and normaliza-

In this section, we describe only the scaling

3.4.1. Window normalization

where datanorm is the data after being normal-

4. Models pends only on the lagged forecast errors:

where B is called a backshiff operator. Thus, a

4.1.1.1. ARIMA model

order to forecast future values. ϕp ∗ Yt−p + θ1 ∗ ϵt−1 + θ2 ∗ ϵt−2 + . . . +

4.1.1.2. What are ’p’, ’q’, ’d’ mean

• q is the order of the MA (Moving-

remains in the differenced series. Of course, you

4.1.1.5. Find the order of the MA terms

As we can see from the results when using ADF

Return: The percent change in a stock

Figure 17: Summary of ARIMA model

Figure 16: Returned series

4.1.1.4. Find the order of AR terms (p)

4.1.1.7. GARCH Model GARCH model estimates the conditional vari-

Figure 20: Confident interval on whole test set

For ease of observation, we choose randomly 300

Figure 21: 300 random continuous data points

And we compute the Probability of confidence

Figure 22: Odds and logit function

selects a random sample with replacement of the

Typically, for a classification problem with p fea-

It can be seen that the test error decreases even

This can be explained by the Margin Theory,

But the model is trained in an additive manner,

That is the core idea of XGBoost algorithms.

Tree frameworks, as described by the authors in 4.1.4.1. LSTM [24]

GOSS keeps all the instances with large gradi-

Exclusive Feature Bundling Technique

Figure 30: LSTM cell state line [24]

Gates are a way to decide if information can pass

Figure 31: LSTM gate [24]

Figure 35: LSTM output gate [24]

Figure 32: LSTM forget gate [24]

The next step is to decide what new informa-

4.1.4.2. GAN [26] [27] This is our GAN architecture:

GANs often suffer from a "mode collapse" where

GANs are implicit generative models, which

minG maxD V (D, G) = Ex∼pdata (x)[log(D(x))] +

We follow the same idea in [26]. We use the Gen-

4.2. Practical Results