Measure of Dependence For Financial Time-Series

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

M EASURE OF D EPENDENCE FOR F INANCIAL T IME -S ERIES

Martin Winistörfer Ivan Zhdankin


Equality Fund Machine Learning Institute
martin.winistoerfer@equalityfund.ch ivan.zhdankin@thalesians.com

Abstract—Assessing the predictive power of both data and The optimal measure of dependence is a metric that gives
models holds paramount significance in time-series machine an indication of the degree of information contained in the
learning applications. Yet, preparing time series data accurately input data to predict the target. A value of 0 could indicate
and employing an appropriate measure for predictive power
seems to be a non-trivial task. This work involves reviewing that there is no relevant information in the input data, and
arXiv:2311.12129v1 [q-fin.ST] 15 Nov 2023

and establishing the groundwork for a comprehensive analysis the dependence between input and target is purely random. A
of shaping time-series data and evaluating various measures of value of 1 could indicate high dependence of input data on the
dependence. Lastly, we present a method, framework, and a target. With such data, a machine learning production model
concrete example for selecting and evaluating a suitable measure could extract the relevant information and make accurate
of dependence.
Index Terms—Time-series, finance, predictive power, pre- predictions.
dictability, data shaping, feature importance, data preprocessing The Pearson correlation is a concrete example of such a
measure of dependence. It compares the input data x with
output data y 1 and outputs a correlation coefficient. Pearson
I. I NTRODUCTION correlation is based on a normalized covariance calculation
To develop a successful machine learning application, it is and is a well-known statistical method to find relationships
crucial to identify the data with the highest predictive power. between two variables.
The machine learning model can then extract information from When looking for an optimal measure of dependence,
the data and make accurate predictions. Identifying such data is it is important to differentiate between purely model-free
part of the data preparation step for a typical machine learning measures of dependence and model-based measures. Model-
pipeline. based measures utilize a model explicitly to calculate a loss,
In financial applications, a model could predict asset move- such as a decision tree, a naive prediction model, or even a
ment for equities, balance portfolio allocations, predict bond neural network. In any case, such measures always include
spreads, manage risk, or execute limit book orders. In all cases, the properties of a model in the output. For instance, the
financial time series data is used as input data, typically market dependency between input data and the target could be purely
data such as price data. Depending on the problem statement, random. However, depending on the data, the model might
alternative datasets can be of high value, including data from overfit, and the loss is almost zero. In such a case, the measure
social media, company news headlines (sentiment), economic would indicate a high value of dependence, even though there
activities, or fundamental data. is, in fact, no dependence. The Pearson correlation would be a
Additionally, deep neural networks are deployed for fi- model-free measure using a model implicitly by the calculation
nancial applications, with complex architectures using many of the coefficient, which is, in essence, a mapping of x to y
layers of convolutional, recurrent, fully connected layers, or data.
ensembles of models. These models have numerous hyperpa- Apart from finding the optimal measure of dependence, the
rameters, and their training is computationally expensive and challenge is to match such a measure with time series data. In
time-consuming. most financial applications, the Markov process does not hold
Given the multitude of data sources and the complexity of true. Consequently, a machine learning model must be fed with
models, developing an optimal system is a challenging task. It historic data. At the same time, financial data can contain many
is desirable to separate this high-dimensional search space for features. Thus, we have to deal with two-dimensional data
an optimal solution. Specifically, it would be highly beneficial where the main dimension is time, and the other dimension is
to separate the identification of data with high predictability the features.
from the hyperparameter search and training process. By doing The time dimension is strictly sequential, with the data
so, it would be possible to preselect potential data with high order fixed in time. Additionally, depending on the problem
predictive power and later extract the detailed granularity of statement, the data can contain additional dimensions, such as
predictive power using the production model and training different assets.
process. In supervised machine learning, time series data is typically
For this purpose, we need measures of dependence between shaped into a tabulated format using a sliding window. Each
the input data and the dependent data. In a supervised machine
learning application, the dependent data is the target respec- 1 Symbols with lowercase letters denote vector data, and symbols with
tively ground truth label. uppercase letters denote matrices or multi-dimensional tensors.

1
row contains all information about one time step (sample). The the future most likely have a low probability of containing
columns contain the features for each time lag and targets for relevant information. The regimes have faded or have not
each time horizon. Nevertheless, it is more convenient to think yet revealed themselves. Whereas very short time data points
in tensors instead. Figure 1 depicts two tensors X and Y for might not be able to pick up regimes and are of a random,
a multivariate and multi-step problem statement. Y includes a noisy nature. Consequently, the probability that this data has
time horizon. relevant information drops. The maximum must be in between,
as indicated by the maximum bumps in probability in Figure
2.
X Y
(samples, time steps, features) (samples, time horizon)
Probability
Historic Time Horizon
Fig. 1. Time series data transformed into two tensors X and Y . In this
example X is of rank 3. The rank could be higher. For instance, instead of f1
having features for one stock only, each stock could have its own feature. In
such a scenario X would be of rank 4. f2

f3
In a multi-step problem statement, the goal is not only to
make a single prediction into the future but to predict multiple f4
future values at once. This constitutes a typical time series
forecasting task. For this purpose, the machine learning model Time
t−2 t0 t1 t3
is fed with multiple future target values. Such a model has
multiple outputs to predict the time horizon and could, for Fig. 2. Exemplary time series data with relevant information embedded in
instance, consist of LSTM network architecture. the historic and future data points. The dashed line indicates the probability
In financial trading applications, technical analysis is a of occurrence of relevant features along the time axis.
common tool for trading assets such as stocks. A technical
analyst would use price trends, patterns, or indicators such It is the goal of this paper to present an overview of the
as momentum or volume indicators to predict future prices. measures of dependence that are applicable for time series
Essentially, a technical analyst or even a long-term investor data, particularly for financial data, but could be deployed for
tries to identify regimes in the data that might persist over time other time series data such as sensory data.
well into the future. Like an analyst or investor, the general In the coming sections, we describe the different ways in
goal is to detect regimes. which to best prepare the time series data for the different
It’s important to understand that a regime cannot be detected measures of dependence. Afterwards, we provide an overview
with a single-step prediction, i.e., one future time step. While of the applicable measures and explain each measure in more
it is possible to identify a pattern or regime in historical data, detail. At the end of this paper, we conduct an experiment
it is highly unlikely that there is a dependence between such on real stock market data and compare the results of the
a regime and a single data point in the future. different measures. It is not the goal of this paper to identify
Therefore, in this paper, we always consider a time horizon the best measure of dependence, but to give the theoretical
with the ultimate goal of detecting a regime that persists over basis to evaluate and select the best measure for a particular
time from historical data through future data. application.
Figure 2 illustrates an time series with four features. The
II. T IME S ERIES S HAPING
input data is in blue, the future data is marked red. The time
horizon consists of the following four time data points up to To use financial time series data in a machine learning
t4 . model, it is necessary to transform the data. For a supervised
At t3 , there is a piece of relevant future information as part machine learning problem, data is very often in a tabular form,
of a regime in the data. If the model could recognize this making it easy to process. Many machine learning libraries
piece of information, it would be able to predict the future and tools require data to be in a matrix format. However,
asset price and make profitable trades. This piece of relevant this is not a strict requirement, as neural network training
future information has a strong dependency on another piece with supervised learning can accept tensors as inputs, as
of information in the past at t−2 . A measure of dependence outlined in the introductory section. In fact, with reinforcement
would need to consider both pieces of information to measure learning setups, data is very often not in tabular form but in
the dependence accurately. If the measure of dependence only multidimensional tensors.
knew the piece of information at t−2 and a hypothetical piece In all cases, a type of sliding window is used to transform
of information at t1 , it would detect no dependence. The the continuous stream of time series data into processable
detection of a regime must be symmetrical between historical units. This transformation is accomplished either by prepro-
and future data. cessing the data or by generating the data units ad-hoc during
Moreover, the dashed line shows the probability of relevant the training process. In this paper, we focus on a generalized
information in the data over time. Very old data or data far in form of data that can be best used as inputs for a particular

2
Sliding Window

y1

f1

f2

f3

t0 t1 ti Time
W W
N
Univariate Multivariate

y1 y1

f1 f1

f2 f2

f3 f3

X Y S Y
ti Time ti Time

Fig. 3. Overview of the X : Y and S : Y time series data transformation using a sliding window with a window width of W = 4 and a step size of
N = 2. Data is processed from left to right. The time series sequence potentially represents the full data set or part of a data sequence like when using cross
validation.

TABLE I B. Information Theory Measures


T IME SERIES TRANSFORMATION IN TO TWO DATA FORMATS .
Since its introduction by Shannon to quantify communica-
Data Format Description tion, information theory [1] has proven to be a valuable tool in
X:Y Measures between a vector of input and a vector of
many disciplines. Information theory’s broad applicability is
output data (uni-variate-multi-step). partly due to its reliance solely on the probability distribution
S:Y Measures between as set of vector of inputs associated with one or more variables. In general, information
S = { X1 , ...XN } and a vector of output data theory utilizes the probability distributions linked to the values
(multivariate-multi-step). of variables to determine whether or not these values are
related and, depending on the context, the nature of their
relationship. The most fundamental information theory mea-
measure of dependence. Table I lists the two data formats. As sures we examine here are entropy and mutual information.
shown in the table, the target variable is always a single vector. Similar to statistical methods, information theory employs an
Both transformations are visualized in Figure 3. The exam- implicit model to map x and y data. However, because we do
ple in Figure 3 has three features and one target. As depicted, not explicitly specify a model, these measures are considered
the target is a different time series than the features. This model-free. In principle, data alone cannot predict other data
definition differs from typical time series forecasting setups without some form of a model.
where the goal of the forecast is to predict the same time series
C. Model-Based Methods
into the future. Also, one could imagine having multiple time
series targets instead of just one. Model-based methods use a model explicitly, such as a
decision tree. Even a machine learning production model
III. OVERVIEW OF M EASURES using a neural network could be considered as a measure
A. Statistical Methods of dependence that uses a model explicitly. In any case, it
Statistical methods are the most common way to com- is important to understand the problem statement at hand.
pare data and find relationships or associations between two The main questions to be answered by a true measure of
variables. During data exploration and understanding phase, dependence are:
covariance matrix or correlation coefficients are used by data 1) How much information is in the input data that can
scientists to understand the data, spot data specific patterns explain the target?
and make selective choices about which data to use. Most 2) Is there a regime and a relationship between past and
of these measures work on pairwise data only and a major future values?
drawback of these measures is that they often can only detect Most model-based methods ask a different question: ”How
linear correlations. However, in financial data it is assumed well can a model extract information and make an accurate
that relevant data with high predictive power is only weakly prediction?” Such a question focuses on the construction of the
embedded in the input data and of non-linear nature. model, not the data. Finding relevant information or regimes

3
in the data is an implicit requirement of the model. It assumes A. Pearson Correlation PC
that the model can do it by default, which depends on the In statistics, the Pearson correlation coefficient is a measure
construction of the model. It is preferred to remove this of linear correlation between two sets of data and is defined
additional uncertainty of model construction from a measure as:
of dependence.
Model-based methods are compelling because they are covxy
ρ= (1)
closer to the final implementation of the machine learning σx σy
algorithm. If the model of the measure of dependence were such that
the same as the production model, it would be best not to
preselect features with high predictive power but evaluate in N
X (xi − x̂)(yi − ŷ)
a single step. If a production model is complex, it would covxy = (2)
N −1
be advantageous to use a simpler model for the measure i
of dependence. In that case, the result could be that such a B. Spearman Correlation (SC)
measure would not detect the relevant information in the data
The Spearman rank-order correlation coefficient, is a non-
or the wrong data, potentially invalidating the training process.
parametric measure of the strength and direction of association
between two variables. Unlike Pearson’s correlation, which
D. Synergy
assesses linear relationships, Spearman’s correlation evaluates
Synergy is an important concept when looking at measure monotonic relationships. Monotonic relationships are those
of dependence [2]. To begin to understand synergy, we can where the variables tend to move in the same relative direction,
use a simple system. Suppose two variables (call them X1 but not necessarily at a constant rate. Spearman’s correlation
and X2 ) provide some information about a third variable (call is particularly useful when the relationship between variables
it Y ). In other words, if you know the state of X1 and X2 , is more accurately described by their ranks rather than their
then you know something about the state Y . Loosely said, the exact values.
portion of information that is not provided by knowing both
X1 alone and X2 alone is said to be provided synergistically cov(R(X), R(Y ))
rs = ρR(X),R(Y ) = (3)
by X1 and X2 . The synergy is the bonus information received ρR(X) ρR(Y )
by knowing X1 and X2 together, instead of separately. where:
Therefore the ideal measure of dependence can concurrently
look at the input data X and compare to the target Y . As we ρ : denotes the usual Pearson correlation
will see later, most of the presented measure, can actually only coefficient, but applied to the rank
compare between one input variable and one output variable variables,
at the same time and ignore the potential of synergy in the cov(R(X), R(Y )) : is the covariance of the rank variables,
input data. R(X), R(Y ) : are the standard deviations of the rank
This is particularly critical for financial time series in which variables.
a combination of input data could yield a new signal for
machine learning trading algorithm. Consequently, synergy is The Spearman correlation is less sensitive than the Pearson
an important concept for financial time series predictions. correlation to strong outliers located in the tails of both
samples. This is due to Spearman’s coefficient limiting the
impact of outliers to the value of their rank. When the data are
E. Type of Interactions
approximately elliptically distributed and devoid of prominent
When looking at measure of dependence one must pay outliers, the Spearman correlation and Pearson correlation
specific attention to two type of interactions: 1) those that yield similar values.
exist within a group of variables and 2) those that exist
between a group of variables and another target variable. C. Distance Correlation DC
This is called “Group to Target”. In such applications we Differing from classical measures such as Pearson or Spear-
always have labeled data. For unsupervised machine learning man correlation, distance correlation is a correlation measure
or forecasting tasks, the first type of measure could be used. that captures both linear and non-linear associations between
For machine learning prediction application only the last type two random variables, X and Y [3]. Distance correlation
of interaction is of importance. equals zero if, and only if, two vectors are independent. This
Figure 4 presents an overview of the measures of depen- characteristic makes distance correlation a statistical measure
dence that have been considered in this work. of independence between random variables [4].
Distance correlation, also referred to as Brownian covari-
IV. M EASURE OF D EPENDENCE ance or correlation, is defined by the following coefficient:

This section gives a short introduction to each measure of dCov(X, Y )


dCor(X, Y ) = p (4)
dependence. dV ar(X) dV ar(Y )

4
Overview of Measure of Dependence

Linear Monotonic
Non-Linear

Univariate Measure Type Measure Type


Pearson Correlation Stat. Spearman Correlation Stat.
Multivariate

Measure Type Model Based No Model


Distance Correlation* Stat.
Mutual Information* Info.
Maximal Information Coefficient* Info. Group to Target Within Group
Measure Type
Predictive Power Score* Loss

Measure Type Measure Type


Redundancy-Synergy Index Info. Total Correlation TC Stat.
Partial Information Decomposition Info. Maximal Correlation MAC Stat.
Cumulative Mutual Information Info. Interaction Information II Info.
High Contrast Subspaces HiCS Stat.
Universal Dependency Analysis UDS -
Monte Carlo Dependency Estimation MCDE Stat.

Fig. 4. An overview of measures of dependence grouped by relevant categories. The metrics itself are grouped into three types: statistical metrics (stat),
information theory based metrics (info) and model based metrics (loss).

The distance correlation is derived from a number of other possible with the MIC to find interesting relationships between
quantities that are used in this definition, specifically: distance variables in a way that simpler measures, such as the corre-
variance, distance standard deviation, and distance covariance lation coefficient, cannot. With MIC the goal is equitability:
2
. similar scores will be seen in relationships with similar noise
levels regardless of the type of relationship.
D. Maximum Information Coefficient MIC
Because of this, it may be particularly useful with high
The maximal information coefficient is a measure that is dimensional settings to find a smaller set of the strongest
based on information theory particularly mutual information correlations. Where distance correlation might be better at
[6]. The maximal information coefficient is an approach to try detecting the presence of (possibly weak) dependencies, the
many different bin sizes and locations, and to compare the MIC is more geared toward the assessment of strength and
maximum mutual information received. It is defined as detecting patterns that we would pick up via visual inspection.
Hence, it cannot handle a system of multivariate inputs and
M IC ≜ maxx,y:xy<B m(x, y) (5) outputs and correlate it as an abstract structure. Other more
such that recent MIC variants are BackMIC and ChiMIC [7].

maxG∈G(x,y) I(X(G), Y (G)) E. Mutual Information MI


m(x, y) = (6)
log2 min(x, y) The information theoretic quantities involving one and two
variables are well-defined and their results are well understood.
where:
Regarding the probability distribution of one variable, call it
B : is some sample-size dependent bound on the p(x), the canonical measure is the entropy H(x) [8]. The
number of bins that can be used to reliably entropy is given by:
estimate the distribution,
n
G(x, y) : is the set of two-dimensional grids of size xy and X
H(X) = P (xi ) log2 P (xi ) (7)
X(G), Y (G) represents a discretization of the i=1
variables onto this grid.
The entropy quantifies the amount of uncertainty that is
The MIC lies in a range [0,1], where 0 represents no present in the probability distribution. If the probability dis-
relationship between variables and 1 represents a noise-free tribution is concentrated near one value, the entropy will be
relationship of any form, not just linear3 . MIC will not give low. If the probability distribution is uniform, the entropy will
any indication of the type of the relationship, though. It is be maximized. When examining the relationship between two
2 In Python, for easy computation of sample distance correlation the dcor() variables, the mutual information 8 quantifies the amount of
library can be used [5]. information provided about one of the variables by knowing
3 In Python to calculate MIC score Minepy library can be used. the value of the other. The mutual information is given by:

5
values are expected to be above the naive prediction and half
I(X; Y ) = H(X) − H(X|Y ) below. Subsequently, the model error is normalized with the
error from the naive prediction.
= H(Y ) − H(Y |X) (8)
= H(X) + H(Y ) − H(X|Y ) M AE(y, ŷmodel )
PPS = 1 − (11)
M AE(y, ŷmedian )
where the conditional entropy is given by:
where:
X
H(X|Y ) = p(y)H(X|y)
ŷmodel = fmodel (x); ŷmedian = median(y) (12)
y∈Y
1 (9)
X X In essence, the score measures the percentage of model
= p(y) p(x|y) log
p(x|y) improvement potential that the feature adds when the current
y∈Y x∈X
naive base model is compared to a perfect deterministic model
where: [13].
X, Y : data variables, The interesting thing, like any model based score, is that it
can be easily extended to a multivariate problem setup having
x, y : individual values of those variables.
more than one X data dimension [14]. This enables this score
The mutual information can be used as a multivariate input to detect combinations of feature variables.
measure of the interactions among more than two or more Like any model-based measure, the model can over-fit and
variables by grouping the variables into sets and treating each the result of the score is ambiguous. This is particularly true,
set as a single vector-valued variable. In this way, the mutual if data is not cross-validated [15]. It is better to split data into
information can be used to measure the interactions between a training and test data set, compute PPS on training set then
a group of variables and a target variable. For instance, the measure dependence on test set. However, in most financial
mutual information can be calculated between Y and the set applications there is not enough data to execute such method,
S = { X1 , X2 }. limiting its application.
F. Redundancy-Synergy Index RSI H. Information Gain IG
This is another multivariate information measure [9] and Information gain is a synonym for Kullback–Leibler diver-
was created as an extension of the interaction information [10]. gence and measures the relative entropy between two distri-
It is given by: butions [16]. It gives a measure of the amount of information
X gained about a variable observing another variable. The most
RSI(S; Y ) ≡ I(S; Y ) − I(Xi ; Y ) (10) prominent application of information gain is in decision trees.
Xi ∈S Information gain helps to determine the order of attributes
The redundancy-synergy index is designed to be maximal in the nodes of a decision tree. The information gain is the
and positive when the variables in S are purported to provide weighed average of all child nodes minus the parent node:
synergistic information about Y . It should be negative when
the variables in S provide redundant information about Y . The IG(S, A) = H(S)parent − H(S, A)children
redundancy-synergy index measures the interactions between (13)
X
= H(S) − p(t)H(t)
a group of variables and another variable, except when S t∈T
contains two variables, in which case the redundancy-synergy
index is equal to the interaction information. The redundancy- where:
synergy index has been referred to as the SynSum [11], IG : information gain, the higher the information gain
the WholeMinusSum synergy [?], and the negative of the the better the split, the more entropy was removed
redundancy-synergy index has also been referred to as the from parent to children,
redundancy [12]. H(S) : entropy of set S (parent),
Redundancy-Synergy Index RSI is one of the measures of
T : the subsets created from splitting
S set S attribute A
dependence that naturally accepts multivariate inputs.
(child nodes) such that S = t,
t∈T
G. Predictive Power Score PPS
p(t) : the proportion of the number of elements in t to the
This measure is a model-based score that operates pairwise number of elements in set S (weighting of node
on X and Y data. The intuition behind this score is as follows: entropies),
A decision tree is fitted on the X and Y data, and the
H(t) : entropy of subset t (entropy of each child node).
model’s loss (Mean Absolute Error, MAE) is calculated. The
loss is then normalized with the loss of a naive model. For a Information gain and Mutual Information are the same
regression task, a naive approach would be to consistently use thing, although the context or usage of the measure often gives
the median as the target. This way, approximately half of the rise to the different names. For example:

6
• Effect of transforms to a data set (decision trees): Infor- It is assumed that this is an iterative process, and after a
mation Gain. certain number of iterations, good measures and features can
• Dependence between variables (feature selection): Mu- be distinguished from the less effective ones. The iterative
tual Information. verification process is outlined in Figure 5.
Notice the similarity in the way that the information gain A positive correlation exists if the best measure Mbest and
13 is calculated and the way that mutual information gain 8 best feature Fbest values yield the highest production model
is calculated; when setting the parent data in a decision tree score Shighest , and the worst measure Mworst and worst
as X and the child data as Y they are equivalent. Technically, feature Fworst values yield the lowest production model score
they calculate the same quantity if applied to the same data. Slowest . This correlation can be linear or of any monotonic
nature. For cross-verification, an intermediate value can be
V. M ETHOD used, or a range of values and scores can be compared.
There are many measures of dependence that could poten- This verification method is best applied with real data in an
tially be used to identify data with high predictive power. The actual production environment. It is believed that this approach
question is, which one is the right one and how to verify will yield optimal results. Alternatively, synthetic data can be
it? Again, we must remember, that the identification of data generated. In such a scenario, signals, patterns, or even regimes
with high predictive power is a preprocessing step before can be embedded into the data. Purely random data can be used
production model training. as a control dataset. In this setup, the data with high and low
predictive power is already known in advance. The value of
the measure of dependence would be expected to be higher
Define the number of
with embeddings than with random data. However, especially
measure and feature
space to be evaluate in financial applications where patterns and regimes are not
Algorithm to
obvious, it would be very difficult to generate synthetic data.
find best measure
and features One could use generative models or, for specific applications
Search for Mbest,worst like derivative investment instruments, a Black–Scholes model.
and Fbest,worst Nevertheless, in this paper, the focus is on equity market
predictions.

Define production model and VI. E XPERIMENT


a subset of hyperparameters
We test three measure of dependence MI, DC and MIC
and will conduct an experiment using stock market data.
Train the production Specifically, the goal is to find features such as price returns,
model with Mbest,worst trading volume and technical indicators that have highest
and Fbest,worst
predict power4 . Table II lists the targets and features that
are used in this experiment. Price data is averaged using hull
Correlate production model moving average HMA. This average minimized lag in the time
score Shighest,lowest with series5 . Feature data are prefixed with f and targets with t .
Mbest,worst and Fbest,worst
The targets are typically price returns. This can be daily
returns t ret c 1 or returns of any duration such as weekly or
monthly. The target t ret c hma 20 for example is a monthly
average. The targets are computed before the search process
Positive no at every time step.
Correlation? The experiment is conducted with daily data from a single
stock like YUM. The data ranges from 1999-11-01 to 2021-
06-25.
yes It is important to note at this point, the duration of return
and the sliding window size itself are key parameters in the
Use Mbest and Fbest for full search process. As outlined in figure 2 regimes might exist
scale hyperparameter search
over different time horizons. For each data the search process
must find the optimal window. Also, the target is not a single
Fig. 5. Verification process to ensure, that measure of dependence and data value but a vector including data over a specific time horizon
have the required predictive power for a specific application. defined by the sliding window size.
Furthermore, the targets values are discretized into bins. In
However, since the measure of dependence and the produc- this example the bin size B = 10 corresponds to 10 basis
tion model can deviate from each other, as outlined in Chapter
III-C, it becomes necessary to use the training results from the 4 In Python a great library for technical indicators is TA-Lib.
production model to judge the success of the preprocessing. 5 See HMA for more details.

7
TABLE II window size, features and targets can be selected. This pa-
L IST OF FEATURES AND TARGETS USED IN THE EXPERIMENT. rameters can then be used in the production model to verify
if measure of dependence holds true, see V
Name Description
The algorithm has been implemented in Python Sklearn [17]
f ret c 1 Daily closing price return
and Optuna. The Jupyter notebook can be found on Github as
f vol pct Daily stock trading volume change a reference.
f ret c hma 5 Avg. closing price return using 5-day HMA
f ret c hma 20 Avg. closing price return using 20-day HMA B. Results
f c macd signal MACD signal
In this chapter we give a summary of the results of the
f c macd MACD indicator search process in table III. The results have not been verified
f obv On balance volume oscillator using the production model, but shall serve as a first indication
f atr Average true range indicator of what hyperparameters could be of importance.
f rsi Relative strength index
t ret c 1 Daily closing price return TABLE III
H YPERPARAMETER SEARCH RESULTS FOR YUM STOCK .
t ret c hma 5 Avg. closing price return using 5-day HMA
t ret c hma 20 Avg. closing price return using 20-day HMA
Hyperparameter MI DC MIC
Best 50 50, 150, 200 50
Window Size
Worst 50,150,200 150, 200 150,200
points (0.1%). The bins size B shall be chosen in resolution f ret c hma 5 f macd
Best f obv
that makes sense for stock market financial investments. Features f ret c hma 20 f macd signal
f ret c 1 f ret c 1
Worst f ret c 1
A. Algorithm f vol pct f vol pct
The primary goal of the feature search algorithm is to find t ret c hma 5 t ret c hma 5 t ret c hma 5
Best
Targets t ret c hma 20 t ret c hma 20 t ret c hma 20
the right combination of parameters that maximizes the value
Worst t ret c 1 t ret c 1 t ret c 1
of the measure of dependence. This is achieved through a
grid search using Optuna6 . The search algorithm optimizes
the following parameters: Interestingly, all three measures of dependence chose a
1) Sliding window size window size of 50 as the best. This implies that, according
2) Feature type to the production model, a larger historic window of more
3) Target type and duration than 50 days might not be necessary. All three measures
4) Measure of dependence indicate that daily price return (f ret c 1) is actually not a
Referring to Figure 3, the sliding window step size N is good feature, which is surprising. On the other hand, the best
fixed at 100 days, while W represents the sliding window features strongly depend on the measure itself. MIC appears to
size in the range of 50 to 200 days. For each point in time strongly emphasize MACD, whereas DC found the on-balance
ti , the measure of dependence is calculated. This enables the volume oscillator to be the best. The targets show a consistent
identification of regime changes in the data. For instance, it pattern among all three measures. Using daily price returns
could happen that during a downtrend, different features might (t ret c 1) as a target does not allow finding high predictive
prevail than during an uptrend. Therefore, a trending indicator power. Instead, using Hull Moving Averages of 5 and 20 days
is calculated for each time step to show if the stock price will yields the best results, which is an interesting finding.
go DOW N , U P , or be N EU T RAL. In a follow-up experiment, one could test shorter window
Optuna offers various useful analysis tools, such as feature sizes to check if even shorter window sizes can be used.
importance plots or parallel coordinate plots. With these tools, However, it must be understood that reducing the window size
it is possible to find features with high predictive power, decreases the number of samples in a single calculation. Based
analyze the change of importance over time7 or verify which on experience, the sample size should not be smaller than 100.
features prevail during an up or down trend. Figure 6 shows Additionally, the experiment can be conducted with other
an example of parallel coordinate plots. Feature selection is stocks. It would be interesting to see if the results correlate
done with the feature importance plots as it cannot be directly between different stocks. Based on the results of this experi-
integrated into parallel coordinate plots. ment, it would be interesting to conduct a sensitivity analysis
The objective value is in this case the value of the measure with daily price returns and compare them to Hull Moving
of dependence. Based on the objective value the best sliding Averages. Clearly, the production model would achieve higher
accuracy using averages than daily returns. By following
6 Optuna is a hyperparameter optimization framework that offers high
the verification process according to Figure 5, it is possible
flexibility and parallel computing.
7 For this purpose, the Optuna source code has been extended, integrating to iteratively preselect important hyperparameters with fairly
date functionality. simple measures of dependencies such as MI, DC, or MIC.

8
Fig. 6. Parallel coordinate plot with selective sliders on the vertical axis to filter hyperparameters.

VII. C ONCLUSIONS R EFERENCES


One of the key findings of this work is the crucial need [1] D. Commenges, “Information Theory and Statistics: an overview,” 2013.
for careful preparation of time-series data. It is essential that [2] B. F. N. Timme, W. Alford and J. Beggs, “Synergy, redundancy, and
multivariate information measures: an experimentalist’s perspective,”
time-series data align with the chosen measure of dependence 2013.
and the specific problem statement. For example, correlating [3] R. Székely, J. Gábor and L. Maria, “Rejoinder: Brownian distance
the entire dataset of a feature with a target value identifies covariance,” 2009.
[4] Wikipedia, “Distance correlation,” 2022.
dependencies, but these dependencies may span any periods [5] C. R. Carreño, “Distance covariance and distance correlation,” 2022.
of time. However, this approach does not yield information [6] B. Robidoux, “Maximal Information Coefficient: An Introduction to
about predictive power, rendering it ineffective for predictive Information Theory,” 2017.
[7] J. C. H. Z. Dan Cao, Yuan Chen and Z. Yuan, “An improved algorithm
applications. Conversely, correlating historical data with fu- for the maximal information coefficient and its application (BackMIC),”
ture data, without information leakage, enables a measure of 2021.
dependence to genuinely uncover predictive power. Hence, in [8] T. Cover and J. Thomas, Elements of information theory. New York:
Wiley-Interscience, 2 ed., 2006.
this study, we propose utilizing an extensive time horizon of [9] M. A. E. Y. G. Chechik, A. Globerson N. Tishby and I. Nelken, “Neural
data for comparison with historical data to identify patterns or information processing systems 14,” The Computer Journal, vol. 1,
regimes in the dataset. The assumption is that these patterns p. 173, 2001.
[10] McGill, “Psychometrika,” 1954.
and regimes form the core of identifying predictive power. [11] E. V. E. A. Globerson, E. Stark and N. Tishby, “PNAS, 106, 3490,”
Another crucial finding is that, irrespective of the chosen 2009.
measure of dependence, the model based on the measure of [12] M. B. I. E. Schneidman, S. Still and W. Bialek, “Physical Review Letters,
91,” 2003.
dependence will always differ from the production model. [13] F. Wetschoreck, “Introducing the Predictive Power Score,” 2020.
Typically, a measure of dependence employs a simple module, [14] J. C. Vermunt, “Extention of Predictive Power Score,” 2022.
like a statistical model. In contrast, a production model utilizes [15] C. Molnar, “Interpretable Machine Learning,” 2022.
[16] Wikipedia, “Information Gain,”
a more complex model, such as a deep learning network. [17] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
When the measure of dependence employs a model as complex O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-
as a deep learning network, the feature engineering step plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
esnay, “Scikit-learn: Machine Learning in Python,” Journal of Machine
for preselecting features becomes redundant. Instead, it is Learning Research, vol. 12, pp. 2825–2830, 2011.
more advisable to directly evaluate hyperparameters with the
production model, even though this incurs a high computa-
tional cost. Depending on the available infrastructure and cost
budget, it is preferable to use a simple measure of dependence.
Additional efforts to synchronize the measure and production
model may be required. This work introduces a verification
method to facilitate such synchronization. The method can be
executed manually or automated for time-saving.
Finally, the experiments in this work have revealed some
intriguing findings. Although they still require verification,
daily price returns seem to be an unfavorable choice for
the application of identifying predictive power in stock price
information. Instead, using average data, especially employing
Hull Moving Averages, which utilizes longer time periodicity,
reveals higher predictive power.

You might also like