Professional Documents
Culture Documents
Measure of Dependence For Financial Time-Series
Measure of Dependence For Financial Time-Series
Measure of Dependence For Financial Time-Series
Abstract—Assessing the predictive power of both data and The optimal measure of dependence is a metric that gives
models holds paramount significance in time-series machine an indication of the degree of information contained in the
learning applications. Yet, preparing time series data accurately input data to predict the target. A value of 0 could indicate
and employing an appropriate measure for predictive power
seems to be a non-trivial task. This work involves reviewing that there is no relevant information in the input data, and
arXiv:2311.12129v1 [q-fin.ST] 15 Nov 2023
and establishing the groundwork for a comprehensive analysis the dependence between input and target is purely random. A
of shaping time-series data and evaluating various measures of value of 1 could indicate high dependence of input data on the
dependence. Lastly, we present a method, framework, and a target. With such data, a machine learning production model
concrete example for selecting and evaluating a suitable measure could extract the relevant information and make accurate
of dependence.
Index Terms—Time-series, finance, predictive power, pre- predictions.
dictability, data shaping, feature importance, data preprocessing The Pearson correlation is a concrete example of such a
measure of dependence. It compares the input data x with
output data y 1 and outputs a correlation coefficient. Pearson
I. I NTRODUCTION correlation is based on a normalized covariance calculation
To develop a successful machine learning application, it is and is a well-known statistical method to find relationships
crucial to identify the data with the highest predictive power. between two variables.
The machine learning model can then extract information from When looking for an optimal measure of dependence,
the data and make accurate predictions. Identifying such data is it is important to differentiate between purely model-free
part of the data preparation step for a typical machine learning measures of dependence and model-based measures. Model-
pipeline. based measures utilize a model explicitly to calculate a loss,
In financial applications, a model could predict asset move- such as a decision tree, a naive prediction model, or even a
ment for equities, balance portfolio allocations, predict bond neural network. In any case, such measures always include
spreads, manage risk, or execute limit book orders. In all cases, the properties of a model in the output. For instance, the
financial time series data is used as input data, typically market dependency between input data and the target could be purely
data such as price data. Depending on the problem statement, random. However, depending on the data, the model might
alternative datasets can be of high value, including data from overfit, and the loss is almost zero. In such a case, the measure
social media, company news headlines (sentiment), economic would indicate a high value of dependence, even though there
activities, or fundamental data. is, in fact, no dependence. The Pearson correlation would be a
Additionally, deep neural networks are deployed for fi- model-free measure using a model implicitly by the calculation
nancial applications, with complex architectures using many of the coefficient, which is, in essence, a mapping of x to y
layers of convolutional, recurrent, fully connected layers, or data.
ensembles of models. These models have numerous hyperpa- Apart from finding the optimal measure of dependence, the
rameters, and their training is computationally expensive and challenge is to match such a measure with time series data. In
time-consuming. most financial applications, the Markov process does not hold
Given the multitude of data sources and the complexity of true. Consequently, a machine learning model must be fed with
models, developing an optimal system is a challenging task. It historic data. At the same time, financial data can contain many
is desirable to separate this high-dimensional search space for features. Thus, we have to deal with two-dimensional data
an optimal solution. Specifically, it would be highly beneficial where the main dimension is time, and the other dimension is
to separate the identification of data with high predictability the features.
from the hyperparameter search and training process. By doing The time dimension is strictly sequential, with the data
so, it would be possible to preselect potential data with high order fixed in time. Additionally, depending on the problem
predictive power and later extract the detailed granularity of statement, the data can contain additional dimensions, such as
predictive power using the production model and training different assets.
process. In supervised machine learning, time series data is typically
For this purpose, we need measures of dependence between shaped into a tabulated format using a sliding window. Each
the input data and the dependent data. In a supervised machine
learning application, the dependent data is the target respec- 1 Symbols with lowercase letters denote vector data, and symbols with
tively ground truth label. uppercase letters denote matrices or multi-dimensional tensors.
1
row contains all information about one time step (sample). The the future most likely have a low probability of containing
columns contain the features for each time lag and targets for relevant information. The regimes have faded or have not
each time horizon. Nevertheless, it is more convenient to think yet revealed themselves. Whereas very short time data points
in tensors instead. Figure 1 depicts two tensors X and Y for might not be able to pick up regimes and are of a random,
a multivariate and multi-step problem statement. Y includes a noisy nature. Consequently, the probability that this data has
time horizon. relevant information drops. The maximum must be in between,
as indicated by the maximum bumps in probability in Figure
2.
X Y
(samples, time steps, features) (samples, time horizon)
Probability
Historic Time Horizon
Fig. 1. Time series data transformed into two tensors X and Y . In this
example X is of rank 3. The rank could be higher. For instance, instead of f1
having features for one stock only, each stock could have its own feature. In
such a scenario X would be of rank 4. f2
f3
In a multi-step problem statement, the goal is not only to
make a single prediction into the future but to predict multiple f4
future values at once. This constitutes a typical time series
forecasting task. For this purpose, the machine learning model Time
t−2 t0 t1 t3
is fed with multiple future target values. Such a model has
multiple outputs to predict the time horizon and could, for Fig. 2. Exemplary time series data with relevant information embedded in
instance, consist of LSTM network architecture. the historic and future data points. The dashed line indicates the probability
In financial trading applications, technical analysis is a of occurrence of relevant features along the time axis.
common tool for trading assets such as stocks. A technical
analyst would use price trends, patterns, or indicators such It is the goal of this paper to present an overview of the
as momentum or volume indicators to predict future prices. measures of dependence that are applicable for time series
Essentially, a technical analyst or even a long-term investor data, particularly for financial data, but could be deployed for
tries to identify regimes in the data that might persist over time other time series data such as sensory data.
well into the future. Like an analyst or investor, the general In the coming sections, we describe the different ways in
goal is to detect regimes. which to best prepare the time series data for the different
It’s important to understand that a regime cannot be detected measures of dependence. Afterwards, we provide an overview
with a single-step prediction, i.e., one future time step. While of the applicable measures and explain each measure in more
it is possible to identify a pattern or regime in historical data, detail. At the end of this paper, we conduct an experiment
it is highly unlikely that there is a dependence between such on real stock market data and compare the results of the
a regime and a single data point in the future. different measures. It is not the goal of this paper to identify
Therefore, in this paper, we always consider a time horizon the best measure of dependence, but to give the theoretical
with the ultimate goal of detecting a regime that persists over basis to evaluate and select the best measure for a particular
time from historical data through future data. application.
Figure 2 illustrates an time series with four features. The
II. T IME S ERIES S HAPING
input data is in blue, the future data is marked red. The time
horizon consists of the following four time data points up to To use financial time series data in a machine learning
t4 . model, it is necessary to transform the data. For a supervised
At t3 , there is a piece of relevant future information as part machine learning problem, data is very often in a tabular form,
of a regime in the data. If the model could recognize this making it easy to process. Many machine learning libraries
piece of information, it would be able to predict the future and tools require data to be in a matrix format. However,
asset price and make profitable trades. This piece of relevant this is not a strict requirement, as neural network training
future information has a strong dependency on another piece with supervised learning can accept tensors as inputs, as
of information in the past at t−2 . A measure of dependence outlined in the introductory section. In fact, with reinforcement
would need to consider both pieces of information to measure learning setups, data is very often not in tabular form but in
the dependence accurately. If the measure of dependence only multidimensional tensors.
knew the piece of information at t−2 and a hypothetical piece In all cases, a type of sliding window is used to transform
of information at t1 , it would detect no dependence. The the continuous stream of time series data into processable
detection of a regime must be symmetrical between historical units. This transformation is accomplished either by prepro-
and future data. cessing the data or by generating the data units ad-hoc during
Moreover, the dashed line shows the probability of relevant the training process. In this paper, we focus on a generalized
information in the data over time. Very old data or data far in form of data that can be best used as inputs for a particular
2
Sliding Window
y1
f1
f2
f3
t0 t1 ti Time
W W
N
Univariate Multivariate
y1 y1
f1 f1
f2 f2
f3 f3
X Y S Y
ti Time ti Time
Fig. 3. Overview of the X : Y and S : Y time series data transformation using a sliding window with a window width of W = 4 and a step size of
N = 2. Data is processed from left to right. The time series sequence potentially represents the full data set or part of a data sequence like when using cross
validation.
3
in the data is an implicit requirement of the model. It assumes A. Pearson Correlation PC
that the model can do it by default, which depends on the In statistics, the Pearson correlation coefficient is a measure
construction of the model. It is preferred to remove this of linear correlation between two sets of data and is defined
additional uncertainty of model construction from a measure as:
of dependence.
Model-based methods are compelling because they are covxy
ρ= (1)
closer to the final implementation of the machine learning σx σy
algorithm. If the model of the measure of dependence were such that
the same as the production model, it would be best not to
preselect features with high predictive power but evaluate in N
X (xi − x̂)(yi − ŷ)
a single step. If a production model is complex, it would covxy = (2)
N −1
be advantageous to use a simpler model for the measure i
of dependence. In that case, the result could be that such a B. Spearman Correlation (SC)
measure would not detect the relevant information in the data
The Spearman rank-order correlation coefficient, is a non-
or the wrong data, potentially invalidating the training process.
parametric measure of the strength and direction of association
between two variables. Unlike Pearson’s correlation, which
D. Synergy
assesses linear relationships, Spearman’s correlation evaluates
Synergy is an important concept when looking at measure monotonic relationships. Monotonic relationships are those
of dependence [2]. To begin to understand synergy, we can where the variables tend to move in the same relative direction,
use a simple system. Suppose two variables (call them X1 but not necessarily at a constant rate. Spearman’s correlation
and X2 ) provide some information about a third variable (call is particularly useful when the relationship between variables
it Y ). In other words, if you know the state of X1 and X2 , is more accurately described by their ranks rather than their
then you know something about the state Y . Loosely said, the exact values.
portion of information that is not provided by knowing both
X1 alone and X2 alone is said to be provided synergistically cov(R(X), R(Y ))
rs = ρR(X),R(Y ) = (3)
by X1 and X2 . The synergy is the bonus information received ρR(X) ρR(Y )
by knowing X1 and X2 together, instead of separately. where:
Therefore the ideal measure of dependence can concurrently
look at the input data X and compare to the target Y . As we ρ : denotes the usual Pearson correlation
will see later, most of the presented measure, can actually only coefficient, but applied to the rank
compare between one input variable and one output variable variables,
at the same time and ignore the potential of synergy in the cov(R(X), R(Y )) : is the covariance of the rank variables,
input data. R(X), R(Y ) : are the standard deviations of the rank
This is particularly critical for financial time series in which variables.
a combination of input data could yield a new signal for
machine learning trading algorithm. Consequently, synergy is The Spearman correlation is less sensitive than the Pearson
an important concept for financial time series predictions. correlation to strong outliers located in the tails of both
samples. This is due to Spearman’s coefficient limiting the
impact of outliers to the value of their rank. When the data are
E. Type of Interactions
approximately elliptically distributed and devoid of prominent
When looking at measure of dependence one must pay outliers, the Spearman correlation and Pearson correlation
specific attention to two type of interactions: 1) those that yield similar values.
exist within a group of variables and 2) those that exist
between a group of variables and another target variable. C. Distance Correlation DC
This is called “Group to Target”. In such applications we Differing from classical measures such as Pearson or Spear-
always have labeled data. For unsupervised machine learning man correlation, distance correlation is a correlation measure
or forecasting tasks, the first type of measure could be used. that captures both linear and non-linear associations between
For machine learning prediction application only the last type two random variables, X and Y [3]. Distance correlation
of interaction is of importance. equals zero if, and only if, two vectors are independent. This
Figure 4 presents an overview of the measures of depen- characteristic makes distance correlation a statistical measure
dence that have been considered in this work. of independence between random variables [4].
Distance correlation, also referred to as Brownian covari-
IV. M EASURE OF D EPENDENCE ance or correlation, is defined by the following coefficient:
4
Overview of Measure of Dependence
Linear Monotonic
Non-Linear
Fig. 4. An overview of measures of dependence grouped by relevant categories. The metrics itself are grouped into three types: statistical metrics (stat),
information theory based metrics (info) and model based metrics (loss).
The distance correlation is derived from a number of other possible with the MIC to find interesting relationships between
quantities that are used in this definition, specifically: distance variables in a way that simpler measures, such as the corre-
variance, distance standard deviation, and distance covariance lation coefficient, cannot. With MIC the goal is equitability:
2
. similar scores will be seen in relationships with similar noise
levels regardless of the type of relationship.
D. Maximum Information Coefficient MIC
Because of this, it may be particularly useful with high
The maximal information coefficient is a measure that is dimensional settings to find a smaller set of the strongest
based on information theory particularly mutual information correlations. Where distance correlation might be better at
[6]. The maximal information coefficient is an approach to try detecting the presence of (possibly weak) dependencies, the
many different bin sizes and locations, and to compare the MIC is more geared toward the assessment of strength and
maximum mutual information received. It is defined as detecting patterns that we would pick up via visual inspection.
Hence, it cannot handle a system of multivariate inputs and
M IC ≜ maxx,y:xy<B m(x, y) (5) outputs and correlate it as an abstract structure. Other more
such that recent MIC variants are BackMIC and ChiMIC [7].
5
values are expected to be above the naive prediction and half
I(X; Y ) = H(X) − H(X|Y ) below. Subsequently, the model error is normalized with the
error from the naive prediction.
= H(Y ) − H(Y |X) (8)
= H(X) + H(Y ) − H(X|Y ) M AE(y, ŷmodel )
PPS = 1 − (11)
M AE(y, ŷmedian )
where the conditional entropy is given by:
where:
X
H(X|Y ) = p(y)H(X|y)
ŷmodel = fmodel (x); ŷmedian = median(y) (12)
y∈Y
1 (9)
X X In essence, the score measures the percentage of model
= p(y) p(x|y) log
p(x|y) improvement potential that the feature adds when the current
y∈Y x∈X
naive base model is compared to a perfect deterministic model
where: [13].
X, Y : data variables, The interesting thing, like any model based score, is that it
can be easily extended to a multivariate problem setup having
x, y : individual values of those variables.
more than one X data dimension [14]. This enables this score
The mutual information can be used as a multivariate input to detect combinations of feature variables.
measure of the interactions among more than two or more Like any model-based measure, the model can over-fit and
variables by grouping the variables into sets and treating each the result of the score is ambiguous. This is particularly true,
set as a single vector-valued variable. In this way, the mutual if data is not cross-validated [15]. It is better to split data into
information can be used to measure the interactions between a training and test data set, compute PPS on training set then
a group of variables and a target variable. For instance, the measure dependence on test set. However, in most financial
mutual information can be calculated between Y and the set applications there is not enough data to execute such method,
S = { X1 , X2 }. limiting its application.
F. Redundancy-Synergy Index RSI H. Information Gain IG
This is another multivariate information measure [9] and Information gain is a synonym for Kullback–Leibler diver-
was created as an extension of the interaction information [10]. gence and measures the relative entropy between two distri-
It is given by: butions [16]. It gives a measure of the amount of information
X gained about a variable observing another variable. The most
RSI(S; Y ) ≡ I(S; Y ) − I(Xi ; Y ) (10) prominent application of information gain is in decision trees.
Xi ∈S Information gain helps to determine the order of attributes
The redundancy-synergy index is designed to be maximal in the nodes of a decision tree. The information gain is the
and positive when the variables in S are purported to provide weighed average of all child nodes minus the parent node:
synergistic information about Y . It should be negative when
the variables in S provide redundant information about Y . The IG(S, A) = H(S)parent − H(S, A)children
redundancy-synergy index measures the interactions between (13)
X
= H(S) − p(t)H(t)
a group of variables and another variable, except when S t∈T
contains two variables, in which case the redundancy-synergy
index is equal to the interaction information. The redundancy- where:
synergy index has been referred to as the SynSum [11], IG : information gain, the higher the information gain
the WholeMinusSum synergy [?], and the negative of the the better the split, the more entropy was removed
redundancy-synergy index has also been referred to as the from parent to children,
redundancy [12]. H(S) : entropy of set S (parent),
Redundancy-Synergy Index RSI is one of the measures of
T : the subsets created from splitting
S set S attribute A
dependence that naturally accepts multivariate inputs.
(child nodes) such that S = t,
t∈T
G. Predictive Power Score PPS
p(t) : the proportion of the number of elements in t to the
This measure is a model-based score that operates pairwise number of elements in set S (weighting of node
on X and Y data. The intuition behind this score is as follows: entropies),
A decision tree is fitted on the X and Y data, and the
H(t) : entropy of subset t (entropy of each child node).
model’s loss (Mean Absolute Error, MAE) is calculated. The
loss is then normalized with the loss of a naive model. For a Information gain and Mutual Information are the same
regression task, a naive approach would be to consistently use thing, although the context or usage of the measure often gives
the median as the target. This way, approximately half of the rise to the different names. For example:
6
• Effect of transforms to a data set (decision trees): Infor- It is assumed that this is an iterative process, and after a
mation Gain. certain number of iterations, good measures and features can
• Dependence between variables (feature selection): Mu- be distinguished from the less effective ones. The iterative
tual Information. verification process is outlined in Figure 5.
Notice the similarity in the way that the information gain A positive correlation exists if the best measure Mbest and
13 is calculated and the way that mutual information gain 8 best feature Fbest values yield the highest production model
is calculated; when setting the parent data in a decision tree score Shighest , and the worst measure Mworst and worst
as X and the child data as Y they are equivalent. Technically, feature Fworst values yield the lowest production model score
they calculate the same quantity if applied to the same data. Slowest . This correlation can be linear or of any monotonic
nature. For cross-verification, an intermediate value can be
V. M ETHOD used, or a range of values and scores can be compared.
There are many measures of dependence that could poten- This verification method is best applied with real data in an
tially be used to identify data with high predictive power. The actual production environment. It is believed that this approach
question is, which one is the right one and how to verify will yield optimal results. Alternatively, synthetic data can be
it? Again, we must remember, that the identification of data generated. In such a scenario, signals, patterns, or even regimes
with high predictive power is a preprocessing step before can be embedded into the data. Purely random data can be used
production model training. as a control dataset. In this setup, the data with high and low
predictive power is already known in advance. The value of
the measure of dependence would be expected to be higher
Define the number of
with embeddings than with random data. However, especially
measure and feature
space to be evaluate in financial applications where patterns and regimes are not
Algorithm to
obvious, it would be very difficult to generate synthetic data.
find best measure
and features One could use generative models or, for specific applications
Search for Mbest,worst like derivative investment instruments, a Black–Scholes model.
and Fbest,worst Nevertheless, in this paper, the focus is on equity market
predictions.
7
TABLE II window size, features and targets can be selected. This pa-
L IST OF FEATURES AND TARGETS USED IN THE EXPERIMENT. rameters can then be used in the production model to verify
if measure of dependence holds true, see V
Name Description
The algorithm has been implemented in Python Sklearn [17]
f ret c 1 Daily closing price return
and Optuna. The Jupyter notebook can be found on Github as
f vol pct Daily stock trading volume change a reference.
f ret c hma 5 Avg. closing price return using 5-day HMA
f ret c hma 20 Avg. closing price return using 20-day HMA B. Results
f c macd signal MACD signal
In this chapter we give a summary of the results of the
f c macd MACD indicator search process in table III. The results have not been verified
f obv On balance volume oscillator using the production model, but shall serve as a first indication
f atr Average true range indicator of what hyperparameters could be of importance.
f rsi Relative strength index
t ret c 1 Daily closing price return TABLE III
H YPERPARAMETER SEARCH RESULTS FOR YUM STOCK .
t ret c hma 5 Avg. closing price return using 5-day HMA
t ret c hma 20 Avg. closing price return using 20-day HMA
Hyperparameter MI DC MIC
Best 50 50, 150, 200 50
Window Size
Worst 50,150,200 150, 200 150,200
points (0.1%). The bins size B shall be chosen in resolution f ret c hma 5 f macd
Best f obv
that makes sense for stock market financial investments. Features f ret c hma 20 f macd signal
f ret c 1 f ret c 1
Worst f ret c 1
A. Algorithm f vol pct f vol pct
The primary goal of the feature search algorithm is to find t ret c hma 5 t ret c hma 5 t ret c hma 5
Best
Targets t ret c hma 20 t ret c hma 20 t ret c hma 20
the right combination of parameters that maximizes the value
Worst t ret c 1 t ret c 1 t ret c 1
of the measure of dependence. This is achieved through a
grid search using Optuna6 . The search algorithm optimizes
the following parameters: Interestingly, all three measures of dependence chose a
1) Sliding window size window size of 50 as the best. This implies that, according
2) Feature type to the production model, a larger historic window of more
3) Target type and duration than 50 days might not be necessary. All three measures
4) Measure of dependence indicate that daily price return (f ret c 1) is actually not a
Referring to Figure 3, the sliding window step size N is good feature, which is surprising. On the other hand, the best
fixed at 100 days, while W represents the sliding window features strongly depend on the measure itself. MIC appears to
size in the range of 50 to 200 days. For each point in time strongly emphasize MACD, whereas DC found the on-balance
ti , the measure of dependence is calculated. This enables the volume oscillator to be the best. The targets show a consistent
identification of regime changes in the data. For instance, it pattern among all three measures. Using daily price returns
could happen that during a downtrend, different features might (t ret c 1) as a target does not allow finding high predictive
prevail than during an uptrend. Therefore, a trending indicator power. Instead, using Hull Moving Averages of 5 and 20 days
is calculated for each time step to show if the stock price will yields the best results, which is an interesting finding.
go DOW N , U P , or be N EU T RAL. In a follow-up experiment, one could test shorter window
Optuna offers various useful analysis tools, such as feature sizes to check if even shorter window sizes can be used.
importance plots or parallel coordinate plots. With these tools, However, it must be understood that reducing the window size
it is possible to find features with high predictive power, decreases the number of samples in a single calculation. Based
analyze the change of importance over time7 or verify which on experience, the sample size should not be smaller than 100.
features prevail during an up or down trend. Figure 6 shows Additionally, the experiment can be conducted with other
an example of parallel coordinate plots. Feature selection is stocks. It would be interesting to see if the results correlate
done with the feature importance plots as it cannot be directly between different stocks. Based on the results of this experi-
integrated into parallel coordinate plots. ment, it would be interesting to conduct a sensitivity analysis
The objective value is in this case the value of the measure with daily price returns and compare them to Hull Moving
of dependence. Based on the objective value the best sliding Averages. Clearly, the production model would achieve higher
accuracy using averages than daily returns. By following
6 Optuna is a hyperparameter optimization framework that offers high
the verification process according to Figure 5, it is possible
flexibility and parallel computing.
7 For this purpose, the Optuna source code has been extended, integrating to iteratively preselect important hyperparameters with fairly
date functionality. simple measures of dependencies such as MI, DC, or MIC.
8
Fig. 6. Parallel coordinate plot with selective sliders on the vertical axis to filter hyperparameters.