Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

⎢⎥

Quantitative Finance with Python, Applied Risk


Management, and Crypto Finance

Czekanowski Index-Based Similarity as Alternative


Correlation Measure in N-Asset Portfolio Analysis
October 27, 2021 by Pawel — Leave a Comment

In quantitative finance we are used to measuring direct linear correlations or non-linear cross-
bicorrelations among various time-series. For the former, by default, one adopts the calculation of
Pearson product-moment correlation coe!cients to quantify a linear relationship between two
vectors. This is true if the the data follow Gaussian distribution. In other case, the rank correlation
methods need to be applied (e.g. Spearman’s or Kendall’s). A good diversification of assets kept in
the investment portfolio often benefits from correlation measures. We want to limit the risk of
losing too much due to highly correlated (co-moving in the same direction) assets. While
correlation measures of any kind are powerful tools in finance, can it be something better than that?

Remarkably, this is a quantitative biology that delivers new weapon to the table. Biologists love in-
depth data analyses and devoted lots of time to the studies of biological samples. The samples can
be far di!erent in their origin or composition, however when it comes to counting, comparing and
classifying, the language of mathematics standing behind has, luckily, the same denominator with
the analysis of financial data samples.

In biology people are more inclined towards talking about similarities, thus similarity metrics.
Similarity metrics — also referred to as correlation metrics — are applied to two or more objects
(e.g. DNA sequences, etc.). The similarity metric will quantify an association the objects have with
each other. This quantification could be a variety of measurements, such as how often the objects
are involved in a similar process, how likely the objects are to appear in the same location, etc. The
value representing the quantified correlation is often referred to as the similarity coe!cient, or the
correlation coe"cient. This similarity coe"cient is a real-valued number that describes to what
extent the objects are related.

In this article we will have a closer look at a similarity measure used commonly in biology for vector
classifications − Czekanowski Index. We start from the theory of Pearson’s and Spearman’s
correlation coe"cients, their assumptions and limitations. Next, we introduce similarity measure
as an alternative tool in measuring correlation between two vectors (data samples) and show how to
adopt it to financial N–Asset Portfolio analysis. We develop Python code for a clear and concise
application and illustrate the functionality based on 39-Crypto-Asset Portfolio during micro flash-
crash in May 2021. We conclude that proposed by us the Average Rolling Similarity measure may be
highly regarded as a strong alternative to classical correlation measures, beating their performance
in certain market conditions.

1. Galton-Pearson Product-Moment Correlation

The history of what we know today under a common term of correlation can be dated back to 1885. In
this year, Sir Francis Galton, the British biometrician, as the first one referred to regression. He
published a bivariate scatterplot with normal isodensity lines — the first graph of correlation:

In 1888, Galton noted that r measures the closeness of the “co-relation” and suggested that r could
not be greater than 1 (although he had not yet recognised the idea of negative correlation). Seven
years later, Pearson (1895) developed the mathematical formula that is still most commonly used to
measure correlation, the Pearson product-moment correlation coe!cient. From a historical point
of view, it seems that more appropriate that the name of r should be the Galton-Pearson r:

∑ni=1 (Xi –X̄)(Yi –Ȳ )


r=
√∑ni=1 (Xi –X̄)2 ∑ni=1 (Yi –Ȳ )2

where X̄ and Ȳ are the expected values of the samples X = {X1 , … , Xn } and Y = {Y1 , … , Yn },
respectively. As one can notice here, the formula describes r as the centered and standardised sum
of cross-product of two variables. In Lord and Novick (1968) we could found one of the first hint
that the application of the Cauchy-Schwartz inequality might be used to show that the absolute
value of the numerator is less than or equal to the denominator. In other words −1 ≤ r ≤ 1. Given
that, the Galton-Pearson r can be interpreted as the strength of the linear relationship between two
variables.

A perfect correlation (anti-correlation) is obtained for r equal to +1 and −1, respectively, and the
data points lie on the straight line. Both extreme values are approached when this relationship
between two variables tends to tighten. Interestingly, r can be calculated as a measure of a linear
relationship without any assumptions. In reality, all depends on the data sample we analyse, thus
requires some minimum level of assumptions:

1. Say, if X does not represent the whole population (due to random or not uniform selection)
then the derived r might be biased or incorrect.

2. Both variables X and Y are continuous, jointly normally distributed, random variables.
They follow a bivariate normal distribution in the population from which they were sampled.

That is why, it is so important to check the normality of both samples before calculation of the r
coe"cient. Why? Because if there is a relationship between jointly normally distributed data, this
relationship is always linear. Therefore, if what you observe in a scatter plot seems to lie close to
some curve (i.e. not exactly a straight line), the assumption of a bivariate normal distribution is
violated.

In the literature, various stratifications for the r have been published on many occasions. It is safe
to say that if 0.00 < r < 0.10 the linear correlation is negligible, if 0.10 ≤ r < 0.40 it is weak, if
0.40 ≤ r < 0.70 it is moderate, if 0.70 ≤ r < 0.90 it is strong, and if 0.90 ≤ r < 1.00 it is very
strong (in terms of absolute magnitude, of course). However, due to the fact that samples can be
inevitably a!ected by chance, the observed correlation may also not necessarily be a good estimate
for the population correlation coe"cient.

Therefore, it is additionally advised that the derived coe"cient of correlation r should be always
accompanied by a confidence interval, which provides the range of plausible values of the
coe"cient in the population from which the data were sampled. By specifying the confidence level
at 1 − α, e.g. 95%, where α denotes the significance level, in Python we can derive both Pearson’s r
and the corresponding confidence brackets as follows:

1 from scipy.stats import pearsonr, norm


2 import numpy as np
3
4 np.random.seed(1)
5
6 for n in [10, 100, 1000, 10000]:
7 x = np.random.randn(n)
8 y = np.random.randn(n)
9
10 r1, r2 = np.corrcoef(x,y)[0,1], pearsonr(x,y)[0]
11 if r1 == r1:
12 r = r1
13 else
else:
14 r = None
15
16 r_z = np.arctanh(r)
17 se = 1/np.sqrt(n-3)
18 cl = 0.95 # confidence level
19 alpha = 1 - cl
20 z = norm.ppf(1-alpha/2)
21 lo_z, hi_z = r_z-z*se, r_z+z*se
22 lo, hi = np.tanh((lo_z, hi_z))
23
24 print('n = %5g: r = %.10f [%13.10f,%13.10f]' % (n,r, lo, hi))

where X and Y samples, in this example, are drawn randomly from the standard Normal
distribution and the results are tested in function of growing sample sizes:

n = 10: r = 0.6556177144 [ 0.0442628668, 0.9097178165]


n = 100: r = 0.1659786467 [-0.0314652772, 0.3509551991]
n = 1000: r = 0.0216530956 [-0.0403942097, 0.0835340468]
n = 10000: r = 0.0072949196 [-0.0123069100, 0.0268911447]

This brings us to key observation we need to remember here. For samples of size n = 10 and
n = 10000, respectively, the 95% confidence intervals introduce di!erent degrees of uncertainty:

r(n = 10) = 0.6556+0.2541


−0.6114 

r(n = 10000) = 0.0073+0.0196
 −0.0196 

.

Therefore, the estimation of r can be very misleading if given without the corresponding confidence
brackets.

In the code, we used a fact that a 95% confidence interval is given by

1.96
tanh [arctanh(r) ± ]
√n − 3

where arctanh is the Fisher transformation. Once transformed, the sampling distribution of the
estimate is approximately normal, so a 95% CI is found by taking the transformed estimate and
adding and subtracting 1.96 times its standard error. The standard error is approximately equal to
(n − 3)−1/2 .

2. Spearman’s Rank-Order Correlation

In 1904 Charles Spearman developed a nonparametric version of the Pearson correlation, often
referred to as Spearman’s ρ, which can be expressed as:

∑ni=1 (Ri –R̄)(Qi –Q̄)


ρ=
√∑ni=1 (Ri –R̄)2 ∑ni=1 (Qi –Q̄)2

where the samples of X and Y have been replaced with the corresponding rank vectors of R and Q
of length n. The trick here is that given n elements of the vector X, we rank them, say, from 1 to n
where 1 is the highest rank (score) and n is the lowest one. In the case when we have two (or more)
identical values, we need to take the average of the ranks that they would have otherwise occupied.

A consideration of correlation ρ based on rank vectors frees us from the examination of actual
values in X and Y samples. In contrast to r, the coe"cient ρ quantifies strictly monotonic
relationships between two variables. Ranking converts a nonlinear strictly monotonic relationship
to a linear relationship what, in addition, helps to eliminate a bad influence of nasty outliers present
in X or Y.

According the most recent review of Bonnett and Wright (2000) the ranks used in Spearman’s
method clearly do not follow a Normal distribution. The consequence of that is that the variance of
the Fisher transformation (ζ) is not well approximated by (n − 3)−1 . The following estimator of the
variance has been proposed:

1 + ρ2 /2
σ2ζ =
n−3
where (1 − α) confidence level for Spearman’s ρ can be estimated as:

⎡ ⎤
√ 1 + ρ 2 /2
tanh arctanh(ρ) ±
⎣ ⎦
zα/2
n − 3

and zα/2 is the α/2-quantile of the Standard Normal distribution.

It is needless to say that Spearman’s correlation coe"cient is −1 ≤ ρ ≤ 1 where −1, 0, 1 represent


a strong negative, very weak, and very strong positive correlation in monotonicity of the two
vectors of X and Y, respectively.

3. Czekanowski Index-based Similarity Measure

In order to quantify the likeness between two biological samples, Jan Czekanowski (1909, 1913)
developed a metric that had been used to quantify the amount of set intersection two (or more)
vectors may have with each other. This metric is known today as Czekanowski Index but also as a
proportional similarity index and is a quantitative version of a presence-absence similarity index (or
Sørensen index).

Given two vectors X and Y, the Czekanowski Index is defined as:

Continue reading as a Member


Login and upgrade your free access to Premium Plan
in your account to unlock the rest of this article...

Username: Password:

Register Lost your password?


Log In

Filed Under: Python for Quants

Leave a Reply
Your email address will not be published. Required fields are marked *

Comment

Name *

Email *

Website

Save my name, email, and website in this browser for the next time I comment.

POST COMMENT

ABOUT ME

Helping YOU to become the go-to expert in quantitative finance, risk, and crypto-markets
via my blog, podcast, online community and book writing!

My name is Dr. Pawel Lachowicz, I’m standing behind QuantAtRisk. Market Risk Lead AVP in HSBC
bank. Python advocate. DeFi and Crypto Market enthusiast. Cuban cigars lover! Currently based in
Wroclaw, Poland.

SUBSCRIBE TO A NEWSLETTER!

email address

SUBSCRIBE

FOLLOW QUANTATRISK!

RECENT ARTICLES

The Longest Winning Streak for Bitcoin

Estimating Probability of Bitcoin Pullback in its Bullish Market for Traders and Algo-Traders

Czekanowski Index-Based Similarity as Alternative Correlation Measure in N-Asset Portfolio


Analysis

Break into Finance: Podcast #1

Honest Guide to Getting a Quant Job in Finance: (2) Before You Apply

Honest Guide to Getting a Quant Job in Finance: (1) So, you want to be a Quant?!

Built on ETH Blockchain: Inspecting Profitability of Emerging Crypto-Coins since their Inception

Where Can You Trade Cryptocurrencies using Fiat Currencies?

Modelling Slippage for Limit Orders using Adaptive KDE-based Loss Severity Distribution (1)

Introduction to Sell-O! Analysis for Crypto-Assets: Triggered by Bitcoin?

Crypto vs Fiat Currency: (2) Does Bitcoin Always Co-Move at All Crypto-Exchanges?

Probing Price Momentum of Bitcoin during its Bull Runs with a Piecewise Linear Model

Crypto vs Fiat Currency: (1) Where is it Traded?

Earning or Losing Money in Cryptocurrencies with Revolut: Simulated Trading for Bitcoin

Does It Make Sense to Use 1-Hour 1% VaR and ES for Bitcoin?

Scanning Crypto-Exchanges for Available Cryptocurrency Close Price Time-Series

Tracking Bitcoin Gains since its 3rd Halving in May 2020 with Python

How to Design Intraday Algo-Trading Model for Cryptocurrencies using Bitcoin-based Signals?

Brent and WTI Oil Price Time-Series with 1-Min Data Sampling in Python

How to Predict Bitcoin Price with Deep Learning LSTM Network

BECOME A PREMIUM MEMBER!

Register today!

RECENT TWEETS

We witness a new development in #crypto environment. #Ethereum gained a competitior in


#Solana. It reminds building… https://t.co/qMFoe7nz0F
3 hours ago

Modeling risk is easy. Modeling consequences is highly challenging! Here is a masterpiece:


#Solana. From less tha… https://t.co/S5GCbmmbn5
6 hours ago

Coinbase users can now share price charts or their portfolios on social media platforms. That
move makes @Coinbase… https://t.co/nONznfFJ4S
7 hours ago

FORTHCOMING EBOOK:

Copyright © 2012-2021 by Pawel Lachowicz · All rights reserved

Home All-Resources E-Books Workshops Contact Members-zone

Terms and Conditions - Privacy Policy

You might also like