Czekanowski Index-Based Similarity As Alternative Correlation Measure in N-Asset Portfolio Analysis

⎢⎥
Quantitative Finance with Python, Applied Risk

Management, and Crypto Finance
Czekanowski Index-Based Similarity as Alternative

Correlation Measure in N-Asset Portfolio Analysis
October 27, 2021 by Pawel — Leave a Comment
In quantitative finance we are used to measuring direct linear correlations or non-linear cross-
bicorrelations among various time-series. For the former, by default, one adopts the calculation of
Pearson product-moment correlation coe!cients to quantify a linear relationship between two
vectors. This is true if the the data follow Gaussian distribution. In other case, the rank correlation
methods need to be applied (e.g. Spearman’s or Kendall’s). A good diversification of assets kept in
the investment portfolio often benefits from correlation measures. We want to limit the risk of
losing too much due to highly correlated (co-moving in the same direction) assets. While
correlation measures of any kind are powerful tools in finance, can it be something better than that?
Remarkably, this is a quantitative biology that delivers new weapon to the table. Biologists love in-
depth data analyses and devoted lots of time to the studies of biological samples. The samples can
be far di!erent in their origin or composition, however when it comes to counting, comparing and
classifying, the language of mathematics standing behind has, luckily, the same denominator with
the analysis of financial data samples.
In biology people are more inclined towards talking about similarities, thus similarity metrics.
Similarity metrics — also referred to as correlation metrics — are applied to two or more objects
(e.g. DNA sequences, etc.). The similarity metric will quantify an association the objects have with
each other. This quantification could be a variety of measurements, such as how often the objects
are involved in a similar process, how likely the objects are to appear in the same location, etc. The
value representing the quantified correlation is often referred to as the similarity coe!cient, or the
correlation coe"cient. This similarity coe"cient is a real-valued number that describes to what
extent the objects are related.
In this article we will have a closer look at a similarity measure used commonly in biology for vector
classifications − Czekanowski Index. We start from the theory of Pearson’s and Spearman’s
correlation coe"cients, their assumptions and limitations. Next, we introduce similarity measure
as an alternative tool in measuring correlation between two vectors (data samples) and show how to
adopt it to financial N–Asset Portfolio analysis. We develop Python code for a clear and concise
application and illustrate the functionality based on 39-Crypto-Asset Portfolio during micro flash-
crash in May 2021. We conclude that proposed by us the Average Rolling Similarity measure may be
highly regarded as a strong alternative to classical correlation measures, beating their performance
in certain market conditions.
1. Galton-Pearson Product-Moment Correlation
The history of what we know today under a common term of correlation can be dated back to 1885. In
this year, Sir Francis Galton, the British biometrician, as the first one referred to regression. He
published a bivariate scatterplot with normal isodensity lines — the first graph of correlation:
In 1888, Galton noted that r measures the closeness of the “co-relation” and suggested that r could
not be greater than 1 (although he had not yet recognised the idea of negative correlation). Seven
years later, Pearson (1895) developed the mathematical formula that is still most commonly used to
measure correlation, the Pearson product-moment correlation coe!cient. From a historical point
of view, it seems that more appropriate that the name of r should be the Galton-Pearson r:
∑ni=1 (Xi –X̄)(Yi –Ȳ )

r=
√∑ni=1 (Xi –X̄)2 ∑ni=1 (Yi –Ȳ )2
where X̄ and Ȳ are the expected values of the samples X = {X1 , … , Xn } and Y = {Y1 , … , Yn },
respectively. As one can notice here, the formula describes r as the centered and standardised sum
of cross-product of two variables. In Lord and Novick (1968) we could found one of the first hint
that the application of the Cauchy-Schwartz inequality might be used to show that the absolute
value of the numerator is less than or equal to the denominator. In other words −1 ≤ r ≤ 1. Given
that, the Galton-Pearson r can be interpreted as the strength of the linear relationship between two
variables.
A perfect correlation (anti-correlation) is obtained for r equal to +1 and −1, respectively, and the
data points lie on the straight line. Both extreme values are approached when this relationship
between two variables tends to tighten. Interestingly, r can be calculated as a measure of a linear
relationship without any assumptions. In reality, all depends on the data sample we analyse, thus
requires some minimum level of assumptions:
1. Say, if X does not represent the whole population (due to random or not uniform selection)
then the derived r might be biased or incorrect.
2. Both variables X and Y are continuous, jointly normally distributed, random variables.
They follow a bivariate normal distribution in the population from which they were sampled.
That is why, it is so important to check the normality of both samples before calculation of the r
coe"cient. Why? Because if there is a relationship between jointly normally distributed data, this
relationship is always linear. Therefore, if what you observe in a scatter plot seems to lie close to
some curve (i.e. not exactly a straight line), the assumption of a bivariate normal distribution is
violated.
In the literature, various stratifications for the r have been published on many occasions. It is safe
to say that if 0.00 < r < 0.10 the linear correlation is negligible, if 0.10 ≤ r < 0.40 it is weak, if
0.40 ≤ r < 0.70 it is moderate, if 0.70 ≤ r < 0.90 it is strong, and if 0.90 ≤ r < 1.00 it is very
strong (in terms of absolute magnitude, of course). However, due to the fact that samples can be
inevitably a!ected by chance, the observed correlation may also not necessarily be a good estimate
for the population correlation coe"cient.
Therefore, it is additionally advised that the derived coe"cient of correlation r should be always
accompanied by a confidence interval, which provides the range of plausible values of the
coe"cient in the population from which the data were sampled. By specifying the confidence level
at 1 − α, e.g. 95%, where α denotes the significance level, in Python we can derive both Pearson’s r
and the corresponding confidence brackets as follows:
1 from scipy.stats import pearsonr, norm

2 import numpy as np
3
4 np.random.seed(1)
5
6 for n in [10, 100, 1000, 10000]:
7 x = np.random.randn(n)
8 y = np.random.randn(n)
9
10 r1, r2 = np.corrcoef(x,y)[0,1], pearsonr(x,y)[0]
11 if r1 == r1:
12 r = r1
13 else
else:
14 r = None
15
16 r_z = np.arctanh(r)
17 se = 1/np.sqrt(n-3)
18 cl = 0.95 # confidence level
19 alpha = 1 - cl
20 z = norm.ppf(1-alpha/2)
21 lo_z, hi_z = r_z-z*se, r_z+z*se
22 lo, hi = np.tanh((lo_z, hi_z))
23
24 print('n = %5g: r = %.10f [%13.10f,%13.10f]' % (n,r, lo, hi))
where X and Y samples, in this example, are drawn randomly from the standard Normal
distribution and the results are tested in function of growing sample sizes:
n = 10: r = 0.6556177144 [ 0.0442628668, 0.9097178165]

n = 100: r = 0.1659786467 [-0.0314652772, 0.3509551991]
n = 1000: r = 0.0216530956 [-0.0403942097, 0.0835340468]
n = 10000: r = 0.0072949196 [-0.0123069100, 0.0268911447]
This brings us to key observation we need to remember here. For samples of size n = 10 and
n = 10000, respectively, the 95% confidence intervals introduce di!erent degrees of uncertainty:
r(n = 10) = 0.6556+0.2541

−0.6114 

r(n = 10000) = 0.0073+0.0196
 −0.0196 

.
Therefore, the estimation of r can be very misleading if given without the corresponding confidence
brackets.
In the code, we used a fact that a 95% confidence interval is given by
1.96
tanh [arctanh(r) ± ]
√n − 3
where arctanh is the Fisher transformation. Once transformed, the sampling distribution of the
estimate is approximately normal, so a 95% CI is found by taking the transformed estimate and
adding and subtracting 1.96 times its standard error. The standard error is approximately equal to
(n − 3)−1/2 .
2. Spearman’s Rank-Order Correlation
In 1904 Charles Spearman developed a nonparametric version of the Pearson correlation, often
referred to as Spearman’s ρ, which can be expressed as:
∑ni=1 (Ri –R̄)(Qi –Q̄)

ρ=
√∑ni=1 (Ri –R̄)2 ∑ni=1 (Qi –Q̄)2
where the samples of X and Y have been replaced with the corresponding rank vectors of R and Q
of length n. The trick here is that given n elements of the vector X, we rank them, say, from 1 to n
where 1 is the highest rank (score) and n is the lowest one. In the case when we have two (or more)
identical values, we need to take the average of the ranks that they would have otherwise occupied.
A consideration of correlation ρ based on rank vectors frees us from the examination of actual
values in X and Y samples. In contrast to r, the coe"cient ρ quantifies strictly monotonic
relationships between two variables. Ranking converts a nonlinear strictly monotonic relationship
to a linear relationship what, in addition, helps to eliminate a bad influence of nasty outliers present
in X or Y.
According the most recent review of Bonnett and Wright (2000) the ranks used in Spearman’s
method clearly do not follow a Normal distribution. The consequence of that is that the variance of
the Fisher transformation (ζ) is not well approximated by (n − 3)−1 . The following estimator of the
variance has been proposed:
1 + ρ2 /2
σ2ζ =
n−3
where (1 − α) confidence level for Spearman’s ρ can be estimated as:
⎡ ⎤
√ 1 + ρ 2 /2
tanh arctanh(ρ) ±
⎣ ⎦
zα/2
n − 3
and zα/2 is the α/2-quantile of the Standard Normal distribution.
It is needless to say that Spearman’s correlation coe"cient is −1 ≤ ρ ≤ 1 where −1, 0, 1 represent

a strong negative, very weak, and very strong positive correlation in monotonicity of the two
vectors of X and Y, respectively.
3. Czekanowski Index-based Similarity Measure
In order to quantify the likeness between two biological samples, Jan Czekanowski (1909, 1913)
developed a metric that had been used to quantify the amount of set intersection two (or more)
vectors may have with each other. This metric is known today as Czekanowski Index but also as a
proportional similarity index and is a quantitative version of a presence-absence similarity index (or
Sørensen index).
Given two vectors X and Y, the Czekanowski Index is defined as:
Continue reading as a Member

Login and upgrade your free access to Premium Plan
in your account to unlock the rest of this article...
Username: Password:
Register Lost your password?

Log In
Filed Under: Python for Quants
Leave a Reply
Your email address will not be published. Required fields are marked *
Comment
Name *
Email *
Website
Save my name, email, and website in this browser for the next time I comment.
POST COMMENT
ABOUT ME
Helping YOU to become the go-to expert in quantitative finance, risk, and crypto-markets
via my blog, podcast, online community and book writing!
My name is Dr. Pawel Lachowicz, I’m standing behind QuantAtRisk. Market Risk Lead AVP in HSBC
bank. Python advocate. DeFi and Crypto Market enthusiast. Cuban cigars lover! Currently based in
Wroclaw, Poland.
SUBSCRIBE TO A NEWSLETTER!
email address
SUBSCRIBE
FOLLOW QUANTATRISK!
RECENT ARTICLES
The Longest Winning Streak for Bitcoin
Estimating Probability of Bitcoin Pullback in its Bullish Market for Traders and Algo-Traders
Czekanowski Index-Based Similarity as Alternative Correlation Measure in N-Asset Portfolio

Analysis
Break into Finance: Podcast #1
Honest Guide to Getting a Quant Job in Finance: (2) Before You Apply
Honest Guide to Getting a Quant Job in Finance: (1) So, you want to be a Quant?!
Built on ETH Blockchain: Inspecting Profitability of Emerging Crypto-Coins since their Inception
Where Can You Trade Cryptocurrencies using Fiat Currencies?
Modelling Slippage for Limit Orders using Adaptive KDE-based Loss Severity Distribution (1)
Introduction to Sell-O! Analysis for Crypto-Assets: Triggered by Bitcoin?
Crypto vs Fiat Currency: (2) Does Bitcoin Always Co-Move at All Crypto-Exchanges?
Probing Price Momentum of Bitcoin during its Bull Runs with a Piecewise Linear Model
Crypto vs Fiat Currency: (1) Where is it Traded?
Earning or Losing Money in Cryptocurrencies with Revolut: Simulated Trading for Bitcoin
Does It Make Sense to Use 1-Hour 1% VaR and ES for Bitcoin?
Scanning Crypto-Exchanges for Available Cryptocurrency Close Price Time-Series
Tracking Bitcoin Gains since its 3rd Halving in May 2020 with Python
How to Design Intraday Algo-Trading Model for Cryptocurrencies using Bitcoin-based Signals?
Brent and WTI Oil Price Time-Series with 1-Min Data Sampling in Python
How to Predict Bitcoin Price with Deep Learning LSTM Network
BECOME A PREMIUM MEMBER!
Register today!
RECENT TWEETS
We witness a new development in #crypto environment. #Ethereum gained a competitior in

#Solana. It reminds building… https://t.co/qMFoe7nz0F
3 hours ago
Modeling risk is easy. Modeling consequences is highly challenging! Here is a masterpiece:

#Solana. From less tha… https://t.co/S5GCbmmbn5
6 hours ago
Coinbase users can now share price charts or their portfolios on social media platforms. That
move makes @Coinbase… https://t.co/nONznfFJ4S
7 hours ago
FORTHCOMING EBOOK:
Copyright © 2012-2021 by Pawel Lachowicz · All rights reserved
Home All-Resources E-Books Workshops Contact Members-zone
Terms and Conditions - Privacy Policy

Czekanowski Index-Based Similarity As Alternative Correlation Measure in N-Asset Portfolio Analysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Czekanowski Index-Based Similarity As Alternative Correlation Measure in N-Asset Portfolio Analysis

Uploaded by

Copyright:

Available Formats

⎢⎥

Quantitative Finance with Python, Applied Risk

Czekanowski Index-Based Similarity as Alternative

1. Galton-Pearson Product-Moment Correlation

∑ni=1 (Xi –X̄)(Yi –Ȳ )

1 from scipy.stats import pearsonr, norm

n = 10: r = 0.6556177144 [ 0.0442628668, 0.9097178165]

r(n = 10) = 0.6556+0.2541

In the code, we used a fact that a 95% confidence interval is given by

2. Spearman’s Rank-Order Correlation

∑ni=1 (Ri –R̄)(Qi –Q̄)

and zα/2 is the α/2-quantile of the Standard Normal distribution.

It is needless to say that Spearman’s correlation coe"cient is −1 ≤ ρ ≤ 1 where −1, 0, 1 represent

3. Czekanowski Index-based Similarity Measure

Given two vectors X and Y, the Czekanowski Index is defined as:

Continue reading as a Member

Register Lost your password?

Filed Under: Python for Quants

The Longest Winning Streak for Bitcoin

Czekanowski Index-Based Similarity as Alternative Correlation Measure in N-Asset Portfolio

Break into Finance: Podcast #1

Where Can You Trade Cryptocurrencies using Fiat Currencies?

Introduction to Sell-O! Analysis for Crypto-Assets: Triggered by Bitcoin?

Crypto vs Fiat Currency: (1) Where is it Traded?

Does It Make Sense to Use 1-Hour 1% VaR and ES for Bitcoin?

Scanning Crypto-Exchanges for Available Cryptocurrency Close Price Time-Series

How to Predict Bitcoin Price with Deep Learning LSTM Network

BECOME A PREMIUM MEMBER!

We witness a new development in #crypto environment. #Ethereum gained a competitior in

Modeling risk is easy. Modeling consequences is highly challenging! Here is a masterpiece:

Copyright © 2012-2021 by Pawel Lachowicz · All rights reserved

Home All-Resources E-Books Workshops Contact Members-zone

Terms and Conditions - Privacy Policy

You might also like