Quantitative Finance with Python, Applied Risk

Management, and Crypto Finance

Czekanowski Index-Based Similarity as Alternative

Correlation Measure in N-Asset Portfolio Analysis
October 27, 2021

In quantitative finance we are used to measuring direct linear correlations or non-linear cross-
bicorrelations among various time-series. For the former, by default, one adopts the calculation of
Pearson product-moment correlation coe!cients to quantify a linear relationship between two
vectors. This is true if the the data follow Gaussian distribution. In other case, the rank correlation
methods need to be applied (e.g. Spearman’s or Kendall’s). A good diversification of assets kept in
the investment portfolio often benefits from correlation measures. We want to limit the risk of
losing too much due to highly correlated (co-moving in the same direction) assets. While
correlation measures of any kind are powerful tools in finance, can it be something better than that?

Remarkably, this is a quantitative biology that delivers new weapon to the table. Biologists love in-
depth data analyses and devoted lots of time to the studies of biological samples. The samples can
be far di!erent in their origin or composition, however when it comes to counting, comparing and
classifying, the language of mathematics standing behind has, luckily, the same denominator with
the analysis of financial data samples.

In biology people are more inclined towards talking about similarities, thus similarity metrics.
Similarity metrics — also referred to as correlation metrics — are applied to two or more objects
(e.g. DNA sequences, etc.). The similarity metric will quantify an association the objects have with
each other. This quantification could be a variety of measurements, such as how often the objects
are involved in a similar process, how likely the objects are to appear in the same location, etc. The
value representing the quantified correlation is often referred to as the similarity coe!cient, or the
correlation coe"cient. This similarity coe"cient is a real-valued number that describes to what
extent the objects are related.

In this article we will have a closer look at a similarity measure used commonly in biology for vector
classifications − Czekanowski Index. We start from the theory of Pearson’s and Spearman’s
correlation coe"cients, their assumptions and limitations. Next, we introduce similarity measure
as an alternative tool in measuring correlation between two vectors (data samples) and show how to
adopt it to financial N–Asset Portfolio analysis. We develop Python code for a clear and concise
application and illustrate the functionality based on 39-Crypto-Asset Portfolio during micro flash-
crash in May 2021. We conclude that proposed by us the Average Rolling Similarity measure may be
highly regarded as a strong alternative to classical correlation measures, beating their performance
in certain market conditions.

1. Galton-Pearson Product-Moment Correlation

The history of what we know today under a common term of correlation can be dated back to 1885. In
this year, Sir Francis Galton, the British biometrician, as the first one referred to regression. He
published a bivariate scatterplot with normal isodensity lines — the first graph of correlation:

In 1888, Galton noted that r measures the closeness of the “co-relation” and suggested that r could
not be greater than 1 (although he had not yet recognised the idea of negative correlation). Seven
years later, Pearson (1895) developed the mathematical formula that is still most commonly used to
measure correlation, the Pearson product-moment correlation coe!cient. From a historical point
of view, it seems that more appropriate that the name of r should be the Galton-Pearson r:

∑ni=1 (Xi –X̄)(Yi –Ȳ )

√∑ni=1 (Xi –X̄)2 ∑ni=1 (Yi –Ȳ )2

where X̄ and Ȳ are the expected values of the samples X = {X1 , … , Xn } and Y = {Y1 , … , Yn },
respectively. As one can notice here, the formula describes r as the centered and standardised sum
of cross-product of two variables. In Lord and Novick (1968) we could found one of the first hint
that the application of the Cauchy-Schwartz inequality might be used to show that the absolute
value of the numerator is less than or equal to the denominator. In other words −1 ≤ r ≤ 1. Given
that, the Galton-Pearson r can be interpreted as the strength of the linear relationship between two

A perfect correlation (anti-correlation) is obtained for r equal to +1 and −1, respectively, and the
data points lie on the straight line. Both extreme values are approached when this relationship
between two variables tends to tighten. Interestingly, r can be calculated as a measure of a linear
relationship without any assumptions. In reality, all depends on the data sample we analyse, thus
requires some minimum level of assumptions:

1. Say, if X does not represent the whole population (due to random or not uniform selection)
then the derived r might be biased or incorrect.

2. Both variables X and Y are continuous, jointly normally distributed, random variables.
They follow a bivariate normal distribution in the population from which they were sampled.

That is why, it is so important to check the normality of both samples before calculation of the r
coe"cient. Why? Because if there is a relationship between jointly normally distributed data, this
relationship is always linear. Therefore, if what you observe in a scatter plot seems to lie close to
some curve (i.e. not exactly a straight line), the assumption of a bivariate normal distribution is

In the literature, various stratifications for the r have been published on many occasions. It is safe
to say that if 0.00 < r < 0.10 the linear correlation is negligible, if 0.10 ≤ r < 0.40 it is weak, if
0.40 ≤ r < 0.70 it is moderate, if 0.70 ≤ r < 0.90 it is strong, and if 0.90 ≤ r < 1.00 it is very
strong (in terms of absolute magnitude, of course). However, due to the fact that samples can be
inevitably a!ected by chance, the observed correlation may also not necessarily be a good estimate
for the population correlation coe"cient.

Therefore, it is additionally advised that the derived coe"cient of correlation r should be always
accompanied by a confidence interval, which provides the range of plausible values of the
coe"cient in the population from which the data were sampled. By specifying the confidence level
at 1 − α, e.g. 95%, where α denotes the significance level, in Python we can derive both Pearson’s r
and the corresponding confidence brackets as follows:

1 from scipy.stats import pearsonr, norm

2 import numpy as np
4 np.random.seed(1)
6 for n in [10, 100, 1000, 10000]:
7 x = np.random.randn(n)
8 y = np.random.randn(n)
10 r1, r2 = np.corrcoef(x,y)[0,1], pearsonr(x,y)[0]
11 if r1 == r1:
12 r = r1
13 else
14 r = None
16 r_z = np.arctanh(r)
17 se = 1/np.sqrt(n-3)
18 cl = 0.95 # confidence level
19 alpha = 1 - cl
20 z = norm.ppf(1-alpha/2)
21 lo_z, hi_z = r_z-z*se, r_z+z*se
22 lo, hi = np.tanh((lo_z, hi_z))
24 print('n = %5g: r = %.10f [%13.10f,%13.10f]' % (n,r, lo, hi))

where X and Y samples, in this example, are drawn randomly from the standard Normal
distribution and the results are tested in function of growing sample sizes:

n = 10: r = 0.6556177144 [ 0.0442628668, 0.9097178165]

n = 100: r = 0.1659786467 [-0.0314652772, 0.3509551991]
n = 1000: r = 0.0216530956 [-0.0403942097, 0.0835340468]
n = 10000: r = 0.0072949196 [-0.0123069100, 0.0268911447]

This brings us to key observation we need to remember here. For samples of size n = 10 and
n = 10000, respectively, the 95% confidence intervals introduce di!erent degrees of uncertainty:

r(n = 10) = 0.6556+0.2541

−0.6114 

r(n = 10000) = 0.0073+0.0196
 −0.0196 


Therefore, the estimation of r can be very misleading if given without the corresponding confidence

In the code, we used a fact that a 95% confidence interval is given by

tanh [arctanh(r) ± ]
√n − 3

where arctanh is the Fisher transformation. Once transformed, the sampling distribution of the
estimate is approximately normal, so a 95% CI is found by taking the transformed estimate and
adding and subtracting 1.96 times its standard error. The standard error is approximately equal to
(n − 3)−1/2 .

2. Spearman’s Rank-Order Correlation

In 1904 Charles Spearman developed a nonparametric version of the Pearson correlation, often
referred to as Spearman’s ρ, which can be expressed as:

∑ni=1 (Ri –R̄)(Qi –Q̄)

√∑ni=1 (Ri –R̄)2 ∑ni=1 (Qi –Q̄)2

where the samples of X and Y have been replaced with the corresponding rank vectors of R and Q
of length n. The trick here is that given n elements of the vector X, we rank them, say, from 1 to n
where 1 is the highest rank (score) and n is the lowest one. In the case when we have two (or more)
identical values, we need to take the average of the ranks that they would have otherwise occupied.

A consideration of correlation ρ based on rank vectors frees us from the examination of actual
values in X and Y samples. In contrast to r, the coe"cient ρ quantifies strictly monotonic
relationships between two variables. Ranking converts a nonlinear strictly monotonic relationship
to a linear relationship what, in addition, helps to eliminate a bad influence of nasty outliers present
in X or Y.

According the most recent review of Bonnett and Wright (2000) the ranks used in Spearman’s
method clearly do not follow a Normal distribution. The consequence of that is that the variance of
the Fisher transformation (ζ) is not well approximated by (n − 3)−1 . The following estimator of the
variance has been proposed:

1 + ρ2 /2
σ2ζ =
where (1 − α) confidence level for Spearman’s ρ can be estimated as:

⎡ ⎤
√ 1 + ρ 2 /2
tanh arctanh(ρ) ±
⎣ ⎦
n − 3

and zα/2 is the α/2-quantile of the Standard Normal distribution.

It is needless to say that Spearman’s correlation coe"cient is −1 ≤ ρ ≤ 1 where −1, 0, 1 represent

a strong negative, very weak, and very strong positive correlation in monotonicity of the two
vectors of X and Y, respectively.

3. Czekanowski Index-based Similarity Measure

In order to quantify the likeness between two biological samples, Jan Czekanowski (1909, 1913)
developed a metric that had been used to quantify the amount of set intersection two (or more)
vectors may have with each other. This metric is known today as Czekanowski Index but also as a
proportional similarity index and is a quantitative version of a presence-absence similarity index (or
Sørensen index).

Given two vectors X and Y, the Czekanowski Index is defined as:

