Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

Applied Machine

Learning (AML):
Correlation Analysis
M. El Hamly, Ing., Ph.D.
Dec 2021
Overview
 Part A: Linear Correlation
o Linear Correlation
o Anscombe’s Quartet

 Part B: Other packages


o Package: ‘corrr’
o Package: ‘correlation’

Dr. M. El Hamly : Correlations 2


Part: A

Linear
Correlation

Dr. M. El Hamly : Correlations 3


Pearson correlation coefficient
 Formula

Dr. M. El Hamly : Correlation Analysis 4


r  Mathematical properties
• The absolute values of both the sample and population Pearson
correlation coefficients (r) are on or between -1 and 1, |r| ≤ 1
• Correlations equal to +1 or −1 correspond to data points lying
exactly on a line (in the case of the sample correlation), or to a
bivariate distribution entirely supported on a line (in the case of the
population correlation).
• The Pearson correlation coefficient is symmetric:
• corr(X, Y) = corr(Y, X)

• A key mathematical property of the Pearson correlation coefficient is


that it is invariant under separate changes in location and scale in
the two variables. That is, we may transform X to a + b X and
transform Y to c + d Y, where a, b, c, and d are constants with
b, d > 0, without changing the correlation coefficient. (This holds for
both the population and sample Pearson correlation coefficients.)

Dr. M. El Hamly : Correlation Analysis 5


r  Interpretation
• The correlation coefficient ranges from −1 to 1.
• An absolute value of exactly 1 implies that a linear equation
describes the relationship between X and Y perfectly, with all data
points lying on a line. The correlation sign is determined by the
regression slope: a value of +1 implies that all data points lie on a
line for which Y increases as X increases, and vice versa for -1. A
value of 0 implies that there is no linear dependency between the
variables.

• More generally, note that (Xi − 𝑋)(Yi − 𝑌) is positive if and only if


Xi and Yi lie on the same side of their respective means. Thus the
correlation coefficient is positive if Xi and Yi tend to be
simultaneously greater than, or simultaneously less than, their
respective means. The correlation coefficient is negative (anti-
correlation) if Xi and Yi tend to lie on opposite sides of their
respective means. Moreover, the stronger is either tendency, the
larger is the absolute value of the correlation coefficient.
Dr. M. El Hamly : Correlation Analysis 6
Linear Correlation
 Script: cor1.R
• Compute the correlations:
• cor(x, y)
• cor(x, a*x + b) with a < 0, a > 0.
• Interpretation of r = +1 or r = -1.
• Compute the anomalies:
• ano <- function(x) x - mean(x)
• nor <- function(x) (x - mean(x))/sd(x)
• Note : Use the apply() function for matrices.
• cor(x.ano, y) ; cor(x.ano, y.ano)
• cor(x.nor, y) ; cor(x.nor, y.nor)
• cor(x.ano, y.nor)
• x.nor : normalized or standardized anomaly (i.e., centered & scaled
variable)

Dr. M. El Hamly : Correlations 7


Corrélation Linéaire
• Calcul de r = cor(x, y)
• où y = x² ou y = exp(x).

• Note : Si r = 0.6 alors r² = 0.36 x explique seulement 36%


de la variance totale de y.
• De même, y explique seulement 36% de la variance totale de
x (car cor(x, y) = cor(y, x)).

 Note très importante :


• Le coefficient de corrélation r (de Pearson) ne peut mesurer
que la relation linéaire entre x et y. Il ne peut pas capturer la
non-linéarité qui pourrait exister entre x et y. D’où le nom :
coefficient de corrélation linéaire.
•  Calcul du niveau de significativité statistique.

Dr. M. El Hamly : Correlations 8


Significance Level
 Script : sig.lev.f.R
• To compute the significance level using the Student’s
t-Distribution.
• x <- rnorm(1e3)
• y <- exp(x)
• r <- cor(x, y) # coefficient de corrélation entre x et y
• n <- length(x) # n est la taille de l’échantillon
• s <- 100*sig.lev.f(n, r)

• Q1 : Que fait s si r augmente (avec n=constante) ?


• Q2 : Que fait s si n augmente (avec r=constante) ?
• Conclusions ?

Dr. M. El Hamly : Correlations 9


Be careful!
• This is why we commonly say “correlation does not imply
causation.”
• A strong correlation might indicate causality, but there could
easily be other explanations: It may be the result of random
chance, where the variables appear to be related, but there is
no true underlying relationship.

 Why does a strong correlation not imply causation?


• “Correlation is not causation” means that just because two
things correlate does not necessarily mean that one causes
the other.
• Correlations between two things can be caused by a third
factor that affects both of them.

Dr. M. El Hamly : Correlation Analysis 10


Causal Analysis
• Causal Analysis is the field of experimental design and statistics
pertaining to establishing cause and effect.
• For any two correlated events, A and B, their possible relationships
include:
1. A causes B (direct causation);
2. B causes A (reverse causation);
3. A and B are both caused by C (common causation);
4. A causes B and B causes A (bidirectional or cyclic causation);
5. There is no connection between A and B; the correlation is a
coincidence.
• Thus there can be no conclusion made regarding the existence or the
direction of a cause-and-effect relationship only from the fact that A and
B are correlated. Determining whether there is an actual cause-and-
effect relationship requires further investigation, even when the
relationship between A and B is statistically significant, a large effect
size is observed, or a large part of the variance is explained.

Dr. M. El Hamly : Correlation Analysis 11


Causal Analysis
 Third factor C (the common-causal variable) causes both
A and B
• Example:
• As ice cream sales increase, the rate of drowning deaths
increases sharply.
• Therefore, ice cream consumption causes drowning.

• This example fails to recognize the importance of time of year


and temperature to ice cream sales. Ice cream is sold during
the hot summer months at a much greater rate than during
colder times, and it is during these hot summer months that
people are more likely to engage in activities involving water,
such as swimming. The increased drowning deaths are simply
caused by more exposure to water-based activities, not ice
cream. The stated conclusion is false.
Dr. M. El Hamly : Correlation Analysis 12
Spurious Correlations
• Tyler Vigen has an interesting page on his website that visualizes
spurious correlations. Below is an example that shows a strong
positive linear correlation with U.S. spending on science, space and
technology with suicides by hanging, strangulation and suffocation.
• Source: http://www.tylervigen.com/spurious-correlations

Dr. M. El Hamly : Correlation Analysis 13


Anscombe’s
Quartet

Anscombe’s
Quartet

Dr. M. El Hamly : Correlations 14


Anscombe’s Quartet
 Script: anscombe.R
• Anscombe’s quartet comprises four datasets that have nearly
identical simple descriptive statistics, yet have very different
distributions and appear very different when graphed.
• Each dataset consists of eleven (x, y) points.
• They were constructed in 1973 by the statistician Francis
Anscombe to demonstrate both the importance of graphing
data before analyzing it, and the effect of outliers and other
influential observations on statistical properties.
• He described the article as being intended to counter the
impression among statisticians that “numerical calculations
are exact, but graphs are rough.”

• N.B. All four sets are identical when examined using simple
summary statistics, but vary considerably when graphed.
Dr. M. El Hamly : Correlation Analysis 15
Anscombe’s Quartet
• Source:
https://en.wikipedia.org/wiki/Anscombe%27s_quartet

Dr. M. El Hamly : Correlation Analysis 16


Dr. M. El Hamly : Correlation Analysis 17
Anscombe’s Quartet
 The Anscombe dataset
• In 1973, statistician Francis Anscombe created a synthetic
dataset that has been used ever since to illustrate concepts
related to correlation and regression.

 Anscombe
• Here we see 4 datasets, each having a very different
graphical presentation.
• However, these datasets have the same number of points, the
same mean and standard deviation in both x and y, the same
correlation, and the same regression line.
• How is it possible that 4 datasets with such obvious visual
differences could have so many identical important
properties?

Dr. M. El Hamly : Correlation Analysis 18


Anscombe’s Quartet
 Anscombe 1
• The first dataset looks the most like “real data”.
• Here we see a positive, linear, moderately strong relationship.
• That is, this scatter plot appears to be a simple linear relationship,
corresponding to two variables correlated where y could be modeled as
Gaussian with mean linearly dependent on x.
• As you gain experience working with real data, you will see lots of scatter
plots that look similar to this one. Here, the correlation coefficient of 0.82 is
giving us an accurate measurement of the strength of the linear relationship
between x and y.

Dr. M. El Hamly : Correlation Analysis 19


Anscombe’s Quartet
 Anscombe 2
• However, in the 2nd set we see something quite different.
• The relationship here is clearly nonlinear.
• That is, this graph is not distributed normally; while a relationship
between the two variables is obvious, it is not linear, and the
Pearson correlation coefficient r (r=0.82) is not relevant. A more
general regression and the corresponding coefficient of
determination would be more appropriate.

Dr. M. El Hamly : Correlation Analysis 20


Anscombe’s Quartet
 Anscombe 3
• In the third dataset, we have an outlier.
• Here again, our eyes tell us that with the exception of the outlier, the correlation
between x and y is perfect (r=1) , and in this case, it is linear.
• That is, in this graph, the distribution is linear, but should have a different regression
line (a robust regression would have been called for).
• However, the presence of the outlier has lowered the value of the correlation
coefficient and, as you will learn later, the slope of the regression line.
• That is, the calculated regression is offset by the one outlier which exerts enough
influence to lower the correlation coefficient from 1 to 0.816.

Dr. M. El Hamly : Correlation Analysis 21


Anscombe’s Quartet
 Anscombe 4 (influential obs.)
• Finally, in the 4th dataset we see a
pathological example with almost the
opposite problem.
• In the previous plot, we saw that the
presence of a single outlier had
lowered the correlation from 1 to 0.82.
• In this plot, the presence of a single
outlier raises the correlation from
undefined to 0.82. That is, this plot
shows an example when one high-
leverage point is enough to produce a
high correlation coefficient.
• Note that here, with the exception of
the outlier, knowing x doesn’t tell you
anything about y. It is only the outlier
that gives you the appearance of a
correlation, but that correlation is
specious.

Dr. M. El Hamly : Correlation Analysis 22


Anscombe’s Quartet  Conclusion
• The Anscombe datasets are fabricated, but they help to
illustrate some properties of correlation.
• Above all else, these examples help to underscore the
importance of visually inspecting your data!
• You never know what you are going to see.
• The quartet is still often used to illustrate the importance
of looking at a set of data graphically before starting to
analyze according to a particular type of relationship,
and the inadequacy of basic statistic properties for
describing realistic datasets.

Dr. M. El Hamly : Correlation Analysis 23


Package: ‘corrr’

Dr. M. El Hamly : Correlation Analysis 24


Package: ‘corrr’
• corrr: Correlations in R
• A tool for exploring correlations.
• It makes it possible to easily perform routine tasks when
exploring correlation matrices such as ignoring the diagonal,
focusing on the correlations of certain variables against
others, or rearranging and visualizing the matrix in terms of
the strength of the correlations.
• Published: 2022-08-16
• Authors: Max Kuhn, Simon Jackson, Jorge Cimentada
• https://cran.r-project.org/package=corrr

 See script: corrr_AQ.R

Dr. M. El Hamly : Correlation Analysis 25


Package: ‘corrr’
• corrr is a package for exploring correlations in R.
• It focuses on creating and working with data frames of
correlations (instead of matrices) that can be easily explored
via corrr functions or by leveraging tools like those in the
tidyverse.
• This, along with the primary corrr functions, is represented
below:

Dr. M. El Hamly : Correlation Analysis 26


Package: ‘corrr’
 Using corrr
• Using corrr typically starts with correlate(), which acts like the base
correlation function cor().
• It differs by defaulting to pairwise deletion, and returning a
correlation data frame (cor_df) of the following structure:
 A tbl with an additional class, cor_df
 An extra “rowname” column
 Standardized variances (the matrix diagonal) set to missing values
(NA) so they can be ignored.

Dr. M. El Hamly : Correlation Analysis 27


Package: ‘corrr’  API
• The corrr API is designed with data pipelines in mind (e.g., to use
%>% from the magrittr package).
• After correlate(), the primary corrr functions take a cor_df as their
first argument, and return a cor_df or tbl (or output like a plot).
• These functions serve one of 3 purposes:
1. Internal changes (cor_df out):
 shave() the upper or lower triangle (set to NA).
 rearrange() the columns and rows based on correlation strengths.
2. Reshape structure (tbl or cor_df out):
 focus() on select columns and rows.
 stretch() into a long format.
3. Output/visualizations (console/plot out):
 fashion() the correlations for pretty printing.
 rplot() the correlations with shapes in place of the values.
 network_plot() the correlations in a network.
Dr. M. El Hamly : Correlation Analysis 28
Package: ‘correlation’

Dr. M. El Hamly : Correlation Analysis 29


Package: ‘correlation’
• correlation: Methods for Correlation Analysis
• Lightweight package for computing different kinds of
correlations, such as partial correlations, Bayesian
correlations, multilevel correlations, polychoric correlations,
biweight correlations, distance correlations and more.
• Relies on the easystats ecosystem (Lüdecke, Waggoner &
Makowski, 2019).
• Published: 2022-10-09
• Authors: Dominique Makowski et al.
• https://cran.r-project.org/package=correlation

• See script: correlation_AQ.R

Dr. M. El Hamly : Correlation Analysis 30


Package: ‘correlation’  Features
• The correlation package can compute many different types of
correlation, including:
• Pearson’s correlation, Spearman’s rank correlation,
• Kendall’s rank correlation, Biweight midcorrelation,
• Distance correlation, Percentage bend correlation,
• Shepherd’s Pi correlation, Blomqvist’s coefficient,
• Hoeffding’s D, Gamma correlation, Gaussian rank correlation,
• Point-Biserial and biserial correlation, Winsorized correlation,
• Polychoric correlation, Tetrachoric correlation,
• Multilevel correlation.

• An overview and description of these correlations types is available


here: https://easystats.github.io/correlation/articles/types.html
• Moreover, many of these correlation types are available as partial or
within a Bayesian framework.
Dr. M. El Hamly : Correlation Analysis 31
Notes

Dr. M. El Hamly : Correlation Analysis 32

You might also like