CORRELATION AND COVARIANCE in R

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 24

CORRELATION AND

COVARIANCE in R
Covariance
 The covariance of two variables x and y in a data set measures how the two are
linearly related. A positive covariance would indicate a positive linear relationship
between the variables, and a negative covariance would indicate the opposite.

 The sample covariance is defined in terms of the sample means as:

 Similarly, the population covariance is defined in terms of the population mean μx, μy
as:
Covariance in R programming

 In Statistics, Covariance is the measure of the relation between two variables of a dataset.
That is, it depicts the way two variables are related to each other.

 For an instance, when two variables are highly positively correlated, the variables move
ahead in the same direction.

 Covariance is useful in data pre-processing prior to modelling in the domain of data science
and machine learning.

 In R programming, we make use of cov() function to calculate the covariance between two
data frames or vectors.
Example:
 We provide the below three parameters to the cov() function–

 x — vector 1
 y — vector 2
 method — Any method to calculate the covariance such as Pearson, spearman. The default method is Pearson.
 a <- c(2,4,6,8,10)

 b <- c(1,11,3,33,5)

 print(cov(a, b, method = "spearman"))
Problem

 Find the covariance of eruption duration and waiting time in the data set faithful.
Observe if there is any linear relationship between the two variables.
 Solution
 We apply the cov function to compute the covariance of eruptions and waiting.
 > duration = faithful$eruptions   # eruption durations 
> waiting = faithful$waiting      # the waiting period 
> cov(duration, waiting)          # apply the cov function 
[1] 13.978
Correlation in R programming

 Correlation on a statistical basis is the method of finding the relationship between


the variables in terms of the movement of the data. That is, it helps us analyze the
effect of changes made in one variable over the other variable of the dataset.

 When two variables are highly (positively) correlated, we say that the variables
depict the same information and have the same effect on the other data variables
of the dataset.
Example

 The cor() function in R enables us to calculate the correlation between the variables of the data set or vector.

 Example:

 a <- c(2,4,6,8,10)

 b <- c(1,11,3,33,5)

 corr = cor(a,b)
 print(corr)

 print(cor(a, b, method = "spearman"))
Covariance to Correlation in R

 R provides us with cov2cor() function to convert the covariance value to


correlation. It converts the covariance matrix into a correlation matrix of values.

 Note: The vectors or values passed to build cov() needs to be a square matrix in
this case!
Example:
 Here, we have passed two vectors a and b such that they obey all the terms of a square matrix. Further, using cov2cor()
function, we achieve a corresponding correlation matrix for every pair of the data values.

 a <- c(2,4,6,8)

 b <- c(1,11,3,33)

 covar = cov(a,b)
 print(covar)

 res = cov2cor(covar)
 print(res)
Compute correlation matrix in R

 The R function cor() can be used to compute a correlation matrix. A simplified


format of the function is :

 cor(x, method = c("pearson", "kendall", "spearman"))

 x: numeric matrix or a data frame.


 method: indicates the correlation coefficient to be computed. The default is
pearson correlation coefficient which measures the linear dependence between
two variables. kendall and spearman correlation methods are non-parametric
rank-based correlation test.
 Consider the below data frame of continuous variable −
 > set.seed(9)
 > x1<-rnorm(20)
 > x2<-rnorm(20,0.2)
 > x3<-rnorm(20,0.5)
 > x4<-rnorm(20,0.8)
 > x5<-rnorm(20,1)
 > df<-data.frame(x1,x2,x3,x4,x5)
 > df
 Finding the correlation matrix for all variables in df −
 > cor(df)
Example

 Import your data into R


 Prepare your data as specified here: Best practices for preparing your data set for R

 Save your data in an external .txt tab or .csv files

 Import your data into R as follow:

 # If .txt tab file, use this


 my_data <- read.delim(file.choose())
 # Or, if .csv file, use this
 my_data <- read.csv(file.choose())
 Here, we’ll use a data derived from the built-in R data set mtcars as an example:

 # Load data
 data("mtcars")
 my_data <- mtcars[, c(1,3,4,5,6,7)]
 # print the first 6 rows
 head(my_data, 6)
Compute correlation matrix

 res <- cor(my_data)


 round(res, 2)
Correlation matrix with significance levels (p-value)

 The function rcorr() [in Hmisc package] can be used to compute the significance
levels for pearson and spearman correlations. It returns both the correlation
coefficients and the p-value of the correlation for all possible pairs of columns in
the data table.

 Simplified format:
 rcorr(x, type = c("pearson","spearman"))
 Install Hmisc package:
 install.packages("Hmisc")
 Use rcorr() function
 library("Hmisc")
 res2 <- rcorr(as.matrix(my_data))
 res2
 The output of the function rcorr() is a list containing the following elements : - r : the
correlation matrix - n : the matrix of the number of observations used in analyzing each
pair of variables - P : the p-values corresponding to the significance levels of correlations.

 If you want to extract the p-values or the correlation coefficients from the output, use this:

 # Extract the correlation coefficients


 res2$r
 # Extract p-values
 res2$P
Plot the correlation plot on dataset and visualize giving an
overview of relationships among data on iris data

 Step 1- Load the relevant libraries

 library(ggplot2)
 library(tidyr)
 library(datasets)
 data("iris")
 summary(iris)
 Step 2 - Create a correlation matrix of the Iris dataset using the DataExplorer
correlation function. Include only continuous variables in your correlation plot to
avoid confusion as factor variables don’t make sense in a correlation plot
 library(DataExplorer)
 ## Warning: package 'DataExplorer' was built under R version 3.5.2
 library(corrplot)
 ## corrplot 0.84 loaded
 title="matrix_iris"
 plot_correlation(iris)
 Step 3 - Create three separate correlation matrices for each species of iris flower (
 str(iris)
 m<-levels(iris$Species)
 title0<-"Setosa"
 setosaCorr=cor(iris[iris$Species==m[1],1:4])
 corrplot(setosaCorr,method="number",title=title,mar=c(0,0,1,0))
 versC=cor(iris[iris$Species==m[2],1:4])
 title1<-"versicolor"
 corrplot(versC,method="number",title=title1,mar=c(0,0,1,0))
 veriC<-cor(iris[iris$Species==m[3],1:4])
 title2<-"virginica"
 corrplot(veriC,method="number",title=title2,mar=c(0,0,1,0))
Ancova
 The simple regression analysis gives multiple results for each value of the categorical variable. In such
scenario, we can study the effect of the categorical variable by using it along with the predictor variable
and comparing the regression lines for each level of the categorical variable. Such an analysis is termed
as Analysis of Covariance also called as ANCOVA.

 ANCOVA is a type of general linear model (GLM) that includes at least one continuous and one
categorical independent variable (treatments). ANCOVA is useful when the effect of treatments are
important while there is an additional continuous variable in the study. ANCOVA is proposed by British
statistician Ronald A. Fisher during 1930s.
 The additional continuous independent variable in ANCOVA is called a covariate (also known as control,
concomitant, or confounding variable).
 We use Regression analysis to create models which describe the effect of variation in predictor variables on
the response variable. Sometimes, if we have a categorical variable with values like Yes/No or
Male/Female etc.
Analysis of covariance: variance (ANOVA),
if data have categorical variables on iris data
 data(iris)
 View(iris) # take a look at the data
 Library(lattice)
 In this case, we will examine sepal width as our response, using species as a categorical
predictor (like in ANOVA) and sepal length as our coviate (as in linear regression).

 It is always helpful to look at a plot first. Note that type=c("p","r") puts both points (p) and
regression lines (r) on the plot

 xyplot(Sepal.Width~Sepal.Length, data=iris, groups=Species, type=c("p","r"),


auto.key=TRUE, xlab="Sepal length (cm)", ylab="Sepal width (cm)")
 In evaluating an ANCOVA model, we want to sequentially ask first whether there is
a difference in slope, then if there is not, look for differences in intercept. Let’s take
it a step at a time. We start with a peek at an anova table to evaluate the different
terms in the model.

 Sepals.lm = lm(Sepal.Width~Sepal.Length*Species, data=iris)

 anova(Sepals.lm) # this command gives us an anova table to evalute different terms


 iris2 = subset(iris, Species != "setosa", drop=TRUE)
 Now look we can make a new model and analyze it:
 sepal2.lm = lm(Sepal.Width~Sepal.Length*Species, data=iris2)
 anova(sepal2.lm)
 xyplot(Sepal.Width~Sepal.Length, data=iris2, col=iris2$Species, type=c("p","r"),
xlab="Sepal length (cm)", ylab="Sepal width (cm)", col.line="black")

You might also like