Bivariate Association of Data

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Module 2a: Bivariate Association of Data - Part I https://learn.uwaterloo.ca/content/enforced/574358-HLTH605B_081_ce...

Module 2a: Bivariate Association of Data


- Part I
2.1 Learning Outcomes
2.2 Introduction
2.3 Correlation
2.4 A Very Brief Introduction to Simple Linear Regression
2.5 Reviewing Module 2a

2.1 Learning Outcomes

By the end of this module, you will be able to:

1. Graph, analyze, and interpret the bivariate association between two numeric variables, X and Y.

2. Measure correlation between two numeric variables.

3. Use R for viewing the relationship between two numeric variables, particularly through scatterplots, as
well as for calculating different types of correlation.

4. Understand the very basics of simple linear regression, a topic to be studied in detail starting next week
in Module 2b.

2.2 Introduction

Definition
A bivariate association is defined as the association between two variables.

For the purposes of this module, we will focus exclusively on the association between two numeric variables,
but we will extend this idea in future modules (e.g., the relationship between a binary variable and a numeric
variable, or the relationship between one numeric variable and three other variables of mixed type).

In terms of bivariate associations, these could include:

the association of a nation’s GDP per capita and average life expectancy

1 of 14 08/10/2020, 12:40 p.m.


Module 2a: Bivariate Association of Data - Part I https://learn.uwaterloo.ca/content/enforced/574358-HLTH605B_081_ce...

the association between average daily calorie intake and systolic blood pressure
the association between regional rainfall and property crime rates
the association between population density and COVID-19 infection rates on June 1, 2020, across
municipalities in Ontario.

Example
We will next look at the gapminder dataset, using the ggplot2 package, with a focus on plotting lifeExp (Y),
which is a country’s average life expectancy, in years, vs. gdpPercap (X), which is a country’s gross
domestic product (GDP) per capita, in inflation-adjusted US$. There are 142 countries in the dataset, with
statistics reported every five years, from 1952 to 2007.

Load in the ggplot2 package, as well as make the gapminder data frame available from the gapminder
package. Note that if you have not before installed the gapminder package, you will need to so in
your R session, by, for example, entering install.packages('gapminder') at the command line.
Then, after loading the ggplot package and making the gapminder data available, use the str function to
output some details about the gapminder data frame.

library(ggplot2)
data(gapminder, package='gapminder')

str(gapminder)

## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)


## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3
...
## $ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997
...
## $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
## $ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372
12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

Next, using ggplot, we will look at a scatterplot from the gapminder data, from 1957 only, with gdpPercap
on the x-axis and lifeExp on the y-axis:

ggplot(subset(gapminder, year %in% 1957), aes(gdpPercap, lifeExp)) +


geom_point()

2 of 14 08/10/2020, 12:40 p.m.


Module 2a: Bivariate Association of Data - Part I https://learn.uwaterloo.ca/content/enforced/574358-HLTH605B_081_ce...

FIGURE 2a.1: Scatterplot of 1957 life expectancy vs. GDP per capita

Now, same as above, except for the 2007 data:

ggplot(subset(gapminder, year %in% 2007), aes(gdpPercap, lifeExp)) +


geom_point()

3 of 14 08/10/2020, 12:40 p.m.


Module 2a: Bivariate Association of Data - Part I https://learn.uwaterloo.ca/content/enforced/574358-HLTH605B_081_ce...

FIGURE 2a.2: Scatterplot of 2007 life expectancy vs. GDP per capita

Observations for the GDP per capita vs. average life expectency plots, noting there are many interesting
items to observe:

1. In 1957, there is a very noticeable outlier on GDP per capita (Kuwait). Since, as far as we know, this
outlier is not an erroneous data point, we want to keep it in the plot. However, we may want to look
more closely at all the other countries. One approach to do this is to take the log of gdpPercap. We
will see this shortly.

2. There are significant increases in average life expectancies overall, from 1957 to 2007. No
countries even made it to 75 in 1957. However, in 2007, a noticeable percentage of countries are
above 75, and there are more than 10 that are above 80. An interesting plot to consider is to look at
the per-country increase of life expectancy between 1957 and 2007 against per-country increase of
GDP between 1957 and 2007. We could also consider colour coding the plotted points in all these
plots by continent.

3. GDP per capita obviously had big gains over the 50 years, which is not surprising, even with the
figures being inflation-adjusted. Almost no countries were above $15K in 1957, but a moderate
percentage are above this in 2007, many well above.

4. There looks to an increase of life expectancy as a function of GDP per capita, but this increase does
not look to be nicely linear (in the form of a straight line) in either graph. We will discuss the point of
linearity later in this week’s lecture.

4 of 14 08/10/2020, 12:40 p.m.


Module 2a: Bivariate Association of Data - Part I https://learn.uwaterloo.ca/content/enforced/574358-HLTH605B_081_ce...

We will draw one additional plot now, related to the earlier discussion. You will have class activities
connected to this module to create some other related plots. Here, we will take the log (always base exp
(log base exponential) in this course, i.e., the natural log) of the x-axis variable, gdpPercap, of the 1957
plot to take care of much of the skewness caused by the outlier. Also, in the geom_point part of the
ggplot call, you will see that we automatically add different shapes and colors for the (five) different
continents coded in the dataset. Then, in order to make the points easier to see, we manually add double
size points for all the continents using the subsequent call to scale_size_manual; note that if we had not
added the scale_size_manual call, the size of the points would have automatically been multiplied by
whatever internal value that R gave to each different continent (1 for the first continent in the legend list, 2
for the second, and so on through 5). See the following code for specifics, including the use of
scale_size_manual:

ggplot(subset(gapminder, year %in% 1957),


aes(log(gdpPercap), lifeExp, group = continent)) +
geom_point(aes(shape = continent, color = continent, size = continent)) +
scale_size_manual(values = rep(2, 5))

FIGURE 2a.3: Enhanced scatterplot of 1957 life expectancy vs. GDP per capita

In FIGURE 2a.3, you can see much more a linear trend in the relationship between the two variables now,
and can see some trends in terms of countries within continents being clustered together.

5 of 14 08/10/2020, 12:40 p.m.


Module 2a: Bivariate Association of Data - Part I https://learn.uwaterloo.ca/content/enforced/574358-HLTH605B_081_ce...

In-module question 1
In the example above, in FIGURE 2a.3, can you identify a variable that is related to both GDP per capita
and average life expectancy?

2.3 Correlation

2.3.1 Introduction to correlation


We will now define a statistical quantity for measuring the linear association between numeric variables.
Examples of when this quantity could be applied include the four examples listed just after we defined bivariate
association earlier in this lecture.

Definition
Correlation measures the linear association between two numeric variables. The range of correlation
goes from to .

Correlation of 1 is a perfect positive linear association between two numeric variables. See left panel in
FIGURE 2a.4.

Correlation of -1 is a perfect negative linear association between two numeric variables. See middle panel
in FIGURE 2a.4.

Correlation of 0 means no linear association between two numeric variables. See right panel in FIGURE
2a.4.

FIGURE 2a.4: Correlation scatterplots

6 of 14 08/10/2020, 12:40 p.m.


Module 2a: Bivariate Association of Data - Part I https://learn.uwaterloo.ca/content/enforced/574358-HLTH605B_081_ce...

2.3.2 Quantitative definition of correlation

Definition
Correlation between two numeric variables and , as defined in words above, is typically written as an
equation, referred to as Pearson’s correlation coefficient, :

In the numerator of is the quantity , which is the covariance between variables and .
Covariance represents the directionality between two numeric variables, just as correlation does, but it is
not bounded between -1 and 1, in general. Hence:

when , and tend to move together;


when , and tend to move in opposite directions;
and when , and have no association; information about one variable provides
no information about the other.

Note that the denominator of , which is comprised of the square root of the factor of the variances of and
, respectively, serves to standardize the covariance such that . This is a desirable
characteristic, as it turns the correlation into a unitless quantity, unaffected by the scale (e.g., centimetres or
inches) of either or while retaining an interpretable quantity of the linear association between the two.

2.3.3 Estimating correlation with data


While we have introduced the definition of correlation and Pearson’s correlation coefficient, , we now want
to see how we can estimate when we have collected data for two variables, and , from individuals
or units.

As we have seen in earlier weeks, the index for an individual will be represented by , where .
And, here, each individual will have two measurements that will go into the estimated correlation coefficient
calculation, i.e., . For example, in FIGURE 2a.2 above, each point in the scatterplot represents an
individual country’s pair of 2007 life expectancy ( ) and GDP per capita ( ), and all of the points in the
scatterplot will provide the data used to estimate Pearson’s correlation coefficient for measuring the linear
association between a country’s life expectancy and GDP per capita in 2007.

The sample correlation coefficient is typically written as , and we will do so here:

Though we will use R to calculate , note that, in the above equation, , and

7 of 14 08/10/2020, 12:40 p.m.


Module 2a: Bivariate Association of Data - Part I https://learn.uwaterloo.ca/content/enforced/574358-HLTH605B_081_ce...

recall that and , where and are the sample means of the

and variables, respectively.


All the same properties that applied to also apply to , including its range between -1 and 1, how to
interpret it when its either positive or negative, and that there is no association between and (here, in the
sample data) when .

Example
We can use R quite easily to help us calculate the sample correlation coefficient, . For example, let’s go
back to the 2007 gapminder data.

with(subset(gapminder, year %in% 2007), cor(gdpPercap, lifeExp))

## [1] 0.6786624

This result suggests that for countries in 2007, life expectancy is fairly strongly positively correlated with
GDP per capita and should confirm our view of the scatterplot from FIGURE 2a.2. However, we would also
like to quantify the uncertainty in this estimation of the correlation by providing a confidence interval. Due
to the nature of the limits of between -1 and 1, and due to the definition of as a fraction,
generating by hand the estimated confidence interval (CI) for what is estimating,
i.e., , is more complicated than the CI’s from review module 1b, and, hence, we will not present that
calculation here. Fortunately, R calculates this very easily, as we show here, providing a 95% CI:

with(subset(gapminder, year %in% 2007), cor.test(gdpPercap, lifeExp, conf.level=


0.95)$conf.int)

## [1] 0.5786217 0.7585843


## attr(,"conf.level")
## [1] 0.95

The 95% CI output, rounded to 3 digits as (0.579 0.759), confirms the fairly strong positive association
between the two variables seen in FIGURE 2a.2, as the CI for is bounded very far away from 0. And,
the conf.level argument above can be easily changed to present CI’s for values of other .05, e.g.,
calculating a CI (here, the conf.level argument would be set to 0.99).

8 of 14 08/10/2020, 12:40 p.m.


Module 2a: Bivariate Association of Data - Part I https://learn.uwaterloo.ca/content/enforced/574358-HLTH605B_081_ce...

In-module question 2
i. How would you interpret the confidence interval for the correlation coefficient between two
numeric variables that is (-0.530, -0.411)?

ii. How would you interpret the confidence interval for the correlation coefficient between two
different numeric variables that is (-0.112, 0.051)?

2.3.4 A few limitations of correlation


In spite of its usefulness as a measure of association between two numeric variables, and our ability to capture
the uncertainty of estimating through a confidence interval, there are some limitations that should be
identified here:

1. Correlation is a measure of “linear” association.


So is not appropriate for curvilinear (such as quadratic) data, for example, or more complicated
associations between and .

2. It is a unitless quantity (both a strength and a limitation).


A strength because a change of scale of either or both variables (from miles to kilometres, or from
kilograms to pounds, etc.) will not affect the estimation of from the data; but a limitation, in so
far as it is hard to tell how an increase in one variable affects the other, aside from knowing that an
increase in one will lead to an increase in the other, for example, if the correlation is positive.
Linear regression will allow us to make more concrete interpretations in this regard.

3. is a biased estimate of .
More of a concern in small samples, where the bias can be a nuisance. The amount of bias will
become more negligible as sample size increases.

4. We cannot make conclusions of correlations outside the range of observed data.


If, say, in our sample, the range of our X variable is [110, 220], and the range of the Y variable is
[76,140], and say we calculate , we cannot assume this strong positive correlation will
persist if we find cases from the population from where the sample was drawn that have an X value
less than 110 and/or a Y value less 76.

5. We cannot judge the strength of the association between X and Y based on alone.
We also need to look at the scatterplots, for example, as outliers can sometimes influence the
estimation of from the data.

9 of 14 08/10/2020, 12:40 p.m.


Module 2a: Bivariate Association of Data - Part I https://learn.uwaterloo.ca/content/enforced/574358-HLTH605B_081_ce...

Textbook: Vittinghoff et al., Chapter 3, Section 3.2, introduction only, ignoring both 3.2.1 and 3.2.2;
pp. 33-34.

Textbook: Sullivan, Section 9.3, up to end of first column of p. 207; pp. 203-207.

2.3.5 Rank correlation


Especially due to Limitation #1 mentioned above, there was a need to find a suitable correlation measure that
can work around when a linear association between and does not exist. This was the reason Spearman’s
rank correlation was created, and we will discuss this approach now.

To obtain Spearman’s rank correlation (informally, we often write “the Spearman rank correlation” or just “the
Spearman correlation”), follow these steps:

1. Sort the original values of in ascending order, retaining the person/unit from which the original
came.

2. Do the same for the original values of .

3. In case of ties, split the difference. For example, say has five numbers ( ): 7, -2, -1, 0, -1. For
ranking, -2 is the smallest observed value, so person gets a rank of 1. The next highest number
is -1, but there are two of these values (for and ). Instead of randomly assigning one a rank
of 2 and the other a rank of 3, we split the difference, and give each a rank of 2.5. Continuing with this
example, then 0 (i.e., person ) gets a rank of 4, and the highest value, 7 (person ) gets a rank
of , i.e., 5. Hence, from the original numbers (7 -2, -1, 0, -1), we obtain ranks for of (5, 1, 2.5, 4, 2.5).

4. Next, the approach for calculating the Spearman rank correlation (which we will label as ) will depend
on if there are ties in at least one of your variables.

a. In case of any ties: In step (3.) above, there was a tie for two values (i.e., the 2nd and 3rd
ordered values, leading to two ranks of 2.5 each). In this case, or in any case where ties exist for
either or both variables, we will actually use the Pearson correlation equation seen earlier! This
may seem surprising, but we will not use the original data, as we did when presenting how to
calculate a Pearson correlation in Section 2.3.3. Instead, we will use the ranks for and ,
respectively, as produced in step (3.), as the data fed into the equation to obtain .
b. When there are no ties: In this case, we first calculate the difference, , between the ranks of
and for each person , , that is: for each . Then, the
Spearman rank correlation, , is based on a calculation of a function of these values and ,
specifically:

To summarize, when there are any ties for either of your variables, use the ranks as the data that you feed into
; the result will be the Spearman rank correlation, . When there are no ties, you then have to take one
additional step to find the differences of the ranks for each person, then feed those ranks and the sample size

10 of 14 08/10/2020, 12:40 p.m.


Module 2a: Bivariate Association of Data - Part I https://learn.uwaterloo.ca/content/enforced/574358-HLTH605B_081_ce...

into the displayed equation just above in order to find . An example, where we will consider one setting with
ties and the other without ties, is shown below.

If the ranks for and for each person are close to one another, this will lead to a positive (possibly high)
Spearman correlation ( ), approaching 1, noting the function of squared differences is subtracted from 1 in
the Spearman equation above. If the ranks for are low (closer to 1) when the ranks for are high (closer to
) for each person, then this will lead to a negative (possibly high) Spearman correlation, approaching -1; the
same will happen in the case when the ranks for are high while the ranks for are low. If there is little or no
pattern connecting the ranks for and for each person, then this will lead to a Spearman correlation closer
to 0.

Example
A. Ties exist.

We will use the we defined above, where is (7 -2, -1, 0, -1), and is (5, 1, 2.5, 4, 2.5). Now,
let’s say is (-4, 7, 3, 2, 10). Then, try on your own to verify that is (1, 4, 3, 2, 5).

As there are ties in at least one of the variables, in this case for the 2nd and 3rd ordered values of , we
use the Pearson calculation, with the data being the ranks, in order to find the Spearman correlation .
Specifically, our 5 pairs of data for this example will be: (1, 5), (4, 1), (3, 2.5), (2, 4), and (5, 2.5),
which are the ranks, not the original data. In this case, you can use the R code provided shortly below (or,
if you wish, enter the ranks into the Pearson calculation by hand) to verify that for
this example.

B. No ties exist.

We will use the same from above, but now will change slightly to make it (7, -2,
-1.25, 0, -1). As there no ties within either or , we need to calculate for each
. For example, will be able to find that the rank data from person is the pair (4,1), leading to a
value of . Then, calculating Spearman’s rank correlation using the calculation for the no-
ties scenario leads to . Try this on your own, with R code (see below) on the original data
and then, if you wish, verifying what you can find with your own calculations based on the ranks (and then
subsequently, the difference of those ranks) for each person .

Key concept
Note that in the “ties” example above, if we used the original data, then the estimated Pearson correlation
coefficient will be , close to the Spearman value, even for a small sample size of .
This will not always be the case, especially in scenarios where there are issues with linearity (i.e., when
we should use Spearman) or when there are outliers. Spearman is a more robust approach to calculating
correlation when major outliers exist. Even the one big outlier in the 1957 gapminder data can throw things
off. Use the example code below and above to help compare Spearman with Pearson correlation for the

11 of 14 08/10/2020, 12:40 p.m.


Module 2a: Bivariate Association of Data - Part I https://learn.uwaterloo.ca/content/enforced/574358-HLTH605B_081_ce...

2007 data (when outliers are not a big issue) and for the 1957 data (when there is an extreme outlier). You
will typically see a much bigger discrepancy for the two methods when outliers exist in the scatterplot. All
this said, since the 2007 data show a curvilinear trend between the two variables (see FIGURE 2a.2), we
should be using Spearman to report the correlation between a country’s life expectancy and GDP per
capita in 2007.

In R, estimating rank correlation is quite easy to implement using the cor function. See the method argument
used in the cor function call below for the gapminder data from 1957:

with(subset(gapminder, year %in% 1957), cor(gdpPercap, lifeExp, method='spearman'))

## [1] 0.7821208

Note that the cor.test function does not produce a confidence interval for Spearman’s correlation coefficient;
it only generates one for Pearson’s correlation coefficient, as was shown in Section 2.3.3.

Also, note that by specifying the argument method=‘spearman’, the cor function in R will automatically choose
the right method to calculate , depending on whether or not any ties exist in either or . That is, when
using the cor function and the ‘spearman’ method, you actually do not need to calculate ranks by
hand, as the function will do this for you and then subsequently use the appropriate calculation.

In-module question 3
When the relationship between two numeric variables is curvilinear, as we see in FIGURE 2a.2, why is the
Pearson correlation coefficient not appropriate to calculate?

2.4 A Very Brief Introduction to Simple Linear


Regression

Regression is the most popular technique for analyzing data, and simple linear regression (SLR) is the most
straightforward of all the regression methods.

Gaining an understanding of SLR should provide an easier entry into more complicated regression methods,
some of which we will cover in future modules this term, including multiple linear regression for numeric
response data, logistic regression for binary response data, and possibly a brief introduction to proportional
hazards regression for time-to-event (survival) response data.

Key concept
In SLR, like with correlation, we have two numeric variables and we would like to determine their

12 of 14 08/10/2020, 12:40 p.m.


Module 2a: Bivariate Association of Data - Part I https://learn.uwaterloo.ca/content/enforced/574358-HLTH605B_081_ce...

association. However, there are some important differences between SLR and correlation to highlight:

We identify a response variable ( ) and a predictor variable ( ). Sometimes, the response variable
is called the dependent variable. And sometimes the predictor variable is called the independent
variable or explanatory variable.
In regression, we specify a statistical model (called a regression model). We will see its specification
for SLR next week (in Module 2b).
From the estimation of the SLR regression model, we will be able to directly quantify the effect that
a one-unit increase in has on . We will actually be able to determine the effect that any -unit
increase in has on for values of other 1, such as 0.5 or 2 or 10, etc.

A few examples where we might use SLR:

1. Investigate the relationship between systolic and diastolic blood pressure.

2. A community-based study of smoking and drinking behavior, looking to see the degree to which smoking
and drinking behavior are associated.

3. A regional study of average calorie intake, and its association with weight, within a certain age range of
interest.

4. Study the association between external temperature and weekly gas consumption within a home.

5. Determine the relationship between average daily calories consumed and body mass index.

We will continue with SLR next week, where you will learn:

how to specify the statistical model for SLR.


how to estimate quantities of interest from the model after data are collected.
the assumptions of the model, verifying they are met, and some techniques to consider when there are
violations of the assumptions.
how to work through an example with real data, using R.

2.5 Reviewing Module 2a

Learned different ways to view and analyze the bivariate association between two numeric variables, X
and Y.

With scatterplot investigation and with correlation techniques, it does not matter if one variable is the
response (or dependent variable) and the other variable is predictor (or independent variable); this is not
the case in simple linear regression, where we need to be careful in differentiating the predictor and the
response. This last point will be emphasized in next week’s lecture.

You saw approaches in R for graphing scatterplots using the ggplot2 package and implementing different
ways to estimate correlations. These were demonstrated on the gapminder dataset.

13 of 14 08/10/2020, 12:40 p.m.


Module 2a: Bivariate Association of Data - Part I https://learn.uwaterloo.ca/content/enforced/574358-HLTH605B_081_ce...

Next week, in Module 2b, we will focus on the specification and details of the simple linear regression
model, as well as work through an example, using R, with real data.

14 of 14 08/10/2020, 12:40 p.m.

You might also like