Professional Documents
Culture Documents
8 - Merely Two Variables
8 - Merely Two Variables
http://ebooks.cambridge.org/
Chapter
How wonderful that we have met with a paradox. Now we have some hope of making
progress.
Niels Bohr
8.1 Introduction
Correlations are basic for statistical understanding. In this chapter we dis-
cuss several important examples of correlations, and what they can and
cannot do. In Chapter 9 we take up more complex issues, those involving
correlations and associations among several variables. We will find that
statistical learning machines are an excellent environment in which to
learn about connections between variables, and about the strengths and
limitations of correlations.
Correlations can be used to estimate the strength of a linear relationship
between two continuous variables. There are three ideas here that are
central: correlations quantify linear relationships, they do so with only
two variables, and the two variables are at least roughly continuous (like
rainfall or temperature). If the relationship between the variables is almost
linear, and we don’t mind restricting ourselves to those two variables, then
the prediction problem has at least one familiar solution: linear regression
(see Note 1). Usually the relationship between two variables is not linear.
Despite this, the relationship could still be strong, and a correlation
analysis could be misleading. A statistical learning machine approach
may be more likely to uncover the relationship. As a prelude to later
results we show this by applying a binary decision tree to one of the
examples in this chapter.
157
Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:17:14 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.009
Cambridge Books Online © Cambridge University Press, 2016
158 Merely two variables
140
ρ = 0.77 [0.73,0.80]
120
Weight [kg]
100
80
60
40
120 140 160 180 200 220
Height [cm]
Figure 8.1 Height and weight for 647 patients. Strong correlation is present, ρ = 0.77 with 95%
confidence interval [0.73, 0.80].
Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:17:14 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.009
Cambridge Books Online © Cambridge University Press, 2016
159 Hazards of correlations
25
ρ = −0.05 [−0.12,0.03]
20
CA level [mg/dL]
15
10
0
120 140 160 180 200 220
Height [cm]
Figure 8.2 Calcium blood levels in 616 patients, plotted against height. Little correlation is
present, ρ = −0.05, with 95% confidence interval [−0.12, 0.03].
Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:17:14 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.009
Cambridge Books Online © Cambridge University Press, 2016
160 Merely two variables
40 ρ = 0.04 [−0.30,0.37]
ρ = 0.62 [0.36,0.79]
35
30
CA level [mg/dL]
25
20
15
10
0
120 140 160 180 200 220 240
Height [cm]
Figure 8.3 The main body of this calcium/height plot shows little correlation, ρ = 0.04 with
95% confidence interval [−0.30, 0.37]. Adding a single outlier at the upper right
changes everything, or so it seems, with ρ = 0.62 and a 95% confidence interval
[0.36, 0.79].
to 0.62 (the p-value is 0.01). The single data point in the upper right corner
seems to be driving the correlation higher, since leaving the point out of the
analysis gives a correlation of 0.04. Before casting out this data point we are
obligated to ask ourselves these questions:
(1) Is this extreme data point representative of some biological process of
merit? Is there a mix of different processes that might be responsible for
this outlier?
(2) Are occasional extreme values typical of this process?
If the answer to either of these questions is yes, or if the answer is not sure,
then the data point cannot be omitted. Only knowledge of the subject matter,
and sometimes more targeted data-gathering, can resolve these questions.
This decision should not be made for purely statistical reasons alone.◄
Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:17:15 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.009
Cambridge Books Online © Cambridge University Press, 2016
161 Hazards of correlations
120
100
Weight [kg]
80
60
ρ = 0.75 [0.51,0.88]
40
ρ = 0.15 [−0.24,0.50]
Figure 8.4 Height and weight was once strongly related, but now? The main body of the data
shows a strong correlation, ρ = 0.75, with 95% confidence interval [0.51, 0.88]. But
adding a single outlier at the lower right returns a correlation of basically zero,
ρ = 0.15, with 95% confidence interval [−0.24, 0.50].
Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:17:15 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.009
Cambridge Books Online © Cambridge University Press, 2016
162 Merely two variables
4 4
y
y
2 2
0 0
−20 0 20 −20 0 20
x x
Figure 8.5. The effect of adding three points to one side of a plot. The three additional points are
from the quadratic curve generating the data. (a) Applying a straight-line regression
to the data originally returns a correlation of ρ = −0.01 (light gray line); adding three
points in the upper right changes the correlation to ρ = 0.23. (b) Applying a single
classification tree (in regression mode) changes little over that part of the data not
containing the added points, and then nicely takes the three points into account when
added.
Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:17:15 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.009
Cambridge Books Online © Cambridge University Press, 2016
163 Correlations big and small
The problem just discussed in Example 8.5 is important and has its own
statistical literature: see Note 3.
Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:17:15 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.009
Cambridge Books Online © Cambridge University Press, 2016
164 Merely two variables
80
60
40
x2
20
−20
−20 0 20 40 60 80
x1
Figure 8.6 The data arises in three parts, within each of which the correlation is strongly positive.
However, considered as a group the data show strong negative correlation. Strata,
within-clinics, subset data need not show any obvious functional relation to pooled
data: correlations don’t add.
Example 8.7a Subsets of data can behave differently from the whole
We pursue this subject of local versus global. Biological datasets typically are
not homogeneous, in the sense that subsets, slices, or parts of the whole,
often do not look like the pooled data. In Figure 8.6 we have three clouds
of data.
Taken separately they yield strong positive correlations. Taken as a whole,
the general trend becomes negative, from the upper left of the plot down to
the lower right. The correlation for the entire data is different from correla-
tions for the three segments of the data, and is in the reverse direction. To
make the example more compelling, suppose the data represent success rates
for individual surgeons over a period of several years, plotted as a function of
the total number of procedures each surgeon performed. Perhaps the three
groups were collected under somewhat different research protocols, or at
clinics with not-so-subtle differences in patient populations, or perhaps
there are hidden, unmeasured biological processes that are causing these
groups to separate. Patients at Clinic A could generally be sicker than those at
Clinic B, or the surgeons at Clinic B performed the transplants under a more
refined patient selection scheme. It is important in the analysis to take these
additional features into account. But the context is important: within each
Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:17:15 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.009
Cambridge Books Online © Cambridge University Press, 2016
165 Correlations big and small
clinic the success rate is positively related to the total number of procedures
each surgeon performs, and this is important information. Pooling the data
is an averaging that obscures this conclusion, even if we don’t understand
what separates the clinic results. One data analysis moral here is that
correlations don’t add, and unmeasured features of the data can and perhaps
should sharply perturb our conclusions. But, in a large and complex dataset
we may have no clue as to the number and natures of hidden variables. See
Note 4.◄
120
80
60
40
20
ρ = −0.05 [−0.14,0.04]
ρ = −0.80 [−0.83,−0.77]
0
100 150 200
LDL level [mg/dL]
Figure 8.7 The data is uncorrelated considered as a whole, but pieces of the data need not be.
There is no correlation of the data considered as a single group, but looking at just
those patients with high total cholesterol shows a strong negative relationship. In this
important subset of at-risk patients does this have clinical meaning?
Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:17:15 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.009
Cambridge Books Online © Cambridge University Press, 2016
166 Merely two variables
The correlation for all patients is basically zero, equal to −0.05. The researcher
is interested in whether a combined measurement of the two cholesterol values
could be predictive of disease. She subsets the data to those patients having
HDL + LDL above a clinically determined threshold value equal to 240, and
then considers these patients to have elevated risk. Further study of these
at-risk patients proceeds, and now the researcher observes that in this pop-
ulation HDL and LDL are strong negatively correlated, with the estimated
correlation equal to −0.80.
Being a careful analyst she immediately realizes that (probably) no new
science has been uncovered in this subset analysis. She is aware that this
correlation was induced, since the subset does not include patients with low
HDL and LDL. The moral is that it is hazardous to infer correlation from
pairs of variables that were themselves used in defining the subset. Of course,
this subset is still medically important since the usual definition of total
cholesterol is HDL plus LDL (plus a triglyceride term; see Note 1 in
Chapter 4), so that adding HDL and LDL together generates a valid tool
for locating patients at risk.
She next proceeds to look at those patients having another risk factor
related to cholesterol, those for whom HDL is less than 40. For this group she
finds no correlation between HDL and LDL, and the subset has essentially
zero correlation. Again, the analysis is statistically valid, but (probably)
doesn’t yield new science.
Leaving aside the problem of how subgroups may or may not reflect group
structure, she now has defined two, distinct at-risk populations, those
patients with high total cholesterol, HDL + LDL more than 240, and those
with low HDL. Each group represents patients with well-studied clinical
problems. Is it possible, she asks, to more reliably classify patients using both
of these risk factors? This is different from the regression-style prediction
problem in Example 8.6 above. It also presents a challenge in classifying
at-risk patients, since the at-risk region of the data is wedge-shaped and not
separated by a single, easy-to-describe, straight-line boundary.◄
Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:17:16 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.009
Cambridge Books Online © Cambridge University Press, 2016
167 Correlations big and small
40 Tree
ρ = 0.04 [−0.30,0.37]
35 ρ = 0.62 [0.36,0.79]
30
CA level [mg/dL]
25
20
15
10
0
120 140 160 180 200 220 240
Height [cm]
Figure 8.8 A single tree is applied to data, the main body of which does not contain a strong
signal, with a correlation close to zero. But the tree does get, once more, the single
outlier. In this case meandering through the central portion of the data is not
necessarily a mistake, but a forest approach would likely smooth the central
prediction. However, passing through the single extreme point is probably
misleading, unless the underlying science says otherwise.
Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:17:16 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.009
Cambridge Books Online © Cambridge University Press, 2016
168 Merely two variables
120
100
Weight [kg]
80
60
Tree
ρ = 0.75 [0.51,0.88]
40
ρ = 0.15 [−0.24,0.50]
Figure 8.9 A single decision tree nails the “outlier” on the lower right, but meanders through
the main body of the data. The central portion of the data does contain a signal, as
the correlation is strong and the data there looks linear, but the tree seems uncertain.
Reducing the terminal node size of the single tree, or using a forest of trees might
help; see Chapter 7.
move smoothly upward from left to right as we might expect given the strong
correlation of this central part of the data.
That is, single trees may not work automatically well with relatively small
amounts of data. Another way to state this result is by reference to Section 2.11:
the rate of convergence to optimality may be rather slow for single trees.
Notes
(1) To understand what the correlation represents we need to start with its
definition. Assume that Y and X are related by the linear equation
Y ¼ a þ bXþ error for some constants a and b.
Then the correlation is given by
X
¼b ;
Y
where
b = slope of the linear regression line Y = a + bX
a = intercept of the line with the Y-axis
Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:17:16 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.009
Cambridge Books Online © Cambridge University Press, 2016
169 Notes
When Y and X are rescaled so that both have standard deviation equal to
one, we find that ρ = b, which emphasizes the connection between
correlation and that term in the equation above for Y. In the regression
analysis framework, given the data we replace the mathematical con-
stants above with estimated values.
(2) Addressing the problem of unusual observations has two parts. First,
we need to devise a method for uncovering observations that markedly
stand out from the others, and then second, settle upon a method for
dealing with the unusual observations. As we saw in Examples 8.3 and
8.4, finding an observation that sticks out requires looking at each
variable separately and the variables together. However we choose to
resolve this first problem – of declaring something unusual – we still
need to deal with the second problem: what to do when an outlier
is detected. Deleting a data point is not something we recommend
generally, but down-weighting or reducing the impact of the data point
on the subsequent analysis is one alternative. Or we can consider an
analysis in which the data point is included, then one in which it isn’t
included, and compare the results. This is an example of studying
the robustness of the inference, which we strongly recommend; see,
for example, Spiegelhalter and Marshall (2006). We could also use
alternative definitions for correlations. That is, methods have been
proposed that are less sensitive to outliers, but they are also less
familiar so that it can be difficult to grasp what they mean precisely
in a practical sense. See for example:
http://www.stat.ubc.ca/~ruben/website/chilsonthesis.pdf.
Also Alqallaf et al. (2002), Hardin et al. (2007). A central point to keep in
mind, no matter how you choose to thread a path through these data
analysis choices, is this: fully declare the method by which any data point
is considered to be unusual (an outlier) and disclose the method used to
treat that data point.
(3) It is easy to construct examples of a distribution for a random variable X
and then define a pair (X, Y) for Y ¼ X 2 such that: corðX m ; Y n Þ ¼ 0, for
Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:17:16 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.009
Cambridge Books Online © Cambridge University Press, 2016
170 Merely two variables
every positive odd integer m and any positive integer n. Indeed, simply
let X have any distribution symmetric around X = 0; see Mukhopadhyay
(2009), as well as Hamedani and Volkmer (2009).
(4) This slippery problem is so recurrent that it has its own name: Simpson’s
paradox. As with many paradoxes, it arises when one assumption or
another has been overlooked or in which the outcome measure is itself
ambiguous. In this case the problem is that the subgroups don’t reflect
the behavior of the pooled data. Here the misleading assumption is that a
single number, the pooled correlation, can speak for the entire dataset,
when it is better represented in terms of two or more correlations, a
pooled value and separate values for the subgroups. If the subsections of
the data all show similar behavior then it is possible to borrow strength
from the subsections, to create a more efficient single summary. This
forms the statistical subject of analysis of covariance and multilevel
models: see, for example, Twisk (2006).
Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:17:16 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.009
Cambridge Books Online © Cambridge University Press, 2016