Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

CHAMP I

NOTES ON INTRODUCTORY STATISTICS

Table of Contents

Introduction……………………………………………………………………………….2
Descriptive statistics for categorical variables…………………………………………....2
Frequency table.…………………………………………………………………..3
Bar graph.………………………………………………………………………....3
Pie chart.…………………………………………………………………………..4
Descriptive statistics for numerical variables……………………………………………..4
Mean………………………………………………………………………………4
Standard deviation………………………………………………………………...5
Range……………………………………………………………………………...6
Median…………………………………………………………………………….6
Quartiles…………………………………………………………………………...6
Interquartile range…………………………………………………………………6
Graphs for numerical variables……………………………………………………………6
Stem-and-leaf plot.………………………………………………………………...7
Histogram.……………………………………………………………………...…8
Dotplot.…………………………………………………………………………..10
Shape of the distribution…………………………………………………………10
Boxplot.……………………………………….…………………………………10
Empirical rule …………………………….……………………………………………..12
Studying relationship between two variables……………………………………………12
Categorical versus categorical variables…………………………………………13
Contingency table.……………………………………………………….13
Numerical versus numerical variables………………………………………………...…13
Scatterplot..……………………………………………………………....13
Correlation coefficient………………………………………………..….14
Coefficient of determination……………………………………………..16
Regression equation………………………………………………….…..16
Correlation and Causation………………………………………………..17

1
Introduction

Population is a large set of interest in a study. It consists of objects or subjects, and we


will refer to them as experimental units. Populations are not always accessible, and even
if they are, it is very expensive and time consuming to study every single member of the
population.
Sample is a subset of the population. The main idea of statistics is that samples reflect
populations. Statistics is the science of designing studies (how do you select a sample?),
and analyzing information that comes from sample (data).
Descriptive statistics produces a summary of sample data using graphs and numerical
summaries.
Inferential statistics draws inferences (decisions or predictions) about populations based
on samples.

A variable is a characteristic of an experimental unit (an object or subject studied). For


example, in a study of reading ability, subjects are students, and variables may include
student’s age, gender, grade, ethnicity, weekly time spent watching TV.
Variables are classified by type into
- categorical (take values that are categories; for example, variable gender takes
values “male” or “female”))
- numerical or quantitative (take values that are numbers; for example, age, SAT
score, number of brothers or sisters)
Numerical variables are further classified as
- discrete (values are separate numbers; gaps in between possible values; for
example, number of brothers or sisters can be 0, 1, 2, …, but can not be 0.5 or
1.3)
- continuous (values form an interval on the real line)
Examples: age is numerical continuous (if measured precisely and not just in whole
years), gender is categorical, grade is numerical discrete, ethnicity is categorical, weekly
time spent watching TV is numerical continuous.
Values of a categorical variable may be coded as, for example, 1=male, 2=female, but the
numbers are merely labels. It would not make sense to average them.
Different types of graphs and numerical summaries are used to describe categorical and
numerical variables.

Descriptive Statistics for Categorical Variables

Numerical summaries for categorical variables include frequency tables; graphs include
bar graphs, and pie-charts.
Frequency table lists the values (categories) of the variable together with counts for each
category (frequencies), and, if needed, relative frequencies or proportions. Proportions
are needed for comparisons of 2 groups with unequal counts.

We illustrate the descriptive statistics for categorical variables using the following
example: for 392 students, eye color was recorded with the following results: 124
students had blue eyes, 150 had brown, 15 had green, and 103 had hazel.

2
Frequency Table

Below is the frequency table:

Category Frequency=Count Relative frequency= Proportion=Count/Total


Blue 124 124/392=.316
Brown 150 150/392=.383
Green 15 15/392=.038
Hazel 103 103/392=.263
Total 392 1

Bar graph
The y-axis may display counts (as on the graphs) below, or proportions.

160

140

120

100

80

60

40

20

0
Blue Brown Green Hazel

Pareto chart (named after Italian economist Vilfredo Pareto) displays the categories in
order of frequencies (most to least frequently observed), or from the tallest to the shortest
bar.
160

140

120

100

80

60

40

20

0
Brown Blue Hazel Green

3
Pie Chart

Eye Colors
C ategory
Green Brown
15, 3.8%
Blue
Hazel
Green
Hazel
103, 26.3%
Brown
150, 38.3%

Blue
124, 31.6%

Each slice represents a category. The size of each slice is proportional to the frequency of
the category.

Descriptive Statistics for Numerical Variables

Numerical summaries: mean, standard deviation, median, quartiles, inter-quartile range,


and range.

Mean

Example. Consider sample data: 1, 2, 2, 7, 8.


The sample mean = x =(1+2+2+7+8)/5=4. Mean (or average) is a summary measure of
center of the data.

Standard Deviation

Standard deviation is a measure of spread. It is denoted by “s” and calculated as

s=
∑ ( x − x )2
, where n is the number of observations in a sample (sample size), x is
n −1
the sample mean; ∑ stands for summation. The denominator is n-1 rather than n because
x estimates the mean of the population. If the mean for the population (from which the
sample was taken) were known, then it would be subtracted in the numerator from x’s,
and the denominator would have been n. The squared standard deviation s 2 is called the
variance.

4
(1 − 4) 2 + (2 − 4) 2 + (2 − 4) 2 + (7 − 4) 2 + (8 − 4) 2
In the example, s =
5 −1
To do calculations step by step, a table below may be helpful:

x x- x ( x − x )2
1 -3 9
2 -2 4
2 -2 4
7 3 9
8 4 16
Total 0 42

Values in the second column, x- x , are called deviations of each data point from the
mean, and deviations always add up to 0. The total of the third column is a numerator
under the square root in the formula for s: s = 42 /(5 − 1) =3.24.

Here is how the standard deviation reflects spread: if we increase the spread in our data
by replacing 8 with 13, the standard deviation would increase.
For the data set: 1, 2, 2, 7, 13

x =(1+2+2+7+13)/5=5

x x- x ( x − x )2
1 -4 16
2 -3 9
2 -3 9
7 2 4
13 8 64
Total 0 102

The standard deviation s = 102 /(5 − 1) =5.05.

Range

Range is the difference between the smallest (minimum) and the largest (maximum)
observations.
For the data set: 1, 2, 2, 7, 8, range=8-1=7.
For the data set: 1, 2, 2, 7, 13, range=13-1=12.

Median

The median (denoted by M) is another summary measure of center. In the data ordered
from smallest to largest, it equals the middle observation if the number of observations
(n) is odd, and the average of two observations adjacent to the middle if n is even.

5
The median splits the data into two halves, with approximately 50% of all data being
below the median, and approximately 50% of the data above the median.

For the data set: 1, 2, 2, 7, 8, M=2.


For the data set: 1, 2, 2, 7, 13, M=2.
The median is not affected by extremely large or small observations in the data set. The
mean is!

For the data set 1, 2, 2, 7, 8, 8, the median M=(2+7)/2=4.5.

Quartiles

Quartiles, Q1 (first quartile), and Q3 (third quartile) cut the data into quarters so that
approximately 25% of data fall below Q1, and approximately 75% of data fall below Q3.
We use the word “approximately” because if the number of observations is not a multiple
of 4, exact slicing into quarters is not possible. Different software programs and different
calculators may employ slightly different rules for approximations in cutting.
We adopt the following approach: Q1 is the median of the lower half of data (below the
median), and Q3 is the median of upper half of data (above the median).
For example, for the data set 1, 2, 2, 7, 8, 8, the median M=(2+7)/2=4.5, the lower half
of the data is 1, 2, 2, and Q1=2. Upper half is 7, 8, 8, Q3=8.

When the number of observations is odd, the median is the middle data point, and we
include it in both lower and upper halves. For example, for data 1, 2, 2, 7, 8, the median
is 2. Lower half: 1, 2, 2, and Q1=2. Upper half is 2, 7, 8, Q3=7.

Interquartile Range

Interquartile range IQR=Q3-Q1 is a summary measure of spread that accompanies the


median as a measure of center.

Graphs for Numerical Variables

Most frequently constructed graphs are:


- stem-and-leaf plot (or stemplot)
- histogram
- dotplot
- boxplot (or box-and-whisker plot)

We will show how to construct these graphs using two hypothetical data sets of scores of
two groups of students on a test.
Group 1 (n=10): 136, 107, 123, 107, 102, 112, 129, 110, 112, 111
Group 2 (n=11): 98, 138, 119, 109, 102, 117, 123, 114, 110, 112, 112

6
Stem-and-leaf Plot

To construct a stem-and-leaf plot, look at the data, and find common digits (stem), and
differing (leaves). In this case, we can use first 2 digits (the number of tens) as a stem,
and the third digit as a leaf.

Start by drawing a stem:

09|
10|
11|
12|
13|

Then go through the data points and add leaves:

Group 1:

09|
10|772
11|2210
12|39
13|6
Finally, order the leaves from smallest to largest:

09|
10|277
11|0122
12|39
13|6

For group 2:

09|8
10|29
11|022479
12|3
13|8

The plots can be put side-by-side for comparison:

Group 2 Group 1
8|09|
92|10|277
974220|11|0122
3|12|39
8|13|6

7
Histogram

To construct a histogram, follow these steps:


1) Divide the range of data into intervals of equal length. The intervals are class
intervals or classes.
2) Count the number of observations to determine the frequency of observations that
fall into each interval; or determine the relative frequency when comparing two
groups of unequal size.
3) Display the frequencies or relative frequencies with bars.

A histogram that displays counts is called a frequency histogram; a histogram that


displays relative frequencies is called a relative frequency histogram. Sometimes, relative
frequencies are converted to percents: for example, relative frequency of 0.40
corresponds to 40%.

For our test scores data:

Class interval Group 1 Group 2


Frequency Relative Frequency Frequency Relative Frequency
97.5 to < 102.5 1 1/10=.1 2 2/11=.18
102.5 to <107.5 2 2/10=.2 0 0
107.5 to <112.5 4 4/10=.4 4 4/11=.36
112.5 to <117.5 0 0 2 2/11=.18
117.5 to <122.5 0 0 1 1/11=.09
122.5 to <127.5 1 1/10=.1 1 1/11=.09
127.5 to <132.5 1 1/10=.1 0 0
132.5 to <137.5 1 1/10=.1 0 0
137. 5 to <142.5 0 0 1 1/11=.09
Total 10 1 11 1

The histograms that display frequencies are below.

Histogram of Group 1

3
Frequency

0
100 105 110 115 120 125 130 135
Group 1

8
Histogram of Group 2

3
Frequency

0
100 110 120 130 140
Group 2

Frequency histograms are not suitable for group comparisons as the groups have unequal
size. The histograms that display percents (relative frequencies *100%) are shown below:

Histogram of Group 1

40

30
Percent

20

10

0
100 110 120 130 140
Group 1

Histogram of Group 2

40

30
Percent

20

10

0
100 110 120 130 140
Group 2

9
Dotplot

Histograms display frequencies or percents, but no the actual data points. Dotplots show
the data points represented by dots. If there are several observations identical in value,
dots corresponding to these observations are stacked on top one another. Here are the
dotplots for the scores of groups 1 and 2:

Dotplot of Group 1, Group 2

Group 1
Group 2
102 108 114 120 126 132 138
Data

Shape of the Distribution

The term distribution refers to a pattern of variation of a variable, i.e. description of how
it varies from subject to subject. An outline of a histogram, dotplot, or stem-and-leaf plot
provides the shape of the distribution (e.g. bell-shaped, u-shaped, triangular, flat or
uniform etc.) The shapes are classified as symmetrical or skewed. If a distribution is not
symmetrical, then one of tails is longer than the other. The distribution is said to be
skewed in the direction of the longer tail. For example, if right tail is longer than the left,
then the distribution is skewed to the right.

Boxplot

Like histograms, boxplots do not show the actual data points. They display quartiles,
median, and extremely large or small observations that fall more than 1.5*IQR from the
quartiles (potential outliers).

Before we construct a boxplot, we need to calculate median and quartiles for our test
score data for groups 1 and 2. Our first step is to order data from smallest to largest:

Group 1: 102, 107, 107, 110, 111, 112, 112, 123, 129, 136
M=(111+112)/2; lower half of the data is 102, 107, 107, 110, 111, Q1=107; upper half of
the data is 112, 112, 123, 129, 136, Q3=123; IQR=Q3-Q1=16

Group 2: 98, 102, 109, 110, 112, 112, 114, 117, 119, 123, 138

10
M=112; lower half of the data is 98, 102, 109, 110, 112, 112; if n is odd both lower and
upper halves include the median), Q1=(109+110)/2=109.5; upper half 112, 114, 117, 119,
123, 138, Q3=(117+119)/2=118; IQR=118-109.5=8.5.

Steps for constructing a boxplot:


1) Draw a box extending from Q1 to Q3.
2) Draw a line inside the box to display the median.
3) Calculate the locations of invisible fences (think of the box as a house, and then
put a fence around it): upper fence=Q3+1.5*IQR, lower fence =Q1-1.5*IQR.
4) Start at the low end of the data: if the smallest observation is inside the fence,
extend the whisker to it, If not, mark the data point outside of the fence with an
asterisk, and proceed to second smallest data points to see if it is inside the fence.
In other words, left whisker extends to the smallest observation inside the fence.
5) Similarly to the left whisker, draw the right whisker extending to the largest
observation inside the fence.
6) Mark observations outside of the fence with asterisks. They are called outliers as
they fall far away from the rest of he data points (more than 1.5*IQR from
quartiles).

Below are the boxplots of our test score data:

Boxplot of Group 1, Group 2

Group 1

Group 2

100 110 120 130 140


Data

There are no outliers in group 1. In group 2, the lower fence is Q1-1.5*IQR=109.5-


1.5*8.5=96.75, the smallest observation is 98, it is inside the fence, and the left whisker
extends to it. The upper fence is 118+1.5*8.5=130.75. The largest observation is 136, it is
outside of the fence (outlier), second largest observation is 123, and the right whisker
extends to it.

The box encloses the middle 50% of the data (between Q1 and Q3). The length of the box
equals IQR. Group 1 has larger spread among the central 50% of scores compared to
group 2. Group 2, however, has an outlier. Spread among the entire data for 2 groups (as
measured by range or standard deviation) would be about the same.

11
Empirical Rule

When a dotplot or a histogram has a bell shape, that is the outline of a shape resembles a
bell, the standard deviation can be used to describe the spread of the values that variable
takes using the Empirical Rule:
- approximately 68% of all values fall within one standard deviation of the mean;
- approximately 95% of all values fall within two standard deviations of the mean;
- essentially all (approximately 99.7%) of all values fall within three standard
deviations of the mean.

For example, if scores have mean 50 and standard deviation 10, and the histogram of
scores has a shape that is close to a bell, approximately 68% of the students scored
between 40 and 60; approximately 95% of the students scored between 30 and 70.

Histogram of Scores
Normal
20 30 40 50 60 70 80
14 Mean 50
StDev 10.00
12 N 200

10
Percent

0
20 30 40 50 60 70 80
Score

Studying Relationship Between Two Variables

So far we have learned how to describe one variable at a time.


Now we will look at two variables, one denoted by X and designated as predictor or
explanatory variable, and another one denoted by Y and designated as response.
Different numerical summaries and graphs are used when X and Y are categorical, and
when X and Y are numerical.
The case when Y is numerical (e.g. test score), and X is categorical (e.g. group 1 versus
group 2) is illustrated above by creating side-by-side stem-and-leaf stem-and-leaf plots,
histograms, boxplots, and dotplots.
Graphs and numerical summaries described below will help us answer the question:

12
Is there an association between X and Y? Is Y contingent upon X? Can we predict Y
based on X?

Categorical versus Categorical Variables

For categorical variables X and Y, a useful summary is a contingency table: a two-way


table with rows representing categories of X, and columns representing categories of Y.
We consider the following example: 13-year old boys and girls were asked how they
perceive their mathematics ability (“Are you good at math?”). The categories of response
were: hopeless, below average, average, above average, superior. The rating of math
ability is our variable Y, and we would like to see if it is contingent upon variable
X=gender (2 categories, boys versus girls).

Contingency Table

Below is a contingency table with row and column totals added for convenience:

Hopeless Below Average Above Superior Total


average average
Girls 56 61 54 21 8 200
Boys 33 41 59 40 17 190
Total 89 102 113 61 25 390

Among girls, 8 out of 200, or 4% (proportion .04) rated themselves as “superior” at math.
Among boys, 17 out of 190, or 8.9% (proportion .089) rated themselves as “superior” at
math. If there was no association between perceived math ability and gender, we would
expect the same proportions of boys and girls to rate themselves as “superior” (same
would apply to other categories). However, it appears that Y (rating of ability) is
contingent upon X (gender). Note that Y is not the actual ability, but a perceived one, i.e.
how boys and girls think of themselves rather than how they actually do in math.

Numerical Versus Numerical Variables

Scatterplot

When two numerical variables, X and Y are measured for each subject or object in a
study, a graph that displays pairs of values (x, y) is called a scatterplot. It helps visualize
the relationship between X and Y.

Example. For 4 subjects, the measurements were: (1, 2) (1, 4) (2, 6) (4, 8).

The scatterplot is

13
Scatterplot of y vs x

5
y

1.0 1.5 2.0 2.5 3.0 3.5 4.0


x

On the scatterplot, we can see if there appears to be an association between X and Y.


Associations are classified as follows:
- positive association: as X increases, Y tends to increase (“upward” trend in the
data);
- negative association: as X increases, Y tends to decrease (“downward” trend in
the data);
- no association.
If there is an association, it may be linear (data points lie close to a line that can be drawn
through the scatterplot), or it may be non-linear. We will consider linear associations.

Correlation Coefficient

The correlation coefficient, denoted by r:


- measures the strength of linear relationship between X and Y;
- if there is a positive association between X and Y, then r>0;
- if there is a negative association between X and Y, then r<0;
- r=1 if there is a perfect linear relationship between X and Y (all data points on the
scatterplot fall exactly on one line with positive slope);
- r=-1 if there is a perfect linear relationship between X and Y (all data points on
the scatterplot fall exactly on one line with negative slope);
- if r=0, then there is no linear relationship between X and Y. Note that X and Y
may be associated in a non-linear way. Uncorrelated does not mean independent.
(see figure below).

14
Calculating r:

r=
∑ ( x − x )( y − y ) ,
(n − 1) s x s y

Where s with subscript x or y denotes standard deviation of x or y variable


respectively: sx is the standard deviation of x variable, and s y is the standard
deviation of y variable. The sample means of x and y variables are x and y
respectively. As before, ∑ stands for summation.

We illustrate the calculation below by putting data into a table:

15
x y x- x y- y ( x − x )2 ( y − y )2 (x- x )(y- y )
1 2 -1 -3 1 9 3
1 4 -1 -1 1 1 1
2 6 0 1 0 1 0
4 8 2 3 4 9 6
Total 8 20 0 0 6 20 10

The mean of x variable x =8/4=2.


The mean of y variable y =20/4=5.
6
The standard deviation of x-variable sx = =1.414.
4 −1

20
The standard deviation of y-variable s y = =2.58.
4 −1
The numerator in the formula for the correlation coefficient is ∑ ( x − x )( y − y ) = 10.
Therefore r=10/((4-1)*1.414*2.58)=.91.

Since r is close to 1, data points would lie close to a line on a scatterplot.

Coefficient of Determination

The coefficient of determination R 2 = r 2 * 100% is commonly reported by statistical


software (also shown as R-sq). It is useful since both r=1 and r=-1 indicate a perfect
linear relationship between X and Y. The closer R 2 is to 100%, the closer data points
are to being on one line.

Regression Equation

To find the equation of the line (also called regression line)


sy
yˆ = a + bx , calculate the slope as b = r , and then find the y-intercept a = y − bx .
sx
These formulas were derived so that the sum of squares of the vertical deviations of
the data points from the line is as small as possible. The formula for y-intercept
reflects the fact that the point ( x , y ) belongs to the line, i.e. its coordinates satisfy the
regression equation. In our example, b=(.91)*2.58/1.414=1.67,
a=5-2*1.67=1.66.
The regression equation is yˆ = 1.66 + 1.67 x .
We use y with a hat to distinguish the observed y and the one predicted if we used the
regression equation to predict y based on x.

16
The line is plotted on the figure below:

Scatterplot of y vs x
9

6
y

1.0 1.5 2.0 2.5 3.0 3.5 4.0


x

For data point (1, 2), the value we get if we substitute x=1 into regression equation is
ŷ =3.33. The observed value is y=2. The error of prediction= residual= y- ŷ =
2-3.33=-1.33. The residual is negative: data point is below the line.

For data point (1.4), ŷ =3.33, residual=4-3.33=.67.


For data point (2, 6) ŷ =5, residual=6-5=1. Data point lies 1 unit above the line.
For data point (4, 8), ŷ =8.34, residual=8-8.34=-.34.

Note that since regression line is constructed to go through the middle of the
scatterplot, the sum of the residuals equals zero: -1.33+.67+1-.34=0.

Using the formulas for the slope and y-intercept, we make the sum of squares of the
residuals as small as it can be based on given data. The regression line is called the
least squares regression line.
Note:
- r does not depend on which of the two variables is designated as X (predictor),
and which one is Y (response), because x’s and y’s appear in the formula in a
symmetrical way;
- the regression equation depends on which the two variables is designated as X
(predictor), and which one is Y (response).
- the slope of the regression line and the correlation coefficient always have the
same sign.

Correlation and Causation

When there is a non-zero correlation between 2 variables, there is a linear association.


However, the cause-and-effect relationship or causal relationship between two variables

17
can not be established based on statistics. One needs to think about the nature of the two
variables and their relationship to say that a change in X would cause a change in Y.
For example, if X is height, and Y is weight of a person, the relationship between 2
variables is causal, that is if a person grows taller, they are likely to weigh more.
If X is a student’s shoe size, and Y is a score on test, there may be an association but no
causal relationship. Stretching somebody’s feet is not going to improve their test scores.
However, there may be a psychological explanation that students with larger shoe size are
taller, and taller students may be more confident, and do better on a test.

Example. For a sample of 30 books selected from the MSU Library, the following
variables were recorded: call number, year published, number of pages, thickness, width
and height (in mm), and the number of authors. Data are presented in table below.

call number year number thickness width height authors


published of pages
HN488 n3.134 79 151 6 60 90 11
HQ10.H4 60 125 7 48 70 2
HM251.2474 48 229 9 53 80 2
HN88.20357 68 107 5 82 79 8
H5192.0183C.2 72 256 7 60 85 7
HQ1206.A73C.2 79 233 9 40 67 11
HN90.V5P34C.2 72 223 8 58 80 7
HQ1241.M268 86 216 6 44 77 7
HN90.S6R68 76 489 13 63 92 1
HJ9150.P4 83 256 9 58 88 9
HN49.C61048 82 142 6 59 89 9
HM251.C653 78 531 17 65 89 3
HM291.5392 80 351 10 55 110 4
HQ1410.A76 85 291 60 59 89 2
HN270.B854 * 183 50 53 74 1
HN57.T613 88 298 80 59 88 2
HQ518.T6813 88 283 80 74 100 2
HQ773.6.863 92 338 80 56 87 4
HQ18.C6C54 89 234 40 50 80 6
HN51.J6 85 187 23 57 88 5
HM15.C36 47 173 24 57 88 4
HN58.B7 66 382 15 54 83 1
HM251.A54 33 526 21 58 91 2
HQ1413.D24167 86 224 9 59 90 7
HJ7451.N44 88 270 10 62 93 6
HM131.546 66 192 6 55 81 3
HJ4121.M49574 91 59 6 83 120 5
HJ4770.63 77 224 5 52 75 4
HQ72.67M6 66 160 5 41 70 5
B855H51 63 50 4 47 69 6

18
Is there a relationship between height and width of a book? To answer this question, a
scatterplot is constructed:

Scatterplot of height vs width

120

110

100
height

90

80

70

60
40 50 60 70 80
width

What value of the correlation coefficient matches the above plot best?
(a) .68 (b) -.68 (c) .12 (d) -.12 (e) .97

Since there is an upward trend in the data (as x increases, y tends to increase), there is a
positive association between x and y, and therefore choices (b) and (d) can be eliminated.
Choice (c) of .12 corresponds to weak (on the scale from 0 to 1) association, and choice
(e) represents a very strong association, neither of which is the case. Therefore value of
r=.68 matches the plot best.

For the scatterplot that relates the thickness of a book to the number of authors, the
association is negative, and so is the correlation coefficient (r=-.47).

19
Scatterplot of thickness vs authors

80

70

60

50
thickness

40

30

20

10

0
0 2 4 6 8 10 12
authors

One would expect to find a relationship between the number of pages in a book and a
book’s thickness. The correlation coefficient is .255 reflecting a moderate strength of the
association. Below is the scatterplot with the regression line:

Scatterplot of thickness vs number of pages

80

70

60

50
thickness

40

30

20

10

0
0 100 200 300 400 500 600
number of pages

Regression Analysis: thickness versus number of pages

The regression equation is


thickness = 8.4 + 0.0514 number of pages

Using the regression equation, we can predict the thickness of a book given the number
of pages: a book with x=100 pages is predicted to be 8.4+0.0514*100=13.54 mm thick.
In other words, for x=100, ŷ =13.54.

20
If we set x to be 0, the predicted value of y is 8.4. This corresponds to a book with no
pages, so y-intercept of the equation in this example gives the thickness of the cover.

The summary of the variables x=number of pages, and y=book’s thickness is provided
below:
Descriptive Statistics: number of pages, thickness

Variable N Mean StDev Minimum Median Maximum


number of pages 30 246.1 119.9 50.0 226.5 531.0
thickness 30 21.00 24.11 4.00 9.00 80.00

In our notation, x =246.1, y =21, sx =119.9, s y =24.11. Using these values and r=.255,
we can check that the slope of the regression line is b=0.255*24.11/119.9=.051, and the
y-intercept is a=21-0.051*246.1=8.4 (if x is about average, y is about average). The
equation is ŷ =8.4+0.051x.
In this example, we did not find a strong association between the number of pages and
book’s thickness. One explanation may be that different types of books were included in
the sample (e.g. paperback or hardcover books), and it is possible that the relationship
between the variables is different according to cover type.

Example. In a study of the relationship of brain weight to body weight for n = 28


animals, brain weight (in g) and body weight (in kg). We will see that transformations are
sometimes needed to get a linear relationship, and we will see that "outliers" can affect
the regression equation.

Species body weight brain weight logbody logbrain


Mt Beaver 1.4 8.10 0.13033 0.90849
Cow 465.0 423.00 2.66745 2.62634
Gray wolf 36.3 119.50 1.56027 2.07737
Goat 27.7 115.00 1.44185 2.06070
Guinea Pig 1.0 5.50 0.01703 0.74036
Diplodocus 11700.0 50.00 4.06819 1.69897
Asian Elephant 2547.0 4603.00 3.40603 3.66304
Donkey 187.1 419.00 2.27207 2.62221
Horse 521.0 655.00 2.71684 2.81624
Potar monkey 10.0 115.00 1.00000 2.06070
Cat 3.3 25.60 0.51851 1.40824
Giraffe 529.0 680.00 2.72346 2.83251
Gorilla 207.0 406.00 2.31597 2.60853
Human 62.0 1320.00 1.79239 3.12057
African 6654.0 5712.00 3.82308 3.75679
elephant
Triceratops 9400.0 70.00 3.97313 1.84510
Rhesus monkey 6.8 179.00 0.83251 2.25285

21
Kangaroo 35.0 56.00 1.54407 1.74819
Hamster 0.1 1.00 -0.92082 0.00000
Mouse 0.0 0.40 -1.63827 -0.39794
Rabbit 2.5 12.10 0.39794 1.08279
Sheep 55.5 175.00 1.74429 2.24304
Jaguar 100.0 157.00 2.00000 2.19590
Chimpanzee 52.2 440.00 1.71734 2.64345
Brachiosaurus 87000.0 154.50 4.93952 2.18893
Rat 0.3 1.90 -0.55284 0.27875

Consider what would happen if you tried to plot these weights on a number line with
units in feet. A guinea pig’s weight would plot at about 1 foot, a cat’s at about 3 feet, and
a horse’s at over 500 feet. If the body weight of an African elephant is plotted on the
number line with units in feet, it would be 1.26 miles from the origin. (Recall that there
are 5,280 feet in one mile.).
Because of the different scales of magnitude, the data have been transformed by taking
logarithms (base 10). Notice that the logs of these numbers give their order of
magnitude, and they don’t vary too much. From now on we concentrate only on the
transformed data, given in the columns labeled logbody, logbrain.
It is easy to measure body weight of an animal, and not easy to measure brain weight. If
we establish an association between X and Y, regression equation computed based on this
sample would help us predict brain weight based on body weight for animals other than
the ones included in the table.

Below is the scatterplot of log-transformed brain weigh (Y) versus body weight (X).

Scatterplot of logbrain vs logbody


4

3
logbrain

-2 -1 0 1 2 3 4 5
logbody

Which value of the correlation coefficient matches the above graph best?
(a) -0.78 (b) 0.2 (c) 0.96 (d) -0.1 (e) 0.78

22
First, we can eliminate the negative values as the scatterplot shows positive
association between X and Y. Value of 0.2 corresponds to a weak (on the scale from
0 to 1) association, and can be eliminated. Value 0.96 would be reasonable except for
3 points located far away in X direction but below the main data cloud in Y direction.
Therefore for this graph, value 0.78 is the best match out of choices given.

Look into the table, and find the dinosaurs in it. Locate the corresponding points on
the scatterplot. Given choices above, what value of the correlation would be the best
match for the graph that does not include dinosaurs? (answer: 0.96).

Given r=0.77, x =1.638, y =1.922, sx =1.638, s y =1.042, calculate the slope of the
regression line: b=0.77*1.042/1.638=.49, and the y-intercept is a=1.922-
0.49*1.638=1.12 (if x is about average, y is about average). The equation is
ŷ =1.12+0.49x.
To use this equation and predict brain weight of an animal with body weight of 35 kg,
we first find x=log(body weight)=1.54. Then substitute the value into the regression
equation to find predicted value of the log of brain weight: ŷ =1.12+0.49*1.54=1.87.
The predicted brain weight is 10 to the power of .1.87, or 74.13 g.
The line corresponding to the regression equation is plotted on the scatterplot below:

Scatterplot of logbrain vs logbody


4

3
logbrain

-2 -1 0 1 2 3 4 5
logbody

Note that dinosaur points do not fit the linear pattern. Below is how the line would
look if dinosaur points were deleted from the data (new data set of n=25 animals).
Once dinosaurs are removed, the correlation coefficient increases, the relationship
between two variables becomes more linear, and the regression equation changes:
It is ŷ =0.93+.75x. Note that compared to the case when dinosaurs are included, the
y-intercept did not change much, but the slope did because dinosaur points were
“pulling the line down” making it less steep.

23
Scatterplot of delogbrain2 vs delogbody 2
4

delogbrain2 3

-2 -1 0 1 2 3 4
delogbody 2

24

You might also like