Professional Documents
Culture Documents
C1S1 Statistics Packet
C1S1 Statistics Packet
Table of Contents
Introduction……………………………………………………………………………….2
Descriptive statistics for categorical variables…………………………………………....2
Frequency table.…………………………………………………………………..3
Bar graph.………………………………………………………………………....3
Pie chart.…………………………………………………………………………..4
Descriptive statistics for numerical variables……………………………………………..4
Mean………………………………………………………………………………4
Standard deviation………………………………………………………………...5
Range……………………………………………………………………………...6
Median…………………………………………………………………………….6
Quartiles…………………………………………………………………………...6
Interquartile range…………………………………………………………………6
Graphs for numerical variables……………………………………………………………6
Stem-and-leaf plot.………………………………………………………………...7
Histogram.……………………………………………………………………...…8
Dotplot.…………………………………………………………………………..10
Shape of the distribution…………………………………………………………10
Boxplot.……………………………………….…………………………………10
Empirical rule …………………………….……………………………………………..12
Studying relationship between two variables……………………………………………12
Categorical versus categorical variables…………………………………………13
Contingency table.……………………………………………………….13
Numerical versus numerical variables………………………………………………...…13
Scatterplot..……………………………………………………………....13
Correlation coefficient………………………………………………..….14
Coefficient of determination……………………………………………..16
Regression equation………………………………………………….…..16
Correlation and Causation………………………………………………..17
1
Introduction
Numerical summaries for categorical variables include frequency tables; graphs include
bar graphs, and pie-charts.
Frequency table lists the values (categories) of the variable together with counts for each
category (frequencies), and, if needed, relative frequencies or proportions. Proportions
are needed for comparisons of 2 groups with unequal counts.
We illustrate the descriptive statistics for categorical variables using the following
example: for 392 students, eye color was recorded with the following results: 124
students had blue eyes, 150 had brown, 15 had green, and 103 had hazel.
2
Frequency Table
Bar graph
The y-axis may display counts (as on the graphs) below, or proportions.
160
140
120
100
80
60
40
20
0
Blue Brown Green Hazel
Pareto chart (named after Italian economist Vilfredo Pareto) displays the categories in
order of frequencies (most to least frequently observed), or from the tallest to the shortest
bar.
160
140
120
100
80
60
40
20
0
Brown Blue Hazel Green
3
Pie Chart
Eye Colors
C ategory
Green Brown
15, 3.8%
Blue
Hazel
Green
Hazel
103, 26.3%
Brown
150, 38.3%
Blue
124, 31.6%
Each slice represents a category. The size of each slice is proportional to the frequency of
the category.
Mean
Standard Deviation
s=
∑ ( x − x )2
, where n is the number of observations in a sample (sample size), x is
n −1
the sample mean; ∑ stands for summation. The denominator is n-1 rather than n because
x estimates the mean of the population. If the mean for the population (from which the
sample was taken) were known, then it would be subtracted in the numerator from x’s,
and the denominator would have been n. The squared standard deviation s 2 is called the
variance.
4
(1 − 4) 2 + (2 − 4) 2 + (2 − 4) 2 + (7 − 4) 2 + (8 − 4) 2
In the example, s =
5 −1
To do calculations step by step, a table below may be helpful:
x x- x ( x − x )2
1 -3 9
2 -2 4
2 -2 4
7 3 9
8 4 16
Total 0 42
Values in the second column, x- x , are called deviations of each data point from the
mean, and deviations always add up to 0. The total of the third column is a numerator
under the square root in the formula for s: s = 42 /(5 − 1) =3.24.
Here is how the standard deviation reflects spread: if we increase the spread in our data
by replacing 8 with 13, the standard deviation would increase.
For the data set: 1, 2, 2, 7, 13
x =(1+2+2+7+13)/5=5
x x- x ( x − x )2
1 -4 16
2 -3 9
2 -3 9
7 2 4
13 8 64
Total 0 102
Range
Range is the difference between the smallest (minimum) and the largest (maximum)
observations.
For the data set: 1, 2, 2, 7, 8, range=8-1=7.
For the data set: 1, 2, 2, 7, 13, range=13-1=12.
Median
The median (denoted by M) is another summary measure of center. In the data ordered
from smallest to largest, it equals the middle observation if the number of observations
(n) is odd, and the average of two observations adjacent to the middle if n is even.
5
The median splits the data into two halves, with approximately 50% of all data being
below the median, and approximately 50% of the data above the median.
Quartiles
Quartiles, Q1 (first quartile), and Q3 (third quartile) cut the data into quarters so that
approximately 25% of data fall below Q1, and approximately 75% of data fall below Q3.
We use the word “approximately” because if the number of observations is not a multiple
of 4, exact slicing into quarters is not possible. Different software programs and different
calculators may employ slightly different rules for approximations in cutting.
We adopt the following approach: Q1 is the median of the lower half of data (below the
median), and Q3 is the median of upper half of data (above the median).
For example, for the data set 1, 2, 2, 7, 8, 8, the median M=(2+7)/2=4.5, the lower half
of the data is 1, 2, 2, and Q1=2. Upper half is 7, 8, 8, Q3=8.
When the number of observations is odd, the median is the middle data point, and we
include it in both lower and upper halves. For example, for data 1, 2, 2, 7, 8, the median
is 2. Lower half: 1, 2, 2, and Q1=2. Upper half is 2, 7, 8, Q3=7.
Interquartile Range
We will show how to construct these graphs using two hypothetical data sets of scores of
two groups of students on a test.
Group 1 (n=10): 136, 107, 123, 107, 102, 112, 129, 110, 112, 111
Group 2 (n=11): 98, 138, 119, 109, 102, 117, 123, 114, 110, 112, 112
6
Stem-and-leaf Plot
To construct a stem-and-leaf plot, look at the data, and find common digits (stem), and
differing (leaves). In this case, we can use first 2 digits (the number of tens) as a stem,
and the third digit as a leaf.
09|
10|
11|
12|
13|
Group 1:
09|
10|772
11|2210
12|39
13|6
Finally, order the leaves from smallest to largest:
09|
10|277
11|0122
12|39
13|6
For group 2:
09|8
10|29
11|022479
12|3
13|8
Group 2 Group 1
8|09|
92|10|277
974220|11|0122
3|12|39
8|13|6
7
Histogram
Histogram of Group 1
3
Frequency
0
100 105 110 115 120 125 130 135
Group 1
8
Histogram of Group 2
3
Frequency
0
100 110 120 130 140
Group 2
Frequency histograms are not suitable for group comparisons as the groups have unequal
size. The histograms that display percents (relative frequencies *100%) are shown below:
Histogram of Group 1
40
30
Percent
20
10
0
100 110 120 130 140
Group 1
Histogram of Group 2
40
30
Percent
20
10
0
100 110 120 130 140
Group 2
9
Dotplot
Histograms display frequencies or percents, but no the actual data points. Dotplots show
the data points represented by dots. If there are several observations identical in value,
dots corresponding to these observations are stacked on top one another. Here are the
dotplots for the scores of groups 1 and 2:
Group 1
Group 2
102 108 114 120 126 132 138
Data
The term distribution refers to a pattern of variation of a variable, i.e. description of how
it varies from subject to subject. An outline of a histogram, dotplot, or stem-and-leaf plot
provides the shape of the distribution (e.g. bell-shaped, u-shaped, triangular, flat or
uniform etc.) The shapes are classified as symmetrical or skewed. If a distribution is not
symmetrical, then one of tails is longer than the other. The distribution is said to be
skewed in the direction of the longer tail. For example, if right tail is longer than the left,
then the distribution is skewed to the right.
Boxplot
Like histograms, boxplots do not show the actual data points. They display quartiles,
median, and extremely large or small observations that fall more than 1.5*IQR from the
quartiles (potential outliers).
Before we construct a boxplot, we need to calculate median and quartiles for our test
score data for groups 1 and 2. Our first step is to order data from smallest to largest:
Group 1: 102, 107, 107, 110, 111, 112, 112, 123, 129, 136
M=(111+112)/2; lower half of the data is 102, 107, 107, 110, 111, Q1=107; upper half of
the data is 112, 112, 123, 129, 136, Q3=123; IQR=Q3-Q1=16
Group 2: 98, 102, 109, 110, 112, 112, 114, 117, 119, 123, 138
10
M=112; lower half of the data is 98, 102, 109, 110, 112, 112; if n is odd both lower and
upper halves include the median), Q1=(109+110)/2=109.5; upper half 112, 114, 117, 119,
123, 138, Q3=(117+119)/2=118; IQR=118-109.5=8.5.
Group 1
Group 2
The box encloses the middle 50% of the data (between Q1 and Q3). The length of the box
equals IQR. Group 1 has larger spread among the central 50% of scores compared to
group 2. Group 2, however, has an outlier. Spread among the entire data for 2 groups (as
measured by range or standard deviation) would be about the same.
11
Empirical Rule
When a dotplot or a histogram has a bell shape, that is the outline of a shape resembles a
bell, the standard deviation can be used to describe the spread of the values that variable
takes using the Empirical Rule:
- approximately 68% of all values fall within one standard deviation of the mean;
- approximately 95% of all values fall within two standard deviations of the mean;
- essentially all (approximately 99.7%) of all values fall within three standard
deviations of the mean.
For example, if scores have mean 50 and standard deviation 10, and the histogram of
scores has a shape that is close to a bell, approximately 68% of the students scored
between 40 and 60; approximately 95% of the students scored between 30 and 70.
Histogram of Scores
Normal
20 30 40 50 60 70 80
14 Mean 50
StDev 10.00
12 N 200
10
Percent
0
20 30 40 50 60 70 80
Score
12
Is there an association between X and Y? Is Y contingent upon X? Can we predict Y
based on X?
Contingency Table
Below is a contingency table with row and column totals added for convenience:
Among girls, 8 out of 200, or 4% (proportion .04) rated themselves as “superior” at math.
Among boys, 17 out of 190, or 8.9% (proportion .089) rated themselves as “superior” at
math. If there was no association between perceived math ability and gender, we would
expect the same proportions of boys and girls to rate themselves as “superior” (same
would apply to other categories). However, it appears that Y (rating of ability) is
contingent upon X (gender). Note that Y is not the actual ability, but a perceived one, i.e.
how boys and girls think of themselves rather than how they actually do in math.
Scatterplot
When two numerical variables, X and Y are measured for each subject or object in a
study, a graph that displays pairs of values (x, y) is called a scatterplot. It helps visualize
the relationship between X and Y.
Example. For 4 subjects, the measurements were: (1, 2) (1, 4) (2, 6) (4, 8).
The scatterplot is
13
Scatterplot of y vs x
5
y
Correlation Coefficient
14
Calculating r:
r=
∑ ( x − x )( y − y ) ,
(n − 1) s x s y
15
x y x- x y- y ( x − x )2 ( y − y )2 (x- x )(y- y )
1 2 -1 -3 1 9 3
1 4 -1 -1 1 1 1
2 6 0 1 0 1 0
4 8 2 3 4 9 6
Total 8 20 0 0 6 20 10
20
The standard deviation of y-variable s y = =2.58.
4 −1
The numerator in the formula for the correlation coefficient is ∑ ( x − x )( y − y ) = 10.
Therefore r=10/((4-1)*1.414*2.58)=.91.
Coefficient of Determination
Regression Equation
16
The line is plotted on the figure below:
Scatterplot of y vs x
9
6
y
For data point (1, 2), the value we get if we substitute x=1 into regression equation is
ŷ =3.33. The observed value is y=2. The error of prediction= residual= y- ŷ =
2-3.33=-1.33. The residual is negative: data point is below the line.
Note that since regression line is constructed to go through the middle of the
scatterplot, the sum of the residuals equals zero: -1.33+.67+1-.34=0.
Using the formulas for the slope and y-intercept, we make the sum of squares of the
residuals as small as it can be based on given data. The regression line is called the
least squares regression line.
Note:
- r does not depend on which of the two variables is designated as X (predictor),
and which one is Y (response), because x’s and y’s appear in the formula in a
symmetrical way;
- the regression equation depends on which the two variables is designated as X
(predictor), and which one is Y (response).
- the slope of the regression line and the correlation coefficient always have the
same sign.
17
can not be established based on statistics. One needs to think about the nature of the two
variables and their relationship to say that a change in X would cause a change in Y.
For example, if X is height, and Y is weight of a person, the relationship between 2
variables is causal, that is if a person grows taller, they are likely to weigh more.
If X is a student’s shoe size, and Y is a score on test, there may be an association but no
causal relationship. Stretching somebody’s feet is not going to improve their test scores.
However, there may be a psychological explanation that students with larger shoe size are
taller, and taller students may be more confident, and do better on a test.
Example. For a sample of 30 books selected from the MSU Library, the following
variables were recorded: call number, year published, number of pages, thickness, width
and height (in mm), and the number of authors. Data are presented in table below.
18
Is there a relationship between height and width of a book? To answer this question, a
scatterplot is constructed:
120
110
100
height
90
80
70
60
40 50 60 70 80
width
What value of the correlation coefficient matches the above plot best?
(a) .68 (b) -.68 (c) .12 (d) -.12 (e) .97
Since there is an upward trend in the data (as x increases, y tends to increase), there is a
positive association between x and y, and therefore choices (b) and (d) can be eliminated.
Choice (c) of .12 corresponds to weak (on the scale from 0 to 1) association, and choice
(e) represents a very strong association, neither of which is the case. Therefore value of
r=.68 matches the plot best.
For the scatterplot that relates the thickness of a book to the number of authors, the
association is negative, and so is the correlation coefficient (r=-.47).
19
Scatterplot of thickness vs authors
80
70
60
50
thickness
40
30
20
10
0
0 2 4 6 8 10 12
authors
One would expect to find a relationship between the number of pages in a book and a
book’s thickness. The correlation coefficient is .255 reflecting a moderate strength of the
association. Below is the scatterplot with the regression line:
80
70
60
50
thickness
40
30
20
10
0
0 100 200 300 400 500 600
number of pages
Using the regression equation, we can predict the thickness of a book given the number
of pages: a book with x=100 pages is predicted to be 8.4+0.0514*100=13.54 mm thick.
In other words, for x=100, ŷ =13.54.
20
If we set x to be 0, the predicted value of y is 8.4. This corresponds to a book with no
pages, so y-intercept of the equation in this example gives the thickness of the cover.
The summary of the variables x=number of pages, and y=book’s thickness is provided
below:
Descriptive Statistics: number of pages, thickness
In our notation, x =246.1, y =21, sx =119.9, s y =24.11. Using these values and r=.255,
we can check that the slope of the regression line is b=0.255*24.11/119.9=.051, and the
y-intercept is a=21-0.051*246.1=8.4 (if x is about average, y is about average). The
equation is ŷ =8.4+0.051x.
In this example, we did not find a strong association between the number of pages and
book’s thickness. One explanation may be that different types of books were included in
the sample (e.g. paperback or hardcover books), and it is possible that the relationship
between the variables is different according to cover type.
21
Kangaroo 35.0 56.00 1.54407 1.74819
Hamster 0.1 1.00 -0.92082 0.00000
Mouse 0.0 0.40 -1.63827 -0.39794
Rabbit 2.5 12.10 0.39794 1.08279
Sheep 55.5 175.00 1.74429 2.24304
Jaguar 100.0 157.00 2.00000 2.19590
Chimpanzee 52.2 440.00 1.71734 2.64345
Brachiosaurus 87000.0 154.50 4.93952 2.18893
Rat 0.3 1.90 -0.55284 0.27875
Consider what would happen if you tried to plot these weights on a number line with
units in feet. A guinea pig’s weight would plot at about 1 foot, a cat’s at about 3 feet, and
a horse’s at over 500 feet. If the body weight of an African elephant is plotted on the
number line with units in feet, it would be 1.26 miles from the origin. (Recall that there
are 5,280 feet in one mile.).
Because of the different scales of magnitude, the data have been transformed by taking
logarithms (base 10). Notice that the logs of these numbers give their order of
magnitude, and they don’t vary too much. From now on we concentrate only on the
transformed data, given in the columns labeled logbody, logbrain.
It is easy to measure body weight of an animal, and not easy to measure brain weight. If
we establish an association between X and Y, regression equation computed based on this
sample would help us predict brain weight based on body weight for animals other than
the ones included in the table.
Below is the scatterplot of log-transformed brain weigh (Y) versus body weight (X).
3
logbrain
-2 -1 0 1 2 3 4 5
logbody
Which value of the correlation coefficient matches the above graph best?
(a) -0.78 (b) 0.2 (c) 0.96 (d) -0.1 (e) 0.78
22
First, we can eliminate the negative values as the scatterplot shows positive
association between X and Y. Value of 0.2 corresponds to a weak (on the scale from
0 to 1) association, and can be eliminated. Value 0.96 would be reasonable except for
3 points located far away in X direction but below the main data cloud in Y direction.
Therefore for this graph, value 0.78 is the best match out of choices given.
Look into the table, and find the dinosaurs in it. Locate the corresponding points on
the scatterplot. Given choices above, what value of the correlation would be the best
match for the graph that does not include dinosaurs? (answer: 0.96).
Given r=0.77, x =1.638, y =1.922, sx =1.638, s y =1.042, calculate the slope of the
regression line: b=0.77*1.042/1.638=.49, and the y-intercept is a=1.922-
0.49*1.638=1.12 (if x is about average, y is about average). The equation is
ŷ =1.12+0.49x.
To use this equation and predict brain weight of an animal with body weight of 35 kg,
we first find x=log(body weight)=1.54. Then substitute the value into the regression
equation to find predicted value of the log of brain weight: ŷ =1.12+0.49*1.54=1.87.
The predicted brain weight is 10 to the power of .1.87, or 74.13 g.
The line corresponding to the regression equation is plotted on the scatterplot below:
3
logbrain
-2 -1 0 1 2 3 4 5
logbody
Note that dinosaur points do not fit the linear pattern. Below is how the line would
look if dinosaur points were deleted from the data (new data set of n=25 animals).
Once dinosaurs are removed, the correlation coefficient increases, the relationship
between two variables becomes more linear, and the regression equation changes:
It is ŷ =0.93+.75x. Note that compared to the case when dinosaurs are included, the
y-intercept did not change much, but the slope did because dinosaur points were
“pulling the line down” making it less steep.
23
Scatterplot of delogbrain2 vs delogbody 2
4
delogbrain2 3
-2 -1 0 1 2 3 4
delogbody 2
24