Professional Documents
Culture Documents
CE 459 Statistics: Assistant Prof. Muhammet Vefa AKPINAR
CE 459 Statistics: Assistant Prof. Muhammet Vefa AKPINAR
VAR1
16
14
12
10
No of obs
Expected
0
50 55 60 65 70 75 80 85 90 95 100 Normal
What is Statistics
Frequency Distribution
Descriptive Statistics
Normal Probability Distribution
Sampling Distribution of the Mean
Simple Linear Regression & Correlation
Multiple Regression & Correlation
08.10.2011 2
INTRODUCTION
Criticism
There is a general perception that statistical knowledge is all-
too-frequently intentionally misused, by finding ways to
interpret the data that are favorable to the presenter.
(A famous quote, variously attributed, but thought to be from
Benjamin Disraeli is: "There are three types of lies - lies,
damn lies, and statistics.") Indeed, the well-known book How to
Lie with Statistics by Darrell Huff discusses many cases of
deceptive uses of statistics, focusing on misleading graphs. By
choosing (or rejecting, or modifying) a certain sample, results
can be manipulated; throwing out outliers is one means of
doing so. This may be the result of outright fraud or of subtle
and unintentional bias on the part of the researcher.
08.10.2011 3
WHAT IS STATISTICS?
Definition
Statistics is a group of methods used to
collect, analyze, present, and interpret data
and to make decisions.
What is Statistics ?
08.10.2011 6
Descriptive statistics and Inferential statistics.
Population
Sample
Median:
A number such that at most half of the
measurements are below it and at most half of the
measurements are above it.
Mode:
The most frequent measurement in the data.
Mean
The Sample Meany( ) is the arithmetic average of a data set.
It is used to estimate the population mean, ( .
Calculated by taking the sum of the observed values (yi) divided by the number
of observations (n).
Historical Transmogrifier
Average Unit Production Costs
Residual
System $K
yi - y
1 22.2 y = 9.06
n
2 17.3
3 11.8
yi
i 1 y1 y2 yn
4 9.6 y
5 8.8 n n y
i
6 7.6
7 6.8 22.2 17.3 1.6
8 3.2 y $9.06K
9 1.7 10
10 1.6
The Mode
The mode, symbolized by Mo, is the most frequently occurring score value.
If the scores for a given sample distribution are:
32 32 35 36 37 38 38 39 39 39 40 40 42 45
08.10.2011 12
A distribution may have more than one mode if the two most frequently
occurring scores occur the same number of times. For example, if the earlier
score distribution were modified as follows:
32 32 32 36 37 38 38 39 39 39 40 40 42 45
then there would be two modes, 32 and 39. Such distributions are called
bimodal. The frequency polygon of a bimodal distribution is presented below.
08.10.2011 13
Example of Mode
Measurements
x
3
Mode: 3
5
1
1
4
Notice that it is possible for a
7 data not to have any mode.
3
8
3
Mode
The Mode is the value of the data set that occurs most frequently
Example:
1, 2, 4, 5, 5, 6, 8
Here the Mode is 5, since 5 occurred twice and no other value
x
3
In this case the data have
5
5 tow modes:
1 5 and 7
7
2 Both measurements are
6 repeated twice
7
0
4
Median
Computation of Median
When there is an odd number of numbers,
the median is simply the middle number. For
example, the median of 2, 4, and 7 is 4.
When there is an even number of numbers,
the median is the mean of the two middle
numbers. Thus, the median of the numbers
2, 4, 7, 12 is (4+7)/2 = 5.5.
08.10.2011 17
Example of Median
Measurements Measurements
Ranked Median: (4+5)/2 =
x x 4.5
3 0
5 1
5 2
Notice that only the
1 3 two central values are
7 4 used in the
2 5 computation.
6 5
7 6
0 7 The median is not
4 7 sensible to extreme
40 40 values
median
rim diameter (cm)
unit 1 unit 2
9.7 9.0
11.5 11.2
11.6 11.3
12.1 11.7
12.4 12.2
12.6 12.5
12.9 <-- 13.2 13.2
13.1 13.8
13.5 14.0
13.6 15.5
14.8 15.6
16.3 16.2
26.9 16.4
Median
The Median is the middle observation of an ordered (from low
to high) data set
Examples:
1, 2, 4, 5, 5, 6, 8
1, 3, 4, 4, 5, 7, 8, 8
4 5
Median 4.5
2
Mode = Median = Mean Mode Median
Mean
Dispersion Statistics
The Mean, Median and Mode by themselves are not sufficient
descriptors of a data set
Example:
Data Set 1: 48, 49, 50, 51, 52
Data Set 2: 5, 15, 50, 80, 100
Note that the Mean and Median for both data sets are identical, but the
data sets are glaringly different!
The difference is in the dispersion of the data points
Dispersion Statistics we will discuss are:
Range
Variance
Standard Deviation
Range
The Range is simply the difference between the smallest and
largest observation in a data set
Example
Data Set 1: 48, 49, 50, 51, 52
where:
X = raw score
X= the mean
Note that if you add all the deviation scores for a dataset
together, you automatically get the mean for that dataset.
08.10.2011 24
Variance
The Variance, s2, represents the amount of variability of the data
relative to their mean
As shown below, the variance is the “average” of the squared
deviations of the observations about their mean
( yi y) 2
s2
n 1
2
( yi )2
N
The Variance, s2, is the sample variance, and is used to estimate the actual
population variance, 2
Standard Deviation
The Variance is not a “common sense” statistic because it describes the
data in terms of squared units
The Standard Deviation, s, is simply the square root of the variance
( yi y) 2
s
n 1
( yi )2
N
The sample standard deviation, s, is measured in the same units as the
data from which the standard deviation is being calculated
Standard Deviation
2
( yi y)2
System FY97$K yi y (yi y) 2 s
1 22.2 13.1 172.7
n 1
2 17.3 8.2 67.9 172.7 67.9 55.7
3 11.8 2.7 7.5 10 1
4 9.6 0.5 0.3
5 8.8 -0.3 0.1 399 .8
44.4 ($K 2 )
6 7.6 -1.5 2.1 9
7 6.8 -2.3 5.1
8 3.2 -5.9 34.3 s s2 44.4($K 2 )
9 1.7 -7.4 54.2
10 1.6 -7.5 55.7 6.67 ($K )
Average 9.06
This number, $6.67K, represents the average estimating error for predicting
subsequent observations
In other words: On average, when estimating the cost of transmogrifiers that
belongs to the same population as the ten systems above, we would expect to
be off by $6.67K
Variance and the closely-related standard deviation
The variance and the closely-related standard deviation are measures of how
spread out a distribution is. In other words, they are measures of variability.
In order to define the amount of deviation of a dataset from the mean, calculate
the mean of all the deviation scores, i.e. the variance.
The variance is computed as the average squared deviation of each number from
its mean.
For example, for the numbers 1, 2, and 3, the mean is 2 and the variance is:
.
08.10.2011 28
variance in a population is:
where;
μ is the mean and N is the number of scores.
08.10.2011 29
The standard deviation is the square root of the variance.
08.10.2011 30
Variance and Standar Deviation
08.10.2011 31
Example of Mean
velocity km/saat
67 73 81 72 76 75 85 77 68 84
76 93 73 79 88 73 60 93 71 59
74 62 95 78 63 72 66 78 82 75
96 70 89 61 75 95 66 79 83 71
76 65 71 75 65 80 73 57 88 78
08.10.2011 35
Mean, Median, Standard Deviation
Valid
Numbers Range Mean Median Minimum Maximum Variance Standard.Dev.
08.10.2011 36
Frequency Table
Relative cumulative
Number class class frequency relative freq. Cumulative freq. freq.
08.10.2011 37
Frequency Table
08.10.2011 38
Selecting the Interval Size
08.10.2011 39
Histogram
14
12
10
No of obs
Expected
0
50 55 60 65 70 75 80 85 90 95 100 Normal
08.10.2011 Upper Boundaries (x <= boundary)
40
There are many different-shaped frequency distributions:
08.10.2011 41
A frequency polygon is a graphical display of a frequency table.
The intervals are shown on the X-axis and the number of scores
in each interval is represented by the height of a point located
above the middle of the interval. The points are connected so
that together with the X-axis they form a polygon.
08.10.2011 42
Spread, Dispersion, Variability
A variable's spread is the degree to which scores on the variable differ from each
other. If every score on the variable were about equal, the variable would have
very little spread. There are many measures of spread. The distributions shown
below have the same mean but differ in spread: The distribution on the bottom is
more spread out. Variability and dispersion are synonyms for spread.
08.10.2011 43
Skew
08.10.2011 44
Further Notes
The distribution shown below has a positive skew. The mean is larger than the
median.
test was very difficult and almost everyone in the class did very poorly on it,
08.10.2011 the resulting distribution would most likely be positively skewed.
46
The distribution shown below has a negative skew. The mean is
smaller than the median.
08.10.2011 47
Probability
Likelihood or chance of occurrence. The
probability of an event is the theoretical
relative frequency of the event in a
model of the population.
08.10.2011 48
Normal Distribution or Normal Curve
08.10.2011 49
In a normal distribution:
08.10.2011 50
The normal distribution function
The normal distribution function is determined by the following
formula:
Where;
: mean
: standard deviation
e: Euler's constant (2.71...)
: constant Pi (3.14...)
08.10.2011 51
Characteristics of the Normal Distribution:
08.10.2011 52
Total area under the curve sums to 1, the area of the distribution on
each side of the mean is 0.5.
The Area Under the Curve Between any Two Scores is a PROBABILITY
The probability that a random variable will have a value between any
two points is equal to the area under the curve between those points.
Positive and negative deviations from this central value are equally
likely
08.10.2011 53
Examples of normal distributions
Notice that they differ in how spread out they are. The area under each curve is the
same. The height of a normal distribution can be specified mathematically in terms
of two parameters: the mean (μ) and the standard deviation (σ). The two
parameters, and , each change the shape of the distribution in a different
manner.
08.10.2011 54
Changes in without changes in
08.10.2011 55
Changes in the value of
Changes in the value of , change the shape of the distribution
without affecting the midpoint, because d affects the spread or the
dispersion of scores. The larger the value of , the more dispersed
the scores; the smaller the value, the less dispersed. The
distribution below demonstrates the effect of increasing the value
of :
08.10.2011 56
THE STANDARD NORMAL CURVE
Note that the integral calculus is used to find the area under the normal
distribution curve. However, this can be avoided by transforming all
normal distribution to fit the standard normal distribution. This
conversion is done by rescalling the normal distribution axis from its true
units (time, weight, dollars, and...) to a standard measure called Z score
or Z value.
08.10.2011 57
Standard Scores (z Scores)
A Z score is the number of standard deviations that a value, X,
is away from the mean.
Standard scores are therefore useful for comparing datapoints
in different distributions.
If the value of X is greater than the mean, the Z score is
positive; if the value of X is less than the mean, the Z score is
negative. The Z score or equation is as follows:
08.10.2011 58
Table of the Standard Normal (z) Distribution
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0190 0.0239 0.0279 0.0319 0.0359
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879
0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549
0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852
0.8 0.2881 0.2910 0.2939 0.2969 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133
0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3513 0.3554 0.3577 0.3529 0.3621
1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830
1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015
1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177
1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319
1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441
1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545
1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633
1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706
1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767
08.10.2011 59
2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857
Three areas on a standard normal curve
08.10.2011 60
Total - infinity to Area from -Z to +Z -infinity to -Z -infinity to -Z
Total Z to + Total - infinity to Z-
Z infinity plus plus
1.5
+Z to + +Z to +
infinity infinity
Z Z -Z +Z -Z +Z -Z +Z Z-1.5
Convert
Area Under Curve (negative infinity to -Z
Z Area Under Curve Area Under Curve (negative infinity to - ) PLUS Area Under Curve
from negative infinity from Z to positive Area Under Curve Z) PLUS (+Z to positive infinity) negative infinity to Z-
to Z infinity from -Z to +Z (+Z to positive infinity) into PPM 1.5
0,000 0,50000000000000 0,50000000000000 0,00000000000000 1,00000000000000 1.000.000,00000000 0,06680720126886
0,100 0,53982783727702 0,46017216272298 0,07965567455403 0,92034432544597 920.344,32544597 0,08075665923377
0,200 0,57925970943909 0,42074029056091 0,15851941887818 0,84148058112182 841.480,58112182 0,09680048458561
0,300 0,61791142218894 0,38208857781106 0,23582284437788 0,76417715562212 764.177,15562212 0,11506967022170
0,400 0,65542174161031 0,34457825838969 0,31084348322063 0,68915651677937 689.156,51677937 0,13566606094638
0,500 0,69146246127400 0,30853753872600 0,38292492254801 0,61707507745200 617.075,07745200 0,15865525393145
0,600 0,72574688224992 0,27425311775008 0,45149376449983 0,54850623550017 548.506,23550017 0,18406012534675
0,700 0,75803634777692 0,24196365222308 0,51607269555384 0,48392730444617 483.927,30444617 0,21185539858339
0,800 0,78814460141659 0,21185539858341 0,57628920283319 0,42371079716681 423.710,79716681 0,24196365222306
0,900 0,81593987465323 0,18406012534677 0,63187974930647 0,36812025069354 368.120,25069354 0,27425311775006
1,000 0,84134474606854 0,15865525393146 0,68268949213707 0,31731050786293 317.310,50786293 0,30853753872598
1,100 0,86433393905361 0,13566606094639 0,72866787810722 0,27133212189278 271.332,12189278 0,34457825838967
1,200 0,88493032977829 0,11506967022171 0,76986065955657 0,23013934044343 230.139,34044343 0,38208857781104
1,300 0,90319951541439 0,09680048458562 0,80639903082877 0,19360096917123 193.600,96917123 0,42074029056089
1,400 08.10.2011 0,08075665923378
0,91924334076622 0,83848668153245 0,16151331846755 161.513,31846755 61
0,46017216272296
The area between Z-scores of -1.00 and +1.00. It is .68 or 68%.
The area between Z-scores of -2.00 and +2.00 and is .95 or 95%.
08.10.2011 62
Exercise 1
An industrial sewing machine uses ball bearings that are targeted to
have a diameter of 0.75 inch. The specification limits under which the
ball bearing can operate are 0.74 inch (lower) and 0.76 inch
(upper). Past experience has indicated that the actual diameter of the
ball bearings is approximately normally distributed with a mean of
0.753 inch and a standard deviation of 0.004 inch.
For this problem, note that "Target" = .75, and "Actual mean" = .753.
08.10.2011 63
What is the probability that a ball bearing will be between the target and
the actual mean?
08.10.2011 64
What is the probability that a ball bearing will be between the
lower specification limit and the target?
08.10.2011 65
What is the probability that a ball bearing will be above the
upper specification limit?
08.10.2011 66
What is the probability that a ball bearing will be below the lower
specification limit?
08.10.2011 67
Above which value in diameter will 93% of the ball bearings be?
The value asked for here will be the 7th percentile, since 93% of the ball
bearings will have diameters above that. So we will look up .4300 in the
Z-table in a "backwards“ manner. The closest area to this is .4306, which
corresponds to a Z-value of 1.48.
08.10.2011 68
Exercise 2
Graduate Management Aptitude Test (GMAT) scores are widely used
by graduate schools of business as an entrance requirement.
Suppose that in one particular year, the mean score for the GMAT
was 476, with a standard deviation of 107. Assuming that the GMAT
scores are normally distributed, answer the following questions:
08.10.2011 69
Question 1
What is the probability that a randomly selected score from this GMAT falls
between 476 and 650 (476 <= x <= 650) the following figure shows a graphic
representation of this problem.
Answer:
Z = (650 - 476)/107 = 1.62.
The Z value of 1.62 indicates that the GMAT score of 650 is 1.62 standard
deviation above the mean. The standard normal table gives the probability of
value falling between 650 and the mean. The whole number and tenths place
portion of the Z score appear in the first column of the table. Across the top of
the table are the values of the hundredths place portion of the Z score. Thus the
answer is that 0.4474 or 44.74% of the scores on the GMAT fall between a
score of 650 and 476.
08.10.2011 70
Question 2.
What is the probability of receiving a score greater than 750 on a GMAT test that
has a mean of 476 and a standard deviation of 107 i.e., P(X >= 750) = ?.
Answer
This problem is asking for determining the area of the upper tail of the distribution.
The Z score is: Z = ( 750 - 476)/107 = 2.56- Table- P(Z=2.56) = 0.4948.
This is the probability of a GMAT with a score between 476 and 750.
0.5 - 0.4948 = 0.0052 or 0.52%.
Note that P(X >= 750) is the same as P(X >750), because, in continuous
distribution, the area under an exact number such as X=750 is zero.
08.10.2011 71
What is the probability of receiving a score of 540 or less on a GMAT test that
has a mean of 476 and a standard deviation of 107 i.e., P(X <= 540)= ?
we are asked to determine the area under the curve for all values less than or
equal to 540.
z score (540-476)/107=0.6 -Table- P (z= 0.2257) which is the probability of
getting a score between the mean 476 and 540.
The answer to this problem is: 0.5 + 0.2257 = 0.73 or 73%.
Graphic representation of this problem.
08.10.2011 72
Question 4
What is the probability of receiving a score between 440 and 330 on a GMAT test that
has a mean of 476 and a standard deviation of 107. i.e., P(330 < 440)="?."
08.10.2011 73
Standard Error (SE)
Any statistic can have a standard error. Each sampling distribution has a
standard error.
Standard errors are important because they reflect how much sampling
fluctuation a statistic will show, i.e. how good an estimate of the population the
sample statistic is
How good an estimate is the mean of a population? One way to determine this
is to repeat the experiment many times and to determine the mean of the
means. However, this is tedious and frequently impossible.
08.10.2011 74
Standard Error of the Mean, SEM, σM
The standard deviation of the sampling distribution of the mean is called the
standard error of the mean.
Not:
08.10.2011 75
The standard error of any statistic depends on the sample size - in general,
the larger the sample size the smaller the standard error.
Note that the spread of the sampling distribution of the mean decreases as
the sample size increases.
Notice that the mean of the distribution is not affected by sample size.
08.10.2011 76
Comparing the Averages of Two Independent Samples
Is there "grade inflation" in KTU? How does the average GPA of KTU
students today compare with, say 10, years ago?
Suppose a random sample of 100 student records from 10 years ago
yields a sample average GPA of 2.90 with a standard deviation of
.40.
A random sample of 100 current students today yields a sample
average of 2.98 with a standard deviation of .45.
The difference between the two sample means is 2.98-2.90 = .08. Is
this proof that GPA's are higher today than 10 years ago?
08.10.2011 77
First we need to account for the fact that 2.98 and 2.90 are not
the true averages, but are computed from random samples.
Therefore, .08 is not the true difference, but simply an estimate
of the true difference.
Can this estimate miss by much? Fortunately, statistics has a
way of measuring the expected size of the ``miss'' (or error of
estimation) . For our example, it is .06 (we show how to
calculate this later). Therefore, we can state the bottom line of
the study as follows: "The average GPA of KTU students today
is .08 higher than 10 years ago, give or take .06 or so."
08.10.2011 78
Overview of Confidence Intervals
08.10.2011 79
We are interested in the mean weight of 10-year old kids living in
Turkey. Since it would have been impractical to weigh all the 10-year
old kids in Turkey, you took a sample of 16 and found that the mean
weight was 90 pounds. This sample mean of 90 is a point estimate of
the population mean.
08.10.2011 80
Confidence intervals provide more information than point estimates.
An example of a 95% confidence interval is shown below:
72.85 < μ < 107.15
There is good reason to believe that the population mean lies between
these two bounds of 72.85 and 107.15 since 95% of the time confidence
intervals contain the true mean.
08.10.2011 81
It is natural to interpret a 95% confidence interval as an interval with a
0.95 probability of containing the population mean
The wider the interval, the more confident you are that it contains the
parameter. The 99% confidence interval is therefore wider than the
95% confidence interval and extends from 4.19 to 7.61.
08.10.2011 82
Example
Assume that the weights of 10-year old children are normally distributed with a
mean of 90 and a standard deviation of 36. What is the sampling distribution of the
mean for a sample size of 9?
standard deviation of 36/3 = 12. Note that the standard deviation of a sampling
distribution is its standard error.
90 - (1.96)(12) = 66.48
90 + (1.96)(12) = 113.52
The value of 1.96 is based on the fact that 95% of the area of a normal distribution
is within 1.96 standard deviations of the mean; 12 is the standard error of the mean.
08.10.2011 83
Figure shows that 95% of the means are no more than 23.52 units
(1.96x12) from the mean of 90.
Now consider the probability that a sample mean computed in a
random sample is within 23.52 units of the population mean of 90.
Since 95% of the distribution is within 23.52 of 90, the probability that
the mean from any given sample will be within 23.52 of 90 is 0.95.
This means that if we repeatedly compute the mean (M) from a
sample, and create an interval ranging from M - 23.52 to M +
23.52, this interval will contain the population mean 95% of the
time.
08.10.2011 84
notice that you need to know the standard deviation (σ) in order
to estimate the mean. This may sound unrealistic, and it is.
However, computing a confidence interval when σ is known is
easier than when σ has to be estimated, and serves a
pedagogical purpose.
Suppose the following five were sampled from a normal
distribution with a standard deviation of 2.5: 2, 3, 5, 6, and 9. To
compute the 95% confidence interval, start by computing the
mean and standard error:
M = (2 + 3 + 5 + 6 + 9)/5 = 5.
σm = = 1.118.
08.10.2011 85
Z.95 --the value is 1.96.
08.10.2011 86
If you had wanted to compute the 99% confidence interval, you
would have set the shaded area to 0.99 and the result would
have been 2.58.
08.10.2011 87
Estimating the Population Mean Using Intervals
08.10.2011 88
Example:
Given the following GPA for 6 students: 2.80, 3.20, 3.75,
3.10, 2.95, 3.40
08.10.2011 89
Determining Sample Size for Estimating the Mean
with 95% confidence. The plus-or-minus quantity .12 is called the margin
of error of the sample mean associated with a 95% confidence level. It is
also correct to say ``we are 95% confident that is within .12 of the
sample mean 3.05''.
08.10.2011 90
Confidence Interval for μ, Standard Deviation Estimated
M - t sM ≤ μ ≤ M + t sM
More generally, the formula for the 95% confidence interval on the
mean is:
Lower limit = M - (t)(sm)
Upper limit = M + (t)(sm)
where;
M is the sample mean, t is the t for the confidence level desired (0.95
in the above example), and sm is the estimated standard error of the
mean.
A comparison of the t and normal distribution
A comparison of the t distribution with 4 df
(in blue) and the standard normal
distribution (in red).
Finding t-values
Example:
50% 60% 70% 80% 90% 95% 96% 98% 99% 99.5% 99.8% 99.9%
0.25 0.2 0.15 0.1 0.05 0.025 0.02 0.01 0.005 0.0025 0.001 0.0005
df
1 1.000 1.376 1.963 3.078 6.314 12.706 15.895 31.821 63.657 127.32 318.30 636.61
2 0.817 1.061 1.386 1.886 2.920 4.303 4.849 6.965 9.925 14.089 22.327 31.599
3 0.765 0.979 1.250 1.638 2.353 3.182 3.482 4.541 5.841 7.453 10.215 12.924
4 0.741 0.941 1.190 1.533 2.132 2.776 2.999 3.747 4.604 5.598 7.173 8.610
5 0.727 0.920 1.156 1.476 2.015 2.571 2.757 3.365 4.032 4.773 5.893 6.869
6 0.718 0.906 1.134 1.440 1.943 2.447 2.612 3.143 3.707 4.317 5.208 5.959
7 0.711 0.896 1.119 1.415 1.895 2.365 2.517 2.998 3.499 4.029 4.785 5.408
8 0.706 0.889 1.108 1.397 1.860 2.306 2.449 2.896 3.355 3.833 4.501 5.041
9 0.703 0.883 1.100 1.383 1.833 2.262 2.398 2.821 3.250 3.690 4.297 4.781
10 0.700 0.879 1.093 1.372 1.812 2.228 2.359 2.764 3.169 3.581 4.144 4.587
11 0.697 0.876 1.088 1.363 1.796 2.201 2.328 2.718 3.106 3.497 4.025 4.437
12 0.696 0.873 1.083 1.356 1.782 2.179 2.303 2.681 3.055 3.428 3.930 4.318
13 0.694 0.870 1.079 1.350 1.771 2.160 2.282 2.650 3.012 3.372 3.852 4.221
14 0.692 0.868 1.076 1.345 1.761 2.145 2.264 2.624 2.977 3.326 3.787 4.140
Abbreviated t table
df 0.95 0.99
2 4.303 9.925
3 3.182 5.841
4 2.776 4.604
5 2.571 4.032
8 2.306 3.355
10 2.228 3.169
20 2.086 2.845
50 2.009 2.678
100 1.984 2.626
Example
Assume that the following five numbers are sampled from a
normal distribution: 2, 3, 5, 6, and 9 and that the standard
deviation is not known. The first steps are to compute the
sample mean and variance:
M=5
sm = 7.5
Standard error (sm)= 1.225
df = N - 1 = 4
t t tablethe value for the 95% interval for is
2.776.
Lower limit = 5 - (2.776)(1.225)= 1.60
Upper limit = 5 + (2.776)(1.225)= 8.40
Example
Suppose a researcher were interested in estimating the mean reading
speed (number of words per minute) of high-school graduates and
computing the 95% confidence interval. A sample of 6 graduates was
taken and the reading speeds were: 200, 240, 300, 410, 450, and
600. For these data, M = 366.6667
sM= 60.9736
df = 6-1 = 5
t = 2.571
The mean time difference for all 47 subjects is 16.362 seconds and
the standard deviation is 7.470 seconds. The standard error of the
mean is 1.090.
NOTE: Each observation is in tens of thousand. So, 9.06 represents 9.06 x 104.
Prediction with Regression Analysis
The relationship(s) between values of the response variable and
corresponding values of the predictor variable(s) is (are) not
deterministic.
Thus the value of y is estimated given the value of x. The estimated
value of the dependent variable is denoted y, and the population
slope and intercept are usually denoted β1 and β0.
Linear Regression
The idea is to fit a straight line through data points
Linear Regression - Indicates that the relationship(s) between the
dependent variable and the independent variable(s).
Can extend to multiple dimensions
correlation analysis is applied to independent factors:
if X increases, what will Y do (increase, decrease, or
perhaps not change at all)?
In regression analysis a unilateral response is
assumed: changes in X result in changes in Y, but
changes in Y do not result in changes in X.
Regression Plot
m1 = 0.0095937 + 0.880436 vwmkt
0.4
0.3
0.2
0.1
m1
0.0
-0.1
-0.2
-0.3
120
100
80
Salary
60
40
20
0
0 5 10 15 20 25
Years
Y X
( xi x )( yi y)
i
2
( xi x)
i
y x
Regression Error
We can also write a regression equation slightly differently:
Also called the residual, this is the difference between our estimate of the value of
the dependent variable y and the actual value of the dependent variable y.
Unless we have perfect prediction, many of the y values will fall off of the line.
The added e in the equation refers to this fact. It would be incorrect to write
the equation without the e, because it would suggest that the y scores are
completely accounted for by just knowing the slope, x values, and the
intercept. Almost always, that is not true. There is some error in prediction, so
we need to add an e for error variation into the equation.
The actual values of y can be accounted for by the regression line equation
(y=a+bx) plus some degree of error in our prediction (the e's).
r correlation coefficient
The correlation between X and Y is expressed by the
correlation coefficient r :
xi = data X, ¯x = mean of data X
yi = data Y, ¯y = mean of data Y
1 >r > -1
r = 1 perfect positive linear correlation
between two variables
r = 0 no linear correlation (maybe other
correlation)
r = -1 perfect negative linear correlation
Notice that for the perfect correlation,
there is a perfect line of points. They do
not deviate from that line.
least squares
The principle is to establish a statistical linear relationship
between two sets of corresponding data by fitting the data to a
straight line by means of the "least squares" technique.
The resulting line takes the general form:
y = bx + a
X1, X2
For example,
X1=„years of experience‟
X2=„age‟
Y=„salary‟
Y x
1 1 x
2 2
y
yi Response Surface
0=10 i
E(yi)
(xi1, xi2) x1
x2
The parameters β0, β1, β2,… , βk are called partial
regression coefficients.
β1 represents the change in y corresponding to a
unit increase in x1, holding all the other predictors
constant.
A similar interpretation can be made for β2, β3,
……, βk
Regression Statistics
Multiple R 0,995
R Square 0,990
Adjusted R Square 0,989
Standard Error 0,008
Observations 30
ANOVA
Significa
df SS MS F nce F
Regression 4 0,164 0,041 628,372 0,000
Residual 25 0,002 0,000
Total 29 0,165
Coefficie Standard
nts Error t Stat P-value
Intercept 0,500 0,008 60,294 0,000
Percent of Gross Hhd Income Spent on rent -0,399 0,016 -24,610 0,000
percent 2-parent families -0,288 0,015 -19,422 0,000
Police Anti-Drug Program? -0,004 0,004 -1,238 0,227
Active Tenants Group? (1 = yes; 0 = no) -0,102 0,004 -28,827 0,000
Controlling also for this new variable, the police anti-drug program is no
longer statistically significant, an instead the presence of the active
tenants group makes the dramatic difference. (and look at that great R
square!). However, we are no quite done…
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.928
R Square 0.861
Adjusted R Square 0.850
Standard Error 0.030
Observations 30
ANOVA
df SS MS F Significance F
Regression 2 0.149 0.074 83.484 0.000
Residual 27 0.024 0.001
Total 29 0.173
Coeffici Standard
ents Error t Stat P-value BETA
Intercept 0.36582 0.017 20.908 0.000
percent 2-parent
families -0.2565 0.051 -5.017 0.000 -0.362
Active Tenants Group?
(1 = yes; 0 = no) -0.1246 0.011 -11.347 0.000 -0.821
Since the police variable now has a statistically insignificant t-score, we remove it
from the model. (We also remove the income variable, since it also becomes
insignificant after we remove the police variable.) We are left with two independent
variables: percent of 2-parent families and active tenants group.
Stepwise Regression Algorithms
• Backward Elimination
• Forward Selection
• Stepwise Selection
Backward Elimination
1. Fit the model containing all (remaining)
predictors.
2. Test each predictor variable, one at a
time, for a significant relationship with y.
3. Identify the variable with the largest pvalue.
If p > α, remove this variable from
the model, and return to (1.).
4. Otherwise, stop and use the existing
model.
Forward Selection
1. Fit all models with one (more) predictor.
2. Test each of these predictor variables,
for a significant relationship with y.
3. Identify the variable with the smallest pvalue.
If p < α, add this variable to the
model, and return to (1.).
4. Otherwise, stop and use the existing
model.
Stepwise Selection
• The Stepwise Selection method is
basically Forward Selection with Backward
Elimination added in at every step.
Stepwise Selection
1. Fit all models with one (more) predictor.
2. Test each of these predictor variables, for a
significant relationship with y.
3. Identify the variable with the smallest p-value.
If p < α, add this variable to the model, and
return to (1.).
4. Now, for the model being considered, test
each predictor variable, one at a time, for a
significant relationship with y.
5. Identify the variable with the largest p-value. If
p > α, remove this variable from the model,
and return to (1.).
6. Otherwise, stop and use the existing model.
Linear regression
Review
Multiple Regression Models
Chapter Topics
Multiple
Regression
Models
Non-
Linear
Linear
Dummy Inter-
Linear action
Variable
Poly- Square
Log Reciprocal Exponential
Nomial Root
Linear Multiple Regression
Model
Additional Assumption
for Multiple Regression
No exact linear relation exists between any
subset of explanatory variables (perfect
"multicollinearity")
The Multiple Regression Model
Yi 0 1X1i 2 X2i p X pi i
Ŷi b0 b1X1i b2 X 2i bp X pi ei
Response 0
i
Plane
X2
X1 (X1i,X2i)
YX = 0 + 1X1i + 2X2i
Sample Multiple Regression
Model
Bivariate model
Yi = b0 + b1X1i + b2X2i + ei
Y (Observed Y)
Response b0
ei
Plane
X2
X1 (X1i,X2i)
^
Yi = b0 + b1X1i + b2X2i
Parameter Estimation
Slope (bP)
Estimated Y changes by bP for each 1 unit
increase in XP holding all other variables
constant (ceterus paribus)
Example: If b1 = -2, then fuel oil usage (Y) is
expected to decrease by 2 gallons for each 1
degree increase in temperature (X1) given the
inches of insulation (X2)
Y-Intercept (b0)
Average value of Y when all XP = 0
Sample Regression Model:
Example
C o e ffi c i e n ts
I n te r c e p t 5 6 2 .1 5 1 0 0 9 2
X V a ria b le 1 -5 . 4 3 6 5 8 0 5 8 8
X V a ria b le 2 -2 0 . 0 1 2 3 2 0 6 7
Multiple Regression:
ABSENCES= + 1AUTONOMY+
2SKILLVARIETY
Overlap in Explanation
ANOVA ANOVA
df SS MS F Significance F df SS MS F
Regression 1 4867,198 4867,198 31,43612 2,62392E-08 Regression 2 9098,483 4549,242 30,1266
Residual 1067 165201,7 154,8282 Residual 1066 160970,4 151,0041
Total 1068 170068,9 Total 1068 170068,9
•Hypotheses:
Hypotheses:
H0 : Variables Xi... do not significantly improve
the model given all others included
H1 : Variables Xi... significantly improve the
model given all others included
Test Statistic: With df = k and (n - p -1)
F= SSR( X i .... all others) / k
MSE k=# of variables
tested
Testing Portions of Model:
Example
Test at the = .05 level
to determine if the
variable of average
temperature
significantly improves
the model given that
insulation is included.
Testing Portions of Model:
Example
H0: X1 does not improve = .05, df = 1 and 12
model (X2 included)
Critical Value = 4.75
H1: X1 does improve model
ANOV A (For X1 and X2) A N O V A (For X2)
SS MS S S
measure.
Smallest value of s (standard
deviation).
C-p statistic. A model with the smallest
-6.169 1 -4.704
The average consumption of oil is reduced by between
4.7 gallons to 6.17 gallons per each increase of 10 F in
houses with the same insulation.