Download as pdf or txt
Download as pdf or txt
You are on page 1of 211

CE 459 Statistics

Assistant Prof. Muhammet Vefa AKPINAR

VAR1
16

14

12

10
No of obs

Expected
0
50 55 60 65 70 75 80 85 90 95 100 Normal

Upper Boundaries (x <= boundary)


Lecture Notes

 What is Statistics
 Frequency Distribution
 Descriptive Statistics
 Normal Probability Distribution
 Sampling Distribution of the Mean
 Simple Linear Regression & Correlation
 Multiple Regression & Correlation

08.10.2011 2
INTRODUCTION

 Criticism
 There is a general perception that statistical knowledge is all-
too-frequently intentionally misused, by finding ways to
interpret the data that are favorable to the presenter.
 (A famous quote, variously attributed, but thought to be from
Benjamin Disraeli is: "There are three types of lies - lies,
damn lies, and statistics.") Indeed, the well-known book How to
Lie with Statistics by Darrell Huff discusses many cases of
deceptive uses of statistics, focusing on misleading graphs. By
choosing (or rejecting, or modifying) a certain sample, results
can be manipulated; throwing out outliers is one means of
doing so. This may be the result of outright fraud or of subtle
and unintentional bias on the part of the researcher.

08.10.2011 3
WHAT IS STATISTICS?
Definition
Statistics is a group of methods used to
collect, analyze, present, and interpret data
and to make decisions.
What is Statistics ?

 American Heritage Dictionary defines statistics as:


"The mathematics of the collection,
organization, and interpretation of numerical
data, especially the analysis of population
characteristics by inference from sampling."

 The Merriam-Webster‟s Collegiate Dictionary


definition is: "A branch of mathematics dealing with
the collection, analysis, interpretation, and
presentation of masses of numerical data."

 The word statistics is also the plural of statistic


(singular), which refers to the result of applying a
statistical algorithm to a set of data, as in
employment statistics, accident statistics, etc.
08.10.2011 5
 In applying statistics to a scientific, industrial, or societal
problem, one begins with a process or population to be
studied. This might be a population of people in a country, of
crystal grains in a rock, or of goods manufactured by a
particular factory during a given period.

 For practical reasons, rather than compiling data about an


entire population, one usually instead studies a chosen subset
of the population, called a sample.

 Data are collected about the sample in an observational or


experimental setting. The data are then subjected to
statistical analysis, which serves two related purposes:
description and inference.

08.10.2011 6
Descriptive statistics and Inferential statistics.

 Statistical data analysis can be subdivided into Descriptive


statistics and Inferential statistics.

 Descriptive statistics is concerned with exploring, visualising,


and summarizing data but without fitting the data to any
models. This kind of analysis is used to explore the data in the
initial stages of data analysis. Since no models are involved, it
can not be used to test hypotheses or to make testable
predictions. Nevertheless, it is a very important part of analysis
that can reveal many interesting features in the data.

 Descriptive statistics can be used to summarize the data, either


numerically or graphically, to describe the sample. Basic
examples of numerical descriptors include the mean and
standard deviation. Graphical summarizations include various
kinds of charts and graphs.
08.10.2011 7
 Inferential statistics is the next stage in data analysis and
involves the identification of a suitable model. The data is then
fit to the model to obtain an optimal estimation of the model's
parameters. The model then undergoes validation by testing
either predictions or hypotheses of the model. Models based on
a unique sample of data can be used to infer generalities about
features of the whole population.

 Inferential statistics is used to model patterns in the data,


accounting for randomness and drawing inferences about the
larger population. These inferences may take the form of
answers to yes/no questions (hypothesis testing), estimates
of numerical characteristics (estimation), forecasting of
future observations, descriptions of association (correlation),
or modeling of relationships (regression).
 Other modeling techniques include ANOVA, time series, and
data mining.
08.10.2011 8
A portion of the
Population and sample. population
selected for study
is referred to as a
sample.

Population
Sample

A population consists of all elements


– individuals, items, or objects –
whose characteristics are being
studied. The population that is being
studied is also called the target
population.
Measures of Central Tendency
The central tendency of a dataset, i.e. the centre of a
frequency distribution, is most commonly measured by the 3
Ms:

 Mean = arithmetic mean =average


Sum of all measurements divided by the number
of measurements.

 Median:
A number such that at most half of the
measurements are below it and at most half of the
measurements are above it.

 Mode:
The most frequent measurement in the data.
Mean
 The Sample Meany( ) is the arithmetic average of a data set.
 It is used to estimate the population mean, ( .
 Calculated by taking the sum of the observed values (yi) divided by the number
of observations (n).
Historical Transmogrifier
Average Unit Production Costs
Residual
System $K
yi - y
1 22.2 y = 9.06
n
2 17.3
3 11.8
yi
i 1 y1 y2  yn
4 9.6 y
5 8.8 n n y
i
6 7.6
7 6.8 22.2 17.3  1.6
8 3.2 y $9.06K
9 1.7 10
10 1.6
The Mode
 The mode, symbolized by Mo, is the most frequently occurring score value.
If the scores for a given sample distribution are:

 32 32 35 36 37 38 38 39 39 39 40 40 42 45

 then the mode would be 39 because a score of 39 occurs 3 times, more


than any other score.

08.10.2011 12
 A distribution may have more than one mode if the two most frequently
occurring scores occur the same number of times. For example, if the earlier
score distribution were modified as follows:

 32 32 32 36 37 38 38 39 39 39 40 40 42 45

 then there would be two modes, 32 and 39. Such distributions are called
bimodal. The frequency polygon of a bimodal distribution is presented below.

08.10.2011 13
Example of Mode
Measurements
x
3
 Mode: 3
5
1
1
4
 Notice that it is possible for a
7 data not to have any mode.
3
8
3
Mode
 The Mode is the value of the data set that occurs most frequently
 Example:
 1, 2, 4, 5, 5, 6, 8
 Here the Mode is 5, since 5 occurred twice and no other value

occurred more than once


 Data sets can have more than one mode, while the mean and median
have one unique value
 Data sets can also have NO mode, for example:
 1, 3, 5, 6, 7, 8, 9
 Here, no value occurs more frequently than any other,

therefore no mode exists


 You could also argue that this data set contains 7 modes since

each value occurs as frequently as every other


Example of Mode
Measurements

x
3
 In this case the data have
5
5 tow modes:
1  5 and 7
7
2  Both measurements are
6 repeated twice
7
0
4
Median
 Computation of Median
When there is an odd number of numbers,
the median is simply the middle number. For
example, the median of 2, 4, and 7 is 4.
When there is an even number of numbers,
the median is the mean of the two middle
numbers. Thus, the median of the numbers
2, 4, 7, 12 is (4+7)/2 = 5.5.

08.10.2011 17
Example of Median
Measurements Measurements
Ranked  Median: (4+5)/2 =
x x 4.5
3 0
5 1
5 2
 Notice that only the
1 3 two central values are
7 4 used in the
2 5 computation.
6 5
7 6
0 7  The median is not
4 7 sensible to extreme
40 40 values
median
rim diameter (cm)

unit 1 unit 2
9.7 9.0
11.5 11.2
11.6 11.3
12.1 11.7
12.4 12.2
12.6 12.5
12.9 <-- 13.2 13.2
13.1 13.8
13.5 14.0
13.6 15.5
14.8 15.6
16.3 16.2
26.9 16.4
Median
 The Median is the middle observation of an ordered (from low
to high) data set
 Examples:
 1, 2, 4, 5, 5, 6, 8

 Here, the middle observation is 5, so the median is 5

 1, 3, 4, 4, 5, 7, 8, 8

 Here, there is no “middle” observation so we take the

average of the two observations at the center

4 5
Median 4.5
2
Mode = Median = Mean Mode Median
Mean
Dispersion Statistics
 The Mean, Median and Mode by themselves are not sufficient
descriptors of a data set
 Example:
 Data Set 1: 48, 49, 50, 51, 52
 Data Set 2: 5, 15, 50, 80, 100
 Note that the Mean and Median for both data sets are identical, but the
data sets are glaringly different!
 The difference is in the dispersion of the data points
 Dispersion Statistics we will discuss are:
 Range
 Variance
 Standard Deviation
Range
 The Range is simply the difference between the smallest and
largest observation in a data set
 Example
 Data Set 1: 48, 49, 50, 51, 52

 Data Set 2: 5, 15, 50, 80, 100

 The Range of data set 1 is 52 - 48 = 4


 The Range of data set 2 is 100 - 5 = 95
 So, while both data sets have the same mean and median, the
dispersion of the data, as depicted by the range, is much
smaller in data set 1
deviation score

 A deviation score is a measure of by how much each point in a


frequency distribution lies above or below the mean for the
entire dataset:

 where:
X = raw score
X= the mean
 Note that if you add all the deviation scores for a dataset
together, you automatically get the mean for that dataset.

08.10.2011 24
Variance
 The Variance, s2, represents the amount of variability of the data
relative to their mean
 As shown below, the variance is the “average” of the squared
deviations of the observations about their mean

( yi y) 2
s2
n 1
2
( yi )2
N
 The Variance, s2, is the sample variance, and is used to estimate the actual
population variance, 2
Standard Deviation
 The Variance is not a “common sense” statistic because it describes the
data in terms of squared units
 The Standard Deviation, s, is simply the square root of the variance

( yi y) 2
s
n 1

 The Standard Deviation, s, is the sample standard deviation, and is


used to estimate the actual population standard deviation,

( yi )2
N
 The sample standard deviation, s, is measured in the same units as the
data from which the standard deviation is being calculated

Standard Deviation
2
( yi y)2
System FY97$K yi y (yi y) 2 s
1 22.2 13.1 172.7
n 1
2 17.3 8.2 67.9 172.7 67.9  55.7
3 11.8 2.7 7.5 10 1
4 9.6 0.5 0.3
5 8.8 -0.3 0.1 399 .8
44.4 ($K 2 )
6 7.6 -1.5 2.1 9
7 6.8 -2.3 5.1
8 3.2 -5.9 34.3 s s2 44.4($K 2 )
9 1.7 -7.4 54.2
10 1.6 -7.5 55.7 6.67 ($K )
Average 9.06
 This number, $6.67K, represents the average estimating error for predicting
subsequent observations
 In other words: On average, when estimating the cost of transmogrifiers that
belongs to the same population as the ten systems above, we would expect to
be off by $6.67K
Variance and the closely-related standard deviation

 The variance and the closely-related standard deviation are measures of how
spread out a distribution is. In other words, they are measures of variability.

 In order to define the amount of deviation of a dataset from the mean, calculate
the mean of all the deviation scores, i.e. the variance.

 The variance is computed as the average squared deviation of each number from
its mean.

 For example, for the numbers 1, 2, and 3, the mean is 2 and the variance is:
.

08.10.2011 28
 variance in a population is:

 variance in a sample is:

 where;
 μ is the mean and N is the number of scores.

08.10.2011 29
 The standard deviation is the square root of the variance.

08.10.2011 30
Variance and Standar Deviation

08.10.2011 31
Example of Mean

Measurements Deviation  MEAN = 40/10 = 4


x x - mean
3 -1
5 1  Notice that the sum of the
5 1
“deviations” is 0.
1 -3
7 3
2 -2  Notice that every single
observation intervenes in
6 2
7 3
0 -4 the computation of the
4 0 mean.
40 0
Example of Variance

Measurements Deviations Square of  Variance = 54/9 = 6


deviations
x x - mean
3 -1 1  It is a measure of
5 1 1
5 1 1 “spread”.
1 -3 9  Notice that the larger
7 3 9
the deviations (positive
2 -2 4
6 2 4 or negative) the larger
7 3 9 the variance
0 -4 16
4 0 0
40 0 54
The standard deviation

 It is defines as the square root of the


variance
 In the previous example
 Variance = 6
 Standard deviation = Square root of the
variance = Square root of 6 = 2.45
Observed Vehicle velocity

velocity km/saat

67 73 81 72 76 75 85 77 68 84

76 93 73 79 88 73 60 93 71 59

74 62 95 78 63 72 66 78 82 75

96 70 89 61 75 95 66 79 83 71

76 65 71 75 65 80 73 57 88 78
08.10.2011 35
Mean, Median, Standard Deviation

Valid
Numbers Range Mean Median Minimum Maximum Variance Standard.Dev.

50 39 75,62 75 57 96 96,362 9,816458

08.10.2011 36
Frequency Table
Relative cumulative
Number class class frequency relative freq. Cumulative freq. freq.

of Class (intervals) intervals midpoints % %

1 50,000 < x <= 55,000 52,5 0 0 0 0

2 55,000 < x <= 60,000 57,5 3 6 3 6

3 60,000 < x <= 65,000 62,5 5 10 8 16

4 65,000 < x <= 70,000 67,5 5 10 13 26

5 70,000 < x <= 75,000 72,5 14 28 27 54

6 75,000 < x <= 80,000 77,5 10 20 37 74

7 80,000 < x <= 85,000 82,5 5 10 42 84

8 85,000 < x <= 90,000 87,5 3 6 45 90

9 90,000 < x <= 95,000 92,5 4 8 49 98

10 95,000 < x <= 100,00 97,5 1 2 50 100

08.10.2011 37
Frequency Table

 A cumulative frequency distribution is a plot of the


number of observations falling in or below an
interval. The graph shown here is a cumulative
frequency distribution of the scores on a statistics
test.
 A frequency table is constructed by dividing the
scores into intervals and counting the number of
scores in each interval. The actual number of scores
as well as the percentage of scores in each interval
are displayed.
Cumulative frequencies are also usually displayed.
 The X-axis shows various intervals of vehicle speed.

08.10.2011 38
Selecting the Interval Size

 In order to find a starting interval size the first step is to


find the range of the data by subtracting the smallest
score from the largest. In the case of the example data,
the range was 96-57 = 39. The range is then divided by
the number of desired intervals, with a suggested starting
number of intervals being ten (10). In the example, the
result would be 50/10 = 5. The nearest odd integer value
is used as the starting point for the selection of the
interval size.

08.10.2011 39
Histogram

 A histogram is constructed from a frequency table. The intervals are shown on


the X-axis and the number of scores in each interval is represented by the
height of a rectangle located above the interval. A histogram of the vehicle
speed from the dataset is shown below. The shapes of histograms will vary
depending on the choice of the size of the intervals.
VAR1
16

14

12

10
No of obs

Expected
0
50 55 60 65 70 75 80 85 90 95 100 Normal
08.10.2011 Upper Boundaries (x <= boundary)
40
There are many different-shaped frequency distributions:

08.10.2011 41
 A frequency polygon is a graphical display of a frequency table.
The intervals are shown on the X-axis and the number of scores
in each interval is represented by the height of a point located
above the middle of the interval. The points are connected so
that together with the X-axis they form a polygon.

08.10.2011 42
 Spread, Dispersion, Variability
 A variable's spread is the degree to which scores on the variable differ from each
other. If every score on the variable were about equal, the variable would have
very little spread. There are many measures of spread. The distributions shown
below have the same mean but differ in spread: The distribution on the bottom is
more spread out. Variability and dispersion are synonyms for spread.

08.10.2011 43
Skew

08.10.2011 44
Further Notes

 When the Mean is greater than the Median the


data distribution is skewed to the Right.

 When the Median is greater than the Mean the


data distribution is skewed to the Left.

 When Mean and Median are very close to each


other the data distribution is approximately
symmetric.
 The Effect of Skew on the Mean and Median

 The distribution shown below has a positive skew. The mean is larger than the
median.

test was very difficult and almost everyone in the class did very poorly on it,
08.10.2011 the resulting distribution would most likely be positively skewed.
46
 The distribution shown below has a negative skew. The mean is
smaller than the median.

08.10.2011 47
Probability
 Likelihood or chance of occurrence. The
probability of an event is the theoretical
relative frequency of the event in a
model of the population.

08.10.2011 48
Normal Distribution or Normal Curve

 Normal distribution is probably one of the most important and


widely used continuous distribution. It is known as a normal
random variable, and its probability distribution is called a
normal distribution.
 The normal distribution is a theoretical function commonly used
in inferential statistics as an approximation to sampling
distributions. In general, the normal distribution provides a
good model for a random variable.

08.10.2011 49
In a normal distribution:

 68% of samples fall between ±1 SD


 95% of samples fall between ±2 SD (actually + 1.96 SD)
 99.7% of samples fall between ±3 SD

08.10.2011 50
The normal distribution function
 The normal distribution function is determined by the following
formula:

 Where;

 : mean
 : standard deviation
 e: Euler's constant (2.71...)
 : constant Pi (3.14...)

08.10.2011 51
Characteristics of the Normal Distribution:

 It is bell shaped and is symmetrical about its mean. It is asymptotic to the


axis, i.e., it extends indefinitely in either direction from the mean.
 They are symmetric with scores more concentrated in the middle than in the
tails.
 It is a family of curves, i.e., every unique pair of mean and standard
deviation defines a different normal distribution. Thus, the normal distribution
is completely described by two parameters: mean and standard
deviation.
 There is a strong tendency for the variable to take a central value. It is
unimodal, i.e., values mound up only in the center of the curve.
 The frequency of deviations falls off rapidly as the deviations become larger.

08.10.2011 52
 Total area under the curve sums to 1, the area of the distribution on
each side of the mean is 0.5.
 The Area Under the Curve Between any Two Scores is a PROBABILITY
 The probability that a random variable will have a value between any
two points is equal to the area under the curve between those points.
Positive and negative deviations from this central value are equally
likely

08.10.2011 53
Examples of normal distributions
 Notice that they differ in how spread out they are. The area under each curve is the
same. The height of a normal distribution can be specified mathematically in terms
of two parameters: the mean (μ) and the standard deviation (σ). The two
parameters, and , each change the shape of the distribution in a different
manner.

08.10.2011 54
Changes in without changes in

 Changes in , without changes in , result in moving the distribution to the


right or left, depending upon whether the new value of was larger or smaller
than the previous value, but does not change the shape of the distribution.

08.10.2011 55
Changes in the value of
 Changes in the value of , change the shape of the distribution
without affecting the midpoint, because d affects the spread or the
dispersion of scores. The larger the value of , the more dispersed
the scores; the smaller the value, the less dispersed. The
distribution below demonstrates the effect of increasing the value
of :

08.10.2011 56
THE STANDARD NORMAL CURVE

 The standard normal curve is a member of the family of normal curves


with = 0.0 and = 1.0.

 Note that the integral calculus is used to find the area under the normal
distribution curve. However, this can be avoided by transforming all
normal distribution to fit the standard normal distribution. This
conversion is done by rescalling the normal distribution axis from its true
units (time, weight, dollars, and...) to a standard measure called Z score
or Z value.

08.10.2011 57
Standard Scores (z Scores)
 A Z score is the number of standard deviations that a value, X,
is away from the mean.
 Standard scores are therefore useful for comparing datapoints
in different distributions.
 If the value of X is greater than the mean, the Z score is
positive; if the value of X is less than the mean, the Z score is
negative. The Z score or equation is as follows:

 where z is the z-score for the value of X

08.10.2011 58
Table of the Standard Normal (z) Distribution
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0190 0.0239 0.0279 0.0319 0.0359

0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753

0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141

0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517

0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879

0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224

0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549

0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852

0.8 0.2881 0.2910 0.2939 0.2969 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133

0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389

1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3513 0.3554 0.3577 0.3529 0.3621

1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830

1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015

1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177

1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319

1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441

1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545

1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633

1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706

1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767

08.10.2011 59
2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817

2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857
Three areas on a standard normal curve

08.10.2011 60
Total - infinity to Area from -Z to +Z -infinity to -Z -infinity to -Z
Total Z to + Total - infinity to Z-
Z infinity plus plus
1.5
+Z to + +Z to +
infinity infinity

Z Z -Z +Z -Z +Z -Z +Z Z-1.5

Convert
Area Under Curve (negative infinity to -Z
Z Area Under Curve Area Under Curve (negative infinity to - ) PLUS Area Under Curve
from negative infinity from Z to positive Area Under Curve Z) PLUS (+Z to positive infinity) negative infinity to Z-
to Z infinity from -Z to +Z (+Z to positive infinity) into PPM 1.5
0,000 0,50000000000000 0,50000000000000 0,00000000000000 1,00000000000000 1.000.000,00000000 0,06680720126886
0,100 0,53982783727702 0,46017216272298 0,07965567455403 0,92034432544597 920.344,32544597 0,08075665923377
0,200 0,57925970943909 0,42074029056091 0,15851941887818 0,84148058112182 841.480,58112182 0,09680048458561
0,300 0,61791142218894 0,38208857781106 0,23582284437788 0,76417715562212 764.177,15562212 0,11506967022170
0,400 0,65542174161031 0,34457825838969 0,31084348322063 0,68915651677937 689.156,51677937 0,13566606094638
0,500 0,69146246127400 0,30853753872600 0,38292492254801 0,61707507745200 617.075,07745200 0,15865525393145
0,600 0,72574688224992 0,27425311775008 0,45149376449983 0,54850623550017 548.506,23550017 0,18406012534675
0,700 0,75803634777692 0,24196365222308 0,51607269555384 0,48392730444617 483.927,30444617 0,21185539858339
0,800 0,78814460141659 0,21185539858341 0,57628920283319 0,42371079716681 423.710,79716681 0,24196365222306
0,900 0,81593987465323 0,18406012534677 0,63187974930647 0,36812025069354 368.120,25069354 0,27425311775006
1,000 0,84134474606854 0,15865525393146 0,68268949213707 0,31731050786293 317.310,50786293 0,30853753872598
1,100 0,86433393905361 0,13566606094639 0,72866787810722 0,27133212189278 271.332,12189278 0,34457825838967
1,200 0,88493032977829 0,11506967022171 0,76986065955657 0,23013934044343 230.139,34044343 0,38208857781104
1,300 0,90319951541439 0,09680048458562 0,80639903082877 0,19360096917123 193.600,96917123 0,42074029056089
1,400 08.10.2011 0,08075665923378
0,91924334076622 0,83848668153245 0,16151331846755 161.513,31846755 61
0,46017216272296
 The area between Z-scores of -1.00 and +1.00. It is .68 or 68%.
 The area between Z-scores of -2.00 and +2.00 and is .95 or 95%.

08.10.2011 62
Exercise 1
 An industrial sewing machine uses ball bearings that are targeted to
have a diameter of 0.75 inch. The specification limits under which the
ball bearing can operate are 0.74 inch (lower) and 0.76 inch
(upper). Past experience has indicated that the actual diameter of the
ball bearings is approximately normally distributed with a mean of
0.753 inch and a standard deviation of 0.004 inch.

 For this problem, note that "Target" = .75, and "Actual mean" = .753.

08.10.2011 63
What is the probability that a ball bearing will be between the target and
the actual mean?

 P(-0.75 < Z < 0) = .2734

08.10.2011 64
 What is the probability that a ball bearing will be between the
lower specification limit and the target?

P(-3.25 < Z < -0.75) = .49942 - .2734 = .22602

08.10.2011 65
 What is the probability that a ball bearing will be above the
upper specification limit?

P(Z > 1.75) = .5 - .4599 = .0401

08.10.2011 66
What is the probability that a ball bearing will be below the lower
specification limit?

P (Z < -3.25) = .5 - .49942 = .00058

08.10.2011 67
Above which value in diameter will 93% of the ball bearings be?
The value asked for here will be the 7th percentile, since 93% of the ball
bearings will have diameters above that. So we will look up .4300 in the
Z-table in a "backwards“ manner. The closest area to this is .4306, which
corresponds to a Z-value of 1.48.

-0.00592 = X - 0.753 X = 0.74708


So 0.74708 in. is the value that 93% of the diameters are above.

08.10.2011 68
Exercise 2
 Graduate Management Aptitude Test (GMAT) scores are widely used
by graduate schools of business as an entrance requirement.
Suppose that in one particular year, the mean score for the GMAT
was 476, with a standard deviation of 107. Assuming that the GMAT
scores are normally distributed, answer the following questions:

08.10.2011 69
Question 1

What is the probability that a randomly selected score from this GMAT falls
between 476 and 650 (476 <= x <= 650) the following figure shows a graphic
representation of this problem.

 Answer:
 Z = (650 - 476)/107 = 1.62.
 The Z value of 1.62 indicates that the GMAT score of 650 is 1.62 standard
deviation above the mean. The standard normal table gives the probability of
value falling between 650 and the mean. The whole number and tenths place
portion of the Z score appear in the first column of the table. Across the top of
the table are the values of the hundredths place portion of the Z score. Thus the
answer is that 0.4474 or 44.74% of the scores on the GMAT fall between a
score of 650 and 476.
08.10.2011 70
Question 2.
 What is the probability of receiving a score greater than 750 on a GMAT test that
has a mean of 476 and a standard deviation of 107 i.e., P(X >= 750) = ?.
 Answer
 This problem is asking for determining the area of the upper tail of the distribution.
 The Z score is: Z = ( 750 - 476)/107 = 2.56- Table- P(Z=2.56) = 0.4948.
 This is the probability of a GMAT with a score between 476 and 750.
 0.5 - 0.4948 = 0.0052 or 0.52%.
 Note that P(X >= 750) is the same as P(X >750), because, in continuous
distribution, the area under an exact number such as X=750 is zero.

08.10.2011 71
 What is the probability of receiving a score of 540 or less on a GMAT test that
has a mean of 476 and a standard deviation of 107 i.e., P(X <= 540)= ?
 we are asked to determine the area under the curve for all values less than or
equal to 540.
 z score (540-476)/107=0.6 -Table- P (z= 0.2257) which is the probability of
getting a score between the mean 476 and 540.
 The answer to this problem is: 0.5 + 0.2257 = 0.73 or 73%.
 Graphic representation of this problem.

08.10.2011 72
Question 4
 What is the probability of receiving a score between 440 and 330 on a GMAT test that
has a mean of 476 and a standard deviation of 107. i.e., P(330 < 440)="?."

 The two values fall on the same side of the mean.


 The Z scores are: Z1 = (330 - 476)/107 = -1.36, and Z2 = (440 - 476)/107 = -0.34.
The probability associated with Z = -1.36 is 0.4131,
 The probability associated with Z = -0.34 is 0.1331.
 Thee answer to this problem is: 0.4131 - 0.1331 = 0.28 or 28%.

08.10.2011 73
Standard Error (SE)

 Any statistic can have a standard error. Each sampling distribution has a
standard error.

 Standard errors are important because they reflect how much sampling
fluctuation a statistic will show, i.e. how good an estimate of the population the
sample statistic is

 How good an estimate is the mean of a population? One way to determine this
is to repeat the experiment many times and to determine the mean of the
means. However, this is tedious and frequently impossible.

 SE refers to the variability of the sample statistic, a measure of spread


for random variables

 The inferential statistics involved in the construction of confidence intervals


(CI) and significance testing are based on standard errors.

08.10.2011 74
Standard Error of the Mean, SEM, σM
 The standard deviation of the sampling distribution of the mean is called the
standard error of the mean.

 The size of the standard error of the mean is inversely proportional


to the square root of the sample size.

Not:

08.10.2011 75
 The standard error of any statistic depends on the sample size - in general,
the larger the sample size the smaller the standard error.
 Note that the spread of the sampling distribution of the mean decreases as
the sample size increases.

 Notice that the mean of the distribution is not affected by sample size.

08.10.2011 76
Comparing the Averages of Two Independent Samples

 Is there "grade inflation" in KTU? How does the average GPA of KTU
students today compare with, say 10, years ago?
 Suppose a random sample of 100 student records from 10 years ago
yields a sample average GPA of 2.90 with a standard deviation of
.40.
 A random sample of 100 current students today yields a sample
average of 2.98 with a standard deviation of .45.
 The difference between the two sample means is 2.98-2.90 = .08. Is
this proof that GPA's are higher today than 10 years ago?

08.10.2011 77
 First we need to account for the fact that 2.98 and 2.90 are not
the true averages, but are computed from random samples.
Therefore, .08 is not the true difference, but simply an estimate
of the true difference.
 Can this estimate miss by much? Fortunately, statistics has a
way of measuring the expected size of the ``miss'' (or error of
estimation) . For our example, it is .06 (we show how to
calculate this later). Therefore, we can state the bottom line of
the study as follows: "The average GPA of KTU students today
is .08 higher than 10 years ago, give or take .06 or so."

08.10.2011 78
Overview of Confidence Intervals

 Once the population is specified, the next step is to take a random


sample from it. In this example, let's say that a sample of 10
students were drawn and each student's memory tested. The way
to estimate the mean of all high school students would be to
compute the mean of the 10 students in the sample. Indeed, the
sample mean is an unbiased estimate of μ, the population mean.

Clearly, if you already knew the population mean, there would be


no need for a confidence interval.

08.10.2011 79
 We are interested in the mean weight of 10-year old kids living in
Turkey. Since it would have been impractical to weigh all the 10-year
old kids in Turkey, you took a sample of 16 and found that the mean
weight was 90 pounds. This sample mean of 90 is a point estimate of
the population mean.

 A point estimate by itself is of limited usefulness because it does not


reveal the uncertainty associated with the estimate; you do not have a
good sense of how far this sample mean may be from the population
mean. For example, can you be confident that the population mean is
within 5 pounds of 90? You simply do not know.

08.10.2011 80
 Confidence intervals provide more information than point estimates.
 An example of a 95% confidence interval is shown below:
 72.85 < μ < 107.15
 There is good reason to believe that the population mean lies between
these two bounds of 72.85 and 107.15 since 95% of the time confidence
intervals contain the true mean.

 If repeated samples were taken and the 95% confidence interval


computed for each sample, 95% of the intervals would contain the
population mean. Naturally, 5% of the intervals would not contain the
population mean.

08.10.2011 81
 It is natural to interpret a 95% confidence interval as an interval with a
0.95 probability of containing the population mean

 The wider the interval, the more confident you are that it contains the
parameter. The 99% confidence interval is therefore wider than the
95% confidence interval and extends from 4.19 to 7.61.

08.10.2011 82
Example

 Assume that the weights of 10-year old children are normally distributed with a
mean of 90 and a standard deviation of 36. What is the sampling distribution of the
mean for a sample size of 9?
 standard deviation of 36/3 = 12. Note that the standard deviation of a sampling
distribution is its standard error.

90 - (1.96)(12) = 66.48
90 + (1.96)(12) = 113.52

 The value of 1.96 is based on the fact that 95% of the area of a normal distribution
is within 1.96 standard deviations of the mean; 12 is the standard error of the mean.

08.10.2011 83
 Figure shows that 95% of the means are no more than 23.52 units
(1.96x12) from the mean of 90.
 Now consider the probability that a sample mean computed in a
random sample is within 23.52 units of the population mean of 90.
Since 95% of the distribution is within 23.52 of 90, the probability that
the mean from any given sample will be within 23.52 of 90 is 0.95.
 This means that if we repeatedly compute the mean (M) from a
sample, and create an interval ranging from M - 23.52 to M +
23.52, this interval will contain the population mean 95% of the
time.

08.10.2011 84
 notice that you need to know the standard deviation (σ) in order
to estimate the mean. This may sound unrealistic, and it is.
However, computing a confidence interval when σ is known is
easier than when σ has to be estimated, and serves a
pedagogical purpose.
 Suppose the following five were sampled from a normal
distribution with a standard deviation of 2.5: 2, 3, 5, 6, and 9. To
compute the 95% confidence interval, start by computing the
mean and standard error:

M = (2 + 3 + 5 + 6 + 9)/5 = 5.
σm = = 1.118.

08.10.2011 85
 Z.95 --the value is 1.96.

08.10.2011 86
 If you had wanted to compute the 99% confidence interval, you
would have set the shaded area to 0.99 and the result would
have been 2.58.

 The confidence interval can then be computed as follows:


 Lower limit = 5 - (1.96)(1.118)= 2.81
Upper limit = 5 + (1.96)(1.118)= 7.19

08.10.2011 87
Estimating the Population Mean Using Intervals

 Estimate the average GPA of the population of approximately 23000 KTU


undergraduates.n=25 randomly selected students, sample average= 3.05.
Consider estimating the population average
 Now chances are the true average is not equal to 3.05.
 True KTU average GPA is between 1.00 and 4.00, and with high confidence
between (2.50, 3.50); but what level of confidence do we have that it is
between say, (2.75, 3.25) or (2.95, 3.15)?
 Even better, can we find an interval (a, b) which will contain with 95%
certainty?

08.10.2011 88
Example:
 Given the following GPA for 6 students: 2.80, 3.20, 3.75,
3.10, 2.95, 3.40

 Calculate a 95% confidence interval for the population mean


GPA.

08.10.2011 89
Determining Sample Size for Estimating the Mean

 want to estimate the average GPA of KTU undergraduates this school


year. Historically, the SD of student GPA is known to be .
 If a random sample of size n=25 yields a sample mean of , then
the population mean is estimated as lying within the interval

 with 95% confidence. The plus-or-minus quantity .12 is called the margin
of error of the sample mean associated with a 95% confidence level. It is
also correct to say ``we are 95% confident that is within .12 of the
sample mean 3.05''.

08.10.2011 90
Confidence Interval for μ, Standard Deviation Estimated

 It is very rare for a researcher wishing to estimate the mean of a population to


already know its standard deviation. Therefore, the construction of a confidence
interval almost always involves the estimation of both μ and σ.

When σ is known -> M - zσM ≤ μ ≤ M + zσM


is used for a confidence interval.

 When σ is not known,


Whenever the standard deviation is estimated (NOT KNOWN), the t rather than
the normal (z) distribution should be used. for μ when σ is estimated is:

M - t sM ≤ μ ≤ M + t sM

where M is the sample mean, sM is an estimate of σM (standard error), and t


depends on the degrees of freedom and the level of confidence.
confidence interval on the mean:

 More generally, the formula for the 95% confidence interval on the
mean is:
 Lower limit = M - (t)(sm)
Upper limit = M + (t)(sm)

 where;
 M is the sample mean, t is the t for the confidence level desired (0.95
in the above example), and sm is the estimated standard error of the
mean.
A comparison of the t and normal distribution
A comparison of the t distribution with 4 df
(in blue) and the standard normal
distribution (in red).
Finding t-values

Find the t-value such that the area under


the t distribution to the right of the t-value
is 0.2 assuming 10 degrees of freedom.
That is, find t0.20 with 10 degrees of
freedom.
Upper tail probability p (area under the right side)

Example:

P[t(2) > 2.92] = 0.05

P[-2.92 < t(2) < 2.92] = 0.9

50% 60% 70% 80% 90% 95% 96% 98% 99% 99.5% 99.8% 99.9%

0.25 0.2 0.15 0.1 0.05 0.025 0.02 0.01 0.005 0.0025 0.001 0.0005

df

1 1.000 1.376 1.963 3.078 6.314 12.706 15.895 31.821 63.657 127.32 318.30 636.61

2 0.817 1.061 1.386 1.886 2.920 4.303 4.849 6.965 9.925 14.089 22.327 31.599

3 0.765 0.979 1.250 1.638 2.353 3.182 3.482 4.541 5.841 7.453 10.215 12.924

4 0.741 0.941 1.190 1.533 2.132 2.776 2.999 3.747 4.604 5.598 7.173 8.610

5 0.727 0.920 1.156 1.476 2.015 2.571 2.757 3.365 4.032 4.773 5.893 6.869

6 0.718 0.906 1.134 1.440 1.943 2.447 2.612 3.143 3.707 4.317 5.208 5.959

7 0.711 0.896 1.119 1.415 1.895 2.365 2.517 2.998 3.499 4.029 4.785 5.408

8 0.706 0.889 1.108 1.397 1.860 2.306 2.449 2.896 3.355 3.833 4.501 5.041

9 0.703 0.883 1.100 1.383 1.833 2.262 2.398 2.821 3.250 3.690 4.297 4.781

10 0.700 0.879 1.093 1.372 1.812 2.228 2.359 2.764 3.169 3.581 4.144 4.587

11 0.697 0.876 1.088 1.363 1.796 2.201 2.328 2.718 3.106 3.497 4.025 4.437

12 0.696 0.873 1.083 1.356 1.782 2.179 2.303 2.681 3.055 3.428 3.930 4.318

13 0.694 0.870 1.079 1.350 1.771 2.160 2.282 2.650 3.012 3.372 3.852 4.221

14 0.692 0.868 1.076 1.345 1.761 2.145 2.264 2.624 2.977 3.326 3.787 4.140
Abbreviated t table

df 0.95 0.99
2 4.303 9.925
3 3.182 5.841
4 2.776 4.604
5 2.571 4.032
8 2.306 3.355
10 2.228 3.169
20 2.086 2.845
50 2.009 2.678
100 1.984 2.626
Example
 Assume that the following five numbers are sampled from a
normal distribution: 2, 3, 5, 6, and 9 and that the standard
deviation is not known. The first steps are to compute the
sample mean and variance:

M=5
sm = 7.5
Standard error (sm)= 1.225
 df = N - 1 = 4
 t  t tablethe value for the 95% interval for is
 2.776.
 Lower limit = 5 - (2.776)(1.225)= 1.60
Upper limit = 5 + (2.776)(1.225)= 8.40
Example
Suppose a researcher were interested in estimating the mean reading
speed (number of words per minute) of high-school graduates and
computing the 95% confidence interval. A sample of 6 graduates was
taken and the reading speeds were: 200, 240, 300, 410, 450, and
600. For these data, M = 366.6667
sM= 60.9736
df = 6-1 = 5
t = 2.571

lower limit is: M - (t) (sM) = 209.904


upper limit is: M + (t) (sM) = 523.430,
95% confidence interval is: 209.904 ≤ μ ≤ 523.430
 Thus, the researcher can conclude based on the rounded off 95%
confidence interval that the mean reading speed of high-school
graduates is between 210 and 523.
Homework 1

 The mean time difference for all 47 subjects is 16.362 seconds and
the standard deviation is 7.470 seconds. The standard error of the
mean is 1.090.

 A t table shows the critical value of t for 47 - 1 = 46 degrees of


freedom is 2.013 (for a 95% confidence interval). The confidence
interval is computed as follows:

 Lower limit = 16.362 - (2.013)(1.090)= 14.17


Upper limit = 16.362 + (2.013)(1.090)= 18.56

 Therefore, the interference effect (difference) for the whole


population is likely to be between 14.17 and 18.56 seconds.
Homework 2

 The pasteurization process reduces the


amount of bacteria found in dairy products,
such as milk. The following data represent
the counts of bacteria in pasteurized milk
(in CFU/mL) for a random sample of 12
pasteurized glasses of milk.
 Construct a 95% confidence interval for the
bacteria count.

NOTE: Each observation is in tens of thousand. So, 9.06 represents 9.06 x 104.
Prediction with Regression Analysis
The relationship(s) between values of the response variable and
corresponding values of the predictor variable(s) is (are) not
deterministic.
Thus the value of y is estimated given the value of x. The estimated
value of the dependent variable is denoted y, and the population
slope and intercept are usually denoted β1 and β0.
Linear Regression
 The idea is to fit a straight line through data points
 Linear Regression - Indicates that the relationship(s) between the
dependent variable and the independent variable(s).
 Can extend to multiple dimensions
 correlation analysis is applied to independent factors:
if X increases, what will Y do (increase, decrease, or
perhaps not change at all)?
 In regression analysis a unilateral response is
assumed: changes in X result in changes in Y, but
changes in Y do not result in changes in X.
Regression Plot
m1 = 0.0095937 + 0.880436 vwmkt

S = 0.0590370 R-Sq = 31.3 % R-Sq(adj) = 30.8 %

0.4

0.3

0.2

0.1
m1

0.0

-0.1

-0.2

-0.3

-0.2 -0.1 0.0 0.1


vwmkt
Linear regression means a regression that is linear in
the parameters
A linear regression can be non-linear in the variables
Example: Y = β0 + β1X2

Some non-linear regression models can be transformed


into a linear regression model
(e.g., Y=aXbZc can be transformed into
lnY = ln a + b*ln X + c*ln Z)
Example
 Given one variable X Y (salary,
(years) $1,000)
 Goal: Predict Y
 Example: 3 30
 Given Years of 8 57
Experience 9 64
 Predict Salary
13 72
 Questions:
3 36
 When X=10, what is Y?
 When X=25, what is Y? 6 43
 This is known as 11 59
regression 21 90
1 20
For the example data
23.2,
3.5
y 23.2 3.5 x

x=10 years  prediction of y (salary) is:


23.2+35=58.2 K dollars/year.
Linear Regression Example
Linear Regression: Y=3.5*X+23.2

120

100

80
Salary

60

40

20

0
0 5 10 15 20 25
Years
Y X

( xi x )( yi y)
i
2
( xi x)
i

y x
Regression Error
 We can also write a regression equation slightly differently:

Also called the residual, this is the difference between our estimate of the value of
the dependent variable y and the actual value of the dependent variable y.

 Unless we have perfect prediction, many of the y values will fall off of the line.
The added e in the equation refers to this fact. It would be incorrect to write
the equation without the e, because it would suggest that the y scores are
completely accounted for by just knowing the slope, x values, and the
intercept. Almost always, that is not true. There is some error in prediction, so
we need to add an e for error variation into the equation.
 The actual values of y can be accounted for by the regression line equation
(y=a+bx) plus some degree of error in our prediction (the e's).
r correlation coefficient
 The correlation between X and Y is expressed by the
correlation coefficient r :
 xi = data X, ¯x = mean of data X
yi = data Y, ¯y = mean of data Y

 1 >r > -1
 r = 1 perfect positive linear correlation
between two variables
 r = 0 no linear correlation (maybe other
correlation)
r = -1 perfect negative linear correlation
 Notice that for the perfect correlation,
there is a perfect line of points. They do
not deviate from that line.
least squares
 The principle is to establish a statistical linear relationship
between two sets of corresponding data by fitting the data to a
straight line by means of the "least squares" technique.
 The resulting line takes the general form:
 y = bx + a

 a = intercept of the line with the y-axis


b = slope (tangent)

 a = 0, b= 1  perfect positive correlation without bias


 a= 0  systematic discrepancy (bias, error) between X and Y;
 b = 1 proportional response or difference between X and Y.
Example
 Each point represents one student with a certain score for
time on the exam, x, and grade, y. The scatter plot reveals
that, in general, longer times on the exam tend to be
associated with higher grades.
0.64

ID Grade on Time on X-X avr Y-Yavr (X-Xavr)*(Y-Yavr) (X-Xavr)2


Exam (x) Exam (y)
1 88 60 8.6 18.55 159.53 73.96
2 96 53 16.6 11.55 191.73 275.56
3 72 22 -7.4 -19.45 143.93 54.76
4 78 44 -1.4 2.55 -3.57 1.96
5 65 34 -14.4 -7.45 107.28 207.36
6 80 47 0.6 5.55 3.33 0.36
7 77 38 -2.4 -3.45 8.28 5.76
8 83 50 3.6 8.55 30.78 12.96
9 79 51 -0.4 9.55 -3.82 0.16
10 68 35 -11.4 -6.45 73.53 129.96
11 84 46 4.6 4.55 20.93 21.16
12 76 36 -3.4 -5.45 18.53 11.56
13 92 48 12.6 6.55 82.53 158.76
r correlation
 The Pearson r can be positive or negative, ranging from -1.0 to
1.0.
 If the correlation is 1.0, the longer the amount of time spent on
the exam, the higher the grade will be--without any exceptions.
 An r value of -1.0 indicates a perfect negative correlation--
without an exception, the longer one spends on the exam, the
poorer the grade.
 If r=0, there is absolutely no relationship between the two
variables. When r=0, on average, longer time spent on the
exam does not result in any higher or lower grade. Most often r
is somewhere in between -1.0 and +1.0.
ID Grade on Exam (x) x2 Time on Exam (y) y2 xy

1 88 7744 60 3600 5280


2 96 9216 53 2809 5088
3 72 5184 22 484 1584
4 78 6084 44 1936 3432
5 65 4225 34 1156 2210
6 80 6400 47 2209 3760
7 77 5929 38 1444 2926
8 83 6889 50 2500 4150
9 79 6241 51 2601 4029
10 68 4624 35 1225 2380
11 84 7056 46 2116 3864
12 76 5776 36 1296 2736
13 92 8464 48 2304 4416
14 80 6400 43 1849 3440
15 67 4489 40 1600 2680
16 78 6084 32 1024 2496
17 74 5476 27 729 1998
18 73 5329 41 1681 2993
19 88 7744 39 1521 3432
20 90 8100 43 1849 3870

S 1588 127454 829 35933 66764


2 2
ID Grade on Time on X-X ort Y-Yort (X-Xort)*(Y-Yort) (X-Xort) (Y-Yort)
Exam (x) Exam (y)
1 88 60 8,6 18,55 159,53 73,96 344,1025
2 96 53 16,6 11,55 191,73 275,56 133,4025
3 72 22 -7,4 -19,45 143,93 54,76 378,3025
4 78 44 -1,4 2,55 -3,57 1,96 6,5025
5 65 34 -14,4 -7,45 107,28 207,36 55,5025
6 80 47 0,6 5,55 3,33 0,36 30,8025
7 77 38 -2,4 -3,45 8,28 5,76 11,9025
8 83 50 3,6 8,55 30,78 12,96 73,1025
9 79 51 -0,4 9,55 -3,82 0,16 91,2025
10 68 35 -11,4 -6,45 73,53 129,96 41,6025
11 84 46 4,6 4,55 20,93 21,16 20,7025
12 76 36 -3,4 -5,45 18,53 11,56 29,7025
13 92 48 12,6 6,55 82,53 158,76 42,9025
14 80 43 0,6 1,55 0,93 0,36 2,4025
15 67 40 -12,4 -1,45 17,98 153,76 2,1025
16 78 32 -1,4 -9,45 13,23 1,96 89,3025
17 74 27 -5,4 -14,45 78,03 29,16 208,8025
18 73 41 -6,4 -0,45 2,88 40,96 0,2025
19 88 39 8,6 -2,45 -21,07 73,96 6,0025
20 90 43 10,6 1,55 16,43 112,36 2,4025
Total 1588 829 941,4 1366,8 1570,95

Average 79,4 41,45


r = 0.6424
r2 square of the correlation coefficient

 r² is the proportion of the sum of squares


explained in one-variable regression,
 r² is the proportion of the sum of squares
explained in multiple regression.
Is an R-Square < 1.00 Good or bad?

 This is both a statistical and a philosophical question;


It is quite rare, especially in the social sciences, to get an R-
square that is really high (e.g., 98%).
 The goal is NOT to get the highest R-square per se. Instead,
the goal is to develop a model that is both statistically and
theoretically sound, creating the best fit with existing data.
 Do you want just the best fit, or a model that
theoretically/conceptually makes sense?
Yes, you might get a good fit with nonsensical explanatory
variables. But, this opens you to spurious/intervening
relationships. THEREFORE: hard to use model for explanation.
Why might an R-Square be less than
1.00?

 underdetermined model (need more variables)


 nonlinear relationships
 measurement error
 sampling error
 not fully predictable/explainable even with all data
available; there is a certain amount of unexplainable
chaos/static/randomness in the universe (which may
be reassuring)
 the unit of analysis is too aggregated (e.g., you are
predicting mean housing values for a city -- you
might get better results with predicting individual
housing prices, or neighborhood housing prices).
Adjusted R2 (R-square)

 What is an "Adjusted" R-Square?


The Adjusted R-Square takes into account not only how much of
the variation is explained, but also the impact of the degrees of
freedom. It "adjusts" for the number of variables use. That is,
look at the adjusted R- Square to see how adding another
variable to the model both increases the explained variance but
also lowers the degrees of freedom.
Adjusted R2 = 1- (1 - R2 )((n - 1)/(n - k - 1)). As the number
of variables in the model increases, the gap between the R-
square and the adjusted R-square will increase. This serves as
a disincentive to simply throwing in a huge number of variables
into the model to increase the R-square.
 This adjusted value for R-square will be equal or smaller than
the regular R-square. The adjusted R-square adjusts for a bias
in R-square. R-square tends to over estimate the variance
accounted for compared to an estimate that would be obtained
from the population. There are two reasons for the
overestimate, a large number of predictors and a small sample
size.
 So, with a small sample and with few predictors, adjusted R-
square should be very similar to the R-square value.
Researchers and statisticians differ on whether to use the
adjusted R-square. It is probably a good idea to look at it to see
how much your R-square might be inflated, especially with a
small sample and many predictors.
Example
Suppose we have collected the following
sample of 6 observations on age and
income:

Find the estimated regression line for


the sample of six observations we
have collected on age and income:
Which is the independent variable and
which is the dependent variable for
this problem?
Cautions About Simple Linear Regression

 Correlation and regression describe only linear relations


 Correlation and least-squares regression line are not resistant to
outliers
 Predictions outside the range of observed data are often
inaccurate
 Correlation and regression are powerful tools for describing
relationship between two variables, but be aware of their
limitations
Multiple Prediction

 Regression analysis allows us to use more than one independent


variable to predict values of y. Take the fat intake and blood
cholesterol level study as an example. If we want to predict
cholesterol as accurately as possible, we need to know more
about diet than just how much fat intake there is.
 On the island of Crete, they consume a lot of olive oil, so there
fat intake is high. This, however, seems to have no dramatic
affect on cholesterol (at least the bad cholesterol, the LDLs).
They also consume very little cholesterol in their diet, which
consists more of fish than high cholesterol foods like cheese and
beef (hopefully this won't be considered libelous in Texas). So,
to improve our prediction of blood cholesterol levels, it would be
helpful to add another predictor, dietary cholesterol.
From Bivariate to Multiple
regression: what changes?
 potentially more explanatory power with more
variables.
 the ability to control for other variables: and one sees
the interaction of the various explanatory variables.
partial correlations and multicollinearity.
 harder to visualize drawing a line through three+ n-
dimensional space.
 the R is no longer simply the square of the
correlation statistic r.

 From Two to Three Dimensions
With simple regression (one predictor) we had only the x-axis
and the y-axis. Now we need an axis for x1, x2, and y.
 where Y' is the predicted score, X1 is the score on the first
predictor variable, X2 is the score on the second, etc. The Y
intercept is A. The regression coefficients (b1, b2, etc.) are
analogous to the slope in simple regression.
 If we want to predict these points, we now need a regression
plane rather than just a regression line. That looks something
like this:
More than one prediction attribute

 X1, X2
 For example,
 X1=„years of experience‟
 X2=„age‟
 Y=„salary‟

Y x
1 1 x
2 2
y
yi Response Surface
0=10 i
E(yi)

(xi1, xi2) x1
x2
 The parameters β0, β1, β2,… , βk are called partial
regression coefficients.
 β1 represents the change in y corresponding to a
unit increase in x1, holding all the other predictors
constant.
 A similar interpretation can be made for β2, β3,
……, βk
Regression Statistics
Multiple R 0,995
R Square 0,990
Adjusted R Square 0,989
Standard Error 0,008
Observations 30

ANOVA
Significa
df SS MS F nce F
Regression 4 0,164 0,041 628,372 0,000
Residual 25 0,002 0,000
Total 29 0,165

Coefficie Standard
nts Error t Stat P-value
Intercept 0,500 0,008 60,294 0,000
Percent of Gross Hhd Income Spent on rent -0,399 0,016 -24,610 0,000
percent 2-parent families -0,288 0,015 -19,422 0,000
Police Anti-Drug Program? -0,004 0,004 -1,238 0,227
Active Tenants Group? (1 = yes; 0 = no) -0,102 0,004 -28,827 0,000

Controlling also for this new variable, the police anti-drug program is no
longer statistically significant, an instead the presence of the active
tenants group makes the dramatic difference. (and look at that great R
square!). However, we are no quite done…
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.928
R Square 0.861
Adjusted R Square 0.850
Standard Error 0.030
Observations 30

ANOVA
df SS MS F Significance F
Regression 2 0.149 0.074 83.484 0.000
Residual 27 0.024 0.001
Total 29 0.173

Coeffici Standard
ents Error t Stat P-value BETA
Intercept 0.36582 0.017 20.908 0.000
percent 2-parent
families -0.2565 0.051 -5.017 0.000 -0.362
Active Tenants Group?
(1 = yes; 0 = no) -0.1246 0.011 -11.347 0.000 -0.821

Since the police variable now has a statistically insignificant t-score, we remove it
from the model. (We also remove the income variable, since it also becomes
insignificant after we remove the police variable.) We are left with two independent
variables: percent of 2-parent families and active tenants group.
Stepwise Regression Algorithms

• Backward Elimination
• Forward Selection
• Stepwise Selection
Backward Elimination
1. Fit the model containing all (remaining)
predictors.
2. Test each predictor variable, one at a
time, for a significant relationship with y.
3. Identify the variable with the largest pvalue.
If p > α, remove this variable from
the model, and return to (1.).
4. Otherwise, stop and use the existing
model.
Forward Selection
1. Fit all models with one (more) predictor.
2. Test each of these predictor variables,
for a significant relationship with y.
3. Identify the variable with the smallest pvalue.
If p < α, add this variable to the
model, and return to (1.).
4. Otherwise, stop and use the existing
model.
Stepwise Selection
• The Stepwise Selection method is
basically Forward Selection with Backward
Elimination added in at every step.
Stepwise Selection
1. Fit all models with one (more) predictor.
2. Test each of these predictor variables, for a
significant relationship with y.
3. Identify the variable with the smallest p-value.
If p < α, add this variable to the model, and
return to (1.).
4. Now, for the model being considered, test
each predictor variable, one at a time, for a
significant relationship with y.
5. Identify the variable with the largest p-value. If
p > α, remove this variable from the model,
and return to (1.).
6. Otherwise, stop and use the existing model.
Linear regression
Review
Multiple Regression Models
Chapter Topics

 The Multiple Regression Model


 Contribution of Individual Independent
Variables
 Coefficient of Determination
 Categorical Explanatory Variables
 Transformation of Variables
 Violations of Assumptions
 Qualitative Dependent Variables
Multiple Regression Models

Multiple
Regression
Models
Non-
Linear
Linear

Dummy Inter-
Linear action
Variable

Poly- Square
Log Reciprocal Exponential
Nomial Root
Linear Multiple Regression
Model
Additional Assumption
for Multiple Regression
 No exact linear relation exists between any
subset of explanatory variables (perfect
"multicollinearity")
The Multiple Regression Model

Relationship between 1 dependent & 2 or more independent variables is a linear


function Random
Population Population slopes Error
Y-intercept

Yi 0 1X1i 2 X2i p X pi i

Ŷi b0 b1X1i b2 X 2i bp X pi ei

Dependent (Response) Independent (Explanatory)


variable for sample variables for sample model
Population Multiple
Regression Model
Yi = 0
+ 1 1i
X + 2
X2i + i
Bivariate model Y (Observed Y)

Response 0
i
Plane
X2

X1 (X1i,X2i)
YX = 0 + 1X1i + 2X2i
Sample Multiple Regression
Model
Bivariate model
Yi = b0 + b1X1i + b2X2i + ei
Y (Observed Y)

Response b0
ei
Plane
X2

X1 (X1i,X2i)
^
Yi = b0 + b1X1i + b2X2i
Parameter Estimation

Linear Multiple Regression Model


Multiple Regression Model:
Example
O il (G a l) T e m p (0F) In su la tio n
2 7 5 .3 0 40 3
Develop a model for estimating
3 6 3 .8 0 27 3
heating oil used for a single
1 6 4 .3 0 40 10
family home in the month of
4 0 .8 0 73 6
January based on average
9 4 .3 0 64 6
temperature and amount of
2 3 0 .9 0 34 6
insulation in inches.
3 6 6 .7 0 9 6
3 0 0 .6 0 8 10
2 3 7 .8 0 23 10
1 2 1 .4 0 63 3
3 1 .4 0 65 10
2 0 3 .5 0 41 6
4 4 1 .1 0 21 3
3 2 3 .0 0 38 3
5 2 .5 0 58 10
Interpretation of Estimated
Coefficients

 Slope (bP)
 Estimated Y changes by bP for each 1 unit
increase in XP holding all other variables
constant (ceterus paribus)
 Example: If b1 = -2, then fuel oil usage (Y) is
expected to decrease by 2 gallons for each 1
degree increase in temperature (X1) given the
inches of insulation (X2)
 Y-Intercept (b0)
 Average value of Y when all XP = 0
Sample Regression Model:
Example
C o e ffi c i e n ts
I n te r c e p t 5 6 2 .1 5 1 0 0 9 2
X V a ria b le 1 -5 . 4 3 6 5 8 0 5 8 8
X V a ria b le 2 -2 0 . 0 1 2 3 2 0 6 7

Ŷi 562 .151 5 .437 X 1i 20 .012 X 2 i


For each increase in one inch of
For each degree increase in
insulation, the use of heating oil is
temperature, the average amount of
decreased by 20.012 gallons,
heating oil used is decreased by 5.437
holding temperature constant.
gallons, holding insulation constant.
Evaluating the Model
Evaluating Multiple Regression
Model Steps
 Examine variation measures
 Test parameter significance
 Overall model
 Portions of model
 Individual coefficients
Variation Measures
Coefficient of
Multiple Determination
 r2Y.12..P = Explained variation = SSR
Total variation SST

 r2=0  all the variables taken together do


not explain variation in Y
Adjusted Coefficient of
Multiple Determination
 NOT proportion of variation in Y
„explained‟ by all X variables taken
together
 Reflects
 Sample size
 Number of independent variables
 Smaller than r2Y.12..P
 Sometimes used to compare models
Simple and Multiple Regression
Compared:Example

 Two simple regressions:


ABSENCES= + 1AUTONOMY
ABSENCES= + 2SKILLVARIETY

 Multiple Regression:
ABSENCES= + 1AUTONOMY+
2SKILLVARIETY
Overlap in Explanation

SIMPLE REGRESSION: AUTONOMY MULTIPLE REGRESSION


Multiple R 0,169171 Multiple R 0,231298
R Square 0,028619 R Square 0,053499
Adjusted R Square
0,027709 Adjusted R Square
0,051723
Standard Error 12,443 Standard Error12,28837
Observations 1069 Observations 1069

ANOVA ANOVA
df SS MS F Significance F df SS MS F
Regression 1 4867,198 4867,198 31,43612 2,62392E-08 Regression 2 9098,483 4549,242 30,1266
Residual 1067 165201,7 154,8282 Residual 1066 160970,4 151,0041
Total 1068 170068,9 Total 1068 170068,9

SIMPLE REGRESSION: SKILL VARIETY


Multiple R 0,193838 0,06619206 SUM OF SIMPLE R2
R Square 0,037573 0,05349881 MULTIPLE R2
Adjusted R Square
0,036671 0,01269325 OVERLAP ATTRIBUTED TO BOTH
Standard Error 12,38552
Observations 1069
11257,2098 SUM OF REGRESSION SUM OF SQUARES
ANOVA 9098,4831 REGRESSION SUM OF SQUARES
df SS MS F Significance F 2158,72671 OVERLAP
Regression 1 6390,011 6390,011 41,6556 1,64882E-10
Residual 1067 163678,9 153,401
Total 1068 170068,9
Testing Parameters
Test for Overall Significance
Example Solution
Test Statistic:
 H0 : 1 = 2 = … = p =0
 H1: At least one I 0 F 168.47
 = .05
 df = 2 and 12 Decision:
 Critical Value(s): Reject at = 0.05
Conclusion:
= 0.05 There is evidence that at
least one independent
variable affects Y
0 3.89 F
Test for Significance:
Individual Variables
•Shows if there is a linear relationship between the
variable Xi and Y

•Use t test Statistic

•Hypotheses:

H0: i = 0 (No linear relationship)

H1: i 0 (Linear relationship between Xi and Y)


t Test Statistic
Excel Output: Example
t Test Statistic for X1
(Temperature)
C o e ffi c i e n ts S ta n d a r d E r r o r t S ta t
I n te r c e p t 5 6 2 .1 5 1 0 0 9 2 1 .0 9 3 1 0 4 3 3 2 6 .6 5 0 9 4
X V a r i a b l e 1 -5 . 4 3 6 5 8 0 6 0 .3 3 6 2 1 6 1 6 7 -1 6 . 1 6 9 9
X V a r i a b l e 2 -2 0 . 0 1 2 3 2 1 2 .3 4 2 5 0 5 2 2 7 -8 . 5 4 3 1 3

t Test Statistic for X2 b


t
(Insulation) Se b
t Test : Example Solution

Does temperature have a significant effect on monthly


consumption of heating oil? Test at = 0.05.
H0: 1 =0 Test Statistic:
H1: 1 0 t Test Statistic = -16.1699
 df = 12 Decision:
Critical Value(s): Reject H0 at = 0.05
Reject H0 Reject H0 Conclusion:
.025 .025 There is evidence of a
significant effect of
temperature on oil
-2.1788 0 2.1788 Z
consumption.
Example: Analysis of job
earnings
What is the impact of employer tenure
(ERTEN), unemployment (UNEM) and
education (EDU) on job earnings (JEARN)?
Example: Analysis of job
earnings
Correlations
Results: Anova
Results
Testing Model Portions

 Examines the contribution of a set of X


variables to the relationship with Y
 Null hypothesis:
 Variables in set do not improve significantly
the model when all other variables are
included
 Alternative hypothesis:
 At least one variable is significant
Testing Model Portions

 Only one-tail test


 Requires comparison of two regressions
 One regression includes everything
 One regression includes everything except
the portion to be tested.
Testing Model Portions Test
Statistic
Test H0: 1= 2 = 0 in a 3 variable model
( SSR( X 1 , X 2 , X 3 ) - SSR( X 3 ))/k
F
MSE ( X 1 , X 2 , X 3 )
From ANOVA section From ANOVA section
of regression for of regression for

Yˆi b0 b1 X1i b2 X 2i b3 X 3i Yˆi b0 b3 X 3i


Testing Portions of Model: SSR
Contribution of X1 and X2 given X3 has been
included:
SSR(X1and X2 X3) = SSR(X1,X2 and X3) -
SSR(X3)

From ANOVA section of From ANOVA section of


regression for regression for

Yˆi b0 b1 X1i b2 X 2i b3 X 3i Yˆi b0 b2 X 3i


Partial F Test For Contribution of
Set of X variables

 Hypotheses:
 H0 : Variables Xi... do not significantly improve
the model given all others included
H1 : Variables Xi... significantly improve the
model given all others included
 Test Statistic: With df = k and (n - p -1)
 F= SSR( X i .... all others) / k
MSE k=# of variables
tested
Testing Portions of Model:
Example
Test at the = .05 level
to determine if the
variable of average
temperature
significantly improves
the model given that
insulation is included.
Testing Portions of Model:
Example
H0: X1 does not improve = .05, df = 1 and 12
model (X2 included)
Critical Value = 4.75
H1: X1 does improve model
ANOV A (For X1 and X2) A N O V A (For X2)
SS MS S S

R e g r e ssi o n 228014.6263 114007.313 R e g re ssio n 5 1 0 7 6 .4 7


R e sid u a l 1 8 5 0 5 8 .8
R e si d u a l 8120.603016 676.716918
T o ta l 2 3 6 1 3 5 .2
T o ta l 236135.2293

SSR ( X 1 X 2 ) 228,015 51,076


F = 261.47
MSE 676,717
Conclusion: Reject H0. X1 does improve model
Do I need to do this for one
variable?
•The F test for the inclusion of a single variable
after all other variables are included in the
model is IDENTICAL to the t test of the slope
for that variable
•The only reason to do an F test is to test several
variables together.
Example: Collinear Variables
20,000 Execs in 439 Corps: Dependent Variable=base pay+bonus
Individual Simple Regression Multiple Regression
R2 Contribution to R2
Company Dummies .33 .08
Occupational Dummies .52 .022
Position in hierarchy .69 .104
Human Capital Vars .28 .032
Shared .632
TOTAL .87
Yedek
Multiple Regression
 The value of outcome variable depends on several
explanatory variables.
 F-test. To judge whether the explanatory variables in
the model adequately describe the outcome variable.
 t-test. Applies to each individual explanatory variable.
Significant t indicates whether the explanatory
variable has an effect on outcome variable while
controlling for other X‟s.
 T-ratio. To judge the relative importance of the
explanatory variable.
Problem of Multicollinearity
 When explanatory variables are correlated
there is difficulty in interpreting the effect of
explanatory variables on the outcome.
Check by:
 Correlation coefficient matrix (see next slide).

 F-test significant with insignificant t.

 Large changes occur in the regression


coefficients when variables are added or
deleted. (Variance Inflation). Vi > 4 or 5
means there is multicollinearity.
Example of a Matrix Plot
 This matrix plot
comprises several
scatter plots to
provide visual
information as to
whether variables
are correlated
 The arrow points at
a scatter plot where
two explanatory
variables are
Selecting the most Economic
Model
The purpose is to find the smallest number of
explanatory variables which make the
maximum contribution to the outcome.
 After excluding variables that may be causing
multicollinearity, examine the table of t-ratios
in the full model. Those variables with a
significant t are included in the sub-set.
 In the Analysis of Variance table examine the
column headed SEQ SS. Check that the
candidate variables are indeed making a
sizable contribution to the Regression Sum of
Squares
Stepwise Regression Analysis
 Stepwise finds the explanatory variable with
the highest R2 to start with. It then checks
each of the remaining variables until two
variables with highest R2 are found. It then
repeats the process until three variables with
highest R2 are found, and so on.
 The overall R2 gets larger as more variables
are added.
 Stepwise may be useful in the early
exploratory stage of data analysis, but not to
be relied upon for the confirmatory stage.
Is the Model Adequate?
Judged by the following:
 R2 value. Increase in R2 on adding
another variable gives a useful hint
 Adjusted R2 is a more sensitive

measure.
 Smallest value of s (standard
deviation).
 C-p statistic. A model with the smallest

C-p is used such that Cp value is closest


Confidence Interval Estimate
For The Slope
Provide the 95% confidence interval for the population
slope 1 (the effect of temperature on oil consumption).
b1 tn p 1S b1
Coefficients Lower 95% Upper 95%
Intercept 562,151009 516,1930837 608,108935
X Variable 1 -5,4365806 -6,169132673 -4,7040285
X Variable 2 -20,012321 -25,11620102 -14,90844

-6.169 1 -4.704
The average consumption of oil is reduced by between
4.7 gallons to 6.17 gallons per each increase of 10 F in
houses with the same insulation.

You might also like