Professional Documents
Culture Documents
L01
L01
ANALYSIS
Chapter 2
• Begin with Review
– Mean, Variance, etc. (one variable)
– Covariance, Correlation (two variables)
• Simple Linear Regression
– Notation
– Conceptualization with Example
– Population vs. Sample
– Definitions
– Least Squares Estimate
– Assumptions
– Types of Errors
– Partitioning the Sum of Squares
– Hypothesis Testing
1
Goal of Empirical Research
• Draw inferences about some population(s)
of interest based on observations of just a
subset or sample from the whole population.
• Then, generalize from sample to population.
Random
Sample Make Inferences
Describe
Sample
Statistics
2
Defining the Problem
Cereal Example
15 ounces
Rise
n
Shine
3
Defining the Problem
The purpose of the study is to determine
whether the cereal boxes contain 15 ounces of
cereal.
Sample
Rise
Rise n Rise
n Shine n
Rise Shine
n Shine Rise
Shine Rise n
n Shine
Rise Rise Shine Rise
Rise
n n n
n
Shine Shine Shine
Rise Shine
n Rise
Shine Rise
Rise n n
n Rise Rise Shine Shine
Shine n Rise n Rise
Shine n Shine Rise
n
Shine n
Shine
Shine
Rise
n
Shine
4
Assumption for this Course
– The sample drawn is representative of the
population.
• In other words, the sample characteristics should
reflect the characteristics of the population as a
whole.
10
5
Parameters and Statistics
Statistics are used to approximate population
parameters.
Population Sample
Parameters Statistics
Mean X
Variance 2 s2
Standard s
Deviation
11
Distributions
When you examine the distribution of values
for the variable, you can find out
– the range of possible data values
– the frequency of data values
– whether the data values accumulate in the middle
of the distribution or at one end.
12
6
“Typical Values” in a Distribution
– Mean: the sum of all the values in the data set
divided by the number of values
N
1
X
N
X
i 1
i
Sample Variance
sx
2 1 N
( X i X )2
( X X ) 2
(2.1)
N 1 i 1 N 1
N N
SXX ( X i X ) 2 X i2 NX 2
i 1 i 1
2
N
N
X i
( N 1) sx X i
2 2 i 1
(2.2)
i 1 N
1
sx 2 SXX
N 1
14
7
Standard Deviation (SD)
1 N 1
sx
N 1 i 1
( X i X )2
N 1
SXX
SDX sx
N
SXX ( X i X ) 2
i 1
15
8
Point Estimates
estimates
estimates
17
18
9
Percentiles
98
95 third quartile
92 75th Percentile=91
90
85
81 50th Percentile=80 Quartiles break your data
79 up into quarters.
70
63 25th Percentile=59
55 first quartile
47
42
19
20
10
The MEANS Procedure
General form of the MEANS procedure:
PROC
PROC MEANS
MEANSDATA=SAS-data-set
DATA=SAS-data-set<options>;
<options>;
VAR variables;
VAR variables;
RUN;
RUN;
21
Picturing Distributions:
Histogram
Each bar in the
histogram represents a
group of values (a bin).
PERCENT
Bins 22
11
The Normal Distribution
23
12
Characteristics of the Bell Curve
Peak
Flanks
Tails
-4 -3 -2 -1 0 1 2 3 4
25
PROC
PROC UNIVARIATE
UNIVARIATE DATA=SAS-data-set
DATA=SAS-data-set
<options>;
<options>;
VAR
VAR variables;
variables;
ID
ID variable;
variable;
HISTOGRAM
HISTOGRAM variables
variables </
</ options>;
options>;
PROBPLOT
PROBPLOT variables
variables </
</ options>;
options>;
RUN;
RUN;
26
13
Graphical Displays of
Distributions
• You can produce three kinds of plots for
examining the distribution of your data
values:
– histograms
– box plots
– normal probability plots.
27
Box-and-Whisker Plots
largest point 1.5 I.Q. from the box
28
14
The BOXPLOT Procedure
General form of the BOXPLOT procedure:
PROC
PROC BOXPLOT
BOXPLOT DATA=SAS-data-set;
DATA=SAS-data-set;
PLOT
PLOT analysis-variable*group-variable
analysis-variable*group-variable
</options>;
</options>;
RUN;
RUN;
29
30
15
Objectives
– Examine the relationship between two
continuous variables using a scatter plot.
– Quantify the degree of linearity between
two continuous variables using correlation
statistics.
– Understand potential misuses of the correlation
coefficient.
– Obtain Pearson correlation coefficients using
the CORR procedure.
31
Scatter Plots
X 32
16
Overview
Correlation
Continuous
Variable
Continuous
Variable
33
in.
lb.
Weight ?
Height
34
17
Relationships between Continuous Variables
1.
1. 2.
2.
3.
3. 4.
4.
35
Correlation
36
18
Plot of Weight by Height Plot of Errors by Study Time
210 30
180
20
Weight
Errors
150
10
120
90 0
60 63 66 69 72 75 0 100 200 300 400
Height Study Time
Plot of SAT-V by Toe Size
700
600
SAT-V
500
400
1.5 1.6 1.7 1.8 1.9
Toe Size
-1 0 1
Correlation Coefficient
38
19
Misuses of the Correlation
Coefficient
causes in.
lb.
39
SAT Example
Average SAT Score
versus
Percent Taking Test
S CO R E
1100
1000
900
800
0 10 20 30 40 50 60 70 80
P CT A K I N G
40
20
Missing Another Type of
Relationship
Curvilinear Relationship
41
PROC
PROC CORR
CORRDATA=SAS-data-set
DATA=SAS-data-set <options>;
<options>;
VAR variables;
VAR variables;
WITH
WITHvariables;
variables;
RUN;
RUN;
42
21
Sample Covariance
s xy
(X i X )(Yi Y )
(2.3)
N 1
• Linear relationship or association
– Degree that X and Y co-vary from their respective means
• Direction of linear association
– Sign (+ or -) for direction
SXY ( X i X )(Yi Y ) ( N 1) s xy
N N
X i Yi
X iYi i 1 i 1
N
(2.4)
i 1 N
1
s xy SXY
N 1
44
22
Pearson product-moment
correlation coefficient
s xy s xy
r xy
sx s y SD X SD Y
SX Y
( SX X )( SYY )
1 rxy 1
23
Pearson’s Correlation and Sample Size
Here are the limits within which 80% of sample r’s will fall
when the true correlation (i.e., in the population) is zero:
47
48
24
Simple Linear Regression Analysis
• The objectives of simple linear regression
are to
– assess the significance of the predictor variable
in explaining the variability or behavior of the
response variable
– predict the values of the response variable
given the values of the predictor variable.
49
Fitness Example
50
25
• In exercise physiology, an object measure of
aerobic fitness is how fast the body can
absorb and use oxygen (oxygen
consumption).
• Subjects participated in a predetermined
exercise run of 1.5 miles.
• Measurements of oxygen consumption as
well as several other continuous
measurements such as age, pulse, and weight
were recorded.
• The researchers are interested in determining
whether any of these other variables can help
predict oxygen consumption.
51
Variables in sasuser.b_fitness
• Name name of the member
• Gender gender of the member
• Runtime time to run 1.5 miles (in minutes)
• Age age of the member (in years)
• Weight weight of the member (in kilograms)
• Oxygen_Consumption a measure of the ability to use
oxygen in the blood stream
• Run_Pulse pulse rate at the end of the run
• Rest_Pulse resting pulse rate
• Maximum_Pulse maximum pulse rate during the run
• Performance a measure of overall fitness
52
26
Fitness Example
PREDICTOR RESPONSE
Performance Oxygen_Consumption
53
“population parameters” or
“regression coefficients”
to be estimated
outcome predictor
Y 0 1 X
54
27
Simple Linear Regression Model
Response (Y)
units
1 unit
Predictor (X)
55
Predictor (X)
56
28
Simple Linear Regression
• Used to test association between two variables
• Accounts for (predicts) the variance in an
interval dependent variable based on an interval,
dichotomous, or dummy independent variable.
• By estimating a straight line through the
corresponding X-Y data points we can estimate
the magnitude of the relationship between X and
Y.
57
29
Next Lecture
• Next Lecture: SAS. Bring your laptop if
possible
• Reading assignments:
– the means, univariate, etc. procedures (Base
SAS Guide)
– Chapters 1 and 2
59
30