Chapter 2 PDF Lecture Notes

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Chapter 2

Exploring Bivariate Distributions


and Estimating Relations
Scatter plots: Graphical analysis of Association
between Measurements
Correlation: Estimating the Strength of a
Linear Relation
• The correlation coefficient 𝑟 is a measure of the strength of the linear
association between two variables.
• The correlation between pairs of observations 𝑥, 𝑦 is given by
𝑆𝑆𝑥𝑦
•𝑟= where
𝑆𝑆𝑥𝑥 𝑆𝑆𝑦𝑦
σ𝑛 𝑥 σ𝑛
𝑖=1 𝑦𝑖
• 𝑆𝑆𝑥𝑦 = σ𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑖=1 𝑖
𝑛
2 2
σ𝑛
𝑖=1 𝑥𝑖 σ𝑛
𝑖=1 𝑦𝑖
• 𝑆𝑆𝑥𝑥 = σ𝑛𝑖=1 𝑥𝑖2 − , 𝑆𝑆𝑦𝑦 = σ𝑛𝑖=1 𝑦𝑖2 −
𝑛 𝑛
• n=number of pairs of observations (sample size)
• A value of r near or equal to zero implies little or no linear
relationship between y and x.

• The closer r is to 1 or -1, the stronger the linear relationship between


y and x

• If r=1 or r=-1, all the points fall exactly on a straight line.

• Positive values of r imply that y increases as x increases; negative


values of r imply that y decreases as x increases.
Example: Power Load and Temperature
Day Maximum Temperature (𝒙) Peak Power Load 𝒚
1 95 214
2 82 152
3 90 156
4 81 129
5 99 254
6 100 266
7 93 210
8 95 204
9 93 213
10 87 150
Continue
• σ𝑛𝑖=1 𝑥𝑖 = 915, σ𝑛𝑖=1 𝑦𝑖 = 1948, σ𝑛𝑖=1 𝑥𝑖 𝑦𝑖 =180798
• σ𝑛𝑖=1 𝑥𝑖2 = 84103, σ𝑛𝑖=1 𝑦𝑖2 = 398734,
σ𝑛 𝑥 σ𝑛
𝑖=1 𝑦𝑖
• 𝑆𝑆𝑥𝑦 = σ𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑖=1 𝑖
= 2556
𝑛
2
σ𝑛
𝑖=1 𝑥𝑖
• 𝑆𝑆𝑥𝑥 = σ𝑛𝑖=1 𝑥𝑖2 − = 380.5
𝑛
2
σ𝑛
𝑖=1 𝑦𝑖
• 𝑆𝑆𝑦𝑦 = σ𝑛𝑖=1 𝑦𝑖2 − = 19263.6
𝑛
𝑆𝑆𝑥𝑦
•𝑟= = 0.944, hence strong positive linear relationship
𝑆𝑆𝑥𝑥 𝑆𝑆𝑦𝑦
Regression: Modeling linear Relationships
• In general, a linear relation between two variables x and y is given by
𝑦 = 𝛽0 + 𝛽1 𝑥 where
• 𝛽0 is the y-intercept
• 𝛽1 is the slope. It gives the amount of change in y for a unit change in
the value of x

• Fitting such a line to a set of data involves estimating the slope and
intercept to produce a line that is denoted by 𝑦ො = 𝛽መ0 + 𝛽መ1 𝑥
Fitting the Model: The Least-Squares
Approach
• Sum of squared errors: 𝑆𝑆𝐸 = σ𝑛𝑖=1 𝑦𝑖 − 𝑦ො𝑖 2

• Select the line that minimizes the SSE.

• Such a line is known as the least-squares regression line.


Graphical Illustration
• The Least-squares regression line is given by:
• 𝑦ො = 𝛽መ0 + 𝛽መ1 𝑥 where
𝑆𝑆𝑥𝑦

• 𝛽1 =
𝑆𝑆𝑥𝑥
σ𝑛 𝑥 σ𝑛
𝑖=1 𝑦𝑖
• Where 𝑆𝑆𝑥𝑦 = σ𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑖=1 𝑖
and 𝑆𝑆𝑥𝑥 =
2 𝑛
σ𝑛 𝑥
σ𝑛𝑖=1 𝑥𝑖2 − 𝑖=1 𝑖
𝑛

• 𝛽መ0 = 𝑦ത − 𝛽መ1 𝑥ҧ
Example: Power Load and Temperature
Day Maximum Temperature (𝒙) Peak Power Load 𝒚
1 95 214
2 82 152
3 90 156
4 81 129
5 99 254
6 100 266
7 93 210
8 95 204
9 93 213
10 87 150
Continue
• σ𝑛𝑖=1 𝑥𝑖 = 915, σ𝑛𝑖=1 𝑦𝑖 = 1948, σ𝑛𝑖=1 𝑥𝑖 𝑦𝑖 =180798
• σ𝑛𝑖=1 𝑥𝑖2 = 84103, σ𝑛𝑖=1 𝑦𝑖2 = 398734,
σ𝑛 𝑥 σ𝑛
𝑖=1 𝑦𝑖
• 𝑆𝑆𝑥𝑦 = σ𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑖=1 𝑖
= 2556
𝑛
2
σ𝑛
𝑖=1 𝑥𝑖
• 𝑆𝑆𝑥𝑥 = σ𝑛𝑖=1 𝑥𝑖2 − = 380.5
𝑛
2
σ𝑛
𝑖=1 𝑦𝑖
• 𝑆𝑆𝑦𝑦 = σ𝑛𝑖=1 𝑦𝑖2 − = 19263.6
𝑛
𝑆𝑆𝑥𝑦
•𝑟= = 0.944, hence strong positive linear relationship
𝑆𝑆𝑥𝑥 𝑆𝑆𝑦𝑦
𝑆𝑆𝑥𝑦

• 𝛽1 = = 6.7175
𝑆𝑆𝑥𝑥
• 𝛽መ0 = 𝑦ത − 𝛽መ1 𝑥ҧ = −419.85
• The Least-squares regression line is given by:
• 𝑦ො = 𝛽መ0 + 𝛽መ1 𝑥 = −419.85 + 6.7175𝑥
• Where 𝑦ො is the predicted peak power load and 𝑥 is the maximum
daily temperature.
• For every one degree (in degree F) increase in the maximum
temperature the peak power load will increase on the average by
6.7175 megawatts
Using the model for prediction
• Consider the example of peak power load. Predict the required peak
power load, if tomorrow’s maximum temperature is expected to be
98 degree F.

• The Least-squares regression line is given by:


• 𝑦ො = 𝛽መ0 + 𝛽መ1 𝑥 = −419.85 + 6.7175𝑥
• 𝑦ො = −419.85 + 6.7175𝑥 = −419.85 + 6.7175 98 = 238.465
megawatts
Coefficient of Determination

• The square of the coefficient of correlation 𝑟 is called the coefficient


of determination 𝑟 2 .

• Note that 𝑟 2 is always between 0 and 1.

• It represents the proportion of the sum of squares of deviations of


the 𝑦 values about their mean that can be attributed to a linear
relation between 𝑦 and 𝑥.
• Calculate and interpret the coefficient of determination for the peak
power load example.

• 𝑟 2 = 0.89

• The sample variability of the peak loads about their mean is reduced
by 89% when the mean peak load is modeled as a linear function of
daily high temperature.
• In a study of pollution in a water stream, the concentration of
pollution is measured at 5 different locations. The locations are at
different distances to the pollution source. In the table below, these
distances and the average pollution are given:

Distance from the 2 4 6 8 10


pollution source
(in km)
Average 11.5 10.2 10.3 9.68 9.32
concentration
• A) Compute the least squares regression line

• B) Compute the correlation coefficient and the coefficient of


determination.

• C) Predict the pollution concentration 7 km from the pollution source


• A) The Least-squares regression line is given by:
• 𝑦ො = 𝛽መ0 + 𝛽መ1 𝑥 = 11.664 − 0.244𝑥

• B)r=-0.931
• 𝑟 2 = 0.868 (an estimate of the variation in concentration which can
be explained by distance)

• C) 𝑦ො = 11.664 − 0.244𝑥 = 11.664 − 0.244 7 = 9.96

You might also like