Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Statistical Analysis (SM 901B)

Unit 2 - Regression

Goonjan Jain
Department of Applied Mathematics
DTU
Introduction
• Determining a relationship between a set of variables.
• For instance, in a chemical process, we might be interested in the relationship between the output of the
process, the temperature at which it occurs, and the amount of catalyst employed.
• Knowledge of such a relationship would enable us to predict the output for various values of temperature
and amount of catalyst.
• Houses in the same part of the country that have the same square footage of living space will not all be sold
for the same price.
Variables
• There is a single response variable Y , also called the dependent variable, which depends on the value of a
set of input, also called independent variables x1, . . . , xr. The simplest type of relationship between the
dependent variable Y and the input variables x1, . . . , xr is a linear relationship. That is, for some constants ꞵ0,
ꞵ1, . . . , ꞵr the equation

• would hold. If this was the relationship between Y and the xi, i = 1, . . . , r, then it would be possible (once the
ꞵi were learned) to exactly predict the response for any set of input values.
• However, in practice, such precision is almost never attainable, and the most that one can expect is that
would be valid subject to random error. By this we mean that the explicit relationship is

• where e, representing the random error, is assumed to be a random variable having mean 0.
Regression Equation
• Another way of expressing Equation is as follows:

• where x = (x1, . . . , xr ) is the set of independent variables, and E[Y |x] is the expected response given the
inputs x.
• This equation is called a linear regression equation
• Describes the regression of Y on the set of independent variables x1, . . . , xr.
• ꞵ0, ꞵ1, . . . , ꞵr - regression coefficients, estimated from a set of data.
• A regression equation containing a single independent variable, r = 1 — simple regression equation
• A regression equation containing many independent variables is called a multiple regression equation.
Simple Linear Regression Model
• a simple linear regression model supposes a linear relationship between the mean response and the value of
a single independent variable. It can be expressed as

• x is the value of the independent variable, also called the input level,
• Y is the response, and
• e, representing the random error, is a random variable having mean 0.
Example 1
• Consider the following 10 data pairs (xi , yi), i = 1, . . . , 10, relating y, the percent yield of a laboratory
experiment, to x, the temperature at which the experiment was run.

• A plot of yi versus xi — called a scatter diagram.


• Scatter diagram appears to reflect, subject to random error, a linear relation between y and x, it seems that a
simple linear regression model would be appropriate.
Example 2
• According to the International Data Base of the U.S. Census Bureau, population of the world grows. How can
we use these data to predict the world population in years 2015 and 2020?
• Figure shows that the population (response) is tightly related to the year (predictor),
population ≈ G(year).
• It increases every year, and its growth is almost linear. If we estimate the regression function G relating our
response and our predictor and extend its graph to the year 2020, the forecast is ready. We can simply
compute G(2015) and G(2020).
Solution 2
• A straight line that fits the observed data for years 1950–2010 predicts the population of 7.15 billion in 2015
and 7.52 billion in 2020.
• It also shows that between 2010 and 2015, around the year 2012, the world population reaches the
historical mark of 7 billion.

• How accurate is the forecast obtained in this example? The observed population during 1950–2010 appears
rather close to the estimated regression line. It is reasonable to hope that it will continue to do so through
2020.
Example 3
• Seventy house sale prices in a certain county are depicted along with the house area.
• First, we see a clear relation between these two variables, and in general, bigger houses are more expensive.
However, the trend no longer seems linear.
• Second, there is a large amount of variability around this trend. Indeed, area is not the only factor
determining the house price. Houses with the same area may still be priced differently.
• Then, how can we estimate the price of a 3200-square-foot house? We can estimate the general trend (the
dotted line) and plug 3200 into the resulting formula, but due to obviously high variability, our estimation
will not be as accurate.
Least Square Estimators of the Regression
Equation
• Let Yi corresponding to the input values xi , i = 1, . . . , n are to be observed and used to estimate α and ꞵ in a
simple linear regression model.
• To determine estimators of α and ꞵ:
• If A is the estimator of α and B of ꞵ, then the estimator of the response corresponding to the input variable xi
would be A + Bxi .
• Since the actual response is Yi , the squared difference is (Yi − A − Bxi)2, and so if A and B are the estimators
of α and ꞵ, then the sum of the squared differences between the estimated responses and the actual
response values—call it SS —is given by

• The method of least squares chooses as estimators of α and ꞵ the values of A and B that minimize SS.
Least Square Method
Least Square Method
• To determine these estimators, we differentiate SS first with respect to A and then to B as follows:

• Setting these partial derivatives equal to zero yields the following equations for the minimizing values A and
B:

• The Equations are known as the normal equations.


Least Square Method
• If we let

• then we can write the first normal equation as

• Substituting this value of A into the second normal equation yields


Least Square Method - Proposition
• The least squares estimators of ꞵ and α corresponding to the data set xi , Yi , i = 1, . . . , n are, respectively,

• The straight line A + Bx is called the estimated regression line.


Example 4
• The raw material used in the production of a certain synthetic fiber is stored in a location without a humidity
control.
• Measurements of the relative humidity in the storage location and the moisture content of a sample of the
raw material were taken over 15 days with the following data (in percentages) resulting.
Solution 4
Example 5
• In Example 2, xi is the year, and yi is the world population during that year.
• To estimate the regression line,
𝑥ҧ = 1980; 𝑦ത = 4558.1;
B = 74.1
A = 𝑌ത − 𝐵𝑥ҧ = −142201

• The estimated regression line is Y = A + Bx = -142201 + 74.1x


• We conclude that the world population grows at the average rate of 74.1 million every year.
• We can use the obtained equation to predict the future growth of the world population.
• Regression predictions for years 2015 and 2020 are
Y = A + 2015 * B = 7152 million people
Y = A + 2020 * B = 7523 million people
Correlation Coefficient
• Correlation measure the direction and strength of a linear relationship between variables X and Y.

• Sxy is the sample covariance


• Sx is the standard deviation of x
• Sy is the standard deviation of y

You might also like