Professional Documents
Culture Documents
Statistical Analysis (SM 901B) Unit 2 - Regression: Goonjan Jain Department of Applied Mathematics DTU
Statistical Analysis (SM 901B) Unit 2 - Regression: Goonjan Jain Department of Applied Mathematics DTU
Unit 2 - Regression
Goonjan Jain
Department of Applied Mathematics
DTU
Introduction
• Determining a relationship between a set of variables.
• For instance, in a chemical process, we might be interested in the relationship between the output of the
process, the temperature at which it occurs, and the amount of catalyst employed.
• Knowledge of such a relationship would enable us to predict the output for various values of temperature
and amount of catalyst.
• Houses in the same part of the country that have the same square footage of living space will not all be sold
for the same price.
Variables
• There is a single response variable Y , also called the dependent variable, which depends on the value of a
set of input, also called independent variables x1, . . . , xr. The simplest type of relationship between the
dependent variable Y and the input variables x1, . . . , xr is a linear relationship. That is, for some constants ꞵ0,
ꞵ1, . . . , ꞵr the equation
• would hold. If this was the relationship between Y and the xi, i = 1, . . . , r, then it would be possible (once the
ꞵi were learned) to exactly predict the response for any set of input values.
• However, in practice, such precision is almost never attainable, and the most that one can expect is that
would be valid subject to random error. By this we mean that the explicit relationship is
• where e, representing the random error, is assumed to be a random variable having mean 0.
Regression Equation
• Another way of expressing Equation is as follows:
• where x = (x1, . . . , xr ) is the set of independent variables, and E[Y |x] is the expected response given the
inputs x.
• This equation is called a linear regression equation
• Describes the regression of Y on the set of independent variables x1, . . . , xr.
• ꞵ0, ꞵ1, . . . , ꞵr - regression coefficients, estimated from a set of data.
• A regression equation containing a single independent variable, r = 1 — simple regression equation
• A regression equation containing many independent variables is called a multiple regression equation.
Simple Linear Regression Model
• a simple linear regression model supposes a linear relationship between the mean response and the value of
a single independent variable. It can be expressed as
• x is the value of the independent variable, also called the input level,
• Y is the response, and
• e, representing the random error, is a random variable having mean 0.
Example 1
• Consider the following 10 data pairs (xi , yi), i = 1, . . . , 10, relating y, the percent yield of a laboratory
experiment, to x, the temperature at which the experiment was run.
• How accurate is the forecast obtained in this example? The observed population during 1950–2010 appears
rather close to the estimated regression line. It is reasonable to hope that it will continue to do so through
2020.
Example 3
• Seventy house sale prices in a certain county are depicted along with the house area.
• First, we see a clear relation between these two variables, and in general, bigger houses are more expensive.
However, the trend no longer seems linear.
• Second, there is a large amount of variability around this trend. Indeed, area is not the only factor
determining the house price. Houses with the same area may still be priced differently.
• Then, how can we estimate the price of a 3200-square-foot house? We can estimate the general trend (the
dotted line) and plug 3200 into the resulting formula, but due to obviously high variability, our estimation
will not be as accurate.
Least Square Estimators of the Regression
Equation
• Let Yi corresponding to the input values xi , i = 1, . . . , n are to be observed and used to estimate α and ꞵ in a
simple linear regression model.
• To determine estimators of α and ꞵ:
• If A is the estimator of α and B of ꞵ, then the estimator of the response corresponding to the input variable xi
would be A + Bxi .
• Since the actual response is Yi , the squared difference is (Yi − A − Bxi)2, and so if A and B are the estimators
of α and ꞵ, then the sum of the squared differences between the estimated responses and the actual
response values—call it SS —is given by
• The method of least squares chooses as estimators of α and ꞵ the values of A and B that minimize SS.
Least Square Method
Least Square Method
• To determine these estimators, we differentiate SS first with respect to A and then to B as follows:
• Setting these partial derivatives equal to zero yields the following equations for the minimizing values A and
B: