Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

Intended Learning Outcomes

At the end of this module, it is expected that the students will be able to:

1. Construct empirical models using simple linear regression.


2. Estimate the parameters in a linear regression model using the Least-
Square Approach.
3. Test hypothesis on simple linear regression.
4. Predict future observation using the regression model.
5. Determine the adequacy of the regression model using residual analysis
and coefficient determination.
6. Apply the correlation model.
Empirical Models
Empirical Models

Analysis of the relationship between variables

pressure and temperature of a gas in a container


velocity and the area of the channel
displacement and velocity

related to each other

fuel usage of a car (y) and its weight (x)


electrical energy consumption of a house (y) and the size of the house in sq ft (x)

related but relationships are not deterministic


Regression Analysis
▪ Collection of statistical tools that are used to model and explore relationships between variables that
are of nondeterministic relationship

▪ Deals in finding the best relationship between y and x, quantifying the strength of that relationship
and using the methods that allow the prediction of the response values of the regressor x.

y
dependent variable

x
independent variable
Simple Linear Regression

▪ Has only one dependent or response variable (y) and one independent,
regressor or predictor variable (x)

𝑦 = 𝑎 + 𝑏𝑥 + 𝜀
𝑦 = 𝑎 + 𝑏𝑥 + 𝜀
Where:
y – dependent variable
x – independent variable
a – regression coefficient, y intercept, constant
b - regression coefficient slope of the regression line
𝜀 – error which we are trying to minimize

▪ Estimates of a and b should result to a line that is the “best fit” to the given
data.
Method of Least Squares

𝑛 𝑛 2
2
σ 𝑖 𝑥𝑖
𝑆𝑆𝑥𝑦 𝑆𝑆𝑥𝑥 = ෍ 𝑥𝑖 −
𝑏෠ = 𝑛
𝑆𝑆𝑥𝑥 𝑖
𝑛
σ𝑛𝑖 𝑥𝑖 σ𝑛𝑖 𝑦𝑖
𝑆𝑆𝑥𝑦 = ෍ 𝑥𝑖 𝑦𝑖 −
𝑎ො = 𝑦ത − 𝑏෠ 𝑥ҧ 𝑛
𝑖

σ𝑛𝑖 𝑦𝑖 𝑦ො = 𝑎ො + 𝑏෠ 𝑥ҧ
𝑦ത =
𝑛 ෠ 𝑖 + 𝑒𝑖
𝑦𝑖 = 𝑎ො + 𝑏𝑥
σ𝑛𝑖 𝑥𝑖
𝑥ҧ = 𝑒𝑖 = 𝑦𝑖 − 𝑦ො𝑖
𝑛
Example

𝑦ത = 𝑎ො + 𝑏෠ 𝑥ҧ

𝑛 𝑛 2
2
σ𝑖 𝑥𝑖
𝑆𝑆𝑥𝑥 = ෍ 𝑥𝑖 −
𝑛
𝑖
23.922
𝑆𝑆𝑥𝑥 = 29.2892 −
20
𝑆𝑆𝑥𝑥 = 0.68088
𝑛
σ𝑛𝑖 𝑥𝑖 σ𝑛𝑖 𝑦𝑖
𝑆𝑆𝑥𝑦 = ෍ 𝑥𝑖 𝑦𝑖 −
𝑛
𝑖
23.92 ∗ 1843.21
𝑆𝑆𝑥𝑦 = 2214.6566 −
20
𝑆𝑆𝑥𝑦 = 10.17744
Example

𝑆𝑆𝑥𝑥 = 0.68088
𝑆𝑆𝑥𝑦 = 10.17744

𝑦ො = 𝑎ො + 𝑏x
𝑆𝑆𝑥𝑦
𝑏෠ = 𝑦ො = 74.28 + 14.95x
𝑆𝑆𝑥𝑥
10.17744

𝑏=
0.68088
𝑏෠ = 14.94748

𝑎ො = 𝑦ത − 𝑏෠ 𝑥ҧ
𝑎ො = 92.1605 − 14.94748 ∗ 1.1960
𝑎ො = 74.28331
Example

𝑛 𝑛

𝑆𝑆𝐸 = ෍ 𝑒𝑖2 = ෍ 𝑦𝑖 − 𝑦ො𝑖 2

𝑖=1 𝑖=1
𝑆𝑆𝐸 = 21.2498

𝑆𝑆𝐸
𝜎ො 2 =
𝑛−2
𝜎ො 2 = 1.1805
Correlation: Estimating the Strength of
Linear Relation
Correlation

▪ Statistical method used to determine if there is a relationship between


variables and the strength of the relationship

▪ correlation coefficient – measures how closely the points in a scatter


diagram are spread around a line

▪ r – symbol for sample correlation coefficient


▪ ρ – symbol for the population coefficient
Correlation

High positive correlation Low positive correlation it


between set of point reaches zero

correlation coefficient points on the line but


will be +1 because all notice the line is going
the points are on a line down and so the
and the line has a correlation coefficient will
positive slope. As x be equal to -1. As x
increases y increases increases y decreases -
there is a direct inverse
Relationship No correlation.
relationship between x r can be very close to zero
and y
Correlation

𝑆𝑆𝑥𝑦
𝑟=
𝑆𝑆𝑥𝑥 𝑆𝑆𝑦𝑦

σ𝑥 2
2
𝑆𝑆𝑥𝑥 = ෍𝑥 −
n

σ𝑦 2
𝑆𝑆𝑦𝑦 2
= ෍𝑦 −
n

σ𝑥 σ𝑦
𝑆𝑆𝑥𝑦 = ෍ 𝑥𝑦 −
n
Example

𝑆𝑆𝑥𝑦
𝑟= = 0.9367
𝑆𝑆𝑥𝑥 𝑆𝑆𝑦𝑦

σ𝑥 2
2 = 0.68088
𝑆𝑆𝑥𝑥 = ෍𝑥 −
n

σ𝑦 2
𝑆𝑆𝑦𝑦 2
= ෍𝑦 − = 173.3769
n

σ𝑥 σ𝑦
𝑆𝑆𝑥𝑦 = ෍ 𝑥𝑦 − = 10.17744
n
Hypothesis Tests in Simple Linear
Regression
Use of t-tests
• Hypothesis
𝐻𝑜 : 𝑎 =0
𝐻𝑜 : 𝑏 = 0 𝐻1 : 𝑎 ≠0
𝐻1 : 𝑏 ≠ 0

• Testing Approaches: Critical Value and P Value 𝑎ො − 𝑎


𝑡𝑜 =
1 𝑥ҧ 2
𝑏෠ − 𝑏 df = n-2 𝜎ො 2 +
𝑛 𝑆𝑆𝑥𝑥
𝑡𝑜 =
𝜎ො 2 /𝑆𝑆𝑥𝑥
𝑏෠ = 𝑠𝑎𝑚𝑝𝑙𝑒 𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 slope coefficient
𝑏 = hypothesized slope (usually b=0)
𝜎ො 2 = estimated variance
𝑆𝑆𝑥𝑥 = sum of squares of x
𝑎 = ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑧𝑒𝑑 𝑦 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
𝑎ො = 𝑠𝑎𝑚𝑝𝑙𝑒 𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 y intercept coefficient
Use of t-tests
Use sample data,
𝐻𝑜 : 𝑎 = 0 No linear relationship; not significant
𝐻1 : 𝑎 ≠ 0 There is a linear relationship: significant
Degrees of freedom = n-2 = 18 Conclusion: Since 46.62
Critical t = ±2.101 is greater than CV 2.101,
reject 𝐻𝑜 and conclude
that the population y
intercept coefficient is
significant, there is a
74.28 − 0 linear relationship
𝑎ො − 𝑎𝑜 𝑡𝑜 =
𝑡𝑜 = 1 1.1962
1 𝑥ҧ 2 1.18 +
20 0.68
𝜎ො 2 𝑛 + 𝑆𝑆𝑥𝑥 𝑡𝑜 =46.62
Use of t-tests
Use sample data, given a level of significance of 0.05.
𝐻𝑜 : 𝑏 = 0 No linear relationship; not significant
𝐻1 : 𝑏 ≠ 0 There is a linear relationship: significant
Degrees of freedom = n-2 = 18 Conclusion: Since 11.35
Critical t = ±2.101 is greater than CV 2.101,
reject 𝐻𝑜 and conclude
that the population slope
coefficient is significant,
there is a linear
relationship
𝑏෠ − 𝑏 14.94748 − 0
𝑡𝑜 = 𝑡𝑜 =
𝜎ො 2 /𝑆𝑆𝑥𝑥 1.18/0.68
𝑡𝑜 =11.35
Significance of Regression

▪ Failure to reject the null hypothesis is equivalent to concluding that there is


no linear relationship between the dependent and the independent variables
or that the true relationship between the two variables is not linear.

▪ If the null hypothesis is rejected, it could mean that the straight-line model is
adequate or there is a linear effect of the independent variable.
Sums of Squares: Measures of Variation

SSR
SSE
SST
Coefficient Determination

▪ R²

▪ Often used to judge the adequacy of a regression model


▪ Square of the correlation coefficient between jointly distributed random
variables X and Y and has a value of 0 ≤ R² ≤ 1
▪ Often referred to as the amount of variability in the data explained or
accounted for by the regression model

▪ It does not measure the magnitude of the slope of the regression line
▪ A large value of R² does not imply a steep slope
▪ Does not measure the appropriateness of the model because it can
artificially inflated by adding higher-order polynomial terms to the model
▪ A large R² does not necessarily imply that the regression model will provide
accurate predictions of future observations
Coefficient Determination
Coefficient Determination

Approximately 87.7% of the


152.127 variation in purity of oxygen can
R²= 173.377 = 0.877
be explained by the percentage of
hydrocarbon

SSR
SSE
SST
Correlation

▪ Correlation analysis attempts to measure the strength of such relationships


between two variables by means of a singe number called a correlation
coefficient.

▪ Correlation coefficient measures how closely the points in a scatter diagram


are spread around a line.

▪ +1 implies a perfect linear relationship with a positive slope


▪ -1 implies a perfect linear relationship with a negative slope
▪ 0 indicates no correlation
Analysis of Variance Approach to Test Significance of Regression

▪ Analysis-of-Variance (ANOVA) approach is used in analyzing the quality of the


estimated regression line.

▪ A procedure where the total variation in the dependent variable is subdivided


into meaningful components that are then observed and treated
systematically

𝑆𝑆𝑅/1 𝑆𝑆𝑅
𝑓= = 2
𝑆𝑆𝐸/(𝑛 − 2) 𝑠

𝑓 > 𝑓𝑎 (1, 𝑛 − 2)
Sums of Square: Measures of Variation
Analysis of Variance Approach to Test Significance of Regression
Steps
1. Parameter of Interest is b or slope
2. 𝐻𝑜 : 𝑏 = 0 no linear relationship
3. 𝐻1 : 𝑏 ≠ 0 𝟏𝟓𝟐.𝟏𝟐𝟕𝟏
4. 𝛼 = 0.05 F=
𝟏.𝟏𝟖𝟎𝟓
5.Test Statistic = SSR/s2
6. Rejection region – F >F0.05,1, 18 F = 128.8617
7. Computation
8. Conclusion
SSR
SSE
SST
Sums of Square: Measures of Variation

Since F = 128.86 >


F0.05,1,18= 4.414,
18
F=
𝟏𝟓𝟐.𝟏𝟐𝟕𝟏 Reject the Ho
𝟏.𝟏𝟖𝟎𝟓
Since p value =
F = 128.8617
0.000<∝=0.05,
reject null and
conclude the slope of
coefficient is greater
than zero.
Sums of Square: Measures of Variation

18
𝟏𝟓𝟐.𝟏𝟐𝟕𝟏
F= 𝟏.𝟏𝟖𝟎𝟓
19

F = 128.8617
Adequacy of the Regression Model

Assumptions required to fit a regression model


• Errors are uncorrelated random variables with mean zero and constant
variance.
• Errors should be normally distributed.
• Order of the model is correct.
Residual Analysis

▪ frequently helpful to check the assumption that the errors are approximately
normally distributed with constant variance and to determine whether
additional terms in the model would be useful.

▪ A frequency histogram of the residuals or a normal probability plot of


residuals can be constructed and be used to approximately check the
normality.
Example
Example

You might also like