AnBis Gasal 2324 - Sesi 2

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 73

Statistical Inference and Predictive

Analytics (1)
Course: Business Analytics – Session 2
Odd Semester – 2023/2024
September 2023
Summarized from
Evans, James R. Business Analytics, 3rd Edition.
Pearson Education, 2020. .

Nurturing Inclusive, Relevant & Reputable Leaders


Agenda

q Statistical Inference
q Trendlines and Regression Analysis

2
Statistical
Inference

3
Statistical Inference

Estimation of population parameters and


hypothesis testing, which is a technique that
allows you to draw valid statistical conclusions,
using samples, about the value of population
parameters or differences among them.

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.276 4
Hypothesis Testing

Drawing inferences about two contrasting propositions (hypothesis)


about the value of one or more population parameters, such as
mean, proportion, standard deviation, or variance.

Null Hypothesis (H0) Alternate Hypothesis (H1)


Describes the existing theory or The compliment of the null
belief that is accepted unless strong hypothesis, that it must be true if
statistical evidence to the contrary. the null hypothesis is FALSE.

Reject the null hypothesis (H0) : the sample data provide sufficient statistical evidence to
support alternate hypothesis.
Fail to reject the null hypothesis (H0) : the sample data do not support alternate hypothesis. We
can only accept as valid the existing theory or belief, but we can never prove it.

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.276 5
Steps of Hypothesis-Testing

1. Identify the population parameter of interest and formulate


hypothesis.
2. Select a level of significance.
3. Determine a decision rule on which to base a conclusion.
4. Collect data dan calculate a test statistic.
5. Apply the decision rule to the test statistic and draw a conclusion.

One-Sample Hypothesis Tests Multiple-Sample Hypothesis Tests


Single population More than one population

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.276 - 277 6
One-Sample Hypothesis Tests -
Types
1. H0: population parameter ≥ constant vs H1: population parameter < constant
2. H0: population parameter ≤ constant vs H1: population parameter > constant
3. H0: population parameter = constant vs H1: population parameter ≠ constant

It is not correct to formulate a null hypothesis using >, <,or ¹ .

REMEMBER:
We cannot “prove” that H0 is true, we can only fail to reject it.
If we cannot reject the null hypothesis, we have shown only that there is insufficient evidence to
conclude that the alternative hypothesis is true.
Rejecting the null hypothesis provides strong evidence (in a statistical sense) that the null hypothesis
is not true and that the alternative hypothesis is true.
Therefore, what we wish to provide evidence for statistically should be identified as the alternative
hypothesis.

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.277 7
One-Sample Hypothesis Tests -
Example 1
CadSoft, a producer of computer-aided design software for the aerospace
industry, receives numerous calls for technical support. In the past, the average
response time has been at least 25 minutes. The company has upgraded its
information systems and believes that this will help reduce response time. As a
result, it believes that the aver-age response time can be reduced to less than 25
minutes. The company collected a sample of 44 response times.

Population parameter : mean (µ) of response time


Hypothesis statements :
H0: population mean response time ≥ 25 minutes
H1: population mean response time < 25 minutes
Hypothesis formula :
H0: µ ≥ 25 minutes
H1: µ < 25 minutes

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p. 278 8
One-Sample Hypothesis Tests -
Results
1. H0 is actually true and the test correctly fails to reject it.
2. H0 is actually false and the test correctly reaches this conclusion.
3. H0 is actually true BUT the test incorrectly rejects it. (TYPE I ERROR)
4. H0 is actually false BUT the test incorrectly fails to reject it rejects it. (TYPE II ERROR)

The probability of Making a Type I Error The probability of Correctly


• Called the level of significance (α) : the Failing to Reject H0
likelihood that you will make the incorrect
conclusion that the alternative hypothesis • Called the confidence coefficient (1-α).
is true when, in fact, the null hypothesis is • Example: a confidence coefficient of 0.95
true. means that we expect 95 out of 100
samples to support the null hypothesis
• The value of a can be controlled by the
decision maker and is selected before the rather than the alternate hypothesis when
test is conducted. H0 is actually true.

• Commonly used levels for a are 0.10, 0.05,


and 0.01.
Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.278 9
Test Statistic

σ is the population standard deviation


s is a sample standard deviation

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.279 10
One-Sample Hypothesis Tests -
Example 2

In the CadSoft example, sample data for 44 customers revealed a mean response
time of 21.91 minutes and a sample standard deviation of 19.49 minutes.

x - µ0 21.91 - 25 -3.09
t= = = = -1.05
s / n 19.49 / 44 2.938

t = −1.05 indicates that the sample mean of 21.91 is 1.05 standard errors below the
hypothesized mean of 25 minutes

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p. 280 11
Critical Value and Drawing a
Conclusion

• The conclusion to reject or fail to reject H0 is based on comparing the value of the
test statistic to a “critical value” from the sampling distribution of the test statistic
when the null hypothesis is true and the chosen level of significance, a.
• The sampling distribution of the test statistic is usually the normal distribution, t-
distribution.

• The critical value divides the sampling distribution into two:


• a rejection region, and
• a non-rejection region.
• If the test statistic falls into the rejection region, we reject the null hypothesis,
otherwise, we fail to reject it.

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.280 12
Rejection Regions
One-Tailed Tests :
Specify a direction of relationship (where H0 is either ≥ or ≤)

If H1 is stated as <, the


rejection region is in the lower
tail;
If H1 is stated as >, the
rejection region is in the upper
tail
Two-Tailed Tests :
Have both upper and lower critical values

If the test statistic is either greater than


the upper critical value or less than the
lower critical value, the decision would
be to reject the null hypothesis.

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.281 13
One-Tailed Test - Example 3

For the CadSoft example, if the level of significance is 0.05, then the critical value
is t0.05,43 is found in Excel using = T.INV(0.95, 43) = 1.68. Because the t-distribution
is symmetric with a mean of 0 and this is a lower-tailed test, we use the negative
of this number ( -1.68) as the critical value. n = 44; df = n - 1 = 43

Hypothesis formula :
H0: µ ≥ 25 minutes Result :
t = −1.05 does not fall in the rejection region.
H1: µ < 25 minutes
Conclusion :
• Fail to reject H0 .
• Cannot conclude that the mean response
time has improved to less than 25 minutes.

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p. 282 14
Two-Tailed Test

For a two-tailed test, the critical value is t((1-α)/2, n-1) OR t((α/2,n-1)

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.282 15
Two-Tailed Test - Example 4

Data collected in a survey of 34 respondents by a travel agency. Suppose that the travel
agency wanted to target individuals who were approximately 35 years old. Thus, we wish
to test whether the average age of respondents is equal to 35. The sample mean is
38.676, and the sample standard deviation is 7.857

Hypothesis : Critical value, t0.025,33, with the Excel function =T.INV.2T(.05, 33),
H0: mean age = 35 we obtain 2.0345. Thus, the critical values are ± 2.0345
x - µ0 ( 38.676 - 35 )
H1: mean age ≠ 35 t= = = 2.73
s/ n (
.857 / 34 )
Result :
t = 2.73 fall in the rejection region.

Conclusion :
Reject H0 that the average age is 35.

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p. 282 16
p-Values

• The probability of obtaining a test statistic value equal to or


more extreme than that obtained from the sample data when
the null hypothesis is trueH0 is actually false and the test
correctly reaches this conclusion, usually also called observed
significance level .
• An alternative approach to Step 3 of a hypothesis test uses the
p-value rather than the critical value.
• Compare the p-value to the chosen level of significance (α) :
Reject H0, if the p-value < α

READ the text book abount finding p-values when σ is known or unknown

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.284 17
p-Values - Example 5

For the CadSoft example, p-value is obtained For the Vacation example, p-value is obtained
using the Excel function =T.DIST(-1.05, 43, using the Excel function =T.DIST.2T(2.73, 33) =
TRUE) = 0.15. 0.010.

Result : Result :
p = 0.15 is not less than α = 0.05. p = 0.010 is less than α = 0.05.

Conclusion : Conclusion :
• Fails to reject H0. Reject H0.
• There is about a 15% chance that the test
statistic would be -1.05 or smaller if H0
were true

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p. 282 18
One Sample Hypothesis Tests -
Excel Template

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p. 287 19
Two-Sample Hypothesis Tests

Hypotheses:
H 0 : µ1 - µ 2 {³, £, or =}0

H 1 : µ1 - µ 2 {<, >, or ¹}0 ( 7.4 )

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.288 20
Two-Sample Hypothesis Tests –
Example 6 (Independent Samples)

Determine if the mean lead time for Alum Sheeting


(µ1) is greater than the mean lead time for Durrable
Products (µ2).

t-Test: Two-Sample Assuming


Unequal Variances
• Variable 1 Range:
Alum Sheeting data
• Variable 2 Range:
Durrable Products data

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.290 - 291 21
Two-Sample Hypothesis Tests –
Example 6 (Independent Samples)
Hypothesis :
H 0 : µ1 - µ2 £ 0
H1 : µ1 - µ2 > 0

One-Tailed (upper-tail test)

Result :
tstat = 3.827 > tcritical = 1.812, OR
p = 0.00166 < α = 0.05.

Conclusion :
Reject H0.

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.291 22
Two-Sample Hypothesis Tests –
Paired Samples
Hypotheses:

H 0 : µ D {³, £, or =}0
H1 : µ D {<, >, or ¹}0

µD is the mean difference between the paired samples

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.290 - 292 23
Two-Sample Hypothesis Tests –
Example 7 (Paired Samples)

Test for a difference in


the means of the
estimated and actual
pile lengths
(two-tailed test).

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.290 - 293 24
Two-Sample Hypothesis Tests –
Example 7 (Independent Samples)
Hypothesis :
H 0 : µD = 0
H 1 : µD ≠ 0

Two-Tailed

Result :
t is < the lower critical value, OR
p-value ≈ 0

Conclusion :
Reject H0.

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education p.290 - 293 25
F-test

• Test for equality of variances between two samples.


H 0 : s 12 - s 2 2 = 0
H 1 : s 12 - s 2 2 ¹ 0 ( 7.5)
• F-test statistic:
s12
F= 2 ( 7.6 )
s2

• Assume that both samples are drawn from normal populations.

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.2932 - 293 26
Conducting F-test

• Although the hypothesis test is really a two-tailed test, we will


simplify it as an upper-tailed, one-tailed test to make it easy to
use tables of the F-distribution and interpret the results of the
Excel tool.
• Find the critical value of the F-distribution :
Fa
, df 1, df 2
2

• Reject the H0, if the F-test statistic > the critical value .
• Note that we are using α/2, to find the critical value, not α.

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.292 - 293 27
F-test –
Example 8
• Determine whether the variance of lead times is the same for
Alum Sheeting and Durrable Products in the Purchase Orders
data.
• The variance of the lead times for Alum Sheeting is larger than
the variance for Durable Products, so this is assigned to Variable
1.

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.292 - 293 28
F-test –
Example 8

Result :
F < Fcrit, OR
p-value > α/2 = 0.025

Conclusion :
Fail to reject H0.

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.292 - 293 29
Analysis of Variance (ANOVA)

• Used to compare the means of two or more population groups.


H 0 : µ1 = µ2 = ! = µ m
H1 : at least one mean is different from the others
• ANOVA measures variation between groups relative to variation
within groups.
• Each of the population groups is assumed to come from a
normally distributed population.

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.294 - 296 30
Analysis of Variance (ANOVA) –
Assumptions

• The m groups or factor levels being studied represent


populations whose outcome measures
1. are randomly and independently obtained,
2. are normally distributed, and
3. have equal variances.
• If these assumptions are violated, then the level of significance
and the power of the test can be affected.

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.294 - 296 31
Analysis of Variance (ANOVA) –
Example 9

• Determine whether any


significant differences exist in
satisfaction among individuals
with different levels of
education.

• In this example, the factor (m) is


the educational level, and we
have three categorical levels of
this factor, college graduate,
graduate degree, and some
college.

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.294 - 295 32
Analysis of Variance (ANOVA) –
Example 9

Result :
F > Fcrit, OR
p-value < α = 0.05

Conclusion :
Reject H0.

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education p.294 - 295 33
Trendlines and
Regression Analysis

34
Using Chart in Modeling
Relationships dan Trends in Data

To better understand data sets are randomly and


independently obtained :
• Cross-sectional data, use a scatter chart.
• Time series data, use a line chart.

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.313 35
Mathematical Functions Used in
Predictive Analytical Models

the base of natural logarithms, e = 2.71828, is often used for the constant b

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.313 36
Excel Trendline Tool

• Right click on data


series and choose Add
trendline from pop-up
menu.
• Check the boxes Display
Equation on chart and
Display R-squared value
on chart.

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.313 37
Mathematical Functions –
Example
Market research study has collected data on sales volumes for different levels of pricing of
a particular product.

The model is :
Sales = 20,512 – (9.5116 x Price)
If the price is $125, we can
estimate the level of sales as :
Sales = 20,512 – (9.5116 x 125)
Sales = 19,323

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.314 - 315 38
R2 (R-squared)

• Measure of the “fit” of the line to the data.


• The value of R2 will be between 0 and 1.
• The larger the value of R2 the better the fit.
• A value of 1.0 indicates a perfect fit.

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.314 39
Regression Analysis

A tool for building mathematical and statistical models that characterize


relationships between a dependent (ratio) variable and one or more
independent, or explanatory variables (ratio or categorical), all of which are
numerical.

Simple linear Multiple regression


regression
• a single independent • two or more
variable independent variables

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.314 40
Data Classification by the Type of
Measurement Scale

Categorical (Nominal)
• Sorted into categories according to specified characteristics
• The categories bear no quantitative relationship to one another, but we usually
assign an arbitrary number to each category, exp 0 or 1, to ease the process of
man-aging the data and computing statistics.
• Usually counted or expressed as proportions or percentages.
• Exp: Employees might be classified as managers, supervisors, and associates.

Ordinal
• Can be ordered or ranked according to some relationship to one another.
• More meaningful than categorical data because data can be compared.
• However, ordinal data have no fixed units of measurement, so we cannot make
meaningful numerical statements about differences between categories.
• Exp: College basketball rankings are ordinal, a higher ranking signifies a stronger
team but does not specify any numerical measure of strength.

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.146 - 147 41
Data Classification by the Type of
Measurement Scale

Interval
• Ordinal but have constant differences between observations and have
arbitrary zero points.
• Exp: commonly are time and temperature.
• Celsius scales represent a specified measure of distance—degrees—
but have arbitrary zero points. we cannot compute meaningful ratios;
for example, we cannot say that 50 degrees is twice as hot as 25
degrees. However, we can compare differences.

Ratio
• Continuous and have a natural zero point.
• Exp. the Seattle region sold $12 million in March whereas the Tampa
region sold $6 million means that Seattle sold twice as much as Tampa

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.146 - 147 42
Simple Linear Regression

• Finds a linear relationship between one independent variable (X) and


one dependent variable (Y).
• Prepare a scatter chart to verify the data has a linear trend.

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.317 - 318 43
Simple Linear Regression –
Example

Size of a house is typically


related to its market value.
X = square footage
Y = market value ($)

The scatter chart of the full


data set (42 homes)
indicates a linear trend.

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.317 - 318 44
Simple Linear Regression –
Example
Equation :
Market Value = a + b x Square Feet
Two possible lines are shown below :

Line A is clearly a better fit to the data.


Which is the best regression line?

The best-fitting equation (using trendline tool) :


Market Value = $32,673 + $35.036 x Square Feet
The estimated market value of a home with 2,200 square feet would be
Market Value = $32,673 + $35.036 x 2,200 = $109,752

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.317 - 318 45
Least-Squares Regression - 1

• The mathematical basis for the best-fitting regression line.


• Simple linear regression model:
Y = b 0 + b1 X + e (8.1)
• We estimate the parameters from the sample data:

Yˆ = b0 + b1 X (8.2 )
• The estimated value of Y for Xi :

Yˆi = b0 + b1 X i
Note: Xi is the value of the independent variable of the ith

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.320 - 321 46
Residuals

The observed errors associated with estimating the value of the dependent
variable using the regression line.

ei = Yi - Yˆi (8.3)

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.320 - 321 47
Least-Squares Regression - 2

• The best-fitting line minimizes the sum of squares of the residuals.

• Excel functions:

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.320 - 321 48
Simple Linear Regression with Excel
– Alt 1

Slope = b1 = 35.036
= SLOPE ( C 4 : C 45, B 4 : B 45 )

Intercept = b0 = 32,673
= INTERCEPT ( C 4 : C 45, B 4 : B 45 )

Estimate Y when X = 1750 square feet :


Yˆ = 32,673 + 35.036(1750) = $93,986
= TREND ( C 4 : C 45, B 4 : B 45,1750 )

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.322 - 324 49
Simple Linear Regression with Excel
– Alt 2

• Clik Data à Data Analysis à Regression.


• Input Y Range (with header)
• Input X Range (with header)
• Check Labels box

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.322 - 324 50
Simple Linear Regression with Excel
– Alt 2

Ŷ = b 0 + b 1X Ŷ = 32.673 + 35.036X

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.322 - 324 51
Regression Statistics

Multiple R (ǀrǀ)
• r is the sample correlation coefficient.
• The value of r varies from −1 to +1. (r is negative if slope is negative)
• r is negative if slope is negative.

R Square (R2)
• See previous slides.

Adjusted R Square
• Modifies the value of R2 by incorporating the sample size and the number of explanatory variables
in the model.
• Useful when comparing this model with other models that include additional explanatory
variables.
Standard Error
• The variability of the observed Y-values from the predicted values (Ŷ).
• If the data are clustered close to the regression line, then the standard error will be small.
• The more scattered the data, the larger the standard error.

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.322 - 324 52
Regression as Analysis of Variance

• Conducts an F-test to determine whether variation in Y is due to


varying levels of X.
• Test for significance of regression:
H 0 : population slope coefficient = 0

H1 : population slope coefficient ¹ 0

• Reports the p-value (see Significance F).


• Rejecting H0 : X explains variation in Y.

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.324 - 326 53
Regression as Analysis of Variance –
Example

H 0 : b1 = 0 Home size is not a


significant variable.
H1 : b1 ¹ 0 Home size is a significant
variable.

p-value = 3.798 ´ 10-8

Result : Conclusion : Reject H0 à home size is a significant variable in


p-value < α = 0.05 explaining variation in market value.

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.324 - 326 54
Regression as Analysis of Variance –
Using t-Test - Example

b1 - 0
t= (8.8)
standard error

Result : Conclusion : Reject H0 à home size is a significant variable in


p-value < α = 0.05 explaining variation in market value.

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.324 - 326 55
Checking Assumptions

Linearity
• examine scatter diagram à appear linear.
• examine residual plot à appear random.

Normality of Errors
• view a histogram of standard residuals.
• regression is robust to departures from normality.

Homoscedasticity
• variation about the regression line is constant.
• examine the residual plot.

Independence of Errors
• Successive observations should not be related.
• Correlation among successive observations over time is called autocorrelation
• Important when the independent variable is time.
Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.327 - 329 56
Checking Assumptions –
Example Home Market Value

Linearity

linear trend in scatterplot no pattern in residual plot

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.327 - 329 57
Checking Assumptions –
Example Home Market Value

Normality of Errors

residual histogram appears slightly skewed but is not a serious departure

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.327 - 329 58
Checking Assumptions –
Example Home Market Value

Homoscedasticity

residual plot shows no serious difference in the spread of the data for different
X values

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.327 - 329 59
Checking Assumptions –
Example Home Market Value

Independence of
Errors
Because the data is cross-sectional, we can assume this assumption holds.

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.327 - 329 60
Multiple Linear Regression

A linear regression model with more than one independent variable (X).
Y = b 0 + b1 X 1 + b 2 X 2 + ! + b k X k + e (8.10 )
Where :
Y is the dependent variable,
X ,!, X are the independent (explanatory) variables,
1 k

b0 is the intercept term,


b1 ,!, b k are the regression coefficients for the independent variables,
e is the error term.

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.329 61
ANOVA for Multiple Regression

• ANOVA tests for significance of the ENTIRE model. That is, it computes
an F-statistic testing the hypotheses:
H 0 : b1 = b 2 = ! = b k = 0
H1 : at least one b j is not 0
• Output also provides information to test hypotheses about each of the
individual regression coefficients.
• Reject H0 that the slope associated with independent variable i is 0,
then the independent variable i (Xi) is significant and improves the
ability of the model to better predict the dependent variable.
• If we fail to reject H0 independent variable is not significant and
probably should not be included in the model.

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.330 - 331 62
Multiple Linear Regression –
Example

Predict student graduation rates using several indicators:

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.331-333 63
Multiple Linear Regression –
Example
Result and Regression Model

• The value of R2 indicates that 53% of the variation in the dependent variable is
explained by these independent variables.
• All coefficients of independent variables are statistically significant (p-value < 0.05).
Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.331-333 64
Systematic Model Building Approach

• Construct a model with all available independent variables. Check for


significance of the independent variables by examining the p-values.
• Identify the independent variable having the largest p-value that
exceeds the chosen level of significance (insignificant).
• Remove the variable identified in step 2 from the model and evaluate
adjusted R2.
• Don’t remove all variables with p-values that exceed a at the same
time, but remove only one at a time.
• Doing step-by-step.
• Continue until all variables are significant.

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.334 - 335 65
Systematic Model Building Approach
- Example
Banking Data

Home value has the largest p-value, drop and re-run the regression.

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.334 - 335 66
Systematic Model Building Approach
- Example
Banking Data

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.334 - 335 67
Multicollinearity

• Occurs when there are strong correlations among the


independent variables, and they can predict each other
better than the dependent variable.
• Correlations exceeding ± 0.7 (see correlation matrix) may
indicate multicollinearity.
• The variance inflation factor is a better indicator, but not
computed in Excel.

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.336 68
Multicollinearity ?

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.336 69
Regression with Categorical
Independen Variables

• Regression analysis requires numerical data.


• Categorical data can be included as independent variables,
but must be coded numeric using dummy variables.
• For variables with 2 categories, code as 0 and 1.

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.336 70
Regression with Categorical
Independen Variables - Example
• Employee Salaries provides data for 35 employees.

• Predict Salary using Age and MBA (code as yes = 1, no = 0)

Y = b 0 + b1 X 1 + b 2 X 2 + e
where
Y = salary
X 1 = age
X 2 = MBA indicator (0 or 1)

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.338 - 340 71
Regression with Categorical
Independen Variables - Example

Salary = 893.59 + 1044.15 ´ Age + 14767.23 ´ MBA

– If MBA = 0, salary Salary = 893.59 + 1044 ´ Age


– If MBA = 1, salary Salary = 15,660.82 + 1,044.15 ´ Age

Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.338 - 340 72
THANK YOU

You might also like