AnBis Gasal 2324 - Sesi 2

Statistical Inference and Predictive
Analytics (1)
Course: Business Analytics – Session 2
Odd Semester – 2023/2024
September 2023
Summarized from
Evans, James R. Business Analytics, 3rd Edition.
Pearson Education, 2020. .
Nurturing Inclusive, Relevant & Reputable Leaders

Agenda
q Statistical Inference
q Trendlines and Regression Analysis
2
Statistical
Inference
3
Statistical Inference
Estimation of population parameters and

hypothesis testing, which is a technique that
allows you to draw valid statistical conclusions,
using samples, about the value of population
parameters or differences among them.
Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.276 4
Hypothesis Testing
Drawing inferences about two contrasting propositions (hypothesis)

about the value of one or more population parameters, such as
mean, proportion, standard deviation, or variance.
Null Hypothesis (H0) Alternate Hypothesis (H1)

Describes the existing theory or The compliment of the null
belief that is accepted unless strong hypothesis, that it must be true if
statistical evidence to the contrary. the null hypothesis is FALSE.
Reject the null hypothesis (H0) : the sample data provide sufficient statistical evidence to
support alternate hypothesis.
Fail to reject the null hypothesis (H0) : the sample data do not support alternate hypothesis. We
can only accept as valid the existing theory or belief, but we can never prove it.
Steps of Hypothesis-Testing
1. Identify the population parameter of interest and formulate

hypothesis.
2. Select a level of significance.
3. Determine a decision rule on which to base a conclusion.
4. Collect data dan calculate a test statistic.
5. Apply the decision rule to the test statistic and draw a conclusion.
One-Sample Hypothesis Tests Multiple-Sample Hypothesis Tests

Single population More than one population
Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.276 - 277 6
One-Sample Hypothesis Tests -
Types
1. H0: population parameter ≥ constant vs H1: population parameter < constant
2. H0: population parameter ≤ constant vs H1: population parameter > constant
3. H0: population parameter = constant vs H1: population parameter ≠ constant
It is not correct to formulate a null hypothesis using >, <,or ¹ .
REMEMBER:
We cannot “prove” that H0 is true, we can only fail to reject it.
If we cannot reject the null hypothesis, we have shown only that there is insufficient evidence to
conclude that the alternative hypothesis is true.
Rejecting the null hypothesis provides strong evidence (in a statistical sense) that the null hypothesis
is not true and that the alternative hypothesis is true.
Therefore, what we wish to provide evidence for statistically should be identified as the alternative
hypothesis.
Example 1
CadSoft, a producer of computer-aided design software for the aerospace
industry, receives numerous calls for technical support. In the past, the average
response time has been at least 25 minutes. The company has upgraded its
information systems and believes that this will help reduce response time. As a
result, it believes that the aver-age response time can be reduced to less than 25
minutes. The company collected a sample of 44 response times.
Population parameter : mean (µ) of response time

Hypothesis statements :
H0: population mean response time ≥ 25 minutes
H1: population mean response time < 25 minutes
Hypothesis formula :
H0: µ ≥ 25 minutes
H1: µ < 25 minutes
Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p. 278 8
Results
1. H0 is actually true and the test correctly fails to reject it.
2. H0 is actually false and the test correctly reaches this conclusion.
3. H0 is actually true BUT the test incorrectly rejects it. (TYPE I ERROR)
4. H0 is actually false BUT the test incorrectly fails to reject it rejects it. (TYPE II ERROR)
The probability of Making a Type I Error The probability of Correctly

• Called the level of significance (α) : the Failing to Reject H0
likelihood that you will make the incorrect
conclusion that the alternative hypothesis • Called the confidence coefficient (1-α).
is true when, in fact, the null hypothesis is • Example: a confidence coefficient of 0.95
true. means that we expect 95 out of 100
samples to support the null hypothesis
• The value of a can be controlled by the
decision maker and is selected before the rather than the alternate hypothesis when
test is conducted. H0 is actually true.
• Commonly used levels for a are 0.10, 0.05,

and 0.01.
Test Statistic
σ is the population standard deviation

s is a sample standard deviation
Example 2
In the CadSoft example, sample data for 44 customers revealed a mean response
time of 21.91 minutes and a sample standard deviation of 19.49 minutes.
x - µ0 21.91 - 25 -3.09
t= = = = -1.05
s / n 19.49 / 44 2.938
t = −1.05 indicates that the sample mean of 21.91 is 1.05 standard errors below the
hypothesized mean of 25 minutes
Critical Value and Drawing a
Conclusion
• The conclusion to reject or fail to reject H0 is based on comparing the value of the
test statistic to a “critical value” from the sampling distribution of the test statistic
when the null hypothesis is true and the chosen level of significance, a.
• The sampling distribution of the test statistic is usually the normal distribution, t-
distribution.
• The critical value divides the sampling distribution into two:

• a rejection region, and
• a non-rejection region.
• If the test statistic falls into the rejection region, we reject the null hypothesis,
otherwise, we fail to reject it.
Rejection Regions
One-Tailed Tests :
Specify a direction of relationship (where H0 is either ≥ or ≤)
If H1 is stated as <, the

rejection region is in the lower
tail;
If H1 is stated as >, the
rejection region is in the upper
tail
Two-Tailed Tests :
Have both upper and lower critical values
If the test statistic is either greater than

the upper critical value or less than the
lower critical value, the decision would
be to reject the null hypothesis.
One-Tailed Test - Example 3
For the CadSoft example, if the level of significance is 0.05, then the critical value
is t0.05,43 is found in Excel using = T.INV(0.95, 43) = 1.68. Because the t-distribution
is symmetric with a mean of 0 and this is a lower-tailed test, we use the negative
of this number ( -1.68) as the critical value. n = 44; df = n - 1 = 43
Hypothesis formula :
H0: µ ≥ 25 minutes Result :
t = −1.05 does not fall in the rejection region.
H1: µ < 25 minutes
Conclusion :
• Fail to reject H0 .
• Cannot conclude that the mean response
time has improved to less than 25 minutes.
Two-Tailed Test
For a two-tailed test, the critical value is t((1-α)/2, n-1) OR t((α/2,n-1)
Two-Tailed Test - Example 4
Data collected in a survey of 34 respondents by a travel agency. Suppose that the travel
agency wanted to target individuals who were approximately 35 years old. Thus, we wish
to test whether the average age of respondents is equal to 35. The sample mean is
38.676, and the sample standard deviation is 7.857
Hypothesis : Critical value, t0.025,33, with the Excel function =T.INV.2T(.05, 33),
H0: mean age = 35 we obtain 2.0345. Thus, the critical values are ± 2.0345
x - µ0 ( 38.676 - 35 )
H1: mean age ≠ 35 t= = = 2.73
s/ n (
.857 / 34 )
Result :
t = 2.73 fall in the rejection region.
Conclusion :
Reject H0 that the average age is 35.
p-Values
• The probability of obtaining a test statistic value equal to or

more extreme than that obtained from the sample data when
the null hypothesis is trueH0 is actually false and the test
correctly reaches this conclusion, usually also called observed
significance level .
• An alternative approach to Step 3 of a hypothesis test uses the
p-value rather than the critical value.
• Compare the p-value to the chosen level of significance (α) :
Reject H0, if the p-value < α
READ the text book abount finding p-values when σ is known or unknown
p-Values - Example 5
For the CadSoft example, p-value is obtained For the Vacation example, p-value is obtained
using the Excel function =T.DIST(-1.05, 43, using the Excel function =T.DIST.2T(2.73, 33) =
TRUE) = 0.15. 0.010.
Result : Result :
p = 0.15 is not less than α = 0.05. p = 0.010 is less than α = 0.05.
Conclusion : Conclusion :
• Fails to reject H0. Reject H0.
• There is about a 15% chance that the test
statistic would be -1.05 or smaller if H0
were true
One Sample Hypothesis Tests -
Excel Template
Two-Sample Hypothesis Tests
Hypotheses:
H 0 : µ1 - µ 2 {³, £, or =}0
H 1 : µ1 - µ 2 {<, >, or ¹}0 ( 7.4 )
Two-Sample Hypothesis Tests –
Example 6 (Independent Samples)
Determine if the mean lead time for Alum Sheeting

(µ1) is greater than the mean lead time for Durrable
Products (µ2).
t-Test: Two-Sample Assuming

Unequal Variances
• Variable 1 Range:
Alum Sheeting data
• Variable 2 Range:
Durrable Products data
Hypothesis :
H 0 : µ1 - µ2 £ 0
H1 : µ1 - µ2 > 0
One-Tailed (upper-tail test)
Result :
tstat = 3.827 > tcritical = 1.812, OR
p = 0.00166 < α = 0.05.
Conclusion :
Reject H0.
Paired Samples
Hypotheses:
H 0 : µ D {³, £, or =}0
H1 : µ D {<, >, or ¹}0
µD is the mean difference between the paired samples
Example 7 (Paired Samples)
Test for a difference in

the means of the
estimated and actual
pile lengths
(two-tailed test).
Hypothesis :
H 0 : µD = 0
H 1 : µD ≠ 0
Two-Tailed
Result :
t is < the lower critical value, OR
p-value ≈ 0
Conclusion :
Reject H0.
Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education p.290 - 293 25
F-test
• Test for equality of variances between two samples.

H 0 : s 12 - s 2 2 = 0
H 1 : s 12 - s 2 2 ¹ 0 ( 7.5)
• F-test statistic:
s12
F= 2 ( 7.6 )
s2
• Assume that both samples are drawn from normal populations.
Conducting F-test
• Although the hypothesis test is really a two-tailed test, we will

simplify it as an upper-tailed, one-tailed test to make it easy to
use tables of the F-distribution and interpret the results of the
Excel tool.
• Find the critical value of the F-distribution :
Fa
, df 1, df 2
2
• Reject the H0, if the F-test statistic > the critical value .
• Note that we are using α/2, to find the critical value, not α.
F-test –
Example 8
• Determine whether the variance of lead times is the same for
Alum Sheeting and Durrable Products in the Purchase Orders
data.
• The variance of the lead times for Alum Sheeting is larger than
the variance for Durable Products, so this is assigned to Variable
1.
F-test –
Example 8
Result :
F < Fcrit, OR
p-value > α/2 = 0.025
Conclusion :
Fail to reject H0.
Analysis of Variance (ANOVA)
• Used to compare the means of two or more population groups.

H 0 : µ1 = µ2 = ! = µ m
H1 : at least one mean is different from the others
• ANOVA measures variation between groups relative to variation
within groups.
• Each of the population groups is assumed to come from a
normally distributed population.
Analysis of Variance (ANOVA) –
Assumptions
• The m groups or factor levels being studied represent

populations whose outcome measures
1. are randomly and independently obtained,
2. are normally distributed, and
3. have equal variances.
• If these assumptions are violated, then the level of significance
and the power of the test can be affected.
Example 9
• Determine whether any

significant differences exist in
satisfaction among individuals
with different levels of
education.
• In this example, the factor (m) is

the educational level, and we
have three categorical levels of
this factor, college graduate,
graduate degree, and some
college.
Example 9
Result :
F > Fcrit, OR
p-value < α = 0.05
Conclusion :
Reject H0.
Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education p.294 - 295 33
Trendlines and
Regression Analysis
34
Using Chart in Modeling
Relationships dan Trends in Data
To better understand data sets are randomly and

independently obtained :
• Cross-sectional data, use a scatter chart.
• Time series data, use a line chart.
Mathematical Functions Used in
Predictive Analytical Models
the base of natural logarithms, e = 2.71828, is often used for the constant b
Excel Trendline Tool
• Right click on data

series and choose Add
trendline from pop-up
menu.
• Check the boxes Display
Equation on chart and
Display R-squared value
on chart.
Mathematical Functions –
Example
Market research study has collected data on sales volumes for different levels of pricing of
a particular product.
The model is :
Sales = 20,512 – (9.5116 x Price)
If the price is $125, we can
estimate the level of sales as :
Sales = 20,512 – (9.5116 x 125)
Sales = 19,323
R2 (R-squared)
• Measure of the “fit” of the line to the data.

• The value of R2 will be between 0 and 1.
• The larger the value of R2 the better the fit.
• A value of 1.0 indicates a perfect fit.
Regression Analysis
A tool for building mathematical and statistical models that characterize

relationships between a dependent (ratio) variable and one or more
independent, or explanatory variables (ratio or categorical), all of which are
numerical.
Simple linear Multiple regression

regression
• a single independent • two or more
variable independent variables
Data Classification by the Type of
Measurement Scale
Categorical (Nominal)
• Sorted into categories according to specified characteristics
• The categories bear no quantitative relationship to one another, but we usually
assign an arbitrary number to each category, exp 0 or 1, to ease the process of
man-aging the data and computing statistics.
• Usually counted or expressed as proportions or percentages.
• Exp: Employees might be classified as managers, supervisors, and associates.
Ordinal
• Can be ordered or ranked according to some relationship to one another.
• More meaningful than categorical data because data can be compared.
• However, ordinal data have no fixed units of measurement, so we cannot make
meaningful numerical statements about differences between categories.
• Exp: College basketball rankings are ordinal, a higher ranking signifies a stronger
team but does not specify any numerical measure of strength.
Data Classification by the Type of
Measurement Scale
Interval
• Ordinal but have constant differences between observations and have
arbitrary zero points.
• Exp: commonly are time and temperature.
• Celsius scales represent a specified measure of distance—degrees—
but have arbitrary zero points. we cannot compute meaningful ratios;
for example, we cannot say that 50 degrees is twice as hot as 25
degrees. However, we can compare differences.
Ratio
• Continuous and have a natural zero point.
• Exp. the Seattle region sold $12 million in March whereas the Tampa
region sold $6 million means that Seattle sold twice as much as Tampa
Simple Linear Regression
• Finds a linear relationship between one independent variable (X) and

one dependent variable (Y).
• Prepare a scatter chart to verify the data has a linear trend.
Simple Linear Regression –
Example
Size of a house is typically

related to its market value.
X = square footage
Y = market value ($)
The scatter chart of the full

data set (42 homes)
indicates a linear trend.
Simple Linear Regression –
Example
Equation :
Market Value = a + b x Square Feet
Two possible lines are shown below :
Line A is clearly a better fit to the data.

Which is the best regression line?
The best-fitting equation (using trendline tool) :

Market Value = $32,673 + $35.036 x Square Feet
The estimated market value of a home with 2,200 square feet would be
Market Value = $32,673 + $35.036 x 2,200 = $109,752
Least-Squares Regression - 1
• The mathematical basis for the best-fitting regression line.

• Simple linear regression model:
Y = b 0 + b1 X + e (8.1)
• We estimate the parameters from the sample data:
Yˆ = b0 + b1 X (8.2 )
• The estimated value of Y for Xi :
Yˆi = b0 + b1 X i
Note: Xi is the value of the independent variable of the ith
Residuals
The observed errors associated with estimating the value of the dependent
variable using the regression line.
ei = Yi - Yˆi (8.3)
Least-Squares Regression - 2
• The best-fitting line minimizes the sum of squares of the residuals.
• Excel functions:
Simple Linear Regression with Excel
– Alt 1
Slope = b1 = 35.036
= SLOPE ( C 4 : C 45, B 4 : B 45 )
Intercept = b0 = 32,673
= INTERCEPT ( C 4 : C 45, B 4 : B 45 )
Estimate Y when X = 1750 square feet :

Yˆ = 32,673 + 35.036(1750) = $93,986
= TREND ( C 4 : C 45, B 4 : B 45,1750 )
– Alt 2
• Clik Data à Data Analysis à Regression.

• Input Y Range (with header)
• Input X Range (with header)
• Check Labels box
– Alt 2
Ŷ = b 0 + b 1X Ŷ = 32.673 + 35.036X
Regression Statistics
Multiple R (ǀrǀ)
• r is the sample correlation coefficient.
• The value of r varies from −1 to +1. (r is negative if slope is negative)
• r is negative if slope is negative.
R Square (R2)
• See previous slides.
Adjusted R Square
• Modifies the value of R2 by incorporating the sample size and the number of explanatory variables
in the model.
• Useful when comparing this model with other models that include additional explanatory
variables.
Standard Error
• The variability of the observed Y-values from the predicted values (Ŷ).
• If the data are clustered close to the regression line, then the standard error will be small.
• The more scattered the data, the larger the standard error.
Regression as Analysis of Variance
• Conducts an F-test to determine whether variation in Y is due to

varying levels of X.
• Test for significance of regression:
H 0 : population slope coefficient = 0
H1 : population slope coefficient ¹ 0
• Reports the p-value (see Significance F).

• Rejecting H0 : X explains variation in Y.
Regression as Analysis of Variance –
Example
H 0 : b1 = 0 Home size is not a

significant variable.
H1 : b1 ¹ 0 Home size is a significant
variable.
p-value = 3.798 ´ 10-8
Result : Conclusion : Reject H0 à home size is a significant variable in

p-value < α = 0.05 explaining variation in market value.
Regression as Analysis of Variance –
Using t-Test - Example
b1 - 0
t= (8.8)
standard error
Result : Conclusion : Reject H0 à home size is a significant variable in

p-value < α = 0.05 explaining variation in market value.
Checking Assumptions
Linearity
• examine scatter diagram à appear linear.
• examine residual plot à appear random.
Normality of Errors
• view a histogram of standard residuals.
• regression is robust to departures from normality.
Homoscedasticity
• variation about the regression line is constant.
• examine the residual plot.
Independence of Errors
• Successive observations should not be related.
• Correlation among successive observations over time is called autocorrelation
• Important when the independent variable is time.
Checking Assumptions –
Example Home Market Value
Linearity
linear trend in scatterplot no pattern in residual plot
Normality of Errors
residual histogram appears slightly skewed but is not a serious departure
Homoscedasticity
residual plot shows no serious difference in the spread of the data for different
X values
Independence of
Errors
Because the data is cross-sectional, we can assume this assumption holds.
Multiple Linear Regression
A linear regression model with more than one independent variable (X).
Y = b 0 + b1 X 1 + b 2 X 2 + ! + b k X k + e (8.10 )
Where :
Y is the dependent variable,
X ,!, X are the independent (explanatory) variables,
1 k
b0 is the intercept term,

b1 ,!, b k are the regression coefficients for the independent variables,
e is the error term.
ANOVA for Multiple Regression
• ANOVA tests for significance of the ENTIRE model. That is, it computes
an F-statistic testing the hypotheses:
H 0 : b1 = b 2 = ! = b k = 0
H1 : at least one b j is not 0
• Output also provides information to test hypotheses about each of the
individual regression coefficients.
• Reject H0 that the slope associated with independent variable i is 0,
then the independent variable i (Xi) is significant and improves the
ability of the model to better predict the dependent variable.
• If we fail to reject H0 independent variable is not significant and
probably should not be included in the model.
Multiple Linear Regression –
Example
Predict student graduation rates using several indicators:
Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.331-333 63
Multiple Linear Regression –
Example
Result and Regression Model
• The value of R2 indicates that 53% of the variation in the dependent variable is
explained by these independent variables.
• All coefficients of independent variables are statistically significant (p-value < 0.05).
Source: Evans, James R. 2020. Business Analytics, 3rd Edition. Pearson Education. p.331-333 64
Systematic Model Building Approach
• Construct a model with all available independent variables. Check for

significance of the independent variables by examining the p-values.
• Identify the independent variable having the largest p-value that
exceeds the chosen level of significance (insignificant).
• Remove the variable identified in step 2 from the model and evaluate
adjusted R2.
• Don’t remove all variables with p-values that exceed a at the same
time, but remove only one at a time.
• Doing step-by-step.
• Continue until all variables are significant.
- Example
Banking Data
Home value has the largest p-value, drop and re-run the regression.
- Example
Banking Data
Multicollinearity
• Occurs when there are strong correlations among the

independent variables, and they can predict each other
better than the dependent variable.
• Correlations exceeding ± 0.7 (see correlation matrix) may
indicate multicollinearity.
• The variance inflation factor is a better indicator, but not
computed in Excel.
Multicollinearity ?
Regression with Categorical
Independen Variables
• Regression analysis requires numerical data.

• Categorical data can be included as independent variables,
but must be coded numeric using dummy variables.
• For variables with 2 categories, code as 0 and 1.
Independen Variables - Example
• Employee Salaries provides data for 35 employees.
• Predict Salary using Age and MBA (code as yes = 1, no = 0)
Y = b 0 + b1 X 1 + b 2 X 2 + e
where
Y = salary
X 1 = age
X 2 = MBA indicator (0 or 1)
Independen Variables - Example
Salary = 893.59 + 1044.15 ´ Age + 14767.23 ´ MBA
– If MBA = 0, salary Salary = 893.59 + 1044 ´ Age

– If MBA = 1, salary Salary = 15,660.82 + 1,044.15 ´ Age
THANK YOU

AnBis Gasal 2324 - Sesi 2

Uploaded by

Copyright:

Available Formats

You might also like

AnBis Gasal 2324 - Sesi 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AnBis Gasal 2324 - Sesi 2

Uploaded by

Copyright:

Available Formats

Statistical Inference and Predictive

Nurturing Inclusive, Relevant & Reputable Leaders

Estimation of population parameters and

Drawing inferences about two contrasting propositions (hypothesis)

Null Hypothesis (H0) Alternate Hypothesis (H1)

1. Identify the population parameter of interest and formulate

One-Sample Hypothesis Tests Multiple-Sample Hypothesis Tests

It is not correct to formulate a null hypothesis using >, <,or ¹ .

Population parameter : mean (µ) of response time

The probability of Making a Type I Error The probability of Correctly

• Commonly used levels for a are 0.10, 0.05,

σ is the population standard deviation

• The critical value divides the sampling distribution into two:

If H1 is stated as <, the

If the test statistic is either greater than

For a two-tailed test, the critical value is t((1-α)/2, n-1) OR t((α/2,n-1)

• The probability of obtaining a test statistic value equal to or

H 1 : µ1 - µ 2 {<, >, or ¹}0 ( 7.4 )

Determine if the mean lead time for Alum Sheeting

t-Test: Two-Sample Assuming

One-Tailed (upper-tail test)

µD is the mean difference between the paired samples

Test for a difference in

• Test for equality of variances between two samples.

• Assume that both samples are drawn from normal populations.

• Although the hypothesis test is really a two-tailed test, we will

• Used to compare the means of two or more population groups.

• The m groups or factor levels being studied represent

• Determine whether any

• In this example, the factor (m) is

To better understand data sets are randomly and

• Right click on data

• Measure of the “fit” of the line to the data.

A tool for building mathematical and statistical models that characterize

Simple linear Multiple regression

• Finds a linear relationship between one independent variable (X) and

Size of a house is typically

The scatter chart of the full

Line A is clearly a better fit to the data.

The best-fitting equation (using trendline tool) :

• The mathematical basis for the best-fitting regression line.

• The best-fitting line minimizes the sum of squares of the residuals.

Estimate Y when X = 1750 square feet :

• Clik Data à Data Analysis à Regression.

• Conducts an F-test to determine whether variation in Y is due to

H1 : population slope coefficient ¹ 0

• Reports the p-value (see Significance F).

H 0 : b1 = 0 Home size is not a

p-value = 3.798 ´ 10-8

Result : Conclusion : Reject H0 à home size is a significant variable in

Result : Conclusion : Reject H0 à home size is a significant variable in

linear trend in scatterplot no pattern in residual plot

residual histogram appears slightly skewed but is not a serious departure

b0 is the intercept term,

Predict student graduation rates using several indicators:

• Construct a model with all available independent variables. Check for

• Occurs when there are strong correlations among the