Professional Documents
Culture Documents
Module 4 Slides
Module 4 Slides
p0 (1 p0 )
The value of p is computed as p n is the sample size
n
p0 (1 p0 )
p 0.1 0.9 / 600 0.01224745
n
45
The sample proportion is p 0.075
600
p p0 0.075 0.1
The test statistic is Z 2.04124
p 0.01224745
p0 (1 p0 )
p 0.23 0.77 / 1200 0.012148388
n
311
The sample proportion is p 0.2591667
1200
p p0 0.2591667 0.23
The test statistic is Z 2.400867
p 0.012148388
The data provide sufficient evidence to conclude that more than 23% of all adults now
recognize the company’s logo at 1% level of significance.
Two samples or two variables
If 54 out of a random sample of 150 boys smoke, while 31 out of random sample
of 100 girls smoke, can we conclude at the 0.05 level of significance that the
proportion of male smokers is higher than that of female smokers?
Suppose a researcher wishes to test the hypothesis that wholesalers in the northern
and southern India differ in the proportion of sales they make to discount retailers.
Such verifications can be carried out using tests for comparison of two
proportions.
Z-Test for Comparing Two Proportions
Testing whether the population proportion for group 1 (p1) equals the population
proportion for group 2 (p2) is conceptually the same as the t-test of two means.
This section illustrates a Z-test for differences of proportions, which requires a
sample size greater than 30.
p1 p 2
When null hypothesis is H0: p1 - p2 = 0, then Z
Sp p
1 2
Test procedure ( that is when to reject H0) is similar to that discussed earlier
In 1980, of 750 men 20-34 years old, 130 were found to be overweight. Whereas,
in 1990, of 700 men, 20-34 years old, 160 were found to be overweight. At the
5% significance level, do the data provide sufficient evidence to conclude that for
men 20-34 years old, a higher percentage were overweight in 1990 than 10 years
earlier?
Let p1 be the proportion of men are overweight in 1990, and p2 be the
proportion of men are overweight in 1980
160
The estimate of p1 is p1 0.22857
700
and
130
the estimate of p2 is p 2 0.1733
750
160 130
The pooled estimate of the proportion is p 0.2
750 700
The standard error of the differences in proportions
1 1
0.2 0.8 0.000441905 0.021021531
700 750
0.22857 0.173333
The test statistic is Z 2.62769
0.02102153
The data provide sufficient evidence to conclude that for men 20-34 years
old, a higher percentage were overweight in 1990 than 10 years earlier at
5% level of significance.
Alternatively the problem solution can also be given as :
130 160
The pooled estimate of the proportion is p 0.2
750 700
The standard error of the differences in proportions
1 1
0.2 0.8 0.000441905 0.021021531
700 750
0.173333 0.22857
The test statistic is Z 2.62769
0.02102153
The data provide sufficient evidence to conclude that for men 20-34 years
old, a higher percentage were overweight in 1990 than 10 years earlier at
5% level of significance.
The company states that the drug is more effective for women than for men.
To test this claim, they choose a simple random sample of 50 women and
100 men from a population of 100,000 volunteers. At the end of the study,
38% of the women caught a cold; and 51% of the men caught a cold. Based
on these findings, can we conclude that the drug is more effective for
women than for men? Use a 0.01 level of significance.
Let p1 be the proportion of men who get cold even after taking the drug in the
population. Whereas p2 be the proportion of women who get cold.
1 1
s 0.0646 (1 0.0646) 0.0074667 0.08641
100 50
Since calculated value 1.50446 is smaller than
0.51 0.38
Z 1.50446 Table value 2.33, we accept H0.
0.08641 Hence conclude that the drug is not more
effective for women than for men.
Simple Linear Regression
Simple Linear Regression
Regression analysis helps businesses to understand the data points and to use
them. Specifically the relationships between data points – to make better decisions,
including anything from predicting sales to understanding inventory levels and
supply and demand. Of all the business analysis techniques, regression analysis is
often referred to as one of the most significant.
Regression analysis is widely used for prediction and forecasting, where its use
has substantial overlap with the field of machine learning. Regression analysis is
also used to understand which among the independent variables are related to
the dependent variable, and to explore the forms of these relationships.
Simple Linear Regression
Production forecast
Simple Linear Regression
The other use of regression analysis is for the prediction of the values of
dependent variable based on the values of independent variable(s).
ˆ
Cov ( X , Y )
XY ( X )( Y ) / n Cap (^) is used to
X X / n
2
Var ( X ) 2
represent estimates
Carryout regression analysis and hence estimate (forecast) the sales when
advertising expenditure is 5.5 lakhs a month.
Regression equation is Y= α +β X + u
^𝛽=∑ 𝑋𝑌 −¿ ¿
X Y XY X2
1 2.8 70 196 7.84
2146.2 − 22.6× 556/ 6
2 3.2 83 265.6 10.24 ¿ 2
87.78 −22.6 / 6
3 3.5 90 315 12.25
= 19.57286
4 4 100 400 16
= 18.94223+19.57286 X
140
120
Sales
100
80
60
40
2 3 4 5 6 7 8
Expenditure
Sales
160
140
120
Sales
100
80
60
40
2 3 4 5 6 7 8
Expenditure
A study was taken to estimate linear demand function. The data on the quantity
demanded and the price of a commodity was collected for 8 periods. The data is
given below
S. No. 1 2 3 4 5 6 7 8
Demand (Y) (in Kg) 16 20 18 21 13 15 17 22
Price (X) (in Kg) 10 8 12 6 13 9 11 7
Estimate the linear demand function Y=a+bX+u, Also interpret the estimated
regression.
S No X Y XY X2
1 10 16 160 100
2 8 20 160 64
3 12 18 216 144
4 6 21 126 36
5 13 13 169 169
6 9 15 135 81
7 11 17 187 121
8 7 22 154 49
Total 76 142 1307 764
bˆ
XY X Y / n 1307 76 142 / 8 42 1
X X / n 764 (76) / 8
2 2
2 42
aˆ Y bˆ X 17.75 (1) 9.5 27.25
Hence the demand (Y) is expressed in terms of price (X) as given by the trend
Estimate of Y is 27.25 – X. This indicates that as price increases demand
decreases. The coefficient of this proportion is -1.
Ice Cream Sales vs
The local ice cream shop keeps track of Temperature
how much ice cream they sell versus the Temperature Ice Cream
°C Sales ₹ ' 000s
noon temperature on that day. Here are
24.2 17.5
their figures for the last 12 days. 26.4 19.4
Obtain a sales trend as a linear function of 21.9 12.8
25.2 15.2
noon temperature. What would be the 28.5 27.4
32.1 28.3
sales when noon temperature is 35°C? 29.4 18.6
35.1 34.4
33.4 24.7
28.1 18.2
32.6 27.2
27.2 17.9
X Y XY X2
24.2 17.5 423.5 585.64
26.4 19.4 512.16 696.96 7752.63 344.1 261.6 / 12
21.9 12.8 b 1.41963
280.32 479.61 10044.05 344.1 344.1 / 12
25.2 15.2 383.04 635.04
28.5 27.4 780.9 812.25 a 261.6 / 12 1.419632 344.1 / 12 18.9079
32.1 28.3 908.43 1030.41
29.4 18.6 546.84 864.36
35.1 34.4 1207.44 1232.01 Sales when noon temperature is 35°C is
33.4 24.7 824.98 1115.56
28.1 18.2 511.42 789.61 -189079+1.41963×35= 30.77917
32.6 27.2 886.72 1062.76 thousand rupees
27.2 17.9 486.88 739.84
344.1 261.6 7752.63 10044.05
We use a measure R2 that is the proportion of variation attributable to the
approximate linear relationship and is called coefficient of determination. It is the
proportion of the total variation in Y that is accounted by the predicted variable X.
The value of R2 is a measure of the extent to which X and Y are linearly related.
Explained variation 2
Yi Yˆi
R
2
R2 1
Total variation Y Y
2
i
Cov( X , Y ) XY X Y / n
2 2
Var ( X ) Var (Y ) X X / n Y Y
R 2
2 2
2 2
/n
A sample of 10 observations based upon the data for the period 1991 to 2000
corresponding to the regression model : Y=a+bX+u, where Y is quantity supplied
(millions tons) and X is export price (₹ per ton), gave the following results:
You may wish to predict the likely success/failure rate of a new product or the
likelihood of customer retention/loss.
In these cases the response variable or dependent variable is not a continuous
variable. It may be a nominal scale, like success and failure.
These cases we use Logistic regression. Here the connection between the categorical
dependent variable and the continuous independent variables is measured by
changing the dependent variable into probability scores. Unlike linear regression
models, which are used to predict a continuous outcome variable, logistic regression
models are mostly used to predict a dichotomous categorical outcome.
Logistic Regression models are frequently used in business analysis
applications.
Whether a voter will vote for a specific political party can be determined
through the interpretation of logistic modeling, which is based on demographic
parameters such as gender, age, family income, caste and the state of residence
of the voter.
Logistic regression has varied applications in marketing, healthcare and social
sciences.
These models also show the extent to which changes in the values of the
attributes may increase or decrease the predicted probability of event outcome.
Suppose we need to study bad loan as interpreted in terms of age.
𝑒 𝛽 +𝛽 𝑋
0 1
The logistic model assumes a linear relationship between the predictors (X)
and ln(odds). This model is also known as logit form of the model.
Logit Function
no data Function Plot
1.0
0.8
0.6
y
0.4
0.2
-10 -8 -6 -4 -2 0 2 4 6 8 10 12
x
exp bo + b1• x
y=
+ exp bo + b1• x
We use simple linear regression methods for the transformed data and
estimate the regression coefficients. Hence probability of “Yes” (p)
can be determined.
Age 18 19 20 21 22 23 24 25
Purchased 4 6 12 15 8 5 6 4
Not Purchased 12 10 20 15 16 20 18 24
Age 18 19 20 21 22 23 24 25
Purchased 4 6 12 15 8 5 6 4
Not Purchased 12 10 20 15 16 20 18 24
Total 16 16 32 30 24 25 24 28
p 0.2500 0.3750 0.3750 0.5000 0.3333 0.2000 0.2500 0.1429
p/(1-p) 0.3333 0.6 0.6 1.0 0.5 0.25 0.3333 0.1667
ln(p/[1-p]) -1.0986 -0.5108 -0.5108 0.0 -0.6931 -1.3863 -1.0986 -1.7918
X 18 19 20 21 22 23 24 25
Y -1.0986 -0.5108 -0.5108 0.0 -0.6931 -1.3863 -1.0986 -1.7918
We estimate the regression coefficients using the method of linear
regression and hence,
and
Hence we can estimate values of Y as . Based on the values we get an
estimate of p using the formula
=