Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 54

Module 4

Tests for Proportion


An insurance company states that it settles 85% of all life insurance claims
within 30 days. A consumer group asks the Government insurance
commission to investigate. In a sample of 250 life insurance claims, 203
were settled within 30 days. Is the insurance company’s claim is true?

Such verification can be carried out using tests for proportions.


Single Population Proportion

Suppose H0: p = p0 against H1: p ≠ p0 ( or p < p0 or p > p0 )

For large sample, the appropriate test statistic would be Z test


p  p0
Z , p is the sample proportion
p

p0 (1  p0 )
The value of  p is computed as p  n is the sample size
n

The value of Z is compared with the corresponding critical value of Z test.


If the hypothesis is H0: p = p0 against H1: p ≠ p0 , then
reject null hypothesis if |Z| > 1.96 at 5% level and if |Z| > 2.58 at 1% level.

If the hypothesis is H0: p = p0 against H1: p < p0 , then


We reject null hypothesis if Z < -1.645 at 5% level and if Z < - 2.33 at 1% level.

If the hypothesis is H0: p = p0 against H1:p > p0


We reject null hypothesis if Z > 1.645 at 5% level and if Z > 2.33 at 1% level .
An auditor claims that 10 percent of customer’s ledger accounts are carrying
mistakes of posting and balancing. A random sample of 600 was taken to test the
accuracy of posting and balancing and 45 mistakes were found. Are these sample
results consistent with the claim of the auditor? Use 5% level of significance.
H0: p = 0.1 against H1: p ≠ 0.1

p0 (1  p0 )
p   0.1 0.9 / 600  0.01224745
n
45
The sample proportion is p   0.075
600
p  p0 0.075  0.1
The test statistic is Z     2.04124
p 0.01224745

The critical value at 5% level is 1.96.


Since 2.04124 > 1.96 H0 is rejected

That is, auditor’s claim is not accepted at 5% level of significance.


Prior to a special advertising campaign, 23% of all adults recognized a particular
company’s logo. At the close of the campaign the marketing department
commissioned a survey in which 311 of 1,200 randomly selected adults recognized
the logo. Determine, at the 1% level of significance, whether the data provide
sufficient evidence to conclude that more than 23% of all adults now recognize the
company’s logo.
H0: p = 0.23 against H1: p > 0.23

p0 (1  p0 )
p   0.23  0.77 / 1200  0.012148388
n
311
The sample proportion is p   0.2591667
1200
p  p0 0.2591667  0.23
The test statistic is Z    2.400867
p 0.012148388

The critical value at 1% level is 2.33.


Since 2.400867 > 2.33 H0 is rejected

The data provide sufficient evidence to conclude that more than 23% of all adults now
recognize the company’s logo at 1% level of significance.
Two samples or two variables
If 54 out of a random sample of 150 boys smoke, while 31 out of random sample
of 100 girls smoke, can we conclude at the 0.05 level of significance that the
proportion of male smokers is higher than that of female smokers?

Suppose a researcher wishes to test the hypothesis that wholesalers in the northern
and southern India differ in the proportion of sales they make to discount retailers.

Such verifications can be carried out using tests for comparison of two
proportions.
Z-Test for Comparing Two Proportions

Testing whether the population proportion for group 1 (p1) equals the population
proportion for group 2 (p2) is conceptually the same as the t-test of two means.
This section illustrates a Z-test for differences of proportions, which requires a
sample size greater than 30.

Suppose H0: p1 - p2 = 0 against H1: p1 - p2 ≠ 0 (or H1: p1 - p2 > 0, etc)


Comparison of the observed sample proportions p1 and p2 allows the
researcher to ask whether the difference between two large (greater than 30)
random samples occurred due to chance alone. The Z-test statistic can be
computed using the following formula:
p1  p 2  ( p1  p2 )
Z ,
Sp p
1 2

where p1 is the sample proportion of first group, p 2 is the sample


proportion of second group and denominator is the pooled estimate of the
standard error of differences in these proportions. That is,
n1 p1  n2 p 2
Pooled estimate of the proportion is p 
n1  n2

To calculate the standard error of the differences in proportions, use


the formula Sp p   1 1
p 1  p    ,
 n1 n2 
1 2

p1  p 2
When null hypothesis is H0: p1 - p2 = 0, then Z
Sp p
1 2

Test procedure ( that is when to reject H0) is similar to that discussed earlier
In 1980, of 750 men 20-34 years old, 130 were found to be overweight. Whereas,
in 1990, of 700 men, 20-34 years old, 160 were found to be overweight. At the
5% significance level, do the data provide sufficient evidence to conclude that for
men 20-34 years old, a higher percentage were overweight in 1990 than 10 years
earlier?
Let p1 be the proportion of men are overweight in 1990, and p2 be the
proportion of men are overweight in 1980

H0: p1 – p2 = 0 against H1: p1 – p2 > 0

160
The estimate of p1 is p1   0.22857
700
and
130
the estimate of p2 is p 2   0.1733
750

160  130
The pooled estimate of the proportion is p   0.2
750  700
The standard error of the differences in proportions
 1 1 
0.2  0.8     0.000441905  0.021021531
 700 750 

0.22857  0.173333
The test statistic is Z   2.62769
0.02102153

The critical value at 5% level is 1.645.


Since 2.62769 > 1.645 H0 is rejected

The data provide sufficient evidence to conclude that for men 20-34 years
old, a higher percentage were overweight in 1990 than 10 years earlier at
5% level of significance.
Alternatively the problem solution can also be given as :

Let p1 be the proportion of men are overweight in 1980, and p2 be the


proportion of men are overweight in 1990
H0: p1 – p2 = 0 against H1: p1 – p2 < 0
130
The estimate of p1 is p1   0.1733
750
and
160
the estimate of p2 is p 2   0.22857
700

130  160
The pooled estimate of the proportion is p   0.2
750  700
The standard error of the differences in proportions
 1 1 
0.2  0.8     0.000441905  0.021021531
 700 750 

0.173333 0.22857
The test statistic is Z    2.62769
0.02102153

The critical value at 5% level is 1.645.


Since -2.62769 <- 1.645 H0 is rejected

The data provide sufficient evidence to conclude that for men 20-34 years
old, a higher percentage were overweight in 1990 than 10 years earlier at
5% level of significance.
The company states that the drug is more effective for women than for men.
To test this claim, they choose a simple random sample of 50 women and
100 men from a population of 100,000 volunteers. At the end of the study,
38% of the women caught a cold; and 51% of the men caught a cold. Based
on these findings, can we conclude that the drug is more effective for
women than for men? Use a 0.01 level of significance.
Let p1 be the proportion of men who get cold even after taking the drug in the
population. Whereas p2 be the proportion of women who get cold.

H0: p1-p2=0 against H1: p1-p2 > 0


100  0.51  50  0.38
p1  0.51, p 2  0.38, and p   0.46667
100  50

 1 1 
s  0.0646  (1  0.0646)    0.0074667  0.08641
 100 50 
Since calculated value 1.50446 is smaller than
0.51  0.38
Z  1.50446 Table value 2.33, we accept H0.
0.08641 Hence conclude that the drug is not more
effective for women than for men.
Simple Linear Regression
Simple Linear Regression
Regression analysis helps businesses to understand the data points and to use
them. Specifically the relationships between data points – to make better decisions,
including anything from predicting sales to understanding inventory levels and
supply and demand. Of all the business analysis techniques, regression analysis is
often referred to as one of the most significant.

Regression analysis is widely used for prediction and forecasting, where its use
has substantial overlap with the field of machine learning. Regression analysis is
also used to understand which among the independent variables are related to
the dependent variable, and to explore the forms of these relationships.
Simple Linear Regression

1. Amount spent on food depends on income.


Expenditure on food can be represented in terms of income, if not accurately
but expenditure may vary over a trend value based on income.
2. The advertising budget (in millions of dollars) of 21 firms and millions of
impressions retained per week by the viewers of the products of these firms.
The data are based on a survey of 4000 adults in which users of the products
were asked to cite a commercial they had seen for the product category in
the past week.
Monthly budget for advertising are based on the forecast of the sales.

Demand can be linked with price.

Production forecast
Simple Linear Regression

In simple linear regression


1. Only one independent variable
2. Relationship between X and Y is described by a liner function
3. Changes in Y are assumed to caused by changes in X and an error term.
Simple Linear Regression

Correlation coefficient will measure the linear relationship between two


variables. Regression analysis can explore beyond linear relationship. Under
correlation analysis between variables X and Y it does not indicate which variable
is influencing which one.
Regression Analysis: In regression analysis we identify one of the variables as
dependent variable and the other variable(s) as independent variable. We
discuss about relationship of dependent variable in terms with independent
variables. Regression explains dependent variable in terms of independent
variables. Regression can be used to measure the relative importance of various
independent variables in explaining the dependent variable.
The population regression model (simple linear regression)
The line of regression of Y on X is given by Y = α + β X

Population Y Population Independent


intercept Slope coeft Variable

Dependent Y=α+βX+u Random


Variable Error term
Linear
component

Sometimes it can be written as Y = β0 + β1 X + ε


Regression Analysis

The other use of regression analysis is for the prediction of the values of
dependent variable based on the values of independent variable(s).

The choice of dependent variable:


The variable whichever is to be predicted ( based on independent or
explanatory variable(s)) is taken as dependent variable.
Regression explains a functional relationship of independent variables in
explaining the dependent variable.
For example, food expenditure by households could be predicted by using
family income and family size which are independent variables.

Another example, the amount spent by a consumer at a retail store in the


last three months can be explained by the store’s location, prices, credit
policy, merchandise quality and speed of the service by using the
regression analysis.
OLS estimates
Regression equation is Y= α +β X + u

The coefficient β is estimated using the formula

ˆ

Cov ( X , Y )

 XY  ( X )( Y ) / n Cap (^) is used to
 X   X  / n
2
Var ( X ) 2
represent estimates

Then coefficient α is estimated using formula  𝜶^ =𝒀´ − ^𝜷 𝑿


´

These estimates are obtained by OLS method

For a given value of X, the expected value of Y or predicted value of Y is


Yˆ  ˆ  ˆ X This is also known as predicted value of Y for any given X.
A company collects data about its advertising expenditure and the corresponding
sales figure over a period of consecutive months are as shown in the table
below.
Expenditure ₹ Sales ₹
Month
(in lakhs) (in crores)
1 2.8 70
2 3.2 83
3 3.5 90
4 4.0 100
5 4.4 105
6 4.7 108

Carryout regression analysis and hence estimate (forecast) the sales when
advertising expenditure is 5.5 lakhs a month.
Regression equation is Y= α +β X + u

 
^𝛽=∑ 𝑋𝑌 −¿ ¿
X Y XY X2
1 2.8 70 196 7.84
  2146.2 − 22.6× 556/ 6
2 3.2 83 265.6 10.24 ¿ 2
87.78 −22.6 / 6
3 3.5 90 315 12.25
  = 19.57286
4 4 100 400 16

5 4.4 105 462 19.36  


𝛼 ´ −^
^ =𝑌 𝛽𝑋´
6 4.7 108 507.6 22.09  /6 – 19.57286 22.6/6
Total 22.6 556 2146.2 87.78
=18.94223
• 
The linear regression line (trend line is )

= 18.94223+19.57286 X

Hence, when advertising expenditure is 5.5 lakhs then the sales is


expected to be = 18.94223+19.57286  5.5 = 126.593 crore rupees.

Note that units of measurement of X and Y are different, however, it


does not affect the calculations.
Sales
160

140

120
Sales

100

80

60

40
2 3 4 5 6 7 8
Expenditure
Sales
160

140

120
Sales

100

80

60

40
2 3 4 5 6 7 8
Expenditure
A study was taken to estimate linear demand function. The data on the quantity
demanded and the price of a commodity was collected for 8 periods. The data is
given below
S. No. 1 2 3 4 5 6 7 8
Demand (Y) (in Kg) 16 20 18 21 13 15 17 22
Price (X) (in Kg) 10 8 12 6 13 9 11 7

Estimate the linear demand function Y=a+bX+u, Also interpret the estimated
regression.
S No X Y XY X2
1 10 16 160 100
2 8 20 160 64
3 12 18 216 144
4 6 21 126 36
5 13 13 169 169
6 9 15 135 81
7 11 17 187 121
8 7 22 154 49
Total 76 142 1307 764

bˆ 
 XY    X   Y  / n  1307  76 142 / 8   42  1
 X   X  / n 764  (76) / 8
2 2
2 42
aˆ  Y  bˆ X  17.75  (1)  9.5  27.25

Hence the demand (Y) is expressed in terms of price (X) as given by the trend
Estimate of Y is 27.25 – X. This indicates that as price increases demand
decreases. The coefficient of this proportion is -1.
Ice Cream Sales vs
The local ice cream shop keeps track of Temperature

how much ice cream they sell versus the Temperature Ice Cream
°C Sales ₹ ' 000s
noon temperature on that day. Here are
24.2 17.5
their figures for the last 12 days. 26.4 19.4
Obtain a sales trend as a linear function of 21.9 12.8
25.2 15.2
noon temperature. What would be the 28.5 27.4
32.1 28.3
sales when noon temperature is 35°C? 29.4 18.6
35.1 34.4
33.4 24.7
28.1 18.2
32.6 27.2
27.2 17.9
X Y XY X2
24.2 17.5 423.5 585.64
26.4 19.4 512.16 696.96 7752.63  344.1 261.6 / 12
21.9 12.8 b  1.41963
280.32 479.61 10044.05  344.1 344.1 / 12
25.2 15.2 383.04 635.04
28.5 27.4 780.9 812.25 a  261.6 / 12  1.419632  344.1 / 12  18.9079
32.1 28.3 908.43 1030.41
29.4 18.6 546.84 864.36
35.1 34.4 1207.44 1232.01 Sales when noon temperature is 35°C is
33.4 24.7 824.98 1115.56
28.1 18.2 511.42 789.61 -189079+1.41963×35= 30.77917
32.6 27.2 886.72 1062.76 thousand rupees
27.2 17.9 486.88 739.84
344.1 261.6 7752.63 10044.05
We use a measure R2 that is the proportion of variation attributable to the
approximate linear relationship and is called coefficient of determination. It is the
proportion of the total variation in Y that is accounted by the predicted variable X.
The value of R2 is a measure of the extent to which X and Y are linearly related.

Explained variation   2
Yi  Yˆi
R 
2
R2  1
Total variation  Y  Y 
2
i

In case of simple linear regression R2 is the square of correlation between X and Y

 Cov( X , Y )   XY    X   Y  / n 
2 2

Var ( X )  Var (Y )  X    X  / n   Y    Y  
R 2
  2 2
2 2
/n
A sample of 10 observations based upon the data for the period 1991 to 2000
corresponding to the regression model : Y=a+bX+u, where Y is quantity supplied
(millions tons) and X is export price (₹ per ton), gave the following results:

∑X=5100, ∑X2=3090000, ∑XY = 35500, ∑Y=59 and ∑Y2=419

Estimate the parameters a and b of the model.

Find the value of R2.


Multiple regression

Y=β0+ β1X1+ β2X2+ β3X3+ β4X4+u


Logistic Regression

You may wish to predict the likely success/failure rate of a new product or the
likelihood of customer retention/loss.
In these cases the response variable or dependent variable is not a continuous
variable. It may be a nominal scale, like success and failure.
These cases we use Logistic regression. Here the connection between the categorical
dependent variable and the continuous independent variables is measured by
changing the dependent variable into probability scores. Unlike linear regression
models, which are used to predict a continuous outcome variable, logistic regression
models are mostly used to predict a dichotomous categorical outcome.
Logistic Regression models are frequently used in business analysis
applications.
Whether a voter will vote for a specific political party can be determined
through the interpretation of logistic modeling, which is based on demographic
parameters such as gender, age, family income, caste and the state of residence
of the voter.
Logistic regression has varied applications in marketing, healthcare and social
sciences.
These models also show the extent to which changes in the values of the
attributes may increase or decrease the predicted probability of event outcome.
Suppose we need to study bad loan as interpreted in terms of age.

In the credit card industry, a financial company maybe interested


in minimizing the risk portfolio and wants to understand the top five
factors that cause a customer to default. 

Using a logistic regression model prediction of the probability of a


customer defaulting based on the average balance carried by the customer.

Logistic regression was developed by statistician David Cox in 1958.  


In these cases dependent variable will only take two values, we call these as success
and failure which can be mapped to 1 and 0. That is binary response for the dependent
variable.
Since predicting cent per cent is not possible, predicting probability of success is a
better idea
Let p = Prob(Yes) or Prob(Success)

Value of this can varies from 0 to 1.


p
is the ratio of probability of success to probability of failure.
1 p

Value of this varies from 0 to infinity.


 p 
Whereas, ln   varies from – infinity to infinity (any negative or positive value)
1 p 
Logistic Regression Model
The popular form of the logistic regression model:
=  β0 + β1 X + ε Here ln is log to the base e.

  𝑒 𝛽 +𝛽 𝑋
0 1

Using trend line we get the value of p as 𝑝=


1+𝑒 𝛽 + 𝛽
0 1 𝑋

Here β0 and β1 are estimated values.

The logistic model assumes a linear relationship between the predictors (X)
and ln(odds). This model is also known as logit form of the model.
Logit Function
no data Function Plot
1.0

0.8

0.6
y

0.4

0.2

-10 -8 -6 -4 -2 0 2 4 6 8 10 12
x
exp bo + b1• x 
y=
 + exp bo + b1• x 
We use simple linear regression methods for the transformed data and
estimate the regression coefficients. Hence probability of “Yes” (p)
can be determined.

From the following relation we get estimate of odds


  𝒑 𝜷 𝟎 + 𝜷𝟏 𝑿
=𝒆
𝟏− 𝒑
Example: Suppose a mobile app is released and out of the sample we have
asked students of college regarding their purchase of this product and noted
their age.

Age 18 19 20 21 22 23 24 25

Purchased 4 6 12 15 8 5 6 4

Not Purchased 12 10 20 15 16 20 18 24
Age 18 19 20 21 22 23 24 25
Purchased 4 6 12 15 8 5 6 4
Not Purchased 12 10 20 15 16 20 18 24
Total 16 16 32 30 24 25 24 28
p 0.2500 0.3750 0.3750 0.5000 0.3333 0.2000 0.2500 0.1429
p/(1-p) 0.3333 0.6 0.6 1.0 0.5 0.25 0.3333 0.1667
ln(p/[1-p]) -1.0986 -0.5108 -0.5108 0.0 -0.6931 -1.3863 -1.0986 -1.7918

 Now solve this problem as a simple regression with age as X and

X 18 19 20 21 22 23 24 25
Y -1.0986 -0.5108 -0.5108 0.0 -0.6931 -1.3863 -1.0986 -1.7918
We estimate the regression coefficients using the method of linear
regression and hence,
 
and

 
Hence we can estimate values of Y as . Based on the values we get an
estimate of p using the formula

=
 

You might also like