Question - A: Y n N (μ, σ n Y −1/2, ´Y +1/2) μ

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 15

QUESTION - A

1. Study the Lungcap data set and answer the following questions.

i) Construct a two-way table for gender and smoking habit.


ii) Find the marginal probabilities.
iii) Given that one randomly selected person is a smoker, what is the probability that the
person is a female?
iv) Are gender and smoking habit independent?

2. Suppose it is given that 20% of the male smokers and 15% of the female smokers were born
caesarean. With the help of the data, verify the above statements. Give enough reasons for your
answers.
3. Plot the histogram of the distribution of Lungcap amongst smokers.
4. Plot the histogram of the distribution of Height amongst smokers.
5. Are height and Lungcap independent?
6. Are the variation of Lungcap of male smokers and female smokers equal?
7. Are the average of Lungcap of smokers and non-smokers equal?
8. Plot the histogram of the age amongst smokers.
9. What percentage of people below 16 years smoke?
10. What percentage of people above 17 years smoke?
11. Test if smoking habit and age are dependent.
12. Test if smoking habit and Lungcap are dependent.
13. Fit a suitable distribution to height and also to Lungcap. Test the goodness of fit.

QUESTION – B

Study the car data set and answer the following questions.

1. Find the average and variance of price and mileage separately. Comment on the results. How will
you interpret the result statistically?
2. Test if the mean mileage of different car manufacturers within some price range are equal.
Clearly specify all the assumptions and the null and alternative hypotheses.
3. Find a 90% confidence price range for the Chevrolet cars.
4. Find a 90% confidence for variance of prices for Pontiac cars.
5. Calculate the correlation coefficient between mileage and Liter for each company.
6. Comment on the results.
7. Suppose a car has a Liter of 3.8. How sure will you be that its mileage is more than 20,000?
8. Is there any correlation between prices and mileage?

QUESTION – C

1. Let Ý be the mean of a random sample of size n1from N ( μ , σ 2=10) . Find n1 such that the
probability of the random interval ( Ý −1/2, Ý +1/2) includes μ is approximately 0.954.
2. Let Ź be the mean of a random sample of size n2 from N ( μ , σ 2=9 ) . Find n2 such that the
probability of the random interval ( Ź−1 , Ź +1) includes μ is approximately 0.90.
3. Draw 200 random samples each of size n1 (found above) from a normal distribution with mean 5
and variance 3.
4. Write down the distribution of the sample mean. Test using the data obtained in Q3 above, if the
sample means follow that distribution.
5. Draw 200 random samples each of size n2 (found above) from a normal distribution with mean 7
and variance 3.
6. Compute 95% confidence interval for the difference of means from each of the 200 samples.
Draw a graph to show all 200 confidence intervals and comment.

QUESTION – D

1. Collect stock prices for 5 companies from 1st Jan 2016 to 30th June 2016.
2. Plot the histogram of the returns for each company. Describe the histograms.
3. Test whether the average returns for 5 companies are equal. State clearly the assumptions
required, null and alternative hypotheses.
4. Test whether the average returns for each pair of companies are equal.
5. Comment on the results.

QUESTION – E

1. The income distribution of a very large population is exponential with average income ₹ 40, 000
per annum. Draw 500 samples (from the income distribution) of size 100 each. Sketch the
distribution of sample average income. Comment.
2. The age distribution of a very large population is given below:

Age Group 15-18 18-21 21-23 23-25 25-27 27-29 29-31 31-33 33-35
(years)
Proportion 0.1 0.1 0.1 0.1 0.2 0.1 0.1 0.1 0.1

Draw 100 samples (from the age distribution) of size 50 each. Sketch the distribution of sample
average age. Comment.

1|Page
Section-A
Q.1.i.
Two-Way Table
Smoking Habit
Gender Non Grand
Smoker Smoker Total
Male 33 334 367
Female 44 314 358
Grand Total 77 648 725
Q.1.ii. Marginal Probabilities
Smoking Habit
Gender Non Marginal
Smoker Smoker Probability
Male 0.046 0.461 0.506
Female 0.061 0.433 0.494
Marginal
Q.1.iii Given that one randomly0.106
Probability selected person
0.894is a smoker, probability
1.000 that the person is
female:
P(Female|Smoker) = #of female smokers
#of smokers
= 44
77
= 0.571

Observed Frequencies Expected Frequencies Difference Sq. Diff./Exp. Freq


F Value Given E Value Expected (Fij - Eij) (Fij - Eij)^2/Eij
F11 33 E11 38.98 -5.98 0.917
F12 334 E12 328.02 5.98 0.109
F21 44 E21 38.02 5.98 0.940
F22 314 E22 319.98 -5.98 0.112

Degrees of freedom= (2-1)(2-1)=1  cal 2.077

 1,0.05 3.841
Q.1.iv. H0: Gender and Smoking Habit are independent.
HA: Gender and Smoking Habit are dependent.
Reject Ho if cal is less than 5% p-value.

Since cal < 1,0.05 , the p-value for cal (>10%) is more than 5%. Hence there is not sufficient
evidence to reject HO and we accept that gender and smoking habits are independent.

Q.2 Given 20%(=m )of male smokers and 15%(=f ) of female smokers were born caesarean.

2|Page
a) As per the sample,
# of male smokers = 33 , # of male smoker born caesarean =10
Proportion of male smoker born caesarean, P m =10/33 =30.3%
Sample size,Nm=33
Since sample size > 30, as per CLT, Pm ~ N(Pm,SDm)
Standard deviation, SDm= sqrt(Pmx(1-Pm)/ Nm)= 0.08
Zcal=Pmm)/SD = (30.30%-20%)/0.08 = 1.29
Z+cri=Z0.975=1.96; Z-cri=Z0.025=-1.96
Hypothesis Statement:
HO: Proportion of male smoker, m = 20%
HA: Proportion of male smoker, m ≠ 20%
Rejection Rule
Reject HO if Zcal > Z+cri or Zcal < Z-cri
Since Z-cri > Zcal (=1.29) < Z+cri , there is not enough
evidence to reject HO.
Hence we accept the hypothesis that 20% of the male smokers were born caesarean.
b) As per the sample
# of female smokers = 44 , # of male smoker born caesarean =11
Proportion of male smoker born caesarean, P f =11/44=25%
Sample size,Nf=44
Since sample size > 30, as per CLT, Pf ~ N(Pf,SDf)
Standard deviation, SDf= sqrt(Pfx(1-Pf)/ Nf)= 0.065
Zcal=Pff)/SD = (25% - 15%)/0.065 = 1.53
Z+cri=Z0.975=1.96; Z-cri=Z0.025=-1.96
Hypothesis Statement:
HO: Proportion of female smoker, f = 15%
HA: Proportion of female smoker, f ≠ 20%
Rejection Rule
Reject HO if Zcal > Z+cri or Zcal < Z-cri
Since Z-cri > Zcal (=1.53) < Z+cri , there is not enough
evidence to reject HO.
Hence we accept the hypothesis that 15% of the female smokers were born caesarean.
Q3.&4.

3|Page
Q5. Hypothesis Statement:
Lungcap Height Total
H0: Height and lungcap are independent for the
<63 >63
following ranges
<7 229 34 263
HA: Height and lungcap are dependent
>7 54 408 462
Rejection Rule: Total 283 442 725
Reject Ho if cal is less than 5% p-value.
Observed Frequencies Expected Frequencies Difference Sq. Diff./Exp. Freq
F Value Given E Value Expected (Fij - Eij) (Fij - Eij)^2/Eij
F11 229 E11 102.66 126.34 155.479
F12 34 E12 160.34 -126.34 99.549
F21 54 E21 180.34 -126.34 88.509
F22 408 E22 281.66 126.34 56.670

Degrees of freedom= (2-1)(2-1)=1  cal 400.207

 1,0.05 3.841
Since cal > 1,0.05 , the p-value for cal (~0%) is less than 5%. Hence reject H O and state that
Height and lungcap are dependent.

Q6. Hypothesis Statement:


HO: Variance of male smokers and female smokers are equal or /22 =1
HA: Variance of male smokers and female smokers are not equal or /22≠1
Rejection Rule:
Reject HO if Fcal(=s12/s22) >Fcrit,df for significance level 0.05.
We conducted F-value test for =0.05/2=0.025(for two tail test) got the following results:

Since Fcal (=1.19)< Fcrit(1.96), there is not enough reasons to reject H O. Hence we accept the
hypothesis and state that the variances of male smokers and female smokers are equal.

Q7. Let 1 and 2 be the average of lungcap of smokers and non-smokers. Whereas 12 and 22 are
the sample variance of the respective population.
x1= Random Variable of average of lungcap of sample smokers ~ N(1,12/n1)
x2= Random Variable of average of lungcap of sample non-smokers~ N(2,22/n2)

4|Page
As per data,
No. of smokers, n1=77 No. of non-smokers, n2= 648
Average of lungcap of smokers x1=8.645 Average of lungcap of non-smokers x2= 7.77
Sample lungcap variance of smoker, s= 3.545 Sample lungcap variance of non-smoker, s=
7.432

Since population lungcap variances of smokers (12)and non-smokers is unknown22) , we assume


12=22=sp2.
Where sp2= [(n1 – 1)s12 + (n2 – 1)s22 ]/(n1+n2 – 2) = 7.023
Now, (x1-x2)/(sp x sqrt(1/n1+1/n2)) ~ tn1+n2-2
tcal = (x1-x2)/(sp x sqrt(1/n1+1/n2)) = 2.74
t+cri,0.025 =1.96
t-cri,0.975 = -1.96
Hypothesis Statement
HO: 1=2 or 1-2=0
HA:1≠2 or 1-2≠0
Rejection Rule
Reject HO if tcal > t+cri or tcal < t-cri
Since in our case tcal (=2.74) > t+cri (=1.96), we reject HO.
Hence the average lungcap of smokers and non-smokers are not equal(1≠2).
Q.8.

Q.9. # people below 16 years = 548


# people below 16 years who smoke= 42
Percentage of people below 16 who smoke = 42/548 = 7.66%
Q.10. # people above 17 years = 80
# people above 17 years who smoke= 15
Percentage of people below 16 who smoke = 15/80 = 18.75%

5|Page
Q.11. H0: Age and smoking habit are independent Smoking Age
for the above age ranges Habit <15 >15 Total
HA: Age and smoking habit are dependent for the Yes 42 35 77
above age ranges No 506 142 648

Reject Ho if  cal is less than 5% p-value. Total 548 177 725
Observed Frequencies Expected Frequencies Difference Sq. Diff./Exp. Freq
F Value Given E Value Expected (Fij - Eij) (Fij - Eij)^2/Eij
F11 42 E11 58.20 -16.20 4.510
F12 35 E12 18.80 16.20 13.963
F21 506 E21 489.80 16.20 0.536
F22 142 E22 158.20 -16.20 1.659

Degrees of freedom= (2-1)(2-1)=1  cal 20.668

 1,0.05 3.841
  
Since  cal >  1,0.05 , the p-value for  cal (~0%) is less than 5%. Hence reject H O and state that age
and smoking habit are dependent.
Q.12. Hypothesis Statement:
H0: Lungcap and smoking habit are independent for Smoking Lungcap
the above lungcap ranges Habit <9 >9 Total
HA: Lungcap and smoking habit are dependent for Yes 43 34 77
the above lungcap ranges No 432 216 648
Total 475 250 725
Reject Ho if cal is less than 5% p-value.
Observed Frequencies Expected Frequencies Difference Sq. Diff./Exp. Freq
F Value Given E Value Expected (Fij - Eij) (Fij - Eij)^2/Eij
F11 43 E11 50.45 -7.45 1.100
F12 34 E12 26.55 7.45 2.089
F21 432 E21 424.55 7.45 0.131
F22 216 E22 223.45 -7.45 0.248

Degrees of freedom= (2-1)(2-1)=1  cal 3.568

 1,0.05 3.841
Since cal < 1,0.05 , the p-value for cal (>=5%) is more than 5%. Hence there is not enough
reasons to reject HO and state that lungcap and smoking habit are independent.

Q13. As per the data, we have the following descriptive statistics for lungcap and height:
LungCap Height
Mean 7.863148 Mean 64.83628
Standard Standard
Deviation 2.662008 Deviation 7.202144
Count 725 Count 725

6|Page
Distribution for Lungcap
HO : We assume the lungcap distribution of the population to follow Normal Distribution
~ N(7.863,2.66)
HA: The lungcap distribution doesn’t follow ~ N(7.863,2.66)
We construct the following frequency distribution with taking bin size such that the frequency
percentage is 10%.
Percentage Z-value Bin Frequency Expected fi-ei (fi-ei)2 (fi-ei)2/ei
(fi) Frequenc
y
(ei)
10% -1.28 4.456 83 72.5 10.5 110.250 1.521
20% -0.84 5.627 62 72.5 -10.5 110.250 1.521
30% -0.52 6.479 67 72.5 -5.5 30.250 0.417
40% -0.25 7.198 61 72.5 -11.5 132.250 1.824
50% 0 7.863 72 72.5 -0.5 0.250 0.003
60% 0.25 8.529 74 72.5 1.5 2.250 0.031
70% 0.52 9.247 79 72.5 6.5 42.250 0.583
80% 0.84 10.099 71 72.5 -1.5 2.250 0.031
90% 1.28 11.271 88 72.5 15.5 240.250 3.314
    More 68 72.5 -4.5 20.250 0.279

 cal 9.524
We get cal = 9.52
For significance level 5% and degrees of freedom 7 (=10-2-1), we have 7,0.05=14.064.
Since cal < 7,0.05 , p-value will be more than 5% . Hence we accept H O and lungcap distribution
follow ~ N(7.863,2.66).

Distribution for Height


HO : We assume the height distribution of the population to follow Normal Distribution
~ N(64.836,7.202)
HA: The height distribution doesn’t follow ~ N(64.836,7.202)
We construct the following frequency distribution with taking bin size such that the frequency
percentage is 10%.
Percentage Z-value Bin Frequency Expected fi-ei (fi-ei)2 (fi-ei)2/ei
(fi) Frequency
(ei)
10% -1.28 55.618 86 72.5 13.5 182.250 2.514
20% -0.84 58.786 64 72.5 -8.5 72.250 0.997
30% -0.52 61.091 64 72.5 -8.5 72.250 0.997
40% -0.25 63.036 69 72.5 -3.5 12.250 0.169
50% 0 64.836 61 72.5 -11.5 132.250 1.824
60% 0.25 66.637 79 72.5 6.5 42.250 0.583
70% 0.52 68.581 63 72.5 -9.5 90.250 1.245

7|Page
80% 0.84 70.886 70 72.5 -2.5 6.250 0.086
90% 1.28 74.055 98 72.5 25.5 650.250 8.969
    More 71 72.5 -1.5 2.250 0.031
cal 17.414
We get cal = 17.414
For significance level 5% and degrees of freedom 7 (=10-2-1), we have 7,0.05=14.064.
Since cal > 7,0.05 , p-value will be less than 5% . Hence we reject H O and height distribution
doesn’t follow ~ (64.836,7.202).

Section-B

Q1. Price Mileage

Mean 21343.14 Mean 19831.93


Sample Variance 97710315 Sample Variance 67179657

We observe that the sample variance of price is more than mileage. That means the spread of
price around average is more than that of mileage. So we can say that wide range of priced cars
have mileage closer to 19831.93.

Q2. Average
Price Range mileage (xi) Variance (i2) Sample Size(ni)
<20000 20241.52 64394503 467
20k-40k 19759.26 65564947 297
>40K 15589.65 95651556 40

Let 1,2 and 3 be the average of mileage of cars in the price range as given in the table.
Whereas 12 ,22 and 32 are the variance of the respective car price range.
Hypothesis Statement
HO : 1= 2=3
HA : 1≠ 2≠3

We conducted
Anova test. Since
the p-Value is less
than 0.05, we
reject HO and state
that the average
mileage of the
cars in above price
range are not
equal.

8|Page
Q3.
Price-Chevrollet t-value Price
t+0.05,319 0.824822 16745.82
Mean 16427.6 t-0.95,319 -0.82482 16109.38
Standard Deviation 6901.439 CL= 636.4364
Sample Variance 47629867
Count 320
Confidence Level(90.0%) 636.4364

Or,else using t-distribution, s=6901.439/sqrt(320) =


385.80

Q.4. 150
N
16708238
Sample Variance
171.507
149,0.1
CI Variance (90%) 14515607 (n-1)s2/149,0.1

Q.5. Manufacture
r Cov.(ML) SD(M) SD(L) Corr.(ML)
Buick 162.323 6932.136 0.230 0.102
Cadillac 594.100 8964.292 0.803 0.083
Chevrolet -285.829 8203.571 1.151 -0.030
Pontiac 959.280 8110.435 1.098 0.108
SAAB -9.525 8404.288 0.162 -0.007
Saturn -501.661 8479.994 0.301 -0.197
Q.6. A n a l y s i n g t h e
be stated that there is weak linear relation between mileage and liter as the correlation
coefficients are close to zero.

Q7. # total car = 804


# cars with liter equal to 3.8 = 160
# cars with liter equal to 3.8 and mileage more than 20,000 = 95
Given that a car has liter 3.8, probability of mileage greater than 20,000=
P(Mileage >20000|liter=3.8)
= P(cars with liter=3.8 & mileage >20000) = (#cars with liter=3.8 & mileage >20000)/(#total cars)
P(cars with liter =3.8) (#cars with liter =3.8)/(#total cars)
=95/804 = 0.5937
160/804
Thus, given that a car has liter 3.8, with 59.37% confidence we can say that mileage is greater
than 20,000.

9|Page
Q8. As correlation between price and mileage Cov.(PM) SD(P) SD(M) Corr.(PM)
is close to zero, there is a weak linear -11589868.158 9884.853 8196.320 -0.143
relation among them.
Section-C:

Q1. Given Ý ~ N(10), error = 0.5 for sample size n1.


Confidence Level =1- =0.954 or 
Corresponding Z value for /2 =|Z0.023| =1.99
Now, from standard normal distribution |Z /2|=(Ý - /(sqrt(n1)).
Or, error = |Z/2|x(sqrt(n1))
Or, 0.5=1.99xsqrt(10)/sqrt(n1)
Or, n1 = 1.992 x 10 / 0.52 = 158.4 ~ 159.
Q2. Given Ź ~ N(9), error = 1
Confidence Level =1- =0.9 or 
Corresponding Z value for /2 =|Z0.05| =1.65
Now, from standard normal distribution |Z /2|=( Ź - /(sqrt(n2)).
Or, error = |Z/2|x(sqrt(n2))
Or, 1=1.65 x sqrt(9)/sqrt(n 2)
Or, n2 = 1.652 x 9 / 12 = 24.5 ~ 25.
Q.3. 200 Random no. with size 159 from N(5,3) generated.

Q.4. Since we have taken


the sample from a
normal distribution
N(5,3) each of size 159
(>30), the average of
200 sample will follow
normal distribution
with N(5,3/159) which
can be verified in the
descriptive statistics of
the sample and histogram below.
Sample Mean = 5.003317 ~ 5
Sample variance = 3/159 = 0.0188 ~ 0.020

Q.5. 200 Random no. with size 25 from N(7,3) generated.


Q.6. For confidence interval(CI) 95% corresponding Z value is 1.96. Hence the Confidence interval will
be 2xZ0.975 x /sqrt(n2). We have =3 , n2=25.

10 | P a g e
1
0.8
0.6
0.4
0.2
0
-0.2 1 8 1 5 2 2 29 3 6 43 50 5 7 6 4 71 7 8 8 5 92 99 0 6 1 3 2 0 2 7 3 4 4 1 4 8 5 5 6 2 6 9 7 6 8 3 9 0 9 7
-0.4 1 1 1 1 1 1 1 1 1 1 1 1 1 1
-0.6
-0.8
-1

Difference of Mean Upper Limit of CI (+0.679) Lower Limit of CI (-0.679)


Hence CI =2x1.96 x sqrt(3/25) = 1.358. The corresponding CI and difference of mean is plotted
below.

Section D:
Q.1. I collected stock price of Monnet Ispat & Energy Ltd, GAIL (India) Ltd, Alstom India Ltd, ABB
India Ltd and Siemens Ltd from 01.01.2016 to 30.06.2016.

Q2. Histogram of stock returns

Q.3. HO: Average stock returns of the five


companies taken are equal (R1 = R2
= R3 = R4 = R5)
HO: Atleast one of average stock
return is not equal to other stock
returns.
Reject HO if p value < 0.05

11 | P a g e
Since the p-Value is greater than 0.05, we accept the H O and state that the average return of the
said companies are equal.

Q4.&5. Hypothesis Statement


HO: Average stock returns of each pair of companies are equal.
HA: Average stock returns of each pair of companies are not equal.
Rejection Rule:
tcal lies in rejection region or tcal>t+0.975,df or tcal<t- 0.025,df
Observation:
In all the cases t-0.025,df <tcal<t+0.975,df, i.e lies in acceptance region. Hence we don’t have sufficient
reasons to reject HO and state that for all the pair of companies averages of stock returns are
equal.

12 | P a g e
Section E:
Q.1

13 | P a g e
Since the sample size is more than 30,i.e 100, the average salary of each sample will follow normal
distribution, N(40000,SD=40000/sqrt(10))
Sample mean calculated = 40260.59 ~ 40000
Standard deviation = 3983.923 ~ 4000 (=40000/10)

Q.2 Since the sample size is more than 30, i.e 50, the average age of the samples should follow
normal distribution as per CLT.
Population average age =25.8, and standard deviation = 5.216 then sample average age must have mean
25.8 ~ 25.09 and standard deviation = 5.216/sqrt(50) = 0.737 ~ 0.64

14 | P a g e

You might also like