Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Data Analysis

Single variate Analysis: Average ( A.M, G.M, H.M, Me, Mo), Variation (s, M.D, R, C.V ),
Skewness, Kurtosis, Test of hypothesis

Bivariate Analysis: Simple correlation, Simple Regression, Trend Analysis, Ratio Analysis,
Test of hypothesis

Multivariate Analysis: Multiple Correlation, Multiple Regression ,Test of hypothesis

2-test (Chi-square test)


Karl Pearson first used it in 1900. The 2 test is one of the simplest and most
widely used non-parametric tests in statistical work. It makes no assumption
about the population being sampled.

The quantity 2 describes the magnitude of discrepancy between theory and


observation. If 2 is zero, it means that the observed and expected frequencies
completely coincide. The greater the value of 2, the greater would be the
discrepancy between observed and expected frequencies. The formula for
computing 2 is

2= 
(O − E )2 =
O2
−n Where, O=observed frequency, E=expected
E E
theoretical frequency.

Degree of freedom:

The no. of degree of freedom is described as the no. of observations that


are free to vary after certain restrictions have been imposed on the data.

Uses of 2-test:

2-test is used for testing single population variance, proportion,


correlation co-efficient, goodness of fit and independence of attributes.

1
Conditions for the application of 2 test:

The following six basic conditions must be met in order for chi-square
analysis to be applied.

i) The experimental data (sample observation) must be independent.


ii) The sample data must be drawn at random from the target
population.
iii) The data should be expressed in original units for convenience of
comparison, not in percentage or ratio form.
iv) The sample should contain at least 50 observations.
v) There should not be less than five observations in any one cell (each
data entry is known as a cell).
vi) The constraint on the cell frequencies is  o =  E .
Test of independence:

One of the most frequent uses of 2 is for testing the null hypothesis that
two criteria of classification are independent. They are independent if the
distribution of a criterion in no way depends on the distribution of the other
criterion. If they are not independent, there is an association between the
criteria.

Let us designate the two attributes are A and B where attribute A is


assumed to have r categories and attribute B is assumed to have c categories.
Furthermore, assume the total no. of observations in the problem is N.

B B1 B2 .... ....... Bj Bc Total


.. .....
A

A1 O11 O12 .... ... O1j ... O1c R1

A2 O21 O22 ... ... O2j ... O2c R2

2
Ai Oi1 Oi2 ... .... Oij ... Oic Ri

Ar Or1 Or2 Orj ... Orc Rr

Total C1 C2 .... .... Cj ... Cc  R = C


i j
=N

We want to test the null hypothesis, H0 : A and B are independent.

alternative hypothesis, H1: A and B are not independent.

To test the above null hypothesis, the required test statistic is given by,

O 2 ij
2=   − N , Where, Oij= Observed frequency in the ith row and
i j E ij
jth column

Ri  C j
Eij= Expected frequency = , Ri is the row total and Cj is the column total
N

which follow 2 distribution with m=(r-1)(c-1) d.f.

If the calculated of 2 is greater than critical value, then null hypothesis will be
rejected otherwise accepted.

Example :
A sample of 200 people with a particular disease was selected. Out of these,
100 were given a drug and the others were not given any drug. The results are as
follows:

3
Drug No drug Total

Cured 65 55 120

Not cured 35 45 80

Total 100 100 200

Test whether the drug is effective or not.

Solution:

Here, null hypothesis, H0: Drug and diseases are independent. i.e. drug is
not effective

alternative hypothesis, H1: Drug and diseases are not


independent. i,e. Drug is effective

To test the above null hypothesis the required test statistic is given by-

Oij Ri  C j Oij
2
E ij =
N E ij

65 120  100 65 2
= 60 = 70.416
200 60

35 80  100 35 2
= 40 = 30.625
200 40

55 120  100 55 2
= 60 = 50.416
200 60

45 80  100 45 2
= 40 = 50.625
200 40

4
Oij2 Oij2
 =   −n
2
 =202.17
i j Eij E
i j ij

=202.107-200

=2.107

At =5%=0.05 level of significance with (r-1)(c-1)=(2-1)(2-1)=1 d.f. the critical


value of 2 is 3.84.

Since calculated value of 2 is less than critical value of 2 hence the null
hypothesis may be accepted, i.e. drug is not effective in curing the disease.

Exercise:

For knowing relationship between soft drink and gender of the student of an
institution, a survey is conducted on 135 students. The findings are as follows:

Gender/Soft Coke Pepsi Seven-up


drink
Male 35 15 10
Female 20 50 5
Test whether soft drink and gender are independent.

Z -test/Normal- test

Let U be a statistic. E (U) and б(U) be the expected value and standard deviation
of U respectively. If population standard deviation is known or estimated from
large sample (n  30) then normal test or Z-test is defined as

U − E (U ) U − E (U )
Z= or Z =
 (U ) estimated  (U )

Z is distributed as normal with mean 0 and variance 1. If calculated value of z is


greater than or equal to critical value of Z at  level of significance, null
hypothesis is rejected otherwise null hypothesis is accepted.

5
Uses of normal test:

Single population mean, two population means, proportion and correlation may
be tested by normal test.

Hypothesis Testing for Difference between Proportions


If two samples are drawn from two different populations, one may be interested in knowing
whether the difference between the proportion of successes is significant or not. In such a case,
we start with the hypothesis that the difference between the proportion of success in sample one
𝑝1 and the proportion of success in sample two 𝑝2 is due to fluctuations of random sampling. In
other words, we take the null hypothesis as H0: π1 = π2 and for testing the significance of
difference, we work out the test statistic as under

(𝑝1 − 𝑝2 ) − (𝜋1 − 𝜋2)


𝑧=
𝑝1 (1 − 𝑝1 ) 𝑝2 (1 − 𝑝2 )
√ +
𝑛1 𝑛2

Where, p1 = proportion of success in sample one


p2= proportion of success in sample two
n1 = size of sample one
n2 = size of sample two
Then, we construct the rejection region(s) depending upon the Ha for a given level of
significance and on its basis we judge the significance of the sample result for accepting or
rejecting H0 .

Example: A drug research experimental unit is testing two drugs newly developed to reduce
blood pressure levels. The drugs are administered to two different sets of animals. In group one,
350 of 600 animals tested respond to drug one and in group two, 260 of 500 animals tested
respond to drug two. The research unit wants to test whether there is a difference between the
efficacies of the said two drugs at 5 per cent level of significance. How will you deal with this
problem?

Solution:

We take the null hypothesis that there is no difference between the two drugs i.e., H0: π1 = π2

The alternative hypothesis can be taken as that there is a difference between the drugs i.e.,

6
Ha: π1 ≠ π2

For testing the significance of difference, the required test statistic is as follows

(𝑝1 − 𝑝2 ) − (𝜋1 − 𝜋2)


𝑧=
𝑝1 (1 − 𝑝1 ) 𝑝2 (1 − 𝑝2 )
√ +
𝑛1 𝑛2

Given information can be stated as: p1 = 350/ 600= 0.583, n1 = 600, p2 = 260/500 =0.520 , n2 =
500

Now,

(0.583 − 0.520) − 0
𝑧= = 2.093
√0.583(1 − 0.583) + 0.520(−0.520)
600 500

At 𝛼 =5% level of significance, the critical value of z =1.96. Since calculated value of z=2.093
is greater than the critical value of z =1.96, hence null hypothesis may be rejected. Thus, we
conclude that the difference between the efficacies of the two drugs is significant.

Exercise:

A company considers two different television advertisements A and B for promotion of a


product. Advertisement A is used in one area where out of a random sample of 60 consumers,
18 tried the product. Advertisement B is used in another area and out of a random sample of 100
consumers, 22 tried the product. Does this indicate that advertisement of A is more effective than
advertisement B if 5% level of significance is used?

F-test

The F-test is based on F distribution which was named in honor of R. A


Fisher who first introduced it in 1924. F-distribution is usually defined in terms of
the ratio of the variances of two normally distributed populations. Therefore the
s12
F-test is defined as F = which follows F- distribution with n1-1 and n2–1
s2 2
degree of freedom.

Uses of F–test:

F-test is mainly used to test the null hypothesis regarding the equality of
two population variances, homogeneity of independent estimates of population

7
means, significance of sample correlation ratio and also for testing the linearity of
regression.

Testing the hypothesis for equality of two variances:

We want to test the null hypothesis, H0: 612 = 622

alternative hypothesis, H1: i) 612 <622 ,

ii) 612 >622 ,

iii) 612 ≠622

To test the above null hypothesis the required test statistic is given by

s12
F = ( s12  s 2 2 ) which follows F-distribution with n1-1 and
s2 2

s 2
n2–1 degree of freedom or F = 2 ( s 2 2  s12 ) which follows F-distribution with n2-
s12
1 and n1–1 degree of freedom

Example:

Two sources of raw materials are under considered by a company. Both


sources seem to have similar characteristics but the company is not sure about their
respective uniformity. A sample of 10 lots from source A yields a variance of 225
and a sample of 11 lots from source B yields a variance of 200. Is it likely that the
variance of source A is significantly greater than the variance of source B?

Solution:

Here, null hypothesis, H0: 612 = 622 i.e. the variance of source A and that of
source B are same.

8
Here, alternative hypothesis, H0: 612 >622 i.e. the variance of source A and that of
source B are not same.

To test the above null hypothesis the required test statistic is given by

s12
F=
s2 2

Here, s12 =225 and s 2 =200


2

F = 225 =1.1 25
200

The tabulated value of F with n1-1=10-1=9 and n2–1=11-1=10 d.f. at  =0.05 level
of significance is 3.02. Since the calculated of F is less than tabulated value, hence
null hypothesis may be accepted, i.e. the variance of source A and that of source B
are same.

Exercise:
A sample of the monthly earnings records of 15 employees of company A has a variance
of Tk. 15.90 while a similar sample of 27 employees for company B has a variance of Tk. 17.50.
Is it safe to assume that there is less variance in company A than in company B?

Exercise:
CGPA of two sections of students each section containing 10 students of BBA 7th
semester of IIUC is given below:

CGPA of section A: 3.90, 4.00, 3.78, 3.50, 2.90, 3.45, 3.80, 3.95 3.98 3.70

CGPA of section B: 4.00, 2.95, 3.50, 3.25, 3.59, 3.80, 3.60, 3.90 3.60 3.55

Test equality of variation of CGPA between two sections at α =0.05.

9
Example: The following data gives the yield of wheat, amount of fertilizer and level of irrigation
of seven fields:

Yield of wheat (in 100 kg) Amount of fertilizer (kg/acre) Level of irrigation
40 10 100
50 20 200
50 30 300
70 20 400
65 25 450
68 32 470
80 35 500
i) Find regression equation of yield of wheat on amount of fertilizer and level of
irrigation;
ii) Estimate probable yield of wheat if amount of fertilizer and level of irrigation are 40
and 520 respectively;
iii) Construct analysis of variance (ANOVA) table;
iv) Find co-efficient of multiple determination and comment;
v) Test whether the regression as a whole is significant;

Solution:

i)

Yield Amount Level of y2 x12 x22 yx1 yx2 x1x2


of of irrigation
wheat fertilizer x2
(in 100 (kg/acre)
kg) x1
y
40 10 100 1600
50 20 200 2500
50 30 300 2500
70 20 400 4900
65 25 450
68 32 470
80 35 500
∑ 𝑦=423 ∑ 𝑥1 =172 ∑ 𝑥2 =2420 ∑ 𝑦2 ∑ 𝑥1 2 ∑ 𝑥2 2 ∑ 𝑦𝑥1 ∑ 𝑦x2=158210 ∑ 𝑥1 𝑥2
= 26749 = 4674 = 973400 = 10901 = 65790

We have, regression equation of yield of wheat on amount of fertilizer and level of irrigation

y=a +b1x1+b2x2

Where, a =𝑦̅ - b1̅̅̅


𝑥1 - b2𝑥
̅̅̅2

10
∑ 𝑦 423 ∑ 𝑥1 172 ∑ 𝑥2 2420
𝑦̅= = =60.43, 𝑥
̅̅̅1 = = =24.57, 𝑥
̅̅̅2 = = =345.71
𝑛 7 𝑛 7 𝑛 7

𝑆𝑆(𝑥2)𝑆𝑃(𝑥1𝑦)−𝑆𝑃(𝑥1𝑥2)𝑆𝑃(𝑥2𝑦)
b1= 𝑆𝑆(𝑥1)𝑆𝑆(𝑥2)−{𝑆𝑃(𝑥1𝑥2)}2

𝑆𝑆(𝑥1)𝑆𝑃(𝑥2𝑦)−𝑆𝑃(𝑥1𝑥2)𝑆𝑃(𝑥1𝑦)
b2= 𝑆𝑆(𝑥1)𝑆𝑆(𝑥2)−{𝑆𝑃(𝑥1𝑥2)}2

(∑ 𝑥1 )2 (172)2
SS(x1)=∑ 𝑥1 2 − =4674 – =447.71
𝑛 7

(∑ 𝑥2 )2 (2420)2
SS(x2)=∑ 𝑥2 2 − =973400 - =136771.43
𝑛 7

(∑ 𝑦)2 (423)2
SS(y)=∑ 𝑦 2 − =26749- =1187.71
𝑛 7

∑ 𝑥1 ∑ 𝑦 172×423
SP(x1y)=∑ 𝑥1𝑦 − =10901- =507.29
𝑛 7

∑ 𝑥2 ∑ 𝑦 2420×423
SP (x2y)=∑ 𝑥2𝑦 − =158210 - =11972.86
𝑛 7

∑ 𝑥1 ∑ 𝑥2 172×2420
SP (x1x2)=∑ 𝑥1𝑥2 − =65790 - =6327.14
𝑛 7

𝑆𝑆(𝑥2)𝑆𝑃(𝑥1𝑦)−𝑆𝑃(𝑥1𝑥2)𝑆𝑃(𝑥2𝑦) 136771.43×507.29−6327.14×11972.86
b1= = =-0.301
𝑆𝑆(𝑥1)𝑆𝑆(𝑥2)−{𝑆𝑃(𝑥1𝑥2)}2 447.71×136771.43−{6327.14}2

𝑆𝑆(𝑥1)𝑆𝑃(𝑥2𝑦)−𝑆𝑃(𝑥1𝑥2)𝑆𝑃(𝑥1𝑦) 447.71×11972.86−6327.14×507.29
b2= = = 0.101
𝑆𝑆(𝑥1)𝑆𝑆(𝑥2)−{𝑆𝑃(𝑥1𝑥2)}2 447.71×136771.43−{6327.14}2

a=60.43 – (-0.301)×24.57 – 0.101×345.71=32.91

Therefore, the required regression equation is

y=a +b1x1+b2x2= 32.91+ (-0.301) x1 + 0.101 x2 = 32.91-0.301 x1 + 0.101 x2

ii) Given , x1 =40 and x2 =520

Thus, the probable yield will be

y= 32.91-0.301×40 + 0.101×520 =73.39 kg

iii) The analysis of variance table is shown below:

Sources of Degree of Sum of Mean Sum of Calculated F Critical F


variation freedom (df) Square(SS) Square(MSS)
Regression 3-1=2 1056.56 528.28 16.11 6.94
Error 6-2=4 131.15 32.79 ------ ---
Total 7-1=6 1187.71 ----- ------ ----
SS(Regression) = b1SP(x1y) + b2SP(x2y) = -0.301×507.29 +0.101×11972.86 =1056.56

11
iv) We have, Co-efficient of multiple determination
𝑆𝑆(𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛) 1056.56
R2 = = 1187.71 = 0.89 =89%
𝑆𝑆(𝑇𝑜𝑡𝑎𝑙)

R2 indicates that 89% variation in yield of wheat has been occurred because of amount of
fertilizer and level of irrigation.

v) Here, Null hypothesis, H0: the regression as a whole is insignificant

Alternative hypothesis, Ha: the regression as a whole is significant

To test above null hypothesis, the required test statistic is given as


Regressioin MSS 528.28
F= = 32.79 =16.11
Error MSS

The tabulated value of F with 2 and 4 d.f. at  =0.05 level of significance is 6.94. Since the
calculated of F is greater than tabulated value, hence null hypothesis may be rejected and,
alternative hypothesis may be accepted i.e. the regression as a whole is significant.

Exercise: A soft drink bottler is analyzing the vending machine serving routes in his
distribution system. He is interested in predicting the time required by the distribution driver to
service the vending machines in an outlet. This service activity includes stocking the machines
with new beverage products and performing minor maintenance or housekeeping. It has been
suggested that the two most important variables influencing delivery time (y in min) are the
number of cases of product stocked (x1) and the distance walked by the driver (x2 in feet). 25
observations on delivery times, cases stocked and walking times have been recorded.

The estimated model: yˆ = 2.341 + 1.616x1 + 0.0144x2.

SS (total) = 133.7 and SS (regression) = 120.10

i) Explain the model.


ii) Construct an analysis of variance table.
iii) Is regression significant?
iv) Calculate percentage amount of variation explained by the model and comment.

Solution:

ii)

Sources of df SS MSS Cal F Critical F


variation
Regression 2 120.10
Error
Total 133.7

12
Exercise: The district manager of Jasons , a large discount retail chain , is investigating why
certain stores in her region are performing better than others. She believes that three factors are
related to total sales(Y): the number of competitor in the region (X1), the population in the
surrounding area(X2), and the amount spend on advertising(X3). From her district, consisting of
several hundred stores, she selects a random sample of 30 stores. The sample data were run on
the SPSS software package and the result with some missing figures are given below:

Analysis of variance table

Sources of df SS MSS Cal F Critical F


variation
Regression -- 3050
Error 26 --
Total 29 5250

Predictor coef StDev t-ratio


Constant 14 7
X1 -1 0.70
X2 30 5.20
X3 0.20 0.08

i) Complete the ANOVA table.


ii) At the significance level of 0.05, test whether the regression as a whole is significant.
iii) Conduct a test of hypothesis on each of the regression coefficient. Could you delete any
of the variables?
iv) Calculate the value of R2 and interpret the result.

Solution:

i)

Analysis of variance table

Sources of df SS MSS Cal F Critical F


variation
Regression 3 3050 3050/3=1016.67 12.01
Error 26 2200 2200/26=84.62 ----
Total 29 5250 ----- -----

13
Predictor coef StDev t-ratio
Constant 14 7 ----
X1 -1 0.70 -1/0.70=-1.43
X2 30 5.20 30/5.20=5.76
X3 0.20 0.08 0.20/0.08=2.5
ii) Here, Null hypothesis, H0: the regression as a whole is insignificant

Alternative hypothesis, Ha: the regression as a whole is significant

To test above null hypothesis, the required test statistic is given as


Regressioin MSS
F= =12.01
Error MSS

The tabulated value of F with 3 and 26 d.f. at  =0.05 level of significance is 2.98. Since the
calculated of F is greater than tabulated value, hence null hypothesis may be rejected and,
alternative hypothesis may be accepted i.e. the regression as a whole is significant.

iii) Here, Null hypothesis, H0: Slope of no. of competitor in the region is insignificant

Alternative hypothesis, Ha: Slope of no. of competitor in the region is significant

To test above null hypothesis, the required test statistic is given as


Co−efficient of X1
t= =-1.43
stdev

Absolute t =1.43

v) The tabulated value of t with 26 d.f. at  =0.05 level of significance is 2.056. Since the
calculated of t is smaller than tabulated value, hence null hypothesis may be accepted and,
alternative hypothesis may be rejected i.e. Slope of no. of competitor in the region is
insignificant. Thus , we can delete any of the variable.

14

You might also like