Professional Documents
Culture Documents
Educational Statistics
Educational Statistics
Educational Statistics
********************
Q.No.01. How is mode calculated? Also discuss its merits and demerits.
• Merits of Mode:
The mode or modal value of a distribution is that value of the variable for which the frequency is maximum.
The number which is repeated maximum number of times is the mode.
▪ It is readily comprehensible and easy to compute. In some case it can be computed merely by inspection.
▪ It is not affected by extreme values. It can be obtained even if the extreme values are not known.
▪ Mode can be determined in distributions with open classes.
▪ Mode can be located on the graph also.
▪ If all the data are not given, then also mode can be calculated.
▪ It is easy to understand.
▪ It is the most observed data point.
▪ Sometimes we get more than one mode and sometimes there is no mode, that gives us the characteristics
of the data.
• Demerits of Mode:
▪ It is ill defined. It is not always possible to find clearly defined mode. In some cases, we may come across
distributions with two modes. Such distributions are called Bimodal. If a distribution has more than two
modes, it is said to be Multimodal.
1
▪ It is not based upon all the observation.
▪ Mode can be calculated by various formulae as such the value may differ from one to other. Therefore, it
is not rigidly defined.
▪ It is affected to a greater extent fluctuations of sampling.
• Uses of Mode:
Mode is used by manufacturers of ready- made garments, shoes and accessories in common use etc. The
readymade garment manufacturers made those sizes more which are used by most of the persons than the
other sizes. Similarly, the makers of shoes will make that size maximum which the majority people use and
others in less quantity. Mode is used when the data is categorized.
*********************
2
• If there is one group being compared against a standard value (e.g. comparing the acidity of a liquid to a
neutral pH of 7), perform a one-sample t-test.
Performing a t-test
The t-test estimates the true difference between two group means using the ratio of the difference in group
means over the pooled standard error of both groups. You can calculate it manually using a formula or use
statistical analysis software.
T-test formula
The formula for the two-sample t-test (a.k.a. the student’s t-test) is shown below.
In this formula, t is the t-value, x1 and x2 are the means of the two groups being compared, s2 is the pooled
standard error of the two groups, and n1 and n2 are the number of observations in each of the groups.
A larger t-value shows that the difference between group means is greater than the pooled standard error,
indicating a more significant difference between the groups.
You can compare your calculated t-value against the values in a critical value chart to determine whether your t-
value is greater than what would be expected by chance. If so, you can reject the null hypothesis and conclude
that the two groups are in fact different.
3
t.test (Petal.Length ~ Species, data = flower.data)
********************
Regression analysis includes several variations, such as linear, multiple linear, and nonlinear. The most
common models are simple linear and multiple linear. Nonlinear regression analysis is commonly used for more
complicated data sets in which the dependent and independent variables show a nonlinear relationship.
Regression analysis offers numerous applications in various disciplines, including finance.
Regression Analysis – Linear Model Assumptions
Linear regression analysis is based on six fundamental assumptions:
4
1. The dependent and independent variables show a linear relationship between the slope and the intercept.
2. The independent variable is not random.
3. The value of the residual (error) is zero.
4. The value of the residual (error) is constant across all observations.
5. The value of the residual (error) is not correlated across all observations.
6. The residual (error) values follow the normal distribution.
***********************
5
Q.No.04. Explain assumptions of applying One-way ANOVA and its procedure.
In some decision-making situations, the sample data may be divided into various groups i.e., the sample may be
supposed to have consisted of k-sub samples. There are interest lies in examining whether the total sample can
be considered as homogenous or there is some indication that sub-samples have been drawn from different
populations. So, in these situations, we must compare the mean values of various groups, with respect to one or
more criteria.
The total variation present in a set of data may be partitioned into several non-overlapping components as per
the nature of the classification. The systematic procedure to achieve this is called Analysis of Variance
(ANOVA). With the help of such a partitioning, some testing of hypothesis may be performed.
Initially, Analysis of Variance (ANOVA) had been employed only for the experimental data from the
Randomized Designs but later they have been used for analyzing survey and secondary data from the
Descriptive Research.
Analysis of Variance may also be visualized as a technique to examine a dependence relationship where the
response (dependence) variable is metric (measured on interval or ratio scale) and the factors (independent
variables) are categorical in nature with a number of categories more than two.
Example of ANOVA
Ventura is an FMCG company, selling a range of products. Its outlets have been spread over the entire state. For
administrative and planning purpose, Ventura has sub-divided the state into four geographical regions
(Northern, Eastern, Western and Southern). Random sample data of sales collected from different outlets spread
over the four geographical regions.
Variation, being a fundamental characteristic of data, would always be present. Here, the total variation in the
sales may be measured by the squared sum of deviation from the mean sales. If we analyze the sources of
variation in the sales, in this case, we may identify two sources:
• Sales within a region would differ and this would be true for all four regions (within-group
variations)
• There might be impact of the regions and mean-sales of the four regions would not be all the same
i.e. there might be variation among regions (between-group variations).
So, total variation present in the sample data may be partitioned into two components: between-regions and
within-regions and their magnitudes may be compared to decide whether there is a substantial difference in the
sales with respect to regions. If the two variations are in close agreement, then there is no reason to believe that
sales are not same in all four regions and if not then it may be concluded that there exists a substantial
difference between some or all the regions.
6
Here, it should be kept in mind that ANOVA is the partitioning of variation as per the assignable causes and
random component and by this partitioning ANOVA technique may be used as a method for testing significance
of difference among means (more than two).
7
Conceptual Background:
The fundamental concept behind the Analysis of Variance is “Linear Model”.
X1, X2,……….Xn are observable quantities. Here, all the values can be expressed as:
Xi = µi + ei
Where µi is the true value which is because of some assignable causes and ei is the error term which is because
of random causes. Here, it has been assumed that all error terms ei are independent distributed normal variate
with mean zero and common variance (σe2).
Further, true value µi can be assumed to be consist of a linear function of t1, t2,…….tk, known as “effects”.
If in a linear model, all effects tj’s are unknown constants (parameters), then that linear model is known as
“fixed-effect model”. Otherwise, if effects tj’s are random variables then that model is known as “random-
effect model”.
One-Way Analysis of Variance
We have n observations (Xij), divided into k groups, A1, A2,…….Ak, with each group having nj observations.
Here, the proposed fixed-effect linear model is:
Xij = µi + eij
Where µi is the mean of the ith group.
General effect (grand mean): µ = Σ (ni. µi)/n
and additional effect of the ith group over the general effect: αi = µi – µ.
So, the linear model becomes:
Xij = µ + αi + eij
with Σi (ni αi) = 0
The least-square estimates of µ and αi may be determined by minimizing error sum of square (Σi Σj eij2) =
Σi Σj (Xij – µ – αi)2 as:
X.. (combined mean of the sample) and Xi. (mean of the ith group in the sample).
So, the estimated linear model becomes:
Xij = X.. + (Xi.- X..) + (Xij– Xi.)
This can be further solved as:
Σi Σj (Xij –X..)2 = Σi ni (Xi.- X..)2 + Σi Σj (Xij –Xi.)2
Total Sum of Square = Sum of square due to group-effect + Sum of square due to error
or
Total Sum of Square= Between Group Sum of square+ Within Group Sum of square
TSS= SSB + SSE
Further, Mean Sum of Square may be given as:
MSB = SSB/(k-1) and MSE = SSE/(n-k),
8
where (k-1) is the degree of freedom (df) for SSB and (n-k) is the df for SSE.
Here, it should be noted that SSB and SSE added up to TSS and the corresponding df’s (k-1) and (n-k) add up
to total df (n-1) but MSB and MSE will not be added up to Total MS.
This by partitioning TSS and total df into two components, we may be able to test the hypothesis:
H0: µ1 = µ2=……….= µk
H1: Not all µ’s are same i.e. at least one µ is different from others.
or alternatively:
H0: α1 = α2=……….= αk =0
H1: Not all α’s are zero i.e. at least one α is different from zero.
MSE has always been an unbiased estimate of σe2 and if H0 is true then MSB will also be an unbiased estimate
of σe2.
Further MSB/ σe2 will follow Chi-square (χ2) distribution with(k-1) df and MSE/ σe2 will follow Chi-square (χ2)
distribution with(n-k) df. These two χ2 distributions are independent so the ratio of two Chi-square (χ2) variate
F= MSB/MSE will follow variance-ratio distribution (F distribution) with (k-1), (n-k) df.
Here, the test-statistic F is a right-tailed test (one-tailed Test). Accordingly, p-value may be estimated to decide
about reject/not able to reject of the null hypothesis H0.
If is H0 rejected i.e. all µ’s are not same then rejecting the null hypothesis does not inform which group-means
are different from others, So, Post-Hoc Analysis is to be performed to identify which group-means are
significantly different from others. Post Hoc Test is in the form of multiple comparison by testing equality of
two group-means (two at a time) i.e. H0: µp = µq by using two-group independent samples test or by comparing
the difference between sample means (two at a time) with the least significance difference (LSD)/critical
difference (CD)
= terror-df*MSE/ (1/np+1/nq )1/2
If observed difference between two means is greater than the LSD/CD then the corresponding Null hypothesis
is rejected at alpha level of significance.
9
It should be noted that the Linear Model used in ANOVA is not affected by minor deviations in the assumptions
especially if the sample is large.
The Normality Assumption may be checked using Tests of Normality: Shapiro-Wilk Test and Kolmogorov-
Smirnov Test with Lilliefors Significance Correction. Here, Normal Probability Plots (P-P Plots and Q-Q Plots)
may also be used for checking normality assumption. The assumption of Equality of Variances
(homoscedasticity) may be checked using different Tests of Homogeneity of Variances (Levene Test, Bartlett’s
test, Brown–Forsythe test etc.).
ANOVA vs T-test:
We employ two-independent sample T-test to examine whether there exists a significant difference in the means
of two categories i.e. the two samples have come from the same or different populations. The extension to it
may be applied to perform multiple T-tests (by taking two at a time) to examine the significance of the
difference in the means of k-samples in place of ANOVA. If this is attempted, then the errors involved in the
testing of hypothesis (type I and type II error) can’t be estimated correctly and the value of type I error will be
much more than alpha (significance level). So, in this situation, ANOVA is always preferred over multiple
intendent samples T-tests.
Like, in our example we have four categories of the regions Northern (N), Eastern (E), Western (W) and
Southern (S). If we want to compare the population means by using two-independent sample T-test i.e. by
taking two categories (groups) at a time. We have to make 4C2 = 6 number of comparisons i.e. six independent
samples tests have to performed (Tests comparing N with S, N with E, N with W, E with W, E with S and W
with S), Suppose we are using 5% level of significance for the null hypothesis based on six individual T-tests,
then type I error will be = 1- (0.95)6 =1-0.735= 0.265 i.e. 26.5%.
*********************
10
Where, = Chi-Square goodness of fit test O= observed value E= expected value
3. Degree of freedom: In Chi-Square goodness of fit test, the degree of freedom depends on the
distribution of the sample. The following table shows the distribution and an associated degree
of freedom:
• Hypothesis testing:
Hypothesis testing in Chi-Square goodness of fit test is the same as in other tests, like t-test, ANOVA, etc. The
calculated value of Chi-Square goodness of fit test is compared with the table value. If the calculated value of
Chi-Square goodness of fit test is greater than the table value, we will reject the null hypothesis and conclude
that there is a significant difference between the observed and the expected frequency. If the calculated value of
Chi-Square goodness of fit test is less than the table value, we will accept the null hypothesis and conclude that
there is no significant difference between the observed and expected value.
Chi-Square goodness of fit test is a non-parametric test that is used to find out how the observed value of a
given phenomenon is significantly different from the expected value. In Chi-Square goodness of fit test, the
term goodness of fit is used to compare the observed sample distribution with the expected probability
distribution. Chi-Square goodness of fit test determines how well theoretical distribution (such as normal,
binomial, or Poisson) fits the empirical distribution. In Chi-Square goodness of fit test, sample data is divided
into intervals. Then the numbers of points that fall into the interval are compared, with the expected numbers of
points in each interval.
We use the chi-square test to test the validity of a distribution assumed for a random phenomenon. The test
evaluates the null hypotheses H0 (that the data are governed by the assumed distribution) against the alternative
(that the data are not drawn from the assumed distribution).
Let p1, p2, ..., pk denote the probabilities hypothesized for k possible outcomes. In n independent trials, we
let Y1, Y2, ..., Yk denote the observed counts of each outcome which are to be compared to the expected
counts np1, np2, ..., npk. The chi-square test statistic is qk-1 =
11
= (Y1 - np1)² + (Y2 - np2)² + ... + (Yk - npk)²
---------- ---------- --------
np1 np2 npk
Reject H0 if this value exceeds the upper critical value of the (k-1) distribution, where is the desired
level of significance.
THE END
********************
12