Educational Statistics

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

ASSIGNMENT# 02

Name Zia Ur Rehman

Registration No. 21pkb00396

Roll No. CE605574

Program B.Ed. (1.5 Years)

Course Title Educational Statistics

Course Code 8614

Semester Spring 2022

********************

Q.No.01. How is mode calculated? Also discuss its merits and demerits.

• Merits of Mode:
The mode or modal value of a distribution is that value of the variable for which the frequency is maximum.
The number which is repeated maximum number of times is the mode.
▪ It is readily comprehensible and easy to compute. In some case it can be computed merely by inspection.
▪ It is not affected by extreme values. It can be obtained even if the extreme values are not known.
▪ Mode can be determined in distributions with open classes.
▪ Mode can be located on the graph also.
▪ If all the data are not given, then also mode can be calculated.
▪ It is easy to understand.
▪ It is the most observed data point.
▪ Sometimes we get more than one mode and sometimes there is no mode, that gives us the characteristics
of the data.
• Demerits of Mode:
▪ It is ill defined. It is not always possible to find clearly defined mode. In some cases, we may come across
distributions with two modes. Such distributions are called Bimodal. If a distribution has more than two
modes, it is said to be Multimodal.

1
▪ It is not based upon all the observation.
▪ Mode can be calculated by various formulae as such the value may differ from one to other. Therefore, it
is not rigidly defined.
▪ It is affected to a greater extent fluctuations of sampling.

• Uses of Mode:
Mode is used by manufacturers of ready- made garments, shoes and accessories in common use etc. The
readymade garment manufacturers made those sizes more which are used by most of the persons than the
other sizes. Similarly, the makers of shoes will make that size maximum which the majority people use and
others in less quantity. Mode is used when the data is categorized.

*********************

Q.No.02. Discuss t-test and its application in educational research.


A t-test can only be used when comparing the means of two groups (a.k.a. pairwise comparison). If you want to
compare more than two groups, or if you want to do multiple pairwise comparisons, use an ANOVA test or a
post-hoc test.
The t-test is a parametric test of difference, meaning that it makes the same assumptions about your data as
other parametric tests. The t-test assumes your data:
1. are independent
2. are (approximately) normally distributed.
3. have a similar amount of variance within each group being compared (a.k.a. homogeneity of variance)
If your data do not fit these assumptions, you can try a nonparametric alternative to the t-test, such as the
Wilcoxon Signed-Rank test for data with unequal variances.
When choosing a t-test, you will need to consider two things: whether the groups being compared come from a
single population or two different populations, and whether you want to test the difference in a specific
direction.

One-sample, two-sample, or paired t-test?


• If the groups come from a single population (e.g. measuring before and after an experimental treatment),
perform a paired t-test.
• If the groups come from two different populations (e.g. two different species, or people from two
separate cities), perform a two-sample t-test (a.k.a. independent t-test).

2
• If there is one group being compared against a standard value (e.g. comparing the acidity of a liquid to a
neutral pH of 7), perform a one-sample t-test.

One-tailed or two-tailed t-test?


• If you only care whether the two populations are different from one another, perform a two-tailed t-test.
• If you want to know whether one population mean is greater than or less than the other, perform a one-
tailed t-test. In your test of whether petal length differs by species:
• Your observations come from two separate populations (separate species), so you perform a two-sample
t-test.
• You don’t care about the direction of the difference, only whether there is a difference, so you choose to
use a two-tailed t-test.

Performing a t-test
The t-test estimates the true difference between two group means using the ratio of the difference in group
means over the pooled standard error of both groups. You can calculate it manually using a formula or use
statistical analysis software.

T-test formula
The formula for the two-sample t-test (a.k.a. the student’s t-test) is shown below.

In this formula, t is the t-value, x1 and x2 are the means of the two groups being compared, s2 is the pooled
standard error of the two groups, and n1 and n2 are the number of observations in each of the groups.
A larger t-value shows that the difference between group means is greater than the pooled standard error,
indicating a more significant difference between the groups.
You can compare your calculated t-value against the values in a critical value chart to determine whether your t-
value is greater than what would be expected by chance. If so, you can reject the null hypothesis and conclude
that the two groups are in fact different.

T-test function in statistical software


Most statistical software (R, SPSS, etc.) includes a t-test function. This built-in function will take your raw data
and calculate the t-value. It will then compare it to the critical value and calculate a p-value. This way you can
quickly see whether your groups are statistically different.
In your comparison of flower petal lengths, you decide to perform your t-test using R. The code looks like this:

3
t.test (Petal.Length ~ Species, data = flower.data)

********************

Q.No.03. What is 'regression analysis'? Where is it applied?


Regression analysis is a set of statistical methods used for the estimation of relationships between a dependent
variable and one or more independent variables. It can be utilized to assess the strength of the relationship
between variables and for modeling the future relationship between them.

Regression analysis includes several variations, such as linear, multiple linear, and nonlinear. The most
common models are simple linear and multiple linear. Nonlinear regression analysis is commonly used for more
complicated data sets in which the dependent and independent variables show a nonlinear relationship.
Regression analysis offers numerous applications in various disciplines, including finance.
Regression Analysis – Linear Model Assumptions
Linear regression analysis is based on six fundamental assumptions:

4
1. The dependent and independent variables show a linear relationship between the slope and the intercept.
2. The independent variable is not random.
3. The value of the residual (error) is zero.
4. The value of the residual (error) is constant across all observations.
5. The value of the residual (error) is not correlated across all observations.
6. The residual (error) values follow the normal distribution.

Regression Analysis – Simple Linear Regression


Simple linear regression is a model that assesses the relationship between a dependent variable and an
independent variable. The simple linear model is expressed using the following equation:
Y = a + bX + ϵ
Where:
• Y – Dependent variable
• X – Independent (explanatory) variable
• a – Intercept
• b – Slope
• ϵ – Residual (error)

Regression Analysis – Multiple Linear Regression


Multiple linear regression analysis is essentially like the simple linear model, with the exception that multiple
independent variables are used in the model. The mathematical representation of multiple linear regression is:
Y = a + bX1 + cX2 + dX3 + ϵ
Where:
• Y – Dependent variable
• X1, X2, X3 – Independent (explanatory) variables
• a – Intercept
• b, c, d – Slopes
• ϵ – Residual (error)
Multiple linear regression follows the same conditions as the simple linear model. However, since there are
several independent variables in multiple linear analysis, there is another mandatory condition for the model:
Non-collinearity: Independent variables should show a minimum correlation with each other. If the
independent variables are highly correlated with each other, it will be difficult to assess the true relationships
between the dependent and independent variables.

***********************

5
Q.No.04. Explain assumptions of applying One-way ANOVA and its procedure.
In some decision-making situations, the sample data may be divided into various groups i.e., the sample may be
supposed to have consisted of k-sub samples. There are interest lies in examining whether the total sample can
be considered as homogenous or there is some indication that sub-samples have been drawn from different
populations. So, in these situations, we must compare the mean values of various groups, with respect to one or
more criteria.
The total variation present in a set of data may be partitioned into several non-overlapping components as per
the nature of the classification. The systematic procedure to achieve this is called Analysis of Variance
(ANOVA). With the help of such a partitioning, some testing of hypothesis may be performed.
Initially, Analysis of Variance (ANOVA) had been employed only for the experimental data from the
Randomized Designs but later they have been used for analyzing survey and secondary data from the
Descriptive Research.
Analysis of Variance may also be visualized as a technique to examine a dependence relationship where the
response (dependence) variable is metric (measured on interval or ratio scale) and the factors (independent
variables) are categorical in nature with a number of categories more than two.

Example of ANOVA
Ventura is an FMCG company, selling a range of products. Its outlets have been spread over the entire state. For
administrative and planning purpose, Ventura has sub-divided the state into four geographical regions
(Northern, Eastern, Western and Southern). Random sample data of sales collected from different outlets spread
over the four geographical regions.
Variation, being a fundamental characteristic of data, would always be present. Here, the total variation in the
sales may be measured by the squared sum of deviation from the mean sales. If we analyze the sources of
variation in the sales, in this case, we may identify two sources:
• Sales within a region would differ and this would be true for all four regions (within-group
variations)
• There might be impact of the regions and mean-sales of the four regions would not be all the same
i.e. there might be variation among regions (between-group variations).
So, total variation present in the sample data may be partitioned into two components: between-regions and
within-regions and their magnitudes may be compared to decide whether there is a substantial difference in the
sales with respect to regions. If the two variations are in close agreement, then there is no reason to believe that
sales are not same in all four regions and if not then it may be concluded that there exists a substantial
difference between some or all the regions.

6
Here, it should be kept in mind that ANOVA is the partitioning of variation as per the assignable causes and
random component and by this partitioning ANOVA technique may be used as a method for testing significance
of difference among means (more than two).

Types of Analysis of Variance (ANOVA)


If the values of the response variable have been affected by only one factor (different categories of single
factor), then there will be only one assignable reason by which data is sub-divided, then the corresponding
analysis will be known as One-Way Analysis of Variance. The example (Ventura Sales) comes in this
category. Other examples may be: examining the difference in analytical-aptitude among students of various
subject-streams (like engineering graduates, management graduates, statistics graduates); impact of different
modes of advertisements on brand-acceptance of consumer durables etc.
On the other hand, if we consider the effect of more than one assignable cause (different categories of multiple
factors) on the response variable then the corresponding analysis is known as N-Way ANOVA (N>=2). In
particular, if the impact of two factors (having multiple categories) been considered on the dependent (response)
variable then that is known as Two-Way ANOVA. For example: in the Ventura Sales, if along with
geographical-regions (Northern, Eastern, Western and Southern), one more factor ‘type of outlet’ (Rural and
Urban) has been considered then the corresponding analysis will be Two-Way ANOVA. More examples:
examining the difference in analytical-aptitude among students of various subject-streams and geographical
locations; the impact of different modes of advertisements and occupations on brand-acceptance of consumer
durables etc.
Two-Way ANOVA may be further classified into two categories:
• Two-Way ANOVA with one observation per cell: there will be only one observation in each cell
(combination). Suppose, we have two factors A (having m categories) and B (having n categories),
So, there will be N= m*n total observations with one observation(data-point) in each of (Ai Bj) cell
(combination), i=1, 2, ……., m and j= 1, 2, …..n. Here, the effect of the two factors may be
examined.
• Two-Way ANOVA with multiple observations per cell: there will be multiple observations in each
cell (combination). Here, along with the effect of two factors, their interaction effect may also be
examined. Interaction effect occurs when the impact of one factor (assignable cause) depends on the
category of other assignable cause (factor) and so on. For examining interaction-effect it is necessary
that each cell (combination) should have more than one observations so it may not be possible in the
earlier Two-Way ANOVA with one observation per cell.

7
Conceptual Background:
The fundamental concept behind the Analysis of Variance is “Linear Model”.
X1, X2,……….Xn are observable quantities. Here, all the values can be expressed as:
Xi = µi + ei
Where µi is the true value which is because of some assignable causes and ei is the error term which is because
of random causes. Here, it has been assumed that all error terms ei are independent distributed normal variate
with mean zero and common variance (σe2).
Further, true value µi can be assumed to be consist of a linear function of t1, t2,…….tk, known as “effects”.
If in a linear model, all effects tj’s are unknown constants (parameters), then that linear model is known as
“fixed-effect model”. Otherwise, if effects tj’s are random variables then that model is known as “random-
effect model”.
One-Way Analysis of Variance
We have n observations (Xij), divided into k groups, A1, A2,…….Ak, with each group having nj observations.
Here, the proposed fixed-effect linear model is:
Xij = µi + eij
Where µi is the mean of the ith group.
General effect (grand mean): µ = Σ (ni. µi)/n
and additional effect of the ith group over the general effect: αi = µi – µ.
So, the linear model becomes:
Xij = µ + αi + eij
with Σi (ni αi) = 0
The least-square estimates of µ and αi may be determined by minimizing error sum of square (Σi Σj eij2) =
Σi Σj (Xij – µ – αi)2 as:
X.. (combined mean of the sample) and Xi. (mean of the ith group in the sample).
So, the estimated linear model becomes:
Xij = X.. + (Xi.- X..) + (Xij– Xi.)
This can be further solved as:
Σi Σj (Xij –X..)2 = Σi ni (Xi.- X..)2 + Σi Σj (Xij –Xi.)2
Total Sum of Square = Sum of square due to group-effect + Sum of square due to error
or
Total Sum of Square= Between Group Sum of square+ Within Group Sum of square
TSS= SSB + SSE
Further, Mean Sum of Square may be given as:
MSB = SSB/(k-1) and MSE = SSE/(n-k),

8
where (k-1) is the degree of freedom (df) for SSB and (n-k) is the df for SSE.
Here, it should be noted that SSB and SSE added up to TSS and the corresponding df’s (k-1) and (n-k) add up
to total df (n-1) but MSB and MSE will not be added up to Total MS.
This by partitioning TSS and total df into two components, we may be able to test the hypothesis:
H0: µ1 = µ2=……….= µk
H1: Not all µ’s are same i.e. at least one µ is different from others.
or alternatively:
H0: α1 = α2=……….= αk =0
H1: Not all α’s are zero i.e. at least one α is different from zero.
MSE has always been an unbiased estimate of σe2 and if H0 is true then MSB will also be an unbiased estimate
of σe2.
Further MSB/ σe2 will follow Chi-square (χ2) distribution with(k-1) df and MSE/ σe2 will follow Chi-square (χ2)
distribution with(n-k) df. These two χ2 distributions are independent so the ratio of two Chi-square (χ2) variate
F= MSB/MSE will follow variance-ratio distribution (F distribution) with (k-1), (n-k) df.
Here, the test-statistic F is a right-tailed test (one-tailed Test). Accordingly, p-value may be estimated to decide
about reject/not able to reject of the null hypothesis H0.
If is H0 rejected i.e. all µ’s are not same then rejecting the null hypothesis does not inform which group-means
are different from others, So, Post-Hoc Analysis is to be performed to identify which group-means are
significantly different from others. Post Hoc Test is in the form of multiple comparison by testing equality of
two group-means (two at a time) i.e. H0: µp = µq by using two-group independent samples test or by comparing
the difference between sample means (two at a time) with the least significance difference (LSD)/critical
difference (CD)
= terror-df*MSE/ (1/np+1/nq )1/2
If observed difference between two means is greater than the LSD/CD then the corresponding Null hypothesis
is rejected at alpha level of significance.

Assumptions for ANOVA:


Though it has been discussed in the conceptual part just to reiterate it should be ensured that the following
assumptions must be fulfilled:
1. The populations from where samples have been drawn should follow a normal distribution.
2. The samples have been selected randomly and independently.
3. Each group should have common variance i.e. should be homoscedastic i.e. the variability in the dependent
variable values within different groups is equal.

9
It should be noted that the Linear Model used in ANOVA is not affected by minor deviations in the assumptions
especially if the sample is large.
The Normality Assumption may be checked using Tests of Normality: Shapiro-Wilk Test and Kolmogorov-
Smirnov Test with Lilliefors Significance Correction. Here, Normal Probability Plots (P-P Plots and Q-Q Plots)
may also be used for checking normality assumption. The assumption of Equality of Variances
(homoscedasticity) may be checked using different Tests of Homogeneity of Variances (Levene Test, Bartlett’s
test, Brown–Forsythe test etc.).

ANOVA vs T-test:
We employ two-independent sample T-test to examine whether there exists a significant difference in the means
of two categories i.e. the two samples have come from the same or different populations. The extension to it
may be applied to perform multiple T-tests (by taking two at a time) to examine the significance of the
difference in the means of k-samples in place of ANOVA. If this is attempted, then the errors involved in the
testing of hypothesis (type I and type II error) can’t be estimated correctly and the value of type I error will be
much more than alpha (significance level). So, in this situation, ANOVA is always preferred over multiple
intendent samples T-tests.
Like, in our example we have four categories of the regions Northern (N), Eastern (E), Western (W) and
Southern (S). If we want to compare the population means by using two-independent sample T-test i.e. by
taking two categories (groups) at a time. We have to make 4C2 = 6 number of comparisons i.e. six independent
samples tests have to performed (Tests comparing N with S, N with E, N with W, E with W, E with S and W
with S), Suppose we are using 5% level of significance for the null hypothesis based on six individual T-tests,
then type I error will be = 1- (0.95)6 =1-0.735= 0.265 i.e. 26.5%.

*********************

Q.No.05. Explain rationale and procedure of Chi-Square Goodness-of-Fit Test.


Set up the hypothesis for Chi-Square goodness of fit test:
1. Null hypothesis: In Chi-Square goodness of fit test, the null hypothesis assumes that there is
no significant difference between the observed and the expected value.
2. Alternative hypothesis: In Chi-Square goodness of fit test, the alternative hypothesis
assumes that there is a significant difference between the observed and the expected value.
Compute the value of Chi-Square goodness of fit test using the following formula:

10
Where, = Chi-Square goodness of fit test O= observed value E= expected value
3. Degree of freedom: In Chi-Square goodness of fit test, the degree of freedom depends on the
distribution of the sample. The following table shows the distribution and an associated degree
of freedom:

Type of distribution No of constraints Degree of freedom

Binominal distribution 1 n-1

Poisson distribution 2 n-2

Normal distribution 3 n-3

• Hypothesis testing:
Hypothesis testing in Chi-Square goodness of fit test is the same as in other tests, like t-test, ANOVA, etc. The
calculated value of Chi-Square goodness of fit test is compared with the table value. If the calculated value of
Chi-Square goodness of fit test is greater than the table value, we will reject the null hypothesis and conclude
that there is a significant difference between the observed and the expected frequency. If the calculated value of
Chi-Square goodness of fit test is less than the table value, we will accept the null hypothesis and conclude that
there is no significant difference between the observed and expected value.
Chi-Square goodness of fit test is a non-parametric test that is used to find out how the observed value of a
given phenomenon is significantly different from the expected value. In Chi-Square goodness of fit test, the
term goodness of fit is used to compare the observed sample distribution with the expected probability
distribution. Chi-Square goodness of fit test determines how well theoretical distribution (such as normal,
binomial, or Poisson) fits the empirical distribution. In Chi-Square goodness of fit test, sample data is divided
into intervals. Then the numbers of points that fall into the interval are compared, with the expected numbers of
points in each interval.
We use the chi-square test to test the validity of a distribution assumed for a random phenomenon. The test
evaluates the null hypotheses H0 (that the data are governed by the assumed distribution) against the alternative
(that the data are not drawn from the assumed distribution).
Let p1, p2, ..., pk denote the probabilities hypothesized for k possible outcomes. In n independent trials, we
let Y1, Y2, ..., Yk denote the observed counts of each outcome which are to be compared to the expected
counts np1, np2, ..., npk. The chi-square test statistic is qk-1 =

11
= (Y1 - np1)² + (Y2 - np2)² + ... + (Yk - npk)²
---------- ---------- --------
np1 np2 npk

Reject H0 if this value exceeds the upper critical value of the (k-1) distribution, where is the desired
level of significance.

THE END
********************

12

You might also like