Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

ADVANCED EDUCATIONAL STATISTICS

EDU 901C

Strategies for Hypothesis Testing

Hillary Rutto
Solomon Kiplimo

1
Statistical hypothesis

A hypothesis is a claim about the value of a parameter or population characteristic. Testing a


statistical hypothesis follows the steps below.

1. Formulating the Hypothesis to be Tested


In any hypothesis-testing problem, there are always two competing hypotheses under
consideration:
a) The null hypothesis (H0) which represents the status quo.

b) The research (alternative) hypothesis (HA or H1). The objective of hypothesis testing is to

decide, based on sample information, if the alternative hypothesis is actually supported by

the data.

We usually do new research to challenge the existing (accepted) beliefs. The burden of proof is
placed on those who believe in the alternative claim. This initially favored claim (H0) will not be
rejected in favor of the alternative claim (HA or H1) unless the sample evidence provides
significant support for the alternative assertion. If the sample does not strongly contradict H0, we
will continue to believe in the plausibility of the null hypothesis. This is based on thee Popperian
Principle of Falsification put forward by Karl Popper who discovered that we can't conclusively
confirm a hypothesis, but we can conclusively negate one.

The two possible conclusions of hypothesis testing are:


a) Reject H0.

b) Fail to reject H0.

Example: Suppose a school is considering overhauling its academic revision programme with a
new one, the school would be reluctant to change over to the new programme unless evidence
strongly suggests that the new programme is superior to the current one.
An appropriate problem formulation would involve testing:
H0: There’s no difference between the current programme and the new programme against HA:
The new programme is superior to the current programme.

2
The conclusion that a change is justified is identified with HA, and it would take conclusive
evidence to justify rejecting H0 and switching to the new programme.
The alternative to the null hypothesis H0: μ1 = μ2 will look like one of the following three
assertions:
a) HA: μ1≠ μ2 (Two-tailed test)
b) HA: μ1> μ2 (in which case the null hypothesis is μ1≤ μ2)
c) HA: μ1< μ2 (in which case the null hypothesis is μ1≥ μ2)

A researcher might believe that the parameter has increased, decreased or changed.

a) Where a researcher hypothesizes an increase in a parameter, this type of test is called


an upper-tailed test.

b) If a researcher hypothesizes a decrease in a parameter, this type of a test is called a lower-


tailed test.

Upper tailed and lower tailed tests are one tailed tests and make up directional research
hypothesis which reflects an expected difference between groups and specifies the direction of
this difference e.g. HA:The mean height of males is greater than that of females.

c) Where a difference is hypothesized, this is called a two-tailed test. A two tailed-test reflects an
expected difference between groups but does not specify the direction of this difference e.g.
HA: The mean height of males is different from that of females.

The exact form of the research hypothesis depends on the investigator's belief about the
parameter of interest and whether it has possibly increased, decreased or is different from the
null value. The research hypothesis is set up by the investigator before any data are collected.

2. Setting up a significance level (α).

To decide if we have sufficient evidence against the null hypothesis to reject it in favour of the alternative

hypothesis, one must first decide upon a significance level. The significance level is the probability of

rejecting the null hypothesis when it the null hypothesis is true.

3
The significance level (or α level) is a threshold that determines whether a study result can be considered

statistically significant after performing the planned statistical tests. It is most often set to 5% (or 0.05),

although other levels may be used depending on the study.

It is the probability of rejecting the null hypothesis when it is in fact true and represents the probability to

commit a type I error. For example, a significance level of 0.05 indicates a 5% risk of concluding that a

difference exists when there is no actual difference.

A confidence level is a way to express how sure we are about the results of a study or experiment. It is

often represented as a percentage, such as 95% or 99%. The confidence level tells us the likelihood that

the true value or effect we are estimating falls within a given range. A 95% Confidence Level for example

means we are 95% confident in our results, it implies that if we were to conduct the same study many

times, we would expect the true value to be within our estimated range about 95 out of 100 times. The

remaining 5 times, our estimate might not capture the true value, but this is considered acceptable

statistical variability. This is the alpha value.

Confidence level+α = 1

e.g 95%+5% = 1.

(0.95+0.05)=1

In hypothesis testing, either the H0 is rejected or it is not. Because this is based on a sample and
not the entire population, we could be wrong about the true treatment effect. Just by chance, it is
possible that this sample reflects a relationship which is not present in the population – this is
when type I and type II errors can happen.
Type I error
A type I error is the incorrect rejection of a true null hypothesis. Usually a type I error leads one
to conclude that a supposed effect or relationship exists when in fact it doesn't. Examples of type
I errors include a test that shows a patient to have a disease when in fact the patient does not have
the disease, or an experiment indicating that a medical treatment should cure a disease when in
fact it does not. Type I errors cannot be completely avoided, but investigators should decide on
an acceptable level of risk of making type I errors when designing the study.

4
Type II error.
A type II error is the failure to reject a false null hypothesis. This leads to the conclusion that an
effect or relationship doesn't exist when it really does. Examples of type II errors would be a
blood test failing to detect the disease it was designed to detect, in a patient who really has the
disease.
A contingency table for type I and type II errors
In reality

H0 is TRUE H0 is FALSE
Decision
Type II error
Do not reject H0 Correct decision
(False negative)
Type I error
Reject H0 Correct decision
(False positive)
Choosing a higher significance level, such as 0.10, increases the chances of making a Type I
error but reduces the chances of making a Type II error (failing to reject a false null hypothesis).
On the other hand, choosing a lower significance level, like 0.01, decreases the chances of a
Type I error but increases the chances of a Type II error. Researchers need to strike a balance
based on the context and the consequences of making each type of error.
3. Data Collection

The type of data collected plays a crucial role in determining the appropriate statistical test
for hypothesis testing. There are two main types of data
1. Categorical Data: This type of data represents categories or groups and is often nominal
or ordinal. Examples include gender, colors, or education levels. For hypothesis testing
with categorical data, chi-square tests or Fisher's exact tests may be used.
2. Numerical Data (Quantitative Data): This type of data consists of numerical values and
can be further categorized as either continuous or discrete. Continuous numerical data
includes measurements like height or weight, while discrete numerical data includes
counts, such as the number of people in a household. Depending on the characteristics of
the data and the research question, t-tests, ANOVA, regression analysis, or other
appropriate statistical tests may be applied.

5
4. Determine the appropriate test statistic and calculate it using the sample data.

The test statistic is a function of the sample data that will be used to make a decision
about whether the null hypothesis should be rejected or not and represents the likelihood of
obtaining sample outcomes if the null hypothesis were true.
. The test statistic summarizes the observed data into a single number using the central
tendency, variation, sample size, and number of predictor variables in the statistical model.
The type of test statistic to be used in a hypothesis test depends on several factors
including:
a) The type of statistic you are using in the test.
b) The size of your sample. For a statistical test to be valid, your sample size needs to be
large enough to approximate the true distribution of the population being studied.
c) Assumptions you can make about the distribution of your data.
d) Assumptions you can make about the distribution of the statistic used in the test.

Statistical tests make some common assumptions about the data they are testing. They
include:
a) Independence of observations. The observations/variables you include in your test are not
related.
b) Homogeneity of variance. The variance within all comparison groups are the same. If one
group has much more variation than others, it will limit the test’s effectiveness.
c) Normality of data. The data follows a normal distribution. This assumption applies only
to quantitative data.
If the data does not meet the assumptions of normality or homogeneity of variance, one can
perform nonparametric statistical tests, which allows making comparisons without any
assumptions about the data distribution.
The distribution of data is how often each observation occurs, and can be described by its central
tendency and variation around that central tendency. Different statistical tests predict different
types of distributions, so it’s important to choose the right statistical test for the stated
hypothesis.

6
Choosing a parametric test
Parametric tests usually have stricter requirements than nonparametric tests, and are able to
make stronger inferences from the data. They can only be conducted with data that adheres to
the common assumptions of statistical tests. The most common types of parametric test include
regression tests, comparison tests, and correlation tests.
Regression tests
Regression tests look for cause-and-effect relationships. They can be used to estimate the effect
of one or more continuous variables on another variable.

Predictor variable Outcome variable Research question example


Simple linear  Continuous  Continuous What is the effect
regression  1 predictor  1 outcome of income on longevity?

Multiple linear  Continuous  Continuous What is the effect


regression  2 or more  1 outcome of income and minutes of exercise per
predictors day on longevity?

Correlation tests
Correlation tests check whether variables are related without hypothesizing a cause-and-effect
relationship.
Variables Research question example
Pearson’s r  2 continuous variables How are latitude and temperature related?

Comparison tests
Comparison tests look for differences among group means. They can be used to test the effect of
a categorical variable on the mean value of some other characteristic.
T-tests are used when comparing the means of precisely two groups (e.g., the average heights of
men and women). ANOVA and MANOVA tests are used when comparing the means of more
than two groups (e.g., the average heights of children, teenagers, and adults).
Predictor variable Outcome variable Research question example
Paired t-test  Categorical  Quantitative What is the effect of two different test
 1 predictor  groups come from the prep programs on the average exam

7
Predictor variable Outcome variable Research question example
same population scores for students from the same class?
Independent  Categorical  Quantitative What is the difference in average exam
t-test  1 predictor  groups come from scores for students from two different
different populations schools?
ANOVA  Categorical  Quantitative What is the difference in average pain
 1 or more  1 outcome levels among post-surgical patients
predictor given three different painkillers?
MANOVA  Categorical  Quantitative What is the effect of flower
 1 or more  2 or more outcome species on petal length, petal width,
predictor and stem length?

Choosing a nonparametric test


Non-parametric tests don’t make as many assumptions about the data, and are useful when one
or more of the common statistical assumptions are violated. However, the inferences they make
aren’t as strong as with parametric tests.
Predictor variable Outcome variable Used in place of…
Spearman’s r  Quantitative  Quantitative Pearson’s r
Chi square test of  Categorical  Categorical Pearson’s r
independence
Sign test  Categorical  Quantitative One-sample t-test
Kruskal–Wallis H  Categorical  Quantitative ANOVA
test.  3 or more groups
ANOSIM  Categorical  Quantitative MANOVA
 3 or more groups  2 or more outcome variables
Wilcoxon Rank-Sum  Categorical  Quantitative Independent t-test
test  2 groups  groups come from different
populations
Wilcoxon Signed-  Categorical  Quantitative Paired t-test
rank test  2 groups  groups come from the same
population

8
A number of test statistics are available for testing hypotheses but vary on suitability and on
ways to calculate them. Common ones are:

Test statistic Description Statistical tests that use it


t value A t test is a statistical test that is used to compare the  T test
means of two groups. It is often used in hypothesis  Regression tests
testing to determine whether a process or treatment
actually has an effect on the population of interest, or
whether two groups are different from one another.
Z- value Z-score is a statistical measurement that describes a  Z test
value's relationship to the mean of a group of values. Z-
score is measured in terms of standard deviations from
the mean. If a Z-score is 0, it indicates that the data
point's score is identical to the mean score.
F value F test is a statistical test that is used in hypothesis testing  ANOVA
to check whether the variances of two or more samples  ANCOVA
are equal or not. In an F test, the data follows an F  MANOVA
distribution. This test uses the f statistic to compare two
variances by dividing them.
X2-value A chi-square (χ2) statistic is a test that measures how a  Chi-squared test
model compares to actual observed data. The data used  Non-parametric correlation
in calculating a chi-square statistic must be random, raw, tests
mutually exclusive, drawn from independent variables,
and drawn from a large enough sample.

Test statistics are typically calculated using a statistical program e.g. the SPSS, which will also
calculate the p value of the test statistic. Tables for estimating the p value of the test statistic are
also available and are based show, based on the test statistic and degrees of freedom (number of
observations minus number of independent variables) of the test, how frequently you would
expect to see that test statistic under the null hypothesis.

The p- value (probability value) is the probability of obtaining a result as extreme as, or more
extreme than, the result actually obtained when the null hypothesis is true. p- value ranges from
0-1.

Example

9
Suppose one wants to run a one sample t-test to determine whether or not the average score of
male and female students in a chemistry test are equal, data is collected and analysed and a p-
value arrived at.

A high p-value, for example 0.90 leaves little reason to doubt the null hypothesis. On the other
hand, if the p-value is small (for example 0.01), there would only be a small chance that the data
would be obtained if the null hypothesis was true.

5. Make a decision (To reject or to fail to reject the null hypothesis)

In order to decide whether to reject the null hypothesis, a test statistic is calculated. The
decision is made based on the numerical value of that test statistic. There are two approaches
how to arrive at that decision, the critical value approach and the p-value approach.

a) The critical value approach.

The observed test statistic calculated based from sample data is compared to the critical value
(a cutoff value). The critical value divides the area under the probability distribution curve
in rejection region(s) and non-rejection region.

The null hypothesis is rejected if the test statistic is more extreme than the critical value. The
null hypothesis is not rejected if the test statistic is not as extreme as the critical value. The
critical value is computed based on the given significance level α and the type of probability
distribution employed.

Rejection regions in a two-tailed test

In a two-tailed test, the null hypothesis is rejected if the test statistic is too small or too large.
The rejection region for such a test consists of two parts: one on the left and one on the right.

10
Rejection region in one tailed test (Left/Lower- tailed)

The null hypothesis is rejected for a left-tailed test if the test statistic is too small. Thus, the
rejection region for such a test consists of one part left from the centre.

Rejection region in a one tailed test (Right/Upper tailed)

The null hypothesis is rejected for a right-tailed test if the test statistic is too large. Thus, the
rejection region for such a test consists of one part right from the centre.

b) The p-value approach

In this approach, the numerical value of the test statistic is compared to the specified
significance level of the hypothesis test.

11
The p-value corresponds to the probability of observing sample data at least as extreme as the
actually obtained test statistic. Small p-values provide evidence against the null hypothesis.
The smaller (closer to 0) the p-value, the stronger is the evidence against the null hypothesis.
The null hypothesis is rejected if the p-value is less than or equal to the specified significance
level (α). Otherwise, the null hypothesis is not rejected.
If p≤α, reject H0; otherwise, if p>α, do not reject H0.

If the null hypothesis is rejected, results are interpreted in the context of the study and the
alternative hypothesis. If on the contrary the null hypothesis is not rejected, limitations and the
lack of evidence against it are acknowledged.

In addition to hypothesis testing, the effect size is considered as this quantifies the magnitude of
the observed effect. A small p-value does not necessarily imply a large practical significance.

A sensitivity analysis also needs to explore how the results change under different assumptions,
significance levels, or statistical methods.

12

You might also like