Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

1

IB Biology
Guide for Statistical Analysis for IA - Technical Side

Before you start reading


This guide serves the purpose of:

1. Using Excel to carry out statistical analysis


2. A brief introduction to statistics

What are our objectives?


We either use statistics to test:
● Relationships: i.e. does the time invested on study affects grades
● Differences: i.e. does studying for 30 mins and 60m ins have a significant effect on
grades

Key terms you should know


Data can be separated into the following categories:

Mean, standard deviation, and normal distribution


2

In statistics we expect occurrence of events to follow a normal distribution, which means it is the
most probable to find average values, and the least probable to find extreme values:

U = mean; 𝜎 = standard deviation


● We expect 68.27% of occurrence of events to be found within u +/- 1𝜎
● We expect 99.73% of occurrence of events to be found within u +/- 3𝜎

Standard deviation can be interpreted as the spread (distribution) of data away from the mean:

*Important: In our statistical analysis, we are all assuming data to be normally distributed
and different data have equal variance. That said, this assumption can be used for your
evaluation part

Hypothesis testing and p-value


In inferential statistics, it is common to test if two values are different from one another
(...statistically, not numerically)

I.e. grades before revision has no difference with the grades after revision
3

(that the grade difference is not due to revision, but due to chance and randomness)

Usually we set up two hypothesis for testing:

● Null hypothesis (H0): two values have no difference from each other
● Alternative hypothesis: two values are different from each other

P-value is a significance level, and also a probability that the null hypothesis is true

Normally, we use a p-value of 0.05 (a probability of 5%) as a significance level, that if p < 0.05,
then there is less than 5% chance that the null hypothesis is true…only! And so we can gladly
reject the null hypothesis and accept the alternative hypothesis.

Statistical models
There are different statistical models that help us to find relationships or differences, including:

Finding relationships Comparing means

Correlation: finding if two things are related t-test


(one sample t-test, two sample unpaired
t-test, two sample paired t-test)

Regression: finding cause-effect relationships ANOVA → Tukey test

So which models should I use?


Here are a list of questions that you should ask yourself first, regarding your data:

1. Are you finding a relationship or finding a difference?


a. relationships
b. Difference
4

Testing for relationships: Causation or correlation?


It is important to know the difference between causation and correlation.

Prices increase from 2006 to 2018 due to inflation, while the no. of mcdonald shop also
increases…so mcdonalds is the inflation causer???

Causation (cause-effect): no. of mcdonalds cause inflation??? (Of course NOT! Not a
causation effect)
Correlation (association): the price and mcdonalds stores may have a relationship
5

Causations
For causations, we are already certain about a cause will lead to an effect.

I.e. temperature will lead to changes in enzyme activity


I.e. osmotic pressure will lead to changes in cell size

Plotting Regression

For us to find causation, regression analysis is required:


https://www.youtube.com/watch?v=Cltt47Ah3Q4&ab_channel=JalayerAcademy
This video will teach you how to plot regression and do data analysis on Excel

R^2 Value
R^2 (coefficient of determination) value: The coefficient of determination ranges from 0 to 1, with
1 showing as the strongest magnitude of causation

Finding significance of R^2 value by ANOVA box


After testing for correlations and obtaining the R^2 value, we also need to see if the correlation
is statistically significant by an ANOVA box.

Here we consider the following hypothesis:

H0: the two variables have no relationship


H1: the two variables have relationship
6

Correlation
For correlations, we are not sure about whether two factors are associated.

I.e. Does number of attendance affect the no. of goals scored in a football match?
I.e. Is there a relationship between money owned by an individual to their happiness index?

Finding the Correlation index, r-value


This video will show you how to do plot correlation graphs and find the r-value
https://www.youtube.com/watch?v=1_jeoqjHtjA&ab_channel=DavidLanger

Here’s how we interpret the r-value:


7

P-value analysis for Correlation


After testing for correlations, we also need to see if the correlation is statistically significant.
(using the p-value) This is a video that shows you how:
https://www.youtube.com/watch?v=vFcxExzLfZI&ab_channel=QuantitativeSpecialists

Here we consider the following hypothesis:

H0: the two variables have no correlation


H1: the two variables have correlations
8

Testing for differences


Here’s an overview of the framework for selecting models for testing for differences, and the
statistical tests that one has to do:

1. Are you comparing two means only?


a. Yes
b. No
9
10

t-test
Since there are multiple t-tests, you may want to ask yourself this question

1. Am I…
a. Comparing 1 sample mean to the population mean → one sample t-test

I.e. comparing the average IB results of delia GP to IB worldwide results


I.e. comparing the weight of class 6A to the weight of average Hong Kong S6 students

b. Comparing 2 sample means that are not related → student’s t-test / independent
t-test

I.e. comparing the IB results of class 5B and class 5C


I.e. comparing the weight of patients who take drug A and those who take drug B

c. Comparing 2 sample means that are related (i.e. at different periods of time) →
two sample paired t-test

I.e. comparing the results of students before and after attending Mr. Samson’s class
I.e. comparing the weight of patients before and after taking drug A
11

One sample t-test: Comparing sample mean to a known population


mean
Null hypothesis (H0): the sample mean = known population mean
alternative hypothesis: the sample mean is different than the known population mean

● If p<0.05, reject the null hypothesis


● If p>0.05, accept the null hypothesis

Example:
It is known that the labels on the protein bars claim that each bar contains 20 grams of protein.
Random samples of 31 energy bars from a number of different stores, and their protein contents
were measured.

Null hypothesis:

Alternative hypothesis:

Results of t-test show a p-value of 0.0046 < 0.05, therefore the null hypothesis is rejected.
Conclusion: The labels claiming 20 grams of protein would be incorrect.

Performing on Excels

https://www.youtube.com/watch?v=OCSmMABkVqQ&list=PLEDQSOItvrBat1jK4QtWaWpt
HWu-wVzW_&index=1&t=4s&ab_channel=TopTipBio

Effect Size
12

To investigate the magnitude of the difference, please conduct Effect Size Analysis.
13

Student’s t-test / independent t-test / Two sample unpaired t-test: 2


sample means are not related
Null hypothesis (H0): the 2 sets of data have the same means
alternative hypothesis: the 2 sets of data have different means

● If p<0.05, reject the null hypothesis


● If p>0.05, accept the null hypothesis

Example:
Weight of 100 individuals is measured: 50 women (group A) and 50 men (group B). We want to
know if the mean weight of women (mA) is significantly different from that of men (mB).

H0: mean weight of men = mean weight of women


Ha: mean weight of men =/= mean weight of men

Results:

Since p-value = 0.01327, H0 is rejected and there is significant difference between the weight of
men and women.

Performing on Excels:

https://www.youtube.com/watch?v=kmww0EewIp0&list=PLEDQSOItvrBat1jK4QtWaWptH
Wu-wVzW_&index=2&t=8s&ab_channel=DavidDunaetz

Effect Size
To investigate the magnitude of the difference, please conduct Effect Size Analysis.
14

Two sample paired t-test: 2 sample means are related


Null hypothesis (H0): the 2 sets of data have the same means
alternative hypothesis: the 2 sets of data have different means

● If p<0.05, reject the null hypothesis


● If p>0.05, accept the null hypothesis

Example 1: the max vertical jump of college basketball players is measured before and after
participating in a training program.

● If p-value <0.05, then we can conclude that there are significant differences in max
vertical jump before and after the training program.
● If p-value >0.05, then we can conclude that the training program makes no significant
difference to the max vertical jump, that any improvement or difference is a result of
randomness/chance.

Example 2: the response time of a patient is measured on two different drugs.

Performing on Excels

https://www.youtube.com/watch?v=N2Rusw-xBIw&ab_channel=SocratGhadban

Effect Size
To investigate the magnitude of the difference, please conduct Effect Size Analysis.
15
16

More than 2 Means: One-way ANOVA


One-way ANOVA can be adopted.

Null hypothesis (H0): all groups have the same mean


alternative hypothesis: at least one group have different means

● If p<0.05, reject the null hypothesis → find out which groups have difference
● If p>0.05, accept the null hypothesis → END

Example 1:

Example 2:
Test if there’s a significant difference in SO2 concentrations at different times in a construction
site.
ANOVA results:
17

● Site A: p>0.05, the null hypothesis is accepted and there is no significant difference
● Site B: p<0.05, the null hypothesis is rejected and there is a significant difference

Example 3:
Test if there’s a significant difference in tree heights in 3 species

As p value is 0.000 < 0.05, there is significant difference in tree height between at least two
groups. Post-hoc tests need to be conducted to find out which pair of tree species have
significant differences.

Conducting on Excel:

https://www.youtube.com/watch?v=ZvfO7-J5u34&list=PLEDQSOItvrBat1jK4QtWaWptHWu-wVz
W_&index=5&t=6s&ab_channel=TopTipBio

Post hoc (meaning: after) test: Tukey HSD


Interpretation of results
In this example, the weight loss resulted from different exercising times is compared:
18

From the above result table:


1. Weight loss for 30 minutes per day of exercise vs no exercise (p: 0.852) is insignificant
2. Weight loss for 60 minutes per day of exercise vs no exercise (p: 0.000) is significant
3. Weight loss for 60 minutes per day of exercise vs 30 minutes per day of exercise (p:
0.000) is significant

Performing on Excels

https://www.youtube.com/watch?v=YbX-JUqD1so&list=PLEDQSOItvrBat1jK4QtWaWptH
Wu-wVzW_&index=4&t=6s&ab_channel=VincentStevenson

Effect size
To investigate the magnitude of the difference, please conduct Effect Size Analysis.
19

Effect Size
Further reading (why is p-value not enough): https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3444174/

Upon knowing the significant difference between sample means, effect size can be calculated to
tell the magnitude of experimental impacts.

There are several scenarios that effect size can be used:


● Calculating before and after effects
● Calculating experiment vs control group effects

Effect sizes can be calculated by:

An effect size of 1 indicates the two groups differ by 1 standard deviation, a d of 2 indicates they
differ by 2 standard deviations, and so on.

Example 1 : paired t-test


paired t-test results have shown that there are significant differences between grades before
and after revision. Effect size can tell us what is the impact of revision on student grades.
Let’s assume that the effect size of the above example is 0.64, then it means the grades after
revision is 0.64 standard deviations higher than that of the grades before revision

Example 2: ANOVA
The mean scores of students who study at different times are collected. Upon ANOVA, results
show that there are significant differences in the scores for students who study at different
times.

Effect size can be calculated by :


20

[Mean score studying at time A] - [Mean score of control group] / (standard deviation of whole
population)

The following results table is obtained:


Variable pairs Effect size

Time A - control 0.8

Time B - control 0.6

Time C - control -0.2

Based on the effect size, we can conclude that effects of grades are:

Time C < Time B < Time A,

That time A has made the most positive impacts on mean scores while time C has made the
most negative impacts.

Conducting on Excel

https://www.youtube.com/watch?v=zUmQ2PZZRJ4&ab_channel=TopTipBio

You might also like