Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

IB Internal Assessment

Math AA SL

Creating a equation to predict the relationship between the death rate and PM2.5 values of a

country

1
Introduction and Rationale

The applications of mathematics in real-life have always seemed to fascinate me. I have always

been intrigued about how one can apply the mathematics learnt in the classroom to real life contexts

and use it to resolve real-life problems. I became highly enthusiastic when I stumbled upon

something that could elucidate the elusive nature of mathematic and something that was highly

personal to each one us during the pandemic.

According to recent scientific and mathematical discoveries, it is hypothesized that the Covid-19

death rate and PM2.5 levels of that country have a strong positive correlation. PM2.5 refers to

atmospheric particulate matter that have a diameter of less than 2.5 micrometers, particles in this

category are so small that they can only be seen with a microscope. Owing to their minute size,

particles smaller than 2.5 micrometers are able to bypass the nose and throat and penetrate deep

into the lungs, some may even enter the circulatory system. As I continued to research I found that

since Covid-19 also has an airborne transmission, particulate matter (PM2.5) could act as a carrier

and spread the infection at a higher rate. Particulate matter could also have induced damage to the

lungs cells which increases inflammation rate which could further affect the severity of Covid-19

in an individual. Upon reading this, I was captivated and wanted to learn more about this

relationship.

The situation of the world during the pandemic is extremely crucial and any information that can

potentially save lives is pivotal. I believe modelling an equation that can make predictions in the

real world between pollution rates and death rates could have innumerable benefits. Creating this

model with a context and real-life considerations could help us conclude upon the best way to

predict the death rate in a country.

2
Aim and Method

As established in the introduction, my aim of this exploration is to determine the relationship

between world pollution rates and the death rates and create an equation to make predictions

in the real world. This model will relate the PM2.5 levels of a country and the death rate of a

country for the year of 2020.

Because I wanted this model to determine a relationship between PM2.5 levels of a country and

the death rate of a country, there was no way to collect primary data and I had to use secondary

data. In order to make sure my data was authentic I used data only from authentic government

websites. I decided to collect data from these sites for 30 different countries and the death rate of

the following countries for the year 2020.

I decided to approach the problem by seeing if the relationship between the PM2.5 levels and the

death rate of a country for the year 2020 described as an equation. The data used in this analysis

consist of daily deaths due to COVID-19 for 30 countries and their respective provinces. The

data covers 30 countries for a year, from January of 2020 to December of 2020, obtained from

the WHO website. The data of 2020 was analyzed as this was the data that was most accessible

during the time of writing. The PM2.5 levels each country was also extracted from IQAir. A

large number of deaths were not reported during the pandemic, this could affect the reliability of

the data as well.

In order to achieve my aim, it becomes extremely important to collect credible data. As

mentioned before the data collected would be from the year 2020 from the WHO website and

IQair. After that I decided to graph my data on a scatterplot to help me understand if the

association between the variables was mostly linear or non-linear. I would then calculate the

value of the Pearson’s Product Moment Correlation Coefficient (𝑟) for both the independent

3
variable and dependent variable. Using the r value can help determine the strength of the

correlation obtained between the two variables. Then, I would then calculate the Spearman’s

Rank Correlation Coefficient (𝑟𝑠 ) that would determine the statistical dependence between the

rankings of the two variables. It assesses how well the relationship between the two variables can

be described using a monotonic function. If the variables have a monotonic relationship, the

Spearman’s Rank Correlation Coefficient would indicate a higher value than the Pearson’s

Product Moment Correlation Coefficient.

According to my aim, I want to create an equation to predict the relationship between the total

deaths due to COVID-19 and the PM2.5 values of a country so it is not enough only to calculate

the correlation coefficients. To achieve my aim, I will deduce the best fit model for my data from

which predictions about the relationship between the total deaths due to COVID-19 and the

PM2.5 values can be calculated. This can help us mathematically understand the implications of

the PM2.5 levels and how the total deaths due to COVID-19 and the PM2.5 levels are related.

The Coefficient of Determination (𝑅2 ) to determine the accuracy and predictive power of the

model. 𝑅2 is the Coefficient of Determination, since this is a ratio of the successfully predicted

variation, we can interpret it as a percentage which will help us determine the accuracy of the

model.

4
Data Collection

Country Total death by covid (as of PM2.5 values (ⲙg/m^3)


the end of 2020)

India 148,738 58.80

Nepal 2690 39.20

Pakistan 10047 59.00

Indonesia 21944 40.70

Myanmar 2664 29.40

Afghanistan 2189 46.50

Oman 1497 44.40

United Arab Emirates 665 29.20

Bangladesh 7531 77.10

Bosnia and Herzegovina 4050 40.60

China 4634 34.70

North Macedonia 2488 30.60

Thailand 61 21.40

Sri Lanka 199 22.40

Madagascar 261 20.00

South Korea 879 19.50

Malaysia 471 15.60

Taiwan 8 15.00

Mali 269 37.90

Andorra 85 7.40

Australia 908 7.60

New Zealand 30 7.00

Norway 433 5.70

Estonia 266 5.90

Myanmar 2637 29.40

Armenia 2807 24.90

Serbia 3163 24.30

Kazakhstan 1783 21.90

Georgia 2313 20.40

Croatia 3920 21.20

Table 1: Data collection

5
Based on the data collected, I decided to plot a scatter plot which will help observe the relationship

between the variables.

y = 0.0003x + 26.607

R² = 0.167
𝑅2 𝑣𝑎𝑙𝑢𝑒 = 0.167

Graph1: the pollution rate is plotted on the y-axis with µg/m3as the unit and total deaths on the x axis

On plotting the data points, we can observe that the correlation between the variables is moderate.

The data points on this graph are not equally distributed with respect to the mean line in this graph

especially in the left-hand portion of this graph. In order to see the strength of association between

the two variables, I will calculate the Pearson’s Product Moment Correlation Coefficient (𝑟).

The value for r for this graph is:

6
Table 1: Calculation of r value

𝑀x = 7654.333

∑ x = 229630

∑( x − 𝑀x )2 = 𝑆𝑆x = 21121402166.66

𝑀𝑦 = 28.59

∑ 𝑦 = 857.7

7
2
∑( 𝑦 − 𝑀𝑦 ) = 𝑆𝑆𝑦 = 8487.378

∑((x−𝑀x )(𝑦− 𝑀𝑦 ))
𝑟= ……(1)
√((𝑆𝑆x )(𝑆𝑆𝑦 ))

5471195.5
𝑟= = 0.486
√(21121402166.6)(8487.387)

where 𝑀x = mean values of x and 𝑀𝑦 = mean of y values

The value of r is always in between -1 ≤ 𝑟 ≤ +1. If r ≥ 0.50 it is considered a strong positive

relationship, in this case r < 0.50 so the relationship between the two variables is weak. Since a

weak correlation is observed, we can check if the relationship is non-linear. To check if the

relationship is non-linear, I will graph a line of best fit with a non-zero curvature and see if it will

pass through more data points.

8
Graph 2: the pollution rate is plotted on the y-axis with µg/𝑚 3as the unit and total deaths on the x axis

This graph is able to account for more data points and is related monotonically, which means that as

one variable increases the other variable increases as well. The Spearman’s Rank Coefficient will be

between the value -1 ≤ 𝑟𝑠 ≤ +1, the sign indicates the direction of association. A positive

coefficient shows that as one variable increases, so does the other while a negative coefficient

shows that as one variable decreases the other increases. We can calculate the Spearman’s Rank

Coefficient to determine the strength between the two variables I have selected.

9
Table 2: Spearman’s Rank Coefficient Calculation

To calculate the correlation:

6 ×∑ 𝑑 2
𝑟𝑠 = 1 − ………..(2)
𝑛(𝑛2 −1)

(Where n= sample size and d= difference in ranks)

6 × 1259
𝑟𝑠 = 1 −
30(302 − 1)

𝑟𝑠 = 0.72

10
I utilised technology to verify the same, the value I obtained on R studio is 0.720. It can be observed

that both values are exactly the same. After reflecting on Graph 1 and Graph 2, We can also see that

the value of r obtained was much lower than the value of 𝑟𝑠 , this suggests a monotonic relationship

between the variables. Any coefficient obtained that is 𝑟𝑠 ≥ +0.70 is a strong positive correlation.

Since my 𝑟𝑠 = 0.72, we can conclude that a very strong positive correlation is obtained between

my variables however both variables do not increase in the same proportion as it is non-linear.

Reflecting back on my main aim, I wanted to formulate an equation to relate the Total deaths and

PM2.5 levels of countries. To achieve this, I decided I needed to linearize my data. Most

relationships that are not linear can be graphed so that the graph is a straight line, linearization does

not change the fundamental relationship or what it represents, but it does change the way the graph

looks. I decided to linearize my data to help make the analysis of the data easier and compute an

equation. One method to linearize the data is using the logarithmic models, we can re-express all of

the different data points by applying the logarithmic value of the data. Logarithms can be used to

linearize data in three forms:

• Log(x) is plotted against y


• Log(y) is plotted against x
• Log(x) is plotted against Log(y)

According to my aim, I wanted to formulate an equation that would relate the two variables. In

order to achieve this, I will conduct a regression analysis. The first step for that would be curve

fitting, a process to specify the model that best fits the curves of the specific dataset. Plotting these

three graphs (log x vs y, logy vs x and log x vs log y) and determining models for each of these can

help me understand which one of these models can predict the future the most accurately. Using the

𝑅2 value can also help me determine which will yield the best result.
11
Power model(log(x) vs log(y))

Upon re-expressing the data in terms of log(x) vs log(y), the data seems to be linearized one again. I

plotted the log(total deaths) on the x axis as the independent variable while I plotted the

log(pollution rate) on the y axis as the dependent variable.

y = 0.2278(x) + 0.6729

R² = 0.434

Graph 3:The third graph has plotted the ln(pollution rate) on the x-axis and the ln(death rate) on the y-axis.

Upon visualising the data, it is clear that the log-log model does not do the best job in linearizing

the data. Without looking at the 𝑅2 value, it is clear that the data points do not follow the trend of

points closely. We must then use the properties of logarithms and exponents to find the non-linear

model

12
Linearization method Linear model Non-linear model
log(y) vs log(x) log(y)=mlog(x)+c (m being y= 10𝑐 𝑥 𝑚
the slope and c being the y-
intercept)
The equation of the line obtained is:

y = 0.2278(x) + 0.6729

(where y represents Total deaths and x represents the PM2.5 levels)

However, since the equation needs to fit the re-expressed data, it can be re-written as

Log(y)=0.2778 × log (x) + 0.6729

The next step would be to find the best fitting non-linear model for the equation, the equation

obtained previously would be ideal if we wanted to predict log(y) and not y, this is the reason it

is important to transform this equation and obtain a power model. To predict y we need to

remove the log on both sides a power model.

10log(y) = 100.2278×log(x)+0.6729

𝑆𝑖𝑛𝑐𝑒 10log(y) = y, 𝑤𝑒 𝑐𝑎𝑛 𝑟𝑒𝑤𝑟𝑖𝑡𝑒 𝑡ℎ𝑒 𝑒𝑞𝑢𝑎𝑡𝑖𝑜𝑛𝑠 𝑎𝑠

y = 100.2278×log(x)+0.6729

Using the property of 𝑥 𝑎 × 𝑥 𝑏 = (𝑥 𝑎 )𝑏

𝑦 = (10log(x) )0.2278 × 100.6729

𝑆𝑖𝑛𝑐𝑒 10log(x) = x, 𝑤𝑒 𝑐𝑎𝑛 𝑟𝑒𝑞𝑟𝑖𝑡𝑒 𝑡ℎ𝑒 𝑒𝑞𝑢𝑎𝑡𝑖𝑜𝑛 𝑎𝑠

𝑦 = x 0.2278 × 100.6729

13
The power model involves taking both the logarithm of the dependent as well as the independent
2
variable. It also has a 𝑅 value of 0.434 for the power model, which is significantly lower. This can

help us understand that the power model is a weak model This model can be interpreted by saying

that 1% increase in the value of x results in the increase of the y value by 0.2278%, which is

significantly weak. I decided to continue the other methods of linearization to find the best fit model

to help create the equation.

14
Logarithmic model(log(x) vs y)

I began by creating a log- lin graph, this model would be in the form y=mlog(x)+c (where m is the

slope and c is the y intercept)

Upon re-expressing the data in terms of log(x) vs y, the data seems to be linearized. I plotted the

log(total deaths) on the x axis as the independent variable while I plotted the pollution rate on the y

axis as the dependent variable.

y = 13.30x - 12.08

R² = 0.457

Graph 4: The graph has plotted log(total death) on the x-axis and PM2.5 levels on the y axis.

Without doing any mathematical analysis it can be seen clearly how the data points are not equally

distributed with respect to the best fit line. Without doing any mathematical analysis, it can be seen

by eye that the data points deviate greatly from the best fit line, particularly in the center portion

where the data points show values much higher and lower than the line of best fit. The 𝑅2 𝑣𝑎𝑙𝑢𝑒

recorded is not high, however by obtaining a value of 0.457, it is slightly higher than the previous

model. However, since the data points do deviate from the best fit line we can clearly assume there
15
is a better fit model for the equation. This model can be interpreted by saying that when x increases

by a factor of 10, y increases by 13.30 units.

Linearization method Linear model Non-linear model


y vs log(x) y=mx+c y=mlog(x)+c

The equation of the line obtained is

y = 13.30x - 12.08

It is then important to rewrite this equation to fit the re-expressed data. The equation of this line is

y=13.30x -12.08 can be re-expressed as

y= 13.30✕log (x) – 12.08

This equation fits into the logarithmic model equation [y= mlog(x)+c] where m is the slope and c
is the y-intercept.

16
Exponential model (x vs log(y))

Lastly, I created the lin-log graph and wanted to see if this would give me better results. Upon re-

expressing the data in terms of log(y) vs x, the data seems to be linearized. I plotted the total deaths

on the x axis as the independent variable while I plotted the log(pollution rate) on the y axis as the

dependent variable.

y= 3.60× 10−6 x + 1.34

R² = 0.96

Graph 5: The graph represents the values of the death rate on the x-axis and the log (pollution rate)on the y axis.

From observing the plotted graph, a few assumptions can be made. Upon looking at the data, it

appears that this model does the best job at linearizing the data. The line of best fit does not pass

through all the data points, but there seems to be a much better distribution of the data points in

comparison to the best fit line. However, it is not enough to consider the overall fit of the model,

finding the 𝑅2 value will help us determine if the exponential model is the best model. 𝑅2 is the

coefficient of determination, since this is a ratio of the successfully predicted variation, we can

interpret it as a percentage. If the obtained 𝑅2 value is 0.90 we can say that 90% of the variation is

17
predicted by the regression line. I decided to calculate the 𝑅2 value for this particular model as it

seems to be the best model. Having a high 𝑅2 value means that the predictive power of the model is

very high and it is efficient in predicting the effect of x on y.

To determine the 𝑅2 value for this graph, I have manually calculated the value of 𝑅2 in the table

below:

Table 3: Value of 𝑅2 calculation

18
∑(𝑦𝑖 − 𝑦̂𝑖 )2
𝑅2 = 1 − ∑(𝑦𝑖 − 𝑦̅)2
….(3)

2.351
𝑅2 = 1 −
58.88

=0.960

In this case 𝑅2 value is 0.96 we can say that 96% of the variation is predicted by the regression line.

The high 𝑅2 value suggests that the predictive power of the model is very high and accurate. The

𝑅2 𝑣𝑎𝑙𝑢𝑒 obtained is quite high and indicates a strong relationship. Having a strong 𝑅2 𝑣𝑎𝑙𝑢𝑒

suggests that the model is able to predict with accuracy. It can be interpreted that as x increases by 1
−6
unit, y increases by a factor of 103.60×10 which is 1, this shows the association is extremely

strong.

The main aim of my exploration was to determine a equation that could determine the relationship

between the total deaths and the PM2.5 values. In order to do this, we have to model the equation

and convert the linear model into a non-linear model

Linearization method Linear model Non-linear model


x vs log(y) log(y)=mx+c 𝑦 = 10𝑐 (10𝑚 )x

The equation of the line obtained is

y= 3.60× 10−6 x + 1.34

The equation of an exponential model needs to be in the form of

Log(y)= 3.60× 10−6x + 1.34

19
−6 x+1.34
10log(𝑦) = 103.60×10

Since 10log(𝑦) = 𝑦, 𝑤𝑒 𝑐𝑎𝑛 𝑟𝑒𝑤𝑟𝑖𝑡𝑒 𝑖𝑡 𝑎s

−6 x
𝑦 = (103.60×10 ) × 101.34

This fits into the exponential model: y= (10𝑚 )x 10𝑐

Until now we demonstrated that there was a strong non-linear association between the variables

with a correlation coefficient of 0.72 between the total deaths and the PM2.5 values. However, after

linearizing the data and creating a model, we obtained a strong model that has a 𝑅2 value of 0.96.

The Coefficient of Determination or the 𝑅 2 value accesses the overall effectiveness of the model.

This means that the model has an accuracy of 96%, this high value suggests the high predictive

power of this model.

Comparisons

Line of best fit 𝑅2 𝑣𝑎𝑙𝑢𝑒 (rounded to 3 d.p.)

Total deaths VS ln(PM2.5 The line of best fit mostly 0.960


values) passes through the points

ln(Total deaths) VS ln(PM2.5 It mostly does not fit the points 0.434
values)

ln(Total deaths) VS PM2.5 It mostly does not fit the points 0.457
values
Table 4: Comparisons of models

20
From the above table, it is clear that the plot of Total deaths VS ln(PM2.5 values) does the best job

in linearizing the data compared to the other graphs. It has the highest 𝑅2 𝑣𝑎𝑙𝑢𝑒 which suggests that

the model is highly efficient.

Verification of equation

To further evaluate and test the predictive strength of the equation formulated, I decided to verify

the equation and conduct error analysis. I had stated in my aim that I wanted my equation to make

real-world predictions, verifying the equation using our collected data will help us achieve this.

Observed data from the data collected:

Country x𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑦𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑

Kazakhstan 1783 21.90

Thailand 61 21.40

Table 5: Observed data

Calculated data using the equation:

−6 x
𝑦 = (103.60×10 ) × 101.34

Country Calculation 𝑦𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑑

Kazakhstan −6 1783 22.204


= (103.60×10 ) × 101.34
Thailand −6 61 21.888
= (103.60×10 ) × 101.34
Table 6: Calculated data

21
Error analysis

|𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑑 𝑣𝑎𝑙𝑢𝑒−𝑟𝑒𝑎𝑙 𝑣𝑎𝑙𝑢𝑒|


Percentage error= × 100
𝑟𝑒𝑎𝑙 𝑣𝑎𝑙𝑢𝑒

Country Calculation % 𝑒𝑟𝑟𝑜𝑟

Kazakhstan |22.204 − 21.90| 1.370%


= × 100
21.90
Thailand |21.888 − 21.40| 2.280%
= × 100
21.40
Table 7: Error analysis

Both the percentage errors obtained are relatively small and the results obtained here is the very

close to the value we were hoping to obtain, this can help us conclude that the exponential model is

the best model and does the best job in creating the equation. Reflecting back on the aim, I have

successfully determined the best model to establish the relationship between the total deaths and

PM2.5 values and predict the value. Henceforth, this model can be used and the variables can be

extrapolated in order to predict the total deaths in the future. Since the 𝑅2 𝑣𝑎𝑙𝑢𝑒 obtained was

extremely high and did a good job in predicting the data, the null hypothesis can be rejected and it

can be concluded that as the PM2.5 values of a country increase the total deaths due to COVID-19

in that country also increases.

Evaluation

I tried to determine the accuracy of the data I collected through using many different websites and

22
did not see any discrepancies. However, since the reporting of deaths was lower than the actual

deaths in all country, we cannot determine if my model will be able to predict accurately. Hence, it

might be challenging to determine the validity of the model I produced.

I mainly focused on 2 variables – Total deaths affected the PM2.5 values. However, a potential

extension of this investigation could be analyzing more variables like Recovered cases and

Population and formulating an equation that takes many factors into consideration. I could also take

a look at other particulate matter such as 𝐶02 , 𝑆𝑂2 , 𝑁𝑂2 𝑎𝑛𝑑 𝑂3 and formulate an equation taking

this into consideration.

In general, despite the limitations, I have achieved the main aim of the exploration and used

accurate data from government websites throughout. I have also rigorously conducted the

linearization of data and found the best model that would predict the relationship between the

variables. All my manual calculations have been verified using technology and was mostly precise.

23
Bibliography

Ali, Nurshad. Islam, Farjana. “Infection and Mortality—A Review on Recent Evidence”. Front,

Public Health, 2020, https://doi.org/10.3389/fpubh.2020.580057. Accessed on October 2020.

Bashir, Muhammad Farhan et al. “Correlation between environmental pollution indicators and

COVID-19 pandemic: A brief study in Californian context.” Environmental research vol. 187

(2020): 109652. doi:10.1016/j.envres.2020.109652. Accessed on January, 2021.

Marco Travaglio, Yizhou Yu et al. “Links between air pollution and COVID-19 in England. Volue

268, Part A, Science Direct, 2021. https://doi.org/10.1016/j.envpol.2020.115859. Accessed on October

2021.

Muhammad, Sulaman et al. “COVID-19 pandemic and environmental pollution: A blessing in

disguise?.” The Science of the total environment vol. 728 (2020): 138820.

doi:10.1016/j.scitotenv.2020.138820. Accessed on March,2021.

WHO Coronavirus (COVID-19) Dashboard, https://covid19.who.int/. Accessed on April,2021.

24
IQAir Dashboard, https://www.iqair.com/in-en/world-most-polluted-countries . Accessed on April,

2021

25
26

You might also like