Professional Documents
Culture Documents
Math IA
Math IA
Math AA SL
Creating a equation to predict the relationship between the death rate and PM2.5 values of a
country
1
Introduction and Rationale
The applications of mathematics in real-life have always seemed to fascinate me. I have always
been intrigued about how one can apply the mathematics learnt in the classroom to real life contexts
and use it to resolve real-life problems. I became highly enthusiastic when I stumbled upon
something that could elucidate the elusive nature of mathematic and something that was highly
According to recent scientific and mathematical discoveries, it is hypothesized that the Covid-19
death rate and PM2.5 levels of that country have a strong positive correlation. PM2.5 refers to
atmospheric particulate matter that have a diameter of less than 2.5 micrometers, particles in this
category are so small that they can only be seen with a microscope. Owing to their minute size,
particles smaller than 2.5 micrometers are able to bypass the nose and throat and penetrate deep
into the lungs, some may even enter the circulatory system. As I continued to research I found that
since Covid-19 also has an airborne transmission, particulate matter (PM2.5) could act as a carrier
and spread the infection at a higher rate. Particulate matter could also have induced damage to the
lungs cells which increases inflammation rate which could further affect the severity of Covid-19
in an individual. Upon reading this, I was captivated and wanted to learn more about this
relationship.
The situation of the world during the pandemic is extremely crucial and any information that can
potentially save lives is pivotal. I believe modelling an equation that can make predictions in the
real world between pollution rates and death rates could have innumerable benefits. Creating this
model with a context and real-life considerations could help us conclude upon the best way to
2
Aim and Method
between world pollution rates and the death rates and create an equation to make predictions
in the real world. This model will relate the PM2.5 levels of a country and the death rate of a
Because I wanted this model to determine a relationship between PM2.5 levels of a country and
the death rate of a country, there was no way to collect primary data and I had to use secondary
data. In order to make sure my data was authentic I used data only from authentic government
websites. I decided to collect data from these sites for 30 different countries and the death rate of
I decided to approach the problem by seeing if the relationship between the PM2.5 levels and the
death rate of a country for the year 2020 described as an equation. The data used in this analysis
consist of daily deaths due to COVID-19 for 30 countries and their respective provinces. The
data covers 30 countries for a year, from January of 2020 to December of 2020, obtained from
the WHO website. The data of 2020 was analyzed as this was the data that was most accessible
during the time of writing. The PM2.5 levels each country was also extracted from IQAir. A
large number of deaths were not reported during the pandemic, this could affect the reliability of
mentioned before the data collected would be from the year 2020 from the WHO website and
IQair. After that I decided to graph my data on a scatterplot to help me understand if the
association between the variables was mostly linear or non-linear. I would then calculate the
value of the Pearson’s Product Moment Correlation Coefficient (𝑟) for both the independent
3
variable and dependent variable. Using the r value can help determine the strength of the
correlation obtained between the two variables. Then, I would then calculate the Spearman’s
Rank Correlation Coefficient (𝑟𝑠 ) that would determine the statistical dependence between the
rankings of the two variables. It assesses how well the relationship between the two variables can
be described using a monotonic function. If the variables have a monotonic relationship, the
Spearman’s Rank Correlation Coefficient would indicate a higher value than the Pearson’s
According to my aim, I want to create an equation to predict the relationship between the total
deaths due to COVID-19 and the PM2.5 values of a country so it is not enough only to calculate
the correlation coefficients. To achieve my aim, I will deduce the best fit model for my data from
which predictions about the relationship between the total deaths due to COVID-19 and the
PM2.5 values can be calculated. This can help us mathematically understand the implications of
the PM2.5 levels and how the total deaths due to COVID-19 and the PM2.5 levels are related.
The Coefficient of Determination (𝑅2 ) to determine the accuracy and predictive power of the
model. 𝑅2 is the Coefficient of Determination, since this is a ratio of the successfully predicted
variation, we can interpret it as a percentage which will help us determine the accuracy of the
model.
4
Data Collection
Thailand 61 21.40
Taiwan 8 15.00
Andorra 85 7.40
5
Based on the data collected, I decided to plot a scatter plot which will help observe the relationship
y = 0.0003x + 26.607
R² = 0.167
𝑅2 𝑣𝑎𝑙𝑢𝑒 = 0.167
Graph1: the pollution rate is plotted on the y-axis with µg/m3as the unit and total deaths on the x axis
On plotting the data points, we can observe that the correlation between the variables is moderate.
The data points on this graph are not equally distributed with respect to the mean line in this graph
especially in the left-hand portion of this graph. In order to see the strength of association between
the two variables, I will calculate the Pearson’s Product Moment Correlation Coefficient (𝑟).
6
Table 1: Calculation of r value
𝑀x = 7654.333
∑ x = 229630
∑( x − 𝑀x )2 = 𝑆𝑆x = 21121402166.66
𝑀𝑦 = 28.59
∑ 𝑦 = 857.7
7
2
∑( 𝑦 − 𝑀𝑦 ) = 𝑆𝑆𝑦 = 8487.378
∑((x−𝑀x )(𝑦− 𝑀𝑦 ))
𝑟= ……(1)
√((𝑆𝑆x )(𝑆𝑆𝑦 ))
5471195.5
𝑟= = 0.486
√(21121402166.6)(8487.387)
relationship, in this case r < 0.50 so the relationship between the two variables is weak. Since a
weak correlation is observed, we can check if the relationship is non-linear. To check if the
relationship is non-linear, I will graph a line of best fit with a non-zero curvature and see if it will
8
Graph 2: the pollution rate is plotted on the y-axis with µg/𝑚 3as the unit and total deaths on the x axis
This graph is able to account for more data points and is related monotonically, which means that as
one variable increases the other variable increases as well. The Spearman’s Rank Coefficient will be
between the value -1 ≤ 𝑟𝑠 ≤ +1, the sign indicates the direction of association. A positive
coefficient shows that as one variable increases, so does the other while a negative coefficient
shows that as one variable decreases the other increases. We can calculate the Spearman’s Rank
Coefficient to determine the strength between the two variables I have selected.
9
Table 2: Spearman’s Rank Coefficient Calculation
6 ×∑ 𝑑 2
𝑟𝑠 = 1 − ………..(2)
𝑛(𝑛2 −1)
6 × 1259
𝑟𝑠 = 1 −
30(302 − 1)
𝑟𝑠 = 0.72
10
I utilised technology to verify the same, the value I obtained on R studio is 0.720. It can be observed
that both values are exactly the same. After reflecting on Graph 1 and Graph 2, We can also see that
the value of r obtained was much lower than the value of 𝑟𝑠 , this suggests a monotonic relationship
between the variables. Any coefficient obtained that is 𝑟𝑠 ≥ +0.70 is a strong positive correlation.
Since my 𝑟𝑠 = 0.72, we can conclude that a very strong positive correlation is obtained between
my variables however both variables do not increase in the same proportion as it is non-linear.
Reflecting back on my main aim, I wanted to formulate an equation to relate the Total deaths and
PM2.5 levels of countries. To achieve this, I decided I needed to linearize my data. Most
relationships that are not linear can be graphed so that the graph is a straight line, linearization does
not change the fundamental relationship or what it represents, but it does change the way the graph
looks. I decided to linearize my data to help make the analysis of the data easier and compute an
equation. One method to linearize the data is using the logarithmic models, we can re-express all of
the different data points by applying the logarithmic value of the data. Logarithms can be used to
According to my aim, I wanted to formulate an equation that would relate the two variables. In
order to achieve this, I will conduct a regression analysis. The first step for that would be curve
fitting, a process to specify the model that best fits the curves of the specific dataset. Plotting these
three graphs (log x vs y, logy vs x and log x vs log y) and determining models for each of these can
help me understand which one of these models can predict the future the most accurately. Using the
𝑅2 value can also help me determine which will yield the best result.
11
Power model(log(x) vs log(y))
Upon re-expressing the data in terms of log(x) vs log(y), the data seems to be linearized one again. I
plotted the log(total deaths) on the x axis as the independent variable while I plotted the
y = 0.2278(x) + 0.6729
R² = 0.434
Graph 3:The third graph has plotted the ln(pollution rate) on the x-axis and the ln(death rate) on the y-axis.
Upon visualising the data, it is clear that the log-log model does not do the best job in linearizing
the data. Without looking at the 𝑅2 value, it is clear that the data points do not follow the trend of
points closely. We must then use the properties of logarithms and exponents to find the non-linear
model
12
Linearization method Linear model Non-linear model
log(y) vs log(x) log(y)=mlog(x)+c (m being y= 10𝑐 𝑥 𝑚
the slope and c being the y-
intercept)
The equation of the line obtained is:
y = 0.2278(x) + 0.6729
However, since the equation needs to fit the re-expressed data, it can be re-written as
The next step would be to find the best fitting non-linear model for the equation, the equation
obtained previously would be ideal if we wanted to predict log(y) and not y, this is the reason it
is important to transform this equation and obtain a power model. To predict y we need to
10log(y) = 100.2278×log(x)+0.6729
y = 100.2278×log(x)+0.6729
𝑦 = x 0.2278 × 100.6729
13
The power model involves taking both the logarithm of the dependent as well as the independent
2
variable. It also has a 𝑅 value of 0.434 for the power model, which is significantly lower. This can
help us understand that the power model is a weak model This model can be interpreted by saying
that 1% increase in the value of x results in the increase of the y value by 0.2278%, which is
significantly weak. I decided to continue the other methods of linearization to find the best fit model
14
Logarithmic model(log(x) vs y)
I began by creating a log- lin graph, this model would be in the form y=mlog(x)+c (where m is the
Upon re-expressing the data in terms of log(x) vs y, the data seems to be linearized. I plotted the
log(total deaths) on the x axis as the independent variable while I plotted the pollution rate on the y
y = 13.30x - 12.08
R² = 0.457
Graph 4: The graph has plotted log(total death) on the x-axis and PM2.5 levels on the y axis.
Without doing any mathematical analysis it can be seen clearly how the data points are not equally
distributed with respect to the best fit line. Without doing any mathematical analysis, it can be seen
by eye that the data points deviate greatly from the best fit line, particularly in the center portion
where the data points show values much higher and lower than the line of best fit. The 𝑅2 𝑣𝑎𝑙𝑢𝑒
recorded is not high, however by obtaining a value of 0.457, it is slightly higher than the previous
model. However, since the data points do deviate from the best fit line we can clearly assume there
15
is a better fit model for the equation. This model can be interpreted by saying that when x increases
y = 13.30x - 12.08
It is then important to rewrite this equation to fit the re-expressed data. The equation of this line is
This equation fits into the logarithmic model equation [y= mlog(x)+c] where m is the slope and c
is the y-intercept.
16
Exponential model (x vs log(y))
Lastly, I created the lin-log graph and wanted to see if this would give me better results. Upon re-
expressing the data in terms of log(y) vs x, the data seems to be linearized. I plotted the total deaths
on the x axis as the independent variable while I plotted the log(pollution rate) on the y axis as the
dependent variable.
R² = 0.96
Graph 5: The graph represents the values of the death rate on the x-axis and the log (pollution rate)on the y axis.
From observing the plotted graph, a few assumptions can be made. Upon looking at the data, it
appears that this model does the best job at linearizing the data. The line of best fit does not pass
through all the data points, but there seems to be a much better distribution of the data points in
comparison to the best fit line. However, it is not enough to consider the overall fit of the model,
finding the 𝑅2 value will help us determine if the exponential model is the best model. 𝑅2 is the
coefficient of determination, since this is a ratio of the successfully predicted variation, we can
interpret it as a percentage. If the obtained 𝑅2 value is 0.90 we can say that 90% of the variation is
17
predicted by the regression line. I decided to calculate the 𝑅2 value for this particular model as it
seems to be the best model. Having a high 𝑅2 value means that the predictive power of the model is
To determine the 𝑅2 value for this graph, I have manually calculated the value of 𝑅2 in the table
below:
18
∑(𝑦𝑖 − 𝑦̂𝑖 )2
𝑅2 = 1 − ∑(𝑦𝑖 − 𝑦̅)2
….(3)
2.351
𝑅2 = 1 −
58.88
=0.960
In this case 𝑅2 value is 0.96 we can say that 96% of the variation is predicted by the regression line.
The high 𝑅2 value suggests that the predictive power of the model is very high and accurate. The
𝑅2 𝑣𝑎𝑙𝑢𝑒 obtained is quite high and indicates a strong relationship. Having a strong 𝑅2 𝑣𝑎𝑙𝑢𝑒
suggests that the model is able to predict with accuracy. It can be interpreted that as x increases by 1
−6
unit, y increases by a factor of 103.60×10 which is 1, this shows the association is extremely
strong.
The main aim of my exploration was to determine a equation that could determine the relationship
between the total deaths and the PM2.5 values. In order to do this, we have to model the equation
19
−6 x+1.34
10log(𝑦) = 103.60×10
−6 x
𝑦 = (103.60×10 ) × 101.34
Until now we demonstrated that there was a strong non-linear association between the variables
with a correlation coefficient of 0.72 between the total deaths and the PM2.5 values. However, after
linearizing the data and creating a model, we obtained a strong model that has a 𝑅2 value of 0.96.
The Coefficient of Determination or the 𝑅 2 value accesses the overall effectiveness of the model.
This means that the model has an accuracy of 96%, this high value suggests the high predictive
Comparisons
ln(Total deaths) VS ln(PM2.5 It mostly does not fit the points 0.434
values)
ln(Total deaths) VS PM2.5 It mostly does not fit the points 0.457
values
Table 4: Comparisons of models
20
From the above table, it is clear that the plot of Total deaths VS ln(PM2.5 values) does the best job
in linearizing the data compared to the other graphs. It has the highest 𝑅2 𝑣𝑎𝑙𝑢𝑒 which suggests that
Verification of equation
To further evaluate and test the predictive strength of the equation formulated, I decided to verify
the equation and conduct error analysis. I had stated in my aim that I wanted my equation to make
real-world predictions, verifying the equation using our collected data will help us achieve this.
Thailand 61 21.40
−6 x
𝑦 = (103.60×10 ) × 101.34
21
Error analysis
Both the percentage errors obtained are relatively small and the results obtained here is the very
close to the value we were hoping to obtain, this can help us conclude that the exponential model is
the best model and does the best job in creating the equation. Reflecting back on the aim, I have
successfully determined the best model to establish the relationship between the total deaths and
PM2.5 values and predict the value. Henceforth, this model can be used and the variables can be
extrapolated in order to predict the total deaths in the future. Since the 𝑅2 𝑣𝑎𝑙𝑢𝑒 obtained was
extremely high and did a good job in predicting the data, the null hypothesis can be rejected and it
can be concluded that as the PM2.5 values of a country increase the total deaths due to COVID-19
Evaluation
I tried to determine the accuracy of the data I collected through using many different websites and
22
did not see any discrepancies. However, since the reporting of deaths was lower than the actual
deaths in all country, we cannot determine if my model will be able to predict accurately. Hence, it
I mainly focused on 2 variables – Total deaths affected the PM2.5 values. However, a potential
extension of this investigation could be analyzing more variables like Recovered cases and
Population and formulating an equation that takes many factors into consideration. I could also take
a look at other particulate matter such as 𝐶02 , 𝑆𝑂2 , 𝑁𝑂2 𝑎𝑛𝑑 𝑂3 and formulate an equation taking
In general, despite the limitations, I have achieved the main aim of the exploration and used
accurate data from government websites throughout. I have also rigorously conducted the
linearization of data and found the best model that would predict the relationship between the
variables. All my manual calculations have been verified using technology and was mostly precise.
23
Bibliography
Ali, Nurshad. Islam, Farjana. “Infection and Mortality—A Review on Recent Evidence”. Front,
Bashir, Muhammad Farhan et al. “Correlation between environmental pollution indicators and
COVID-19 pandemic: A brief study in Californian context.” Environmental research vol. 187
Marco Travaglio, Yizhou Yu et al. “Links between air pollution and COVID-19 in England. Volue
2021.
disguise?.” The Science of the total environment vol. 728 (2020): 138820.
24
IQAir Dashboard, https://www.iqair.com/in-en/world-most-polluted-countries . Accessed on April,
2021
25
26