Professional Documents
Culture Documents
Math IA
Math IA
Internal Assessment
Darren Boesono
12 IBCP-2
Subject:
Topic:
The bivariate analysis of the correlation of coefficient of the GDP per capita with the death rate
attributed to smoking of different countries using the Pearson correlation coefficient.
1
12 IBCP-2
1. INTRODUCTION
What indicates the development of a country is its GDP per capita. The GDP per capita is
a measure of a country’s economic output that accounts for its number of people. This is a simple
measurement of a country’s standard of living and prosperity. The GDP per capita indicates how
developed a country is. My belief is that somehow the development of the economy of a country
affects the tobacco usage in that country that eventually takes part of their country’s death rate.
During this investigation, I will mainly focus on the portion of the death rate within a country
that is caused by smoking.
In conclusion, my aim of this investigation is to inspect the correlation the GDP per
capita of different countries with their death rate that is attributed to smoking. By the end of this
investigation, I will conclude the correlation of GDP per capita and the death rate that is caused
by cigarette consumption within a country,by obtaining datas from researches and using my
knowledge of mathematics I have learned in my school years so far.
2
12 IBCP-2
2. DATA COLLECTION
➢ The GDP per capita in different countries from the year of 1990 to 2017 is collected from
a worldwide database website called “Our World In Data”.
➢ The death rate that is attributed to smoking is obtained from the same source as the GDP
per capita (“Our World In Data”)
The data collected are all in the annual basis from almost every country, from the year
1990 to 2017. Every year, the data recorded in both variables from all countries differs from one
another. In this case, the year 2017 for both variable acts as the independent variable for this
investigation
3. RAW DATA
Below is the table which displays the number of GDP per capita in 30 different
countries in the year of 2017 and the annual number of deaths attributed to smoking per
100,000 people.
3
12 IBCP-2
➢ Data in table 1: GDP per capita is expressed in US dollar ($) and the death attributed to
smoking is expressed per 100,000 people.
➢ Variable X : The GDP per capita in 30 countries ($)
➢ Variable Y: The annual number of deaths attributed to smoking per 100,000 people.
4
12 IBCP-2
As you can see in the table above, it informs us that the country that has the highest GDP
per capita is Singapore (85535.4 US dollar) and the lower being Afghanistan (1803.9 US dollar).
The country that has the most annual number of deaths attributed to smoking is Laos (141.19
thousand) while the lowest being Nigeria (17.72). There is no trend regarding this table as the
variables are arranged in random and both variables are recorded in the year 2017.
4. PROCESSED DATA
Linear regression is the process of finding a line that best fits the data points that are on
the data recorded, the line will best fit in the plot given. It is used to predict output values for
inputs that are not present in the data set recorded, the best fit line is recorded with the prediction
that those output values will fall on the line. Simple linear regression uses only one explanatory
variable while multiple linear regression uses more than one explanatory variable. This theory is
very suitable for this investigation since the investigation deals with 1 independent variable (x)
and 1 dependent variable (y).
5
12 IBCP-2
The equation of the simple linear regression has the form of (Y = mx + b) , where Y i s the
dependent variable (the y variable). Therefore, X is the independent variable (variable that is
plotted in the x axis). The slope in the line is known as (m) and the y- intercept is known as (b)
shown in the equation.
The formula for finding the gradient, or slope (m) of a regression line is:
Sxy
m = (Sx)2
(Σx)(Σy)
Sxy = Σxy - n
(Σx)2
(Sx)2 = Σx2 - n
The formula for finding the y- intercept (b) of a regression line is:
y - y = m (x - x )
Where:
➢ m : The slope of the regression line
➢ Sxy : Sum of square of variable x m ultiplied with variable y
➢ Sx : Sum of square of variable x
➢ Σx : The sum of all of the x variables
➢ Σy : The sum of all of the x variables
➢ Σxy : The sum of all of the x variables multiplied with the x variables
➢ y : The mean of all of the x variables
➢ x : The mean of all of the x variables
➢ n : The total number of data points in the investigation, or 30.
Row / Column 1 2 3 4 5
x y x2 y2 xy
1 1803.9 89.89 3254055.21 8080.2121 162152.571
2 67335.3 76.08 4534042626 5788.1664 5122869.624
3 27216.4 108.88 740732429 11854.8544 2963321.632
4 26808.1 99.55 718674225.6 9910.2025 2668746.355
6
12 IBCP-2
7
12 IBCP-2
5. DATA CALCULATION
Sxy
m = (Sx)2
(803087.41) × (2394.78)
Sxy = (55536758.68) - (30)
(55536758.68) - (64107255.59)
Sxy =
−8570496.91
Sxy =
2 (803087.41)2
(Sx ) = (34081066451) - 30
2
(Sx ) = ( 34081066451) - (21498312900)
(Sx )2 = 12582800000
−8570496.91
m= 12582800000
m = - 0.000681
y - y = m (x - x )
b = 92.5084759
8
12 IBCP-2
y = mx + b
m = -0.000681
b = 92.5084759
y = -0.000681x - 92.5084759
Σxy is recorded in (column 5, row 31). Σx is recorded in (column 1, row 31). Σy is
recorded in (column 2, row 31). Σx2 (column 3, row 31). Σy 2 (column 4, row 31). By applying
these components to the formula stated, the result of (m) is -0.000681. The (m) is then can be
applied to find the y-intercept of the regression line (y - y = m (x - x )). y and x is the mean
value of both variables. This gives us the mean point of the the x and y v ariables ( x , y ),
(26769.58033,79.826). It can be deduced that the y- intercept is 92.5084759 (b = 92.5084759).
The final line equation is deduced (y =-0.000681 - 92.5084759). The linear regression above
serves to predict the number of the annual deaths attributed to smoking in different countries
from an independent variable (GDP per capita). The number of the annual deaths attributed to
smoking in different countries that represent the y-intercept will occur even without the influence
of GDP per capita, which can be proven when x is 0, the value of y w ould be 92.5084759. When
the value of x i ncreases the value of y will also be decreasing. Thus, the equation tells us that the
relationship between the dependent variable (y) and independent variables (x) is negative.The
regression line can be drawn into a line of equations which into a set of data drawn into a scatter
diagram. The best fit line can be drawn below:
9
12 IBCP-2
Sxy
r = SxSy
(Σx)(Σy)
Sxy = Σxy - n
Sx =
√ Σx −2 (Σx)2
n
Sy =
√ Σy −2 (Σy)2
n
10
12 IBCP-2
r - v alues Correlation
Sxy
r = SxSy
(803087.41) × (2394.78)
Sxy = (55536758.68) - (30)
Sxy = (55536758.68) - (64107255.59)
Sxy = −8570496.91
(803087.41)2
Sx = √ (34081066451) − 30
Sx = √(34081066451) − (21498312900)
Sx = √12582800000
Sx = 112173
Sy =
√ Σy 2 −
(Σy)2
n
(2394.78)2
Sy = √ 226335.4992 − 30
Sy = √226335.4992 − 191165.7083
Sy = √35169.7909
Sy = 187.5361056
−8570496.91
r = (112173) × (187.5361056)
11
12 IBCP-2
r = −8570496.91
21036500
r = -0.407411
➢ Interpretation of r - value
From the calculations above, the formula for finding the Pearson’s correlation coefficient
Sxy
is used (r = SxSy
). The Pearson product-moment correlation coefficient, which is denoted by r, is
a measure of the correlation between two variables x a nd y, g iving a value between +1 and -1
inclusive. Table.3 shows the interpretation of r - value. The Pearson correlation coefficient
between variables x a nd y i s (r = -0.407411). It shows that the correlation is negative between
both variables, meaning that the variables move in opposite directions. For instance, when one
variable increases, the other decreases. As can be seen from Graph.2, as GDP per capita
increases the number of deaths attributed to smoking decreases thus showing a negative
correlation between the x and y variable. From Table.3, it is known that the correlation between
variable x and y is weak, (r = - 0.407411) is placed in (0.25 < |r| ≤ 0.5) which is categorized to
have weak correlation between two variables.
In Graph.2, the best fit line that is plotted through the calculation of using the obtained
regression line equation from the calculation process above. The best fit or the linear regression
line is based on the actual statistical reports of GDP per capita and the number of deaths
attributed to smoking in 30 different countries from worldwide statistics. As seen through the
graph, the trend of the linear line contradicts both variables, where both variables move in an
inverse direction. As the x variable (GDP per capita) increases the y variables (the number of
deaths attributed to smoking) decreases. This assures that the formula, though not exactly fitting
into each of the actual data, is able to be accounted as a reliable formula to estimate the prospects
of GDP per capita and number of deaths in different countries in other countries. This indicates
that the implemented mathematical theory and calculation was accurate enough to produce a
linear equation which we can calculate the estimated number of deaths caused by smoking
through the GDP per capita of a country.
12
12 IBCP-2
If the linear regression formula is used to produce an equation to model the expected the
number of deaths caused by smoking in different countries, Pearson product-moment correlation
coefficient is further utilized to verify if the two variables (GDP per capita and number of deaths
caused by smoking) do actually have an influence to each other. As calculated the Pearson
correlation coefficient, denoted by r, is categorized to have a weak correlation according to
Table.3. The results from r - value has conveyed us that the GDP per capita and the number of
deaths caused by smoking in different countries have an insufficient influence on each other.
From the graph recorded from the data, there is no trend that correlates both of the variables.
This tells us that there is a chance when the higher amount of GDP per capita won't lower
the number of deaths attributed to smoking. From another standpoint, there are arguments that
revolve around this statement. First is that the fact that cigarettes are addictive, smokers who
develop an addiction for consuming cigarettes will ignore any negative side-effects, such as lung
disease that worsen their cardiovascular system. This happens in the short run at least, because
addiction towards consuming cigarettes can be very hard to be rehabilitated in a short-period of
time. Even though the government attempts to lower the cigarette consumption through imposing
a variety of regulations, the cigarette consumption is mainly dependent on the smoker. It's the
smoker’s choice to pick their alternatives to erase their addiction towards cigarettes. Second is
that there are countries with low GDP per capita but have high death rates attributed to other
factors aside from smoking. For instance, there are countries that live in extreme poverty which
resulted in low GDP per capita but have low death numbers attributed to smoking, instead they
have high death numbers attributed to other factors. The hypothetical reason for this is that the
death number resulting from smoking is a small proportion from the death that resulted from
hunger, sickness and many more.
Aside from the arguments above that state that the GDP per capita and death number
attributed from smoking does not correlate with each other. However, there are tolerable
arguments that support the correlation between the two variables. Death from smoking is a result
from the consumption of cigarettes. Cigarette consumption can be reduced by imposing taxes to
cigarettes by the government. The imposition of a large amount of tax can be concluded by
mostly developed countries (with high GDP per capita) because they do not need the revenue
that comes from allocation of cigarettes. However, most undeveloped countries do not impose
tax on cigarettes is because they need extra revenue from the allocation of cigarettes to fund
merit goods or perform public expenditure to develop their economy.
13
12 IBCP-2
7. CONCLUSION
The aim of this investigation, as mentioned previously, was able to come up with the
correlation of the two variables by using the simple linear regression line equation. The equation,
when implemented, was able to produce a linear line that resembles the best fit line of the two
variables. Then we find the correlation between the two variables using the Pearson correlation
coefficient which is denoted by r. The result is that the two have a weak correlation between the
two variables because of different factors that affect the weak correlation. The equation cannot
be used to predict the death number attributed to smoking by the GDP per capita of a country due
to the weak correlation denoted by the r - value.
8. FURTHER IMPROVEMENTS
The data of this investigation is limited to the year of 2017 for both variables. There is no
recent updated data that records today’s world situations. To improve the accuracy in terms of its
results, is to find the cigarette consumption data in each country. However, the data cannot be
found in statistical websites. Therefore, the annual number of deaths attributed to smoking is
used instead of the cigarette consumption in different countries.
14
12 IBCP-2
9. BIBLIOGRAPHY
➢ Buchanan, L., Fensom, J., Kemp, E., Rondie, P. L., & Stevens, J. (2012).
Mathematics: standard level. Oxford: Oxford University Press.
➢ Ritchie, H., & Roser, M. (2013, May 23). Smoking. Retrieved April 13, 2020, from
https://ourworldindata.org/smoking
➢ GDP per capita. (n.d.). Retrieved from
https://ourworldindata.org/grapher/gdp-per-capita-worldbank
15
12 IBCP-2
16