MATH3806 Group Report

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 13

MATH3806 Group Report - Air Pollution

A. Background of the time series data:

In this report, we would use the air pollution data from data courtesy of Professor G. C.

Tiao for our own research. Regarding the dataset we used, there are 42 measurements in

total on air pollution variables, which are all recorded at 12:00 noon in the Los Angeles

area on different days. To be specific, there are 7 variables in this dataset, including Wind

(x1), Solar radiation (x2), Carbon Monoxide (CO) (x3), Nitric Oxide (NO) (x4), Nitrogen

Dioxide (NO2) (x5), Ozone (O3) (x6), and Hydrocarbon (HC) (x7). And we would set

x3, x4, x5, x6, x7 as the targeted air pollution variables. So, to be more clear about these

selected air pollutants, some important information of each air pollution variable would

be given. First, CO (x3), it can be found in fumes produced any time when you burn fuel

in cars or trucks, small engines, or furnaces. Second, NO (x4), forms in combustion

systems, and can be generated by lightning in thunderstorms. Third, NO2 (x5), is

produced by human activities, like the combustion of fossil fuels (coal, gas and oil),

especially fuel used in cars. It is also produced from making nitric acid, refining of petrol

and metals, commercial and food manufacturing. Fourth, O3 (x6), is formed when heat

and sunlight cause chemical reactions between oxides of nitrogen (NOX ) and Volatile

Organic Compounds (VOC), which are also known as Hydrocarbons. This reaction can

occur both near the ground and high in the atmosphere. Fifth, HC (x7), burning fossil

fuels, including gasoline in automobile engines, releases some hydrocarbons into the air.

B. Motivation:
In this day and age, the global warming and climate change effects have become a very

serious issue all over the world. And we definitely know that air pollution is one of the

problems that cause global warming to continue to worsen. In order to have a much

deeper understanding towards air pollution these days, we have chosen Los Angeles as

our main research area, so as to find out the newest situation regarding air pollution in

this place. So, Los Angeles, it has air quality index rating of “moderate”, with monthly

average in 2019 varied from AQI 32 (regarded as “good”) to AQI 64 (regarded as

“moderate”). Although the air quality ratings in Los Angeles are quite optimistic, actually

the air pollution in Los Angeles is the worst in the United States. And according to the

County of Los Angeles Public Health Department, because of Los Angeles not meeting

the U.S. EPA’s national air quality standards for air pollutants, 1 in 10 childrens in Los

Angeles may suffer from the disease of asthma. Also according to the South Coast Air

Quality Management District, regarding the overall risk for cancer in Los Angeles, has

increased by 900 for every million. So, we can clearly see that the air pollution data for

Los Angeles is a very important key for people to take action to alleviate these kinds of

health problems. Such that, in this research, we would like to fully use the accessible air

pollution data from Professor G. C. Tiao, to analyze different air pollution variables

recorded in the Los Angeles Areas, in order to find out the main problems regarding the

air pollution in Los Angeles, then help improve its air quality, and most importantly our

final target is to mitigate the health issues caused by air pollution. Thus, several models

and methods are selected for this research as follows.

C. Selected Models and Methods + Discussion of Analysis Results:


(1) Scatter plot with marginal dot diagram

First, we would like to demonstrate the scatter plots for each air pollution

variable, in order to visualize the air pollution data that shows the relationship

between these seven selected variables. So, using each scatter plot as follows, we

can clearly figure out the patterns or correlations between two variables. Besides,

we would like to demonstrate the marginal dot diagram for each air pollution

variable, in order to figure out any right-skewed or left-skewed data, any outliers,

any multi-modal data. Such that, in order to find out clear x-y relationships

between groups of observations, we have demonstrated scatter plots with

marginal dot diagram as follows, including ‘x1-x2’, ‘x1-x3’, ‘x1-x4’, ‘x1-x5’,

‘x1-x6’, ‘x1-x7’:
So, regarding the skewness of each variable in air pollution data, we can find out

that x1,x3,x4,x7 are like normal distribution, which means the data of these

variables are quite balanced, while x2,x5,x6 are left-skewed and right-skewed

data respectively, which means the data of these variables are mostly skewed on

only one side. Then, regarding the outliers, referring to the marginal dot diagrams,

we cannot find a big problem of outliers, since we can see that the dispersion

pattern of the dots are quite normal and balanced. Finally, regarding the

relationships between groups of observations, all x-y relationships are quite

significant.

(2) Mean, covariance and correlation arrays


Means:

X1 X2 X3 X4 X5 X6 X7

7.500000 73.85714 4.54762 2.19048 10.04762 9.404762 3.095238

After calculation, we have found the means for all seven selected variables. By

assuming all variables have the same units, we find out that “solar radiation” has

the highest means among all variables, which is 73.85714, while “NO” has the

lowest means, which is just 2.19048. From this case, we can estimate that “solar

radiation” probably has the largest effect on the model, and has a great impact on

the relationships between different variables.

Covariances (S):

X1 X2 X3 X4 X5 X6 X7

X1 2.5000000 -2.7804878 -0.3780488 -0.4634146 -0.5853659 -2.2317073 0.1707317

X2 -2.7804878 300.5156794 3.9094077 -1.3867596 6.7630662 30.7909408 0.6236934

X3 -0.3780488 3.9094077 1.5220674 0.6736353 2.3147503 2.8217189 0.1416957

X4 -0.4634146 -1.3867596 0.6736353 1.1823461 1.0882695 -0.8106852 0.1765389

X5 -0.5853659 6.7630662 2.3147503 1.0882695 11.3635308 3.1265970 1.0441347

X6 -2.2317073 30.7909408 2.8217189 -0.8106852 3.1265970 30.9785134 0.5946574

X7 0.1707317 0.6236934 0.1416957 0.1765389 1.0441347 0.5946574 0.4785134

In statistics, covariance is a measure of the joint variability of different random

variables. When the greater values of one variable correspond with the greater

values of another variables (same holds for the lesser values), the covariance will

be positive, for instances, “X1-X7”, “X2-X3”, “X2-X5”, “X2-X6”, “X2-X7”,

“X3-X4”, “X3-X5”, “X3-X6”, “X3-X7”, and so on, such that these combinations
of variables may have greater linear relationship. On the other hand, when the

greater values of one variable correspond with the lesser values of another

variable, the covariance will be negative, which means their linear relationships

are lower.

Correlations (R):

X1 X2 X3 X4 X5 X6 X7

X1 1.0000000 -0.10144191 -0.1938032 -0.26954261 -0.1098249 -0.2535928 0.15609793

X2 -0.1014419 1.00000000 0.1827934 -0.07356907 0.1157320 0.3191237 0.05201044

X3 -0.1938032 0.18279338 1.0000000 0.50215246 0.5565838 0.4109288 0.16603235

X4 -0.2695426 -0.07356907 0.5021525 1.00000000 0.2968981 -0.1339521 0.23470432

X5 -0.1098249 0.11573199 0.5565838 0.29689814 1.0000000 0.1666422 0.44776780

X6 -0.2535928 0.31912373 0.4109288 -0.13395214 0.1666422 1.0000000 0.15445056

X7 0.1560979 0.05201044 0.1660323 0.23470432 0.4477678 0.1544506 1.00000000

In statistics, correlation (r) is used to measure the strength of the linear association

between the components. When r is negative, it implies a tendency for one value

in the pair to be larger than its average when the other is smaller than its average,

for instances, “X1-X2”, “X1-X3”, X1-X4”, “X1-X5”, “X1-X6”, and so on. On the

other hand, when r is positive, it implies a tendency for one value of the pair to be

large when the other value is large and also for both values to be small together,

for instances, “X1-X7”, “X2-X3”, “X2-X5”, “X2-X6”, “X2-X7”, and so on.

(3) Q-Q plots


The Q-Q plots is a useful graphical tool to help us access if a set of data came

from some theoretical distribution, especially Normal distribution. So, after

calculation, all of the seven selected variables have a good normal distribution,

maybe there are some outliers for each variable, but most of the data for each

variable are grouped near the qq normal line.

(4) Maximum likelihood estimates


The above plots are the histograms of different variables with their empirical line

(black line) and the fitted normal line (red line). Below are the tables of their

maximum likelihood estimates (MLE) and standard errors:

Wind: Solar:

estimate Std. Error estimate Std. Error

mean 7.500000 0.2410531 mean 73.85714 2.642872

sd 1.562202 0.1704499 sd 17.12777 1.868793

CO: NO:

estimate Std. Error estimate Std. Error

mean 4.547619 0.1880873 mean 2.190476 0.1657734

sd 1.218945 0.1329974 sd 1.074335 0.1172191

NO2: O3:

estimate Std. Error estimate Std. Error

mean 10.047619 0.5139245 mean 9.404762 0.8485412

sd 3.330611 0.3633993 sd 5.499175 0.6000091

HC:

estimate Std. Error


mean 3.0952381 0.10546046

sd 0.6834619 0.07457109

From the plots, obviously some factors are close to normal, but some have

skewness problems. The reason is probably due to the fact that all observed values

are integers, so for some variables which values are small -- such as HC and NO

-- could impose a bad effect on fitting a normal distribution model. Still, the

standard errors look not big compared to their corresponding estimates for these 7

MLEs, so at some significance level, the MLE should have a good estimation to

the true points.

(5) Principal component analysis

Fifth, we would like to perform the principal component analysis, in order to

explain the variance-covariance structure of the selected air pollution variables.

For the first part, we would like to use the covariance matrix (S) for our analysis.

After calculation, we find out the corresponding eigenvalues: λ1=304.26,

λ2=28.28, λ3=11.46, λ4=2.52, λ5=1.28, λ6=0.53, λ7=0.21. And we also find out

the first sample principal component is: y1 = -0.10x1 + 0.993x2 + 0.14x3 -

0.005x4 + 0.024x5 + 0.112x6 + 0.002x7, which contributes for 87% of the total

sample variance, and the first component is essentially the “solar radiation”. For

the second part, we would like to use the correlation matrix (R) for our analysis.

After calculation, we find out the corresponding eigenvalues: λ1=2.34, λ2=1.39,

λ3=1.20, λ4=0.73, λ5=0.65, λ6=0.54, λ7=0.16. And we also find out the first three

sample principle components are: (i) y1 = 0.237z1 - 0.205z2 - 0.551z3 - 0.378z4 -

0.498z5 - 0.324z6 - 0.319z7 (ii) y2 = -0.278z1 + 0.527z2 + 0.007z3 - 0.435z4 -


0.199z5 + 0.567z6 - 0.308 z7 (iii) y3 = 0.644z1 + 0.225z2 - 0.113z3 -0.407z4 +

0.197z5 + 0.159z6 + 0.541z7, which contribute for 70% of the total sample

variance. So, the first component contrasts “wind” with the remaining variables,

which are some general measure of the pollution level. The second component is

largely composed of “solar radiation” and the air pollutants “NO” and “O3”,

which represents the effects of solar radiation since solar radiation is involved in

the production of NO and O3 from the other pollutants. The third component is

composed mainly of “wind” and several air pollutants, e.g. “NO” and “HC”,

which represent a wind transport effect. Thus, the air pollution data can be

effectively summarized in three or fewer dimensions. So, there are some

differences between using S and R.

(6) Correlation test

From part (2), we obtained a table of correlations. We were interested in finding

whether the correlations are significant or not, so we conducted this correlation

r
test by applying the equation: t= √ n−2. The p-values is shown in the table
√1−r 2
below:

X1 X2 X3 X4 X5 X6 X7

X1 0

X2 0.5227 0

X3 0.2188 0.2466 0

X4 0.08431 0.6433 0.0007029 0

X5 0.4887 0.4655 0.0001293 0.05622 0

X6 0.1051 0.0394 0.006865 0.3977 0.2915 0

X7 0.3236 0.7436 0.2933 0.1346 0.002945 0.3288 0


At the 0.05 significance level, the correlations of X2X6, X3X4, X3X5, X3X6 and

X5X7 are statistically significant. If we check the signs of their correlation, we

could find that they are all positively correlated. Therefore, the statistical result

shows that for these 7 factors, they either have no correlation or positively

correlated. That means the increase of solar radiation or pollutants might

positively affect other factors, and this phenomenon might intensify the problem

of air pollution. If we use 0.1 significance level, then X1X4 and X4X5 will also

become significant. The correlation of X4X5 is positive, but the correlation of

X1X4 is negative. At this significance level, wind significantly reduces the

amount of air pollutant ‘NO’.

(7) Conclusion

The air pollution data recorded in Los Angeles contributed to us conducting this

study. The principal component analysis(PCA) shows the correlation between

different factors. Based on the first PCA of covariance matrix, the coefficient of

0.993 indicated that solar radiation explains 99% variance in this PCA, so we

considered that solar radiation is making the greatest effect on air pollution.

However, it is hard to control solar radiation in practice, therefore, we looked for

the PCA of correlation matrix and carried out the correlation test. They are

showing 2 important correlations: firstly, wind is negatively correlated to the

other pollutants. Secondly, NO has a significant positive correlation with other

pollutants. Similarly, wind is uncontrollable, but it is possible to control the

release of NO. It is believed that the reduction of NO might has a positive effect

on decreasing the other pollutants.


Hence, the result of our analysis not only provides the effect of different factors,

but also gives a deep insight into the relationship between each air pollution

factor, which could be helpful for making strategies to solve the problem of global

warming.

You might also like