MATH3806 Group Report

MATH3806 Group Report - Air Pollution
A. Background of the time series data:
In this report, we would use the air pollution data from data courtesy of Professor G. C.
Tiao for our own research. Regarding the dataset we used, there are 42 measurements in
total on air pollution variables, which are all recorded at 12:00 noon in the Los Angeles
area on different days. To be specific, there are 7 variables in this dataset, including Wind
(x1), Solar radiation (x2), Carbon Monoxide (CO) (x3), Nitric Oxide (NO) (x4), Nitrogen
Dioxide (NO2) (x5), Ozone (O3) (x6), and Hydrocarbon (HC) (x7). And we would set
x3, x4, x5, x6, x7 as the targeted air pollution variables. So, to be more clear about these
selected air pollutants, some important information of each air pollution variable would
be given. First, CO (x3), it can be found in fumes produced any time when you burn fuel
in cars or trucks, small engines, or furnaces. Second, NO (x4), forms in combustion
systems, and can be generated by lightning in thunderstorms. Third, NO2 (x5), is
produced by human activities, like the combustion of fossil fuels (coal, gas and oil),
especially fuel used in cars. It is also produced from making nitric acid, refining of petrol
and metals, commercial and food manufacturing. Fourth, O3 (x6), is formed when heat
and sunlight cause chemical reactions between oxides of nitrogen (NOX ) and Volatile
Organic Compounds (VOC), which are also known as Hydrocarbons. This reaction can
occur both near the ground and high in the atmosphere. Fifth, HC (x7), burning fossil
fuels, including gasoline in automobile engines, releases some hydrocarbons into the air.
B. Motivation:
In this day and age, the global warming and climate change effects have become a very
serious issue all over the world. And we definitely know that air pollution is one of the
problems that cause global warming to continue to worsen. In order to have a much
deeper understanding towards air pollution these days, we have chosen Los Angeles as
our main research area, so as to find out the newest situation regarding air pollution in
this place. So, Los Angeles, it has air quality index rating of “moderate”, with monthly
average in 2019 varied from AQI 32 (regarded as “good”) to AQI 64 (regarded as
“moderate”). Although the air quality ratings in Los Angeles are quite optimistic, actually
the air pollution in Los Angeles is the worst in the United States. And according to the
County of Los Angeles Public Health Department, because of Los Angeles not meeting
the U.S. EPA’s national air quality standards for air pollutants, 1 in 10 childrens in Los
Angeles may suffer from the disease of asthma. Also according to the South Coast Air
Quality Management District, regarding the overall risk for cancer in Los Angeles, has
increased by 900 for every million. So, we can clearly see that the air pollution data for
Los Angeles is a very important key for people to take action to alleviate these kinds of
health problems. Such that, in this research, we would like to fully use the accessible air
pollution data from Professor G. C. Tiao, to analyze different air pollution variables
recorded in the Los Angeles Areas, in order to find out the main problems regarding the
air pollution in Los Angeles, then help improve its air quality, and most importantly our
final target is to mitigate the health issues caused by air pollution. Thus, several models
and methods are selected for this research as follows.
C. Selected Models and Methods + Discussion of Analysis Results:

(1) Scatter plot with marginal dot diagram
First, we would like to demonstrate the scatter plots for each air pollution
variable, in order to visualize the air pollution data that shows the relationship
between these seven selected variables. So, using each scatter plot as follows, we
can clearly figure out the patterns or correlations between two variables. Besides,
we would like to demonstrate the marginal dot diagram for each air pollution
variable, in order to figure out any right-skewed or left-skewed data, any outliers,
any multi-modal data. Such that, in order to find out clear x-y relationships
between groups of observations, we have demonstrated scatter plots with
marginal dot diagram as follows, including ‘x1-x2’, ‘x1-x3’, ‘x1-x4’, ‘x1-x5’,
‘x1-x6’, ‘x1-x7’:
So, regarding the skewness of each variable in air pollution data, we can find out
that x1,x3,x4,x7 are like normal distribution, which means the data of these
variables are quite balanced, while x2,x5,x6 are left-skewed and right-skewed
data respectively, which means the data of these variables are mostly skewed on
only one side. Then, regarding the outliers, referring to the marginal dot diagrams,
we cannot find a big problem of outliers, since we can see that the dispersion
pattern of the dots are quite normal and balanced. Finally, regarding the
relationships between groups of observations, all x-y relationships are quite
significant.
(2) Mean, covariance and correlation arrays

Means:
X1 X2 X3 X4 X5 X6 X7
7.500000 73.85714 4.54762 2.19048 10.04762 9.404762 3.095238
After calculation, we have found the means for all seven selected variables. By
assuming all variables have the same units, we find out that “solar radiation” has
the highest means among all variables, which is 73.85714, while “NO” has the
lowest means, which is just 2.19048. From this case, we can estimate that “solar
radiation” probably has the largest effect on the model, and has a great impact on
the relationships between different variables.
Covariances (S):
X1 X2 X3 X4 X5 X6 X7
X1 2.5000000 -2.7804878 -0.3780488 -0.4634146 -0.5853659 -2.2317073 0.1707317
X2 -2.7804878 300.5156794 3.9094077 -1.3867596 6.7630662 30.7909408 0.6236934
X3 -0.3780488 3.9094077 1.5220674 0.6736353 2.3147503 2.8217189 0.1416957
X4 -0.4634146 -1.3867596 0.6736353 1.1823461 1.0882695 -0.8106852 0.1765389
X5 -0.5853659 6.7630662 2.3147503 1.0882695 11.3635308 3.1265970 1.0441347
X6 -2.2317073 30.7909408 2.8217189 -0.8106852 3.1265970 30.9785134 0.5946574
X7 0.1707317 0.6236934 0.1416957 0.1765389 1.0441347 0.5946574 0.4785134
In statistics, covariance is a measure of the joint variability of different random
variables. When the greater values of one variable correspond with the greater
values of another variables (same holds for the lesser values), the covariance will
be positive, for instances, “X1-X7”, “X2-X3”, “X2-X5”, “X2-X6”, “X2-X7”,
“X3-X4”, “X3-X5”, “X3-X6”, “X3-X7”, and so on, such that these combinations
of variables may have greater linear relationship. On the other hand, when the
greater values of one variable correspond with the lesser values of another
variable, the covariance will be negative, which means their linear relationships
are lower.
Correlations (R):
X1 X2 X3 X4 X5 X6 X7
X1 1.0000000 -0.10144191 -0.1938032 -0.26954261 -0.1098249 -0.2535928 0.15609793
X2 -0.1014419 1.00000000 0.1827934 -0.07356907 0.1157320 0.3191237 0.05201044
X3 -0.1938032 0.18279338 1.0000000 0.50215246 0.5565838 0.4109288 0.16603235
X4 -0.2695426 -0.07356907 0.5021525 1.00000000 0.2968981 -0.1339521 0.23470432
X5 -0.1098249 0.11573199 0.5565838 0.29689814 1.0000000 0.1666422 0.44776780
X6 -0.2535928 0.31912373 0.4109288 -0.13395214 0.1666422 1.0000000 0.15445056
X7 0.1560979 0.05201044 0.1660323 0.23470432 0.4477678 0.1544506 1.00000000
In statistics, correlation (r) is used to measure the strength of the linear association
between the components. When r is negative, it implies a tendency for one value
in the pair to be larger than its average when the other is smaller than its average,
for instances, “X1-X2”, “X1-X3”, X1-X4”, “X1-X5”, “X1-X6”, and so on. On the
other hand, when r is positive, it implies a tendency for one value of the pair to be
large when the other value is large and also for both values to be small together,
for instances, “X1-X7”, “X2-X3”, “X2-X5”, “X2-X6”, “X2-X7”, and so on.
(3) Q-Q plots

The Q-Q plots is a useful graphical tool to help us access if a set of data came
from some theoretical distribution, especially Normal distribution. So, after
calculation, all of the seven selected variables have a good normal distribution,
maybe there are some outliers for each variable, but most of the data for each
variable are grouped near the qq normal line.
(4) Maximum likelihood estimates

The above plots are the histograms of different variables with their empirical line
(black line) and the fitted normal line (red line). Below are the tables of their
maximum likelihood estimates (MLE) and standard errors:
Wind: Solar:
estimate Std. Error estimate Std. Error
mean 7.500000 0.2410531 mean 73.85714 2.642872
sd 1.562202 0.1704499 sd 17.12777 1.868793
CO: NO:
mean 4.547619 0.1880873 mean 2.190476 0.1657734
sd 1.218945 0.1329974 sd 1.074335 0.1172191
NO2: O3:
mean 10.047619 0.5139245 mean 9.404762 0.8485412
sd 3.330611 0.3633993 sd 5.499175 0.6000091
HC:
estimate Std. Error

mean 3.0952381 0.10546046
sd 0.6834619 0.07457109
From the plots, obviously some factors are close to normal, but some have
skewness problems. The reason is probably due to the fact that all observed values
are integers, so for some variables which values are small -- such as HC and NO
-- could impose a bad effect on fitting a normal distribution model. Still, the
standard errors look not big compared to their corresponding estimates for these 7
MLEs, so at some significance level, the MLE should have a good estimation to
the true points.
(5) Principal component analysis
Fifth, we would like to perform the principal component analysis, in order to
explain the variance-covariance structure of the selected air pollution variables.
For the first part, we would like to use the covariance matrix (S) for our analysis.
After calculation, we find out the corresponding eigenvalues: λ1=304.26,
λ2=28.28, λ3=11.46, λ4=2.52, λ5=1.28, λ6=0.53, λ7=0.21. And we also find out
the first sample principal component is: y1 = -0.10x1 + 0.993x2 + 0.14x3 -
0.005x4 + 0.024x5 + 0.112x6 + 0.002x7, which contributes for 87% of the total
sample variance, and the first component is essentially the “solar radiation”. For
the second part, we would like to use the correlation matrix (R) for our analysis.
After calculation, we find out the corresponding eigenvalues: λ1=2.34, λ2=1.39,
λ3=1.20, λ4=0.73, λ5=0.65, λ6=0.54, λ7=0.16. And we also find out the first three
sample principle components are: (i) y1 = 0.237z1 - 0.205z2 - 0.551z3 - 0.378z4 -
0.498z5 - 0.324z6 - 0.319z7 (ii) y2 = -0.278z1 + 0.527z2 + 0.007z3 - 0.435z4 -

0.199z5 + 0.567z6 - 0.308 z7 (iii) y3 = 0.644z1 + 0.225z2 - 0.113z3 -0.407z4 +
0.197z5 + 0.159z6 + 0.541z7, which contribute for 70% of the total sample
variance. So, the first component contrasts “wind” with the remaining variables,
which are some general measure of the pollution level. The second component is
largely composed of “solar radiation” and the air pollutants “NO” and “O3”,
which represents the effects of solar radiation since solar radiation is involved in
the production of NO and O3 from the other pollutants. The third component is
composed mainly of “wind” and several air pollutants, e.g. “NO” and “HC”,
which represent a wind transport effect. Thus, the air pollution data can be
effectively summarized in three or fewer dimensions. So, there are some
differences between using S and R.
(6) Correlation test
From part (2), we obtained a table of correlations. We were interested in finding
whether the correlations are significant or not, so we conducted this correlation
r
test by applying the equation: t= √ n−2. The p-values is shown in the table
√1−r 2
below:
X1 X2 X3 X4 X5 X6 X7
X1 0
X2 0.5227 0
X3 0.2188 0.2466 0
X4 0.08431 0.6433 0.0007029 0
X5 0.4887 0.4655 0.0001293 0.05622 0
X6 0.1051 0.0394 0.006865 0.3977 0.2915 0
X7 0.3236 0.7436 0.2933 0.1346 0.002945 0.3288 0

At the 0.05 significance level, the correlations of X2X6, X3X4, X3X5, X3X6 and
X5X7 are statistically significant. If we check the signs of their correlation, we
could find that they are all positively correlated. Therefore, the statistical result
shows that for these 7 factors, they either have no correlation or positively
correlated. That means the increase of solar radiation or pollutants might
positively affect other factors, and this phenomenon might intensify the problem
of air pollution. If we use 0.1 significance level, then X1X4 and X4X5 will also
become significant. The correlation of X4X5 is positive, but the correlation of
X1X4 is negative. At this significance level, wind significantly reduces the
amount of air pollutant ‘NO’.
(7) Conclusion
The air pollution data recorded in Los Angeles contributed to us conducting this
study. The principal component analysis(PCA) shows the correlation between
different factors. Based on the first PCA of covariance matrix, the coefficient of
0.993 indicated that solar radiation explains 99% variance in this PCA, so we
considered that solar radiation is making the greatest effect on air pollution.
However, it is hard to control solar radiation in practice, therefore, we looked for
the PCA of correlation matrix and carried out the correlation test. They are
showing 2 important correlations: firstly, wind is negatively correlated to the
other pollutants. Secondly, NO has a significant positive correlation with other
pollutants. Similarly, wind is uncontrollable, but it is possible to control the
release of NO. It is believed that the reduction of NO might has a positive effect
on decreasing the other pollutants.

Hence, the result of our analysis not only provides the effect of different factors,
but also gives a deep insight into the relationship between each air pollution
factor, which could be helpful for making strategies to solve the problem of global
warming.

MATH3806 Group Report

Uploaded by

Copyright:

Available Formats

You might also like

MATH3806 Group Report

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MATH3806 Group Report

Uploaded by

Copyright:

Available Formats

MATH3806 Group Report - Air Pollution

A. Background of the time series data:

in cars or trucks, small engines, or furnaces. Second, NO (x4), forms in combustion

systems, and can be generated by lightning in thunderstorms. Third, NO2 (x5), is

average in 2019 varied from AQI 32 (regarded as “good”) to AQI 64 (regarded as

and methods are selected for this research as follows.

C. Selected Models and Methods + Discussion of Analysis Results:

between groups of observations, we have demonstrated scatter plots with

marginal dot diagram as follows, including ‘x1-x2’, ‘x1-x3’, ‘x1-x4’, ‘x1-x5’,

relationships between groups of observations, all x-y relationships are quite

(2) Mean, covariance and correlation arrays

7.500000 73.85714 4.54762 2.19048 10.04762 9.404762 3.095238

the relationships between different variables.

X1 2.5000000 -2.7804878 -0.3780488 -0.4634146 -0.5853659 -2.2317073 0.1707317

X2 -2.7804878 300.5156794 3.9094077 -1.3867596 6.7630662 30.7909408 0.6236934

X3 -0.3780488 3.9094077 1.5220674 0.6736353 2.3147503 2.8217189 0.1416957

X4 -0.4634146 -1.3867596 0.6736353 1.1823461 1.0882695 -0.8106852 0.1765389

X5 -0.5853659 6.7630662 2.3147503 1.0882695 11.3635308 3.1265970 1.0441347

X6 -2.2317073 30.7909408 2.8217189 -0.8106852 3.1265970 30.9785134 0.5946574

X7 0.1707317 0.6236934 0.1416957 0.1765389 1.0441347 0.5946574 0.4785134

In statistics, covariance is a measure of the joint variability of different random

be positive, for instances, “X1-X7”, “X2-X3”, “X2-X5”, “X2-X6”, “X2-X7”,

X1 1.0000000 -0.10144191 -0.1938032 -0.26954261 -0.1098249 -0.2535928 0.15609793

X2 -0.1014419 1.00000000 0.1827934 -0.07356907 0.1157320 0.3191237 0.05201044

X3 -0.1938032 0.18279338 1.0000000 0.50215246 0.5565838 0.4109288 0.16603235

X4 -0.2695426 -0.07356907 0.5021525 1.00000000 0.2968981 -0.1339521 0.23470432

X5 -0.1098249 0.11573199 0.5565838 0.29689814 1.0000000 0.1666422 0.44776780

X6 -0.2535928 0.31912373 0.4109288 -0.13395214 0.1666422 1.0000000 0.15445056

X7 0.1560979 0.05201044 0.1660323 0.23470432 0.4477678 0.1544506 1.00000000

for instances, “X1-X7”, “X2-X3”, “X2-X5”, “X2-X6”, “X2-X7”, and so on.

(3) Q-Q plots

from some theoretical distribution, especially Normal distribution. So, after

variable are grouped near the qq normal line.

(4) Maximum likelihood estimates

maximum likelihood estimates (MLE) and standard errors:

estimate Std. Error estimate Std. Error

mean 7.500000 0.2410531 mean 73.85714 2.642872

sd 1.562202 0.1704499 sd 17.12777 1.868793

estimate Std. Error estimate Std. Error

mean 4.547619 0.1880873 mean 2.190476 0.1657734

sd 1.218945 0.1329974 sd 1.074335 0.1172191

estimate Std. Error estimate Std. Error

mean 10.047619 0.5139245 mean 9.404762 0.8485412

sd 3.330611 0.3633993 sd 5.499175 0.6000091

estimate Std. Error

the true points.

(5) Principal component analysis

Fifth, we would like to perform the principal component analysis, in order to

explain the variance-covariance structure of the selected air pollution variables.

After calculation, we find out the corresponding eigenvalues: λ1=304.26,

the first sample principal component is: y1 = -0.10x1 + 0.993x2 + 0.14x3 -

After calculation, we find out the corresponding eigenvalues: λ1=2.34, λ2=1.39,

sample principle components are: (i) y1 = 0.237z1 - 0.205z2 - 0.551z3 - 0.378z4 -

0.498z5 - 0.324z6 - 0.319z7 (ii) y2 = -0.278z1 + 0.527z2 + 0.007z3 - 0.435z4 -

effectively summarized in three or fewer dimensions. So, there are some

differences between using S and R.

(6) Correlation test

From part (2), we obtained a table of correlations. We were interested in finding