Professional Documents
Culture Documents
MATH3806 Group Report
MATH3806 Group Report
MATH3806 Group Report
In this report, we would use the air pollution data from data courtesy of Professor G. C.
Tiao for our own research. Regarding the dataset we used, there are 42 measurements in
total on air pollution variables, which are all recorded at 12:00 noon in the Los Angeles
area on different days. To be specific, there are 7 variables in this dataset, including Wind
(x1), Solar radiation (x2), Carbon Monoxide (CO) (x3), Nitric Oxide (NO) (x4), Nitrogen
Dioxide (NO2) (x5), Ozone (O3) (x6), and Hydrocarbon (HC) (x7). And we would set
x3, x4, x5, x6, x7 as the targeted air pollution variables. So, to be more clear about these
selected air pollutants, some important information of each air pollution variable would
be given. First, CO (x3), it can be found in fumes produced any time when you burn fuel
produced by human activities, like the combustion of fossil fuels (coal, gas and oil),
especially fuel used in cars. It is also produced from making nitric acid, refining of petrol
and metals, commercial and food manufacturing. Fourth, O3 (x6), is formed when heat
and sunlight cause chemical reactions between oxides of nitrogen (NOX ) and Volatile
Organic Compounds (VOC), which are also known as Hydrocarbons. This reaction can
occur both near the ground and high in the atmosphere. Fifth, HC (x7), burning fossil
fuels, including gasoline in automobile engines, releases some hydrocarbons into the air.
B. Motivation:
In this day and age, the global warming and climate change effects have become a very
serious issue all over the world. And we definitely know that air pollution is one of the
problems that cause global warming to continue to worsen. In order to have a much
deeper understanding towards air pollution these days, we have chosen Los Angeles as
our main research area, so as to find out the newest situation regarding air pollution in
this place. So, Los Angeles, it has air quality index rating of “moderate”, with monthly
“moderate”). Although the air quality ratings in Los Angeles are quite optimistic, actually
the air pollution in Los Angeles is the worst in the United States. And according to the
County of Los Angeles Public Health Department, because of Los Angeles not meeting
the U.S. EPA’s national air quality standards for air pollutants, 1 in 10 childrens in Los
Angeles may suffer from the disease of asthma. Also according to the South Coast Air
Quality Management District, regarding the overall risk for cancer in Los Angeles, has
increased by 900 for every million. So, we can clearly see that the air pollution data for
Los Angeles is a very important key for people to take action to alleviate these kinds of
health problems. Such that, in this research, we would like to fully use the accessible air
pollution data from Professor G. C. Tiao, to analyze different air pollution variables
recorded in the Los Angeles Areas, in order to find out the main problems regarding the
air pollution in Los Angeles, then help improve its air quality, and most importantly our
final target is to mitigate the health issues caused by air pollution. Thus, several models
First, we would like to demonstrate the scatter plots for each air pollution
variable, in order to visualize the air pollution data that shows the relationship
between these seven selected variables. So, using each scatter plot as follows, we
can clearly figure out the patterns or correlations between two variables. Besides,
we would like to demonstrate the marginal dot diagram for each air pollution
variable, in order to figure out any right-skewed or left-skewed data, any outliers,
any multi-modal data. Such that, in order to find out clear x-y relationships
‘x1-x6’, ‘x1-x7’:
So, regarding the skewness of each variable in air pollution data, we can find out
that x1,x3,x4,x7 are like normal distribution, which means the data of these
variables are quite balanced, while x2,x5,x6 are left-skewed and right-skewed
data respectively, which means the data of these variables are mostly skewed on
only one side. Then, regarding the outliers, referring to the marginal dot diagrams,
we cannot find a big problem of outliers, since we can see that the dispersion
pattern of the dots are quite normal and balanced. Finally, regarding the
significant.
X1 X2 X3 X4 X5 X6 X7
After calculation, we have found the means for all seven selected variables. By
assuming all variables have the same units, we find out that “solar radiation” has
the highest means among all variables, which is 73.85714, while “NO” has the
lowest means, which is just 2.19048. From this case, we can estimate that “solar
radiation” probably has the largest effect on the model, and has a great impact on
Covariances (S):
X1 X2 X3 X4 X5 X6 X7
variables. When the greater values of one variable correspond with the greater
values of another variables (same holds for the lesser values), the covariance will
“X3-X4”, “X3-X5”, “X3-X6”, “X3-X7”, and so on, such that these combinations
of variables may have greater linear relationship. On the other hand, when the
greater values of one variable correspond with the lesser values of another
variable, the covariance will be negative, which means their linear relationships
are lower.
Correlations (R):
X1 X2 X3 X4 X5 X6 X7
In statistics, correlation (r) is used to measure the strength of the linear association
between the components. When r is negative, it implies a tendency for one value
in the pair to be larger than its average when the other is smaller than its average,
for instances, “X1-X2”, “X1-X3”, X1-X4”, “X1-X5”, “X1-X6”, and so on. On the
other hand, when r is positive, it implies a tendency for one value of the pair to be
large when the other value is large and also for both values to be small together,
calculation, all of the seven selected variables have a good normal distribution,
maybe there are some outliers for each variable, but most of the data for each
(black line) and the fitted normal line (red line). Below are the tables of their
Wind: Solar:
CO: NO:
NO2: O3:
HC:
sd 0.6834619 0.07457109
From the plots, obviously some factors are close to normal, but some have
skewness problems. The reason is probably due to the fact that all observed values
are integers, so for some variables which values are small -- such as HC and NO
-- could impose a bad effect on fitting a normal distribution model. Still, the
standard errors look not big compared to their corresponding estimates for these 7
MLEs, so at some significance level, the MLE should have a good estimation to
For the first part, we would like to use the covariance matrix (S) for our analysis.
λ2=28.28, λ3=11.46, λ4=2.52, λ5=1.28, λ6=0.53, λ7=0.21. And we also find out
0.005x4 + 0.024x5 + 0.112x6 + 0.002x7, which contributes for 87% of the total
sample variance, and the first component is essentially the “solar radiation”. For
the second part, we would like to use the correlation matrix (R) for our analysis.
λ3=1.20, λ4=0.73, λ5=0.65, λ6=0.54, λ7=0.16. And we also find out the first three
0.197z5 + 0.159z6 + 0.541z7, which contribute for 70% of the total sample
variance. So, the first component contrasts “wind” with the remaining variables,
which are some general measure of the pollution level. The second component is
largely composed of “solar radiation” and the air pollutants “NO” and “O3”,
which represents the effects of solar radiation since solar radiation is involved in
the production of NO and O3 from the other pollutants. The third component is
composed mainly of “wind” and several air pollutants, e.g. “NO” and “HC”,
which represent a wind transport effect. Thus, the air pollution data can be
r
test by applying the equation: t= √ n−2. The p-values is shown in the table
√1−r 2
below:
X1 X2 X3 X4 X5 X6 X7
X1 0
X2 0.5227 0
X3 0.2188 0.2466 0
could find that they are all positively correlated. Therefore, the statistical result
shows that for these 7 factors, they either have no correlation or positively
positively affect other factors, and this phenomenon might intensify the problem
of air pollution. If we use 0.1 significance level, then X1X4 and X4X5 will also
(7) Conclusion
The air pollution data recorded in Los Angeles contributed to us conducting this
different factors. Based on the first PCA of covariance matrix, the coefficient of
0.993 indicated that solar radiation explains 99% variance in this PCA, so we
considered that solar radiation is making the greatest effect on air pollution.
the PCA of correlation matrix and carried out the correlation test. They are
release of NO. It is believed that the reduction of NO might has a positive effect
but also gives a deep insight into the relationship between each air pollution
factor, which could be helpful for making strategies to solve the problem of global
warming.