Air Quality Prediction Using Machine Learning

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Air Quality Prediction Using Machine Learning

Abdul Majeed K K
Vellore Institute of Technology University
Mahammad Abubakar Shaik Janubhai (  addusj21@gmail.com )
Vellore Institute of Technology University
Mohammed Khalid Totlapalli Shaik
Vellore Institute of Technology University

Research Article

Keywords: Air Quality Index, Machine Learning, Linear regression, Logistic regression, Artificial Neural
Networks

Posted Date: December 22nd, 2023

DOI: https://doi.org/10.21203/rs.3.rs-3676592/v1

License:   This work is licensed under a Creative Commons Attribution 4.0 International License.
Read Full License

Additional Declarations: No competing interests reported.

Page 1/15
Abstract
Air pollution is one of the current major problems in the world, and due to this, the quality of air we
breathe is becoming worse as the days pass. Air pollution has been increasing rapidly since the year
2010, as most of the reports say that every year since 2015 air pollution has been more than that of total
air pollution recorded in the previous whole decade! So, to live a sustainable life, the quality of air we
breathe must be good and free of any kind of pollutants. So, to predict and monitor the air quality the
data of various air pollutants that decrease the air quality have been collected and used as features for
developing a machine learning model which predicts the air quality index of a particular place given the
values of the pollutants. Machine learning models like Linear regression, Logistic regression, and Artificial
Neural Networks (ANNs) models have been used and compared in terms of their accuracy. Initially, simple
machine learning models like linear and logistic regression were trained and achieved good accuracies,
later the use of complex artificial neural networks proved to have the highest accuracy of them all on test
data sets.

Introduction
After 2010, the amount of air pollution increased quickly, and according to the majority of reports, starting
in 2015, it has exceeded the total amount of air pollution that was recorded throughout the preceding
whole decade! So, for us to live sustainably, the air we breathe needs to be healthy and free of all toxins.
Air pollution has grown to be a very serious issue and particulate matter has a much larger influence on
human health than other toxins. Fine particulate matter (PM2.5) has a small diameter that enables it to
enter the bronchioles and travel deep within the alveoli, preventing the lungs from exchanging gas. It has
been established that prolonged exposure to particulate matter increases the risk of developing lung
cancer as well as respiratory and cardiovascular diseases. There are 11 other air pollutants that are
dangerous to human health. As a result, predicting air quality has become crucial for facilitating people's
actions. This complex relationship between the pollutants and the AQI can be analysed through machine
learning models. Choosing an appropriate machine learning model and training it accordingly, can predict
the outcome of the objective accurately.

Materials and methods


The data required, which consists input values of all the 12 pollutants (PM2.5, PM10, NO, NO2, NOx, NH3,
CO, SO2, CO, benzene, toluene, xylene) with the AQI values as target values, is collected from the Kaggle
website and excel application has been used for preprocessing purposes. Finally, to build the machine
learning models, octave software has been used. Initially, simple machine learning models like simple
linear regression, multiple linear regression, simple logistic regression and multiple logistic regression
have been used and then use of complex machine learning model like artificial neural networks is done
for building and training of the model.

Page 2/15
There are many machine learning methods and techniques which can be used to build the desired
machine learning model. But, based on the research most of the existing machine learning models built
use the concepts of regression (Zheyuan Zhang, 2023; Maheshwari and Lamba, 2019; Y. Zhang et al.,
2019), classification (Rosero-Montalvo et al., 2018; Nandini and Fathima, 2019; Ketu and Mishra, 2021),
and artificial neural networks (X. Jin et al., 2023; Fang et al., 2023; Kow et al., 2022; Yan et al., 2020; N. Jin
et al., 2021; Zhendong Zhang et al., 2021k et al., 2017). Here, in this study we use all the concepts like
regression, classification and artificial neural networks do build an optimal machine learning model by
conducting a comparative study similar to few research studies done (Sinha and Singh, 2021; Madan et
al., 2020).

Linear Regression Model

First, the simple linear regression model is implemented using fine particulate matter (PM2.5) as single
input feature, and then trained without regularization initially but later by observing and analyzing
through the learning curves, the model is again trained with regularization using optimal regularization
value and subsequently the training, cross-validation, testing data set accuracies are calculated. In similar
way for multiple linear regression model all the 12 pollutants are used as input features and same steps
are followed as in the simple regression model and overall performance is evaluated using the testing
data set. The algorithm or steps in building the model is as follows:

Initially the lambda or regularization parameter value and the parameters/weights of each feature
are set is zero.
Then the shuffling of the training data is done in order to avoid any associated bias.
The input data training data set is further pre-processed using feature scaling and standardization
which helps in faster training.
Bias term is added to the input data.
Now the model is trained using ‘gradient descent’ algorithm as the optimization algorithm and ‘mean
squared error’ as the cost function.
The built in octave function ‘fminunc’ (find minimum of unconstrained multivariable function) used
for the overall training with the maximum epochs set to 200. This built in function chooses the
optimal learning rate by itself.
After the training is completed, by using the optimized weights/parameters output values of the
training data set are predicted using the hypothesis function of the linear regression.
As we are classifying the output into two classes based on the calculated AQI values i.e., 0-150 as
sustainable category and labelled as numeric ‘0’ and > 150 as not sustainable category and labelled
as numeric ‘1’,
We modify the predicted output values and the actual output values of training data set using the
above criteria and then compare both to get the accuracy of the training data set.

Page 3/15
Then the same is done with the cross-validation data set to calculate the validation data set
accuracy.
Now with the help of learning curves we can analyze whether the trained model has high bias or high
variance.
As either of the problems can be solved using the regularization, we calculate the optimal value of
regularization parameter (lambda).
Now, the model is again trained using the optimal value of regularization parameter and
subsequently the training and cross-validation data set accuracies are calculated using the updated
values of weights.
Finally, we predict the output values of the test data set using the updated weights and compare with
the actual ones by modifying the predicted and actual values to get the test data set accuracy, which
tells about the overall performance of the trained model.

Logistic Regression Model

Next, using fine particulate matter (PM2.5) as a single input feature, the concept of simple logistic
regression is implemented. The machine learning model is then trained without regularization at first, but
after observation and analysis through learning curves, the model is again trained with regularization
best regularization value. Finally, the training, cross-validation, and testing data set accuracies are
calculated. The 12 pollutants are utilized as input features for the multiple logistic regression model in a
manner similar to that of the simple logistic regression model. The same procedures are followed, and
the testing data set is used to assess overall performance. The algorithm or steps in building the model is
as follows:

Initially the lambda or regularization parameter value and the parameters/weights of each feature
are set is zero.
Then the shuffling of the training data is done in order to avoid any associated bias.
As we are classifying the output into two classes based on the calculated AQI values i.e., 0-150 as
sustainable category and labelled as numeric ‘0’ and greater than 150 as not sustainable category
and labelled as numeric ‘1’,
The actual output values of all the training, cross-validation and testing data set are modified
beforehand according to the above criteria.
The input data training data set is further pre-processed using feature scaling and standardization
which helps in faster training.
Bias term is added to the input data.
Now the model is trained using ‘gradient descent’ algorithm as the optimization algorithm and ‘log
loss’ as the cost function.
The built in octave function ‘fminunc’ is used for the overall training with the maximum epochs set to
200. This built in function chooses the optimal learning rate by itself.

Page 4/15
After the training is completed, by using the optimized weights/parameters output values of the
training data set are predicted using the hypothesis function of the logistic regression i.e., sigmoid
function.
We modify the predicted output values by using a threshold function which uses a threshold value to
divide the predicted data into ‘0’s and ‘1’s i.e., if the predicted value greater than or equal to 0.5 –
classify as ‘1’s and vice-versa and then compare these with actual output values to get the accuracy
of the training data set.
Then the same is done with the cross-validation data set to calculate the validation data set
accuracy.
Now with the help of learning curves we can analyze whether the trained model has high bias or high
variance.
As either of the problems can be solved using the regularization, we calculate the optimal value of
regularization parameter (lambda).
Now, the model is again trained using the optimal value of regularization parameter and
subsequently the training and validation data set accuracies are calculated using the updated values
of weights.
Finally, we predict the output values of the test data set using the updated weights and compare with
the actual ones by modifying the predicted and actual values to get the test data set accuracy, which
tells about the overall performance of the trained model.

Artificial Neural Networks Model

Finally, we use the concept of artificial neural networks (ANNs) is used to train a model with single input
feature as well as with multiple input features but the building and training procedure followed is the
same i.e., the machine learning model is initially trained without regularization, then following
observation and analysis using learning curves, the model is then retrained using the best regularization
value. At last, the trained/ updated weights/parameters are used to predict the output on unseen data or
testing data to evaluate the overall performance of the model. The steps carried out or algorithm in
building the model is as follows:

First, the neural network is set to have four layers in total including two hidden layers, input layer has
one neuron (former model) and 12 neurons (latter model) excluding bias, each hidden layer has 30
neurons excluding bias and the output layer has a single neuron.
The hidden layers and output layer used sigmoid activation function.
The other hyperparameters like regularization term (lambda) is set to ‘0’ initially and number of
epochs as 50.
The weights have been initialized to be random values within range of -0.12 to 0.12.
Then the shuffling of the training data is done in order to avoid any associated bias.

Page 5/15
We are classifying the output into two classes based on the calculated AQI values i.e., 0-150 as
sustainable category and labelled as numeric ‘0’ and > 150 as not sustainable category and labelled
as numeric ‘1’,
The actual output values of all the training, cross-validation and testing data set are modified before-
hand according to the above criteria.
Now the input data training data set is further pre-processed using feature scaling and
standardization for faster training.
Bias term is added to the each and every layer excluding the output layer.
Now the neural network is trained using ‘gradient descent’ and ‘back propagation’ algorithm as the
optimization algorithms and ‘log loss’ as the cost function.
The built in octave function ‘fminunc’ is used for the overall training with the maximum epochs set to
50 as mentioned earlier. This built in function chooses the optimal learning rate by itself.
After the training is completed, by using the optimized weights/parameters output values of the
training data set are predicted using feedforward propagation of the input data and classifying them
into ‘0’s and ‘1’s i.e., if the predicted value greater than or equal to 0.5 – classify as ‘1’s and vice-versa
and then compare these with actual output values to get the accuracy of the training data set.
Then the same is done with the cross-validation data set to calculate the validation data set
accuracy.
Now with the help of learning curves we can analyze whether the trained model has high bias or high
variance.
As either of the problems can be solved using the regularization, we calculate the optimal value of
regularization parameter (lambda).
Now, the model is again trained using the optimal value of regularization parameter and
subsequently the training and validation data set accuracies are calculated using the updated values
of weights.
Finally, we predict the output values of the test data set using the updated weights and compare with
the actual ones by modifying the predicted and actual values to get the test data set accuracy, which
tells about the overall performance of the trained model.

Results and Discussion


Simple Linear Regression

Training the model without regularization resulted in training data set, validation data set accuracy of
90.833% and 97.62% respectively as shown in the Fig. 1(a). Then,

from analyzing the learning curve, we can say that the model has high bias as shown in the Fig. 1(b).
From Fig. 1(c) we have found that the optimal value of regularization parameter (lambda) is found to be
zero, as set initially. As, the value of lambda isn’t changed, which shows that the regularization hasn’t

Page 6/15
much impact so, the test data set accuracy is found to be 85.71% by using the non-regularized weights as
show in Fig. 1(d).

Multiple Linear Regression

Training the model without regularization resulted in training data set, validation data set accuracy of
95% and 100% respectively as shown in Fig. 2 (a). Then from analyzing the learning curve from Fig. 2 (b),
we can say that the model has high variance. From Fig. 2 (c), optimal value of regularization parameter
(lambda) is found to be 15.

Finally, the test data set accuracy is found to be 90.47% by using the regularized weights as shown in Fig.
2 (d).

Simple Logistic Regression

Training the model without regularization resulted in training data set, validation data set accuracy of
95.83% and 97.62% respectively as shown in Fig. 3 (a). Then from analyzing the learning curve Fig. 3 (b).,
we can say that the model has high variance. From Fig. 3 (c), optimal value of regularization parameter
(lambda) is found to be zero, as set initially. As, the value of lambda isn’t changed, which shows that the
regularization hasn’t much impact so, the test data set accuracy is found to be 83.33% by using the non-
regularized weights as shown in Fig. 3 (d).

Multiple Logistic Regression

Training the model without regularization resulted in training data set, validation data set accuracy of
95.83% and 90.47% respectively as shown in Fig. 4 (a). Then from analyzing the learning curve from Fig.
4 (b), we can say that the model has high variance. So, the optimal value of regularization parameter
(lambda) from the Fig. 4 (c) is found to be zero, as set initially. As, the value of lambda isn’t changed,
which shows that the regularization hasn’t much impact so, the test data set accuracy is found to be
88.09% by using the non-regularized weights as shown in Fig. 4 (d).

Artificial Neural Networks with single input feature

Training the model without regularization resulted in training data set, validation data set accuracy of
95% and 97.62% respectively as shown in Fig. 5 (a). Then from analyzing the learning curve from Fig. 5
(b), we can say that the model has high bias. So, the optimal value of regularization parameter (lambda)
from the Fig. 5 (c) is found to be zero, as set initially. As, the value of lambda isn’t changed, which shows
that the regularization hasn’t much impact so, the test data set accuracy is found to be 83.33% by using
the non-regularized weights as shown in Fig. 5 (d).

Artificial Neural Networks with multiple input features

Training the model without regularization resulted in training data set, validation data set accuracy of
100% and 83.33% respectively as shown in Fig. 6 (a). Then from analyzing the learning curve from Fig. 6
Page 7/15
(b), we can say that the model has high variance.

So, the optimal value of regularization parameter (lambda) from the Fig. 6 (c) is found to be 0.3. Finally,
the test data set accuracy is found to be 95.24% by using the regularized weights as shown in Fig. 6 (d).

Conclusion
Air pollution is getting more severe every day and has a significant impact on ozone layer depletion,
global warming, and the emergence of numerous dangerous diseases that could cause a pandemic!
Therefore, we should monitor air pollution and quality by lowering the emission of air contaminants to
avoid all these issues and live sustainably. This can be achieved by building an accurate machine
learning model that computes/ predicts the value of AQI and then classifies whether the area/place is
suitable for living or not. So, we have built various machine learning models using the concepts of linear
regression, logistic regression, and artificial neural networks to achieve higher accuracies on the overall
model built. Of all the models built the artificial neural networks model with multiple features resulted in
an accuracy of more than 95% or a probability of more than 0.95 on the testing data set, which tells
about the performance/efficiency of the overall machine learning model. But there are many other
machine learning models developed for air quality prediction but those are very complex and
computational complexity is also very high as most of the models include advanced deep-learning
methods. However, our model is not that complex when compared to most of the models but still
manages to achieve accuracy more than many of those machine learning models.

Declarations
Funding

No funding was received to assist with the preparation of this manuscript.

Competing interests

The authors have no competing interests to declare that are relevant to the content of this article.

Data Availability

The original dataset used for this research is available at


https://www.kaggle.com/datasets/rohanrao/air-quality-data-in-india.
The data that support the findings in this research are made publicly available in a repository at
https://github.com/Abubakar021/AQI-ML-Project/tree/main, and can also be assessed with the
identifier https://doi.org/10.5281/zenodo.10322581.

References

Page 8/15
1. Fang, W., Zhu, R., Chun-Wei Lin, J., An air quality prediction model based on improved Vanilla LSTM
with multichannel input and multiroute output, Expert Systems with Applications, Volume 211, 2023,
118422, ISSN 0957-4174, https://doi.org/10.1016/j.eswa.2022.118422.
2. Jin, N., Zeng, Y., Yan, K. and Ji, Z., Multivariate Air Quality Forecasting With Nested Long Short Term
Memory Neural Network, IEEE Transactions on Industrial Informatics, vol. 17, no. 12, pp. 8514-8522,
Dec. 2021, doi: https://doi.org/10.1109/TII.2021.3065425.
3. Jin, X., Wang, Z., Kong, J., Bai, Y., Su, T., Ma, H. and Chakrabarti, P., Deep Spatio-Temporal Graph
Network with Self-Optimization for Air Quality Prediction, 2023, Entropy 25, no. 2: 247.
https://doi.org/10.3390/e25020247
4. Ketu, S., Mishra, P.K., Scalable kernel-based SVM classification algorithm on imbalance air quality
data for proficient healthcare. Complex Intell. Syst. 7, 2597– 2615 (2021).
https://doi.org/10.1007/s40747-021-00435-5
5. Kök, İ., Şimşek, M. U. and Özdemir, S., "A deep learning model for air quality prediction in smart cities,"
2017 IEEE International Conference on Big Data (Big Data), Boston, MA, USA, 2017, pp. 1983-1990,
doi: https://doi.org/10.1109/BigData.2017.8258144.
6. Kow, P., Hsia, I., Chang, L., Chang, F., Real-time image-based air quality estimation by deep learning
neural networks, Journal of Environmental Management, Volume 307, 2022, 114560, ISSN 0301-
4797, https://doi.org/10.1016/j.jenvman.2022.114560.
7. Madan, T., Sagar, S. and Virmani, D., "Air Quality Prediction using Machine Learning Algorithms –A
Review," 20202nd International Conference on Advances in Computing, Communication Control and
Networking (ICACCCN), Greater Noida, India, 2020, pp. 140-145, doi:
https://doi.org/10.1109/ICACCCN51052.2020.9362912.
8. Maheshwari, K. and Lamba, S., "Air Quality Prediction using Supervised Regression Model," 2019
International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT),
Ghaziabad, India, 2019, pp. 1-7, doi: https://doi.org/10.1109/ICICT46931.2019.8977694.
9. Nandini, K. and Fathima, G., "Urban Air Quality Analysis and Prediction Using Machine Learning,"
2019 1st International Conference on Advanced Technologies in Intelligent Control, Environment,
Computing & Communication Engineering (ICATIECE), Bangalore, India, 2019, pp. 98-102, doi:
https://doi.org/10.1109/ICATIECE45860.2019.9063845.
10. Rosero-Montalvo, P. D. et al., "Air Quality Monitoring Intelligent System Using Machine Learning
Techniques," 2018 International Conference on Information Systems and Computer Science
(INCISCOS), Quito, Ecuador, 2018, pp. 75-80, doi: https://doi.org/10.1109/INCISCOS.2018.00019.
11. Sinha, Anurag, and Shubham, S., "Dynamic Forecasting Of Air Pollution In Delhi Zone Using Machine
Learning Algorithm." Quantum Journal of Engineering, Science and Technology, vol. 2, no. 3, 2021,
pp. 40-53.
12. Yan, R. & Liao, J. & Yang, J. & Sun, W. & Nong, M. & Li, F., (2020). Multi-hour and multi-site air quality
index forecasting in Beijing using CNN, LSTM, CNN-LSTM, and spatiotemporal clustering. Expert
Systems with Applications. 169. 114513. https://doi.org/10.1016/j.eswa.2020.114513.

Page 9/15
13. Zhang, Y. et al., "A Predictive Data Feature Exploration-Based Air Quality Prediction Approach," in IEEE
Access, vol. 7, pp. 30732-30743, 2019, doi: https://doi.org/10.1109/ACCESS.2019.2897754.
14. Zhang, Z., Zeng, Y. & Yan, K., A hybrid deep learning technology for PM2.5 air quality forecasting.
Environ Sci Pollut Res 28, 39409–39422 (2021). https://doi.org/10.1007/s11356-021-12657-8
15. Zhang, Z., Wang, J., Xiong, N. et al, Air Pollution Exposure Based on Nighttime Light Remote Sensing
and Multi-source Geographic Data in Beijing. Chin.Geogr. Sci. 33, 320–332 (2023).
https://doi.org/10.1007/s11769-023-1339-z

Figures

Figure 1

(a) Training without regularization

(b) Learning curve (c) Learning curve for optimal lambda value

(d) Testing the accuracy of overall model through test data set

Page 10/15
Figure 2

(a) Training without regularization

(b) Learning curve (c) Learning curve for optimal lambda value

(d) Testing the accuracy of overall model through test data set

Page 11/15
Figure 3

(a) Training without regularization

(b) Learning curve (c) Learning curve for optimal lambda value

(d) Testing the accuracy of overall model through test data set

Page 12/15
Figure 4

(a) Training without regularization

(b) Learning curve (c) Learning curve for optimal lambda value

(d) Testing the accuracy of overall model through test data set

Page 13/15
Figure 5

(a) Training without regularization

(b) Learning curve (c) Learning curve for optimal lambda value

(d) Testing the accuracy of overall model through test data set

Page 14/15
Figure 6

(a) Training without regularization

(b) Learning curve (c) Learning curve for optimal lambda value

(d) Testing the accuracy of overall model through test data set

Page 15/15

You might also like