Integrated Multiple Directed Attention-Based Deep Learning For Improved Air Pollution Forecasting

IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL.
70, 2021 3520815
Integrated Multiple Directed Attention-Based Deep

Learning for Improved Air Pollution Forecasting
Abdelkader Dairi , Fouzi Harrou , Member, IEEE, Sofiane Khadraoui , and Ying Sun
Abstract— In recent years, human health across the world pollution is becoming a critical problem in urban areas and
is becoming concerned by a constant threat of air pollution, industrialized countries and one of the principal factors for
which causes many chronic diseases and premature mortalities. global warming. Mitigating air pollution is a paramount issue
Poor air quality does not have only serious adverse effects on
human health and vegetation but also some major negative in developing countries, notably in larger urban areas with
political, societal, and economic impacts. Hence, it is essential a high concentration of emission sources, including vehicles
to invest more effort on accurate forecasting of ambient air and industrial activities. Many epidemiological studies showed
pollution to provide practical and relevant solutions, achieve the effect of certain chemical compounds such as sulfur
acceptable air quality, and plan for prevention. In this work, dioxide (SO2 ), nitrogen dioxide (NO2 ), ozone (O3 ), or dust
we propose a flexible and efficient deep learning-driven model
to forecast concentrations of ambient pollutants. This article particles in the air on the health of the general population and
introduces first the traditional variational autoencoder (VAE) particularly noticeable on sensitive people such as asthmatics,
and the attention mechanism to develop the forecasting modeling children, and elderly [1]. Today, air quality is a multidis-
strategy based on the innovative integrated multiple directed ciplinary problem that mobilizes epidemiological specialists,
attention (IMDA) deep learning architecture. To assess the per- specialists in transport modeling, emissions and transformation
formance of the proposed forecasting methodology, experimental
validation is then performed using air pollution data from four of pollutants, geographical systems, forecasting, and local
US states. Six statistical indicators have been used to evaluate authorities and industrialists. Thus, monitoring the ambient
the forecasting accuracy. A discussion of the results obtained air quality is essential to achieve acceptable air quality [2].
finally demonstrates the satisfying performance of integrated Over the past few decades, much effort has been made to
multiple directed attention variational autoencoder (IMDA-VAE) enhance air quality [3], [4]. For instance, air quality networks
methods to forecast different pollutants in different locations.
Furthermore, results indicate that the proposed IMDA-VAE (composed of numerous measurement stations) were installed
model can effectively improve air pollution forecasting per- across almost all countries to monitor numerous air pollutants’
formance and outperforms the deep learning models, namely concentration levels. This study attempts to design an efficient
VAE, long short-term memory (LSTM), gated recurrent units deep learning data-driven model for forecasting concentrations
(GRU), bidirectional LSTM, bidirectional GRU, and ConvLSTM. of carbon monoxide (CO), nitrogen monoxide (NO), nitrogen
We also showed that the forecasting results of the proposed model
surpass the performance of LSTM and GRU with the attention dioxide (NO2 ), sulfur dioxide (SO2 , and ozone (O3 ).
mechanism. Reliable forecasting of air pollutants gives valuable infor-
mation to help people take the necessary precautions to
Index Terms— Air pollution concentrations forecasting, deep
learning, integrated multiple directed attention variational avoid undesirable consequences. Also, it permits taking more
autoencoder (IMDA-VAE), multiple directed attention, time adequate countermeasures for preventing air pollution crisis
series. and protecting public health. The need for flexible forecasting
techniques that can accurately forecast air pollutants has
I. I NTRODUCTION considerably drawn researchers’ and engineers’ attention [5].
Various model- and data-based methods have been designed
C ONCERNS for the environment, health, and safety have
been attracting considerable attention worldwide due to
the new environmental challenges that threaten the planet. Air
to improve air pollution modeling and forecasting over the
past four decades [1], [6], [7]. Conventional time-series-based
models are among the widely utilized methods for air pollution
Manuscript received February 28, 2021; revised May 6, 2021; accepted forecasting in the literature [8]–[10]. These models comprise
May 25, 2021. Date of publication June 28, 2021; date of current version
July 9, 2021. This work was supported by the King Abdullah University autoregressive integrated moving average (ARIMA) and its
of Science and Technology (KAUST), Office of Sponsored Research (OSR) alternatives, such as seasonal-ARIMA [11] and Holt–Winters
under Award OSR-2019-CRG7-3800. The Associate Editor coordinating the models [12], [13]. Nevertheless, the forecast error in these
review process was Wilson Wang. (Corresponding author: Fouzi Harrou.)
Abdelkader Dairi is with the Computer Science Department Signal, Image methods is apparent when the concentration levels of pollu-
and Speech Laboratory (SIMPA) Laboratory, University of Science and tants exhibit irregular variations [14]. Also, the linear nature
Technology of Oran-Mohamed Boudiaf (USTO-MB), Bir El Djir 31000, of statistical models (e.g., AR and ARIMA) does not enable
Algeria.
Fouzi Harrou and Ying Sun are with the Computer, Electrical and Math- forecasting the nonlinear and nonstationary of ambient air
ematical Sciences and Engineering (CEMSE) Division, KAUST, Thuwal pollution accurately [15].
23955-6900, Saudi Arabia (e-mail: fouzi.harrou@kaust.edu.sa). To mitigate the weakness mentioned above, machine learn-
Sofiane Khadraoui is with the Department of Electrical Engineering,
University of Sharjah, Sharjah, United Arab Emirates. ing models, which are more flexible, such as neural network
Digital Object Identifier 10.1109/TIM.2021.3091511 forecasting and support vector machine, have been widely
1557-9662 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on August 15,2021 at 14:54:06 UTC from IEEE Xplore. Restrictions apply.
3520815 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 70, 2021
employed in improving air pollution forecasting [16]–[18]. prevention. Deep learning has recently emerged as a promising
Machine learning models showed a suitable ability to model research line in modeling and forecasting time series data, both
the complicated relationship between process variables without in academia and industry [22]–[26]. Various deep techniques
the need for an explicit model formulation to be specified. have been applied in the literature to improve ambient air
Over the last two decades, several machine learning-based pollution forecasting. Chang et al. [27] considered an aggre-
methods have been applied for air pollution forecasting. For gated long short-term memory model (ALSTM) deep learning
instance, in [16], a wavelet-based neural network approach approach to improve air pollution forecasting. Essentially,
has been proposed for forecasting one step-ahead hourly, this forecasting approach consists of aggregating three LSTM
daily mean, and daily maximum concentrations of ozone (O3 ), models into a predictive model using information obtained
sulfur dioxide (SO2 ), carbon monoxide (CO), nitrogen oxides from nearby industrial air quality stations and external sources
(NO), nitrogen dioxide (NO2 ), and dust particles (PM2.5). of pollution. Results proved the outperformance of this aggre-
In particular, by using maximum overlap wavelet transform gated deep learning method compared to LSTM, support
(MODWT), this approach decomposes each time series of vector machine (SVR)-based regression, and gradient boosted
each air pollutant into different time-scale components. Then, tree regression (GBTR) in predicting PM2.5 concentrations.
the Elman network is applied to these time-scale components. In [28], the convolutional-based bidirectional gated recurrent
In this study, the wavelet network approach is used for one- unit (CBGRU) method is applied to forecast PM2.5 concen-
step-ahead forecasting. In [19], a hybrid method merging the tration levels. This approach combines both 1-D convolutional
advantages of the empirical wavelet transform (EWT), multi- neural networks’ desirable features and bidirectional gated
agent evolutionary genetic algorithm (MEGA), and nonlinear recurrent unit neural networks. This approach showed good
autoregressive network with exogenous inputs (NARX) is forecasting performance of PM2.5 concentration compared
introduced for enhancing multistep air pollutant concentrations to SVR, gradient boosting regressor (GBR), LSTM, GRU,
forecasting. Specifically, the air pollutant series are decom- and bidirectional GRU. In [29], LSTM optimized using a
posed via the EWT, and then, the optimized NARX neural particle swarm optimization algorithm is applied for predicting
network by the MAEGA model is applied to forecasting air ambient air pollutants concentrations (PM2.5, PM10,NO2 ,
pollutant concentrations. Results showed the outperformance CO, O3 , and SO2 ). In [10], an approach combining RNN
of this model compared to conventional models when applied models with LSTM (RNN + LSTM) is proposed to predict
for forecasting PM2.5, SO2 , NO2 , and CO concentrations PM10 particles in different places in the city Skopje. Results
levels in Beijing, China. However, these methods are not suited show that using both meteorological and air pollution mea-
for real-time forecasting because wavelet transform needs surements enhances LSTM and RNN + LSTM models’ fore-
batch data. In [20], an ANN model has been combined with casting accuracy. Also, it has been shown that the combined
numerical models to improve the prediction of daily concentra- RNN + LSTM models consistently outperform the ARIMA
tions of air pollutants, including SO2, NO2, and PM10, using approach.
meteorological variables. Similarly, Arhami et al. [17] intro- Bui et al. [30] and Kim et al. [31] proposed a method for
duced a hybrid forecasting model to forecast hourly pollutant air pollution prediction from historical time series pollutant
levels (i.e., NO2 , NO, O3 , CO, and PM10 ) based on artificial data and meteorological data using RNN and LSTM models.
neural networks (ANN) combined with uncertainty analysis Another related work [32] utilizes an LSTM model to forecast
by Monte Carlo simulations (MCS). This hybrid model uses the 8-h moving average concentrations of ozone, where results
several selected input meteorological variables for improved obtained showed forecasts with low error. Deep learning
forecasting. It has been shown that the combination of ANN models with RNN, LSTM, and GRU architectures are applied
with MCS offers a promising tool for air pollution predictions. for forecasting air quality based on the AirNet dataset that
In [18], at first, a genetic algorithm and a linear method of includes both meteorological time series and air quality data
stepwise fit are applied to select the relevant features from [33]. The analysis examined in [33] showed that the GRU
the prediction point of view. Then, the random forest (RF), model slightly outperforms the RNN and LSTM architectures
the multilayer perception (MLP), the radial basis function for PM10 concentration prediction. A deep learning approach
(RBF), and the support vector machine (SVM) are applied is proposed in [34] and [35] for the air pollution prediction
to the selected features for daily forecasting of atmospheric in South Korea, where a stacked autoencoder (SAE) model is
pollutants NO2 , O3 , SO2 , and PM10 . It has been shown that utilized for learning and training data. An LSTM model-based
the features selection step with an ensemble of predictors approach is proposed in [36] to predict air pollutant con-
enables improved forecasting quality of atmospheric pollution. centrations. This forecasting approach presented in [36] is
In [21], an approach to predict air concentration levels of capable to automatically extract useful features, such as the
air pollutants (i.e., NO2 , NOX, SO2 , O3 , and PM2.5) is pro- spatiotemporal correlations within air pollutant concentrations.
posed using based on sparse response backpropagation training The spatiotemporal deep learning (STDL) architectures have
feedforward neural networks (called FFANN-SRBP). This also been used for spatiotemporal prediction of air quality.
approach outperformed the multiple linear regression (MLR) A combination of a convolutional neural network and LSTM
and FFANN based on backpropagation (FFANN-BP) in terms model is proposed in [37] to forecast air quality up to
of prediction precision. 48 h and extracts the spatial–temporal relations. Chiou-Jye
Precise air pollution forecasting provides relevant informa- and Ping-Huan [38] presented a comparative study of the
tion about the future pollution evolution, which is crucial for performance of CNN-LSTM model with traditional machine
effective air pollution monitoring and assists in planning for learning models in terms of their ability to forecast PM2.5
DAIRI et al.: INTEGRATED MULTIPLE DIRECTED ATTENTION-BASED DEEP LEARNING 3520815
Fig. 1. Schematic diagram of the overall forecasting strategy.
concentration, where the obtained results showed that the ConvLSTM, as well as LSTM and GRU with the
CNN-LSTM model provides the lowest root mean square attention mechanism (termed LSTM-A and GRU-A).
error (RMSE) and mean absolute error (MAE). A CNN-GRU The following section presents the preliminary material
model proposed in [39] is applied to forecast three air pollu- needed in this study and briefly introduces the IMDA-VAE
tants (PM2.5 , PM10 , and O3 ) of monitoring stations over 48 h. forecasting methodology. In Section II, the performances of
In this article, we propose an innovative deep learning model the considered methods are illustrated using air pollution data
to improve the forecasting quality of concentrations levels of from four US states. Finally, in Section IV, we conclude this
ambient air pollutants (NO2 , O3 , SO2 , and CO) based on a study and sheds light on potential future research lines.
wind attention mechanism and variational AE (VAE) deep
model. The contribution of this article is threefold. II. M ETHODOLOGY
1) The first contribution is mainly related to the devel- In this section, we briefly describe the basic concept of
opment of an innovative integrated multiple directed vanilla AEs and VAEs. Afterward, we briefly describe how
attention deep learning architecture (called IMDA-VAE) self-attention mechanisms work. We then introduce the pro-
based on the traditional VAE and the attention mech- posed IMDA-VAE deep learning-driven method. The overall
anism. To the best of the authors’ knowledge, this schematic of the proposed forecasting strategy is presented
is the first study introducing the traditional VAE and in Fig. 1.
the attention mechanism to accurately forecast various
air pollutants concentrations and leverage air pollution A. Variational Autoencoder
complexity. Essentially, the proposed method extends
the capability of the traditional VAE to model temporal To better understand the VAE, we first present a short
dependencies by introducing the self-attention mech- description of the AE. The AE comprises three principal
anism at a multilevel of the VAE model. The pro- elements, namely the encoder, the latent space, and the decoder
posed flexible modeling framework permits exploiting (see Fig. 2). The encoder receives input data, pollutants’
the suitable performance of VAE in time-series modeling concentration time series, and projects it into the latent space
and flexible nonlinear approximation and the focus on employing neural networks. Crucially, the pertinent informa-
the relevant features via the attention-based mechanism. tion is stored in the latent space, and the aim is to optimize
Exploiting all these sophisticated statistical tools is how data can be scattered in this latent space. On the other
advantageous in the sense that it has the potential to hand, the decoder represents a mirror of the encoder because
improve short-term forecasting of ambient air pollution it attempts to reconstruct the compressed latent space’s input
time-series data. data. In short, we can summarize the AE procedure using this
2) The second contribution lies in the validation of the chain of events
proposed method through three different forecasting encoder(x;θe ) decoder(z;θd )
x ⇒ z ⇒ x̂
experimentations: univariate, multivariate with a single
output, and multivariate with multiple outputs. where θe and θd , respectively, denote parameterizations of the
3) Finally, the third contribution is the comparative study encoder and decoder networks.
of the proposed forecasting model and some powerful The AE aims to give a dimensionality reduction procedure,
recurrent neural network models performed in this work which learns to encode the data such that a distance metric
to show and evaluate their performance and capabilities loss is optimized. Essentially, the more effective the encoder is
in forecasting air ambient pollutants. Air pollution data learning data compression, the more accurate the reconstructed
from four US states were used in the experiments data will be. A traditional and straightforward loss function
and assessment of the outputs’ deep learning-driven utilized for the AE is the mean squared error, which consists
forecasting methods. The results demonstrate that the of the squared deviation between the input, X, and output
designed IMDA-VAE method offers satisfying fore- data, X̂
casting performance of different air ambient pollu-
1
n
tants and outperformed other deep learning models, ( X̂ − X)2 . (1)
including GRUs, LSTM, BiGRU, Bidirectional LSTM, n i=1
Fig. 2. Basic illustration of the AE.
VAE is a generative model having similar architecture to

the vanilla AE. Note that an AE cannot generate new data
because it is principally based on encoding input data into
discrete values in the latent space. Accordingly, this procedure Fig. 3. VAE architecture.
limits the AE only on memorizing the input data without the
possibility for data generation. VAE is a generative model
having similar architecture to the vanilla AE. Note that an measurements, which allows us to reduce the dimensionality
AE cannot generate new data because it is principally based of original data and obtain low-dimensional data. The encoder
on encoding input data into discrete values in the latent space. is generally derived by a posterior approximation of qθ (z|x),
Accordingly, this procedure limits the AE only on memorizing whereas the decoder is obtained with a likelihood pφ (x|z),
the input data without the possibility for data generation. where θ and φ are the parameters of the encoder and the
To alleviate this limitation, the VAE enables encoding the input decoder, respectively.
data into a sampling layer, making the latent space continuous. It should be mentioned that the loss function has a sig-
In this way, the decoder of the VAE can generate new and nificant effect on the feature extraction for training VAE.
realistic data based on the learned features from the latent Suppose that Xt = [x 11 , x 2t , . . . , x N t ] is the vector of the input
space. data points of the VAE at time instant t and Xt is the data
VAE integrates the desirable features of the variational infer- reconstructed by the VAE model. For maximum-likelihood
ence and AE, which enables efficient extraction of releveant learning of parameters, we may write [41]
low-dimensional and hidden features in raw data. VAEs are log pφ (x ) = DKL [qθ (z|x) pφ (x)] + L(θ, φ; x) (2)
one of the most powerful and effective class of deep generative
models [40]. Basically, this is mainly due to their important where DKL [.] is the Kulback–Leibler divergence and L denotes
advantages and ability to extract low-dimensional and relevant the likelihood of the encoder and decoder parameters θ and
features from raw data in an unsupervised way [41], that φ. Hence, the loss function is expressed as
is, VAE models are trained to learn representations from
complex without the need for data labeling. It is very worth L(θ, φ) = Ez∼qθ (z|x) (logp(x |z)) − DKL (qθ (z|x)||pφ (z)) . (3)

mentioning that VAEs offer fundamental properties of dimen- Reconstruction term Regularization term
sionality reduction, which makes them useful for transforming
high-dimensional input data into compressed while preserving The main objective is the design of a suitable VAE
the essential properties of the original representation. VAEs model for which the reconstruction loss function converges
can be utilized for complex probability distribution approxima- to zero. The strength of the decoder’s ability to learn data
tion using the main results of stochastic gradient descent [40]. reconstruction is achieved by the reconstruction term given
Unlike the traditional AEs, VAEs overcome the commonly in (3). The second term of regularization in (3), which
known overfitting problem by introducing a regulation mecha- defines the Kulback–Leibler divergence, aims at separating the
nism in the training process. The regularization term improves encoder distribution function (qθ (z|x)) and the latent variable
the capability of the generative models to sample data points (z, |pφ (z)). To minimize the loss function with respect to the
using learned data distribution represented in the latent space. encoder and decoder parameters, one can use the gradient
Fig. 3 shows the VAE architecture, which shows that the VAE descent procedure in the training stage. The main reason
is built using a structure consisting of an encoder and a decoder behind minimizing the loss function is to obtain a regular latent
(i.e., two neural networks). space, z, as well as a satisfactory sampling of the new data
The VAE encoder aims at learning the latent variable points using z ∼ pφ (z) [41].
z through measured data acquired by sensors, whereas the Now, consider pφ (z) = N (z; 0, I ), and thus, qθ (z|x) can
decoder utilized the latent variable obtained z to ultimately be expressed as
reconstruct the input measurements. The residuals between logqθ (z|x) = logN (z; μ, σ 2 I ) (4)
the original input measurements and the reconstructed data
obtained from the decoder should always be as much as where μ and σ denote the mean and standard deviation of the
possible close to zero. The latent variable z provided by the approximate posterior, respectively. The latent space z is built
encoder is utilized as a feature extractor applied to the input based on a deterministic function g parameterized by φ and
Algorithm 1 VAE Training Algorithm vector V is expressed at time t as follows:

Input: : Training dataset X = {x , . . . , x }
1 k
Vt = αt ht (7)
Output: : {θ, φ} t
1 θ : Encoder parameters; where h t represents the hidden states provided by the model
2 φ: Decoder parameters; that feeds the attention model, and a recurrent network is
3 M: number of mini-batch (drawn from full dataset); usually utilized. The term αt denotes the normalized attention
4 {θ, φ} ←− Initialize model parameters randomly; model weights computed as follows:
5 Ol : is the output of the l variable layer;
6 U pdateModel Pr ameter s: Admam optimizer has been αt = softmax(et ). (8)
used [42]; et refers to the attention model weights (also called align-
7 repeat ment score), which is computed via a feedforward neural
8 X m ←− Random Mi ni batch(X, M); network [52], conditioned on the previous hidden state h t−1
9 Ointermediate ←− Layerintermediate (X m );
10 Oσ = Layerσ (X m ); et = σ (Wa ht−1 + ba ) (9)
11 Oμ = Layerμ (X m ); where (Wa , ba ) are weights matrix and bias vector of the atten-
12 Draw samples from ∼ N (0, 1); tion model computed during the training. Indeed, the attention
13 Oz = f φ (X m , ε) = Oμ + Oσ ε (Equation (5)); model context vector V is a dynamic representation of the
14 G = L(θ, φ, X m ) (Equation (6)); relevant part of the time-series input at time t. It is viewed
15 {θ, φ} ←− Update parameters using gradients via a as a weighted sum that highlights the importance of some
gradient-descent algorithm; data position in the sequence using normalized attention model
16 until convergence of parameters {θ, φ}; weights that can be interpreted as a probability. Attention
enables the model to focus on essential pieces of the feature
space; in other words, it allows the learning to emphasize
a particular area of the data sequence through reviewing its
an auxiliary noise variable ε ∼ p(ε), i.e., ε ∼ N (0, I )
memory during the prediction time.
z = gφ (x, ε) = μ + σ ε. (5) It is worth pointing out that there are two types or modes
of attention: additive [52] [see (9)] and multiplicative [54].
The reconstruction error can be written out as The main difference between them is how to compute the
1 alignment score
L(θ, φ, x) = (1 + log((σi )2 ) − (μi )2 − (σi )2 )
2 i et = Wa · ht . (10)
1
L
Additive attention utilizes a linear combination of hidden states
+ log( pφ (x|z(l) )) (6)
L through one neural network layer, specifically a feedforward
l=1
with by default hyperbolic tangent (tanh) nonlinearity as an
where denotes the element-wise product. activation function. On the other hand, multiplicative attention
The VAE parameters are optimized iteratively based on a computes the attention scores by reducing the hidden states
gradient descent algorithm with the Adam optimizer [42]. For using matrix multiplications; other activation functions with
more details about VAE, see [43]. The VAE is trained via the learnable parameters can be applied to the dot product.
procedure given in Algorithm 1.
It is worth noting that VAE has been used recently C. Self-Attention Mechanism
to forecast time-series data in many applications, includ- The self-attention mechanism improves the attention mech-
ing solar power production forecasting [44], traffic flow
anism by reducing external information dependence (source to
forecasting [45], stock market prediction [46], and fore-
target) and capturing the internal correlation of input data [55].
casting of COVID-19 spread [47], [48]. It has also An essential characteristic of self-attention is its flexibility to
demonstrated good performance in anomaly detection and
be applied to any layer representing a data sequence such as
classification [49]–[51]. a time series, which improves the internal input structure’s
learning by focusing on the relationship between elements of
B. Attention Mechanism the same input (sequence). Furthermore, self-attention is based
The attention mechanism’s key idea is to try mimicking on the same principle of the attention mechanism. Specifically,
human behavior by focusing on some particular areas. For the model generates a new representation of the features space
example, when looking at an image, the brain pays more atten- through a weighted sum of extracted features using only one
tion to a specific region of interest. The attention mechanism data sequence as input compared to the attention mechanism.
was inspired from the human visual process and adapted to Indeed, the self-attention data processing starts by computing
neural machine translation first [52] and images processing the weights (called the score) between data point in position
[53]. Specifically, during the training phase, the primary pur- i and j for a given input data sequence X as follows [55]
pose is to concentrate on specific features through a weighted (Wa Xi )T (Wa Xj )
Ei j = √ (11)
sum approach represented by a context vector. The context d
where Wa is the weight matrix of the self-attention

model computed during the training and d is the dimen-
sion of (Wa Xi ), and the division by d makes the
convergence faster. Normalization of the weights Ei j
is performed to represent them as a probability (sum
of all weights values equals to 1) using a softmax
transformation
exp(Ei j )
Ai j = softmax(Eij ) = . (12)
j exp(Ei j )
The final self-attention model output is expressed as

n
Oi = Aij (Wa Xi ). (13)
j =1 Fig. 4. Proposed approach: VAE with attention.
This output enhances the quality of the extracted features

effectively and describes, more specifically, the internal corre- pollutants dynamics by adding information and improving the
lation of the input elements. learning quality by reducing error and penalizing the loss
function.
Note that the L2 regularization is used for weights nor-
D. Integrated Multiple Directed Attention-Based Variational malization, while bias-regularizer: L1 bias extenuation [56].
Autoencoder Moreover, the regularization process is used to guarantee the
This article introduces an integrated multiple directed weighted sum diversification. The continuous representation
attention-based deep learning (called IMDA-VAE) based on and the context vector (weighted sum) feed the covariance
the traditional VAE and the attention mechanism for improved matrix σ and the mean μ of the regularized data distribu-
ambient air pollution forecasting. The proposed IMDA-VAE tions. Furthermore, the regularization is applied to encour-
approach extends the VAE model’s capability and improves aging the distributions to be closest to a standard normal
the forecasting accuracy compared to the unidirectional and distribution and enforce the covariance matrix to be close
bidirectional recurrent neural network models. Specifically, to the identity. Finally, the latent space is constructed after
we introduced the self-attention mechanism at multilevels to duel self-attention transformation of the regularized mean and
the encoder part of the traditional VAE (see Fig. 4). Indeed, variance, and a set of data points is sampled from the latent
attention as nonlinear transformation improves modeling and space [see 3] to be reconstructed by the decoder model. In the
forecasting quality using weighted features vector. Moreover, proposed approach, the decoder is a deep fully connected
the attention’s technique was usually incorporated in the neural network; it represents the reverse path used to train
decoder side [52], [53] to map the sequence of images to their the encoder. Here, the loss is measured via Kullback–Leibler
text description sequence (as caption). Indeed, the forecasting (KL), representing the divergence between the training data
problem can be viewed as a sequence to sequence mapping; and learned probability distribution. Furthermore, KL is a
in the univariate case, the models learning aims to map a key to monitor the model parameters convergence, where
sequence of pollutant measurements to the next concentration usually once its value decreases (close to zero) and stabilizes,
value of a given pollutant, while multivariate cases, especially the training can be stopped. During training, the reconstruction
with multiple outputs, map long sequences, including all error is backpropagated, and the model parameters are updated
pollutants measurements, to forecast short sequences. This accordingly. Once the training is done, the latent space is used
principle of recurrent neural networks is based on one-step to forecast the next values (measurement) of the pollutants in
supervised learning. However, VAE is a composite model question depending on the configuration used (univariate or
composed of an encoder and decoder, where the main is to multivariate)
learn the approximation of the training dataset probability In contrast to the recurrent models, the proposed approach
distribution in an unsupervised approach. It is expected that training is performed in an unsupervised manner. Specif-
the incorporation of the robust variation inference approach, ically, the model learns first the data distribution of the
an efficient regularization with an enforced attention mecha- considered pollutant and approximates it using the model
nism, will improve the univariate and multivariate forecasting parameters to sample new data points from a features space
performances. (also called the latent space); the data points shared the same
The historical pollutants measurements are processed first features of real data points. Moreover, the proposed approach
through a nonlinear transformation based on a dense layer model parameters are adjusted and optimized through a
(called intermediate) into a continuous representation. Further- fine-tuning process based on supervised learning, aiming to
more, this new representation is passed through a self-attention learn the mapping of a given data sequence of pollutant
layer that emphasizes the internal correlation within the pollu- measurement to its next value. Notably, the forecasting is
tants’ time series via computing a context vector, representing done at the latent space level (encoding space). In other
a weighted sum of features. Moreover, regularization aims words, the last layer of the IMDA-VAE acts as a forecasting
to prevent the overfitting problem during the learning of layer.
TABLE I
S TATISTICS S UMMARY OF THE A RIZONA A MBIENT
A IR P OLLUTION D ATASETS
III. E XPERIMENTS AND R ESULTS

This section describes the data used in this article, with full
details about experimental settings and execution procedure.
Moreover, we provide an analysis of obtained results and
discussion.
A. Data Description
The dataset used in this study was collected by the United
States Environmental Protection Agency, in which the con-
centration level of several ambient pollutants is recorded
daily in different states. The datasets are publicly accessi-
ble on the website “https://www.epa.gov/outdoor-air-quality-
data/download-daily-data.” The pollutants measurements were
collected during a period of 16 years exactly (2000–2016).
This study focuses on four significant pollutants, namely nitro-
gen dioxide (NO2 ), sulfur dioxide (SO2 ), carbon monoxide
(CO), and ozone (O3 ). In our study, we picked measurements
from four locations, namely California, Arizona, Texas, and
Pennsylvania. It should be mentioned that each state contains
numerous air quality monitoring stations; however, we selected
one station per state for measurement collection.
The descriptive statistics of the Arizona ambient air pol-
lution datasets are listed in Table I. The skewness of the
normal distribution (or any perfectly symmetric distribution)
is zero, and the kurtosis of the normal distribution is 3. We can
Fig. 5. Monthly distribution of concentration levels of (a) NO2 , (b) CO,
conclude from Table I that the Arizona ambient air pollution (c) O3 , and (d) SO2 in California.
time-series datasets are non-Gaussian distributed with positive
support and exhibit different intervals of variability. California, Pennsylvania, and Texas. Fig. 6 shows the presence
Monthly distributions of concentration levels of the four of a weak negative correlation between O3 and its precursors
ambient pollutants from California are shown in Fig. 5(a)–(d). (NO2 and CO) and a moderate positive correlation between
Fig. 5(c) shows that the highest O3 concentration levels can precursors, NO2 and CO. Besides, distinct patterns regarding
be seen in the summer season due to the local photochem- SO2 and the other pollutants are obtained for each station.
ical production. Also, Fig. 5(a) and (c) shows that NO2 is Interpretation of the relation between pollutants is relatively
negatively correlated with the variation of O3 concentration difficult because of their dependence on several factors, includ-
levels. This fact is because O3 is formed through the photo- ing meteorological variables and the transportation phenom-
chemical destruction of nitrogen dioxide (NO2 ) under sunlight. ena. Specifically, sometimes pollution concentration can be
Similarly, we can see the presence of a negative correlation high due to transported pollution produced elsewhere in the
between O3 and CO because of photochemical production of region. Various studies have been reported in the literature
O3 principally from the oxidation of natural and anthropogenic to investigate the correlation between different ambient air
hydrocarbons, carbon monoxide (CO), and methane (CH4) by pollutants [57], [58].
hydroxyl (OH) radical in the availability of enough quantity of
NOx. Of course, O3 is adversely correlated with O3 precursors
(i.e., NO2 and CO) [57]. B. Experiments Settings
Fig. 6 shows the correlation coefficients between the To evaluate the performance of the proposed approach in
NO2 , CO, O3 , and SO2 recorded, respectively, in Arizona, forecasting air ambient pollutants, several experiments are
Fig. 6. Pairwise correlation matrix between ambient air pollutants (NO2 , CO, O3 , and SO2 ) obtained in (a) Arizona, (b) California, (c) Pennsylvania, and
(d) Texas.
conducted to show its superiority and efficiency through a learn how to predict the next value of a given pollutant
comparative study including powerful recurrent neural net- based on a sequence of measurements of the four other
work models, namely GRU, LSTM, BiGRU, BiLSTM, and pollutants. In the end, a forecasting model dedicated to
ConvLSTM. We also compared the performance of the a pollutant with multiple inputs is obtained.
proposed IMDA-VAE approach to that GRU and LSTM with 3) Multivariates Forecasting With Multioutput: This exper-
the attention mechanism (LSTM-A and GRU-A). Importantly, imentation can be seen as a one-shot task, where all
these models with attention are designed by stacking three pollutants are forecast based on all historical measure-
components. For instance, in LSTM-A, LSTM is used to ments of all pollutants. In the end, only one forecasting
model time dependencies and extract temporal features, fol- model for all pollutants with multiple inputs to forecast
lowed by the attention module that aims to improve and all pollutants next values is derived.
highlight relevant extracted features, and finally, we added a
forecaster layer represented by a fully connected layer. In a C. Measurements of Effectiveness
similar way to LSTM-A, we designed GRU-A. Here, both
LSTM-A and GRU-A are trained through supervised learning. To assess the forecast precision and compare the models,
We used the same settings (hyperparameters) used for all some validation metrics, such as coefficient of determination
models: 300 epochs, a learning rate of 0.0001, batch size (R 2 ), RMSE, MAE, explained variance (EV), mean absolute
of 250, 32 hidden units, Rmsprop optimizer, and binary cross percentage error (MAPE), mean bias error (MBE), and
entropy as a lost function. To demonstrate the advantage of relative MBE (rMBE), are used (Table II), where yt is the
the proposed IMDA-VAE compared to the traditional deep concentration level of a pollutant, ŷt is its corresponding
learning models, we consider the following experimentations forecast values, and n is the number of data points [59], [60].
in this study. The more precise forecasting is, the lower RMSE, MAE, MBE
and rMBE values and high R 2 , EV, and MAPE values are.
1) Univariate Forecasting: In this experiment, each pollu-
tant is modeled and forecast individually, which means
that five models are trained to forecast the next values D. Results Analysis
based on only the measurement of the pollutant in The first set of experiments aims to analyze the performance
question. of the investigated models in forecasting the concentration
2) Multivariate Forecasting With One Output (Prediction): levels of the four pollutants separately. Toward this end, in this
The forecasting of each pollutant is done using all univariate forecasting, each model is trained in a supervised
other pollutants, which means that the training aims to way to learn temporal dependencies included in time-series
TABLE II TABLE III

D EFINITION OF M EASUREMENTS OF E FFECTIVENESS P ERFORMANCE C OMPARISON OF THE P ROPOSED U SING P OLLUTION
T ESTING D ATASETS F ROM C ALIFORNIA
data measurement of each pollutant. Indeed, the objective is to

learn the prediction of the next value of a given measurement
sequence. This study has been implemented using a fast
algorithm-based CPU Intel i7 with 12Go RAM. We used
Python 2.7.0 with Keras 2.3And TensorFlow 2.0 under Ubuntu
18 LTS. The training set consists of daily concentration levels in Fig. 7 for NO2 , CO, O3 , and SO2 from Arizona. Results
of five ambient pollutants (NO2 , CO, O3 , and SO2 ) from from the other stations are omitted because they give rela-
January 2000 to December 2014. At first, we normalize tively similar results. Fig. 7 shows relatively narrower bands
the training data, e, by min-max normalization within the around the actual observations, which indicates a good forecast
interval [0, 1] and then used it for models construction. The quality. One exception is from SO2 time-series data, which
normalization is performed as follows: indicates the deviation of deep learning model forecasts. From
Fig. 7, the proposed IMDA-VAE approach shows a good
(y − ymin )

y= (14) ability in forecasting the future trends of SO2 concentration
(ymax − ymin ) dynamics.
where ymin and ymax denote the minimum and maximum of Tables III–VI list the obtained validation metrics of test-
the original data, respectively. This procedure is reversed after ing data from California, Arizona, Texas, and Pennsylvania,
the forecasting process. respectively. In terms of all metrics calculated, IMDA-VAE
Here, the k-fold cross-validation technique with K = 10, is the best approach for this univariate time-series forecast-
as recommended in [61] and [62], is adopted in this study ing problem with high efficiency and satisfying accuracy.
to build the forecasting models. Using the training datasets The proposed IMDA-VAE approach outperforms the recurrent
in these experimentations, the set of hyperparameters is fixed models (i.e., GRU, LSTM, BiGRU, ConvLSTM, LSTM-A,
for all models used, namely optimizer = “rmsprop,” loss and GRU-A) on forecasting all investigated pollutants (i.e.,
function = “Cross Entropy,” batch size = 250, epochs = NO2 , O3 , SO2 , and CO) measured from four locations, namely
200, and learning rate = 0.001. For RNNs hidden units, California, Arizona, Texas, and Pennsylvania. As expected,
we used hidden units = 32, while in the IMDA-VAE model the proposed IMDA-VAE approach achieved the lowest fore-
for all layers (intermediate, mean, variance), 12 is used as casting errors (RMSE, MAE, MBE, and RMBE) and the
the number of hidden units. The proposed model contains highest score of R2 and EV. It could be attributed to its
nine layers, its configuration in terms of hidden units per capacity to model time dependencies and select relevant fea-
layer
3, 12, 12, 15, 15, 15, 15, 3, 1. This configuration was tures using the attention mechanism. We should highlight that
determined using the grid search method; we also used for the performance of Bidirectional recurrent networks, namely
the training a batch size = 250, learning rate = 10−3 , number the BiLSTM and the BiGRU, is superior to the unidirectional
of epochs = 200, optimizer: Rmsprop, and loss function: models represented by LSTM and GRU. This is mainly due
binary cross entropy. For the fine-tuning step, the standard to the ability of bidirectional models in processing data in two
backpropagation algorithm has been used to adjust model directions: forward and backward.
parameters [63]–[65]. Table III shows the performance comparison of the con-
The testing period is from January 1 to November 30, sidered models based on California pollutants measurements.
2015. For a visual illustration, the observed test set together The proposed approach has accounted for more than 95% of
with model forecasts using the five studied models is charted the variability of NO2, which is the best R2 and EV, while the
TABLE IV
P ERFORMANCE C OMPARISON OF THE P ROPOSED U SING P OLLUTION
T ESTING D ATASETS F ROM T EXAS
TABLE V
P ERFORMANCE C OMPARISON OF THE P ROPOSED U SING P OLLUTION
T ESTING D ATASETS F ROM P ENNSYLVANIA
Fig. 7. Records and forecasts using the five models of (a) NO2 , (b) CO,
(c) O3 , and (d) SO2 from Arizona.
lowest RMSE = 9.484, MEA = 7.153, MBE = 0.122, and

RMBE = −2.763 were recorded. Also, it can be seen that
the bidirectional data processing performed by BiLSTM and whereas the other models recorded MPAE = 8%. Note that
BiGRU model is better time dependencies when forecasting a similar performance was recorded by the considered mod-
NO2 (i.e., R2 = 93% and EV = 94%) for R2 and EV els when forecasting CO concentration levels. On the other
compared to unidirectional recurrent models such as LSTM hand, the validation metrics of SO2 forecast values from the
and GRU (R2 = 89% and EV = 91%). proposed indicate a good forecast quality by achieving an
Similar conclusions hold true also for forecasting con- R2 of 0.96 and low forecasting error (i.e., MAPE = 6%,
centration levels of O3 , and around 95% of the variability RMSE = 0.59, and MAE = 0.47). Bidirectional (LSTM and
was accounted for by using the proposed approach and fol- GRU) models perform better than unidirectional models by
lowed by BiLSTM with 94%, while the remaining models reaching an MPAE of 9% for BiLSTM and 11% for BiGRU.
scored 93%. Here, also, MPAE = 7% was recorded by Similar conclusions are obtained when using data from the
the proposed approach demonstrating its high performance, other states.
TABLE VI TABLE VIII

P ERFORMANCE C OMPARISON OF THE P ROPOSED U SING P OLLUTION P ERFORMANCE C OMPARISON OF THE P ROPOSED A PPROACH
T ESTING D ATASETS F ROM A RIZONA U SING NO 2 P OLLUTANT
TABLE VII
P ERFORMANCE C OMPARISON OF THE P ROPOSED A PPROACH
U SING CO P OLLUTANT TABLE IX
P ERFORMANCE C OMPARISON OF THE P ROPOSED A PPROACH
U SING SO 2 P OLLUTANT
The following experiments are devoted to a comparative

assessment for predicting concentration levels of a single
pollutant by using the remaining pollutants as multiple inputs error on all experimentations. It is also interesting to see
as presented in Section III-B. Results in terms of R2, EV, better outcomes from the bidirectional models (BiLSTM and
and MAPE metrics are presented in Figs. 8–11. Results BiGRU) than unidirectional models (LSTM and GRU). Results
confirm the superior performance of the proposed IMDA-VAE confirm that the IMDA-VAE approach clearly outperforms
approach again compared to the traditional VAE without the LSTM-A and GRU-A models. Overall, the results tes-
attention and the other investigated deep learning models tify the superior performance of the IMDA-VAE approach
by achieving the highest (R2, EV) and the lowest mean (see Tables VII–X).
TABLE X TABLE XI
P ERFORMANCE C OMPARISON OF THE P ROPOSED A PPROACH M ULTIVARIATE W ITH M ULTIOUTPUT P ERFORMANCE R ESULTS U SING
U SING O 3 P OLLUTANT P OLLUTION D ATASETS F ROM C ALIFORNIA
Fig. 8. Validation metrics for multivariate forecasting of CO concentrations.
model, forecasting several pollutants can be obtained simul-

taneously compared to univariate forecasting that requires a
model for each time-series data. However, multivariate fore-
casting is relatively challenging because the cross correlation
among variables and time dependencies of multiple variables
Fig. 9. Validation metrics for multivariate forecasting of SO2 concentrations.
need to be modeled. Here, an important variable that can
impact the forecasting accuracy is the timestep, which consists
of the amount of data (in days) used to look back from the past
to predict the following values. In this experiment, we evaluate
the forecasting performance using different timestep values: 3,
6, 9, 12, and 15. Table XI shows the comparative forecasting
results of the considered models. We can see that our model
recorded the best score with the highest (R2, EV) for all
Fig. 10. Validation metrics for multivariate forecasting of NO2 concentra-
tions.
experiments, and especially for timestep = 15, it records 0.9,
which is a good performance for multivariate forecasting of
The final experiment aims to assess the potentials of the all pollutants.
IMDA-VAE technique in forecasting several pollutants simul- Furthermore, in this study, we compared the time cost
taneously (i.e., multivariate forecasting) by using historical between the proposed method with attention and without
data from all pollutants. Here, we used only ambient pollution attention. As expected, the traditional approach without atten-
data from California, and datasets from other stations are tion is less time-consuming than the approach with attention
omitted because they give relatively similar results. The major due to computation cost related to the attention mechanism.
advantage of multivariate forecasting is that by using only one More specifically, when conducting the experiments using an
the high ability of deep hybrid model with attention to

model temporal dependencies in unsupervised learning with-
out complex recurrent networks gating and memory mecha-
nism; variational inference approximation exhibits promising
performance in time-dependent modeling. Besides, univariate
forecasts showed better accuracy than multivariate forecasting
in this setting. This is mainly due to the absence of high cross
Fig. 11. Validation metrics for multivariate forecasting of O3 concentrations. correlation between the four studied pollutants.
ordinary laptop (CPU Intel i3), the average execution time Despite the adequate forecasting results of ambient air
in second for the VAE and IMDA-VAE is 0.0076 and 0.0227, pollution obtained using the proposed IMDA-VAE model,
respectively. Thus, the average time allocated to the attention the work presented in this study guides future works. Since
mechanism is approximately around 15 ms in this case. On the pollution measurements may contain noisy features with time
other hand, when conducting the experiment using PC with and frequency contributions, we plan to enhance the pro-
an Intel i7 CPU and equipped with a GPU, both approaches posed IMDA-VAE-based forecasting model by developing
have an average execution time of less than 10−4 s. In short, a multiscale IMDA-VAE model that combines IMDA-VAE
when using time series, the attention mechanism is not techniques with wavelet-based multiresolution representation.
time-consuming. Another direction for future improvement is to incorporate
In summary, the overall forecasting results demonstrate the explanatory variables, such as meteorological measurements,
high ability of the VAE based on robust variational inferences in constructing the deep learning models. Furthermore, it will
to approximate data probability distribution of a given pol- be interesting to design an early detection system of abnormal
lution time-series, with a self-attention mechanism incorpo- pollution to foster reactive control, enabling avoiding exposi-
rated at multilevel to highlight and emphasize the correlation tion to abnormal pollution with high concentrations.
between data points of a given sequence. It has been shown
that the variational inference with attention units improves R EFERENCES
time dependencies modeling and univariate and multivariate [1] F. Harrou, F. Kadri, S. Khadraoui, and Y. Sun, “Ozone measurements
monitoring using data-based approach,” Process Saf. Environ. Protec-
forecasting without recurrent connections or memory cells. tion, vol. 100, pp. 220–231, Mar. 2016.
It should be highlighted that the best forecasting performance [2] S. Ali, T. Glass, B. Parr, J. Potgieter, and F. Alam, “Low cost sensor with
of the proposed IMDA-VAE approach has been obtained for IoT LoRaWAN connectivity and machine learning-based calibration for
univariate forecasting. Also, it has been shown that univariate air pollution monitoring,” IEEE Trans. Instrum. Meas., vol. 70, pp. 1–11,
2021, doi: 10.1109/TIM.2020.3034109.
forecasting outperformed multivariate forecasting in this set- [3] W. Hernandez, A. Mendez, R. Zalakeviciute, and A. M. Diaz-Marquez,
ting. This is mainly due to the absence of high cross correlation “Analysis of the information obtained from PM2.5 concentration mea-
between the considered pollutants. This may be improved surements in an urban park,” IEEE Trans. Instrum. Meas., vol. 69, no. 9,
pp. 6296–6311, 2020.
by considering meteorological variables, such as temperature [4] K. Gu, Z. Xia, and J. Qiao, “Stacked selective ensemble for PM2.5
and pressure, which facilitate the description of pollution forecast,” IEEE Trans. Instrum. Meas., vol. 69, no. 3, pp. 660–671,
dynamics, particularly for SO2 . The proposed IMDA-VAE Mar. 2020.
[5] D. Antanasijević, V. Pocajt, A. Perić-Grujić, and M. Ristić, “Multiple-
approach can be used for online forecasting due to its simple input–multiple-output general regression neural networks model for
architecture; only the encoder part is used for the forecasting, the simultaneous estimation of traffic-related air pollutant emissions,”
and data are processed in only one direction. Atmos. Pollut. Res., vol. 9, no. 2, pp. 388–397, Mar. 2018.
[6] F. Harrou, L. Fillatre, M. Bobbia, and I. Nikiforov, “Statistical detection
of abnormal ozone measurements based on constrained generalized like-
IV. C ONCLUSION lihood ratio test,” in Proc. 52nd IEEE Conf. Decis. Control, Dec. 2013,
Air pollution is a global issue, with most regions of the pp. 4997–5002.
[7] F. Harrou, A. Dairi, Y. Sun, and F. Kadri, “Detecting abnormal ozone
globe affected by concentrations known to have adverse health measurements with a deep learning-based strategy,” IEEE Sensors J.,
outcomes. The fast evolution of industrial technology has vol. 18, no. 17, pp. 7222–7232, Sep. 2018.
induced various adverse environmental impacts. Monitoring [8] M. Abhilash, A. Thakur, D. Gupta, and B. Sreevidya, “Time series
analysis of air pollution in Bengaluru using ARIMA model,” in Ambient
the ambient air quality is essential to achieve acceptable Communications and Computer Systems. Singapore: Springer, 2018,
air quality. This work presented a novel deep hybrid model pp. 413–426.
by introducing an attention mechanism to the VAE (called [9] U. Kumar and V. K. Jain, “ARIMA forecasting of ambient air pollutants
(O3 , NO, NO2 and CO),” Stochastic Environ. Res. Risk Assessment,
IMDA-VAE) to improve air pollution forecasting. In this vol. 24, no. 5, pp. 751–760, Jul. 2010.
study, we proved the efficiency of the proposed approach [10] M. Arsov et al., “Short-term air pollution forecasting based on envi-
for univariate and multivariate forecasting of air pollution ronmental factors and deep learning models,” in Proc. Federated Conf.
Comput. Sci. Inf. Syst., Sep. 2020, pp. 15–22.
time-series data. Results showed that the proposed IMDA-VAE [11] M. H. Lee, N. H. Rahman, M. T. Latif, M. E. Nor, and N. A. Kamisan,
model provides more accurate forecasting of concentrations of “Seasonal ARIMA for forecasting air pollution index: A case study,”
four principal pollutants (i.e., NO2 , O3 , SO2 , and CO) than Amer. J. Appl. Sci., vol. 9, no. 4, p. 570, 2012.
unidirectional and bidirectional recurrent networks, namely [12] L. M. B. Ventura, F. de Oliveira Pinto, L. M. Soares, A. S. Luna, and A.
Gioda, “Forecast of daily PM2.5 concentrations applying artificial neural
VAE, Gated GRUs, LSTM, BiGRU, BiLSTM, ConvLSTM, networks and Holt–Winters models,” Air Qual., Atmos. Health, vol. 12,
LSTM-A, and GRU-A. The forecasting accuracy has been no. 3, pp. 317–325, Mar. 2019.
evaluated by six statistical indicators, including R2, RMSE, [13] L. Wu, X. Gao, Y. Xiao, S. Liu, and Y. Yang, “Using grey Holt–Winters
model to predict the air quality index for cities in China,” Natural
MAE, MAPE, EV, MBE, and RMBE. Metrics demonstrated Hazards, vol. 88, no. 2, pp. 1003–1012, Sep. 2017.
[14] Z. Zhao, W. Chen, X. Wu, P. C. Y. Chen, and J. Liu, “LSTM network: [38] C.-J. Huang and P.-H. Kuo, “A deep CNN-LSTM model for particulate
A deep learning approach for short-term traffic forecast,” IET Intell. matter (PM2.5 ) forecasting in smart cities,” Sensors, vol. 18, no. 7,
Transp. Syst., vol. 11, no. 2, pp. 68–75, Mar. 2017. p. 2220, Jul. 2018, doi: 10.3390/s18072220.
[15] P. Jiang, C. Li, R. Li, and H. Yang, “An innovative hybrid air pollution [39] H. Wang, B. Zhuang, Y. Chen, N. Li, and D. Wei, “Deep infer-
early-warning system based on pollutants forecasting and extenics ential spatial-temporal network for forecasting air pollution con-
evaluation,” Knowl.-Based Syst., vol. 164, pp. 174–192, Jan. 2019. centrations,” Mach. Learn., to be published. [Online]. Available:
[16] A. Prakash, U. Kumar, K. Kumar, and V. K. Jain, “A wavelet-based https://arxiv.org/abs/1809.03964
neural network model to predict ambient air pollutants’ concentration,” [40] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” Stat,
Environ. Model. Assessment, vol. 16, no. 5, pp. 503–517, Oct. 2011. vol. 1050, pp. 1–15, Dec. 2014.
[17] M. Arhami, N. Kamali, and M. M. Rajabi, “Predicting hourly air [41] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, “Variational inference:
pollutant levels using artificial neural networks coupled with uncertainty A review for statisticians,” J. Amer. Stat. Assoc., vol. 112, no. 518,
analysis by Monte Carlo simulations,” Environ. Sci. Pollut. Res., vol. 20, pp. 859–877, Apr. 2017.
no. 7, pp. 4777–4789, Jul. 2013. [42] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
[18] K. Siwek and S. Osowski, “Data mining methods for prediction of air tion,” 2014, arXiv:1412.6980. [Online]. Available: http://arxiv.org/abs/
pollution,” Int. J. Appl. Math. Comput. Sci., vol. 26, no. 2, pp. 467–478, 1412.6980
Jun. 2016. [43] J. An and S. Cho, “Variational autoencoder based anomaly detection
[19] H. Liu et al., “An intelligent hybrid model for air pollutant concentra- using reconstruction probability,” Special Lecture IE, vol. 2, pp. 1–18,
tions forecasting: Case of Beijing in China,” Sustain. Cities Soc., vol. 47, Dec. 2015.
May 2019, Art. no. 101471. [44] A. Dairi, F. Harrou, Y. Sun, and S. Khadraoui, “Short-term forecasting
[20] J. He et al., “Numerical model-based artificial neural network model of photovoltaic solar power production using variational auto-encoder
and its application for quantifying impact factors of urban air quality,” driven deep learning approach,” Appl. Sci., vol. 10, no. 23, p. 8400,
Water, Air, Soil Pollut., vol. 227, no. 7, p. 235, Jul. 2016. Nov. 2020.
[21] W. Ding, J. Zhang, and Y. Leung, “Prediction of air pollutant con- [45] G. Boquet, A. Morell, J. Serrano, and J. L. Vicario, “A variational
centration based on sparse response back-propagation training feed- autoencoder solution for road traffic forecasting systems: Missing
forward neural networks,” Environ. Sci. Pollut. Res., vol. 23, no. 19, data imputation, dimension reduction, model selection and anomaly
pp. 19481–19494, Oct. 2016. detection,” Transp. Res. C, Emerg. Technol., vol. 115, Jun. 2020,
[22] F. Harrou et al., Statistical Process Monitoring Using Advanced Data- Art. no. 102622.
Driven and Deep Learning Approaches: Theory and Practical Applica- [46] H. Gunduz, “An efficient stock market prediction model using hybrid
tions. Amsterdam, The Netherlands: Elsevier, 2020. feature reduction method based on variational autoencoders and recur-
[23] W. Mao, J. He, and M. J. Zuo, “Predicting remaining useful life of rolling sive feature elimination,” Financial Innov., vol. 7, no. 1, pp. 1–24,
bearings based on deep feature representation and transfer learning,” Dec. 2021.
IEEE Trans. Instrum. Meas., vol. 69, no. 4, pp. 1594–1608, Apr. 2020. [47] A. Zeroual, F. Harrou, A. Dairi, and Y. Sun, “Deep learning methods for
[24] X. Yuan, S. Qi, and Y. Wang, “Stacked enhanced auto-encoder for data- forecasting COVID-19 time-series data: A comparative study,” Chaos,
driven soft sensing of quality variable,” IEEE Trans. Instrum. Meas., Solitons Fractals, vol. 140, Nov. 2020, Art. no. 110121.
vol. 69, no. 10, pp. 7953–7961, Oct. 2020. [48] M. R. Ibrahim, J. Haworth, A. Lipani, N. Aslam, T. Cheng, and
[25] F. Harrou, T. Cheng, Y. Sun, T. Leiknes, and N. Ghaffour, “A data-driven N. Christie, “Variational-LSTM autoencoder to forecast the spread of
soft sensor to forecast energy consumption in wastewater treatment coronavirus across the globe,” PLoS ONE, vol. 16, no. 1, Jan. 2021,
plants: A case study,” IEEE Sensors J., vol. 21, no. 4, pp. 4908–4917, Art. no. e0246120.
Feb. 2021. [49] Y. Zerrouki, F. Harrou, N. Zerrouki, A. Dairi, and Y. Sun, “Desertifica-
[26] F. Harrou, F. Kadri, and Y. Sun, “Forecasting of photovoltaic solar power tion detection using an improved variational autoencoder-based approach
production using LSTM approach,” in Advanced Statistical Modeling, through ETM-landsat satellite data,” IEEE J. Sel. Topics Appl. Earth
Forecasting, and Fault Detection in Renewable Energy Systems. London, Observ. Remote Sens., vol. 14, pp. 202–213, 2021.
U.K.: IntechOpen, 2020, p. 3. [50] L. Li, J. Yan, H. Wang, and Y. Jin, “Anomaly detection of time
[27] Y.-S. Chang, H.-T. Chiao, S. Abimannan, Y.-P. Huang, Y.-T. Tsai, and series with smoothness-inducing sequential variational auto-encoder,”
K.-M. Lin, “An LSTM-based aggregated model for air pollution fore- IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 3, pp. 1177–1191,
casting,” Atmos. Pollut. Res., vol. 11, no. 8, pp. 1451–1463, Aug. 2020. Mar. 2021.
[28] Q. Tao, F. Liu, Y. Li, and D. Sidorov, “Air pollution forecasting using [51] C. Zhang, S. Li, H. Zhang, and Y. Chen, “VELC: A new variational
a deep learning model based on 1D convnets and bidirectional GRU,” AutoEncoder based model for time series anomaly detection,” 2019,
IEEE Access, vol. 7, pp. 76690–76698, 2019. arXiv:1907.01702. [Online]. Available: http://arxiv.org/abs/1907.01702
[29] S. Al-Janabi, M. Mohammad, and A. Al-Sultan, “A new method [52] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
for prediction of air pollution based on intelligent computation,” Soft jointly learning to align and translate,” 2014, arXiv:1409.0473. [Online].
Comput., vol. 24, no. 1, pp. 661–680, Jan. 2020. Available: http://arxiv.org/abs/1409.0473
[30] T.-C. Bui, V.-D. Le, and S.-K. Cha, “A deep learning approach for [53] K. Xu et al., “Show, attend and tell: Neural image caption genera-
forecasting air pollution in South Korea using LSTM,” Mach. Learn., tion with visual attention,” in Proc. Int. Conf. Mach. Learn., 2015,
to be published. [Online]. Available: https://arxiv.org/abs/1804.07891v3 pp. 2048–2057.
[31] S. Kim, J. M. Lee, J. Lee, and J. Seo, “Deep-dust: Predicting concen- [54] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to
trations of fine dust in Seoul using LSTM,” Climate Informat., to be attention-based neural machine translation,” 2015, arXiv:1508.04025.
published. [Online]. Available: http://arxiv.org/abs/1901.10106 [Online]. Available: http://arxiv.org/abs/1508.04025
[32] B. S. Freeman, G. Taylor, B. Gharabaghi, and J. Thé, “Forecasting air [55] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf.
quality time series using deep learning,” J. Air Waste Manage. Assoc., Process. Syst., 2017, pp. 5998–6008.
vol. 68, no. 8, pp. 866–886, Aug. 2018. [56] C. Cortes, M. Mohri, and A. Rostamizadeh, “L2 regularization for learn-
[33] A. V, G. P, V. R, and S. K P, “DeepAirNet: Applying recurrent ing kernels,” in Proc. 25th Conf. Uncertainty Artif. Intell., Arlington, VA,
networks for air quality prediction,” Procedia Comput. Sci., vol. 132, USA: AUAI Press, 2009, pp. 109–116.
pp. 1394–1403, Jan. 2018, doi: 10.1016/j.procs.2018.05.068. [57] J. Ngarambe, S. J. Joen, C.-H. Han, and G. Y. Yun, “Exploring the
[34] T. Xayasouk and H. Lee, “Air pollution prediction system using deep relationship between particulate matter, CO, SO2 , NO2 , O3 and urban
learning,” WIT Trans. Ecology Environ., vol. 230, pp. 71–79, Oct. 2018. heat island in Seoul, Korea,” J. Hazardous Mater., vol. 403, Feb. 2021,
[35] X. Li, L. Peng, Y. Hu, J. Shao, and T. Chi, “Deep learning architecture Art. no. 123615.
for air quality predictions,” Environ. Sci. Pollut. Res., vol. 23, no. 22, [58] Ş. Ç. Doǧruparmak and B. Özbay, “Investigating correlations and
pp. 22408–22417, Nov. 2016, doi: 10.1007/s11356-016-7812-9. variations of air pollutant concentrations under conditions of rapid
[36] X. Li et al., “Long short-term memory neural network for air pollu- industrialization—Kocaeli (1987–2009): Investigating correlations and
tant concentration predictions: Method development and evaluation,” variations,” CLEAN Soil, Air, Water, vol. 39, no. 7, pp. 597–604,
Environ. Pollut., vol. 231, pp. 997–1004, Dec. 2017, doi: 10.1016/ Jul. 2011.
j.envpol.2017.08.114. [59] R. A. Rajagukguk, R. A. A. Ramadhan, and H.-J. Lee, “A review
[37] P.-W. Soh, J.-W. Chang, and J.-W. Huang, “Adaptive deep learning-based on deep learning models for forecasting time series data of solar
air quality prediction model using the most relevant spatial-temporal irradiance and photovoltaic power,” Energies, vol. 13, no. 24, p. 6623,
relations,” IEEE Access, vol. 6, pp. 38186–38199, 2018. Dec. 2020.
[60] J. Zhang et al., “A suite of metrics for assessing the performance of Sofiane Khadraoui received the B.E. and magister’s degrees in automatic
solar power forecasting,” Sol. Energy, vol. 111, pp. 157–175, Jan. 2015. control from Abou Bekr Belkaïd University, Tlemcen, Algeria, in 2004 and
[61] M. Kuhn et al., Applied Predictive Modeling, vol. 26. New York, NY, 2007, respectively, and the Ph.D. degree in automatic control from the
USA: Springer, 2013. University of Franche-Comté, Besançon, France, in 2012.
[62] G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to From 2012 to 2015, he was a Post-Doctoral Research Associate with Texas
Statistical Learning, vol. 112. New York, NY, USA: Springer, 2013. A&M University, Doha, Qatar. In 2015, he joined the Electrical Engineering
[63] G. E. Hinton, “A practical guide to training restricted Boltzmann Department, University of Sharjah, Sharjah, UAE, where he is currently an
machines,” in Neural Networks: Tricks of the Trade. Berlin, Germany: Assistant Professor. His main research interests mainly include robust control
Springer, 2012, pp. 599–619. design for uncertain systems, data-based control design for unknown systems,
[64] R. Sarikaya, G. E. Hinton, and A. Deoras, “Application of deep belief robust stability analysis, process monitoring and fault detection, compensation
networks for natural language understanding,” IEEE/ACM Trans. Audio, of nonlinear phenomena, modeling of uncertainties, optimization, and interval
Speech, Language Process., vol. 22, no. 4, pp. 778–784, Apr. 2014. computation.
[65] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm
for deep belief nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554,
Jul. 2006.
Abdelkader Dairi received the B.E. degree in computer science from the
University of Oran 1 Ahmed Ben Bella, Oran, Algeria, in 2003, the magister’s
degree in computer science from the National Polytechnic School of Oran,
Oran, in 2006, and the Ph.D. degree in computer science from Ben Bella
Oran1 University, Oran, in 2018.
From 2007 to 2013, he was a Senior Oracle Database Administrator (DBA)
and Enterprise Resource Planning (ERP) Manager. He has over 20 years
of programming experience in different languages and environments. He is
currently an Assistant professor with the Department of Computer Science,
University of Sciences and Technology of Oran Mohamed Boudiaf, Bir El
Djir, Algeria. His research interests include programming languages, artificial
intelligence, computer vision, machine learning, and deep learning.
Fouzi Harrou (Member, IEEE) received the M.Sc. degree in telecommunica-

tions and networking from the University of Paris VI, Paris, France, in 2006, Ying Sun received the Ph.D. degree in statistics from Texas A&M University,
and the Ph.D. degree in systems optimization and security from the University Doha, Qatar, in 2011.
of Technology of Troyes (UTT), Troyes, France, in 2010. She held a two-year postdoctoral research position with the Statistical and
He was an Assistant Professor with UTT for one year and with the Institute Applied Mathematical Sciences Institute and the University of Chicago. She
of Automotive and Transport Engineering, Nevers, France, for one year. established and leads the Environmental Statistics Research Group, which
He was also a Post-Doctoral Research Associate with the Systems Modeling works on developing statistical models and methods for complex data to
and Dependability Laboratory, UTT, for one year. He was a Research address important environmental problems with The Ohio State University,
Scientist with the Chemical Engineering Department, Texas A&M University Columbus, OH, USA. She was an Assistant Professor with The Ohio State
at Qatar, Doha, Qatar, for three years. He is actually a Research Scientist University for a year in 2014. She has made original contributions to
with the Division of Computer, Electrical and Mathematical Sciences and environmental statistics, in particular in the areas of spatiotemporal statistics,
Engineering, King Abdullah University of Science and Technology, Thuwal, functional data analysis, visualization, and computational statistics, with an
Saudi Arabia. He is the coauthor of the book Statistical Process Monitoring exceptionally broad array of applications.
Using Advanced Data-Driven and Deep Learning Approaches: Theory and Dr. Sun received two prestigious awards: The Early Investigator Award
Practical Applications (Elsevier, 2020). His current research interests include in Environmental Statistics presented by the American Statistical Association
statistical decision theory and its applications, fault detection and diagnosis, and the Abdel El-Shaarawi Young Research Award from The International
and deep learning. Environmetrics Society.

Integrated Multiple Directed Attention-Based Deep Learning For Improved Air Pollution Forecasting

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Integrated Multiple Directed Attention-Based Deep Learning For Improved Air Pollution Forecasting

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL.

70, 2021 3520815

Integrated Multiple Directed Attention-Based Deep

Fig. 1. Schematic diagram of the overall forecasting strategy.

Fig. 2. Basic illustration of the AE.

VAE is a generative model having similar architecture to

Algorithm 1 VAE Training Algorithm vector V is expressed at time t as follows:

where Wa is the weight matrix of the self-attention

The final self-attention model output is expressed as

This output enhances the quality of the extracted features

III. E XPERIMENTS AND R ESULTS

TABLE II TABLE III

data measurement of each pollutant. Indeed, the objective is to

lowest RMSE = 9.484, MEA = 7.153, MBE = 0.122, and

TABLE VI TABLE VIII

The following experiments are devoted to a comparative

Fig. 8. Validation metrics for multivariate forecasting of CO concentrations.

model, forecasting several pollutants can be obtained simul-

the high ability of deep hybrid model with attention to

Fouzi Harrou (Member, IEEE) received the M.Sc. degree in telecommunica-

You might also like