Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Environmental Pollution 289 (2021) 117859

Contents lists available at ScienceDirect

Environmental Pollution
journal homepage: www.elsevier.com/locate/envpol

Spatio-temporal modeling of PM2.5 risk mapping using three machine


learning algorithms☆
Seyedeh Zeinab Shogrkhodaei a, Seyed Vahid Razavi-Termeh b, *, Amanollah Fathnia a
a
Department of Geography, Faculty of Literature and Humanities, Razi University, Kermanshah, Iran
b
Geoinformation Tech. Center of Excellence, Faculty of Geodesy and Geomatics Engineering, K.N. Toosi University of Technology, Tehran, 19697, Iran

A R T I C L E I N F O A B S T R A C T

Keywords: Urban air pollution is one of the most critical issues that affect the environment, community health, economy,
PM2.5 and management of urban areas. From a public health perspective, PM2.5 is one of the primary air pollutants,
Spatio-temporal modeling especially in Tehran’s metropolis. Owing to the different patterns of PM2.5 in different seasons, Spatio-temporal
GIS
modeling and identification of high-risk areas to reduce its effects seems necessary. The purpose of this study was
Remote sensing
Machine learning algorithms
Spatio-temporal modeling and preparation of PM2.5 risk mapping using three machine learning algorithms
(random forest (RF), AdaBoost, and stochastic gradient descent (SGD)) in the metropolis of Tehran, Iran.
Therefore, in the first step, to prepare the dependent variable data, the PM2.5 average was used for the four
seasons of spring, summer, autumn, and winter. Then, using remote sensing (RS) and a geographic information
system (GIS), independent data such as temperature, maximum temperature, minimum temperature, wind speed,
rainfall, humidity, normalized difference vegetation index (NDVI), population density, street density, and dis­
tance to industrial centers were prepared as a seasonal average. To Spatio-temporal modeling using machine
learning algorithms, 70% of the data were used for training and 30% for validation. The frequency ratio (FR)
model was used as input to machine learning algorithms to calculate the spatial relationship between PM2.5 and
the effective parameters. Finally, Spatio-temporal modeling and PM2.5 risk mapping were performed using three
machine learning algorithms. The receiver operating characteristic (ROC) area under the curve (AUC) results
showed that the RF algorithm had the greatest modeling accuracy, with values of 0.926, 0.94, 0.949, and 0.949
for spring, summer, autumn, and winter, respectively. According to the RF model, the most important variable in
spring and autumn was NDVI. Temperature and distance to industrial centers were the most important variables
in the summer and winter, respectively. The results showed that autumn, winter, summer, and spring had the
highest risk of PM2.5, respectively.

1. Introduction the World Health Organization (WHO) in 2016 showed that 1 in 9


deaths was due to air pollution (WHO, 2016). Air pollution affects dis­
In recent years, air quality worldwide has decreased due to the eases such as lung cancer (Lamichhane et al., 2017), cardiovascular
persistence of pollutants in the air (Alvarez-Mendoza et al., 2019). Air disease (Cesaroni et al., 2013), and respiratory (Razavi-Termeh et al.,
pollution is one of the most important environmental consequences of 2021a). Air pollution hurts not only human health but also the economy
population growth in metropolitan areas in recent years, owing to and the environment (Shahbazi et al., 2018). The effects of air pollution
migration for education and work (Safarianzengir et al., 2020). Factors on the economy include the costs of health and reduced productivity
such as increased use of heating equipment, industrial centers, com­ (Cabaneros et al., 2020). Primary and secondary pollutants are the two
mercial activities, use of fossil fuels in transportation (coal, oil, and types of pollution found in the air. Primary pollutants emit directly into
natural gas), and traffic have exacerbated air pollution in metropolitan the atmosphere, while secondary pollutants produce chemical reactions
areas (Safarianzengir et al., 2020; Zeinalnezhad et al., 2020). A report of between primary pollutants (sulfur oxides (SOx), nitrogen oxides (NOx),

This paper has been recommended for acceptance by Prof. Pavlos Kassomenos.

* Corresponding author.
E-mail addresses: zeinab.shogrkhodaei@yahoo.com (S.Z. Shogrkhodaei), vrazavi70@gmail.com, vrazavi@mail.kntu.ac.ir (S.V. Razavi-Termeh), a_fathnia@razi.
ac.ir (A. Fathnia).

https://doi.org/10.1016/j.envpol.2021.117859
Received 18 April 2021; Received in revised form 29 June 2021; Accepted 26 July 2021
Available online 28 July 2021
0269-7491/© 2021 Elsevier Ltd. All rights reserved.
S.Z. Shogrkhodaei et al. Environmental Pollution 289 (2021) 117859

Fig. 1. Research methodology.

carbon oxides (COx), volatile organic compounds (VOCs), and particu­ pollution is a major environmental issue and a major public health
late matter (PM)) (Santana et al., 2020). Among the primary pollutants, concern in Tehran, Iran. The concentration of pollutants in Tehran has
PM2.5 showed a higher correlation with mortality (Amini et al., 2019). risen rapidly owing to of its special natural conditions, proximity to the
PM2.5 is known as one of the pollutants that have the greatest impact on Alborz Mountains, dust storms, and the use of fossil fuels (Yousefian
human health (Wang and Ogawa, 2015). PM2.5 was identified as the fifth et al., 2018). Air pollution is affected by topographic, meteorological,
leading cause of death in 2015 owing to its small diameter and deep and environmental factors, making air pollution forecasting a complex
penetration into the lungs (Feng et al., 2015; Yu et al., 2020). Air nonlinear problem (Ghaemi et al., 2018). Air pollution modeling is a

2
S.Z. Shogrkhodaei et al. Environmental Pollution 289 (2021) 117859

Fig. 2. Study area and air pollution monitoring stations.

practical tool for forecasting and estimating air pollutants and assessing that aim at nonlinear and complex relationships between observations
their impact on human health, environment, and economy (Rybarczyk and prediction variables have emerged (Ma et al., 2019). The advan­
and Zalakeviciute, 2018). Also, citizens’ awareness of air pollution and tages of machine learning algorithms include high speed, higher effi­
its forecast according to the current trend can attract their participation ciency than traditional models, and no need for basic statistical
in reducing air pollution. Today, with the location of ground stations for assumptions (Ranjgar et al., 2021). So far, various machine learning
measuring pollution in different city areas, the concentration of pollut­ algorithms have been used to predict PM2.5, including random forest
ants is measured with acceptable accuracy (Ghaemi et al., 2018). One of (RF) (Brokamp et al., 2018), generalized additive model (GAM) (Xiao
the main problems of ground stations for measuring pollution is the et al., 2018), extreme gradient boosting (XGBoost) (Chen et al., 2019),
limitedness of these stations and their limited and heterogeneous dis­ multivariate adaptive regression spline (MARS) (Xu et al., 2018),
tribution, which does not allow the estimation of pollutants in a large boosted regression tree (BRT) (Li et al., 2020), artificial neural network
area (continuously) and preparing a continuous risk map. Using (ANN) (Feng et al., 2015), support vector machine (SVM) (Zhou et al.,
geographic information system (GIS) and remote sensing (RS), it is 2019), decision tree (DT) (Mehdipour et al., 2018), AdaBoost (Liu et al.,
possible to analyze spatial and temporal changes and patterns of air 2019), and stochastic gradient descent (SGD) (Ganesh et al., 2018).
pollution dispersion at low cost and time (Sudhira et al., 2003). GIS can These algorithms have been successful in predicting PM2.5. Although
provide interpolation methods to study the spatial distribution of pol­ various machine learning algorithms have been used to predict PM2.5 so
lutants, manage spatial data and consider the impact of effective far, these algorithms have not been used in combination with GIS for
modeling parameters (Mozumder et al., 2013). RS is also a useful tool for Spatio-temporal modeling and PM2.5 risk mapping. Since there is a
monitoring and temporally managing variable phenomena on earth temporal correlation between air pollution data at a specific location,
(Sohrabinia and Khorshiddoust, 2007). To date, three approaches have Spatio-temporal air pollution modeling is critical in developing a reli­
been used to predict air pollution: deterministic, statistical, and machine able forecasting model for reducing and monitoring air pollution (Fan
learning models. Deterministic models are based on physical and et al., 2017). As a result, this study aimed to use three machine learning
chemical reactions and require many parameters with high accuracy and algorithms (RF, AdaBoost, and SGD) to Spatio-temporal modeling and
a long time (Wang et al., 2015). Deterministic methods apply meteo­ prepare a PM2.5 risk map using meteorological, environmental, and RS
rological and statistical principles to model dispersion, emission, diffu­ variables. These algorithms have been of acceptable accuracy in various
sion, transformation, and removal processes of pollutants based on environmental modeling’s such as groundwater potential (Kalantar
aerodynamic theories and physicochemical processes (Wen et al., 2019). et al., 2019), flood susceptibility (Li et al., 2019a), and landslide sus­
Deterministic methods need to solve a set of differential equations ceptibility (Tien Bui et al., 2019). The use of all spatial environmental
(Masmoudi et al., 2020). Due to the high volume of data and access to variables impacting PM2.5, preparation PM2.5 risk mapping, and a
the information source data, implementing these methods takes a long combination of GIS capabilities, RS, and machine learning algorithms in
time (Wang et al., 2015). In addition, the accuracy of the method de­ the Spatio-temporal modeling of PM2.5 are all innovative facets of the
pends on the scale used and the quality of the data, and due to the need current research.
for various parameters and their experimental estimation, the accuracy
of the method is limited (Wen et al., 2019). But statistical and machine 2. Materials and methods
learning models are the two most common methods for predicting air
pollution (Bai et al., 2018). Complex atmospheric physical and chemical This study was done in four steps (Fig. 1). Using GIS and RS, a Spatio-
reactions are not taken into account in statistical models, and instead, temporal database was built in the first step, consisting of dependent
the internal laws of data are examined (Zhang et al., 2020). While (seasonal average PM2.5) and independent data (humidity, temperature,
nonlinear relationships exist in the real world, statistical approaches maximum temperature, minimum temperature, rainfall, wind speed,
assume linear relationships. As a result, machine learning algorithms normalized difference vegetation index (NDVI), population density,

3
S.Z. Shogrkhodaei et al. Environmental Pollution 289 (2021) 117859

Table 1 this city is 230 mm, and the average temperature is 17 ◦ C. According to
Result of the accuracy of PM2.5 interpolation. the Iranian Statistics Center’s 2017 census, Tehran’s population is esti­
Season RMSE %RMSE Functioning mated to be 9 million people. Temperature inversion conditions arising
from atmospheric anticyclones and being surrounded by Alborz moun­
Spring 6.771 19.34% Acceptable
Summer 6.65 19.72% Acceptable tains cause pollutants to be trapped in Tehran. Factors such as traffic,
Autumn 4.236 15.95% Acceptable industrial centers, private cars, and population growth have caused air
Winter 6.16 16.91% Acceptable pollution in Tehran (Shahbazi et al., 2018). According to the reports of
Tehran Air Quality Control Company (2020), Tehran’s city has wit­
nessed a 7% increase in the number of adverse days compared to the
distance to centers industrial, and street density). The frequency ratio
previous year, and three periods of long-term stability of air pollution
(FR) model was used to investigate the spatial relationship between the
have occurred during the cold months. December (18 days) and January
dependent and independent variables in the second step. In the third
(15 days) saw the most polluted days. In Tehran, PM2.5 has been iden­
step, Spatio-temporal modeling and PM2.5 risk mapping were performed
tified as an indicator pollutant. Because of morning traffic and the
using three machine learning algorithms (RF, AdaBoost, and SGD). In
lowering of the mixing layer, the greatest volume happened in the early
the final step, the receiver operating characteristic (ROC), root mean
morning hours and the middle of the night. Fig. 2 shows the study area
square error (RMSE) and mean absolute error (MAE) were used to
with 23 air pollution control stations.
evaluate the results of Spatio-temporal modeling.
2.1.2. Dependent data
2.1. Materials
Organic chemicals, acids, soil particles or dust, and metals combine
to form particles matter (PM). The most important sources of PM2.5
2.1.1. Study area
pollutants are fires, volcanoes, dust, tobacco smoke, vehicle smoke, and
Tehran is at 51◦ 17′ and 51◦ 33′ east longitude and 35◦ 36′ and 35◦ 44′
combustion in mechanical and industrial processes (Anderson et al.,
north latitude in Iran. Tehran is located in the south of the Caspian Sea
2012). To measure the PM2.5 concentration, the average seasonal con­
and on the slopes of the Alborz, and its average altitude is 1200 m above
centration of these pollutants from January 1, 2010, to January 1, 2020,
sea level and has an area of more than 700 km2. The average rainfall in

Fig. 3. PM2.5 concentration with training and validation points in a) Spring, b) Summer, c) Autumn, and d) Winter.

4
S.Z. Shogrkhodaei et al. Environmental Pollution 289 (2021) 117859

Table 2
Classification of independent parameters.
Layer Spring Summer Autumn Winter

Temperature ( C)

Category 1 20.96–22.34 26.41–27.38 9.63–10.37 5.32–6.17


Category 2 22.34–23.09 27.38–27.93 10.37–10.83 6.17–6.7
Category 3 23.09–23.68 27.93–28.49 10.83–11.3 6.7–7.2
Category 4 23.68–24.17 28.49–29.01 11.3–11.76 7.2–7.61
Category 5 >24.17 >29.01 >11.76 >7.61
Tmax (◦ C)
Category 1 26.27–27.33 32.16–33.22 14.51–15.44 10.26–11.14
Category 2 27.33–27.88 33.22–33.73 15.44–15.91 11.14–11.63
Category 3 27.88–28.37 33.73–34.15 15.91–16.33 11.63–12.1
Category 4 28.37–28.91 34.15–34.6 16.33–16.78 12.1–12.7
Category 5 >28.91 >34.6 >16.78 >12.7
Tmin (◦ C)
Category 1 15.05–16.22 20.25–21.3 5.37–6.22 1.32–2.08
Category 2 16.22–16.92 21.3–21.9 6.22–6.74 2.08–2.55
Category 3 16.92–17.54 21.9–22.47 6.74–7.21 2.55–2.98
Category 4 17.54–18.15 22.47–22.98 7.21–7.61 2.98–3.38
Category 5 >18.15 >22.98 >7.61 >3.38
Wind speed (m/s)
Category 1 1.26–1.87 1.09–1.67 0.805–1.22 0.903–1.49
Category 2 1.87–2.4 1.67–2.071 1.22–1.6 1.49–1.97
Category 3 2.4–2.77 2.071–2.32 1.6–1.79 1.97–2.28
Category 4 2.77–3.09 2.32–2.58 1.79–1.99 2.28–2.56
Category 5 >3.09 >2.58 >1.99 >2.56
Rainfall (mm)
Category 1 12.28–14.58 1.99–3.69 24.43–29.92 20.74–25.36
Category 2 14.58–16.49 3.69–4.48 29.92–35.11 25.36–30.92
Category 3 16.49–18.57 4.48–5.34 35.11–38.7 30.92–36.93
Category 4 18.57–20.7 5.34–6.62 38.7–43.5 36.93–43.18
Category 5 >20.7 >6.62 >43.5 >43.18
Humidity (%)
Category 1 23.8–26.09 20.66–22.11 45.28–46.66 44.79–46.52
Category 2 26.09–27.04 22.11–22.82 46.66–48.04 46.52–47.91
Category 3 27.04–28.15 22.82–23.62 48.04–49.65 47.91–49.33
Category 4 28.15–29.5 23.62–24.6 49.65–51.55 49.33–50.63
Category 5 >29.5 >24.6 >51.55 >50.63
NDVI
Category 1 − 0.06-0.099 − 0.031-0.09 − 0.015-0.072 − 0.013-0.059
Category 2 0.099–0.129 0.09–0.113 0.072–0.096 0.059–0.08
Category 3 0.129–0.177 0.113–0.153 0.096–0.135 0.08–0.115
Category 4 0.177–0.272 0.153–0.238 0.135–0.22 0.115–0.188
Category 5 >0.272 >0.238 >0.22 >0.188

Population density (Person/km2) Street density (km/km2) Distance to industrial (m)

0–122 0–1.97 0–100


122–366 1.97–3.82 100–200
366–1343 3.82–6.14 200–300
1343–8915 6.14–9.96 300–400
>8915 >9.96 >400

was collected from 23 air pollution monitoring stations in Tehran. The In modeling with machine learning algorithms, we need occurrence
PM2.5 concentration map was created in ArcGIS 10.3 software using the data (value 1) and non-occurrence data (value 0) (Rahmati et al., 2020).
Kriging interpolation method with 30 * 30 m pixel size. Eqs. (1) and (2) Therefore, for training data and validation in modeling by machine
were used to perform interpolation validation. In optimum mode, or learning algorithms, seasonal maps of PM2.5 concentration between
when estimated and measured values are equal, the value of RMSE is 0 and 1 were normalized. Then, PM2.5 values ranging from 0.5 to 1 were
null. Because RMSE is sensitive to outlier data, % RMSE can be used considered occurrence data, while PM2.5 values ranging from 0 to 0.5
instead. The higher the precision of interpolation, the lower the value of were considered non-occurrence data. Then, 300 locations were selected
this index. The acceptable limit for % RMSE is < 40, while values more as occurrence points (value 1) and 300 locations as non-occurrence
than 70% indicate estimate point imprecision (Hengl et al., 2004). points (value 0). Randomly 70% (210 points) of these locations were
considered as training data and 30% (90 points) for validation using
(1)
1
RMSE = Ei − Ai )(Ei − Ai )(Ei − Ai )2 )2 holdout technique (Fig. 3).

RMSE
%RMSE = (2) 2.1.3. Independent data
μ Independent variables used in this study include meteorological data
(average temperature, rainfall, wind speed, humidity, minimum tem­
where Ai average variable measured at stations, Ei predicted predict by
perature, and maximum temperature), NDVI, distance to industrial
kriging, n is the number of stations. The μ represents the average of each
centers, street density, and population density. The parameters were
measurement component. The results of PM2.5 interpolation accuracy in
prepared using the natural breaks classification in ArcGIS 10.3 software
different seasons are presented in Table 1. The results of the % RMSE
with 30 * 30 m pixel size and are shown in Table 2 for four seasons.
index showed that PM2.5 interpolation values in four seasons are less
than 40% and had acceptable accuracy.

5
S.Z. Shogrkhodaei et al. Environmental Pollution 289 (2021) 117859

Fig. 4. Independent parameter maps: a) Temperature, b) Tmax, c) Tmin, d) Wind speed, e) Rainfall, f) Humidity, g) NDVI, h) Population density, i) Road density, and
j) Distance to industrial.

6
S.Z. Shogrkhodaei et al. Environmental Pollution 289 (2021) 117859

Fig. 4. (continued).

7
S.Z. Shogrkhodaei et al. Environmental Pollution 289 (2021) 117859

Fig. 5. The wind direction in Tehran city.

• Meteorology data

Air pollution occurs when pollutants’ emission exceeds the atmo­


sphere’s capacity (He et al., 2013). Two important factors in reducing air
quality are meteorological conditions and the release of contaminants
from human resources (Xu et al., 2018). Meteorological conditions are
directly or indirectly related to the emission, formation, and air pollut­
ants movement (Zhang et al., 2015). Meteorological variables have
different effects on PM2.5 concentration in different seasons and loca­
tions (Yang et al., 2017). The meteorological variables used in this study
are described below.
Temperature: Temperature is an influential factor in the formation of
particles. Rising temperature enhances the photochemical reaction be­
tween precursors (Wang and Ogawa, 2015). Also, increasing the tem­
perature leads to an increase in pollutants’ dispersion and decreased
concentration (Yang et al., 2017). In this study, the temperature is
considered in three different ways: average temperature, minimum
temperature, and maximum temperature (Fig. 4a–c).
Wind speed: Wind directly affects pollutants’ concentration and their Fig. 6. Meteorology stations in Tehran province.
horizontal and vertical displacement in the atmosphere (He et al., 2013).
The concentration of PM2.5 decreases with increasing wind speed. While
reduction of pollutants, and air quality improvement occur with rainfall
reducing wind speed leads to PM2.5 concentration in the atmosphere (Li
(Santana et al., 2020) (Fig. 4e).
et al., 2020). Five synoptic stations (Chitgar, Geophysic, Imam Kho­
Humidity: Humidity is one of the most significant factors affecting
meini, Mehrabad, and Shemiran) were utilized to examine the effect of
particle structure and properties, which has a significant effect on sec­
wind direction on PM2.5 concentrations in Tehran (Fig. 5). The results
ondary reactions (Lou et al., 2017). The concentration of particulate
showed that the most important directions of the wind are the directions
matter also decreases when the humidity is high, and the air becomes
between the north to the West at all stations except the Geophysic sta­
saturated. Temperature inversion and atmospheric stability also occur at
tion (South direction) (Fig. 4d).
high humidity (Lou et al., 2017) (Fig. 4f).
Rainfall: Total rainfall, rainfall duration, rainfall intensity, particle
Meteorological data were compiled from the Iranian Meteorological
size, chemical composition, and spectral distribution all affect the
Organization from 12 synoptic stations in Tehran province between
removal of particles by rain (Hart et al., 2020). Dispersion of gases,
January 1, 2010, and January 1, 2020 (Fig. 6). Meteorological data were

8
S.Z. Shogrkhodaei et al. Environmental Pollution 289 (2021) 117859

Table 3
Result of the accuracy of meteorology data interpolation.
Factors Spring Summer Autumn Winter

RMSE %RMSE Functioning RMSE %RMSE Functioning RMSE %RMSE Functioning RMSE %RMSE Functioning

Humidity 10.38 34.38 Acceptable 5.41 20.33 Acceptable 3.5 6.84 Acceptable 3.91 7.62 Acceptable
Temperature 2.84 13.16 Acceptable 3.05 11.28 Acceptable 4.01 10.9 Acceptable 2.14 38 Acceptable
Tmax 2.79 10.37 Acceptable 2.83 8.59 Acceptable 2.13 14.34 Acceptable 2.42 22.8 Acceptable
Tmin 2.28 15.26 Acceptable 2.46 12.36 Acceptable 1.8 35.5 Acceptable 1.92 34 Acceptable
Rainfall 7.48 40.17 Acceptable 1.94 30.6 Acceptable 10.85 28.98 Acceptable 11.07 30.96 Acceptable
Wind speed 9.1 31.66 Acceptable 5.6 37.7 Acceptable 2.905 32.17 Acceptable 1.124 39.35 Acceptable

prepared as seasonal averages (spring, summer, autumn, and winter) shortest distance between two points is calculated according to Eq. (5)
from 2010 to 2020. In this research, meteorological parameters were (Fig. 4h).
prepared in ArcGIS10.3 software using the kriging interpolation √̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
method. The results related to the accuracy of interpolation of meteo­ D = (x1 − x2 )2 + (y1 − y2 )2 (5)
rology factors by RMSE and %RMSE indices are presented in Table 3.
The results showed that all factors in different seasons have acceptable D is the Euclidean distance between two points with coordinates (x1,
accuracy (% RMSE <40). y1) and (x2, y2).

• NDVI 2.2. Methods

Green spaces have a significant ability to eliminate pollutants in 2.2.1. Statistical methods
urban environments (Mozumder et al., 2013). NDVI is an indicator ob­
tained from reflecting the earth’s surface in the visible (red) and • Multicollinearity analysis
near-infrared bands (Dadvand et al., 2015). The NDVI layer was created
using Landsat 7 satellite images in Google Earth Engine from January 1, Multicollinearity is a term used to describe a situation in multivariate
2010, to January 1, 2020. The NDVI index is calculated using Eq. (3). regression in which there is a high correlation between two or more
NIRum − Redum predictor variables. In this case, the regression model may not be highly
NDVI = (3) valid. Tolerance and variance inflation factor (VIF) indexes are used to
NIRum + Redum
measure multicollinearity in regression models. The value of the toler­
Near-infrared and red wavelengths above the atmosphere are related
ance coefficient varies between 0 and 1, and the larger the value, the
to NIR and Red, respectively (Fig. 4g).
smaller the collinearity. VIF >5 or tolerance <0.1 indicates the possi­
bility of multicollinearity in a dataset (Razavi-Termeh et al., 2020b).
• Population Density
• The FR model
Declining urban air quality is affected by urban population growth
because population growth is associated with an increase in the number
The FR model is a widely used bivariate statistical method (Chen
of vehicles (traffic congestion) and industrial and commercial activities
et al., 2020) that examines the spatial relationship between the depen­
(Zhang et al., 2017). The population density map was created using
dent and independent variables (Naghibi et al., 2015). According to this
2017 population data from Iran’s Statistics Center. The population
model, future hazards have similar characteristics and conditions to past
density was calculated using Eq. (4).
risks (Razavi-Termeh et al., 2020a). The high value of FR in a class in­
Number of people dicates the importance of that class in the occurrence of PM2.5. The ratio
Population density = (4)
Area of the occurrence of PM2.5 areas to the whole study area is calculated
according to Eq. (6):
The area of each block is based on Km2 (Fig. 4h).
∑Npix(SXi)
m
• Street density FR = i=1
Npix(SXi)
(6)
∑Npix(Xj)
n
Npix(Xj)
At the city level, an increase in street density increases intersections.
j=1

Increases at intersections reduce vehicle speed, increase the number of Npix (SXi) is the number of pixels containing PM2.5 in the ith class of
stopping points, wear out tires and brakes, and eventually release sig­ the X variable, Npix (Xj) is the total number of pixels of the variable Xj.
nificant quantities of PM2.5 (Han and Sun, 1968). The street map was M and n are the number of classes in each criterion and the total number
gathered from the open street map (OSM) at the scale of 1:100,000 in of criteria, respectively.
2020. A street density raster map was created using the Line Density tool
in ArcGIS 10.3 software (Fig. 4i). 2.2.2. Machine learning algorithms
In this study, three machine learning algorithms (RF, Adaboost, and
• Distance to industrial centers SGD) were used to prepare the PM2.5 risk map. After creating a PM2.5
concentration map for each season, 300 sample-based points were
Significant sources of PM2.5 production include fuel combustion in created for dependent data. The holdout technique was utilized to solve
industry and power plants (Kumar and Joseph, 2006). Unfavorable the overfitting problem as well as split the modeling (70%) and vali­
location and non-observance of hygienic and environmental principles dation (30%) data randomly. The GridSearchCV method was used to
of industrial centers in Tehran have intensified air pollution. The in­ tune the hyper parameters of the three algorithms.
dustrial layer was extracted from the land-use layer of Tehran. The raster
map of distance to industrial centers was created using ArcGIS 10.3 • The RF algorithm
software and the Euclidean distance tool. Using Euclidean distance, the
RF is a technique that combines different decision tree (DT)

9
S.Z. Shogrkhodaei et al. Environmental Pollution 289 (2021) 117859

Table 4
Result of multicollinearity in four seasons.
Factors Spring Summer Autumn Winter

VIF Tolerance VIF Tolerance VIF Tolerance VIF Tolerance

Distance to street 1.164 0.859 1.187 0.843 1.126 0.888 1.229 0.813
Distance to industrial 1.082 0.924 2.873 0.348 1.074 0.931 1.143 0.875
Population density 1.104 0.905 1.862 0.537 1.097 0.912 1.201 0.833
NDVI 1.077 0.928 1.959 0.510 1.030 0.971 1.174 0.852
Humidity 2.319 1.43 2.415 0.414 1.496 0.668 2.091 0.478
Rainfall 1.305 0.766 1.548 0.646 1.527 0.655 1.538 0.650
Wind speed 1.330 0.752 1.781 0.562 1.287 0.777 1.659 0.603
Tmin 3.708 .269 1.323 0.756 1.512 0.661 3.872 0.258
Tmax 2.124 0.471 1.122 0.891 1.907 0.524 2.638 0.379
Temperature 3.524 0.284 1.256 0.796 1.968 0.508 3.149 0.318

algorithms for classification and regression (Rodriguez-Galiano et al., • ROC curve


2014). Breiman (2001) was the first to develop this algorithm. The
number of trees (ntree) and the number of randomly selected predictor ROC is one of the most widely used evaluation methods in spatial
variables (mtry) are the two main parameters in the implementation of modeling and a standard tool for evaluating the accuracy of modeling
the RF algorithm (Farhangi et al., 2020). The RF algorithm works by output maps (Khosravi et al., 2019). ROC is a two-dimensional graph
initially using the bootstrap approach to construct a training set and based on specificity (y-axis) and sensitivity (x-axis) (Razavi-Termeh
then creating a decision tree for each training set. In the following, et al., 2021b). Sensitivity and specificity parameters were calculated
different classifiers of each type are trained using each training subset. according to Eqs. (8) and (9):
Finally, with the most votes, individual categories are combined and use
TP
the average value of the decision tree set to predict the amount of new Sensitivity = (8)
TP + FN
input data (Chen et al., 2020).
TN
• The AdaBoost algorithm Specificity = (9)
FP + TN

AdaBoost is one of the hybrid machine learning algorithms first where TP (true positive) and TN (true negative) are the numbers of
proposed by Freund and Schapire (1999). Using a linear combination, pixels correctly classified, and FP (false positive) and FN (false negative)
AdaBoost combines basic or weak classifiers to produce a more robust are the numbers of pixels incorrectly classified.
classifier (Naghibi et al., 2017). In the first step, AdaBoost assigns equal The area under the curve (AUC) is a metric for validating output with
weight to all training samples. Then, at a particular stage of training, the actual data (Razavi-Termeh et al., 2020b). The AUC value is between 0.5
samples with the highest prediction error amount gain more weight. In and 1. When the AUC is equal to 1, it indicates the excellent performance
contrast, it is giving the least weight to samples that have minor pre­ of the model. The AUC was calculated according to Eq. (10):
diction errors. As repetitions continue, the amount of global prediction ∑ ∑
TP + TN ​
error decreases with more attention to samples that are difficult to AUC = (10)
P+N
predict (Zhai et al., 2018).
The total occurrence and non-occurrence of PM2.5 are represented by
• The SGD algorithm P and N, respectively (Wang et al., 2019).

SGD is one of the most popular machine learning algorithms for 3. Results
working with big data. In methods based on this algorithm, the gradient
is approximated by a single sample at each slope iteration (Cohen et al., 3.1. Results of multicollinearity analysis
2017). Unlike standard gradient descent, which uses all training data to
optimize the objective function, SGD uses a randomly selected set of The result of multicollinearity analysis is presented in Table 4. The
training data to optimize. According to stochastic approximation ana­ tolerance and VIF indices demonstrate that the values of the factors are
lyzes, when target performance is divergent under an appropriate less than 0.1 and 5, respectively, in all seasons. Therefore, all factors can
learning rate and some common conditions, SGD converges to a global be used in Spatio-temporal modeling.
minimum (Bottou, 2012).
3.2. Results of FR model
2.2.3. Assessment criteria
The spatial relationship between PM2.5 and the independent vari­
• Models performance assessment
ables using the FR model is presented in Table 5. The result of temper­
ature indicates that as the temperature increases in the spring, so does
Machine learning algorithms were evaluated using the root mean
PM2.5. In the autumn, lowering the temperature rises the PM2.5. In
square error (RMSE) and the mean absolute error (MAE) indices. MAE
summer and winter, no specific pattern was observed between temper­
indicator was calculated using Eq. (7):
ature and PM2.5 concentration. The results related to the maximum
1∑ n temperature show that the highest weight of the FR model for spring,
MAE = |Ei − Ai | (7) summer, and winter belongs to class >28.91 ◦ C, > 34.6 ◦ C, and
n i=1
12.1–12.7 ◦ C, and in autumn belongs to class 14.51–15.44 ◦ C. The re­
where Ai average PM2.5 measured at stations, Ei predicted PM2.5, and n sults related to the minimum temperature indicate that the highest FR
is the number of samples. weights for spring, summer, autumn, and winter are for class >18.15 ◦ C,
21.3–21.9 ◦ C, 21.9–22.47 ◦ C, > 7.61 ◦ C, and 2.55–2.98 ◦ C. According to
the wind speed, in spring and autumn, with increasing wind speed,

10
S.Z. Shogrkhodaei et al. Environmental Pollution 289 (2021) 117859

Table 5 lower values. According to the FR results, the concentration of PM2.5


Result of the FR model. increased with a decrease in NDVI. The highest FR values for population
Layer FR (Spring) FR (Summer) FR (Autumn) FR (Winter) density factor for spring, summer, autumn, and winter are related to
classes of >8915, 1343–8915, 122–366, and 1343–8915, respectively.
Temperature ( C)

Category 1 0.37 0.55 1.16 0 The street density results show that with increasing street density, the
Category 2 1.24 0.95 0.91 0.85 amount of FR in four seasons has an increasing trend. For the distance to
Category 3 0.94 1.77 1.01 1.77 industrial centers factor, the highest amount of FR in four seasons are in
Category 4 1.15 1.19 0.92 0.91 the 0–200 m class.
Category 5 1.28 0.64 1.02 0.9
Tmax (◦ C)
Category 1 0 0.096 1.28 0.23 3.3. The importance of effective criteria
Category 2 0.67 1.11 0.75 1.08
Category 3 1.42 1.08 0.907 0.76 The importance of factors in all seasons by the RF algorithm is shown
Category 4 0.68 0.9 1.08 1.28
in Fig. 7. For spring, NDVI (0.42), distance to industrial centers (0.41),
Category 5 1.53 1.26 1.033 0.87
Tmin (◦ C) street density (0.38), and population density (0.35) had the most impact,
Category 1 0.4 0.64 1.13 0 and maximum temperature (0.21) had the least impact on PM2.5 con­
Category 2 0.96 1.44 0.82 1.03 centration, respectively. In summer, the most important factors affecting
Category 3 0.95 1.44 1.2 1.46 PM2.5 concentration were temperature (0.48), maximum temperature
Category 4 1.27 0.64 0.76 0.82
Category 5 1.33 1.26 1.36 1.27
(0.4), street density (0.4), and minimum temperature (0.39), respec­
Wind speed (m/s) tively. In summer, the wind speed criterion (0.25) was less important. In
Category 1 1.08 1.005 0.9 1.66 autumn, NDVI (0.45), street density (0.38), population density (0.38),
Category 2 1.05 1.25 0.99 1.45 and distance to industrial centers (0.37) had the most significant effect
Category 3 1.12 1.29 1.009 0.88
on PM2.5 concentration, respectively. Minimum temperature (0.26) had
Category 4 0.44 1.12 0.72 0.71
Category 5 1.68 0.33 1.6 1.24 the least impact on PM2.5 concentration in autumn. In winter, distance
Rainfall (mm) to industrial centers (0.41), population density (0.38), and NDVI (0.38)
Category 1 0.88 1.05 0.89 0.062 had the most significant effect. In winter, the rainfall (0.25) was less
Category 2 1.26 0.8 1.06 1.2 significant.
Category 3 1.42 1.07 0.9 1.08
Category 4 0.69 1.22 1.16 0.88
Category 5 0.61 0.86 0.85 1.55 3.4. Modeling with machine learning algorithms
Humidity (%)
Category 1 1.42 1.32 0.87 0.109 A spatial database with occurrence and non-occurrence points, and
Category 2 1.16 0.61 1.04 0.75
effective parameters, was first developed to implement machine
Category 3 1.04 1.1 0.87 1.29
Category 4 0.78 1.51 1.17 0.86 learning algorithms (RF, AdaBoost, and SGD). Weights for effective
Category 5 0.46 0.58 0.85 0.93 parameters using the FR model were considered as input of machine
NDVI learning algorithms. 70% of the data were used as training data, and the
Category 1 1.046 1.06 1.061 1.59 remaining 30% as validation. Machine learning algorithms were
Category 2 1.4 1.26 1.061 1.18
Category 3 0.91 1.02 0.864 .1
implemented using the Python programming language. The results for
Category 4 0.62 0.85 1.054 0.77 tuning hyperparameters using the GridSearchCV method for the three
Category 5 0.96 0.74 0.96 0.44 algorithms are shown in Table 6. The values of RMSE and MAE indices
Population density (km/km2) for machine learning algorithms are shown in Tables 7 and 8, respec­
0–122 0.87 0.95 1.008 0.87
tively. According to the results of RF algorithm, RMSE values for training
122–366 1.02 0.77 1.06 0.77
366–1343 1.17 1.17 0.9 1.27 and validation data were in spring (0.169, 0.432), summer (0.156,
1343–8915 1.17 1.5 0.83 1.67 0.383), autumn (0.141, 0.338), and winter (0.151, 0.375). In addition,
>8915 1.19 1.49 1.04 1.64 MAE values related to RF algorithm for training and validation data
Street density (km/km2) were in spring (0.124, 0.318), summer (0.11, 0.287), autumn (0.104,
0–1.97 0.69 0.63 0.933 0.45
1.97–3.82 0.9 0.8 0.93 0.99
0.234), and winter (0.108, 0.269). Based on the Adaboost algorithm,
3.82–6.14 1.21 1.33 1.21 1.38 RMSE values for training and validation data were in spring (0.233,
6.14–9.96 1.45 1.64 0.67 1.25 0.462), summer (0.216, 0.39), autumn (0.202, 0.342), and winter (0.21,
>9.96 1.51 1.51 1.006 1.51 0.337). The results of MAE values were related to Adaboost algorithm
Distance to industrial (m)
for training and validation data in spring (0.184, 0.34), summer (0.164,
0–100 1.37 1.28 1.28 1.24
100–200 0.94 1.42 1.05 1.15 0.301), autumn (0.155, 0.245), and winter (0.1603, 0.284). Based on the
200–300 1.25 1.08 1.19 1.19 results of SGD algorithm, RMSE values for training and validation data
300–400 1.28 0.98 1.06 1.13 were in spring (0.326, 0.462), summer (0.314, 0.442), autumn (0.309,
>400 0.74 0.74 0.8 0.77 0.408), and winter (0.324, 0.453). The results of MAE values were
related to SGD algorithm for training and validation data in spring
PM2.5 increased. In winter, as the wind speed decreases, the concen­ (0.213, 0.353), summer (0.197, 0.315), autumn (0.191, 0.254), and
tration of PM2.5 increases. In summer, no significant relationship was winter (0.21, 0.311). The results showed that the RF model in training
observed between wind speed and PM2.5 concentration. Rainfall show and validation data had higher accuracy than the AdaBoost and SGD
that there is no direct relationship between rainfall amount and PM2.5 models in all seasons. The results showed that autumn, winter, summer,
concentration. The concentration of PM2.5 has risen in the spring (class and spring had the highest modeling accuracy. Fig. 8 shows the differ­
16.49–18.57 mm), summer (class 5.34–6.62 mm), autumn (class ence between the training and validation data with the actual data
38.7–43.5 mm), and winter (class > 43.18 mm). With increasing hu­ (PM2.5 observed) related to the RF algorithm (maximum accuracy).
midity in spring, the amount of FR has a decreasing trend. The highest
amount of FR for summer, autumn, and winter are related to classes of 3.5. Preparing PM2.5 risk map with machine learning algorithms
23.62%–24.6%, 49.65%–51.55%, and 47.91%–49.33%, respectively.
The highest FR value for the NDVI factor in four seasons is related to the After Spatio-temporal modeling, the fitted model was generalized to
the whole study area. The outputs of machine learning algorithms were

11
S.Z. Shogrkhodaei et al. Environmental Pollution 289 (2021) 117859

Fig. 7. The importance of the factors using the RF algorithm.

autumn, the intensity and extent of the risky areas have increased
Table 6 compared to other seasons. For the AdaBoost algorithm, the southern,
Result of hyperparamers tuning using GridSearchCV.
southwestern, southeastern, central, and northern regions had high-risk
Algorithms Spring Summer Autumn Winter intensities. Also, in autumn and winter, its intensity and extent have
SGD Alpha = Alpha = 0.1 Alpha = 1 Alpha = increased. For the SGD algorithm, the range of hazards was high but less
0.001 0.0001 severe.
Eta0 = 10 Eta0 = 100 Eta0 = 100 Eta0 = 1
Learning rate Learning rate Learning rate Learning rate
= adaptive = adaptive = optimal = invscaling 3.6. Validation of the PM2.5 risk maps
Penalty = Penalty = l2 Penalty = l2 Penalty = l1
elastic net
AdaBoost Learning rate Learning rate Learning rate Learning rate
In this study, the ROC curve was used to evaluate the PM2.5 risk
= 0.2 = 0.1 = 0.367 = 0.8 mapping. For this purpose, 30% of the occurrence (70 points) and non-
Loss = Linear Loss = Square Loss = Square Loss = occurrence (70 points) data were used. Fig. 11 shows the ROC curve for
Exponential the three machine learning algorithms. The AUC values for the RF,
Number of Number of Number of Number of
AdaBoost, and SGD algorithms were equal: in spring: 0.926, 0.895, and
estimators = estimators = estimators = estimators =
60 100 50 60 0.81; in summer: 0.94, 0.93, and 0.832; in autumn: 0.949, 0.938, and
RF Max depth = Max depth = 3 Max depth = Max depth = 7 0.841, and in winter: 0.949, 0.938, and 0.841. In Spatio-temporal
9 9 modeling, the results showed that the RF, AdaBoost, and SGD algo­
Max features Max features Max features Max features rithms had the highest accuracy, respectively. Autumn, winter, summer,
= log2 = auto = Sqrt = Log2
Min samples Min samples Min samples Min samples
and spring had the highest modeling accuracy and PM2.5 risk,
leaf = 6 leaf = 1 leaf = 1 leaf = 3 respectively.
Min samples Min samples Min samples Min samples
split = 3 split = 2 split = 2 split = 7
4. Discussion

transferred to ArcGIS 10.3 software to create PM2.5 risk maps. PM2.5 risk 4.1. Criteria analysis using FR model
maps with machine learning algorithms are shown in Figs. 9 and 10.
Results showed that the southern, central, and southwestern regions are According to the analyzes performed by the FR model, the factors
associated with a higher risk, which can be owing to industries in the used in different seasons can have different effects on the concentration
south and southwest, population density, and traffic in the center. High- of PM2.5. In spring and summer, with the increase of the temperature,
risk areas were distributed in a spiral from north to south. Also, in the maximum temperature, and the minimum temperature, PM2.5 con­
centration has also increased. Increasing the temperature has enhanced

Table 7
RMSE index results.
Model Spring Summer Autumn Winter

Train Validation Train Validation Train Validation Train Validation

RF 0.169 0.432 0.156 0.383 0.141 0.338 0.151 0.375


AdaBoost 0.233 0.462 0.216 0.39 0.202 0.342 0.21 0.377
SGD 0.326 0.462 0.314 0.442 0.309 0.408 0.324 0.453

12
S.Z. Shogrkhodaei et al. Environmental Pollution 289 (2021) 117859

Table 8
MAE index results.
Model Spring Summer Autumn Winter

Train Validation Train Validation Train Validation Train Validation

RF 0.124 0.318 0.11 0.287 0.104 0.234 0.108 0.269


AdaBoost 0.184 0.34 0.164 0.301 0.155 0.245 0.1603 0.284
SGD 0.213 0.353 0.197 0.315 0.191 0.254 0.210 0.311

Fig. 8. Diagram of difference between modeling output with the real data using RF algorithm: a) Spring, b) Summer, C) Autumn, and d) Winter.

the photochemical reaction between the precursors and particle for­ industries are located in the south and southwest of the city. The pre­
mation (Wang & Ogawa, 2015). In autumn and winter, high concen­ vailing westerly and southwesterly winds direct most of the waste from
trations of PM2.5 occurred at low temperatures. Because convection these factories into the city and increase PM2.5. Rainfall factor results
reduction occurs at low temperatures, this factor causes contaminants to showed a complex relationship with PM2.5 in the study area. For a more
be trapped beneath the inversion layer (Li et al., 2019b). Also, when the detailed analysis of the effect of rainfall on PM2.5 concentration, we
air temperature is low, the emission rate of PM2.5 increases owing to the divided the rainfall into two classes 0–20 mm (low) and >20 mm (me­
increased use of heating devices (Yang et al., 2017). For the wind speed dium and high) according to the climate of the study area. In spring and
criterion, the PM2.5 concentration also increased in the spring when the summer, the highest concentration of PM2.5 was observed when the
wind speed reached more than 3 m/s. Although increasing wind speed amount of rainfall was between 16.49-18.57 mm and 5.34–6.62 mm,
reduces the concentration of pollutants, increasing wind speed on PM2.5 respectively. At low rainfall, the amount of particles colliding with
was different. Increasing the wind speed causes the suspended particles raindrops is reduced (Zhang et al., 2020). At low rainfall, the atmo­
to rotate and create dust (Aldrin et al., 2005). Since the PM2.5concen­ sphere’s humidity decreases and increases the concentration of PM2.5
tration was lower in spring than in other seasons, the wind speed has no (Wang & Ogawa, 2015). Besides, in summer, the presence of sand dust
significant effect on increasing the PM2.5 concentration. In summer, aggravates the situation (Huang et al., 2021). In autumn and winter,
autumn, and winter, the increase in PM2.5 concentration was associated heavy rainfall events (above 38 mm) have increased PM2.5. Because
with low wind speeds. Reducing wind speed causes the concentration of after rain, the humidity in the air increases, and the wind speed is low,
PM2.5 (Lou et al., 2017). The presence of a calm atmosphere without which increases the PM2.5 concentration with the stability of the at­
disturbance and the formation of a temperature inversion layer prevent mosphere (Huang et al., 2021). For the humidity factor, in the spring
pollutants’ dispersion (Ye et al., 2018). More than 50% of Tehran’s and summer, when the humidity was between 23.8-26.09% and

13
S.Z. Shogrkhodaei et al. Environmental Pollution 289 (2021) 117859

Fig. 9. PM2.5 risk maps using AdaBoost, and SGD algorithms: a) Spring, b) Summer, c) Autumn, and d) Winter.

14
S.Z. Shogrkhodaei et al. Environmental Pollution 289 (2021) 117859

Fig. 10. PM2.5 risk map using RF algorithm: a) Spring, b) Summer, c) Autumn, and d) Winter.

23.62–24.6%, the PM2.5 concentration increased. In autumn and winter, PM2.5 (Miri et al., 2019). Reducing the distance to industrial centers
there was an increase in PM2.5 when the humidity was between (between 0 and 200 m) raised PM2.5 concentrations in all seasons.
49.65-51.55% and 47.91–49.33%, respectively. Based on a study by Lou
et al. (2017), the humidity had a connection inverted “U" shape with
PM2.5. As a result, when the humidity is between 0 and 70%, the con­ 4.2. Comparing the importance of factors in different seasons
centration of PM2.5 rises, and once the humidity reaches its highest
(more than 70%), PM2.5 begins to decline (Lou et al., 2017). The hu­ According to the RF algorithm, NDVI was the most important factor
midity factor findings were consisted by Huang et al. (2021). NDVI re­ in spring. In spring, owing to the time of growth and greenery of trees
sults for four seasons showed a positive relationship between a decrease and plants, we see the highest amount of NDVI. NDVI has a high dry
in NDVI and an increase in PM2.5 concentration (Wu et al., 2017). Green deposition rate in the adsorption of airborne particles and their removal
spaces usually reduce PM2.5 concentrations by reducing the effects of (Zhang et al., 2019). The northern and northeastern parts of the study
thermal inversion and thermal islands (with evapotranspiration func­ area have seen better air quality owing to more green spaces. In summer,
tion) (Hart et al., 2020). Therefore, NDVI is an influential factor in the temperature was more important than other variables associated
reducing PM2.5 in the metropolitan area of Tehran. Increasing popula­ with PM2.5. The production of secondary particles from precursor gases
tion density increased PM2.5 in spring, summer, and winter. But in the (NO2, SO2, and VOCs) is accelerated at high temperatures, resulting in a
autumn, high concentrations of PM2.5 occurred in the low population rise in PM2.5 (Zhang et al., 2019). In autumn, the most important factor
density. The use of heating and cooling equipment, waste generation, affecting the concentration of PM2.5 was NDVI. The leaf area of plants
vehicle use, and dust production all occur in areas of high population changes rapidly in autumn and spring (Chen et al., 2016). Plants and
densities (Wang et al., 2019). According to a study conducted by pollutant sediment must have two critical characteristics to trap atmo­
Ghaedrahmati and Alian (2019) in Tehran, the city center is witnessing a spheric pollutant particles: porosity and density (Janhäll, 2015). If in
greater concentration of pollutants owing to the high population density autumn, by reducing the leaves and their surface, the density of urban
compared to other areas. Findings on street density showed that at a green spaces decreases and causes an increase in PM2.5 concentration. In
street density of more than 3.82 km/km2, PM2.5 increased. An increase winter, the distance to industrial centers had the most significant on the
in street density leads to increased traffic and further production of concentration of PM2.5. In this season, lower temperatures lead to
increased use of fossil fuels for domestic and industrial heating. Hence,

15
S.Z. Shogrkhodaei et al. Environmental Pollution 289 (2021) 117859

Fig. 11. Assessment of model performance using the ROC curve.

at close distances to industries, fog and stagnation of air increase the are two problems with the SGD algorithm.
concentration of PM2.5, and the distance from it reduces PM2.5.
5. Conclusion

4.3. Comparison between algorithms This study utilized three machine learning algorithms to model the
spatial-temporal risk of PM2.5 in the Tehran metropolis. Based on the
ROC, RMSE, and MAE indicators showed higher accuracy of the RF validation indicators, the RF algorithm had higher accuracy than the
algorithm than AdaBoost and SGD algorithms. According to the results, two algorithms in Spatio-temporal modeling of PM2.5. The results of the
the highest accuracy of modeling was related to autumn, winter, spring, ROC curve showed that the highest risk of PM2.5 was in autumn. The
and summer seasons, respectively. In the cold seasons of the year autumn and winter seasons are more prone to PM2.5 risk owing to
(autumn and winter), the concentration of PM2.5 increases due to temperature inversion conditions and increased use of fossil fuels.
reduced vegetation, low sunlight, temperature inversion, and increased Central, southern, southeastern, and southwestern regions are continu­
use of fossil fuels (Arhami et al., 2018; Razavi-Termeh et al., 2021c). The ously exposed to high concentrations of PM2.5 owing to high population
RF algorithm has several advantages over the other two algorithms, density and traffic in the center and the establishment of industries in
including determining the most influential variables, high processing the south and southwest of Tehran.
speed, ease of use, and avoidance of overfitting (Razavi-Termeh et al., One of the study’s drawbacks is the high cost of accessing all synoptic
2019). After the RF algorithm, the AdaBoost algorithm performed bet­ station meteorological data. Also, the unavailability of daily traffic in­
ter. Although the AdaBoost algorithm is a strong approximation and has formation was another limitation of this study. Access to this informa­
fast performance and good generalizability, it is noise sensitive and leads tion and use it in the process of risk mapping of air pollutants using deep
to retraining when synchronized with noise (Nazarenko et al., 2019). learning models can be a need in the future. The results of this study can
Another problem with the AdaBoost algorithm is that it requires a lot of be used as a practical approach to identify high-risk and low-risk areas.
training data. If the hypothesis is weak or there is not enough data, The findings of this study can be used by planners and city managers to
AdaBoost will not provide acceptable results (Kummar et al., 2014). The identify high-risk areas and develop strategies for controlling airborne
SGD algorithm was less accurate in modeling PM2.5 risk than the other particle prevention.
two algorithms. The need for hyperparameters and overgeneralization

16
S.Z. Shogrkhodaei et al. Environmental Pollution 289 (2021) 117859

Credit author statement Ganesh, S.S., Arulmozhivarman, P., Tatavarti, V.R., 2018. Prediction of PM 2.5 using an
ensemble of artificial neural networks and regression models. J. Ambient Intell.
Humanized Comput. 1–11.
Conceptualization, Seyedeh Zeinab Shogrkhodaei. and Seyed Vahid Ghaedrahmati, S., Alian, M., 2019. Health risk assessment of relationship between air
Razavi-Termeh.; Data creation, Seyedeh Zeinab Shogrkhodaei.; Formal pollutants’ density and population density in Tehran, Iran. Hum. Ecol. Risk Assess.
analysis, Seyed Vahid Razavi-Termeh. and Seyedeh Zeinab Shogrkho­ 25, 1853–1869.
Ghaemi, Z., Alimohammadi, A., Farnaghi, M., 2018. LaSVM-based big data learning
daei; Funding acquisition, Amanollah Fathnia; Investigation, Seyed system for dynamic prediction of air pollution in Tehran. Environ. Monit. Assess.
Vahid Razavi-Termeh.; Methodology, Seyedeh Zeinab Shogrkhodaei; 190, 1–17.
Project administration, Seyed Vahid Razavi-Termeh. and Amanollah Han, S., Sun, B., 1968. Impact of population density on PM2. 5 concentrations: a case
study in Shanghai, China. Sustainability 2019, 11.
Fathnia; Resources, Seyedeh Zeinab Shogrkhodaei. and Seyed Vahid Hart, R., Liang, L., Dong, P., 2020. Monitoring, mapping, and modeling spatial–temporal
Razavi-Termeh.; Software, Seyed Vahid Razavi-Termeh; Supervision, A. patterns of PM2. 5 for improved understanding of air pollution dynamics using
F; Validation, S.V. R.-T. and Seyedeh Zeinab Shogrkhodaei; Visualiza­ portable sensing technologies. Int. J. Environ. Res. Publ. Health 17, 4914.
He, J., Yu, Y., Liu, N., Zhao, S., 2013. Numerical model-based relationship between
tion, Seyed Vahid Razavi-Termeh. and Seyedeh Zeinab Shogrkhodaei; meteorological conditions and air quality and its implication for urban air quality
Writing—original draft, Seyedeh Zeinab Shogrkhodaei; Writing—re­ management. Int. J. Environ. Pollut. 53, 265–286.
view & editing, Seyed Vahid Razavi-Termeh. and Amanollah Fathnia. Hengl, T., Heuvelink, G.B., Stein, A., 2004. A generic framework for spatial prediction of
soil variables based on regression-kriging. Geoderma 120 (1–2), 75–93.
Huang, C, Liu, K, Zhou, L, 2021 Mar. Spatio-temporal trends and influencing factors of
Declaration of competing interest PM 2.5 concentrations in urban agglomerations in China between 2000 and 2016.
Environ. Sci. Pollut. Res. 28 (9), 10988–1000.
Janhäll, S., 2015. Review on urban vegetation and particle air pollution–Deposition and
The authors declare that they have no known competing financial dispersion. Atmos. Environ. 105, 130–137.
interests or personal relationships that could have appeared to influence Kalantar, B., Al-Najjar, H.A., Pradhan, B., Saeidi, V., Halin, A.A., Ueda, N., et al., 2019.
Optimized conditioning factors using machine learning techniques for groundwater
the work reported in this paper. potential mapping. Water 11, 1909.
Khosravi, K., Shahabi, H., Pham, B.T., Adamowski, J., Shirzadi, A., Pradhan, B., et al.,
References 2019. A comparative assessment of flood susceptibility modeling using multi-criteria
decision-making analysis and machine learning methods. J. Hydrol. 573, 311–323.
Kumar, R., Joseph, A.E., 2006. Air pollution concentrations of PM 2.5, PM 10 and NO 2
Aldrin, M., Haff, I.H., 2005. Generalised additive modelling of air pollution, traffic
at ambient and kerbsite and their correlation in Metro City–Mumbai. Environ. Monit.
volume and meteorology. Atmos. Environ. 39, 2145–2155.
Assess. 119, 191–199.
Alvarez-Mendoza, C.I., Teodoro, A.C., Torres, N., Vivanco, V., 2019. Assessment of
Kumar, N.S., Rao, K.N., Govardhan, A., Reddy, K.S., Mahmood, A.M., 2014.
remote sensing data to model PM10 Estimation in cities with a low number of air
Undersampled $$$$-means approach for handling imbalanced distributed data.
quality stations: a case of Study in Quito, Ecuador. Environments 6, 85.
Progr. Artif. Intell. 3, 29–38.
Amini, H., Nhung, N.T.T., Schindler, C., Yunesian, M., Hosseini, V., Shamsipour, M.,
Lamichhane, D.K., Kim, H.-C., Choi, C.-M., Shin, M.-H., Shim, Y.M., Leem, J.-H., et al.,
et al., 2019. Short-term associations between daily mortality and ambient particulate
2017. Lung cancer risk and residential exposure to air pollution: a Korean
matter, nitrogen dioxide, and the air quality index in a Middle Eastern megacity.
population-based case-control study. Yonsei Med. J. 58, 1111.
Environ. Pollut. 254, 113121.
Li, X, Luo, A, Li, J, Li, Y, 2019b. Air pollutant concentration forecast based on support
Anderson, J.O., Thundiyil, J.G., Stolbach, A., 2012. Clearing the air: a review of the
vector regression and quantum-behaved particle swarm optimization. Environ.
effects of particulate matter air pollution on human health. J. Med. Toxicol. 8,
Model. Assess. 24 (2), 205–222.
166–175.
Li, X., Yan, D., Wang, K., Weng, B., Qin, T., Liu, S., 2019a. Flood risk assessment of global
Arhami, M., Shahne, M.Z., Hosseini, V., Haghighat, N.R., Lai, A.M., Schauer, J.J., 2018.
watersheds based on multiple machine learning models. Water 11, 1654.
Seasonal trends in the composition and sources of PM2. 5 and carbonaceous aerosol
Li, Z., Yim, S.H.-L., Ho, K.-F., 2020. High temporal resolution prediction of street-level
in Tehran, Iran. Environ. Pollut. 239, 69–81.
PM2. 5 and NOx concentrations using machine learning approach. J. Clean. Prod.
Bai, L., Wang, J., Ma, X., Lu, H., 2018. Air pollution forecasts: an overview. Int. J.
268, 121975.
Environ. Res. Publ. Health 15, 780.
Liu, H., Jin, K., Duan, Z., 2019. Air PM2. 5 concentration multi-step forecasting using a
Bottou, L., 2012. Stochastic Gradient Descent Tricks. Neural Networks: Tricks of the
new hybrid modeling method: comparing cases for four cities in China. Atmos.
Trade. Springer, pp. 421–436.
Pollut. Res. 10, 1588–1600.
Breiman, L., 2001. Random forests. Mach. Learn. 45, 5–32.
Lou, C., Liu, H., Li, Y., Peng, Y., Wang, J., Dai, L., 2017. Relationships of relative
Brokamp, C., Jandarov, R., Hossain, M., Ryan, P., 2018. Predicting daily urban fine
humidity with PM 2.5 and PM 10 in the Yangtze river Delta, China. Environ. Monit.
particulate matter concentrations using a random forest model. Environ. Sci.
Assess. 189, 1–16.
Technol. 52, 4173–4179.
Ma, J., Cheng, J.C., Lin, C., Tan, Y., Zhang, J., 2019. Improving air quality prediction
Cabaneros, S.M., Calautit, J.K., Hughes, B., 2020. Spatial estimation of outdoor NO2
accuracy at larger temporal resolutions using deep learning and transfer learning
levels in Central London using deep neural networks and a wavelet decomposition
techniques. Atmos. Environ. 214, 116885.
technique. Ecol. Model. 424, 109017.
Masmoudi, S., Elghazel, H., Taieb, D., Yazar, O., Kallel, A., 2020. A machine-learning
Cesaroni, G., Badaloni, C., Gariazzo, C., Stafoggia, M., Sozzi, R., Davoli, M., et al., 2013.
framework for predicting multiple air pollutants’ concentrations via multi-target
Long-term exposure to urban air pollution and mortality in a cohort of more than a
regression and feature selection. Sci. Total Environ. 715, 136991.
million adults in Rome. Environ. Health Perspect. 121, 324–331.
Mehdipour, V., Stevenson, D.S., Memarianfard, M., Sihag, P., 2018. Comparing different
Chen, T., He, J., Lu, X., She, J., Guan, Z., 2016. Spatial and temporal variations of PM2. 5
methods for statistical modeling of particulate matter in Tehran, Iran. Air Quality.
and its relation to meteorological factors in the urban area of Nanjing, China. Int. J.
Atmos. Health 11, 1155–1165.
Environ. Res. Publ. Health 13, 921.
Miri, M., Ghassoun, Y., Dovlatabadi, A., Ebrahimnejad, A., Löwner, M.-O., 2019.
Chen, Z.-Y., Zhang, T.-H., Zhang, R., Zhu, Z.-M., Yang, J., Chen, P.-Y., et al., 2019.
Estimate annual and seasonal PM1, PM2. 5 and PM10 concentrations using land use
Extreme gradient boosting model to estimate PM2. 5 concentrations with missing-
regression model. Ecotoxicol. Environ. Saf. 174, 137–145.
filled satellite data in China. Atmos. Environ. 202, 180–189.
Mozumder, C., Reddy, K.V., Pratap, D., 2013. Air pollution modeling from remotely
Chen, W., Li, Y., Tsangaratos, P., Shahabi, H., Ilia, I., Xue, W., et al., 2020. Groundwater
sensed data using regression techniques. J. Indian Soc. Remote Sens. 41, 269–277.
spring potential mapping using artificial intelligence approach based on kernel
Naghibi, S.A., Pourghasemi, H.R., Pourtaghi, Z.S., Rezaei, A., 2015. Groundwater qanat
logistic regression, random forest, and alternating decision tree models. Appl. Sci.
potential mapping using frequency ratio and Shannon’s entropy models in the
10, 425.
Moghan watershed, Iran. Earth Sci. Inf. 8, 171–186.
Cohen, K., Nedić, A., Srikant, R., 2017. On projected stochastic gradient descent
Naghibi, S.A., Moghaddam, D.D., Kalantar, B., Pradhan, B., Kisi, O., 2017. A comparative
algorithm with weighted averaging for least squares regression. IEEE Trans.
assessment of GIS-based data mining models and a novel ensemble model in
Automat. Contr. 62, 5974–5981.
groundwater well potential mapping. J. Hydrol. 548, 471–483.
Dadvand, P., Rivas, I., Basagaña, X., Alvarez-Pedrerol, M., Su, J., Pascual, M.D.C., et al.,
Nazarenko, E., Varkentin, V., Polyakova, T., 2019. Features of application of machine
2015. The association between greenness and traffic-related air pollution at schools.
learning methods for classification of network traffic (features, advantages,
Sci. Total Environ. 523, 59–63.
Disadvantages). In: 2019 International Multi-Conference on Industrial Engineering
Fan, J., Li, Q., Hou, J., Feng, X., Karimian, H., Lin, S., 2017. A spatiotemporal prediction
and Modern Technologies (FarEastCon). IEEE, pp. 1–5.
framework for air pollution based on deep RNN. ISPRS Ann. Photogram. Rem. Sens.
Organization WH, 2016. Ambient Air Pollution: A Global Assessment of Exposure and
Spatial Inf. Sci. 4, 15.
Burden of Disease.
Farhangi, F., Sadeghi-Niaraki, A., Nahvi, A., Razavi-Termeh, S.V., 2020. Spatial modeling
Rahmati, O., Panahi, M., Kalantari, Z., Soltani, E., Falah, F., Dayal, K.S., et al., 2020.
of accidents risk caused by driver drowsiness with data mining algorithms. Geocarto
Capability and robustness of novel hybridized models used for drought hazard
Int. 1–15.
modeling in southeast Queensland, Australia. Sci. Total Environ. 718, 134656.
Feng, X., Li, Q., Zhu, Y., Hou, J., Jin, L., Wang, J., 2015. Artificial neural networks
Ranjgar, B., Razavi-Termeh, S.V., Foroughnia, F., Sadeghi-Niaraki, A., 2021. Perissin D.
forecasting of PM2. 5 pollution using air mass trajectory based geographic model
Land subsidence susceptibility mapping using persistent scatterer SAR
and wavelet transformation. Atmos. Environ. 107, 118–128.
interferometry technique and optimized hybrid machine learning algorithms. Rem.
Freund, Y., Schapire, R., Abe, N., 1999. A short introduction to boosting. J. Jpn. Soc.
Sens. 13, 1326.
Artif. Intell. 14, 1612.

17
S.Z. Shogrkhodaei et al. Environmental Pollution 289 (2021) 117859

Razavi-Termeh, S.V., Sadeghi-Niaraki, A., Choi, S.-M., 2019. Groundwater potential inference system with biogeography based optimization and imperialistic
mapping using an integrated ensemble of three bivariate statistical models with competitive algorithm. J. Environ. Manag. 247, 712–729.
random forest and logistic model tree models. Water 11, 1596. Wen, C., Liu, S., Yao, X., Peng, L., Li, X., Hu, Y., et al., 2019. A novel spatiotemporal
Razavi-Termeh, S.V., Sadeghi-Niaraki, A., Choi, S.-M., 2020a. Ubiquitous GIS-based convolutional long short-term neural network for air pollution prediction. Sci. Total
forest fire susceptibility mapping using artificial intelligence methods. Rem. Sens. Environ. 654, 1091–1099.
12, 1689. Wu, C.-D., Chen, Y.-C., Pan, W.-C., Zeng, Y.-T., Chen, M.-J., Guo, Y.L., et al., 2017. Land-
Razavi-Termeh, S.V., Khosravi, K., Sadeghi-Niaraki, A., Choi, S.-M., Singh, V.P., 2020b. use regression with long-term satellite-based greenness index and culture-specific
Improving groundwater potential mapping using metaheuristic approaches. Hydrol. sources to model PM2. 5 spatial-temporal variability. Environ. Pollut. 224, 148–157.
Sci. J. 65, 2729–2749. Xiao, Q., Chang, H.H., Geng, G., Liu, Y., 2018. An ensemble machine-learning model to
Razavi-Termeh, S.V., Sadeghi-Niaraki, A., Choi, S.-M., 2021a. Asthma-prone areas predict historical PM2. 5 concentrations in China from satellite data. Environ. Sci.
modeling using a machine learning model. Sci. Rep. 11, 1–16. Technol. 52, 13260–13269.
Razavi-Termeh, S.V., Shirani, K., Pasandi, M., 2021b. Mapping of landslide susceptibility Xu, Y., Ho, H.C., Wong, M.S., Deng, C., Shi, Y., Chan, T.-C., et al., 2018. Evaluation of
using the combination of neuro-fuzzy inference system (ANFIS), ant colony (ANFIS- machine learning techniques with multiple remote sensing datasets in estimating
ACOR), and differential evolution (ANFIS-DE) models. Bull. Eng. Geol. Environ. 80, monthly concentrations of ground-level PM2. 5. Environ. Pollut. 242, 1417–1426.
2045–2067. Yang, Q., Yuan, Q., Li, T., Shen, H., Zhang, L., 2017. The relationships between PM2. 5
Razavi-Termeh, S.V., Sadeghi-Niaraki, A., Choi, S.-M., 2021. Effects of air pollution in and meteorological factors in China: seasonal and regional variations. Int. J.
Spatio-temporal modeling of asthma-prone areas using a machine learning model. Environ. Res. Publ. Health 14, 1510.
Environ. Res. 200, 111344. Ye, W.-F., Ma, Z.-Y., Ha, X.-Z., 2018. Spatial-temporal patterns of PM2. 5 concentrations
Rodriguez-Galiano, V., Mendes, M.P., Garcia-Soldado, M.J., Chica-Olmo, M., Ribeiro, L., for 338 Chinese cities. Sci. Total Environ. 631, 524–533.
2014. Predictive modeling of groundwater nitrate pollution using Random Forest Yousefian, F., Mahvi, A.H., Yunesian, M., Hassanvand, M.S., Kashani, H., Amini, H.,
and multisource variables related to intrinsic and specific vulnerability: a case study 2018. Long-term exposure to ambient air pollution and autism spectrum disorder in
in an agricultural setting (Southern Spain). Sci. Total Environ. 476, 189–206. children: a case-control study in Tehran, Iran. Sci. Total Environ. 643, 1216–1222.
Rybarczyk, Y., Zalakeviciute, R., 2018. Machine learning approaches for outdoor air Yu, W., Guo, Y., Shi, L., Li, S., 2020. The association between long-term exposure to low-
quality modelling: a systematic review. Appl. Sci. 8, 2570. level PM2. 5 and mortality in the state of Queensland, Australia: a modelling study
Safarianzengir, V., Sobhani, B., Yazdani, M.H., Kianian, M., 2020. Monitoring, analysis with the difference-in-differences approach. PLoS Med. 17, e1003141.
and spatial and temporal zoning of air pollution (carbon monoxide) using Sentinel-5 Zeinalnezhad, M., Chofreh, A.G., Goni, F.A., Klemeš, J.J., 2020. Air pollution prediction
satellite data for health management in Iran, located in the Middle East. Air Quality. using semi-experimental regression model and Adaptive Neuro-Fuzzy Inference
Atmos. Health 13, 709–719. System. J. Clean. Prod. 261, 121218.
Santana, J.C.C., Miranda, A.C., Yamamura, C.L.K., Filho SCd, Silva, Tambourgi, E.B., Lee Zhai, B., Chen, J., 2018. Development of a stacked ensemble model for forecasting and
Ho, L., et al., 2020. Effects of air pollution on human health and costs: current analyzing daily average PM2. 5 concentrations in Beijing, China. Sci. Total Environ.
situation in São Paulo, Brazil. Sustainability 12, 4875. 635, 644–658.
Shahbazi, H., Karimi, S., Hosseini, V., Yazgi, D., Torbatian, S., 2018. A novel regression Zhang, H., Wang, Y., Hu, J., Ying, Q., Hu, X.-M., 2015. Relationships between
imputation framework for Tehran air pollution monitoring network using outputs meteorological parameters and criteria air pollutants in three megacities in China.
from WRF and CAMx models. Atmos. Environ. 187, 24–33. Environ. Res. 140, 242–254.
Sohrabinia, M., Khorshiddoust, A.M., 2007. Application of satellite data and GIS in Zhang, D., Huang, Q., He, C., Wu, J., 2017. Impacts of urban expansion on ecosystem
studying air pollutants in Tehran. Habitat Int. 31, 268–275. services in the Beijing-Tianjin-Hebei urban agglomeration, China: a scenario analysis
Sudhira, H., Ramachandra, T., Jagadish, K., 2003. Urban sprawl pattern recognition and based on the Shared Socioeconomic Pathways. Resour. Conserv. Recycl. 125,
modeling using GIS. Map India 28–31. 115–130.
Tien Bui, D., Shahabi, H., Omidvar, E., Shirzadi, A., Geertsema, M., Clague, J.J., et al., Zhang, K., Thé, J., Xie, G., Yu, H., 2020. Multi-step ahead forecasting of regional air
2019. Shallow landslide prediction using a novel hybrid functional machine learning quality using spatial-temporal deep neural networks: a case study of Huaihai
algorithm. Rem. Sens. 11, 931. Economic Zone. J. Clean. Prod. 277, 123231.
Wang, J., Ogawa, S., 2015. Effects of meteorological conditions on PM2. 5 concentrations Zhou, Y., Chang, F.-J., Chang, L.-C., Kao, I.-F., Wang, Y.-S., Kang, C.-C., 2019. Multi-
in Nagasaki, Japan. Int. J. Environ. Res. Publ. Health 12, 9089–9101. output support vector machine for regional multi-step-ahead PM2. 5 forecasting. Sci.
Wang, Y., Hong, H., Chen, W., Li, S., Panahi, M., Khosravi, K., et al., 2019. Flood Total Environ. 651, 230–240.
susceptibility mapping in Dingnan County (China) using adaptive neuro-fuzzy

18

You might also like