Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

pubs.acs.

org/est Perspective

Data-Driven Machine Learning in Environmental Pollution: Gains


and Problems
Xian Liu, Dawei Lu, Aiqian Zhang,* Qian Liu,* and Guibin Jiang
Cite This: Environ. Sci. Technol. 2022, 56, 2124−2133 Read Online

ACCESS Metrics & More Article Recommendations *


sı Supporting Information
See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.

ABSTRACT: The complexity and dynamics of the environment make it


Downloaded via SOUTH CHINA UNIV OF TECHNOLOGY on April 4, 2024 at 06:27:46 (UTC).

extremely difficult to directly predict and trace the temporal and spatial changes
in pollution. In the past decade, the unprecedented accumulation of data, the
development of high-performance computing power, and the rise of diverse
machine learning (ML) methods provide new opportunities for environmental
pollution research. The ML methodology has been used in satellite data
processing to obtain ground-level concentrations of atmospheric pollutants,
pollution source apportionment, and spatial distribution modeling of water
pollutants. However, unlike the active practices of ML in chemical toxicity
prediction, advanced algorithms such as deep neural networks in environmental
process studies of pollutants are still deficient. In addition, over 40% of the
environmental applications of ML go to air pollution, and its application range and acceptance in other aspects of environmental
science remain to be increased. The use of ML methods to revolutionize environmental science and its problem-solving scenarios
has its own challenges. Several issues should be taken into consideration, such as the tradeoff between model performance and
interpretability, prerequisites of the machine learning model, model selection, and data sharing.
KEYWORDS: Machine learning, Environmental processes, Pollution, Big data, Artificial intelligence

■ INTRODUCTION
Learning is a critical intelligent behavior of human beings, and
concepts of different machine learning algorithms. Herein we
focus more on environmental processes of pollutants and
the current development of artificial intelligence (AI) has include only a few toxicity-related practices. Through a
already enabled computers to have a similar ability. Machine literature survey, current statistics of the publications are
learning (ML), a classical form of AI that teaches a program to
recognize data patterns, can provide approaches for decipher- determined on the application of ML in environmental
ing the complexity of data, realizing predictability of the trends, pollution research. At the same time, we must be aware that
and discovering new knowledge hidden behind the big data. the ML methodology is still developing with many problems
With the development of ML technology, time and cost spent remaining. The selection of suitable ML models for specific
in experiments could be saved.1,2 In the past decades, ML has
penetrated into various fields of technologies and human life, issues is still a dilemma. For example, complicated models
especially in protein structure3 and toxicity prediction,2 cancer generally have better performance but are inherently obscure
detection,4 synthesis of complex organic substances,5 etc. The for an understanding and explanation, which largely hinders
development of mature and easy-to-use programming
the interpretation of environmental problems. The prereq-
languages and efficient algorithms make ML easy to get
started with. Currently, ML methods have become useful tools uisites of the machine learning method should also be
to offering predictive insights into the environmental behavior addressed. Finally, we discuss the current challenges and
of pollutants. How environmental researchers can constantly opportunities for the interdisciplinary fields of environmental
adapt and innovate ML use to ensure up-to-date problem-
science and ML.
solving capacities is becoming a hot issue.6−8 This offers the
promise of assisting decision making in a collection of contexts,
including forecasting pollution, 9−11 evaluating disease Received: September 14, 2021
risks,12,13 predicting aqueous14 and soil adsorption of organic Published: January 27, 2022
compounds,15,16 intelligent control of wastewater,17 etc.
To provide an up-to-date overview of ML/AI technologies
and fostering their application in environmental pollution
research, this Perspective summarizes and categorizes the

© 2022 American Chemical Society https://doi.org/10.1021/acs.est.1c06157


2124 Environ. Sci. Technol. 2022, 56, 2124−2133
Environmental Science & Technology pubs.acs.org/est Perspective

Figure 1. General schema of machine learning category. Linear and nonlinear methods are labeled for the basic forms of the model.

■ ENVIRONMENTAL POLLUTION MEETS MACHINE


LEARNING TOOLS
the source data set into subsets on the basis of an attribute
value test. Either classification or regression can be
Machine Learning: An Overview. Machine learning accomplished by judging the subset that the features of the
algorithms are types of analytics to obtain laws and patterns unknown sample belong to.
from data, and they can also be applied to the unknown for Unsupervised learning, in comparison with supervised
prediction purposes. As a data mining methodology, ML tools learning, operates on data without artificial labels, and
have been improved and are continuously being refined. With therefore it is not tied to specific prediction tasks and focuses
the emerging new detection and characterization technologies, on the discovery of patterns in the data. Commonly used
the amount of data in various fields has exploded and favored unsupervised learning methods include dimension reduction,
the widespread application of ML. ML becomes a powerful clustering, and generative adversarial networks (GAN).
data mining tool in knowledge discovery, especially from Clustering divides data into groups such that the data points
messy, patternless, and high-dimensional data. are more similar to each other in a group than to those in other
In general, machine learning models fall into three main groups. Clustering methods can identify the data association
categories: supervised, unsupervised, and reinforcement pattern: e.g., clustering of a region into several subregions with
learning (Figure 1). The supervised learning algorithm learns similar pollution peculiarities according to the characteristics of
a function from a given training data set, and when new data air pollution.19 The most popular methods for clustering
comes, it can predict the result on the basis of the learned include K-means, hierarchical clustering, affinity propagation,
function. The training set requirements for supervised learning etc. Principal component analysis (PCA) is a widely used
include features (input) and labels (output), which are also technique for dimension reduction. PCA mathematically
known as independent variables/predictors and dependent
reduces a high-dimension matrix to a lower-dimension matrix
variables in statistics, respectively. The labels can be either (i)
by using simple matrix operations from linear algebra to
discrete (classification problem) such as determining whether a
pollutant has a certain type of toxic effect18 or (ii) continuous calculate the projection of the original data into fewer
for quantitative prediction (regression problem). The simplest dimensions. The reinforcement learning method is a dynamic
method to solve a regression problem is linear/polynomial programming that trains algorithms to take action using a
regression that attempts to model the relationship between the system of reward and punishment, which is especially designed
input and output by fitting a linear equation to the data. Some for solving optimal control problems such as robotics for
machine learning methods can be used for both classification automation.20 With regard to environmental pollution, there
and regression tasks. For example, a decision tree is a treelike have been just a few preprints and conference articles that used
flowchart type diagram that shows the possible consequences reinforcement learning for air pollution prediction21 and
from a series of decisions. It learns the hidden rules by splitting monitoring.22
2125 https://doi.org/10.1021/acs.est.1c06157
Environ. Sci. Technol. 2022, 56, 2124−2133
Environmental Science & Technology pubs.acs.org/est Perspective

Figure 2. (a) Peer-reviewed research articles of ML applications in different environmental media from 2011 to 2020. (b) Relationship between the
number of articles from different countries and that in specific environmental media. The data were obtained from the Web of Science by searching
the topic with individual media (air/atmospheric pollution, water pollution, soil pollution, and solid waste pollution) combined with “machine
learning” or a specific machine learning method (e.g., decision tree, random forest, logistic regression, artificial neural network, and deep learning).

Deep learning (DL) or deep neural network (DNN), a Articles published in China and the US accounted for 50.8% of
subfield of machine learning, is one of the latest trends in ML the total number of articles published in the world. From all
and AI community. DL is based on an artificial neural network countries except Iran, air-pollution-related articles account for
(ANN) that involves multiple layers of neurons, imitating the the largest proportion, while Iran has the largest proportion of
process of learning in the human brain and capturing the deep water-pollution-related articles. Furthermore, Chinese re-
connections between input data. A DNN consists of stacks of searchers appeared to have more interest in comparison to
nonlinear neural network layers to produce a nonlinear those from other countries in the integration of machine
decision boundary and works better in an understanding of learning and solid waste pollution, with the largest number
complex systems. DL technology has achieved considerable (45) of solid-waste-related articles being published. The
success in computer vision and speech/image recognition. For statistics may reflect global developing trends in environmental
example, the development of convolutional neural networks research. Developed countries such as the US and the UK have
(CNNs) is regarded as a breakthrough in image processing. comparatively balanced ML practices in air, water, and soil.


Since the chemical structure of a pollutant can also be shown
in a 2D or 3D image, CNNs have also been successfully SELELCTED APPLICATONS OF MACHINE
applied to reactivity and toxicity prediction for environmental LEARNING IN ENVIRONMENTAL POLLUTION
chemicals. One typical work is to identify the potential RESEARCH
persistent organic pollutants (POPs) and persistent, bio-
accumulative, and toxic (PBT) chemicals by a CNN model Assessment of Air Pollution. Figure 2 shows that over
using two-dimensional molecular descriptor matrixes as inputs, 40% of ML applications to environmental pollution have gone
just like a 2D image.1 A recurrent neural network (RNN) is to air pollution. In fact, the air quality has been monitored
another class of artificial neural architecture with which nationwide in the US by the Environmental Protection Agency
connections between nodes constitute a directed graph along (EPA) since 197224 and the health risks of particulate matter
a temporal sequence. Such architecture characteristics allow (PM) have attracted great attention.25 Due to the sparse
RNN to remember each piece of information through time, distribution of PM2.5 monitoring sites, the monitoring data
making them good at handling dynamic sequence data.23 may not be able to effectively reflect the influence of
State of Machine Learning Applications in Environ- topography, local weather, and the spatial distribution of
mental Pollution Research. The statistics of publications emission sources on the PM2.5 concentration. Different
about ML applications in environmental pollution research in statistical methods or ML models have been adopted in
the past decade are shown in Figure 2a. The search criteria are atmospheric pollution studies and trend prediction since the
described in Text S1 in the Supporting Information. The early 2000s along with the improved data availability.
publication number dramatically increased from 821 in 2011 to In the past 10 years, land use regression (LUR) models in
3140 in 2020, growing at a nearly exponential rate. In terms of the form of multiple regression equations have been widely
environmental media, the applications in air pollution used to estimate the long-term concentrations of NO2, NOx,
accounted for the largest proportion, followed by the and PM2.5 on a fine spatial scale.26 LUR often relies on
applications in water and soil pollution; the number of studies geographic information systems (GIS) to collect site-specific
on solid waste was relatively small. From the perspective of parameters. Traffic conditions, land use, population density,
national distribution, researchers from the US contributed the and climate are significant parameters used in the LUR models.
most articles in the interdisciplinary field of ML and The improved models included but were not limited to
environmental pollution (4126 articles), followed by those generalized additive models (GAM), mixed effects models,10,27
from China (3822 articles) and other countries (Figure 2b). and geographically weighted regression (GWR) models.28,29
2126 https://doi.org/10.1021/acs.est.1c06157
Environ. Sci. Technol. 2022, 56, 2124−2133
Environmental Science & Technology pubs.acs.org/est Perspective

However, the above models still utilized linear algorithms. A the AQI of designated locations in an Indian metropolitan
linear algorithm follows a form that all terms in the model are city.37 The summary of machine learning studies in the
constant parameters multiplied by the input features; thus, it assessment of air pollution is presented in Table S1 in the
works efficiently when the relationship between features and Supporting Information.
labels is known to be linear. However, air pollution is affected Pollution Source Apportionment. Source apportion-
by complex environmental processes, and its source is ment is a typical representative of the application of
essentially a nonstationary one. Especially, significant devia- mathematical calculation methods in environmental research.
tions have been observed in linear model practices in air Sample data are usually abundant, are high-dimensional, and
pollution, and the goodness of fit (R2) for many models was contain a series of pollutants that usually need to be reduced
around 0.6.10,27,29 Clearly, linear models cannot handle and classified according to the similarity of chemical
dynamic, heterogeneous, and spatially expansive data since composition. The conventional unsupervised learning techni-
the inherent variables are interactive and exhibit nonlinear ques, such as PCA, positive matrix factorization (PMF), and
relationships in complex environmental settings. other clustering-based solutions with correlation analysis, have
Nonlinear models are gradually being considered in the been used to determine the contributions of vehicular exhausts,
evaluation of atmospheric pollution. A notable example comes industrial emission, and coal combustion to air pollution. An
from ensemble strategies widely used in pollution prediction, accurate source analysis also benefits from the application of
which use multiple learning algorithms to improve the classifiers. For example, to achieve a quantitative identification
performance of a single classifier on PM2.5 estimation.30 Such of the source, a classification model was built to differentiate
a modeling scenario not only exhibited an improved predictive the sources of SiO2 nanoparticles through Si and O stable
ability in estimating ground PM2.58,9,19,31 but also provided isotope fingerprint information. The model can give the
satisfactory daily NO2 exposure estimates.32 For example, a probability values of different sources for a sample. The overall
land use random forest (LURF) model combines a land use accuracy rate reached 93.3%, enabling accurate quantitative
model and a random forest. The principle of LURF is to use tracing of SiO2 nanoparticles.38 Additionally, ML is also
the LUR of stepwise regression to obtain the contribution of playing a role in tracing pollution sources of the soil. In
features to the predictive performance of the model and to particular, in order to reveal sources and potentially toxic
further use the feature whose contribution is greater than the elements (PTE) in the soil of a mining city, a hierarchical
preset threshold as the input of the random forest model. In cluster analysis was performed to identify the PTE geochemical
comparison with the LUR model, the average external R2 associations in each group of subsamples determined by k-
values of the LURF model on the daily average PM2.5, NO2, means clustering.39 In 2019, a machine learning framework was
and CO concentration predictions were increased by 0.10− proposed to identify potential sources of soil heavy metal
0.19.33 Among the ensemble models, a random forest is an pollution. In the case study, potentially polluting enterprises
ensemble bagging method that aims to control the model were identified and classified by Naive ̈ Bayes classifiers to
overfitting by randomly selecting the individual decision trees investigate the spatial relationship between heavy metals in the
from the data set and thereby reducing the variances. One of soil and the different industry types.40
the most recent demonstrations of random forest application Estimate of Health Outcome of Pollutant Exposure.
in air pollution was done by British researchers, who elucidated There exists an increasing demand for understanding the
the lockdown impact on the air quality of Wuhan City during potential connections between environmental pollution and
the COVID-19 pandemic in 2020 by removing the health outcomes. This issue is normally answered by
confounding effects of weather conditions.34 epidemiological studies. Here the most commonly adopted
For DNNs, such powerful ML methods also can tackle learning method is generalized linear regression due to its
nonlinear issues, with their applications in air pollution better model interpretability. For example, the question of
beginning in the last 3 years. For example, RNNs could whether long-term exposure to air pollution increases the
capture the nonlinearity of environmental chemical and severity of COVID-19 health outcomes was addressed through
physical parameters well and significantly enhance the an ecological regression analysis, a type of linear regression
predictive accuracy for time series data. The long short-term method.41 It was found that an increase of 1 μg/m3 in
memory (LSTM) model has an RNN architecture constructed historical PM2.5 exposures was significantly associated with a
with LSTM units. In comparison with ordinary RNNs, the 11% increase in county-level COVID-19 mortality rates after
repetitive modules in LSTM have a different structure and accounting for 20 area-level confounders such as age, race,
exhibit better learning capacity for long sequences. Due to such smoking status, population density, etc. The US researchers
a unique trait, LSTM is suitable for processing and predicting also used daily PM2.5 concentration and mortality data of the
important events with very long intervals and delays in the elderly to perform a linear regression analysis and found that
time series. LSTM has been successfully applied to the both short-term and long-term exposure to PM2.5 were
prediction of air quality.35−37 Forest fire data, air emissions, correlated to all-cause mortality even if the PM2.5 exposure
meteorological data, and sea ice coverage were used as input level did not exceed the newly revised US EPA standards.42
features of the LSMT model, while the historical PAH The epidemiologists still benefit from ensemble methods
pollution data in the past 7 years (1996−2021) was learned rather than conventional methods when professional knowl-
to predict the PAH concentration in the near future (2012− edge is added. Ensemble methods in particular tree-based
2014) in the high Arctic. In comparison with the previously models have been suggested as alternatives to regression
constructed atmospheric transport model, the obtained LSTM models for estimates of propensity scores in epidemiology.43
model has significantly improved the phenanthrene (PHE) and Forecasting Wastewater Treatment Plant Perform-
benzo[a]pyrene (BaP) predictions by 62.5% and 91.1%, ance. Machine learning techniques have also been used to
respectively.36 In addition, in comparison with other ML analyze wastewater treatment plant (WWTP) data. Multi-
technologies, LSTM provided more accurate predictions for variate statistical tools, such as PCA and hierarchical k-means
2127 https://doi.org/10.1021/acs.est.1c06157
Environ. Sci. Technol. 2022, 56, 2124−2133
Environmental Science & Technology pubs.acs.org/est Perspective

clustering, have been adopted to investigate the relationship PointNet architecture for the identification of potential
between N2O emissions at a specific time with a variety of environmental estrogens. The difference in isomer activity
influencing factors such as dissolved oxygen level, nitrogen for estrogenic activity was successfully recognized, and the
concentration, temperature, and water flow rate.44 Multivariate model exhibited significantly improved interpretability.51 In
regression models were established to assess the temporal addition, an exhaustive ML method comparison was
performance of full-scale wastewater treatment plants by performed covering classic ML and DNN techniques for
predicting selected water quality features such as the multitask models of the 18 bioassays related to ER activity.52
concentrations of total phosphorus, total nitrogen, and total The consensus model based on the average of the individual
suspended solids (TSS) concentrations.45 However, WWTP model predictions offered the largest area under the receiver
data were collected at various time intervals, and their features operating characteristic curve. Although DL has brought new
were cocorrelated. The fluctuation in time profile of the insights into environmental computation toxicity, the vast
effluent quality was not stable. Related features exhibit majority of the application practices are still subject to the
nonlinear and nonstationary characters instead. Similar to the traditional SAR way of thinking. How to realize effective
air pollution forecasting issues, as the forecast period is characterization of a biological regulation system by mining
extended, the limitation of the traditional ML techniques in a multiomics and in vitro HTS data and extrapolate from
quantitative analysis of the dynamic features in the non- bioassay data in vitro to the body damage in vivo by ML is still
stationarity of long-term WWTP data appears.46 A deep a huge challenge we must face.


learning method such as RNN may play a role in this respect.
The latest study used LSTM to process modeling and CONCLUDING REMARKS
forecasting of N2O emission from WWTPs,47 and the obtained
Machine Learning Is Not a One-Size-Fits-All Solution:
model showed a better performance in comparison to the
Model Selection and “Black Box” Problem. One key issue
previously reported DNN-based model (R2 > 0.94). The
of machine learning applications is the model selection. Each
forecast of municipal solid waste (MSW) generation is also a
model has its own prior statistical assumptions about the data
time series analysis problem with high temporal variation. In a
distribution, and there is no “master algorithm” that works well
recent study, historical household waste data, weather, and on all problems. In other words, even if a ML algorithm
environmental features were used to construct a multisite dramatically outperforms other methods in solving a certain
LSTM neural network to predict the MSW generation rate class of problems, it is very likely that with some other
from households in Denmark.48 It was found that the multisite problems it may perform poorly. It is therefore important to
LSTM model outperforms traditional models such as the select and tailor the model for a specific problem.
autoregressive integrated moving average model by 85% on Prediction accuracy is not the only dimension that matters.
average. The model interpretability is equally important in selecting
Screening Endocrine-Disrupting Chemicals. The ap- models. There is a tradeoff between model accuracy and its
plication of ML to the environmental behavior of pollutants is interpretability. Better model performance is expected when
relatively lagging in comparison to its use in toxicity more complicated ML methods are exploited. However, the
evaluations. Due to the massive amounts of toxicity data more complex the ML algorithm becomes, the more adjustable
generated by high-throughput screening (HTS) bioassays and parameters the model needs. As the goodness of the model fit
next-generation sequencing technologies, ML application to a increases, the interpretability of the model may get worse. In
prediction toxicity model has evolved from the simplest linear fact, the main difficulty in model selection is to balance
regression methods to data-driven DNN modeling49 on the between the model performance and interpretability. As shown
basis of structure−activity relationships (SARs). ML has in Figure 3, it is easy to explain the inherent rules in linear
become a new driving force as an alternative method for regression models, but they cannot handle more complex
toxicity testing in environmental toxicology. Here the relationships.
identification of endocrine-disrupting chemicals is employed
as a typical demonstration for comparison.
In 2015, the US EPA Endocrine Disruptor Screening
Program (EDSP) constructed a battery of 18 bioassays related
to estrogen-receptor (ER)-mediated signaling pathways for
testing estrogen-like chemicals with a well-defined data set.
Numerous efforts have been made to identify potential
environmental estrogens using either individual bioassay data
or unified efficiency data. In a recent study, an artificial neural
network (ANN) was imagined as a virtual biological network
of the adverse outcome pathway (AOP) since ANN is
mechanistically related to biological processes.50 The molec-
ular fingerprint was used as the input of a designed multilevel
neural network to predict the in vivo estrogenic activity. The
important biological events in the AOP were determined
through the activation of the corresponding neurons in the
network. With the image recognition technique PointNet as
inspiration, known as end-to-end learning that directly
consumes unordered pixel point data of an image, a novel
3D molecular surface point cloud was proposed to describe Figure 3. Tradeoff between performance and interpretability for ML
chemical structures and used as the input of a modified models.

2128 https://doi.org/10.1021/acs.est.1c06157
Environ. Sci. Technol. 2022, 56, 2124−2133
Environmental Science & Technology pubs.acs.org/est Perspective

Predicting environmental pollution is a complex task, major limitation of PCA is that the underlying structure of the
because of the dynamic nature and volatility of pollutants data must be linear since all obtained principal components
and their high temporal and spatial variability. DNN is very should be uncorrelated.61 Support vector machines (SVMs)
suitable for solving such complex problems, because it and k-nearest neighbors (k-NNs) make no explicit model
performs nonlinear transformations of the input feature specifications about the data normality. Instead, the difference
through multiple layers and, in most practices, a higher comes from the a priori assumptions about the shape of the
precision is also guaranteed. However, such models work as a class boundary.62 k-NN is a nonparametric method that can
“black box”, and it is difficult or even impossible to determine both handle linear and nonlinear boundaries. For example, k-
how the environmental predictors produce the outcome. The NN could help to determine nitrate pollutant sources and
lack of transparency may make the methods unacceptable produce a nitrate pollution vulnerability map with nonlinear
when the established models are used to make high-stakes boundaries.63 In addition, the use of cross-validation and test
decisions, especially when they make unexpected mistakes. As set validation is not always necessary for all ML models to
a result, Rudin et al. advised to cease the use of a “black box” obtain unbiased error estimates.64 For random forest, a tree-
ML model in high-risk decisions and to use interpretable based ensemble approach, the out-of-bag error estimate has
models instead.53 It is worth noting that, although interpret- been proven to be unbiased in many studies.31,65 A DL model
able models tend to be simpler, the benefit in interpretability assumes that the relationship between the input and output
does not always incur a cost in performance. For example, can be represented by a particular neural network architecture.
regression models could provide similar or even slightly better Thus, the data volume and complexity of the network have to
predictability for outdoor air quality prediction on some types work in harmony to produce a generalizable DL model, which
of training data.54 Another real-world issue in ML is the is not always possible.
absence of high-quality data. The quality of the data affects In comparison with the misuse of ML methods, the neglect
how well models will operate. Data defects such as improper or misuse of data scaling may occur more frequently. Real-
labeling and imbalanced data could undermine the predictive world data sets usually contain features that are highly varying
power of ML. Unfortunately, the data of environmental issues in ranges, magnitudes, and units. Some machine learning
are frequently biased.55 ML models trained on the imbalanced algorithms used the Euclidian distance between data points for
data set tend to assign most cases into the majority class and the measurement of dissimilarity and made decisions entirely
result in poor resolution for the minority class and thus a low on the basis of the magnitude of features, neglecting the unit
model sensitivity.2 Various sampling strategies, such as diversity of the modeling data set. Not surprisingly among
oversampling and undersampling, are recommended for different scales or ranges, the final result was irrational.
imbalanced data sets. In addition to the general data quality, Whether variables need to be rescaled before modeling occurs
insufficient environmental data are available in comparison depends on the principle of the algorithm: that is, if it is based
with image recognition, natural language processing, and even on distance, normality assumptions, etc. For example, PCA is
medical research. For an environmental issue with a small data quite sensitive to the scale and range of the features, since it
set, the ML method selection becomes particularly important. transforms the raw data into a new coordinate system that may
There is sometimes no significant difference in performance maximize the data variance, and a variable with a higher
between more complex methods such as DNNs and simpler variance will have a higher weight for the calculation of a new
methods (such as random forest and support vector machine) axis in comparison to a variable with a lower variance. It has
in the case of insufficient data.56 Also, a comprehensive model been verified that data normalization has a significant influence
evaluation was carried out by a pharmaceutical company using on enhancing the interpretation of PCA when estuarine
an in-house ADMET (absorption, distribution, metabolism, sediment data with heavily skewed variables and different units
elimination, toxicity) database. The complex “black box” DNN are used.66 Data standardization is recommended when the
techniques did not systematically outperform simpler tree- raw variables have been measured on a significantly different
based models such as random forest.57 Nevertheless, well- scale.67 Other distance-based algorithms such as k-NN, K-
optimized DNN models could achieve better performance in means, and SVM are also dramatically affected by the scale of
comparison to that of classic algorithms on large data sets the features, while tree-based algorithms and Naive ̈ Bayes
because there are a large number of hyperparameters that can classifiers are less affected. Unfortunately, data normalization is
be tuned and the leverage of DNNs begins to be manifested.58 often ignored or not clarified in the environmental application
The above discussion is mainly based on the characteristics of practice of ML. For example, both distance-based algorithms
the specific algorithms, while the performance of an algorithm and random forest were used to develop a prediction model for
is also somehow data-dependent.49 In practice, the method PM10 concentration, while the feature type, data range, and
selection often depends on the expert experience. For more corresponding data normalization were not clearly stated. The
information on algorithm features, please refer to Table S2 in argument over the reliability and rationality of the model
the Supporting Information. conclusions could therefore be questioned.68
Prerequisite of Machine Learning: Validation of Demand for Collaboration and Innovation. Since
Model Assumptions. At the level of method application, 2008, the concept of big data has sparked attention from all
ignoring the assumptions and requirements for machine circles of the scientific community.69 The Big Data to
learning tools inevitably raises more questions and con- Knowledge (BD2K) program initiated by the National
troversies. With the simple linear regression as an example, a Institutes of Health (NIH) in 2013 supported the research
normal distribution of its noise is required because the model in order to maximize and accelerate the utility of big data in
residual needs to follow a normal distribution with a mean biomedical and health research. The transformation of big data
value of 0, also known as homoskedasticity. However, only into knowledge or productivity cannot be segmented from ML.
limited studies have realized the possible violation of the Although ML methods have gained certain degrees of success
assumption and the resulting bias in the conclusion.59,60 One in pollution prediction, the application range and acceptance of
2129 https://doi.org/10.1021/acs.est.1c06157
Environ. Sci. Technol. 2022, 56, 2124−2133
Environmental Science & Technology pubs.acs.org/est Perspective

ML methods in environmental science remain relatively low, in University of Chinese Academy of Sciences, Beijing 100190,
comparison with those in chemistry,70 medicine,34 and People’s Republic of China; Institute of Environment and
biology.71 One possible reason is the lack of widespread Health, Jianghan University, Wuhan 430056, People’s
“open access”. Long-term sharing and cooperation mechanisms Republic of China; orcid.org/0000-0001-5680-2529;
have not been formed, and true data sharing is highly desirable. Email: aqzhang@rcees.ac.cn
There have been thousands of publications in ML applications Qian Liu − State Key Laboratory of Environmental Chemistry
in environmental pollution, as mentioned above, but the and Ecotoxicology, Research Center for Eco-Environmental
original data are usually not accessible. The full potential of Sciences, Chinese Academy of Sciences, Beijing 100085,
ML is restricted by limited data sharing. People’s Republic of China; College of Resources and
Second, the availability of environmental pollution data is Environment, University of Chinese Academy of Sciences,
imbalanced. Atmospheric data may be easily collected by Beijing 100190, People’s Republic of China; Institute of
routine environmental monitoring, and the advanced high- Environment and Health, Jianghan University, Wuhan
throughput analysis and capacity building of in vitro toxicity 430056, People’s Republic of China; orcid.org/0000-
data make a concept such as as predictive toxicity or even 0001-8525-7961; Email: qianliu@rcees.ac.cn
DeepTox2 more feasible and practical. In contrast, the
accessible soil pollution data are rather limited, because soil Authors
conditions change slowly while pollution has accumulative and Xian Liu − State Key Laboratory of Environmental Chemistry
hysteretic effects. and Ecotoxicology, Research Center for Eco-Environmental
Finally, big data is not simply equal to massive data, nor to a Sciences, Chinese Academy of Sciences, Beijing 100085,
great variety of data. Currently, there are tremendous amounts People’s Republic of China; orcid.org/0000-0002-5398-
of environmental pollution data, including real-time air 4707
monitoring data, population activity data, pollutant exposure Dawei Lu − State Key Laboratory of Environmental Chemistry
data, meteorological data, etc. In addition, there are lots of data and Ecotoxicology, Research Center for Eco-Environmental
types, which come from different laboratories and depart- Sciences, Chinese Academy of Sciences, Beijing 100085,
ments. The different data acquisition methods, preprocessing People’s Republic of China; orcid.org/0000-0002-8128-
methods, and reference systems would cause not random error 6367
but a systematic error, which poses challenges to the data Guibin Jiang − State Key Laboratory of Environmental
preprocess before learning. To solve the “data island” problem Chemistry and Ecotoxicology, Research Center for Eco-
and achieve high quality data integration, further efforts are still Environmental Sciences, Chinese Academy of Sciences, Beijing
needed from governments and relevant departments. Besides 100085, People’s Republic of China; School of Environment,
the effort to generate high quality data for existing ML Hangzhou Institute for Advanced Study, University of
methods, attention should also be paid to the innovation of Chinese Academy of Sciences, Hangzhou 310012, People’s
ML methods that are adaptive to the data as well. As a typical Republic of China; orcid.org/0000-0002-6335-3917
data-dependent technology, all of the ML algorithms are Complete contact information is available at:
constantly evolving with the change in data characteristics. For https://pubs.acs.org/10.1021/acs.est.1c06157
instance, the emergence of DL was partly attributed to the
rising demand in understanding biomedical digital images.72,73 Notes
Similarly, there is an urgent need to develop ML models that The authors declare no competing financial interest.
are able to cope with and to reason about the complex network Biographies
of environmental data. To this end, experimental research and
machine learning should complement each other to solve
environmental problems. The former produces the “data”,
while the latter seeks knowledge.

■ ASSOCIATED CONTENT
* Supporting Information

The Supporting Information is available free of charge at
https://pubs.acs.org/doi/10.1021/acs.est.1c06157.
Literature search criteria, summary of application
practices of machine learning methods in the assessment
of air pollution, and strengths and weaknesses of diverse
machine learning methods (PDF)

■ AUTHOR INFORMATION
Corresponding Authors
Prof. Aiqian Zhang joined the Research Center for Eco-Environ-
mental Sciences, Chinese Academy of Sciences, in 2009 after 20 years
Aiqian Zhang − State Key Laboratory of Environmental at Nanjing University and is currently Vice Director of College of
Chemistry and Ecotoxicology, Research Center for Eco- Resources and Environment, University of Chinese Academy of
Environmental Sciences, Chinese Academy of Sciences, Beijing Sciences. She obtained her B.Sc. in 1993 and Ph.D. in 1998 from
100085, People’s Republic of China; School of Environment, Nanjing University. Her research interests are environmental
Hangzhou Institute for Advanced Study, University of computational chemistry and predictive toxicology. Prof. Zhang has
Chinese Academy of Sciences, Hangzhou 310012, People’s been a member of the council of the Chinese Society of Toxicology
Republic of China; College of Resources and Environment, (CST) since 2013 and Vice-Chairman of the Committee of

2130 https://doi.org/10.1021/acs.est.1c06157
Environ. Sci. Technol. 2022, 56, 2124−2133
Environmental Science & Technology pubs.acs.org/est Perspective

Computational Toxicology, CST, since 2016. She has utilized various (6) Zimmerman, J. B.; Westerhoff, P.; Field, J.; Lowry, G. Evolving
data exploration methods, such as quantum chemistry, molecular today to best serve tomorrow. Environ. Sci. Technol. 2020, 54 (10),
dynamics, machine learning, bioinformatics, and structure−activity 5923−5924.
(7) Zhong, S.; Zhang, K.; Bagheri, M.; Burken, J. G.; Gu, A.; Li, B.;
relationships, to reveal the underlying mechanism for the environ-
Ma, X.; Marrone, B. L.; Ren, Z. J.; Schrier, J.; et al. Machine Learning:
mental behavior and toxic effect of pollutants. New Ideas and Tools in Environmental Science and Engineering.
Environ. Sci. Technol. 2021, 55 (19), 12741−12754.
(8) Gupta, S.; Aga, D.; Pruden, A.; Zhang, L.; Vikesland, P. Data
Analytics for Environmental Science and Engineering Research.
Environ. Sci. Technol. 2021, 55 (16), 10895−10907.
(9) Hu, X.; Belle, J. H.; Meng, X.; Wildani, A.; Waller, L. A.;
Strickland, M. J.; Liu, Y. Estimating PM2.5 Concentrations in the
Conterminous United States Using the Random Forest Approach.
Environ. Sci. Technol. 2017, 51 (12), 6936−6944.
(10) Lee, H. J.; Chatfield, R. B.; Strawa, A. W. Enhancing the
applicability of satellite remote sensing for PM2.5 estimation using
MODIS deep blue AOD and land use regression in California, United
States. Environ. Sci. Technol. 2016, 50 (12), 6546−6555.
(11) Stafoggia, M.; Bellander, T.; Bucci, S.; Davoli, M.; de Hoogh,
K.; de’Donato, F.; Gariazzo, C.; Lyapustin, A.; Michelozzi, P.; Renzi,
M.; Scortichini, M.; Shtein, A.; Viegi, G.; Kloog, I.; Schwartz, J.
Estimation of daily PM10 and PM2.5 concentrations in Italy, 2013−
Qian Liu is a full Professor at the Research Center for Eco- 2015, using a spatiotemporal land-use random-forest model. Environ.
Environmental Sciences, Chinese Academy of Sciences (RCEES, Int. 2019, 124, 170−179.
CAS). He obtained his B.Sc. in 2004 and Ph.D. in 2009 from Hunan (12) Chen, S. J.; Wu, S. Z. Deep learning for identifying
University. Thereafter, he obtained postdoc training at RCEES, CAS, environmental risk factors of acute respiratory diseases in Beijing,
in 2010−2012 and Trent University in 2013−2014. He became an China: implications for population with different age and gender. Int.
Assistant Professor at RCEES, CAS, in 2012 and became a full J. Environ. Health Res. 2020, 30 (4), 435−446.
Professor in 2017. Dr. Liu is a recipient of the National Science Fund (13) Ngiam, K. Y.; Khor, W. Big data and machine learning
algorithms for health-care delivery. Lancet Oncol. 2019, 20 (5), e262−
for Distinguished Young Scholars and the NSFC Science Fund for
e273.
Excellent Young Scholars. He has won the XPLORER Prize and MIT (14) Zhang, K.; Zhong, S. F.; Zhang, H. C. Predicting aqueous
Technology Review Innovators Under 35 China. His research adsorption of organic compounds onto biochars, carbon nanotubes,
interests include environmental analytical chemistry, air pollution, granular activated carbons, and resins with machine learning. Environ.
environmental applications of machine learning, and health effects of Sci. Technol. 2020, 54 (11), 7008−7018.
environmental pollution. (15) Li, J.; Wilkinson, J. L.; Boxall, A. B. A. Use of a large dataset to


develop new models for estimating the sorption of active
pharmaceutical ingredients in soils and sediments. J. Hazard. Mater.
ACKNOWLEDGMENTS 2021, 415, 125688.
This research was jointly supported by the project of the (16) Umeh, A. C.; Naidu, R.; Shilpi, S.; Boateng, E. B.; Rahman, A.;
National Natural Science Foundation of China (91743204, Cousins, I. T.; Chadalavada, S.; Lamb, D.; Bowman, M. Sorption of
PFOS in 114 well-characterized tropical and temperate soils:
21825403, 21976194, 22193053, 21976206 and 92043302 ), application of multivariate and artificial neural network analyses.
the National Key Research and Development Program of Environ. Sci. Technol. 2021, 55 (3), 1779−1789.
China (2020YFA0907500 and 2018YFA0901101), and the (17) Qin, X. S.; Gao, F. R.; Chen, G. H. Wastewater quality
Strategic Priority Research Program of the CAS (XDPB2005). monitoring system using sensor fusion and machine learning


techniques. Water Res. 2012, 46 (4), 1133−1144.
REFERENCES (18) Zang, Q.; Rotroff, D. M.; Judson, R. S. Binary classification of a
large collection of environmental chemicals from estrogen receptor
(1) Sun, X.; Zhang, X.; Muir, D. C. G.; Zeng, E. Y. Identification of assays by quantitative structure-activity relationship and machine
potential PBT/POP-like chemicals by a deep learning approach based learning methods. J. Chem. Inf. Model. 2013, 53 (12), 3244−3261.
on 2D structural features. Environ. Sci. Technol. 2020, 54 (13), 8221− (19) Xiao, Q.; Chang, H. H.; Geng, G.; Liu, Y. An ensemble
8231. machine-learning model to predict historical PM2.5 concentrations in
(2) Ciallella, H. L.; Zhu, H. Advancing computational toxicology in China from satellite data. Environ. Sci. Technol. 2018, 52 (22),
the big data era by artificial intelligence: data-driven and mechanism- 13260−13269.
driven modeling for chemical toxicity. Chem. Res. Toxicol. 2019, 32 (20) Gu, S.; Holly, E.; Lillicrap, T.; Levine, S. In Deep reinforcement
(4), 536−547. learning for robotic manipulation with asynchronous off-policy updates;
(3) Senior, A. W.; Evans, R.; Jumper, J.; Kirkpatrick, J.; Sifre, L.; 2017 IEEE international conference on robotics and automation
Green, T.; Qin, C.; Ž ídek, A.; Nelson, A. W.; Bridgland, A.; et al. (ICRA); IEEE: 2017; pp 3389−3396.
Improved protein structure prediction using potentials from deep (21) Zhong, H.; Yin, C.; Wu, X.; Luo, J.; He, J., AirRL: a
learning. Nature 2020, 577 (7792), 706−710. reinforcement learning approach to urban air quality inference. arXiv
(4) Svoboda, E. Artificial intelligence is improving the detection of preprint arXiv:2003.12205, 2020.
lung cancer. Nature 2020, 587 (7834), S20−S22. (22) Papaleonidas, A.; Iliadis, L. In Hybrid and reinforcement multi
(5) Mikulak-Klucznik, B.; Gołębiowska, P.; Bayly, A. A.; Popik, O.; agent technology for real time air pollution monitoring, IFIP Interna-
Klucznik, T.; Szymkuć, S.; Gajewska, E. P.; Dittwald, P.; Staszewska- tional Conference on Artificial Intelligence Applications and
Krajewska, O.; Beker, W.; et al. Computational planning of the Innovations, 2012; Springer: 2012; pp 274−284.
synthesis of complex natural products. Nature 2020, 588 (7836), 83− (23) Mikolov, T.; Kombrink, S.; Burget, L.; Č ernocký, J.;
88. Khudanpur, S. In Extensions of recurrent neural network language

2131 https://doi.org/10.1021/acs.est.1c06157
Environ. Sci. Technol. 2022, 56, 2124−2133
Environmental Science & Technology pubs.acs.org/est Perspective

model; 2011 IEEE international conference on acoustics, speech and (41) Wu, X.; Nethery, R. C.; Sabath, M. B.; Braun, D.; Dominici, F.
signal processing (ICASSP); IEEE: 2011; pp 5528−5531. Air pollution and COVID-19 mortality in the United States: Strengths
(24) Jia, S.; Xu, T.; Huan, T.; Chong, M.; Liu, M.; Fang, W.; Fang, and limitations of an ecological regression analysis. Sci. Adv. 2020, 6
M. Chemical isotope labeling exposome (CIL-EXPOSOME): one (45), eabd4049.
high-throughput platform for human urinary global exposome (42) Shi, L.; Zanobetti, A.; Kloog, I.; Coull, B. A.; Koutrakis, P.;
characterization. Environ. Sci. Technol. 2019, 53 (9), 5445−5453. Melly, S. J.; Schwartz, J. D. Low-concentration PM2.5 and mortality:
(25) Korn, E. L.; Whittemore, A. S. Methods for analyzing panel estimating acute and chronic effects in a population-based study.
studies of acute health effects of air pollution. Biometrics 1979, 35 (4), Environ. Health Perspect 2016, 124 (1), 46−52.
795−802. (43) Westreich, D.; Lessler, J.; Funk, M. J. Propensity score
(26) Hoek, G.; Beelen, R.; de Hoogh, K.; Vienneau, D.; Gulliver, J.; estimation: neural networks, support vector machines, decision trees
Fischer, P.; Briggs, D. A review of land-use regression models to assess (CART), and meta-classifiers as alternatives to logistic regression. J.
spatial variation of outdoor air pollution. Atmos. Environ. 2008, 42 Clin. Epidemiol. 2010, 63 (8), 826−833.
(33), 7561−7578. (44) Vasilaki, V.; Volcke, E. I. P.; Nandi, A. K.; van Loosdrecht, M.
(27) Zhang, X.; Chu, Y.; Wang, Y.; Zhang, K. Predicting daily PM2.5 C. M.; Katsou, E. Relating N2O emissions during biological nitrogen
concentrations in Texas using high-resolution satellite aerosol optical removal with operating conditions using multivariate statistical
depth. Sci. Total Environ. 2018, 631, 904−911. techniques. Water Res. 2018, 140, 387−402.
(28) Van Donkelaar, A.; Martin, R. V.; Spurr, R. J.; Burnett, R. T. (45) Ebrahimi, M.; Gerber, E. L.; Rockaway, T. D. Temporal
High-resolution satellite-derived PM2.5 from optimal estimation and performance assessment of wastewater treatment plants by using
geographically weighted regression over North America. Environ. Sci. multivariate statistical analysis. J. Environ. Manage. 2017, 193, 234−
Technol. 2015, 49 (17), 10482−10491. 246.
(29) Ma, Z.; Hu, X.; Huang, L.; Bi, J.; Liu, Y. Estimating ground- (46) Newhart, K. B.; Holloway, R. W.; Hering, A. S.; Cath, T. Y.
level PM2.5 in China using satellite remote sensing. Environ. Sci. Data-driven performance analyses of wastewater treatment plants: A
Technol. 2014, 48 (13), 7436−7444. review. Water Res. 2019, 157, 498−513.
(30) Lyu, B.; Hu, Y.; Zhang, W.; Du, Y.; Luo, B.; Sun, X.; Sun, Z.; (47) Hwangbo, S.; Al, R.; Chen, X. M.; Sin, G. Integrated model for
Deng, Z.; Wang, X.; Liu, J.; et al. Fusion Method Combining Ground- understanding N2O emissions from wastewater treatment plants: a
Level Observations with Chemical Transport Model Predictions deep learning approach. Environ. Sci. Technol. 2021, 55 (3), 2143−
Using an Ensemble Deep Learning Framework: Application in China 2151.
to Estimate Spatiotemporally-Resolved PM2.5 Exposure Fields in (48) Cubillos, M. Multi-site household waste generation forecasting
2014−2017. Environ. Sci. Technol. 2019, 53 (13), 7306−7315. using a deep learning approach. Waste Manage 2020, 115, 8−14.
(31) Brokamp, C.; Jandarov, R.; Hossain, M.; Ryan, P. Predicting (49) Wu, L.; Huang, R.; Tetko, I. V.; Xia, Z.; Xu, J.; Tong, W. Trade-
daily urban fine particulate matter concentrations using a random off Predictivity and Explainability for Machine-Learning Powered
Predictive Toxicology: An in-Depth Investigation with Tox21 Data
forest model. Environ. Sci. Technol. 2018, 52 (7), 4173−4179.
(32) Zhan, Y.; Luo, Y.; Deng, X.; Zhang, K.; Zhang, M.; Grieneisen, Sets. Chem. Res. Toxicol. 2021, 34 (2), 541−549.
(50) Zorn, K. M.; Foil, D. H.; Lane, T. R.; Russo, D. P.; Hillwalker,
M. L.; Di, B. Satellite-based estimates of daily NO2 exposure in China
W.; Feifarek, D. J.; Jones, F.; Klaren, W. D.; Brinkman, A. M.; Ekins,
using hybrid random forest and spatiotemporal kriging model.
S. Machine Learning Models for Estrogen Receptor Bioactivity and
Environ. Sci. Technol. 2018, 52 (7), 4180−4189.
Endocrine Disruption Prediction. Environ. Sci. Technol. 2020, 54 (19),
(33) Jain, S.; Presto, A. A.; Zimmerman, N. Spatial modeling of daily
12202−12213.
PM2.5, NO2, and CO concentrations measured by a low-cost sensor
(51) Wang, L.; Zhao, L.; Liu, X.; Fu, J.; Zhang, A. SepPCNET:
network: comparison of linear, machine learning, and hybrid land use
Deeping Learning on a 3D Surface Electrostatic Potential Point Cloud
models. Environ. Sci. Technol. 2021, 55 (13), 8631−8641. for Enhanced Toxicity Classification and Its Application to Suspected
(34) Cole, M. A.; Elliott, R. J.; Liu, B. The impact of the Wuhan Environmental Estrogens. Environ. Sci. Technol. 2021, 55 (14), 9958−
Covid-19 lockdown on air pollution and health: a machine learning 9967.
and augmented synthetic control approach. Environ. Resour. Econ (52) Ciallella, H. L.; Russo, D. P.; Aleksunes, L. M.; Grimm, F. A.;
2020, 76 (4), 553−580. Zhu, H. Predictive modeling of estrogen receptor agonism,
(35) Seng, D. W.; Zhang, Q. Y.; Zhang, X. F.; Chen, G. S.; Chen, X. antagonism, and binding activities using machine- and deep-learning
Y. Spatiotemporal prediction of air quality based on LSTM neural approaches. Lab. Invest. 2021, 101 (4), 490−502.
network. Alex. Eng. J. 2021, 60 (2), 2021−2032. (53) Rudin, C. Stop explaining black box machine learning models
(36) Zhao, Y.; Wang, L.; Luo, J.; Huang, T.; Tao, S.; Liu, J.; Yu, Y.; for high stakes decisions and use interpretable models instead. Nat.
Huang, Y.; Liu, X.; Ma, J. Deep learning prediction of polycyclic Mach. Intell. 2019, 1 (5), 206−215.
aromatic hydrocarbons in the high arctic. Environ. Sci. Technol. 2019, (54) Kerckhoffs, J.; Hoek, G.; Portengen, L.; Brunekreef, B.;
53 (22), 13238−13245. Vermeulen, R. C. H. Performance of prediction algorithms for
(37) Janarthanan, R.; Partheeban, P.; Somasundaram, K.; modeling outdoor air pollution spatial surfaces. Environ. Sci. Technol.
Elamparithi, P. N. A deep learning approach for prediction of air 2019, 53 (3), 1413−1421.
quality index in a metropolitan city. Sustain. Cities Soc. 2021, 67, (55) Suter, G. W.; Cormier, S. M. The Problem of Biased Data and
102720. Potential Solutions for Health and Environmental Assessments. Hum.
(38) Yang, X. Z.; Liu, X.; Zhang, A. Q.; Lu, D. W.; Li, G.; Zhang, Q. Ecol. Risk Assess. 2015, 21 (7), 1736−1752.
H.; Liu, Q.; Jiang, G. B. Distinguishing the sources of silica (56) Miljkovic, F.; Rodriguez-Perez, R.; Bajorath, J. Machine
nanoparticles by dual isotopic fingerprinting and machine learning. learning models for accurate prediction of kinase Inhibitors with
Nat. Commun. 2019, 10 (1), 1−9. different binding modes. J. Med. Chem. 2020, 63 (16), 8738−8748.
(39) Tepanosyan, G.; Sahakyan, L.; Maghakyan, N.; Saghatelyan, A. (57) Aleksic, S.; Seeliger, D.; Brown, J. B. ADMET Predictability at
Combination of compositional data analysis and machine learning Boehringer Ingelheim: State-of-the-Art, and Do Bigger Datasets or
approaches to identify sources and geochemical associations of Algorithms Make a Difference? Mol. Inform. 2021, 2100113.
potentially toxic elements in soil and assess the associated human (58) Koutsoukas, A.; Monaghan, K. J.; Li, X. L.; Huan, J. Deep-
health risk in a mining city. Environ. Pollut. 2020, 261, 114210. learning: investigating deep neural networks hyper-parameters and
(40) Jia, X.; Hu, B.; Marchant, B. P.; Zhou, L.; Shi, Z.; Zhu, Y. A comparison of performance to shallow methods for modeling
methodological framework for identifying potential sources of soil bioactivity data. J. Cheminformatics 2017, 9 (1), 1−13.
heavy metal pollution based on machine learning: A case study in the (59) Wang, Y.; Wei, H.; Wang, Y.; Peng, C.; Dai, J. Chinese
Yangtze Delta, China. Environ. Pollut. 2019, 250, 601−609. industrial water pollution and the prevention trends: an assessment

2132 https://doi.org/10.1021/acs.est.1c06157
Environ. Sci. Technol. 2022, 56, 2124−2133
Environmental Science & Technology pubs.acs.org/est Perspective

based on environmental complaint reporting system (ECRS). Alex.


Eng. J. 2021, 60 (6), 5803−5812.
(60) Slini, T.; Karatzas, K.; Papadopoulos, A. Regression analysis
and urban air quality forecasting: an application for the city of Athens.
Glob. Nest J. 2002, 4 (2−3), 153−162.
(61) Lever, J.; Krzywinski, M.; Altman, N. Points of significance:
Principal component analysis. Nat. Methods 2017, 14 (7), 641−643.
(62) Bzdok, D.; Krzywinski, M.; Altman, N. Machine learning:
supervised methods. Nat. Methods 2018, 15 (1), 5.
(63) Motevalli, A.; Naghibi, S. A.; Hashemi, H.; Berndtsson, R.;
Pradhan, B.; Gholami, V. Inverse method using boosted regression
tree and k-nearest neighbor to quantify effects of point and non-point
source nitrate pollution in groundwater. J. Clean. Prod. 2019, 228,
1248−1263.
(64) Neto, U. M. B.; Dougherty, E. R. Error estimation for pattern
recognition; Wiley: 2015; pp 97−99.
(65) Stafoggia, M.; Johansson, C.; Glantz, P.; Renzi, M.; Shtein, A.;
de Hoogh, K.; Kloog, I.; Davoli, M.; Michelozzi, P.; Bellander, T. A
random forest approach to estimate daily particulate matter, nitrogen
dioxide, and ozone at fine spatial resolution in Sweden. Atmosphere
2020, 11 (3), 239.
(66) Reid, M. K.; Spencer, K. L. Use of principal components
analysis (PCA) on estuarine sediment datasets: The effect of data pre-
treatment. Environ. Pollut. 2009, 157 (8−9), 2275−2281.
(67) Gewers, F. L.; Ferreira, G. R.; de Arruda, H. F.; Silva, F. N.;
Comin, C. H.; Amancio, D. R.; Costa, L. d. F., Principal component
analysis: A natural approach to data exploration. arXiv preprint
arXiv:1804.02502, 2018.
(68) Bozdag, A.; Dokuz, Y.; Gokcek, O. B. Spatial prediction of
PM10 concentration using machine learning algorithms in Ankara,
Turkey. Environ. Pollut. 2020, 263, 114635.
(69) Lynch, C. How do your data grow? Nature 2008, 455 (7209),
28−29.
(70) Graziano, G. Deep learning chemistry ab initio. Nat. Rev. Chem.
2020, 4 (11), 564−564.
(71) Alipanahi, B.; Delong, A.; Weirauch, M. T.; Frey, B. J.
Predicting the sequence specificities of DNA-and RNA-binding Recommended by ACS
proteins by deep learning. Nat. Biotechnol. 2015, 33 (8), 831−838.
(72) Hosny, A.; Parmar, C.; Quackenbush, J.; Schwartz, L. H.; Aerts,
H. J. Artificial intelligence in radiology. Nat. Rev. Cancer 2018, 18 (8), Machine Learning-Enabled Framework for High-
500−510. Throughput Screening of MOFs: Application in
(73) Moen, E.; Bannon, D.; Kudo, T.; Graf, W.; Covert, M.; Van Radon/Indoor Air Separation
Valen, D. Deep learning for cellular image analysis. Nat. Methods Junyu Ren, Xu Ji, et al.
2019, 16 (12), 1233−1246. DECEMBER 27, 2022
ACS APPLIED MATERIALS & INTERFACES READ

Machine Learning-Based Models with High Accuracy and


Broad Applicability Domains for Screening PMT/vPvM
Substances
Qiming Zhao, Guibin Jiang, et al.
DECEMBER 06, 2022
ENVIRONMENTAL SCIENCE & TECHNOLOGY READ

Toward High-Performance Map-Recovery of Air Pollution


Using Machine Learning
Jun Song, Yike Guo, et al.
SEPTEMBER 28, 2022
ACS ES&T ENGINEERING READ

Application of the Deep Learning Algorithm to Identify the


Spatial Distribution of Heavy Metals at Contaminated Sites
Jun Man, Yijun Yao, et al.
DECEMBER 23, 2021
ACS ES&T ENGINEERING READ

Get More Suggestions >

2133 https://doi.org/10.1021/acs.est.1c06157
Environ. Sci. Technol. 2022, 56, 2124−2133

You might also like