Hydro D 23 00089

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Journal of Hydroinformatics

Neural Network and RandomForest approach to water quality index prediction of


Nokoue lac in Benin West Africa.
--Manuscript Draft--

Manuscript Number: HYDRO-D-23-00089

Full Title: Neural Network and RandomForest approach to water quality index prediction of
Nokoue lac in Benin West Africa.

Article Type: Special Issue Paper OA

Section/Category: Special Issue: WDSA-CCWI 22

Manuscript Region of Origin: BURKINA FASO

Abstract: The Nokoue lake in Benin, the country's main water body, is experiencing without
equivox pollution. This study discusses the development and validation of an Random
Forest and Artificial Neural Network (ANN) model in estimating water quality index
(WQI) in the Nokoué lake, Benin. The two models have been developed and tested
using data from 20 monitoring stations over a period of 12 months. The modeling data
was divided into two sets. For the first set, RF and ANNs were trained, tested and
validated using 12 physico-parameters as input parameters. A detailed comparison of
the overall performance showed that prediction of the RF model was better than
artificial neural networks with coefficient of correlation (R2) = 0.98, root mean squared
error RMSE = 0.12, explaned variance score (EVS) = 0.98 and mean absolute error
MAE = 0.14 at training phase while and at the validation phase their values are 0.80,
0.19, 0.23, 0.74 respectly which demonstrates that RF is capable of estimating WQI
with acceptable accuracy. This method simplify the calculation of the WQI and reduce
substantial efforts and time. This will help in taking appropriate preventive measures to
control the water quality of Nokoue lake through associated chemical treatments.

Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation
Click here to
access/download;Manuscript;renamed_d4624.docx

Application of random forest model for predicting water quality index of Nokoué lake.
1
2 N. Dabire*^, E. C. Ezin**, A. M. Firmin***
3
4
*
Institut National de l’Eau (INE), Centre d’Excellence d’Afrique pour l’Eau et
5 l’Assainissement (C2EA), Université d’Abomey Calavi (UAV)
6
7 Ecole Doctorale des Sciences de l’Ingénieur (ED-SDI), Université d’Abomey Calavi
*
8
9 **
Institut de Formation et de Recherche en Informatique (IFRI), Université d’Abomey Calavi
10
11 (E-mail : eugene.ezin@uac.bj)
12
13 ***
Institut National de l’Eau, Laboratoire d’Hydrologie Appliquée (LHA), Université
14
15 d’Abomey Calavi
16
17 (E-mail : firminelite@gmail.com)
18 ^Corresponding
19 author : (E-mail : namwinwelbere@gmail.com)
20
21
22
23 ABSTRACT
24
25 Poor water quality is a serious problem in the world which threatens human health, ecosystems,
26
27 plant and animal life. Prediction of surface water quality is a main concern in water resource
28
29 and environmental systems. The Nokoue lake in Benin, the country's main water body, is
30
31 experiencing without equivox pollution. usually the water quality index is evaluated manually
32
33 by complex mathematical formulas with major risks of errors. This study discusses the
34 development and validation of an Random Forest and Artificial Neural Network (ANN) model
35
36 in estimating water quality index (WQI) in the Nokoué lake, Benin. The two models have been
37
38 developed and tested using data from 20 monitoring stations over a period of 12 months. The
39
40 modeling data was divided into two sets. For the first set, RF and ANNs were trained, tested
41
42 and validated using 12 physico-parameters as input parameters. A detailed comparison of the
43
44
overall performance showed that prediction of the random forest (RF) model was better than
45 artificial neural networks with coefficient of correlation (R2) = 0.98, root mean squared error
46
47 (RMSE) = 0.12, explaned variance score (EVS) = 0.98 and mean absolute error (MAE) = 0.14
48
49 at training phase while and at the validation phase their values are 0.80, 0.19, 0.23, 0.74
50
51 respectly which demonstrates that RF is capable of estimating WQI with acceptable accuracy.
52
53 This method simplify the calculation of the WQI and reduce substantial efforts and time by
54
55
optimizing the computations. This will help in taking appropriate preventive measures to
56 control the water quality of Nokoue lake through associated chemical treatments.
57
58
59
60
61
62
63
64
65
Key words : Benin, Nokoue lake, neural networks, random forest, surface water water quality
1
2 index.
3
4 HIGHLIGHTS
5
6
7 • Two prediction models namely Random Forest (RF) and Artificial Neural Networks (ANN)
8
9 are built to predict the water quality index (WQI).
10
11 • The WQI is applied to estimate the influence of natural and anthropogenic factors based on
12
13 twelve key physico-chemical water parameters.
14
15
16
• This research has the potential to improve the early warning system of the water quality of
17 Nokoue lake.
18
19
20 1. INTRODUCTION
21
22 Water is an essential natural resource whose physical and chemical quality is the foundation of
23
24 the ecosystem ( Dovonou et al. 2011). According to Lalèyè et al.(2004) a good chemical and
25
26 ecological status of surface water body is a major concern for a society that has to meet
27
28 increasingly important water needs. Thus, water resources are a major concern in West Africa
29
30 countries because they are absolutely essential for the development of human, economic and
31
32 social activities. Nokoue lake in Benin plays a very important role in socio-economic
33 development at the local, regional and national levels (Avahouin et al. 2018). However, it is
34
35 subject to numerous environmental pressures related to hydrocarbon traffic, artisanal fishing,
36
37 household wastewater discharges from the communes bordering Nokoue lake. In addition to
38
39 salt water inputs from the Atlantic Ocean, and inputs of pesticide and fertilizer residues by
40
41 leaching from soils subjected to intense and diversified agriculture (Zandagba et al. 2016).
42
43
Nokoue lake, like all bodies of water, hosts the largest lake villages with a galloping
44 demographic growth. No doubt, this leads a strong anthropization and consequently chemical
45
46 pollution of the water of Nokoue lake, an increasing eutrophication compromising biodiversity
47
48 and promoting the proliferation of invasive plants such as the water hyacinth (Mama 2010 ;
49
50 Dovonou et al. 2011; Zandagba et al. 2016). Thus, to ensure its good management within the
51
52 framework of sustainable development, it is judicious to make a permanent follow-up of the
53
54
medium and long term evolution of the qualitative physicochemical state of the water (Sèdami
55 & Bokossa 2016). Ensuring freshwater quality appropriate to human and ecological needs is
56
57 therefore an important aspect of integrated environmental management and sustainable
58
59 development. In terms of environmental and ecological problems, the number of water quality
60
61
62
63
64
65
parameters is quite extensive. Hence, a robust mathematical technique is required to combine
1
2 the physicochemical characterization of water into a single variable which describes the water
3
4
quality. A water quality index (WQI) is a single number which uses a set of physicochemical
5 water parameters to express the quality of water at a certain place and time. (Vasistha &
6
7 Ganguly 2020). This method has been initially proposed by (Horton, 1965) and (Brown et al.,
8
9 1972). To compute this index, (Horton 1965) proposed the first formula that takes into account
10
11 all parameters needed to determine surface water quality and reflects the composite influence
12
13 of different parameters important for water quality assessment and management (Liou et al.
14
15
2004 ; Tyagi et al. 2013). This index was used for the first time to highlight the physico-
16 chemical changes that may occur during the year (House 1990 ; House & Ellis 1987). Based on
17
18 this parameter, the water quality was classified into five different classes according to the
19
20 water’s suitability for various uses such as water supplies, irrigation, and fish culture. The
21
22 conventional method suggested by Horton requires lengthy transformations to estimate
23
24 subindices. In addition, the subindices required the inclusion of different equations, which need
25
lengthy effort and time to calculate the final WQI. Therefore, estimation of such a WQI is
26
27 cumbersome and can lead to occasional mistakes. However, the random forest (RF) and
28
29 artificial neural networks (ANNs) can be suggested as alternatives for estimation of WQI, as
30
31 both employ the raw data instead of subindices. The performance of machine learning models
32
33 to enhance water quality and reduce a wide range of wastewater was reported in several studies
34
35 (Iorliam et al. 202; Kalhori & Zeng 2013; Wang 2017; Zheng 2018). Since the number of
36 variables which affect water quality is too high, recently machine learning techniques such as
37
38 RF, genetic programming (GP), and ANNs have been successfully employed to solve the
39
40 problems related to engineering in hydrology (Sampurno et al. 2022). The RF is proposed as a
41
42 leading technique which can be used for regression and classification purposes (Maier & Dandy
43
44 2000 ; Zhu et al. 2022). The RF has high ability for generalization and is less prone to
45
46 overfitting. Furthermore, it simultaneously minimizes the estimation of error and model
47 dimensions. Huang et al. (2022) applied the RF and ANNs for prediction of water level in rivers.
48
49 Amir et al. (2018) recommended the RF as the appropriate tool to forecast lake water levels
50
51 and obtained quite acceptable results. The ANNs have been recommended as an effective tool
52
53 for the prediction of water pollution and water quality (Abobakr et al. 2019 ; Liou et al., 2004).
54
55 The ANNs are a useful technique that was used to speed up the calculation of water quality
56
57
index in rivers (Dahal et al. 2021; Tyagi et al. 2013). In this research, both RF and ANNs were
58 used as robust techniques for rapid and direct prediction of the WQI in the Nokoué lake which
59
60 can be used as another alternative for some long-lasting conventional methods. Twenty points
61
62
63
64
65
in the wetland were monitored twice a month over a period of 12 months and an extensive
1
2 dataset was collected for 12 physico-parameters. Finally, the RF result was compared with
3
4
neural networks models, namely the MLP.
5
6 2. MATERIALS AND METHODS
7
8
9 2.1. Study area and sampling collection
10
11 Located in the South-East of the Republic of Benin (6º 25’N, 2º 26’E) Figure 1, Nokoué lake
12
has a surface area varying from 150-170 km² betwen the low water period and the high water
13
14 period (Gnohossou 2006 ; Mama 2010 ; Dehotin et al. 2007). The lake measures 20 km in the
15
16 East-West and 11 km in the North-South direction. In the east, Nokoué lake is linked to Porto-
17
18 Novo Lagoon and forms a freshwater lake with a surface area of about 180 km 2. Nokoué lake
19
20 is connected with the Atlantic Ocean by a channel named Cotonou lagoon, which has a total
21
22 length of 4.5 km. The hydrological regime of Nokoué lake is characterized by a low flood from
23 May to June, which is the main rainy season in southern Benin and a major flood from
24
25 September to November because of water supply from the Ouémé river. The depth of the lake
26
27 is between 0.3 and 3.4m (Adandedji et al. 2022). The average depth of the lake is 1.3m
28
29 according to a bathymetric study reported in the article by Mama et al. (2011).
30
31 Sampling locations were selected according to three criteria: i. good geographical distribution
32
33 of stations through the complex lagoon; ii. shrimp and oysters fishing locations; iii. lacustrial
34
35 village and residential houses locations of the complex lagoon. A total of twenty stations (S1 to
36
37 S20) surrounding the complex lagoon were chosen. The data collection was carried out twice a
38
39 month over a period of 12 months. Totally, 12 physico-parameters were collected in the Nokoué
40
41 lake, including the physico-chemical parameters still called sub-indicators such as oxygenation
42 state (quantity of oxygen present in the water in dissolved form and available for aquatic life
43
44 and the oxidation of organic matter O2_dissolved), temperature (T), electrical conductivity
45
46 (EC), hydrogen potential (PH), organic pollution, turbidity, nutrient load (pollutants responsible
47
48 for eutrophication phenomena), chemical oxygen demand (COD), suspended solids (SS), total
49
50 nitrogen, salinity, nitrite and nitrate. Parameters such as temperature, hydrogen potential,
51
52
electrical conductivity, salinity, total dissolved solids and dissolved oxygen were measured with
53 a multi-parameter instrument (AQUAREAD AP-700) and suspended solids with a DR-890
54
55 colorimeter. Laboratory analyses were performed with a DR-2800 spectrometer to determine
56
57 nitrite, nitrate and ortho-phosphate.
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25 Figure 1| Map of water sample collection points.
26
27 2.2 Data series description
28
29 The descriptive statistical characteristics of the physicochemical variables used in this study
30
31 concern the minimum and maximum values, the mean and standard deviation (Table 1). The
32 results show that salinity remains a major problem for the ecosystem with a maximum of
33
34 19700mg/l. The pH varies between acidic and basic with a minimum value of 5.45 and 8.78 as
35
36 a maximum value. From a global point of view, the waters of Nokoué lake have an alkaline
37
38 character with an average of 7.31. The temperature varies between 23 and 34.1°C. The
39
40 mineralization of Nokoué lake water is relatively low with an electrical conductivity that varies
41
42
between 0.07 and 64.2µS/cm. The concentration of dissolved oxygen decreases. With regard to
43 the Nitrite and Nitrate concentrations, the results increase from 0 to 0.04 and from 0.01 to 5.9
44
45 mg/l respectively. Some parameters show a slight increase (nitrites, pH and Temperature) while
46
47 the rest of the physico-chemical parameters show a very large increase.
48
49
50 Tableau 1|Descriptive statistics of physico-chemical parameters
51
52 Variable Observed Min Max Mean Std
53 Temperature (T) °C 448 23.00 34.10 29.13 2.34
54 Turbidity NTU 448 1.00 144.00 21.08 24.75
55 Conductivity (CE) µS/cm 448 0.07 64.20 12.54 12.48
56
suspended matter (MES) mg/l 448 1.00 77.00 13.15 14.74
57
58 total solids dissolved (TDS) mg/l 448 0.94 1805.00 247.33 433.23
59 Salinity mg/l 448 0.00 19700.00 4330.00 5292.72
60 dissolved oxygen (O-diss) mg/l 448 0.10 12.20 4.77 3.68
61
62
63
64
65
hydrogen potential (pH) --- 448 5.45 8.78 7.31 0.73
1 OrthoPhospore (𝑃𝑂43− ) mg/l 448 0.04 1.52 0.17 0.23
2 Nitrites(N-NH4) mg/l 448 0.00 0.04 0.01 0.01
3
Nitrates (N-NO3) mg/l 448 0.01 5.90 2.42 1.81
4
5 chemical oxygen demand (DCO) mg/l 448 3.67 281.97 86.16 57.98
6
7
8 2.3. The local water quality index assessment
9
10
11 The tedious mathematical formulas proposed by the authors Horton (1965), Brown et al. (1970),
12 Brown et al. (1972), Chatterji and Raziuddin (2002), Yidana and Yidana (2010) are usually
13
14 used to evaluate the surface water quality index. These formulas take into account into account
15
16 all the parameters needed to determine surface water quality and reflects the composite
17
18 influence of different parameters important for the assessment and management water quality.
19
20 In this study this index is described by the key physicochemical parameters (called sub-
21
22
indicators) describing hydrogen potential (pH), dissolved oxygen (O-diss), salinity, electrical
23 conductivity (EC), temperature (T°C), suspended solids (MES), turbidity, Total Dissolved
24
25 Solids (TDS), Chemical Oxygen Demand (COD), Ortho Phosphates (𝑃𝑂43− ), Nitrites (NO-2),
26
27 and Nitrates (NO-3) (pollutants responsible for eutrophication). In this approach, a numerical
28
29 value called relative weight (Wi), specific to each physicochemical parameter, is calculated
30
31 according to the following formula :
32
33 𝐾
34
𝑊𝑖 = 𝑆 (1)
𝑖
35
36
37
Where K is a constant of proportionality and Si is a maximum value of the standard for
38 surface water of each parameter in miligram per liter (mg/l) except for pH, T°C and electrical
39
40 conductivity. The unit of measurement of turbidity is the cephalometric unit of turbidity
41
42 (NTU).
43
44
45 K can also be calculated using the following equation :
46
1
47 𝐾= 1 (2)
48 ∑𝑛
𝑖=1( ) 𝑆𝑖
49
50
51 Where n is the number of parameters.
52
53
54 Then, a quality rating scale (Qi) is calculated for each parameter by dividing the each
55 parameter concentration by the standard maximum value for that parameter and multiplying
56
57 the whole by 100 as in the following formula :
58
59
60
61
62
63
64
65
𝐶
𝑄𝑖 = (𝑆𝑖 ) × 100 (3)
1 𝑖
2
3 Where Qi is quality assessment scale for each parameter and Ci is the concentration of each
4
5 parameter in mg/l.
6
7
8 Finally, the overall water quality index is calculated by the following equation:
9
10 ∑𝑛
𝑖=1 𝑊𝑖 ×𝑄𝑖
WQI= ∑𝑛
(4)
𝑖=1 𝑊𝑖
11
12
13 where Qi represents the quality rating scale for each parameter, Wi is a numerical value called
14
15 relative weight and wi×Qi is the sub-index value specific to each physico-chemical parameter
16
17 retained. Five quality classes can be identified according to the values of the water quality index Table
18
19 2. the water quality index of lake nokoué varies disproportionately over the 12 months (Figure
20
21
4).
22
23 Tableau 2| sub-index calculation for each value WQI
24
25 Si 1/Si Ci K Wi Qi Wi×Qi WQI
26
27 T (°C) 25 0,04 29,8 0,0324 119,2 3,85917
28 Turbidity (NTU) 5 0,2 5 0,1619 100 16,1878
29 Conductivity (µS/cm) 1000 0,001 7,5 0,0008 0,75 0,00061
30
31
SS (mg/l) 50 0,02 1 0,0162 2 0,03238
32 TDS (mg/l) 500 0,002 96 0,0016 19,2 0,03108
33 Salinity (mg/l) 150 0,00667 0 0,0054 0 0
34 O2 Dissolved (mg/l) 100 0,01 0,62 0,0081 0,62 0,00502
35
36 pH 9 0,11111 6,94 0,0899 77,1111 6,93476
37 Ortho-P(mg/l) 2 0,5 0,218 0,4047 10,9 4,41117
38 Nitrite (mg/l) 3,2 0,3125 0,013 0,2529 0,40625 0,10275
39 Nitrate (mg/l) 45 0,02222 2,9 0,018 6,44444 0,11591
40
41 DCO (mg/l) 100 0,01 57,12 0,0081 57,12 0,46232
42 Total 1,2355 0,8094 1 32,1429 32,143
43
44
45
46 Tableau 3| Standard WQI classification and status (Brown et al. 1972; and Aher et al. 2016).
47
48 Classes Water status Possible use
49 0-25 Excellent quality Potable water, irrigation and industry
50
51 26 - 50 Good quality Drinking water, Irrigation and industry
52 51 - 75 Poor quality Irrigation and industry
53 76 - 100 Very poor quality Irrigation
54 Above 100 Unsuitable Appropriate treatment required before use
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Figure 4 | Nokoue lake monthly boxplot variation of the water quality indicator
21
22
23
24 2.4. Data preprocessing
25
26 There are two techniques for preprocessing data before training machine learning models :
27
28 management of outliers and scaling of variables. The management of outliers values answers
29
30 the following equation :
31
32 < 𝑄1 − 1,5 × 𝐼𝑄
33 𝑉𝑎𝑏 = { (5)
34 > 𝑄3 + 1,5 × 𝐼𝑄
35
36 where Q1 is the first quartile, Q3 is the third quartile and (IQ) is the inter-quartile range. If a
37 value in the series does not fall within these ranges, it is considered an outlier.
38
39
40 Feature scaling is therefore a very important step to take care before training a machine learning
41
42 model (Dahal et al. 2021). The data for machine learning in this study were preprocessed to
43
44 reduce the impact on the accuracy of the water quality index prediction caused by the large
45 order of magnitude difference between the water quality index data and the physico-chemical
46
47 parameters (influencing factors). There are many methods of scaling features, the most common
48
49 and popular techniques used in the machine learning community are normalization also namely
50
51 standardization. The standardization equation is formulated in three steps : (1) calculate the
52
53 mean and standard deviation (std), (2) subtract the mean (Xmean) from each explanatory variable
54
55 to be standardized (Xi), (3) the previous result is divided by the standard deviation. Analyzes
56 and modeling were done with python language.
57
58
59 Its mathematical expression is the following :
60
61
62
63
64
65
Xi −Xmean
Xnormalized = (6)
1 std
2
3 After the after the removal of outliers in the series,the data base used has a dimension of 460
4
5 rows and 12 columns. All variables (physico-chemical parameters) were considered to build the
6
7 two models. Seven variables show a positive correlation with the water quality index of Nokoué
8
9
lake as shown in the correlation matrix, which the turbidity, suspended solids and
10 ortophosphorus are have the greatest influence on the variation of the water quality index of
11
12 Nokoué lake with correlation coefficients of 0.97, 0.91 and 0.37. The rest of the variables are
13
14 negatively correlated with the water quality index of nokue lake. In the Figure 5 indicate the
15
16 relative influence of the entered variables. All variables were considered for training and
17
18 validation. The data base is partitioned into two sets : training data set (x_train ; y_train) and
19
test data set (x_test ; y_test). The size of the training data set is 80 percent and that of the testing
20
21 data set is 20 percent.
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50 Figure 5 | Correlation matrix of variables
51
52 2.5. Structure of random forest and artificial neural network
53
54
55 Machine learning is a set of statistical methods for analyzing trends, finding relationships, and
56
57 developing models to make predictions about a data set. In this study, we use the random forest
58 and neural networks technique to predicte water quality index of Nokoué lake automatically.
59
60
61
62
63
64
65
2.5.1. Structure of random forest
1
2 Decision tree forests, also known as Random Forest, are a set learning technique that uses
3
4 decision trees to build decision support models (Dalai et al. 2021). They use split data from
5
6 historical data by randomly selecting a subset of variables at each step of the decision tree. The
7
8 model then selects the mode for all predictions in each decision tree. This method reduces the
9
10 risk of error in an individual tree by relying on a majority prevalence model (i.e., where the
11
12
majority prevails). For example if we create a random forest with four (04) decision trees, the
13 third decision tree below will predict zero (0), but if we rely on the mode of the four decision
14
15 trees, the predicted value will be one (01) as shown in Figure 6.
16
17
18
19
20
21
22
23
24
25
26
27
Figure 6 | Visual representation of a random forest structure
28
29 2.5.2. Structure of artificial neural network
30
31 They were first described in 1943 by the neurophysiologist Warren McCulloch and the
32
33 mathematician Walter Pitts and they are often much better than other machine learning methods
34
35 on large and complex problems. Chu et al. (2020) and Taher et al. (2022) explains the
36
37 composition and various steps in creating a neural network model. This means that the structure
38
39
consists neural layers that work together in parallel (Zahiri et al. 2015). In this study as shown
40 in figure 7 one models, multilayer perceptron (MLP) was presented and a brief description of
41
42 this is given here.
43
44
45 The multilayer perceptron is a type of artificial neural network organized in several layers. The
46
47
information flows from the input layer to the output layer only: it is therefore a feedforward
48 network. Each layer consists of a variable number of neurons, the neurons in the last layer
49
50 (called "output") being the outputs of the overall system. The perceptron was invented in 1957
51
52 by Frank Rosenblatt at the Cornell Aeronautical Laboratory. In this study the input layer
53
54 composed of twelve neurons that represent the input variables, two intermediate layers of which
55
56 the first layer is composed of 64 neurons and the second is composed of 32 neurons. The output
57
58
layer is composed of a single node that represents the target (WQI). The input variables are
59 multiplied by a series of weights, then summed by a constant value. Then, a transfer function
60
61
62
63
64
65
also known as the activation function acts on the results of the previous computating. In this
1
2 study we have applied a rectified linear unit activation function using (ReLu) on each
3
4
intermediate neuron. With default values, this returns the standard ReLU activation : max(x, 0),
5 the element-wise maximum of zero and the input tensor. Where x is the input tensor or variable.
6
7 The mathematical expression of the neural computing is given in the equation (7) :
8
9
10 𝐼𝑗 = 𝑓( ∑𝑖 𝑤𝑖𝑗 × 𝛼𝑖 + 𝜃𝑖 ) (7)
11
12 where αi are inputs (explanatory variables), wij are the weights, θj are bias, f is an activation
13
14 function that governs each cell and acts on its inputs (Bhutani 2014). During the learning phase,
15
16
after having calculated the errors of the neural network, it is necessary to correct them in order
17 to improve its performance. To minimize these errors - and thus the objective function - the
18
19 stockastique gradient descent (SGD) algorithm is used .
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44 Figure 7 | Archecture of the artificial neural network for the prediction of the water quality
45 index of Nokoue lake
46
47 2.5. Performance evaluation metrics used
48
49 We selected four metrics to quantify the performance of the two models built for the prediction
50
51 of the Nokué lake water quality index. These metrics are formulated by equations 8 to 11: the
52
53 correlation coefficient (R2), the root mean square error (RMSE), the mean absolute error (MAE)
54
55 and the explained variance score (EVS) (Dahal et al. 2021).
56
57 2
∑𝑛 𝑖 𝑖
𝑖=1(𝑄𝑜𝑏𝑠 −𝑄𝑠𝑖𝑚 )
2
58 𝑅 =1− 2 (8)
59 ∑𝑛 𝑖 ̅
𝑖=1(𝑄𝑜𝑏𝑠 −𝑄𝑜𝑏𝑠 )
60
61
62
63
64
65
𝑖 𝑖 2
(𝑄 −𝑄 )
1 𝑅𝑀𝑆𝐸 = √∑𝑛𝑖=1 𝑜𝑏𝑠 𝑠𝑖𝑚 (9)
𝑛
2
1
𝑀𝐴𝐸 = 𝑛 ∑𝑛𝑖=1|𝑄𝑜𝑏𝑠
𝑖
− 𝑄̅𝑠𝑖𝑚
𝑖
3
4
| (10)
5
𝑉𝑎𝑟{𝑄𝑜𝑏𝑠 −𝑄𝑠𝑖𝑚 }
6 EVS=1 − 𝑉𝑎𝑟{𝑄𝑜𝑏𝑠 }
(11)
7
8 𝑖 𝑖
9 where n is the length of the observation or simulation ; 𝑄𝑜𝑏𝑠 𝑎𝑛𝑑 𝑄𝑠𝑖𝑚 are the observations and
10
11 simulations of the quality index at time step i ; 𝑄̅𝑜𝑏𝑠 and 𝑄̅𝑠𝑖𝑚 are the averages of the
12
13 observations and simulation of the quality index respectively.
14
15 3. RESULT AND DISCUSSION
16
17
18 The results obtained in this study are analyzed, interpreted and presented in tables and graphs,
19
20
in order to give a general idea of the statistical analyses and the performance of the models on
21 the prediction of the water quality index of Nokué lake.
22
23
24 3.1. results of the random forest model
25
26 The observed and predicted from the training and validation phase for the RF model is
27
28 graphically shown in Figure 8. The graphical of RF model results also indicate that the three
29
30 predictors (hyperparameters) perform a lot better than other combination predictor. To justify
31 the best predictor combination (hyperparameters selected) for RF model the grid search method
32
33 indicate : 100 for the number of trees used, 4 for the maximum depth of a tree (the maximum
34
35 number of entities), and 1 for the minimum number of samples required to be at a leaf node.
36
37 The summary of metrics values is shown in the Table 4. These values of RF are R2=0.98,
38
39 RMSE=0.12, MAE=0.14 and EVS=0.98 which means that all predictors accounts for 98% of
40
41 the variation in WQI at the training phase. All figures clearly show a better correlation between
42 the observed values and the values predicted by the random forest model during training and
43
44 validation (Figure 9 and Figure 10).
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19 Figure 8 | Predicted and observed water quality index of the random forest model
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40 Figure 9 | Observed vs predicted values graphic of RandomForest
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60 Figure 10 | Random Forest training and testing graphic
61
62
63
64
65
3.2. Results of Neural network model
1 The comparison between the results of the water quality index prediction and the the real
2
3 data shows observed show that the prediction of the neural network model are more or less
4
5 correlated. This means that the artificial neural network is less efficient to the prediction of
6
7 the water quality index. The predicted values of the artificial neural network model in the
8
9
Figure 11 show overfitting and underfitting in some places. This is due to the fact that these
10 artificial neural network models are generally applied to a large volume of data in training
11
12 (Weng 2022). Figure 11 and 14 shown the change in training and validation error with the
13
14 number of iterations in the artificial neural network model. An early stopping process that
15
16 optimizes the model by monitoring the performance of the model on a set of test data and
17
18 stopping the learning procedure once the training and validation error values reach a constant
19
value on the training and test data beyond a certain number of iterations. We recorded better
20
21 predictive results with 100 iterations and estimated loss function value of 4.29 at the training
22
23 and 5.59 at the validation. We used stochastic gradient descent (SGD) and adaptive moment
24
25 estimation (Adam) as the optimization algorithm to update the network weights. The training
26
27 and validation metrics values in Table 4 demonstrate how well the RF model is better than
28
29 ANNs with R2=0.95, RMSE=0.72, MAE=0.21 and EVS=0.7 for the ANNs at the training
30 phase. A difference of performance betwen with RF and ANNs was report during the training
31
32 (see Table 4). The values of the determination coefficient and root mean square error of the
33
34 neural network and the random forest are similar on the training data. On the other hand, a
35
36 large difference is observed on the testing data set, allowing the random forest model to be
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
discretized at the expense of the neural network model for the prediction of the water quality
1
2 index of Nokoue lake.
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26 Figure 11 | Predicted and Observed water quality index of the Artificial Neural Network
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45 Figure 12 | Comparison of the loss and val_loss of training and validation
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22 Figure 13| Observed vs predicted values graphic of Artificial Neural Network
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44 Figure 14| Artificial Neural Network training and testing graphic
45
46
47
Table 4 | Performance measures obtained using training and testing dataset
48
49 Models Training dataset Testing dataset
50
51 R2 RMSE MAE EVS R2
RMSE MAE EVS
52
53
54 Random Forest 0.98 0.12 0.14 0.98 0.80 0.19 0.23 0.74
55
56 ANN 0.95 0.72 0.21 0.70 0.57 0.75 0.25 0.69
57
58
59
60
61
62
63
64
65
4. CONCLUSION
1
2 This work has shown that various statistical analyses can be used to analyze data on physico-
3
4 chemical water parameters to determine the water quality index. This study clearly shows that
5
6
it is possible to predict the water quality index of Nokoué lake with a high accuracy using a
7 large amount of information on physico-chemical parameters from the manchine learning
8
9 models. Undoubtedly, it can afford early warnings when the water quality changes as well as it
10
11 can reduce the adverse consequences resulting from the poor water quality. Herein, the RF and
12
13 ANNs approach were introduced to estimate the water quality index of Nokoué lake using 12
14
15 parameters. The presented RF accurately estimated the water quality index with relatively
16
17
minor prediction errors, proving a quite efficient and robust performance. Prediction precision
18 with RF was equal to 0.98 at the traning phase and equal to 0.80 at the testing phase. Even
19
20 though the outcomes seem to be reasonable, the application of water quality parameters is quite
21
22 sensitive to the error level. For better decision making based on the model results we suggest
23
24 collecting samples in order to have more data to improve the quality of the model results. We
25
26 also recommend experimenting with other than machine learning models in order to compare
27
28
results with our models.
29
30 ACKNOWLEDGEMENTS
31
32 This work is supported by the african center of excellence for water and sanitation project
33 (C2EA) funded in part by the World Bank and the French Development Agency.
34
35 We would like to particularly thank the Beninese Government for its impetus in building the
36
37 capacities of young people in the water sector through the National Water Institute (INE).
38
39 CONFLICT OF INTEREST
40
41 The authors declare that they have no conflict of interest in the publication of this article.
42
43 DATA AVAILABILITY STATEMENT
44
45 Data cannot be made publicly available; readers should contact the corresponding author for
46 details.
47
48 REFERENCES
49
50 Abobakr Saeed Abobakr Yahya, Ali Najah Ahmed, Faridah Binti Othman, Rusul Khaleel
51
52
Ibrahim, Haitham Abdulmohsin Afan, Amr El-Shafie, Chow Ming Fai, Md Shabbir
53 Hossain, Mohammad Ehteram, & and Ahmed Elshafie. 2019 Water Quality Prediction
54 Model Based Support Vector Machine Model for Ungauged River Catchment under
55
56 Dual Scenarios. Water 2019, 11(1231), 16.
57 Amir Hamzeh Haghiabi, Ali Heidar Nasrolahi, & Abbas Parsaie. 2018 Water quality
58
59
prediction using machine learning methods. Water Quality Research Journal 53(1),
60 11.
61
62
63
64
65
Bhutani, G. 2014 Application of machine-learning based prediction techniques in wireless
1 networks. Int’l J. of Communications, Network and System Sciences.
2
3 Brown, R. M., McClelland, N. I., Deininger, N. I., & O’Connor, M. F. 1972 A water quality
4 index-crashing the psychological barrier. In Indicators of environmental quality.
5
6
173‑ 182.
7 Calèche Nehemie Nounagnon AVAHOUIN, Henri Sourou TOTIN VODOUNON, Ernest, &
8 AMOUSSOU. 2018. Variabilité climatique et production halieutique du lac Nokoué
9
10 dans les Aguégués au Bénin. 8(2), 51‑ 61.
11 Dahal, K. R., Dahal, J. N., Banjade, H., & Gaire, S. 2021 Prediction of Wine Quality Using
12
13
Machine Learning Algorithms. Open Journal of Statistics, 11(2), Art. 2.
14 Dalai, C., Azizian, J. M., Trieu, H., Rajan, A., Chen, F. C., Dong, T., Beaven, S. W., &
15 Tabibian, J. H. 2021 Machine learning models compared to existing criteria for
16
17 noninvasive prediction of endoscopic retrograde cholangiopancreatography-confirmed
18 choledocholithiasis. Liver Research, 5(4), 224‑ 231.
19
20
Daouda MAMA. 2010 Methodologie et resultats du diagnostic de l’eutrophisation du lac
21 Nokoue (BENIN) [PhD Thesis]. UNIVERSITE DE LIMOGES.
22 Flavien DOVONOU, Martin AINA, Moussa BOUKARI, & Abdoukarim ALASSANE. 2011
23
24 Pollution physico-chimique et bactériologique d’un écosystème aquatique et ses
25 risques écotoxicologiques : Cas du lac Nokoue au Sud Benin. 5(5), 1590‑ 1602.
26
27
HaïboChu, WenyanWu, Q.J. Wang, RoryNathan, & JiahuaWei. 2020 Un cadre de
28 modélisation d’émulation basé sur ANN pour la modélisation des inondations :
29 Application, défis et orientations futures. Volume 124, 10(45-87), 13.
30
31 Horton, R. K. 1965. An index number system for rating water quality. 3(37), 300‑ 306.
32 House, M. A. 1990. Water quality indices as indicators of ecosystem change. Environmental
33
34
Modeling & Assessment, 255‑ 263.
35 House, M. A., & Ellis, J. B. 1987 The development of water quality indices for operational
36 management. Water Science and Technology, 145‑ 154.
37
38 Huang, H., Lin, Z., Liu, S., & Zhang, Z. 2022 A neural network approach for short-term water
39 demand forecasting based on a sparse autoencoder. Journal of Hydroinformatics,
40
41
jh2022089.
42 Iorliam, I. B., Ikyo, B. A., Iorliam, A., Okube, E. O., Kwaghtyo, K. D., & Shehu, Y. I. 2021
43 Application of Machine Learning Techniques for Okra Shelf Life Prediction. Journal
44
45 of Data Analysis and Information Processing, 9(3), Art. 3.
46 Josué Zandagba, Firmin M. Adandedji, Daouda MAMA, Amédée Chabi, & Abel Afouda.
47
48
2016 Assessment of the Physico-Chemical Pollution of a Water Body in a Perspective
49 of Integrated Water Resource Management : Case Study of Nokoué Lake. Journal of
50 Environmental Protection, 2016, 7(656‑ 669), 14.
51
52 Kalhori, S. R. N., & Zeng, X.-J. 2013 Evaluation and Comparison of Different Machine
53 Learning Methods to Predict Outcome of Tuberculosis Treatment Course. Journal of
54
55
Intelligent Learning Systems and Applications, 5(3), Art. 3.
56 Lalèyè, Chikou, Teugels, & Dewalle, V. 2004 Etude de la diversité ichtyologique du bassin
57 du fleuve Ouémé au Bénin (Afrique de l’Ouest). Cybium, 28(4), 329-339.
58
59 Liou, S. M., Lo, S. L, & Wang, S. H. 2004 A generalized water quality index for Taiwan.
60 35‑ 52.
61
62
63
64
65
Maier, H. R., & Dandy, G. C. 2000 Neural networks for the prediction and forecasting of
1 water resources variables : A review of modelling issues and applications.
2
3 Environmental Modelling & Software, 15(1), 101‑ 124.
4 P. Vasistha & R. Ganguly. 2020 Water quality assessment of natural lakes and its
5
6
importance : An overview. Proceedings 32(544–552), 9.
7 Rosenblatt, Frank 1958 A Probabilistic Model For Information Storage And Organization in
8 the Brain" Psychological Review. 65 (6), 386–408
9
10 Sampurno, J., Ardianto, R., & Hanert, E. 2022 Integrated machine learning and GIS-based
11 bathtub models to assess the future flood risk in the Kapuas River Delta, Indonesia.
12
13
Journal of Hydroinformatics, jh2022106.
14 Sèdami Pivot Amour SACHI & Innocent BOKOSSA YAOU. 2016 Evaluation de la
15 connaissance et de la mise en œuvre des bonnes pratiques d’hygiène par les
16
17 populations riveraines du lac Nokoué (Sud-Bénin). Int. J. Biol. Chem. Sci. 10(4),1823-
18 1831.
19
20
Taher Rajaee, Salar Khani, & Masoud Ravansalar 2022 Modèles uniques et hybrides basés
21 sur l’intelligence artificielle pour la prédiction de la qualité de l’eau dans les rivières :
22 Un examen. 200(103-978), 4.
23
24 Tyagi, S., Sharma, B., Singh, P., & Dobhal, R. 2013 Water quality assessment in terms of
25 water quality index. American Journal of Water Resources, 34‑ 38.
26
27
Wang, N. 2017 Bankruptcy Prediction Using Machine Learning. Journal of Mathematical
28 Finance, 7(4), Art. 4.
29 Weng, C.Y. 2022 Land-Use Classification via Transfer Learning with a Deep Convolutional
30
31 Neural Network. Journal of Intelligent Learning Systems and Applications, 14(2), Art.
32 2.
33
34
Zheng, H. 2018 Analysis of Global Warming Using Machine Learning. Computational Water,
35 Energy, and Environmental Engineering, 7(3), Art. 3.
36 Zhu, X., Guo, H., Huang, J. J., Tian, S., Xu, W., & Mai, Y. 2022 An ensemble machine
37
38 learning model for water quality estimation in coastal area based on remote sensing
39 imagery. Journal of Environmental Management. 3(23),116-187.
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Supplementary Material

1
Cover later
2
3
Mindful of the responsibilities and demands on the hydroinformatics journal I am willing to
4 spend some time refining the focus on specific knowledge gaps in the international literature
5
6 and extending it to highlight any new methodological, conceptual or practical implications.
7
8
9 Holder of a Diploma of Design Engineer in Hydrology delivered by the Regional Center
10
11 AGRHYMET of Niamey and a Diploma of Master in Computer Science option Software
12 Engineering and Computerized Information System delivered by the University Joseph Ki
13
14 ZERBO in BURKINA FASO, and currently a doctoral student at the University of Abomey
15
16 Calavi under funding from the World Bank, my interest has been focused on the journal of
17
18 hydroinformatics because the areas cover our field of study which is the monitoring of water
19
20 quality. As I am always looking for a journal where the results of our research will be beneficial
21
22
to different readers, it would be a great pleasure for me to see our article accepted in your
23 journal. I chose this journal because, first of all, after my work on the second objective of our
24
25 thesis, I wanted to publish this article in a journal indexed by scopus.
26
27
28 Then, in the long term, and with a few years of experience, my ambition, admittedly modest,
29
30
but otherwise stated, is to become a researcher with articles already published in very good
31 journals such as journal of informatics.
32
33
34 My choice is particularly focused on your prestigious journal because of the quality of the
35
36 programs and the diversity of the fields that your journal covers. My article highlights the
37
38 usefulness of artificial intelligence models in the prediction of pollution risks through the
39 prediction of the water quality index. This research is part of the engineering of the modeling
40
41 of hydro-systems phenomena.
42
43
44 Finally, my relational and professional qualities: my rigor, my sense of criticism, my dynamism,
45
46 my availability, my strong capacity of adaptation as well as my taste of the initiatives and the
47 team work are as many assets which comfort me in my will to submit my article in your
48
49 newspaper which will enable me to excel more in the professional environment.
50
51
52 Aware of the responsibilities and requirements of your new collaborators, while hoping for a
53
54 favorable response to my submission, I am available to make any necessary corrections to my
55 article once it is accepted for publication, I thank you for the attention you would like to give
56
57 to my file and in the hope of having a good experience with you, I beg you Ladies and
58
59 Gentlemen to receive my most distinguished greetings.
60
61 DABIRE Namwinwelbere
62
63
64
65

You might also like