Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Journal of Physics: Conference Series

PAPER • OPEN ACCESS You may also like


- Research Status of Microplastics Pollution
The Classification Status of River Water Quality in in Abiotic Environment in China
Z H Wang and X J Sun
Riau Province Using Modified K-Nearest Neighbor - Air Pollution Deterministic Index Modeling:
Application in Quetta, Pakistan
Algorithm with STORET Modeling and Water F Akhtar and S Shahkhan

Pollution Index - Air pollution monitoring and GIS modeling:


a new use of nanotechnology based solid
state gas sensors
O. Pummakarnchana, N. Tripathi and J.
To cite this article: Mustakim et al 2021 J. Phys.: Conf. Ser. 1783 012020 Dutta

View the article online for updates and enhancements.

This content was downloaded from IP address 182.1.44.100 on 25/07/2023 at 07:53


Annual Conference on Science and Technology Research (ACOSTER) 2020 IOP Publishing
Journal of Physics: Conference Series 1783 (2021) 012020 doi:10.1088/1742-6596/1783/1/012020

The Classification Status of River Water Quality in Riau Province


Using Modified K-Nearest Neighbor Algorithm with STORET
Modeling and Water Pollution Index

Mustakim1*, Rosdina2, Dian Ramadhani3, M. Afdal4, Medyantiwi Rahmawita5


1,2,3,4,5
Departement of Information System, Faculty of Science and Technology, Universitas
Islam Negeri Sultan Syarif Kasim Riau, Pekanbaru, Indonesia
1,2,3,4
Puzzle Research Data Technology, Faculty of Science and Technology, Universitas
Islam Negeri Sultan Syarif Kasim Riau, Pekanbaru, Indonesia

*mustakim@uin-suska.ac.id

Abstract. The Department of Environment and Forestry, Pollution and Environmental Damage
Control Division, has an active role in monitoring water quality in Riau Province. The rivers that are
still monitored and managed are Kampar River, Siak River and Indragiri River. Division of
Environment Pollution calculates river quality status manually using Microsoft Excel, this is not
maximally done since this important information should be processed quickly. Division of water
pollution must determine the right calculation to get the results of the water quality status. Because
of many calculation formulas set by the government, the commonly used method is the STORET
method and the Pollution Index. So, in overcoming the problem of classification, the researcher
proposes the use of learning methods that can predict or determine the status of water quality with
classification techniques on data mining that is Modified K-Nearest Neighbor (MKNN) which is a
modification of K-NN. The calculation of the MKNN algorithm produced the highest accuracy of
85.10% at K = 5 using STORET result data as training data. While, using the Pollution Index data
results, the highest accuracy is 76.92% at K = 1. Based on the analysis with attribute analysis, the
attributes that influence the determination of river water quality are BOD, COD, NH3, Fecal Coli
and Total Coli. This result can be taken into consideration by the Division of Environmental
Pollution in the process of overcoming and reducing pollutant overload that exceeds quality
standards.

1. Introduction
River water has a very important role in the lives of humans and other living things. In the past until now,
the river water has been used to fulfill daily needs, such as bathing, washing, for transportation to connect
one area to another, for cultivation areas, fishing, water sources for industrial production, agricultural
irrigation, and sources of clean water, as well as freshwater fishery sources[1]. To maintain river water
quality, monitoring of river water quality is necessary[2]. Based on Government Regulation of the Republic
of Indonesia Number 82 Year 2001 Concerning Water Quality Management and Water Pollution Control,
the one who plays a role in the process of monitoring river water quality is the Government. Government
Agency that plays an active role in monitoring water quality in Riau Province is the Department of
Environment and Forestry, the Division of Pollution and Environmental Damage Control [3].

Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd 1
Annual Conference on Science and Technology Research (ACOSTER) 2020 IOP Publishing
Journal of Physics: Conference Series 1783 (2021) 012020 doi:10.1088/1742-6596/1783/1/012020

One important role of this agency is to prevent pollution, manage and monitor river water quality
regularly. The rivers that are still monitored and managed by the Department of Environment and Forestry
of Riau Province are Kampar River, Siak River and Indragiri River. The information about the river water
quality status from the collected data cannot be retrieved before further processing. After monitoring, The
Environmental Pollution Department must compile data again and calculate the status of river water quality
manually using Microsoft Excel, this is not the maximally done since this important information should be
processed quickly. The division of water pollution must determine the right calculation to get the results of
the status of water quality. Because of many calculation formulas which are determined by the government
to calculate water quality status, the commonly used methods are the STORET method and the Pollution
Index [4].
Data mining is one of computer science which involves several computational processes, statistical
techniques, clustering, classification and finding patterns in the dataset which are used to extract
information from large datasets by transforming it into an understandable format and is understood
beforehand [5]. Several studies about the determination of the classification of water quality status had been
done by some researchers. Hamidi, et al in 2017 researched about river water quality by using the Learning
Vector Quantization (LVQ) algorithm for the classification of River Water Quality with the rsult of average
accuracy is 81.13% [6]. Research that was conducted by Alamelu, M. J et al, in 2013 with the evaluation
of the correctness of the results was based on the value of accuracy. Based on the webpage classification
test results by using MKNN was better than using KNN. It was shown from the accuracy value of each
method. MKNN had the lowest accuracy of 92.05% and the highest of 97.60% with the test k at 9th to 13th.
Meanwhile, KNN had the lowest accuracy of 82.14% with the same test k place. Threfore, MKNN was
recommended to be improved in webpage classification [7].
One of the standards in determining the status of water quality is Storage and Retrieval (STORET).
STORET is one of the methods used to measure river water quality, but it requires a high amount of time
and cost. The weakness of this method requires some sufficient data, if there is one missing data it cannot
calculate the maximum and minimum averages in determining the status of water quality [8]. While the
pollution index method (PI) is used to calculate the river quality status. As an index-based approach, this
method is built on two quality indices. The first is the average index (IR) which shows the average pollution
level of all parameters in one observation, while the second is the maximum index [9].
The use of water pollution index is useful to provide a quick and simple initial assessment to determine
the status of water quality. This water quality assessment must be followed by regular water quality
monitoring on water resources needed to assess water quality for ecosystem health. The water pollution
index is an approach that minimizes data volume widely and simplifies the expression of water quality
status [10]. Water quality index calculation is based on a number of physico-chemical and bacteriological
parameters [11].
So, in overcoming the problem of classification, researcher proposes the use of other learning methods
that can help predict or determine the status of water quality with classification techniques in data mining.
One algorithm that is used in classification techniques is Modified K-Nearest Neighbor (MKNN). MKNN
is a classification algorithm to improve algorithm performance and improvement from the previous
algorithm, K-Nearest Neighbor (KNN), which uses the nearest neighbor in the training data. This MKNN
Algorithn is determined by using a different procedure from K-NN. The advantage of the MKNN algorithm
is its good improvement in accuracy compared to the K-NN method [12]. MKNN is a new classification
method to improve the performance of the proposed K-Nearest Neighbor that uses strong neighbors in the
training data. Strong neighbors are detected by using the validation process [13]. This method has the main
idea by classifying test samples according to neighbor tags. This method is a kind of weighted KNN so that
the weight is determined by using a different procedure. The experiment shows a very good increase in
accuracy compared to the KNN method [5].
By applying the MKNN algorithm to river water data using training data from the calculation results of
the STORET method and the Water Pollution Index (WPI), the status information on water quality classes
from the classification results will be obtained. Information on river water quality needs to be known by
the community to add information about the status of water pollution around the residence, provide

2
Annual Conference on Science and Technology Research (ACOSTER) 2020 IOP Publishing
Journal of Physics: Conference Series 1783 (2021) 012020 doi:10.1088/1742-6596/1783/1/012020

information to the relevant government as information and help decision makers regarding the prevention
of river water pollution and raise awareness of industral centers in protecting the surrounding environment.
River water quality management based on the index can provide alternative to decision makers in order to
assess the quality of water bodies for an allotment and take action to improve quality in case the decrease
in water quality happen.

2. Materials and Method


Research Methodology consist of Data Collection, Data Preprocessing, Training and Testing Data,
Acuration of MKNN Storet, Acuration of MKNN Water Pollution Index and Analysis use show in the
Figure 1.

Figure 1. Research Methodology

2.1. K-Nearest Neighbor (KNN)


In the based learning algorithm, the K-NN neighbor method is used for classification and grouping
problems. By using the K-NN algorithm we can define the unknown sample class data based on previous
system learning [6]. The K-Nearest Neighbor classifier is transparent, consistent, straightforward, easy to
understand, high tendency to have the wanted quality and easier to implement than most other machine
learning techniques especially when there is little or no prior knowledge about data distribution [12].
According to the K-Nearest Neighbor classification technique, the classification of unknown data is
done by analyzing the nearest neighbor class. The KNN algorithm uses the principle of this closest neighbor
technique. But in the case of the KNN algorithm, a number of nearest fixed neighbors are allowed to vote
in the process of classifying unknown data that are identified by k, where "k" is a positive integer. When k
= 1, the unknown data is classified as the closest training data class to it. The closest k-neighbor algorithm
(KNN) is an intuitive machine learning method but it is effective for solving conventional classification
problems [14][15]. Proximity is measured in terms of the distance between new unclassified instances and
old classified instances in the training dataset. One of the most widely used metrics Euclidean distance [3].

2.2. Modified K-Nearest Neighbor (MKNN)


Modified K-Nearest Neighbor (MKNN) algorithm is a development method of the K-Nearest Neighbors
(KNN) method [16]. The K – Nearest Neighbor modification is a modified version of the KNN algorithm.
This method performs better than the KNN algorithm in accuracy. Related to the new value, each validity

3
Annual Conference on Science and Technology Research (ACOSTER) 2020 IOP Publishing
Journal of Physics: Conference Series 1783 (2021) 012020 doi:10.1088/1742-6596/1783/1/012020

is inputted in the training set by calculating the validity using the top nearest neighbor from each training
set document [17].
In the MKNN algorithm, each sample in the training data set must be validated in the first step. The
validity of each point is calculated according to its neighbors based on weight (weight voting) and the
validity of data points.This step is processed for all samples. The development process of the K-NN method
is very efficient because it reduces the number of data points and overcome the low accuracy [18]. The
stages of MKNN are described as follows:
The Distance calculation is done with equation 1 [19] [7].

= ∑ ( ᵢ − ᵢ)² (1)

3. Results and Analysis


In this study, the analysis is limited to 3 rivers in Riau Province, i.e Kampar River, Siak River and Indragiri
River. The determination of the status of river water quality is conducted to see the condition of water
quality every year. By knowing the condition of water quality, the government can do prevention, control
and reduction of water pollution due to water pollutant parameters that have very high pollutant content
and as a benchmark for water quality every year. Previously, measurement and determination of river water
quality have been carried out by using manual methods such as the Storage and Retrieval (STORET)
method and Water Pollution Index (IP).
The calculated data of the determination of water quality must be collected annually and sent to the
Central Government for the Environment in making decisions and policies to control water pollution in
every Province in Indonesia and to manage river water quality according to its designation to keep it in its
natural conditions. One proposed algorithm that is used as classification techniques is Modified K-Nearest
Neighbor (MKNN). In the calculation of MKNN method, it calculates separately by using training data
from the STORET calculation and training data from the Water Pollution Index (IPA) with the data from
monitoring data of the Kampar, Siak and Indragiri Rivers.

3.1. The Analysis of River Water Parameters


The method of collecting research data on the parameters of river water of the Kampar, Siak and Indragiri
Rivers was carried out at the Riau Environmental and Forestry Agency in the Division of environmental
pollution and collected 624 data records from 2014-2016. In the interview of this agency, the determination
of attributes was based on Classes-Based Water Quality Criteria according to Government Regulation No.
82 in the year 2001 on December 14, 2001 about Management of Water Quality and Water Pollution
Control. The used data attributes were pH , Total Dissolved Solid (TDS), Total Suspended Solid (TSS),
Dissolved Oxygen (DO), Biological Oxygen Demand (BOD), Chemical Oxygen Demand (COD), Nirit
(NO2), NO3, Ammonia (NH3), Free Chlorine, Total Phosphate (TP), Phenol, Oils and Fats (M&L),
Detergents (MBAS), Fecal coli, Total Coli, Hydrogen Sulfide (H2S), Iron (Fe), Cadmium (Cd), Zinc (Zn),
Copper (Cu) ) and Plumbum (Pb).

3.2. The Analysis of Training Data and Data Testing


In the simulation algorithm, dataset was divided into training data and testing data with the number training
data was 416 recording data training and testing data is 208 recording. The classes that was used in the
study data were (1) 0 ≤ PIj ≤ 1.0 (good condition), (2) 1.0 ≤ PIj ≤ 5.0 (lightly polluted); (3) 5.0 ≤ PIj ≤ 10
(medium polluted); and (4) PIj> 10.0 (heavily polluted).
The K simulation on MKNN aimed to calculate the accuracy by using a confusion matrix, with 10 trials
(K = 1 to K = 10). The final results of K simulation on MKNN can be seen in Figure 2. From the table
above, it can be seen that the K value that has the highest accuracy is K = 5 with an accuracy rate of 93.61%,
and the lowest accuracy is K = 1 with an accuracy level of 76.59%. The average accuracy of the simulation
value of K on MKNN from K = 1 to K = 10 is 89.35%. Cross validation was used in order to determine the
best parameter in a model by testing the accuracy of the testing data. The sample data of model and
parameter "K" that was used was the highest accuracy value and the lowest error value. Due to the total

4
Annual Conference on Science and Technology Research (ACOSTER) 2020 IOP Publishing
Journal of Physics: Conference Series 1783 (2021) 012020 doi:10.1088/1742-6596/1783/1/012020

data record is 624, the cross validation was divided equally in 3 models. Cross validation took 10 tests to
determine the value of the parameter "K" and was a good model for MKNN. The results of cross validation
testing can be seen in Table 2.

Acuration
100.00% 91.48% 91.48% 91.48% 93.61% 91.48% 89.36% 89.36% 87.23% 91.48%
76.59%
80.00%
60.00%
40.00%
20.00%
0.00%
K=1 K=2 K=3 K=4 K=5 K=6 K=7 K=8 K=9 K=10

Figure 2. Simulation of the Accuracy of K Value in the MKNN Algorithm

Table 1. Results of Cross Validation Accuracy Measurement


K CROSS 1 CROSS 2 CROSS 3 Average
1 85,57 80,76 82,69 83,00
2 82,21 80,28 80,28 80,92
3 83,65 81,25 81,25 82,53
4 81,25 79,80 79,80 80,28
5 83,17 80,76 82,21 82,04
6 82,21 80,28 79,80 80,76
7 81,73 81,25 80,76 81,24
8 81,73 79,32 80,28 80,44
9 81,73 80,76 81,73 81,40
10 81,73 80,28 80,28 80,76
Cross Avg 82,49 80,47 80,90 81,28

Based on the results of Cross validation, it can be seen that the best "K" parameter is at K = 1 with an
average accuracy of 83.00% and a good "Cross" model is found in Cross 1 with an average accuracy of
82.49%. Furthermore, the number of training data on Cross 1 were used as training data on the MKNN
algorithm using K = 1. To obtain the weight voting (WV) value, Equation 2 from each value of the distance
attribute between the training data and the testing data (TT) that had been obtained for every training data
validity was used. After the weight voting value was obtained, the next thing to do was to find out the
highest weight voting value as much as the predetermined K value, that is K = 1.

4. Conclusion
The results of the classification using the MKNN algorithm with STORET calculation as training data with
47 predicted data produced 47 data that were classified into the heavily polluted class. Meanwhile, the
results of the calculation by using Water Pollution Index with 208 predicted data produced 108 data that
were classified into the medium polluted class, 99 data were the class of lightly polluted and one data were
into the class of heavy pollution. The MKNN algorithm calculation that was implemented in the
classification of river water quality status produced the highest accuracy of 85.10% at K = 5 by using
STORET results as training data. On the other hand, by using the Pollution Index data as training data
produces the highest accuracy was 76.92% at K = 1 by testing using the calculation of confusion matrix
accuracy. The classification process used water quality monitoring data of Siak River, Kampar River and
Indragiri River from 2014 to 2016 by applying MKNN algorithm which was able to classify and predict
water quality classes according to algorithm calculations manually. The results are useful for decision
making for the Division of Pollution and Environmental Damage Control. Based on the analysis of results
with the analysis of attributes, the attributes that influence the determination of river water quality were 3
chemical parameters namely BOD, COD and NH3 as well as micro-ecological parameters of Fecal Coli

5
Annual Conference on Science and Technology Research (ACOSTER) 2020 IOP Publishing
Journal of Physics: Conference Series 1783 (2021) 012020 doi:10.1088/1742-6596/1783/1/012020

and Total Coli. This can be considered by the Department of Environment and Forestry in the Division of
Environmental Pollution in Pekanbaru city in the process of overcoming and reducing the overload of
pollutants that exceed quality standards.

References
[1] R. Karolina and Y. G. C. Sianipar, “The utilization of stone ash on cellular lightweight concrete,”
in IOP Conference Series: Materials Science and Engineering, 2018, vol. 309, no. 1.
[2] R. Karolina and A. L. A. Putra, “The effect of steel slag as a coarse aggregate and Sinabung volcanic
ash a filler on high strength concrete,” in IOP Conference Series: Materials Science and
Engineering, 2018, vol. 309, no. 1.
[3] G. Regulation, “Peraturan Pemerintah Republik Indonesia Nomor 82 Tahun 2001,” Jakarta
Peratur. Pemerintah, pp. 1–32, 2001.
[4] Keputusan Menteri Negara Lingkungan Hidup, “Keputusan Menteri Negara Lingkungan Hidup
Nomor 115 Tentang Pedoman Penentuan Status Mutu Air,” Jakarta Menteri Negara Lingkung.
Hidup, pp. 1–15, 2003.
[5] V. VijayanV and A. Ravikumar, “Study of Data Mining Algorithms for Prediction and Diagnosis
of Diabetes Mellitus,” Int. J. Comput. Appl., vol. 95, no. 17, pp. 12–16, 2014.
[6] R. Agrawal, “A modified K-nearest neighbor algorithm using feature optimization,” Int. J. Eng.
Technol., vol. 8, no. 1, pp. 28–37, 2016.
[7] M. M. Siti Mutrofin, Abidatul Izzah, Arrie Kurniawardhani, “Optimasi Teknik Klasifikasi Modified
K Nearest Neighbor Menggunakan Algoritma Genetika,” J. GAMMA, vol. s3-VII, no. 182, p. 504,
2015.
[8] D. Purwitasari, O. P. Putri, and W. N. Khotimah, “Aturan Asosiasi Dengan Standar Storet Pada
Model Prediksi Parameter Pendukung Uji Kualitas Air Baku,” J. Inf. Syst. Eng. Bus. Intell., vol. 1,
no. 1, pp. 1–8, 2015.
[9] I. dan A. Mutiara, “Penerapan K-Optimal Pada Algoritma Knn Untuk Prediksi Kelulusan Tepat
Waktu Mahasiswa Program Studi Ilmu Komputer Fmipa Unlam Berdasarkan Ip Sampai Dengan
Semester 4,” Klik - Kumpul. J. Ilmu Komput., vol. 2, no. 2, pp. 159–173, 2015.
[10] S. V. Mohan, P. Nithila, and S. J. Reddy, “Estimation of heavy metals in drinking water and
development of heavy metal pollution index,” J. Environ. Sci. Heal. Part A, vol. 31, no. 2, pp. 283–
289, 1996.
[11] H. Parvin, H. Alizadeh, and B. Minaei-bidgoli, “MKNN : Modified K-Nearest Neighbor,” Proc.
World Congr. Eng. Comput. Sci. WCECS, pp. 22–25, 2008.
[12] D. A. Adeniyi, Z. Wei, and Y. Yongquan, “Automated web usage data mining and recommendation
system using K-Nearest Neighbor (KNN) classification method,” Appl. Comput. Informatics, vol.
12, no. 1, pp. 90–108, 2016.
[13] W. Wu, W. Guo, and K.-L. Tan, “Distributed processing of moving k-nearest-neighbor query on
moving objects,” in 2007 IEEE 23rd International Conference on Data Engineering, 2007, pp.
1116–1125.
[14] Rezaei, Alizadeh, and H. Parvin, “An extended MKNN modified K-nearest neighbor,” J. Netw.
Technol., vol. 4, no. 2, pp. 162–168, 2011.
[15] Okfalisa, I. Gazalba, Mustakim, and N. G. I. Reza, “Comparative analysis of k-nearest neighbor and
modified k-nearest neighbor algorithm for data classification,” in Proceedings - 2017 2nd
International Conferences on Information Technology, Information Systems and Electrical
Engineering, ICITISEE 2017, 2018, vol. 2018-Janua.
[16] C. Shi, J. Ma, J. Wu, K. Chen, and B. Wu, “(Bi0. 5Na0. 5) ZrO3 modified KNN-based ceramics:
Enhanced electrical properties and temperature insensitivity,” Ceram. Int., vol. 46, no. 3, pp. 2798–
2804, 2020.
[17] P. Bolaj and S. Govilkar, “Text Classification for Marathi Documents using Supervised Learning
Methods,” Int. J. Comput. Appl., vol. 155, no. 8, pp. 6–10, 2016.
[18] T. Dharani and I. L. Aroquiaraj, “Content Based Image Retrieval System using Feature

6
Annual Conference on Science and Technology Research (ACOSTER) 2020 IOP Publishing
Journal of Physics: Conference Series 1783 (2021) 012020 doi:10.1088/1742-6596/1783/1/012020

Classification with Modified KNN Algorithm,” 2013.


[19] J. Tarapitakwong, B. Chartrungruang, and N. Tantranont, “A Classification Model for Predicting
Standard Levels of OTOP ’ s Wood Handicraft Products by Using the K-Nearest Neighbor,” Int. J.
Comput. Internet Manag., vol. 25, no. 2, pp. 135–141, 2017.

You might also like