Professional Documents
Culture Documents
Water Potablity Detection
Water Potablity Detection
Project Report
Course ID: CSE 4214
Course Name: Pattern Recognition Lab
Semester: Fall 2020
Dhaka, Bangladesh
September 2021
Water Potability Prediction
Submitted by
Submitted To
Faisal Muhammad Shah, Associate Professor
Farzad Ahmed, Lecturer
Md. Tanvir Rouf Shawon, Lecturer
Department of Computer Science and Engineering
Ahsanullah University of Science and Technology
Dhaka, Bangladesh
September 2021
ABSTRACT
Access to safe drinking-water is essential to health. It is a basic human right and
a component of effective policy for health protection. But In some regions, it is very
difficult to get safe drinking-water. Recent advance in Machine Learning (ML) makes
it possible to predict safe drinking-water. For this purpose a number of features for
a typical water is selected, with the assumption that these characteristics influence the
safety of water. The features are applied to machine learning (ML) models that are used
to predict safe drinking-water, and their performance is compared. For this experiment
Water Quality Dataset [1] from Kaggle was used to train each Machine Learning Model.
i
Contents
ABSTRACT i
List of Tables iv
1 Introduction 1
4 Methodology 10
4.1 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Data Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.3 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.4 Model Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.5 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.6 Model Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.7 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
References 22
ii
List of Figures
iii
List of Tables
iv
1
Chapter 1
Introduction
All plants and animals need water to survive. There can be no life on earth without water.
Fresh water is the primary source of human health, prosperity, and security. 60 percent of
our body weight is made up of water. Our bodies use water in all the cells, organs, and
tissues, to help regulate body temperature and maintain other bodily functions. Because
our bodies lose water through breathing, sweating, and digestion, it’s crucial to re hydrate
and replace water by drinking fluids and eating foods that contain water.
Potable water, also known as drinking water, comes from surface and ground sources and
is treated to levels that that meet state and federal standards for consumption. Availability
of potable water is important as a health and development issue at a national, regional and
local level. But in some regions, it has been shown that investments in water supply and
sanitation can yield a net economic benefit, since the reductions in adverse health effects
and health care costs outweigh the costs of undertaking the interventions.
Machine Learning (ML) developments have made it feasible to infer rules and predict changes
in water potability based on a large number of features, automatically revealing hidden cor-
relations between the features. The data used to train the water potability prediction model
collected from Kaggle [1].
2
Chapter 2
Predicting the potability of water is important since the element is the basic need of our
life. Researchers have used machine learning algorithms and data mining tactics to predict
potable water over the last decade. Zhao Fu , et al. [2] have been proposed to improve
the performance of ANFIS-based water quality prediction models. A deep prediction per-
formance comparison between MLR, ANN, and ANFIS model is presented after stratified
sampling and wavelet de-noising techniques are applied. Ren et al. [3] investigates the
performance of artificial intelligence techniques including artificial neural network (ANN),
group method of data handling (GMDH) and support vector machine (SVM) for predicting
water quality.. To develop the ANN and SVM, different types of transfer and kernel func-
tions were tested, respectively. ANN and SVM indicated that both models have suitable
performance for predicting water quality components.
3
Chapter 3
For any machine learning prediction, datasets are very important. Different types of water
bodies are used to train up the model. Water bodies provide information on a variety of
pH value, hardness, TDS, sulfate,organic carbon, etc. For information scratching, a variety
of sources are available. This segment discusses the various sources and parameters which
have been collected. To test this, data is gathered from the kaggle [1].
The CSV file of the dataset contains data with features and details. All of them are required
for our water potability prediction. So we had to process the data for our needs, which is
feature extraction. We have considered the below features for our prediction:
• Chloramines: Chloramines are most commonly formed when ammonia is added to chlo-
rine to treat drinking water
• Trihalomethanes: THMs are chemicals which may be found in water treated with chlo-
rine
As the model won’t be able to predict from this mixed data of the dataset, cleaning, and
processing of the data which the model needs is important. Cleaning and processing the
data is not easy as many pieces of information are mixed with the datasets but the only
required one is to collect for our needs.
Potability Count:
In our data-set, there are two classes. (i) potable and (ii) non potable. In the figure below,
we can see the count of potable and non potable water samples. From figure 3.4, we can
see that categorical plotting that the potable water is almost 40%.
Density Plotting:
From the figure 3.5 - 3.9, we plot 9 features of our dataset. This figures show that how these
features are related to the potability.
(a) Turbidity
Chapter 4
Methodology
Recognizing patterns in data can be done using a variety of approaches and Machine Learn-
ing algorithms. There are several stages to developing a machine learning prediction model.
Our model is broken into the following phases: Data collection, data preprocessing, feature
extraction, model training, and prediction.
Data acquisition, or DAQ as it is often referred, is the process of digitizing data from the
world around us so it can be displayed, analyzed, and stored in a computer. The process
of gathering, measuring, and evaluating correct insights for research using established ap-
proved procedures is known as data acquisition.The initial and most important step in any
research project is data collection. Depending on the information needed, different ap-
proaches to data gathering are used in different disciplines of study. The quantity and qual-
ity of the collected data will determine the efficiency of the output. The more will be the
data, the more accurate will be the prediction.
• Collect data.
A water-quality data-set was obtained from kaggle [1], which is open to the public for re-
search. This data-set contains over 3000 entries with 10 columns that describe whether
a sample of water is potable or not based on the characteristics of different elements that
dissolved in that.
After collecting the data, we need to prepare it for further steps. Data preparation is a step
where we put our data into a suitable place and prepare it to use in our machine learning
training.
• Data Exploration
• Data Visualization.
Exploration : Data exploration, also known as exploratory data analysis (EDA), is a process
where users look at and understand their data with statistical and visualization methods.
This step helps identifying patterns and problems in the data-set, as well as deciding which
model or algorithm to use in subsequent steps.
4.3. DATA PRE-PROCESSING 12
In our data-set there are 3276 entries and has 10 distinct columns which are ’ph’, ’Hard-
ness’, ’Solids’, ’Chloramines’, ’Sulfate’, ’Conductivity’,’Organic Carbon’, ’Trihalomethanes’,
and ’Turbidity’ .
• Finding Outliers.
Finding missing values : The missing values are shown in the Table 4.1
In Machine Learning, data pre-processing is a critical step that helps improve data quality
and facilitates the extraction of relevant insights from the data.Data pre-processing refers to
the process of cleaning and organizing raw data in order to make it appropriate for creating
and training Machine Learning models.
In our data-set there are some missing values shown in Table 4.1
In our data-set some entries in pH, Sulfate and Trihalomethanes are missing.
Drop the null entries : If we drop the null values then we get 2011 entries. Working with
this will not be sufficient. So we prefer the second approach .
Filling the null entries : For filling the null entries we need to see the distributions of the
features.
Now, if we look at Figure 4.1 and Figure 4.3 then we can see that the entries are normally
distributed, So if we just fill the null entries with mean values then it will not affect the
final outcome. But in Figure 4.2 the entries are not normally distributed. So if we fill the
null entries of ’Sulfate’ will mean then it will affect the final outcome.So for this we need to
remove the outliers of ’Sulfate’ .
To remove the outliers we sort the column. Then we remove 1 percentile of data from both
side. Before removing the outliers the distribution looks like Figure 4.4 and, After removing
the outliers the distribution looks like Figure 4.5
After Processing our data-set we get around 2500 entries to work with.
4.3. DATA PRE-PROCESSING 14
• K-Nearest Neighbour.
• Logistic regression.
• Decision Tree.
We’ve implemented these models and observe the training accuracy and test accuracy. then
we optimized the Hyper parameters of the algorithms manually and by using gridSearchCV
library.
4.7 Prediction
Prediction refers to the output of an algorithm after it has been trained on a train data-set
and applied to new test data. For each record in the new data, the algorithm will generate
probable values for an unknown variable, allowing the model builder to determine what that
value will most likely be. After training various regression algorithms with training data,
the models were applied to test data to see how well each algorithm learned the data’s
underlying pattern.
17
Chapter 5
We have tried different models such as Logistic Regression, Decision Tree Regression, K-
nearest neighbour classifier,Support vector machine and calculated the Accuracy, Preci-
sion, Recall,F1-score,Confusion Matrix for each model.
At first we implemented the Logistic regression and K-nearest neighbour models both for
data-set whose null values were dropped and for data-set whose null values were filled.
Confusion Matrix :Generated confusion matrix with Drop null values of data-set is shown
in Figure 5.2 and Generated confusion matrix with Filling the null values of data-set is
shown in Figure 5.3
Choosing the right Hyper parameter value (N) for Knn model manually : Now if we
see the testing accuracy of KNN model, Testing Accuracy (Drop Na) = 63.28437917222964
Testing Accuracy (Fill Na) = 64.03269754768392
Here the value of hyper parameter N was 5, it can be ant integer. The Accuracy will vary
upon changing the value of N. So what’s the best? We’ve tried to find that out manually
using a for loop from 1 to 75 and for each value of N we plotted the Accuracy.
Now in Figure 5.4 we can see that some where around 35 in the best value of N for our
model.
Then we implemented the SVM model and Decision tree model and generate Confusion
matrices to validate the models.
Now we use gridsearchCv to tune the hyper parameter(only for SVM ) to see if we can
get optimum accuracy. GridSearchCV is a library function that is a member of sklearn’s
modelSelection package. It helps to loop through predefined hyper-parameters and fit your
estimator (model) on your training set. So, in the end, you can select the best parameters
from the listed hyper parameters.
After implementing the GridSearchCV. We get an increased accuracy for SVM shown in Table
5.3
Chapter 6
The results of the experiments demonstrate that machine learning models are a good tool
for predicting safe water. Data collection and feature selection are also important factors in
predicting safe drinking-water.
This study could be expanded in the future. More experiments with larger water potability
data sets are planned. This initial study emphasizes the ability of Machine Learning models
to assist humans in making good selecting decisions.
22
References
[2] Z. Fu, “Water quality prediction based on machine learning techniques,” IEEE, 2020.
[3] A. P. Amir Hamzeh Haghiabi, Ali Heidar Nasrolahi, “Water quality prediction using ma-
chine learning methods,” University of Stanford, 2018.
Generated using Undegraduate Thesis LATEX Template, Version 1.4. Department of Computer
Science and Engineering, Ahsanullah University of Science and Technology, Dhaka, Bangladesh.
This project report was generated on Wednesday 29th September, 2021 at 9:38pm.
23