Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

Water Potability Prediction

Tasnid Mahin 170104019


Md. Mainul Ahsan 170104020
Kandakar Rezwan Ahmed 170104044

Project Report
Course ID: CSE 4214
Course Name: Pattern Recognition Lab
Semester: Fall 2020

Department of Computer Science and Engineering


Ahsanullah University of Science and Technology

Dhaka, Bangladesh

September 2021
Water Potability Prediction

Submitted by

Tasnid Mahin 170104019


Md. Mainul Ahsan 170104020
Kandakar Rezwan Ahmed 170104044

Submitted To
Faisal Muhammad Shah, Associate Professor
Farzad Ahmed, Lecturer
Md. Tanvir Rouf Shawon, Lecturer
Department of Computer Science and Engineering
Ahsanullah University of Science and Technology

Department of Computer Science and Engineering


Ahsanullah University of Science and Technology

Dhaka, Bangladesh

September 2021
ABSTRACT
Access to safe drinking-water is essential to health. It is a basic human right and
a component of effective policy for health protection. But In some regions, it is very
difficult to get safe drinking-water. Recent advance in Machine Learning (ML) makes
it possible to predict safe drinking-water. For this purpose a number of features for
a typical water is selected, with the assumption that these characteristics influence the
safety of water. The features are applied to machine learning (ML) models that are used
to predict safe drinking-water, and their performance is compared. For this experiment
Water Quality Dataset [1] from Kaggle was used to train each Machine Learning Model.

i
Contents

ABSTRACT i

List of Figures iii

List of Tables iv

1 Introduction 1

2 Literature Reviews / Background 2

3 Exploratory Data Analysis (EDA) 3


3.1 Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4 Methodology 10
4.1 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Data Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.3 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.4 Model Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.5 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.6 Model Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.7 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5 Experiments and Results 17

6 Future Work and Conclusion 21

References 22

ii
List of Figures

3.1 Raw Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4


3.2 Value count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.3 Value count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.4 Potable VS non-potable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.5 pH and Hardness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.6 Solids & chloramins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.7 Sulphate & Conductivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.8 Organic carbon & Trihalomithans . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.9 Turbidity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.10 corelation heatmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.1 Density based distribution of pH. . . . . . . . . . . . . . . . . . . . . . . . . . . . 13


4.2 Density based distribution of Sulfate. . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3 Density based distribution of Trihalomethanes. . . . . . . . . . . . . . . . . . . . 14
4.4 Box plot Before removing Outliers. . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.5 Box plot After removing Outliers. . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5.1 Accuracy, Precision, Recall, F1-score. . . . . . . . . . . . . . . . . . . . . . . . . . 18


5.2 Confusion Matrix(Drop Na). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.3 Confusion Matrix(Fill Na). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.4 Accuracy plotting based on N. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

iii
List of Tables

4.1 Missing values exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5.1 Training Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20


5.2 Testing Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.3 Optimum Testing Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

iv
1

Chapter 1

Introduction

All plants and animals need water to survive. There can be no life on earth without water.
Fresh water is the primary source of human health, prosperity, and security. 60 percent of
our body weight is made up of water. Our bodies use water in all the cells, organs, and
tissues, to help regulate body temperature and maintain other bodily functions. Because
our bodies lose water through breathing, sweating, and digestion, it’s crucial to re hydrate
and replace water by drinking fluids and eating foods that contain water.

Potable water, also known as drinking water, comes from surface and ground sources and
is treated to levels that that meet state and federal standards for consumption. Availability
of potable water is important as a health and development issue at a national, regional and
local level. But in some regions, it has been shown that investments in water supply and
sanitation can yield a net economic benefit, since the reductions in adverse health effects
and health care costs outweigh the costs of undertaking the interventions.

Machine Learning (ML) developments have made it feasible to infer rules and predict changes
in water potability based on a large number of features, automatically revealing hidden cor-
relations between the features. The data used to train the water potability prediction model
collected from Kaggle [1].
2

Chapter 2

Literature Reviews / Background

Predicting the potability of water is important since the element is the basic need of our
life. Researchers have used machine learning algorithms and data mining tactics to predict
potable water over the last decade. Zhao Fu , et al. [2] have been proposed to improve
the performance of ANFIS-based water quality prediction models. A deep prediction per-
formance comparison between MLR, ANN, and ANFIS model is presented after stratified
sampling and wavelet de-noising techniques are applied. Ren et al. [3] investigates the
performance of artificial intelligence techniques including artificial neural network (ANN),
group method of data handling (GMDH) and support vector machine (SVM) for predicting
water quality.. To develop the ANN and SVM, different types of transfer and kernel func-
tions were tested, respectively. ANN and SVM indicated that both models have suitable
performance for predicting water quality components.
3

Chapter 3

Exploratory Data Analysis (EDA)

For any machine learning prediction, datasets are very important. Different types of water
bodies are used to train up the model. Water bodies provide information on a variety of
pH value, hardness, TDS, sulfate,organic carbon, etc. For information scratching, a variety
of sources are available. This segment discusses the various sources and parameters which
have been collected. To test this, data is gathered from the kaggle [1].
The CSV file of the dataset contains data with features and details. All of them are required
for our water potability prediction. So we had to process the data for our needs, which is
feature extraction. We have considered the below features for our prediction:

• pH value: Indicator of acidic or alkaline condition of water status

• Hardness: Capacity of water to precipitate soap caused by Calcium and Magnesium

• Solids: Total dissolved solids - TDS

• Chloramines: Chloramines are most commonly formed when ammonia is added to chlo-
rine to treat drinking water

• Sulfate: Sulfates are naturally occurring substances

• Conductivity: Electrical conductivity

• Organic_carbon: Total amount of carbon in organic compounds in pure water (TOC)

• Trihalomethanes: THMs are chemicals which may be found in water treated with chlo-
rine

• Turbidity: Measure of light emitting properties of water

• Potability: 1 means Potable and 0 means Not potable


4

Figure 3.1: Raw Dataset

As the model won’t be able to predict from this mixed data of the dataset, cleaning, and
processing of the data which the model needs is important. Cleaning and processing the
data is not easy as many pieces of information are mixed with the datasets but the only
required one is to collect for our needs.

Show value of two classes:


We have count the total value of two classes.

Figure 3.2: Value count

Work with missing values:


We work in both way by dropping missing values and by filling those values.
3.1. DATA VISUALIZATION 5

Figure 3.3: Value count

3.1 Data Visualization

Potability Count:
In our data-set, there are two classes. (i) potable and (ii) non potable. In the figure below,
we can see the count of potable and non potable water samples. From figure 3.4, we can
see that categorical plotting that the potable water is almost 40%.

Density Plotting:
From the figure 3.5 - 3.9, we plot 9 features of our dataset. This figures show that how these
features are related to the potability.

Co Relation Heatmap: Correlation heat map is graphical representation of correlation ma-


trix representing correlation between different variables/ features. We’ve plot Correlation
for our data-set.The Correlation heat map is shown in Figure 3.10
3.1. DATA VISUALIZATION 6

Figure 3.4: Potable VS non-potable


3.1. DATA VISUALIZATION 7

(a) pH Density (b) Hardness density

Figure 3.5: pH and Hardness

(a) Solids (b) chloramins

Figure 3.6: Solids & chloramins

(a) Sulphate (b) Conductivity

Figure 3.7: Sulphate & Conductivity


3.1. DATA VISUALIZATION 8

(a) Organic carbon (b) Trihalomithans

Figure 3.8: Organic carbon & Trihalomithans

(a) Turbidity

Figure 3.9: Turbidity


3.1. DATA VISUALIZATION 9

Figure 3.10: corelation heatmap


10

Chapter 4

Methodology

Recognizing patterns in data can be done using a variety of approaches and Machine Learn-
ing algorithms. There are several stages to developing a machine learning prediction model.
Our model is broken into the following phases: Data collection, data preprocessing, feature
extraction, model training, and prediction.

In our project, we follow the following pipeline.


4.1. DATA ACQUISITION 11

4.1 Data Acquisition

Data acquisition, or DAQ as it is often referred, is the process of digitizing data from the
world around us so it can be displayed, analyzed, and stored in a computer. The process
of gathering, measuring, and evaluating correct insights for research using established ap-
proved procedures is known as data acquisition.The initial and most important step in any
research project is data collection. Depending on the information needed, different ap-
proaches to data gathering are used in different disciplines of study. The quantity and qual-
ity of the collected data will determine the efficiency of the output. The more will be the
data, the more accurate will be the prediction.

This step includes the below tasks:

• Identify various data sources.

• Collect data.

• Integrate the data obtained from different sources.

A water-quality data-set was obtained from kaggle [1], which is open to the public for re-
search. This data-set contains over 3000 entries with 10 columns that describe whether
a sample of water is potable or not based on the characteristics of different elements that
dissolved in that.

4.2 Data Exploration

After collecting the data, we need to prepare it for further steps. Data preparation is a step
where we put our data into a suitable place and prepare it to use in our machine learning
training.

To process raw data, the following steps should be taken :

• Data Exploration

• Data Visualization.

Exploration : Data exploration, also known as exploratory data analysis (EDA), is a process
where users look at and understand their data with statistical and visualization methods.
This step helps identifying patterns and problems in the data-set, as well as deciding which
model or algorithm to use in subsequent steps.
4.3. DATA PRE-PROCESSING 12

In our data-set there are 3276 entries and has 10 distinct columns which are ’ph’, ’Hard-
ness’, ’Solids’, ’Chloramines’, ’Sulfate’, ’Conductivity’,’Organic Carbon’, ’Trihalomethanes’,
and ’Turbidity’ .

Now for this project our exploration includes:

• Finding missing values.

• Finding Outliers.

Finding missing values : The missing values are shown in the Table 4.1

Table 4.1: Missing values exploration


Feature Total missing entries Percentage of missing
pH 491 14.99
Hardness 0 0
Solids 0 0
Chloramines 0 0
Sulfate 781 23.84
Conductivity 0 0
Organic Carbon 0 0
Trihalomethanes 162 4.95
Turbidity 0 0
Potablity 0 0

Finding outliers : To find the outliers we need to visualize the data-set.

4.3 Data Pre-processing

In Machine Learning, data pre-processing is a critical step that helps improve data quality
and facilitates the extraction of relevant insights from the data.Data pre-processing refers to
the process of cleaning and organizing raw data in order to make it appropriate for creating
and training Machine Learning models.

In our data-set there are some missing values shown in Table 4.1

In our data-set some entries in pH, Sulfate and Trihalomethanes are missing.

So we we can work with this data-set in 2 ways:

• Drop the null entries.

• Filling the null entries.


4.3. DATA PRE-PROCESSING 13

Figure 4.1: Density based distribution of pH.

Drop the null entries : If we drop the null values then we get 2011 entries. Working with
this will not be sufficient. So we prefer the second approach .

Filling the null entries : For filling the null entries we need to see the distributions of the
features.

Now, if we look at Figure 4.1 and Figure 4.3 then we can see that the entries are normally
distributed, So if we just fill the null entries with mean values then it will not affect the
final outcome. But in Figure 4.2 the entries are not normally distributed. So if we fill the
null entries of ’Sulfate’ will mean then it will affect the final outcome.So for this we need to
remove the outliers of ’Sulfate’ .

To remove the outliers we sort the column. Then we remove 1 percentile of data from both
side. Before removing the outliers the distribution looks like Figure 4.4 and, After removing
the outliers the distribution looks like Figure 4.5

After Processing our data-set we get around 2500 entries to work with.
4.3. DATA PRE-PROCESSING 14

Figure 4.2: Density based distribution of Sulfate.

Figure 4.3: Density based distribution of Trihalomethanes.


4.3. DATA PRE-PROCESSING 15

Figure 4.4: Box plot Before removing Outliers.

Figure 4.5: Box plot After removing Outliers.


4.4. MODEL CREATION 16

4.4 Model Creation

In this project the following models have been implemented :

• Support Vector Machine.

• K-Nearest Neighbour.

• Logistic regression.

• Decision Tree.

4.5 Model Training

The process of training an Machine Learning model involves providing an ML algorithm


with training data to learn from. The training data must contain the correct answer, which
is known as a target or target attribute. The learning algorithm finds patterns in the train-
ing data that map the input data attributes to the target, and it outputs an ML model that
captures these patterns. The data-set was split into train and test with a 80/20 ratio before
training the model. Because the target variable is continuous, different regression algo-
rithms were chosen to create the prediction model.

4.6 Model Optimization

We’ve implemented these models and observe the training accuracy and test accuracy. then
we optimized the Hyper parameters of the algorithms manually and by using gridSearchCV
library.

4.7 Prediction

Prediction refers to the output of an algorithm after it has been trained on a train data-set
and applied to new test data. For each record in the new data, the algorithm will generate
probable values for an unknown variable, allowing the model builder to determine what that
value will most likely be. After training various regression algorithms with training data,
the models were applied to test data to see how well each algorithm learned the data’s
underlying pattern.
17

Chapter 5

Experiments and Results

We have tried different models such as Logistic Regression, Decision Tree Regression, K-
nearest neighbour classifier,Support vector machine and calculated the Accuracy, Preci-
sion, Recall,F1-score,Confusion Matrix for each model.

At first we implemented the Logistic regression and K-nearest neighbour models both for
data-set whose null values were dropped and for data-set whose null values were filled.

So we get Accuracy, Precision, Recall,F1-score,Confusion Matrix for each model. For


data-set whose null values were dropped and For data-set whose null values were filled the
Figure 5.1 is shown representing the values of Accuracy, Precision, Recall, F1-score

Confusion Matrix :Generated confusion matrix with Drop null values of data-set is shown
in Figure 5.2 and Generated confusion matrix with Filling the null values of data-set is
shown in Figure 5.3

Choosing the right Hyper parameter value (N) for Knn model manually : Now if we
see the testing accuracy of KNN model, Testing Accuracy (Drop Na) = 63.28437917222964
Testing Accuracy (Fill Na) = 64.03269754768392

Here the value of hyper parameter N was 5, it can be ant integer. The Accuracy will vary
upon changing the value of N. So what’s the best? We’ve tried to find that out manually
using a for loop from 1 to 75 and for each value of N we plotted the Accuracy.

Now in Figure 5.4 we can see that some where around 35 in the best value of N for our
model.

Then we implemented the SVM model and Decision tree model and generate Confusion
matrices to validate the models.

Training accuracy’s of the models are shown in Table 5.1

Training accuracy’s of the models are shown in Table 5.2


18

Figure 5.1: Accuracy, Precision, Recall, F1-score.

Figure 5.2: Confusion Matrix(Drop Na).


19

Figure 5.3: Confusion Matrix(Fill Na).

Figure 5.4: Accuracy plotting based on N.


20

Table 5.1: Training Accuracy


Model Accuracy
Support Vector Machine 74.58
K-Nearest Neighbour 75.28
Logistic Regression 60.37
Decision Tree 100.00

Table 5.2: Testing Accuracy


Model Accuracy
Support Vector Machine 66.59
K-Nearest Neighbour 64.03
Logistic Regression 62.67
Decision Tree 60.49

Now we use gridsearchCv to tune the hyper parameter(only for SVM ) to see if we can
get optimum accuracy. GridSearchCV is a library function that is a member of sklearn’s
modelSelection package. It helps to loop through predefined hyper-parameters and fit your
estimator (model) on your training set. So, in the end, you can select the best parameters
from the listed hyper parameters.

After implementing the GridSearchCV. We get an increased accuracy for SVM shown in Table
5.3

Table 5.3: Optimum Testing Accuracy


Model Accuracy
Support Vector Machine 69.75
K-Nearest Neighbour 64.03
Logistic Regression 62.67
Decision Tree 60.49
21

Chapter 6

Future Work and Conclusion

The results of the experiments demonstrate that machine learning models are a good tool
for predicting safe water. Data collection and feature selection are also important factors in
predicting safe drinking-water.
This study could be expanded in the future. More experiments with larger water potability
data sets are planned. This initial study emphasizes the ability of Machine Learning models
to assist humans in making good selecting decisions.
22

References

[1] “Water quality dataset.” https://https://www.kaggle.com/adityakad


iwal/water-potability.

[2] Z. Fu, “Water quality prediction based on machine learning techniques,” IEEE, 2020.

[3] A. P. Amir Hamzeh Haghiabi, Ali Heidar Nasrolahi, “Water quality prediction using ma-
chine learning methods,” University of Stanford, 2018.
Generated using Undegraduate Thesis LATEX Template, Version 1.4. Department of Computer
Science and Engineering, Ahsanullah University of Science and Technology, Dhaka, Bangladesh.

This project report was generated on Wednesday 29th September, 2021 at 9:38pm.

23

You might also like