Professional Documents
Culture Documents
Mini Project
Mini Project
A Mini Project
On
MINE EXPLOSION
PREDICTION
Submitted in partial fulfilment
of the Requirements for the
award of the degree of
Bachelor of Technology
In
Computer Science and
Engineering
A Mini Project
On
WATER QUALITY ANALYSIS
for the award of the degree of Bachelor of Technology
In
Computer Science and Engineering
By
Rohith Kalwa
2
(17H61A0520)
PothakaniSamyuktha
(17H61A0538
By
Prachi Singh
2104920109002
Himani Sharma
2004920100025
Shubhi Tomar
2004921530005
Saif Ur Rehman
2104920109003
Department of
Computer Science and
engineering
ANURAG
GROUP OF
INSTITUTIO
NS Department of Computer Science and Engineering
ACKNOWLEDGEMENT
It is our privilege and pleasure to express a profound sense of respect, gratitude and
indebtedness to our Prof. Dr. Sanjay Kumar, Assistant Professor, Dept. of Computer
Science and Engineering, KCC Group of Institutions for his/her indefatigable
inspiration, guidance, cogent discussion, constructive criticisms, and encouragement
throughout this dissertation work.
We express our sincere gratitude to Dr Sanjay, Associate Professor & Head, Department of
Computer Science and Engineering, KCC Group of Institutions, for his suggestions,
motivations, and co-operation for the successful completion of the work.
We extend our sincere thanks to Mr. Deepak Gupta, Chairman, KCC Group of
Institutions for his encouragement.
DECLARATION
Prof. Dr. Sanjay Kumar, Assistant Professor and this project work have not been submitted to
any other university for the award of any other degree or diploma.
ABSTRACT
The major goal of this project is to use machine learning techniques to measure water quality.
A potability is a numerical phrase that is used to assess the quality of a body of water. The
following water quality parameters were utilized to assess the overall water quality in terms
of potability in this study: ph., Hardness, Solids, Chloramines, Sulfate, Conductivity, Organic
9
Carbon, Trihalomethanes, Turbidity were the parameters. To depict the water quality, these
parameters are used as a feature vector. To estimate the water quality class, the paper used
two types of classification algorithms: Decision Tree (DT) and K- Nearest Neighbor (KNN).
Experiments were carried out utilizing a real dataset containing information from various
locations around Andhra Pradesh, as well as a synthetic dataset generated at random using
parameters. Based on the results of two different types of classifiers, it was discovered that
the KNN classifier outperforms other classifiers. According to the findings, machine learning
approaches are capable of accurately predicting the potability. Potability, Water Quality
Parameters, Data Mining, and Classification are all index terms.
CONTENT
S.NO PAGE
NO
1. Introduction 1
1.1. Motivation 2
1.2. Problem Definition
2
1.3. Objective of the
Project 3
2. Literature Survey 3
3. Analysis
5
3.1. Existing System
5
3.2. Proposed System
5
11
3.3. Software
Requirement Specification
6
3.3.1 Purpose
6
3.3.2 Scope
6
3.3.3 Overall
Description
6
4. Design
7
4.1. UML diagrams
7
5. Implementation 12
5.1. Modules 12
12
1. Introduction
Water quality analysis is a complex topic due to the different factors that influence it. This
concept is inextricably linked to the various purposes for which water is used. Different needs
necessitate different standards. There is a lot of study being done on water quality prediction.
Water quality is normally determined by a set of physical and chemical parameters that are
closely related to the water's intended usage. The acceptable and unacceptable values for each
variable must then be established. Water that meets the predetermined parameters for a
specific application is considered appropriate for that application. If the water does not fulfil
these requirements, it must be treated before it may be used. Water quality can be assessed
using a variety of physical and chemical properties. As a result, studying the behaviour of
each individual variable independently is not possible in practise to accurately describe water
quality on a spatial or temporal basis. The more challenging method is to combine the values
of a group of physical and chemical variables into a single value. A quality value function
(usually linear) represented the equivalence between the variable and its quality level was
included in the index for each variable. These functions were created using direct
measurements of a substance's concentration or the value of a physical variable derived from
water sample studies. The major goal of this research is to examine how machine learning
algorithms may be used to predict water quality.
15
1.1 Motivation
Nowadays, machine learning algorithms have proven themselves as a universal tool for
different types of tasks, giving advanced possibilities for dealing with analysed data,
including such types of tasks as data imputation, unsupervised clusterization, classification
and regression. They are commonly used in many research areas; however, they are yet less
common among environmental engineering workers, though such tools may provide an
extremely efficient alternative to the traditional analytical approaches. (Wilcox, Woon and
Aung 2013).
2. Literature Survey
This paper reviews the role of uncertainty in the identification of mathematical models of
water quality and in the application of these models to problems of prediction. More
specifically, four problem areas are examined in detail: uncertainty about model structure,
uncertainty in the estimated model parameter values, the propagation of prediction errors, and
the design of experiments to reduce the critical uncertainties associated with a model. The
review is rather lengthy, and it has therefore been prepared in effect as two papers. There is a
shorter, largely nontechnical version, which gives a quick impression of the current and
future issues in the analysis of uncertainty in water quality modeling. Enclosed by this shorter
discussion is the main body of the review dealing in turn with (1) identifiability and
experimental design, (2) the generation of preliminary model hypotheses under conditions of
sparse, grossly uncertain field data, (3) the selection and evaluation of model structure, (4)
parameter estimation (model calibration), (5) checks and balances on the identified model,
i.e., model “verification” and model discrimination, and (6) prediction error propagation.
Much time is spent in discussing the algorithms of system identification the methods of
recursive estimation, and in relating these algorithms and the subject of identification to the
problems of prediction uncertainty and first-order error analysis. There are two obvious
omissions from the review. It is not concerned primarily with either the development and
solution of stochastic differential equations or the issue of decision making under uncertainty,
although clearly some reference must be made to these topics. In brief, the review concludes
(not surprisingly) that much work has been done on the analysis of uncertainty in the
development of mathematical models of water quality, and much remains to be done. A lack
of model identifiability has been an outstanding difficulty in the interpretation and
explanation of past observed system behavior, and there is ample evidence to show that the
“larger,” more “comprehensive” models are easily capable of generating highly uncertain
predictions of future behavior. For the future of the subject, it is speculated that there is the
possibility of progress in the development of novel algorithms for model structure
identification, a need for new questions to be posed in the problem of prediction, and a
distinct challenge to the conventional views of this review in the new forms of knowledge
representation and manipulation now emerging from the field of artificial intelligence.
18
3.Analysis
The proposed approach, however, is not constrained by the number of parameters or the
selection of parameters. A k-fold cross validation technique is employed to set the learning
and testing framework in this study, corresponding to each data sample in the data set. The
dataset is separated into k-disjointed sets of equal size, each with roughly the same class
distribution, using this technique. This division's subsets are utilised as the test set in turn,
with the remaining subsets serving as the training set. These are Decision Tree (DT) and K-
Nearest Neighbour (KNN) methods. In terms of the underlying relational structure between
the indicator parameters and the class label, each of these strategies takes a different
approach. As a result, each technique's performance for the same data set is likely to differ.
Validating the performance of different classifiers on an unknown data set: Data mining
provides several metrics for validating the performance of different classifiers on an unknown
data set. A repeated cross-validation procedure in the MATLAB caret package was used to
create the learning and testing environment. The following procedure was used to apply the
classification algorithm: 1. The data set was split into two parts: training (80%) and testing
(20%). (20 percent). 2. The training set was subjected to repeated cross-validation, with the
number of iterations fixed to Classifiers were trained in this manner. 3. The model's optimal
parameter configuration was selected, resulting in the maximum accuracy. 4. The model was
scrutinized.
20
4. Design
5. Implementation
5.1 Modules
To estimate river water quality class, two data mining methods were used: Decision Tree
(DT) and K- Nearest Neighbour (KNN). These methods are both parametric and
nonparametric classifiers, and their goal is to develop a function that maps input variables to
output variables from a training dataset. Because the function's form is unknown, different
algorithms make different assumptions about the function's form and how training data is
learned to produce the output. The parametric learning classifier makes more confident
assumptions about the data. If the assumptions for any data set are true, these classifiers will
make rectification judgments. However, if the assumptions are incorrect, the same classifier
performs poorly. To learn classification tasks, these classifiers do not rely on the quantity of
the sample data set; rather, their working principles are their assumptions. This classifier is
susceptible to prediction mistakes such as bias, in addition to its parametric character. When
the model makes multiple assumptions, the Decision Tree yields substantial bias.
Nonparametric classifiers, unlike parametric learning classifiers, do not make any
assumptions about the form of the mapping function, and by not making any assumptions,
they are having more accuracy. These classifiers can create any function from the training
data set. The DT and KNN classifiers are included in this category. Learning techniques are
used in DT, whereas the similarity principle is used in KNN. To put it another way, DT Small
data sets with complete domain expertise, on the other hand, are equally advantageous for
these classifiers. Instead of learning from data, the KNN classifier finds a group of k items in
the training set that are the most like the test object. Unlike other classifiers, DT does not rely
on domain expertise. To make classification decisions, it simply calculates the distance
between two characteristics. Because each algorithm's mode of operation differs, a
comparison of all of them is necessary to determine which one is better at approximating the
underlying function for the same training and testing water quality datasets.
23
import os
for dirname, _, filenames in os. walk('/kaggle/input'):
for filename in filenames:
print (os. path. join (dirname, filename))
import numpy as np
import pandas as pd
from warnings import filterwarnings
from collections import Counter
import matplotlib. pyplot as plt
import seaborn as sns
import plotly
import plotly. express as px
water_df=pd. read_csv('/content/water_potability.csv')
water_df.info()
len(water_df.axes[0])
pot= pd.DataFrame(water_df['Potability'].value_counts())
fig = px.pie(pot,values='Potability',names=['Not Potable','Potable'],op
acity=0.6,
labels={'label':'Potability','Potability':'No. Of Samples'
},
color_discrete_sequence=px.colors.sequential.RdBu)
fig.update_layout(
font_family='monospace',
title=dict(text='Samples Of Potable & Non-Potable Water
',x=0.47,y=0.98,
font=dict(color='royalblue',size=20)),
legend=dict(x=0.37,y=-0.05,orientation='h',traceorder='reversed'),
hoverlabel=dict(bgcolor='black'))
fig.show()
24
fig = px.histogram(water_df,x='ph',y=Counter(water_df['ph']),color='Pot
ability',template='plotly_white',
marginal='box',opacity=0.7,nbins=100,
barmode='group',histfunc='count',
width=1000, height=700)
fig.update_layout(
font_family='Gravitas One',
title=dict(text='pH Level Distribution Plot',x=0.5,y=0.95,
font=dict(color='darkblue',size=20)),
xaxis_title_text='pH Level',
yaxis_title_text='Count',
legend=dict(x=1,y=0.98,borderwidth=0,tracegroupgap=5),
bargap=0.4,
)
fig.show()
fig = px.histogram(water_df,x='Sulfate',y=Counter(water_df['Sulfate']),
color='Potability',template='plotly_white',
marginal='box',opacity=0.7,nbins=100,color_discrete_s
equence=['#51C4D3','#4F7942'],
barmode='group',histfunc='count',
width=1000, height=700)
fig.update_layout(
font_family='monospace',
title=dict(text='Distribution Of Sulphates Plot',x=0.53,y=0.95,
font=dict(color='#17869E',size=20)),
xaxis_title_text='Sulfate (mg/L)',
yaxis_title_text='Count',
legend=dict(x=1,y=0.96,borderwidth=0,tracegroupgap=5),
bargap=0.3,
)
fig.show()
fig = px.histogram(water_df,x='Hardness',y=Counter(water_df['Hardness']
),color='Potability',template='plotly_white',
marginal='box',opacity=0.7,nbins=100,color_discrete_s
equence=['#17869E','#74C365'],
barmode='group',histfunc='count',
)
fig.add_annotation(text='<76 mg/L is<br> considered soft
',x=40,y=130,showarrow=False,font_size=12)
fig.add_annotation(text='Between 76 and 150<br> (mg/L) is<br>moderately
hard',x=113,y=130,showarrow=False,font_size=12)
fig.add_annotation(text='Between 151 and 300 (mg/L)<br> is considered h
ard',x=250,y=130,showarrow=False,font_size=12)
fig.add_annotation(text='>300 mg/L is<br> considered very hard
',x=340,y=130,showarrow=False,font_size=12)
fig.update_layout(
25
font_family='monospace',
title=dict(text='Distribution of Hardness Plot',x=0.55,y=0.98,
font=dict(color='#636363',size=24)),
xaxis_title_text='Hardness (mg/L)',
yaxis_title_text='Count',
legend=dict(x=1,y=0.96,bordercolor='royalblue',borderwidth=0,traceg
roupgap=5),
bargap=0.3
)
fig.show()
cor=water_df.drop('Potability',axis=1).corr()
cor
tree_text=export_text(model , feature_names=list(train_inputs.columns)
, max_depth=5)
print(tree_text[:5000])
26
6. Results
Performance Measures Results True Positives (TP) are when the model predicts the positive
class properly. True Negatives (TN) is one of the components of a confusion matrix designed
to demonstrate how classification algorithms work. Positive outcomes that the model
predicted incorrectly are known as False Positives (FP). False Negatives (FN) are negative
outcomes that the model predicts negative class. Accuracy is the most basic and intuitive
performance metric, consisting of the ratio of successfully predicted observations to total
observations. Accuracy = TP+TN/(TP+FP+FN+TN).
27
7. Screenshots
28
8. Conclusion
Potability determines the quality of water, which is one of the most important resources for
existence. Traditionally, testing water quality required an expensive and time-consuming lab
analysis. This study investigated an alternative machine learning method for predicting water
quality using only a few simple water quality criteria. To estimate, a set of representative
supervised machine learning algorithms was used. It would detect water of bad quality before
it was released for consumption and notify the appropriate authorities It will hopefully reduce
the number of individuals who drink low-quality water, lowering the risk of diseases like
typhoid and diarrhoea. In this case, using a prescriptive analysis based on projected values
would result in future capabilities to assist decision and policy makers.
Overall, the goals defined for this research were reached and the examples of the application
of machine learning models are presented, covering most of the aspects of the average
research working in the field of artificial intelligence for environmental sciences tasks. This
work also reveals the importance of consulting data scientists before starting of the
monitoring, since data sets unsuitable for requested tasks is a common problem.
29
9. Future Enhancement
Future prospective of the development of this research may be seen in several ways. Firstly,
consistent misclassification of season values between winter and spring may be studied
further using this data set by extracting and analysing the samples, which tend to be often
misclassified. On the other hand, models generated during this research may be used by IT
students for producing software meant to help environmental specialists in analysing
collected water quality data.
All in all, following the technological progress and taking the best from what it provides us
from day to day ensures continuous development of the research field. The same goes for
environmental sciences and machine-learning algorithms are one of the tools that can
contribute to this field a lot and may be used to keep the progress on-going.
30
10. Bibliography