Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 18

WHAT IS

DATA MINING
Data mining is the process of sorting through large data sets to identify patterns and
relationships that can help solve business problems through data analysis. Data mining
techniques and tools enable enterprises to predict future trends and make more-
informed business decisions.
The information it generates can be used in business intelligence (BI) and advanced
analytics applications that involve analysis of historical data, as well as real-time
analytics applications that examine streaming data as it's created or collected.

The information it generates can be used in business intelligence (BI) and advanced
analytics applications that involve analysis of historical data, as That includes
customer-facing functions such as marketing, advertising, sales and customer support,
plus manufacturing, supply chain management, finance and HR. Data mining supports
fraud detection, risk management, cybersecurity planning and many other critical
business use cases.
It also plays an important role in healthcare, government, scientific research,
mathematics, sports and more.
1. Classification:
This analysis is used to retrieve important
and relevant information about data, and
metadata. This data mining method helps to
classify data in different classes
2. Clustering:

DATA MINING
Clustering analysis is a data mining
technique to identify data that are like each
other. This process helps to understand the
TECHNIQUES differences and similarities between the data.
3. Regression:
Regression analysis is the data mining
method of identifying and analyzing the
relationship between variables. It is used to
identify the likelihood of a specific variable,
given the presence of other variables.
4. Association Rules:

This data mining technique helps to find the association between


two or more Items. It discovers a hidden pattern in the data set.

5. Outer detection:

This type of data mining technique refers to observation of data


items in the dataset which do not match an expected pattern or
expected behavior. This technique can be used in a variety of
domains, such as intrusion, detection, fraud or fault detection,
DATA MINING etc. Outer detection is also called Outlier Analysis or Outlier
mining.
TECHNIQUES 6. Sequential Patterns:

This data mining technique helps to discover or identify similar


patterns or trends in transaction data for certain period.

7. Prediction:

Prediction has used a combination of the other techniques of data


mining like trends, sequential patterns, clustering, classification,
etc. It analyses past events or instances in a right sequence for
predicting a future event.
DATA MINING ADVANTAGES AND
DISADVANTAGES: -

Advantages Disadvantages

It helps gather reliable information Data Mining tools are complex and require training to
use

Helps businesses make operational adjustments Data mining techniques are not infallible

Helps to make informed decisions Rising privacy concerns

It helps detect risks and fraud Data mining requires large databases

Helps to understand behaviours, trends and discover Expensive


hidden patterns

Helps to analyse very large quantities of data quickly


DATA MINING
APPLICATIONS: -

Financial Data Telecommunicatio Biological Data Other Scientific


Retail Industry
Analysis n Industry Analysis Applications
DATASET
ABOUT THE DATA
Dataset and Attributes: There are 400 patient records in this dataset. In
addition, they have 25 properties, but we only use 14 for our model.
Age, Blood Pressure, Albumin, Red Blood Cells, Pus Cells, Pus Cell
Clumps, Serum Creatinine, Haemoglobin, White Blood Cell Count, Red
Blood Cell Count, Anaemia, Classification, Appetite, and Packed Cell
Volume are all 14 characteristics that are utilised to create a model.
VISUALS OF DATASET
CLUSTERING K-MEANS
K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data
without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of
groups represented by the variable K.
CLASSIFICATION:
PROBABILISTIC CLASSIFIER -NAIVE BAYES: -
Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for solving
classification problems. It is suitable for binary and multiclass classification.
LOGISTIC REGRESSION
Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous
(binary).  Like all regression analyses, the logistic regression is a predictive analysis.  Logistic regression is used to
describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal,
interval or ratio-level independent variables.
DECISION TREE-J48
J48 is based on a top-down strategy, a recursive divide and conquer strategy. You select which attribute to split on at the
root node, and then you create a branch for each possible attribute value, and that splits the instances into subsets, one for
each branch that extends from the root node.
RANDOM TREE
Random Tree is a supervised Classifier; it is an ensemble learning algorithm that generates many individual learners. It
employs a bagging idea to produce a random set of data for constructing a decision tree. In standard tree each node is
split using the best split among all variables.
RANDOM FOREST
Random Forest is an extension of bagging for decision trees that can be used for classification or regression. A down
side of bagged decision trees is that decision trees are constructed using a greedy algorithm that selects the best split
point at each step in the tree building process
ALGORITHM SELECTION

Algorithm MCC Accuracy RMSE TPRate PRC Area ROC Area


J-48 0.268 77% 0.4455 0.770 0.750 0.665
Naïve-Bayes 0.359 70.75% 0.4992 0.708 0.845 0.804
Random Tree 0.272 76.5% 0.4848 0.765 0.730 0.635

Random 0.295 81% 0.3635 0.810 0.847 0.803


Forest
Logistic 0.303 78.75% 0.3718 0.788 0.852 0.811
Regression

MCC: Matthews correlation coefficient. TP rate: true positive rate (sensitivity,


recall), Accuracy, TP rate, PR: precision-recall curve. ROC: receiver operating
characteristic curve. AUC: area under the curve.
INTERPRETATION

After analysing outputs from various classification Algorithms, we can


conclude that Random Forest Algorithm for classification is the best
technique in our case. Use of this technique gives the most efficient
results compared to the other algorithms in all the confusion matrix
rates. mentioned above Moreover, our approach showed that Data
Mining can be used effectively for binary classification of health
records of patients with CKD kidney disease.
THANK YOU

You might also like