Unit I DM

DATA MINING
UNIT - I
UNIT - I
Introduction: Data Mining – KDD vs Data mining-DBMS vs

Data mining-Other areas-Data mining techniques-Issues and
challenges-Application areas.
INTRODUCTION
Data mining is the process of discovering patterns in large data sets
involving methods at the intersection of machine learning, statistics, and
database systems.
Data mining is an interdisciplinary subfield of computer science and
statistics with an overall goal to extract information (with intelligent
methods) from a data set and transform the information into a
comprehensible structure for further use.
Data mining is the analysis step of the "knowledge discovery in
databases" process or KDD.
DATA MINING
Data Mining is the non-trivial process of identifying valid, novel,
potentially useful, and ultimately understandable patterns in data.
Data Mining is a process of finding potentially useful patterns from huge
data sets.
Data mining is also known as data discovery and knowledge discovery.
Data mining refers to extracting or mining knowledge from large
amounts of data.
DATA MINING DEFINITION
Data mining is the process of discovering meaningful, new
correlation patterns and trends by shifting through large
amounts of data stored in repositories, using pattern
recognition techniques as well as statistical and mathematical
techniques.
KDD VS DATA MINING
KDD refers to the overall process of discovering useful knowledge
from data. It involves the evaluation and possibly interpretation of the
patterns to make the decision of what qualifies as knowledge. It also
includes the choice of encoding schemes, preprocessing, sampling, and
projections of the data prior to the data mining step.
Data mining refers to the application of algorithms for extracting
patterns from data without the additional steps of the KDD process.
Knowledge discovery in databases (KDD)-is a multistep process of
finding useful information and patterns in data while Data Mining is
one of the steps in KDD of using algorithms for extraction of patterns
Steps Of KDD
1. Selection
Data Extraction -Obtaining Data from heterogeneous data sources -
Databases, Data warehouses, World wide web or other information
repositories
2. Preprocessing
Data Cleaning- Incomplete , noisy, inconsistent data to be cleaned-
Missing data may be ignored or predicted, erroneous data may be deleted or corrected
3. Transformation
Data Integration- Combines data from multiple sources into a coherent
store -Data can be encoded in common formats, normalized, reduced
4. Data mining
Apply algorithms to transformed data an extract patterns
5. Pattern Interpretation/evaluation
Pattern Evaluation- Evaluate the interestingness of resulting patterns or apply
interestingness measures to filter out discovered patterns
Knowledge presentation- present the mined knowledge- visualization
techniques can be used
6. DataVisualization
Graphical-bar charts,pie charts histograms Geometric-boxplot, scatter plot
40
35
30
25
20
15
10
5
0
10000 30000 50000 70000 90000
Icon-based- using colors figures as icons Pixel-based- data as colored pixels
Hierarchical- Hierarchically dividing Hybrid- combination of above approaches

display area
Knowledge discovery process
KDD is the nontrivial extraction
of implicit previously unknown Pattern Evaluation
and potentially useful knowledge
from data Data Mining
Data Transformation
Data Preprocessing Data Integration

Data Cleaning
Selection
DBMS VS DATA MINING
DBMS is a full-fledged system for housing and managing a set of digital
databases.
However Data Mining is a technique or a concept in computer science,
which deals with extracting useful and previously unknown information
from raw data.
Most of the times, these raw data are stored in very large databases.
Therefore Data miners use the existing functionalities of DBMS to handle,
manage and even preprocess raw data before and during the Data mining
process.
However, a DBMS system alone cannot be used to analyze data. But,
some DBMS at present have inbuilt data analyzing tools or capabilities.
OTHER AREAS
Statistics-developing and studying methods for collecting, analyzing, interpreting
and presenting empirical data. (Descriptive and Inferential)
Machine Learning-the field of study that gives computers the capability to learn
without being explicitly programmed.
Supervised Learning- machines are trained using well "labelled" training data, and
on basis of that data, machines predict the output.
Unsupervised Learning-find the hidden patterns and insights from the given data.
Mathematical Programming-refers to mathematical models used to solve
problems such as decision problems.
DATA MINING TECHNIQUES
Two fundamental goals of data mining : Prediction and Description
Prediction makes use of existing variables in the database to predict
unknown or future values of interest.
Description focuses on finding patterns describing the data and
subsequent presentation for user interpretation.
The study of DM techniques is to classified as

User-guided or verification-driven data mining
Discovery-driven or automatic discovery of rules
Verification Model
In this process of data mining , the user makes a hypothesis and tests
the hypothesis on the data to verify its validity.
Discovery Model
The discovery model is the system that automatically discovers
important information hidden in the data.
The typical discovery driven tasks are

Discovery of association rules
Discovery of classification rules
Clustering
Discovering of frequent episodes
Deviation Detection
Classification:
This analysis is used to retrieve important and relevant information
about data, and metadata. This data mining method helps to classify
data in different classes.
Eg: Loan applicants
Clustering:
Clustering analysis is a data mining technique to identify data that are
like each other.
This process helps to understand the differences and similarities
between the data.
Eg:Information on houses
Regression:
Regression analysis is the data mining method of identifying and
analyzing the relationship between variables. It is used to identify the
likelihood of a specific variable, given the presence of other variables.
Eg:Financial predictors
Association Rules:
This data mining technique helps to find the association between two or
more Items. It discovers a hidden pattern in the data set. Eg:
bread,milk,jam
Sequential Patterns:
This data mining technique helps to discover or identify similar patterns
or trends in transaction data for certain period.Eg: Customer shopping
sequences
Outlier detection:
This type of data mining technique refers to observation of data items in the
dataset which do not match an expected pattern or expected behavior. This
technique can be used in a variety of domains, such as intrusion, detection,
fraud or fault detection, etc.
Outer detection is also called Outlier Analysis or Outlier mining. Eg:unusual
response - medical treatment
Prediction:
Prediction has used a combination of the other data mining techniques
like trends, sequential patterns, clustering, classification, etc. It analyzes
past events or instances in a right sequence for predicting a future event.
Neural networks
A neural network is a series of algorithms that endeavours to recognize underlying
relationships in a set of data through a process that mimics the way the human
brain operates.
Neural networks refer to systems of neurons, either organic or artificial in nature.
Eg: Facial recognition
Genetic Algorithm
The genetic algorithm is a method for solving both constrained and unconstrained
optimization problems that is based on natural selection, the process that drives
biological evolution.
The genetic algorithm repeatedly modifies a population of individual solutions.
At each step, the genetic algorithm selects individuals at random from the current
population to be parents and uses them to produce the children for the next
generation. Eg: Learning Robots
Support Vector Machine
Support Vector Machine (SVM) is a supervised machine learning algorithm
capable of performing classification, regression and even outlier detection.
The linear SVM classifier works by drawing a straight line between two classes.
All the data points that fall on one side of the line will be labelled as one class
and all the points that fall on the other side will be labelled as the second. E.g.:
text recognition
Rough set techniques
Rough set theory has been a methodology of database mining or knowledge
discovery in relational databases.
It is a new area of uncertainty mathematics closely related to fuzzy theory.
Rough set approach is used to discover structural relationship within imprecise
and noisy data.
OTHER MINING PROBLEMS
Sequence Mining- discovering interesting patterns
E.g.: DNA, protein, customer purchase history
Web mining-finding patterns from Web data

E.g.: Google, Yahoo
Text mining-process of extracting essential data from standard language text

E.g.: Customer care, Social media analysis
Spatial mining-related to spatial descriptions

E.g.: Earthquake points
ISSUES AND CHALLENGES
Limited information – some attributes which are essential for
knowledge discovery is not present.
Noise and missing data-omitting corresponding record/inferring
missing values from known values
User interaction and prior knowledge-requires a KDD tool which
is interactive and iterative
Uncertainty-severity of error/degree of noise.data precision is
important
Size, updates and irrelevant fields- large and dynamic database
APPLICATION AREAS
Business and E-Commerce Data
Business transactions
Electronic commerce
Scientific, Engineering and health care data

Genomic data - Genomic data refers to the genome and DNA data of an organism.
Sensor data- detects and responds to some type of input from the physical
environment.
Simulation data-generation of dataset of random outcome
Healthcare data-patients' diagnoses, medications, allergies, treatment plans,
radiology images, and test results
Web data - text, audio ,video material in web and streaming data
Multimedia documents-digital documents
Data web– HTML XMLdata mining

Unit I DM

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit I DM

Uploaded by

Copyright:

Available Formats

DATA MINING

Introduction: Data Mining – KDD vs Data mining-DBMS vs

Icon-based- using colors figures as icons Pixel-based- data as colored pixels

Hierarchical- Hierarchically dividing Hybrid- combination of above approaches

Data Preprocessing Data Integration

The study of DM techniques is to classified as

The typical discovery driven tasks are

Web mining-finding patterns from Web data

Text mining-process of extracting essential data from standard language text

Spatial mining-related to spatial descriptions

Scientific, Engineering and health care data

You might also like