Professional Documents
Culture Documents
UNIT-3 DATA MINING - Part1
UNIT-3 DATA MINING - Part1
UNIT-3 DATA MINING - Part1
AND MINING
UNIT-3_PART1
BTECH- IT421
KAVITA SETHIA
Sources of Data
Businesses generate giant data
worldwide sales
including
transactions, stocksets, records,
trading
product descriptions,
promotions,
sales and performance, and customer
profiles compan
feedback.
y
Scientific and engineering practices generate high
orders of petabytes of data in a continuous manner,
from remote sensing, process measuring, scientific
experiments, system performance, engineering
observations, and environment surveillance.
2/1/2019 2
Billions of Web searches supported by search engines
process tens of petabytes of data daily. Communities
and social media have become increasingly important
data sources, producing digital pictures and videos,
blogs, Web communities, and various kinds of social
networks.
2/1/2019 4
Knowledge Discovery
Process(previous year paper)
Knowledge discovery concerns the entire knowledge extraction process,
2/1/2019 5
2/1/2019 6
Steps in Knowledge Discovery
Process
KDD is an iterative sequence of the following steps:
2/1/2019 7
5.Data - an essential process where
mining
intelligent methods are applied to
patterns. extract data
6. Pattern evaluation - to identify the truly
interesting patterns representing knowledge based on
interestingness measures.
7. Knowledge presentation - where visualization
and knowledge representation techniques are used to
present mined knowledge to users.
2/1/2019 8
Contents:
Introduction to Data Mining
Data Mining
On What Kind of Data
What kind of patterns to be mined
Classification of data mining
system
Data Mining Task primitives
Major issues in Data Mining
2/1/2019 9
What is Data Mining
Data mining is the process of discovering
knowledge in form of interesting patterns and
relationships from large amounts of data.
“Data mining,” writes Joseph P. Bigus in his book, Data
Mining with Neural Networks,
It is the effi cient discovery of valuable, non-
obvious information from a large collection of
data.”
Data mining centers around the automated
discovery of new facts and relationships in data.
2/1/2019 10
It is a multi-disciplinary skill that uses machine
learning, statistics, AI and database technology.
2/1/2019 11
2/1/2019 12
Difference between KDD and Data
Mining
Although, the two terms KDD and Data Mining are
heavily used interchangeably, they refer to two related
yet slightly different concepts. KDD is the overall
process of extracting knowledge from data while
Data Mining is a step inside the KDD process,
which deals with identifying patterns in data. In
other words, Data Mining is only the application of a
specific algorithm based on the overall goal of the
KDD process.
2/1/2019 13
Classification of Data
Mining Systems(previous
year paper)
A data mining system can be classified based on the kind
of :
2/1/2019 14
(c) Applications adapted
Finance
Retail and Telecommunication
Recommender System
2/1/2019 15
What kinds of data can be mined?
As a general technology, data mining can be applied
to any kind of data as long as the data are meaningful
for a target application. The most basic forms of data
for mining applications are :
database data
data warehouse data
transactional data
2/1/2019 16
Database Data
Relational data can be accessed by database queries written
in a relational query language (e.g., SQL) or with the assistance
of graphical user interfaces.
“Show me a list of all items that were sold in the last quarter.”
“How
“Show me many
the totalsales
sales transactions
of the last month, grouped
in theby branch,”
month of
occurred
December?”
“Which salesperson had the highest sales?”
2/1/2019 17
When mining relational databases, we can go
further by searching for trends or data patterns. For
example, data mining systems can analyse customer
data to predict the credit risk of new customers based
on their income, age, and previous credit information.
2/1/2019 18
Data Warehouse
A Data Warehouse is a repository of information collected from
multiple sources, stored under a unified schema, and usually
residing at a single site.
2/1/2019 19
A data warehouse is usually modelled by a
multidimensional data structure, called a data cube,
in which each dimension corresponds to an attribute
or a set of attributes in the schema, and each cell
stores the value of some aggregate measure such as
count or sum(sales_amount).
2/1/2019 20
It allows the exploration of multiple combinations of dimensions at
varying levels of granularity in data mining, and thus has
greater potential for discovering interesting patterns representing
knowledge.
2/1/2019 21
Transactional Data
Each record in a transactional database captures a transaction,
such as a customer’s purchase, a flight booking, or a user’s clicks
on a web page.
2/1/2019 22
Suppose you may want to know, “Which items sold well
together?”
This kind of market basket data analysis would enable
you to bundle groups of items together as a strategy
for boosting sales.
For example, given the knowledge that printers are
commonly purchased together with computers, you
could offer certain printers at discounted price.
2/1/2019 23
Other Kinds of Data
By mining user comments on products (which are
often submitted as short text messages),we can assess
customer sentiments and understand how well a
product is accepted by a market.
By mining video data of a hockey game, we can detect
video sequences corresponding to goals.
Stock exchange data can be mined to uncover trends
that could help you plan investment strategies.
With spatial data, we may look for patterns that
describe changes in metropolitan poverty rates based
on city distances from major highways.
2/1/2019 24
What kind of patterns can be mined?
There are a number of data mining
functionalities. These include:
the mining of frequent patterns, associations, and
correlations
classification and regression
clustering analysis
outlier analysis
characterization and discrimination
There are many kinds of frequent patterns, including frequent itemsets and
frequent subsequences.
2/1/2019 30
The derived model may be represented in various forms,
such as classification rules (i.e., IF-THEN rules),
decision trees.
2/1/2019 31
A decision tree is a flowchart-like tree structure,
where each node denotes a test on an attribute value,
each branch represents an outcome of the test, and
tree leaves represent classes or class distributions
2/1/2019 32
Whereas classification predicts categorical
(discrete, labels, regression models
continuous-valued
unordered) functions. That is, regression is
used to predict missing or unavailable numerical data
values rather than (discrete) class labels.
CLASSIFICATION REGRESSION
(nominal or categorical labels) (Numerical values)
2/1/2019 33
Regression Analysis
Regression analysis is a statistical methodology that is
most often used for numeric prediction.
2/1/2019 35
Formation of
clusters with similar
properties
2/1/2019 36
Outlier Analysis
A data set may contain objects that do not comply
with the general behavior or model of the data.
These data objects are outliers. Many data mining
methods discard outliers as noise or exceptions.
2/1/2019 37
For Outlier analysis may uncover
example:
fraudulent of credit cards by detecting
usage
purchases of unusually large amounts for a
account number in comparison
given to regular charges
incurred by the same account. Outlier values may also
be detected with respect to the locations and types of
purchase, or the purchase frequency.
2/1/2019 38
Data Characterization
Data characterization is a summarization of the
general characteristics or features of a target class
of data. The data corresponding to the user-specified
class are typically collected by a query.
2/1/2019 40
Data Discrimination
Data discrimination is a comparison of the general features of
the target class data objects against the general features of
objects from one or multiple contrasting classes.
The target and contrasting classes can be specified by a user, and the
corresponding data objects can be retrieved through database queries.
The methods used for data discrimination are similar to those used for
data characterization.
2/1/2019 41
Data Mining Life
Cycle/Implementation Process
The life cycle of a data mining project consists of six
phases. The sequence of the phases is not rigid.
Moving back and forth between different phases is
always required depending upon the outcome of each
phase. The main phases are:
2/1/2019 42
2/1/2019 43
Business Understanding
In this phase, business and data-mining goals are established.
First, you need to understand business and client objectives.
You need to define what your client wants (which many times
even they do not know themselves)
Take stock of the current data mining scenario. Factor in
resources, assumption, constraints, and other significant
factors into your assessment.
Using business objectives and current scenario, define your
data mining goals.
A good data mining plan is very detailed and should be
developed to accomplish both business and data mining goals.
2/1/2019 44
Data Understanding
Performed to check whether Data is appropriate for the data mining
goals.
First, data is collected from multiple data sources available in the
organization.
These data sources may include multiple databases, flat files or data
cubes. There are issues like object matching and schema
integration which can arise during Data Integration process. It is a
quite complex and tricky process as data from various sources
unlikely to match easily.
Next, the step is to search for properties of acquired data. A good
way to explore the data is to answer the data mining questions
(decided in business phase) using the query, reporting, and
visualization tools.
Based on the results of query, the data quality should be ascertained.
Missing data if any should be acquired.
2/1/2019 45
Data Preparation/Preprocessing
In this phase, data is made production ready.
The data preparation process consumes about 90% of the time of the
project.
The data from different sources should be selected,
cleaned, transformed, formatted, anonymized, and constructed (if
required).
Data cleaning is a process to "clean" the data by smoothing noisy
data and filling in missing values.
For example, for a customer demographics profile, age data is
missing. The data is incomplete and should be filled. In some cases,
there could be data outliers. For instance, age has a value 100. Data
could be inconsistent. For instance, name of the customer is different
in different tables.
Data transformation operations change the data to make it useful in
data mining. Following transformation can be applied
2/1/2019 46
Data Transformation
Data transformation operations would contribute toward the success of
the mining process.
Smoothing: It helps to remove noise from the data. For eg : using
Binning
Aggregation: Summary or aggregation operations are applied to the
data. i.e., the weekly sales data is aggregated to calculate the monthly
and yearly total.
Generalization: In this step, Low-level data is replaced by higher-
level concepts with the help of concept hierarchies. For example, the
city is replaced by the county.
Normalization: Normalization performed when the attribute data are
scaled up or scaled down. Example: Data should fall in the range
-
2.0 to 2.0 post-normalization.
The result of this process is a final data set that can be used in
modeling.
2/1/2019 47
Modelling
In this phase, mathematical models are used to determine
data patterns.
Based on the business objectives, suitable
modeling techniques should be selected for the
prepared dataset.
Create a scenario to test check the quality and validity of
the model.
Run the model on the prepared dataset.
Results should be assessed by all stakeholders to
make sure that model can meet data mining objectives.
2/1/2019 48
Evaluatio
nIn this phase, patterns identified are evaluated against the
business objectives.
2/1/2019 52
Benefits of Data Mining
Data mining technique helps companies to get
knowledge-
based information.
Data mining helps organizations to make the
profitable adjustments in operation and production.
The data mining is a cost-effective and efficient
solution compared to other statistical data applications.
Data mining helps with the decision-making process.
Facilitates automated prediction of trends and behaviors as
well as automated discovery of hidden patterns.
It can be implemented in new systems as well as
existing platforms
It is the speedy process which makes it easy for the users to
analyze huge amount of data in less time.
2/1/2019 53
Limitations of Data Mining
There are chances of companies may sell useful
information of their customers to other companies for
money. For example, American Express has sold credit
card purchases of their customers to the other companies.
Many data mining analytics software is difficult to operate
and requires advance training to work on.
Different data mining tools work in different manners due
to different algorithms employed in their design.
Therefore, the selection of correct data mining tool is a
very difficult task.
If the data mining techniques are not accurate, it can cause
serious consequences in certain conditions.
2/1/2019 54
Applications of Data
Mining/Where it is used?
2/1/2019 55
2/1/2019 56
2/1/2019 57
Data Mining Task Primitives
Each user will have a data mining task in mind, that is,
some form of data analysis that he or she would like to
have performed.
2/1/2019 59
The set of task-relevant data to be mined: This
specifies the portions of the database or the set of data
in which the user is interested. This includes the
database attributes or data warehouse dimensions of
interest.
2/1/2019 60
The background knowledge be used in the
discovery process: The background knowledg allows
to
data to be mined at multiple levels ofe abstraction. For
example, the Concept hierarchies are one of the
background knowledge that allows data to be mined at
multiple levels of abstraction.
2/1/2019 62
Major Issues in Data
Mining(previous year paper)
Data mining is not an easy task, as the algorithms
used can get very complex and data is not always
available at one place. It needs to be integrated from
various heterogeneous data sources. These factors
also create
some issues. Here the major issues regarding −
2/1/2019 68
SCATTER PLOTS WHEN THERE IS NO CORRELATION BETWEEN ATTRIBUTES
2/1/2019 69
Data Preprocessing(previous year
paper)
Data Cleaning
Data Integration
Data Transformation
Data Reduction
Data Discretization
Concept Hierarchy Generation
2/1/2019 70
Data Preprocessing
Today’s real-world databases are highly susceptible to
noisy, missing, and inconsistent data due to their
typically huge size (often several gigabytes or more) and
their likely origin from multiple, heterogeneous sources.
2/1/2019 73
2/1/2019 74
Data Cleaning
Data that to be analyze by Data Mining can
be incomplete (Missing attribute values or certain
attributes
interest of or containing only
data),noisy(containing errors or aggregate outlier
deviate from the expected) and inconsistent (containing
values which
discrepancies in the department codes used to categorize
items).
2/1/2019 75
Handling Missing Data
Deleting Rows/list-
Replacing With
Predicting The Using Algorithms
Mean/Median/Mod Which Support
wise deletion Missing Values
e
Missing Values
Pros: Pros: Pros:
This is a better approach when This is a better approach when Imputation is good as long as the
Pros:
the data size is small the data size is small bias from the same is smaller than Does not require creation of a
If missingness is MCAR Then, the omitted variable bias predictive model for each attribute
It can prevent data loss which with missing data in the dataset
listwise deletion may be a results in removal of the rows Yields unbiased estimates of the
reasonable strategy model parameters Correlation of the data is neglected
and columns
Cons:
Cons:
Loss of information and data Bias also arises when an incomplete Cons:
Works poorly if the percentage Cons: conditioning set is used for a Is a very time consuming process
of missing values is high (say Add variance and bias categorical variable and it can be critical in data mining
30%), compared to the whole Considered only as a proxy for the where large databases are being
dataset true values extracted
Handling Missing Values
There are many tuples that have no recorded values for several
attributes, then:
Various methods to address missing values for the attribute are:
1. Ignore the tuple: This is usually done when the class label is
missing (assuming the mining task involves classification). This
method is not very effective, unless the tuple contains several
attributes with missing values. It is especially poor when the
percentage of missing values per attribute varies considerably. By
ignoring the tuple, we do not make use of the remaining
attributes’ values in the tuple. Such data could have been useful
to the task at hand.
Binning
Regression
Outlier analysis
2/1/2019 83
Data Cleaning as a process
Missing values, noise, and inconsistencies contribute
to inaccurate data. So far, we have looked at techniques
for handling missing data and for smoothing data.
“But data cleaning is a big job. What about data
cleaning as a process? How exactly does one proceed in
tackling this task? Are there any tools out there to
help?”
2/1/2019 86
Data Integration
Data mining often requires data integration—
the merging of data from multiple data stores.
2/1/2019 88
Redundancy and Correlation Analysis
Redundancy is another important issue in data integration.
An attribute (such as annual revenue, for instance) may be
redundant if it can be “derived” from another attribute or
set of attributes.
2/1/2019 90
Data Transformation
In this preprocessing step, the data are transformed or
consolidated so that the resulting mining process may
be more efficient, and the patterns found may be
easier to understand.
Data
transformation: -0.02, 0.36, 1.00, 0.58, 0.95
-2, 36, 100, 58, 95
2/1/2019 91
Data Transformation Strategies
In data transformation, the data are transformed or consolidated
into forms appropriate for mining. Strategies for data
transformation include the following:
Smoothing: which works to remove noise from the data.
Techniques include binning, regression, and clustering.
Aggregation: where summary or aggregation operations are
applied to the data. For example, the daily sales data may be
aggregated so as to compute monthly and annual total amounts.
Normalization: where the attribute data are scaled so as to fall
within a smaller range, such as -1.0 to 1.0, or 0.0 to 1.0.
2/1/2019 92
Discretization: where the raw values of a numeric
attribute (e.g., age) are replaced by interval labels (e.g., 0–
10, 11–20, etc.) or conceptual labels (e.g., youth, adult,
senior). The labels, in turn, can be recursively organized
into higher-level concepts, resulting in a concept hierarchy
for the numeric attribute.
2/1/2019 93
Data Transformation by
Normalization
The measurement unit used can affect the data analysis.
2/1/2019 94
Methods for Data Normalization
There are many methods for data normalization like:
min-max normalization,
z-score normalization, and
normalization by decimal scaling.
2/1/2019 95
2/1/2019 96
2/1/2019 97
2/1/2019 98
Z-score normalization
2/1/2019 99
2/1/2019 100
2/1/2019 101
Normalization by decimal scaling
Decimal Scaling is a data normalization technique.
In this technique, we move the decimal point of values of
the attribute.
This movement of decimal points totally depends on the
maximum value among all values in the attribute.
Formula for Decimal Scaling:
Avalue v of attribute A can be normalized by
the following formula:
Normalized value of attribute
2/1/2019 102
2/1/2019 103
Data Transformation by
Discretization
Data Discretization convert a large number of data
values into smaller so that data evaluation and data
management becomes very easy.
Data Discretization Techniques:
Histogram Analysis
Binning
Correlation Analysis
Clustering Analysis
Decision Tree Analysis
2/1/2019 104
2/1/2019 105
Data Transformation
by
Concept Hierarchy
Generation
Manual definition of concept hierarchies can be a tedious and
time-consuming task for a user or a domain expert.
Fortunately, many hierarchies are implicit within the database
schema and can be automatically defined at the schema
definition level.
It defines a sequence of mappings from a set of low level
concepts to high level. The concept hierarchies can be used
to transform the data into multiple levels of granularity.
For example, data mining patterns regarding sales may be
found relating to specific regions or countries, in addition to
individual branch locations.
2/1/2019 106
Four methods for the generation of concept hierarchies for
nominal data:
1. Specification of a partial ordering of attributes explicitly at
the schema level by users or experts: Concept hierarchies
for nominal attributes or dimensions typically involve a
group of attributes.
A user or expert can easily define a concept hierarchy by
specifying a partial or total ordering of the attributes at the
schema level.
For example, suppose that a relational database contains the
following group of attributes: street, city, province or state,
and country. Similarly, a data warehouse location dimension
may contain the same attributes.
A hierarchy can be defined by specifying the total
ordering among these attributes at the schema level such
as street < city < province or state < country.
2/1/2019 107
2 Specification of a portion of a hierarchy by explicit
data grouping: This is essentially the manual definition
of a portion of a concept hierarchy. In a large database, it
is unrealistic to define an entire concept hierarchy by
explicit value enumeration.
On the contrary, we can easily specify explicit groupings
for a small portion of intermediate-level data.
For example, after specifying that province and country
form a hierarchy at the schema level, a user could define
some intermediate levels manually such as:
“{Janakpuri, Dwarka} West Delhi”
2/1/2019 108
3. Specification of a set of attributes, but not of their
partial ordering: A user may specify a set of attributes
forming a concept hierarchy, but omit to explicitly state
their partial ordering.
The system can then try to automatically generate the
attribute ordering so as to construct a meaningful concept
hierarchy based on the number of distinct values.
Note that this heuristic rule is not foolproof. For
example, a time dimension in a database may contain
20 distinct years, 12 distinct months, and 7 distinct
days of the week. However, this does not suggest that
the time hierarchy should be “year <month <days of
the week,” with days of the week at the top of the
hierarchy.
2/1/2019 109
Concept Hierarchy Generation by system on the basis of
distinct values.
2/1/2019 110
4. Specification of only a partial set of attributes:
Sometimes a user can be careless when defining a
hierarchy, or have only a vague idea about what should be
included in a hierarchy.
Consequently, the user may have included only a small
subset of the relevant attributes in the hierarchy
specification.
For example, instead of including all of the
hierarchically relevant attributes for location, the user
may have specified only street and city and rest of
attributes that is state and country are automatically
pinned together according to data semantics defined in
metadata.
2/1/2019 140