Professional Documents
Culture Documents
Unit-Iii Data Mining Material
Unit-Iii Data Mining Material
CONCEPT DESCRIPTION
Concept Description:
Concept description is the most basic form of descriptive data mining. It describes a given set of
task-relevant data in a concise and summarative manner, presenting interesting general properties
of the data. concept description generates descriptions for data characterization and comparison.
It is sometimes called class description when the concept to be described refers to a class of
objects.
Concept description, which characterizes a collection of data and compares it with others in a
concise and succinct manner, is an essential task in data mining. Concept description can be
presented in many forms, including generalized relation, cross-tabulation (or briefly, crosstab ),
chart, graph, etc.
Characterization provides a concise and succinct summarization of the given data collection, while
concept or class comparison (also known as discrimination) provides descriptions comparing two
or more data collections.
Concept comparison can be performed using the attribute-oriented-induction or data cube
approaches in a manner similar to concept characterization. Generalized tuples from the target
and contrasting classes can be quantitatively compared and contrasted.
Data generalization and summarization-based characterization
data mining can be classified into two categories:
1-Descriptive data mining describes the data set in concise and summarative manner and presents
interesting general properties of the data.
2-Predictive data mining analyzes the data in order to construct one or a set of models, and
attempts to predict the behavior of new data sets.
Concept Description
The simplest kind of descriptive data mining.
A concept usually refers to a collection of data such as frequent_buyers, graduate_students,
and so on.
Generates descriptions for characterization and comparison of the data.
1
It is sometimes called class description.
Data Generalization by Attribute-Oriented Induction
The data cube can be viewed as a kind of multidimensional data generalization. Data
generalization summarizes data by replacing relatively low Level values (e.g., numeric values for
an attribute age) with higher-level concepts (e.g.. young, middle-aged, and senior), or by reducing
the number of dimensions to summarize data in concept space involving fewer dimensions (e.g
removing birth_date and telephone number when summarizing the behavior of a group of
students).
Given the large amount of data stored in databases, it is useful to be able to describe
concepts in concise and succinct terms at generalized (rather than low) levels of abstraction.
Allowing data sets to be generalized at multiple levels of abstraction facilitates users in examining
the general behavior of the data.
AlIElectronics database, for example, instead of examining individual customer
transactions, sales managers may prefer to view the data generalized to higher levels, such as
summarized by customer groups according to geographic regions, frequency of purchases per
group, and customer income.
This leads us to the notion of concept description, which is a form of data generalization.
A concept typically refers to a data collection such as frequent_buyers, graduate students, and so
on. As a data mining task, concept description is not a simple enumeration of the data.
concept description generates descriptions for data characterization and comparison. It is
sometimes called class description when the concept to be described refers to a class of objects.
Characterization provides a concise and succinct summarization of the given data collection,
while concept or class comparison (also known as discrimination) provides descriptions
comparing two or more data collections.
Up to this point, we have studied data cube (or OLAP) approaches to concept description
using multidimensional, multilevel data generalization in data warehouses. "Is data cube
technology sufficient to accomplish all kinds of concept description tasks for large data sets?"
Consider the following cases.
2
Complex data types and aggregation: Data warehouses and OLAP tools are based on a
multidimensional data model that views data in the form of a data cube, consisting of dimensions
(or attributes) and measures (aggregate functions). However, many current OLAP systems confine
dimensions to non-numeric data and measures to numeric data. In reality, the database can
include attributes of various data types, including numeric, non-numeric, spatial, text, or image,
which ideally should be included in the concept description.
User Control versus automation: Online analytical processing in data warehouses is a user-
controlled process.The selection of dimensions and the application of OL AP Operations (e.g-,
drill-down, roll-up, slicing, and dicing) are primarily directed and controlled by users. Although
the control in most OLAP systems is quite user-friendly, users do require a good understanding of
the role of each dimension.
It is often desirable to have a more automated process that helps users determine which
dimensions (or attributes) should be included in the analysis, and the degree to which the given
data set should be generalized in order to produce an interesting summarization of the data.
This section presents an alternative method for concept description, called attribute-oriented
induction, which works for complex data types and relies on a data-driven generalization process.
Attribute-Oriented Induction for Data characterization
The data cube approach can be considered as a data warehouse – based, pre computational –
oriented, materialized approach. It performs off-line aggregation before an OLAP or data mining
query is submitted for processing.
On the other hand, the attribute oriented induction approach, at least in its initial proposal, a
relational database query – oriented, generalized – based, on-line data analysis technique.
However, there is no inherent barrier distinguishing the two approaches based on online
aggregation versus offline precomputation.
Some aggregations in the data cube can be computed on-line, while off-line precomputation of
multidimensional space can speed up attribute-oriented induction as well.
3
1. The General idea of attribute oriented induction is to first Collect the task-relevant data( initial
relation) using a relational database query.
2. Then Perform generalization by attribute removal or attribute generalization.
3. Apply aggregation by merging identical, generalized tuples and accumulating their respective
counts.
4. Reduces the size of the generalized data set.
5. Interactive presentation with users.
Data focusing:
Analyzing task-relevant data, including dimensions, and the result is the initial relation.
Attribute-removal:
To remove attribute A if there is a large set of distinct values for A but (1) there is no
generalization operator on A, or (2) A’s higher-level concepts are expressed in terms of other
attributes.
Attribute-generalization:
If there is a large set of distinct values for A, and there exists a set of generalization operators on A,
then select an operator and generalize A.
Attribute-threshold control:
InitialRel:
It is nothing but query processing of task-relevant data and deriving the initial relation.
PreGen:
It is based on the analysis of the number of distinct values in each attribute and to determine the
generalization plan for each attribute: removal? or how high to generalize?
4
PrimeGen:
It is based on the PreGen plan and performing the generalization to the right level to derive a
“prime generalized relation” and also accumulating the counts.
Presentation:
User interaction: (1) adjust levels by drilling, (2) pivoting, (3) mapping into rules, cross tabs,
visualization presentations.
Example
Let's say there is a University database that is to be characterized, for that its corresponding DMQL
will be
use University_DB
mine characteristics as “Science_Students”
in relevance to name, gender, major, birth_place, birth_date, residence, phone_no, GPA
from student
InitialRel:
PreGen
Now, we have generalized these results by removing a few attributes and retaining important
attributes.
And also we have generalized a few attributes by naming them "Country" rather than "Birth_Place",
"Age Range" rather than "Birth_data", "City" rather than "Residence" and so on as per the table
given below.
5
PrimeGen
Based on the PreGen plan we've performed generalization to the right level to derive a “prime
generalized relation” and also we've accumulated the counts.
Final Results
Now we've and analyzed and concluded our final generalized results as shown below.
Presentation Of Results
Generalized relation:
Relations where some or all attributes are generalized, with counts or other aggregation values
accumulated.
6
Cross-tabulation:
Visualization techniques:
Pie charts, bar charts, curves, cubes, and other visual forms.
Mapping generalized results in characteristic rules with quantitative information associated with it.
--------------------------------------------------------------------------------------------------
Data Generalization is the process of summarizing data by replacing relatively low level values
with higher level concepts. It is a form of descriptive data mining.
There are two basic approaches of data generalization :
1. Data cube approach :
It is also known as OLAP approach.
It is an efficient approach as it is helpful to make the past selling graph.
In this approach, computation and results are stored in the Data cube.
It uses Roll-up and Drill-down operations on a data cube.
These operations typically involve aggregate functions, such as count(), sum(), average(), and
max().
These materialized views can then be used for decision support, knowledge discovery, and
many other applications.
2. Attribute oriented induction :
It is an online data analysis, query oriented and generalization based approach.
In this approach, we perform generalization on basis of different values of each attributes
within the relevant data set. after that same tuple are merged and their respective counts are
accumulated in order to perform aggregation.
It performs off-line aggregation before an OLAP or data mining query is submitted for
processing.
On the other hand, the attribute oriented induction approach, at least in its initial proposal, a
relational database query – oriented, generalized based (on-line data analysis technique).
It is not limited to particular measures nor categorical data.
Attribute oriented induction approach uses two method :
(i). Attribute removal.
(ii). Attribute generalization.
7
8
--------------------------------------------------------------------------------------------
Association rule mining is a popular and well researched method for discovering interesting
relations between variables in large databases.It is intended to identify strong rules discovered in
databases using different measures of interestingness. Based on the concept of strong rules,
Rakesh Agrawal introduced association rules.
Problem Definition:
9
Example database with 4 items and 5 transactions
10
or the ratio of the observed support to that expected if X and Y were independent.
and can be interpreted as the ratio of the expected frequency that X occurs without
Y (that is to say, the frequency that the rule makes an incorrect prediction) if X and Y
were independent divided by the observed frequency of incorrect predictions.
Frequent pattern analysis can generate various kinds of rules and other interesting
relationships. Association rule mining can generate a large number of rules, many of
which are redundant or do not indicate a correlation relationship among itemsets.
Sequential pattern mining searches for frequent sub sequences in a sequence data set,
where a sequence records an ordering of events.
For example, with sequential pattern mining, we can study the order in which items are
frequently purchased. For instance, customers may tend to first buy a PC, followed by a
digital camera,and then a memory card.
Each element of an itemset may contain a sub sequence, a sub tree, and so on.
structured pattern mining can be considered as the most general form of frequent
pattern mining.
12
Efficient Frequent Itemset Mining Methods:
Finding Frequent Itemsets Using Candidate Generation:The Apriori
Algorithm
Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining
The name of the algorithm is based on the fact that the algorithm uses prior knowledge
First, the set of frequent 1-itemsets is found by scanning the database to accumulate
the count for each item, and collecting those items that satisfy minimum support. The
resulting set is denoted L1.Next, L1 is used to find L2, the set of frequent 2-itemsets,
which is used to find L3, and so on, until no more frequent k-itemsets can be found.
13
14
Example:
TID List of item IDs
T100 I1, I2, I5
T200 I2, I4
T300 I2, I3
T400 I1, I2, I4
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I1, I2, I3, I5
T900 I1, I2, I3
1. In the first iteration of the algorithm, each item is a member of the set of candidate1-
itemsets, C1. The algorithm simply scans all of the transactions in order to count the number of
occurrences of each item.
2. Suppose that the minimum support count required is 2, that is, min sup = 2. The set of
frequent 1-itemsets, L1, can then be determined. It consists of the candidate 1-itemsets
satisfying minimum support. In our example, all of the candidates in C1 satisfy minimum
support.
3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1 on L1
to generate a candidate set of 2-itemsets, C2. No candidates are removed from C2 during the
prune step because each subset of the candidates is also frequent.
4. Next, the transactions in Dare scanned and the support count of each candidate itemset InC2
is accumulated.
5. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2-
itemsets in C2 having minimum support.
6. The generation of the set of candidate 3-itemsets ,C3, From the join step, we first get C3
=L2 x L2 = ({I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4},{I2, I3, I5}, {I2, I4, I5}. Based
on the Apriori property that all subsets of a frequent item set must also be frequent, we can
determine that the four latter candidates cannot possibly be frequent.
15
7. The transactions in D are scanned in order to determine L3, consisting of those
candidate 3-itemsets in C3 having minimum support.
8. The algorithm uses L3 x L3 to generate a candidate set of 4-itemsets, C4.
16
Generating Association Rules from Frequent Itemsets:
Once the frequent itemsets from transactions in a database D have been found, it
is straightforward to generate strong association rules from them.
17
Example:
Therefore, data mining systems should provide capabilities for mining association
rules at multiple levels of abstraction, with sufficient flexibility for easy traversal
among different abstraction spaces.
18
Association rules generated from mining data at multiple levels of abstraction arecalled
multiple-level or multilevel association rules.
Multilevel association rules can be mined efficiently using concept hierarchies under a
support-confidence framework.
In general, a top-down strategy is employed, where counts are accumulated for the
calculation of frequent itemsets at each concept level, starting at the concept level 1
and working downward in the hierarchy toward the more specific concept levels,until
no more frequent itemsets can be found.
A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher
level,more general concepts. Data can be generalized by replacing low-level concepts within
the data by their higher-level concepts, or ancestors, from a concept hierarchy.
19
The concept hierarchy has five levels, respectively referred to as levels 0to 4, starting with
level 0 at the root node for all.
Level 2 includes laptop computer, desktop computer, office software, antivirus software.
Level 3 includes IBM desktop computer, . . . , Microsoft office software, and so on.
Level 4 is the most specific abstraction level of this hierarchy.
20
The deeper the level of abstraction, the smaller the corresponding threshold is.
For example,the minimum support thresholds for levels 1 and 2 are 5% and 3%,respectively.
In this way, ―computer,‖ ―laptop computer,‖ and ―desktop computer‖ areall considered
frequent.
21
age(X, “20…29”)^occupation(X, “student”)=>buys(X, “laptop”)
Above Rule contains three predicates (age, occupation,and buys), each of which
occurs only once in the rule. Hence, we say that it has norepeated predicates.
22
From Association Mining to Correlation Analysis:
A correlation measure can be used to augment the support-
confidence framework for association rules. This leads to correlation
rules of the form
A=>B [support, confidence, correlation]
That is, a correlation rule is measured not only by its support and
confidence but also by the correlation between itemsets A and B. There
are many different correlation measures from which to choose. In this
section, we study various correlation measures to determine which
would be good for mining large data sets.
23