Mit401 Unit 11-Slm

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Data Warehousing and Data Mining Unit 11

Unit 11 Data Mining Techniques An Overview


Structure:
11.1 Introduction
Objectives
11.2 Data mining: Various Definitions
11.3 Data Mining Versus Database Management System (DBMS)
11.4 Data Mining Techniques
Association rules
Classification
Regression
Clustering
Neural networks
11.5 Summary
11.6 Terminal Questions
11.7 Answers

11.1 Introduction
In the previous unit you learnt that methods for data preprocessing such as :
data cleaning, data integration and transformation, and data reduction. In
this unit you would be introduced to Data Mining techniques, where different
techniques of data base such as Association rules, Classification,
Regression, Clustering, Neural networks are discussed in brief.
Objectives:
After studying this unit, you should be able to:
define data mining
differentiate between Data Mining and DBMS
describe the association rules.
explain the usage of Neural Networks in Data Mining

11.2 Data mining: Various Definitions


Data Mining refers to the finding of relevant and useful information from
databases. Data Mining and knowledge discovery in the databases is a new
interdisciplinary field, merging ideas from statistics, machine learning,
databases and parallel computing.

Sikkim Manipal University B1633 Page No.: 156


Data Warehousing and Data Mining Unit 11

Definitions:
Definition - 1:
Data Mining or knowledge discovery in databases, as it is also known,
is the non-trivial extraction of implicit, previously unknown and
potentially useful information from the data.

Definition - 2:
Data mining is the search for the relationships and global patterns that
exist in large databases but are hidden among vast amounts of data,
such as relationship between patient data and their medical diagnosis.

Definition - 3:
Data Mining is the process of discovering meaningful, new correlation
patterns and trends by sifting through large amounts of data stored in
repositories, using pattern recognition techniques.

11.3 Data Mining Vs Database Management System (DBMS)


DBMS stands for Database Management System". This is the software that
manages data on physical storage devices. The software provides the ability
to store, access and modify the data. The software also provides a suite of
utilities to manage and monitor the performance on those actions against
the data. Examples of a DBMS would be Oracle, SQL/Server, DB2 and
Informix in the relational (RDBMS) world.
We know that Data Base Management System (DBMS) supports query
languages, which are useful for query triggered data exploration, whereas
data mining supports automatic data exploration. If we know exactly what
information we are seeking, a DBMS query would suffice. Whereas, if we
vaguely know the possible correlations or patterns, then Data Mining
Techniques are useful.
A majority of Data Mining systems do not use any DBMS and have their
own memory and storage management. They treat the database simply as a
data repository from which data is expected to be downloaded into their own
memory structures, before the data mining algorithm starts. The advantage

Sikkim Manipal University B1633 Page No.: 157


Data Warehousing and Data Mining Unit 11

of such an approach is that one can optimize the memory management


specific to the Data Mining Algorithm.
Table 11.1: Differences between Database Management Systems (DBMS) and
Data Mining.
Area DBMS DATA MINING
Task Extraction of detailed Knowledge discovery of
and summary data hidden patterns and
insights
Type of Result Information Insight and Prediction
Method Deduction (Ask the Induction (Build the model,
question, verify with apply it to new data, get the
data) result)
Example Question Who purchased mutual Who will buy a mutual fund
funds in the last 3 in the next 6 months and
years? why?

Data Mining: A hot buzzword for a class of database applications that look
for hidden patterns in a group of data. For example, Data Mining software
can help retail companies find customers with common interests. The term
is commonly misused to describe software that presents data in new ways.
Self Assessment Questions
1. Oracle, SQL/Server, DB2 are examples for _____________.
2. Data Base Management System (DBMS) supports query languages.
(True/False)

11.4 Data Mining Techniques


Many DM (Data Mining) techniques and systems have been developed and
designed. These techniques can be classified based on the database, the
knowledge to be discovered, and the techniques to be utilized.
1) Based on the Database
There are many database systems that are used in organizations, such
as relational database, transaction database, object oriented database,
spatial database, multimedia database, legacy database, and Web
database. A DM system can be classified based.

Sikkim Manipal University B1633 Page No.: 158


Data Warehousing and Data Mining Unit 11

2) Based on the Knowledge


DM systems can discover various types of knowledge, including
association rules, characteristic rules, classification rules, clustering,
evolution, and deviation analysis.
3) Based on the Techniques
DM systems can also be categorized by DM techniques. For example, a
DM system can be categorized according to the driven method, such as
autonomous knowledge mining, data driven mining, query-driven mining,
and interactive DM techniques.
Data Mining Techniques
Data mining uses a number of techniques to discover patterns and uncover
trends in data warehouse data/data base. Researchers have identified two
fundamental goals of Data Mining:
1) Prediction: Prediction makes use of existing variables in the database
in order to predict unknown or future values of interest.
2) Description: Focuses on finding patterns describing the data and the
subsequent presentation for user interpretation.
The relative emphasis of both prediction and description differ with respect
to the underlying application and the technique.
There are several Data Mining techniques fulfilling these objectives. Some
of these are:
Association
Classification
Regression
Clustering
Neural Networks
The basic premise of an association is to find all associations, such that the
presence of one set of items in a transaction implies the other items.
Classification develops profiles of different groups. Sequential patterns
identify sequential patterns subject to a user-specified minimum constraint.
Clustering segments a database into subsets.

Sikkim Manipal University B1633 Page No.: 159


Data Warehousing and Data Mining Unit 11

11.4.1 Association Rules


The goal of association rules is to detect relationships or associations
between specific values of categorical variables in large data sets. These
powerful rules or techniques have a wide range of applications in many
areas of business practice and research - from the analysis of consumer
preferences or human resource management, to the history of language.
These techniques enable analysts and researchers to uncover hidden
patterns in large data sets, such as "customers who order product A often
also order product B or C" or "employees who said positive things about
initiative X also frequently complain about issue Y but are happy with issue
Z."
Association rules mining has many applications other than market basket
analysis, including applications in marketing, customer segmentation,
medicine, electronic commerce, bioinformatics and finance.
Association Rule Mining: A Road Map
Market basket analysis is just one form of association rule mining. In fact,
there are many kinds of association rules. Association rules can be
classified in various ways, based on the following criteria:
Based on the types of values handled in the rule: If a rule concerns
associations between the presence or absence of items, it is a Boolean
association rule.
If a rule describes associations between quantitative items or attributes,
then it is a quantitative association rule. In these rules, quantitative values
for items or attributes are partitioned into intervals. The following rule is an
example of a quantitative association rule, where X is a variable
representing a customer:
age (X, 30 .. 39) ^ income (X, 42K . 48K)
buys (X, high resolution TV) (11.1)
Note that the quantitative attributes, age and income, have been discretized.
Based on the dimensions of data involved in the rule: If the items or
attributes in an association rule reference only one dimension, then it is a
single dimensional association rule. Note that Rule (11.1) could be rewritten
as

Sikkim Manipal University B1633 Page No.: 160


Data Warehousing and Data Mining Unit 11

Buys (X, computer)


buys (X, financial _ management_ software) (11.2)
Rule (11.1) is a single dimensional association rule since it refers to only
one dimension, buys.4. If a rule refersence two or more dimensions, such as
the dimensions buys, time _ of _ transaction, and customer _ category, then
it is a multidimensional association rule. Rule (11.2) is considered a
multidimensional association rule since it involves three dimensions: age,
income, and buys.
Based on the levels of abstractions involved in the rule set: Some
methods for association rule mining can find rules at differing levels of
abstraction. For example, suppose that a set of association rules mined
includes the following rules:
age (X, 30 .. 39) buys (X, laptop computer) (11.3)

age (X, 30 39) buys (X, computer) (11.4)


In Rules (11.3) and (11.4), the items bought are refersenced at different
levels of abstraction. (e.g., computer is a higher level abstraction of
laptop copter). We refer to the rule set mined as consisting of multilevel
association rules. If, instead, the rules within a given set do not refer items
or attributes at different levels of abstraction, then the set contains single
level association rules.
Based on various extensions to association mining. Association mining
can be extended to correlation analysis, where the absence or presence of
correlated items can be identified. It can also be extended to mining
maxpatterns (i.e., maximal frequent patterns) and frequent closed itemsets.
A maxpattern is a frequent pattern, p, such that any proper super pattern of
p is not frequent. A frequent closed itemset is a frequent closed itemset .
That is, an itemset c is closed if there exists no proper superset of c, c, such
that every transaction containing c also contains c. Maxpatterns and
frequent closed itemsets can be used to substantially reduce the number of
frequent itemsets generated in mining.
How do Association Rules Work? The usefulness of this technique to
address unique data mining problems is best illustrated in a simple example.
Suppose you are collecting data at the checkout cash registers at a large

Sikkim Manipal University B1633 Page No.: 161


Data Warehousing and Data Mining Unit 11

bookstore. Each customer transaction is logged in a database, and consists


of the titles of the books purchased by the respective customer, perhaps
additional magazine titles and other gift items that were purchased, and so
on. Hence, each record in the database will represent one customer
(transaction), and may consist of a single book purchased by that customer,
or it may consist of many (perhaps hundreds of) different items that were
purchased, arranged in an arbitrary order depending on the order in which
the different items (books, magazines, and so on) came down the conveyor
belt at the cash register. The purpose of the analysis is to find associations
between the items that were purchased, i.e., to derive association rules that
identify the items and co-occurrences of different items that appear with the
greatest (co-) frequencies. For example, you want to learn which books are
likely to be purchased by a customer who you know already purchased
(or is about to purchase) a particular book. This type of information could
then quickly be used to suggest to the customer those additional titles. You
may already be "familiar" with the results of these types of analyses, if you
are a customer of various on-line (Web-based) retail businesses. Many
times when making a purchase on-line, the vendor will suggest similar items
(to the ones purchased by you) at the time of "check-out", This is based on
some rules such as "customers who buy book title A are also likely to
purchase book title B," and so on.
There are many interesting algorithms proposed recently to discover
association rules. One of the key features of all algorithms is that each of
these methods assume that the underlying database size is enormous and
they require multiple passes over the database.
Thus, the desirable features of any efficient algorithm are,
(a) To reduce the I/O operations at the same time be efficient in computing.
Methods to Discover Association Rules
The discovery of association rules is the most well studied problem in data
mining. There are many interesting algorithms proposed recently and we
shall discuss some of the important ones. One of the key features of all
algorithms is that each of these methods assume that the underlying
database size is enormous and they require multiple passes over the
database. For disk resident databases, this requires reading the database
completely for each pass, resulting in a large number of disk reads. In

Sikkim Manipal University B1633 Page No.: 162


Data Warehousing and Data Mining Unit 11

these algorithms, the effort spent in performing just the I/O may be
considerable for large databases. For example, a 1 GB database will require
125,000 block reads for a single pass (for a block size of 8KB). If the
algorithm requires 10 passes, this results in a 1,250,000 block read.
Assuming an average read time of 12 ms per page, the time spent in just
performing the I/O is 1,250,000 x 12 ms = 4hours.
Apart from poor response times, this problem places a huge burden on the
I/O system. Usually, the data is collected by an online transaction
processing system running hundreds or thousands of transactions per
second. Running this algorithm under such workloads will adversely affect
the transaction response time, and may even disrupt the daily database
server. Over a network such as LAN, it will create network congestion
problems and lead to poor resource utilization.
Problem Decomposition
The problem of mining association rules can be decomposed into two sub
problems:
Find all sets of items (items sets) whose support is greater than the user
specified minimum support, . Such item sets are called frequent item
sets.
Use the frequent item sets to generate the desired rules. The general
idea is that if, say ABCD and AB are frequent item sets, and then we
can determine if the rule AB CD holds by checking the following
inequality.
sA, B, C, D

sA, B , where s (X) is the support of X in T.
Much research has been focused on the first sub problem, as the database
is accessed in this part of the computation, and several algorithms have
been proposed. We shall describe 5 important algorithms.
Definition: Frequent Set
Let T be the transaction database and be the user specified minimum
support. An item set X A is said to be a frequent item set in T with respect
to , if s(X)T .
Downward Closure Property: Any subset of a frequent set is a
frequent set.
Sikkim Manipal University B1633 Page No.: 163
Data Warehousing and Data Mining Unit 11

Upward Closure Property: Any superset of an infrequent set is an


infrequent set.
Discovering all frequent item sets and their supports is a non trivial
problem if the cardinality of A, the set of items and the database T are large.
For example, if |A| = m, the number of possible distinct item sets is 2 m. The
problem is to identify which of these are frequent in the given set of
transactions. One way to achieve this is to set up 2m counters, one for each
distinct item set and count the support for every item set by 1,000; clearly,
this approach is impractical. It should be noted that a large number of item
sets would have minimum support. Hence, it is not necessary to test the
support for every item set. Even if it is practically feasible, testing support for
every possible item set results in much wasted effort. On the other hand, we
have just one counter and make a database pass to count the support for
each database. This is not practical either, as we may have to carry out a
very large number of I/O on the database.
To reduce the combinatorial search space, all algorithms exploit the two
properties outlined above. All existing algorithms for discovering frequent
sets are variants of this approach.
Definition: Maximal Frequent Set
A frequent set is a maximal frequent set if it is a frequent set and no
superset of this is a frequent set.
Definition: Border Set
An item set is a border set if it is not a frequent set, but all its proper subsets
are frequent sets.
One can see that if X is an infrequent item set, then it must have a subset
(not necessarily a proper subset) that is a border set. It is easy to derive a
proof for this. Since X is not frequent, it is possible that it is a border set. In
that case, the proof is done. Let us assume that X is not a border set too.
Hence, there exists at least one proper subset of cardinality |X| - 1 that is not
frequent, say X. If X is a border set, then the proof is complete. Let us,
hence, assume that X is not a border set. We recursively construct X, X,
X, and so on, having the common property that neither of these is a
frequent set nor a border set and this construction process terminates when
we get a set, which is a border. This construction process must terminate in

Sikkim Manipal University B1633 Page No.: 164


Data Warehousing and Data Mining Unit 11

a finite number of steps as we are decreasing the size of the sets by 1 in


every step. In most peculiar case, we may land up in a singleton item set
(the empty item set is always considered to be a frequent set).
Note that if we know the set of all maximal frequent sets of a given T with
respect to a , then we can find the set of all frequent sets without any extra
scan of the database. Thus, the set of all maximal frequent sets can act as a
compact representation of the set of all frequent sets. However, if we require
the frequent sets together with their respective support values in T, then we
have to make one more database pass to derive the support values when
the set of all maximal frequent sets is known.

Fig. 11.1: Lattice of subsets

We shall often refer to the lattice of subsets of A throughout this chapter. For
example, if A = {a, b, c, d, e}, then the lattice is given the following figure
(Figure 11.1). In this lattice, the set of maximal frequent sets acts as a
boundary between the set of all frequent sets and the set of all infrequent
sets. It is thus easy to characterize the class of frequent sets and the class
of infrequent sets in terms of the boundary sets between these two classes.
Note that some maximal frequent sets are proper subsets of some border
sets. But there can also be a maximal frequent set which is not a proper
subset of any border set. Similarly, it is possible that a proper subset of a
border set, of cardinality one less than the border set, is not necessarily
always maximal.

Sikkim Manipal University B1633 Page No.: 165


Data Warehousing and Data Mining Unit 11

Thus, we cannot establish a definite relationship between the set of maximal


frequent sets and the set of border sets. However, the set of all border sets
and the set of the maximal frequent, which are not proper subsets of any of
the border sets, jointly provide a better representation of the set of frequent
sets.
Example
Study the following transaction database. We shall use this database for
illustration throughout this chapter.

A = {A1, A2, A3, A4, A5, A6, A7, A8, A9}. Assume = 20%. Since T
contains 15 records, it means that an itemset that is supported by at least
three transactions is a frequent set.
Table 11.1: Sample Database

A1 A2 A3 A4 A5 A6 A7 A8 A9
1 0 0 0 1 1 0 1 0
0 1 0 1 0 0 0 1 0
0 0 0 1 1 0 1 0 0
0 1 1 0 0 0 0 0 0
0 0 0 0 1 1 1 0 0
0 1 1 1 0 0 0 0 0
0 1 0 0 0 1 1 0 1
0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 1 0
0 0 1 0 1 0 1 0 0
0 0 1 0 1 0 1 0 0
0 0 0 0 1 1 0 1 0
0 1 0 1 0 1 1 0 0
1 0 1 0 1 0 1 0 0
0 1 1 0 0 0 0 0 1

Sikkim Manipal University B1633 Page No.: 166


Data Warehousing and Data Mining Unit 11

Table 11.2: Frequent Count for Some Itemsets

X SUPPORT COUNT
{1} 2
{2} 6
{3} 6
{4} 4
{5} 8
{6} 5
{7} 7
{8} 4
{9} 2
{5,6} 3
{5, 7} 5
{6,7} 3
{5, 6, 7} 1

The number of transactions supporting some of the itemsets is given in


Table 3.2. Please note that we are using the index i for the item Ai. We shall
use this notational convention henceforth. So {1} is not a frequent with
respect to , but {3} is a frequent set.
It is easy to check that {5, 6, 7} is a border set; {5, 6} I a maximal frequent
set; {2, 4} I is also a maximal frequent set. But there is no border set having
{2, 4} as a proper subset. Thus, {2, 4} and {5, 6, 7} jointly represent the set
of all frequent sets of T with respect to . This is so, because we can
generate all the frequent sets from these two itemsets. If we know the set of
all maximal frequent sets, we can generate all the frequent sets.
Alternatively, if we know the set of border sets and the set of those maximal
frequent sets, which are not subsets of any border set, even then we can
generate all the frequent sets.
Self Assessment Questions
3. The _____ item sets finds all sets of items (items sets) whose support is
greater than the user specified minimum support, .
4. A frequent set is a _______ if it is a frequent set and no superset of this
is a frequent set.

Sikkim Manipal University B1633 Page No.: 167


Data Warehousing and Data Mining Unit 11

11.4.2 Classification
Clustering is the method by which like records are grouped together.
Usually this is done to give the end user a high level view of what is going
on in the database.
Definition: Classification is a Data Mining (machine learning) technique
used to predict group membership for data instances. For example, you may
wish to use classification to predict if the weather on a particular day will be
sunny, rainy or cloudy. Popular classification techniques include
decision trees and neural networks.

Regression and Classification are two of the more popular Classification


Techniques.

Classification involves finding rules that partition the data into disjoint
groups. The input for the classification is the training data set, whose class
labels are already known. Classification analyzes the training data set and
constructs a model based on the class label, and aims to assign a class
label to the future unlabelled records. Since the class field is known, this
type of classification is known as supervised learning. A set of classification
rules are generated by such a classification process, which can be used to
classify future data and develop a better understanding of each class in the
database.
The applications include the credit card analysis, banking, medical
applications and the like.
Classification Example
Problem:
Given a new automobile insurance applicant, should he or she be classified
as low risk, medium risk or high risk? Classification rules for above problem
could use a variety of data, such as customers educational level, salary,
age, etc.
Classification Rules
Rule 1:
" Person P, P.degree = masters and P.income > 75,000
P.credit = Excellent
Sikkim Manipal University B1633 Page No.: 168
Data Warehousing and Data Mining Unit 11

Rule 2:
" person P, P.degree = bachelors and
(P.income 25,000 and P.income 75,000)
P.credit = Good
Decision Trees for Classification
A Decision Tree is a predictive model that, as its name implies, can be
viewed as a tree. Specifically each branch of the tree is a classification
question and the leaves of the tree are partitions of the dataset (data base
table/file) with their classification.

Fig. 11.1: Decision Tree Credit Risk Assessment

In the above classification, four groups are classified i.e Bad, Good,
Average and Excellent. At any moment of time the customer would fall into
any one of the group.
11.4.3 Regression
Regression is the oldest and most well known statistical technique that the
Data Mining community utilizes. Basically, regression takes a numerical
dataset and develops a mathematical formula (Eg: y=a+ bx, here y is the
dependant variable and x is the independent variable) that fits the data.

Sikkim Manipal University B1633 Page No.: 169


Data Warehousing and Data Mining Unit 11

When you're ready to use the results to predict future behavior, you simply
take your new data, plug it into the developed formula and you've got a
prediction. The major limitation of this technique is that it only works well
only with continuous quantitative data (like weight, speed or age).
If the data is categorical, where order is not significant (like color, name or
gender) then it is better off choosing another technique.
11.4.4 Clustering
Clustering is a method of grouping data into different groups, so that the
data in each group share similar trends and patterns. Clustering constitutes
a major class of data mining algorithms. The algorithm attempts to
automatically partition the data space into a set of regions or clusters, to
which the examples in the table are assigned, either deterministically or
probability-wise. The goal of the process is to identify all sets of similar
examples in the data, in some optimal fashion.
The objectives of clustering are:
to uncover natural groupings
to initiate hypothesis about the data
to find consistent and valid organization of the data.
A retailer may want to know where similarities exist in his customer base, so
that he can create and understand different groups. He can use the existing
database or different customers or, more specifically, different transactions
collected over a period of time. The clustering methods will help him in
identifying different categories of customers. During the discovery process,
the differences between the data sets can be discovered in order to
separate them into different groups, and similarity between data sets can be
used to group similar data together.
11.4.5 Neural networks
An Artificial Neural Network (ANN) is an information-processing paradigm
that is inspired by the way biological nervous systems, such as the brain,
process information.
The key element of this paradigm is the novel structure of the information
processing system. It is composed of a large number of highly
interconnected processing elements (neurons) working in unison to solve
specific problems. ANNs, like people, learn by example. An ANN is

Sikkim Manipal University B1633 Page No.: 170


Data Warehousing and Data Mining Unit 11

configured for a specific application, such as pattern recognition or data


classification, through a learning process. Learning in biological systems
involves adjustments to the synaptic connections that exist between the
neurons. This is true of ANNs as well.
Neural Networks are made up of many artificial neurons. An artificial neuron
is simply an electronically modelled biological neuron. How many neurons
are used depends on the task at hand. It could be as few as three or as
many as several thousands. One optimistic researcher has even hard wired
2 million neurons together in the hope he could come up with something as
intelligent as a cat although most people in the AI community doubt he will
be successful (Update: he wasn't!). There are many different ways of
connecting artificial neurons together to create a neural network.
There are different types of Neural Networks, each of which has different
strengths particular to their applications. The abilities of different networks
can be related to their structure, dynamics and learning methods.
Self Assessment Questions
5. The goal of ___________rules to detect relationships or associations
between specific values of categorical variables in large data sets.
6. A Decision Tree is a _____________ model.
7. Using decision tree, only categorical variables would be modeled.
(True/False).
8. Clustering is a un-supervised learning method (True/false).
9. Neural networks are made up of many _______________.

11.5 Summary
Data Mining or knowledge discovery in databases, as it is also known, is
the non-trivial extraction of implicit, previously unknown and potentially
useful information from the data.
Data Base Management System (DBMS) supports query languages,
which are useful for query triggered data exploration, whereas data
mining supports automatic data exploration.
Data Mining uses a number of techniques to discover patterns and
uncover trends in data warehouse data/data base.
Associations, Classifications, Clustering and Neural Networks are some
of the techniques for Data Mining.
Various tools exist for data mining operations. Example SAS-Enterprise
Minor Darwin (Oracle), Clementine (SPSS), S-plus etc.

Sikkim Manipal University B1633 Page No.: 171


Data Warehousing and Data Mining Unit 11

11.6 Terminal Questions


1. What is Data Mining? How does it work?
2. Differentiate between database management systems (DBMS) and data
mining.
3. What is Association Technique in Data Mining? How does it work?
4. What is Clustering? Explain in brief.
5. What is Neural Network? Explain in brief.

11.7 Answers
Self Assessment Questions
1. DBMS
2. True
3. Frequent set
4. Maximal frequent set
5. Association
6. Predictive model
7. False
8. False
9. Artificial neurons
Terminal Questions
1. Data Mining or knowledge discovery in databases, as it is also known, is
the non-trivial extraction of implicit, previously unknown and potentially
useful information from the data. Refer section 11.1.
2. Data Base Management System (DBMS) supports query languages
which are useful for query triggered data exploration, whereas data
mining supports automatic data exploration. Refer section 11.2.
3. The techniques describe the methods to detect relationships or
associations between specific values of categorical variables in large
data sets. Refer section 11.3.1.
4. Clustering is a method of grouping data into different groups, so that the
data in each group share similar trends and patterns. Refer section
11.3.4
5. An Artificial Neural Network (ANN) is an information-processing
paradigm that is inspired by the way biological nervous systems, such
as the brain, process information. Refer section 11.3.5.

Sikkim Manipal University B1633 Page No.: 172

You might also like