Professional Documents
Culture Documents
Mit401 Unit 11-Slm
Mit401 Unit 11-Slm
Mit401 Unit 11-Slm
11.1 Introduction
In the previous unit you learnt that methods for data preprocessing such as :
data cleaning, data integration and transformation, and data reduction. In
this unit you would be introduced to Data Mining techniques, where different
techniques of data base such as Association rules, Classification,
Regression, Clustering, Neural networks are discussed in brief.
Objectives:
After studying this unit, you should be able to:
define data mining
differentiate between Data Mining and DBMS
describe the association rules.
explain the usage of Neural Networks in Data Mining
Definitions:
Definition - 1:
Data Mining or knowledge discovery in databases, as it is also known,
is the non-trivial extraction of implicit, previously unknown and
potentially useful information from the data.
Definition - 2:
Data mining is the search for the relationships and global patterns that
exist in large databases but are hidden among vast amounts of data,
such as relationship between patient data and their medical diagnosis.
Definition - 3:
Data Mining is the process of discovering meaningful, new correlation
patterns and trends by sifting through large amounts of data stored in
repositories, using pattern recognition techniques.
Data Mining: A hot buzzword for a class of database applications that look
for hidden patterns in a group of data. For example, Data Mining software
can help retail companies find customers with common interests. The term
is commonly misused to describe software that presents data in new ways.
Self Assessment Questions
1. Oracle, SQL/Server, DB2 are examples for _____________.
2. Data Base Management System (DBMS) supports query languages.
(True/False)
these algorithms, the effort spent in performing just the I/O may be
considerable for large databases. For example, a 1 GB database will require
125,000 block reads for a single pass (for a block size of 8KB). If the
algorithm requires 10 passes, this results in a 1,250,000 block read.
Assuming an average read time of 12 ms per page, the time spent in just
performing the I/O is 1,250,000 x 12 ms = 4hours.
Apart from poor response times, this problem places a huge burden on the
I/O system. Usually, the data is collected by an online transaction
processing system running hundreds or thousands of transactions per
second. Running this algorithm under such workloads will adversely affect
the transaction response time, and may even disrupt the daily database
server. Over a network such as LAN, it will create network congestion
problems and lead to poor resource utilization.
Problem Decomposition
The problem of mining association rules can be decomposed into two sub
problems:
Find all sets of items (items sets) whose support is greater than the user
specified minimum support, . Such item sets are called frequent item
sets.
Use the frequent item sets to generate the desired rules. The general
idea is that if, say ABCD and AB are frequent item sets, and then we
can determine if the rule AB CD holds by checking the following
inequality.
sA, B, C, D
sA, B , where s (X) is the support of X in T.
Much research has been focused on the first sub problem, as the database
is accessed in this part of the computation, and several algorithms have
been proposed. We shall describe 5 important algorithms.
Definition: Frequent Set
Let T be the transaction database and be the user specified minimum
support. An item set X A is said to be a frequent item set in T with respect
to , if s(X)T .
Downward Closure Property: Any subset of a frequent set is a
frequent set.
Sikkim Manipal University B1633 Page No.: 163
Data Warehousing and Data Mining Unit 11
We shall often refer to the lattice of subsets of A throughout this chapter. For
example, if A = {a, b, c, d, e}, then the lattice is given the following figure
(Figure 11.1). In this lattice, the set of maximal frequent sets acts as a
boundary between the set of all frequent sets and the set of all infrequent
sets. It is thus easy to characterize the class of frequent sets and the class
of infrequent sets in terms of the boundary sets between these two classes.
Note that some maximal frequent sets are proper subsets of some border
sets. But there can also be a maximal frequent set which is not a proper
subset of any border set. Similarly, it is possible that a proper subset of a
border set, of cardinality one less than the border set, is not necessarily
always maximal.
A = {A1, A2, A3, A4, A5, A6, A7, A8, A9}. Assume = 20%. Since T
contains 15 records, it means that an itemset that is supported by at least
three transactions is a frequent set.
Table 11.1: Sample Database
A1 A2 A3 A4 A5 A6 A7 A8 A9
1 0 0 0 1 1 0 1 0
0 1 0 1 0 0 0 1 0
0 0 0 1 1 0 1 0 0
0 1 1 0 0 0 0 0 0
0 0 0 0 1 1 1 0 0
0 1 1 1 0 0 0 0 0
0 1 0 0 0 1 1 0 1
0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 1 0
0 0 1 0 1 0 1 0 0
0 0 1 0 1 0 1 0 0
0 0 0 0 1 1 0 1 0
0 1 0 1 0 1 1 0 0
1 0 1 0 1 0 1 0 0
0 1 1 0 0 0 0 0 1
X SUPPORT COUNT
{1} 2
{2} 6
{3} 6
{4} 4
{5} 8
{6} 5
{7} 7
{8} 4
{9} 2
{5,6} 3
{5, 7} 5
{6,7} 3
{5, 6, 7} 1
11.4.2 Classification
Clustering is the method by which like records are grouped together.
Usually this is done to give the end user a high level view of what is going
on in the database.
Definition: Classification is a Data Mining (machine learning) technique
used to predict group membership for data instances. For example, you may
wish to use classification to predict if the weather on a particular day will be
sunny, rainy or cloudy. Popular classification techniques include
decision trees and neural networks.
Classification involves finding rules that partition the data into disjoint
groups. The input for the classification is the training data set, whose class
labels are already known. Classification analyzes the training data set and
constructs a model based on the class label, and aims to assign a class
label to the future unlabelled records. Since the class field is known, this
type of classification is known as supervised learning. A set of classification
rules are generated by such a classification process, which can be used to
classify future data and develop a better understanding of each class in the
database.
The applications include the credit card analysis, banking, medical
applications and the like.
Classification Example
Problem:
Given a new automobile insurance applicant, should he or she be classified
as low risk, medium risk or high risk? Classification rules for above problem
could use a variety of data, such as customers educational level, salary,
age, etc.
Classification Rules
Rule 1:
" Person P, P.degree = masters and P.income > 75,000
P.credit = Excellent
Sikkim Manipal University B1633 Page No.: 168
Data Warehousing and Data Mining Unit 11
Rule 2:
" person P, P.degree = bachelors and
(P.income 25,000 and P.income 75,000)
P.credit = Good
Decision Trees for Classification
A Decision Tree is a predictive model that, as its name implies, can be
viewed as a tree. Specifically each branch of the tree is a classification
question and the leaves of the tree are partitions of the dataset (data base
table/file) with their classification.
In the above classification, four groups are classified i.e Bad, Good,
Average and Excellent. At any moment of time the customer would fall into
any one of the group.
11.4.3 Regression
Regression is the oldest and most well known statistical technique that the
Data Mining community utilizes. Basically, regression takes a numerical
dataset and develops a mathematical formula (Eg: y=a+ bx, here y is the
dependant variable and x is the independent variable) that fits the data.
When you're ready to use the results to predict future behavior, you simply
take your new data, plug it into the developed formula and you've got a
prediction. The major limitation of this technique is that it only works well
only with continuous quantitative data (like weight, speed or age).
If the data is categorical, where order is not significant (like color, name or
gender) then it is better off choosing another technique.
11.4.4 Clustering
Clustering is a method of grouping data into different groups, so that the
data in each group share similar trends and patterns. Clustering constitutes
a major class of data mining algorithms. The algorithm attempts to
automatically partition the data space into a set of regions or clusters, to
which the examples in the table are assigned, either deterministically or
probability-wise. The goal of the process is to identify all sets of similar
examples in the data, in some optimal fashion.
The objectives of clustering are:
to uncover natural groupings
to initiate hypothesis about the data
to find consistent and valid organization of the data.
A retailer may want to know where similarities exist in his customer base, so
that he can create and understand different groups. He can use the existing
database or different customers or, more specifically, different transactions
collected over a period of time. The clustering methods will help him in
identifying different categories of customers. During the discovery process,
the differences between the data sets can be discovered in order to
separate them into different groups, and similarity between data sets can be
used to group similar data together.
11.4.5 Neural networks
An Artificial Neural Network (ANN) is an information-processing paradigm
that is inspired by the way biological nervous systems, such as the brain,
process information.
The key element of this paradigm is the novel structure of the information
processing system. It is composed of a large number of highly
interconnected processing elements (neurons) working in unison to solve
specific problems. ANNs, like people, learn by example. An ANN is
11.5 Summary
Data Mining or knowledge discovery in databases, as it is also known, is
the non-trivial extraction of implicit, previously unknown and potentially
useful information from the data.
Data Base Management System (DBMS) supports query languages,
which are useful for query triggered data exploration, whereas data
mining supports automatic data exploration.
Data Mining uses a number of techniques to discover patterns and
uncover trends in data warehouse data/data base.
Associations, Classifications, Clustering and Neural Networks are some
of the techniques for Data Mining.
Various tools exist for data mining operations. Example SAS-Enterprise
Minor Darwin (Oracle), Clementine (SPSS), S-plus etc.
11.7 Answers
Self Assessment Questions
1. DBMS
2. True
3. Frequent set
4. Maximal frequent set
5. Association
6. Predictive model
7. False
8. False
9. Artificial neurons
Terminal Questions
1. Data Mining or knowledge discovery in databases, as it is also known, is
the non-trivial extraction of implicit, previously unknown and potentially
useful information from the data. Refer section 11.1.
2. Data Base Management System (DBMS) supports query languages
which are useful for query triggered data exploration, whereas data
mining supports automatic data exploration. Refer section 11.2.
3. The techniques describe the methods to detect relationships or
associations between specific values of categorical variables in large
data sets. Refer section 11.3.1.
4. Clustering is a method of grouping data into different groups, so that the
data in each group share similar trends and patterns. Refer section
11.3.4
5. An Artificial Neural Network (ANN) is an information-processing
paradigm that is inspired by the way biological nervous systems, such
as the brain, process information. Refer section 11.3.5.