Download as pdf or txt
Download as pdf or txt
You are on page 1of 222

IT6702 - DATA WAREHOUSING AND

DATA MINING

By: Dr. N.Yuvaraj


IT6702 DATA WAREHOUSING AND DATA MINING
SYLLABUS

UNIT I DATA WAREHOUSING 9


Data warehousing Components –Building a Data warehouse –- Mapping the Data Warehouse to a
Multiprocessor Architecture – DBMS Schemas for Decision Support – Data Extraction, Cleanup, and
Transformation Tools –Metadata.

UNIT II BUSINESS ANALYSIS 9


Reporting and Query tools and Applications – Tool Categories – The Need for Applications – Cognos
Impromptu – Online Analytical Processing (OLAP) – Need – Multidimensional Data Model – OLAP
Guidelines – Multidimensional versus Multirelational OLAP – Categories of Tools – OLAP Tools and
the Internet.

UNIT III DATA MINING 9


Introduction – Data – Types of Data – Data Mining Functionalities – Interestingness of Patterns –
Classification of Data Mining Systems – Data Mining Task Primitives – Integration of a Data Mining
System with a Data Warehouse – Issues –Data Preprocessing.

UNIT IV ASSOCIATION RULE MINING AND CLASSIFICATION 9


Mining Frequent Patterns, Associations and Correlations – Mining Methods – Mining various Kinds of
Association Rules – Correlation Analysis – Constraint Based Association Mining – Classification and
Prediction - Basic concepts - Decision Tree Induction - Bayesian Classification – Rule Based
Classification – Classification by Back propagation – Support Vector Machines – Associative
Classification – Lazy Learners – Other Classification Methods – prediction.

UNIT V CLUSTERING AND TRENDS IN DATA MINING 9


Cluster Analysis - Types of Data – Categorization of Major Clustering Methods – K-means– Partitioning
Methods – Hierarchical Methods - Density-Based Methods –Grid Based Methods – Model-Based
Clustering Methods – Clustering High Dimensional Data - Constraint – Based Cluster Analysis – Outlier
Analysis – Data Mining Applications. TOTAL: 45 PERIODS

TEXT BOOKS:
1. Alex Berson and Stephen J.Smith, ―Data Warehousing, Data Mining and OLAP‖, Tata McGraw – Hill
Edition, Thirteenth Reprint 2008.
2. Jiawei Han and Micheline Kamber, ―Data Mining Concepts and Techniques‖, Third Edition, Elsevier,
2012.

REFERENCES:
1. Pang-Ning Tan, Michael Steinbach and Vipin Kumar, ―Introduction to Data Mining‖, Person
Education, 2007.
2. K.P. Soman, Shyam Diwakar and V. Aja, ―Insight into Data Mining Theory and Practice‖, Eastern
Economy Edition, Prentice Hall of India, 2006.
3. G. K. Gupta, ―Introduction to Data Mining with Case Studies‖, Eastern Economy Edition, Prentice Hall
of India, 2006.
4. Daniel T.Larose, ―Data Mining Methods and Models‖, Wiley-Interscience, 2006.
UNIT I DATA WAREHOUSING

Data warehousing Components –Building a Data warehouse –Mapping the Data Warehouse to a
Multiprocessor Architecture – DBMS Schemas for Decision Support – Data Extraction, Cleanup,
and Transformation Tools –Metadata.

1.1 INTRODUCTION
1.1.1 Data Warehouse
A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection
of data. It is a central repository of integrated data from one or more sources.
Data warehouse is used to
 Store current & historical data.
 Create analytical reports for knowledge workers
 Take informed decisions in an organization.

Operational Database Vs Data Warehouse


An operational database undergoes frequent changes on a daily basis on account of the
transactions that take place. A data warehouses provides us generalized and consolidated data in
multidimensional view. Along with generalized and consolidated view of data, a data
warehouses also provides us Online Analytical Processing (OLAP) tools.

Need to separate Data Warehouse from Operational Databases


An operational database is constructed for well-known tasks and workloads such as
searching particular records, indexing, etc. In contrast, data warehouse queries are often
complex and they present a general form of data. Operational databases support concurrent
processing of multiple transactions and it requires concurrency control and recovery
mechanisms to ensure robustness and consistency of the database. Operational database allows
reading & modifying operations, while an OLAP query needs only read only access of stored
data. An operational database maintains current data where as data warehouse maintains
historical data.

Need for a Data Warehouse


 No need for frequent updating.
 It possesses consolidated historical data.
 Data Warehouse system helps in consolidated historical data analysis.
 To keep historical data separate from the organization's operational database.
 Data Warehouse helps in the integration of diversity of application systems.
 Helps executives to organize, understand, and use their data to take strategic
decisions.
Data Warehouse Applications
Data warehouse has applications in various sectors. Few important applications are as follows:
 Financial services
 Banking services
 Consumer goods
 Retail sectors
 Controlled manufacturing

Data Warehouse Features


Based on the operation key features of a data warehouse are as follows:
 Subject Oriented
 Integrated
 Time Variant
 Non-volatile

Subject Oriented
A data warehouse is subject oriented because it provides information around a subject
rather than the organization's ongoing operations. These subjects can be product, customers,
suppliers, sales, revenue, etc. A data warehouse does not focus on the ongoing operations, rather
it focuses on modeling and analysis of data for decision making.

Integrated
A data warehouse is constructed by integrating data from heterogeneous sources such as
relational databases, flat files, etc., to enhance the effective analysis of data.

Time Variant
The data collected in a data warehouse is identified with in a particular time period. The data in
a data warehouse provides information from the historical point of view.

Non-volatile
The historical data in a data warehouse is kept separate from the operational database and
therefore frequent changes in operational database do not affect the data in data warehouse.
1.1.2 Online Transaction Processing (OLTP) vs Online Analytical Processing (OLAP)
OLAP
Online Analytical Processing (OLAP) deals with Historical Data or Archival Data. OLAP
is a powerful technology for identifying the data which includes capabilities for data discovery,
data reporting and performing complex analytical calculations. The main component of OLAP
system is a data cube. A data cube is constructed by combining data warehouse‘s structures like
facts and dimensions. The merging of all the cubes creates a multidimensional data warehouse.
Online Transaction Processing (OLTP) is a tool capable of supporting transactions
oriented data over internet. OLTP monitors day to day transaction of an organization. OLTP
supports transaction oriented applications in 3-tier architecture. Data from OLTP are collected
over a period of time and store it in a very large database called Data warehouse. The Data
warehouses are highly optimized for read (SELECT) operation. Transactional data are extracted
from multiple OLTP sources and pre-processing is done on these data to make it compatible with
the data warehouse data format.
OLAP Example 1: If we collect last 10 years data about flight reservation, The data can
give us many meaningful information such as the trends in reservation. This may give useful
information like peak time of travel, what kinds of people are traveling in various classes
(Economy/Business) etc.

OLAP Example 2: In a hospital there is 20 years of very complete patient information


stored. Someone on the administration wants a detailed report of the most common
diseases, success rate of treatment, internship days and a lot of relevant data. For this,
we apply OLAP operations to our data warehouse with historical information, and
through complex queries we get these results. Then they can be reported to the
administration for further analysis.

OLTP vs OLAP – Example Queries


Examples for OLTP Queries:
 What is the Salary of Mr. John?
 Withdraw Money from Bank Account: It performs update operation if money is with
drawed from account.
 What is the address and email id of the person who is the head of maths department?

Examples for OLAP Queries


 How is the profit changing over the years across different regions?
 Is it financially viable continuing the production unit at location X?
The Table 1.1 shown below given a comparison of OLTP vs. OLAP

CATEGORY OLTP OLAP

Focus Updating data. Reporting and retrieval of information.


Simple, returning the results Complex queries of data in order to
Queries
expected for the system activity. aggregate information for reporting.
Regular backups with full, Simple backup or reloading of the
Backup incremental and archives. This data mechanisms that support data insertion.
is critical and can't be loss.
Data task Operational tasks. Business tasks. Reporting and data analysis.
Operational information of the
Data Source application. Generally this process Historical and archive data.
is the source of the data.

Space Operational data stored, typically Large data sets of historical information.
small. Large storage needed.
Normalized schemas. Many tables Star, snowflake and constellation
Schema and relationships. schema. Fewer tables not normalized.

Management, operational, web Management Systems. Reporting,


Applications services, client-server. decision.

Insert, update and delete operations. Refreshing of data with huge data sets.
Data Refresh They are performed fast. Immediate Takes time and is sporadic.
results.
Fast. Requires some indexes on Slow. Depending on the amount of data.
Speed
large tables. Requires more indexes.
Data Model Entity-relationship on databases. One or multi-dimensional data.
Horizon Day-to-day, weeks, months. Long time data.

Users Common person. Staff. Managers, executives, data scientists,


marketers.

Table 1.1 OLTP vs OLAP


1.2 Data Ware Housing Components
The tools and components that ensure the effective functioning of a data warehouse system
are listed below.
1. Sourcing, Acquisition, clean up and Transformation Tools
2. Repository.
3. Metadata (Data warehouse DBMS)
4. Data marts
5. Applications and Tools
6. Management Platform – Administration & Management
7. Information Delivery systems

1.2.1 Sourcing, Acquisition, clean up and Transformation Tools


Data sourcing, cleanup, transformation and migration tools extract the data from
operational system and put it in a format suitable for applications. It converts the datain to
information to be used by decision support tools. These tools will generate programs that will
move the data from multiple operating systems in to data warehouse and to maintain meta data.
It includes various functionalities like:

 Removing unwanted data.


 Converting different data labels in to common data names.
 Calculating derived names and summaries.
 Data filling. (Filling the data with missing values)
Sourcing, Acquisition, cleanup and Transformation tools deals with significant issues like,

Database heterogeneity: Data Access Language, Data navigation, operations, concurrency,


integrity, recovery etc…

Data heterogeneity: Holding same name for different attributes. (This occurs when data from
disparate data sources and presented to the user with a unified interface)

1.2.2 Repository (Data warehouse Database)


Repository is a data warehouse which is a centralized repository to store and maintain
the data. Data warehouse is a centralized database. Data warehouse is implemented on any
RDBMS or on any of the proprietary Multidimensional databases. Data warehouse database
provides various drivers for flexible user view creation, query processing etc….

1.2.3 Metadata
Metadata is used for building, maintaining, managing and using Data warehouse. Metadata
provides the user easy access to the data. Metadata interface need to be created. Meta data
management is provided via metadata repository and accompanying software which runs on the
work station.
There are two types of metadata.
 Technical metadata.
 Business metadata.
Technical metadata Contains data to be used by warehouse designers and administrators.
Business metadata Contains information that gives users an easy to understand perspective of the
information stored in the data warehouse.
Important functional components of Metadata repository is the information directory.
Content of information directory is the metadata.
Information directory and the metadata repository should ….
 Be a gateway to data warehouse environment.
 Support easy distribution and replication of its contents.
 Searchable by business oriented keywords.
 Support sharing of information objects.
 Support variety of scheduling options.
 Support the distribution of query results.
 Support and provide interfaces to other applications.
 Support end user monitoring of the status of the data warehouse environment.

1.2.4. Data mart


The data mart as shown in Figure 1.1 is a subset of the data warehouse that is usually oriented
to a specific business line or team.

Figure 1.1 Data mart

A data warehouse, unlike a data mart, deals with multiple subject areas and is typically
implemented and controlled by a central organizational unit such as the corporate IT group as
shown in Figure 1.2. Data marts are small slices of the data warehouse. Whereas data
warehouses have an enterprise-wide depth, the information in data marts pertains to a single
department.
Figure 1.2 Data Warehouse to Data Mart

Types of Data Mart:

1. Independent data mart.


2. Dependant data mart

Dependent data marts are simple because clean data has already been loaded into the
central data warehouse. Here ETL process is mostly a process of identifying the right subset of
data relevant to the chosen data mart subject and moving a copy of it, perhaps in a
summarized form. Independent data marts deal with all aspects of the ETL process,
similar as central data warehouse. The number of sources is likely to be fewer and the
amount of data associated with the data mart is less than the warehouse, given your focus
on a single subject.

Difference between Data Mart and Data Warehouse


The difference between a data mart and a data warehouse is given the following table 1.2.
Table 1.2 Data mart vs Data Warehouse

1.2.5. Application and Access Tools


Main Purpose of developing a data warehouse is for strategic decision making. Users
interact with front end tools. Adhoc requests, reports and custom application are primary
deliverables of the analysis done. Example – Share market.

Access Tools are grouped as follows:


1. Data query and reporting tools.
2. Application development tools.
3. EIS tools.
4. OLAP tools.
5. Data mining tools.
6. Data Visualization.

Application development tools can be used when the analytical need of the data warehouse
increases. In this case organizations need to depend on developing some applications based on
some of the proven approaches. Some of the application development platforms are power
builder from power soft, VB from microsoft, forte from forte software, buisness objects etc…

OLAP Tools
OLAP tools are based on multi dimensional databases and allow a sophisticated user to
analyze the data using elaborate, multidimensional, complex views. Main applications of these
tools include product performance, profitability, effectiveness of sales program , etc…These
tools assume that data is organized in multi dimensional database.
Data mining Tools
Success factor for any business is to use information effectively. Knowledge discovery from the
available data is important to formulate the business strategies. Data mining is the process of
extracting patterns to build predictive rather than retrospective model. Data mining tools are used
to find hidden relationships. Data mining tools are used to perform the following tasks.
 Segmentation.
 Classification.
 Association.
 Preferencing.
 Visualize data
 Correct Data

Access Tools – Data Visualization (EIS)


Visualization techniques create the view of data. Visualization is not a separate tools , it is a
method of presenting the output of all the other tools. Visualization techniques – will not be
simpler.

1.2.6 Data warehouse Administration and management.


Size of a data warehouse is 4 times larger at minimum. Data Warehouse is not
synchronized in real time. Data Warehouse includes gateways to access enterprise data sources.
No special inter networking technologies.

Managing data warehouse includes:


 Security and priority management.
 Monitoring updates from multiple sources.
 Data quality checks.
 Managing and updating metadata.
 Auditing and reporting DW usage and status.
 Purging data
 Backup and recovery.

1.2.7 Information Delivery systems


Information delivery component of DW is used to enable the process of subscribing for data
warehouse information and having it delivered to one or more destinations of choice according to
some user specified algorithms. Delivery of information is based on time of day or on a
completion of an external event.
1.3 Building a Data Warehouse

To design a Data warehouse Ralph Kimball proposed a nine step method as listed as follows:

1. Choosing the subject matter of a particular data mart.


2. Deciding what a fact table record represents.
3. Identifying and conforming the dimensions.
4. Choosing the facts.
5. Storing the pre-calculations in the fact table.
6. Rounding out the dimension tables.
7. Choosing the duration of the database.
8. The need to track slowly changing dimensions.
9. Deciding the query priorities and the query modes.

1.3.1 Schema Design


Data warehouses use various schema to represent multi-dimensional model.

 Dimension Table
 Fact Table

Building a Data Warehouse - Fact Table


Fact tables record measurements for a specific event. Fact tables generally consist of
numeric values and foreign keys to dimensional data where descriptive information is kept. Fact
tables are designed to record events at a very atomic level. A large number of records in a fact
table over time. Fact tables are defined as one of three types:

 Transaction fact tables record facts about a specific event (e.g., sales events)
 Snapshot fact tables record facts at a given point in time (e.g., account details at month
end)
 Accumulating snapshot tables record aggregate facts at a given point in time (e.g., total
month-to-date sales for a product)
Building a Data Warehouse - Dimension Table:
Dimension tables as shown in Figure1.3 have a relatively small number of records
compared to fact tables, but each record may have a very large number of attributes to describe
the fact data. Dimensions can define a wide variety of characteristics, but some of the most
common attributes defined by dimension tables include:
 Dimension attributes should be:
 Verbose (labels consisting of full words)
 Descriptive
 Complete (having no missing values)
 Discretely valued (having only one value per dimension table row)
 Quality assured (having no misspellings or impossible values)

Figure 1.3 Dimension table

1.3.2 Nine Step approach for building a data warehouse


Nine step approach proposed by Ralph Kimball for building a data warehouse is explained as
follows.

1.3.2.1 Choosing the subject matter.


The process (function) refers to the subject matter of a particular data mart. The first data
mart to be built should be the one that is most likely to be delivered on time, within budget,
and to answer the most commercially important business questions. The best choice for the
first data mart tends to be the one that is related to sales. This data source is likely to be
accessible and of high quality.
1.3.2.2 Deciding what a fact table record represents
It means deciding exactly what a fact table represents. For example if the product sale is
the fact table then the grain of the fact table is that it represents individual sale of a product.
After identifying the grain of the fact table only we can able to decide number of dimensions
of a fact table. For example, the grain of customer dimension is the detail about the customer
who purchases the product.

1.3.2.3Identifying and conforming the dimensions.


A well-built set of dimensions makes the data mart understandable and easy to use. We
identify dimensions in sufficient detail to describe things such as clients and properties at the
correct grain. For example, each client of the Client Buyer dimension table is described by
the clientID, clientNo, clientName, clientType, city, region, and country attributes. A poorly
presented or incomplete set of dimensions will reduce the usefulness of a data mart to an
enterprise.

If any dimension occurs in two data marts, they must be exactly the same dimension, or one
must be a mathematical subset of the other. Only in this way can two data marts share one or
more dimensions in the same application. When a dimension is used in more than one data
mart, the dimension is referred to as being conformed.

1.3.2.4 Choosing the facts


The grain of the fact table determines which facts can be used in the data mart. All the
facts must be expressed at the level implied by the grain. In other words, if the grain of the
fact table is an individual property sale, then all the numerical facts must refer to this
particular sale. Also, the facts should be numeric and additive. The Figure1.4 shown below
represents the facts.
Figure 1.4 choosing the facts

1.3.2.5 Storing pre-calculations in the fact table


Once the facts have been selected each should be re-examined to determine whether there
are opportunities to use pre-calculations. A common example of the need to store
precalculations occurs when the facts comprise a profit and loss statement. This situation will
arise when fact table is based on invoices or sales. Above fig shows the fact table with the
rentDuration, totalRent, clientAllowance, staffCommission, and totalRevenue attributes.
These types of facts are useful because they are additive quantities, from which we can
derive valuable information such as the average clientAllowance based on aggregating some
number of fact table records.

1.3.2.6 Rounding out the dimension tables


In this step, we return to the dimension tables and add as many text descriptions to the
dimensions as possible. The text descriptions should be as intuitive and understandable to the
users as possible. The usefulness of a data mart is determined by the scope and nature of the
attributes of the dimension tables. Based on the usefulness the dimension tables can be
reduced and rounded off.

1.3.2.7 Choosing the duration of the database


The duration measures how far back in time the fact table goes. In many enterprises,
there is a requirement to look at the same time period a year or two earlier. For other
enterprises, such as insurance companies, there may be a legal requirement to retain data
extending back five or more years. Very large fact tables raise at least two very significant
data warehouse design issues.

First, it is often increasingly difficult to source increasingly old data. The older the data,
the more likely there will be problems in reading and interpreting the old files or the old
tapes. Second, it is mandatory that the old versions of the important dimensions be used, not
the most current versions. This is known as the ‗slowly changing dimension‘ problem, which
is described in more detail in the following step.

1.3.2.8 Tracking slowly changing dimensions


The slowly changing dimension problem means, for example, that the proper description
of the old client and the old branch must be used with the old transaction history. Often, the
data warehouse must assign a generalized key to these important dimensions in order to
distinguish multiple snapshots of clients and branches over a period of time. There are three
basic types of slowly changing dimensions:

Type 1, where a changed dimension attribute is overwritten;


Type 2, where a changed dimension attribute causes a new dimension record to be created;
Type 3, where a changed dimension attribute causes an alternate attribute to be created so
that both the old and new values of the attribute are simultaneously accessible in the same
dimension record.

1.3.2.9 Deciding the query priorities and the query modes


In this step physical design issues involved in data warehouse construction is considered. The
most critical physical design issues affecting the end-user‘s perception of the data mart are
the physical sort order of the fact table on disk and the presence of pre-stored summaries or
aggregations. Beyond these issues there are a host of additional physical design issues
affecting administration, backup, indexing performance, and security.
1.4 Mapping the Data Warehouse to a Multiprocessor Architecture
The primary objective for shifting from simple data warehouse processing to parallel data
warehouse processing is to gain performance improvement

Two main measures used to measure the data warehouse processing performance
improvement are:

Throughput: The number of tasks that can be completed within a given time interval as shown in
Figure 1.5(b).

Response time: The amount of time it takes to complete a single task from the time it is
submitted as shown below in 1.5(a).

Figure 1.5(a) Response time Figure 1.5(b) Throughput time

To implement Parallel Terminology two metrics can be taken in to account:


Speed-Up
 Performance improvement gained because of extra processing elements added.
 Running a given task in less time by increasing the degree of parallelism
Scale-Up
 Handling of larger tasks by increasing the degree of parallelism.
 The ability to process larger tasks in the same amount of time by providing more
resources.

1.4.1 Objectives in Moving to Multiprocessor Architecture from Uniprocessor


The two main objectives in moving to a multiprocessor architecture are
1. Speed Up
2. Scale Up
Speed up
Speed up is defined as the elapsed time on uniprocessor divded by the elapsed time on
multiprocessor.

Types of Speed up
There are three different types of speedup as shown in Figure 1.6 that may occur. They are
Linear speed up: Performance improvement growing linearly with additional resources
Superlinear speed up: Performance improvement growing super linearly with additional
resources
Sublinear speed up: Performance improvement growing sub linearly with additional resources

Figure 1.6 Different types of speed up

Scale up
Scale up is defined as the uniprocessor elapsed time on small system divided by multiprocessor
elapsed time on larger system.
Types of Scale up
There are two different types of scale as shown in Figure 1.7 up that may occur. They are
Linear scale up: the ability to maintain the same level of performance when both the workload
and the resources are proportionally added.
Sublinear Scale up: The performance of the system decreases when both the workload and the
resources are proportionally added.

Figure 1.7 Different types of Scale up

1.4.2 Types of Parallelism


The goals of linear performance and scalability can be satisfied by parallel hardware
architectures, parallel operating systems, and parallel DBMSs.Parallel hardware architectures are
based on Multi-processor systems designed as a Shared-memory model, Shared-disk model or
distributed-memory model.
Parallelism can be achieved in three different ways:
Horizontal Parallelism - Database is partitioned across different disks

Vertical Parallelism - occurs among different tasks – all components query operations i.e. scans,
join, sort

Data Partitioning – Supports both horizontal and vertical parallelism.

Figure 1.8 shows the Types of DBMS parallelism.


Figure 1.8 Types of DBMS parallelism.

1.4.2.1 Horizontal Parallelism


Horizontal parallelism is the ability to run multiple instances of an operator on the specified
portion of the data. The way you partition your data greatly affects the efficiency horizontal
parallelism.

1.4.2.2 Vertical Parallelism


Vertical parallelism is the ability to run multiple operators simultaneously by employing
different system resources such as CPUs, disks, and so on. Horizontal parallelism is the ability to
run multiple instances of an operator on the specified portion of the data.

1.4.2.3Data Partitioning
Data parallelism is parallelization across multiple processors in parallel computing
environments. It focuses on distributing the data across different nodes, which operate on
the data in parallel. It can be applied on regular data structures like arrays and matrices by
working on each element in parallel.
Data parallelism spreads the data across multiple disks randomly or intelligently.Random
methods include random selection, round robin. It spreads the data across multiple disks
randomly or intelligently.

 Hash partitioning
 Key range partitioning
 Schema partitioning
 User defined partitioning
1.4.3 Parallel Database Architectures
Parallel computers are no longer a monopoly of supercomputers. Parallel database system seeks
to improve performance through parallelization of various operations, such as loading data,
building indexes and evaluating queries. Although data may be stored in a distributed fashion,
the distribution is governed solely by performance considerations. Parallel databases improve
processing and input/output speeds by using multiple CPUs and disks in parallel.

1.4.3.1 Objectives - Parallel Database Architectures


 Start-up and Consolidation costs,
 Interference and Communication, and
 Skew

Start-up and Consolidation


Startup is used for initiation of multiple processes. Consolidation is the cost for collecting results
obtained from each processor by a host processor. Figure 1.9 shown below represents Serial part
vs Parallel part

Figure 1.9 Serial part vs Parallel part

Interference and Communication


Interference refers competing to access shared resources. Communication represents one process
communicating with other processes and is represented in Figure 1.10, and often one has to wait
for others to be ready for communication (i.e. waiting time).
Figure 1.10 Interference and communication

Skew
In skew as shown below in Figure1.11 there will be unevenness of workload and load balancing
is one of the critical factors to achieve linear speed up.

Figure 1.11 Balanced workload vs Unbalanced workload(Skewed)

1.4.4 Different forms of parallel computers


Parallel computers are available in four different forms:

 Shared-memory architecture
 Shared-disk architecture
 Shared-nothing architecture
 Shared-something architecture

1.4.4.1 Shared-memory Architecture


In shared memory architecture multiple processors share the main memory space, as well as
mass storage (e.g. hard disk drives)

1.4.4.2 Shared Disk Architecture


In shared disk architecture each node has its own main memory, but all nodes share mass
storage, usually a storage area network

1.4.4.3 Shared-nothing Architecture


In shared nothing architecture each node has its own mass storage as well as main memory.

Shared-Memory and Shared-Disk Architectures


In Shared-Memory as shown below in Figure 1.12 all processors share a common main memory
and secondary memory. Load balancing is relatively easy to achieve, but suffer from memory
and bus contention. In Shared-Disk all processors, each of which has its own local main memory,
share the disks

Figure 1.12 An SMP architecture

Shared-Nothing Architecture
In shared nothing architecture each processor has its own local main memory and disks. Load
balancing becomes difficult. Figure 1.13 represents A Shared Nothing Architecture.
Figure 1.13 A Shared Nothing Architecture

Shared-Something Architecture
It uses a mixture of shared-memory and shared-nothing architectures. Each node is a shared-
memory architecture connected to an interconnection network in a shared-nothing architecture.
Figure 1.14 represents Cluster of SMP architectures.

Figure 1.14 Cluster of SMP architectures.

Interconnection Networks
The interconnection networks contains Bus, Mesh, Hypercube as shown below in Figure
1.15.
Figure 1.15(a) Bus interconnection Figure 1.15(b) Mesh interconnection network

Figure 1.15(c) Hypercube interconnection network.

Grid Database Architecture


Grid database as shown below in Figure 1.16 architecture contains wide geographical area,
autonomous and heterogeneous environment. Grid services (Meta-repository services, look-up
services, replica management services). It also contains Grid middleware.
Figure 1.16 Data-intensive applications working in Grid Database architecture

Parallel RDBMS Features


Data Warehouse development requires a good understanding of all architectural components,
including the data warehouse DBMS Platform. Understanding the basic architecture of
Warehouse database is the first step in evaluating and selecting a product.The developers and
users of the Warehouse should demand the following features from the DBMS vendor:

 Scope and techniques of Parallel DBMS


 Application Transparency
 The Parallel environment
 DBMS Management Tools
 Price/ Performance

1.4.5 Forms of Parallelism


Forms of parallelism for database processing:

 Interquery parallelism
 Intraquery parallelism
 Interoperation parallelism
 Intraoperation parallelism
 Pipeline parallelism
 Independent parallelism
 Mixed parallelism

1.4.5.1 Interquery Parallelism


Interquery parallelism represents ―Parallelism among queries‖. Different queries or transactions
are executed in parallel with one another. Main aim: scaling up transaction processing systems.
Figure 1.17 represents a Interquery parallelism

Figure 1.17 Interquery parallelism

1.4.5.2 Intraquery Parallelism


Intraquery parallelism represents ―Parallelism within a query‖. Execution of a single query in
parallel on multiple processors and disks. Main aim: speeding up long-running queries. Figure
1.18 represents a Intraquery Parallelism

Figure 1.18 Intraquery Parallelism


The execution of a single query can be parallelized in two ways:

Intraoperation parallelism: Speeding up the processing of a query by parallelizing the


execution of each individual operation (e.g. parallel sort, parallel search, etc)

Interoperation parallelism: Speeding up the processing of a query by executing in parallel


different operations in a query expression (e.g. simultaneous sorting or searching)

1.4.5.3 Intraoperation Parallelism


Intraoperation parallelism represents ―Partitioned parallelism‖. Parallelism due to the data being
partitioned. Since the number of records in a table can be large, the degree of parallelism is
potentially enormous. Figure 1.19 shown below represents a Intraoperation Parallelism

Figure 1.19 Intraoperation Parallelism

1.4.5.4 Interoperation parallelism


Interoperational parallelism represents Parallelism created by concurrently executing different
operations within the same query or transaction.

 Pipeline parallelism
 Independent parallelism

1.4.5.5 Pipeline Parallelism


In pipeline parallelism Output record of one operation A are consumed by a second operation B,
even before the first operation has produced the entire set of records in its output. Multiple
operations form some sort of assembly line to manufacture the query results. Useful with a small
number of processors, but does not scale up well. Figure 1.20 shown below represents a Pipeline
parallelism

Figure 1.20 Pipeline parallelism

1.4.5.6 Independent Parallelism


In independent parallelism Operations in a query that do not depend on one another are executed
in parallel. Does not provide a high degree of parallelism. Figure 1.21 shown below represents a
Independent parallelism
Figure 1.21 Independent parallelism

1.4.5.7 Mixed Parallelism


In mixed parallelism as shown in Figure 1.22, a mixture of all available parallelism forms is
used.

Figure 1.22 Mixed parallelism

Parallel DBMS Vendors


Some of the most famous Vendors of Parallel DBMS are as follows;
 ORACLE – for Oracle Parallel Server Option (OPS) and Parallel Query Option (PQO)

 Informix – developed its Dynamic Scalable Architecture (DSA) to support Shared-


Memory, Shared-Disk, and Shared-Nothing Models

 Sybase – implemented in a product called SYBASE MPP

 IBM – Used in DB2 Parallel Edition (DB2 PE), a Database based on DB2/6000 Server
Architecture

1.5 Data Extraction, Cleanup and Transformation tools

ETL tools and the data warehouse


Many ETL tools were originally developed to make the task of the data warehouse
developer easier and more fun. Developers find tough with task of handwriting SQL code,
replacing it with easy drag and drop to develop a data warehouse.Today, the top ETL tools in the
market have vastly expanded their functionality beyond data warehousing and ETL.

They now contain extended functionalities for data profiling, data cleansing, Enterprise
Application Integration (EAI), Big Data processing, data governance and master data
management.

Tool Requirements

Tool are required for the following purposes…


 Data transformation from one format to another.
 Data consolidation and Integration.
 Metadata synchronization and management.

ETL Tools – Importance


ETL tools required to handle Database & Data Warehouse. While using any cloud storage
services, these ETL tools are of very great importance. The main functions of these tools is to
migrate the data from the source database to the required database through cloud computing or
Data warehouses.If business process includes any kind of Data migration, data transformation
or Data mobility then you might have to employ these ETL tools to enable your business
process.

List of Top ETL Tools


 Prism solutions
 SAS institute
 Carleton corporations PASSPORT and metacenter
 Validity Corporation
 Evolutionary technologies
 Informatica – Power Center
 IBM – Infosphere Information Server:
 Oracle – Data Integrator
 Microsoft – SQL Server Integrated Services
 Talend – Talend Open Studio for Data Integration
 SAS – Data Integration studio
 SAP – BusinessObjects Data Integrator
 Clover ETL – CloverETL
 Pentaho – Pentaho Data Integration
 AB – Initio

Prism solutions
Prism manager provides a solution for data warehousing by mapping source data to target
database management system. The prism warehouse manager generates code to extract and
integrate data, create and manage metadata and create subject oriented historical database. It
extracts data from multiple sources –DB2, IMS, VSAM, RMS &sequential files.

SAS institute
SAS data access engines serve as a extraction tools to combine common variables, transform
data Representations forms for consistency. It support for decision reporting, graphing .so it act
as the frontend.

Carleton corporations PASSPORT and metacenter


It fulfill the data extraction & transformation need of data warehousing. PASSPORT can
produce multiple output files from a single execution of an extract program. Used for data
mapping and data migration facility. It runs as a client on various PC platform in three tired
environment.

It consists of two components:


Mainframe based: It collects the file, record, or table layouts for the required inputs and outputs
and convert them into the passport data language (PDL).

Workstation based:In this User must transfer the PDL file from the mainframe to a location
accessible by PASSPORT

PASSPORT offers
Metadata directory at the core of process, robust data conversion, migration, analysis and
auditing facilities. PASSPORT work bench or GUI workbench that enables project development
on a work station and also maintains various personal who design, implement or use.

The Metacenter
It is developed by Carleton Corporation and designed to put users in a control of the
dataWarehouse. The heart of the meta center is the metadata dictionary. The meta center
conjunction with PASSPORT provides number of capabilities.
 Data extraction and transformation
 Event management and notification
 Data mart subscription
 Control center mover

Validity Corporation
Validity corporation integrity data reengineering tool is used to investigate standardizes
transform and integrates data from multiple operational systems and external sources. It main
focus is on data quality indeed focusing on avoiding the GIGO (garbage in garbage out)
Principle.
Benefits of integrity tool: Builds accurate consolidated views of customers, supplier, products
and other corporate entities. Maintain the highest quality of data.

Evolutionary technologies ETI-EXTRACT:


Another data extraction and transformation tool is ETI-EXTRACT tool, it automates the
migration of data between dissimilar storage environments. It saves up to 95 % of the time and
cost of manual data conversion. It enables users to populate and maintain data
warehouse.Migrate data to a new database, platforms, and applications.Automatically generates
and executes program in the appropriate language for source and target Platforms. Provide
powerful metadata facility. Provide a sophisticated graphical interface that allows users to
indicate how to move data through simple point and click operation.

Informatica – Power Center


PowerCenter is the ETL tool introduced by Informatica Corporation, which has strong customer
base of over 4500 companies. The main components of PowerCenter are its clients tools and
repository tools and servers. PowerCenter starts the execution process according to the Work
Flow of the client server.

Pros and Cons:


The ready availability of the tool and the easy training modules has become a major hit among
the customers. One other advantage in this ETL tool is, this tool can be integrated with the Lean
process which is widely used in any manufacturing company.

IBM – Infosphere Information Server:


IBM, the market leader in Computer technology has introduced Infosphere Information server
for Information Integration and Management in February 2008. This is a data integration
platform which can help you with cleansing, transforming and transport the required data into
Data warehouse and also interpret data into the required business analytics. Over the years IBM
has introduced many upgraded versions of this server.

Pros and Cons:


Infosphereservers version 8.7 and 9.1 are capable of the integration with new Netezza which
helps you in fast loading and optimum clarity in the transformation of Data. These are mainly
designed for Big Data companies and it may not be right choice for midsized B2B companies.

Oracle – Data Integrator


Oracle Corporation Ltd, the experts in the Database Management systems has created their own
ETL tool in the name of Oracle – Data Integrator. Due to their increasing customer support,
Oracle has updated its ETL tools in various versions. In the latest version, Oracle has integrated
Oracle ETL tools to Oracle Golden Gate 12C, which creates a rapid fast software portfolio for
Data Migration and Data Analysis.

Pros and Cons:


This Data Integrator is compatible on most of the platforms and it is one of the rapid processors.
As described earlier, it is an integrated portfolio which is much suitable for large organizations
who have on recurring needs and not suitable for one time migration

SQL Server Integration Services


SSIS is the Data Migration ETL tools created and introduced by Microsoft. With SSIS you will
be enabled to use a scalable enterprise data integration platform.Microsoft Integration Services is
a platform for building enterprise-level data integration and data transformations solutions. You
use Integration Services to solve complex business problems by copying or downloading files,
sending e-mail messages in response to events, updating data warehouses, cleaning and mining
data, and managing SQL Server objects and data.

Pros and Cons: In SSIS, the transformation is processed in the Memory and so the integration
process is much faster in the SQL server. SSIS is only compatible only with all the SQL servers.

Talend – Talend Open Studio for Data Integration


TalendOpenstudio is one of the most powerful data Integration ETL tool in the market. This
provides improvised data integration with strong connectivity, easy adaptability and good flow
of Extraction and transformation process.

Pros and Cons: It fits in for every kind of Data Integration process from small file
transformation to big Data migration and Analysis. Moreover, its highly scalable Architecture
has created a huge customer base.

SAS – Data Integration studio


This SAS Data Integration Studio is the core component of the SAS systems. Over the years,
SAS has improvised itself as the provider of the best Data Integration tools and systems. To
satisfy the growing needs of their customers, they have upgraded into more complex tools which
help them in organising and Analysing the Data transferred.

Pros and Cons: This tool has a clear and easy integration with the production process and other
business process components. And also you can find an exceptionally good auditing and Data
capturing process with these ETL tools.

SAP – BusinessObjects Data Integrator


Initially BusinessObjects Data Integrator, the ETL tool introduced by SAP was known as
ActaWorks. Primarily, this tool was created to build Data marts and Data warehouses. But later,
they created Data Integration models which allow the customers to customize their package.

Pros and Cons:Data Profiling and Data validation are very attractive features in this ETL tool.
The advances version of the Data Validation also serves as a Firewall for your Data Network.
The only down point here is, it is more suitable for small and mid sized enterprises

Clover ETL – CloverETL


Clover ETL is the Data Integration portfolio introduced by JavlinInc in 2002, based on Java
Platform mainly designed to transform and cleanse the Data from one source Database to
required target Database.

Pros and Cons:This is a tool working cross platform hence the user circle is not restricted to
certain OS users alone. The Non availability of the debugging facility is one of the reasons why
the big enterprises do not opt for Clover ETL.

Pentaho – Pentaho Data Integration


Pentaho Data Integration is an ETL tool run by Kettle runtime. Here, the procedures are saved in
XML files and interpreted in Java files while transforming the Data.

Pros and Cons: Comparing to the other ETL tools, this tools has a slow performance rate. The
other major drawback is the absence of the debugging facility.

AB – Initio
AB Initio is Enterprise Software Company whose products are very user friendly for Data
processing. Customers can use these tools for Data Integration, Data warehousing and also in
support for retails and Banking.

Pros and Cons: This is considered as one of the most efficient and fast processing Data
Integration tool
1.6 Meta Data
Metadata is simply defined as a set of data that describes and gives information about
other data i.e. data about data. The data that is used to represent other data is known as metadata.
The term metadata is often used in the context of Web pages, where it describes page content for
a search engine. For example, the index of a book serves as a metadata for the contents in the
book.

In terms of data warehouse, we can define metadata as follows.

 Metadata is the road-map to a data warehouse.

 Metadata in a data warehouse defines the warehouse objects.

 Metadata acts as a directory. This directory helps the decision support system to locate the
contents of a data warehouse.

1.6.1 Categories of Metadata


 Technical Metadata
 Business Metadata
 Operational Metadata

Business Metadata
Business metadata is data that adds business context to other data. It provides information
authored by business people and/or used by business people. It is in contrast to technical
metadata, which is data used in the storage and structure of the data in a database or system.A
simple example of business metadata is a glossary entry. Hover functionality in an application or
web form can enable a glossary definition to be shown when cursor is on a field or term.

Other examples of business metadata are:


 Business rules
 Data quality rules
 Valid values for reference data
 Wikis
 Collaboration software

Technical Metadata
Technical metadata describes the information required to access the data, such as where the
data resides or the structure of the data in its native environment.Technical metadata represents
information that describes how to access the data in its original native data storage. It includes
database system names, table and column names and sizes, data types and allowed values.
Technical metadata also includes structural information such as primary and foreign key
attributes and indices.
Using our example of an address book database, the following represent the technical
metadata we know about the ZIP code column:
 Named ZIPCode
 Nine characters long
 A string
 Located in the StreetAddress table
 Uses SQL Query Language

Operational Metadata -
Operational Metadata are metadata about operational data.It includes currency of data and data
lineage. Currency of data means whether the data is active, archived, or purged. Lineage of data
means the history of data migrated and transformation applied on it.

1.6.2 Role of Metadata


The various roles of metadata are explained below.

 Metadata acts as a directory.


 Helps the decision support system to locate the contents of the DW.
 DW operational information is stored in DW.
 Metadata helps in decision support system for mapping of data when data is transformed from
operational environment to data warehouse environment.
 Metadata helps in summarization between current detailed data and highly summarized data.
 Metadata is used for query and reporting tools.
 Metadata is used in ETL Tools.
 Metadata plays an important role in loading functions.

One or more special-purpose metadata repositories that include


(a) information on the contents of the data warehouse, their location and their structure
(b) information on the processes that take place in the data warehouse back-stage, concerning the
refreshment of the warehouse with clean, up-to-date, semantically and structurally reconciled
data

(c) information on the implicit semantics of data, along with any other kind of data that aids the
end-user exploit the information of the warehouse

(d) information on the infrastructure and physical characteristics of components and the sources
of the data warehouse

(e) information including security, authentication, and usage statistics that aids the administrator
tune the operation of the data warehouse as appropriate. Figure 1.23 shown below represents a
Metadata element for customer entity and Figure 1.24 shows Who needs metadata?

Figure 1.23 Metadata element for customer entity


Figure 1.24 Who needs metadata?

The Figure1.6.3shown below represents how Metadata act as a nerve centre.

Figure1.25 Metadata act as a nerve centre

The Figure 1.26 represents Metadata vital for end users


Figure 1.26 Metadata vital for end users

Figure 1.27 Metadata essential for IT.


Figure 1.28 Metadata drives data warehouse processes.

Ex: Use of metadata

 which territories does region ―SOUTH‖ include?


 does the data item 04-01-2000 denote April 1, 2000 or January 4, 2000? What is the convention
used for dates in your data warehouse?
 Are the numbers shown as sale units given in physical units of the products, or in some measure
such as pounds or kilograms?
 What about the amounts shown in the result set? Are these amounts in dollars or in some other
currency?
 Metadata gives your user the meaning of each data element.

Metadata Interchange initiative


It is used to develop the standard specifications for metadata interchange format.It will allow
Vendors to exchange common metadata for avoid difficulties of exchanging, sharing and
Managing metadata

The initial goals include


Creating a vendor-independent, industry defined and maintained standard access mechanism and
standard API. Enabling individual tools to satisfy their specific metadata for access requirements,
freely and easily within the context of an interchange model. Defining a clean simple,
interchange implementation infrastructure. Creating a process and procedures for extending and
updating
Metadata Interchange initiative have define two distinct Meta models:
 The application Metamodel- it holds the metadata for particular application
 The metadata Metamodel- set of objects that the metadata interchange standard can be used
to describe.
Metadata Interchange standard framework
Metadata itself may be stored in different formats like relational tables, ASCII, etc…Three
different approaches are:
1. Procedural Approach.
2. ASCII based approach
3. Hybrid Approach

Metadata Interchange standard framework Components


1. Standard meta data model.
2. Standard access framework.
3. Tool profile.
4. User Configuration.

Metadata Repository
A metadata repository is a database created to store metadata. The metadata itself is housed in
and managed by metadata repository.Metadata repository management software is used to map
the source data to the target database, code generated for data transformation, integrate and
transform the data and control data moving data to data warehouse.A metadata provides decision
support oriented pointers to warehouse data.Provides link between data warehouse and decision
support systems.Data warehouse arch should ensure that there is a mechanism to populate the
metadata repository and all access paths to data warehouse should have meta data as entry point.

Metadata Repository- Benefits


 It reduces and eliminates information redundancy, inconsistency and underutilization.
 Simplifies management and improves organization and control.
 Increases identification, understanding, coordination and utilization of information assets.
 Provide effective data administration.
 Increases flexibility and control.
 Avoids the investment in legacy system.
 Provides universal relational model for heterogeneous RDBMS to interact an share
information. Figure 1.29 shown below represents A three-tier data warehousing architecture.
Metadata Repository

Figure 1.29 A three-tier data warehousing architecture.


UNIT-I QUESTION BANK

PART A

1. Define the term „Data Warehouse‟.


2. Write down the applications of data warehousing.
3. When is data mart appropriate?
4. List out the functionality of metadata.
5. What are nine decision in the design of a Data warehousing?
6. List out the two different types of reporting tools.
7. Why data mining is used in all organizations.
8. What are the technical issues to be considered when designing and implementing a data
warehouse environment?
9. List out some of the examples of access tools.
10. What are the advantages of data warehousing.
11. Give the difference between the Horizontal and Vertical Parallelism.
12. Draw a neat diagram for the Distributed memory shared disk architecture.
13. Define star schema.
14. What are the reasons to achieve very good performance by SYBASE IQ technology?
15. What are the steps to be followed to store the external source into the data warehouse?
16. Define Legacy data.
17. Draw the standard framework for metadata interchange.
18. List out the five main groups of access tools.
19. Define Data Visualization.
20. What are the various forms of data pre-processing?

PART-B

1. Enumerate the building blocks of data warehouse. Explain the importance of metadata in a data
warehouse environment.
2. Explain various methods of data cleaning in detail.
3. Diagrammatically illustrate and discuss the data warehousing architecture with briefly explain
components of data warehouse.
4. (i) Distinguish between Data warehousing and data mining.
(ii)Describe in detail about data extraction, clean-up
5. Write short notes on (i)Transformation (ii)Metadata
6. List and discuss the steps involved in mapping the data warehouse to a multiprocessor
architecture.
7. Discuss in detail about Bitmapped Indexing
8. Explain in detail about different Vendor Solutions.
9. Explain Various group of Access tool.
10. Explain indexing
UNIT II BUSINESS ANALYSIS

Reporting and Query –Tool Categories– The Need for Applications–Cognos Impromptu–
(OLAP)–Need– Multidimensional Data Model– OLAP Guidelines Multidimensional versus
Multi- –Categories of Tools– OLAP Tools and theInternet.

Reporting and Query tools and Applications

The data warehouse is accessed using an end-user query and reporting tool from Business
Objects.The principle purpose of data warehousing is to provide information to business users
for strategic decision making. These users interact with the data warehouse using the front-end
tools, or by getting required information through the information delivery system.

2.1. Tool categories

There are five categories of decision support tools

 Reporting tools
 Managed Query tools
 Executive information systems
 On-line analytical processing
 Data mining

2.1.1 Reporting tools

Reporting tools can be divided into two types.


1. Production reporting tools
2. Desktop report writers
Production Reporting Tools:
It will let companies to generate regular operational reports or support high volume batch
jobs, such as calculating and printing paychecks. Production reporting tools include third-
generation languages such as COBOL, specialized fourth-generation languages, such as
Information Builders, Inc.'s Focus, and high-end client/server tools, such as MITI'S SQL.

Report writers:
Report writers are inexpensive desktop tools designed for End users. Generally they have
graphical interfaces and built in charting functions. They can pull a group of data from a variety
of data sources and integrate them in a single report. Leading report writers include Crystal
Reports, Actuate and Platinum Technology, Inc's Info Reports. Vendors are trying to increase the
scalability of report writers by supporting three-tiered architectures in Windows NT and Unix
server. At the beginning they are offered for Object oriented interfaces for designing and
manipulating reports and modules for performing ad hoc queries and OLAP Analysis.
Users and related activities

2.1.2. Managed Query Tools


Managed query tools protect end users from the complexities of SQL and database
structures by inserting a metalayer between users and the database. Metalayer is the software that
provides subject-oriented views of a database and supports point-and-click creation of SQL.
Some vendors, such as Business objects, Inc., call this layer a "universe". Managed query tools
have been extremely popular because they make it possible for knowledge workers to access
corporate data without IS intervention. Most managed query tools have embraced three-tiered
architectures to improve scalability. Managed query tool vendors are racing to embed support for
OLAP and Data mining features. Other tools are IQ software's IQ objects, Andyne Computing
Ltd,'s GQL, IBM's Decision Server, Speedware Corp's Esperant (formerly sold by software AG),
and Oracle Corp'sDiscoverer /2000.

2.1.3. Executive Information System Tools


Executive Information System (EIS) tools earlier than report writers and managed query
tools they were first install on mainframes. EIS tools allow developers to build customized,
graphical decision support applications or "briefing books". EIS applications highlight
exceptions to normal business activity or rules by using color coded graphics. EIS tools include
pilot software, Inc.'s Light ship, Platinum Technology's Forest and Trees, Comshare, Inc.'s
Commander Decision, Oracle's Express Analyzer and SAS Institute, Inc.'s SAS/EIS.
EIS vendors are moving in two directions. Many are adding managed query functions to
compete head-on with other decision support tools. Others are building packaged applications
that address horizontal functions, such assales budgeting, and marketing, or vertical industries
such as financial services .Ex: Platinum Technologies offers Risk Advisor.
Advantages of EIS
 Easy for upper-level executives to use, extensive computer experience is not
required in operations
 Provides timely delivery of company summary information
 Information that is provided is better understood
Disadvantages of EIS
 System dependent
 Limited functionality, by design
 Information overload for some managers

2.1.4 OLAP Tools


OLAP provides a sensitive way to view corporate data. These tools aggregate data along
common business subjects or dimensions and then let users navigate through the hierarchies and
dimensions with the click of a mouse button. Some tools such as Arbor software Corp.'s Essbase
, Oracle's Express, pre aggregate data in special multi dimensional databases. Other tools work
directly against relational data and aggregate data on the fly, such as Micro-Strategy, Inc.'s DSS
Agent or Information /Advantage, Inc.'s Decision suite. Some tools process OLAP data on the
desktop instead of server. Desktop OLAP tools include Cogno‘s Power play, Brio Technology,
Inc‘sBrioQuery, Planning Sciences, Inc.'s Gentium, and Andyne's Pablo. Vendors are
rearchitecting their products to give users greater control over the tradeoff between flexibility
and performance that is inherent in OLAP tools. Many vendors are rewriting pieces of their
products in Java.

2.1.5 Data Mining Tools


Data mining tools provide insights close to corporate data that aren't easily
differentiated with managed query or OLAP tools. Data mining tools use a variety of statistical
and artificial intelligence (AI) algorithm to analyze the correlation of variables in the data and
search out interesting patterns and relationship to investigate. Data mining tools, such as IBM's
Intelligent Miner, are expensive and require statisticians to implement and manage. There are
some new emerging tools like Data Mind CorP's Data Mind, Pilot's Discovery server, and tools
from Business objects and SAS Institute which will remove the complication from the older
versions. These tools offer simple user interfaces that plug in directly to existing OLAP tools or
databases and can be run directly against data warehouses. For example, all end-user tools use
metadata definitions to obtain access to data stored in the warehouse, and some of these tools
(eg., OLAP tools) may employ additional or intermediary data stores. (eg., data marts, multi
dimensional data base).

2.2. The Need for Applications


These tools are easy-to-use, point-and-click tools that either accept SQL or
generate SQL statements to query relational data stored in the warehouse. Some of these tools
and applications can format the retrieved data into easy-to-read reports, while others concentrate
on the on-screen presentation. These tools are the preferred choice of the users of business
applications such as segment identification, demographic analysis, territory management and
customer mailing lists. As the complexity of the questions grows, these tools may become
inefficient. The various access types to the data stored in a data warehouse:
 Simple tabular form reporting
 Ad hoc user-specified queries
 Predefined repeatable queries
 Complex queries with multitable joins, multilevel sub queries, and
sophisticated search criteria.
 Ranking
 Multivariable analysis
 Time series analysis
 Data visualization, graphing, charting, and pivoting
 Complex textual search
 Statistical analysis
 AI techniques for testing of hypothesis, trends discovery, definition, and
validation of data clusters and segments.
 Information mapping
 Interactive drill-down reporting and analysis

The first four types of access are covered by the combined category of tools and are called as
query and reporting tools. Three distinct types of reporting are identified.
1. Creation and viewing of standard reports – Routine delivery of reports based on
predetermined measures.
2. Definition and creation of ad hoc reports – It allows managers and business users to
quickly create their own reports and get quick answers to business questions.
3. Data exploration – Users can easily ―surf‖ through data without a preset path to
quickly uncover business trends or problems. The above said reasons may require
applications often take the form of custom-developed screens and reports that retrieve
frequently used data and format it in a predefined standardized way.

2.3. Cognous Impromptu


2.3.1 Overview
Impromptu is an interactive database reporting tool. It allows Users to query data with
out programming knowledge. When using the Impromptu tool, no data is written or changed
inthe database. It is only capable of reading the data.
Impromptu from Cognos Corporation for interactive database reporting that delivers 1- to
1000 + seat scalability. Impromptu's object-oriented architecture ensures control administrative
consistency across all users and reports. Users access Impromptu through its easy-to-use
graphical user interface. Impromptu offers a fast and robust implementation at the enterprise
level, and features full administrative control, ease of deployment, and low cost of ownership. It
can support database reporting tool and single user reporting on personal data.

2.3.2 The Impromptu Information Catalog


Impromptu stores metadata in subject related folders. This metadata is will be used to
develop a query for a report. The metadata set is stored in a file called a catalog. The catalog
does not contain any data. It just contains information about connecting to the database and the
fields that will be accessible for reports.

A catalog contains:

 Folders - Meaningful groups of information representing columns from one or more tables
 Columns - Individual data elements that can appear in one or more folders
 Calculations - Expressions used to compute required values from existing data
 Conditions - Used to filter information so that only a certain type of information is
displayed
 Prompts - Pre-defined selection criteria prompts that users can include in reports they
create
 Other components, such as metadata, a logical database name, join information and user
classes
Impromptu reporting begins with the information catalog, a LAN based repository
(Storage area) of business knowledge and data access rules. The catalog insulates users from
such technical aspects of the database as SQL syntax, table joins and hidden table and field
names.

Creating a catalog is a relatively simple task, so that an Impromptu administrator can be


anyone who's familiar with basic database query function. The catalog presents the database in a
way that reflects how the business is organized, and uses the terminology of the business.
Impromptu administrators are free to organize database items such as tables and fields into
Impromptu subject-oriented folders, subfolders and columns. This enables business-relevant
reporting through, business rules, which can consists of shared calculations, filters and ranges
for critical success factors.

Use of catalogs
 view, run, and print reports
 export reports to other applications
 disconnect from and connect to the database
 create reports
 change the contents of the catalog
 add user classes

2.3.3 Object-oriented architecture


Impromptu's object-oriented architecture drives inheritance-based administration and
distributed catalogs. Impromptu implements management functionality through the use of
governors. The governors allow administrators to control the enterprise's reporting environment.
Some of the activities and processes that governors can control are
 Query activity
 Processing location
 Database connections
 Reporting permissions
 User profiles
 Client/server balancing
 Database transactions
 Security by value
 Field and table security
2.3.4. Reporting
Impromptu is designed to make it easy for users to build and run their own reports. With
Report Wise templates and Head Starts, users simply apply data to Impromptu to produce reports
rapidly. Impromptu's predefined report wise templates include templates for mailing labels,
invoices, sales reports, and directories. These templates are complete with formatting, logic,
calculations, and custom automation. The templates are database-independent; therefore, users
simply map their data onto the existing placeholders to quickly create reports. Impromptu
provides users with a variety of page and screen formats, known as Headstarts.
Impromptu offers special reporting options that increase the value of distributed standard
reports.
Picklists and prompts: Organizations can create standard Impromptu reports for which users can
select from lists of value called picklists. Picklists and prompts make a single report flexible
enough to serve many users.
Custom templates: Standard report templates with global calculations and business rules can be
created once and then distributed to users of different databases.A template's standard logic,
calculations and layout complete the report automatically in the user's choice of format.
Exception reporting: Exception reporting is the ability to have reports highlight values that lie
outside accepted ranges. Impromptu offers three types of exception reporting.
• Conditional filters — Retrieves only those values that are outside defined threshold or
define ranges to organize data for quick evaluation.
• Conditional highlighting — Create rules for formatting data on the basis of datavalues.
• Conditional display — Display report objects under certain conditions

Interactive reporting: Impromptu unifies querying and reporting in a single interface. Userscan
perform both these tasks by interacting with live, data in one integrated module.
Frames: Impromptu offers an interesting frame based reporting style.Frames are building blocks
that may be used to produce reports that are formatted withfonts, borders, colors, shading
etc.Frames or combination of frames, simplify building even complex reports.The data formats
itself according to the type of frame selected by the user.
 List frames are used to display detailed information.
 Form frames offer layout and design flexibility.
 Cross-tab frames are used to show the totals of summarized data at selectedintersections.
 Chart frames make it easy for users to see their business data in 2-D and 3-
Ddisplays using line, bar, ribbon, area and pie charts.
 Text frames allow users to add descriptive text to reports and display binary
largeobjects (BLOBS) such as product descriptions or contracts.
 Picture frames incorporate bitmaps to reports or specific records, perfect for
visuallyenhancing reports.
 OLE frames make it possible for user to insert any OLE object into a report.
2.3.5. Impromptu Request Server
Impromptu introduced the new request server, which allows clients to off-load the query
process to the server. APC user can now schedule a request to run on the server, and an
Impromptu requestserver will execute the request, generating the result on the server. When
done, thescheduler notifies the user, who can then access, view or print at will from PC.The
Impromptu request server runs on HP/UX 9.X, IBM AIX 4.X and Sun Solaris 2.4. Itsupports
data maintained in ORACLE 7.X and SYBASE system 10/11.
2.3.6. Supported databases
Impromptu provides a native database support for ORACLE, Microsoft SQL Server,
SYBASE, SQL Server, Omni SQL Gateway, SYBASE Net Gateway. MDI DB2 Gateway,
Informix, CA-Ingres, Gupta SQL-Base, Borland InterBase, Btrieve, dBASE, Paradox, and
ODBC accessing any database with an ODBC driver.

2.3.7. Impromptu features


The various features of Impromptu are
 Unified query and reporting interface
 Object-oriented architecture
 Complete integration with power play
 Scalability
 Security and control
 Data presented in business content
 Over 70_redefined report templates
 Frame-based reporting
 Business-relevant reporting
 Database-independent catalogs
2.3.8. Applications
Organizations use a familiar application development approach to build a query
and reporting environment for the data warehouse. There are several reasons for doing
this:
 A legacy DSS is still being used and the reporting facilities appear
adequate.
 An organization has made a large investment in a particular application
development environment and has sufficient number of well trained
developers to provide required query.
 A new tool may require an additional investment in developers skill set,
software, and the infrastructure.
 A particular reporting requirement may be too complicated for an
available reporting tools to handle
The entire development paradigm is shifting from procedural to object based.
The market for effective, portable, easy-to-learn, full-featured, graphical development
tools is very competitive.

2.3.9. PowerBuilder
Object-oriented applications, including encapsulation, polymorphism, and inheritance
and GUI objects.Once object created and tested and it can be reused by other applications. The
strength of the power builder is to develop windows application towards client/server
architecture. Power builder offers a fourth-generation language, object oriented graphical
development environment, and the ability to interface with a wide variety of database
management systems.

2.3.10. Object orientation


Object orientation supports many object-oriented features like Inheritance, Data
abstraction, Encapsulation and Polymorphism. Inheritance allows developers to change
attributes of child classes by modifying these attributes in the parent class of objects.
Data abstraction- the encapsulation of properties and behavior within the object. Polymorphism
allows one message to invoke an appropriate but different behavior when sent to different
objects. Power builder supports execution of SQL commands at run time.
2.3.11. Windows facilities
A powerful Windows –based environment. PowerBuilder supports key Windows
facilities. These include dynamic data exchange(DDE), dynamic link libraries(DLL),object
linking and embedding(OLE) and multiple document interface(MDI).
2.3.12. Features
The Power Builder windows and controls can contain program scripts that execute in
response to different events that can be detected by Power Builder. The scripting language –
Power script—is a high level, object oriented, event-driven programming language similar to
Visual Basic.PowerBuilder controls are buttons, radio buttons, bush buttons, list box, check
boxes, combo boxes, text field‘s menus, edit fields, and picturesSupports events such as clicked,
double clicked. Client/server application can be constructed using PowerBuilder painters
ApplicationPainter
The utility, first identifies basic details and components of new or existing applications.
Existing application maintenance is simple by double clicking on the application icon displays a
hierarchical view of the application structure. All levels can be expanded or contracted with a
click of the right mouse button. Application painter allows creation and naming new
applications, selection of an application icon, setting of the library search path, and defining of
default text characteristics.It also used to run or debug the application.
Window Painter
Windows painter is used to create and maintain the majority of PowerBuilder window
objects. It has several attributes such as title, position, size color and font. Itsoperations are
performed by drag and drop and clicking mouse buttons in a graphical fashion.

DataWindows Painter
These are Dynamic objects that provide access to databases and other data sources such
as ASCII files. Power Builder applications use DataWindows to connect to multiple databases
and files, as well as import and export data in a variety of formats such as dBase, Excel, Lotus
and tab delimited text. DataWindowssupports execution of stored procedures. DataWindows
allows developers to select a number of presentation styles from the list of tabular, grid, label,
and free form.It also allows a user specifiednumber of rows to be displayed in a display line.

QueryPainter
This is used to generate of SQL statements that can be stored in PowerBuilder
libraries.Thus, using Application Painter, Window Painter, and DataWindows Painter facilities,
a simple client/server application can be constructed literally in minutes.A rich set of SQL
functions is supported, including CONNECT/DISCONNECT, DECLARE, OPEN, and CLOSE
cursor, FETCH, and COMMIT/ROLLBACK.PowerBuilder supplies server other painter.
Database Painter
This painter allows developers to pick tables from the list box and examine and edit join
conditions and predicates, key fields, extended attribute, display formats and other database
attributes.
Structure Painter- This painter allows creation and modification of data structures and groups
of related data elements.
Preference Painter – This is a configuration tool that is used to examine and modify
configuration parameters for the PowerBuilder development environment.
Menu Painter – This painter creates menus for the individual windows and the entire
application.
Function Painter –This is a development toll that assists developers in creating functions calls
and parameters using combo boxes.
Library Painter– This painter manages the library in which the application components reside.
It also check-in and check-out of library objects for developers.
User object Painter – This painter allows developers to create custom controls. These custom
controls can be treated just like standard PowerBuilder controls.
Help Painter – This is a built-in help system, similar to the MS Windows Help facility.
2.3.13. Forté
In a three-tiered client/server computing architecture, an applications functionality is
partitioned into three distinct pieces: presentation logic with its GUI, application business logic,
and data access function. The presentation logic is placed on a client, while the application logic
resides on application server, and the data access logic and the database reside on a database or a
data warehouse server.
Application partitioning:
Forté allows developers tobuild a logical application that is independent of the
underlying environment.Developers build an application as if it were to run entirely on a single
machine.Forté automatically splits apart the application to run across the clients and servers that
constitute the deployment environment.It support tunable application partitioning.
Shared-application services:
With Forté, developers build a high-end application as a collection of application
components. The components can include client functionality such as data presentation and other
desktop processing.Shared-application services form the basis for a three-tiered application
architecture in which clients request actions from application services that, in turn access one or
more of the underlying data sources. Eachtier can be developed and maintained independent of
each other.
Business events:
Business events automate the notification of significant business occurrences so that
appropriate actions can be taken immediately by users. Forté detects the events whether they
originate on a user‘s desktop or in an application services, and sends notification to all the
application components that have expressed interest in that event.It supports three functional
components
Application Development Facility (ADF) - Distributed object computing framework, to define
user interfaces and application logic. It includes GUI designer for building user screens, a
proprietary 4GL called Transactional object-oriented language (TOOL).
System Generation Facility (SGF) - This assists developers in partitioning the application,
generating executables for distribution. Forté‘s most powerful feature is its ability to automate
partitioning of the application into client and server components. SGF automatically puts
processes on the appropriate device on basis of the application‘s logic and platform inventory.

Distributed Execution Facility (DEF) - This provides tools for managing applications at
runtime, including system administration support, a distributed object manager to handle
communications between applications partitions, and a performance monitor.

Web and Java integration - Release 3.0 provides integration with java, desktop, and
mainframe platforms. Integration with Java ActiveX and ActiveX server support. Forté servers
can be called from OLE. Support for the ability to call Forté Application servers from C++
modules. An option to generate and compile C++ codefor client modules. 4GL Profiler provides
detailed data on an applications performance.

Portability and supported platforms - Forté provides transparent portability across the most
common client/server platforms for both development and deployment.Forté masks the
differences while preserving the native look and feel of each environment. Any set of supported
platforms can be used for deployment. Server/Hosts platforms include Data General AViiON,
Digital Alpha, Open VMS, UNIX, HP 9000, IBM RS/6000, Sun SPARC, and Window NT.
Desktop GUI support includes Macintosh, Motif, and Windows.

2.3.14 Information Builder - The products from Information builder are Catcus and FOCUS
Fusion.

 Cactus
It is a new second-generation, enterprise-class, client/server development environment. Cactus
lets developers create, test and deploy business applications spanning the Internet. It is a three-
tiered development environment and enables creation of application of any size and scope.It
builds highly reusable components for distributes enterprise-class applications through a visual
object-based development environment. Cactus provides access to a wealth of ActiveX, VBX,
and OLE controls.

Web-enabled access: Cactus offers full application development for the Web with no prior
knowledge of HTML, Java or complex 3GLs. Developers can build the traditional PC- based
front ends or web applications for industry standard all from one toolbox. Developers can focus
on the business problem rather than the underlying technology.

Components and features


Cactus Workbench – the front-end interface that provides access to the tool suite via iconic
toolbars, push buttons, and menus.

Application Manager – in integrated application repository that manages the data access,
business logic and presentation components created during development.

Partition Manager– a component that allows developers to drag locally developed procedures
and drop them on different Cactus servers anywhere in the enterprise.

Object browser– offers developers direct access to any portion of a multi-tiered application.

Maintain – the proprietary language of cactus.

File painter – used to build the database access objects.

Application packager – used at deployment

EDA/Client – ―message layer‖ for tier-to-tier communications.

Cactus Servers– the targets of the partitioned applications.

Cactus OCX– an OLE Custom Control that allows any cactus procedure to be called by a third
party application.

 Focus Fusion :
A tool from Information Builder, is the new multidimensional database technology for OLAP
and data warehousing.

Focus Fusion provides the following features:

Fast query and reporting - It‘s advanced indexing, parallel query and rollup facilities provides
performance for reports, queries and analyses.

Comprehensive, graphics-based administration facilities, which make Fusion database


applications easy to build and quick to deploy.

Integrated copy management facilities, which schedule automatic data refresh from any source
into Fusion. Open access via industry-standard protocols, such as ANSI SQL, ODBC, and HTTP
via EDA/SQL, so that Fusion works with hundreds of desktop tools including World Wide Web
browsers.

2.3. OLAP
OLAP stands for Online Analytical Processing. It uses database tables (fact and
dimension tables) to enable multidimensional viewing, analysis and querying of large amounts
of data. E.g. OLAP technology could provide management with fast answers to complex queries
on their operational data or enable them to analyze their company's historical data for trends and
patterns. Online Analytical Processing (OLAP) applications and tools are those that are designed
to ask ― complex queries of large multidimensional collections of data. Due to that OLAP is
accompanied with data warehousing.

2.3.1 Need of OLAP


Solving modern business problems such as market analysis and financial forecasting
requires query centric database schemas that are array-oriented and multidimensional in nature.
These business problems are characterized by the need to retrieve large number of records from
very large data sets and summarize them on the fly. The multidimensional nature of the
problems it is designed to address is the key driver for OLAP. These problems are characterized
by retrieving a very large number of records that can reach gigabytes and terabytes and
summarizing this data into a form information that can by used by business analysts.
One of the limitations that SQL has, it cannot represent these complex problems. A query
will be translated in to several SQL statements. These SQL statements will involve multiple
joins, intermediate tables, sorting, aggregations and a huge temporary memory to store these
tables.
These procedures required a lot of computation which will require a long time in
computing. The second limitation of SQL is its inability to use mathematical models in these
SQL statements. If an analyst, could create these complex statements using SQL statements, still
there will be a large number of computation and huge memory needed. Therefore the use of
OLAP is preferable to solve this kind of problem. OLAP is a continuous, iterative, and
preferably interactive process.

2.3.2 The Multidimensional data model


The multidimensional data model is an integral part of On-Line Analytical Processing, or
OLAP. Because OLAP is on-line, it must provide answers quickly; analysts pose iterative
queries during interactive sessions, not in batch jobs that run overnight. And because OLAP is
also analytic, the queries are complex. The multidimensional data model is designed to solve
complex queries in real time.

Multidimensional data model is to view it as a cube as shown in figure 2.2. The table at
the left contains detailed sales data by product, market and time. The cube on the right associates
sales number (unit sold)with dimensions-product type, market and time with the unit variables
organized as cell in an array.

This cube can be expended to include another array-price-which can be associates with
all or only some dimensions. The cube supports matrix arithmetic that allows the cube to present
the dollar sales array simply by performing a single matrix operation on all cells of the
array(dollar sales= units * price). The response time of the multidimensional query depends on
how many cells have to be added on the fly. The caveat here is that, as the number of dimensions
increases number of cubes cell increase exponentially. On the other hand, the majority of
multidimensional queries deal with summarized, high level data. Therefore, the solution to
building an efficient multidimensional database is to pre aggregate all logical subtotals and totals
along all dimensions. This aggregation is especially valuable since typical dimensions are
hierarchical in nature i.e. time dimension may contain hierarchies for years, quarters, months,
weak and day. GEOGRAPHY may contain country, state, city etc.
Another way to reduce the size of the cube is to properly handle sparse data. Often, not
every cell has a meaning across all dimensions( many marketing databases may have more than
95 percent of all cells empty or containing 0). Another kind of sparse data is created when many
cells contain duplicate data( i.e., if the cube contains a PRICE dimension, the same price may
apply to all markets and all quarters for the year). The ability of a multidimensional database to
skip empty or repetitive cells can greatly reduce the size of the cube and amount of processing.
Dimensional hierarchy, sparse data management, and pre aggregation are the keys, since
they can significantly reduce the size of the database and the need to calculate values. Such a
design obviates the need for multitable joins and provides quick and direct access to the arrays
of answers, thus significantly speeding up execution of the multidimensional queries.

Figure 2.2 Relational tables and multidimensional tables.

Figure 2.3 Drill down and drill up

In this cube we can observe, that each side of the cube represents one of the elements of
the question. The x-axis represents the time, the y-axis represents the products and the z-axis
represents different centers. The cells of in the cube represents the number of product sold or
can represent the price of the items. This Figure 2.3 also gives a different understanding to the
drilling down operations. The relations defined must not be directly related, they related
directly. The size of the dimension increase, the size of the cube will also increase
exponentially. The time response of the cube depends on the size of the cube.

Operations in Multidimensional Data Model:

 Aggregation(roll-up)
 dimension reduction: e.g., total sales by city
 summarization over aggregate hierarchy: e.g., total sales by city and year -> total
sales by region and by year
 Selection (slice) defines a subcube
 e.g., sales where city = Palo Alto and date =1/15/96
 Navigation to detailed data(drill-down)
 e.g., (sales - expense) by city, top 3% of cities by averageincome
 Visualization Operations (e.g., Pivot ordice)

2.4. OLAP Guidelines

Dr. E.F. Codd the ―father of the relational model, created a list of rules to deal with the OLAP
systems. Users should priorities these rules according to their needs to match their business
requirements.
These rules are:
1) Multidimensional conceptual view: The OLAP should provide an appropriate
multidimensional Business model that suits the Business problems and requirements.
2) Transparency: The OLAP system‘s technology the underlying databases and computing
architecture and the heterogeneity of input data sources should be transparent to users
to preserve their productivity and proficiency with familiar front-end environments and
tools.

3) Accessibility: The OLAP tool should only access the data required only to theanalysis
needed. Additionally, the system should be able to access data from all heterogeneous
enterprise data sources required for the analysis.
4) Consistent reporting performance: As the number of dimensions and the size of the
database increase, users should not affect in anyway theperformance.
5) Client/server architecture: The OLAP tool should use the client server architecture to
ensure better performance, adaptivity, interoperability, andflexibility.
6) Generic dimensionality: Data entered should be equivalent to the structure and
operation capabilities.
7) Dynamic sparse matrix handling: The OLAP too should be able to manage thesparse
matrix and so maintain the level ofperformance.
8) Multi-user support: The OLAP should allow several users working concurrently towork
together.
9) Unrestricted cross-dimensional operations: The OLAP tool should be able to recognize
dimensional hierarchies and automatically perform operations across the dimensions of
the cube.
10) Intuitive data manipulation. ―Consolidation path re-orientation, drilling down across
columns or rows, zooming out, and other manipulation inherent in the consolidation
path outlines should be accomplished via direct action upon the cells of the analytical
model, and should neither require the use of a menu nor multiple trips across the user
interface.
11) Flexible reporting: It is the ability of the tool to present the rows and column in a
manner suitable to be analyzed.
12) Unlimited dimensions and aggregation levels: This depends on the kind of Business,
where multiple dimensions and defining hierarchies can bemade. The OLAP system
should not impose any artificial restrictions on the number of dimensions or
aggregation levels.

In addition to these guidelines an OLAP system should also support:

 Comprehensive database management tools: This gives the database management


to control distributed Businesses
 The ability to drill down to detail source record level: Which requires that The
OLAP tool should to the detail record level of the source relational databases.
 Incremental database refresh: The OLAP tool should provide partial refresh and this
presents an operations and usability problem as the size of database increases.
 Structured Query Language (SQL interface): the OLAP system should be able to
seamlessly integrate in the surrounding enterprise environment.

2.5. Multidimensional versus Multirelational OLAP


These relational implementations of multidimensional database systems are sometimes
referred to as multirelational database systems. To achieve the required speed, these products use
the star or snowflake schemas- specially optimized and denormalized data models that involve
data restructuring and aggregation. (The snowflake schema is an extension of the star schema
that supports multiple fact tables and joins between them.)
One benefit of the star schema approach is reduced complexity in the data model, which
increases data ―legibility‖, making it easier for users to pose business questions of OLAP nature.
Data warehouse queries can be answered up to 10 times faster because of improved navigations.

2.6. Categories of OLAP Tools


OLAP tools are based on the concepts of multidimensional databases and allow a
sophisticated user to analyze the data using elaborate, multidimensional, complex views. Typical
business applications for these tools include product performance and profitability, effectiveness
of a sales program or a marketing campaign, sales forecasting, and capacity planning. These
tools assume that the data is organized in a multidimensional model which is supported by a
special multidimensional database or by a relational database designed to enable
multidimensional properties. A chart comparing capabilities of these two classes of OLAP tools
as shown in figure

2.6.1. MOLAP
This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a
multidimensional cube. The storage is not in the relational database, but in proprietary formats.
That is, data structures uses array-based technology and in most cases, provide improved storage
techniques to minimize the disk space requirements through sparse data management. This
architecture enables excellent performance when the data is utilized as designed, and predictable
application response time for applications addressing a narrow breadth of data for a specific DSS
requirement. In addition, some products treat time as a special dimension( e.g., pilot software‘s
analysis server), enhancing their ability to perform time series analysis. Other products provide
strong analytical capabilities (e.g., Oracles‘s Express Server) built into the database.
Applications requiring iterative and comprehensive time series analysis of trends are well
suited for MOLAP technology (e.g., financial analysis and budgeting). Examples include Arbor
Software‘s Essbase, Oracles‘s Express Server, pilot software‘s Lighship server and Kenan
Technology‘s Multiway.
Advantages:
 Excellent performance: MOLAP cubes are built for fast data retrieval, and are
optimal for slicing and dicing operations.
 Can perform complex calculations: All calculations have been pre-generated when
the cube is created. Hence, complex calculations are not only doable, but they
return quickly.
Disadvantages:
 Limited in the amount of data it can handle: Because all calculations are performed
when the cube is built, it is not possible to include a large amount of data in the
cube itself. This is not to say that the data in the cube cannot be derived from a
large amount of data. Indeed, this is possible. But in this case, only summary-level
information will be included in the cube itself.

 Requires additional investment: Cube technology are often proprietary and do not
already exist in the organization. Therefore, to adopt MOLAP technology, chances
are additional investments in human and capital resources are needed.

Figure 2.3 MOLAP architecture

To address this issue, some vendors significantly enhanced their reach-through


capabilities. These hybrid solutions have as their primary characteristic the integration of
specialized multidimensional data storage with RDBMS technology, providing users with a
facility that tightly ―couples‖ the multidimensional data structures (MDDSs) with data
maintained in an RDBMS as shown in figure. This allows the MDDSs to dynamically obtain
detail data maintained in an RDBMS (as shown in figure 2.3), when the application reaches the
bottom of the multidimensional cells during drill-down analysis. This may deliver the best of
both worlds, MOLAP and ROLAP. This approach can be very useful for organizations with
performance-sensitive multidimensional analysis requirements and that have built, or are in the
process of building, a data warehouse architecture that contains multiple subject areas. An
example would be the creation of sales data measured by several dimensions to be stored and
maintained in a persistent structure. This structure would be provided to reduce the application
overhead of performing calculations and building aggregations during application initialization.
These structures can be automatically refreshed at predetermined intervals established by an
administrator.
2.6.2 ROLAP
This segment constitutes the fast growing style of OLAP technology, with new vendors
entering the market at an accelerating pace. Products in this group have been engineered from
the beginning to support RDBMS products directly through a dictionary layer of metadata,
bypassing any requirement for creating a static multidimensional data structure(shown in figure
2.5). This enables multiple multidimensional views of the two-dimensional relational tables to be
created without the need to data structure the data around the desired view. This methodology
relies on manipulating the data stored in the relational database to give the appearance of
traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing
is equivalent to adding a "WHERE" clause in the SQL statement. Data stored in relational tables.
while flexibility is an attractive feature of ROLAP products, there are products in this segment
that recommend, or require, the use of highly denormalized database designs. The design and
performance issues associated with the star schema.
The ROLAP tools are undergoing some technology realignment. This shift in technology
emphasis is coming in two forms. First is the movement toward pure middleware technology
that provides facilitates to simplify development of multidimensional applications. Second, there
continues further blurring of the lines that delineate ROLAP and hybrid-OLAP products.
Vendors of ROLAP tools and RDBMS products look to provide an option to create
multidimensional, persistent structures, with facilitates to assist in the administration of these
structures. Examples: Microstrategy Intelligence Server, MetaCube (Informix/IBM) HOLAP
(MQE: Managed Query Environment)
Advantages:
 Can handle large amounts of data: The data size limitation of ROLAP technology isthe
limitation on data size of the underlying relational database. In other words, ROLAP
itself places no limitation on data amount.
 Can leverage functionalities inherent in the relational database: Often, relational
database already comes with a host of functionalities. ROLAP technologies, since they
sit on top of the relational database, can therefore leverage these functionalities.

Disadvantages:
 Performance can be slow: Because each ROLAP report is essentially a SQL
query(or multiple SQL queries) in the relational database, the query time can be
long if the underlying data size islarge.
 Limited by SQL functionalities: Because ROLAP technology mainly relies
ongenerating SQL statements to query the relational database, and SQL statements
do not fit all needs (for example, it is difficult to perform complex calculations
using SQL), ROLAP technologies are therefore traditionally limited by what SQL
can do. ROLAP vendors have mitigated this risk by building into the tool out-of-
the-box complex functions as well as the ability to allow users to define their
ownfunctions.

Figure 2.5: ROLAP architecture


HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For
summary-type information, HOLAP leverages cube technology for faster performance. It stores
only the indexes and aggregations in the multidimensional form while the rest of the data is
stored in the relational database.
Examples: PowerPlay (Cognos), Brio, Microsoft Analysis Services, Oracle Advanced Analytic
Services

2.6.3 Hybrid OLAP / Managed query environment (MQE)

This style of OLAP, which is beginning to see increased activity, provides users with the
ability to perform limited analysis capability, either directly against RDBMS products, or by
leveraging an intermediate MLOP server( shown in figure 2.6).Some products developed
features to provide ―datacube‖ and ―slice and dice‖ analysis capabilities. This is achieved by first
developing a query to select data from the DBMS, which then delivers the requested data to the
desktop, where it is placed into a datacube. The data cube can be stored and maintained locally
in the desktop to reduce the overhead required to create the structure each time the query is
executed .Once the data is in the datacube, users can perform multidimensional analysis. The
tools can work with MLOP servers, and the data from the relational DBMS can be delivered to
the MLOP server, and from there to the desktop.
The simplicity of the installation and administration of such products makes them
particularly attractive to organizations looking to provide seasoned users with more sophisticated
analysis capabilities, without the significant cost and maintenance of more complex products.
With all the ease of installation and administration that accompanies the desktop OLAP
products, most of these tools require the data cube to be built and maintained on the desktop.
With metadata definitions that assist users in retrieving the correct set of data that makes up the
data

cube. Each user to build a custom data cube, the lack of data consistency among users, and the
relatively small amount of data that can be efficiently maintained are significant.
ExamplesCognos Software‘s PowerPlay, Andyne Software‘s Pablo, Dimensional Insight‘s
CrossTarget, and Speedware‘s Media.

Figure 2.6 Hybrid/MQE architecture

2.7 Categories of OLAP Tools

OLAP tools provide way to view the corporate data. The tools aggregate data along
common business subjects or dimensions and then let the users navigate through the hierarchies
and dimensions. Some tools such as Arbor Software Corp‘s Essbase pre-aggregate data in
special multidimensional database. Some other tools work directly against relational data and
aggregate data on the fly. Some tools process OLAP data on the desktop instead of server.
Desktop OLAP tools include Cognos‘ PowerPlay, Brio Technology and Andyne‘s Pablo. Many
of the differences between OLAP tools are fading. Vendors are rearchitecturing their products to
give users greater control over the tradeoff between flexibility and performance tha is inherent in
OLAP tools. Many vendors are rewriting pieces of their products in Java.

Database vendors eventually might be the largest OLAP providers. Leading database
vendors incorporate OLAP functionality in their database kernels. Examples:
CognosPowerPlay, IBI FOCUS FusionPilot Software.

2.9.4.1 Cognos Power Play

Power Playfro Cognos is a mature and popoular software tool for multidimensional
analysis for corporate data. It can be characterized as an MQE tool that can leverage corporate
investment in the relational database technology to provide multidimensional access to
enterprise data, at the same time proving robustness, scalability, and administrative control. It is
an open OLAP solution that can interoperate with a wide variety of third-party software tools,
databases and applications.

CognosPowerPlay client offers:

 Support for enterprise data sets of 20 million recors, 100000 categories, and 100
measures.
 A drill through capability for queries from Cognos Impromptu
 Powerful 3-D charting capabilities with background and rotation control for advanced
users
 Faster and easier ranking of data
 Unlimited undo levels and customizable toolbars.
 A ―home‖ button that automatically resets the dimension line to the top level.
 Full support for OLE2 Automation, as both a client and a server.
 Linked displays that gives users multiple views of the same data in a report.
 Complete integration with relational database security and data management features.

2.9.4.2 IBI FOCUS Fusion


FOCUS Fusion is a multidimensional database technology for OLAP and data
warehousing. It is designed to address business applications that require multidimensional
analysis of detail product data. It combines a parallel-enabled, high-performance,
multidimensional database engine with the administrative, copy management and access tools
necessary for a data warehouse solution. Fusion provides
 Fast query and reporting
 Comprehensive, graphics-based administration facilitates that make Fusion
database applications easy to build and deploy.
 Three-tiered reporting architecture for high performance
 Scalability of OLAP applications from the department to the enterprise.
 Interoperability with the leading EIS, DSS, and OLAP tools.
 Support for parallel computing environments

FOCUS Fusion is a modular tool that supports flexible configurations for diverse needs, and
includes the following components:
 Fusion/Dbserver
 Fusion/Administrator
 Fusion/PDQ
 EDA/Link
 EDA/WebLink
 EDA Gateways
 Enterprise Copy Manager for Fusion

2.9.4.3 Pilot Software


Pilot Software offers the Pilot Decision Support Suite of tools from a high-speed
multidimensional database(MOLAP), Data warehouse integration(ROLAP), data mining, and a
diverse set of customizable business applications. The following products are at the core of Pilot
Software‘s offering:
 Pilot Analysis Server: A full-function multidimensional database with high-speed
consolidation, Graphical user interface and expert-level interface.
 Pilot Link: A database connectivity tool that includes ODBC connectivity and high speed
connectivity via specialized drivers to the most popular relational database platforms.
 Pilot Designer: An application design environment specifically created to enable raptd
development of OLAP applications.
 Pilot Desktop: A collection of applications that allows the end user easy navigation and
visualization of the multidimensional database
 Pilot sales and Marketing Analysis Library
 Pilot Discovery Server
 Pilot Marketing Intelligence library
 Pilot Internet Publisher
Within their OLAP offering, these are some of the key features:
 Time intelligence
 Embedded data mining
 Multidimensional database compression
 Relational Integration

2.9 OLAP Tools and the Internet

The Internet/WWW and data warehouse are tightly bound together. The reason of this
trend is simple; the compelling advantage in using the web for access are magnified even further
in a data warehouse. Indeed:

 The Internet is a virtually free resource which provides a universal connectivity within
and between companies.

 The Web eases complex administrative tasks of managing distributed environments.

 The Web allows companies to store and manage both data and applications on server that
can be centrally managed, maintained and updated.

From these and other reasons, the web is a perfect medium for decision support. Lets
look at the general features of the web-enabled data access.

First-generation Web sites – web sites used a static distribution model, in which the
client can access the decision support report through static HTML pages via web browsers. In
this model, the decision support reports were stored as HTML documents and delivered to users
on request. Clearly, this model has some serious deficiencies, including inability to provide web
clients with interactive analytical capabilities such as drill-down.

Second-generation Web sites – web sites support interactive database queries by utilizing
a multitiered architecture in which a web client submits a query in the form of HTML- encoded
request to a web server, which in turn transforms the request for structured data into a and CGI
(HTML gateway) script. The gateway submits the SQL queries to the database, receives the
results, translates them into HTML, and sends the pages to the requester shown in figure 2.7.
Requests for the unstructured data can be sent directly to the unstructured data store.
Figure 2.7: Web processing Model

Third-generation Web sites – web sites replace HTML gateways with web based
application servers. These servers can download Java Applets or ActiveX applications that
execute on clients or Web based application servers. Vendor‘s approaches for deploying tools on
the Web includeHTML publishingHelper applications, Plug-ins, Server-centric components,
Java and ActiveX applications.

Vendors approaches for deploying tools on the web include

 HTML publishing
 Helper applications
 Plug-ins
 Server-centric components
 Java and ActiveX applications

2.10.1.Tools from Internet/Web implementations Arbor Essbase Web

Essbase is one of the most ambitious of the early Web products. It includes not only
OLAP manipulations such as Drill up, down, and across pivot, slice and dice; and fixed and
dynamic reporting but also data entry including full multiuser concurrent write capabilities- a
feature that differentiates it from others. It doesn't have client package that might suffer from
sales of its web gateway product. It makes sense from a business perspective. The web product
does not replace administrative and development modules, only user access for query and
update.

Information Advantage Web


OLAP Server – centric Powerful analytical engine that generates SQL to pull data from
relational databases, manipulates the results, and transfer the result to a client. Since all the
intelligence of the product is in the server, implementing web OLAP to provide a web-based
client is straightforward. It provide client based package, Data store and the analytical engine are
separate.

Micro Strategy DSS Web

Micro Strategy‘s flagship product, DSS agent, was originally a windows-only tool, but
Micro strategy has smoothly made the transition, first with an NT-based server product, and now
as one of the first OLAP tools to have a web-access product. DSS agent in concert with the
complement of Micro Strategy‘s product suite- DSS server relational OLAP server, DSS
Architect data modeling tool and DSS Executive design tool for building executive information
system.

Brio Technology

Brio shaped a suite of new products called brio.web.warehouse. This suite implements
several of the approaches listed above for deploying decision support OLAP applications on the
web. The key to Brio‘s strategy is a new server component called brio.query.server. The server
works in conjunction with Brio Enterprise and Brio‘s web clients- brio. quick view and brio,
insight- and can off-load processing from the clients and thus enables users to access Brio
reports via Web browsers. On the client side, Brio uses plug-ins to give users viewing and report
manipulation capabilities.
UNIT-II QUESTION BANK
PART A

1. Difference between OLAP and OLTP.


2. Classify OLAP tools.
3. What is meant by OLAP?
4. Difference between OLAP & OLTP
5. Define Concept Hierarchy.
6. List out the five categories of decision support tools.
7. List out any 5 OLAP guidelines.
8. Distinguish between multidimensional and multi-relational OLAP.
9. Define ROLAP.
10. Draw a neat diagram for the web processing model.
11. Define MQE.
12. Draw a neat sketch for three-tired client/server architecture.
13. List out the applications that the organizations uses to build a query and reporting
environment for the data warehouse.
14. Distinguish between window painter and data windows painter.
15. Define ADF, SGF and DEF.
16. What is the function of power play administrator?
17. What are the products of pilot software?
18. What are the FOCUS fusion components?

PART-B

1. Discuss the typical OLAP operations with an example.


2. List and discuss the basic features that are provided by reporting and query tools used for
business analysis.
3. Describe in detail about Cognos Impromptu
4. Explain about OLAP in detail.
5. With relevant examples discuss multidimensional online analytical processing and multi-
relational online analytical processing.
6. Discuss about the OLAP tools and the Internet
7. Explain Multidimensional Data model.
8.Discuss how computations can be performed efficiently on data cubes.
9.Write notes on MOLAP,ROLAP
10.Describe in detail about IBI FOCUS, Pilot software
UNIT 3
Introduction – Data – Types of Data – Data Mining Functionalities – Interestingness of Patterns
–Classification of Data Mining Systems – Data Mining Task Primitives – Integration of a Data
Mining System with a Data Warehouse – Issues –Data Preprocessing.

3.1 Data Mining Introduction


Data Mining is defined as extracting information from huge sets of data. In other words,
data mining is the procedure of mining knowledge from data. In addition to extraction of
information data mining involves other processes like Data Cleaning, Data Integration, Data
Transformation, Pattern Evaluation and Data Presentation.
Data mining, also called knowledge discovery in databases. It is the process of
discovering interesting and useful patterns and relationships in large volumes of data. The field
combines tools from statistics and artificial intelligence (such as neural networks and machine
learning) with database management to analyze large digital collections, known as data sets.

3.1.1 Data Mining Applications


Knowledge extracted from the data mining can be used for following applications:
 Market Analysis.
 Fraud Detection.
 Customer Retention.
 Production Control.
 Science Exploration.

3.1.2 Sources of Data


There are different source of data that are available from which the data can be extracted to
perform data mining operations.
Different sources of data are listed as follows:
 Relational Databases
 Data Warehouses
 Transactional Databases
 Advanced Data and Information Systems and
 Advanced Applications

3.2 Types of Data


There are different types of data that are available from which the data can be mined and it is
listed in the following Figure 3.1.
Different types of data are listed as follows:
 Categorical Data
 Nominal Data
 Ordinal Data
 Numerical Data
 Discrete Data
 Continuous Data
 Interval Scaled Variable
 Ratio Scaled Variable

Figure 3.1 Types of Data

3.2.1 Categorical Data

Categorical data represents characteristics. Therefore it can represent things like a person‘s
gender, language etc. Categorical data can also take on numerical values (Example: 1 for female
and 0 for male). Note that those numbers don‘t have mathematical meaning.

3.2.1.1 Nominal Data

Nominal values represent discrete units and are used to label variables. The Nominal value has
no quantitative value and it is just like labels. The nominal data that has no order and if the order
of its value is changed, the meaning would not change. The two examples of nominal features
below diagram as shown in Figure 3.2.
Figure 3.2 Nominal Features
The left feature that describes a person‘s gender would be called dichotomous, which is a
type of nominal scales that contains only two categories.

3.2.1.2 Ordinal Data


Ordinal values represent discrete and ordered units. It is therefore nearly the same as nominal
data, except that it‘s ordering matters. You can see an example below Figure 3.3.

Figure 3.3 Ordinal Data


Note that the difference between Elementary and High School is different than the difference
between High School and College. This is the main limitation of ordinal data, the differences
between the values are not really known. Because of that, ordinal scales are usually used to
measure non-numeric features like happiness, customer satisfaction and so on.

3.2.2 Numerical Data

Numerical data is data that is measurable, such as time, height, weight, amount, and so on.
The numerical data can be identified by seeing average or order the data in either ascending
or descending order.

3.2.2.1 Discrete Data

The discrete data is one in which the values are distinct and separate. In other words the
discrete can be defined as data that can only take on certain values. This type of data can‘t be
measured but it can be counted. It basically represents information that can be categorized
into a classification. An example is the number of heads in 100 coin flips. We can check by
asking the following two questions whether you are dealing with discrete data or not: Can
you count it and can it be divided up into smaller and smaller parts?

3.2.2.2 Continuous Data

Continuous Data represents measurements and therefore their values can‘t be counted but
they can be measured. An example would be the height of a person, which you can describe
by using intervals on the real number line.

3.2.2.3 Interval Data

Interval values represent ordered units that have the same difference. Therefore we speak of
interval data when we have a variable that contains numeric values that are ordered and
where we know the exact differences between the values. An example would be a feature that
contains temperature of a given place like you can see below in Figure 3.4.

Figure 3.4 Ordinal Data


The problem with interval values data is that they don‘t have a true zero. That means in
regards to our example, that there is no such thing as no temperature. With interval data, we
can add and subtract, but we cannot multiply, divide or calculate ratios. Because there is no
true zero, a lot of descriptive and inferential statistics can‘t be applied.

3.2.2.4 Ratio Data

Ratio values are also ordered units that have the same difference. Ratio values are the same as
interval values, with the difference that they do have an absolute zero. Good examples are
height, weight, length etc as shown in Figure 3.5.
Figure 3.5 Ratio Data
3.3 Data Mining Functionalities

Data Mining is the process of extracting information from huge sets of data. Data mining
functionalities are used to specify the kind of patterns to be found in data mining tasks. In
general, data mining tasks can be classified into two categories:

 Descriptive Data Mining


 Predictive Data Mining

Descriptive Data Mining

Descriptive mining tasks characterize the general properties of the data in the database.
While the Descriptive analytics looks at data and analyzes past events for insight as to how to
approach the future. Descriptive analytics looks at past performance and understands that
performance by mining historical data to look for the reasons behind past success or failure.
Almost all management reporting such as sales, marketing, operations, and finance, uses this type
of post- mortem analysis. For example, descriptive analytics examines historical electricity usage
data to help plan power needs and allow electric companies to set optimal prices.

Predictive Data Mining

Predictive mining tasks perform inference on the current data in order to make
predictions. Prescriptive analytics automatically synthesizes big data, mathematical
sciences, business rules, and machine learning to make predictions and then suggests decision
options to take advantage of the predictions. Prescriptive analytics goes beyond predicting future
outcomes by also suggesting actions to benefit from the predictions and showing the decision
maker the implications of each decision option. Prescriptive analytics not only anticipates what
will happen and when it will happen, but also why it will happen.

Types of Data Mining Functionalities


Different types of Data mining functionalities used for extracting various patterns are listed as
follows.

 Data characterization and Data Discrimination


 Mining Frequent Patterns, Associations, and Correlations
 Classification
 Prediction
 Cluster Analysis
 Outlier Analysis
 Evolution Analysis

3.3.1 Data characterization and Data Discrimination

Data can be associated with classes or concepts. For example, in an electronic shop
named All Electronics store, classes of items for sale include computers and printers, and
concepts of customers include big Spenders and budget Spenders. Such descriptions of a class or
a concept are called class/concept descriptions. These descriptions can be derived via

 Data characterization
 Data discrimination
 Both data characterization and discrimination
Data characterization
It is a summarization of the general characteristics or features of a target class of data. For
example, to study the characteristics of software products whose sales increased by 10% in the
last year, the data related to such products can be collected by executing an SQL query. Effective
data summarization and its various forms of outputs.

Example: A data mining system should be able to produce a description summarizing the
characteristics of customers who spend more than $1,000 a year at All Electronics store.

Data discrimination

It is a comparison of the general features of target class data objects with the general features of
objects from one or a set of contrasting classes. The target and contrasting classes can be
specified by the user, and the corresponding data objects retrieved through database queries.
Example: The user may like to compare the general features of software products whose sales
increased by 10% in the last year with those whose sales decreased by at least 30% during the
same period.

A data mining system should be able to compare two groups of AllElectronics customers,
such as those who shop for computer products regularly (more than two times a month) versus
those who rarely shop for such products (i.e., less than three times a year). The resulting
description provides a general comparative profile of the customers, such as 80% of the
customers who frequently purchase computer products are between 20 and 40 years old and have
a university education, whereas 60% of the customers who infrequently buy such products are
either seniors or youths, and have no university degree. Drilling down on a dimension, such as
occupation, or adding new dimensions, such as income level, may help in finding even more
discriminative features between the two classes.

3.3.2 Mining Frequent Patterns, Associations, and Correlations

Frequent patterns, as the name suggests, are patterns that occur frequently in data. A
frequent item set typically refers to a set of items that frequently appear together in a
transactional data set, such as milk and bread. Frequent patterns are itemsets, subsequences, or
substructures that appear in a data set with frequency no less than a user-specified threshold. For
example, a set of items, such as milk and bread, that appear frequently together in a transaction
data set, is a frequent itemset.

Association rule mining is a procedure which is meant to find frequent patterns,


correlations, associations, or causal structures from data sets found in various kinds of databases
such as relational databases, transactional databases, and other forms of data repositories.
Given a set of transactions, association rule mining aims to find the rules which enable us to
predict the occurrence of a specific item based on the occurrences of the other items in the
transaction.

The term correlation refers to a mutual relationship or association between


quantities. In almost any business, it is useful to express one quantity in terms of its
relationship with others. For example, sales might increase when the marketing department
spends more on TV advertisements, or a customer's average purchase amount on an e-
commerce website might depend on a number of factors related to that customer.
Support is the percentage of transactions in T that contain both wine and Cheese together. (9%
of all baskets had these 2 items together.)
Support (A→B )= P(A∪B) support (A→B) = P(A∪B)

Confidence is the percentage of transactions in T, containing wine, that also contain Cheese. In
other words, the probability of having Cheese, given that wine is already in the basket. (65% of
all those who bought Wine, also bought Cheese.)
Confidence (A→B) = P (B|A)

Association Rule examples:

Buys(X; ―computer‖))buys(X; ―software‖) [support = 1%; confidence = 50%]

Age( X, ―20:::29‖)^income(X, ―20K:::29K‖))buys(X, ―CD player‖)

[support = 2%, confidence = 60%]

3.3.3 Classification

Classification models predict categorical class labels. Classification is the process of finding
a best model that describes and distinguishes data classes or concepts, for the purpose of being
able to use the model to predict the class of objects whose class label is unknown. The class label
is usually the target variable in classification, which makes it special from other categorical
attributes. Derived model is based on the analysis of a set of training data. Some of the
commonly used classification algorithms are listed as follows.

 Decision Tree
 Neural Network.
 Bayesian classification.
 If then rules

Classification Model Design Process

Data classification is a two-step process, as shown for the loan application data of Figure 3.6 (a).
In the first step, a classifier is built describing a predetermined set of data classes or concepts.
This is the learning step (or training phase), where a classification algorithm builds the classifier
by analyzing or ―learning from‖ a training set made up of database tuples and their associated
class labels. A tuple, X, is represented by an n-dimensional attribute vector, X = (x1, x2,. . . , xn),
depicting n measurements made on the tuple from n database attributes, respectively, A 1, A2,. . ,
An. Each tuple, X, is assumed to belong to a predefined class as determined by another database
attribute called the class label attribute. The class label attribute is discrete-valued and unordered.
It is categorical in that each value serves as a category or class.

In the second step Figure 6.1(b), the model is used for classification. First, the predictive
accuracy of the classifier is estimated. If we were to use the training set to measure the accuracy
of the classifier, this estimate would likely be optimistic, because the classifier tends to overfit
the data (i.e., during learning it may incorporate some particular anomalies of the training data
that are not present in the general data set overall). Therefore, a test set is used, made up of test
tuples and their associated class labels. These tuples are randomly selected from the general data
set. They are independent of the training tuples, meaning that they are not used to construct the
classifier.

Figure 3.6 (a) Classification Algorithm


Figure 3.6 (b) Classification Rules

3.3.4 Prediction

Prediction models continuous-valued functions. Prediction is used to predict missing or


unavailable numerical data values rather than class labels. Prediction is similar to classification.
First, construct a model. Second, use model to predict unknown value. Major method for
prediction is regression. Regression can be grouped in to two main categories as mentioned
below.

 Linear and multiple regression


Linear regression involves finding the ―best‖ line to fit two attributes (or variables), so that
one attribute can be used to predict the other. Multiple linear regression is an extension of
linear regression, where more than two attributes are involved and the data are fit to a
multidimensional surface.
 Non-linear regression
Nonlinear regression is a form of regression analysis in which observational data are modeled
by a function which is a nonlinear combination of the model parameters and depends on one
or more independent variables. The data are fitted by a method of successive approximations.

Prediction is different from classification. Classification refers to predict categorical class label.
Prediction models continuous-valued functions.

Data Mining Functionalities Predictive Modeling in Databases


Predictive modeling: Predict data values or construct generalized linear models based on
the database data. One can only predict value ranges or category distributions. Method outline:

 Minimal generalization
 Attribute relevance analysis
 Generalized linear model construction
 Prediction
Determine the major factors which influence the prediction. Data relevance analysis:
uncertainty measurement, entropy analysis, expert judgment, etc. Multi-level prediction: drill-
down and roll-up analysis.

3.3.5 Cluster Analysis

Unlike classification and prediction, which analyze class-labeled data objects, clustering
analyzes data objects without consulting a known class label. Finding groups of objects such that
the objects in a group will be like one another. And different from the objects in other groups.
Data Clustering analysis is used in many applications. Such as market research, pattern
recognition, data analysis, and image processing. Data Clustering analysis is used in many
applications. Such as market research, pattern recognition, data analysis, and image processing.

3.3.6 Outlier Analysis

A database may contain data objects that do not comply with the general behavior or
model of the data. These data objects are outliers. Most data mining methods discard outliers as
noise or exceptions. However, in some applications such as fraud detection, the rare events can
be more interesting than the more regularly occurring ones. The analysis of outlier data is
referred to as outlier mining. Outliers may be detected using statistical tests that assume a
distribution or probability model for the data, or using distance measures.

An outlier is a data point that is significantly different (abnormal or irregular) or deviates from
the remaining data as shown below in Figure 3.7.
Figure 3.7 Outlier

Each purple dot represents a data point in a data set. From the graph, the two data points are
considered outliers since they are very far away from the rest of the data points.

Example:

Outlier analysis may uncover fraudulent usage of credit cards by detecting purchases of
extremely large amounts for a given account number in comparison to regular charges incurred
by the same account. Outlier values may also be detected with respect to the location and type of
purchase, or the purchase frequency.

3.3.7 Evolution Analysis

Data evolution analysis describes and models regularities or trends for objects whose
behavior changes over time. Although this may include characterization, discrimination,
association and correlation analysis, classification, prediction, or clustering of time related data,
distinct features of such an analysis include time-series data analysis, sequence or periodicity
pattern matching, and similarity-based data analysis.

Example:

Evolution analysis helps predicting the value of a user-specified goal attribute based on the
values of other attributes. For instance, a banking institution might want to predict whether a
customer's credit would be "good" or "bad" based on their age, income and current savings.

3.4 Interestingness of pattern


Interestingness discovery is a process employed in data mining and knowledge discovery to
classify the usefulness of patterns. Many different patterns such as customer spending and social
trends are often discovered in data mining, but the relevance, utility, or usefulness of said
patterns depends upon their interestingness. Interestingness discovery or interestingness measure
is the technique used in order to narrow down the number of patterns to consider, since most of
them have already been found or considered, too obvious, or even irrelevant.

 ―What makes a pattern interesting?

 Can a data mining system generate all of the interesting patterns?

 Can a data mining system generate only interesting patterns?‖

3.4.1 What makes a pattern interesting?

A pattern is interesting if it is easily understood by humans, valid on new or test data with
some degree of certainty, potentially useful and novel.

Several objective measures of pattern interestingness exist. An objective measure for


association rules of the form X=>Y is rule support, representing the percentage of transactions
from a transaction database that the given rule satisfies. This is taken to be the probability
P(XUY),where X U Y indicates that a transaction contains both X and Y, that is, the union of
item sets X and Y. Another objective measure for association rules is confidence, which assesses
the degree of certainty of the detected association. This is taken to be the conditional probability
P(Y |X), that is, the probability that a transaction containing X also contains Y. More formally,
support and confidence are defined as

support(X => Y) = P(X U Y)

confidence(X => Y) = P(Y | X)

―Can a data mining system generate all of the interesting patterns?‖

―Can a data mining system generate all of the interesting patterns?‖ This refers to the
completeness of a data mining algorithm. It is often unrealistic and inefficient for data mining
systems to generate all of the possible patterns. Instead, user-provided constraints and
interestingness measures should be used to focus the search. For some mining tasks, such as
association, this is often sufficient to ensure the completeness of the algorithm.

―Can a data mining system generate only interesting patterns?‖

―Can a data mining system generate only interesting patterns?‖ This is an optimization
problem in data mining. It is highly desirable for data mining systems to generate only
interesting patterns. This would be much more efficient for users and data mining systems,
because neither would have to search through the patterns generated in order to identify the truly
interesting ones. Progress has been made in this direction; however, such optimization remains a
challenging issue in data mining.

3.5 Data Mining System

Data mining is defined as a process used to extract usable data from a larger set of any raw data.
It implies analyzing data patterns in large batches of data using one or more software. Data
mining has applications in multiple fields, like science and research. Data mining is the process
of discovering patterns in large data sets involving methods at the intersection of machine
learning, statistics, and database systems. Data mining is the analysis step of the "knowledge
discovery in databases" process, or KDD.

3.5.1 Data Mining Steps:

1. Data cleaning - To remove noise and inconsistent data


2. Data integration - Multiple data sources may be combined

3. Data selection - Data relevant to analysis task are retrieved from DB

4. Data transformation - Data are transformed or consolidated into forms appropriate for
mining by performing summary or aggregation

5. Data mining - An essential process where intelligent methods are applied in order to
extract data patterns.
6. Pattern evaluation - To identify the truly interesting patterns representing knowledge based
on some interestingness measures.

7. Knowledge presentation - Visualization and knowledge representation techniques are used


to present the mined knowledge to the user.
Figure 3.8 Data Mining Steps

 Data cleaning

The data we have collected are not clean and may contain errors, missing values, noisy or
inconsistent data. So we need to apply different techniques to get rid of such anomalies.

 Data integration

First of all the data are collected and integrated from all the different sources.

 Data selection

We may not all the data we have collected in the first step. So in this step we select only
those data which we think useful for data mining.

 Data transformation

The data even after cleaning are not ready for mining as we need to transform them into
forms appropriate for mining. The techniques used to accomplish this are smoothing,
aggregation, normalization etc.
 Data mining
Now we are ready to apply data mining techniques on the data to discover the interesting
patterns. Techniques like clustering and association analysis are among the many different
techniques used for data mining.
 Pattern evaluation

This step involves visualization, transformation, removing redundant patterns etc from the
patterns we generated.

 Knowledge presentation

This step helps user to make use of the knowledge acquired to take better decisions.

3.5.2 Architecture of a typical data mining system

The architecture of a typical data mining system as shown below in Figure 3.9 may have the
following major components Database, data warehouse, World Wide Web, or other information
repository:

Components of a data mining system

The various components used in a data mining system are given below, and its diagrammatic
representation is shown below in Figure 3.9.

 Data sources
 Database or data warehouse server
 Knowledge base
 Data mining engine
 Pattern evaluation module
 User interface
Figure 3.9 Data Mining System

 Data sources
There are so many documents present. That is a database, data warehouse, World Wide Web
(WWW). That is the actual sources of data. Sometimes, data may reside even in plain text
files or spreadsheets. World Wide Web or the Internet is another big source of data.

 Database or data warehouse server


The database or data warehouse server is responsible for fetching the relevant data, based on
the user‘s data mining request.
 Knowledge base
This is the domain knowledge that is used to guide the search or evaluate the interestingness
of resulting patterns. Such knowledge can include concept hierarchies, used to organize
attributes or attribute values into different levels of abstraction. Knowledge such as user
beliefs, which can be used to assess a pattern‘s interestingness based on its unexpectedness,
may also be included.
 Data mining engine
This is essential to the data mining system and ideally consists of a set of functional modules
for tasks such as characterization, association and correlation analysis, classification,
prediction, cluster analysis, outlier analysis, and evolution analysis.
 Pattern evaluation module
This component typically employs interestingness measures and interacts with the data
mining modules so as to focus the search toward interesting patterns. It may use
interestingness thresholds to filter out discovered patterns. Alternatively, the pattern
evaluation module may be integrated with the mining module, depending on the
implementation of the data mining method used. For efficient data mining, it is highly
recommended to push the evaluation of pattern interestingness as deep as possible into the
mining process so as to confine the search to only the interesting patterns.
 User interface
This module communicates between users and the data mining system, allowing the user to
interact with the system by specifying a data mining query or task, providing information to
help focus the search, and performing exploratory data mining based on the intermediate data
mining results. In addition, this component allows the user to browse database and data
warehouse schemas or data structures, evaluate mined patterns, and visualize the patterns in
different forms.

3.6 Data Mining - On What Kind of Data?

There different sources of data that are used in data mining process. The data from multiple
sources are integrated into a common source known as Data Warehouse.

 Relational Databases
 Data Warehouses
 Transactional Databases
 Advanced Data and Information Systems and
 Advanced Applications

3.6.1 Relational Databases

A relational database for All Electronics as shown in Figure 3.10. The All Electronics company
is described by the following relation tables: customer, item, employee, and branch. A Relational
database is defined as the collection of data organized in tables with rows and columns. Physical
schema in Relational databases is a schema which defines the structure of tables. Logical schema
in Relational databases is a schema which defines the relationship among tables. Standard API of
relational database is SQL. Application: Data Mining, ROLAP model, etc.
Figure 3.10 Relational Database
3.6.2 Data Warehouse

A data warehouse is defined as the collection of data integrated from multiple sources that
will queries and decision making. There are three types of data warehouse Enterprise data
warehouse, Data Mart and Virtual Warehouse. Two approaches can be used to update
data in Data Warehouse: Query-driven Approach and Update-driven Approach.
Application: Business decision making, Data mining, etc as shown in Figure 3.11.

Figure 3.11 DataWarehouse For All Electronics Example

A data cube can be described as the multidimensional extensions of two-dimensional tables. It


can be viewed as a collection of identical 2-D tables stacked upon one another. Data cubes are
used to represent data that is too complex to be described by a table of columns and rows. As
such, data cubes can go far beyond 3-D to include many more dimensions.
A data cube for summarized sales data of All Electronics is presented in Figure 3.12.
The cube has three dimensions, they are, address, time and item. The address contains
dimensions(withcityvaluesChicago,NewYork,Toronto,Vancouver),time(withquartervaluesQ1,
Q2,Q3,Q4),anditem(withitemtypevalueshomeentertainment,computer,phone,security).
Figure 3.12 DataCube

The aggregate value stored in each cell of the cube is sales amount (in thousands).

For example, the total sales for the first quarter,Q1,for Items relating to security systems in
Vancouver is $400,000, as stored in cellVancouver,Q1,security. Additional cubes may be used
to store aggregate sums over each dimension, corresponding to the aggregate values obtained
using different SQL group-bys. (e.g., the total sales amount per city and quarter, or per city and
item, or per quarter A and item, or per each individual dimension).

3.6.3 Transactional Databases

A transactional database consists of a file where each record represents a transaction. A


transaction typically includes a unique transaction identity number (trans ID) and a list of the
items making up the transaction (such as items purchased in a store). The transactional
database may have additional tables associated with it, which contain other information
regarding the sale, such as the date of the transaction, the customer ID number, the ID number
of the sales person and of the branch at which the sale occurred, and so on.
A transactional database as shown in Figure 3.13 for All Electronics, transactions can be stored
in a table, with one record per transaction. From the relational database point of view, the sales
table in Figure is a nested relation because the attribute list of item IDs contains a set of items.

Figure 3.13 Transactional Databases


Since most relational database systems do not support nested relational structures, the
transactional database is usually either stored in a flat file in a form at similar to that of the
table or unfolded into a standard relation in a form at similar to that of the items sold table in
relational DB.

3.6.4 Advanced Data and Information Systems and Advanced Applications

Various other databases that store specific information's include:

 Object-Relational Databases
 Sequence database
 Temporal database
 Time-series database
 Spatial databases
 Text databases
 Multimedia databases
 A heterogeneous database
 Legacy database
 Data Streams
 The World Wide Web
Advanced Data and Information Systems and Advanced Applications

The object-relational data model inherits the essential concepts of object-oriented databases,
where, in general terms, each entity is considered as an object. Following the All Electronics
example, objects can be individual employees, customers, or items. Data and code relating to
an object are encapsulated into a single unit.

Each object has associated with it the following:


 A set of variables that describe the objects. These correspond to attributes in the entity-
relationship and relational models.
 A set of messages that the object can use to communicate with other objects, or with the
rest of the database system.
 A set of methods, where each method holds the code to implement a message.

Advanced Data and Information Systems and Advanced Applications


A temporal database typically stores relational data that include time-related attributes. These
attributes may involve several timestamps, each having different semantics. A sequence
database stores sequences of ordered events, with or without a concrete notion of time.
Examples include customer shopping sequences, Web click streams, and biological sequences.

A time-series database stores sequences of values or events obtained over repeated


measurements of time (e.g., hourly, daily, weekly). Examples included at a collected from the
stock exchange, inventory control, and the observation of natural phenomena (like temperature
and wind).

Advanced Data and Information Systems and Advanced Applications


Spatial databases contain spatial-related information. Examples include geographic
(map) databases, very large-scale integration (VLSI) or computer-aided design databases, and
medical and satellite image databases. A spatial database that stores spatial objects that change
with time is called a spatiotemporal database, from which interesting information can be
mined.

Advanced Data and Information Systems and Advanced Applications


Text databases will contain word descriptions for objects. These word descriptions are
usually not simple keywords but rather long sentences or paragraphs, such as product
specifications, error or bug reports, warning messages, summary reports, notes, or other
documents. Text databases may be highly unstructured (such as some Web pages on the World
Wide Web). Some text databases may be somewhat structured, that is, semi structured whereas
others are relatively well structured. Text databases with highly regular structures typically can
be implemented using relational database systems.
 Multimedia databases store image, audio, and video data. They are used in applications
such as picture content-based retrieval, voice-mail systems, video-on-demand systems, the
World Wide Web, and speech-based user interfaces that recognize spoken commands.
 A heterogeneous database consists of a set of interconnected, autonomous component
databases. The components communicate in order to exchange information and answer
queries.
 A legacy database is a group of heterogeneous databases that combines different kinds of
data systems, such as relational or object-oriented databases, hierarchical databases, network
databases, spreadsheets, multimedia databases, or file systems.

The heterogeneous databases in a legacy database may be connected by intra or inter-


computer networks. Information exchange across such databases is difficult because it would
require precise transformation rules from one representation to another, considering diverse
semantics.

Data Streams many applications involve the generation and analysis of a new kind of data,
called stream data, where data flow in and out of an observation platform (or window)
dynamically. Such data streams have the following unique features: huge or possibly infinite
volume, dynamically changing, flowing in and out in a fixed order, allowing only one or a
small number of scans, and demanding fast (often real-time) response time. Mining data
streams involves the efficient discovery of general patterns and dynamic changes within stream
data.

World Wide Web and its associated distributed information services, such as Yahoo!, Google,
America Online, and AltaVista, provide rich, worldwide, on-line information services, where
data objects are linked together to facilitate interactive access. Capturing user access patterns in
such distributed information environments is called Web usage mining. Automated Web page
clustering and classification help in arrange Web pages in a multidimensional manner based on
their contents. Web community analysis helps identify hidden Web social networks and
communities and observe their evolution.
3.7 Data Mining Task Primitives

A data mining task can be specified in the form of a data mining query, which is input to
the data mining system. A data mining query is defined in terms of data mining task primitives
as shown in Figure 3.14. These primitives allow the user to interactively communicate with the
data mining system during discovery in order to direct the mining process, or examine the
findings from different angles or depths.

1. The set of task-relevant data to be mined:

2. The kind of knowledge to be mined:

3. The background knowledge to be used in the discovery process:

4. The interestingness measures and thresholds for pattern evaluation:

5. The expected representation for visualizing the discovered patterns:


Figure 3.14 Transactional Databases

 The set of task-relevant data to be mined:


This specifies the portions of the database or the set of data in which the user is interested.
This includes the database attributes or data warehouse dimensions of interest (referred to
as the relevant attributes or dimensions).
 The kind of knowledge to be mined:
This specifies the data mining functions to be per- formed, such as characterization,
discrimination, association or correlation analysis, classification, prediction, clustering,
outlier analysis, or evolution analysis.
 The background knowledge to be used in the discovery process:
This knowledge about the domain to be mined is useful for guiding the knowledge
discovery process and for evaluating the patterns found. Concept hierarchies are a popular
form of back- ground knowledge, which allow data to be mined at multiple levels of
abstraction. An example of a concept hierarchy for the attribute (or dimension) age is
shown in Figure 1.2. User beliefs regarding relationships in the data are another form of
back- ground knowledge.
 The interestingness measures and thresholds for pattern evaluation:
They may be used to guide the mining process or, after discovery, to evaluate the
discovered patterns. Different kinds of knowledge may have different interestingness
measures. For example, interestingness measures for association rules include support and
confidence. Rules whose support and confidence values are below user-specified thresholds
are considered uninteresting.
 The expected representation for visualizing the discovered patterns:
This refers to the form in which discovered patterns are to be displayed, which may include
rules, tables, charts, graphs, decision trees, and cubes.

3.8 Integration of a Data Mining System with a Database or Data Warehouse System

A good system architecture will facilitate the data mining system to make best use of
the software environment. A data mining system is said to be effective if it can able to do the
following:

 Accomplish data mining tasks in an efficient and timely manner.


 Interoperate and exchange information with other information systems.
 Be adaptable to users‘ diverse requirements.
 Evolve with time.

A critical question in the design of a data mining system is ―How to integrate or couple the
DM(Data Mining) system with a database system and/or a data warehouse(DW) system?‖
There are four different ways that include no coupling, loose coupling, semitight coupling,
and tight coupling. No coupling means that a DM system will not utilize any function of a DB
or DW system. Loose coupling means that a DM system will use some facilities of a DB or
DW. Semitight coupling means that besides linking a DM system to a DB/DW system efficient
implementations of a few essential data mining primitives can be provided in the DB/DW
system.. Tight coupling means that a DM system is smoothly integrated into the
DB/DW system.

The Figure 3.15 represents an Integration of Data mining system with Data Warehouse in
which Data sources are loaded in to the data Warehouse and then Data Mining is performed.
Figure 3.15 Integration of a Data Mining System with a Database or Data Warehouse System
3.8.1 No coupling

No coupling means that a Data mining system will not utilize any function of a data
base or Data WareHouse system. It may fetch data from a particular source (such as a file
system), process data using some data mining algorithms, and then store the mining results in
another file.

In the architecture, data mining system does not utilize any functionality of a database or data
warehouse system. A no-coupling data mining system retrieves data from a particular data
source such as file system, processes data using major data mining algorithms and stores results
into the file system. The no-coupling data mining architecture does not take any advantages of
database or data warehouse that is already very efficient in organizing, storing, accessing and
retrieving data. The no-coupling architecture is considered a poor architecture for data mining
system, however, it is used for simple data mining processes.

3.8.2 Loose coupling:

Loose coupling means that a Data Mining system will use some facilities of a Data
Base or Data Warehouse system, fetching data from a data repository managed by these
systems, performing data mining, and then storing the mining results either in a file or in a
designated place in a database or data warehouse.
In the architecture, data mining system uses the database or data warehouse for data
retrieval. In loose coupling data mining architecture, data mining system retrieves data from the
database or data warehouse, processes data using data mining algorithms and stores the result
in those systems. This architecture is mainly for memory-based data mining system that does
not require high scalability and high performance.

3.8.3 Semitight Coupling:

Semitight coupling means that besides linking a Data mining to a Data Base/Data
WareHouse system, efficient implementations of a few essential data mining primitives can be
provided in the DB/DW system. These primitives can include sorting, indexing, aggregation,
histogram analysis, multiway join, and precomputation of some essential statistical measures,
such as sum, count, max, min, standard deviation, and so on.

In semi-tight coupling data mining architecture, besides linking to database or data warehouse
system, data mining system uses several features of database or data warehouse systems to
perform some data mining tasks including sorting, indexing, aggregation…etc. In this
architecture, some intermediate result can be stored in database or data warehouse system for
better performance.

3.8.4 Tight coupling

Tight coupling means that a Data Mining system is smoothly integrated into the Data
Base/Data Warehouse system. The data mining subsystem is treated as one functional
component of an information system. Data mining queries and functions are optimized based
on mining query analysis, data structures, indexing schemes, and query processing methods of
a Data Base or Data Warehouse system. In tight coupling data mining architecture, database or
data warehouse is treated as an information retrieval component of data mining system using
integration. All the features of database or data warehouse are used to perform data mining
tasks. This architecture provides system scalability, high performance, and integrated
information.

3.9 Major Issues in Data Mining

Possible issues in a data mining system can be classified in to three categories.

1. User Interaction Issues


2. Performance Issues.
3. Issues relating to the diversity of database types

3.9.1 User interaction issues:

Mining different kinds of knowledge in databases

Different users may be interested in different kinds of knowledge. Therefore it is


necessary for data mining to cover a broad range of knowledge discovery task.

Interactive mining of knowledge at multiple levels of abstraction


The data mining process needs to be interactive because it allows users to focus the search
for patterns, providing and refining data mining requests based on the returned results.

Incorporation of background knowledge


To guide discovery process and to express the discovered patterns, the background
knowledge can be used. Background knowledge may be used to express the discovered
patterns not only in concise terms but at multiple levels of abstraction.

Data mining query languages and ad hoc data mining


Data Mining Query language that allows the user to describe ad hoc mining tasks, should
be integrated with a data warehouse query language and optimized for efficient and flexible
data mining.

Presentation and visualization of data mining results


Once the patterns are discovered it needs to be expressed in high level languages, and
visual representations. These representations should be easily understandable.

Handling noisy or incomplete data


The data cleaning methods are required to handle the noise and incomplete objects while
mining the data regularities. If the data cleaning methods are not there then the accuracy of the
discovered patterns will be poor.

Pattern evaluation
The patterns discovered should be interesting because either they represent common
knowledge or lack novelty.

3.9.2 Performance issues


There can be performance-related issues such as follows

Efficiency and scalability of data mining algorithms:


To effectively extract information from a huge amount of data in databases, data mining
algorithms must be efficient and scalable.

Parallel, distributed, and incremental mining algorithms:


The huge size of many databases, the wide distribution of data, and complexity of some
data mining methods are factors motivating the development of parallel and distributed data
mining algorithms. Such algorithms divide the data into partitions, which are processed in
parallel.

3.9.3 Issues relating to the diversity of database types

Handling of relational and complex types of data:

There are many kinds of data stored in databases and data warehouses. It is not possible
for one system to mine all these kind of data. So different data mining system should be
construed for different kinds data.

Mining information from heterogeneous databases and global information systems


Since data is fetched from different data sources on Local Area Network (LAN) and
Wide Area Network (WAN).The discovery of knowledge from different sources of structured
is a great challenge to data mining.

3.10 Data Preprocessing

Data preprocessing is a data mining technique that involves transforming raw data into
an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in
certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven
method of resolving such issues.

3.10.1 Data Preprocessing


Data preprocessing is a data mining technique that involves transforming raw data into an
understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in
certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven
method of resolving such issues. For Example the data in the real world is dirty, incomplete:
lacking attribute values, lacking certain attributes of interest, or containing only aggregate data.
e.g., occupation=― ‖; noisy: containing errors or outliers e.g., Salary=―-10‖ inconsistent:
containing discrepancies in codes or names e.g., Age=―42‖ Birthday=―03/07/1997‖ e.g., Was
rating ―1,2,3‖, now rating ―A, B, C‖ e.g., discrepancy between duplicate records.
Dirty Data
Dirty Data or Incomplete data may come from ―Not applicable‖ data value when
collected different considerations between the time when the data was collected and when it is
analyzed. Human / hardware / software problems. Noisy data (incorrect values) may come
from faulty data collection instruments ,human or computer error at data entry ,errors in data
transmission Inconsistent data may come from different data sources functional dependency
violation (e.g., modify some linked data) .Duplicate records also need data cleaning.

Importance of Data Preprocessing


The data preprocessing is important because No quality data, no quality mining results.
Quality decisions must be based on quality data e.g., duplicate or missing data may cause
incorrect or even misleading statistics. Data warehouse needs consistent integration of quality
data Data extraction, cleaning, and transformation comprises the majority of the work of
building a data warehouse.

Multi-Dimensional Measure of Data Quality


A well-accepted multidimensional view:
 Accuracy
 Completeness
 Consistency
 Timeliness
 Believability
 Value added
 Interpretability
 Accessibility

3.10.2 DataMining Descriptive Characteristics


The Characteristics are median, max, min, quantiles, outliers, variance. The Numerical
dimensions correspond to sorted intervals analyzed with multiple granularities of precision
Boxplot or quantile analysis on sorted intervals. Dispersion analysis on computed measures
folding measures into numerical dimensions Boxplot or quantile analysis on the transformed
cube.

Measuring the Central Tendency


There are about two means for measuring central Tendency they are, Weighted arithmetic
mean, Trimmed mean. Mean (algebraic measure) (sample vs. population) is given below:

Weighted arithmetic mean:


1 n
 xi
x
x
n i 1 N

Trimmed mean: chopping extreme values


n

w x i i
x i 1
n

w
i 1
i

Median: A holistic measure


Median is the middle value if odd number of values, or average of the middle two values
otherwise. Estimated by interpolation (for grouped data) as given below:

n / 2  ( f )l
median  L1  ( )c
Mode f median

Mode is the value that occurs most frequently in the data. There are three types of mode. They
are Unimodal, bimodal, trimodal. Empirical formula:

mean  mode  3  (mean  median)


Symmetric vs. Skewed Data
Median, mean and mode of symmetric, positively and negatively skewed data as shown in
Figure 3.16
Figure 3.16: Symmetric vs Skewed Data

Measuring the Dispersion of Data


The Quartiles, outliers and boxplots. Quartiles: Q1 (25th percentile), Q3 (75th percentile). Inter-
quartile range: IQR = Q3 – Q1 . Five number summary: min, Q1, M, Q3, max. Boxplot: ends of
the box are the quartiles, median is marked, whiskers, and plot outlier individually. Outlier:
usually, a value higher/lower than 1.5 x IQR.
The Variance and standard deviation (sample: s, population: σ). Variance: (algebraic, scalable
computation). Standard deviation s (or σ) is the square root of variance s2 (or σ2)
1 n 1 n 2
   ( xi   )   xi   2
2 2

N i 1 N i 1
1 n 1 n 2 1 n 2
s 
2
 ( xi  x)  n 1[i1 xi  n (i1 xi ) ]
n  1 i1
2

Properties of Normal Distribution Curve


The normal (distribution) curve as shown in Figure 3.17

 From μ–σ to μ+σ: contains about 68% of the measurements (μ: mean, σ: standard
deviation)
 From μ–2σ to μ+2σ: contains about 95% of it
 From μ–3σ to μ+3σ: contains about 99.7% of it

Figure 3.17 : Normal Distribution Curve

Boxplot Analysis
Five-number summary of a distribution: Minimum, Q1, M, Q3, Maximum Boxplot -Data is
represented with a box. The ends of the box are at the first and third quartiles, i.e., the height of
the box is IRQ. The median is marked by a line within the box. Whiskers: two lines outside the
box extend to Minimum and Maximum, as shown in Figure 3.18 and Visualization of Data
Dispersion: Boxplot Analysis( as shown in figure 3.19)

Figure 3.18 :Boxplot Analysis

Figure3.19 : Visualization of Data Dispersion

3.10.3 Major Tasks in Data Preprocessing


There are five major tasks in data preprocessing as shown in figure 3.20
 Data cleaning-Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies.
 Data integration-Integration of multiple databases, data cubes, or files
 Data transformation- Normalization and aggregation
 Data reduction- Obtains reduced representation in volume but produces the same or
similar analytical results
 Data discretization - Part of data reduction but with particular importance,
especially for numerical data
Figure3.20 : Data Preprocessing Tasks

3.10.3.1 Data Cleaning


―Data cleaning is one of the three biggest problems in data warehousing‖—Ralph Kimball.
―Data cleaning is the number one problem in data warehousing‖—DCI survey.
Different Data cleaning tasks are listed as follows:

 Fill in missing values

 Identify outliers and smooth out noisy data

 Correct inconsistent data

 Resolve redundancy caused by data integration

Missing Data
Data is not always available E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data Missing data may be due to equipment
malfunction inconsistent with other recorded data and thus deleted data not entered due to
misunderstanding certain data may not be considered important at the time of entry not register
history or changes of the data Missing data may need to be inferred.

Handling Missing Data


Ignore the tuple: usually done when class label is missing (assuming the tasks in
classification—not effective when the percentage of missing values per attribute varies
considerably.

Fill in the missing value manually: tedious + infeasible?

Fill in it automatically with a global constant : e.g., ―unknown‖, a new class?! the attribute
mean the attribute mean for all samples belonging to the same class: smarter the most probable
value: inference-based such as Bayesian formula or decision tree.

Noisy Data
Noise: random error or variance in a measured variable Incorrect attribute values may due
to faulty data collection instruments data entry problems data transmission problems
technology limitation inconsistency in naming convention Other data problems which requires
data cleaning duplicate records incomplete data inconsistent data.

Handling Noisy Data


Binning-first sort data and partition into (equal-frequency) bins then one can smooth by bin
means, smooth by bin median, smooth by bin boundaries, etc.

 Regression- smooth by fitting the data into regression functions


 Clustering- detect and remove outliers
 Combined computer and human inspection detect suspicious values and check by
human (e.g., deal with possible outliers).

Simple Discretization Methods: Binning


Equal-width (distance) partitioning. Divides the range into N intervals of equal size: uniform
grid . if A and B are the lowest and highest values of the attribute, the width of intervals will
be: W = (B –A)/N. The most straightforward, but outliers may dominate presentation .Skewed
data is not handled well Equal-depth (frequency) partitioning. Divides the range into N
intervals, each containing approximately same number of samples. Good data scaling
Managing categorical attributes can be tricky.

Simple Discretization Methods: Binning


Equal-width (distance) partitioning
 Divides the range into N intervals of equal size: uniform grid
 if A and B are the lowest and highest values of the attribute, the width of intervals will be:
W = (B –A)/N.
 The most straightforward, but outliers may dominate presentation

 Skewed data is not handled well

Equal-depth (frequency) partitioning

 Divides the range into N intervals, each containing approximately same number of
samples
 Good data scaling
 Managing categorical attributes can be tricky

Binning Methods for Data Smoothing


Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

Partition into equal-frequency (equi-depth) bins:


- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Regression
A regression curve fitted to the set of paired data in regression analysis, for linear regression
the curve is a straight line.
Figure 3.21: Regression curve

Cluster Analysis
Data sets are clustered based on the similar characteristics as shown in figure 3.22

Figure 3.22 : Cluster Analysis

3.10.3.2 Data Integration


Data integration means combining data from multiple sources into a coherent store. To match
the schema and objects from different sources the following points need to be concentrated.

 Entity identification problem


 Redundancy and correlation analysis

 Tuple duplication

 Data value conflict detection and resolution.

In Entity Identification Problem there are two main issues to be considered during data
integration are
 Schema Integration

 Object matching
Entity Identification problem – Matching the real world entities from different data sources.
Meta data plays a main role to overcome the issues during data integration. Special attention
on structure of the data is needed to overcome the problems due to functional dependencies
and referential constraints.

Redundancy and correlation analysis


Redundancy is an important issue in data integration. Inconsistencies in attribute or dimension
naming also cause redundancies in dataset. Some redundancy can be detected by correlation
analysis. For nominal data – Chi square test. For numeric attributes – Correlation coefficient &
covariance.

Tuple Duplication
In addition to detecting the redundancies between the attributes, duplication should also be
detected at tuple level.

Data value conflict detection and Resolution


Data Integration leads to detection and Resolution of data value conflicts. The same real world
entity can have different representations due to differences in representations, scaling or
encoding. Attributes may also refer to different level of abstraction.

3.10.3.3 Data Transformation


Smoothing: remove noise from data

Aggregation: summarization, data cube construction

Generalization: concept hierarchy climbing

Discretization: Filling with interval & conceptual labels.

Normalization: scaled to fall within a small, specified range

 min-max normalization to [new_minA, new_maxA]

Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is
mapped to

 z-score normalization (μ: mean, σ: standard deviation):

Ex. Let μ = 54,000, σ = 16,000. Then

 normalization by decimal scaling


 Attribute/feature construction
New attributes can be constructed from the given ones
3.10.3.4 Data Reduction
A database/data warehouse may store terabytes of data. Complex data mining may take a
very long time to run on the complete data set and to overcome this complexity data reduction
may be done.

Data reduction
Obtain a reduced representation of the data set that is much smaller in volume but yet
produce the same (or almost the same) analytical results.

Data Reduction Strategies


 Dimensionality reduction
 Numerosity reduction
 Data Compression
 Wavelet transforms
 Principal component analysis
 Attribute subset selection
 Regression and log linear models
 Histograms
 Clustering
 Sampling
 Data cube aggregation

Data Cube Aggregation


The lowest level of a data is datacube (base cuboid): Data cube is an aggregated data for an
individual entity of interest. Multiple levels of aggregation need to be done in data cubes and it
is to be done to further reduce the size of data to deal with. It is essential to use the smallest
representation which is enough to solve the task. Queries regarding aggregated information
should be answered using data cube.

Attribute Subset Selection


Attribute Subset Selection is also called as Feature selection. The steps for selecting attributes
are:
 Select a minimum set of features such that the probability distribution of different classes
given the values for those features is as close as possible to the original distribution given the
values of all features.

 reduce # of patterns in the patterns, easier to understand

The four heuristic methods (due to exponential # of choices):

 Step-wise forward selection

 Step-wise backward elimination

 Combining forward selection and backward elimination

 Decision-tree induction (shown in figure 3.23)

Figure 3.23: Example for Decision Tree Induction

Heuristic Feature Selection Methods


There are 2d possible sub-features of d features. Several heuristic feature selection methods are:

Best single features under the feature independence assumption: choose by significance tests.
 Best step-wise feature selection:
 The best single-feature is picked first
 Then next best feature condition to the first, ...
 Step-wise feature elimination:
Repeatedly eliminate the worst feature

 Best combined feature selection and elimination


 Optimal branch and bound:

Use feature elimination and backtracking

Dimensionality Reduction: Principal Component Analysis (PCA)


Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal
components) that can be best used to represent data .
Steps
 Normalize input data: Each attribute falls within the same range

 Compute k orthonormal (unit) vectors, i.e., principal components

 Each input data (vector) is a linear combination of the k principal component vectors

 The principal components are sorted in order of decreasing ―significance‖ or strength

 Since the components are sorted, the size of the data can be reduced by eliminating the
weak components, i.e., those with low variance. (i.e., using the strongest principal
components, it is possible to reconstruct a good approximation of the original data, as shown
in figure 3.24

Limitations
 Works for numeric data only
 Used when the number of dimensions is large

Figure 3.24: Principal Component Analysis


UNIT-III - QUESTION BANK
PART A

1. Define data.
2. State why the data preprocessing an important issue for data warehousing and data mining.
3. What is the need for discretization in data mining?.
4. What are the various forms of data preprocessing?
5. What is concept Hierarchy? Give an example.
6. What are the various forms of data preprocessing?
7. Mention the various tasks to be accomplished as part of data pre-processing.
8. Define Data Mining.
9. List out any four data mining tools.
10. What do data mining functionalities include?
11. Define patterns.
12. Define cluster Analysis
13. What is Outlier Analysis?
14. What makes the pattern interesting?
15. Difference between OLAP and Data mining
16. What do you mean by high performance data mining?
17. What are the Various data mining techniques?
18. What do you mean by predictive data mining?
19. What do you mean by descriptive data mining?
20. What are the steps involved in the data mining process?
21. List the methods available to fill the missing values.
22. Locate the outliers in a sample box plot and explain.
23. List out the major research issues in data mining.
24. Noisy data of price attribute in a data set is as follows.
4, 8, 15, 21, 21, 24, 25, 28, 34.
How the noise can be removed from the above data? Give the data values after data
Smoothing is done.
PART-B
1. Explain the various primitives for specifying Data mining Task.
2. Describe the various descriptive statistical measures for data mining.
3. Discuss about different types of data and functionalities.
4. Describe in detail about Interestingness of patterns.
5. Explain in detail about data mining task primitives.
6. Discuss about different Issues of data mining.
7. Explain in detail about data pre-processing.
8. How data mining system are classified? Discuss each classification with an example.
9. How data mining system can be integrated with a data warehouse? Discuss with an example.
10. Explain data mining applications for Biomedical and DNA data analysis
11. Explain data mining applications for financial data analysis.
12. Explain data mining applications for retail industry.
13. Explain data mining applications for Telecommunication industry.
14. Discuss about data integration and data transformation steps in data pre-processing.
15. List and discuss the steps for integrating a data mining system with a data warehouse.
16. Explain the process of measuring the dispersion of data.
UNIT 4
ASSOCIATION RULE MINING AND CLASSIFICATION

Mining Frequent Patterns, Associations and Correlations – Mining Methods – Mining various
Kinds of Association Rules – Correlation Analysis – Constraint Based Association Mining –
Classification and Prediction - Basic concepts - Decision Tree Induction - Bayesian
Classification – Rule Based Classification – Classification by Back propagation – Support
Vector Machines – Associative Classification – Lazy Learners – Other Classification Methods
– prediction.

4.1Mining Frequent Pattern


Pattern: A pattern is said to be the set of items, subsequences or substructures that exists in the
given dataset.

Frequent pattern: A pattern (a set of items, subsequences, substructures, etc.) that occurs
frequently in a data set is called as Frequent Pattern. For example, a set of items, such as milk
and bread that appear frequently together in a transaction data set is a frequent item set.

Frequent item-set mining is an interesting branch of data mining that focuses on looking at
sequences of actions or events, for example the order in which we get dressed. Shirt first?
Pants first? Socks second item, or second shirt if wintertime?

Motivation of Frequent Pattern Mining


The main motivation of frequent pattern mining is to find the inherent regularities in data.
Some of the sample queries that can be answered after frequent pattern mining are listed as
follows.
 What products were often purchased together?
 What are the subsequent purchases after buying a PC?
 What kinds of DNA are sensitive to this new drug?
 Can we automatically classify web documents?
An itemset X is frequent if X‘s support is no less than a minsup threshold.

Frequent Pattern Mining


Frequent pattern mining can be classified in various ways, based on the following criteria:
 Based on the completeness of patterns to be mined
 Based on the levels of abstraction involved in the rule set
 Based on the types of values handled in the rule.
 Based on the kinds of rules to be mined.
 Based on the kinds of patterns to be mined.
Applications of Frequent Pattern Mining
Some of the main applications of frequent pattern mining are.,
 Basket data analysis
 Cross-marketing
 Catalog design
 Sale campaign analysis
 Web log (click stream) analysis
 DNA sequence analysis

4.1.1 Market Basket Analysis


Market Basket Analysis is a modeling technique based upon the theory that if a person
buy a certain group of items, then he is more (or less) likely to buy another group of items. For
example, if you are in an English resturant and you buy a pint of coke and don't buy a meal,
you are more likely to buy crisps, at the same time than somebody who didn't buy coke. The
set of items a customer buys is referred to as an itemset, and market basket analysis seeks to
find relationships between purchases as shown below in Figure 4.1.Typically the relationship
will be in the form of a rule:IF {coke, no meal} THEN {crisps}.
Figure 4.1 Market Basket Analysis

Basic Concepts

The Market basket analysis ―Which groups or sets of items are customers likely to
purchase on a given trip to the store?‖ For instance, market basket analysis may help us to
design different store layouts. In one strategy, items that are frequently purchased together can
be placed in proximity to further encourage the combined sale of such items. In an alternative
strategy, placing hardware and software at opposite ends of the store may entice customers
who purchase such items to pick up other items along the way. If we think of the universe as
the set of items available at the store, then each item has a Boolean variable representing the
presence or absence of that item. Each basket can then be represented by a Boolean vector of
values assigned to these variables. The Boolean vectors can be analyzed for buying patterns
that reflect items that are frequently associated or purchased together. These patterns can be
represented in the form of association rules as given below.

computer =>antivirus_software[ support =2%, confidence = 60%].

4.1.2 Support and Confidence


Two measures that define rule interestingness are
 Support
 Confidence
Support:
A support defines that set of all the transactions under analysis show that item A and B
are purchased together. A support of 2% for Rule means that 2% of all the transactions under
analysis show that computer and antivirus software are purchased together.
Confidence:
Confidence of an Association rule represents those customers who purchased item A
also bought the item B.A confidence of 60% means that 60% of the customers who purchased
a computer also bought the software. Typically, association rules are considered interesting if
they satisfy both a minimum support threshold and a minimum confidence threshold.

support(A=>B) = P(A U B)

confidence(A=>B) = P(B|A)
𝐵 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝐴𝑈𝐵 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 _𝑐𝑜𝑢 𝑛𝑡 (𝐴 𝑈 𝐵)
𝑃 = =
𝐴 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 (𝐴) 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 _𝑐𝑜𝑢𝑛𝑡 (𝐴)

𝐵 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 (𝐴𝑈𝐵 ) 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 −𝑐𝑜𝑢𝑛𝑡 (𝐴𝑈𝐵 )


confidence(A=>B)= P(𝐴 ) = =
𝑠𝑢𝑝𝑝𝑜𝑟𝑡 (𝐴) 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 _𝑐𝑜𝑢𝑛𝑡 (𝐴)

Support
―The support is the percentage of transactions that demonstrate the rule.‖
Example: Database with transactions ( customer_# : item_a1, item_a2, … )
1: 1, 3, 5.
2: 1, 8, 14, 17, 12.
3: 4, 6, 8, 12, 9, 104.
4: 2, 1, 8.
support {8,12} = 2 (,or 50% ~ 2 of 4 customers)
support {1, 5} = 1 (,or 25% ~ 1 of 4 customers )
support {1} = 3 (,or 75% ~ 3 of 4 customers)
Confidence
The confidence is the conditional probability that, given X present in a transition , Y will also
be present.
Confidence measure, by definition:
Confidence(X=>Y) equals support(X,Y) / support(X)
Example
1. Example: Database with transactions ( customer_# : item_a1, item_a2, … )
1: 3, 5, 8.
2: 2, 6, 8.
3: 1, 4, 7, 10.
4: 3, 8, 10.
5: 2, 5, 8.
6: 1, 5, 6.
7: 4, 5, 6, 8.
8: 2, 3, 4.
9: 1, 5, 7, 8.
10: 3, 8, 9, 10.
Conf ( {5} => {8} ) ?
supp({5}) = 5 , supp({8}) = 7 , supp({5,8}) = 4,
then conf( {5} => {8} ) = 4/5 = 0.8 or 80%

2.Example: Database with transactions ( customer_# : item_a1, item_a2, … )

1: 3, 5, 8.
2: 2, 6, 8.
3: 1, 4, 7, 10.
4: 3, 8, 10.
5: 2, 5, 8.
6: 1, 5, 6.
7: 4, 5, 6, 8.
8: 2, 3, 4.
9: 1, 5, 7, 8.
10: 3, 8, 9, 10.
Conf ( {5} => {8} ) ? 80% Done. Conf ( {8} => {5} ) ?
supp({5}) = 5 , supp({8}) = 7 , supp({5,8}) = 4,
then conf( {8} => {5} ) = 4/7 = 0.57 or 57%

3.Example: Database with transactions ( customer_# : item_a1, item_a2, … )


1: 3, 5, 8.
2: 2, 6, 8.
3: 1, 4, 7, 10.
4: 3, 8, 10.
5: 2, 5, 8.
6: 1, 5, 6.
7: 4, 5, 6, 8.
8: 2, 3, 4.
9: 1, 5, 7, 8.
10: 3, 8, 9, 10.
Conf ( {9} => {3} ) ?
supp({9}) = 1 , supp({3}) = 1 , supp({3,9}) = 1,
then conf( {9} => {3} ) = 1/1 = 1.0 or 100%. OK?
Example 4: Following database has various Itemset I = {Beer, Bread, Jelly, Milk, Peanut
Butter}. Database with transactions are shown in Figure 4.2. Support and confidence of
different item sets is shown in Figure 4.3.

Figure 4.2 Example Transaction

Figure 4.3 Calculated Support

4.1.3Association Rules
Every association rule has a support and a confidence. Find all the rules X Y with
minimum support and confidence
 support, s, probability that a transaction contains X  Y
 confidence, c, conditional probability that a transaction having X also contains
Y

support(A=>B) = P(A U B)

confidence(A=>B) = P(B|A)

Above equation shows that the confidence of rule A->B can be easily derived from the
support counts of A and AUB. Once the support counts of A, B, and AUB are found, it is
straightforward to derive the corresponding association rules.
Association rule mining can be viewed as a two-step process:
 Find all frequent itemsets. (By Min Support count value)
 Generate strong association rules from the frequent item sets.
(By Min Support count and confidence value)
Additional interesting measures can be given by correlation analysis.

4.2 Mining Methods


There are two frequent Pattern Mining Algorithms.
 Apriori
 FP-growth
4.2.1 Apriori
The downward closure property of Apriori algorithm is defined as "Any subset of a
frequent item set must be frequent". Apriori uses prior knowledge of frequent item set
properties. Employs an iterative approach known as a level-wise search, where k-itemsets
are used to explore (k+1)-itemsets. First, the set of frequent 1-itemsets is found by scanning
the database to accumulate the count for each item, and collecting those item that satisfy
minimum support. The resulting set is denoted L1. Next, L1 is used to find L2, the set of
frequent 2-itemsets, which is used to find L3, and so on, until no more frequent k-item sets
can be found.
Apriori property:
 All nonempty subsets of a frequent item set must also be frequent.
 By definition, if an item set I does not satisfy the minimum support threshold, min sup,
then I is not frequent; that is, P(I) < min sup.
 If an item A is added to the item set I, then the resulting item set (i.e., IUA) cannot
occur more frequently than I. Therefore, I UA is not frequent either; that is, P(I UA) <
min sup.
Apriori Algorithm steps:
A two-step process is followed, consisting of Join actions and Prune actions. The join
step is to find Lk, a set of candidate k-itemsets is generated by joining L K-1with itself. This
set of candidates is denoted Ck.

The join step:


To find Lk, a set of candidate k- itemsets is generated by joining Lk-1
with itself. This set of candidates is denoted Ck. Let l1 and l2 be itemsets in Lk-1.
The notation li [j] refers to the jth item in li(e.g., l1[k-2] refers to the second to the
last item in l1). By convention, Apriori assumes that items within a transaction or
itemset are sorted in lexicographic order. For the (k -1)-itemset, li, this means that
the items are sorted such that li[1] < li[2] <....< li[k - 1]. The join, is performed, where
members of Lk-1 are joinable if their first (k - 2) items are in common.
The prune step:
Ck is a super set of Lk, that is, its members may or may not be frequent, but all of the
frequent k-itemsets are included in Ck. A scan of the database to determine the count of
each candidate in Ck would result in the determination of Lk. Ck, however, can be huge,
and so this involves heavy computation. To reduce the size of Ck, the Apriori property is
used as follows. Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-
itemset. Hence, if any (k-1)-subset of a candidate k-itemset is not in Lk-1, then the candidate
cannot be frequent either and so can be removed from Ck. An working example of Apriori
is shown in Figure 4.4.
Working Example

Figure 4.4 Example of Apriori

|D| = 9 min_sup = 2
In the first iteration of the algorithm, each item is a member of the set of candidate 1-
itemsets, C1 as shown below in Figure 4.5 to 4.8.
Figure 4.5 Candidate set 1

Figure 4.6 Generating Candidate set 2

Figure 4.7 Comparing Candidate set 2


Figure 4.8 Generating Candidate set 3

Apriori Algorithm for forming C 3

C3 = L2 L2 = {{I1,I2},{I1.I3},{I1,I5},{I2,I3},{I2,I4}{I2,I5}}
{{I1,I2},{I1.I3},{I1,I5},{I2,I3},{I2,I4}{I2,I5}}
= {{I1,I2,I3},{I1,I2,I5},{I1,I3,I5},{I2,I3,I4},{I2,I3,I5},{I2,I4,I5}}
The 2- item subsets of {I1,I2,I3} are {I1,I2} ,{I1,I3} and {I2,I3}. All 2- item subsets of
{I1,I2,I3} are members of L2. Therefore, keep {I1,I2,I3} in C3.
The 2- item subsets of {I1,I2,I5} are {I1,I2} ,{I1,I5} and {I2,I5}. All 2- item subsets of
{I1,I2,I5} are members of L2. Therefore, keep {I1,I2,I5} in C3.
The 2- item subsets of {I1,I3,I5} are {I1,I3} ,{I1,I5} and {I3,I5}. {I3,I5} is not a
member of I2, and so it is not frequent. therefore, remove {I1,I3,I5} from C 3.
The 2- item subsets of {I2,I3,I4} are {I2,I3} ,{I2,I4} and {I3,I4}. {I3,I4} is not a
member of I2, and so it is not frequent. therefore, remove {I1,I3,I5} from C 3.
The 2- item subsets of {I2,I3,I5} are {I2,I3} ,{I2,I5} and {I3,I5}. {I3,I5} is not a
member of I2, and so it is not frequent. therefore, remove {I1,I3,I5} from C 3.
The 2- item subsets of {I2,I4,I5} are {I2,I4} ,{I2,I5} and {I4,I5}. {I4,I5} is not a
member of I2, and so it is not frequent. therefore, remove {I2,I4,I5} from C 3.
Therefore, C3 = {{I1,I2,I3}, {I1,I2,I3},{I1,I2,I5}} after pruning.

Figure 4.9 Comparing Candidate set 3

PSEUDO-CODE
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
returnkLk;

Candidate Generation: An SQL Implementation


SQL Implementation of candidate generation
Suppose the items in Lk-1 are listed in an order
Step 1: self-joining Lk-1
insert into Ck
selectp.item1, p.item2, …, p.itemk-1, q.itemk-1
fromLk-1 p, Lk-1 q
wherep.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
Step 2: pruning
forallitemsets c in Ckdo
forall(k-1)-subsets s of c do
if(s is not in Lk-1) then delete c from Ck
Use object-relational extensions like UDFs, BLOBs, and Table functions for efficient
implementation

Further Improvement of the Apriori Method


Major computational challenges are multiple scans of transaction database, huge
number of candidates, tedious workload of support counting for candidates.

Apriori can be improved by general ideas, Reduce passes of transaction database scans, Shrink
number of candidates, Facilitate support counting of candidates

Counting Supports of Candidates Using Hash Tree


The total number of candidates can be very huge, one transaction may contain many
candidates. So counting the no of candidates is not an easy task. By using hash tree method
support of candidates can be counted.
Here Candidate itemsets are stored in a hash-tree shown in figure 4.10. Leaf node of
hash-tree contains a list of itemsets and counts. Interior node contains a hash table. Subset
functionfinds all the candidates contained in a transaction

Figure 4.10 Hash Tree for Support counting

Hash-based technique
A hash-based technique can be used to reduce the size of the candidate k-itemsets, Ck,
for k> 1.Hash the items into the different buckets of a hash table structure as shown in Figure
4.11, and increase the corresponding bucket counts.

Figure 4.11 Comparing Candidate set 3

Transaction reduction

A transaction that does not contain any frequent k-itemsets cannot contain any frequent
(k+1)-itemsets.Therefore, such a transaction can be marked or removed from further
consideration
Partitioning

A partitioning technique can be used that requires just two database scans to mine the
frequent itemsets (Figure 4.12). It consists of two phases. In Phase I, the algorithm subdivides
the transactions of D into n nonoverlapping partitions. If the minimum support threshold for
transactions in D is min sup, then the minimum support count for a partition is min sup × the
number of transactions in that partition. For each partition all frequent itemsets within the
partition are found. These are referred to as local frequent itemsets. The procedure employs a
special data structure that, for each itemset, records the TIDs of the transactions containing the
items in the itemset. This allows it to find all of the local frequent k-itemsets, for k =1,2,..., in
just one scan of the database.

Figure 4.12 Partitioning

4.2.2 Mining Frequent Item sets without Candidate Generation FP Growth Algorithm

The FP-growth algorithm is the process of mining frequent patterns without candidate
generation. Compress a large database into a compact Frequent-Pattern tree (FPtree) structure –
highly condensed, but complete for frequent pattern mining – avoid costly database scans

Drawback and Solution


Drawback of Apriori Algotithm: Need to generate a huge number of candidate sets. Need to
repeatedly scan the database.
Solutions: Design a method that mines the complete set of frequent item sets without candidate
generation.FP growth method can be applied.
FP tree construction with example

Figure 4.13 have the list of items that purchased. FP tree is constructed with the below
example.
Figure 4.13 Transaction Database

Construct FP-tree from a Transaction Database

 Scan DB once, find frequent 1-itemset (single item pattern)


 Sort frequent items in frequency descending order, f-list
 Scan DB again, construct FP-tree as shown in figure 4.14
L ={{I2:7},{I1:6},{I3:6}{I4:2},{I5:2}}

Figure 4.14 Constructing FP Tree

Mining FP tree

The FP-tree is mined as follows. Start from each frequent length-1 pattern (as an initial
suffix pattern), construct its conditional pattern base (a ―sub database,‖ which consists of
the set of prefix paths in the FP-tree co-occurring with the suffix pattern), then construct
its (conditional) FP-tree, and perform mining recursively on such a tree. The pattern
growth is achieved by the concatenation of the suffix pattern with the frequent patterns
generated from a conditional FP-tree.

Figure 4.15 Generating Frequent Patterns

Figure 4.16 Mining FP Tree

Mining of the FP-tree is summarized Figure 4.15 and detailed as follows. We first
consider I5, which is the last item in L, rather than the first. The reason for starting at the
end of the list will become apparent as we explain the FP-tree mining process. I5 occurs
in two branches of the FP-tree of Figure 4.14. (The occurrences of I5 can easily be found
by following its chain of node-links.) The paths formed by these branches are <I2, I1,
I5: 1> and <I2, I1, I3, I5: 1>. Therefore, considering I5 as a suffix, its corresponding two
prefix paths are <I2, I1: 1> and <I2, I1, I3: 1>, which form its conditional pattern base. Its
conditional FP-tree contains only a single path, <I2: 2, I1: 2>; I3 is not included because
its support count of 1 is less than the minimum support count. The single path generates
all the combinations of frequent patterns: {I2, I5: 2}, {I1, I5: 2}, {I2, I1, I5: 2}.

For I4, its two prefix paths form the conditional pattern base, {{I2 I1: 1}, {I2: 1}},
which generates a single-node conditional FP-tree, <I2: 2>, and derives one frequent pattern,
<I2, I1: 2>. Notice that although I5 follows I4 in the first branch, there is no need to include I5
in the analysis here because any frequent pattern involving I5 is analyzed in the examination of
I5. Similar to the above analysis, I3‘s conditional pattern base is {{I2, I1: 2}, {I2: 2}, {I1: 2}}.
Its conditional FP-tree has two branches, {I2: 4, I1: 2} and {I1: 2}, as shown in Figure 4.16,
which generates the set of patterns, {{I2, I3: 4}, {I1, I3: 4}, {I2, I1, I3: 2}}. Finally, I1‘s
conditional pattern base is {{I2: 4}}, whose FP-tree contains only one node, {I2: 4}, which
generates one frequent pattern, {I2, I1: 4}.

The FP-growth method transforms the problem of finding long frequent patterns to
searching for shorter ones recursively and then concatenating the suffix. It uses the least
frequent items as a suffix, offering good selectivity. The method substantially reduces the
search costs. When the database is large, it is sometimes unrealistic to construct a main
memory based FP-tree. An interesting alternative is to first partition the database into a set of
projected databases, and then construct an FP-tree and mine it in each projected database.
Such a process can be recursively applied to any projected database if its FP-tree still
cannot fit in main memory.
A study on the performance of the FP-growth method shows that it is efficient and
scalable for mining both long and short frequent patterns, and is about an order of magnitude
faster than the Apriori algorithm. It is also faster than a Tree-Projection algorithm,
which recursively projects a database into a tree of projected databases.

Advantages of the Pattern Growth Approach

 Divide-and-conquer :Decompose both the mining task and DB according to the frequent
patterns obtained so far lead to focused search of smaller databases.
 No candidate generation, no candidate test
 Compressed database: FP-tree structure
 No repeated scan of entire database
 Basic ops: counting local freq items and building sub FP-tree, no pattern search and
matching

4.3 Mining Multilevel Association Rules


Multilevel Association Rules can be mined by
 Using uniform minimum support for all levels
 Using reduced minimum support at lower levels
 Using item or group-based minimum support
Figure 4.17 depicts the item purchased and the association rules for the item purchased in
figure 4.18

Figure 4.17 Items purchased

Figure 4.18 Multilevel Association Rules

4.4 From Association Mining to Correlation Analysis

Many association rules so generated are still not interesting to the users. This is especially
true when mining at low support thresholds or mining for long patterns. This has been one of
the major bottlenecks for successful application of association rule mining. Whether or not a
rule is interesting can be assessed either subjectively or objectively.

Strong Rules Are Not Necessarily Interesting: An Example


A misleading "strong" association rule. Suppose we are interested in analyzing
transactions at AllElectronics with respect to the purchase of computer games and videos. Let
game refer to the transactions containing computer games, and video refer to those containing
videos. Of the 10000 transactions analyzed, the data show that 6000 of the customer
transactions included computer games, while 7500 included videos, and 4000 included both
computer games and videos. Suppose that a data mining program for discovering association
rules is run on the data, using a minimum support of, say, 30% and a minimum confidence of
60%. The following association rule is discovered:

buys(X, "computer games") => buys (X, "videos") [support = 40%, confidence =60%]
From Association Analysis to Correlation Analysis
A => B [support, confidence. correlation].
Measures for correlation analysis are
 Calculating lift value
 Calculating x2 Value
From Association Analysis to Correlation Analysis Calculating lift value
 Lift is a simple correlation measure.
 The occurrence of item set A is independent of item set B if ,
P ( A U B) = P(A)P(B)
𝑃(𝐴 𝑈 𝐵
lift(A,B) = 𝑃 𝐴 𝑃(𝐵)

4.5Data Mining: Classification & Prediction


Supervised vs. Unsupervised Learning
Supervised learning (classification) : The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations. New data is classified based on
the training set.
Unsupervised learning (clustering) : The class labels of training data are unknown. Given a set
of measurements, observations, etc. with the aim of establishing the existence of classes or
clusters in the data
Classification vs. Prediction
Classification: Classification predicts categorical class labels. It classifies data (constructs a
model) based on the training set and the values (class labels) in a classifying attribute and uses
it in classifying new data

Prediction: Prediction models continuous-valued functions, i.e., predicts unknown or missing


values

Typical Applications

 credit approval
 target marketing
 medical diagnosis
 treatment effectiveness analysis

Classification—A Two-Step Process


Step 1 : Learning Phase
In the first step, a classifier is built describing a predetermined set of data classes or
concepts. This is the learning step where a classification algorithm builds the classifier by
analyzing a training set made up of database tuples and their associated class labels. A tuple, X,
is represented by an n-dimensional attribute vector, X = (x1, x2, ,xn) depicting n
measurements made on the tuple from n database attributes, respectively, A1, A2,.. , An. This
first step of the classification process can also be viewed as the learning of a mapping or
function, y = f (X),as mentioned in figure 4.19

Step 2 : Classification Phase


The predictive accuracy of the classifier is estimated by using a training data.Accuracy of a
classifier on a given test set is then calculated.

Classification Process (1): Model Construction


Figure 4.19 Classification Model Construction

Classification Process (2): Use the Model in Prediction

Figure 4.20 Predicting the Classified Model

Numeric prediction differs from classification


Data prediction is also two step process similar to that of classification. The attribute can
be referred to simply as the predicted attribute. Prediction and classification differ in their
models used for construction. Accuracy measure used for these two methods is also different.

Issues regarding classification and prediction


Data cleaning: It Preprocess data in order to reduce noise and handle missing values

Relevance analysis (feature selection): It Remove the irrelevant or redundant attributes

Data transformation: It Generalize and/or normalize data


4.6Classification by decision tree induction
4.6.1 Decision tree
It is a flow-chart-like tree structure. Internal node denotes a test on an attribute. Branch
represents an outcome of the test. Leaf nodes represent class labels or class distribution
Decision tree generation consists of two phases:
 Tree construction
 Tree pruning
Tree construction
 At start, all the training examples are at the root
 Partition examples recursively based on selected attributes
Tree pruning
 Identify and remove branches that reflect noise or outliers
 Use of decision tree: Classifying an unknown sample
 Test the attribute values of the sample against the decision tree
Training Dataset
Figure 4.21 represents the training dataset for which the decision tree has to be built.

age income student credit_rating


<=30 high no fair
<=30 high no excellent
31…40 high no fair
>40 medium no fair
>40 low yes fair
>40 low yes excellent
31…40 low yes excellent
<=30 medium no fair
<=30 low yes fair
>40 medium yes fair
<=30 medium yes excellent
31…40 medium no excellent
31…40 high yes fair
>40 medium no excellent
Figure 4.21 Training Dataset

This follows an example from Quinlan‘s ID3

Output: A Decision Tree for ―buys_computer‖ as shown in figure 4.22


Figure 4.22Decision Tree

4.6.2Decision Tree Induction


Decision trees are constructed in top down approach. The Popular algorithms are ID3, C4.5,
CART.

Decision Tree Algorithm – Strategy

The algorithm is called with three parameters: D, attribute list, and Attribute selection
method. (Information gain, gini index, gain ratio). Tree starts as a single node, N, representing
the training tuples in D. If the tuples in D are all of the same class, then node N becomes a leaf
and is labeled with that class. Otherwise, the algorithm calls Attribute selection method to
determine the splitting criterion.

The splitting criterion tells us which attribute to test at node N. Let A be the splitting attribute.
A has v distinct values, {a1, a2, ..,av} based on the training data. A is discrete-valued: In this
case, the outcomes of the test at node N correspond directly to the known values of A.A is
continuous-valued: In this case, the test at node N has two possible outcomes, corresponding to
the conditions A<split point and A > split point, respectively. A is discrete-valued and a binary
tree must be produced: The test at node N is of the form ―A belongs to SA?‖ shown in figure
4.23
Figure 4.23Decision Tree Strategy

The recursive partitioning stops only when any one of the following terminating conditions is
true:

 All of the tuples in partition D (represented at node N) belong to the same class

 There are no remaining attributes on which the tuples may be further partitioned.

 There are no tuples for a given branch, that is, a partition Dj is empty

Decision Tree Algorithm – Complexity


The computational complexity of the algorithm given training set D is O(n×|D|×
log(|D|)), where n is the number of attributes describing the tuples in D and |D| is the
number of training tuples in D. This means that the computational cost of growing a
tree grows at most n×|D|×log(|D|) with |D| tuples. The proof is left as an exercise for
the reader.

Attribute Selection Measures


The splitting criterion that ―best‖ separates a given data partition, D, of class-labeled training
tuples into individual classes. If we split D into smaller partitions according to the outcomes of
the splitting criterion, ideally each partition would be pure. To determine how the tuples at a
given node are to be split, the attribute selection measure provides a ranking for each attribute
describing the given training tuples. Three popular attribute selection measures—information
gain, gain ratio, and gini index. Notations used here are:
 D – Data Partition

 M- Distinct values

 Ci – Distinct classes

 Ci,D be the set of tuples of class Ci in D.

4.6.3 Information gain


Attribute with the highest information gain is chosen as the splitting attribute for node
N. This attribute minimizes the information needed to classify the tuples and reflects least
―impurity‖ in these partitions. The expected information needed to classify a tuple in D is
given by

𝑚
Info(D) = - 𝑖=1 𝑃𝑖 log 2 𝑃𝑖 |𝐶𝑖 , D|/|D|

where pi is the probability that an arbitrary tuple in D belongs to class Ci.

Now, suppose we were to partition the tuples in Don some attribute A having v distinct
values, {a1,a2,....,av}, as observed from the training data. If A is discrete- valued, these values
correspond directly to the v outcomes of a test on A. Attribute A can be used to split D into V
partitions or subsets, {D1,D2,...D}, where Dj contains those tuples in D that have outcome a jof
A. These partitions would correspond to the branches grown from node N. Ideally, we would
like this partitioning to produce an exact classification.

𝑣 |𝐷𝑗 |
InfoA(D) = 𝑗 =1 |𝐷| X Info (Dj)

Gain (A) = Info(D) - InfoA (D)

Attribute selection by Information gain – Example


Information gain Calculation Example

Figure 4.24: Example for Information gain Calculation

In this example shown in figure 4.24, each attribute is discrete-valued. The class label
attribute, buys computer, has two distinct values (namely, yes, no); Therefore, there are two
distinct classes (that is, m = 2). Let class C1 correspond to yes and class C2 correspond to
no.There are nine tuples of class yes and five tuples of class no. A (root) node N is created for
the tuples in D. To find the splitting criterion for these tuples, we must compute the
information gain of each attribute

Information needed to classify the tuple is calculated.

9 9 5 5
Info (D) = -14 𝑙𝑜𝑔 2(14 ) − 𝑙𝑜𝑔 2 (14 )
14

= 0.940 bits.

Next, we need to compute the expected information requirement for each attribute.Let‘s
start with the attribute age. We need to look at the distribution of yes and no tuples for each
category of age.
 For the age category youth, there are two yes tuples and three no tuples.

 For the category middle aged, there are four yes tuples and zero no tuples.
 For the category senior, there are three yes tuples and two no tuples

5 2 2 3 3 4 4 4
Info age (D) = 14 𝑋 ( − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 ) + 𝑋(− 𝑙𝑜𝑔2 −
5 5 5 5 14 4 4
0 0 5 3 3 2 2
𝑙𝑜𝑔2 ) + 14 𝑋 ( − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 )
4 4 5 5 5 5

= 0.694 bits

Gain(age)=Info(D) -Info age(D)

= 0.940-0.694 =0.246 bits

 Gain(income) = 0.029 bits,


 Gain(student) = 0.151 bits, and
 Gain(credit rating) = 0.048 bits

Figure 4.25: Information gain allocation

Gain ratio
Consider an attribute that acts as a unique identifier, such as product ID. A split on
product ID would result in a large number of partitions, each one containing just one tuple.
Because each partition is pure, the information required to classify data set D based on
this partitioning would be Infoproduct ID(D) = 0.
𝑣 |𝐷𝑗 | |𝐷𝑗 |
SplitInfoA (D) = - 𝑗 =1 |𝐷| X log2 ( |𝐷| )
This value represents the potential information generated by splitting the training data
set, D, into v partitions, corresponding to the v outcomes of a test on attribute A.

𝐺𝑎𝑖𝑛 (𝐴)
Gain Ratio(A) =
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜 (𝐴)

4.6.4 TREE PRUNING


Decision tree construction will reflect anomalies in in the training data due to noise or
outliers. Tree pruning methods address this problem of over fitting the data by removing least
reliable branches.

Two types of pruning are there:


 Pre pruning
 Post Pruning
In the prepruning approach, a tree is ―pruned‖ by halting its construction early. Upon
halting, the node becomes a leaf. The leaf may hold the most frequent class among the subset
tuples or the probability distribution of those tuples. If partitioning the tuples at a node would
result in a split that falls below a pre specified threshold, then further partitioning of the given
subset is halted. There are difficulties, however, in choosing an appropriate threshold. High
thresholds could result in oversimplified trees, whereas low thresholds could result in very
little simplification.
In postpruning, it removes subtrees from a ―fully grown‖ tree. A subtree at a given node
is pruned by removing its branches and replacing it with a leaf. The leaf is labeled with the
most frequent class among the subtree being replaced.

Figure4.26 : Tree pruning


Decision tree – Drawbacks
(a) Decision trees can suffer from repetition and replication shown in figure 4.27

Figure 4.27 (a) :Repitition and replication

Figure 4.27(b): Repitition and replication

4.7Bayesian Classification
Bayesian classification is based on Bayes theorem. Bayes‘ theorem describes the
probability of an event, based on prior knowledge of conditions that might be related to the
event.Naïve Bayesian classifiers assume that the effect of an attribute value on a given class is
independent of the values of the other attributes. This assumption is called class conditional
independence.

Bayes’ Theorem
Let X be a data tuple , consider as evidence. Let H be a hypothesis such that tuple X belongs to
class C. For classification problems we want to find P(H|X). P(H|X) , P(X|H) – Posterior
Probability. P(H) , P(X) – Prior probability.
𝑃(𝑋|𝐻) 𝑃(𝐻)
P(𝐻 𝑋) = 𝑃(𝑋)

Working of Bayesian Classification


1.Let D be a training set of tuples and their associated class labels. Each tuple is represented
by an n-dimensional attribute vector, X = (x1, x2, ….,xn), depicting n measurements
made on the tuple from n attributes, respectively, A1, A2, … , An.

2. Suppose that there are m classes, C1, C2,….Cm. Given a tuple, X, the classifier will predict
that X belongs to the class having the highest posterior probability, conditioned on X.

That is, the naïve Bayesian classifier predicts that tuple X belongs to the class Ci if and
only if

P(Ci | X) > P(Cj | X) for

Thus we maximize P(Ci|X). The class Ci for which P(Ci|X) is maximized is called the
maximum posteriori hypothesis.

𝑃(𝑋|𝐶𝑖)𝑃 (𝐶𝑖)
P(Ci|x) = 𝑃(𝑋)

3. P(X) is constant for all classes, only P(Xj|Ci)P(Ci) need be maximized. If the class prior
probabilities are not known, then it is commonly assumed that the classes are equally likely,
that is, P(C1) = P(C2) = = P(Cm), and we would therefore maximize

P(X|Ci). Otherwise, we maximize P(X|Ci)P(Ci).

4. In order to reduce computation in evaluating P(X|Ci), the naive assumption of class


conditional independence is made.

P (X|Ci) =

= P(X1|Ci) X P(X2|Ci) X .....X P(xn|Ci).

a)If Ak is categorical, then P(Xk|Ci) is the number of tuples of class Ci in D having the value Xk
for Ak, divided by |Ci,D|, the number of tuples of class Ci in D.
(b) A continuous-valued attribute is typically assumed to have a Gaussian distribution with a
mean μ and standard deviation s,

5. In order to predict the class label of X, P(X|Ci) P(Ci) is evaluated for each class Ci. The
classifier predicts that the class label of tuple X is the class Ci if and only if

P(X1|Ci) P(Ci) > P(X1|Cj) P(Cj) for 1≤j ≤m, j≠ 𝑖

Bayesian Classification – Working Example


Predicting a class label using naive Bayesian classification
To predict the class label of a tuple using naïve Bayesian classification, given the same training
data in Table 4.28.

Figure 4.28: Example for Bayesian classification

The data tuples are described by the attributes age, income, student, and credit rating. The class
label attribute, buys computer, has two distinct values (namely,{yes, no}). LetC 1 correspond to
the class buys computer = yes and C2 correspond to buys computer = no. The tuple we wish to
classify is

X = (age = youth, income = medium, student = yes, credit rating = fair)


We need to maximize P(X|Ci)P(Ci), for i = 1, 2. P(Ci), the prior probability of each class, can
be computed based on the training tuples:

P(buys computer = yes)=9/14=0.643

P(buys computer = no) =5/14=0.357

To compute PX|Ci), for i = 1, 2, we compute the following conditional probabilities:

P(age = youth|buys computer = yes) =2/9=0.222

P(age = youth|buys computer = no) =3/5=0.600

P(income = medium|buys computer = yes) =4/9=0.444

P(income = medium|buys computer = no) =2/5=0.400

P(student = yes|buys computer = yes) =6/9=0.667

P(student = yes|buys computer = no) =1/5=0.200

P(credit rating = fair|buys computer = yes)=6/9=0.667

P(credit rating = fair|buys computer = no) =2/5=0.400

Using the above probabilities, we obtain

P(X|buys computer = yes) = P(age = youth|buys computer = yes) × P(income =


medium|buyscomputer = yes) × P(student = yes|buys computer = yes) × P(credit rating =
fair|buys computer = yes)

=0.222×0.444×0.667×0.667=0.044.

Similarly,

P(X|buys computer = no)=0.600×0.400×0.200×0.400=0.019.

To find the class,Ci, that maximizes P(X|Ci)P(Ci), we compute

P(X|buys computer = yes)P(buys computer = yes)=0.044×0.643=0.028

P(X|buys computer = no)P(buys computer = no)=0.019×0.357=0.007

Therefore, the naïve Bayesian classifier predicts buys computer = yes for tuple X.
4.8Rule Based Classification
Using IF-THEN Rules for Classification. Learned model is represented as a set of IF-
THEN rules. IF condition THEN conclusion. An example is rule R1 as given below.
R1: IF age = youth AND student = yes THEN buys computer = yes.

The ―IF‖-part (or left-hand side)of a rule is known as the rule antecedent or
precondition.The ―THEN‖-part (or right-hand side) is the rule consequent.

R1: (age = youth) ^ (student = yes))(buys computer = yes).

A rule R can be assessed by its coverage and accuracy. Given a tuple, X, from a
classlabeleddataset,D,

𝑛 𝑐𝑜𝑣𝑒𝑟𝑠
Coverage(R) = |𝐷|

𝑛 𝑐𝑜𝑟𝑟𝑒𝑐𝑡
Accuracy(R) = 𝑛 𝑐𝑜𝑣𝑒𝑟𝑠

Where ncovers be the number of tuples covered by R; ncorrect be the number of tuples correctly
classified by R; |D| be the number of tuples in D.

If a rule is satisfied by X, the rule is said to be triggered. For example, suppose we have

X= (age = youth, income = medium, student = yes, credit rating = fair).

X is classified as buys computer. X satisfies R1, which triggers the rule. If R1 is the only
rule satisfied, then the rule fires by returning the class prediction for X. If more than one rule is
triggered, we have problem as What if they each specify a different class? , What if no rule is
satisfied by X? If more than one rule is triggered, we need a conflict resolution strategy to
figure out which rule gets to fire and assign its class prediction to X.

Conflict resolution strategy can be formed by

 Size ordering
 Rule ordering
 Class based ordering
 Rule-based ordering

Rule Extraction from a Decision Tree


R1: IF age = youth AND student = no THEN buys computer = no
R2: IF age = youth AND student = yes THEN buys computer = yes
R3: IF age = middle aged THEN buys computer = yes
R4: IF age = senior AND credit rating = excellent THEN buys computer = yes
R5: IF age = senior AND credit rating = fair THEN buys computer = no

To extract rules from a decision tree, one rule is created for each path from the root to a
leaf node. Each splitting criterion along a given path is logically ANDed to form the rule
antecedent (―IF‖ part). The leaf node holds the class prediction, forming the rule consequent
(―THEN‖ part) shown in figure 4.29

Figure 4.29: Sequential Covering Algorithm

Rules such derived are Exclusive and Exhaustive.


Rules derived from decision tree may suffer from Repetition and Replication.―How can
we prune the rule set?‖ For a given rule antecedent, any condition that does not improve the
estimated accuracy of the rule can be pruned (i.e., removed), thereby generalizing the rule.
Problems arise during rule pruning, however, as the rules will no longer be mutually exclusive
and exhaustive

Rule Extraction by Using a Sequential Covering Algorithm


Output: A set of IF-THEN rules.
Method:
(1) Rule set = fg; // initial set of rules learned is empty
(2) for each class c do
(3) repeat
(4) Rule = Learn One Rule(D, Attvals, c);
(5) remove tuples covered by Rule from D;
(6) until terminating condition;
(7) Rule set = Rule set +Rule; // add new rule to rule set
(8) endfor
(9) return Rule Set;

Rule Quality Measures


Rule R1 correctly classifies 38 of the 40 tuples it covers. Rule R2 covers only two
tuples, which it correctly classifies. Their respective accuracies are 95% and 100%. Thus, R2
has greater accuracy than R1

Figure 4.30: Rule quality measures

Information gain known as the expected information needed to classify a tuple in data
set, D.Here, D is the set of tuples covered by condition` and pi is the probability of classCi
in D. The lower the entropy, the better condition` is. Entropy prefers conditions that cover a
large number of tuples of a single class and few tuples of other classes.
4.9Classification by Back propagation
Back propagation is a neural network learning algorithm. Neural network is a set of
connected input/output unit in which each connection has a weight associated with it. During
the learning phase, the network learns by adjusting the weights so as to be able to predict the
correct class label of the input tuples. Neural network has poor interpretability.

4.9.1 A Multilayer Feed-Forward Neural Network

Figure 4.31 : Multilayer feed forward neural network.

Network Topology
Before training can begin, the user must decide on the network topology by specifying
the number of units in the input layer, the number of hidden layers (if more than one), the
number of units in each hidden layer, and the number of units in the output layer.

4.9.2 Back propagation


Backpropagation learns by iteratively processing a data set of training tuples,
comparing the network‘s prediction for each tuple with the actual known target value. The
target value may be the known class label of the training tuple a continuous value.For each
training tuple, the weights are modified so as to minimize the mean squared error between the
network‘s prediction and the actual target value.These modifications are made in the
―backwards‖ direction, that is, from the output layer, through each hiddenlayer down to the
first hidden layer (hence the name backpropagation).
Although it is not guaranteed, in general the weights will eventually converge, and
the learning process stops.
 Initialize the weights
 Propagate the inputs forward

Figure 4.32: Back propagation

This function is referred to as a squashing function, because it maps a large input


domain onto the smaller range of 0 to 1.

To compute the error of a hidden layer unit j, the weighted sum of the errors of the units
connected to unit j in the next layer is considered. The error of a hidden layer unit j is

wherewjkis the weight of the connection from unit j to a unit k in the next higher layer,
and Errkis the error of unit k.The weights and biases are updated to reflect the propagated
errors. Weights are updated by the following equations, where ∆wi j is the change in weight wi
j
4.10 Support Vector Machines
A Support Vector Machine (SVM) is a discriminative classifier formally defined by a
separating hyper plane. In other words, given labeled training data (supervised learning), the
algorithm outputs an optimal hyperplane which categorizes new examples. Support vector
machines are supervised learning models with associated learning algorithms that analyze data
used for classification and regression analysis. Given a set of training examples, each marked
for belonging to one of two categories, an SVM training algorithm builds a model that assigns
new examples into one category or the other. An SVM model is a representation of the
examples as points in space, mapped so that the examples of the separate categories are
divided by a clear gap that is as wide as possible represented in figure 4.33.

Figure4.33 : Support vector machine

A line is bad if it passes too close to the points because it will be noise sensitive and it
will not generalize correctly. Therefore, our goal should be to find the line passing as far as
possible from all points. Then, the operation of the SVM algorithm is based on finding the
hyperplane that gives the largest minimum distance to the training examples. Twice, this
distance receives the important name of margin within SVM‘s theory. Therefore, the optimal
separating hyperplane maximizes the margin of the training data as mentioned in figure 4.34
Figure 4.34: finding maximum margin

Computing optimal hyper plane


SVM searches for the hyperplane with the largest margin, that is, the maximum
marginal hyperplane (MMH). The associated margin gives the largest separation between
classes. Getting to an informal definition of margin, we can say that the shortest distance from
a hyperplane to one side of its margin is equal to the shortest distance from the hyper plane to
the other side of its margin, where the ―sides‖ of the margin are parallel to the hyperplane.
Separating hyper plane can be given as

W ·X +b = 0;

W is the weight vector {w1,.wn} b is a scalar bias value. X = x1 and x2 are the values
of the attributes A1 and A2

w0 +w1x1 +w2x2 = 0

w0 +w1x1 +w2x2>0

w0 +w1x1 +w2x2 < 0

H1 : w0 +w1x1 +w2x2≥ 1 for yi= +1, and


H2 : w0 +w1x1 +w2x2≤ -1 for yi= -1:
Figure4.35: Finding Small margin and large margin

4.11 Associative Classification: Classification by Association Rule Analysis


Frequent patterns and their corresponding association or correlation rules is used for
classification. Association rules are commonly used to analyze the purchasing patterns of
customers and used in decision-making processes. The discovery of association rules is based
on frequent itemset mining by exploring highly confident associations among multiple
attributes. There are 3 methods: CBA, CMAR, CPAR
Association rules are mined in a two-step process consisting of frequent itemset mining,
followed by rule generation.

age = youth ^ credit = OK=>buys_computer = yes [support = 20%, confidence = 93%]

Let D be data tuples having n attributes A1,A2….An.

Let P is an attribute value pair of the form (Ai, v) , where Ai is an attribute taking the value v.

A data tuple X = (x1, x2, … ,xn) satisfies an item, p = (Ai, v), if and only if xi = v, where xi is
the value of the ith attribute of X. Association rules can have any number of antecedent rule
consequent. Association rules for classification should be of the form

(p1 ^ p2 ^….. pl ) =>Aclass = C

For a given rule, R, the percentage of tuples in D satisfying the rule antecedent that also have
the class label C is called the confidence of R. Methods of associative classification differ
primarily in the approach used for frequent itemset mining and in how the derived rules are
analyzed and used for classification.
Three methods that are in associative classification are:
 CBA - Classification-Based Association.
 CMAR – Classification based on multiple association rule
 CPAR

CBA
CBA uses iterative approach similar to apriori. Multiple passes are made over the data
and number of passes made is equal to the length of the longest rule. Complete set of rules
satisfying minimum confidence and support thresholds are found and then analyzed for
inclusion in the classifier.CBA construct the classifier (Decision list)., where the rules are
organized according to decreasing precedence based on their confidence and support.

CMAR

CMAR employs another tree structure to store and retrieve rules efficiently and to
prune rules based on confidence, correlation and data base coverage. Rule pruning strategies
are triggered whenever a rule is inserted into the tree. For example, given two rules, R1 and
R2, if the antecedent of R1 is more general than that of R2 and conf(R1) ≥
conf(R2),thenR2ispruned.Therationaleisthathighlyspecializedruleswithlowconfidencecanbeprun
edifamoregeneralizedversionwithhigherconfidenceexists.CMAR also prunes rules for which
the rule antecedent and class are not positively correlated, based on aχ2 test of statistical
significance.

CPAR

CPAR employs a different multiple rule strategy than CMAR. If more than one rule
satisfies a new tuple, X, the rules are divided into groups according to class, similar to CMAR.
However, CPAR uses the best k rules of each group to predict the class label of X, based on
expected accuracy. By considering the best k rules rather than all of the rules of a group, it
avoids the influence of lower ranked rules. The accuracy of CPAR on numerous data sets was
shown to be close to that of CMAR. However, since CPAR generates far fewer rules than
CMAR, it shows much better efficiency with large sets of training data.
4.12 Lazy Learners
Eager Learners – will construct a classification model before receiving new tuples to
classify. Lazy Learners - simply store the training tuple and waits until it is given a test tuple.
Unlike eager learning methods, lazy learners do less work when a training tuple is presented
and more work when making a classification or prediction. When making a classification or
prediction, lazy learners can be computationally expensive k-nearest neighbor, classifiers and
case-based reasoning classifiers.

Lazy Learners - k-nearest neighbor


Nearest-neighbor classifiers are based on learning by analogy, that is, by comparing a
given test tuple with training tuples that are similar to it. Each tuple represents a point in an n-
dimensional space and all of the training tuples are stored in an n-dimensional pattern space.
When given an unknown tuple, a k-nearest-neighbor searches the pattern space for the k
training tuples that are closest to the unknown tuple. These k training tuples are the k ―nearest
neighbors‖ of the unknown tuple. ―Closeness‖ is defined in terms of a distance metric, such as
Euclidean distance.

𝑛
dist(X1,X2) = 𝑖=1(𝑥1𝑖 − 𝑥2𝑖 )2

The Euclidean distance between two points or tuples, say,

X1 = (x11, x12, : : : , x1n) and X2 = (x21, x22, : : : , x2n), is

Min-max normalization, for example, can be used to transform a value v of a numeric attribute
A to v` in the range [0, 1] by computing

𝑣−𝑚𝑖𝑛 𝐴
V‘ = 𝑚𝑎𝑥
𝐴 − 𝑚𝑖𝑛 𝐴

4.13 Prediction
Numeric prediction is the task of predicting continuous (or ordered) values for given
input. It is better to predict a continuous value, rather than a categorical label. Regression
analysis can be used to model the relationship between one or more independent or predictor
variables and a dependent or response variable. Regression analysis can be used to model the
relationship between one or more independent or predictor variables and a dependent or
response variable.

4.13.1Linear Regression
Straight-line regression analysis involves a response variable, y, and a single predictor
variable, x. It is the simplest form of regression, and models y as a linear function of x.
y = b+wx, where b,w are regression coeffecients.
The above expression can be rewritten by using weight values as mentioned below.
y = w 0 + w1x

The training set contains |D| data points of the form (x1, y1), (x2, y2), . . . , (x|D|, y|D|)

The training set contains |D| data points of the form (x1, y1), (x2, y2), . . . , (x|D|, y|D|)

Example

Straight-line regression using the method of least squares Figure 4.36 shows a set of paired
data where x is the number of years of work experience of a college graduate and y is the
salary data.
Figure 4.36 Training Dataset

The training set contains |D| data points of the form (x1, y1), (x2, y2), . . . , (x|D|, y|D|)

3−9.1 30−55.4 + 8−9.1 57−55.4 +⋯.+(16−9.1)(83−55.4)


w1 = = 3.5
(3−9.1)2 +(8−9.1)2 +⋯…+(16−9.1)2

w0 = 55.4 – (3.5) (9.1) = 23.6

y = 23.6 + 3.5x.

Multiple Linear Regression


Multiple linear regression is an extension of straight-line regression so as to involve more
than one predictor variable. It allows response variable y to be modeled as a linear function of,
say, n predictor variables or attributes, A1, A2,…, An, describing a tuple, X.

y = w0 + w1x1 + w2x2,

Multiple linear regression is an extension of straight-line regression so as to involve more


than one predictor variable. It allows response variable y to be modeled as a linear function of,
say, n predictor variables or attributes, A1, A2,..., An, describing a tuple, X.An example of a
multiple linear regression model based on two predictor attributes or variables, A1 and A2, is
(X1, y1), (X2, y2),……, (X|D|, y|D|)

y = w0 + w1x1 + w2x2

4.13.2 Nonlinear Regression

We can model data that does not show a linear dependence by nonlinear regression. For
example, what if a given response variable and predictor variable have a relationship that may
be modeled by a polynomial function? Polynomial regression is often of interest when there is
just one predictor variable. It can be modeled by adding polynomial terms to the basic linear
model.

By applying transformations to the variables, we can convert the nonlinear model into a
linear one that can then be solved by the method of least squares.

y = w0 +w1x +w2x2+w3x3
x1= x x2= x2x3 = x3
y = w0+ w1x+ w2x2+ w3x3

UNIT-IV QUESTION BANK


Part A

1. What is meant by market Basket analysis?


2. What is the use of multilevel association rules?
3. What is meant by pruning in a decision tree induction?
4. Write the two measures of Association Rule.
5. With an example explain correlation analysis.
6. Define conditional pattern base.
7. List out the major strength of decision tree method.
8. In classification trees, what are the surrogate splits, and how are they used?
9. The Naïve Bayes‟ classifier makes what assumptions that motivate its name?
10. What is the frequent item set property?
11. List out the major strength of the decision tree Induction.
12. Write the two measures of association rule.
13. How are association rules mined from large databases?
14. What is tree pruning in decision tree induction?
15. What is the use of multi-level association rules?
16. What are the Apriori properties used in the Apriori algorithms?
17. How is predication different from classification?
18. What is a support vector machine?
19. What are the means to improve the performance of association rule mining algorithm?
20. State the advantages of the decision tree approach over other approaches for performing
classification.
21. What is rule based classification? Give an example.
22. State the Apriori property
23. Find the support and confidence for X. X = Bread Jam
Tid Itemset

1 Bread, Milk, Jam

2 Bread, Jam

3 Bread

4 Bread , Jam

5 Bread, Milk

6 Bread , Milk, Jam

7 Bread, Jam

24. Define Bayes theorem.


25. How does backpropagation work?
26. When do you apply the Laplacian correction?
27. Mention the variations that improves the efficiency of Apriori algorithm.
28. What is market basket analysis?
29. Define support and confidence value in association rule mining.
30. Compare classification and prediction.
31. “Bayesian classification is called Naïve Bayesian classification” – Why?
32. Mention the ways by which efficiency of Apriori algorithm be improved.
33. How the rule coverage and rule accuracy value of a rule can be calculated?
34. Differentiate the types of tree pruning methods.
35. What is decision tree induction?
36. How to find the best hyper plane in support vector machine?

PART-B
1. Decision tree induction is a popular classification method. Taking one typical decision tree
induction algorithm , briefly outline the method of decision tree classification.
2. Consider the following training dataset and the original decision tree induction algorithm (ID3).
Risk is the class label attribute. The Height values have been already discredited into disjoint
ranges. Calculate the information gain if Gender is chosen as the test attribute. Calculate the
information gain if Height is chosen as the test attribute. Draw the final decision tree (without
any pruning) for the training dataset. Generate all the “IF-THEN rules from the decision tree.
Gender Height Risk
F (1.5, 1.6) Low
M (1.9, 2.0) High
F (1.8, 1.9) Medium
F (1.8, 1.9) Medium
F (1.6, 1.7) Low
M (1.8, 1.9) Medium
F (1.5, 1.6) Low
M (1.6, 1.7) Low
M (2.0, 8) High
M (2.0, 8) High
F (1.7, 1.8) Medium
M (1.9, 2.0) Medium
F (1.8, 1.9) Medium
F (1.7, 1.8) Medium
F (1.7, 1.8) Medium

3. Given the following transactional database


1 C, B, H
2 B, F, S
3 A, F, G
4 C, B, H
5 B, F, G
6 B, E, O
(i) We want to mine all the frequent itemsets in the data using the Apriori algorithm.
Assume the minimum support level is 30%. (You need to give the set of frequent item sets
in L1, L2,… candidate item sets in C1, C2,…)

(ii) Find all the association rules that involve only B, C.H (in either left or right hand side
of the
rule). The minimum confidence is 70%.
3. Describe the multi-dimensional association rule, giving a suitable example.
4. Explain the algorithm for constructing a decision tree from training samples
5. Explain Bayes theorem.
6. Develop an algorithm for classification using Bayesian classification. Illustrate
the algorithm with a relevant example.
7. Discuss the approaches for mining multi-level association rules from the
transactional databases. Give relevant example.
8. Write and explain the algorithm for mining frequent item sets without candidate
generation. Give relevant example.
9. How attribute is oriented induction implemented? Explain in detail.
10. Discuss in detail about Bayesian classification.
11. Write and explain the algorithm for mining frequent item sets without
candidate generation with an example.
12. A database given below has nine transactions. Let min_sup = 30 %
TID List of Items IDs
1 a, b, e
2 b,d
3 b, c
4 a, b, d
5 a, c
6 b, c
7 a, c
8 a, b, c ,e
9 a, b, c
Apply apriori algorithm to find all frequent item sets.
13. With an example explain various attribute selection measures in classification.
14. Discuss the steps involved in the working of following classifiers.
(i) Bayesian classifier.
(ii) Back Propagation algorithm.
15. Apply the Apriori algorithm for discovering frequent item sets to the following dataset.
Use 0.3 for the minimum support value. Illustrate each step of Apriori algorithm.

16. Construct a decision tree classifier by applying ID3 algorithm to the following dataset.

Day Outlook Temperature Humidity Wind Play ball

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Overcast Hot High Weak Yes

D4 Rain Mild High Weak Yes

D5 Rain Cool Normal Weak Yes

D6 Rain Cool Normal Strong No


D7 Overcast Cool Normal Strong Yes

D8 Sunny Mild High Weak No

D9 Sunny Cool Normal Weak Yes

D10 Rain Mild Normal Weak Yes

D11 Sunny Mild Normal Strong Yes

D12 Overcast Mild High Strong Yes

D13 Overcast Hot Normal Weak Yes

D14 Rain Mild High Strong No

17. Predict a class label for X using naïve Bayesian classification algorithm.
X = { Color = Red, Type = SUV, Origin = Domestic }
Use the following training data set.
UNIT V

5. CLUSTER ANALYSIS

5.1 Introduction
Cluster is said to be a collection of data objects. Data objects in a cluster will be similar to
one another within the same cluster and dissimilar to the objects in other clusters. Cluster
analysis is a process of finding similarities between data according to the characteristics found
in the data and grouping similar data objects into clusters.
A cluster of data objects can be treated collectively as one group and so may be considered
as a form of data compression. Although classification is an effective means for distinguishing
groups or classes of objects, it requires the often costly collection and labeling of a large set of
training tuples or patterns, which the classifier uses to model each group. It is often more
desirable to proceed in the reverse direction: First partition the set of data into groups based on
data similarity (e.g., using clustering), and then assign labels to the relatively small number of
groups. Additional advantages of such a clustering-based process are that it is adaptable to
changes and helps single out useful features that distinguish different groups. By automated
clustering, we can identify dense and sparse regions in object space and, therefore, discover
overall distribution patterns and interesting correlations among data attributes. Clustering can
also be used for outlier detection, where outliers (values that are ―far away‖ from any cluster)
may be more interesting than common cases.
Clustering can also be used for outlier detection, where outliers (values that are ―far away‖
from any cluster) may be more interesting than common cases. Applications of outlier
detection include the detection of credit card fraud and the monitoring of criminal activities in
electronic commerce. For example, exceptional cases in credit card transactions, such as very
expensive and frequent purchases, may be of interest as possible fraudulent activity
5.1.1 Applications of Clustering
The process of clustering has various applications as listed below:
 Pattern Recognition
 Spatial Data Analysis
 Create thematic maps in GIS by clustering feature spaces
 Detect spatial clusters or for other spatial mining tasks
 Image Processing
 Economic Science (especially market research)
 WWW
 Document classification
 Cluster Weblog data to discover groups of similar access patterns.
5.1.2 Examples of Clustering Applications
Few examples of clustering are listed below:
 Marketing: Help marketers discover distinct groups in their customer bases, and then
use this knowledge to develop targeted marketing programs
 Land use: Identification of areas of similar land use in an earth observation database
 Insurance: Identifying groups of motor insurance policy holders with a high average
claim cost
 City-planning: Identifying groups of houses according to their house type, value, and
geographical location
 Earth-quake studies: Observed earth quake epicenters should be clustered along
continent faults.

5.1.3 Good Clustering


Clustering is a process of data segmentation or outlier detection. A good clustering method will
produce high quality clusters of two different types.
High intra-class similarity: Data points in one cluster are more similar to one another.
Low inter-class similarity: Data points in separate clusters are less similar to one another.
The quality of a clustering result depends on both the similarity measure used by the method
and its implementation. The quality of a clustering method is also measured by its ability to
discover some or all of the hidden patterns.

5.1.4 Requirements of clustering in data mining


The following are typical requirements of clustering in data mining
Scalability: Many clustering algorithms work well on small data sets containing fewer than
several hundred data objects; however, a large database may contain millions of objects.
Clustering on a sample of a given large data set may lead to biased results. Highly scalable
clustering algorithms are needed. Ability to deal with different types of attributes: Many
algorithms are designed to cluster interval-based (numerical) data. However, applications may
require clustering other types of data, such as binary, categorical (nominal), and ordinal data,
or mixtures of these data types.
Discovery of clusters with arbitrary shape: Many clustering algorithms determine clusters
based on Euclidean or Manhattan distance measures. Algorithms based on such distance
measures tend to find spherical clusters with similar size and density. However, a cluster could
be of any shape. It is important to develop algorithms that can detect clusters of arbitrary
shape.
Minimal requirements for domain knowledge to determine input parameters: Many
clustering algorithms require users to input certain parameters in cluster analysis (such as the
number of desired clusters). The clustering results can be quite sensitive to input parameters.
Parameters are often difficult to determine, especially for data sets containing high-
dimensional objects. This not only burdens users, but it also makes the quality of clustering
difficult to control.
Ability to deal with noisy data: Most real-world databases contain outliers or missing,
unknown, or erroneous data. Some clustering algorithms are sensitive to such data and may
lead to clusters of poor quality.
Incremental clustering and insensitivity to the order of input records: Some clustering
algorithms cannot incorporate newly inserted data (i.e., database updates) into existing
clustering structures and, instead, must determine a new clustering from scratch. Some
clustering algorithms are sensitive to the order of input data. That is, given a set of data objects,
such an algorithm may return dramatically different clustering‘s depending on the order of
presentation of the input objects. It is important to develop incremental clustering algorithms
and algorithms that are insensitive to the order of input.
High dimensionality: A database or a data warehouse can contain several dimensions or
attributes. Many clustering algorithms are good at handling low-dimensional data, involving
only two to three dimensions. Human eyes are good at judging the quality of clustering for up
to three dimensions. Finding clusters of data objects in high dimensional space is challenging,
especially considering that such data can be sparse and highly skewed.
Constraint-based clustering: Real-world applications may need to perform clustering under
various kinds of constraints. Suppose that your job is to choose the locations for a given
number of new automatic banking machines (ATMs) in a city. To decide upon this, you may
cluster households while considering constraints such as the city‘s rivers and highway
networks, and the type and number of customers per cluster. A challenging task is to find
groups of data with good clustering behavior that satisfy specified constraints.
Interpretability and usability: Users expect clustering results to be interpretable,
comprehensible, and usable. That is, clustering may need to be tied to specific semantic
interpretations and applications. It is important to study how an application goal may influence
the selection of clustering features and methods.

5.1.5 Measure the Quality of Clustering


Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function,
typically metric: d(i, j). The definitions of distance functions are usually very different for
interval-scaled, boolean, categorical, ordinal ratio &vector variables. Weights should be
associated with different variables based on applications and data semantics. It is hard to define
―similar enough‖ or ―good enough‖ the answer is typically highly subjective.

5.2 TYPES OF DATA IN CLUSTERS - DATA STRUCTURES

There are different types of data that often occur in cluster analysis those data's need to be
preprocessed for cluster analysis.

5.2.1 Types of Data Structures

Clustering algorithms typically operate on either of the following two data structures.

Data matrix
This represents n objects, such as persons, with p variables (also called measurements or
attributes), such as age, height, weight, gender, and so on. The structure is in the form of a
relational table, or n-by-p matrix (n objects ×p variables). This represents n objects, such as
persons, with p variables (also called measurements or attributes), such as age, height, weight,
gender, and so on.
Dissimilarity matrix
This stores a collection of proximities that are available for all pairs of n objects. It is often
represented by an n-by-n table:

where d(i, j) is the measured difference or dissimilarity between objects i and j. In general, d(i,
j) is a nonnegative number that is close to 0 when objects i and j are highly similar or ―near‖
each other, and becomes larger the more they differ.

5.2.2 Type of data in clustering analysis


Different types of data used for cluster analysis are as follows:
 Interval-scaled variables
 Binary variables
 Nominal, ordinal, and ratio variables
 Variables of mixed types

5.2.2.1 Interval-valued variables


Interval variable is a measurement where the difference between two values is
meaningful.
Eg:Temperature, Height & weight, Latitude & Longitude. It calculates the distance measures
and it is used for computing the dissimilarity of objects described by such variables. These
measures include the Euclidean, Manhattan, and Minkowski distances.
The measurement unit used can affect the clustering analysis. For example, changing
measurement units from meters to inches for height, or from kilograms to pounds for weight,
may lead to a very different clustering structure. In general, expressing a variable in smaller
units will lead to a larger range for that variable, and thus a larger effect on the resulting
clustering structure. To help avoid dependence on the choice of measurement units, the data
should be standardized. Standardizing measurements attempts to give all variables an equal
weight.
How can the data for a variable be standardized?” To standardize measurements, one choice
is to convert the original measurements to unitless variables. Given measurements for a
variable f , this can be performed as follows.
Calculate the mean absolute deviation:

Calculate the standardized measurement, or z-score:


z-score is the number of standard deviations from the mean a data point is. It‘s a
measure of how many standard deviations below or above the population mean a raw score is.

The mean absolute deviation, Sf , is more robust to outliers than the standard deviation,
σf. When computing the mean absolute deviation, the deviations from the mean
are not squared; hence, the effect of outliers is somewhat reduced. There are more robust
measures of dispersion, such as the median absolute deviation.
After standardization, or without standardization in certain applications, the
dissimilarity (or similarity) between the objects described by interval-scaled variables is
typically computed based on the distance between each pair of objects. The most popular
distance measure is Euclidean distance, which is defined as
Another well-known metric is Manhattan (or city block) distance, defined as

Both the Euclidean distance and Manhattan distance satisfy the following mathematic
requirements of a distance function:
1. d(i, j) ≥ 0: Distance is a nonnegative number.
2. d(i, i) = 0: The distance of an object to itself is 0.
3. d(i, j) = d( j, i): Distance is a symmetric function.
4. d(i, j) ≤ d(i, h)+d(h, j): Going directly from object i to object j in space is no more than
making a detour over any other object h (triangular inequality).

5.2.2.2 Binary variables


A binary variable has only two states: 0 or 1, where 0 means that the variable is absent,
and 1 means that it is present. Given the variable smoker describing a patient, for instance, 1
indicates that the patient smokes, while 0 indicates that the patient does not.
Computing dissimilarity between two binary variables: One approach involves
computing a dissimilarity matrix from the given binary data. If all binary variables are thought
of as having the same weight, we have the 2-by-2 contingency table of Table 5.1, where q is
the number of variables that equal 1 for both objects i and j, r is the number of variables that
equal 1 for object i but that are 0 for object j, s is the number of variables that equal 0 for object
i but equal 1 for object j, and t is the number of variables that equal 0 for both objects i and j.
The total number of variables is p, where p = q+r +s +t.
Table 5.1 Contingency table for binary variables

Symmetric and Asymmetric Binary variable:


A binary variable is symmetric if both of its states are equally valuable and carry the same
weight; that is, there is no preference on which outcome should be coded as 0 or 1. One such
example could be the attribute gender having the states male and female. Dissimilarity that is
based on symmetric binary variables is called symmetric binary dissimilarity. Its dissimilarity
(or distance) measure, defined in the following Equation can be used to assess the dissimilarity
between objects i and j.

A binary variable is asymmetric if the outcomes of the states are not equally important, such as
the positive and negative outcomes of a disease test. Given two asymmetric binary variables,
the agreement of two 1s (a positive match) is then considered more significant than that of two
0s (a negative match). Therefore, such binary variables are often considered ―monary‖ (as if
having one state). The dissimilarity based on such variables is called asymmetric binary
dissimilarity, where the number of negative matches, t, is considered unimportant and thus is
ignored in the computation, as shown in Equation (7.10).

5.2.2.3 Categorical Variables


A categorical variable is a generalization of the binary variable in that it can take on
more than two states. For example, map color is a categorical variable that may have, say, five
states: red, yellow, green, pink, and blue. Let the number of states of a categorical variable be
M. The states can be denoted by letters, symbols, or a set of integers, such as 1, 2, . , M. Notice
that such integers are used just for data handling and do not represent any specific ordering.
The dissimilarity between two objects described by categorical variables i and j can be
computed based on the ratio of mismatches:

where m is the number of matches (i.e., the number of variables for which i and j are in the
same state), and p is the total number of variables. Weights can be assigned to increase the
effect of m or to assign greater weight to the matches in variables having a larger number of
states.

5.2.2.4 Ordinal Variables


A discrete ordinal variable resembles a categorical variable, except that the M states
of the ordinal value are ordered in a meaningful sequence. Ordinal variables are very useful for
registering subjective assessments of qualities that cannot be measured objectively. For
example, professional ranks are often enumerated in a sequential order, such as assistant,
associate, and full for professors. A continuous ordinal variable looks like a set of continuous
data of an unknown scale; that is, the relative ordering of the values is essential but their actual
magnitude is not. For example, the relative ranking in a particular sport (e.g., gold, silver,
bronze) is often more essential than the actual values of a particular measure.
The values of an ordinal variable can be mapped to ranks. For example, suppose that an ordinal
variable f has Mf states. These ordered states define the ranking 1, . . . , Mf .

5.2.2.5 Ratio-Scaled Variables


A ratio-scaled variable makes a positive measurement on a nonlinear scale, such as
an exponential scale, approximately following the formula.

where A and B are positive constants, and t typically represents time. Common examples
include the growth of a bacteria population or the decay of a radioactive element.
There are three methods to handle ratio scaled variables:
 Treat ratio-scaled variables like interval-scaled variables.
 Apply logarithmic transformation to a ratio-scaled variable f having value xif for object
i by using the formula yif = log(xif ). yif can be treated as interval valued variable.
 Treat xif as continuous ordinal data and treat their ranks as interval-valued.

5.2.2.6 Variables of Mixed Types


In many real databases, objects are described by mixture of different types of
variable. One approach to compute the dissimilarity of mixed type variables is to group each
kind of variable together, performing a separate cluster analysis for each variable type. This is
feasible if these analyses derive compatible results. However, in real applications, it is unlikely
that a separate cluster analysis per variable type will generate compatible results.
A more preferable approach is to process all variable types together, performing a
single cluster analysis. One such technique combines the different variables into a single
dissimilarity matrix, bringing all of the meaningful variables onto a common scale of the
interval [0.0,1.0].

5.2.2.7 Vector Objects


In some applications, such as information retrieval, text document clustering, and
biological taxonomy, we need to compare and cluster complex objects (such as documents)
containing a large number of symbolic entities (such as keywords and phrases). To measure the
distance between complex objects, it is often desirable to abandon traditional metric distance
computation and introduce a nonmetric similarity function. There are several ways to define
such a similarity function, s(x, y), to compare two vectors x and y. One popular way is to
define the similarity function as a cosine measure as follows:

where xt is a transposition of vector x, ||x|| is the Euclidean norm of vector x,1 ||y|| is the
Euclidean norm of vector y, and s is essentially the cosine of the angle between vectors x and
y. This value is invariant to rotation and dilation, but it is not invariant to translation and
general linear transformation.

5.3 MAJOR CLUSTERING APPROACHES


Many clustering algorithms exist in the literature. It is difficult to provide a crisp
categorization of clustering methods because these categories may overlap, so that a method
may have features from several categories. In general, the major clustering methods can be
classified into the following categories.
 Partitioning methods
 Hierarchical methods
 Density-based methods
 Grid-based methods
 Model-based methods
 Clustering high-dimensional data
 Constraint-based clustering

5.3.1 Partitioning methods


Given a database of n objects or data tuples, a partitioning method constructs k
partitions of the data, where each partition represents a cluster and k ≤ n. That is, it classifies
the data into k groups, which together satisfy the following requirements: (1) each group must
contain at least one object, and (2) each object must belong to exactly one group. Notice that
the second requirement can be relaxed in some fuzzy partitioning techniques.
Given k, the number of partitions to construct, a partitioning method creates an
initial partitioning. It then uses an iterative relocation technique that attempts to improve the
partitioning by moving objects from one group to another. To achieve global optimality in
partitioning-based clustering would require the exhaustive enumeration of all of the possible
partitions. Instead, most applications adopt one of a few popular heuristic methods, such as
(1) the k-means algorithm, where each cluster is represented by the mean value of the objects
in the cluster, and (2) the k-medoids algorithm, where each cluster is represented by one of the
objects located near the center of the cluster.

5.3.2 Hierarchical methods


A hierarchical method creates a hierarchical decomposition of the given set of data
objects. A hierarchical method can be classified as being either agglomerative or divisive,
based on how the hierarchical decomposition is formed. The agglomerative approach, also
called the bottom-up approach, starts with each object forming a separate group. It successively
merges the objects or groups that are close to one another, until all of the groups are merged
into one (the topmost level of the hierarchy), or until a termination condition holds. The
divisive approach, also called the top-down approach, starts with all of the objects in the same
cluster. In each successive iteration, a cluster is split up into smaller clusters, until eventually
each object is in one cluster, or until a termination condition holds.

5.3.3 Density-based methods


Most partitioning methods cluster objects based on the distance between objects.
Such methods can find only spherical-shaped clusters and encounter difficulty at discovering
clusters of arbitrary shapes. Other clustering methods have been developed based on the notion
of density. Their general idea is to continue growing the given cluster as long as the density
(number of objects or data points) in the ―neighborhood‖ exceeds some threshold; that is, for
each data point within a given cluster, the neighborhood of a given radius has to contain at
least a minimum number of points. Such a method can be used to filter out noise (outliers) and
discover clusters of arbitrary shape.
5.3.4 Grid-based methods
Grid-based methods quantize the object space into a finite number of cells that form
a grid structure. All of the clustering operations are performed on the grid structure (i.e., on the
quantized space). The main advantage of this approach is its fast processing time, which is
typically independent of the number of data objects and dependent only on the number of cells
in each dimension in the quantized space.

5.3.5 Model-based methods


Model-based methods hypothesize a model for each of the clusters and find the best
fit of the data to the given model. A model-based algorithm may locate clusters by constructing
a density function that reflects the spatial distribution of the data points. It also leads to a way
of automatically determining the number of clusters based on standard statistics, taking ―noise‖
or outliers into account and thus yielding robust clustering methods.

5.3.6 Clustering high-dimensional data


It is a particularly important task in cluster analysis because many applications
require the analysis of objects containing a large number of features or dimensions. For
example, text documents may contain thousands of terms or keywords as features, and DNA
microarray data may provide information on the expression levels of thousands of genes under
hundreds of conditions. Clustering high-dimensional data is challenging due to the curse of
dimensionality. Many dimensions may not be relevant. As the number of dimensions increases,
the data become increasingly sparse so that the distance measurement between pairs of points
become meaningless and the average density of points anywhere in the data is likely to be low.
Therefore, a different clustering methodology needs to be developed for high-dimensional data.

5.3.7 Constraint-based clustering


It is a clustering approach that performs clustering by incorporation of user-
specified or application-oriented constraints. A constraint expresses a user‘s expectation or
describes ―properties‖ of the desired clustering results, and provides an effective means for
communicating with the clustering process. Various kinds of constraints can be specified,
either by a user or as per application requirements.

5.4 Partitioning Methods


Given D, a data set of n objects, and k, the number of clusters to form, a partitioning
algorithm organizes the objects into k partitions (k ≤ n), where each partition represents a
cluster. The clusters are formed to optimize an objective partitioning criterion, such as a
dissimilarity function based on distance, so that the objects within a cluster are ―similar,‖
whereas the objects of different clusters are ―dissimilar‖ in terms of the data set attributes.
Two commonly used partitioning methods are
1. K – Means
2. K – Medoids

5.4.1 Centroid-Based Technique - Means Method


The k-means algorithm takes the input parameter, k, and partitions a set of n objects into k
clusters so that the resulting intracluster similarity is high but the intercluster similarity is low.
Cluster similarity is measured in regard to the mean value of the objects in a cluster, which can
be viewed as the cluster‘s centroid or center of gravity.

Working of K-Means Algorithm:


The working of k-means algorithm proceeds as follows.
First, it randomly selects k of the objects, each of which initially represents a cluster mean or
center. For each of the remaining objects, an object is assigned to the cluster to which it is the
most similar, based on the distance between the object and the cluster mean. It then computes
the new mean for each cluster. This process iterates until the criterion function converges.
Typically, the square-error criterion is used, defined as

where E is the sum of the square error for all objects in the data set; p is the point in space
representing a given object; and mi is the mean of cluster Ci (both p and mi are
multidimensional). In other words, for each object in each cluster, the distance from the object
to its cluster center is squared, and the distances are summed. This criterion tries to make the
resulting k clusters as compact and as separate as possible. The k-means procedure is
summarized as follows.

Algorithm:
k-means: The k-means algorithm for partitioning, where each cluster‘s center is represented
by the mean value of the objects in the cluster.
Input:
 k: the number of clusters,
 D: a data set containing n objects.
Output: A set of k clusters.
Method:
1) arbitrarily choose k objects from D as the initial cluster centers;
2) repeat
3) (re)assign each object to the cluster to which the object is the most similar, based on the
mean value of the objects in the cluster;
4) update the cluster means, i.e., calculate the mean value of the objects for each cluster;
5) until no change;

Example for K-Means Algorithm


Clustering by k-means partitioning. Suppose that there is a set of objects located in
space as depicted in the rectangle shown in Figure 5.1 (a). Let k = 3; that is, the user would like
the objects to be partitioned into three clusters. According to the working of K means
algorithm, we arbitrarily choose three objects as the three initial cluster centers, where cluster
centers are marked by a ―+‖. Each object is distributed to a cluster based on the cluster center
to which it is the nearest. Such a distribution forms silhouettes encircled by dotted curves, as
shown in Figure 5.1 (a). Next, the cluster centers are updated. That is, the mean value of each
cluster is recalculated based on the current objects in the cluster. Using the new cluster
centers, the objects are redistributed to the clusters based on which cluster center is the nearest.
Such redistribution forms new silhouettes encircled by dashed curves, as shown in Figure 5.1
(b). This process iterates, leading to Figure 5.1 (c). The process of iteratively reassigning
objects to clusters to improve the partitioning is referred to as iterative relocation. Eventually,
no redistribution of the objects in any cluster occurs, and so the process terminates. The
resulting clusters are returned by the clustering process.
Figure 5.1 Clustering of a set of objects based on the k-means method. (The mean of each cluster
is marked by a ―+‖.)

The algorithm attempts to determine k partitions that minimize the square-error


function. It works well when the clusters are compact clouds that are rather well separated
from one another. The method is relatively scalable and efficient in processing large data sets
because the computational complexity of the algorithm is O(nkt), where n is the total number
of objects, k is the number of clusters, and t is the number of iterations. Normally, k<<n and t
<<n. The method often terminates at a local optimum. The k-means method, however, can be
applied only when the mean of a cluster is defined. This may not be the case in some
applications, such as when data with categorical attributes are involved. The necessity for users
to specify k, the number of clusters, in advance can be seen as a disadvantage. The k-means
method is not suitable for discovering clusters with nonconvex shapes or clusters of very
different size. Moreover, it is sensitive to noise and outlier data points because a small number
of such data can substantially influence the mean value.
Another variant to k-means is the k-modes method, which extends the k-means
paradigm to cluster categorical data by replacing the means of clusters with modes, using new
dissimilarity measures to deal with categorical objects and a frequency-based method to update
modes of clusters. The k-means and the k-modes methods can be integrated to cluster data with
mixed numeric and categorical values.

―How can we make the k-means algorithm more scalable?‖


A recent approach to scaling the k-means algorithm is based on the idea of
identifying three kinds of regions in data: regions that are compressible, regions that must be
maintained in main memory, and regions that are discardable. An object is discardable if its
membership in a cluster is ascertained. An object is compressible if it is not discardable but
belongs to a tight subcluster. A data structure known as a clustering feature is used to
summarize objects that have been discarded or compressed. If an object is neither discardable
nor compressible, then it should be retained in main memory. To achieve scalability, the
iterative clustering algorithm only includes the clustering features of the compressible objects
and the objects that must be retained in main memory, thereby turning a secondary-memory
based algorithm into a main-memory-based algorithm.

5.4.2 k-Medoids Method


The k-means algorithm is sensitive to outliers because an object with an extremely
large value may substantially distort the distribution of data. To overcome the mentioned issue
K-Medoids algorithm is applied.
In K-Medoids the algorithm iterates until, eventually, each representative object is
actually the medoid, or most centrally located object, of its cluster. This is the basis of the k-
medoids method for grouping n objects into k clusters.

Working of K-Medoids Algorithm


Instead of taking the mean value of the objects in a cluster as a reference point, we
can pick actual objects to represent the clusters, using one representative object per cluster.
Each remaining object is clustered with the representative object to which it is the most similar.
The partitioning method is then performed based on the principle of minimizing the sum of the
dissimilarities between each object and its corresponding reference point. That is, an absolute-
error criterion is used, defined as

where E is the sum of the absolute error for all objects in the data set; p is the point in space
representing a given object in cluster Cj; and oj is the representative object of Cj.
The initial representative objects (or seeds) are chosen arbitrarily. The iterative
process of replacing representative objects by non representative objects continues as long as
the quality of the resulting clustering is improved. This quality is estimated using a cost
function that measures the average dissimilarity between an object and the representative
object of its cluster. To determine whether a non representative object, Orandom, is a good
replacement for a current representative object, o j, the following four cases are examined for
each of the non representative objects, p, as illustrated in Figure 5.2.

Figure 5.2 Four cases of the cost function for k-medoids clustering.

Each time a reassignment occurs, a difference in absolute error, E, is contributed to


the cost function. Therefore, the cost function calculates the difference in absolute-error value
if a current representative object is replaced by a non representative object. The total cost of
wapping is the sum of costs incurred by all non representative objects. If the total cost is
negative, then oj is replaced or swapped with orandom since the actual absolute error E would be
reduced. If the total cost is positive, the current representative object, oj, is considered
acceptable, and nothing is changed in the iteration.

5.4.3 Partitioning Around Medoids (PAM)


PAM (Partitioning Around Medoids) was one of the first k-medoids algorithms
introduced. It attempts to determine k partitions for n objects. After an initial random selection
of k representative objects, the algorithm repeatedly tries to make a better choice of cluster
representatives. All of the possible pairs of objects are analyzed, where one object in each pair
is considered a representative object and the other is not. The quality of the resulting clustering
is calculated for each such combination. An object, o j, is replaced with the object causing the
greatest reduction in error. The set of best objects for each cluster in one iteration forms the
representative objects for the next iteration. The final set of representative objects are the
respective medoids of the clusters. The complexity of each iteration is O(k(n-k)2). For large
values of n and k, such computation becomes very costly.

Algorithm: k-medoids. PAM, a k-medoids algorithm for partitioning based on medoid or


central objects.
Input:
 k: the number of clusters,
 D: a data set containing n objects.
Output: A set of k clusters.
Method:
1) arbitrarily choose k objects in D as the initial representative objects or seeds;
2) repeat
3) assign each remaining object to the cluster with the nearest representative object;
4) randomly select a non representative object, orandom;
5) compute the total cost, S, of swapping representative object, o j, with orandom;
6) if S < 0 then swap oj with orandom to form the new set of k representative objects;
7) until no change;

―Which method is more robust—k-means or k-medoids?‖: The k-medoids method is more


robust than k-means in the presence of noise and outliers, because a medoid is less influenced
by outliers or other extreme values than a mean. However, its processing is more costly than
the k-means method. Both methods require the user to specify k, the number of clusters.

5.5 Hierarchical Methods


A hierarchical clustering method works by grouping data objects into a tree of
clusters. Hierarchical clustering methods can be further classified as either agglomerative or
divisive, depending on whether the hierarchical decomposition is formed in a bottom-up
(merging) or top-down (splitting) fashion. The quality of a pure hierarchical clustering method
suffers from its inability to perform adjustment once a merge or split decision has been
executed. That is, if a particular merge or split decision later turns out to have been a poor
choice, the method cannot backtrack and correct it.

5.5.1 Agglomerative and Divisive Hierarchical Clustering


In general, there are two types of hierarchical clustering methods:

 Agglomerative hierarchical clustering


 Divisive hierarchical clustering

Agglomerative hierarchical clustering: This bottom-up strategy starts by placing each object
in its own cluster and then merges these atomic clusters into larger and larger clusters, until all
of the objects are in a single cluster or until certain termination conditions are satisfied. Most
hierarchical clustering methods belong to this category. They differ only in their definition of
intercluster similarity.

Divisive hierarchical clustering: This top-down strategy does the reverse of agglomerative
hierarchical clustering by starting with all objects in one cluster. It subdivides the cluster into
smaller and smaller pieces, until each object forms a cluster on its own or until it satisfies
certain termination conditions, such as a desired number of clusters is obtained or the diameter
of each cluster is within a certain threshold.

5.5.2 Working of Agglomerative and Divisive Approaches


Figure 5.3 shows the application of AGNES (AGglomerative NESting), an
agglomerative hierarchical clustering method, and DIANA (DIvisive ANAlysis), a divisive
hierarchical clustering method, to a data set of five objects, fa, b, c, d, eg. Initially, AGNES
places each object into a cluster of its own. The clusters are then merged step-by-step
according to some criterion. For example, clusters C1 and C2 may be merged if an object in C1
and an object in C2 form the minimum Euclidean distance between any two objects from
different clusters. This is a single-linkage approach in that each cluster is represented by all of
the objects in the cluster, and the similarity between two clusters is measured by the similarity
of the closest pair of data points belonging to different clusters. The cluster merging process
repeats until all of the objects are eventually merged to form one cluster.

Figure 5.3 Agglomerative and divisive hierarchical clustering on data objects {a, b, c, d, e}
In DIANA, all of the objects are used to form one initial cluster. The cluster is split
according to some principle, such as the maximum Euclidean distance between the closest
neighboring objects in the cluster. The cluster splitting process repeats until, eventually, each
new cluster contains only a single object. In either agglomerative or divisive hierarchical
clustering, the user can specify the desired number of clusters as a termination condition.
A tree structure called a dendrogram is commonly used to represent the process of
hierarchical clustering. It shows how objects are grouped together step by step. Figure 5.4
shows a dendrogram for the five objects presented in Figure 5.3, where l = 0 shows the five
objects as singleton clusters at level 0. At l = 1, objects a and b are grouped together to form
the first cluster, and they stay together at all subsequent levels. We can also use a vertical axis
to show the similarity scale between clusters. For example, when the similarity of two groups
of objects, {a, b} and {c, d, e} is roughly 0.16, they are merged together to form a single
cluster.
Figure 5.4 Dendrogram
5.5.3 Distance Measures between clusters
Four widely used measures for distance between clusters are as follows, where |p-p'|
is the distance between two objects or points, p and p'; mi is the mean for cluster, Ci; and ni is
the number of objects in Ci.

When an algorithm uses the minimum distance, d min(Ci, Cj), to measure the distance
between clusters, it is sometimes called a nearest-neighbor clustering algorithm. Moreover, if
the clustering process is terminated when the distance between nearest clusters exceeds an
arbitrary threshold, it is called a single-linkage algorithm.
When an algorithm uses the maximum distance, dmax(Ci, Cj), to measure the distance
between clusters, it is sometimes called a farthest-neighbor clustering algorithm. If the
clustering process is terminated when the maximum distance between nearest clusters exceeds
an arbitrary threshold, it is called a complete-linkage algorithm.
The use of mean or average distance is a compromise between the minimum and
maximum distances and overcomes the outlier sensitivity problem. Whereas the mean distance
is the simplest to compute, the average distance is advantageous in that it can handle categoric
as well as numeric data. The computation of the mean vector for categoric data can be difficult
or impossible to define.

5.5.4 Difficulties with hierarchical clustering


The hierarchical clustering method, often encounters following difficulties:
 Difficulties regarding the selection of merge or split points.
 Split decision is critical because once a group of objects is merged or split, the process
at the next step will operate on the newly generated clusters.
 It will neither undo what was done previously nor perform object swapping between
clusters.
 Low-quality clusters will be formed at wrong split or merge decisions.
 The method does not scale well, because each decision to merge or split requires the
examination and evaluatation of a good number of objects or clusters.

5.6 Density-based Clustering


The Density-based Clustering works by detecting areas where the data points are
concentrated and where they are separated by areas that are empty or sparse. Points that are not
part of a cluster are labeled as noise. To discover clusters with arbitrary shape, density-based
clustering methods have been developed. Three most commonly used density based clustering
algorithms are listed as follows:
 DBSCAN - Grows clusters according to a density-based connectivity analysis.
 OPTICS - Produce a cluster ordering obtained from a wide range of parameter settings.
 DENCLUE - Clusters objects based on a set of density distribution functions.

5.6.1 DBSCAN
DBSCAN is a density based clustering algorithm. The algorithm grows regions with
sufficiently high density into clusters and discovers clusters of arbitrary shape in spatial
databases with noise. It defines a cluster as a maximal set of density-connected points. The
basic ideas in working of density-based clustering are as follows.

 The neighborhood within a radius e of a given object is called the 𝜀-neighborhood of


the object.
 If the 𝜀 -neighborhood of an object contains at least a minimum number, MinPts, of
objects, then the object is called a core object.
 Given a set of objects, D, we say that an object p is directly density-reachable from
object q if p is within the 𝜀 -neighborhood of q, and q is a core object.
 An object p is density-reachable from object q with respect to 𝜀 and MinPts in a set of
objects, D, if there is a chain of objects p1, . . . , pn, where p1 = q and pn = p such that
pi+1 is directly density-reachable from pi with respect to 𝜀 and MinPts, for 1<= i<= n, pi
∈ D.
 An object p is density-connected to object q with respect to e and MinPts in a set of
objects, D, if there is an object o ∈ D such that both p and q are density-reachable from
o with respect to 𝜀 and MinPts.
 Density reachability is the transitive closure of direct density reachability, and this
relationship is asymmetric. Only core objects are mutually density reachable. Density
connectivity, however, is a symmetric relation.

DBSCAN : Density-reachability and density connectivity


Consider Figure 5.5 for a given 𝜀 represented by the radius of the circles, and, say,
let MinPts = 3.

Figure 5.5 Density reachability and density connectivity in density-based clustering

Based on the working principles of DB scan algorithm:


 Of the labeled points, m, p, o, and r are core objects because each is in an e-
neighborhood
 containing at least three points.
 q is directly density-reachable from m.m is directly density-reachable from p and vice
versa.
 q is (indirectly) density-reachable from p because q is directly density-reachable from
m and m is directly density-reachable from p. However, p is not density-reachable
from q because q is not a core object. Similarly, r and s are density-reachable from o,
and o is density-reachable from r.
 o, r, and s are all density-connected.
A density-based cluster is a set of density-connected objects that is maximal with respect to
density-reachability. Every object not contained in any cluster is considered to be noise.

DBSCAN : Finding Clusters in DB Scan


DBSCAN searches for clusters by checking the 𝜀 -neighborhood of each point in the
database. If the e-neighborhood of a point p contains more than MinPts, a new cluster with p as
a core object is created. DBSCAN then iteratively collects directly density-reachable objects
from these core objects, which may involve the merge of a few density-reachable clusters. The
process terminates when no new point can be added to any cluster.
If a spatial index is used, the computational complexity of DBSCAN is O(nlogn),
where n is the number of database objects. Otherwise, it is O(n 2).With appropriate settings of
the user-defined parameters e and MinPts, the algorithm is effective at finding arbitrary-shaped
clusters.
Advantages
1) Does not require a-priori specification of number of clusters.
2) Able to identify noise data while clustering.
3) DBSCAN algorithm is able to find arbitrarily size and arbitrarily shaped clusters.
Disadvantages
1) DBSCAN algorithm fails in case of varying density clusters.
2) Fails in case of neck type of dataset.

5.6.2 OPTICS: Ordering Points to Identify the Clustering Structure


Ordering points to identify the clustering structure (OPTICS) is an algorithm for
finding density-based clusters in spatial data. Its basic idea is similar to DBSCAN, but it
addresses one of DBSCAN's major weaknesses: the problem of detecting meaningful clusters
in data of varying density. To do so, the points of the database are (linearly) ordered such that
spatially closest points become neighbors in the ordering. Additionally, a special distance is
stored for each point that represents the density that must be accepted for a cluster so that both
points belong to the same cluster.
Working of OPTICS Algorithm:
Rather than produce a data set clustering explicitly, OPTICS computes an
augmented cluster ordering for automatic and interactive cluster analysis. This ordering
represents the density-based clustering structure of the data. It contains information that is
equivalent to density-based clustering obtained from a wide range of parameter settings. The
cluster ordering can be used to extract basic clustering information (such as cluster centers or
arbitrary-shaped clusters) as well as provide the intrinsic clustering structure.
By examining DBSCAN, we can easily see that for a constant MinPts value, density
based clusters with respect to a higher density (i.e., a lower value for 𝜀 ) are completely
contained in density-connected sets obtained with respect to a lower density. Recall that the
parameter 𝜀 is a distance—it is the neighborhood radius. Therefore, in order to produce a set or
ordering of density-based clusters, we can extend the DBSCAN algorithm to process a set of
distance parameter values at the same time. To construct the different clusterings
simultaneously, the objects should be processed in a specific order. This order selects an object
that is density-reachable with respect to the lowest 𝜀 value so that clusters with higher density
(lower 𝜀) will be finished first. Based on this idea, two values need to be stored for each
object—core-distance and reachability-distance:
 The core-distance of an object p is the smallest 𝜀′ value that makes {p} a core object.
If p is not a core object, the core-distance of p is undefined.
 The reachability-distance of an object q with respect to another object p is the greater
value of the core-distance of p and the Euclidean distance between p and q. If p is not a
core object, the reachability-distance between p and q is undefined.

Core-distance and reachability-distance


The following figure 5.6 illustrates the concepts of core distance and reachability-distance.
Figure 5.6 Concepts of core distance and reachability-distance

Suppose that 𝜀 =6 mm and MinPts=5. The core-distance of p is the distance, 𝜀′,


between p and the fourth closest data object. The reachability-distance of q1 with respect to p is
the core-distance of p (i.e., 𝜀′ =3 mm) because this is greater than the Euclidean distance from
p to q1. The reachability distance of q2 with respect to p is the Euclidean distance from p to q 2
because this is greater than the core-distance of p.
Because of the structural equivalence of the OPTICS algorithm to DBSCAN, the
OPTICS algorithm has the same runtime complexity as that of DBSCAN, that is, O(nlogn) if a
spatial index is used, where n is the number of objects.

5.6.3 DENCLUE: Clustering Based on Density Distribution Functions


DENCLUE (DENsity-based CLUstEring) is a clustering method based on a set of
density distribution functions. The method is built on the following ideas:
(1) the influence of each data point can be formally modeled using a mathematical function,
called an influence function, which describes the impact of a data point within its
neighborhood; (2) the overall density of the data space can be modeled analytically as the sum
of the influence function applied to all data points; and
(3) clusters can then be determined mathematically by identifying density attractors, where
density attractors are local maxima of the overall density function.

Let x and y be objects or points in Fd, a d-dimensional input space. The influence
function of data object y on x is a function, , which is defined in terms of a
basic influence function fB:
This reflects the impact of y on x. In principle, the influence function can be an
arbitrary function that can be determined by the distance between two objects in a
neighborhood.
The distance function, d(x, y), should be reflexive and symmetric, such as the Euclidean
distance function. It can be used to compute a square wave influence function,

or a Gaussian influence function

Major advantages of DENCLUE in comparison with other clustering algorithms:


(1) It has a solid mathematical foundation and generalizes various clustering methods,
including partitioning, hierarchical, and density-based methods;
(2) It has good clustering properties for data sets with large amounts of noise;
(3) It allows a compact mathematical description of arbitrarily shaped clusters in high
dimensional data sets;
(4) It uses grid cells, yet only keeps information about grid cells that actually contain data
points. It manages these cells in a tree-based access structure, and thus is significantly faster
than some influential algorithms, such as DBSCAN.
However, the method requires careful selection of the density parameter 𝜎 and noise threshold,
as the selection of such parameters may significantly influence the quality of the clustering
results.

5.7 Grid-Based Methods


The grid-based clustering approach uses a multi resolution grid data structure. It
quantizes the object space into a finite number of cells that form a grid structure on which all
of the operations for clustering are performed. The main advantage of the approach is its fast
processing time, which is typically independent of the number of data objects, yet dependent
on only the number of cells in each dimension in the quantized space. Some typical
examples of the grid-based approach include,
 STING - which explores statistical information stored in the grid cells?
 WaveCluster - which clusters objects using a wavelet transform method?
 CLIQUE - which represents a grid-and density-based approach for clustering in high-
dimensional data space?

5.7.1 STING: STatistical INformation Grid


STING is a grid-based multiresolution clustering technique in which the spatial area
is divided into rectangular cells. There are usually several levels of such rectangular cells
corresponding to different levels of resolution, and these cells form a hierarchical structure:
each cell at a high level is partitioned to form a number of cells at the next lower level.
Statistical information regarding the attributes in each grid cell (such as the mean, maximum,
and minimum values) is precomputed and stored. These statistical parameters are useful for
query processing, as described below.

Figure 5.7 A hierarchical structure for STING clustering.

Figure 5.7 shows a hierarchical structure for STING clustering. Statistical


parameters of higher-level cells can easily be computed from the parameters of the lower-level
cells. These parameters include the following: the attribute-independent parameter, count; the
attribute-dependent parameters, mean, stdev (standard deviation), min (minimum), max
(maximum); and the type of distribution that the attribute value in the cell follows, such as
normal, uniform, exponential, or none (if the distribution is unknown).When the data are
loaded into the database, the parameters count, mean, stdev, min, and max of the bottom-level
cells are calculated directly from the data. The value of distribution may either be assigned by
the user if the distribution type is known beforehand or obtained by hypothesis tests such as the
test. The type of distribution of a higher-level cell can be computed based on the majority of
distribution types of its corresponding lower-level cells in conjunction with a threshold
filtering process. If the distributions of the lower level cells disagree with each other and fail
the threshold test, the distribution type of the high-level cell is set to none.

Statistical Information for query answering


The statistical parameters can be used in a top-down, grid-based method as follows.
First, a layer within the hierarchical structure is determined from which the query-answering
process is to start. This layer typically contains a small number of cells. For each cell in the
current layer, we compute the confidence interval (or estimated range of probability) reflecting
the cell‘s relevancy to the given query. The irrelevant cells are removed from further
consideration. Processing of the next lower level examines only the remaining relevant cells.
This process is repeated until the bottom layer is reached. At this time, if the query
specification is met, the regions of relevant cells that satisfy the query are returned. Otherwise,
the data that fall into the relevant cells are retrieved and further processed until they meet the
requirements of the query.
Advantages does STING offer over other clustering methods
STING offers several advantages:
(1) the grid-based computation is query-independent, because the statistical information stored
in each cell represents the summary information of the data in the grid cell, independent of the
query.
(2) the grid structure facilitates parallel processing and incremental updating.
(3) the method‘s efficiency is a major advantage: STING goes through the database once to
compute the statistical parameters of the cells, and hence the time complexity of generating
clusters is O(n), where n is the total number of objects. After generating the hierarchical
structure, the query processing time is O(g), where g is the total number of grid cells at the
lowest level, which is usually much smaller than n.

Quality of STING clustering


Because STING uses a multi resolution approach to cluster analysis, the quality of
STING clustering depends on the granularity of the lowest level of the grid structure. If the
granularity is very fine, the cost of processing will increase substantially; however, if the
bottom level of the grid structure is too coarse, it may reduce the quality of cluster analysis.
Moreover, STING does not consider the spatial relationship between the children and their
neighboring cells for construction of a parent cell. As a result, the shapes of the resulting
clusters are isothetic; that is, all of the cluster boundaries are either horizontal or vertical, and
no diagonal boundary is detected. This may lower the quality and accuracy of the clusters
despite the fast processing time of the technique.

5.7.2 WaveCluster: Clustering UsingWavelet Transformation


Wave Cluster is a multiresolution clustering algorithm that first summarizes the data
by imposing a multidimensional grid structure onto the data space. It then uses a wavelet
transformation to transform the original feature space, finding dense regions in the transformed
space. In this approach, each grid cell summarizes the information of a group of points that
map into the cell. This summary information typically fits into main memory for use by the
multiresolution wavelet transform and the subsequent cluster analysis.

WaveCluster Working:
A wavelet transform is a signal processing technique that decomposes a signal into
different frequency subbands. The wavelet model can be applied to d-dimensional signals by
applying a one-dimensional wavelet transform d times. In applying a wavelet transform, data
are transformed so as to preserve the relative distance between objects at different levels of
resolution. This allows the natural clusters in the data to become more distinguishable. Clusters
can then be identified by searching for dense regions in the new domain.

Wavelet transformation useful for clustering


It provides unsupervised clustering : It uses hat-shaped filters that emphasize regions where the
points cluster, while suppressing weaker information outside of the cluster boundaries. Thus,
dense regions in the original feature space act as attractors for nearby points and as inhibitors
for points that are further away. This means that the clusters in the data automatically stand out
and ―clear‖ the regions around them. Thus, another advantage is that wavelet transformation
can automatically result in the removal of outliers.
The multiresolution property of wavelet transformations can help detect clusters at varying
levels of accuracy. For example, Figure 5.8 shows a sample of two dimensional feature space,
where each point in the image represents the attribute or feature values of one object in the
spatial data set. Figure 5.9 shows the resulting data are decomposed are shown. The subband
shown in the upper-left quadrant emphasizes the average neighborhood around each data point.
The subband in the upper-right quadrant emphasizes the horizontal edges of the data. The
subband in the lower-left quadrant emphasizes the vertical edges, while the subband in the
lower-right quadrant emphasizes the corners.

Figure 5.8 A sample of two-dimensional feature space.

Figure 5.9 Multiresolution of the feature space in Figure 7.16 at (a) scale 1 (high resolution);
(b) scale 2 (medium resolution); and (c) scale 3 (low resolution).

Wavelet-based clustering is very fast, with a computational complexity of O(n), where n is the
number of objects in the database. The algorithm implementation can be made parallel.
WaveCluster is a grid-based and density-based algorithm. It conforms with many of the
requirements of a good clustering algorithm: It handles large data sets efficiently, discovers
clusters with arbitrary shape, successfully handles outliers, is insensitive to the order of input,
and does not require the specification of input parameters such as the number of clusters or a
neighborhood radius.

5.8 Model Based Clustering


Model-based clustering methods attempt to optimize the fit between the given data
and some mathematical model. Such methods are often based on the assumption that the data
are generated by a mixture of underlying probability distributions. Model based clustering can
be implemented by different methods. Three examples of implementing this model based
clustering are listed as follows.
 Expectation-Maximization.
 Conceptual clustering.
 Neural network approach to clustering.
5.8.1 Expectation-Maximization
Each cluster can be represented mathematically by a parametric probability
distribution. The entire data is a mixture of these distributions, where each individual
distribution is typically referred to as a component distribution. We can therefore cluster the
data using a finite mixture density model of k probability distributions, where each distribution
represents a cluster. The problem is to estimate the parameters of the probability distributions
so as to best fit the data. Figure 5.10 is an example of a simple finite mixture density model.
There are two clusters. Each follows a normal or Gaussian distribution with its own mean and
standard deviation.

Fig 5.10 Each cluster can be represented by a probability distribution, centered at a mean, and
with a standard deviation.

The EM (Expectation-Maximization) algorithm is a popular iterative refinement


algorithm that can be used for finding the parameter estimates. It can be viewed as an extension
of the k-means paradigm, which assigns an object to the cluster with which it is most similar,
based on the cluster mean. Instead of assigning each object to a dedicated cluster, EM assigns
each object to a cluster according to a weight representing the probability of membership. In
other words, there are no strict boundaries between clusters. Therefore, new means are
computed based on weighted measures.
EM starts with an initial estimate or ―guess‖ of the parameters of the mixture model
(collectively referred to as the parameter vector). It iteratively rescores the objects against the
mixture density produced by the parameter vector. The rescored objects are then used to update
the parameter estimates. Each object is assigned a probability that it would possess a certain set
of attribute values given that it was a member of a given cluster. The algorithm is described as
follows:

1. Make an initial guess of the parameter vector: This involves randomly selecting k
objects to represent the cluster means or centers (as in k-means partitioning), as well as
making guesses for the additional parameters.

2. Iteratively refine the parameters (or clusters) based on the following two steps:
(a) Expectation Step: Assign each object xi to cluster Ck with the probability

where follows the normal (i.e., Gaussian) distribution


around mean, mk, with expectation, Ek. In other words, this step calculates the
probability of cluster membership of object xi, for each of the clusters. These
probabilities are the ―expected‖ cluster memberships for object x i.
(b) Maximization Step: Use the probability estimates from above to re-estimate (or
refine) the model parameters. For example

This step is the ―maximization‖ of the likelihood of the distributions given the data.

The EM algorithm is simple and easy to implement. In practice, it converges fast but may not
reach the global optima. Convergence is guaranteed for certain forms of optimization
functions. The computational complexity is linear in d (the number of input features), n (the
number of objects), and t (the number of iterations).
5.8.2 Conceptual Clustering
Conceptual clustering is a form of clustering in machine learning that, given a set of
unlabeled objects, produces a classification scheme over the objects. Unlike conventional
clustering, which primarily identifies groups of like objects, conceptual clustering goes one
step further by also finding characteristic descriptions for each group, where each group
represents a concept or class. Hence, conceptual clustering is a two-step process: clustering is
performed first, followed by characterization. Here, clustering quality is not solely a function
of the individual objects. Rather, it incorporates factors such as the generality and simplicity of
the derived concept descriptions. Most methods of conceptual clustering adopt a statistical
approach that uses probability measurements in determining the concepts or clusters.
Probabilistic descriptions
are typically used to represent each derived concept. COBWEB is a popular and simple method
of incremental conceptual clustering. Its input objects are described by categorical attribute-
value pairs. COBWEB creates a hierarchical clustering in the form of a classification tree.
Classification Tree and Decision Tree:
Following figure 5.11 shows a classification tree for a set of animal data.

Figure 5.11 Classification tree for a set of animal data.

A classification tree differs from a decision tree. Each node in a classification tree
refers to a concept and contains a probabilistic description of that concept, which summarizes
the objects classified under the node. The probabilistic description includes the probability of
the concept and conditional probabilities of the form P(Ai = vij | Ck), where Ai = vij is an
attribute-value pair (that is, the ith attribute takes its jth possible value) and Ck is the concept
class. (Counts are accumulated and stored at each node for computation of the probabilities.)
This is unlike decision trees, which label branches rather than nodes and use logical rather than
probabilistic descriptors.3 The sibling nodes at a given level of a classification tree are said to
form a partition. To classify an object using a classification tree, a partial matching function is
employed to descend the tree along a path of ―best‖ matching nodes.

COBWEB uses a heuristic evaluation measure called category utility to guide construction of
the tree. Category utility (CU) is defined as

where n is the number of nodes, concepts, or ―categories‖ forming a partition, {C1, C2,. . .,
Cn}, at the given level of the tree. Category utility rewards intra class similarity and interclass
dissimilarity, where:
Intraclass similarity is the probability P(Ai = vij | Ck), The larger this value is, the greater the
proportion of class members that share this attribute-value pair and the more predictable the
pair is of class members.
Interclass dissimilarity is the probability P(Ck |Ai = vij), The larger this value is, the fewer the
objects in contrasting classes that share this attribute-value pair and the more predictive the
pair is of the class.

COBWEB Working:
COBWEB incrementally incorporates objects into a classification tree.
―Given a new object, how does COBWEB decide where to incorporate it into the classification
tree?‖ COBWEB descends the tree along an appropriate path, updating counts along the way,
in search of the ―best host‖ or node at which to classify the object. This decision is based on
temporarily placing the object in each node and computing the category utility of the resulting
partition. The placement that results in the highest category utility should be a good host for the
object.
COBWEB computes the category utility of the partition that would result if a new
node were to be created for the object. This is compared to the above computation based on the
existing nodes. The object is then placed in an existing class, or a new class is created for it,
based on the partition with the highest category utility value. Notice that COBWEB has the
ability to automatically adjust the number of classes in a partition. It does not need to rely on
the user to provide such an input parameter.
The two operators mentioned above are highly sensitive to the input order of the
object. COBWEB has two additional operators that help make it less sensitive to input order.
These are merging and splitting. When an object is incorporated, the two best hosts are
considered for merging into a single class. Furthermore, COBWEB considers splitting the
children of the best host among the existing categories. These decisions are based on category
utility. The merging and splitting operators allow COBWEB to perform a bidirectional
search—for example, a merge can undo a previous split.

Limitations of COBWEB
First, it is based on the assumption that probability distributions on separate
attributes are statistically independent of one another. This assumption is, however, not always
true because correlation between attributes often exists. Moreover, the probability distribution
representation of clusters makes it quite expensive to update and store the clusters. This is
especially so when the attributes have a large number of values because the time and space
complexities depend not only on the number of attributes, but also on the number of values for
each attribute. Furthermore, the classification tree is not height-balanced for skewed input data,
which may cause the time and space complexity to degrade dramatically.

5.8.3 Neural Network Approach


The neural network approach is motivated by biological neural networks. Roughly
speaking, a neural network is a set of connected input/output units, where each connection has
a weight associated with it. Neural networks have several properties that make them popular
for clustering. First, neural networks are inherently parallel and distributed processing
architectures. Second, neural networks learn by adjusting their interconnection weights so as to
best fit the data. This allows them to ―normalize‖ or ―prototype‖ the patterns and act as feature
(or attribute) extractors for the various clusters. Third, neural networks process numerical
vectors and require object patterns to be represented by quantitative features only. Many
clustering tasks handle only numerical data or can transform their data into quantitative
features if needed.
The neural network approach to clustering tends to represent each cluster as an
exemplar. An exemplar acts as a ―prototype‖ of the cluster and does not necessarily have to
correspond to a particular data example or object. New objects can be distributed to the cluster
whose exemplar is the most similar, based on some distance measure. The attributes of an
object assigned to a cluster can be predicted from the attributes of the cluster‘s exemplar.
Self-organizing feature maps (SOMs) are one of the most popular neural network
methods for cluster analysis. They are sometimes referred to as Kohonen self-organizing
feature maps, after their creator, Teuvo Kohonon, or as topologically ordered maps. SOMs‘
goal is to represent all points in a high-dimensional source space by points in a low-
dimensional (usually 2-D or 3-D) target space, such that the distance and proximity
relationships (hence the topology) are preserved as much as possible. The method is
particularly useful when a nonlinear mapping is inherent in the problem itself.
SOMs can also be viewed as a constrained version of k-means clustering, in which
the cluster centers tend to lie in a low-dimensional manifold in the feature or attribute space.
With SOMs, clustering is performed by having several units competing for the current object.
The unit whose weight vector is closest to the current object becomes the winning or active
unit. So as to move even closer to the input object, the weights of the winning unit are adjusted,
as well as those of its nearest neighbors. SOMs assume that there is some topology or ordering
among the input objects and that the units will eventually take on this structure in space. The
organization of units is said to form a feature map. SOMs are believed to resemble processing
that can occur in the brain and are useful for visualizing high-dimensional data in 2-D or 3-D
space.

5.9 Clustering High-Dimensional Data


Definition:

Challenges in High dimensional data Clustering


To overcome the challenges in high dimensional data clustering the following methods may be
applied.
 Feature transformation methods
 Attribute subset selection
 Subspace clustering
Feature transformation methods, such as principal component analysis and singular
value decomposition, transform the data onto a smaller space while generally preserving the
original relative distance between objects. They summarize data by creating linear
combinations of the attributes, and may discover hidden structures in the data. However, such
techniques do not actually remove any of the original attributes from analysis. This is
problematic when there are a large number of irrelevant attributes. The irrelevant information
may mask the real clusters, even after transformation. Moreover, the transformed features
(attributes) are often difficult to interpret, making the clustering results less useful. Thus,
feature transformation is only suited to data sets where most of the dimensions are relevant to
the clustering task. Unfortunately, real-world data sets tend to have many highly correlated, or
redundant, dimensions.

Attribute subset selection is commonly used for data reduction by removing


irrelevant or redundant dimensions (or attributes). Given a set of attributes, attribute subset
selection finds the subset of attributes that are most relevant to the data mining task. Attribute
subset selection involves searching through various attribute subsets and evaluating these
subsets using certain criteria. It is most commonly performed by supervised learning—the most
relevant set of attributes are found with respect to the given class labels. It can also be
performed by an unsupervised process, such as entropy analysis, which is based on the
property that entropy tends to be low for data that contain tight clusters. Other evaluation
functions, such as category utility, may also be used.

Subspace clustering is an extension to attribute subset selection that has shown its
strength at high-dimensional clustering. It is based on the observation that different subspaces
may contain different, meaningful clusters. Subspace clustering searches for groups of clusters
within different subspaces of the same data set. The problem becomes how to find such
subspace clusters effectively and efficiently.

Clustering Approaches for effective clustering of high-dimensional data


Three approaches for effective clustering of high-dimensional data:
 Dimension-growth subspace clustering, represented by CLIQUE
 Dimension-reduction projected clustering, represented by PROCLUS
 Frequent pattern based clustering, represented by pCluster
5.9.1 CLIQUE: A Dimension-Growth Subspace Clustering Method
CLIQUE (CLustering In QUEst) was the first algorithm proposed for dimension-
growth subspace clustering in high-dimensional space. In dimension-growth subspace
clustering, the clustering process starts at single-dimensional subspaces and grows upward to
Higher dimensional ones. Because CLIQUE partitions each dimension like a grid structure and
determines whether a cell is dense based on the number of points it contains, it can also be
viewed as an integration of density-based and grid-based clustering methods.

CLIQUE Working steps:


The ideas of the CLIQUE clustering algorithm are outlined as follows.
Given a large set of multidimensional data points, the data space is usually not
uniformly occupied by the data points. CLIQUE‘s clustering identifies the sparse and the
―crowded‖ areas in space (or units), thereby discovering the overall distribution patterns of the
data set. A unit is dense if the fraction of total data points contained in it exceeds an input
model parameter. In CLIQUE, a cluster is defined as a maximal set of connected dense units.

In the first step, CLIQUE partitions the d-dimensional data space into non
overlapping rectangular units, identifying the dense units among these. This is done (in 1-D)
for each dimension. For example, Figure 5.12 shows dense rectangular units found with respect
to age for the dimensions salary and (number of weeks of) vacation. The subspaces
representing these dense units are intersected to form a candidate search space in which dense
units of higher dimensionality may exist.
Figure 5.12 Dense units found with respect to age for the dimensions salary and vacation are
intersected in order to provide a candidate search space for dense units of higher
dimensionality.

CLIQUE confine its search for dense units of higher dimensionality to the
intersection of the dense units in the subspaces because the identification of the candidate
search space is based on the Apriori property used in association rule mining. In general, the
property employs prior knowledge of items in the search space so that portions of the space can
be pruned. The property, adapted for CLIQUE, states the following: If a k-dimensional unit is
dense, then so are its projections in (k−1)-dimensional space. That is, given a k-dimensional
candidate dense unit, if we check its (k−1)-th projection units and find any that are not dense,
then we know that the kth dimensional unit cannot be dense either. Therefore, we can generate
potential or candidate dense units in k-dimensional space from the dense units found in (k −1)-
dimensional space. In general, the resulting space searched is much smaller than the original
space. The dense units are then
examined in order to determine the clusters.
In the second step, CLIQUE generates a minimal description for each cluster as
follows. For each cluster, it determines the maximal region that covers the cluster of connected
dense units. It then determines a minimal cover (logic description) for each cluster. CLIQUE
automatically finds subspaces of the highest dimensionality such that high-density clusters
exist in those subspaces. It is insensitive to the order of input objects and does not presume any
canonical data distribution. It scales linearly with the size of input and has good scalability as
the number of dimensions in the data is increased. However, obtaining meaningful clustering
results is dependent on proper tuning of the grid size (which is a stable structure here) and the
density threshold. This is particularly difficult because the grid size and density threshold are
used across all combinations of dimensions in the data set. Thus, the accuracy of the clustering
results may be degraded at the expense of the simplicity of the method.
5.9.2 PROCLUS: A Dimension-Reduction Subspace Clustering Method
PROCLUS (PROjected CLUStering) is a typical dimension-reduction subspace
clustering method. That is, instead of starting from single-dimensional spaces, it starts by
finding an initial approximation of the clusters in the high-dimensional attribute space. Each
dimension is then assigned a weight for each cluster, and the updated weights are used in the
next iteration to regenerate the clusters. This leads to the exploration of dense regions in all
subspaces of some desired dimensionality and avoids the generation of a large number of
overlapped clusters in projected dimensions of lower dimensionality.
PROCLUS finds the best set of medoids by a hill-climbing process similar to that
used in CLARANS, but generalized to deal with projected clustering. It adopts a distance
measure called Manhattan segmental distance, which is the Manhattan distance on a set of
relevant dimensions. The PROCLUS algorithm consists of three phases: initialization,
iteration, and cluster refinement. In the initialization phase, it uses a greedy algorithm to select
a set of initial medoids that are far apart from each other so as to ensure that each cluster is
represented by at least one object in the selected set. More concretely, it first chooses a random
sample of data points proportional to the number of clusters we wish to generate, and then
applies the greedy algorithm to obtain an even smaller final subset for the next phase. The
iteration phase selects a random set of k medoids from this reduced set (of medoids), and
replaces ―bad‖ medoids with randomly chosen new medoids if the clustering is improved. For
each medoid, a set of dimensions is chosen whose average distances are small compared to
statistical expectation. The total number of dimensions associated to medoids must be k×l,
where l is an input parameter that selects the average dimensionality of cluster subspaces. The
refinement phase computes new dimensions for each medoid based on the clusters found,
reassigns points to medoids, and removes outliers.
Experiments on PROCLUS show that the method is efficient and scalable at finding
high-dimensional clusters. Unlike CLIQUE, which outputs many overlapped clusters,
PROCLUS finds nonoverlapped partitions of points. The discovered clusters may help better
understand the high-dimensional data and facilitate other subsequence analyses.

5.9.3 Frequent Pattern–Based Clustering Methods


Frequent pattern mining can be applied to clustering, resulting in frequent pattern–
based cluster analysis. Frequent pattern mining can lead to the discovery of interesting
associations and correlations among data objects. The idea behind frequent pattern–based
cluster analysis is that the frequent patterns discovered may also indicate clusters. Frequent
pattern–based cluster analysis is well suited to high-dimensional data. It can be viewed as an
extension of the dimension-growth subspace clustering approach.
Typical examples of frequent pattern–based cluster analysis include the clustering of
text documents that contain thousands of distinct keywords, and the analysis of microarray data
that contain tens of thousands of measured values or ―features.‖
In frequent term–based text clustering, text documents are clustered based on the
frequent terms they contain. Using the vocabulary of text document analysis, a term is any
sequence of characters separated from other terms by a delimiter. A term can be made up of a
single word or several words. In general, we first remove nontext information (such as HTML
tags and punctuation) and stop words. Terms are then extracted.
A stemming algorithm is then applied to reduce each term to its basic stem. In this
way, each document can be represented as a set of terms. Each set is typically large.
Collectively, a large set of documents will contain a very large set of distinct terms. If we treat
each term as a dimension, the dimension space will be of very high dimensionality! This poses
great challenges for document cluster analysis. The dimension space can be referred to as term
vector space, where each document is represented by a term vector. This difficulty can be
overcome by frequent term–based analysis. That is, by using an efficient frequent itemset
mining algorithm to mine a set of frequent terms from the set of text documents. Then, instead
of clustering on high-dimensional term vector space, we need only consider the low-
dimensional frequent term sets as ―cluster candidates.‖ Notice that a frequent term set is not a
cluster but rather the description of a cluster. The corresponding cluster consists of the set of
documents containing all of the terms of the frequent term set. A well-selected subset of the set
of all frequent term sets can be considered as a clustering.
Selecting a good subset of the set of all frequent term sets: This step is critical
because such a selection will determine the quality of the resulting clustering. Let Fi be a set of
frequent term sets and cov(Fi) be the set of documents covered by Fi. That is, cov(Fi) refers to
the documents that contain all of the terms in Fi. The general principle for finding a well-
selected subset, F1, ..., Fk, of the set of all frequent term sets is to ensure that (1) Σ k i=1cov(Fi)
= D (i.e., the selected subset should cover all of the documents to be clustered); and (2) the
overlap between any two partitions, Fi and Fj (for i 6= j), should be minimized. An overlap
measure based on entropy is used to assess cluster overlap by measuring the distribution of the
documents supporting some cluster over the remaining cluster candidates.
An advantage of frequent term–based text clustering is that it automatically
generates a description for the generated clusters in terms of their frequent term sets.
Traditional clustering methods produce only clusters—a description for the generated clusters
requires an additional processing step.

5.10 Constraint-Based Cluster Analysis

Constraint-based clustering finds clusters that satisfy user-specified preferences or


constraints. Depending on the nature of the constraints, constraint-based clustering may adopt
rather different approaches. Here are a few categories of constraints.
1. Constraints on individual objects: We can specify constraints on the objects to be
clustered. In a real estate application, for example, one may like to spatially cluster only those
luxury mansions worth over a million dollars. This constraint confines the set of objects to be
clustered. It can easily be handled by preprocessing (e.g., performing selection using an SQL
query), after which the problem reduces to an instance of unconstrained clustering.
2. Constraints on the selection of clustering parameters: A user may like to set a desired
range for each clustering parameter. Clustering parameters are usually quite specific to the
given clustering algorithm. Examples of parameters include k, the desired number of clusters in
a k-means algorithm; or ε (the radius) and MinPts (the minimum number of points) in the
DBSCAN algorithm. Although such user-specified parameters may strongly influence the
clustering results, they are usually confined to the algorithm itself. Thus, their fine tuning and
processing are usually not considered a form of constraint-based clustering.
3. Constraints on distance or similarity functions: We can specify different distance or
similarity functions for specific attributes of the objects to be clustered, or different distance
measures for specific pairs of objects. When clustering sportsmen, for example, we may use
different weighting schemes for height, body weight, age, and skill level. Although this will
likely change the mining results, it may not alter the clustering process per se. However, in
some cases, such changes may make the evaluation of the distance function nontrivial,
especially when it is tightly intertwined with the clustering process.
4. User-specified constraints on the properties of individual clusters: A user may like to
specify desired characteristics of the resulting clusters, which may strongly influence the
clustering process.
5. Semi-supervised clustering based on ―partial‖ supervision: The quality of unsupervised
clustering can be significantly improved using some weak form of supervision. This may be in
the form of pairwise constraints (i.e., pairs of objects labeled as belonging to the same or
different cluster). Such a constrained clustering process is called semi-supervised clustering.

The following section explains few methods of how efficient constraint-based clustering
methods can be developed for large data sets.

5.11 Outlier Analysis


Outlier: Very often, there exist data objects that do not comply with the general behavior or
model of the data. Such data objects, which are grossly different from or inconsistent with the
remaining set of data, are called outliers.
Outliers can be caused by measurement or execution error. For example, the display
of a person‘s age as −999 could be caused by a program default setting of an unrecorded age.
Alternatively, outliers may be the result of inherent data variability. The salary of the chief
executive officer of a company, for instance, could naturally stand out as an outlier among the
salaries of the other employees in the firm.
Many data mining algorithms try to minimize the influence of outliers or eliminate
them all together. This, however, could result in the loss of important hidden information
because one person‘s noise could be another person‘s signal. In other words, the outliers may
be of particular interest, such as in the case of fraud detection, where outliers may indicate
fraudulent activity. Thus, outlier detection and analysis is an interesting data mining task,
referred to as outlier mining.
Outlier mining Applications:
It can be used in fraud detection, for example, by detecting unusual usage of credit cards or
telecommunication services.
It is useful in customized marketing for identifying the spending behavior of customers with
extremely low or extremely high incomes, or in medical analysis for finding unusual responses
to various medical treatments.
Outlier mining Working:
Given a set of n data points or objects and k, the expected number of outliers, find the top k
objects that are considerably dissimilar, exceptional, or inconsistent with respect to the
remaining data. The outlier mining problem can be viewed as two subproblems:
(1) define what data can be considered as inconsistent in a given data set, and
(2) find an efficient method to mine the outliers so defined.
The problem of defining outliers is nontrivial. If a regression model is used for data modeling,
analysis of the residuals can give a good estimation for data ―extremeness.‖ The task becomes
tricky, however, when finding outliers in time-series data, as they may be hidden in trend,
seasonal, or other cyclic changes. When multidimensional data are analyzed, not any particular
one but rather a combination of dimension values may be extreme. For nonnumeric (i.e.,
categorical) data, the definition of outliers requires special consideration.
Using data visualization methods for outlier detection:
This may seem like an obvious choice, since human eyes are very fast and effective at noticing
data inconsistencies. However, this does not apply to data containing cyclic plots, where values
that appear to be outliers could be perfectly valid values in reality. Data visualization methods
are weak in detecting outliers in data with many categorical attributes or in data of high
dimensionality, since human eyes are good at visualizing numeric data of only two to three
dimensions.

Approaches for Outlier Detection


Outlier detection can be categorized into four approaches:
 Statistical approach
 Distance-based approach
 Density-based local outlier approach
 Deviation-based approach

5.11.1 Statistical Distribution-Based Outlier Detection


The statistical distribution-based approach to outlier detection assumes a distribution
or probability model for the given data set (e.g., a normal or Poisson distribution) and then
identifies outliers with respect to the model using a discordancy test. Application of the test
requires knowledge of the data set parameters (such as the assumed data distribution),
knowledge of distribution parameters (such as the mean and variance), and the expected
number of outliers.

Discordancy testing: A statistical discordancy test examines two hypotheses: a working


hypothesis and an alternative hypothesis. A working hypothesis, H, is a statement that the
entire data set of n objects comes from an initial distribution model, F, that is,

The hypothesis is retained if there is no statistically significant evidence supporting


its rejection. A discordancy test verifies whether an object, oi, is significantly large (or small)
in relation to the distribution F. Different test statistics have been proposed for use as a
discordancy test, depending on the available knowledge of the data. Assuming that some
statistic, T , has been chosen for discordancy testing, and the value of the statistic for object oi
is vi, then the distribution of T is constructed. Significance probability, SP(vi) = Prob(T > vi), is
evaluated. If SP(vi) is sufficiently small, then oi is discordant and the working hypothesis is
rejected. An alternative hypothesis, H, which states that oi comes from another distribution
model, G, is adopted. The result is very much dependent on which model F is chosen because
oi may be an outlier under one model and a perfectly valid value under another.

The alternative distribution is very important in determining the power of the test, that is, the
probability that the working hypothesis is rejected when oi is really an outlier. There are
different kinds of alternative distributions.

Inherent alternative distribution: In this case, the working hypothesis that all of the objects
come from distribution F is rejected in favor of the alternative hypothesis that all of the objects
arise from another distribution, G:

F and G may be different distributions or differ only in parameters of the same distribution.
There are constraints on the form of the G distribution in that it must have potential to produce
outliers. For example, it may have a different mean or dispersion, or a longer tail.
Mixture alternative distribution: The mixture alternative states that discordant values are not
outliers in the F population, but contaminants from some other population, G. In this case, the
alternative hypothesis is

Slippage alternative distribution: This alternative states that all of the objects (apart from
some prescribed small number) arise independently from the initial model, F, with its given
parameters, whereas the remaining objects are independent observations from a modified
version of F in which the parameters have been shifted.

There are two basic types of procedures for detecting outliers:


 Block procedures: In this case, either the entire suspect objects are treated as outliers
or all of them are accepted as consistent.
 Consecutive (or sequential) procedures: An example of such a procedure is the inside
out procedure. Its main idea is that the object that is least ―likely‖ to be an outlier is
tested first. If it is found to be an outlier, then all of the more extreme values are also
considered outliers; otherwise, the next most extreme object is tested, and so on. This
procedure tends to be more effective than block procedures.

Effectiveness of statistical approach in outlier detection: A major drawback is that most


tests are for single attributes, yet many data mining problems require finding outliers in
multidimensional space. Moreover, the statistical approach requires knowledge about
parameters of the data set, such as the data distribution. However, in many cases, the data
distribution may not be known. Statistical methods do not guarantee that all outliers will be
found for the cases where no specific test was developed, or where the observed distribution
cannot be adequately modeled with any standard distribution.

5.11.2 Density-Based Local Outlier Detection

Statistical and distance-based outlier detection both depend on the overall or


―global‖ distribution of the given set of data points, D. However, data are usually not
uniformly distributed. These methods encounter difficulties when analyzing data with rather
different density distributions, as illustrated in the following example.
Example:
Figure 5.13 shows a simple 2-D data set containing 502 objects, with two obvious clusters.
Cluster C1 contains 400 objects. Cluster C2 contains 100 objects. Two additional objects, o1
and o2 are clearly outliers. However, by distance-based outlier detection (which generalizes
many notions from statistical-based outlier detection), only o1 is a reasonable DB(pct, dmin)-
outlier, because if dmin is set to be less than the minimum distance between o2 andC2, then all
501 objects are further away from o2 than dmin. Thus, o2 would be considered a DB(pct,
dmin)- outlier, but so would all of the objects in C1! On the other hand, if dmin is set to be
greater than the minimum distance between o2 and C2, then even when o2 is not regarded as
an outlier, some points in C1 may still be considered outliers.

Figure 5.13 The necessity of density-based local outlier analysis.


This brings us to the notion of local outliers. An object is a local outlier if it is
outlying relative to its local neighborhood, particularly with respect to the density of the
neighborhood. In this view, o2 of Example 7.18 is a local outlier relative to the density of C2.
Object o1 is an outlier as well, and no objects in C1 are mislabeled as outliers. This forms the
basis of density-based local outlier detection. Another key idea of this approach to outlier
detection is that, unlike previous methods, it does not consider being an outlier as a binary
property. Instead, it assesses the degree to which an object is an outlier. This degree of
―outlierness‖ is computed as the local outlier factor (LOF) of an object. It is local in the sense
that the degree depends on how isolated the object is with respect to the surrounding
neighborhood. This approach can detect both global and local outliers.
To define the local outlier factor of an object, the concepts of k-distance, k-distance
neighborhood, reachability distance and local reachability density need to be defined. These
are defined as follows:
 The k-distance of an object p is the maximal distance that p gets from its k-nearest
neighbors. This distance is denoted as k-distance (p). It is defined as the distance, d(p,
o), between p and an object o∈D, such that (1) for at least k objects, o`∈D, it holds that
d(p, o`) ≤ d(p, o). That is, there are at least k objects in D that are as close as or closer
to p than o, and (2) for at most k−1 objects, o``∈D, it holds that d(p;o``) < d(p, o). That
is, there are at most k−1 objects that are closer to p than o. The LOF method links to
density-based clustering in that it sets k to the parameter MinPts, which specifies the
minimum number of points for use in identifying clusters based on density. Here,
MinPts (as k) is used to define the local neighborhood of an object, p.
 The k-distance neighborhood of an object p is denoted Nk distance(p)(p), or Nk(p) for short.
By setting k to MinPts, we get NMinPts(p). It contains the MinPts-nearest neighbors of p.
That is, it contains every object whose distance is not greater than the MinPts-distance
of p.
 The reachability distance of an object p with respect to object o (where o is within the
MinPts-nearest neighbors of p), is defined as reach_dist MinPts(p, o) =
max{MinPtsdistance(o), d(p, o)}. Intuitively, if an object p is far away from o, then the
reachability distance between the two is simply their actual distance. However, if they
are ―sufficiently‖ close (i.e., where p is within the MinPts-distance neighborhood of o),
then the actual distance is replaced by the MinPts-distance of o. This helps to
significantly reduce the statistical fluctuations of d(p, o) for all of the p close to o. The
higher the value of MinPts is, the more similar is the reachability distance for objects
within the same neighborhood.
 Intuitively, the local reachability density of p is the inverse of the average reachability
density based on the MinPts-nearest neighbors of p. It is defined as

 The local outlier factor (LOF) of p captures the degree to which we call p an outlier. It
is defined as
 It is the average of the ratio of the local reachability density of p and those of p‘s
MinPts-nearest neighbors. It is easy to see that the lower p‘s local reachability density
is, and the higher the local reachability density of p‘s MinPts-nearest neighbors are, the
higher LOF(p) is.

From this definition, if an object p is not a local outlier, LOF(p) is close to 1. The more that p
is qualified to be a local outlier, the higher LOF(p) is. Therefore, we can determine
whether a point p is a local outlier based on the computation of LOF(p). Experiments
based on both synthetic and real-world large data sets have demonstrated the power of
LOF at identifying local outliers.

UNIT-V QUESTION BANK


PART A
1. What are the requirements of clustering?
2. What are the applications of spatial data bases?
3. What is text mining?
4. Distinguish between classification and clustering.
5. Define a Spatial database.
6. List out any two commercial data mining tools.
7. What is the objective function of K-means algorithm?
8. Mention the advantages of Hierarchical clustering.
9. Distinguish between classification and clustering.
10. List requirements of clustering in data mining.
11. What is web usage mining?
12. What are the requirements of clustering?
13. What are the applications of spatial databases?
14. What is text mining?
15. What is cluster analysis?
16. What are the two data structures in cluster analysis?
17. What is an outlier? Give example.
18. List two application of data mining.
19. Difference between density and grid clustering.
20. Define divisive hierarchical clustering.
21. Classify hierarchical clustering methods.
PART-B
1. BIRCH and CLARANS are two interesting clustering algorithms that perform effective
clustering in large data sets. (i) Outline how BIRCH performs clustering in large data sets.
(ii) Compare and outline the major differences of the two scalable clustering algorithms
BIRCH and CLARANS.
2. Write a short note on web mining taxonomy. Explain the different activities of text mining.
3. Discuss and elaborate the current trends in data mining.
4. Discuss spatial data bases and Text databases
5. What is a multimedia database? Explain the methods of mining multimedia database?
6. Explain the following clustering methods in detail. (i) BIRCH (ii) CURE
7. Discuss in detail about any four data mining applications.
8. Write short notes on (i) Partitioning methods (ii) Outlier analysis
9. Describe K means clustering with an example.
10. Describe in detail about Hierarchical methods.
11. The data mining task is to cluster the following eight points into three clusters: (16)
A1 (2,10) A2(2,5),A3(8,4),B1 (5 8) B2(7 9) B3(6,4),C1 (1,2),C2(5,9).The distance function is
) J ) I )

Euclidean distance. Suppose initially we assign A1, B1 and C1 as the center of each cluster,
respectively. Apply the K-means algorithm to show the three cluster centers after the first round
execution and the final three clusters.
12. Explain hierarchical and density based clustering methods with example.
13. Write the types of data in cluster analysis and explain.
14. What is an outlier? Explain outlier analysis with example.
15. Explain with an example density based outlier detection.
16. Discuss the following clustering algorithms with example 1. K-Means 2. K- Medoids
17. Explain the working of PAM algorithm.
18. Explain how data mining is used for intrusion detection.
19. Write the difference between CLARA and CLARANS.

You might also like