Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Syllabus

UNIT I
Introduction: Fundamentals of data mining, Data Mining Functionalities, Classification of
Data Mining systems, Data Mining Task Primitives, Integration of a Data Mining System
Data Warehouse and Data Mining with a Database or a Data Warehouse System, Major issues in Data Mining. Data
Preprocessing: Need for Preprocessing the Data, Data Cleaning, Data Integration and
Transformation, Data Reduction, Discretization and Concept Hierarchy Generation.

UNIT II
Data Warehouse and OLAP Technology for Data Mining: Data Warehouse,
Multidimensional Data Model, Data Warehouse Architecture, Data Warehouse
Implementation, From Data Warehousing to Data Mining. Data Cube Computation and
DWDM Data Generalization: Efficient Methods for Data Cube Computation, Further
Development of Data Cube and OLAP Technology, Attribute-Oriented Induction.

Prepared by S.Palaniappan, Assoc.Prof 1 Prepared by S.Palaniappan, Assoc.Prof 2


GRIET GRIET

UNIT III UNIT VI


Mining Frequent Patterns, Associations and Correlations: Basic Concepts, Efficient Mining Streams, Time Series and Sequence Data: Mining Data Streams, Mining
and Scalable Frequent Itemset Mining Methods, Mining various kinds of Association Time-Series Data, Mining Sequence Patterns in Transactional Databases,
Rules, From Association Mining to Correlation Analysis, Constraint-Based Mining Sequence Patterns in Biological Data, Graph Mining, Social Network
Association Mining Analysis and Multirelational Data Mining

UNIT IV UNIT VII


Classification and Prediction: Issues Regarding Classification and Prediction, Mining Object, Spatial, Multimedia, Text and Web Data: Multidimensional
Classification by Decision Tree Induction, Bayesian Classification, Rule-Based Analysis and Descriptive Mining of Complex Data Objects, Spatial Data Mining,
Classification, Classification by Backpropagation, Support Vector Machines, Multimedia Data Mining, Text Mining, Mining the World Wide Web.
Associative Classification, Lazy Learners, Other Classification Methods, Prediction,
Accuracy and Error measures, Evaluating the accuracy of a Classifier or a Predictor, UNIT VIII
Ensemble Methods Applications and Trends in Data Mining: Data Mining Applications, Data Mining
System Products and Research Prototypes, Additional Themes on Data Mining
UNIT V and Social Impacts of Data Mining.
Cluster Analysis Introduction :Types of Data in Cluster Analysis, A Categorization of
Major Clustering Methods, Partitioning Methods, Hierarchical Methods, Density- Data Mining – Concepts and Techniques - Jiawei Han & Micheline
Based Methods, Grid-Based Methods, Model-Based Clustering Methods, Clustering Kamber, Morgan Kaufmann Publishers, Elsevier, Second Edition,
High-Dimensional Data, Constraint-Based Cluster Analysis, Outlier Analysis.
2006.
Prepared by S.Palaniappan, Assoc.Prof 3 Prepared by S.Palaniappan, Assoc.Prof 4
GRIET GRIET

1
About the Unit
Introduction
 Fundamentals of data mining
 Data Mining Functionalities
 Classification of Data Mining systems

UNIT – I  Data Mining Task Primitives.


 Integration of a Data Mining System with a Database or Data Warehouse

Data Mining: System

Concepts and Techniques  Major issues in Data Mining.


Data Preprocessing
&  Needs Preprocessing the Data
Data Preprocessing  Data Cleaning
 Data Integration and Transformation
 Data Reduction
 Discretization and Concept Hierarchy Generation.

Prepared by S.Palaniappan, Assoc.Prof 5 Prepared by S.Palaniappan, Assoc.Prof 6


GRIET GRIET

Data Evolution of Database Technology


 1960s:
 Data are any facts, numbers, or text that can be processed by a  Data collection, database creation
computer  1970s:
 Relational data model, relational DBMS implementation
Information  1980s:
 RDBMS, advanced data models (extended-relational, OO, deductive,
 The patterns, associations, or relationships among all this data can etc.)
provide information  Application-oriented DBMS (spatial, scientific, engineering, etc.)
 1990s:
Knowledge  Data mining, data warehousing, multimedia databases, and Web
databases
 Information can be converted into knowledge about historical patterns  2000s
and future trends  Stream data management and mining
 Data mining and its applications
 Web technology (XML, data integration) and global information systems
Prepared by S.Palaniappan, Assoc.Prof 7 Prepared by S.Palaniappan, Assoc.Prof 8
GRIET GRIET

2
What Is Data Mining? Data Mining Activity or not ?
 Dividing the product of a item according to their brand.
 Data mining (knowledge discovery from data)  Dividing the product of a item according to their profitability.
 Extraction of interesting (non-trivial, implicit, previously unknown and  Computing the total sales of a company
potentially useful) patterns or knowledge from huge amount of data  Sorting a student database based on student identification numbers.
 Predicting the future stock price of a company using historical
 Alternative names records.
 Knowledge discovery (mining) in databases (KDD), knowledge  Predicting the sales of an item using historical records.
extraction, data/pattern analysis, data archeology, data dredging,  Monitoring the heart rate of a patient for abnormalities
information harvesting, business intelligence, etc.
 Watch out: Is everything “data mining”?
 Simple search and query processing
 (Deductive) expert systems

Prepared by S.Palaniappan, Assoc.Prof 9 Prepared by S.Palaniappan, Assoc.Prof 10


GRIET GRIET

Why Data Mining? Why Data Mining?—Potential Applications

 The Explosive Growth of Data: from terabytes to petabytes  Data analysis and decision support
 Data collection and data availability  Market analysis and management
 Automated data collection tools, database systems, Web, computerized  Target marketing, customer relationship management (CRM), market
society basket analysis, cross selling, market segmentation
 Major sources of abundant data  Risk analysis and management

 Business: Web, e-commerce, transactions, stocks, …  Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
 Science: Remote sensing, bioinformatics, scientific simulation…
 Fraud detection and detection of unusual patterns (outliers)
 Society and everyone: news, digital cameras, YouTube
 Other Applications
 We are drowning in data, but starving for knowledge!
 Text mining (news group, email, documents) and Web mining
 “Necessity is the mother of invention”—Data mining—Automated analysis of
 Stream data mining
massive data sets
 Bioinformatics and bio-data analysis

Prepared by S.Palaniappan, Assoc.Prof 11 Prepared by S.Palaniappan, Assoc.Prof 12


GRIET GRIET

3
Ex. 1: Market Analysis and Management Ex. 2: Corporate Analysis & Risk Management
 Where does the data come from?—Credit card transactions, loyalty cards,  Finance planning and asset evaluation
discount coupons, customer complaint calls, plus (public) lifestyle studies
 cash flow analysis and prediction
 Target marketing
 contingent claim analysis to evaluate assets
 Find clusters of “model” customers who share the same characteristics:
interest, income level, spending habits, etc.  cross-sectional and time series analysis (financial-ratio, trend analysis,
 Determine customer purchasing patterns over time etc.)
 Cross-market analysis—Find associations/co-relations between product sales, &  Resource planning
predict based on such association
 summarize and compare the resources and spending
 Customer profiling—What types of customers buy what products (clustering or
 Competition
classification)
 Customer requirement analysis  monitor competitors and market directions
 Identify the best products for different groups of customers  group customers into classes and a class-based pricing procedure
 Predict what factors will attract new customers  set pricing strategy in a highly competitive market

Prepared by S.Palaniappan, Assoc.Prof 13 Prepared by S.Palaniappan, Assoc.Prof 14


GRIET GRIET

Ex. 3: Fraud Detection & Mining Unusual Patterns Knowledge Discovery (KDD) Process
Knowledge Presentation
 Approaches: Clustering & model construction for frauds, outlier analysis
Pattern Evaluation
 Applications: Health care, retail, credit card service, telecomm.  Data mining—core of
 Auto insurance: ring of collisions knowledge discovery process
 Money laundering: suspicious monetary transactions Data Mining
 Medical insurance
 Professional patients, ring of doctors, and ring of references Task-relevant Data
 Unnecessary or correlated screening tests
 Telecommunications: phone-call fraud Data Transformation & Data Reduction
Data Warehouse Data Selection
 Phone call model: destination of the call, duration, time of day or
week. Analyze patterns that deviate from an expected norm
Data Cleaning
 Retail industry
 Analysts estimate that 38% of retail shrink is due to dishonest Data Integration
employees
 Anti-terrorism
Databases
Prepared by S.Palaniappan, Assoc.Prof 15 Prepared by S.Palaniappan, Assoc.Prof 16
GRIET GRIET

4
Architecture of a Typical Data Mining Data Mining: On What Kinds of Data?
System  In principle, data mining should be applicable to any kind of data repository, as
Graphical user interface well as to transient data, such as data streams

 Database-oriented data sets and applications


Pattern evaluation  Relational database, data warehouse, transactional database
 Advanced data sets and advanced applications
Data mining engine
 Data streams and sensor data
 Time-series data, temporal data, sequence data (incl. bio-sequences)
Database or data
warehouse server Knowledge-base  Structure data, graphs, social networks and multi-linked data
Data cleaning & data integration Filtering  Object-relational databases , Heterogeneous databases and legacy
databases, Spatial data and spatiotemporal data
Data
Databases Warehouse  Multimedia database, Text databases
 The World-Wide Web
Prepared by S.Palaniappan, Assoc.Prof 17 Prepared by S.Palaniappan, Assoc.Prof 18
GRIET GRIET

On What Kinds of Data Data warehouse (centralized data management and retrieval)

 It is a repository of an organization's electronically stored data. Data


A relational database (DBMS)
warehouses are designed to facilitate reporting and analysis.

It is a database that groups data using common attributes found in


the data set. The resulting "clumps" of organized data are much easier  A data warehouse is a repository of information collected from multiple
for people to understand. sources, stored under a unified schema, and that usually resides at a single
site
 A relational database is a collection of tables, each of which is assigned  Data warehouses are constructed via a process of data cleaning, data
a unique name. integration, data transformation, data loading, and periodic data refreshing
 Each table consists of a set of attributes (columns or fields) and usually
stores a large set of tuples (records or rows)  A data warehouse is usually modeled by a multidimensional database
structure – Data cube
 A data cube provides a multidimensional view of data and allows the pre-
computation and fast accessing of summarized data.

Prepared by S.Palaniappan, Assoc.Prof 19 Prepared by S.Palaniappan, Assoc.Prof 20


GRIET GRIET

5
 Stores sequences of values or events obtained over repeated
Transactional databases measurements of time (e.g., hourly, daily, weekly).
 Stock exchange, inventory control, and the observation of natural
 Consists of a file where each record represents a transaction phenomena
 A transaction typically includes a unique transaction identity number
(trans ID) and a list of the items making up the transaction (such as Spatial database and Spatiotemporal Databases
items purchased in a store).

 It is a database that is optimized to store and query data related to


Temporal Databases, Sequence Databases, and Time-Series objects in space, including points, lines and polygons.
Databases
 Geographic (map) databases, very large-scale integration
(VLSI) or computed-aided design databases, and medical and
 It is a database with built-in time aspects satellite image databases
 Stores sequences of ordered events, with or without a concrete notion  Spatial-related information
of time
 Customer shopping sequences, Web click streams, and
biological sequences

Prepared by S.Palaniappan, Assoc.Prof 21 Prepared by S.Palaniappan, Assoc.Prof 22


GRIET GRIET

Text Databases and Multimedia Databases Data Mining: Classification Schemes


 Databases that contain word descriptions for objects.  Kinds of data patterns that can be mined.
 Highly unstructured , semi structured , well structured Descriptive modeling
 Find human-interpretable patterns that describe the data –
 Multimedia databases store image, audio, and video data.  Characterize the general properties of the data in the database

Heterogeneous Databases and Legacy Databases • Clustering • Association rule discovery


 Consists of a set of interconnected, autonomous component • Sequential pattern discovery • Deviation detection
databases
 It is a group of heterogeneous databases that combines different Predictive modeling
kinds of data systems  Use some variables (attributes) to predict another variable

 Such as relational or object-oriented databases, hierarchical  Perform inference on the current data in order to make predictions
databases, network databases, spreadsheets, multimedia
databases • Classification • Regression

Prepared by S.Palaniappan, Assoc.Prof 23 Prepared by S.Palaniappan, Assoc.Prof 24


GRIET GRIET

6
Data Mining Functionalities
 Frequent patterns, association, correlation
 Multidimensional concept description: Characterization and discrimination
 Data can be associated with classes or concepts
 Patterns that occur frequently in data
 Classes of items for sale include computers and printers, and
 Itemsets
 Concepts of customers include bigSpenders and budgetSpenders.
 Descriptions of a class or a concept are called class/concept
 typically refers to a set of items that frequently appear together
descriptions.  Subsequences
 data characterization  sequential pattern
 It is a summarization of the general characteristics or features of a
 and substructures
target class of data
 data discrimination
 different structural forms ( graphs, trees, lattices)
 It is a comparison of the general features of target class data
objects with the general features of objects from one or a set of buys(X; “computer”))  buys(X; “software”) [support = 1%; confidence =
contrasting classes
50%]
Prepared by S.Palaniappan, Assoc.Prof 25 Prepared by S.Palaniappan, Assoc.Prof 26
GRIET GRIET

 Cluster analysis
 Classification and prediction  Class label is unknown: Group data to form new classes, e.g.,
 Construct models (functions) that describe and distinguish classes cluster houses to find distribution patterns
or concepts for future prediction  Maximizing intra-class similarity & minimizing interclass similarity

 The derived model is based on the analysis of a set of training data  Outlier analysis
(class label is known)  Outlier: Data object that does not comply with the general behavior
 The derived model may be represented in various forms, such as of the data
classification (IF-THEN) rules, decision trees, mathematical  rare events can be more interesting than the more regularly
formulae, or neural networks occurring ones
 E.g., classify countries based on (climate), or classify cars  Noise or exception? Useful in fraud detection, rare events analysis

based on (gas mileage)  Trend and evolution analysis (whose behavior changes over time)
 Predict some unknown or missing numerical values  Trend and deviation: e.g., regression analysis
 Sequential pattern mining: e.g., digital camera  large SD memory
 Models continuous-valued functions
 Periodicity analysis
 Regression analysis
 Similarity-based analysis

Prepared by S.Palaniappan, Assoc.Prof 27 Prepared by S.Palaniappan, Assoc.Prof 28


GRIET GRIET

7
Top-10 Algorithm Are All the “Discovered” Patterns Interesting?
#1 : C4.5 Data mining may generate thousands of patterns: Not all of them are
#2 : K-Means interesting
#3 : SVM
 What makes a pattern interesting?
#4 : Apriori
 Can a data mining system generate all of the interesting patterns?
#5 : EM - Expectation maximization
 Can a data mining system generate only interesting patterns?”
#6 : PageRank
#7 : AdaBoost
1) easily understood by humans,
#7 : kNN
2) valid on new or test data with some degree of certainty,
#7 : Naive Bayes 3) potentially useful and
#10 : CART 4) Novel

Prepared by S.Palaniappan, Assoc.Prof 29 Prepared by S.Palaniappan, Assoc.Prof 30


GRIET GRIET

 Objective vs. subjective interestingness measures Data Mining: Confluence of Multiple Disciplines
 Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.
 Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, Database
actionability, etc Technology Statistics
 Can a data mining system find all the interesting patterns? Do we need to
find all of the interesting patterns?
 Find all the interesting patterns: Completeness
 It is often unrealistic and inefficient for data mining systems to generate all of Machine Visualization
the possible patterns Learning Data Mining
 Can a data mining system find only the interesting patterns?
 Search for only interesting patterns: An optimization problem
 Approaches Pattern
Recognition Other
 First generate all the patterns and then filter out the uninteresting ones
Algorithm Disciplines
 Generate only the interesting patterns—mining query optimization

Prepared by S.Palaniappan, Assoc.Prof 31 Prepared by S.Palaniappan, Assoc.Prof 32


GRIET GRIET

8
Multi-Dimensional View of Data Mining
 Data to be mined Data Mining Task primitives
 Relational, data warehouse, transactional, stream, object-
oriented/relational, active, spatial, time-series, text, multi-media,
heterogeneous, legacy, WWW  Each user will have a data mining task in mind, that is, some form of
data analysis that he or she would like to have performed.
 Knowledge to be mined
 Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
 Data mining query is defined in terms of data mining task primitives

 Multiple/integrated functions and mining at multiple levels


 Techniques utilized
 Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, etc.
 Applications adapted
 Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
market analysis, text mining, Web mining, etc.
Prepared by S.Palaniappan, Assoc.Prof 33 Prepared by S.Palaniappan, Assoc.Prof 34
GRIET GRIET

What Defines a Data Mining Task ? Background knowledge


 Used in the discovery process and for evaluating the patterns
Data mining primitives define a data mining task, which can be specified found,
in the form of a data mining query.  Concept hierarchies are a popular form of background knowledge
- multiple levels of abstraction

Task-relevant data (Relevant attributes)


Pattern interestingness measurements
 This specifies the portions of the database or the set of data in
which the user is interested. They may be used to guide the mining process (after discovery) to

evaluate the discovered patterns - support and confidence
1. Simplicity 2. Certainty (e.g., confidence) 3. Utility (e.g., support)
Type of knowledge to be mined 4. Novelty
 Characterization, discrimination, association or correlation Visualization of discovered patterns
analysis classification, prediction, clustering, outlier analysis, or
evolution analysis  This refers to the form in which discovered patterns are to be
displayed

Prepared by S.Palaniappan, Assoc.Prof 35 Prepared by S.Palaniappan, Assoc.Prof 36


GRIET GRIET

9
An example using DMQL

Prepared by S.Palaniappan, Assoc.Prof 37 Prepared by S.Palaniappan, Assoc.Prof 38


GRIET GRIET

Integration of Data Mining System with a Architectures of Data Mining System


database or data warehouse
 No Coupling: DM system will not utilize any functionality of a DB or
 With popular and diverse application of data mining, it is expected that a DW system
good variety of data mining system will be designed and developed.
 Comprehensive information processing and data analysis will be
 Loose Coupling: DM system will use some facilities of DB and DW
continuously and systematically surrounded by data warehouse and
databases. system
 A critical question in design is whether we should integrate data mining like storing the data in either of DB or DW systems and using these
systems with database systems. systems for data retrieval - Main Memory based.
 This gives rise to four architecture:
- No coupling  Semi-tight Coupling: Besides linking a DM system to a DB/DW
- Loose Coupling systems, efficient implementation of a few DM primitives.
- Semi-tight Coupling
- Tight Coupling
 Tight Coupling: DM system is smoothly integrated with DB/DW
systems. Each of these DM, DB/DW is treated as main functional
component of information retrieval system.

Prepared by S.Palaniappan, Assoc.Prof 39 Prepared by S.Palaniappan, Assoc.Prof 40


GRIET GRIET

10
Major Issues in Data Mining Major Issues in Data Mining
 Mining methodology
 Performance issues
 Efficiency and scalability of data mining algorithm
 Mining different kinds of knowledge from diverse data types, e.g.,
bio, stream, Web  Parallel, distributed, and incremental mining algorithms:
 Interactive mining of knowledge at multiple levels of abstraction:
 Incorporation of background knowledge:  Issues relating to the diversity of database types:

 Data mining query languages and ad hoc data mining  Handling of relational and complex types of data
 Presentation and visualization of data mining results  Mining information from heterogeneous databases and global
information systems
 Handling noisy or incomplete data
 Pattern evaluation—the interestingness problem

Prepared by S.Palaniappan, Assoc.Prof 41 Prepared by S.Palaniappan, Assoc.Prof 42


GRIET GRIET

Data Preprocessing Why Data Preprocessing?


 Data in the real world is dirty
 Why preprocess the data?  incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
 Descriptive data summarization
 e.g., occupation=“ ”
 Data cleaning
 noisy: containing errors or outliers
 Data integration and transformation
 e.g., Salary=“-10”
 Data reduction
 inconsistent: containing discrepancies in codes or names
 Discretization and concept hierarchy generation
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records

Prepared by S.Palaniappan, Assoc.Prof 43 Prepared by S.Palaniappan, Assoc.Prof 44


GRIET GRIET

11
Why Is Data Dirty? Why Is Data Preprocessing Important?
 Incomplete data may come from
 “Not applicable” data value when collected ( Customer information for sales  No quality data, no quality mining results!
transaction)
 Different considerations between the time when the data was collected and when it
 Quality decisions must be based on quality data
is analyzed.  e.g., duplicate or missing data may cause incorrect or even
 Data my not considered important at the time of entry. misleading statistics.
 Human/hardware/software problems
 Data warehouse needs consistent integration of quality data
 Noisy data (incorrect values) may come from
 Faulty data collection instruments  Data extraction, cleaning, and transformation comprises the majority
 Human or computer error at data entry of the work of building a data warehouse
 Errors in data transmission, limited buffer size
 Inconsistent data may come from
 Different data sources
 Functional dependency violation (e.g., modify some linked data)
 Duplicate records also need data cleaning

Prepared by S.Palaniappan, Assoc.Prof 45 Prepared by S.Palaniappan, Assoc.Prof 46


GRIET GRIET

Major Tasks in Data Preprocessing Forms of Data Preprocessing


 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files ( remove
redundancy)
 Data transformation
 Normalization and aggregation
 Data reduction
 Obtains reduced representation in volume but produces the same or
similar analytical results . (Data aggregation , Attribute subset selection,
Numerosity reduction)
 Data discretization
 Part of data reduction but with particular importance, especially for
numerical data

Prepared by S.Palaniappan, Assoc.Prof 47 Prepared by S.Palaniappan, Assoc.Prof 48


GRIET GRIET

12
Descriptive Data Summarization Measuring the Central Tendency
 Mean (algebraic measure) (sample vs. population):
 Descriptive Data Summarization techniques are used to identify the
typical properties of the data and highlight which data values should  Weighted arithmetic mean:
be treated as nose or outliers.  Trimmed mean: chopping extreme values

 We want to learn about data characteristics regarding both : central n


n
wx

tendency and dispersion of the data. 1 x i i
x 
n

i 1
xi
N x  i1
n

w
i 1
i

 Measures of CENTRALTENDENCY are:


 Median: A holistic measure
 Mean, Median, Mode, and Midrange.
 Dispersion, or variance of the data is the degree to which numerical  Middle value if odd number of values, or average of the middle two
data tend to spread. values otherwise
 Data Dispersion Measures:  Estimated by interpolation (for grouped data): n / 2  ( f )l
 Range, the five-number summary (based on quartiles), the median  L1  ( )c
interquartile range, and standard deviation. f median
Prepared by S.Palaniappan, Assoc.Prof 49
GRIET GRIET

Measuring the Central Tendency Measuring the Dispersion of Data

 Mode  Quartiles, outliers and boxplots


 Value that occurs most frequently in the data
 The median is the 50th percentile
 Unimodal, bimodal, trimodal
 Quartiles: Q1 (25th percentile), Q3 (75th percentile)
 For unimodal frequency curves that are moderately skewed, we have  Inter-quartile range: IQR = Q3 – Q1
the following empirical relation:
 Outlier: usually, a value higher/lower than 1.5 x IQR
mean – mode = 3 * (mean - median)

GRIET GRIET

13
Measuring the Dispersion of Data
Boxplot Analysis

 Five-number summary of a distribution:  Standard deviation s (or σ) is the square root of variance
Minimum, Q1, M, Q3, Maximum s2 (or σ2)
 Boxplot
1 N 1 N 2 1 N
 Data is represented with a box
2   ( xi  x ) 2
 [ xi  ( xi )2 ]
 The ends of the box are at the first and third quartiles, i.e., the N i1 N i 1 n i1
height of the box is IRQ
 The median is marked by a line within the box
 Whiskers: two lines outside the box extend to Minimum and
Maximum

Data Mining: Concepts and Techniques 54


GRIET GRIET

Boxplot Analysis Graphic Displays of Basic


Descriptive Data Summaries
 Popular types of graphs for the display of data summaries and
distributions

 Histograms,
 Quantile plots
 q-q plots
 Scatter plots
 and loess curves

Prepared by S.Palaniappan, Assoc.Prof 56


GRIET GRIET

14
Histogram Analysis
 Graph displays of basic statistical class descriptions
 Attribute A partitions the data distribution of A into disjoint subsets, or
buckets
 Consists of a set of rectangles that reflect the counts or frequencies of
the classes present in the given data

Univariate graphical method

Prepared by S.Palaniappan, Assoc.Prof 57 Prepared by S.Palaniappan, Assoc.Prof 58


GRIET GRIET

Quantile Plot Quantile-Quantile (Q-Q) Plot


 Displays all of the data (allowing the user to assess both the overall  Graphs the quantiles of one univariate distribution against the
behavior and unusual occurrences) corresponding quantiles of another
 Plots quantile information  Allows the user to view whether there is a shift in going from one
 For a data xi data sorted in increasing order, fi indicates that distribution to another
approximately 100 fi% of the data are below or equal to the value
xi

February 13, 2012 Data Mining: Concepts and February 13, 2012 Data Mining: Concepts and
GRIET Techniques 59
GRIET Techniques

15
Scatter plot Loess Curve
 Provides a first look at bivariate data to see clusters of points, outliers,
etc  Adds a smooth curve to a scatter plot in order to provide better
perception of the pattern of dependence
 Each pair of values is treated as a pair of coordinates and plotted as
points in the plane  Loess curve is fitted by setting two parameters: a smoothing parameter,
and the degree of the polynomials that are fitted by the regression

February 13, 2012 Data Mining: Concepts and February 13, 2012 Data Mining: Concepts and
GRIET Techniques GRIET Techniques

Data Cleaning Missing Data


 Data is not always available
 E.g., many tuples have no recorded value for several
 Data cleaning tasks
attributes, such as customer income in sales data
 Fill in missing values  Missing data may be due to
 Identify outliers and smooth out noisy data  equipment malfunction
 Correct inconsistent data  inconsistent with other recorded data and thus deleted

 Resolve redundancy caused by data integration  data not entered due to misunderstanding
 certain data may not be considered important at the time of
entry
 not register history or changes of the data
 Missing data may need to be inferred.

Prepared by S.Palaniappan, Assoc.Prof 63 Prepared by S.Palaniappan, Assoc.Prof 64


GRIET GRIET

16
How to Handle Missing Data? How to Handle Noisy Data?
 Ignore the tuple: usually done when class label is missing (assuming
 Noise: random error or variance in a measured variable
the tasks in classification—not effective when the percentage of
missing values per attribute varies considerably.  Binning
 Fill in the missing value manually: tedious + infeasible?  first sort data and partition into (equal-frequency) bins
 then one can smooth by bin means, smooth by bin boundaries.
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!  Regression
 smooth by fitting the data into regression functions
 the attribute mean
 Linear regression and multiple linear regression.
 the attribute mean for all samples belonging to the same class:
smarter  Clustering
 the most probable value: inference-based such as Bayesian  detect and remove outliers

formula or decision tree


Prepared by S.Palaniappan, Assoc.Prof 65 Prepared by S.Palaniappan, Assoc.Prof 66
GRIET GRIET

Binning Methods for Data Smoothing Regression:


 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34  Data can be smoothed by fitting the data to a function, such as with
* Partition into equal-frequency (equi-depth) bins: regression
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25  Linear regression involves finding the “best” line to fit two attributes,
so that one attribute can be used to predict the other
- Bin 3: 26, 28, 29, 34
 * Smoothing by bin means:
 Multiple linear regression where more than two attributes are
- Bin 1: 9, 9, 9, 9
involved and the data are fit to a multidimensional surface.
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
 * Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Prepared by S.Palaniappan, Assoc.Prof 67 Prepared by S.Palaniappan, Assoc.Prof 68
GRIET GRIET

17
Cluster Analysis Data Cleaning as a Process
 The first step in data cleaning as a process is Data discrepancy
detection

 Poorly designed data entry forms


 Human error in data entry
 Deliberate errors
 Data decay
 From inconsistent data representations and the use of codes

 How can we proceed with discrepancy detection?

 Use metadata (e.g., domain, range, dependency, distribution)


 Check field overloading
 Check uniqueness rule, consecutive rule and null rule

Prepared by S.Palaniappan, Assoc.Prof 69 Prepared by S.Palaniappan, Assoc.Prof 70


GRIET GRIET

 Use commercial tools


Data Preprocessing
 Data scrubbing: use simple domain knowledge (e.g., postal code,
spell-check) to detect errors and make corrections
 Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and clustering to
find outliers)
 Second step in data cleaning as a process is data transformation
 Data migration and integration
 Data migration tools: allow transformations to be specified (gender to
sex)
Data integration and transformation
 ETL (Extraction/Transformation/Loading) tools: allow users to specify
transformations through a graphical user interface

 Integration of the two processes


 Iterative
 New approaches to data cleaning emphasize increased interactivity
 Potter’s Wheels - publicly available data cleaning tool

Prepared by S.Palaniappan, Assoc.Prof 71 Prepared by S.Palaniappan, Assoc.Prof 72


GRIET GRIET

18
Data Integration Handling Redundancy in Data Integration
 Data integration:  Redundant data occur often when integration of multiple databases
 Combines data from multiple sources into a coherent data store  Object identification: The same attribute or object may have
 Issues to consider  Schema integration & Obj Matching different names in different databases
 How Entities from multiple data sources be matched up?  Derivable data: One attribute may be a “derived” attribute in
 Entity identification problem another table, e.g., annual revenue
 e.g., A.cust-id  B.cust-#
 Redundant attributes may be able to be detected by correlation
 Integrate metadata from different sources
analysis

Solution is the metadata.  Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining
 Detecting and resolving data value conflicts speed and quality
 Possible reasons: different representations, different scales, e.g.,
metric vs. British units
 Eg: Price (dollars, euro), (kg, gram), total sales etc.,

Prepared by S.Palaniappan, Assoc.Prof 73 Prepared by S.Palaniappan, Assoc.Prof 74


GRIET GRIET

Correlation Analysis (Numerical Data) Correlation Analysis (Categorical Data)


 Χ2 (chi-square) test
 Correlation coefficient (also called Pearson’s product moment coefficient) (Observed  Expected ) 2
2 
Expected
rA, B 
 ( A  A)( B  B)   ( AB )  n A B  The larger the Χ2 value, the more likely the variables are related
(n  1)AB (n  1)AB  The cells that contribute the most to the Χ2 value are those whose
 where n is the number of tuples, A and B are the respective means of A actual count is very different from the expected count
and B, σA and σB are the respective standard deviation of A and B, and
 Correlation does not imply causality
Σ(AB) is the sum of the AB cross-product.
 If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The  # of hospitals and # of car-theft in a city may be correlated
higher, the stronger correlation.  Both are causally linked to the third variable: population
 rA,B = 0: independent; rA,B < 0: negatively correlated
Expected = count (A = ai ) x (B = bj) / N

Prepared by S.Palaniappan, Assoc.Prof 75 Prepared by S.Palaniappan, Assoc.Prof 76


GRIET GRIET

19
Chi-Square Calculation: An Example
Male Female Sum (row)
Like science fiction 250(90) 200(360) 450  For this 2 X 2 table, the degrees of freedom are (2-1)(2-1) = 1.
Not like science fiction 50(210) 1000(840) 1050

2
For 1 degree of freedom, the X value needed to reject the hypothesis
Sum(col.) 300 1200 1500
at the 0.001 significance level is
 Χ2 (chi-square) calculation (numbers in parenthesis are expected  10.828 (taken from the table of upper percentage points of the c2
counts calculated based on the data distribution in the two categories) distribution, typically available from any textbook on statistics)
 It shows that gender and preferred reading are correlated in the group
 So we can conclude that the two attributes are strongly correlated
2 2 2 2
(250  90) (50  210) (200  360) (1000  840)
2      507.93
90 210 360 840

Prepared by S.Palaniappan, Assoc.Prof 77 Prepared by S.Palaniappan, Assoc.Prof 78


GRIET GRIET

Data Transformation Min-max normalization


 The data are transformed or consolidated into forms appropriate for  Min-max normalization: to [new_minA, new_maxA]
mining
 Smoothing: remove noise from data
v  min A
v' ( new _ max A  new _ min A )  new _ min A

 Aggregation: summarization, data cube construction max A  min A

 Generalization: low-level or “primitive” (raw) data are replaced  Suppose that the minimum and maximum values for the attribute
by higher-level concepts through the use of concept hierarchies income are $12,000 and $98,000, respectively.
 Normalization: scaled to fall within a small, specified range  We would like to map income to the range [0:0;1:0]
 min-max normalization  Then $73,000 is mapped to
 z-score normalization 73 , 600  12 ,000
 normalization by decimal scaling (1 . 0  0 )  0  0 . 716
98 , 000  12 ,000
 Attribute/feature construction
 New attributes constructed from the given ones

Prepared by S.Palaniappan, Assoc.Prof 79 Prepared by S.Palaniappan, Assoc.Prof 80


GRIET GRIET

20
Z-score normalization Normalization by decimal scaling
 Z-score normalization (μ: mean, σ: standard deviation):  Normalizes by moving the decimal point of values of attribute A.
 The values for an attribute, A, are normalized based on the mean and  The number of decimal points moved depends on the maximum
standard deviation of A absolute value of A.
v
v   A v'
v ' 10 j

 A
 Suppose that the mean and standard deviation of the values for the
attribute income are $54,000 and $16,000, respectively. With z-score  Suppose that the recorded values of A range from -986 to 917. The
normalization, a value of $73,600 for income is transformed to maximum absolute value of A is 986.
 To normalize by decimal scaling, we therefore divide each value by
1,000 (i.e., j = 3) so that -986 normalizes to -0:986 and 917
73 , 600  54 , 000 normalizes to 0:917.
 1 . 225
16 , 000
Prepared by S.Palaniappan, Assoc.Prof 81 Prepared by S.Palaniappan, Assoc.Prof 82
GRIET GRIET

Data Reduction Strategies Data Cube Aggregation


 Why data reduction?  Data cubes store multidimensional aggregated information
 A database/data warehouse may store terabytes of data
 The cube created at the lowest level of abstraction is the base cuboid
 Complex data analysis/mining may take a very long time to run on
the complete data set  The aggregated data for an individual entity of interest
 Data reduction  A cube at the highest level of abstraction is the apex cuboid.
 Obtain a reduced representation of the data set that is much smaller
in volume but yet produce the same (or almost the same) analytical  Multiple levels of aggregation in data cubes
results  Further reduce the size of data to deal with
 Data reduction strategies  Data cubes created for varying levels of abstraction are often referred
 Data cube aggregation: to as cuboids,
 Attribute subset selection
 So that a data cube may instead refer to a lattice of cuboids.
 Dimensionality reduction — e.g., reduce the data size by encoding
 The smallest available cuboid relevant to the given task should be
 Numerosity reduction — e.g., fit data into models used
 Discretization and concept hierarchy generation

Prepared by S.Palaniappan, Assoc.Prof 83 Prepared by S.Palaniappan, Assoc.Prof 84


GRIET GRIET

21
Attribute Subset Selection
 Feature selection
 Heuristic methods

 Attribute subset selection:


 The goal is to select a minimum set of features such that the
probability distribution of different classes given the values for
those features is as close as possible to the original distribution
given the values of all features

 Redundant features
 Irrelevant features

Prepared by S.Palaniappan, Assoc.Prof 85 Prepared by S.Palaniappan, Assoc.Prof 86


GRIET GRIET

Heuristic Feature Selection Methods


 How can we find a ‘good’ subset of the original attributes?
 For n attributes, there are 2n possible subsets

1. Step-wise feature selection:


 The best single-feature is picked first
 Then next best feature condition to the first, ...

2. Step-wise feature elimination:


 Repeatedly eliminate the worst feature

3. Combined feature selection and elimination

4. Decision tree induction


>
Prepared by S.Palaniappan, Assoc.Prof 87 Prepared by S.Palaniappan, Assoc.Prof 88
GRIET GRIET

22
Top-Down Induction of Decision Tree Dimensionality Reduction: (WT, PCA)
Wavelet Transformation
 Discrete wavelet transform (DWT): linear signal processing, multi-resolutional
Attributes = {Outlook, Temperature, Humidity, Wind} analysis
PlayTennis = {yes, no}  Compressed approximation: store only a small fraction of the strongest of the
wavelet coefficients
 Similar to discrete Fourier transform (DFT), but better lossy compression,
Outlook
localized in space
sunny rain  Method: (Hierarchical pyramid algorithm)
overcast  Length, L, must be an integer power of 2 (padding with 0’s, when necessary)
 Each transform has 2 functions: smoothing, difference
Humidity Wind  Applies to pairs of data, resulting in two set of data of length L/2
yes
 Applies two functions recursively, until reaches the desired length
high normal strong weak

no yes no yes

Prepared by S.Palaniappan, Assoc.Prof 89 Prepared by S.Palaniappan, Assoc.Prof 90


GRIET GRIET Haar2 Daubechie4

Dimensionality Reduction: Principal Component Numerosity Reduction


Analysis (PCA)
 Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors  Reduce data volume by choosing alternative, smaller forms of data
(principal components) that can be best used to represent data
representation
 Steps
 Parametric methods
 Normalize input data: Each attribute falls within the same range
 Assume the data fits some model, estimate model parameters,
 Compute k orthonormal (unit) vectors, i.e., principal components
store only the parameters, and discard the data (except possible
 Each input data (vector) is a linear combination of the k principal
component vectors
outliers)
 The principal components are sorted in order of decreasing “significance”  Example: Log-linear models—obtain value at a point in m-D
or strength space as the product on appropriate marginal subspaces
 Since the components are sorted, the size of the data can be reduced by  Non-parametric methods
eliminating the weak components, i.e., those with low variance. (i.e., using  Do not assume models
the strongest principal components, it is possible to reconstruct a good
 Major families: histograms, clustering, sampling
approximation of the original data
 Works for numeric data only
 Used when the number of dimensions is large

Prepared by S.Palaniappan, Assoc.Prof 91 Prepared by S.Palaniappan, Assoc.Prof 92


GRIET GRIET

23
Regress Analysis and Log-Linear Histograms
Models
 Linear regression: Y = w X + b  Divide data into buckets and store average (sum) for each bucket
 Two regression coefficients, w and b, specify the line and are to
be estimated by using the data at hand
 Partitioning rules:

 Multiple regression: Y = b0 + b1 X1 + b2 X2.  Equal-width: equal bucket range (eg age range)
 Many nonlinear functions can be transformed into the above  Equal-frequency (or equal-depth)
 V-optimal: with the least histogram variance (weighted sum of the
 Log-linear models: original values that each bucket represents)
 Approximate discrete multidimensional probability distributions.
 MaxDiff: set bucket boundary between each pair for pairs having the
 Can be used to estimate the probability of each point in a
multidimensional space for a set of discretized attributes` β–1 largest differences - β is the user-specified number of buckets.
 Higher-dimensional data space to be constructed from lower
dimensional spaces.

Prepared by S.Palaniappan, Assoc.Prof 94


GRIET GRIET

Clustering Sampling
 Partition data set into clusters based on similarity, and store cluster  Sampling: obtaining a small sample s to represent the whole data set
D containing N tuples
representation only
 The “quality” of a cluster may be represented by its diameter &  Simple random sample without replacement (SRSWOR) of size s:
Centroid distance  This is created by drawing s of the N tuples from D (s < N)
 In database systems, multidimensional index trees are primarily
used for providing fast data access.  Simple random sample with replacement (SRSWR) of size s:
 An index tree recursively partitions the multidimensional space for a  After a tuple is drawn, it is placed back in D so that it may be
given set of data objects drawn again.
 Each child of a parent node as a bucket, then an index tree can be
considered as a hierarchical histogram

Prepared by S.Palaniappan, Assoc.Prof 95 Prepared by S.Palaniappan, Assoc.Prof 96


GRIET GRIET

24
 Cluster sample
 Apply SRSWOR to the pages, resulting in a cluster sample of the Discretization and Concept Hierarchy
tuples
 Stratified sample:  Discretization
 It is a method of sampling from a population
 Reduce the number of values for a given continuous attribute by
 It is the process of dividing members of the population into
homogeneous subgroups before sampling dividing the range of the attribute into intervals
 Interval labels can then be used to replace actual data values
 Supervised vs. unsupervised
 Split (top-down) vs. merge (bottom-up)
 Discretization can be performed recursively on an attribute
 Concept hierarchy formation
 Recursively reduce the data by collecting and replacing low level
concepts by higher level concepts

Prepared by S.Palaniappan, Assoc.Prof 97 Prepared by S.Palaniappan, Assoc.Prof 98


GRIET GRIET

Discretization and Concept Hierarchy Entropy-Based Discretization


Generation for Numeric Data  Given a set of samples S, if S is partitioned into two intervals S1 and S2 using
 Typical methods: All the methods can be applied recursively boundary T, the information gain after partitioning is

 Binning | | | |
I ( S , T )  S 1 Entropy ( S 1)  S 2 Entropy ( S 2 )
|S| |S |
 Top-down split, unsupervised,
 Entropy is calculated based on class distribution of the samples in the set.
 Histogram analysis Given m classes, the entropy of S1 is
m

 Top-down split, unsupervised Entropy ( S 1 )    p i log 2 ( p i )


i 1

 Clustering analysis
 where pi is the probability of class i in S1
 Either top-down split or bottom-up merge, unsupervised  The boundary that minimizes the entropy function over all possible boundaries
 Entropy-based discretization: supervised, top-down split is selected as a binary discretization
 The process is recursively applied to partitions obtained until some stopping
 Interval merging by 2 Analysis: unsupervised, bottom-up merge
criterion is met
 Segmentation by natural partitioning: top-down split, unsupervised  Such a boundary may reduce data size and improve classification accuracy
Prepared by S.Palaniappan, Assoc.Prof 99 Prepared by S.Palaniappan, Assoc.Prof 100
GRIET GRIET

25
Interval Merge by 2 Analysis Segmentation by Natural Partitioning
 Merging-based (bottom-up)
 A simply 3-4-5 rule can be used to segment numeric data into
 Merge: Find the best neighboring intervals and merge them to form larger
intervals recursively relatively uniform, “natural” intervals.

 ChiMerge [Kerber AAAI 1992, See also Liu et al. DMKD 2002]  If an interval covers 3, 6, 7 or 9 distinct values at the most
 Initially, each distinct value of a numerical attr. A is considered to be one significant digit, partition the range into 3 equi-width intervals
interval  If it covers 2, 4, or 8 distinct values at the most significant digit,
 2 tests are performed for every pair of adjacent intervals partition the range into 4 intervals
 Adjacent intervals with the least 2 values are merged together, since low 2  If it covers 1, 5, or 10 distinct values at the most significant digit,
values for a pair indicate similar class distributions partition the range into 5 intervals
 This merge process proceeds recursively until a predefined stopping
criterion is met (such as significance level, max-interval, max inconsistency,
etc.)

Prepared by S.Palaniappan, Assoc.Prof 101 Prepared by S.Palaniappan, Assoc.Prof 102


GRIET GRIET

Example of 3-4-5 Rule Concept Hierarchy Generation for Categorical


Data
count
 Categorical attributes have a finite number of distinct values, with no
ordering among the values – eg., item type, job category.
Step 1: -$351 -$159 profit $1,838 $4,700
Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max

 Specification of a partial/total ordering of attributes explicitly at the


Step 2: msd=1,000 Low=-$1,000 High=$2,000
schema level by users or experts
Step 3: (-$1,000 - $2,000)
(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)  street < city < state < country

(-$400 -$5,000)
Step 4:
 Specification of a hierarchy for a set of values by explicit data grouping
(-$400 - 0) (0 - $1,000) ($1,000 - $2, 000)
($2,000 - $5, 000)  {Urbana, Champaign, Chicago} < Illinois
(0 - ($1,000 -
(-$400 - $200) ($2,000 -
-$300) $1,200)
($200 - ($1,200 - $3,000)
(-$300 - $400) $1,400)
-$200)
($400 - ($1,400 -
($3,000 -
$4,000)  Specification of only a partial set of attributes
(-$200 - $1,600) ($4,000 -
$600)
-$100)
(-$100 -
($600 -
$800) ($800 -
($1,600 - ($1,800 -
$1,800)
$5,000)
 vague idea about what should be included in a hierarchy
$1,000) $2,000)
0)
 E.g., only street < city, not others
Prepared by S.Palaniappan, Assoc.Prof 103 Prepared by S.Palaniappan, Assoc.Prof 104
GRIET GRIET

26
Automatic Concept Hierarchy Generation
 Specification of a set of attributes, but not of their partial ordering
 High concept level will usually contain a smaller number of
distinct values than an attribute defining a lower concept level
 E.g., for a set of attributes: {street, city, state, country}

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values


Prepared by S.Palaniappan, Assoc.Prof 105 Prepared by S.Palaniappan, Assoc.Prof 106
GRIET GRIET

Finding the median, quartiles and inter-quartile range.  Suppose that the data for analysis includes the attribute age. The age
values for the data tuples are (in increasing order)
Example 1: Find the median and quartiles for the data
below. 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35,
35, 35, 36, 40, 45, 46, 52, 70.
12, 6, 4, 9, 8, 4, 9, 8, 5, 9, 8, 10

 (a) What is the mean of the data? What is the median?


Example 2: Find the median and quartiles for the data  (b) What is the mode of the data? Comment on the data’s modality
below. (i.e., bimodal, trimodal, etc.).
 (c) What is the midrange of the data?
6, 3, 9, 8, 4, 10, 8, 4, 15, 8, 10  (d) Can you find (roughly) the first quartile (Q1) and the third quartile
(Q3) of the data?
 (e) Give the five-number summary of the data.
 (f) Show a boxplot of the data.
 (g) How is a quantile-quantile plot different from a quantile plot?

Prepared by S.Palaniappan, Assoc.Prof 107 Prepared by S.Palaniappan, Assoc.Prof 108


GRIET GRIET

27
 (a) Use smoothing by bin means to smooth the data, using a bin
depth of 3. Illustrate your steps. Comment on the effect of this  Use the two methods below to normalize the following group of data:
technique for the given data. 200, 300, 400, 600, 1000
 (b) How might you determine outliers in the data?  (a) min-max normalization by setting min = 0 and max = 1
 (b) z-score normalization
 Using the data for age, answer the following:

 (a) Use min-max normalization to transform the value35for age onto


the range [0:0;1:0].
 (b) Use z-score normalization to transform the value 35 for age,
where the standard
 deviation of age is 12.94 years.
 (c) Use normalization by decimal scaling to transform the value 35 for
age.

Prepared by S.Palaniappan, Assoc.Prof 109 Prepared by S.Palaniappan, Assoc.Prof 110


GRIET GRIET

28

You might also like