Professional Documents
Culture Documents
Information Technology Fundamentals: CCIT4085
Information Technology Fundamentals: CCIT4085
Information
Technology
Fundamentals
2.2 Data Processing, Modeling and Analysis
23-24s1
Data Processing,
Modeling and Analysis
▪ Knowledge Discovery in Databases (KDD)
▪ Data selection
▪ Data pre-processing
▪ Data transformation
▪ Data mining
▪ Business Intelligence
2
(Recall from 1.1)
Data and Information
▪Data: raw, no meaning
▪e.g. data in a spreadsheet
4
KNOWLEDGE DISCOVERY IN
DATABASES (KDD)
• Massive amount of data are being collected and stored in
databases.
▪ E.g.
▪ web data, e-commerce
▪ purchasing record in business
▪ bank/ credit card transactions
▪ medical records, biological studies
6
2. DATA PREPROCESSING
8
2. DATA PREPROCESSING –
DATA INTEGRATION
• Combine data from multiple source into a coherent store
• Combine two or more attributes into one
9
3. DATA TRANSFORMATION
Equi-depth
binning: 0-21 22-31 62-80
38-43 48-54
32-37 44-47 55-61
10
Equal Width Binning
▪ A method to divide data into k intervals of equal
size.
11
Equal Width Binning
▪ Data: 0, 4, 12, 16, 16, 18, 24, 26, 28;
▪ We need to determine the interval (equal width) before filling each bin;
▪ Note that the width of each interval is fixed but the number of data items is not fixed;
▪ Then we can count the number of items in each interval (namely a conversion of continuous data to discrete
data).
12
Equal Depth (Frequency)
Binning
▪ This method divides data into k groups where each group
contains approximately the same number of values.
13
Equal Depth/Frequency
Binning
▪ Data: 0, 4, 12, 16, 16, 18, 24, 26, 28;
▪ Assume k=3, so each bin approximately holds 3 data items (equal depth/frequency);
▪ We determine which 3 data items in each bin before determine each interval;
▪ Note that the interval is not fixed but the number of items in each bin is fixed;
▪ Then we can determine the width of each interval (a conversion of continuous data to discrete data).
Taxable Taxable
Income Income?
> 80K?
< 10K ≥ 80K
> 80K
Yes No
15
EXERCISES
Outlook Go to Play Temperature Humidity Windy 1. How many attributes in
Sunny No 35 high No the data set?
Sunny No 34 high Yes
2. How many records in
Overcast Yes 32 high N the data set?
Rain Yes 25 high N
3. Identify problem(s) in
Rain Yes 18 normal N
the data set, and
Rain N 16 normal true
provide possible
Overcast Y 15 true solution(s).
Sunny No 23 high false
Sunny Yes 17 normal false
4. Perform discretization
on the attribute
Rain Yes 24 normal false
temperature with
Sunny Yes 26 normal true equal-width binning.
Overcast Yes 23 high true
Overcast Yes 33 normal false
Rain No 25 high True
20 normal
16
WHAT IS DATA MINING?
17
WHAT IS MACHINE
LEARNING?
19
DATA MODELING AND
ANALYSIS
• Classification (predict an item class)
• Association (correlations between items)
• Clustering (finding groups)
20
CLASSIFICATION
• Given a collection of records (training set)
▪ Each record contains a set of attributes, one of the
attributes is defined as class (target).
• Classification algorithms are used to develop models for
predicting the class from the other attributes.
• To determine the accuracy of the model, test set is used.
21
CLASSIFICATION
No
Yes
Yes
No
Yes
22
EXAMPLE OF CLASSIFICATION
TASKS
• Classify credit card transactions as legitimate or fraudulent
• Predict customers’ spending
• Predict patients’ risk of lung cancer
• Classify an email is spam or not
• Predict the illegal cargo in imports and exports
• Predict molecular toxicology
• Identify loyalty customers
• etc.
23
CLASSIFICATION MODELING
• Decision Tree
• Nearest Neighbor
• Bayesian
• Artificial neural network
• Support Vector Machines
• Random Forest
• etc.
24
CLASSIFICATION MODELING –
DECISION TREE
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
4
5
6
Yes
No
No
80% Medium
Large
Medium
120K
95K
60K
No
Yes
No
Induction
Decision
7 Yes Large 220K No Learn Tree
8 No Small 85K Yes Model
9 No Medium 75K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class
Model
No
11 No Small 55K ?
20%
Yes
12 Yes Medium 80K ?
26
CLASSIFICATION MODELING –
DECISION TREE
MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No No
Yes
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K ≥ 80K
> 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
Model: Decision Tree
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes There could be more than one
tree/model that fits the same
10
Training Data
data!
27
CLASSIFICATION MODELING –
DECISION TREE Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K No
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K ≥
> 80K
80K
NO YES
28
CLASSIFICATION MODELING –
DECISION TREE Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K No
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K ≥> 80K
80K
NO YES
29
CLASSIFICATION MODELING –
DECISION TREE Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K No
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K ≥> 80K
80K
NO YES
30
CLASSIFICATION MODELING –
DECISION TREE Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K No
Refund 10
Yes No
Valid the decision tree
NO MarSt
model with test data.
New Data
TaxInc NO
Refund Marital Taxable
< 80K ≥> 80K
80K Status Income Cheat
No Single 100K ?
NO YES 10
=
(Refund=Yes) ==> No
NO YES
32
CLASSIFICATION MODELING –
DECISION TREE
If multiple trees are created, which one is better?
Refund MarSt Single,
Yes No Married Divorced
YES:
NO 50% Refund
YES: 10% MarSt
NO: 50% Yes No
NO: 90%
Single, Divorced
Married
YES:NO
60% TaxInc
TaxInc YES: 0% NO: 40% < 80K ≥ 80K
>
34
ASSOCIATION MODELING (EXTRA)
Itemset
TID Items
▪ A collection of one or more items
▪ E.g. {Milk, Diaper} 1 Bread, Milk
▪ k-itemset - an itemset that contains k items 2 Bread, Diaper, Beer, Eggs
▪ E.g. {Milk, Diaper} is a 2-itemset 3 Milk, Diaper, Beer, Coke
Support count () 4 Bread, Milk, Diaper, Beer
▪ Frequency of occurrence of an itemset 5 Bread, Milk, Diaper, Coke
▪ E.g. ({Milk, Diaper}) = 3
Support
▪ Fraction of transactions that contain an
itemset
▪ E.g. support ({Milk, Diaper}) = 3/5 = 0.6
Frequent Itemset
▪ An itemset whose support is greater than
or equal to a minimum support threshold
(e.g. 0.5)
35
ASSOCIATION MODELING (EXTRA)
Basic steps:
1. Frequent Itemset Generation
– Generate all itemsets whose support min support
2. Rule Generation
– Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of a
frequent itemset
36
ASSOCIATION MODELING (EXTRA)
Example:
Rule Evaluation Metrics {Milk , Diaper } Beer
– Support (s)
(Milk, Diaper, Beer) 2
◆Fraction of transactions that s= = = 0.4
contain both X and Y |T| 5
– Confidence (c) (Milk, Diaper, Beer) 2
c= = = 0.67
◆Measures how often items in Y (Milk, Diaper) 3
appear in transactions that
contain X
37
ASSOCIATION MODELING (EXTRA)
TID Items Example of Rules:
1 Bread, Milk {Milk, Diaper} → {Beer} (s=0.4, c=0.67)
2 Bread, Diaper, Beer, Eggs {Milk, Beer} → {Diaper} (s=0.4, c=1.0)
3 Milk, Diaper, Beer, Coke {Diaper, Beer} → {Milk} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer {Beer} → {Milk, Diaper} (s=0.4, c=0.67)
5 Bread, Milk, Diaper, Coke {Diaper} → {Milk, Beer} (s=0.4, c=0.5)
{Milk} → {Diaper, Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
38
CLUSTERING
Finding groups of objects such that the objects in a group
will be similar (or related) to one another and different from
(or unrelated to) the objects in other groups.
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
39
CLUSTERING
Four Clusters
How many clusters?
40
Business Intelligence (BI)
▪ Business intelligence (BI) is a technology-driven process for
analyzing data and delivering actionable information that helps
executives, managers and workers make informed business
decisions.
▪ The BI process includes:
▪ collecting data from internal IT systems and external sources, and
preparing it for analysis
▪ running queries against the data and creating data visualizations, e.g.
dashboards and reports to make the analytics results available to
business users for operational decision-making and strategic planning.
▪ The ultimate goal is to drive better business decisions that enable
organizations to increase revenue, improve operational efficiency
and gain competitive advantages over business rivals.
42
Benefits of BI in various
industries
• Identify profitable customers and devise strategies to mitigate
churn. Create robust marketing plans to attract new customers.
• Conduct accurate credit worthiness of customers to gauge
customer profitability and curb fraudulent behavior.
• Determine what purchases customers are willing to make and
when they want to buy based on past activities.
• In the insurance industry, set precise rates for premiums.
• In the medical industry, analyze clinical trials for new drugs and
compounds.
• Leverage predictive maintenance to reduce equipment downtime.