Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

CCIT4085

Information
Technology
Fundamentals
2.2 Data Processing, Modeling and Analysis

23-24s1
Data Processing,
Modeling and Analysis
▪ Knowledge Discovery in Databases (KDD)
▪ Data selection
▪ Data pre-processing
▪ Data transformation
▪ Data mining

▪ Data modelling and analysis


▪ Classification – Decision Tree
▪ Association

▪ Business Intelligence

2
(Recall from 1.1)
Data and Information
▪Data: raw, no meaning
▪e.g. data in a spreadsheet

▪Information: data that are processed


and integrated to be meaningful, e.g.
average, maximum, differences, chart,
etc.
Russ Ackoff “From Data to Wisdom”, Journal of
Applied Systems Analysis, Volume 16, 1989 p
3-9
▪Knowledge: useful patterns from
organized data and information, e.g. IF
<condition(s)> THEN <result>

▪Wisdom: evaluate the discovered


knowledge and make recommendation
for the future. 3
KNOWLEDGE DISCOVERY IN
DATABASES (KDD)
KDD refers to the overall process of discovering useful
knowledge from data, and data mining refers to a particular
step in this process.

4
KNOWLEDGE DISCOVERY IN
DATABASES (KDD)
• Massive amount of data are being collected and stored in
databases.
▪ E.g.
▪ web data, e-commerce
▪ purchasing record in business
▪ bank/ credit card transactions
▪ medical records, biological studies

• KDD is the process of discovering hidden knowledge in


data.

• Data is in raw form (simply a collection of elements).


5
1. DATA SELECTION
Attributes
• What data is available for
the task?
Tid Refund Marital Taxable
Income Cheat
• Is the data relevant? Status

1 Yes Single 125K No


• Is additional relevant data 2 No Married 100K No
available? 3 No Single 70K No
4 Yes Married 120K No
• How much historical data 5 No Divorced 95K Yes
is available? records
6 No Married 60K No
7 Yes Divorced 220K No
• Who is the data expert? 8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

6
2. DATA PREPROCESSING

SID Age Gender GPA


• Data in real word is dirty 20000001 M 3.8
▪ incomplete data 20000002 18 Female 3.9
▪ duplicate records 20000003 10 F
20000003 19 Female 3.2
▪ noise (errors or outliers)
▪ inconsistency (e.g. 2015/09/01, 1 Sep 2015, 1/9/2015)

• Low quality data, low quality mining results!

• Major tasks: Data Cleaning and Data Integration


7
2. DATA PREPROCESSING –
DATA CLEANING
• Fill in missing values
▪ ignore, fill manually, global constant, mean value, estimation
• Identify or remove outliers
▪ define with domain knowledge and common sense
• Resolve inconsistencies
• Unified the format
▪ Use M/F instead of Male/Female
▪ YYYYMMDD, date format in database instead of free
text
▪ etc.
• Use standard format (csv, tab, excel, etc.)

8
2. DATA PREPROCESSING –
DATA INTEGRATION
• Combine data from multiple source into a coherent store
• Combine two or more attributes into one

9
3. DATA TRANSFORMATION

• Conversion of data values from continuous to discrete


• Feature extraction (AM/PM from time)
• Data discretization (split into partitions)
▪ Equal-width binning (e.g. 1-10, 11-20, 21-30)
▪ Equal-depth binning (each range has approximately
same number of samples)

Equi-depth
binning: 0-21 22-31 62-80
38-43 48-54
32-37 44-47 55-61
10
Equal Width Binning
▪ A method to divide data into k intervals of equal
size.

▪ The width of intervals is:


w = (max – min)/k

▪ The interval boundaries are :


min+w, min+2w, …, min+(k-1)w

11
Equal Width Binning
▪ Data: 0, 4, 12, 16, 16, 18, 24, 26, 28;

▪ Assume k=3, then the approximate interval: (28-0)/3 = 9.33 ~ 10;

▪ We need to determine the interval (equal width) before filling each bin;

▪ Note that the width of each interval is fixed but the number of data items is not fixed;

▪ Then we can count the number of items in each interval (namely a conversion of continuous data to discrete
data).

Bin Interval No. of items in interval


Bin-1: 0, 4 [0 - 10) 2
Bin-2: 12, 16, 16, [10 - 20) 4
18
Bin-3: 24, 26, 28 [20 - 30) 3

12
Equal Depth (Frequency)
Binning
▪ This method divides data into k groups where each group
contains approximately the same number of values.

▪ For both methods, the best way to determine k is to look at


histogram and try different intervals or groups.

13
Equal Depth/Frequency
Binning
▪ Data: 0, 4, 12, 16, 16, 18, 24, 26, 28;

▪ Assume k=3, so each bin approximately holds 3 data items (equal depth/frequency);

▪ We determine which 3 data items in each bin before determine each interval;

▪ Note that the interval is not fixed but the number of items in each bin is fixed;

▪ Then we can determine the width of each interval (a conversion of continuous data to discrete data).

Bin Interval Width of interval


Bin-1: 0, 4, 12 [0 - 14) 14
Bin-2: 16, 16, [14 - 21) 7
18
Bin-3: 24, 26, [21 - 30) 9
28
14
3. DATA TRANSFORMATION

Taxable Taxable
Income Income?
> 80K?
< 10K ≥ 80K
> 80K

Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split

Which split is better? - Depends on the usage of the data

15
EXERCISES
Outlook Go to Play Temperature Humidity Windy 1. How many attributes in
Sunny No 35 high No the data set?
Sunny No 34 high Yes
2. How many records in
Overcast Yes 32 high N the data set?
Rain Yes 25 high N
3. Identify problem(s) in
Rain Yes 18 normal N
the data set, and
Rain N 16 normal true
provide possible
Overcast Y 15 true solution(s).
Sunny No 23 high false
Sunny Yes 17 normal false
4. Perform discretization
on the attribute
Rain Yes 24 normal false
temperature with
Sunny Yes 26 normal true equal-width binning.
Overcast Yes 23 high true
Overcast Yes 33 normal false
Rain No 25 high True
20 normal

16
WHAT IS DATA MINING?

▪ The process of discovering


patterns in large data
Statistics
sets involving methods at the
intersection of :
- Machine Learning,
- Statistics, and Data Mining
- Database systems
Machine
Learning Databases

17
WHAT IS MACHINE
LEARNING?

• Machine learning is the scientific study of


algorithms and statistical models that computer
systems use in order to perform a specific task
effectively without using explicit instructions, relying
on patterns and inference instead.

• Machine learning algorithms build a mathematical


model based on sample data, known as "training
data“, in order to make predictions or decisions
without being explicitly programmed to perform the
task.
18
WHY WE NEED DATA MINING?
1. Data Explosion
▪ Worldwide mobile data traffic increased to 75 exabytes (1018)
▪ Google processes 3.5 billion searches/day & 1.2 trillion searches/year

2. Competitive Pressure in Business


▪ Personalized services
▪ Understand consumers’ experience
▪ Automation

3. Traditional techniques are not enough and infeasible for raw


and massive data

19
DATA MODELING AND
ANALYSIS
• Classification (predict an item class)
• Association (correlations between items)
• Clustering (finding groups)

20
CLASSIFICATION
• Given a collection of records (training set)
▪ Each record contains a set of attributes, one of the
attributes is defined as class (target).
• Classification algorithms are used to develop models for
predicting the class from the other attributes.
• To determine the accuracy of the model, test set is used.

21
CLASSIFICATION

No
Yes
Yes
No
Yes

22
EXAMPLE OF CLASSIFICATION
TASKS
• Classify credit card transactions as legitimate or fraudulent
• Predict customers’ spending
• Predict patients’ risk of lung cancer
• Classify an email is spam or not
• Predict the illegal cargo in imports and exports
• Predict molecular toxicology
• Identify loyalty customers
• etc.

23
CLASSIFICATION MODELING
• Decision Tree
• Nearest Neighbor
• Bayesian
• Artificial neural network
• Support Vector Machines
• Random Forest
• etc.

24
CLASSIFICATION MODELING –
DECISION TREE
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4
5
6
Yes
No
No
80% Medium
Large
Medium
120K
95K
60K
No
Yes

No
Induction
Decision
7 Yes Large 220K No Learn Tree
8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class
Model
No
11 No Small 55K ?

20%
Yes
12 Yes Medium 80K ?

13 Yes Large 110K


Yes
?
Deduction
No
14 No Small 95K ?
Yes A training set is a set of data used to discover
15 No Large 67K ?
10
potentially predictive relationships.
Test Set A test set is a set of data used to assess the strength
and utility of a predictive relationship.
25
CLASSIFICATION MODELING –
DECISION TREE
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No Refund
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
7 Yes Divorced 220K No
TaxInc NO
8 No Single 85K Yes < 80K ≥
> 80K
80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree

26
CLASSIFICATION MODELING –
DECISION TREE
MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No No
Yes
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K ≥ 80K
> 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
Model: Decision Tree
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes There could be more than one
tree/model that fits the same
10

Training Data
data!
27
CLASSIFICATION MODELING –
DECISION TREE Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K No
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K ≥
> 80K
80K

NO YES

28
CLASSIFICATION MODELING –
DECISION TREE Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K No
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K ≥> 80K
80K

NO YES

29
CLASSIFICATION MODELING –
DECISION TREE Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K No
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K ≥> 80K
80K

NO YES

30
CLASSIFICATION MODELING –
DECISION TREE Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K No
Refund 10

Yes No
Valid the decision tree
NO MarSt
model with test data.

Single, Divorced Married

New Data
TaxInc NO
Refund Marital Taxable
< 80K ≥> 80K
80K Status Income Cheat

No Single 100K ?
NO YES 10

Predict unknown class


label with the decision
tree model.
31
CLASSIFICATION MODELING –
DECISION TREE
• Alternatively, we can generate a collection of “If… then…”
rules from the decision tree
• {condition} → result
Refund Classification Rules
Yes No

=
(Refund=Yes) ==> No

NO Marita l (Refund=No, Marital Status={Single,Divorced},


{Single, Status Taxable Income<80K) ==> No
{Married}
Divorced} (Refund=No, Marital Status={Single,Divorced},
Taxable Income ≥ 80K)==>
Income>80K) ==>Yes
Yes
Taxable NO
Income (Refund=No, Marital Status={Married}) ==> No
< 80K ≥ 80K
>

NO YES

32
CLASSIFICATION MODELING –
DECISION TREE
If multiple trees are created, which one is better?
Refund MarSt Single,
Yes No Married Divorced

YES:
NO 50% Refund
YES: 10% MarSt
NO: 50% Yes No
NO: 90%
Single, Divorced
Married
YES:NO
60% TaxInc
TaxInc YES: 0% NO: 40% < 80K ≥ 80K
>

< 80K 80K NO: 100%


≥>80K YES:
NO30% YES:
YES 50%
NO: 70% NO: 50%
YES: 20% YES: 90%
NO: 80% NO: 10%

• Better tree (better splits) can have higher accuracy in


prediction.
33
ASSOCIATION MODELING
To predict which items are most likely to appear together, and
the strength of the relationship between them.

Market-Basket transactions Example of Association Rules


TID Items {Bread} → {Milk},
1 Bread, Milk {Milk, Bread} → {Diaper},
2 Bread, Diaper, Beer, Eggs {Diaper} → {Bread, Beer},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!

34
ASSOCIATION MODELING (EXTRA)
Itemset
TID Items
▪ A collection of one or more items
▪ E.g. {Milk, Diaper} 1 Bread, Milk
▪ k-itemset - an itemset that contains k items 2 Bread, Diaper, Beer, Eggs
▪ E.g. {Milk, Diaper} is a 2-itemset 3 Milk, Diaper, Beer, Coke
Support count () 4 Bread, Milk, Diaper, Beer
▪ Frequency of occurrence of an itemset 5 Bread, Milk, Diaper, Coke
▪ E.g. ({Milk, Diaper}) = 3
Support
▪ Fraction of transactions that contain an
itemset
▪ E.g. support ({Milk, Diaper}) = 3/5 = 0.6
Frequent Itemset
▪ An itemset whose support is greater than
or equal to a minimum support threshold
(e.g. 0.5)
35
ASSOCIATION MODELING (EXTRA)

Given a set of transactions T, the goal of association


rule mining is to find all rules having
•support ≥ min support threshold
•confidence ≥ min confidence threshold
Remarks: min support/ confidence threshold are defined manually.

Basic steps:
1. Frequent Itemset Generation
– Generate all itemsets whose support  min support

2. Rule Generation
– Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of a
frequent itemset
36
ASSOCIATION MODELING (EXTRA)

Association Rule TID Items

– An implication expression of the 1 Bread, Milk


form X → Y, where X and Y are 2 Bread, Diaper, Beer, Eggs
itemsets 3 Milk, Diaper, Beer, Coke
– Example: 4 Bread, Milk, Diaper, Beer
{Milk, Diaper} → {Beer} 5 Bread, Milk, Diaper, Coke

Example:
Rule Evaluation Metrics {Milk , Diaper }  Beer
– Support (s)
 (Milk, Diaper, Beer) 2
◆Fraction of transactions that s= = = 0.4
contain both X and Y |T| 5
– Confidence (c)  (Milk, Diaper, Beer) 2
c= = = 0.67
◆Measures how often items in Y  (Milk, Diaper) 3
appear in transactions that
contain X
37
ASSOCIATION MODELING (EXTRA)
TID Items Example of Rules:
1 Bread, Milk {Milk, Diaper} → {Beer} (s=0.4, c=0.67)
2 Bread, Diaper, Beer, Eggs {Milk, Beer} → {Diaper} (s=0.4, c=1.0)
3 Milk, Diaper, Beer, Coke {Diaper, Beer} → {Milk} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer {Beer} → {Milk, Diaper} (s=0.4, c=0.67)
5 Bread, Milk, Diaper, Coke {Diaper} → {Milk, Beer} (s=0.4, c=0.5)
{Milk} → {Diaper, Beer} (s=0.4, c=0.5)

Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence

38
CLUSTERING
Finding groups of objects such that the objects in a group
will be similar (or related) to one another and different from
(or unrelated to) the objects in other groups.

Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

39
CLUSTERING

Four Clusters
How many clusters?

Two Clusters Six Clusters

40
Business Intelligence (BI)
▪ Business intelligence (BI) is a technology-driven process for
analyzing data and delivering actionable information that helps
executives, managers and workers make informed business
decisions.
▪ The BI process includes:
▪ collecting data from internal IT systems and external sources, and
preparing it for analysis
▪ running queries against the data and creating data visualizations, e.g.
dashboards and reports to make the analytics results available to
business users for operational decision-making and strategic planning.
▪ The ultimate goal is to drive better business decisions that enable
organizations to increase revenue, improve operational efficiency
and gain competitive advantages over business rivals.

Ref: What is Business Intelligence (BI)? | Definition from TechTarget 41


BI Dashboard - example

42
Benefits of BI in various
industries
• Identify profitable customers and devise strategies to mitigate
churn. Create robust marketing plans to attract new customers.
• Conduct accurate credit worthiness of customers to gauge
customer profitability and curb fraudulent behavior.
• Determine what purchases customers are willing to make and
when they want to buy based on past activities.
• In the insurance industry, set precise rates for premiums.
• In the medical industry, analyze clinical trials for new drugs and
compounds.
• Leverage predictive maintenance to reduce equipment downtime.

Ref: Business Intelligence Concepts, Components & Applications In 2023 (selecthub.com) 43

You might also like