Information Technology Fundamentals: CCIT4085

CCIT4085
Information
Technology
Fundamentals
2.2 Data Processing, Modeling and Analysis
23-24s1
Data Processing,
Modeling and Analysis
▪ Knowledge Discovery in Databases (KDD)
▪ Data selection
▪ Data pre-processing
▪ Data transformation
▪ Data mining
▪ Data modelling and analysis

▪ Classification – Decision Tree
▪ Association
▪ Business Intelligence
2
(Recall from 1.1)
Data and Information
▪Data: raw, no meaning
▪e.g. data in a spreadsheet
▪Information: data that are processed

and integrated to be meaningful, e.g.
average, maximum, differences, chart,
etc.
Russ Ackoff “From Data to Wisdom”, Journal of
Applied Systems Analysis, Volume 16, 1989 p
3-9
▪Knowledge: useful patterns from
organized data and information, e.g. IF
<condition(s)> THEN <result>
▪Wisdom: evaluate the discovered

knowledge and make recommendation
for the future. 3
KNOWLEDGE DISCOVERY IN
DATABASES (KDD)
KDD refers to the overall process of discovering useful
knowledge from data, and data mining refers to a particular
step in this process.
4
KNOWLEDGE DISCOVERY IN
DATABASES (KDD)
• Massive amount of data are being collected and stored in
databases.
▪ E.g.
▪ web data, e-commerce
▪ purchasing record in business
▪ bank/ credit card transactions
▪ medical records, biological studies
• KDD is the process of discovering hidden knowledge in

data.
• Data is in raw form (simply a collection of elements).

5
1. DATA SELECTION
Attributes
• What data is available for
the task?
Tid Refund Marital Taxable
Income Cheat
• Is the data relevant? Status
1 Yes Single 125K No

• Is additional relevant data 2 No Married 100K No
available? 3 No Single 70K No
4 Yes Married 120K No
• How much historical data 5 No Divorced 95K Yes
is available? records
6 No Married 60K No
7 Yes Divorced 220K No
• Who is the data expert? 8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
6
2. DATA PREPROCESSING
SID Age Gender GPA

• Data in real word is dirty 20000001 M 3.8
▪ incomplete data 20000002 18 Female 3.9
▪ duplicate records 20000003 10 F
20000003 19 Female 3.2
▪ noise (errors or outliers)
▪ inconsistency (e.g. 2015/09/01, 1 Sep 2015, 1/9/2015)
• Low quality data, low quality mining results!
• Major tasks: Data Cleaning and Data Integration

7
2. DATA PREPROCESSING –
DATA CLEANING
• Fill in missing values
▪ ignore, fill manually, global constant, mean value, estimation
• Identify or remove outliers
▪ define with domain knowledge and common sense
• Resolve inconsistencies
• Unified the format
▪ Use M/F instead of Male/Female
▪ YYYYMMDD, date format in database instead of free
text
▪ etc.
• Use standard format (csv, tab, excel, etc.)
8
2. DATA PREPROCESSING –
DATA INTEGRATION
• Combine data from multiple source into a coherent store
• Combine two or more attributes into one
9
3. DATA TRANSFORMATION
• Conversion of data values from continuous to discrete

• Feature extraction (AM/PM from time)
• Data discretization (split into partitions)
▪ Equal-width binning (e.g. 1-10, 11-20, 21-30)
▪ Equal-depth binning (each range has approximately
same number of samples)
Equi-depth
binning: 0-21 22-31 62-80
38-43 48-54
32-37 44-47 55-61
10
Equal Width Binning
▪ A method to divide data into k intervals of equal
size.
▪ The width of intervals is:

w = (max – min)/k
▪ The interval boundaries are :

min+w, min+2w, …, min+(k-1)w
11
Equal Width Binning
▪ Data: 0, 4, 12, 16, 16, 18, 24, 26, 28;
▪ Assume k=3, then the approximate interval: (28-0)/3 = 9.33 ~ 10;
▪ We need to determine the interval (equal width) before filling each bin;
▪ Note that the width of each interval is fixed but the number of data items is not fixed;
▪ Then we can count the number of items in each interval (namely a conversion of continuous data to discrete
data).
Bin Interval No. of items in interval

Bin-1: 0, 4 [0 - 10) 2
Bin-2: 12, 16, 16, [10 - 20) 4
18
Bin-3: 24, 26, 28 [20 - 30) 3
12
Equal Depth (Frequency)
Binning
▪ This method divides data into k groups where each group
contains approximately the same number of values.
▪ For both methods, the best way to determine k is to look at

histogram and try different intervals or groups.
13
Equal Depth/Frequency
Binning
▪ Data: 0, 4, 12, 16, 16, 18, 24, 26, 28;
▪ Assume k=3, so each bin approximately holds 3 data items (equal depth/frequency);
▪ We determine which 3 data items in each bin before determine each interval;
▪ Note that the interval is not fixed but the number of items in each bin is fixed;
▪ Then we can determine the width of each interval (a conversion of continuous data to discrete data).
Bin Interval Width of interval

Bin-1: 0, 4, 12 [0 - 14) 14
Bin-2: 16, 16, [14 - 21) 7
18
Bin-3: 24, 26, [21 - 30) 9
28
14
3. DATA TRANSFORMATION
Taxable Taxable
Income Income?
> 80K?
< 10K ≥ 80K
> 80K
Yes No
[10K,25K) [25K,50K) [50K,80K)
(i) Binary split (ii) Multi-way split
Which split is better? - Depends on the usage of the data
15
EXERCISES
Outlook Go to Play Temperature Humidity Windy 1. How many attributes in
Sunny No 35 high No the data set?
Sunny No 34 high Yes
2. How many records in
Overcast Yes 32 high N the data set?
Rain Yes 25 high N
3. Identify problem(s) in
Rain Yes 18 normal N
the data set, and
Rain N 16 normal true
provide possible
Overcast Y 15 true solution(s).
Sunny No 23 high false
Sunny Yes 17 normal false
4. Perform discretization
on the attribute
Rain Yes 24 normal false
temperature with
Sunny Yes 26 normal true equal-width binning.
Overcast Yes 23 high true
Overcast Yes 33 normal false
Rain No 25 high True
20 normal
16
WHAT IS DATA MINING?
▪ The process of discovering

patterns in large data
Statistics
sets involving methods at the
intersection of :
- Machine Learning,
- Statistics, and Data Mining
- Database systems
Machine
Learning Databases
17
WHAT IS MACHINE
LEARNING?
• Machine learning is the scientific study of

algorithms and statistical models that computer
systems use in order to perform a specific task
effectively without using explicit instructions, relying
on patterns and inference instead.
• Machine learning algorithms build a mathematical

model based on sample data, known as "training
data“, in order to make predictions or decisions
without being explicitly programmed to perform the
task.
18
WHY WE NEED DATA MINING?
1. Data Explosion
▪ Worldwide mobile data traffic increased to 75 exabytes (1018)
▪ Google processes 3.5 billion searches/day & 1.2 trillion searches/year
2. Competitive Pressure in Business

▪ Personalized services
▪ Understand consumers’ experience
▪ Automation
3. Traditional techniques are not enough and infeasible for raw

and massive data
19
DATA MODELING AND
ANALYSIS
• Classification (predict an item class)
• Association (correlations between items)
• Clustering (finding groups)
20
CLASSIFICATION
• Given a collection of records (training set)
▪ Each record contains a set of attributes, one of the
attributes is defined as class (target).
• Classification algorithms are used to develop models for
predicting the class from the other attributes.
• To determine the accuracy of the model, test set is used.
21
CLASSIFICATION
No
Yes
Yes
No
Yes
22
EXAMPLE OF CLASSIFICATION
TASKS
• Classify credit card transactions as legitimate or fraudulent
• Predict customers’ spending
• Predict patients’ risk of lung cancer
• Classify an email is spam or not
• Predict the illegal cargo in imports and exports
• Predict molecular toxicology
• Identify loyalty customers
• etc.
23
CLASSIFICATION MODELING
• Decision Tree
• Nearest Neighbor
• Bayesian
• Artificial neural network
• Support Vector Machines
• Random Forest
• etc.
24
CLASSIFICATION MODELING –
DECISION TREE
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
4
5
6
Yes
No
No
80% Medium
Large
Medium
120K
95K
60K
No
Yes
No
Induction
Decision
7 Yes Large 220K No Learn Tree
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes

Model
10
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class
Model
No
11 No Small 55K ?
20%
Yes
12 Yes Medium 80K ?
13 Yes Large 110K

Yes
?
Deduction
No
14 No Small 95K ?
Yes A training set is a set of data used to discover
15 No Large 67K ?
10
potentially predictive relationships.
Test Set A test set is a set of data used to assess the strength
and utility of a predictive relationship.
25
DECISION TREE
Splitting Attributes
Status Income Cheat
1 Yes Single 125K No

2 No Married 100K No Refund
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
TaxInc NO
8 No Single 85K Yes < 80K ≥
> 80K
80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10
Training Data Model: Decision Tree
26
DECISION TREE
MarSt Single,
Married Divorced
Status Income Cheat
NO Refund
1 Yes Single 125K No No
Yes
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K ≥ 80K
> 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
Model: Decision Tree
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes There could be more than one
tree/model that fits the same
10
Training Data
data!
27
DECISION TREE Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K No
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K ≥
> 80K
80K
NO YES
28
Refund Marital Taxable
Status Income Cheat
No Married 80K No
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K ≥> 80K
80K
NO YES
29
Status Income Cheat
No Married 80K No
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K ≥> 80K
80K
NO YES
30
Status Income Cheat
No Married 80K No
Refund 10
Yes No
Valid the decision tree
NO MarSt
model with test data.
New Data
TaxInc NO
< 80K ≥> 80K
80K Status Income Cheat
No Single 100K ?
NO YES 10
Predict unknown class

label with the decision
tree model.
31
DECISION TREE
• Alternatively, we can generate a collection of “If… then…”
rules from the decision tree
• {condition} → result
Refund Classification Rules
Yes No
=
(Refund=Yes) ==> No
NO Marita l (Refund=No, Marital Status={Single,Divorced},

{Single, Status Taxable Income<80K) ==> No
{Married}
Divorced} (Refund=No, Marital Status={Single,Divorced},
Taxable Income ≥ 80K)==>
Income>80K) ==>Yes
Yes
Taxable NO
Income (Refund=No, Marital Status={Married}) ==> No
< 80K ≥ 80K
>
NO YES
32
DECISION TREE
If multiple trees are created, which one is better?
Refund MarSt Single,
Yes No Married Divorced
YES:
NO 50% Refund
YES: 10% MarSt
NO: 50% Yes No
NO: 90%
Single, Divorced
Married
YES:NO
60% TaxInc
TaxInc YES: 0% NO: 40% < 80K ≥ 80K
>
< 80K 80K NO: 100%

≥>80K YES:
NO30% YES:
YES 50%
NO: 70% NO: 50%
YES: 20% YES: 90%
NO: 80% NO: 10%
• Better tree (better splits) can have higher accuracy in

prediction.
33
ASSOCIATION MODELING
To predict which items are most likely to appear together, and
the strength of the relationship between them.
Market-Basket transactions Example of Association Rules

TID Items {Bread} → {Milk},
1 Bread, Milk {Milk, Bread} → {Diaper},
2 Bread, Diaper, Beer, Eggs {Diaper} → {Bread, Beer},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!
34
ASSOCIATION MODELING (EXTRA)
Itemset
TID Items
▪ A collection of one or more items
▪ E.g. {Milk, Diaper} 1 Bread, Milk
▪ k-itemset - an itemset that contains k items 2 Bread, Diaper, Beer, Eggs
▪ E.g. {Milk, Diaper} is a 2-itemset 3 Milk, Diaper, Beer, Coke
Support count () 4 Bread, Milk, Diaper, Beer
▪ Frequency of occurrence of an itemset 5 Bread, Milk, Diaper, Coke
▪ E.g. ({Milk, Diaper}) = 3
Support
▪ Fraction of transactions that contain an
itemset
▪ E.g. support ({Milk, Diaper}) = 3/5 = 0.6
Frequent Itemset
▪ An itemset whose support is greater than
or equal to a minimum support threshold
(e.g. 0.5)
35
Given a set of transactions T, the goal of association

rule mining is to find all rules having
•support ≥ min support threshold
•confidence ≥ min confidence threshold
Remarks: min support/ confidence threshold are defined manually.
Basic steps:
1. Frequent Itemset Generation
– Generate all itemsets whose support  min support
2. Rule Generation
– Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of a
frequent itemset
36
Association Rule TID Items
– An implication expression of the 1 Bread, Milk

form X → Y, where X and Y are 2 Bread, Diaper, Beer, Eggs
itemsets 3 Milk, Diaper, Beer, Coke
– Example: 4 Bread, Milk, Diaper, Beer
{Milk, Diaper} → {Beer} 5 Bread, Milk, Diaper, Coke
Example:
Rule Evaluation Metrics {Milk , Diaper }  Beer
– Support (s)
 (Milk, Diaper, Beer) 2
◆Fraction of transactions that s= = = 0.4
contain both X and Y |T| 5
– Confidence (c)  (Milk, Diaper, Beer) 2
c= = = 0.67
◆Measures how often items in Y  (Milk, Diaper) 3
appear in transactions that
contain X
37
TID Items Example of Rules:
1 Bread, Milk {Milk, Diaper} → {Beer} (s=0.4, c=0.67)
2 Bread, Diaper, Beer, Eggs {Milk, Beer} → {Diaper} (s=0.4, c=1.0)
3 Milk, Diaper, Beer, Coke {Diaper, Beer} → {Milk} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer {Beer} → {Milk, Diaper} (s=0.4, c=0.67)
5 Bread, Milk, Diaper, Coke {Diaper} → {Milk, Beer} (s=0.4, c=0.5)
{Milk} → {Diaper, Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
38
CLUSTERING
Finding groups of objects such that the objects in a group
will be similar (or related) to one another and different from
(or unrelated to) the objects in other groups.
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
39
CLUSTERING
Four Clusters
How many clusters?
Two Clusters Six Clusters
40
Business Intelligence (BI)
▪ Business intelligence (BI) is a technology-driven process for
analyzing data and delivering actionable information that helps
executives, managers and workers make informed business
decisions.
▪ The BI process includes:
▪ collecting data from internal IT systems and external sources, and
preparing it for analysis
▪ running queries against the data and creating data visualizations, e.g.
dashboards and reports to make the analytics results available to
business users for operational decision-making and strategic planning.
▪ The ultimate goal is to drive better business decisions that enable
organizations to increase revenue, improve operational efficiency
and gain competitive advantages over business rivals.
Ref: What is Business Intelligence (BI)? | Definition from TechTarget 41

BI Dashboard - example
42
Benefits of BI in various
industries
• Identify profitable customers and devise strategies to mitigate
churn. Create robust marketing plans to attract new customers.
• Conduct accurate credit worthiness of customers to gauge
customer profitability and curb fraudulent behavior.
• Determine what purchases customers are willing to make and
when they want to buy based on past activities.
• In the insurance industry, set precise rates for premiums.
• In the medical industry, analyze clinical trials for new drugs and
compounds.
• Leverage predictive maintenance to reduce equipment downtime.
Ref: Business Intelligence Concepts, Components & Applications In 2023 (selecthub.com) 43

Information Technology Fundamentals: CCIT4085

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Information Technology Fundamentals: CCIT4085

Uploaded by

Copyright:

Available Formats

CCIT4085

▪ Data modelling and analysis

▪Information: data that are processed

▪Wisdom: evaluate the discovered

• KDD is the process of discovering hidden knowledge in

• Data is in raw form (simply a collection of elements).

1 Yes Single 125K No

SID Age Gender GPA

• Low quality data, low quality mining results!

• Major tasks: Data Cleaning and Data Integration

• Conversion of data values from continuous to discrete

▪ The width of intervals is:

▪ The interval boundaries are :

▪ Assume k=3, then the approximate interval: (28-0)/3 = 9.33 ~ 10;

Bin Interval No. of items in interval

▪ For both methods, the best way to determine k is to look at

Bin Interval Width of interval

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split

Which split is better? - Depends on the usage of the data

▪ The process of discovering

• Machine learning is the scientific study of

• Machine learning algorithms build a mathematical

2. Competitive Pressure in Business

3. Traditional techniques are not enough and infeasible for raw

10 No Small 90K Yes

13 Yes Large 110K

1 Yes Single 125K No

Training Data Model: Decision Tree

Single, Divorced Married

Predict unknown class

NO Marita l (Refund=No, Marital Status={Single,Divorced},

< 80K 80K NO: 100%

• Better tree (better splits) can have higher accuracy in

Market-Basket transactions Example of Association Rules

Given a set of transactions T, the goal of association

Association Rule TID Items

– An implication expression of the 1 Bread, Milk

Two Clusters Six Clusters

Ref: What is Business Intelligence (BI)? | Definition from TechTarget 41

Ref: Business Intelligence Concepts, Components & Applications In 2023 (selecthub.com) 43

You might also like