Professional Documents
Culture Documents
4-1 - Machine Learning - Intro-Classification
4-1 - Machine Learning - Intro-Classification
Learning Analytics
Machine
Learning / Learner
Data Mining Modeling
Information
Visualization &
Visual Analytics
How? Who?
Privacy
Methods Stakeholders
Social
Network
Analysis (SNA)
Training Prediction
using answer
data questions
4. Training 4. Training
5. Evaluation
5.
6. Parameter tuning Evaluation
Data Warehouse
Databases
Questions:
• What is the effect of smoking and drinking on a person’s bodyweight?
• Do people that smoke also drink?
• What factors influence a person’s life expectancy the most?
• Can we identify groups of people having a similar lifestyle?
Questions:
• Which products are frequently purchased together?
• When do people buy a particular product?
• Is it possible to characterize typical customer groups?
• Clustering
• Identifying a set of similar groups in the data
• E.g. Can we identify groups of people having a similar lifestyle?
• Example / Instance?
• Features / Attributes?
• Labeled / Unlabeled data?
• Discrete / Continuous variables?
• Nominal / Ordinal variables?
Operations Workflow
Liner algebra Logic Programming … Duration Results
research systems
9 8 8 9 9 … 36 Cum laude
7 6 - 8 8 … 42 Passed
- - 5 4 6 … 54 Failed
8 6 6 6 5 … 38 Passed
6 7 6 - 8 … 39 Passed
9 9 9 9 8 … 38 Cum laude
5 5 - 6 6 … 52 Failed
… … … … … … … …
• Example / Instance?
• Features / Attributes?
• Labeled / Unlabeled data?
• Discrete / Continuous variables?
• Nominal / Ordinal variables?
• Example / Instance?
• Features / Attributes?
• Labeled / Unlabeled data?
• Discrete / Continuous variables?
• Nominal / Ordinal variables?
• Two Phases:
• Training Phase (Model Construction)
• Prediction Phase (Inference)
• Logistic Regression
• Support Vector Machines (SVM)
• Neural Networks
• Bayes’ theorem
• 𝑃 𝐴 ∧ 𝐵 = 𝑃(𝐴|𝐵) ⋅ 𝑃(𝐵)
• 𝑃 𝐵 ∧ 𝐴 = 𝑃(𝐵|𝐴) ⋅ 𝑃(𝐴)
• Since 𝑃 𝐴∧𝐵 =𝑃 𝐵∧𝐴 ⇒
𝑃(𝐴|𝐵) ⋅ 𝑃(𝐵) = 𝑃(𝐵|𝐴) ⋅ 𝑃(𝐴) ⇒
𝑃(𝐵|𝐴) ⋅ 𝑃(𝐴)
𝑃 𝐴𝐵 =
𝑃(𝐵)
Machine Learning in Learning Analytics 31
Bayesian Classifiers – Components
prior probabiliy
posteriori probabiliy
𝑃(𝑋|𝐶) ⋅ 𝑃(𝐶)
𝑃 𝐶𝑋 =
𝑃(𝑋)
• Let 𝑋 be a data example (“evidence”): class label is unknown
• Let 𝐶 be a hypothesis that 𝑋 belongs to class 𝐶
• Classification is to determine 𝑃 𝐶 𝑋 , (i.e., posteriori probability): the probability
that 𝑋 belongs to class 𝐶 given the observed data example 𝑋
• 𝑃(𝐶) (prior probability): the initial probability
• E.g., 𝑋 will buy computer, regardless of age, income, …
• 𝑃(𝑋): probability that example is observed
• 𝑃 𝑋 𝐶 (likelihood): the probability of observing the example 𝑋, given that the
hypothesis holds
• E.g., Given that 𝑋 will buy computer, the prob. that 𝑋 is 31..40, medium income
𝑃 𝑋 𝐶𝑖 = ෑ 𝑃 𝑥𝑘 𝐶𝑖 = 𝑃 𝑥1 𝐶𝑖 ⋅ 𝑃 𝑥2 𝐶𝑖 ⋅ … ⋅ 𝑃 𝑥𝑛 𝐶𝑖
𝑘=1
• If 𝑘-th attribute is categorical:
𝑃 𝑥𝑘 𝐶𝑖 is estimated as the relative frequency of samples having value 𝑥𝑘 as 𝑘-th attribute in class 𝐶𝑖 in
the training set
• If 𝑘-th attribute is continuous:
𝑃 𝑥𝑘 𝐶𝑖 can be estimated through Gaussian distribution with a mean 𝜇 and standard deviation 𝜎 and
𝑃 𝑥𝑘 𝐶𝑖 is
𝑃 𝑥𝑘 𝐶𝑖 = 𝑔(𝑥𝑘 , 𝜇𝐶𝑖 , 𝜎𝐶𝑖 )
1 (𝑥−𝜇)2
−
𝑔 𝑥, 𝜇, 𝜎 = 𝑒 2𝜎2
2𝜋𝜎
• Computationally easy in both cases
• All examples for a given node belong to the same class sportive family
• There are no remaining attributes for further partitioning
high risk low risk
• There are no examples left
C0: 5 C0: 9
C1: 5 C1: 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
New Data
D15 Rain High Weak ?
Outlook
Overcast
Sunny Rain
4 yes / 0 no
Outlook
Overcast
Sunny Rain
Day Outlook Humidity Wind
D3 Overcast High Weak Day Outlook Humidity Wind
Humidity D7 Overcast Normal Strong D4 Rain High Weak
D12 Overcast High Strong D5 Rain Normal Weak
D13 Overcast Normal Weak D6 Rain Normal Strong
High Normal D10 Rain Normal Weak
4 yes / 0 no D14 Rain High Strong
Day Humidity Wind Day Humidity Wind pure subset 3 yes / 2 no
D1 High Weak D9 Normal Weak
D2 High Strong D11 Normal Strong
split further
D8 High Weak
Outlook
Overcast
Sunny Rain
Day Outlook Humidity Wind
D3 Overcast High Weak
Humidity D7 Overcast Normal Strong Wind
D12 Overcast High Strong
D13 Overcast Normal Weak
High Normal 4 yes / 0 no Weak Strong
Day Humidity Wind Day Humidity Wind pure subset Day Humidity Wind Day Humidity Wind
D1 High Weak D9 Normal Weak D4 High Weak D6 Normal Strong
D2 High Strong D11 Normal Strong D5 Normal Weak D14 High Strong
D8 High Weak D10 Normal Weak
Machine Learning in Learning Analytics 47
Decision Tree - Example
Outlook
Overcast
Sunny Rain
yes
Humidity Wind
Outlook Wind
• Wanted
• A measure for the heterogeneity of a set 𝑆 of training objects with respect to the class
membership
• A split of 𝑇 into partitions 𝑇1 , 𝑇2 , …, 𝑇𝑚 such that the heterogeneity is minimized
• Proposals: Entropy/Information gain, Gini index
4 4 0 0
• Pure set (4 yes / 0 no): 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑇 = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 = 0 𝑏𝑖𝑡𝑠
4 4 4 4
Weak Strong
6 yes / 2 no 3 yes / 3 no
6 6 2 2 3 3 3 3
− 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2
8 8 8 8 6 6 6 6
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑊𝑖𝑛𝑑𝑊𝑒𝑎𝑘 = 0.811 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑊𝑖𝑛𝑑𝑆𝑡𝑟𝑜𝑛𝑔 = 1.0
𝑚
𝑇𝑖
𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑔𝑎𝑖𝑛 𝑇, 𝐴 = 𝑒𝑛𝑡𝑟𝑜𝑝𝑦(𝑇) − ⋅ 𝑒𝑛𝑡𝑟𝑜𝑝𝑦(𝑇𝑖 )
𝑇
𝑖=1
8 6
𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑔𝑎𝑖𝑛 𝑇, 𝑊𝑖𝑛𝑑 = 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑇 − ⋅ 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑊𝑖𝑛𝑑𝑊𝑒𝑎𝑘 − ⋅ 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑊𝑖𝑛𝑑𝑆𝑡𝑟𝑜𝑛𝑔
14 14
8 6
= 0.94 − ⋅ 0.811 − ⋅ 1.0 = 0.048
14 14
Machine Learning in Learning Analytics 53
Entropy / Information Gain – Example
9 9 5 5
𝑘 − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 9 yes / 5 no
14 14 14 14
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑇 = − 𝑝𝑖 ⋅ 𝑙𝑜𝑔2 𝑝𝑖 Humidity
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑇 = 0.94
𝑖=1
High Normal
3 yes / 4 no 6 yes / 1 no
3 3 4 4 6 6 1 1
− 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2
7 7 7 7 7 7 7 7
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦𝐻𝑖𝑔ℎ = 0.985 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦𝑁𝑜𝑟𝑚𝑎𝑙 = 0.592
𝑚
𝑇𝑖
𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑔𝑎𝑖𝑛 𝑇, 𝐴 = 𝑒𝑛𝑡𝑟𝑜𝑝𝑦(𝑇) − ⋅ 𝑒𝑛𝑡𝑟𝑜𝑝𝑦(𝑇𝑖 )
𝑇
𝑖=1
7 7
𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑔𝑎𝑖𝑛 𝑇, 𝐻𝑢𝑚𝑖𝑑𝑡𝑦 = 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑇 − ⋅ 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦𝐻𝑖𝑔ℎ − ⋅ 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦𝑁𝑜𝑟𝑚𝑎𝑙
14 14
7 7
= 0.94 − ⋅ 0.985 − ⋅ 0.592 = 0.151
14 14
Machine Learning in Learning Analytics 54
Entropy / Information Gain – Example
9 9 5 5
𝑘 − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 9 yes / 5 no
14 14 14 14
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑇 = − 𝑝𝑖 ⋅ 𝑙𝑜𝑔2 𝑝𝑖 Outlook
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑇 = 0.94
𝑖=1
𝑚
𝑇𝑖
𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑔𝑎𝑖𝑛 𝑇, 𝐴 = 𝑒𝑛𝑡𝑟𝑜𝑝𝑦(𝑇) − ⋅ 𝑒𝑛𝑡𝑟𝑜𝑝𝑦(𝑇𝑖 )
𝑇
𝑖=1
𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑔𝑎𝑖𝑛 𝑇, 𝑂𝑢𝑡𝑙𝑜𝑜𝑘
5 4 5
= 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑇 − ⋅ 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑂𝑢𝑡𝑙𝑜𝑜𝑘𝑆𝑢𝑛𝑛𝑦 − ⋅ 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑂𝑢𝑡𝑙𝑜𝑜𝑘𝑂𝑣𝑒𝑟𝑐𝑎𝑠𝑡 − ⋅ 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑂𝑢𝑡𝑙𝑜𝑜𝑘𝑅𝑎𝑖𝑛
14 14 14
5 4 5
= 0.94 − ⋅ 0.971 − ⋅0− ⋅ 0.971 = 0.246
14 14 14
Machine Learning in Learning Analytics 55
Entropy / Information Gain – Example
entropy = 0.940 entropy = 0.940 entropy = 0.940
9 yes / 5 no 9 yes / 5 no 9 yes / 5 no
5 4 5
𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑔𝑎𝑖𝑛 𝑇, 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 0.94 − ⋅ 0.971 − ⋅0− ⋅ 0.971 = 0.246
14 14 14
7 7
𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑔𝑎𝑖𝑛 𝑇, 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 0.94 − ⋅ 0.985 − ⋅ 0.592 = 0.151
14 14
9 yes / 5 no
8 6
𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑔𝑎𝑖𝑛 𝑇, 𝑊𝑖𝑛𝑑 = 0.94 − ⋅ 0.811 − ⋅ 1.0 = 0.048 Outlook
14 14
• Result: “Outlook” yields the highest information gain Sunny Overcast Rain
? yes ?
Machine Learning in Learning Analytics 56
Entropy / Information Gain – Example
Day Outlook Humidity Wind Play
D1 Sunny High Weak No Final Decision Tree
D2 Sunny High Strong No
D3 Overcast High Weak Yes {1, …, 14}
D4 Rain High Weak Yes
Outlook
D5 Rain Normal Weak Yes
D6 Rain Normal Strong No
D7 Overcast Normal Strong Yes Overcast
D8 Sunny High Weak No Sunny Rain
yes {3, 7, 12, 13}
D9 Sunny Normal Weak Yes
D10 Rain Normal Weak Yes {1, 2, 8, 9, 11} {4, 5, 6, 10, 14}
D11 Sunny Normal Strong Yes Humidity Wind
D12 Overcast High Strong Yes
D13 Overcast Normal Weak Yes
D14 Rain High Strong No High Normal Weak Strong
no {1, 2, 8} yes {9, 11} no {6, 14} yes {4, 5, 10}
• Eager evaluation
• Create models from data (training phase) and then use these models for classification
(test phase)
• Examples: Decision tree, Bayes classifier
• Algorithm
• New object x
• Compute distance to every training example
• Select k-Neighborhood (x): k closest instances
• Label x with most frequent class in k-
Neighborhood (x) (majority vote)
• Bayesian Classifiers
• A statistical classifier: performs probabilistic prediction; i.e. predicts class membership probabilities
• Based on Bayes’ theorem