4-1 - Machine Learning - Intro-Classification

Machine Learning / Data
Mining in Learning Analytics

Prof. Dr. Mohamed Amine Chatti
M.Sc. Arham Muslim
Social Computing Group, University of Duisburg-Essen

www.uni-due.de/soco/
Course Content
Big
Data Personalization
What?
Why?
Data, Environments, Context
Objectives
Recommender
Systems
Hadoop
Ecosystem
Learning Analytics
Machine
Learning / Learner
Data Mining Modeling
Information
Visualization &
Visual Analytics
How? Who?
Privacy
Methods Stakeholders
Social
Network
Analysis (SNA)
Machine Learning in Learning Analytics 2

Motivation
• LA Objectives
• Predicting student performance
• Intelligent feedback
• Personalization / Recommendation
• Detecting at-risk students
• Grouping students
• Student modeling
• …
Classification, prediction, clustering
Machine Learning / Data mining methods

Machine Learning (ML)
• It is a field of study that gives the ability to the computer for self-learn
without being explicitly programmed (Arthur Samuel, 1959)
• A computer algorithm/program is said to learn from performance measure

𝑃 and experience 𝐸 with some class of tasks 𝑇 if its performance at tasks in
𝑇, as measured by 𝑃, improves with experience 𝐸 (Tom M. Mitchell, 1997)
• Machine learning systems learn how to combine input to produce useful

predictions on never-before-seen data (Google)

Machine Learning
• Machine Learning is using data to answer questions
Training Prediction
using answer
data questions
Data Training Model Predictions

answer
questions

The 7 Steps of Machine Learning
1. Gathering data 1. Data
gathering
2. Preparing that data 2. Data

• E.g. feature engineering - map raw data to features, split in training preparation
and test sets

3. Model
3. Choosing a model selection
4. Training 4. Training
5. Evaluation
5.
6. Parameter tuning Evaluation
• E.g. adding or removing features 6.

Parameter
tuning
7. Prediction
• E.g. “spam” or “not spam” 7.
Prediction

Data Mining
• Data mining is the analysis step of the knowledge discovery in databases
process, or KDD
• Process is iterative: If results are not satisfying, change the process and try
again (change parameters, more data, different data representations, …)
Visualization,
Evaluation
Data Knowledge
Mining
Transformation, Patterns
Selection,
Projection
Data Cleaning, Task-relevant
Data Integration Data
Data Warehouse
Databases

Data Mining
• Data mining (knowledge discovery in Databases)
• Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful)
information or patterns from data in large databases
• Alternative names: Knowledge discovery in databases (KDD), knowledge extraction,
data/pattern analysis
• Roots of data mining

• Statistics
• Machine Learning
• Database Systems
• Information Visualization
• Data Mining = extraction of patterns from data

• Patterns
• Regularities – examples: clusters, frequent itemsets
• Irregularities – examples: outliers

Machine Learning vs. Data Mining
• Machine learning focuses on prediction, based on known properties learned
from the training data (supervised learning)
• Data mining focuses on the discovery of (previously) unknown properties in

the data (unsupervised learning)
• Both employ the same methods and overlap significantly

Machine Learning Tasks – Dataset 1
Data about 860 recently deceased persons
to study the effects of drinking, smoking,
and body weight on the life expectancy
Drinker Smoker Weight Age

Yes Yes 120 44
No No 70 96
Yes No 72 88
Yes Yes 55 52
No Yes 94 56
No No 62 93
… … … …
Questions:
• What is the effect of smoking and drinking on a person’s bodyweight?
• Do people that smoke also drink?
• What factors influence a person’s life expectancy the most?
• Can we identify groups of people having a similar lifestyle?

Data about 240 students to investigate relationships among course
grades and the student’s overall performance in the Bachelor program
Operations Workflow
Liner algebra Logic Programming … Duration Results
research systems
9 8 8 9 9 … 36 Cum laude
7 6 - 8 8 … 42 Passed
- - 5 4 6 … 54 Failed
8 6 6 6 5 … 38 Passed
6 7 6 - 8 … 39 Passed
9 9 9 9 8 … 38 Cum laude
5 5 - 6 6 … 52 Failed
… … … … … … … …
Questions:
• Will new student X pass or fail?
• Are the marks of courses highly correlated?
• Which electives do excellent students (cum laude) take?
• Which courses significantly delay the moment of graduation?
• Why do students drop out?
• Can one identify groups of students having a similar study behavior?
Data about 240 customer orders in a coffee bar
recorded by the cash register
Cappuccino Latte Expresso Americano Ristretto Tea Muffin Bagel

1 0 0 0 0 0 1 0
0 2 0 0 0 0 1 1
0 0 1 0 0 0 0 0
1 0 0 0 0 0 0 0
0 0 0 0 0 1 2 0
0 0 0 1 1 0 0 0
… … … … … … … …
Questions:
• Which products are frequently purchased together?
• When do people buy a particular product?
• Is it possible to characterize typical customer groups?

Machine Learning Tasks
• Classification
• Mining patterns that can classify future data into known classes
• E.g. Will new student X pass or fail?
• Clustering
• Identifying a set of similar groups in the data
• E.g. Can we identify groups of people having a similar lifestyle?
• Association Rule Mining

• Mining any rule of the form X -> Y, where X and Y are sets of data items
• E.g. Which products are frequently purchased together?

Key ML Terminology
• Example is a particular instance of data, 𝒙 (a data set consists of examples /
instances)
• Features are input variables describing our data: 𝒙𝒊
• In the spam detector example, the features could include the following:
• subject line
• words in the email text
• sender's email address
• time of day the email was sent
• Label is the thing (class, number) we're predicting: 𝒚
• Labeled example has {features, label}: (𝒙, 𝒚)
• Unlabeled example has {features, ?}: (𝒙, ? )

Key ML Terminology
• Models
• Model maps examples to predicted labels: 𝒚′
• Training means creating or learning the model: you show the model labeled
examples and enable the model to gradually learn the relationships between features
and label
• Prediction / Inference means applying the trained model to unlabeled examples: you
use the trained model to make useful predictions
• Classification vs. Regression
• A classification model predicts discrete values
• Is a given email message spam or not spam?
• Is this an image of a dog, a cat, or a hamster?
• A regression model predicts continuous values
• What is the value of a house in Duisburg?
• What is the probability that a user will click on this ad?

Variables
• Data set (sample or table) consists of examples / instances (row in a table)
• Variables are often referred to as features / attributes (column in a table)
• Two types:
• Categorical (discrete) variables
• Has only a finite set of values
• Ordinal (high-med-low, grades) or
• Nominal (true-false, color, profession)
• Numerical (continuous) variables
• Has real numbers as values (e.g. Temperature, height, weight)
• Ordered, cannot be enumerated easily

Data about 860 recently deceased persons
to study the effects of drinking, smoking,
and body weight on the life expectancy
Drinker Smoker Weight Age

Yes Yes 120 44
No No 70 96
Yes No 72 88
Yes Yes 55 52
No Yes 94 56
No No 62 93
… … … …
• Example / Instance?
• Features / Attributes?
• Labeled / Unlabeled data?
• Discrete / Continuous variables?
• Nominal / Ordinal variables?

Data about 240 students to investigate relationships among course
grades and the student’s overall performance in the Bachelor program
Operations Workflow
Liner algebra Logic Programming … Duration Results
research systems
9 8 8 9 9 … 36 Cum laude
7 6 - 8 8 … 42 Passed
- - 5 4 6 … 54 Failed
8 6 6 6 5 … 38 Passed
6 7 6 - 8 … 39 Passed
9 9 9 9 8 … 38 Cum laude
5 5 - 6 6 … 52 Failed
… … … … … … … …

Data about 240 customer orders in a coffee bar
recorded by the cash register
Cappuccino Latte Expresso Americano Ristretto Tea Muffin Bagel

1 0 0 0 0 0 1 0
0 2 0 0 0 0 1 1
0 0 1 0 0 0 0 0
1 0 0 0 0 0 0 0
0 0 0 0 0 1 2 0
0 0 0 1 1 0 0 0
… … … … … … … …

Types of Machine Learning
• Supervised Learning
• Teaches machines by example
• Looking for something specific (supervised)
• Trying to predict a specific class or quantity
• Have training examples with labels (labeled data)
• Methods
• Classification: Predicts categorical class labels (label is categorical variable)
• Regression: Predicts unknown or missing values (label is numerical variable)

Types of Machine Learning
• Unsupervised Learning
• Trying to “understand” the data
• Looking for structure or unusual patterns
• Not looking for something specific (unsupervised)
• Does not require labeled data (unlabeled data)
• Methods
• Clustering
• Frequent Pattern Mining
• Outlier Detection

Classification

Classification
• Setting
• Class labels are known for small set of “training data”
• Task
• Find models/functions/rules that
b b a
a b
• Describe and distinguish classes
• Predict class membership for “new” objects
b
b b
• Classification = supervised learning a a
• Training set contains labeled items
• New data is classified based on the training set a aa a
• Classifier predicts class labels a
Classification – Examples
• Predict new applicant‘s loan eligibility
Previous customers Classifier Decision rules
Salary > 5 L
Age
Salary Good/
Profession Prof. = Exec Bad
Location
Customer type
(Good/Bad)
New applicant’s data

• Predict risk potential
Training Data
Simple Classifier
ID Age Car Type Risk
1 23 Family High if Age > 50 then Risk = Low;
2 17 Sportive High
3 43 Sportive High
if Age ≤ 50 and Car Type = Truck then Risk = Low;
4 68 Family Low if Age ≤ 50 and Car Type ≠ Truck then Risk = High;
5 32 Truck Low

Classification – Phases
• Usually, the given data set is divided into training and test sets
• Training set is used to train the classifier and build the model
• Test set is used to evaluate the classifier
• Goal: previously unseen data should be assigned a class as accurately as

possible
• Two Phases:
• Training Phase (Model Construction)
• Prediction Phase (Inference)

Classification – Training Phase (Model Construction)
ID Age Car Type Risk Training
1 23 Family High Data
2 17 Sportive High
3 43 Sportive High Training
4 68 Family Low
5 32 Truck Low
Unknown Data Classifier Class Label
if Age > 50 then Risk = Low;

if Age ≤ 50 and Car Type = Truck then Risk = Low;
(Age = 60, Family) if Age ≤ 50 and Car Type ≠ Truck then Risk = High;

Classification – Prediction Phase (Inference)
ID Age Car Type Risk Training
1 23 Family High Data
2 17 Sportive High
3 43 Sportive High Training
4 68 Family Low
5 32 Truck Low
Unknown Data Classifier Class Label
if Age > 50 then Risk = Low;

if Age ≤ 50 and Car Type = Truck then Risk = Low; Risk = Low
(Age = 60, Family) if Age ≤ 50 and Car Type ≠ Truck then Risk = High;

Classification
• Major Classification Methods
• Bayesian Classifiers
• Decision Tree Classifiers
• Nearest Neighbor Classifiers
• Logistic Regression
• Support Vector Machines (SVM)
• Neural Networks

Bayesian Classifiers
Classification

Bayesian Classifiers – Basics
• A probabilistic framework for solving classification problems
• Performs probabilistic prediction; i.e. predicts class membership

probabilities
• Foundation: based on Bayes’ theorem
• Performance: A simple Bayesian classifier; naïve Bayesian classifier, has

comparable performance with decision tree

Bayes‘ Theorem
• Probability theory:
𝑃(𝐴∧𝐵)
• Conditional probability: 𝑃 𝐴 𝐵 = (“probability of A given B”)
𝑃(𝐵)
• Product rule: 𝑃 𝐴 ∧ 𝐵 = 𝑃(𝐴|𝐵) ⋅ 𝑃(𝐵)
• Bayes’ theorem
• 𝑃 𝐴 ∧ 𝐵 = 𝑃(𝐴|𝐵) ⋅ 𝑃(𝐵)
• 𝑃 𝐵 ∧ 𝐴 = 𝑃(𝐵|𝐴) ⋅ 𝑃(𝐴)
• Since 𝑃 𝐴∧𝐵 =𝑃 𝐵∧𝐴 ⇒
𝑃(𝐴|𝐵) ⋅ 𝑃(𝐵) = 𝑃(𝐵|𝐴) ⋅ 𝑃(𝐴) ⇒
𝑃(𝐵|𝐴) ⋅ 𝑃(𝐴)
𝑃 𝐴𝐵 =
𝑃(𝐵)
Bayesian Classifiers – Components
prior probabiliy
posteriori probabiliy
𝑃(𝑋|𝐶) ⋅ 𝑃(𝐶)
𝑃 𝐶𝑋 =
𝑃(𝑋)
• Let 𝑋 be a data example (“evidence”): class label is unknown
• Let 𝐶 be a hypothesis that 𝑋 belongs to class 𝐶
• Classification is to determine 𝑃 𝐶 𝑋 , (i.e., posteriori probability): the probability
that 𝑋 belongs to class 𝐶 given the observed data example 𝑋
• 𝑃(𝐶) (prior probability): the initial probability
• E.g., 𝑋 will buy computer, regardless of age, income, …
• 𝑃(𝑋): probability that example is observed
• 𝑃 𝑋 𝐶 (likelihood): the probability of observing the example 𝑋, given that the
hypothesis holds
• E.g., Given that 𝑋 will buy computer, the prob. that 𝑋 is 31..40, medium income

Bayes Classifier
• Let 𝐷 be a training set of examples and their associated class labels, and each
example is represented by an n-Dim attribute vector 𝑋 = (𝑥1 , 𝑥2 , … , 𝑥𝑛 )
• Suppose there are m classes 𝐶1 , 𝐶2 , … , 𝐶𝑚 .
• Classification: Assign to the class with the maximum posteriori, i.e., the
maximal 𝑃(𝐶𝑖 |𝑋) (e.g. 𝑃(𝑆𝑝𝑎𝑚|𝑋) and 𝑃(𝑁𝑜𝑡 𝑠𝑝𝑎𝑚|𝑋))
• This can be derived from Bayes’ theorem prior probabiliy
posteriori probabiliy 𝑃(𝑋|𝐶𝑖 ) ⋅ 𝑃(𝐶𝑖 )
𝑃 𝐶𝑖 𝑋 =
𝑃(𝑋)
• Since 𝑃(𝑋) is constant for all classes, only 𝑃 𝐶𝑖 𝑋 = 𝑃(𝑋|𝐶𝑖 ) ⋅ 𝑃 𝐶𝑖 needs
to be maximized

Bayes Classifier
• Estimate the apriori probabilities 𝑃(𝐶𝑖 ) of classes 𝐶𝑖 by using the observed
frequency of the individual class labels 𝐶𝑖 in the training set, i.e.,
𝑁𝐶𝑖
𝑃 𝐶𝑖 =
𝑁
• How to estimate the values of 𝑃 𝑋|𝐶𝑖 ?

Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally independent (i.e., no dependence relation between
attributes):
𝑛
𝑃 𝑋 𝐶𝑖 = ෑ 𝑃 𝑥𝑘 𝐶𝑖 = 𝑃 𝑥1 𝐶𝑖 ⋅ 𝑃 𝑥2 𝐶𝑖 ⋅ … ⋅ 𝑃 𝑥𝑛 𝐶𝑖
𝑘=1
• If 𝑘-th attribute is categorical:
𝑃 𝑥𝑘 𝐶𝑖 is estimated as the relative frequency of samples having value 𝑥𝑘 as 𝑘-th attribute in class 𝐶𝑖 in
the training set
• If 𝑘-th attribute is continuous:
𝑃 𝑥𝑘 𝐶𝑖 can be estimated through Gaussian distribution with a mean 𝜇 and standard deviation 𝜎 and
𝑃 𝑥𝑘 𝐶𝑖 is
𝑃 𝑥𝑘 𝐶𝑖 = 𝑔(𝑥𝑘 , 𝜇𝐶𝑖 , 𝜎𝐶𝑖 )
1 (𝑥−𝜇)2
−
𝑔 𝑥, 𝜇, 𝜎 = 𝑒 2𝜎2
2𝜋𝜎
• Computationally easy in both cases

Naïve Bayes Example
Outlook Temp Humidity Wind Play
Sunny Hot High False Yes
Sunny Hot High False No
Sunny Mild High True No
Overcast Mild High False Yes
Overcast Mild Normal False Yes
Rain Cool Normal True Yes
Rain Mild High True No
Rain Cool Normal True No


Sunny Hot High False Yes Outlook Temp Humidity Wind Play
Yes No Yes No Yes No Yes No Yes No
Sunny Mild High True No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast Mild High False Yes
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rain 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/14 5/14
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5
Rain 3/9 2/5 Cool 3/9 1/5
Rain Mild High True No
Rain Cool Normal True No

Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rain 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/14 5/14
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5
Rain 3/9 2/5 Cool 3/9 1/5
• We want to predict “Play” on the following day:

Sunny Cool High True ?
Bayes classifier
Choose 𝐶𝑖 that maximizes P(Yes | Sunny, Cool, High, True)
𝑃 𝐶𝑖 𝑋
P(No | Sunny, Cool, High, True)

Naïve Bayes classifier
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5 Choose 𝐶𝑖 that maximizes
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3 𝑃 𝐶𝑖 𝑋
Rain 3 2 Cool 3 1 equivalent to
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/14 5/14 Choose 𝐶𝑖 that maximizes
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5 𝑃(𝑋|𝐶𝑖 ) ⋅ 𝑃 𝐶𝑖
Rain 3/9 2/5 Cool 3/9 1/5 Which is according to the Naïve Bayes equation equal to
ς𝑛𝑘=1 𝑃 𝑥𝑘 𝐶𝑖 ⋅ 𝑃 𝐶𝑖
Sunny Cool High True ?
𝑃 𝑌𝑒𝑠 𝑆𝑢𝑛𝑛𝑦, 𝐶𝑜𝑜𝑙, 𝐻𝑖𝑔ℎ, 𝑇𝑟𝑢𝑒 =

𝑃 𝑆𝑢𝑛𝑛𝑦 𝑌𝑒𝑠 ⋅ 𝑃 𝐶𝑜𝑜𝑙 𝑌𝑒𝑠 ⋅ 𝑃 𝐻𝑖𝑔ℎ 𝑌𝑒𝑠 ⋅ 𝑃 𝑇𝑟𝑢𝑒 𝑌𝑒𝑠 ⋅ 𝑃 𝑌𝑒𝑠
2/9 ⋅ 3/9 ⋅ 3/9 ⋅ 3/9 ⋅ 9/14 = 0.0053
𝑃 𝑁𝑜 𝑆𝑢𝑛𝑛𝑦, 𝐶𝑜𝑜𝑙, 𝐻𝑖𝑔ℎ, 𝑇𝑟𝑢𝑒 =

𝑃 𝑆𝑢𝑛𝑛𝑦 𝑁𝑜 ⋅ 𝑃 𝐶𝑜𝑜𝑙 𝑁𝑜 ⋅ 𝑃 𝐻𝑖𝑔ℎ 𝑁𝑜 ⋅ 𝑃 𝑇𝑟𝑢𝑒 𝑁𝑜 ⋅ 𝑃 𝑁𝑜
3/5 ⋅ 1/5 ⋅ 4/5 ⋅ 3/5 ⋅ 5/14 = 0.0206
Decision Tree Classifiers
Classification

Decision Tree Classifiers
• Learned function is represented as a tree item id age car type risk potential
1 young sportive high
• A flow-chart-like tree structure 2 young family low
• Internal nodes represent a test on an attribute 3 young sportive high
4 old family low
• Branch represents an outcome of the test 5 old sportive low
• Leaf nodes represent class labels 6 old sportive low
• Learned tree can be transformed into IF-THEN rules

• IF age > 60 THEN risk = low
• IF age ≤ 60 AND car_type = sportive THEN risk = high age?
• Classification steps young old
• Decision tree generation
car type? Low risk
• Traverse the tree to classify an unknown sample
• e.g. age = 18; car type = sportive -> risk = high sportive family
• Advantages
high risk low risk
• Decision trees are intuitive to most users

Decision Tree Generation
• Basic algorithm item id age car type risk potential
1 young sportive high
• Tree is created in a top-down recursive divide-and-conquer 2 young family low
manner 3 young sportive high
• Attributes may be categorical or continuous-valued 4
5
old
old
family
sportive
low
low
• At start, all the training examples are assigned to the root 6 old sportive low
node
• Recursively partition/split the examples at each node
• Goal: find such splits which lead to as homogenous groups as
possible age?
• Split at the attribute that result in minimal heterogeneity young old
• Example: Use “age” or “car type” as split condition?
• Conditions for stopping partitioning car type? Low risk
• All examples for a given node belong to the same class sportive family
• There are no remaining attributes for further partitioning
high risk low risk
• There are no examples left

How to determine the Best Split?
• Nodes with homogeneous class distribution (pure nodes) are preferred
• Need a measure of node impurity
C0: 5 C0: 9
C1: 5 C1: 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity

Decision Tree - Example
Training Examples 9 yes / 5 no
• Predict if Hasan will play tennis today? Day Outlook Humidity Wind Play
D1 Sunny High Weak No
D2 Sunny High Strong No
D3 Overcast High Weak Yes
• Build a decision tree D4 Rain High Weak Yes
D5 Rain Normal Weak Yes
D6 Rain Normal Strong No
• Divide & conquer D7 Overcast Normal Strong Yes
D8 Sunny High Weak No
• Split into subsets
D9 Sunny Normal Weak Yes
• Are they pure? (all yes or all no) D10 Rain Normal Weak Yes
• If yes: stop D11 Sunny Normal Strong Yes
D12 Overcast High Strong Yes
• If not: repeat
D13 Overcast Normal Weak Yes
D14 Rain High Strong No
New Data
D15 Rain High Weak ?

9 yes / 5 no
Outlook
Overcast
Sunny Rain
4 yes / 0 no
2 yes / 3 no pure subset 3 yes / 2 no

split further split further

9 yes / 5 no
Outlook
Overcast
Sunny Rain
Day Outlook Humidity Wind
D3 Overcast High Weak Day Outlook Humidity Wind
Humidity D7 Overcast Normal Strong D4 Rain High Weak
D12 Overcast High Strong D5 Rain Normal Weak
D13 Overcast Normal Weak D6 Rain Normal Strong
High Normal D10 Rain Normal Weak
4 yes / 0 no D14 Rain High Strong
Day Humidity Wind Day Humidity Wind pure subset 3 yes / 2 no
D1 High Weak D9 Normal Weak
D2 High Strong D11 Normal Strong
split further
D8 High Weak

9 yes / 5 no
Outlook
Overcast
Sunny Rain
Day Outlook Humidity Wind
D3 Overcast High Weak
Humidity D7 Overcast Normal Strong Wind
D12 Overcast High Strong
D13 Overcast Normal Weak
High Normal 4 yes / 0 no Weak Strong
Day Humidity Wind Day Humidity Wind pure subset Day Humidity Wind Day Humidity Wind
D1 High Weak D9 Normal Weak D4 High Weak D6 Normal Strong
D2 High Strong D11 Normal Strong D5 Normal Weak D14 High Strong
D8 High Weak D10 Normal Weak
Outlook
Overcast
Sunny Rain
yes
Humidity Wind
High Normal Weak Strong

no yes yes no
New Data yes

Which attribute to split on?
9 yes / 5 no 9 yes / 5 no
Outlook Wind
Sunny Overcast Rain Weak Strong
2 yes / 3 no 4 yes / 0 no 3 yes / 2 no 6 yes / 2 no 3 yes / 3 no
• Want to measure “purity” of the split

• More certain about yes/no after the split
• Pure set (4 yes / 0 no) => completely certain (100%)
• Impure (3 yes / 3 no) => completely uncertain (50%)
• Must be symmetric: 4 yes / 0 no as pure as 0 yes / 4 no

Split Strategies
• Given
• A set 𝑇 of training objects
• A (disjoint, complete) partitioning 𝑇1 , 𝑇2 , …, 𝑇𝑚 of 𝑇
• The relative frequencies 𝑝𝑖 of class 𝐶𝑖 in 𝑇 and in the partitions 𝑇1 , 𝑇2 , …, 𝑇𝑚
9 yes / 5 no 9 yes / 5 no 9 yes / 5 no
Outlook Humidity Wind
Sunny Overcast Rain High Normal Weak Strong
2 yes / 3 no 4 yes / 0 no 3 yes / 2 no 3 yes / 4 no 6 yes / 1 no 6 yes / 2 no 3 yes / 3 no
• Wanted
• A measure for the heterogeneity of a set 𝑆 of training objects with respect to the class
membership
• A split of 𝑇 into partitions 𝑇1 , 𝑇2 , …, 𝑇𝑚 such that the heterogeneity is minimized
• Proposals: Entropy/Information gain, Gini index

Entropy
For two classes:
• Entropy
• The entropy of a set 𝑇 of training examples is defined as follows:
𝑘
For k classes 𝐶𝑖 with
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑇 = − ෍ 𝑝𝑖 ⋅ 𝑙𝑜𝑔2 𝑝𝑖
probabilities 𝑝𝑖
𝑖=1
• 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑇 = 0 if 𝑝𝑖 = 1 for any class 𝑐𝑖
• 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑇 = 1 if there are 𝑘 = 2 classes with 𝑝𝑖 = 1Τ2 for each 𝑖
• Interpretation: assume item 𝑋 belongs to 𝑇

• How many bits need to tell if X positive or negative
3 3 3 3
• Impure (3 yes / 3 no): 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑇 = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 = 1 𝑏𝑖𝑡𝑠
6 6 6 6
4 4 0 0
• Pure set (4 yes / 0 no): 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑇 = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 = 0 𝑏𝑖𝑡𝑠
4 4 4 4

Information Gain
• The entropy of a set 𝑇 of training objects is defined as follows:
𝑘
For k classes 𝐶𝑖 with
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑇 = − ෍ 𝑝𝑖 ⋅ 𝑙𝑜𝑔2 𝑝𝑖 probabilities 𝑝𝑖
𝑖=1
• Let 𝐴 be the attribute that induced the partitioning 𝑇1 , 𝑇2 , …, 𝑇𝑚 of 𝑇. The

information gain of attribute 𝐴 wrt. 𝑇 is defined as follows:
𝑚
𝑇𝑖
𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑔𝑎𝑖𝑛 𝑇, 𝐴 = 𝑒𝑛𝑡𝑟𝑜𝑝𝑦(𝑇) − ෍ ⋅ 𝑒𝑛𝑡𝑟𝑜𝑝𝑦(𝑇𝑖 )
𝑇
𝑖=1

Entropy / Information Gain – Example
9 9 5 5
𝑘 − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 9 yes / 5 no
14 14 14 14
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑇 = − ෍ 𝑝𝑖 ⋅ 𝑙𝑜𝑔2 𝑝𝑖 Wind
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑇 = 0.94
𝑖=1
Weak Strong
6 6 2 2 3 3 3 3
− 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2
8 8 8 8 6 6 6 6
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑊𝑖𝑛𝑑𝑊𝑒𝑎𝑘 = 0.811 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑊𝑖𝑛𝑑𝑆𝑡𝑟𝑜𝑛𝑔 = 1.0
𝑚
𝑇𝑖
𝑇
𝑖=1
8 6
𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑔𝑎𝑖𝑛 𝑇, 𝑊𝑖𝑛𝑑 = 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑇 − ⋅ 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑊𝑖𝑛𝑑𝑊𝑒𝑎𝑘 − ⋅ 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑊𝑖𝑛𝑑𝑆𝑡𝑟𝑜𝑛𝑔
14 14
8 6
= 0.94 − ⋅ 0.811 − ⋅ 1.0 = 0.048
14 14
9 9 5 5
14 14 14 14
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑇 = − ෍ 𝑝𝑖 ⋅ 𝑙𝑜𝑔2 𝑝𝑖 Humidity
𝑖=1
High Normal
3 3 4 4 6 6 1 1
− 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2
7 7 7 7 7 7 7 7
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦𝐻𝑖𝑔ℎ = 0.985 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦𝑁𝑜𝑟𝑚𝑎𝑙 = 0.592
𝑚
𝑇𝑖
𝑇
𝑖=1
7 7
𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑔𝑎𝑖𝑛 𝑇, 𝐻𝑢𝑚𝑖𝑑𝑡𝑦 = 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑇 − ⋅ 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦𝐻𝑖𝑔ℎ − ⋅ 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦𝑁𝑜𝑟𝑚𝑎𝑙
14 14
7 7
= 0.94 − ⋅ 0.985 − ⋅ 0.592 = 0.151
14 14
9 9 5 5
14 14 14 14
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑇 = − ෍ 𝑝𝑖 ⋅ 𝑙𝑜𝑔2 𝑝𝑖 Outlook
𝑖=1
Sunny Overcast Rain 3 3 2 2

2 2 3 3 − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2
− 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 5 5 5 5
5 5 5 5 2 yes / 3 no 4 yes / 0 no 3 yes / 2 no
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑂𝑢𝑡𝑙𝑜𝑜𝑘𝑅𝑎𝑖𝑛 = 0.971
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑂𝑢𝑡𝑙𝑜𝑜𝑘𝑆𝑢𝑛𝑛𝑦 = 0.971
4 4 0 0
− 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2
4 4 4 4
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑂𝑢𝑡𝑙𝑜𝑜𝑘𝑂𝑣𝑒𝑟𝑐𝑎𝑠𝑡 = 0
𝑚
𝑇𝑖
𝑇
𝑖=1
𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑔𝑎𝑖𝑛 𝑇, 𝑂𝑢𝑡𝑙𝑜𝑜𝑘
5 4 5
= 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑇 − ⋅ 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑂𝑢𝑡𝑙𝑜𝑜𝑘𝑆𝑢𝑛𝑛𝑦 − ⋅ 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑂𝑢𝑡𝑙𝑜𝑜𝑘𝑂𝑣𝑒𝑟𝑐𝑎𝑠𝑡 − ⋅ 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑂𝑢𝑡𝑙𝑜𝑜𝑘𝑅𝑎𝑖𝑛
14 14 14
5 4 5
= 0.94 − ⋅ 0.971 − ⋅0− ⋅ 0.971 = 0.246
14 14 14
entropy = 0.940 entropy = 0.940 entropy = 0.940
9 yes / 5 no 9 yes / 5 no 9 yes / 5 no
Outlook Humidity Wind
Sunny Overcast Rain High Normal Weak Strong
2 yes / 3 no 4 yes / 0 no 3 yes / 2 no 3 yes / 4 no 6 yes / 1 no 6 yes / 2 no 3 yes / 3 no

0.971 0 0.971 0.985 0.592 0.811 1.0
5 4 5
𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑔𝑎𝑖𝑛 𝑇, 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 0.94 − ⋅ 0.971 − ⋅0− ⋅ 0.971 = 0.246
14 14 14
7 7
𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑔𝑎𝑖𝑛 𝑇, 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 0.94 − ⋅ 0.985 − ⋅ 0.592 = 0.151
14 14
9 yes / 5 no
8 6
𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑔𝑎𝑖𝑛 𝑇, 𝑊𝑖𝑛𝑑 = 0.94 − ⋅ 0.811 − ⋅ 1.0 = 0.048 Outlook
14 14
• Result: “Outlook” yields the highest information gain Sunny Overcast Rain
? yes ?
Day Outlook Humidity Wind Play
D1 Sunny High Weak No Final Decision Tree
D2 Sunny High Strong No
D3 Overcast High Weak Yes {1, …, 14}
D4 Rain High Weak Yes
Outlook
D5 Rain Normal Weak Yes
D6 Rain Normal Strong No
D7 Overcast Normal Strong Yes Overcast
D8 Sunny High Weak No Sunny Rain
yes {3, 7, 12, 13}
D9 Sunny Normal Weak Yes
D10 Rain Normal Weak Yes {1, 2, 8, 9, 11} {4, 5, 6, 10, 14}
D11 Sunny Normal Strong Yes Humidity Wind
D12 Overcast High Strong Yes
D13 Overcast Normal Weak Yes
D14 Rain High Strong No High Normal Weak Strong
no {1, 2, 8} yes {9, 11} no {6, 14} yes {4, 5, 10}

Nearest Neighbor Classifiers
Classification

Nearest Neighbor Classifiers
• Instance-based learning
• Store training examples and delay the processing (“lazy evaluation”) until a new
instance must be classified
• Typical approaches : k-nearest neighbor approach
• Eager evaluation
• Create models from data (training phase) and then use these models for classification
(test phase)
• Examples: Decision tree, Bayes classifier

k-Nearest Neighbor (kNN)
• Intuition: Nearby things should have the
same class
• Distance function
• Defines the (dis-)similarity for pairs of objects
• Algorithm
• New object x
• Compute distance to every training example
• Select k-Neighborhood (x): k closest instances
• Label x with most frequent class in k-
Neighborhood (x) (majority vote)

kNN Example
• Example: Risk potential of a 25 year car driver with 210 max speed? (1-NN?,
3-NN?)

kNN – Parameter k
• Problem of choosing an appropriate value for parameter k
• k too small: high sensitivity against outliers
• k too large: decision set contains many objects from other classes
• Empirically, 1 ≪ 𝑘 < 10 yields a high classification accuracy in many cases

Classification – Summary
• Classification = supervised learning
• Training set contains labeled items
• New data is classified based on the training set
• Classifier predicts class labels
• Bayesian Classifiers
• A statistical classifier: performs probabilistic prediction; i.e. predicts class membership probabilities
• Based on Bayes’ theorem
• Decision Tree Classifiers

• Learned function is represented as a tree
• Tree is created in a top-down recursive divide-and-conquer manner
• Nearest Neighbor Classifiers

• k-Nearest Neighbor (kNN) approach
• Intuition: Nearby things should have the same class
• Distance function: Defines the similarity for pairs of objects

4-1 - Machine Learning - Intro-Classification

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

4-1 - Machine Learning - Intro-Classification

Uploaded by

Copyright:

Available Formats

Machine Learning / Data

Mining in Learning Analytics

Social Computing Group​, University of Duisburg-Essen

Machine Learning in Learning Analytics 2

Classification, prediction, clustering

Machine Learning / Data mining methods

Machine Learning in Learning Analytics 3

• A computer algorithm/program is said to learn from performance measure

• Machine learning systems learn how to combine input to produce useful

Machine Learning in Learning Analytics 4

Data Training Model Predictions

Machine Learning in Learning Analytics 5

2. Preparing that data 2. Data

and test sets

• E.g. adding or removing features 6.

Machine Learning in Learning Analytics 6

Machine Learning in Learning Analytics 7

• Roots of data mining

• Data Mining = extraction of patterns from data

Machine Learning in Learning Analytics 8

• Data mining focuses on the discovery of (previously) unknown properties in

• Both employ the same methods and overlap significantly

Machine Learning in Learning Analytics 9

Drinker Smoker Weight Age

Machine Learning in Learning Analytics 10

Cappuccino Latte Expresso Americano Ristretto Tea Muffin Bagel

Machine Learning in Learning Analytics 12

• Association Rule Mining

Machine Learning in Learning Analytics 13

Machine Learning in Learning Analytics 14

Machine Learning in Learning Analytics 15

Machine Learning in Learning Analytics 16

Drinker Smoker Weight Age

Machine Learning in Learning Analytics 17

Machine Learning in Learning Analytics 18

Cappuccino Latte Expresso Americano Ristretto Tea Muffin Bagel

Machine Learning in Learning Analytics 19

Machine Learning in Learning Analytics 20

Machine Learning in Learning Analytics 21

Machine Learning in Learning Analytics 22

New applicant’s data

Machine Learning in Learning Analytics 24

• Goal: previously unseen data should be assigned a class as accurately as

Machine Learning in Learning Analytics 25

Unknown Data Classifier Class Label

if Age > 50 then Risk = Low;

Machine Learning in Learning Analytics 26

Unknown Data Classifier Class Label

if Age > 50 then Risk = Low;

Machine Learning in Learning Analytics 27

Machine Learning in Learning Analytics 28

Machine Learning in Learning Analytics 29

• Performs probabilistic prediction; i.e. predicts class membership

• Foundation: based on Bayes’ theorem

• Performance: A simple Bayesian classifier; naïve Bayesian classifier, has

Machine Learning in Learning Analytics 30

Machine Learning in Learning Analytics 32

Machine Learning in Learning Analytics 33

• How to estimate the values of 𝑃 𝑋|𝐶𝑖 ?

Machine Learning in Learning Analytics 34

Machine Learning in Learning Analytics 35

Machine Learning in Learning Analytics 36

Social Computing Group, University of Duisburg-Essen