Lectures

Lecture 1: Introduction to Data Science
in Forensics and Security
IMT4133 – Data Science for Forensics and Security

Andrii Shalaginov, andrii.shalaginov@ntnu.no
NTNU i Gjøvik
Week 3: 20.01.2022
Reading: Kononenko 1&2; Chio 1
Norwegian University of Science and Technology
Content of the lecture
• Administrative announcements
• Data Science in InfoSec and Forensics
• Computational Forensics
• Machine Learning
• Hard & Soft Computing
Norwegian University of Science and Technology 2

IMT 4133 – the course team
• Teacher / Course Responsible: Andrii Shalaginov

andrii.shalaginov@ntnu.no
• Teaching Assistant: Kyle Porter

kyle.porter@ntnu.no

Overview of the topics
• Lecture 1: (Kononenko 1,2; Chio 1) Data Analysis;
Learning and Intelligence
• Lecture 2: (Kononenko 3; Chio 2): Machine Learning Basics;
Knowledge Representation
• Lecture 3: (Kononenko 4,5): Knowledge Representation;
Learning as Search
• Lecture 4: (Kononenko 6,7): Attribute Quality Measures; Data
Pre-processing
• Lecture 5: (Kononenko 9,10): Symbolic and Statistical
Learning
• Lecture 6: (Kononenko 11*; Chio 2): Artificial Neural
Networks; Deep Learning; Support Vector Machines
• Lecture 7: (Kononenko 12; Chio 2): Unsupervised Learning;
Cluster Analysis

Plan for semester
Semester: Weeks 3-14; Weeks 16-18 including MOCK exam
Written examination: 20.05.2022 (make sure to cross-check)
Curriculum includes:
• 7 theoretical Lectures (material from the textbook)
• 7 practical Exercise/Tutorials (practical tasks and applications) to
solve tasks given after each lecture
• Exercises are released after each lecture – try to solve by following
Thursday before the Tutorial!
• Control questions for progress
• Q & A sessions before the exam
• MOCK exam (optional – preparation for final exam)
• A guest lecture on selected topic
Need help / Questions ? -> email

Examination (1)
• 4 Assignments – 40% of the final grade (10% each)

– Each assignment covers several topics
– Mainly requires application of learnt theory and data analysis
– Need to program (which language – up to you)
– Familiarize yourself with statistical software (Weka, RapidMiner)
• Final written exam – 60% of the grade

– Materials covered in lectures and textbook
– Does not require extensive manual calculations
– 3 hours

Examination (2)
• Spring 2022 – 3 hours written exam
• The exam will contain open questions, not multiple-
choice questions
• For the most up-to-date information, please check:
https://www.ntnu.edu/studies/courses/IMT4133#tab=time
plan
• Make sure to read and understand guidelines:
https://innsida.ntnu.no/wiki/-
/wiki/English/Cheating+on+exams
• You are expected to demonstrate your own independent
work on the exam.
• Feel free to send us email any time if anything is unclear

The textbook
- Igor Kononenko and Matjaz Kukar

“Machine Learning and Data Mining”
ISBN 1-904275-21-4, 2007
• Check with the library
• Good theoretical foundations

Recommended reading
• Chio, Clarence, and David
Freeman. Machine learning and
security: Protecting systems
with data and algorithms. "
O'Reilly Media, Inc.", 2018.
• http://noracook.io/Books/Machi
neLearning/machinelearningan
dsecurity.pdf
• Practical point of view

Semester Plan (1)
Week 3 (20.01.2022) Lecture 1: (Kononenko 1,2; Chio 1) Introduction to the team / Data
Analysis / ML methods / Artificial Intelligence / Big Data / Data Analytics problems in Digital
Forensics and Information Security / Computational Forensics
Week 4 (27.01.2022) Tutorial 1: Data Analysis; Learning and Intelligence
Week 5 (03.02.2022) Lecture 2: (Kononenko 3; Chio 2): ML Basics; Hybrid Intelligence;

Performance Evaluation
Week 6 (10.02.2022) Tutorial 2: Machine Learning Basics
Week 7 (17.02.2022) Lecture 3: (Kononenko 4,5) Knowledge Representation; Learning as

Search
Week 8 (24.02.2022) Tutorial 3: Learning as Search; Knowledge Representation
Week 9 (03.03.2022) Lecture 4: (Kononenko 6,7) Attribute Quality Measures; PCA; LDA;
Feature Selection

Semester Plan (2)
Week 10 (10.03.2022) Tutorial 4: Attribute Quality Measures. Data Pre-processing
Week 11 (17.03.2022) Lecture 5: (Kononenko 9,10) Symbolic & Stat learning; Visulatization
Week 12 (24.03.2022) Tutorial 5: Symbolic and Statistical learning
Week 13 (31.03.2022) Lecture 6: (Kononenko 11*; Chio 2) Artificial Neural Networks; Deep
Learning; Support Vector Machines
Week 14 (07.04.2022) Tutorial 6: Support Vector Machine & Artificial Neural Network
Week 15 Påske/Easter
Week 16 (21.04.2022) Lecture 7: (Kononenko 12; Chio 2) Unsupervised Learning; Cluster

Analysis
Week 17 (28.04.2022) Tutorial 7: Cluster Analysis
Week 18 (05.05.2022) Guest lecture / MOCK exam Preparation for the exam; Q & A

The Reference Group
• Shares student feedback on teaching quality
– At the end of the course
• Needs at Least 3 Members (remote, part time, full time)
– https://innsida.ntnu.no/wiki/-/wiki/English/Reference+groups+-
+quality+assurance+of+education
– Establish ongoing dialogue with fellow students
• Meetings?
• Surveys?
• We *Prefer* volunteers ☺
• Email us by 15.02.2022

Time for introduction
What is your motivation in taking this course?
Did you use Machine Learning before?

It is all about Data

From 4Vs of Big Data …
Ibm.com
… to 5 Vs of Big Data paradigm
http://bigdata.black/
What stops us?
The 42 V's of Big Data and Data Science !!!
https://www.elderresearch.com/blog/42-v-of-big-data
Norwegian University of Science and Technology 18 18

The Big Data become more complex
https://www.elderresearch.com/blog/42-v-of-big-data

Growth in data exchange rate

21
Digital Footprint

Everything becomes Cyber
blogspot.com

geckoandfly.com Norwegian University of Science and Technology 24 24
PC Storage Capacity Trends
https://www.anandtech.com/show/10315/market-views-hdd-shipments-down-q1-2016/3
https://www.backblaze.com/blog/hard-drive-benchmark-stats-2016/
Smartphones Storage Trends
https://www.gizmochina.com/2017/06/20/antutu-report-smartphone-pref-052017/
Data generated by devices - trends
https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-
index-vni/mobile-white-paper-c11-520862.html
Digital Forensics,
Computational Forensics

ICT-related crimes
Cyber-enabled and cyber-dependent crimes

General categories:
• ICT as a target
• ICT as a tool
• ICT is affiliated with a crime
• Crimes against ICT industry
www.tunisiancloud.com
Digital Forensics
• Computer Crimes
• Fields of DF:
– Malware analysis
– Network Forensics
– Social Network Mining
– Content identification
– etc
• Digital Forensics Process:

– Identification, Preparation, Approach Strategy, Preservation,
– Collection, Examination, Analysis and Presentation.
Norwegian University of Science and Technology http://www.itsgov.com/ 30

Digital Forensics Process

Computational Forensics
Computational Forensics (CF) is a quantitative approach
to the methodology of the forensic sciences. It involves
computer-based modeling, computer simulation, analysis,
and recognition in studying and solving problems posed in
various forensic disciplines. CF integrates expertise from
computational science and forensic sciences.
http://www.tamingdata.com/2011/01/06/beyond-c-s-i-the-rise-of-computational-forensics/
Norwegian University of Science and Technology http://ieeesmc.org/newsletters/back/2009_12/main_article1.html
AI, ML, Data Analytic in real world

Digital Forensics & Hybrid Intelligence
- Too much data
for manual processing
- Emergence of new ways
to commit cyber crime, e.g.
malware, network attacks
+ Ability to handle real word data

+ Human-understandable ML models

Expectations
https://www.kdnuggets.com/2015/12/top-tweets-dec14-20.html

Are we there?
http://wmbriggs.com/post/24784/

“AI Winters”
https://www.actuaries.digital/2018/09/05/history-of-ai-winters/

https://www.analyticsvidhya.com/blog/2015/07/difference-machine-learning-statistical-modeling/
Machine Learning (1)
• Machine Learning – a set of methods that are capable

of learning and making decisions from data. Requires
data to be represented as numerical features
(attributes). Ensembles two main stages: Training and
Testing. Includes following simple dataflow:

Machine Learning (2)
Classical ML:
• Supervised learning
• Unsupervised learning
• Regression / Forecasting
• Rules learning
• Reinforcement learning
Also
• New trends – Deep Learning
• Nature-inspired methods
• Big Data-oriented improvements

Machine Learning – real life
• In reality ML requires thorough selection of methods:

Hard Computing & Soft Computing (1)
• What is Soft Computing?

– inexact solutions
– better tolerance to imprecise / missing data
– ability to extract human-understandable model
• In contrary, conventional Hard Computing
– requires specific definitions
– crisp classical logic
– deterministic with exact specifications

SlideShare

Decision Tree Fuzzy Rules
www.data-machine.com orthojournal.wordpress.com

Thank you for your attention!
Andrii Shalaginov
Department of Information Security and Communication
Technology
Faculty of Information Technology and Electrical Engineering

IMT 4133
Data Science for Security and Forensics
Carl Stuart Leichter, Andrii Shalaginov, Katrin Franke,

(NTNU/Testimon)
• A field of “Artificial Intelligence”
• Uses existing knowledge to create new
– Perspectives of the data
– Knowledge from the data.
• Raw data is often not understandable or informative

– additional transformation
– representation.
• General approaches to Knowledge Representation:

– first-order logic,
– probability distributions
– Regression
2
Attributes Representation
• In a general Machine Learning problem, the attributes’
values domain can be characterized by the following
properties:
• all possible attributes (eg: variables) A = {A0 ,...,An }

– The things we measured (to collect data)
– These choices affect the data structure
• -allowable values for particular attribute Ai

– Vi = {V0 ,...Vn } in case of discrete attribute,
– Vi = [Vmin,Vmax] in case of continuous attribute,
• - the output attribute C to be predicted/classified are:

– Ci = {C0 ,...,Cn } in case of discrete classification,
– Ci = [Cmin,Cmax] continuous value in case of regression.
• NB:
3
– discrete values can be binary: {0,1}
Attributes Representation
• Attributes can also be classified by their properties

and/or heuristics:
– missing
– correlated
– noisy
– redundant
– random.
4
• Logical Descriptions
– describing data samples themselves
– describing relationships between data samples
– describing relationships between data and outputs
Every skier likes the snow:
∀x Skier(x) => LikesSnow(x)
All brothers are siblings:
∀x ∀y Brother(x, y) => Siblings(x, y)
http://people.westminstercollege.edu/faculty/ggagne/fall2014/301/chapters/
5
chapter8/index.html
Logical Order to Attributes
6
http://people.westminstercollege.edu
DIKW Pyramid
DIKW Adaptation in use within the US Army KM Community of Practice by way of

7 https://commons.wikimedia.org/wiki/File:KM_Pyramid_Adaptation.png
DIKW Progression
8
DIKW Progression
Data Raw Packet Data
Analysis
Information Network Resources Utilization

Interpretation
Knowledge Intrusion Detection

Understanding
9 Wisdom IDS Policy

• Important area of Data Mining

• Represent data in form suitable for further analysis.
• Many different approaches:
– mapping functions
– regression
– first-order logic rules from the attributes space
– others
10
IMT 4133
Basic Machine Learning
Andrii Shalaginov (NTNU

Sargur N. Srihari (University at Buffalo),
Carl Stuart Leichter (NTNU)
ML Basics
• Data as Features
• Feature Space
• Polynomial Curve Fitting
• Model Selection
• Performance Testing
• Curse of Dimensionality
2
https://medium.com/@manveetdn/understanding-machine-learning-as-6-jars-eecfafc77051
3 3
Data Everywhere…
https://www.analyticsvidhya.com/blog/2015/12/hilarious-jokes-videos-statistics-data-science/
4 4
Analogue vs Digital
https://techdifferences.com/difference-between-analog-and-digital-signal.html
5 5
6 https://www.reddit.com/r/Damnthatsinteresting/comments/jt87tl/bill_gates_showing_how_much_data_a_cdrom_can_hold/ 6
What Does the Data Represent?
• The input attributes are the features
– Length
– Weight
– Duration
– Intensity
– Variation
– Etc
7
Data Types
https://www.etsfl.com/do-you-know-the-types-of-data/
8 8
https://medium.com/@manveetdn/understanding-machine-learning-as-6-jars-eecfafc77051
9 9
Vs of Big Data
Some “real-world” numbers
10
Tasks like…analysing 13 TBytes of
viruses
11 11
High volume data
Can we find such publicly available data?
12
Kaggle (1)
https://www.kaggle.com/datasets?sizeStart=90%2CGB&sizeEnd=1000%2CGB
13 13
Kaggle (2)
14 https://www.kaggle.com/niveditjain/human-faces-dataset 14
UC Irvine Machine Learning
Repository (1)
15 https://archive.ics.uci.edu/ml/index.php 15
Repository (2)
16 https://archive.ics.uci.edu/ml/datasets/ 16
Repository (3)
17 17
Feature Spaces
Wood Classification Example
• Have a big pile of mixed wooden blocks
• Mixture of 3 different kinds of wood
– Ash
– Pine
– Birch
• Want to be able to measure a wooden block’s attributes

and use them to determine the type of wood
• Decided on 2 optical attributes or features
1. Overall brightness
2. Wood grain prominence (peak to peak variation)
18
Feature Spaces
• Wood Classification Problem
– 2 Optical Features
• Overall brightness of the wood
• Wood grain prominence (peak to peak variation)
– Results in a 2-Dimensional Feature Space

– SUPERVISED learning:
• We start with known pieces of wood
• Measure each piece’s features
• Plot those measurements in the feature space
• Give each plotted point its class LABEL
– If we have chose our features well, then we will see good

clustering/separation of the different classes in the
features space.
19
Wood Classification Feature Space
20 Grain Prominence
• Note the separate clusters
• If you had an unknown piece of wood, you could
measure its features and then find which class it
belongs in
21
ML Development Data
• “Toy” Data
– Well known, well understood and commonly used data sets:
• https://en.wikipedia.org/wiki/Iris_flower_data_set
– You already know what the results should be:
• Can compare your results with the ones in the literature
• Synthetic Data
– You KNOW the data structure, because you have created it
• Data from your application domain

– Simple and clean, first (May be synthesized)
– Progress to more and more realistic data sets
22
Simple Regression Problem
• Observe Real-valued input variable x
• Use x to predict value of target variable t
• Experiment with synthesized* data:

– Target function to be learned learned via regression: sin(2π x)
– Add some random noise to the data!
23
Polynomial Curve Fitting
• N observations of x
– x = (x1 ,..,xN )
– t = (t1 ,..,tN )
• Goal is to exploit training set to predict value of from x

• Inherently difficult problem
• Probability theory allows us to make a prediction
• Data Generation:
– N = 10
– Spaced uniformly in range [0,1]
– Generated from sin(2πx)
– Adding Gaussian noise
– Noise typically due to unobserved variables
24
• M is the order (degree) of the polynomial

• Nonlinear function of x,
• Linear function of coefficients wj
• These types of problems are often called Linear
Models
• Is higher value of M better? We’ll see shortly!

• Coefficients w0 ,…wM are denoted by vector w
– In ML, w is often called the “weight vector”
25
26 Find the weights wj

Sum-of-Squares Error Function
27
Objective Functions
• Measures a figure of merit to be optimized
– Sum of Squares (for this example)

– Mean Square Error (MSE)
• Average of sum of squares
– Least Mean Squares (LMS)
– Statistical Measurements
• Variance
• Kurtosis
– Information Theoretical Metrics

• Mutual Information
• Information Entropy
– Negentropy
28
Objective Functions
– Sum of Squares
– Mean Square Error (MSE)
– Variance
29
Learn by Optimizing the Objective Function
30
Want to find w that minimizes the Sum of Squares Error
Optimizing the Objective Function
• The error function is a quadratic in coefficients w

• Optimization requires?
• Partial Differential Calculus!

– <and the crowd says: “yay”>
31
Partially Differentiating
32
Partially Differentiating
=0
• How do we find an minima with differentiation?

• Set the differential to zero
• After expansion, we find the zeros at:
33
Optimizing the Objective Function
• We end of with a set of M + 1 equations and unknowns

• We need to solve them to get the elements of the vector
w
• So what are the minimum number of data points (x,t)

we need to have any chance of our learning converge
on a solution?
• We need at least M + 1 data points!
• The model complexity drives the training data
requirements!
34
A Central Principle in ML
• The model complexity drives the training data

requirements!
35
0th Order Polynomial (Constant)
Real World System
Curve from Regression/Learning

36
1st Order Polynomial (Linear)
Real World System

37
3rd Order Polynomial (Cubic)
Real World System

38
9th Order Polynomial (Nonic)
What Happened?
39
Generalized Performance Analysis
• Several separate tests, with M = 0,1,2 …9
• For each test with a different M
– N = 100 (# data points)
• Evaluate the performance, by measuring the error.

• Can use a different error metric than the one used by the
ML algorithm.
– Use RMS
40
• Division by N allows different sizes of N to be compared
– Can see how # data points used for training affects performance
(an E vs N graph)
– Can use experiments to find the # data points required for model
complexity M to converge on its minimum error.
• As M increases, so does N
41
Training/Testing Data Partition
• Not all of the data is used to find the best fit

• Some of the data is held back, to test the fit
• A good model with sufficient data will learn to
“generalize”
– It will converge on the hidden structure in the data
– If the data contains a good representation of the system
under study (by implication, the structure in the system)
42
Training Data and Testing Data
Root-Mean-Square (RMS) Error:
RMS is another error measure

43 It could be used as an objective function
Over-fitting (N = 10)
Poor performance due to

Simplistic model
Best Performance
Complex model that overfits (overlearns).

Perfectly fits the training data.
But it cannot generalize, so it fails with test data
44
45 Find the weights wj

Polynomial Coefficients (weights wj)
46
How Can We Fix the Overfitting Problem?
• N= 10 Data Points

47
• Regularization: add a term to the objective function, that

penalizes large coefficient values (large weight vectors)
• Regularization is like model order estimation

– What is max “N” ?
– Akaike Information Criterion (AIC)
48
– Bayesian Information Criterion (BIC)
“Goodness of Fit” Term Model Complexity Term
49
Effect of Regularizer
M=9 polynomials using regularized error function
Yuuuge Regularizer [ln(λ) = 0] Optimal Regularizer [ln(λ) = -18]

λ -∞ λ = 1.53 e-8
No Regularizer
50
51
52
Impact of Regularization on Error
• λ controls the complexity of the model and hence

degree of over-fitting
– Similar to the choice of M
• Doesn’t completely eliminate the higher order terms
– Reduces their influence
53
Classifier Performance and Evaluation
• Classification is also called “Logistical Regression”
• Regression PLUS some Logic (0/1, True/False)
– Within Class/Outside of Class
– Can have several classes (like our wood problem)
– Data sample classification:
• Where in the feature space, does the data sample belong?
• Which side of the feature space boundary does the data sample’s f
54
Classifier Evaluation Metrics:
Confusion Matrix
Confusion Matrix:
Actual class\Predicted C1 ¬ C1
class
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
• Given m classes
• an entry, CMi,j in a confusion matrix indicates:
– # of tuples in class i that were labeled by the classifier as class j
• May have extra rows/columns to provide totals
55
Confusion Matrix
Actual class\Predicted buy_computer buy_computer Total

class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000
56
Accuracy
A\P C ¬C
C TP FN P
¬C FP TN N
P’ N’ All
• Classifier Accuracy, or recognition rate:

– percentage of test set tuples that are correctly classified
Accuracy = (TP + TN)/All
• Error rate: 1 – accuracy, or

Error rate = (FP + FN)/All
57
57
Sensitivity and Specificity
• Sensitivity: True Positive recognition rate
– Sensitivity = TP/P A\P C ¬C
= TP/(TP+ FN) C TP FN P
¬C FP TN N
P’ N’ All
• Specificity: True Negative recognition rate
– Specificity = TN/N
– Specificity = TN/(TN + FP)
58
58
Class Imbalance Problem:
▪ One class may be rare, e.g. fraud, or HIV-positive

▪ Significant majority in the negative class
▪ Small minority in the positive class
A\P C ¬C
C TP FN P
¬C FP TN N
P’ N’ All
59
59
Precision and Recall
• Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive
• Recall: completeness – what % of positive tuples did the

classifier label as positive?
Should have been

positives
• Perfect score is 1.0

60 • Inverse relationship between precision & recall 60
Example
Actual Class\Predicted class cancer = yes cancer = Total Recognition(%)

no
cancer = yes 90 210 300 30.00
(sensitivity
cancer = no 140 9560 9700 98.56
(specificity)
Total 230 9770 10000 96.40
(accuracy)
– Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%
61
Cross-Validation Methods
• Cross-validation (k-fold, where k = 10 is most popular)
– Randomly partition data into k mutually exclusive

subsets (D1…Di…Dk)
• each subset Di is approximately equal in size
– At i-th iteration, use Di as test set and others as training

set
• Leave-one-out:
– for small sized data
– k folds where k = # of tuples
• Stratified cross-validation:
– folds are stratified
– class distribution in each fold is approximately the same
as that in entire data set
62
62
ROC Curve:
Receiver Operatator Characterisic
63
ROC Curve Explained
• It shows the tradeoff between sensitivity and specificity
(any increase in sensitivity will be accompanied by a
decrease in specificity).
• The closer the curve follows the left-hand border and
then the top border of the ROC space, the more accurate
the test.
NB: specificity is 1 – FP rate
64
ROC Curve Explained
• The closer the curve comes to the 45-degree diagonal

of the ROC space, the less accurate the test.
• The area under the curve is a measure of accuracy.
65
Measures for Multiclass Classifiers
• Confusion Matrix – The most widely used

technique . Gives idea about the distribution of
accuracy with respect to all class.
Predicted
Actual
66
Curse of Dimensionality
67
Lecture 2: Machine Learning Basics;

NTNU i Gjøvik
Week 5: 03.02.2022
Reading: Chapter 3 – Kononenko, Chapter 2 - Chio
• Reference group ☺
• Assignments
• Knowledge Representation
• Machine Learning Basics

IMT 4133 – the course team
• Teacher / Course Responsible: Andrii Shalaginov

• Teaching Assistant: Kyle Porter

kyle.porter@ntnu.no

Overview of the topics
• Lecture 1: (Kononenko 1,2; Chio 1) Data Analysis; Learning
and Intelligence
• Lecture 2: (Kononenko 3; Chio 2): Machine Learning
Basics; Knowledge Representation
• Lecture 3: (Kononenko 4,5): Knowledge Representation;
Learning as Search
• Lecture 4: (Kononenko 6,7): Attribute Quality Measures; Data
Pre-processing
• Lecture 5: (Kononenko 9,10): Symbolic and Statistical
Learning
• Lecture 6: (Kononenko 11*; Chio 2): Artificial Neural
Networks; Deep Learning; Support Vector Machines
• Lecture 7: (Kononenko 12; Chio 2): Unsupervised Learning;
Cluster Analysis

Plan for semester
Semester: Weeks 3-14; Weeks 16-18 including MOCK exam
Written examination: 20.05.2022 (make sure to cross-check)
Curriculum includes:
• 7 theoretical Lectures (material from the textbook)
• 7 practical Exercise/Tutorials (practical tasks and applications) to
solve tasks given after each lecture
• Exercises are released after each lecture – try to solve by following
Thursday before the Tutorial!
• Control questions for progress
• Q & A sessions before the exam
• MOCK exam (optional – preparation for final exam)
• A guest lecture on selected topic
Need help / Questions ? -> email

Examination (1)
• 4 Assignments – 40% of the final grade (10% each)

– Each assignment covers several topics
– Mainly requires application of learnt theory and data analysis
– Need to program (which language – up to you)
– Familiarize yourself with statistical software (Weka, RapidMiner)
• Final written exam – 60% of the grade

– Materials covered in lectures and textbook
– Does not require extensive manual calculations
– 3 hours

Examination (2)
• Spring 2022 – 3 hours written exam
• The exam will contain open questions, not multiple-
choice questions
• For the most up-to-date information, please check:
https://www.ntnu.edu/studies/courses/IMT4133#tab=time
plan
• Make sure to read and understand guidelines:
https://innsida.ntnu.no/wiki/-
/wiki/English/Cheating+on+exams
• You are expected to demonstrate your own independent
work on the exam.
• Feel free to send us email any time if anything is unclear

The textbook

ISBN 1-904275-21-4, 2007

Recommended reading
dsecurity.pdf

Semester Plan (1)


Search
Feature Selection

Semester Plan (2)

Analysis

The Reference Group
• Shares student feedback on teaching quality
– At the end of the course
• Needs at Least 3 Members (remote, part time, full time)
– https://innsida.ntnu.no/wiki/-/wiki/English/Reference+groups+-
+quality+assurance+of+education
– Establish ongoing dialogue with fellow students
• Meetings?
• Surveys?
• We *Prefer* volunteers ☺
• Email us by 15.02.2022

Machine Learning 101: What the heck
is this
Kyle Porter & Katrin Franke

Covariance Review
Carl Leichter, Andrii Shalaginov

NTNU I Gjøvik
1
a11 x1
s1
a12 x2
a21
s2 a22 a13
a23 x3
A Less Messy Representation
3
What are Some Relationships Can We Find in
the Data Space?
• Covariance
• Correlation
• Etc
4
Correlation and Covariance
Covariance in terms of Correlation:
5
Correlation (Normalized Covariance)
https://www.researchgate.net/figure/259147064_fig19_Figure-8-33-
Illustration-of-covariance-and-correlation
6
Covariance Matrix For Variables X, Y
7
Covariance Matrix For Variables X, Y
x
http://www.sharetechnote.com/html/Handbook_EngMath_CovarianceMatrix.html
8
Covariance Matrix For Variables
X, Y, Z
http://gael-varoquaux.info/science/ica_vs_pca.html
9
Covariance Matrix For Variables X, Y, Z
10
Generalized Correlation Matrix
https://www.value-at-risk.net/parameters-of-
random-vectors/
11
Generalized Covariance Matrix
https://www.value-at-risk.net/parameters-of-
random-vectors/
12
13
Eigenanalysis of Covariance
(First Step in PCA)
http://www.visiondummy.com/wp-
content/uploads/2014/04/eigenvectors_covariance.png
14
Eigenanalysis of Covariance
http://www.visiondummy.com/wp-
content/uploads/2014/04/eigenvectors.png
15
Zero Correlation Does NOT Imply
Statistical Independence
http://stats.stackexchange.com/questions/128
42/covariance-and-independence
16
Learning as a Search
Andrii Shalaginov, Carl Stuart Leichter, Jayson Mackie
Objectives
• Understand the concepts

• Understand the algorithms
– Through the description
– Through the examples
• Pros and cons of algorithms
• Applications and Demo
Search algorithms in machine learning
• Exhaustive search
• Bounded exhaustive search
• Best-first search
• Greedy search
• Beam search
• Gradient search
• Simulated annealing
• Genetic algorithms
What are search algorithms?
• In computer science, a search algorithm, broadly speaking, is

an algorithm that takes a problem as input and returns a
solution to the problem, usually after evaluating a number of
possible solutions.
• The set of all possible solutions to a problem is called the

search space.
Exhaustive search
• The simplest method

– derives the next hypothesis from the current one and
evaluate it until the end
– requires a constant space and a linear time complexity with
respect to the search depth
If the search depth is not known, we cannot finish the search
(Based on slides of
Watanabe, 2010)
Exhaustive search
• Use a tree representation

– In an evolutionary system, we can find the candidates even if
we stop at the middle
Exhaustive search: BFS
• Breadth-first search (BFS)

– Evaluate the current node
– Queue the successor of the current node
– Take the next node from the queue
• Pros & Cons
+ Easy to understand
– Space-complexity grows
exponentially with
the depth of the
search tree
Exhaustive search: DFS
• Depth-first search (DFS)

– Evaluate the current node’s unevaluated successors first
• Pros & Cons
+ Linear space-complexity for the search depth
– Does not work if the tree has
unlimited depth or a cycle in
the search space
DFS Cycle
Performing the same search

without remembering previously
visited nodes results in visiting
nodes in the order
A, B, D, F, E,
A, B, D, F, E, ...
https://en.wikipedia.org/wiki/Depth-first_search
Exhaustive search: IDS
• Iterative deepening search (IDS)
– Set search depth = 1
– Depth First Search for the search depth
– Increase the search depth
• Pros & Cons
+ No cycle-problem, Linear-space complexity
– Increased time-complexity
Example of Exhaustive Search
6
x1 + x2 = 6 Max Z = 5x1 + 8x2
5 subject to
x1 + x2 ≤ 6
4 5x1 + 9x2 ≤ 45
x1 , x2 ≥ 0 integer
3
2 5x1 + 9x2 = 45
0 1 2 3 4 5 7 8
6
Example of BFS
6
x1 + x2 = 6 Max Z = 5x1 + 8x2
..
subject to
x1 + x2 ≤ 6
.. .. 5x1 + 9x2 ≤ 45
10 .. .. ..
6 9 .. .. ..
5x1 + 9x2= 45
3 5 8 .. .. ..
1 2 4 7 .. .. ..
Example of DFS
6
x1 + x2 = 6 Max Z = 5x1 + 8x2
..
subject to
x1 + x2 ≤ 6
.. .. 5x1 + 9x2 ≤ 45
.. .. .. ..
.. .. .. .. 10
5x1 + 9x2= 45
.. .. .. .. 9 8
1 2 3 4 5 6 7
Exhaustive search
• Conclusion
– If the depth of the search is known
Use Depth-first search
– else
Use Iterative deepening search
• Applications
– Crypto attacks
– Dictionary attacks
– Guess passwords
Branch and Bound
• Pros & Cons
+ Faster than exhaustive search
+ In many cases, it can find the global optimal solution
- Not easy to implement
• Applications
– Optimization problems
– Feature selection
Best-first search
• Improvement of the bounded exhaustive search based on the

Breadth-first search
• Only follow the most promising hypothesis
• Pros & Cons
+ Search time is short
- Space-complexity is still exponent
- No guarantee of the global optimal solution
Greedy search
• A greedy algorithm works in phases. At each phase:

– You take the best you can get right now, without regard for
future consequences
– You hope that by choosing a local optimum at each step,
you will end up at a global optimum
Example: Counting money
• Suppose you want to count out a certain amount of money,
using the fewest possible bills and coins
• A greedy algorithm would do this would be: At each step,
take the largest possible bill or coin that does not overshoot
– Example: To make $6.39, you can choose:
• a $5 bill
• a $1 bill, to make $6
• a 25¢ coin, to make $6.25
• A 10¢ coin, to make $6.35
• four 1¢ coins, to make $6.39
• For US money, the greedy algorithm always gives the
optimum solution
26
A failure of the greedy algorithm
• In some (fictional) monetary system, “krons” come in 1

kron, 7 kron, and 10 kron coins
• Using a greedy algorithm to count out 15 krons, you would
get
– A 10 kron piece
– Five 1 kron pieces, for a total of 15 krons
– This requires six coins
• A better solution would be to use two 7 kron pieces and one
1 kron piece
– This only requires three coins
• The greedy algorithm results in a solution, but not in an
optimal solution
27
Beam search
• Extension of Greedy Search:

–Instead of following only one “best-first”-choice, choose k
“best” solutions
–Not equal to parallel Greedy Searches: The k “best” children
choices are chosen from all current choices
Gradient search (Gradient descent)
• Gradient of a scalar field or function is a vector field which

points in the direction of the greatest rate of increase of the
scalar field or function, and whose magnitude is the greatest
rate of change.
• The gradient of a function is defined:
• Gradient search is based on the observation that if the real-
valued function F(x) is defined and differentiable in a
neighborhood of a point x0, F(x) decreases fastest if on goes
from x0 in the direction of the negative gradient of F(x) at x0:
x1 = x0 -µ▼F(x0), µ: jumping parameter
• Consider the sequence x0, x1, …, xn, such that:
xn+1 = xn - µn▼F(xn)
• Then we have:
F(x0) ≥ F(x1) ≥ F(x2) ≥ F(x3)…..
• So the sequence converges to local minimums.
Example of gradient search
• Find the minimal value of the quadratic function.
• Consider the sequence x0, x1, …, xn, such that:
xn+1 = xn - µn▼F(xn)
• ▼F(x)=2x
• x0 = 1
• µ0 = 0.001
• µn will be increased
• We will reach to optimum at (0,0)
• Pros & Cons
+ Easy to understand and implement
- The functions needs to be defined and differentiable.
Hill climbing
• Candidate is always and only accepted if cost is lower (or
fitness is higher) than current configuration
• Stop when no neighbor with lower cost (higher fitness) can be
found
Hill climbing
Pros and cons:

+ Fast to solutions
- Local optimum as best result
- Local optimum depends on initial configuration
- Generally no upper bound on iteration length
How to cope with disadvantages
• Repeat algorithm many times with different initial
configurations
• Use information gathered in previous runs
• Use a more complex Generation Function to jump out of local
optimum
• Use a more complex Evaluation Criterion that accepts
sometimes (randomly) also solutions away from the (local)
optimum
Simulated Annealing
Think of bouncing a ball, that explores the cost space.
So instead of using a fitness measure as our OBJECTIVE function to

optimize, we are using a cost measure.
Cost
Bounce
Optimal Cost
Simulated annealing
• Do sometimes accept candidates with higher cost to escape
from local optimum
• Adapt the parameters of this Evaluation Function during
execution
• Exploits an analogy between the annealing process and the
search for the optimum in a more general system.
Simulated annealing
• Annealing Process
– Raising the temperature up to a very high level (melting temperature, for
example), the atoms have a higher energy state and a high possibility to re-
arrange the crystalline structure.
– Cooling down slowly, the atoms have a lower and lower energy state and a
smaller and smaller possibility to re-arrange the crystalline structure.
• Analogy
– Metal  Problem
– Energy State  Cost Function
– Temperature  Control Parameter
– A completely ordered crystalline structure
 the optimal solution for the problem
Metropolis Criterion
• Let
– X be the current solution and X’ be the new solution
– C(x) (C(x’))be the energy state (cost) of x (x’)
• Probability Paccept = exp [(C(x)-C(x’))/ T]
• Let N=Random(0,1)
• Unconditional accepted if
– C(x’) < C(x), the new solution is better
• Probably accepted if
– C(x’) >= C(x), the new solution is worse . Accepted only
when N < Paccept
Algorithm
Initialize initial solution x , highest temperature Th, and coolest
temperature Tl
T= Th
When the temperature is higher than Tl
While not in equilibrium
Search for the new solution X’
Accept or reject X’ according to Metropolis Criterion
End
Decrease the temperature T
End
Control Parameters
• Definition of equilibrium
– Cannot yield any significant improvement after certain number
of loops
– A constant number of loops
• Annealing schedule (i.e. How to reduce the temperature)
– A constant value, T’ = T - Td
– A constant scale factor, T’= T * Rd
• A scale factor usually can achieve better performance
Control Parameters
• Temperature determination
– Artificial, without physical significant
– Initial temperature
• 80-90% acceptance rate
– Final temperature
• A constant value, i.e., based on the total number of
solutions searched
• No improvement during the entire Metropolis loop
• Acceptance rate falling below a given (small) value
– Problem specific and may need to be tuned
Simulated annealing
• Pros & Cons
+ Allow to escape from local optimums
+ Easy to implement
- No guarantee of the global optimum
Genetic algorithms
• Cellular automata
– John Holland, university of Michigan, 1975.
• Until the early 80s, the concept was studied theoretically.
• In 80s, the first “real world” GAs were designed.
(Based on slides of
Popovic, 2001)
Branch-and-Bound Technique
for Solving Integer Programs
Andrii Shalaginov, Carl Stuart Leichter
http://www.ohio.edu/people/melkonia/math4630/slides/bb1.ppt.
Basic Concepts
The basic concept underlying the branch-and-bound technique is to

divide and conquer.
Since the original “large” problem is hard to solve directly, it is

divided into smaller….
…. and smaller subproblems …
…. until these subproblems can be conquered.

Branching, Fathoming and Bounding
The dividing (branching) is done by partitioning the entire set of

feasible solutions into smaller and smaller subsets.
The conquering (fathoming) is done partially by

(i) giving a bound for the best solution in the subset;
(ii) discarding the subset if the bound indicates that the subset can’t
contain an optimal solution
These three basic steps – branching, bounding, and fathoming –

are illustrated on the following example.
Linear Programing Terminology
Example of Branch-and-Bound
6
x1 + x2 = 6 Max Z = 5x1 + 8x2
5 s.t. x1 + x2  6
5x1 + 9x2  45
4 (2.25, 3.75) x1 , x2 ≥ 0 integer
Z=41.25
2 5x1 + 9x2 = 45
1
Z=20
0 1 2 3 4 5 6 7 8
Example of Branch-and-Bound
6
x1 + x2 = 6 Max Z = 5x1 + 8x2
5 s.t. x1 + x2  6
5x1 + 9x2  45
4 (2.25, 3.75) x1 , x2 ≥ 0 integer
Z=41.25
2 5x1 + 9x2 = 45
1
Z=20 m = 5/8
0 1 2 3 4 5 6 7 8
Utilizing the information about the optimal
solution of the LP-relaxation
We have relaxed the constraint where x1and x2 must be integers.
Not restricting ourselves to IP.
Utilizing the information about the optimal
solution of the LP-relaxation
➢ Fact: Optimized(LP-relaxation) ≥ Optimized (IP)

(for maximization problems)
That is, the optimal value of the LP-relaxation

is an upper bound for the optimal value of the integer
program.
• In this particular case, 41.25 is an upper bound for OPT(IP).

Branching step
• In an attempt to narrow down the location of the IP’s optimal
solution, we will partition (branch) the feasible region of the LP-
relaxation.
• So , we choose a variable that is fractional in the optimal

solution to the LP-relaxation – say, x2 .
• Then observe that every feasible IP point must have either x2 

3 or x2 ≥ 4 .
Branching step
• With this in mind, we branch on the variable x2 to create the
following two subproblems:
Subproblem 1 Subproblem 2
Max Z = 5x1 + 8x2 Max Z = 5x1 + 8x2
s.t. x1 + x2  6 s.t. x1 + x2  6
5x1 + 9x2  45 5x1 + 9x2  45
x2  3 x2 ≥ 4
x1 , x2 ≥ 0 x1 , x2 ≥ 0
• Solve both subproblems
(note that the original optimal solution (2.25, 3.75) can’t recur)
Branching step (graphically)
Z=41
Subproblem 1: Opt. solution (3,3) with value 39
5 Subproblem 2: Opt. solution (1.8,4) with value 41
Subproblem 2
4
(1.8, 4)
3 (3, 3)
2
Subproblem 1
1
Z=20 Z=39
0 1 2 3 4 5 6 7 8
Start Creating a Solution Tree
S1: x2  3
(3, 3) Integral Solution
All
Z=39
(2.25, 3.75)
Z=41.25 S2: x2 ≥ 4
(1.8, 4)
Z=41
For each subproblem, we will record

• the restriction that creates the subproblem
• the optimal LP solution
• the optimal LP value
Start Creating a Solution Tree
S1: x2  3
(3, 3) Integral Solution
All
Z=39
(2.25, 3.75)
Z=41.25 S2: x2 ≥ 4
(1.8, 4)
Z=41
For each subproblem, we will record

• the restriction that creates the subproblem
• the optimal LP solution
• the optimal LP value
S1: x2  3
(3, 3)
All
Z=39
(2.25, 3.75)
Z=41.25 S2: x2 ≥ 4
(1.8, 4)
Z=41
The optimal solution for Subproblem S1 is integral: (3, 3).

➢ If further branching on a subproblem will yield no useful
information, then we can fathom (dismiss) further branching of
the subproblem.
In our case, we can fathom Subproblem 1 because its solution is integral.
(it’s the best solution we will find, in that sub-branch
S1: x2  3
(3, 3)
All
Z=39
(2.25, 3.75)
Z=41.25 S2: x2 ≥ 4
(1.8, 4)
Z=41
The optimal solution for Subproblem S1 is integral: (3, 3).
➢ The best integer solution found so far is stored as incumbent.

The value of the incumbent is denoted by Z*.
In our case, the first incumbent is (3, 3), and Z*=39.
➢ Z* is a lower bound for OPT(IP): OPT(IP) ≥ Z* .

In our case, OPT(IP) ≥ 39. The upper bound is 41: OPT(IP)  41.
6
5
Recall why the
upper bound is 41
4 (2.25, 3.75)
Z=41.25
2
0 1 2 3 4 5 6 7 8
Next Branch Subproblem 2 on x1:
Subproblem 3:
New restriction is x1  1.
5
Subpr. 4 Subproblem 4:
Subpr. 3
New restriction is x1 ≥ 2.
4
3
Z=40.55
1
Z=2
0
0 1 2 3 4 5 6 7
Next Branch Subproblem 2 on x1:
Subproblem 3:
New restriction is x1  1.
Opt. solution (1, 4.44) with value 40.55
5
(1, 4.44) Subproblem 4:
Subpr. 3 Subpr. 4
New restriction is x1 ≥ 2.
4
The subproblem is infeasible
3
Z=40.55
1
Z=20
0 1 2 3 4 5 6 7
Solution tree (cont.)
S1: x2  3
int.
All (3, 3) S3: x1  1
(2.25, 3.75) Z=39
(1, 4.44)
Z=41.25
S2: x2 ≥ 4 Z=40.55
(1.8, 4)
Z=41 S4: x1 ≥ 2
infeasible
➢ If a subproblem is infeasible, then it is fathomed.

In our case, Subproblem 4 is infeasible; fathom it.
➢ The upper bound for OPT(IP) is updated: 39  OPT(IP)  40.55 .
➢ Next branch Subproblem 3 on x2 .

(Note that the branching variable might recur).
Branch Subproblem 3 on x2 :
Subproblem 6: New restriction is x2 ≥ 5
4
Subproblem 5: New restriction is x2  4.
1
Z=20
0 1 2 3 4 5 6 7 8
Feasible region: the segment joining (0,4) and (1,4)
5
4
(1, 4)
1
Z=20
0 1 2 3 4 5 6 7 8
Feasible region: the segment joining (0,4) and (1,4)
5 Opt. solution (1, 4):
4 Z = 5x1 + 8x2 =37

(1, 4)
1
Z=20
0 1 2 3 4 5 6 7 8
Next branching step (graphically)
(0, 5)
Subproblem 6: New restriction is x2 ≥ 5.
5 Feasible region is just one point: (0, 5)
1
Z=20
0 1 2 3 4 5 6 7 8
Next branching step (graphically)
(0, 5)
Subproblem 6: New restriction is x2 ≥ 5.
5 Feasible region is just one point: (0, 5)
4 Opt. solution (0, 5):

(1, 4)
Z = 5x1 + 8x2 = 40
3
1
Z=20
0 1 2 3 4 5 6 7 8
Final Solution Tree
S1: x2  3 S5: x2  4
int.
All (3, 3) S3: x1  1 (1, 4) int.
(2.25, 3.75) Z=39 Z=37
(1, 4.44)
Z=41.25
S2: x2 ≥ 4 Z=40.55 S6: x2 ≥ 5
(1.8, 4) (0, 5)
int.
Z=41 S4: x1 ≥ 2 Z=40
infeasible
S1: x2  3 S5: x2  4
int.
All (3, 3) S3: x1  1 (1, 4) int.
(2.25, 3.75) Z=39 Z=37
(1, 4.44)
Z=41.25
S2: x2 ≥ 4 Z=40.55 S6: x2 ≥ 5
(1.8, 4) (0, 5)
int.
Z=41 S4: x1 ≥ 2 Z=40
infeasible
➢ If the optimal value of a subproblem is  Z*, then it is fathomed.

• In our case, Subproblem 5 is fathomed because 37  39 = Z*.
S1: x2  3 S5: x2  4
int.
All (3, 3) S3: x1  1 (1, 4) int.
(2.25, 3.75) Z=39 Z=37
(1, 4.44)
Z=41.25
S2: x2 ≥ 4 Z=40.55 S6: x2 ≥ 5
(1.8, 4) (0, 5)
int.
Z=41 S4: x1 ≥ 2 Z=40
infeasible
➢ If a subproblem has integral optimal solution x*,

and its value > Z*, then x* replaces the current incumbent.
• In our case, Subproblem 6 has integral optimal solution, and its value
40>39=Z*. Thus, (0,5) is the new incumbent x* , and new Z*=40.
S1: x2  3 S5: x2  4
int.
All (3, 3) S3: x1  1 (1, 4) int.
(2.25, 3.75) Z=39 Z=37
(1, 4.44)
Z=41.25
S2: x2 ≥ 4 Z=40.55 S6: x2 ≥ 5
(1.8, 4) (0, 5)
int.
Z=41 S4: x1 ≥ 2 Z=40
infeasible
➢ If there are no unfathomed subproblems left, then the current

incumbent is an optimal solution for (IP).
• In our case, (0, 5) is an optimal solution with optimal value 40.
Genetic Algorithms
enginfo.ut.ac.ir/keramati/Course%20pages/.../Meta-%20Heuristic%20Algorithms.ppt
Genetic Algorithm Is Not...
Gene coding...
Speaking of Genetic Coding…
http://www.digit.in/general/scientists-create-the-first-living-
organism-with-synthetic-alien-dna-20809.html
Genetic Algorithm Is...
… Computer algorithm
That resides on principles of genetics
and evolution
• The main principle of evolution used in GA

is “survival of the fittest”.
• The good solutions survive, while bad ones die.

(mostly)
The History of GA
• Cellular automata
– John Holland, university of Michigan, 1975.
• Until the early 80s, the concept was studied
theoretically.
• In 80s, the first “real world” GAs were designed.
Hill climbing •
global
local
Multi-climbers •
Genetic algorithm •
I am at the
I am not at the top. top
My high is better! Height is ...
I will continue
GA Concept
• Genetic algorithm (GA) introduces the principle of evolution
and genetics into search among possible solutions
to given problem.
• The idea is to simulate the process in natural systems.
• This is done by the creation within a machine
• A population of simulated “individuals”

– who are represented their “chromosomes”
• A set of char strings ~DNA
Nature and GA...
Nature Genetic algorithm
Chromosome String
Gene Character
Locus String position
Genotype Population
Phenotype Decoded structure

Algorithmic Phases
Initialize the population
Select individuals for the mating pool
Perform crossover
Perform mutation
Insert offspring into the population
no
Stop?
yes
The End
Designing GA Requires Answers to:
⚫ How to represent genomes?

⚫ How to define the crossover operator?
⚫ Sexual Reproduction
⚫ How to define the mutation operator?
⚫ How to define fitness function?
⚫ How to generate next generation?
⚫ How to define stopping criteria?
Representing Genomes...
Representation Example
string 1 0 1 1 1 0 0 1
array of strings http avala yubc net ~apopovic
or
> c
tree - genetic programming
xor b
a b
Crossover
• Crossover is concept from genetics.
• Crossover is sexual reproduction.
• Crossover combines genetic material from
two parents,
in order to produce superior offspring.
• Few types of crossover:
– One-point
– Multiple point.
http://findwallpapershd.com/rabbit/tiger-rabbit-wallpaper/
One-point Crossover
0 7
1 6
2 5
3 4
4 3
5 2
6 1
7 0
Parent #2
Parent #1
One-point Crossover
0 7
1 6
5 2
3 4
4 3
5 2
6 1
7 0
Parent #2
Parent #1
Mutation
• Mutation introduces randomness into the
population.
• Mutation is asexual reproduction.
• The idea of mutation
is to reintroduce divergence
into a converging population.
• Mutation is performed
on small part of population,
in order to avoid entering unstable state.
Mutation...
Parent 1 1 0 1 0 0 0 1
Child 0 1 0 1 0 1 0 1
Setting the Probabilities...
• Average probability for individual to crossover

is, in most cases, about 80%.
• Average probability for individual to mutate

is about 1-2%.
• The better solutions reproduce more often.

Fitness Function
• Fitness function is the OBJECTIVE function,
that determines what solutions are better than others.
• Fitness is computed for each individual.
• Fitness function is application depended.

– It depends upon your application (IDUYA)
Artificial Selection
• The selection operation probabilistically selects specific
individuals, based on fitness,
– These are the next generation of the population
– Those that are not selected, are culled.
• There are few possible ways to implement selection:

– “Only the strongest survive”
• Choose the individuals with the highest fitness
for next generation
– “Some weak solutions survive”

• Assign a probability that a particular individual
will be selected for the next generation
• More diversity
• Why?
– Some bad solutions might have good parts!
Selection - Survival of The Strongest
Fitness Measurements
Previous generation
0.93 0.51 0.72 0.31 0.12 0.64
Next generation
0.93 0.72 0.64
Selection - Some Weak Solutions Survive
Previous generation
0.93 0.51 0.72 0.31 0.12 0.64
Next generation
0.93 0.72 0.64 0.12
Stopping Criteria
• Final problem is to decide when to stop execution of algorithm.
• There are two possible solutions to this problem:
– First approach:
• Stop after production
of definite number of generations
– Second approach:
• Stop when the improvement in average fitness
over two generations is below a threshold
GA Vs. Ad-hoc Algorithms
Genetic Algorithm Ad-hoc Algorithms
Speed Slow * Generally fast
Human work Minimal Long and exhaustive
There are problems

Applicability General that cannot be solved analytically
Performance Excellent Depends
* Not necessary!
Ad-hoc = conventional algorithm development based on

prior knowledge of the problem, its math, logic
Problem With GAs
• Sometimes GA is extremely slow…..
……but….
Advantages of GAs
• Concept is easy to understand.

• Minimum human involvement.
• Computer is not learned how to use existing solution,
but to find new solution!
• Modular, separate from application
• Supports multi-objective optimization
• Always an answer; answer gets better with time !!!
• Inherently parallel; easily distributed
• Many ways to speed up and improve a GA-based
application as knowledge about problem domain is
gained
• Easy to exploit previous or alternate solutions
Advantages of GAs
• Concept is easy to understand.
• Minimum human involvement.
• Computer is not learned how to use existing solution,

but to find new solution!
• Modular, separate from application
• Supports multi-objective optimization

Advantages of GAs
• Inherently parallel; easily distributed

– Not so slow, after all
• Many ways to speed up and improve a GA-based application

as knowledge about problem domain is gained
• Easy to exploit previous or alternate solutions
• Always have an answer; answer gets better with time !!!

– BUT: No guarantee that it’s a good answer or it will improve enough
GA: An Example - Diophantine Equations
Diophantine equation (n=4):
A*x + b*y + c*z + d*q = s
For a given a, b, c, d, and s - find x, y, z, q

x y z q
Genome:
(X, y, z, p) =
GA:An Example - Diophantine Equations(2)
(Using easy to track #s)
( 1, 2, 3, 4 ) Crossover ( 1, 6, 3, 4 )
( 5, 6, 7, 8 ) ( 5, 2, 7, 8 )
Mutation
( 1, 2, 3, 4 ) ( 1, 2, 3, 9 )
Diophantine Equations(3)
• First generation is usually randomly generated of numbers
lower than sum (s).
• Fitness is defined as absolute value of difference

between total and given sum (s):
Fitness = abs ( total - sum ) ,
• Algorithm enters a loop in which operators are performed

on genomes: crossover, mutation, selection.
• After number of generations an optimal solution is reached.

Some Applications of GAs
Control systems design Software guided circuit design
Optimization
Internet search search GA Path finding Mobile robots
Data mining Trend spotting
Stock prize prediction

Lecture 3: Learning as a Search

Andrii Shalaginov andrii.shalaginov@ntnu.no
NTNU i Gjøvik
Week 7: 17.02.2022
Reading: Chapter 4&5 - Kononenko
• Learning as a Search
• Genetic Algorithm
• Branch and Bound
• DFS / BFS

The textbook

ISBN 1-904275-21-4, 2007

Semester Plan (1)


Search
Feature Selection

Semester Plan (2)

Analysis

Recommended reading
dsecurity.pdf

Andrii Shalaginov
Department of Information Security and Communication Technology


Today’s Agenda
• Basic Ideas of Machine Learning

• Concluding Items:
– What about [INSERT BUZZWORD HERE]
2
ML vs AI
• Machine Learning ML
is good at
learning specific
tasks.
• AI is uses various
components of
machine learning
to make decisions
AI
and automate
control in some
smart/data driven
way.
• ML is really just
3 sort of a toolbox.
Machine Learning
Warnings
4
Machine Learning Warnings
• Machine learning is not magic.

– Machine learning will not solve all your problems.
– But it is good at solving specific tasks/problems.
Example is finding optimal values.
5
Things ML is good at
• Great success in image classification. Given some

image, is it dog or a cat?
• Sounds easy? What if you have 1000000 images

to look through?
6
Things ML is good at
• Good at finding (some) patterns in data.
7
• Machine learning is mostly just statistics, linear

algebra, and some rule of thumb algorithms.
8
• Results from your machine learning will suck if

your data sucks.
– Garbage in, garbage out.
9
• Many machine learning programs accomplish the

same task, just in different ways.
– Support Vector Machine and Artificial Neural Networks
and Naïve Bayes, oh my!
• It’s like getting hung up on the differences between
dish washers. The point is to get clean dishes.
10
• There is a lot of HYPE about machine learning.

– Therefore, be careful. Silver bullet still doesn’t exist.
11
How Machine
Learning Works
(using an example)
12
(Basically) How Machine Learning
Works
• We feed data to an algorithm.
• This algorithm constructs (trains) a model.
• If we can test the model, we do so.
• We apply the model.
TRAINING
DATA ALGORITHM
MODEL Apply it!

13
Machine Learning Models?
• A model is just a compact representation of your

data, that hopefully represents reality.
– Maybe the model is an equation.
– Maybe the model is a graph
• The model should provide functionality.
14
Our Example
Goal: Only using the weight and horsepower of a car,

make a model that can predict (classify) if a car has
good or bad gas mileage.
Good/Bad
MODEL MPG?
15
Our Data
• We assume we have these 9 cars for training data.

– The cars have labels of good or bad gas mileage.
– We know their features: weight and horsepower
16
The Model (what we want)
• Here is a finished model.

– We needed to choose a model and an algorithm to make
it.
• As you see, models do not really have any
“intelligence”.
17
Training a Model
• Train model with data.

– To make a good model (for it to generalize well) you need
lots of samples.
–
Machine Learning
Training Algorithm
18
What is training?
• Training is mostly just adjusting values in your

model until it works well.
– Guided by trying to minimize/maximize some
performance equation.
• It is a little bit like tuning a radio. If you move the
dial and you hear more static, you turn it the other
direction. You do this until the audio output is
“good enough”.
19
Applying the Model!
• The outcome is hopefully a good representation of

reality.
– Once the model is trained, we can test unseen samples
or ensure the quality of the model.
– For new cars, we only know the weight and horsepower.
New Car, but

we don’t know
mileage
20
A note about classifying or
predicting
• Usually, we are considering whether samples are
positive or negative.
– Think in terms of medical practice. You may test positive
for some condition.
• In our case, a sample is positive if it has (or we
predict it has) good gas mileage.
21
False Positives
• But the models we make aren’t always so peachy.
Fancy new
police car
????
• But when we test the gas mileage, it turns out our

model was wrong! It has BAD mileage, and this is
an example of a false positive.
22 – IE a false alarm.
False Negatives
• Another kind of error may occur as well.
Fancy new
police car 2 ????
• But when we test the gas mileage, it turns out our

model was wrong! It has GOOD mileage, and this
is an example of a false negative.
23 – IE we missed it!
Do I really need a model for this?
• Why not just closely examine all the cars?

– There might be a lot, and you might not have that much
time for close examination.
• Another similar example.
– You have a conveyor belt of randomly caught fish (100s
pass by every minute!) and need to categorize them.
Trained Machine
Salmon or Cod?
Learning Model
24
Machine Learning Basics Summary
• A machine learning model is a compact

representation of your data, that hopefully
represents reality.
– It should provide some function (like classifying or
predicting).
25
• You will need lots of training samples to make a

good model.
– This helps generalize the model. The more samples it
sees, the better it “understands” your data.
26
• You should pick good measurements (features) to

train your model.
– Was weight and horsepower really so good in predicting
good or bad MPG? What about users? What about other
engine details?
27
• Even if your model is good, there will always be

errors.
– Anybody who says their model is 100% accurate is
probably 100% full of it.
28
We have only spoken about
classification so far!
• There are 3 Main Functions:
– Classification – for this new sample, which category does
it belong to?
– Regression/Prediction – Can I use my data to make
guesses about unseen data?
– Clustering – what are the groupings in my data?
• And how can it help me explore my data?
• There are more functionalities that can be
executed with machine learning as well.
29
Conclusion
• Basically, think about which of these functions

would make certain tasks you do easier.
– Avoid this: “if all you have is a hammer, everything looks
like a nail”
30
Conclusion: What about [INSERT
BUZZWORD HERE]
• What about Deep Learning?

– It is really just a (very successful) method/model to
implement some of these concepts (classification,
regression, clustering).
– In a nutshell, these are universal function approximators.
• What about Big Data?
– This ideally is the point in applying these methods.
– BE CAREFUL: some algorithms take a LONG time to
build models.
• What about AI?
– ‘AI is very, very stupid,’ says Google’s AI leader (2018)
https://www.cnet.com/news/ai-is-very-stupid-says-google-ai-leader-
31
compared-to-humans/
Thanks for your attention!
• Questions?
32
Written By
Presenter:
Hai Nguyen
Norwegian Information Security L aboratory
Gjøvik University College, Norway
Edited and Presented By: Carl Leichter

Part I:
• Introduction to Feature Selection Methods
Part II:
• Application of Feature Selection for Intrusion Detection
 Motivation from an application
 Feature selection problem
 Feature selection methods
 Search strategies
 Challenges
Raw traffic:… .0101100101001001010010100101001010101010… .
Features: ( Duration, protocol_type, service, flag, src_bytes, dst_bytes,… . )
A sample: (100, tcp, telnet, SF, 1000 , 1000 … )  DoS attack
Are all extracted features important for detecting attacks?

Web attack detection The process of identifying activities which try to compromise
the confidentiality, integrity or availability of Web applications )
How can we detect web attacks?
HTTP traffic: http://www.google.es/#
Feature extraction
Features: ( length_of_path, length_of_arg, number_of_special_chars )
F1 F2 F3 F4 F5 Classes
1000 100 100 0 0 XSS
1000 10 20 0 0 SQL-inject
Observations:
1000 1 1 0 1 Buffer-overf
1000 1 1 100 100 LDAP-inject
1000 1 1 0 0 Normal
 What is feature selection ?
• The process of removing irrelevant and redundant features from the data for improving
the performance of predictor (classifier).
 Why feature selection is important?

• To improve the performance of predictor (classifier) (time, resources… )
• To better understand the domain
 How to define irrelevant and redundant features?

 Filter Model (FM):
 Ranks features or feature subsets independently of the classifier (predictor)
 Wrapper Model (WM):

 Uses a classifier to assess feature or feature subsets
 E mbedded Model: combination of FM and WM

 Feature Ranking:
• Ranks all features
 Feature Subset Selection:

• Ranks all possible subsets of features
 Information gain
 Distance measures
 Minimum description length (MDL )
 J-measure
 Gini-index
 Statistical tests: Chi-square, F-test
 … .so on
 E ntropy is a measure of the uncertainty associated with X :
 Mutual information (MI) is a measure of the mutual

dependence between feature X and target Y:
 Feature selection by using mutual information:

• Using mutual information as a measure of relevance
• Calculating all scores
• Selecting features with high scores.
 Feature ranking is fast with the complexity of
 But feature ranking may fail:
Guyon-Elisseeff, JMLR 2004; Springer 2006
 We need to consider feature subsets!!!

 We need to consider feature subsets:
• Relevance
• Redundancy
 Some measures:
• Correlation feature selection (CFS)
• The consistency feature selection
• The mRMR measure
• … so on
 The general complexity of
 The relevance of a feature subset S for the target Y:
 The redundancy of all features in the set S:
 The mRMR measure is defined as follows:

 Using performance of the classifier as a measure for ranking feature subsets:
All Multiple
features Feature Predictor
subsets
 The best feature subset is the subset with which the Predictor gives the best result.
Filter model: Wrapper model:
 Fast  Slow
 Not always fit to classifiers  Always fit to classifiers

Nothing
 BUT… …
Filter model Wrapper model E mbedded
 The simplest way is :

• Remove some irrelevant features by using filter model.
• Apply wrapper model to the rest
Optimal search: Heuristic search:
 Exhaustive search.  Genetic search

 Branch and Bound algorithm.  Greedy search
 Beam search
 Floating search
 Forward selection
 Backward elimination
 … ..so on
 More effective search method
 More effective feature-selection measures for high-dimensional data
 E xplanation:
• Why does filter model work?
• How well a method helps a classifier (predictor) in terms of accuracy
measures?
 Since there are plenty of feature-selection measures
• How to generalize them into some main measures?
Hai Thanh Nguyen, Carmen Torrano-Gimenez, Gonzalo Alvarez,
Slobodan Petrovic,and Katrin Franke
***************
Norwegian Information Security Laboratory
Gjøvik University College, Norway
&&&
Instituto de Fisica Aplicada, Consejo Superior de
Investigaciones Cientificas
1
 Motivation:
• Web attack detection
• Feature selection for Web attack detection
 Generic Feature Selection (GeFS) measure
• The CFS and the mRMR measures
• Optimizing the GeFS measure
• New feature-selection method: Opt-GeFS
 Experimental results
• CSIC 2010 HTTP dataset
• ECML-PKDD 2007 HTTP dataset
 Conclusions
2
Web attack detection – The process of identifying activities which try to compromise
the confidentiality, integrity or availability of Web applications )
How can we detect web attacks?
HTTP traffic: http://www.google.es/#sclient=psy&hl=es&sourc….
Feature extraction
Features: ( length_of_path, length_of_arg, number_of_special_chars,…... )
F1 F2 F3 F4 F5 … Classes
1000 100 100 0 0 … XSS
1000 10 20 0 0 … SQL-inject
Observations:
1000 1 1 0 1 … Buffer-overf
1000 1 1 100 100 … LDAP-inject
1000 1 1 0 0 … Normal
3
 Relevance - Not all features are relevant for detecting attacks:
 Feature F1 is irrelevant for detecting attacks
 Redundancy - Not all relevant features are necessary for detecting attacks:
• Feature F5 is redundant for detecting LDAP-injection attack, since feature F4 is enough
F1 F2 F3 F4 F5 … Classes
1000 0 0 … XSS
100 100
1000 0 0 … SQL- injection
10 20
1000 1 1 0 … Buffer-overflow
1
1000 1 1 … LDAP-injection
100 100
1000 1 1 0 0 … Normal
How to measure Relevance and Redundancy?

4
 Correlation indicates the linear relationship between two random variables:
5
 Mutual information measures the mutual dependence (non-linear relationship) of
two random variables:
6
 M. Hall proposed correlation feature-selection (CFS) measure:
Given a feature subset S with k features, there is a score:
where
and
Class-feature correlation
 Feature selection by means of CFS measure: Feature-feature correlation
7
 In 2005, Peng et. al. proposed a feature selection method using mutual information.
 In terms of mutual information, the relevance of a feature set S for the class c is
defined as follows:
 The redundancy of all features in the set S is defined as follows:
 The feature selection measure (mRMR measure) is a combination of two measures

given above:
8
 Definition 1: A generic-feature-selection measure for intrusion detection is a
function GeFS(x), which has the following form:
 Definition 2: The feature selection problem by means of the generic-feature-

selection measure is to find x that maximizes the function GeFS(x):
 Proposition: The CFS and mRMR measures are instances of the GeFS measure.
9
Exhaustive search: Heuristic search:
 Globally optimal feature subsets  Locally optimal feature subsets
 Slow with complexity of  Faster than exhaustive search
Can we find globally optimal feature subsets without exhaustive search?
 The answer is Yes.
10
 Proposition: The feature selection problem:
is a polynomial 0-1 fractional programming problem (P01FP)
 The solution of optimization problem P01FP is indicators of features

in the best feature subset S.
11
Chang’s method for solving P01FP: Our method for solving P01FP:
 Linearizing P01FP problem into  Differently linearizing P01FP problem

mixed 0-1 linear programming problem into mixed 0-1linear programming
(M01P). problem (M01P).
 The number of variables & constraints:  The number of variables & constraints:
 Branch and Bound algorithm.

 Branch and Bound algorithm.
12
Chang’s method: Our method:
 Proposition 1: A polynomial mixed 0-1  Proposition 4: A polynomial mixed 0-1
term from (7) can be represented term from (12) can be
by the following program [9]: represented by the following program:
 Proposition 2: A polynomial mixed 0-1

 Proposition 5: A polynomial mixed 0-1
term from (8) can be represented by
term from (8) can be
a continuous variable , subject to the
represented by a continuous variable ,
following linear inequalities [9]:
subject to the following linear inequalities:
13
 Step 0: Analyze statistical properties of datasets before
choosing GeFS_CFS OR GeFS_mRMR.
 Step 1: Calculate all the parameters , such as

correlation or mutual information coefficients.
 Step 2: Construct the optimization problem of GeFS

from the parameters calculated above.
M01LP
 Step 3: Transform the optimization problem of GeFS to
a mixed 0-1 linear programming (M01LP) problem,
which can be solved by the branch-and-bound algorithm.
Branch & Bound
Opt-GeFS
14
 Objective: Apply the generic-feature-selection (GeFS) measure for Web attack detection.
 DARPA Benchmarking dataset for Intrusion Detection Systems:
• Out of date: 1998
• Does not include many actual Web attacks
 ECML-PDKK 2007 HTTP dataset

• The dataset was generated for ECML-PKDD 2007 Discovery Challenge.
• The European Conference on Machine Learning and Principles and Practice of
Knowledge Discovery in Databases (ECML- PKDD)
 Our own generated CSIC 2010 HTTP dataset.
15
 Dataset description:
• Traffic targeted to a real-world web application: E-commerce web application.
• 36000 normal requests.
• 25000 anomalous requests (SQL injection, buffer overflow, XSS, etc.)
 Feature extraction: 30 features.
 Feature selection by means of GeFS measure

• Analyze the statistical properties of the dataset
• Select the GeFS_CFS measure to select features, instead of the GeFS_mRMR.
16
Number of selected features (on average)
35
30
30
25
20
15
10 14
11
5
0
Full-set GeFS_CFS GeFS_mRMR
GeFS: Generic Feature Selection

17
Classification accuracies (on average)
100
90 93.65 93.53
80
70 75.67
C4.5
60
50 CART
40 RandomTree
30 RandomForest
20
10
0

18
False positive rate (on average)
30
28
25
C4.5
20 CART
RandomTree
15
RandomForest
10
5 6.9
7.1
0

19
 Dataset description:
• 40, 000 normal requests
• 10,000 attacks (Cross-Site Scripting, SQL Injection, LDAP Injection, etc.)
 Feature extraction: 30 features.
 Feature selection by means of GeFS measure

• Analyze the statistical properties of the dataset
• Select the GeFS_mRMR measure to select features, instead of the GeFS_CFS.
20
Number of selected features (on average)
35
30
30
25
20
15
10 6
5
2
0

21
Classification accuracies (on average)
100
90 97.04 92.93
80 86.42
70
60 C4.5
50 CART
40 RandomTree
30 RandomForest
20
10
0

22
False positive rate (on average)
20
15
17.6 C4.5
CART
RandomTree
10
RandomForest
7.8
5
2.95
0

23
 Feature selection is important for Web attack detection.
 Generic-feature-selection method improves the effectiveness of Web attack detection

systems. Depending statistical properties of a dataset, an appropriate instance of
GeFS measure should be chosen.
 The proposed feature-selection method (Opt-GeFS) ensures globally optimal feature

subsets.
 The Opt-GeFS method is domain-independent.
24
Questions?
25
The Nature of Data Itself and
Our Models of the World
and Some Considerations on

Covariance, Correlation, PCA
Andrii Shalaginov, Carl Leichter, Katrin Franke

Technology
NTNU i Gjøvik
Norway
1
Data Acquisition and Feature Extraction
• Human perceptual feature space is limited

– Tactile (touch: smooth, wet, warm)
• Callouses
• Irritability
– Sonic vibrations (20-20Khz)
• Hearing loss
– Chemical (taste and smell)
• Head cold
– Electromagentic
• ~380nm – 750nm λ (violet to red)
2
We Perceive a Narrow Slice of Reality
• Snakes can see infrared

• Pigeons sense earth’s magnetic field
• Sharks can sense electric fields
3
Projection of Real World onto Sensory
Space
• In our minds, we build a model of the real world to account
for our experiences (data in our sensory feature space).
• These models may be called “Intelligibility Strategies”*: they

are the framework we use to guide our actions.
• In Science and Engineering, we build formal models of the

real world to account for our data.
• These models are called “Theories”: they are the framework

we use to guide our experiments and applications.
4
Projection of Real World onto a Sensor
Space
• Is the origin of all Data Spaces
• Scientific models are built to account for the data;
they make testable predictions/estimates of the Real
World
• Rigorously tested models lead to useful applications
• Research is the process of exploring the boundaries
of our models:
– Extend the model’s boundaries
– Replace the model completely.
• “Paradigm Shift” Thomas Kuhn “The Structure of Scientific
Revolutions”
5
How is Any of This Relevant?!
• Why are we here?

– Explore/extend boundaries of knowledge and applications
– Must understand the tools we use: their limits and drawbacks
6
Linear Mixture Models
7
Linear Mixture Models
s1 x1
8
x1
s1
x2
9
a11 x1
s1
x2
a12
Coupling Weights
a11 > a12
10
Many of The Data we Collect Come From
Linear Mixtures
“A linear mixed model (LMM) is a parametric linear model
for clustered, longitudinal, or repeated-measures data that
quantifies the relationships between a continuous
dependent variable and various predictor variables.”
LMMs aren’t just for signals
West, Brady T., Kathleen B. Welch, and Andrzej T. Galecki.

Linear mixed models: a practical guide using statistical
software. CRC Press, 2014.
11
a11 x1
s1
a12 x2
a21
s2 a22 a13
a23 x3
12
a11 x1
s1
a12 x2
a21
s2 a 22 a13
a 23 x3
13
A Less Messy Representation
14
What are Some Relationships Can We
Find in the Data Space?
• Covariance
• Correlation
• Etc
What are Some Relationships Can We
Find in the
Feature Space?
• Covariance
• Correlation
• Etc
15
A Vector Space Perspective
of Projection
Vector a is projected
onto Vector b
Decreasing the angle of separation,

will increase vector a’s projection onto
vector b
16
Projections, Inner Products and
Statistics
17
This “dot product” between a and b is
also called the “inner product” and
can be evaluated as abT
So we can understand correlation as a

complementary measurement of the
orthogonality between two vectors, two
signals or time series data streams.
18
From the perspective of linear algebra. Completely uncorrelated vectors,
signals or time series data streams are orthogonal to each other. Their
correlation coefficient resolves to zero, as does their inner product.
If feature vectors a and b were uncorrelated,

then they would be orthogonal to each other
19
Inner Product, Magnitude, Variance, Etc
20
Inner Product and Covariance
• Inner Product:
• Covariance:
21
Inner Product, Magnitude and Variance
• Inner Product:
Magnitude:
• Variance:
22
Standard Deviation and Magnitude
23
Covariance can be large, while….:
25
• A Standardized Data Covariance Matrix is a Correlation

Matrix
26
Correlation
• When ρxy is the correlation coefficient between x and y

– ρxy is sometimes called the Pearson Correlation Coefficient
– It is analogous to inner product of two unit vectors
– It is also analogous to cosine of the angle between two vectors

• This Angle is Independent of the Vector Magnitudes#
– How does “cosine similarity” differ from correlation?

– For zero-mean (centered) data, cosine similarity is the same as ρxy
27
Correlation
https://www.aplustopper.com/correlation/
28
PCA
• Principal component analysis (PCA)
– Data Covariance Analysis
– PCA is a way to reduce data dimensionality
– PCA projects high dimensional data to a lower dimension
– Retains most of the sample's variation.
– Useful for the compression and classification of data.
– Auxiliary variables, called “principal components” are uncorrelated
– Ordered in descending variance
NB: The Principal Components are NOT the Original Sources (s) From the
Mixture Model As = x !!
29
PCA
https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c
30
PCA
• Principal Component Analysis (PCA) extracts the most
important information. This in turn leads to compression
since the less important information are discarded. With
fewer data points to consider, it becomes simpler to
describe and analyze the dataset.
• PCA can be seen a trade-off between faster computation

and less memory consumption versus information loss. It's
considered as one of the most useful tools for data
analysis.
https://devopedia.org/principal-component-analysis
31
PCA
32
PCA
• We can describe the shape of a fish with two variables: height and width.
However, these two variables are not independent of each other. In fact, they
have a strong correlation. Given the height, we can probably estimate the
width; and vice versa. Thus, we may say that the shape of a fish can be
described with a single component.
• This doesn't mean that we simply ignore either height or width. Instead, we
transform our two original variables into two orthogonal (independent)
components that give a complete alternative description. The first component
(blue line) will explain most of the variation in the data. The second component
(dotted line) will explain the remaining variation. Note that both components
are derived from both height and width.
33
PCA
34
PCA: advantages
• PCA minimizes information loss even when fewer principal components are
considered for analysis. This is because each principal component is along a
direction that maximizes variation, that is, the spread of data. More importantly,
the components themselves need not be identified a priori: they are identified
by PCA from the dataset. Thus, PCA is an adaptive data analysis technique. In
other words, PCA is an unsupervised learning method.
• By reducing the number of dimensions, PCA enables easier data visualization.

Visualization helps us to identify clusters, patterns and trends more easily.
Fewer dimensions means less computation and lower error rate. PCA reduces
noise and makes algorithms work better.
35
PCA: drawbacks
• PCA works only if the observed variables are linearly correlated. If
there's no correlation, PCA will fail to capture adequate variance with
fewer components.
• PCA is lossy. Information is lost when we discard insignificant
components.
• Scaling of variables can yield different results. Hence, scaling that you
use should be documented. Scaling should not be adjusted to match
prior knowledge of data.
• Since each principal components is a linear combination of the original
features, visualizations are not easy to interpret or relate to original
features.
36
Eigen decomposition (analysis)
https://guzintamath.com/textsavvy/2019/02/02/eigenvalue-decomposition/
37
PCA: original data
38
PCA: data - zero mean
39
PCA: covariance matrix
• Since the data is 2 dimensional, the covariance
matrix will be 2x2.
• So, since the non-diagonal elements in this

covariance matrix are positive, we should expect
that both the x and y variable increase together.
40
PCA: eigen decomposition
• It is important to notice that these eigenvectors are both unit
eigenvectors ie. Their lengths are both 1. This is very important
for PCA, but luckily, most maths packages, when asked for
eigenvectors, will give you unit eigenvectors.
41
PCA: final components
42
PCA: new dataset
• After selecting n eigenvalues:
43
PCA: new dataset
44
PCA: new dataset
45
Weka example
46
PCA Produces Orthogonal Basis
64
Matrix A is a Linear Transform of Source
Space Vectors into the Sensor/Data Space
65
Machine Learning Builds Useful Models
• GIGO
• “good” data/features are a pre-requisite for machine
learning to build a useful model.
66
Image Credits
• https://commons.wikimedia.org/wiki/File:Goodmans_Axiette_101_a.png
• https://commons.wikimedia.org/wiki/File:Us664a_microphone.jpg
• https://commons.wikimedia.org/wiki/File:Animal_hearing_frequency_range.svg
• https://en.wikipedia.org/wiki/Dipole_antenna#/media/File:Half_–_Wave_Dip
• https://en.wikipedia.org/wiki/File:Plato_-_Allegory_of_the_Cave.png
• https://en.wikipedia.org/wiki/Optical_illusion#/media/File:Optical-illusion-
checkerboard-twisted-cord.svg
• https://en.wikipedia.org/wiki/Ren%C3%A9_Descartes#/media/File:Cartesian_coordin
ates_2D.svg
• https://upload.wikimedia.org/wikipedia/commons/4/41/Kevin_Mitnick_2008.jpeg
• https://en.wikipedia.org/wiki/Computer#/media/File:Dell_PowerEdge_Servers.jpg
• https://upload.wikimedia.org/wikipedia/commons/2/2c/G5_supplying_Wikipedia_via_
Gigabit_at_the_Lange_Nacht_der_Wissenschaften_2006_in_Dresden.JPG
• https://en.wikipedia.org/wiki/Computer#/media/File:Acer_Aspire_8920_Gemstone.jpg
• http://mathinsight.org/dot_product
72
73
Lecture 4: Data Pre-processing;
Attribute Quality Measures

NTNU i Gjøvik
Week 9: 03.03.2022
• Feature Selection
• Principle Component Analysis

Semester Plan (1)


Search
Feature Selection

Semester Plan (2)

Analysis

Features in Machine Learning
• Features = Attributes = Properties = Characteristic

• Feature Extraction – the main process:
– Feature Construction
– Feature Selection
Norwegian University of Science and Technology Slideshare 5

Features in Information Security (1)
• Intrusion Detection
Norwegian University of Science and Technology MDPI 6

Features in Information Security (2)
• Malware Detection
Norwegian University of Science and Technology http://hpimentel.github.io/Android-Malware-Detection/ 7

Principle Component Analysis
http://www.nlpca.org/pca_principal_component_analysis.html
Andrii Shalaginov
Technology

Lecture 5: Symbolic and Statistical
Learning. Neuro-Fuzzy Algorithm

NTNU i Gjøvik
Week 12: 21.03.2022
Semester Plan (1)


Search
Feature Selection

Semester Plan (2)

Analysis

Andrii Shalaginov
Technology

Statistical Learning
by Barbora Micenkova and Vladimir Smida
 Prevalence of uncertainty in the real world

 Can be handled by using methods of probability
 Some approaches:
 Bayes classification & maximum likelihood
estimation
 Nearest-neighbor models
 Discriminant functions & neural networks
 Linear regression
 Support vector machines
Discriminant Functions
 Curve or hypersurface separating the feature

space into subspaces containing only samples
of one class = decision boundary
 We estimate straight the parameters of the
discriminant functions
 Functions may be
 linear
 quadratic – by adding additional terms
 general – Φ functions
Linear Discriminant Functions
 Linear discriminant function:
T
g  x =w xw 0
 maps the feature space into a real number which can be
viewed as a distance from the decision boundary
 This is how the decision boundary is obtained
g  x =0
 Obtaining value g(x) > 0, the classifier decides for

class ω1
 The classifier has to find such function g that minimizes
the classification error of training examples
Multi-Category Case
 For multi-category case we need to divide the

feature space into as many regions as the
number of classes
 Usually we define also
as many discriminant
functions
 Differences between
weight vectors
important
Source: Duda&Hart&Stork
Two-Category Case
 the task is to determine the weights w of the

T
discriminant function g  x =w x
 Imagine the weight vector in weight space
 Each sample vector will affect the solution
 Several methods how to find the weight vector
– the solution is not unique!
 We define some criterion function that is
minimized when w is a solution vector.
Perceptron Criterion Function
J w = ∑ −w x 
T
x ∈M
 M(w) is the set of samples missclassified by w

 We count the sum of distances of the
misclassified samples to the decision boundary
 Can be solved by gradient descent
Linear Regression
 Prediction of a continuous target

 Linear combination of coefficients
 Criterion function: Minimum squared error
 trying to find a line which best fits the examples
 MSE tries to make the sum of squares of ”errors” as
small as possible
 we search for the minimum of the function:
n
J w =∑ w x i − y i 
T 2
i =1
Linear Regression: An Example
 Four data points: (1,6), (2,5), (3,7), and (4,10)

 We want to find the line y=w 1w 2 x
 Actually, we want to solve the system:
w 11 w 2=6
w 12 w 2=5
w 13 w 2=7
w 14 w 2=10
 System is overdetermined – we try to make as
small as possible the sum of squares of "errors"
Linear Regression: An Example II.
 Find the minimum of the function:

2 2
J w=[6−w 11 w 2 ] [5−w 12 w 2 ]
2 2
[7−w 13 w 2 ] [10−w 1 4 w 2 ]
 Minimum can be found by calculating partial

derivatives (gradient) of J(w) in respect to w
and setting them to 0
 ... solution: w 1=3.5, w 2=1.4
 The line of best fit: y=3.51.4 x
Linear Regression: An Example III.
 Solution in a figure
 Red spots: learning

examples
 Blue line: line of best fit
 Green lines: residuals
Source: Wikipedia
Nearest Neighbors
 Lazy learning method

 Transductive reasoning - does not induce general
hypothesis
 New example is labeled by subset of learning
samples similar to the new example
 Used for classification & regression problems
 According to prediction of target class of new
example:
 k-Nearest Neighbors
 Weighted k-Nearest Neighbors
k-Nearest Neighbors
 k stored learning examples make a prediction of

new sample
 Classification
 Prevalent class from the set of k nearest neighbors is
predicted:
{ }
k
argmax
c x=
c ∈{C 1 ,... , C m }
∑  c , ci 
i=1
c x target value
{C 1,  ,C m } set of possible clases
1, a=b
 a , b=
0, a≠b Source: Wikipedia
k-Nearest Neighbors
 Regression
 mean target value from k nearest neighbors examples
k
1
cx=
k
∑ ci
i =1
 Training set
 Neighbors are taken from a set of objects for which
the correct classification (value of property) is known
 Although no training phase!
k-Nearest Neighbors
 How to identify neighbors?
k-Nearest Neighbors
 How to identify neighbors?
 Objects = position vectors in a multidimensional
feature space
 Euclidean distace
 Distace of two attributes is equal to their absolute
difference: d v i , j , v i , l =∣v i , j −v i , l∣
 Distance between two examples:

a
D t j , t l = ∑ d  vi , j , v i ,l  2
i =1
 Manhattan distance
 ...
Weighted k-Nearest Neighbors
 Deal with drawback of k-Nearest Neighbors
 Classes with more frequent examples
tend to dominate the prediction of new
sample
 Prediction of the class of new sample also based on
distaces from the neighbors - weights
 Impact of distace – linear, polynomial,
exponentional,... function = kernel function
 Classification
{∑ }
k
argmax c , c i 
c x= 2
c∈{C 1 ,... ,C m } i=1 D t x ,t i 
Děkujeme!
 Thank you for your attention!

Symbolic Learning
IMT4612
Gjøvik University College
Kjell Tore Fossbakk, Katrin Franke
Symbolic Learning
  Use symbols in the means to learn new

information
  Topics
  Decision Trees
  Decision Rules
  Association Rules
  Regression Trees
Decision Trees: The tree
  Nodes
  Attributes
  Connections/Edges
  Values
  Leaves/terminal
nodes
  Class labels
Decision Trees: Paths
  Path from the root node to a terminal node

Decision Trees: Pruning
  Too fragmented tree, poor reliability in the

bottom leaves
  Perfect accuracy on learning
  Bad accuracy on test
  Overfitting
  Removing irrelevant nodes
  Post-pruning
Decision Rules: Made from decision trees
  Maximum the number of leaves

  Conditions of attributes and values
  Either discrete or continuous
Association Rules: Basics
  Find relations between attributes in datasets
AB
  Antecendent
  Left hand side of an association rule
  Consequent
  Right hand side of an association rule
  Support
  Supporting the confidence
  Proportion of transactions Z in all transations T, Z T

  Confidence
  Confidence in a statement
  ”If you buy a white t-shirt, you also buy blue ones”
Support (antecendant + consequent)
Confidence = Support (antecendant)
Association Rules: Formal model (1)
  I = I1, I2, .., In (n items)

  T = t1, t2, .., tm (m transactions)
  Binary vector
  Tk = 1 if this transaction purchased item Ik, 0 if not.
  Assosication Rule
  Transaction t satisfies X when tk is 1 for all items tk

in X.
Association Rules: Example (1)
  Grocery store with milk, bread, butter, fish,

chips and soda.
  Transactions = items in a basket
  Market Basket Analysis
  Statement:
  People that buy fish and chips, also buys soda.
Association Rules: Example (2)
  Support {fish,chips,soda} = 1/6

  Support {fish,chips} = 2/6
  Confidence = 50%
 50% of the purchases the statement is true
Regression Trees: Elements
  Similar to decision trees

  Nodes (attributes), edges (values), leaves (class
labels)
  Regression rule: Path from root node to leaf
  Pruning
  Difference
  Values of the leaves (target values) can be
continuous
  Function of values
  Constant, Linear, Arbitrary
Application of Usage
  Supervised Learning
  Classify HTML pages as spam/not spam
  Pattern recogniton (handwriting)
  Speech recognition
Conclusion
  Decision Trees are simple and easy to build

  Complex data can be structured in a way, so
that new relations can be found
  Both discrete and continuous
  Binary Decision Trees for performance
End of lecture – Questions?
Bibliography:
  Matjaz Kukar Igor Kononenko. Machine Learning and Data Mining:
Introduction to Principles and Algorithms. Horwood Publishing Limited,
2007.
  V.Berikov A.Litvinenko. Methods for statistical data analysis with
decision trees. Sobolev Institute of Mathematics, 2003.
  Teemu Hirsimäki. Decision trees in speech recognition.Helsinki
University of Technology, 2003.
  Madan Kumar Pernati. Web spam detection using decision trees. Indian
Institute of Information Technology Allahabad, 2007.
  R. Agrawal; T. Imielinski; A. Swami. Mining association rules between
sets of items in large databases. In SIGMOD Conference (207-216),
1993.
  Noboru Takagi. An application of binary decision trees to pattern
recognition. Department of Intelligent Systems Design Engineering,
Toyama Prefectural University, Tokyo, 2006.
Artificial Neural Networks
Sukalpa Chanda, Katrin Franke&DUO/HLFKWHU
Department of Information Security and
Communication Technology
NTNU Digital Forensics Group
1
Outline
• What is ANN and where do we apply
• What are different variants
• Basic building block (Perceptron)
• What is multi-layer perceptron and why we
need that.
• Learning algorithm in Multilayer Perceptron

• Drawbacks with that learning algorithm.
2
Artificial Neural Network
• Artificial Neural Network (ANN) – Mimic to
Human Brain
• Is a mathematical model or computational
model that tries to simulate the structure
and/or functional aspects of biological neural
networks.
3
Artificial Neural Network
• Models of the brain and nervous system
• Highly parallel
– Process information much more like the brain
than a serial computer
• Learning
• Very simple principles
• Very complex behaviours
• Applications
– As powerful problem solvers
– As biological models
4
Types of ANN
Mainly ANN can be classified to following
types
• According to Topology
• According to Learning Rule
• According to Activation function
• According to Application
6
Based on Topology
• ANN without layers
• Two Layered FeedForward ANN
• Multi-layered FeedForward ANN
• Bi-Directional Two layered ANN
• Picture of a Multi-layered
Feed forward ANN
7
Based On learning Rule
• Hebbian Learning Rule
• Delta Learning Rule ( Back propagation
learning for Multi-layered Perceptron)
• Competitive learning
• Forgetting
8
Tapson, Jonathan, et al. "Synthesis of neural networks for spatio‐temporal spike pattern
recognition and processing."
Perceptron
Adds a Bias Term
Can Learn Linearly Separable Feature Spaces
Θ
OR & AND Decision Boundaries
XOR Decision Boundaries
Solution-Multi-layer
Perceptron
• Designed to Handle "non-linear" classification
problem.
• Achieved by introducing one or more hidden
layers between input layer and final output
layer.
• Activation functions are generally Sigmoid or
gaussian giving an output ranging from 0 to1.
13
Sophisticated Activation Functions
• Threshold based activation function outputs
are not compatible with sophisticated weight
correction methods, like gradient descent.
• Activation function should be continuous and
differentiable in nature in order to use
gradient descent for correcting weights
• Solutions are as follows:
11
Sigmoid Activation Function
Error Back‐Propagation Through Activation!
MultiLayer Perceptron
• A inter connected network of single
perceptrons.
• Consist of Multiple layers of
perceptron/neurone /node.
• Input layer nodes are feature components.
• Out put layer nodes are desired
classification labels/regression output.
• Each node gets input from all or
some of the node of the earlier layer
14
Feed-forward Nets
Information flow is unidirectional

Data is presented to Input layer
Passed on to Hidden Layer
Passed on to Output layer
Information is distributed
Information processing is parallel
Internal representation (interpretation) of data

Internal Sigmoid
Internal Model Principle
(Control Theory)
How Do MLP‐BP Construct Model of Data from Sigmoids?
Combine Two Sigmoids Into a Ridge
Combine Two Ridges Into a Pseudobump
Weights Adjusted to Smooth Bump
Bump is Internal Modeling “Voxel”
BackPropagation-
Learning in a MLP
• For each layer, difference in value of the ouput
node and desired value of that output node is
calculated by a error function.
• Associated weight to the output node are
updated on minimizing an error function.
• The order/direction of weight correction is
determined by taking derivative of the error
function.The process starts from the outermost
layer and moves subsequently to deeper layers.
• Stops when reach the input layer. 16
Drawbacks of
BackPropagation
• It might not always give the optimal
network (like over fitting problem ).
• Wrong Step size might bypass the
optimum.
• Might stuck up in a local Minimum.
• We need to fix up number of hidden layer
and number of nodes in each hidden layer
empirically.
• Learning requires large number of passes
and huge number of training data. 17
Utility of Optimal
Step Size (Learning rate)
18
Alternative to Gradient Decent
• Genetic Algorithms
19
Network Topologies
Take Home Message
• Use Neural Network when you have enough
samples.
• Training is very time consuming.
20
21
Support Vector Machine
Sukalpa Chanda, Katrin Franke, Carl Leichter

Department of Information Security and
Communication Technology
NTNU Digital Forensics Group

1
Outline
• An Introduction to SVMs
• The geometrical representation
• SVMs for binary classification.
• SVM for multi-class classification.
• Why Kernel SVM
• Kernel trick
• Different Kernel Types.
2
Linearly Separable
Binary Classification
• L training points .
• Each training points consists of D dimension vector.
• Training data is of the form
{ xi ,yi} where i= 1,2,3,........L , yi є [-1,+1] and xi є RD
• we assume the data is linearly separable
3
Pictorial Motivation
• So many posssible Hyperplanes – Which one to Choose?
4
Best Solution
5
How do we get it!
• Solving an
Optimization Problem
(Convex Optimization
Precisely)
• Lets Formulate the
problem
Mathematically
• This hyperplane can be
described by w · x + b
=0
The Margin
• x is my Data ( Feature
Vector)
• b is a Bias
• w is the weight vector
Why Max. Margin?
• According to Statistical Learning Theory
Max. generalization can be achieved by Max.
margin.
• We need to define distance/metric in the
feature space.
• We implicitly fix a scale
• How???
Margin
• We introduce
canonical hyper
plane for both
classes.
• x· w + b = +1
• x · w + b = −1
Margin
• Let us take two
arbitrary points from both
class examples.
• The distance between
them X1- X2. (Red Line)
• The margin/distance
between X1 and X2 can
be obtained by
projecting it on the
vector normal to the
hyperplane.(Green Line)
Margin
• x1 · w + b = +1
• x2 · w + b = -1 On
Subtraction Gives
• w · ( x1-x2)=2, here w is
the green line.
• Canonical hyperplanes
in yellow
• Data points on Yellow
line are Support
vectors
Margin
• Since total distance =2
between the two samples
• distance of 1 between
hyperplane & each sample
on canonical hyperplanes
(see dotted black line)
• Recall how we define our

canonical hyperplanes
(i) x1 · w + b = +1
(ii) x2 · w + b = -1
Mathematical
Formulation
• It turns out that for each sample we need to do
the following optimization computation :
- Maximize 1 / ||w||
(we normalized the distance/ margin by dividing it by ||w||, so w · ( x -x )=2
1 2
becomes w · ( x1-x2)/||w||=2 /||w||)
Subject to yi (xi · w + b) − 1 ≥ 0 ∀i , i=1,2……L

Mathematical
Formulation
• For computational convenience we make it a Minimization
problem and introduce a square term to make it a Quadratic
Programming Problem. (well Convex Optimization to be
honest in this case)
• The original Problem is redefined as :
Min. ||w||2 / 2 , ( here by norm we mean L2 norm)
Subject to
yi (xi · w + b) − 1 ≥ 0 ∀i , i=1,2……L
• Hmm…. Constrained Optimization Problem!!!
• Solution- lagrange multiplier
Lagrange Multiplier
• So, we start by trying to find the extreme
values of f(x, y) subject to a constraint of
the form g(x, y) = k.
• In other words, we seek the extreme values
of f(x, y) when the point (x, y) is restricted to lie
on the level curve g(x, y) = k.
• These have the
equations f(x, y) = c,
where c = 7, 8, 9,
10, 11
Lagrange Multiplier
• To maximize f(x, y) subject to g(x, y) = k
is to find:
– The largest value of c

such that the level
curve f(x, y) = c
intersects g(x, y) = k.
Lagrange Multiplier
• This means that the normal lines at
the point (x0 , y0) where they touch are identical.
– So the gradient vectors are parallel.
– That is,
f ( x0 , y0 )   g ( x0 , y0 )
for some scalar λ.

• Please note that gradient vector of a function at
a point is always perpendicular to its level curve
at that point and hence perpendicular to the
functions tangent plane.
Multi-class SVM
• Generally Solved by two Approaches
• 1-against all
• 1- against 1
• Majority Voting between different decision
function.
• For an equilizer the max. actual score value is
considered.
Depends Upon Feature Space Topology
http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html
Easy Linear Separation
What Hyperplane Works Here?
The Kernel Trick!
Different Kinds of Building Blocks
• Basis Functions
• Building blocks that span a vector space (can be a feature space)
• Can be a transformation (time to frequency)
• Not used to change dimensionality
• Kernel Functions
• Building blocks to create new representation in a different feature space
• Always a transformation
• Used to go from a lower dimension to a higher dimension
• RKHS
• The Kernel Trick (Appearing at an SVM near you….)
Kernel Evolution
• In 1992, Bernhard Boser, Isabelle Guyon and
Vapnik suggested a way to create nonlinear
classifiers by applying the kernel trick.
• Largely inspired by Aizerman et al.
• M. Aizerman, E. Braverman, and L. Rozonoer
(1964). "Theoretical foundations of the
potential function method in pattern
recognition learning". Automation and
Remote Control 25: 821–837.
Why Kernel SVM?
• What if our feature per class is non-linear by
nature.
• Linear SVM cannot handle such cases.
• The solution is to use Kernel trick
- Main idea is to represent the original
feature vector in a higher dimensional
space
- Features in higher dimensional space
gets linearly separable.
Transformation Function
• Map Data from input space to high
dimensional feature space where they are
linearly separable.
Examples of Kernel
Function
• Gaussian Kernel
• Polynomial Kernel
• Can we compute our Kernel from a feature

vector ??? ( Homework) !!!
Some Advantages
with SVM
• Generalization is achieved
• Can efficiently handle non-linear classification
problem
• Easy to interpret its results when compared to
Neural Network.
Some Problems
with SVM
• Need to select the correct Kernel function
• Need to put right value for the parameter of
the Kernel
- Setting the Value of Sigma in case Gaussian
Kernel,
- Setting the Right Degree for Polynomial
Kernel
• Answer to those problems ? Yes!!!
• MKL SVM ( Multiple Kernel Learning SVM)
Acknowledgements
http://www.cs.nott.ac.uk/~pszgxk/courses/g5aiai/006neuralnetworks/ne
ural‐networks.htmcontinuous
Tapson, Jonathan, et al. "Synthesis of neural networks for spatio‐
temporal spike pattern recognition and processing." arXiv preprint
arXiv:1304.7118 (2013).
https://upload.wikimedia.org/wikipedia/commons/2/20/ASR‐
9_Radar_Antenna.jpg
http://ieeebooks.blogspot.no/2011/02/lessons‐in‐electric‐circuits‐
volume‐ii_4086.html
https://en.wikipedia.org/wiki/Air_traffic_control_radar_beacon_system#
/media/File:ASR‐9_Radar_Antenna.jpg
• https://en.wikiversity.org/wiki/Learning_and_neural_networks
• http://www.ece.utep.edu/research/webfuzzy/docs/kk‐thesis/kk‐thesis‐html/node18.html
• http://www.cs.bham.ac.uk/~jxb/NN/l3.pdf
• http://www.cs.stir.ac.uk/courses/ITNP4B/lectures/kms/2‐Perceptrons.pdf
• http://www.math.washington.edu/~palmieri/Courses/2008/Math326/pictures.php
• https://i.imgur.com/Jl4gIBl.jpg
• https://www.youtube.com/watch?v=3liCbRZPrZA
• https://www.youtube.com/watch?v=9wijQD8DPc4
• https://www.youtube.com/watch?v=UFnjV1E615I
• https://www.youtube.com/watch?v=NmhbQ‐ag2z0
39
Lecture 6: Artificial Neural Networks;
Support Vector Machines

NTNU i Gjøvik
Week 14: 31.03.2022
Reading: Kononenko 12; Chio 2
IMT4133 Data Science for Security and
Forensics
Lecture 7
Clustering
Andrii Shalaginov, Carl Leichter
Norwegian University of Science and Technology.
1
Types of Clustering
• Hierarchical
– Taxonomies
– Organizational Charts
• Partition
– Feature Space Regions
2
Clustering Methods
• K-Means Clustering
• Gaussian Mixture Models
• Canopy Clustering +
• Vector Quantization
3
Essentials of Clustering
• Similarities
– Natural Associations
– Proximate*
• Differences
– Distant*
*Implies a distance metric
4
Clustering considerations
• What does it mean for objects to be similar?
• What algorithm and approach do we take?
– k-means
– hierarchical
• Bottom up agglomerative clustering (HAC)
• Top Down divisive
• Do we need a hierarchical arrangement of clusters?
• How many clusters?
• Can we label or name the clusters?
• How do we make it efficient and scalable?
– Canopy Clustering
5
Hierarchical Clustering
• Build a tree-based hierarchical taxonomy

(dendrogram) from a set of documents.
animal
vertebrate invertebrate
fish reptile amphib. mammal worm insect crustacean
How could you do this with k-means?
6
Clustering: Corpus browsing
www.yahoo.com/Science
… (30)
agriculture biology physics CS space
... ... ... ...

...
dairy
crops botany cell AI courses craft
magnetism
forestry agronomy evolution HCI missions
relativity
Dendrogram: Hierarchical Clustering
Cut the dendrogram at a desired level
Each connected node forms a cluster

Dendrogram: Hierarchical Clustering
• Yes: you can have a
cluster with only one
node
• Also produces a set of nested clusters
6 5
0.2
4
3 4
0.15 2
5
0.1 2
0.05
1
3 1
0
1 3 2 5 4 6
1 2 3
4
10
• Assumes a similarity function for determining the similarity of two entities
in the clustering (nodes or other clusters).
• Two main types of hierarchical clustering
– Agglomerative (Bottom Up):

• Start with the nodes as individual clusters
• At each step, merge the closest pair of clusters until only one cluster (or k clusters) left
– Divisive (Top Down):

• Start with one, all-inclusive cluster
• Eventually each node forms a cluster on its own.
–(or there are k clusters)
–Merge or split one cluster at a time
• Traditional hierarchical algorithms use a similarity or distance metric
• Needs a termination/convergence/readout condition
11
Agglomerative versus Divisive
• Data set {a, b, c, d ,e }
Step 0 Step 1 Step 2 Step 3 Step 4

Agglomerative
a
ab
b
abcde
c
cde
d
de
e
Divisive
Step 4 Step 3 Step 2 Step 1 Step 0
12
12
Strengths of Hierarchical Clustering
• No assumptions on the number of clusters
– Any desired number of clusters can be obtained by ‘cutting’ the
dendrogram at the proper level
– But there are limits
• Hierarchical clusterings may correspond to meaningful

taxonomies
– Biological sciences
– Websites
– Product Catalogues
13
Agglomerative
• Start with clusters of individual points
14
Intermediate State
• After some merging steps, we have some clusters
C3
C4
C1
C2 C5
15
Intermediate State
• Merge the two closest clusters (C2 and C5)
C3
C4
C1
C2 C5
16
After Merging
C3
C4
C1
C2 U C5
17
Means of Clustering
• K-means
• GMM
– Expectation Maximization (EM)
• Canopy
• Hard+Sharp/Soft+Fuzzy
– Hard = No Cluster Overlap
– Soft = Some Cluster Overlap
18
Essentials of Clustering
• What is a “Good” Cluster?

– Members are very “similar” to each other
• Within Cluster Divergence Metric σi
– Variance also works
• Relative Cluster Sizes versus Data Spread
19
Evaluating Clusters
• What does it mean to say that a cluster is “good”?
– Clusters should have members that have a high degree of similarity

• Within Cluster Divergence Metric
– Std Dev σi
– Variance σi2
– Standard way to measure within-cluster similarity is variance* –

• clusters with lowest variance is considered best
• Cluster size is also important so alternate approach is to use

average cluster variance*
• Relative Cluster Sizes versus Data Spread
20
What are we optimizing to get good
clusters?
• Given: Final number of clusters
• Optimize:
– “Tightness” of clusters
• {average/min/max/} distance of points to each other in the same
cluster
• {average/min/max} distance of points to each cluster’s center
– Distance between clusters
21
The Distance Metric
• How the similarity of two elements in a
set is determined, e.g.
– Euclidean Distance
– Inner Product Space (*)
– Manhattan Distance
– Maximum Norm
– Mahalanobis Distance
– Hamming Distance
– Or any metric you define over the
space…
22
Manhattan Distance
https://www.quora.com/What-is-the-difference-between-Manhattan-and-
Euclidean-distance-measures
23
Mahalanobis Distance
http://www.jennessent.com/arcview/mahalanobis_description.htm
24
Mahalanobis Distance
http://stats.stackexchange.com/questions/62092/bottom-to-top-
explanation-of-the-mahalanobis-distance
25
Partitional Clustering
• Partitions set into all clusters simultaneously.

26
Partitional Clustering
• Partitions set into all clusters simultaneously.

27
K-Means Clustering
• Simple Partitional Clustering
• Choose the number of clusters, k
• Choose k points to be cluster centers
• Then…
28
But!
• The complexity is pretty high:
– k * n * i *O ( d)
k = # Clusters
n = # data points
i =num (iterations)
O ( d)= computational complexity
of distance metric
(Motivation for Canopy Clustering)
29
Canopy Clustering for Big Data
• Preliminary step to help parallelize computation.

• Clusters data into overlapping Canopies using super
cheap distance metric.
• Efficient
• Semi-Accurate
It’s Like Using Postal Codes!
30
What Does This Remind You Of?
• Nations
–States (Regions)
• Cities
–Postal Codes
» Street Addresses
31
Canopy Clustering
• Use “cheap” method in order to create some number of
overlapping subsets, called canopies.
• A canopy is a subset of data that, according to the

approximate similarity measure, are within some
distance threshold from a central point.
• An element may appear under more than one canopy
• Canopies are created with the intention that points not

appearing in any common canopy are far enough
apart that they could not possibly be in the same
cluster.
32
Creating Canopies
• Define two thresholds
– Tight: T2
– Loose: T1
• All elements initialized in set S
• While S is not empty

– Remove any element r from S and create a canopy centered at r
– For each other element ri, compute cheap distance d from r to ri
• If d < T1, place ri in r’s canopy
• If d < T2, remove ri from S
When will S be empty?

After the last element has been found to be within T2 within
from the last canopy center
–or-
If only one element is left, when we start the next iteration
33
Single Canopy
https://www.codeboy.me/2014/11/02/datami
ne-canopy/
34
Single Canopy
35
Single Canopy
36
Multiple Canopies
37
Canopy Clustering
• Start with computationally inexpensive

approximation
– Household address, the postal code
• Finish with a more expensive and accurate

similarity measure
– Household address detailed field-by-field string
comparison
38
Partitioning Large Data Sets
• Start with Cheap Canopy Clustering
• Finish with Expensive K-Means Clustering
39
Hybrid Clustering
• Get Data into a form you can use

• (Feature extraction & selection)
• Pick Canopy Centers
– Assign Data Points to Canopies
• Pick K-Means Cluster Centers
– K-Means algorithm
– Iterate!
40
Gaussian Mixture Models in 1-D
41
Gaussian Data
42
Parameter Estimation
43
“True” Distribution:
The Estimated Distribution:
44
Estimated Distribution is Based on
“Expectation Maximization”
45
Acknowledgements and References
• https://www.youtube.com/watch?v=_aWzGGNrcic
• https://www.youtube.com/watch?v=REypj2sy_5U
• https://www.youtube.com/watch?v=qMTuMa86NzU
• https://www.youtube.com/watch?v=B36fzChfyGU
• https://www.youtube.com/watch?v=jgQhzl3djM8
• https://en.wikipedia.org/wiki/File:Svg-cards-2.0.svg
• McCallum, Andrew, Kamal Nigam, and Lyle H. Ungar.

"Efficient clustering of high-dimensional data sets with
application to reference matching." Proceedings of the sixth
ACM SIGKDD international conference on Knowledge
discovery and data mining. ACM, 2000.
51
Acknowledgements and References
• http://www.fallacyfiles.org/taxonomy.html
• http://www.indiana.edu/~hlw/Meaning/senses.html
• http://bioweb.uwlax.edu/bio203/s2007/barger_rach/
• http://ocw.mit.edu/courses/electrical-engineering-and-
computer-science/6-345-automatic-speech-recognition-
spring-2003/lecture-notes/lecture6new.pdf
• www.cs.cmu.edu/~knigam/15-505/clustering-lecture.ppt
• courses.cs.washington.edu/courses/cse590q/04au/slides
/DannyMcCallumKDD00.ppt
52
IMT4133 – Data Science for Security
and Forensics
Uncovering the Data Structure Via

Unsupervised Learning;
Self Organizing Feature Maps;
Multi-Dimensional Scaling

We Are Here, Because…..
• We believe our data has a structure that reflects its

system of origin
• We believe that proper analysis of the data will

reveal the data’s structure
• We believe that the data structure we discover,

will give us useful information about the system of
origin
2
Overview
• Recall mixture model of data

– The structure of the data
– The data’s relation to the system we are studying
• How we can extract information about the system, from

the data
– Analytical Model
• The Math of Extracting (unmixing the mixture model)
– Empirical Methods
• Data analysis to extract information about the data structure
– Unsupervised Learning
– Self Organizing Maps (SOM)
• Self Organizing Feature Maps
3
Analytical vs Empirical
• Analytical
– The true nature of the system under study
– Idealized model (usually mathematical)
– Allegory of the cave: the objects, not their shadows
– We Can Never* Have Direct Knowledge
• Empirical
– What we can actually know about the system under study
– Data
– Our analysis of the data
• Estimates of the true nature
– Always indirect knowledge of the true nature of the system
• Recall limitations of the senses
4
Supervised v Unsupervised
• Supervised Learning Vectors are Labeled
– Explicit preconceptions about data structure
– Costly
• Unsupervised Learning Vectors: Unlabeled

– Are there implicit preconceptions?
• There is at least one
– Lower Cost
Why is labelling costly?

5
5 Reasons for Unsupervised Learning
1. Cost of Labelling
2. Data Mining
3. Dynamic Classes
4. Identify Useful Features
5. Initial Exploratory Data Analysis
6
Types of Unsupervised Learning
• Clustering
• K-mean
• GMM
• Self Organization
• What is the organizational principal?
• Data topology
• Want a topology preserving projection to lower
dimensional space
• Say What?
• Some/all of the data structure is preserved
7
Topology Preserving Projections I
8
http://www.cita.utoronto.ca/~murray/GLG130/Exercises/F2.gif
9
Topology Preserving Projections I
10
Topology Preserving Projections
11
https://commons.wikimedia.org/wiki/File:Europe_topography_map_en.png
• Geographic terrain projections are limited.
– Restricted to 3D -> 2D
• 2D map (isomorphic projection):
– N, S, E W -> Top, Bottom, Right, Left
• 3D ->2D map: N, S, E, W, Higher, Lower
– How do we visualize nD -> 2D (n>3) ???

• n= 4, Iris Flower Data Set
– What relationship(s) we can generalized for n

dimensional spaces?
– What do they all feature spaces have in

common?
• A distance metric!
12
• How will the distance metric handle polymorphous data?

– Units of time (different units of time?)
• Sprint performance data: years of age and seconds to finish
– Units of space
• (meters, lightyears)
• Surface area
• Volumetric
– Units of mass (grams, kilograms, tonnes)
– Units of $$$
• NOK
• USD
– Benjamins
13
• How will the distance metric handle polymorphous data?
– Explicit Data Standardization (z-Statistics)
– No Data Standardization (Input raw data numbers)

• Units are dropped, but dynamic ranges are preserved.
– 40 years old (range: 20-65)
– 5 years of college (0-8)
– 50000 NOK (0-100000)
– Fuzzification of Data input into Membership Function Values

(Topic for next week)
14
• What level of preservation is required?

– What information can we do without?
• Lossy PCA reduction for classification

– Discarding principal components containing information
– Do all PCs contain information?
• Some components can be pure normal/gaussian noise
– WARNING!
• *DO NOT RECONSTRUCT THE DATA WITH LOSSY PCA *
– Discuss it with me, first (cf PhD thesis: “Eigenspecters”)
• Using Lossy PCA, without data reconstruction, is OK
15
16 Raw Data
Lossy PCA Reduction for Classification
17
First PC
This PC has all

information required
for Classification
So We Don’t Need This PC

for classification
18
What Type of Simple Classifier Can We Use?
LDA!
19
Un/Supervised Clustering
• Recall k-means
– It is semi-supervised in that we have pre-determined the
number of means (number of clusters)
• Recall G-MM
– Note how the results are affected by the initial estimate for
the number of clusters
20
• Recall k-means and GMM-EM clustering
watch videos
www.youtube.com/watch?v=_aWzGGNrcic
www.youtube.com/watch?v=qMTuMa86NzU
www.youtube.com/watch?v=B36fzChfyGU
21
• Recall k-means
– It is semi-supervised in that we have pre-determined the
number of means (number of clusters)
• Recall G-MM
– Note how the results are affected by the initial estimate for
the number of clusters
• Many Artificial Neural Networks are like doing

statistics with black boxes.
• An SOM is like doing k-means with ANN

– We pick the number of output neurons
– Training the SOM moves the output neurons wrt the data
22
SOM and Topology Preservation
• What is actually preserved?

– Spatial Relationships
• So, we would like a way to take high-dimensional
data and reduce it down to a 2-D map that
preserves the spatial relationships of the higher
dimensions.
• How do we do that?
– Distance (Things nearby are similar)
– Colour (Things with similar colors are similar)
– Location (E, W N S –Right Left Top Bottom)*
*Might not always mean what you think
23
Self Organizing Maps Architecture
Output Neurons
Output Layer
Connection Weights
24
Proximity By Colour and Location
Poverty Map of the World (1997)
25
http://www.cis.hut.fi/research/som-research/worldmap.html
If ML Is Statistics By Other Means,
Why Use ML Instead of Stats?
26
Is Map Orientation Important?
Are the Map Axes Informative?
• Proximity is the most important relation
– Data points that are in the same neighbourhood, have the
closest resemblance to each other
27
Are the Map Axes Informative?
– Data points that are to the left, right, above or below are
indicating their relationship to neighbourhoods that are
further away
• Further Away = data with a less close resemblance
28
29
How Does the SOM Work?
• A competitive learning algorithm.
– The neuron “closest to the input vector” is the winner
• The neuron that most closely resembles a sample input.
• Its weight vector is adjusted to move even closer to the current input
vector xi
– The neurons that are too far away lose out completely
• No weight adjustment for them!
• A cooperative learning algorithm

– But the neurons in the “same neighbourhood” as the winner are
partial winners
• Their weight vectors are adjusted, based on their proximity to
"winning" neurons
• The closer the neighbour is to the winner, the more its weight vector
is adjusted
30
How Large is the Neighbourhood?
• How big would you like it?
– It’s a training parameter that can be set
– A parameter that also gets smaller as training

progresses
• Like the ANN weight training step size gets smaller as training
progresses
31
Neighbour Interconnection Topologies
32 http://users.ics.aalto.fi/jhollmen/dippa/node9.html
Neighbourhoods in a Rectangular Map
33
The Hexagonal Neighbourhood
34 http://users.ics.aalto.fi/jhollmen/dippa/node9.html
Image Credits
• https://12095675emilygrant3ddunitx.files.wordpress.com/2013/05/mapprojectio
n5.gif?w=450&h=299
• https://en.wikipedia.org/wiki/Self-organizing_map#/media/File:Somtraining.svg
• https://en.wikipedia.org/wiki/File:Europe_topography_map.png
• http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html
• By User:W!B: - http://www.maps-for-free.com/, GFDL,
https://commons.wikimedia.org/w/index.php?curid=5115489
35
Lecture 7: Unsupervised Learning;
Cluster Analysis

NTNU i Gjøvik
Week 16: 21.04.2022
Reading: Kononenko 12; Chio 2
Semester Plan (1)


Search
Feature Selection

Semester Plan (2)

Analysis
Week 18 (05.05.2022) MOCK exam Preparation for the exam; Q & A

Announcements
• 4 assignments – by 27.05.2022 (mandatory part)

• Exam – on 20.05.2022 (digital home exam)
• Make sure to check

https://www.ntnu.edu/studies/courses/IMT4133#tab=om
Eksamen

Assessment Forms and Academic
Writing
Some motivation before the exam

Assessment in higher education
• Assessment is a central feature of teaching and the
curriculum. It powerfully frames how students learn and what
students achieve.
• It is one of the most significant influences on students’

experience of higher education and all that they gain from it.
• The reason for an explicit focus on improving assessment

practice is the huge impact it has on the quality of learning.
• Students themselves need to develop the capacity to make

judgements about both their own work and that of others in
order to become effective continuing learners and
practitioners.
Boud, David, and Filip Dochy. "Assessment 2020. Seven propositions for assessment reform in higher education." (2010).
Things changed in 2020/2021
• Standard exams (f.e. 3 hours in classrooms) nearly gone
• All assessment forms became digital
• Major forms proved efficiency:
– Verbal exam
– Final written report
– Essay-based 3-hours home exam
– Portfolio of assignments during semester
– Few days case study
• Many with A/F grading scale has been replaced with Pass/Fail
• Many struggle with digital meetings and online tasks
• Teachers in non-IT fields had to put enormous efforts to adopt
• New digital systems

Unexpected challenges (1)
Including, but not limited to:
• Presenting someone else’s work entirely or partially as own
work from websites, reports, articles, blogs, etc.
• Submitting answers partially or fully from the Internet.
• Delivering pictures, diagrams, tables, figures from other
sources without referencing.
• Unauthorized work between students in the way that it was
not permitted.
• Using illegal advantages to fulfill the exam or assessment
tasks.
• Misusing plagiarism control tools.
• https://i.ntnu.no/wiki/-/wiki/English/Cheating+on+exams

Unexpected challenges (2)

Recommendations: academic writing
• We expect you to deliver you own independent work

• Follow academic writing best practices when delivering report:
– Abstract
– Introduction
– Background Literature
– Methodology
– Results & Analysis
– Conclusions & Discussions
• Always reference / cite / quote used sources
• There are plenty of tools that can help you with that:
– Google Scholar
– Grammarly
– MS Office / LaTeX

Use case: copied text with/without
quotes (1)

Use case: adopted text without quotes (2)

Solution: read and understand the
rules
Only your own independent work will be graded

Andrii Shalaginov


IMT 4133
Introduction to Neuro-Fuzzy Methods

Neuro Fuzzy Overview
• Neuro-Fuzzy (NF) is a hybrid intelligence / soft computing
– (*Soft?)
• A combination of Artificial Neural NetworkS (ANN) and Fuzzy

Logic (FL)
• Opposite of fuzzy logic is

– Crisp
– Sharp
– Hard
• ANN are black box statistics, modelled to simulate the activity

of biological neurons
• FL extracts human-explainable linguistic fuzzy rules
• Applications in Decision Support Systems and Expert Systems

2
Fuzzy Basics
• Classical Logic uses only TRUE (1) or FALSE (0) values
• FL is a concept of PARTIAL TRUTH

– Why not partially false?
• FL assigns a truth value from the interval [0,1]

– Membership degree
– Similar to the probability of belonging to a set
• Except cumulative fuzzy set membership is non sum-normal
– Summation of membership values doesn’t equal unity (1)
3
Fuzzy Basics
• FL uses linguistic variables that can contains several

linguistic terms
• Temperature (linguistic variable)
– Hot (linguistic terms)
– Warm
– Cold
• Consistency (linguistic variable)

– Watery (linguistic terms)
– Gooey
– Soft
– Firm
– Hard
– Crunchy
– Crispy
4
5
Variables and Terms
Linguistic Variables
Linguistic
Terms •
http://radio.feld.cvut.cz/matlab/toolbox/fuzzy/fuzzyt27.html Crisp Output

Types of Membership Functions
Which One Is the Best?

http://wing.comp.nus.edu.sg/pris/FuzzyLogic/DescriptionDetailed2.html 6
Multiple Triangular MFs
http://sci2s.ugr.es/keel/links.php
7
Multiple Gaussian/Normal MFs
Non sum
normal
0.6 0.6
http://cdn.intechopen.com/pdfs-wm/6928.pdf
Fuzzy Inference
● Fuzzy rules are conditional statements in the form:
IF x is A THEN z is C
● Where 'x is A' is an atom of the fuzzy rule and also

called antecedent
● A – fuzzy set
● x – raw input variable
● 'z is C' – a consequent of the fuzzy rule.

● C – fuzzy set
http://ispac.diet.uniroma1.it/scarpiniti/files/NNs/Less9.pdf 9
10
Fuzzy Inference
● Sharp antecedent: “If the tomato is red, then it is

sweet”
● Fuzzy antecedent:
● “If the tomato is more or less red (μRED = 0.7)”
● Fuzzy consequent(s):
● “The tomato is more of less sweet (μSWEET = 0.64)”
● “The tomato is more of less sour (μSOUR = 0.36)”
Does the sourness have to be (1- μSWEET ?)

No! We aren’t using sum normal math
http://ispac.diet.uniroma1.it/scarpiniti/files/NNs/Less9.pdf
Fuzzy Reasoning
● There can be more than one atom in one antecedent
● Mamdami-type rules:
● IF x is A AND y is B THEN z is C
● Takagi-Sugeno-type rules:
● IF x is A AND y is B THEN z = f(x,y)
● eg: z = ax + by + c
● IF the tomato is more or less red ….

AND the tomato is more or less firm …
THEN the tomato is ???
Sweet (μSWEET = 0.84)
Sour (μSOUR = 0.26)
Throwable (μToss = 0.76)
http://ispac.diet.uniroma1.it/scarpiniti/files/NNs/Less9.pdf 11
Artificial Neural Network (1)
http://pharmacyebooks.com/2010/10/artifitial-neural-networks-hot-topic-pharmaceutical-research.html
12
Artificial Neural Network (2)
ANN optimization leads to error function minimization:
http://mines.humanoriented.com/classes/2010/fall/csci568/portfolio_exports/lguo/ann.html
13
Combining ANN/FL
● ANN black box approach requires sufficient data to find
the structure (generalization learning)
● NO PRIORS required
● Cannot extract linguistically meaningful rules from trained ANN
● Fuzzy rules require prior knowledge

● Based on linguistically meaningful rules
● Combining the two gives us higher level of system

intelligence
● Intelligence(?)
● Can handle the usual ML tasks

● (regression, classification, etc)
http://www.scholarpedia.org/article/Fuzzy_neural_network
14
Combining ANN/FL
● How to resolve these issues
● Can’t extract linguistic rules from ANN
● FL requires manual heuristics for training
● Combine advantages of FL and ANN!
● Several different ways of building such Neuro-Fuzzy

systems
http://www.scholarpedia.org/article/Fuzzy_neural_network 16
Example Hybrid NF 1
(Fuzzy > ANN)
First, the fuzzy inference block receives linguistic statements and processes them.
Second, the fuzzy block output serves as the input to the ANN block
17
Robert Fuller, Neural Fuzzy Systems, 1995
Example Hybrid NF 1
(ANN > Fuzzy )
First, the raw data are delivered to the ANN through input neurons.
Second, neural outputs drive the fuzzy inference block that builds
decision statements.
First,.
Second,
Robert Fuller, Neural Fuzzy Systems, 1995
18
Cooperative NF
Both FL and ANN can work independently
http://www.scholarpedia.org/article/Fuzzy_neural_network
19
Fuzzy patches (1)
• Fuzzy rules approximate the mapping function
[y=f(x)] of an input value region for X to an output
value region Y
• Despite the fact that such mapping can be done

manually, the fuzzy rules can be determined
automatically by fuzzy-function approximation
• Bart Kosko framework for adapting fuzzy rules to

data.
– The approach is based on the geometric interpretation

of fuzzy rules as so-called fuzzy patches.
http://kyfranke.com/uploads/Publications/kyfranke-PhD-thesis-2007-Acrobat7.pdf
20
Fuzzy patches (1)
There are two steps in fuzzy rules adaptation:
1. unsupervised learning procedure for a rough
placement of the rule parameters
Cheap!
2. fine-tuning of the roughly placed rules by means

of a supervised learning procedure
Expensive!
What ML1 clustering method does this remind you

of? (From cheap to expensive)
http://kyfranke.com/uploads/Publications/kyfranke-PhD-thesis-2007-Acrobat7.pdf 21
VARIABLE 24
The Fuzzy Tomato
“FLAVOR”
“sweet”
“tart”
“sour” •
TERMS “COLOUR”
VARIABLE
“green” “pink” “red”
http://www.sciencedirect.com/science/article/pii/S0196890405003225 TERMS
● Smaller and more numerous patches will provide
more precision in the estimation of f(x)
● Increasing the crispness of the rules
● But increasing the complexity
● More terms
● More rules
● More training data required for accurate training
26
Kosko Method (2 Steps)
1 Rough Placement of Patches
● Unsupervised Learning (eg SOM) is applied to data
● Seeking rough approximation of rule parameters
(Rough location of fuzzy patches)
2 Refinement
● Supervised learning (eg gradient descent)
27
NF general overview
● Input Fuzzification (Data)

● Transformation from numerical input data to linguistic variables
● Rule discovery/evaluation (Model)

● MIN-MAX principle
● Output Defuzzification (Application)

● Deriving crisp quantifiable results based on the rules
http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/sbaa/report.fuzrules.html 28
Fuzzy Inference Principles
● Inference means reasoning that the

system provides (makes a decision)
based on the extracted rules
● Fuzzy inference means mapping from

an input variable space to an output
variable space with help of the fuzzy
rules
● The input and output are associated by

means of fuzzy patches
29
http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/sbaa/report.fuzrules.html
Fuzzy MIN-MAX Principle
● MIN-MAX principle:
● MIN: perform AND operation among atoms
to define fuzzy membership degree
● MAX: perform OR operation among atoms

to define fuzzy membership degree
30
http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/sbaa/report.fuzrules.html
31
Fuzzy Logic and Math Operations
Fuzzy set A is a collection of pairs x, μ(x): 𝐴 = 𝑥, μ 𝑥
Continuous Gaussian MF in linguistic term “i":

1
μ𝐴𝑖 𝑋𝑗 = 𝑥−𝑐 2
𝑒 2𝑠 2
Fuzzy inference principles (MIN-MAX):

μ 𝑅𝑖 = 𝑚𝑖𝑛 μ𝐴1 𝑋1 , μ𝐴2 𝑋2 𝐶𝑙𝑎𝑠𝑠1 ∨ 𝐶𝑙𝑎𝑠𝑠2 = 𝑚𝑎𝑥 μ𝑅1 𝑋 , μ𝑅2 𝑋
Fuzzy rules:
𝐼𝐹𝑥1 𝑖𝑠𝐴 ∧ 𝑥2 𝑖𝑠𝐵 𝑇𝐻𝐸𝑁 𝐶𝑙𝑎𝑠𝑠𝑌 𝑅𝑖 : 𝐼𝐹𝑥1 𝑖𝑠𝐴1 ∧ 𝑥2 𝑖𝑠𝐴2 𝑇𝐻𝐸𝑁 𝑋𝑖𝑠𝐶𝑙𝑎𝑠𝑠𝑋
Hybrid NeuroFuzzy Networks
Extracted fuzzy rules => input neuron weights
Learning input patterns di by optimization a cost function:

2
𝐶 = ෍ 𝑦𝑖 − 𝑑𝑖
Weight adjustment using Gradient Descent optimization:
2
𝜕𝐶 𝜕 𝑦𝑖 − 𝑑𝑖 𝜕𝑑𝑖
= = −2 ⋅ 𝑦𝑖 − 𝑑𝑖 ⋅
𝜕ω 𝜕ω 𝜕ω
Delta learning rule:

ω𝑖+1 = ω𝑖 − α ⋅ 𝑦𝑖 − 𝑑𝑖 ⋅ 𝑥𝑖
Activation function (sigmoid):

1
𝑦𝑖 =
1 + 𝑒 σ ω𝑖 ⋅𝑥𝑖 32
Fuzzy Inference System
http://cnmat.berkeley.edu/publication/real_time_neuro_fuzzy_systems_adaptive_control_musical_processes 33
2 Rule Fuzzy Inference z = f(x,y)
Rule 1
Rule 2
Input 1 Input 2
http://www.cs.princeton.edu/courses/archive/fall07/cos436/HIDDEN/Knapp/fuzzy004.htm 34
Neuro-Fuzzy Architectures
http://www.sciencedirect.com/science/article/pii/S0307904X1200025X
38
Complex Neuro-Fuzzy Architecture
http://scialert.net/fulltext/?doi=jas.2008.309.315 39
Drawbacks
● Linguistic terms have to be defined beforehand

● Impossible to determine proper MF without
preliminary data analysis
● Different MF results in different fuzzy patches
● Optimization of ANN weights for each layer can be
difficult and computationally expensive task
● Especially in case of complex NF model
40
Strengths
● Human-understandable FL
● Human-like reasoning in ANN
● Enables modelling of complex I/O relationships
● results in simple fuzzy rules
● ANN is one of the best statistical approaches
● NF is flexible and easily adjusted by modifying
various parameters.
● ANN is amenable to parallel optimization
41
A tutorial on Principal Components Analysis
Lindsay I Smith
February 26, 2002

Chapter 1
Introduction
This tutorial is designed to give the reader an understanding of Principal Components

Analysis (PCA). PCA is a useful statistical technique that has found application in
fields such as face recognition and image compression, and is a common technique for
finding patterns in data of high dimension.
Before getting to a description of PCA, this tutorial first introduces mathematical
concepts that will be used in PCA. It covers standard deviation, covariance, eigenvec-
tors and eigenvalues. This background knowledge is meant to make the PCA section
very straightforward, but can be skipped if the concepts are already familiar.
There are examples all the way through this tutorial that are meant to illustrate the
concepts being discussed. If further information is required, the mathematics textbook
“Elementary Linear Algebra 5e” by Howard Anton, Publisher John Wiley & Sons Inc,
ISBN 0-471-85223-6 is a good source of information regarding the mathematical back-
ground.
1
Chapter 2
Background Mathematics
This section will attempt to give some elementary background mathematical skills that
will be required to understand the process of Principal Components Analysis. The
topics are covered independently of each other, and examples given. It is less important
to remember the exact mechanics of a mathematical technique than it is to understand
the reason why such a technique may be used, and what the result of the operation tells
us about our data. Not all of these techniques are used in PCA, but the ones that are not
explicitly required do provide the grounding on which the most important techniques
are based.
I have included a section on Statistics which looks at distribution measurements,
or, how the data is spread out. The other section is on Matrix Algebra and looks at
eigenvectors and eigenvalues, important properties of matrices that are fundamental to
PCA.
2.1 Statistics
The entire subject of statistics is based around the idea that you have this big set of data,
and you want to analyse that set in terms of the relationships between the individual
points in that data set. I am going to look at a few of the measures you can do on a set
of data, and what they tell you about the data itself.
2.1.1 Standard Deviation

To understand standard deviation, we need a data set. Statisticians are usually con-
cerned with taking a sample of a population. To use election polls as an example, the
population is all the people in the country, whereas a sample is a subset of the pop-
ulation that the statisticians measure. The great thing about statistics is that by only
measuring (in this case by doing a phone survey or similar) a sample of the population,
you can work out what is most likely to be the measurement if you used the entire pop-
ulation. In this statistics section, I am going to assume that our data sets are samples
2
of some bigger population. There is a reference later in this section pointing to more
information about samples and populations.

Here’s an example set:

I could simply use the symbol to refer to this entire set of numbers. If I want to

refer to an individual number in this data set, I will use subscripts on the symbol to
indicate a specific number. Eg. refers to the 3rd number in , namely the number
4. Note that

is the first number in the sequence, not like you may see in some
textbooks. Also, the symbol will be used to refer to the number of elements in the
set
There are a number of things that we can calculate about a data set. For example,
we can calculate the mean of the sample. I assume that the reader understands what the
!"$# "
mean of a sample is, and will only give the formula:

Notice the symbol (said “X bar”) to indicate the mean of the set . All this formula
says is “Add up all the numbers and then divide by how many there are”.
Unfortunately, the mean doesn’t tell us a lot about the data except for a sort of
middle point. For example, these two data sets have exactly the same mean (10), but
&%'&
%
( *) +
,

are obviously quite different:
So what is different about these two sets? It is the spread of the data that is different.
The Standard Deviation (SD) of a data set is a measure of how spread out the data is.
How do we calculate it? The English definition of the SD is: “The average distance
.-
from the mean of the data set to a point”. The way to calculate it is to compute the
squares of the distance from each data point to the mean of the set, add them all up,
!"$# 010 " - 2 32 54

divide by , and take the positive square root. As a formula:
/ -
/ 0 6- 2
Where is the usual symbol for standard deviation of a sample. I hear you asking “Why
0 7- 2
are you using and not ?”. Well, the answer is a bit complicated, but in general,
if your data set is a sample data set, ie. you have taken a subset of the real-world (like
surveying 500 people about the election) then you must use because it turns out

that this gives you an answer that is closer to the standard deviation that would result
0 8- 2
if you had used the entire population, than if you’d used . If, however, you are not

calculating the standard deviation for a sample, but for an entire population, then you
should divide by instead of . For further reading on this topic, the web page
http://mathcentral.uregina.ca/RR/database/RR.09.95/weston2.html describes standard
deviation in a similar way, and also provides an example experiment that shows the
3
Set 1:
0 - 32 0 - 32 4
0 -10 100
8 -2 4
12 2 4
20 10 100
Total 208
Divided by (n-1) 69.333
Square Root 8.3266
Set 2:
" 0 " - 32 0 " - 32 4

8 -2 4
9 -1 1
11 1 1
12 2 4
Total 10
Divided by (n-1) 3.333
Square Root 1.8257
Table 2.1: Calculation of standard deviation
difference between each of the denominators. It also discusses the difference between
samples and populations.
So, for our two data sets above, the calculations of standard deviation are in Ta-
ble 2.1.
And so, as expected, the first set has a much larger standard deviation due to the
% %&%&%

fact that the data is much more spread out from the mean. Just as another example, the
data set:
also has a mean of 10, but its standard deviation is 0, because all the numbers are the
same. None of them deviate from the mean.
2.1.2 Variance
Variance is another measure of the spread of data in a data set. In fact it is almost
/ 4 "9! # 0 : 0 - " - 2 32 4

identical to the standard deviation. The formula is this:
4
/4
You will notice that this is simply the standard deviation squared, in both the symbol
( ) and the formula (there is no square root in the formula for variance). /4
is the
usual symbol for variance of a sample. Both these measurements are measures of the
spread of the data. Standard deviation is the most common measure, but variance is
also used. The reason why I have introduced variance in addition to standard deviation
is to provide a solid platform from which the next section, covariance, can launch from.
Exercises
;
Find the mean, standard deviation, and variance for each of these data sets.
;
[12 23 34 44 59 70 98]
;
[12 15 25 27 32 88 99]
[15 35 78 82 90 95 97]
2.1.3 Covariance
The last two measures we have looked at are purely 1-dimensional. Data sets like this
could be: heights of all the people in the room, marks for the last COMP101 exam etc.
However many data sets have more than one dimension, and the aim of the statistical
analysis of these data sets is usually to see if there is any relationship between the
dimensions. For example, we might have as our data set both the height of all the
students in a class, and the mark they received for that paper. We could then perform
statistical analysis to see if the height of a student has any effect on their mark.
Standard deviation and variance only operate on 1 dimension, so that you could
only calculate the standard deviation for each dimension of the data set independently
of the other dimensions. However, it is useful to have a similar measure to find out how
much the dimensions vary from the mean with respect to each other.
Covariance is such a measure. Covariance is always measured between 2 dimen-
<< < = < > > = = > > =

sions. If you calculate the covariance between one dimension and itself, you get the
< =
variance. So, if you had a 3-dimensional data set ( , , ), then you could measure the
covariance between the and dimensions, the and dimensions, and the and >
< = >
dimensions. Measuring the covariance between and , or and , or and would
give you the variance of the , and dimensions respectively.
The formula for covariance is very similar to the formula for variance. The formula
? (+@ 0 A2 !"9# 0 " 0 : - - 32 20 " - A2

for variance could also be written like this:
where I have simply expanded the square term to show both parts. So given that knowl-
B&CD? 0 8E+F2G "9! # 0 0" . - - 32 2 0 F " - F 2

edge, here is the formula for covariance:
5
includegraphicscovPlot.ps
Figure 2.1: A plot of the covariance data showing positive relationship between the
number of hours studied against the mark received
F
=<
It is exactly the same except that in the second set of brackets, the ’s are replaced by
< 0 .- 2 =

’s. This says, in English, “For each data item, multiply the difference between the
value and the mean of , by the the difference between the value and the mean of .
Add all these up, and divide by ”.
How does this work? Lets use some example data. Imagine we have gone into the
world and collected some 2-dimensional data, say, we have asked a bunch of students
H
how many hours in total that they spent studying COSC241, and the mark that they
I B&CD? 0 H E I 2
received. So we have two dimensions, the first is the dimension, the hours studied,
and the second is the dimension, the mark received. Figure 2.2 holds my imaginary
data, and the calculation of , the covariance between the Hours of study
done and the Mark received.
So what does it tell us? The exact value is not as important as it’s sign (ie. positive
or negative). If the value is positive, as it is here, then that indicates that both di-
mensions increase together, meaning that, in general, as the number of hours of study
increased, so did the final mark.
If the value is negative, then as one dimension increases, the other decreases. If we
had ended up with a negative covariance here, then that would have said the opposite,
that as the number of hours of study increased the the final mark decreased.
In the last case, if the covariance is zero, it indicates that the two dimensions are
independent of each other.
The result that mark given increases as the number of hours studied increases can
be easily seen by drawing a graph of the data, as in Figure 2.1.3. However, the luxury
of being able to visualize data is only available at 2 and 3 dimensions. Since the co-
variance value can be calculated between any 2 dimensions in a data set, this technique
1B CD? 0 8E1F,2 B&CD? 0 FJEKA2

is often used to find relationships between dimensions in high-dimensional data sets
where visualisation is difficult.

F0 " - F 2 0 " - A2B1 CD? 0 8E1F,2 B&CD? 0 FLEMN2 0 " - A2 0 F " - 7F 2
You might ask “is equal to ”? Well, a quick look at the for-
mula for covariance tells us that yes, they are exactly the same since the only dif-
ference between and is that is replaced by
. And since multiplication is commutative, which means that it
doesn’t matter which way around I multiply two numbers, I always get the same num-
ber, these two equations give the same answer.
2.1.4 The covariance Matrix

Recall that covariance is always measured between 2 dimensions. If we have a data set
B&CD? 0 < E = 2 0 B&CD? 0 < E > 2 B&CD? 0 = E > 2

with more than 2 dimensions, there is more than one covariance measurement that can
be calculated. For example, from a 3 dimensional data set (dimensions , , ) you
< = >
could calculate
set, you can calculate
,
Q !SR !P4DOT O U 4
, and . In fact, for an -dimensional data
different covariance values.
6
Hours(H) Mark(M)
Data 9 39
15 56
25 93
14 61
10 50
18 75
0 32
16 85
5 42
19 70
16 66
20 80
Totals 167 749
Averages 13.92 62.42
Covariance:
9
H I 0 H " - H 2 0I " - I 2 0H " - H 2 0 I " - I 2
39 -4.92 -23.42 115.23
15 56 1.08 -6.42 -6.93
25 93 11.08 30.58 338.83
14 61 0.08 -1.42 -0.11
10 50 -3.92 -12.42 48.69
18 75 4.08 12.58 51.33
0 32 -13.92 -30.42 423.45
16 85 2.08 22.58 46.97
5 42 -8.92 -20.42 182.15
19 70 5.08 7.58 38.51
16 66 2.08 3.58 7.45
20 80 6.08 17.58 106.89
Total 1149.89
Average 104.54
Table 2.2: 2-dimensional data set and covariance calculation
7
A useful way to get all the possible covariance values between all the different
dimensions is to calculate them all and put them in a matrix. I assume in this tutorial

that you are familiar with matrices, and how they can be defined. So, the definition for
V !PWS! 0 B M" X Y E B "ZX Y &B CD? 0M[]\_^ " E [`\Z^ Y 2a2ME

the covariance matrix for a set of data with dimensions is:
V !PWS! [`\Z^b <

where is a matrix with rows and columns, and

is the th dimension.
All that this ugly looking formula says is that if you have an -dimensional data set,
then the matrix has rows and columns (so is square) and each entry in the matrix is
the result of calculating the covariance between two separate dimensions. Eg. the entry
on row 2, column 3, is the covariance value calculated between the 2nd dimension and
the 3rd dimension.
< = >
An example. We’ll make up the covariance matrix for an imaginary 3 dimensional
data set, using the usual dimensions , and . Then, the covariance matrix has 3 rows
V B&B&CDCD?? 00 <= EEE << 222 B&B&CDCD?? 00 < EE = 22 &BB&CDCD?? 00 < EE > 22
and 3 columns, and the values are this:
B&CD? 0 > < B&CD? 0 = > E == 2 B&CD? 0 = > E >> 2

B&CD? 0 (E+c+2 B1CD? 0 c&E1(P2
Some points to note: Down the main diagonal, you see that the covariance value is
between one of the dimensions and itself. These are the variances for that dimension.
The other point is that since , the matrix is symmetrical about the
main diagonal.
Exercises
< =
Work out the covariance between the and dimensions in the following 2 dimen-
sional data set, and describe what the result indicates about the data.
<=
Item Number: 1
10
43
2
39
13
3
19
32
4
23
21
5
28
20
Calculate the covariance matrix for this 3 dimensional set of data.
<=
Item Number: 1
1
2
-1
3
4
> 2
1
1
3
3
-1
2.2 Matrix Algebra

This section serves to provide a background for the matrix algebra required in PCA.
Specifically I will be looking at eigenvectors and eigenvalues of a given matrix. Again,
I assume a basic knowledge of matrices.
8
e
d f d
e
d f d g f d
Figure 2.2: Example of one non-eigenvector and one eigenvector
f d
e
d f g f
Figure 2.3: Example of how a scaled eigenvector is still and eigenvector
2.2.1 Eigenvectors
As you know, you can multiply two matrices together, provided they are compatible
sizes. Eigenvectors are a special case of this. Consider the two multiplications between
a matrix and a vector in Figure 2.2.
In the first example, the resulting vector is not an integer multiple of the original
d
vector, whereas in the second example, the example is exactly 4 times the vector we
began with. Why is this? Well, the vector is a vector in 2 dimensional space. The
vector
from the origin,
0 %SEh%2 0 dSEh2
(from the second example multiplication) represents an arrow pointing
, to the point . The other matrix, the square one, can be
thought of as a transformation matrix. If you multiply this matrix on the left of a
vector, the answer is another vector that is transformed from it’s original position.
= < = <
It is the nature of the transformation that the eigenvectors arise from. Imagine a
transformation matrix that, when multiplied on the left, reflected vectors in the line
. Then you can see that if there were a vector that lay on the line , it’s
reflection it itself. This vector (and all multiples of it, because it wouldn’t matter how
long the vector was), would be an eigenvector of that transformation matrix.
What properties do these eigenvectors have? You should first know that eigenvec-
dfd
vectors. And, given an
Given a
f
tors can only be found for square matrices. And, not every square matrix has eigen-

matrix that does have eigenvectors, there are of them.
matrix, there are 3 eigenvectors.
Another property of eigenvectors is that even if I scale the vector by some amount
before I multiply it, I still get the same multiple of it as a result, as in Figure 2.3. This
is because if you scale a vector by some amount, all you are doing is making it longer,
9
not changing it’s direction. Lastly, all the eigenvectors of a matrix are perpendicular,
ie. at right angles to each other, no matter how many dimensions you have. By the way,
another word for perpendicular, in maths talk, is orthogonal. This is important because
< =
it means that you can express the data in terms of these perpendicular eigenvectors,
instead of expressing them in terms of the and axes. We will be doing this later in
the section on PCA.
Another important thing to know is that when mathematicians find eigenvectors,
they like to find the eigenvectors whose length is exactly one. This is because, as you
know, the length of a vector doesn’t affect whether it’s an eigenvector or not, whereas
the direction does. So, in order to keep eigenvectors standard, whenever we find an
eigenvector we usually scale it to make it have a length of 1, so that all eigenvectors
d
have the same length. Here’s a demonstration from our example above.
0 d 4i 4 2 kj ld

is an eigenvector, and the length of that vector is
d m j dn SdS oo j && dd

so we divide the original vector by this much to make it have a length of 1.
j
only easy(ish) if you have a rather small matrix, like no bigger than about
dfd
How does one go about finding these mystical eigenvectors? Unfortunately, it’s
. After
that, the usual way to find the eigenvectors is by some complicated iterative method
which is beyond the scope of this tutorial (and this author). If you ever need to find the
eigenvectors of a matrix in a program, just find a maths library that does it all for you.
A useful maths package, called newmat, is available at http://webnz.com/robert/ .
Further information about eigenvectors in general, how to find them, and orthogo-
nality, can be found in the textbook “Elementary Linear Algebra 5e” by Howard Anton,
Publisher John Wiley & Sons Inc, ISBN 0-471-85223-6.
2.2.2 Eigenvalues
Eigenvalues are closely related to eigenvectors, in fact, we saw an eigenvalue in Fig-
ure 2.2. Notice how, in both those examples, the amount by which the original vector
was scaled after multiplication by the square matrix was the same? In that example,
the value was 4. 4 is the eigenvalue associated with that eigenvector. No matter what
multiple of the eigenvector we took before we multiplied it by the square matrix, we
would always get 4 times the scaled vector as our result (as in Figure 2.3).
So you can see that eigenvectors and eigenvalues always come in pairs. When you
get a fancy programming library to calculate your eigenvectors for you, you usually get
the eigenvalues as well.
10
Exercises
d %
For the following square matrix:
-- p% -
Decide which, if any, of the following vectors are eigenvectors of that matrix and
-% - % d
give the corresponding eigenvalue.
- d %
11
Chapter 3
Principal Components Analysis
Finally we come to Principal Components Analysis (PCA). What is it? It is a way

of identifying patterns in data, and expressing the data in such a way as to highlight
their similarities and differences. Since patterns in data can be hard to find in data of
high dimension, where the luxury of graphical representation is not available, PCA is
a powerful tool for analysing data.
The other main advantage of PCA is that once you have found these patterns in the
data, and you compress the data, ie. by reducing the number of dimensions, without
much loss of information. This technique used in image compression, as we will see
in a later section.
This chapter will take you through the steps you needed to perform a Principal
Components Analysis on a set of data. I am not going to describe exactly why the
technique works, but I will try to provide an explanation of what is happening at each
point so that you can make informed decisions when you try to use this technique
yourself.
3.1 Method
Step 1: Get some data
In my simple example, I am going to use my own made-up data set. It’s only got 2
dimensions, and the reason why I have chosen this is so that I can provide plots of the
data to show what the PCA analysis is doing at each step.
The data I have used is found in Figure 3.1, along with a plot of that data.
Step 2: Subtract the mean
< <=
For PCA to work properly, you have to subtract the mean from each of the data dimen-
<=
sions. The mean subtracted is the average across each dimension. So, all the values
have (the mean of the values of all the data points) subtracted, and all the values
have subtracted from them. This produces a data set whose mean is zero.
12
x y x y
2.5 2.4 .69 .49
0.5 0.7 -1.31 -1.21
2.2 2.9 .39 .99
1.9 2.2 .09 .29
Data = 3.1 3.0 DataAdjust = 1.29 1.09
2.3 2.7 .49 .79
2 1.6 .19 -.31
1 1.1 -.81 -.81
1.5 1.6 -.31 -.31
1.1 0.9 -.71 -1.01
Original PCA data

4
"./PCAdata.dat"
-1
-1 0 1 2 3 4
Figure 3.1: PCA example data, original data on the left, data with the means subtracted
on the right, and a plot of the data
13
Step 3: Calculate the covariance matrix
is 2 dimensional, the covariance matrix will be

f
This is done in exactly the same way as was discussed in section 2.1.4. Since the data
. There are no surprises here, so I
&B CD? qq rr qq r
will just give you the result:
< =
So, since the non-diagonal elements in this covariance matrix are positive, we should
expect that both the and variable increase together.
Step 4: Calculate the eigenvectors and eigenvalues of the covariance

matrix
Since the covariance matrix is square, we can calculate the eigenvectors and eigenval-
ues for this matrix. These are rather important, as they tell us useful information about
our data. I will show you why soon. In the meantime, here are the eigenvectors and
s \Zt s ? (uwv sx/ q % q % %dd

eigenvalues:
s \_t s ?rsyBZz1C @ / - q rq d& dd -- qq d &&dd

It is important to notice that these eigenvectors are both unit eigenvectors ie. their
lengths are both 1. This is very important for PCA, but luckily, most maths packages,
when asked for eigenvectors, will give you unit eigenvectors.
So what do they mean? If you look at the plot of the data in Figure 3.2 then you can
see how the data has quite a strong pattern. As expected from the covariance matrix,
they two variables do indeed increase together. On top of the data I have plotted both
the eigenvectors as well. They appear as diagonal dotted lines on the plot. As stated
in the eigenvector section, they are perpendicular to each other. But, more importantly,
they provide us with information about the patterns in the data. See how one of the
eigenvectors goes through the middle of the points, like drawing a line of best fit? That
eigenvector is showing us how these two data sets are related along that line. The
second eigenvector gives us the other, less important, pattern in the data, that all the
points follow the main line, but are off to the side of the main line by some amount.
So, by this process of taking the eigenvectors of the covariance matrix, we have
been able to extract lines that characterise the data. The rest of the steps involve trans-
forming the data so that it is expressed in terms of them lines.
Step 5: Choosing components and forming a feature vector

Here is where the notion of data compression and reduced dimensionality comes into
it. If you look at the eigenvectors and eigenvalues from the previous section, you
14
Mean adjusted data with eigenvectors overlayed
2
"PCAdataadjust.dat"
(-.740682469/.671855252)*x
(-.671855252/-.740682469)*x
1.5
0.5
-0.5
-1
-1.5
-2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Figure 3.2: A plot of the normalised data (mean subtracted) with the eigenvectors of
the covariance matrix overlayed on top.
15
will notice that the eigenvalues are quite different values. In fact, it turns out that
the eigenvector with the highest eigenvalue is the principle component of the data set.
In our example, the eigenvector with the larges eigenvalue was the one that pointed
down the middle of the data. It is the most significant relationship between the data
dimensions.
In general, once eigenvectors are found from the covariance matrix, the next step
is to order them by eigenvalue, highest to lowest. This gives you the components in
order of significance. Now, if you like, you can decide to ignore the components of
lesser significance. You do lose some information, but if the eigenvalues are small, you
don’t lose much. If you leave out some components, the final data set will have less

dimensions than the original. To be precise, if you originally have dimensions in
{
your data, and so you calculate eigenvectors and eigenvalues, and then you choose
only the first eigenvectors, then the final data set has only dimensions.
What needs to be done now is you need to form a feature vector, which is just
{
a fancy name for a matrix of vectors. This is constructed by taking the eigenvectors
that you want to keep from the list of eigenvectors, and forming a matrix with these
| s ( z v@ sy}~syBZz1C @n 0 s \Zt s \Zt 4 s \Zt qqqq s \Zt 2

eigenvectors in the columns.
!
Given our example set of data, and the fact that we have 2 eigenvectors, we have
qq r&dr& dd - q q &d & dd

two choices. We can either form a feature vector with both of the eigenvectors:
--
qq d &&dd

or, we can choose to leave out the smaller, less significant component and only have a
single column:
--
We shall see the result of each of these in the next section.
Step 5: Deriving the new data set

This the final step in PCA, and is also the easiest. Once we have chosen the components
(eigenvectors) that we wish to keep in our data and formed a feature vector, we simply
take the transpose of the vector and multiply it on the left of the original data set,
| \ (u [ ( z (6k CD | s ( z v@ sx}7syBZz1C @ f DC [ ( z ( ) v /hz E

transposed.
CD | s ( z v@ sy}6syBZz1C @

CD [ ( z ( ) v /hz
where is the matrix with the eigenvectors in the columns trans-
posed so that the eigenvectors are now in the rows, with the most significant eigenvec-
tor at the top, and is the mean-adjusted data transposed, ie. the data
items are in each column, with each row holding a separate dimension. I’m sorry if
this sudden transpose of all our data confuses you, but the equations from here on are
16
a little T symbol above their names from now on.
| \ (u [ ( z (
easier if we take the transpose of the feature vector and the data first, rather that having
is the final data set, with
data items in columns, and dimensions along rows.
< =
What will this give us? It will give us the original data solely in terms of the vectors
we chose. Our original data set had two axes, and , so our data was in terms of
them. It is possible to express data in terms of any two axes that you like. If these
axes are perpendicular, then the expression is the most efficient. This was why it was
< =
important that eigenvectors are always perpendicular to each other. We have changed
our data from being in terms of the axes and , and now they are in terms of our 2
eigenvectors. In the case of when the new data set has reduced dimensionality, ie. we
have left some of the eigenvectors out, the new data is only in terms of the vectors that
we decided to keep.
To show this on our data, I have done the final transformation with each of the
possible feature vectors. I have taken the transpose of the result in each case to bring
the data back to the nice table-like format. I have also plotted the final points to show
how they relate to the components.
In the case of keeping both eigenvectors for the transformation, we get the data and
the plot found in Figure 3.3. This plot is basically the original data, rotated so that the
eigenvectors are the axes. This is understandable since we have lost no information in
this decomposition.
The other transformation we can make is by taking only the eigenvector with the
largest eigenvalue. The table of data resulting from that is found in Figure 3.4. As
expected, it only has a single dimension. If you compare this data set with the one
resulting from using both eigenvectors, you will notice that this data set is exactly the
<
first column of the other. So, if you were to plot this data, it would be 1 dimensional,
and would be points on a line in exactly the positions of the points in the plot in
Figure 3.3. We have effectively thrown away the whole other axis, which is the other
eigenvector.
So what have we done here? Basically we have transformed our data so that is
expressed in terms of the patterns between them, where the patterns are the lines that
most closely describe the relationships between the data. This is helpful because we
< =
have now classified our data point as a combination of the contributions from each of
those lines. Initially we had the simple and axes. This is fine, but the and
values of each data point don’t really tell us exactly how that point relates to the rest of
< =
the data. Now, the values of the data points tell us exactly where (ie. above/below) the
trend lines the data point sits. In the case of the transformation using both eigenvectors,
we have simply altered the data so that it is in terms of those eigenvectors instead of
the usual axes. But the single-eigenvector decomposition has removed the contribution
due to the smaller eigenvector and left us with data that is only in terms of the other.
3.1.1 Getting the old data back

Wanting to get the original data back is obviously of great concern if you are using
the PCA transform for data compression (an example of which to will see in the next
section). This content is taken from
http://www.vision.auc.dk/ sig/Teaching/Flerdim/Current/hotelling/hotelling.html
17
-.827970186
< =
-.175115307
1.77758033 .142857227
-.992197494 .384374989
-.274210416 .130417207
Transformed Data= -1.67580142 -.209498461
-.912949103 .175282444
.0991094375 -.349824698
1.14457216 .0464172582
.438046137 .0177646297
1.22382056 -.162675287
Data transformed with 2 eigenvectors
2
"./doublevecfinal.dat"
1.5
0.5
-0.5
-1
-1.5
-2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Figure 3.3: The table of data by applying the PCA analysis using both eigenvectors,
and a plot of the new data points.
18
<
Transformed Data (Single eigenvector)
-.827970186
1.77758033
-.992197494
-.274210416
-1.67580142
-.912949103
.0991094375
1.14457216
.438046137
1.22382056
Figure 3.4: The data after transforming using only the most significant eigenvector
So, how do we get the original data back? Before we do that, remember that only if
we took all the eigenvectors in our transformation will we get exactly the original data
back. If we have reduced the number of eigenvectors in the final transformation, then
the retrieved data has lost some information.
| \ (u [ ( z (6k CD | s ( z v@ sx}7syBZz1C @ f DC [ ( z ( ) v /hz E

Recall that the final transform is this:
CD [ ( z ( ) v /hz CD | s ( z v@ sx}7syBZz1C @ R f | \ (u [ ( z (

which can be turned around so that, to get the original data back,
where
CD | s ( z v@ sy}6syBZz1C @ R
is the inverse of
CD | s ( z v@ sx}7syBZz1C @
. However, when
we take all the eigenvectors in our feature vector, it turns out that the inverse of our
feature vector is actually equal to the transpose of our feature vector. This is only true
because the elements of the matrix are all the unit eigenvectors of our data set. This
CD [ ( z ( ) v /hz CD | s ( z v@ sy}7sxBMz1C @ f | \ (u [ ( z (

makes the return trip to our data easier, because the equation becomes
But, to get the actual original data back, we need to add on the mean of that original
CD7 @ \Zt\ (u [ ( z (7 0 DC | s ( z v@ sy}6syBZz1C @ f | \ (u [ ( z (P2 i @ Z\ t\ (u I s (
data (remember we subtracted it right at the start). So, for completeness,
This formula also applies to when you do not have all the eigenvectors in the feature
vector. So even when you leave out some eigenvectors, the above equation still makes
the correct transform.
I will not perform the data re-creation using the complete feature vector, because the
result is exactly the data we started with. However, I will do it with the reduced feature
vector to show you how information has been lost. Figure 3.5 show this plot. Compare
19
Original data restored using only a single eigenvector
4
"./lossyplusmean.dat"
-1
-1 0 1 2 3 4
Figure 3.5: The reconstruction from the data that was derived using only a single eigen-
vector
it to the original data plot in Figure 3.1 and you will notice how, while the variation
along the principle eigenvector (see Figure 3.2 for the eigenvector overlayed on top of
the mean-adjusted data) has been kept, the variation along the other component (the
other eigenvector that we left out) has gone.
;
Exercises
;
What do the eigenvectors of the covariance matrix give us?
At what point in the PCA process can we decide to compress the data? What
;
effect does this have?
For an example of PCA and a graphical representation of the principal eigenvec-
tors, research the topic ’Eigenfaces’, which uses PCA to do facial recognition
20
Chapter 4
Application to Computer Vision
This chapter will outline the way that PCA is used in computer vision, first showing
how images are usually represented, and then showing what PCA can allow us to do
with those images. The information in this section regarding facial recognition comes
from “Face Recognition: Eigenface, Elastic Matching, and Neural Nets”, Jun Zhang et
al. Proceedings of the IEEE, Vol. 85, No. 9, September 1997. The representation infor-
mation, is taken from “Digital Image Processing” Rafael C. Gonzalez and Paul Wintz,
Addison-Wesley Publishing Company, 1987. It is also an excellent reference for further
information on the K-L transform in general. The image compression information is
taken from http://www.vision.auc.dk/ sig/Teaching/Flerdim/Current/hotelling/hotelling.html,
which also provides examples of image reconstruction using a varying amount of eigen-
vectors.
4.1 Representation
4
When using these sort of matrix techniques in computer vision, we must consider repre-
sentation of images. A square, by image can be expressed as an -dimensional
< < 4 < q q <

vector

dimensional image. E.g. The first < g- <
where the rows of pixels in the image are placed one after the other to form a one-
elements ( will be the first row of the

image, the next elements are the next row, and so on. The values in the vector are
the intensity values of the image, possibly a single greyscale value.
4.2 PCA to find patterns

Say we have 20 images. Each image is pixels high by pixels wide. For each
image we can create an image vector as described in the representation section. We
can then put all the images together in one big image-matrix like this:
21
^ (( t sx}6sxB
^ ( t xs / I ( z @ \ < ^ t qsx}6sxB
^ ( t syq }~syB %
which gives us a starting point for our PCA analysis. Once we have performed PCA,
we have our original data in terms of the eigenvectors we found from the covariance
matrix. Why is this useful? Say we want to do facial recognition, and so our original
images were of peoples faces. Then, the problem is, given a new image, whose face
from the original set is it? (Note that the new image is not one of the 20 we started
with.) The way this is done is computer vision is to measure the difference between
the new image and the original images, but not along the original axes, along the new
axes derived from the PCA analysis.
It turns out that these axes works much better for recognising faces, because the
PCA analysis has given us the original images in terms of the differences and simi-
4 4
larities between them. The PCA analysis has identified the statistical patterns in the
data.
Since all the vectors are dimensional, we will get eigenvectors. In practice,
we are able to leave out some of the less significant eigenvectors, and the recognition
still performs well.
4.3 PCA for image compression
(KL), transform. If we have 20 images, each with 4

pixels, we can form 4
Using PCA for image compression also know as the Hotelling, or Karhunen and Leove
vectors,
each with 20 dimensions. Each vector consists of all the intensity values from the same
pixel from each picture. This is different from the previous example because before we
had a vector for image, and each item in that vector was a different pixel, whereas now
we have a vector for each pixel, and each item in the vector is from a different image.
Now we perform the PCA on this set of data. We will get 20 eigenvectors because
o1
each vector is 20-dimensional. To compress the data, we can then choose to transform
the data only using, say 15 of the eigenvectors. This gives us a final data set with
only 15 dimensions, which has saved us of the space. However, when the original
data is reproduced, the images have lost some of the information. This compression
technique is said to be lossy because the decompressed image is not exactly the same
as the original, generally worse.
22
Appendix A
Implementation Code
This is code for use in Scilab, a freeware alternative to Matlab. I used this code to
generate all the examples in the text. Apart from the first macro, all the rest were
written by me.
// This macro taken from

// http://www.cs.montana.edu/˜harkin/courses/cs530/scilab/macros/cov.sci
// No alterations made
// Return the covariance matrix of the data in x, where each column of x

// is one dimension of an n-dimensional data set. That is, x has x columns
// and m rows, and each row is one sample.
//
// For example, if x is three dimensional and there are 4 samples.
// x = [1 2 3;4 5 6;7 8 9;10 11 12]
// c = cov (x)
function [c]=cov (x)

// Get the size of the array
sizex=size(x);
// Get the mean of each column
meanx = mean (x, "r");
// For each pair of variables, x1, x2, calculate
// sum ((x1 - meanx1)(x2-meanx2))/(m-1)
for var = 1:sizex(2),
x1 = x(:,var);
mx1 = meanx (var);
for ct = var:sizex (2),
x2 = x(:,ct);
mx2 = meanx (ct);
v = ((x1 - mx1)’ * (x2 - mx2))/(sizex(1) - 1);
23
cv(var,ct) = v;
cv(ct,var) = v;
// do the lower part of c also.
end,
end,
c=cv;
// This a simple wrapper function to get just the eigenvectors

// since the system call returns 3 matrices
function [x]=justeigs (x)
// This just returns the eigenvectors of the matrix
[a, eig, b] = bdiag(x);
x= eig;
// this function makes the transformation to the eigenspace for PCA

// parameters:
// adjusteddata = mean-adjusted data set
// eigenvectors = SORTED eigenvectors (by eigenvalue)
// dimensions = how many eigenvectors you wish to keep
//
// The first two parameters can come from the result of calling
// PCAprepare on your data.
// The last is up to you.
function [finaldata] = PCAtransform(adjusteddata,eigenvectors,dimensions)

finaleigs = eigenvectors(:,1:dimensions);
prefinaldata = finaleigs’*adjusteddata’;
finaldata = prefinaldata’;
// This function does the preparation for PCA analysis

// It adjusts the data to subtract the mean, finds the covariance matrix,
// and finds normal eigenvectors of that covariance matrix.
// It returns 4 matrices
// meanadjust = the mean-adjust data set
// covmat = the covariance matrix of the data
// eigvalues = the eigenvalues of the covariance matrix, IN SORTED ORDER
// normaleigs = the normalised eigenvectors of the covariance matrix,
// IN SORTED ORDER WITH RESPECT TO
// THEIR EIGENVALUES, for selection for the feature vector.
24
//
// NOTE: This function cannot handle data sets that have any eigenvalues
// equal to zero. It’s got something to do with the way that scilab treats
// the empty matrix and zeros.
//
function [meanadjusted,covmat,sorteigvalues,sortnormaleigs] = PCAprepare (data)
// Calculates the mean adjusted matrix, only for 2 dimensional data
means = mean(data,"r");
meanadjusted = meanadjust(data);
covmat = cov(meanadjusted);
eigvalues = spec(covmat);
normaleigs = justeigs(covmat);
sorteigvalues = sorteigvectors(eigvalues’,eigvalues’);
sortnormaleigs = sorteigvectors(eigvalues’,normaleigs);
// This removes a specified column from a matrix

// A = the matrix
// n = the column number you wish to remove
function [columnremoved] = removecolumn(A,n)
inputsize = size(A);
numcols = inputsize(2);
temp = A(:,1:(n-1));
for var = 1:(numcols - n)
temp(:,(n+var)-1) = A(:,(n+var));
end,
columnremoved = temp;
// This finds the column number that has the

// highest value in it’s first row.
function [column] = highestvalcolumn(A)
inputsize = size(A);
maxval = A(1,1);
maxcol = 1;
for var = 2:numcols
if A(1,var) > maxval
maxval = A(1,var);
maxcol = var;
end,
end,
column = maxcol
25
// This sorts a matrix of vectors, based on the values of
// another matrix
//
// values = the list of eigenvalues (1 per column)
// vectors = The list of eigenvectors (1 per column)
//
// NOTE: The values should correspond to the vectors
// so that the value in column x corresponds to the vector
// in column x.
function [sortedvecs] = sorteigvectors(values,vectors)
inputsize = size(values);
highcol = highestvalcolumn(values);
sorted = vectors(:,highcol);
remainvec = removecolumn(vectors,highcol);
remainval = removecolumn(values,highcol);
for var = 2:numcols
highcol = highestvalcolumn(remainval);
sorted(:,var) = remainvec(:,highcol);
remainvec = removecolumn(remainvec,highcol);
remainval = removecolumn(remainval,highcol);
end,
sortedvecs = sorted;
// This takes a set of data, and subtracts

// the column mean from each column.
function [meanadjusted] = meanadjust(Data)
inputsize = size(Data);
means = mean(Data,"r");
tmpmeanadjusted = Data(:,1) - means(:,1);
for var = 2:numcols
tmpmeanadjusted(:,var) = Data(:,var) - means(:,var);
end,
meanadjusted = tmpmeanadjusted
26
Semester Plan (1)


Search
Feature Selection

Semester Plan (2)
Week 13 (31.03.2022) Lecture 6: (Kononenko 11*; Chio 2) Artificial Neural Networks;

Deep Learning; Support Vector Machines

Analysis

Andrii Shalaginov
Department of Information Security and Communication Technology


Lectures

Uploaded by

Copyright:

Available Formats

You might also like

Lectures

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lectures

Uploaded by

Copyright:

Available Formats

Lecture 1: Introduction to Data Science

in Forensics and Security

IMT4133 – Data Science for Forensics and Security

Norwegian University of Science and Technology 2

• Teacher / Course Responsible: Andrii Shalaginov

• Teaching Assistant: Kyle Porter

Norwegian University of Science and Technology 3

Norwegian University of Science and Technology 5

Need help / Questions ? -> email

Norwegian University of Science and Technology 6

• 4 Assignments – 40% of the final grade (10% each)

• Final written exam – 60% of the grade

Norwegian University of Science and Technology 7

Norwegian University of Science and Technology 8

- Igor Kononenko and Matjaz Kukar

• Check with the library

• Good theoretical foundations

Norwegian University of Science and Technology 9

• Practical point of view

Norwegian University of Science and Technology 10

Week 4 (27.01.2022) Tutorial 1: Data Analysis; Learning and Intelligence

Week 5 (03.02.2022) Lecture 2: (Kononenko 3; Chio 2): ML Basics; Hybrid Intelligence;

Week 6 (10.02.2022) Tutorial 2: Machine Learning Basics

Week 7 (17.02.2022) Lecture 3: (Kononenko 4,5) Knowledge Representation; Learning as

Week 8 (24.02.2022) Tutorial 3: Learning as Search; Knowledge Representation

Norwegian University of Science and Technology 11

Week 12 (24.03.2022) Tutorial 5: Symbolic and Statistical learning

Week 16 (21.04.2022) Lecture 7: (Kononenko 12; Chio 2) Unsupervised Learning; Cluster

Week 17 (28.04.2022) Tutorial 7: Cluster Analysis

Norwegian University of Science and Technology 12

• Needs at Least 3 Members (remote, part time, full time)

Norwegian University of Science and Technology 13

Norwegian University of Science and Technology

Norwegian University of Science and Technology

Norwegian University of Science and Technology 18 18

Norwegian University of Science and Technology 19 19

Norwegian University of Science and Technology 20 20

Norwegian University of Science and Technology 22 22

Norwegian University of Science and Technology 23 23

Norwegian University of Science and Technology

Cyber-enabled and cyber-dependent crimes

• Digital Forensics Process:

Norwegian University of Science and Technology http://www.itsgov.com/ 30

Norwegian University of Science and Technology 31

Norwegian University of Science and Technology

+ Ability to handle real word data

Norwegian University of Science and Technology 34

Norwegian University of Science and Technology 35 35

Norwegian University of Science and Technology 36 36

Norwegian University of Science and Technology 37 37

• Machine Learning – a set of methods that are capable

Norwegian University of Science and Technology 39

Norwegian University of Science and Technology 40

• In reality ML requires thorough selection of methods:

Norwegian University of Science and Technology 41

• What is Soft Computing?

Norwegian University of Science and Technology 42

Norwegian University of Science and Technology 43