Professional Documents
Culture Documents
Lectures
Lectures
Lectures
• Administrative announcements
• Data Science in InfoSec and Forensics
• Computational Forensics
• Machine Learning
• Hard & Soft Computing
Curriculum includes:
• 7 theoretical Lectures (material from the textbook)
• 7 practical Exercise/Tutorials (practical tasks and applications) to
solve tasks given after each lecture
• Exercises are released after each lecture – try to solve by following
Thursday before the Tutorial!
• Control questions for progress
• Q & A sessions before the exam
• MOCK exam (optional – preparation for final exam)
• A guest lecture on selected topic
• http://noracook.io/Books/Machi
neLearning/machinelearningan
dsecurity.pdf
Week 9 (03.03.2022) Lecture 4: (Kononenko 6,7) Attribute Quality Measures; PCA; LDA;
Feature Selection
Week 11 (17.03.2022) Lecture 5: (Kononenko 9,10) Symbolic & Stat learning; Visulatization
Week 13 (31.03.2022) Lecture 6: (Kononenko 11*; Chio 2) Artificial Neural Networks; Deep
Learning; Support Vector Machines
Week 14 (07.04.2022) Tutorial 6: Support Vector Machine & Artificial Neural Network
Week 15 Påske/Easter
Week 18 (05.05.2022) Guest lecture / MOCK exam Preparation for the exam; Q & A
– https://innsida.ntnu.no/wiki/-/wiki/English/Reference+groups+-
+quality+assurance+of+education
– Establish ongoing dialogue with fellow students
• Meetings?
• Surveys?
• We *Prefer* volunteers ☺
• Email us by 15.02.2022
Ibm.com
Norwegian University of Science and Technology 16
… to 5 Vs of Big Data paradigm
http://bigdata.black/
Norwegian University of Science and Technology 17
What stops us?
The 42 V's of Big Data and Data Science !!!
https://www.elderresearch.com/blog/42-v-of-big-data
https://www.elderresearch.com/blog/42-v-of-big-data
blogspot.com
https://www.anandtech.com/show/10315/market-views-hdd-shipments-down-q1-2016/3
https://www.backblaze.com/blog/hard-drive-benchmark-stats-2016/
Norwegian University of Science and Technology 25
Smartphones Storage Trends
https://www.gizmochina.com/2017/06/20/antutu-report-smartphone-pref-052017/
Norwegian University of Science and Technology 26
Data generated by devices - trends
https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-
index-vni/mobile-white-paper-c11-520862.html
Norwegian University of Science and Technology 27
Digital Forensics,
Computational Forensics
• Computer Crimes
• Fields of DF:
– Malware analysis
– Network Forensics
– Social Network Mining
– Content identification
– etc
http://www.tamingdata.com/2011/01/06/beyond-c-s-i-the-rise-of-computational-forensics/
Norwegian University of Science and Technology http://ieeesmc.org/newsletters/back/2009_12/main_article1.html
AI, ML, Data Analytic in real world
https://www.kdnuggets.com/2015/12/top-tweets-dec14-20.html
http://wmbriggs.com/post/24784/
https://www.actuaries.digital/2018/09/05/history-of-ai-winters/
Classical ML:
• Supervised learning
• Unsupervised learning
• Regression / Forecasting
• Rules learning
• Reinforcement learning
Also
• New trends – Deep Learning
• Nature-inspired methods
• Big Data-oriented improvements
SlideShare
www.data-machine.com orthojournal.wordpress.com
Knowledge Representation
2
Attributes Representation
• In a general Machine Learning problem, the attributes’
values domain can be characterized by the following
properties:
• NB:
3
– discrete values can be binary: {0,1}
Attributes Representation
4
Knowledge Representation
• Logical Descriptions
– describing data samples themselves
– describing relationships between data samples
– describing relationships between data and outputs
http://people.westminstercollege.edu/faculty/ggagne/fall2014/301/chapters/
5
chapter8/index.html
Logical Order to Attributes
6
http://people.westminstercollege.edu
DIKW Pyramid
8
DIKW Progression
Data Raw Packet Data
Analysis
10
IMT 4133
Data Science for Security and Forensics
• Data as Features
• Feature Space
• Polynomial Curve Fitting
• Model Selection
• Performance Testing
• Curse of Dimensionality
2
https://medium.com/@manveetdn/understanding-machine-learning-as-6-jars-eecfafc77051
3 3
Data Everywhere…
https://www.analyticsvidhya.com/blog/2015/12/hilarious-jokes-videos-statistics-data-science/
4 4
Analogue vs Digital
https://techdifferences.com/difference-between-analog-and-digital-signal.html
5 5
6 https://www.reddit.com/r/Damnthatsinteresting/comments/jt87tl/bill_gates_showing_how_much_data_a_cdrom_can_hold/ 6
What Does the Data Represent?
• The input attributes are the features
– Length
– Weight
– Duration
– Intensity
– Variation
– Etc
7
Data Types
https://www.etsfl.com/do-you-know-the-types-of-data/
8 8
https://medium.com/@manveetdn/understanding-machine-learning-as-6-jars-eecfafc77051
9 9
Vs of Big Data
10
Tasks like…analysing 13 TBytes of
viruses
11 11
High volume data
Can we find such publicly available data?
12
Kaggle (1)
https://www.kaggle.com/datasets?sizeStart=90%2CGB&sizeEnd=1000%2CGB
13 13
Kaggle (2)
14 https://www.kaggle.com/niveditjain/human-faces-dataset 14
UC Irvine Machine Learning
Repository (1)
15 https://archive.ics.uci.edu/ml/index.php 15
UC Irvine Machine Learning
Repository (2)
16 https://archive.ics.uci.edu/ml/datasets/ 16
UC Irvine Machine Learning
Repository (3)
17 17
Feature Spaces
Wood Classification Example
• Have a big pile of mixed wooden blocks
• Mixture of 3 different kinds of wood
– Ash
– Pine
– Birch
18
Feature Spaces
– 2 Optical Features
• Overall brightness of the wood
• Wood grain prominence (peak to peak variation)
20 Grain Prominence
• Note the separate clusters
• If you had an unknown piece of wood, you could
measure its features and then find which class it
belongs in
21
ML Development Data
• “Toy” Data
– Well known, well understood and commonly used data sets:
• https://en.wikipedia.org/wiki/Iris_flower_data_set
– You already know what the results should be:
• Can compare your results with the ones in the literature
• Synthetic Data
– You KNOW the data structure, because you have created it
22
Simple Regression Problem
• Observe Real-valued input variable x
• Use x to predict value of target variable t
23
Polynomial Curve Fitting
• N observations of x
– x = (x1 ,..,xN )
– t = (t1 ,..,tN )
• Data Generation:
– N = 10
– Spaced uniformly in range [0,1]
– Generated from sin(2πx)
– Adding Gaussian noise
– Noise typically due to unobserved variables
24
Polynomial Curve Fitting
25
Polynomial Curve Fitting
27
Objective Functions
• Measures a figure of merit to be optimized
– Statistical Measurements
• Variance
• Kurtosis
28
Objective Functions
– Sum of Squares
– Variance
29
Learn by Optimizing the Objective Function
30
Want to find w that minimizes the Sum of Squares Error
Optimizing the Objective Function
31
Partially Differentiating
32
Partially Differentiating
=0
33
Optimizing the Objective Function
35
0th Order Polynomial (Constant)
What Happened?
39
Generalized Performance Analysis
• Several separate tests, with M = 0,1,2 …9
• For each test with a different M
– N = 100 (# data points)
40
• Division by N allows different sizes of N to be compared
– Can see how # data points used for training affects performance
(an E vs N graph)
– Can use experiments to find the # data points required for model
complexity M to converge on its minimum error.
• As M increases, so does N
41
Training/Testing Data Partition
42
Training Data and Testing Data
Best Performance
46
How Can We Fix the Overfitting Problem?
• N= 10 Data Points
• N= 15 Data Points
49
Effect of Regularizer
M=9 polynomials using regularized error function
No Regularizer
50
Effect of Regularizer
M=9 polynomials using regularized error function
51
Effect of Regularizer
M=9 polynomials using regularized error function
52
Impact of Regularization on Error
53
Classifier Performance and Evaluation
• Classification is also called “Logistical Regression”
• Regression PLUS some Logic (0/1, True/False)
– Within Class/Outside of Class
– Can have several classes (like our wood problem)
– Data sample classification:
• Where in the feature space, does the data sample belong?
• Which side of the feature space boundary does the data sample’s f
54
Classifier Evaluation Metrics:
Confusion Matrix
Confusion Matrix:
Actual class\Predicted C1 ¬ C1
class
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
• Given m classes
• an entry, CMi,j in a confusion matrix indicates:
– # of tuples in class i that were labeled by the classifier as class j
55
Classifier Evaluation Metrics:
Confusion Matrix
56
Classifier Evaluation Metrics:
Accuracy
A\P C ¬C
C TP FN P
¬C FP TN N
P’ N’ All
57
57
Classifier Evaluation Metrics:
Sensitivity and Specificity
• Sensitivity: True Positive recognition rate
– Sensitivity = TP/P A\P C ¬C
= TP/(TP+ FN) C TP FN P
¬C FP TN N
P’ N’ All
• Specificity: True Negative recognition rate
– Specificity = TN/N
– Specificity = TN/(TN + FP)
58
58
Class Imbalance Problem:
A\P C ¬C
C TP FN P
¬C FP TN N
P’ N’ All
59
59
Classifier Evaluation Metrics:
Precision and Recall
• Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive
61
Cross-Validation Methods
• Cross-validation (k-fold, where k = 10 is most popular)
63
ROC Curve Explained
• It shows the tradeoff between sensitivity and specificity
(any increase in sensitivity will be accompanied by a
decrease in specificity).
• The closer the curve follows the left-hand border and
then the top border of the ROC space, the more accurate
the test.
64
ROC Curve Explained
65
Measures for Multiclass Classifiers
Predicted
Actual
66
Curse of Dimensionality
67
Thank you for your attention!
Lecture 2: Machine Learning Basics;
Knowledge Representation
• Administrative announcements
• Reference group ☺
• Assignments
• Knowledge Representation
• Machine Learning Basics
Curriculum includes:
• 7 theoretical Lectures (material from the textbook)
• 7 practical Exercise/Tutorials (practical tasks and applications) to
solve tasks given after each lecture
• Exercises are released after each lecture – try to solve by following
Thursday before the Tutorial!
• Control questions for progress
• Q & A sessions before the exam
• MOCK exam (optional – preparation for final exam)
• A guest lecture on selected topic
• http://noracook.io/Books/Machi
neLearning/machinelearningan
dsecurity.pdf
Week 9 (03.03.2022) Lecture 4: (Kononenko 6,7) Attribute Quality Measures; PCA; LDA;
Feature Selection
Week 11 (17.03.2022) Lecture 5: (Kononenko 9,10) Symbolic & Stat learning; Visulatization
Week 13 (31.03.2022) Lecture 6: (Kononenko 11*; Chio 2) Artificial Neural Networks; Deep
Learning; Support Vector Machines
Week 14 (07.04.2022) Tutorial 6: Support Vector Machine & Artificial Neural Network
Week 15 Påske/Easter
Week 18 (05.05.2022) Guest lecture / MOCK exam Preparation for the exam; Q & A
– https://innsida.ntnu.no/wiki/-/wiki/English/Reference+groups+-
+quality+assurance+of+education
– Establish ongoing dialogue with fellow students
• Meetings?
• Surveys?
• We *Prefer* volunteers ☺
• Email us by 15.02.2022
1
a11 x1
s1
a12 x2
a21
s2 a22 a13
a23 x3
A Less Messy Representation
3
What are Some Relationships Can We Find in
the Data Space?
• Covariance
• Correlation
• Etc
4
Correlation and Covariance
5
Correlation (Normalized Covariance)
https://www.researchgate.net/figure/259147064_fig19_Figure-8-33-
Illustration-of-covariance-and-correlation
6
Covariance Matrix For Variables X, Y
7
Covariance Matrix For Variables X, Y
x
http://www.sharetechnote.com/html/Handbook_EngMath_CovarianceMatrix.html
8
Covariance Matrix For Variables
X, Y, Z
http://gael-varoquaux.info/science/ica_vs_pca.html
9
Covariance Matrix For Variables X, Y, Z
10
Generalized Correlation Matrix
https://www.value-at-risk.net/parameters-of-
random-vectors/
11
Generalized Covariance Matrix
https://www.value-at-risk.net/parameters-of-
random-vectors/
12
Correlation and Covariance
13
Eigenanalysis of Covariance
(First Step in PCA)
http://www.visiondummy.com/wp-
content/uploads/2014/04/eigenvectors_covariance.png
14
Eigenanalysis of Covariance
http://www.visiondummy.com/wp-
content/uploads/2014/04/eigenvectors.png
15
Zero Correlation Does NOT Imply
Statistical Independence
http://stats.stackexchange.com/questions/128
42/covariance-and-independence
16
Learning as a Search
Andrii Shalaginov, Carl Stuart Leichter, Jayson Mackie
Objectives
• Exhaustive search
• Bounded exhaustive search
• Best-first search
• Greedy search
• Beam search
• Gradient search
• Simulated annealing
• Genetic algorithms
What are search algorithms?
(Based on slides of
Watanabe, 2010)
Exhaustive search
https://en.wikipedia.org/wiki/Depth-first_search
Exhaustive search: IDS
• Iterative deepening search (IDS)
– Set search depth = 1
– Depth First Search for the search depth
– Increase the search depth
• Pros & Cons
+ No cycle-problem, Linear-space complexity
– Increased time-complexity
Example of Exhaustive Search
6
x1 + x2 = 6 Max Z = 5x1 + 8x2
5 subject to
x1 + x2 ≤ 6
4 5x1 + 9x2 ≤ 45
x1 , x2 ≥ 0 integer
3
2 5x1 + 9x2 = 45
0 1 2 3 4 5 7 8
6
Example of BFS
6
x1 + x2 = 6 Max Z = 5x1 + 8x2
..
subject to
x1 + x2 ≤ 6
.. .. 5x1 + 9x2 ≤ 45
x1 , x2 ≥ 0 integer
10 .. .. ..
6 9 .. .. ..
5x1 + 9x2= 45
3 5 8 .. .. ..
1 2 4 7 .. .. ..
Example of DFS
6
x1 + x2 = 6 Max Z = 5x1 + 8x2
..
subject to
x1 + x2 ≤ 6
.. .. 5x1 + 9x2 ≤ 45
x1 , x2 ≥ 0 integer
.. .. .. ..
.. .. .. .. 10
5x1 + 9x2= 45
.. .. .. .. 9 8
1 2 3 4 5 6 7
Exhaustive search
• Conclusion
– If the depth of the search is known
Use Depth-first search
– else
Use Iterative deepening search
• Applications
– Crypto attacks
– Dictionary attacks
– Guess passwords
Branch and Bound
• Pros & Cons
+ Faster than exhaustive search
+ In many cases, it can find the global optimal solution
- Not easy to implement
• Applications
– Optimization problems
– Feature selection
Best-first search
26
A failure of the greedy algorithm
27
Beam search
Cost
Bounce
Optimal Cost
Simulated annealing
• Do sometimes accept candidates with higher cost to escape
from local optimum
• Adapt the parameters of this Evaluation Function during
execution
• Exploits an analogy between the annealing process and the
search for the optimum in a more general system.
Simulated annealing
• Annealing Process
– Raising the temperature up to a very high level (melting temperature, for
example), the atoms have a higher energy state and a high possibility to re-
arrange the crystalline structure.
– Cooling down slowly, the atoms have a lower and lower energy state and a
smaller and smaller possibility to re-arrange the crystalline structure.
• Analogy
– Metal Problem
– Energy State Cost Function
– Temperature Control Parameter
– A completely ordered crystalline structure
the optimal solution for the problem
Metropolis Criterion
• Let
– X be the current solution and X’ be the new solution
– C(x) (C(x’))be the energy state (cost) of x (x’)
• Probability Paccept = exp [(C(x)-C(x’))/ T]
• Let N=Random(0,1)
• Unconditional accepted if
– C(x’) < C(x), the new solution is better
• Probably accepted if
– C(x’) >= C(x), the new solution is worse . Accepted only
when N < Paccept
Algorithm
Initialize initial solution x , highest temperature Th, and coolest
temperature Tl
T= Th
When the temperature is higher than Tl
While not in equilibrium
Search for the new solution X’
Accept or reject X’ according to Metropolis Criterion
End
Decrease the temperature T
End
Control Parameters
• Definition of equilibrium
– Cannot yield any significant improvement after certain number
of loops
– A constant number of loops
• Annealing schedule (i.e. How to reduce the temperature)
– A constant value, T’ = T - Td
– A constant scale factor, T’= T * Rd
• A scale factor usually can achieve better performance
Control Parameters
• Temperature determination
– Artificial, without physical significant
– Initial temperature
• 80-90% acceptance rate
– Final temperature
• A constant value, i.e., based on the total number of
solutions searched
• No improvement during the entire Metropolis loop
• Acceptance rate falling below a given (small) value
– Problem specific and may need to be tuned
Simulated annealing
• Pros & Cons
+ Allow to escape from local optimums
+ Easy to implement
- No guarantee of the global optimum
Genetic algorithms
• Cellular automata
– John Holland, university of Michigan, 1975.
• Until the early 80s, the concept was studied theoretically.
• In 80s, the first “real world” GAs were designed.
(Based on slides of
Popovic, 2001)
Branch-and-Bound Technique
for Solving Integer Programs
http://www.ohio.edu/people/melkonia/math4630/slides/bb1.ppt.
Basic Concepts
(ii) discarding the subset if the bound indicates that the subset can’t
contain an optimal solution
Z=41.25
2 5x1 + 9x2 = 45
1
Z=20
0 1 2 3 4 5 6 7 8
Example of Branch-and-Bound
6
x1 + x2 = 6 Max Z = 5x1 + 8x2
5 s.t. x1 + x2 6
5x1 + 9x2 45
4 (2.25, 3.75) x1 , x2 ≥ 0 integer
Z=41.25
2 5x1 + 9x2 = 45
1
Z=20 m = 5/8
0 1 2 3 4 5 6 7 8
Utilizing the information about the optimal
solution of the LP-relaxation
We have relaxed the constraint where x1and x2 must be integers.
Not restricting ourselves to IP.
Utilizing the information about the optimal
solution of the LP-relaxation
Subproblem 1 Subproblem 2
Max Z = 5x1 + 8x2 Max Z = 5x1 + 8x2
s.t. x1 + x2 6 s.t. x1 + x2 6
5x1 + 9x2 45 5x1 + 9x2 45
x2 3 x2 ≥ 4
x1 , x2 ≥ 0 x1 , x2 ≥ 0
(note that the original optimal solution (2.25, 3.75) can’t recur)
Branching step (graphically)
Z=41
Subproblem 1: Opt. solution (3,3) with value 39
5 Subproblem 2: Opt. solution (1.8,4) with value 41
Subproblem 2
4
(1.8, 4)
3 (3, 3)
2
Subproblem 1
1
Z=20 Z=39
0 1 2 3 4 5 6 7 8
Start Creating a Solution Tree
S1: x2 3
(3, 3) Integral Solution
All
Z=39
(2.25, 3.75)
Z=41.25 S2: x2 ≥ 4
(1.8, 4)
Z=41
5
Recall why the
upper bound is 41
4 (2.25, 3.75)
Z=41.25
2
0 1 2 3 4 5 6 7 8
Next Branch Subproblem 2 on x1:
Subproblem 3:
New restriction is x1 1.
5
Subpr. 4 Subproblem 4:
Subpr. 3
New restriction is x1 ≥ 2.
4
3
Z=40.55
1
Z=2
0
0 1 2 3 4 5 6 7
Next Branch Subproblem 2 on x1:
Subproblem 3:
New restriction is x1 1.
Opt. solution (1, 4.44) with value 40.55
5
(1, 4.44) Subproblem 4:
Subpr. 3 Subpr. 4
New restriction is x1 ≥ 2.
4
The subproblem is infeasible
3
Z=40.55
1
Z=20
0 1 2 3 4 5 6 7
Solution tree (cont.)
S1: x2 3
int.
All (3, 3) S3: x1 1
(2.25, 3.75) Z=39
(1, 4.44)
Z=41.25
S2: x2 ≥ 4 Z=40.55
(1.8, 4)
Z=41 S4: x1 ≥ 2
infeasible
4
Subproblem 5: New restriction is x2 4.
1
Z=20
0 1 2 3 4 5 6 7 8
Branch Subproblem 3 on x2 :
Subproblem 5: New restriction is x2 4.
Feasible region: the segment joining (0,4) and (1,4)
5
4
(1, 4)
1
Z=20
0 1 2 3 4 5 6 7 8
Branch Subproblem 3 on x2 :
Subproblem 5: New restriction is x2 4.
Feasible region: the segment joining (0,4) and (1,4)
5 Opt. solution (1, 4):
1
Z=20
0 1 2 3 4 5 6 7 8
Next branching step (graphically)
(0, 5)
Subproblem 6: New restriction is x2 ≥ 5.
5 Feasible region is just one point: (0, 5)
1
Z=20
0 1 2 3 4 5 6 7 8
Next branching step (graphically)
(0, 5)
Subproblem 6: New restriction is x2 ≥ 5.
5 Feasible region is just one point: (0, 5)
1
Z=20
0 1 2 3 4 5 6 7 8
Final Solution Tree
S1: x2 3 S5: x2 4
int.
All (3, 3) S3: x1 1 (1, 4) int.
(2.25, 3.75) Z=39 Z=37
(1, 4.44)
Z=41.25
S2: x2 ≥ 4 Z=40.55 S6: x2 ≥ 5
(1.8, 4) (0, 5)
int.
Z=41 S4: x1 ≥ 2 Z=40
infeasible
S1: x2 3 S5: x2 4
int.
All (3, 3) S3: x1 1 (1, 4) int.
(2.25, 3.75) Z=39 Z=37
(1, 4.44)
Z=41.25
S2: x2 ≥ 4 Z=40.55 S6: x2 ≥ 5
(1.8, 4) (0, 5)
int.
Z=41 S4: x1 ≥ 2 Z=40
infeasible
• In our case, Subproblem 6 has integral optimal solution, and its value
40>39=Z*. Thus, (0,5) is the new incumbent x* , and new Z*=40.
S1: x2 3 S5: x2 4
int.
All (3, 3) S3: x1 1 (1, 4) int.
(2.25, 3.75) Z=39 Z=37
(1, 4.44)
Z=41.25
S2: x2 ≥ 4 Z=40.55 S6: x2 ≥ 5
(1.8, 4) (0, 5)
int.
Z=41 S4: x1 ≥ 2 Z=40
infeasible
enginfo.ut.ac.ir/keramati/Course%20pages/.../Meta-%20Heuristic%20Algorithms.ppt
Genetic Algorithm Is Not...
Gene coding...
Speaking of Genetic Coding…
http://www.digit.in/general/scientists-create-the-first-living-
organism-with-synthetic-alien-dna-20809.html
Genetic Algorithm Is...
… Computer algorithm
That resides on principles of genetics
and evolution
• Cellular automata
– John Holland, university of Michigan, 1975.
• Until the early 80s, the concept was studied
theoretically.
• In 80s, the first “real world” GAs were designed.
Hill climbing •
global
local
Multi-climbers •
Genetic algorithm •
I am at the
I am not at the top. top
My high is better! Height is ...
I will continue
GA Concept
• Genetic algorithm (GA) introduces the principle of evolution
and genetics into search among possible solutions
to given problem.
Chromosome String
Gene Character
Genotype Population
Perform crossover
Perform mutation
no
Stop?
yes
The End
Designing GA Requires Answers to:
string 1 0 1 1 1 0 0 1
or
> c
tree - genetic programming
xor b
a b
Crossover
• Crossover is concept from genetics.
• Crossover is sexual reproduction.
• Crossover combines genetic material from
two parents,
in order to produce superior offspring.
• Few types of crossover:
– One-point
– Multiple point.
http://findwallpapershd.com/rabbit/tiger-rabbit-wallpaper/
One-point Crossover
0 7
1 6
2 5
3 4
4 3
5 2
6 1
7 0
Parent #2
Parent #1
One-point Crossover
0 7
1 6
5 2
3 4
4 3
5 2
6 1
7 0
Parent #2
Parent #1
Mutation
• Mutation introduces randomness into the
population.
• Mutation is asexual reproduction.
• The idea of mutation
is to reintroduce divergence
into a converging population.
• Mutation is performed
on small part of population,
in order to avoid entering unstable state.
Mutation...
Parent 1 1 0 1 0 0 0 1
Child 0 1 0 1 0 1 0 1
Setting the Probabilities...
Next generation
0.93 0.72 0.64
Selection - Some Weak Solutions Survive
Previous generation
Next generation
0.93 0.72 0.64 0.12
Stopping Criteria
• Final problem is to decide when to stop execution of algorithm.
• There are two possible solutions to this problem:
– First approach:
• Stop after production
of definite number of generations
– Second approach:
• Stop when the improvement in average fitness
over two generations is below a threshold
GA Vs. Ad-hoc Algorithms
Genetic Algorithm Ad-hoc Algorithms
* Not necessary!
……but….
Advantages of GAs
Genome:
(X, y, z, p) =
GA:An Example - Diophantine Equations(2)
(Using easy to track #s)
( 1, 2, 3, 4 ) Crossover ( 1, 6, 3, 4 )
( 5, 6, 7, 8 ) ( 5, 2, 7, 8 )
Mutation
( 1, 2, 3, 4 ) ( 1, 2, 3, 9 )
Diophantine Equations(3)
• First generation is usually randomly generated of numbers
lower than sum (s).
Optimization
• Administrative announcements
• Learning as a Search
• Genetic Algorithm
• Branch and Bound
• DFS / BFS
Week 9 (03.03.2022) Lecture 4: (Kononenko 6,7) Attribute Quality Measures; PCA; LDA;
Feature Selection
Week 11 (17.03.2022) Lecture 5: (Kononenko 9,10) Symbolic & Stat learning; Visulatization
Week 13 (31.03.2022) Lecture 6: (Kononenko 11*; Chio 2) Artificial Neural Networks; Deep
Learning; Support Vector Machines
Week 14 (07.04.2022) Tutorial 6: Support Vector Machine & Artificial Neural Network
Week 15 Påske/Easter
Week 18 (05.05.2022) Guest lecture / MOCK exam Preparation for the exam; Q & A
• http://noracook.io/Books/Machi
neLearning/machinelearningan
dsecurity.pdf
2
ML vs AI
• Machine Learning ML
is good at
learning specific
tasks.
• AI is uses various
components of
machine learning
to make decisions
AI
and automate
control in some
smart/data driven
way.
• ML is really just
3 sort of a toolbox.
Machine Learning
Warnings
4
Machine Learning Warnings
5
Things ML is good at
6
Things ML is good at
7
Machine Learning Warnings
8
Machine Learning Warnings
9
Machine Learning Warnings
10
Machine Learning Warnings
11
How Machine
Learning Works
(using an example)
12
(Basically) How Machine Learning
Works
• We feed data to an algorithm.
• This algorithm constructs (trains) a model.
• If we can test the model, we do so.
• We apply the model.
TRAINING
DATA ALGORITHM
14
Our Example
Good/Bad
MODEL MPG?
15
Our Data
16
The Model (what we want)
17
Training a Model
18
What is training?
19
Applying the Model!
20
A note about classifying or
predicting
• Usually, we are considering whether samples are
positive or negative.
– Think in terms of medical practice. You may test positive
for some condition.
• In our case, a sample is positive if it has (or we
predict it has) good gas mileage.
21
False Positives
Fancy new
police car
????
Fancy new
police car 2 ????
Trained Machine
Salmon or Cod?
Learning Model
24
Machine Learning Basics Summary
25
Machine Learning Basics Summary
26
Machine Learning Basics Summary
27
Machine Learning Basics Summary
28
We have only spoken about
classification so far!
• There are 3 Main Functions:
– Classification – for this new sample, which category does
it belong to?
– Regression/Prediction – Can I use my data to make
guesses about unseen data?
– Clustering – what are the groupings in my data?
• And how can it help me explore my data?
• There are more functionalities that can be
executed with machine learning as well.
29
Conclusion
30
Conclusion: What about [INSERT
BUZZWORD HERE]
https://www.cnet.com/news/ai-is-very-stupid-says-google-ai-leader-
31
compared-to-humans/
Thanks for your attention!
• Questions?
32
Written By
Presenter:
Hai Nguyen
Norwegian Information Security L aboratory
Gjøvik University College, Norway
Part II:
• Application of Feature Selection for Intrusion Detection
Motivation from an application
Feature selection problem
Feature selection methods
Search strategies
Challenges
Raw traffic:… .0101100101001001010010100101001010101010… .
Feature extraction
F1 F2 F3 F4 F5 Classes
1000 100 100 0 0 XSS
1000 10 20 0 0 SQL-inject
Observations:
1000 1 1 0 1 Buffer-overf
1000 1 1 100 100 LDAP-inject
1000 1 1 0 0 Normal
What is feature selection ?
• The process of removing irrelevant and redundant features from the data for improving
the performance of predictor (classifier).
All Multiple
features Feature Predictor
subsets
The best feature subset is the subset with which the Predictor gives the best result.
Filter model: Wrapper model:
Fast Slow
BUT… …
1
Motivation:
• Web attack detection
• Feature selection for Web attack detection
Generic Feature Selection (GeFS) measure
• The CFS and the mRMR measures
• Optimizing the GeFS measure
• New feature-selection method: Opt-GeFS
Experimental results
• CSIC 2010 HTTP dataset
• ECML-PKDD 2007 HTTP dataset
Conclusions
2
Web attack detection – The process of identifying activities which try to compromise
the confidentiality, integrity or availability of Web applications )
How can we detect web attacks?
Feature extraction
F1 F2 F3 F4 F5 … Classes
1000 100 100 0 0 … XSS
1000 10 20 0 0 … SQL-inject
Observations:
1000 1 1 0 1 … Buffer-overf
1000 1 1 100 100 … LDAP-inject
1000 1 1 0 0 … Normal
3
Relevance - Not all features are relevant for detecting attacks:
Feature F1 is irrelevant for detecting attacks
Redundancy - Not all relevant features are necessary for detecting attacks:
• Feature F5 is redundant for detecting LDAP-injection attack, since feature F4 is enough
F1 F2 F3 F4 F5 … Classes
1000 0 0 … XSS
100 100
1000 0 0 … SQL- injection
10 20
1000 1 1 0 … Buffer-overflow
1
1000 1 1 … LDAP-injection
100 100
1000 1 1 0 0 … Normal
5
Mutual information measures the mutual dependence (non-linear relationship) of
two random variables:
6
M. Hall proposed correlation feature-selection (CFS) measure:
Given a feature subset S with k features, there is a score:
where
and
Class-feature correlation
Feature selection by means of CFS measure: Feature-feature correlation
7
In 2005, Peng et. al. proposed a feature selection method using mutual information.
In terms of mutual information, the relevance of a feature set S for the class c is
defined as follows:
8
Definition 1: A generic-feature-selection measure for intrusion detection is a
function GeFS(x), which has the following form:
Proposition: The CFS and mRMR measures are instances of the GeFS measure.
9
Exhaustive search: Heuristic search:
10
Proposition: The feature selection problem:
11
Chang’s method for solving P01FP: Our method for solving P01FP:
The number of variables & constraints: The number of variables & constraints:
12
Chang’s method: Our method:
Proposition 1: A polynomial mixed 0-1 Proposition 4: A polynomial mixed 0-1
term from (7) can be represented term from (12) can be
by the following program [9]: represented by the following program:
13
Step 0: Analyze statistical properties of datasets before
choosing GeFS_CFS OR GeFS_mRMR.
M01LP
Step 3: Transform the optimization problem of GeFS to
a mixed 0-1 linear programming (M01LP) problem,
which can be solved by the branch-and-bound algorithm.
Branch & Bound
Opt-GeFS
14
Objective: Apply the generic-feature-selection (GeFS) measure for Web attack detection.
DARPA Benchmarking dataset for Intrusion Detection Systems:
• Out of date: 1998
• Does not include many actual Web attacks
15
Dataset description:
• Traffic targeted to a real-world web application: E-commerce web application.
• 36000 normal requests.
• 25000 anomalous requests (SQL injection, buffer overflow, XSS, etc.)
16
Number of selected features (on average)
35
30
30
25
20
15
10 14
11
5
0
Full-set GeFS_CFS GeFS_mRMR
100
90 93.65 93.53
80
70 75.67
C4.5
60
50 CART
40 RandomTree
30 RandomForest
20
10
0
Full-set GeFS_CFS GeFS_mRMR
30
28
25
C4.5
20 CART
RandomTree
15
RandomForest
10
5 6.9
7.1
0
Full-set GeFS_CFS GeFS_mRMR
20
Number of selected features (on average)
35
30
30
25
20
15
10 6
5
2
0
Full-set GeFS_CFS GeFS_mRMR
100
90 97.04 92.93
80 86.42
70
60 C4.5
50 CART
40 RandomTree
30 RandomForest
20
10
0
Full-set GeFS_CFS GeFS_mRMR
20
15
17.6 C4.5
CART
RandomTree
10
RandomForest
7.8
5
2.95
0
Full-set GeFS_CFS GeFS_mRMR
24
Questions?
25
The Nature of Data Itself and
Our Models of the World
1
Data Acquisition and Feature Extraction
2
We Perceive a Narrow Slice of Reality
3
Projection of Real World onto Sensory
Space
• In our minds, we build a model of the real world to account
for our experiences (data in our sensory feature space).
4
Projection of Real World onto a Sensor
Space
• Is the origin of all Data Spaces
• Scientific models are built to account for the data;
they make testable predictions/estimates of the Real
World
• Rigorously tested models lead to useful applications
• Research is the process of exploring the boundaries
of our models:
– Extend the model’s boundaries
– Replace the model completely.
• “Paradigm Shift” Thomas Kuhn “The Structure of Scientific
Revolutions”
5
How is Any of This Relevant?!
6
Linear Mixture Models
7
Linear Mixture Models
s1 x1
8
x1
s1
x2
9
a11 x1
s1
x2
a12
Coupling Weights
a11 > a12
10
Many of The Data we Collect Come From
Linear Mixtures
“A linear mixed model (LMM) is a parametric linear model
for clustered, longitudinal, or repeated-measures data that
quantifies the relationships between a continuous
dependent variable and various predictor variables.”
11
a11 x1
s1
a12 x2
a21
s2 a22 a13
a23 x3
12
a11 x1
s1
a12 x2
a21
s2 a 22 a13
a 23 x3
13
A Less Messy Representation
14
What are Some Relationships Can We
Find in the Data Space?
• Covariance
• Correlation
• Etc
What are Some Relationships Can We
Find in the
Feature Space?
• Covariance
• Correlation
• Etc
15
A Vector Space Perspective
of Projection
Vector a is projected
onto Vector b
16
Projections, Inner Products and
Statistics
17
This “dot product” between a and b is
also called the “inner product” and
can be evaluated as abT
18
From the perspective of linear algebra. Completely uncorrelated vectors,
signals or time series data streams are orthogonal to each other. Their
correlation coefficient resolves to zero, as does their inner product.
19
Inner Product, Magnitude, Variance, Etc
20
Inner Product and Covariance
• Inner Product:
• Covariance:
21
Inner Product, Magnitude and Variance
• Inner Product:
Magnitude:
• Variance:
22
Standard Deviation and Magnitude
23
Correlation and Covariance
25
Correlation and Covariance
26
Correlation
27
Correlation
https://www.aplustopper.com/correlation/
28
PCA
• Principal component analysis (PCA)
– Data Covariance Analysis
– PCA is a way to reduce data dimensionality
– PCA projects high dimensional data to a lower dimension
– Retains most of the sample's variation.
– Useful for the compression and classification of data.
– Auxiliary variables, called “principal components” are uncorrelated
– Ordered in descending variance
NB: The Principal Components are NOT the Original Sources (s) From the
Mixture Model As = x !!
29
PCA
https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c
30
PCA
• Principal Component Analysis (PCA) extracts the most
important information. This in turn leads to compression
since the less important information are discarded. With
fewer data points to consider, it becomes simpler to
describe and analyze the dataset.
https://devopedia.org/principal-component-analysis
31
PCA
https://devopedia.org/principal-component-analysis
32
PCA
• We can describe the shape of a fish with two variables: height and width.
However, these two variables are not independent of each other. In fact, they
have a strong correlation. Given the height, we can probably estimate the
width; and vice versa. Thus, we may say that the shape of a fish can be
described with a single component.
• This doesn't mean that we simply ignore either height or width. Instead, we
transform our two original variables into two orthogonal (independent)
components that give a complete alternative description. The first component
(blue line) will explain most of the variation in the data. The second component
(dotted line) will explain the remaining variation. Note that both components
are derived from both height and width.
https://devopedia.org/principal-component-analysis
33
PCA
https://devopedia.org/principal-component-analysis
34
PCA: advantages
• PCA minimizes information loss even when fewer principal components are
considered for analysis. This is because each principal component is along a
direction that maximizes variation, that is, the spread of data. More importantly,
the components themselves need not be identified a priori: they are identified
by PCA from the dataset. Thus, PCA is an adaptive data analysis technique. In
other words, PCA is an unsupervised learning method.
https://devopedia.org/principal-component-analysis
35
PCA: drawbacks
• PCA works only if the observed variables are linearly correlated. If
there's no correlation, PCA will fail to capture adequate variance with
fewer components.
• PCA is lossy. Information is lost when we discard insignificant
components.
• Scaling of variables can yield different results. Hence, scaling that you
use should be documented. Scaling should not be adjusted to match
prior knowledge of data.
• Since each principal components is a linear combination of the original
features, visualizations are not easy to interpret or relate to original
features.
36
Eigen decomposition (analysis)
https://guzintamath.com/textsavvy/2019/02/02/eigenvalue-decomposition/
37
PCA: original data
38
PCA: data - zero mean
39
PCA: covariance matrix
• Since the data is 2 dimensional, the covariance
matrix will be 2x2.
40
PCA: eigen decomposition
• It is important to notice that these eigenvectors are both unit
eigenvectors ie. Their lengths are both 1. This is very important
for PCA, but luckily, most maths packages, when asked for
eigenvectors, will give you unit eigenvectors.
41
PCA: final components
42
PCA: new dataset
• After selecting n eigenvalues:
43
PCA: new dataset
44
PCA: new dataset
45
Weka example
46
PCA Produces Orthogonal Basis
64
Matrix A is a Linear Transform of Source
Space Vectors into the Sensor/Data Space
65
Machine Learning Builds Useful Models
• GIGO
• “good” data/features are a pre-requisite for machine
learning to build a useful model.
66
Image Credits
• https://commons.wikimedia.org/wiki/File:Goodmans_Axiette_101_a.png
• https://commons.wikimedia.org/wiki/File:Us664a_microphone.jpg
• https://commons.wikimedia.org/wiki/File:Animal_hearing_frequency_range.svg
• https://en.wikipedia.org/wiki/Dipole_antenna#/media/File:Half_–_Wave_Dip
• https://en.wikipedia.org/wiki/File:Plato_-_Allegory_of_the_Cave.png
• https://en.wikipedia.org/wiki/Optical_illusion#/media/File:Optical-illusion-
checkerboard-twisted-cord.svg
• https://en.wikipedia.org/wiki/Ren%C3%A9_Descartes#/media/File:Cartesian_coordin
ates_2D.svg
• https://upload.wikimedia.org/wikipedia/commons/4/41/Kevin_Mitnick_2008.jpeg
• https://en.wikipedia.org/wiki/Computer#/media/File:Dell_PowerEdge_Servers.jpg
• https://upload.wikimedia.org/wikipedia/commons/2/2c/G5_supplying_Wikipedia_via_
Gigabit_at_the_Lange_Nacht_der_Wissenschaften_2006_in_Dresden.JPG
• https://en.wikipedia.org/wiki/Computer#/media/File:Acer_Aspire_8920_Gemstone.jpg
• http://mathinsight.org/dot_product
72
Thank you for your attention!
73
Lecture 4: Data Pre-processing;
Attribute Quality Measures
• Feature Selection
• Principle Component Analysis
Week 9 (03.03.2022) Lecture 4: (Kononenko 6,7) Attribute Quality Measures; PCA; LDA;
Feature Selection
Week 11 (17.03.2022) Lecture 5: (Kononenko 9,10) Symbolic & Stat learning; Visulatization
Week 13 (31.03.2022) Lecture 6: (Kononenko 11*; Chio 2) Artificial Neural Networks; Deep
Learning; Support Vector Machines
Week 14 (07.04.2022) Tutorial 6: Support Vector Machine & Artificial Neural Network
Week 15 Påske/Easter
Week 18 (05.05.2022) Guest lecture / MOCK exam Preparation for the exam; Q & A
• Intrusion Detection
• Malware Detection
http://www.nlpca.org/pca_principal_component_analysis.html
Norwegian University of Science and Technology 8
Thank you for your attention!
Andrii Shalaginov
Department of Information Security and Communication
Technology
Faculty of Information Technology and Electrical Engineering
Norwegian University of Science and Technology
andrii.shalaginov@ntnu.no
Week 9 (03.03.2022) Lecture 4: (Kononenko 6,7) Attribute Quality Measures; PCA; LDA;
Feature Selection
Week 11 (17.03.2022) Lecture 5: (Kononenko 9,10) Symbolic & Stat learning; Visulatization
Week 13 (31.03.2022) Lecture 6: (Kononenko 11*; Chio 2) Artificial Neural Networks; Deep
Learning; Support Vector Machines
Week 14 (07.04.2022) Tutorial 6: Support Vector Machine & Artificial Neural Network
Week 15 Påske/Easter
Week 18 (05.05.2022) Guest lecture / MOCK exam Preparation for the exam; Q & A
general – Φ functions
Linear Discriminant Functions
Linear discriminant function:
T
g x =w xw 0
maps the feature space into a real number which can be
viewed as a distance from the decision boundary
This is how the decision boundary is obtained
g x =0
Source: Duda&Hart&Stork
Two-Category Case
J w = ∑ −w x
T
x ∈M
i =1
Linear Regression: An Example
Solution in a figure
Source: Wikipedia
Nearest Neighbors
{ }
k
argmax
c x=
c ∈{C 1 ,... , C m }
∑ c , ci
i=1
c x target value
{C 1, ,C m } set of possible clases
1, a=b
a , b=
0, a≠b Source: Wikipedia
k-Nearest Neighbors
Regression
mean target value from k nearest neighbors examples
k
1
cx=
k
∑ ci
i =1
Training set
Neighbors are taken from a set of objects for which
the correct classification (value of property) is known
Although no training phase!
k-Nearest Neighbors
How to identify neighbors?
k-Nearest Neighbors
How to identify neighbors?
Objects = position vectors in a multidimensional
feature space
Euclidean distace
Distace of two attributes is equal to their absolute
difference: d v i , j , v i , l =∣v i , j −v i , l∣
a
D t j , t l = ∑ d vi , j , v i ,l 2
i =1
Manhattan distance
...
Weighted k-Nearest Neighbors
Deal with drawback of k-Nearest Neighbors
Classes with more frequent examples
tend to dominate the prediction of new
sample
Prediction of the class of new sample also based on
distaces from the neighbors - weights
Impact of distace – linear, polynomial,
exponentional,... function = kernel function
Classification
{∑ }
k
argmax c , c i
c x= 2
c∈{C 1 ,... ,C m } i=1 D t x ,t i
Děkujeme!
IMT4612
Gjøvik University College
Kjell Tore Fossbakk, Katrin Franke
Symbolic Learning
Topics
Decision Trees
Decision Rules
Association Rules
Regression Trees
Decision Trees: The tree
Nodes
Attributes
Connections/Edges
Values
Leaves/terminal
nodes
Class labels
Decision Trees: Paths
Overfitting
Removing irrelevant nodes
Post-pruning
Decision Rules: Made from decision trees
Consequent
Right hand side of an association rule
Support
Supporting the confidence
Supervised Learning
Classify HTML pages as spam/not spam
Pattern recogniton (handwriting)
Speech recognition
Conclusion
1
Outline
• What is ANN and where do we apply
• What are different variants
• Basic building block (Perceptron)
• What is multi-layer perceptron and why we
need that.
3
Artificial Neural Network
• Models of the brain and nervous system
• Highly parallel
– Process information much more like the brain
than a serial computer
• Learning
• Very simple principles
• Very complex behaviours
• Applications
– As powerful problem solvers
– As biological models
4
Types of ANN
Mainly ANN can be classified to following
types
• According to Topology
• According to Learning Rule
• According to Activation function
• According to Application
6
Based on Topology
• ANN without layers
• Two Layered FeedForward ANN
• Multi-layered FeedForward ANN
• Bi-Directional Two layered ANN
• Picture of a Multi-layered
Feed forward ANN
7
Based On learning Rule
• Hebbian Learning Rule
• Delta Learning Rule ( Back propagation
learning for Multi-layered Perceptron)
• Competitive learning
• Forgetting
8
Tapson, Jonathan, et al. "Synthesis of neural networks for spatio‐temporal spike pattern
recognition and processing."
Perceptron
Adds a Bias Term
Can Learn Linearly Separable Feature Spaces
Θ
OR & AND Decision Boundaries
XOR Decision Boundaries
Solution-Multi-layer
Perceptron
• Designed to Handle "non-linear" classification
problem.
• Achieved by introducing one or more hidden
layers between input layer and final output
layer.
• Activation functions are generally Sigmoid or
gaussian giving an output ranging from 0 to1.
13
Sophisticated Activation Functions
• Threshold based activation function outputs
are not compatible with sophisticated weight
correction methods, like gradient descent.
• Activation function should be continuous and
differentiable in nature in order to use
gradient descent for correcting weights
• Solutions are as follows:
11
Sigmoid Activation Function
Error Back‐Propagation Through Activation!
MultiLayer Perceptron
• A inter connected network of single
perceptrons.
• Consist of Multiple layers of
perceptron/neurone /node.
• Input layer nodes are feature components.
• Out put layer nodes are desired
classification labels/regression output.
• Each node gets input from all or
some of the node of the earlier layer
14
Feed-forward Nets
Information is distributed
18
Alternative to Gradient Decent
• Genetic Algorithms
19
Network Topologies
Take Home Message
• Use Neural Network when you have enough
samples.
• Training is very time consuming.
20
Thank you for your attention!
21
Support Vector Machine
2
Linearly Separable
Binary Classification
• L training points .
• Each training points consists of D dimension vector.
• Training data is of the form
{ xi ,yi} where i= 1,2,3,........L , yi є [-1,+1] and xi є RD
3
Pictorial Motivation
• So many posssible Hyperplanes – Which one to Choose?
4
Best Solution
5
How do we get it!
• Solving an
Optimization Problem
(Convex Optimization
Precisely)
• Lets Formulate the
problem
Mathematically
• This hyperplane can be
described by w · x + b
=0
The Margin
• x is my Data ( Feature
Vector)
• b is a Bias
• w is the weight vector
Why Max. Margin?
• According to Statistical Learning Theory
Max. generalization can be achieved by Max.
margin.
• We need to define distance/metric in the
feature space.
• We implicitly fix a scale
• How???
Margin
• We introduce
canonical hyper
plane for both
classes.
• x· w + b = +1
• x · w + b = −1
Margin
• Let us take two
arbitrary points from both
class examples.
• The distance between
them X1- X2. (Red Line)
• The margin/distance
between X1 and X2 can
be obtained by
projecting it on the
vector normal to the
hyperplane.(Green Line)
Margin
• x1 · w + b = +1
• x2 · w + b = -1 On
Subtraction Gives
• w · ( x1-x2)=2, here w is
the green line.
• Canonical hyperplanes
in yellow
• Data points on Yellow
line are Support
vectors
Margin
• Since total distance =2
between the two samples
• distance of 1 between
hyperplane & each sample
on canonical hyperplanes
(see dotted black line)
http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html
Easy Linear Separation
http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html
What Hyperplane Works Here?
http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html
The Kernel Trick!
Different Kinds of Building Blocks
• Basis Functions
• Building blocks that span a vector space (can be a feature space)
• Can be a transformation (time to frequency)
• Not used to change dimensionality
• Kernel Functions
• Building blocks to create new representation in a different feature space
• Always a transformation
• Used to go from a lower dimension to a higher dimension
• RKHS
• The Kernel Trick (Appearing at an SVM near you….)
Kernel Evolution
• In 1992, Bernhard Boser, Isabelle Guyon and
Vapnik suggested a way to create nonlinear
classifiers by applying the kernel trick.
• Largely inspired by Aizerman et al.
• M. Aizerman, E. Braverman, and L. Rozonoer
(1964). "Theoretical foundations of the
potential function method in pattern
recognition learning". Automation and
Remote Control 25: 821–837.
Why Kernel SVM?
• What if our feature per class is non-linear by
nature.
• Linear SVM cannot handle such cases.
• The solution is to use Kernel trick
- Main idea is to represent the original
feature vector in a higher dimensional
space
- Features in higher dimensional space
gets linearly separable.
Transformation Function
• Map Data from input space to high
dimensional feature space where they are
linearly separable.
Examples of Kernel
Function
• Gaussian Kernel
• Polynomial Kernel
Tapson, Jonathan, et al. "Synthesis of neural networks for spatio‐
temporal spike pattern recognition and processing." arXiv preprint
arXiv:1304.7118 (2013).
https://upload.wikimedia.org/wikipedia/commons/2/20/ASR‐
9_Radar_Antenna.jpg
http://ieeebooks.blogspot.no/2011/02/lessons‐in‐electric‐circuits‐
volume‐ii_4086.html
https://en.wikipedia.org/wiki/Air_traffic_control_radar_beacon_system#
/media/File:ASR‐9_Radar_Antenna.jpg
• https://en.wikiversity.org/wiki/Learning_and_neural_networks
• http://www.ece.utep.edu/research/webfuzzy/docs/kk‐thesis/kk‐thesis‐html/node18.html
• http://www.cs.bham.ac.uk/~jxb/NN/l3.pdf
• http://www.cs.stir.ac.uk/courses/ITNP4B/lectures/kms/2‐Perceptrons.pdf
• http://www.math.washington.edu/~palmieri/Courses/2008/Math326/pictures.php
• https://i.imgur.com/Jl4gIBl.jpg
• https://www.youtube.com/watch?v=3liCbRZPrZA
• https://www.youtube.com/watch?v=9wijQD8DPc4
• https://www.youtube.com/watch?v=UFnjV1E615I
• https://www.youtube.com/watch?v=NmhbQ‐ag2z0
Thank you for your attention!
39
Lecture 6: Artificial Neural Networks;
Support Vector Machines
Lecture 7
Clustering
1
Types of Clustering
• Hierarchical
– Taxonomies
– Organizational Charts
• Partition
– Feature Space Regions
2
Clustering Methods
• K-Means Clustering
• Gaussian Mixture Models
• Canopy Clustering +
• Vector Quantization
3
Essentials of Clustering
• Similarities
– Natural Associations
– Proximate*
• Differences
– Distant*
4
Clustering considerations
• What does it mean for objects to be similar?
• What algorithm and approach do we take?
– k-means
– hierarchical
• Bottom up agglomerative clustering (HAC)
• Top Down divisive
• Do we need a hierarchical arrangement of clusters?
• How many clusters?
• Can we label or name the clusters?
• How do we make it efficient and scalable?
– Canopy Clustering
5
Hierarchical Clustering
vertebrate invertebrate
6
Clustering: Corpus browsing
www.yahoo.com/Science
… (30)
6 5
0.2
4
3 4
0.15 2
5
0.1 2
0.05
1
3 1
0
1 3 2 5 4 6
1 2 3
4
10
Hierarchical Clustering
• Assumes a similarity function for determining the similarity of two entities
in the clustering (nodes or other clusters).
• Two main types of hierarchical clustering
11
Agglomerative versus Divisive
• Data set {a, b, c, d ,e }
a
ab
b
abcde
c
cde
d
de
e
Divisive
Step 4 Step 3 Step 2 Step 1 Step 0
12
12
Strengths of Hierarchical Clustering
• No assumptions on the number of clusters
– Any desired number of clusters can be obtained by ‘cutting’ the
dendrogram at the proper level
13
Agglomerative
14
Intermediate State
• After some merging steps, we have some clusters
C3
C4
C1
C2 C5
15
Intermediate State
• Merge the two closest clusters (C2 and C5)
C3
C4
C1
C2 C5
16
After Merging
C3
C4
C1
C2 U C5
17
Means of Clustering
• K-means
• GMM
– Expectation Maximization (EM)
• Canopy
• Hard+Sharp/Soft+Fuzzy
– Hard = No Cluster Overlap
– Soft = Some Cluster Overlap
18
Essentials of Clustering
19
Evaluating Clusters
• What does it mean to say that a cluster is “good”?
20
What are we optimizing to get good
clusters?
• Given: Final number of clusters
• Optimize:
– “Tightness” of clusters
• {average/min/max/} distance of points to each other in the same
cluster
• {average/min/max} distance of points to each cluster’s center
21
The Distance Metric
• How the similarity of two elements in a
set is determined, e.g.
– Euclidean Distance
– Inner Product Space (*)
– Manhattan Distance
– Maximum Norm
– Mahalanobis Distance
– Hamming Distance
– Or any metric you define over the
space…
22
Manhattan Distance
https://www.quora.com/What-is-the-difference-between-Manhattan-and-
Euclidean-distance-measures
23
Mahalanobis Distance
http://www.jennessent.com/arcview/mahalanobis_description.htm
24
Mahalanobis Distance
http://stats.stackexchange.com/questions/62092/bottom-to-top-
explanation-of-the-mahalanobis-distance
25
Partitional Clustering
• Then…
28
But!
• The complexity is pretty high:
– k * n * i *O ( d)
k = # Clusters
n = # data points
i =num (iterations)
O ( d)= computational complexity
of distance metric
(Motivation for Canopy Clustering)
29
Canopy Clustering for Big Data
30
What Does This Remind You Of?
• Nations
–States (Regions)
• Cities
–Postal Codes
» Street Addresses
31
Canopy Clustering
• Use “cheap” method in order to create some number of
overlapping subsets, called canopies.
32
Creating Canopies
• Define two thresholds
– Tight: T2
– Loose: T1
33
Single Canopy
https://www.codeboy.me/2014/11/02/datami
ne-canopy/
34
Single Canopy
35
Single Canopy
36
Multiple Canopies
37
Canopy Clustering
38
Partitioning Large Data Sets
• Start with Cheap Canopy Clustering
• Finish with Expensive K-Means Clustering
39
Hybrid Clustering
40
Gaussian Mixture Models in 1-D
41
Gaussian Data
42
Parameter Estimation
43
“True” Distribution:
44
Estimated Distribution is Based on
“Expectation Maximization”
45
Acknowledgements and References
• https://www.youtube.com/watch?v=_aWzGGNrcic
• https://www.youtube.com/watch?v=REypj2sy_5U
• https://www.youtube.com/watch?v=qMTuMa86NzU
• https://www.youtube.com/watch?v=B36fzChfyGU
• https://www.youtube.com/watch?v=jgQhzl3djM8
• https://en.wikipedia.org/wiki/File:Svg-cards-2.0.svg
51
Acknowledgements and References
• http://www.fallacyfiles.org/taxonomy.html
• http://www.indiana.edu/~hlw/Meaning/senses.html
• http://bioweb.uwlax.edu/bio203/s2007/barger_rach/
• http://ocw.mit.edu/courses/electrical-engineering-and-
computer-science/6-345-automatic-speech-recognition-
spring-2003/lecture-notes/lecture6new.pdf
• www.cs.cmu.edu/~knigam/15-505/clustering-lecture.ppt
• courses.cs.washington.edu/courses/cse590q/04au/slides
/DannyMcCallumKDD00.ppt
52
IMT4133 – Data Science for Security
and Forensics
2
Overview
3
Analytical vs Empirical
• Analytical
– The true nature of the system under study
– Idealized model (usually mathematical)
– Allegory of the cave: the objects, not their shadows
– We Can Never* Have Direct Knowledge
• Empirical
– What we can actually know about the system under study
– Data
– Our analysis of the data
• Estimates of the true nature
– Always indirect knowledge of the true nature of the system
• Recall limitations of the senses
4
Supervised v Unsupervised
• Supervised Learning Vectors are Labeled
– Explicit preconceptions about data structure
– Costly
1. Cost of Labelling
2. Data Mining
3. Dynamic Classes
4. Identify Useful Features
5. Initial Exploratory Data Analysis
6
Types of Unsupervised Learning
• Clustering
• K-mean
• GMM
• Self Organization
• What is the organizational principal?
• Data topology
• Want a topology preserving projection to lower
dimensional space
• Say What?
• Some/all of the data structure is preserved
7
Topology Preserving Projections I
8
http://www.cita.utoronto.ca/~murray/GLG130/Exercises/F2.gif
9
http://www.cita.utoronto.ca/~murray/GLG130/Exercises/F2.gif
Topology Preserving Projections I
10
http://www.cita.utoronto.ca/~murray/GLG130/Exercises/F2.gif
Topology Preserving Projections
11
https://commons.wikimedia.org/wiki/File:Europe_topography_map_en.png
Topology Preserving Projections
• Geographic terrain projections are limited.
– Restricted to 3D -> 2D
• 2D map (isomorphic projection):
– N, S, E W -> Top, Bottom, Right, Left
• 3D ->2D map: N, S, E, W, Higher, Lower
– Units of $$$
• NOK
• USD
– Benjamins
13
Topology Preserving Projections
14
Topology Preserving Projections
15
Topology Preserving Projections
16 Raw Data
Lossy PCA Reduction for Classification
17
First PC
Topology Preserving Projections
LDA!
19
Un/Supervised Clustering
• Recall k-means
– It is semi-supervised in that we have pre-determined the
number of means (number of clusters)
• Recall G-MM
– Note how the results are affected by the initial estimate for
the number of clusters
20
Un/Supervised Clustering
• Recall k-means and GMM-EM clustering
watch videos
www.youtube.com/watch?v=_aWzGGNrcic
www.youtube.com/watch?v=qMTuMa86NzU
www.youtube.com/watch?v=B36fzChfyGU
21
Un/Supervised Clustering
• Recall k-means
– It is semi-supervised in that we have pre-determined the
number of means (number of clusters)
• Recall G-MM
– Note how the results are affected by the initial estimate for
the number of clusters
23
Self Organizing Maps Architecture
Output Neurons
Output Layer
Connection Weights
24
Proximity By Colour and Location
Poverty Map of the World (1997)
25
http://www.cis.hut.fi/research/som-research/worldmap.html
If ML Is Statistics By Other Means,
Why Use ML Instead of Stats?
26
Is Map Orientation Important?
Are the Map Axes Informative?
• Proximity is the most important relation
– Data points that are in the same neighbourhood, have the
closest resemblance to each other
27
Are the Map Axes Informative?
– Data points that are to the left, right, above or below are
indicating their relationship to neighbourhoods that are
further away
• Further Away = data with a less close resemblance
28
29
How Does the SOM Work?
• A competitive learning algorithm.
– The neuron “closest to the input vector” is the winner
• The neuron that most closely resembles a sample input.
• Its weight vector is adjusted to move even closer to the current input
vector xi
– The neurons that are too far away lose out completely
• No weight adjustment for them!
• Like the ANN weight training step size gets smaller as training
progresses
31
Neighbour Interconnection Topologies
32 http://users.ics.aalto.fi/jhollmen/dippa/node9.html
Neighbourhoods in a Rectangular Map
33
The Hexagonal Neighbourhood
34 http://users.ics.aalto.fi/jhollmen/dippa/node9.html
Image Credits
• https://12095675emilygrant3ddunitx.files.wordpress.com/2013/05/mapprojectio
n5.gif?w=450&h=299
• https://en.wikipedia.org/wiki/Self-organizing_map#/media/File:Somtraining.svg
• https://en.wikipedia.org/wiki/File:Europe_topography_map.png
• http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html
• By User:W!B: - http://www.maps-for-free.com/, GFDL,
https://commons.wikimedia.org/w/index.php?curid=5115489
35
Lecture 7: Unsupervised Learning;
Cluster Analysis
Week 9 (03.03.2022) Lecture 4: (Kononenko 6,7) Attribute Quality Measures; PCA; LDA;
Feature Selection
Week 11 (17.03.2022) Lecture 5: (Kononenko 9,10) Symbolic & Stat learning; Visulatization
Week 13 (31.03.2022) Lecture 6: (Kononenko 11*; Chio 2) Artificial Neural Networks; Deep
Learning; Support Vector Machines
Week 14 (07.04.2022) Tutorial 6: Support Vector Machine & Artificial Neural Network
Week 15 Påske/Easter
Boud, David, and Filip Dochy. "Assessment 2020. Seven propositions for assessment reform in higher education." (2010).
Norwegian University of Science and Technology 6 6
Things changed in 2020/2021
• Standard exams (f.e. 3 hours in classrooms) nearly gone
• All assessment forms became digital
• Major forms proved efficiency:
– Verbal exam
– Final written report
– Essay-based 3-hours home exam
– Portfolio of assignments during semester
– Few days case study
• Many with A/F grading scale has been replaced with Pass/Fail
• Many struggle with digital meetings and online tasks
• Teachers in non-IT fields had to put enormous efforts to adopt
• New digital systems
• https://i.ntnu.no/wiki/-/wiki/English/Cheating+on+exams
3
Fuzzy Basics
4
5
Variables and Terms
Linguistic Variables
Linguistic
Terms •
http://sci2s.ugr.es/keel/links.php
7
Multiple Gaussian/Normal MFs
Non sum
normal
0.6 0.6
http://cdn.intechopen.com/pdfs-wm/6928.pdf
Fuzzy Inference
● Fuzzy rules are conditional statements in the form:
IF x is A THEN z is C
http://ispac.diet.uniroma1.it/scarpiniti/files/NNs/Less9.pdf 9
10
Fuzzy Inference
● Fuzzy antecedent:
● “If the tomato is more or less red (μRED = 0.7)”
● Fuzzy consequent(s):
● “The tomato is more of less sweet (μSWEET = 0.64)”
● “The tomato is more of less sour (μSOUR = 0.36)”
http://ispac.diet.uniroma1.it/scarpiniti/files/NNs/Less9.pdf
Fuzzy Reasoning
● There can be more than one atom in one antecedent
● Mamdami-type rules:
● IF x is A AND y is B THEN z is C
● Takagi-Sugeno-type rules:
● IF x is A AND y is B THEN z = f(x,y)
● eg: z = ax + by + c
http://pharmacyebooks.com/2010/10/artifitial-neural-networks-hot-topic-pharmaceutical-research.html
12
Artificial Neural Network (2)
http://mines.humanoriented.com/classes/2010/fall/csci568/portfolio_exports/lguo/ann.html
13
Combining ANN/FL
● ANN black box approach requires sufficient data to find
the structure (generalization learning)
● NO PRIORS required
● Cannot extract linguistically meaningful rules from trained ANN
http://www.scholarpedia.org/article/Fuzzy_neural_network
14
Combining ANN/FL
● How to resolve these issues
● Can’t extract linguistic rules from ANN
● FL requires manual heuristics for training
http://www.scholarpedia.org/article/Fuzzy_neural_network 16
Example Hybrid NF 1
(Fuzzy > ANN)
First, the fuzzy inference block receives linguistic statements and processes them.
Second, the fuzzy block output serves as the input to the ANN block
17
Robert Fuller, Neural Fuzzy Systems, 1995
Example Hybrid NF 1
(ANN > Fuzzy )
First, the raw data are delivered to the ANN through input neurons.
Second, neural outputs drive the fuzzy inference block that builds
decision statements.
First,.
Second,
Robert Fuller, Neural Fuzzy Systems, 1995
18
Cooperative NF
http://www.scholarpedia.org/article/Fuzzy_neural_network
19
Fuzzy patches (1)
• Fuzzy rules approximate the mapping function
[y=f(x)] of an input value region for X to an output
value region Y
http://kyfranke.com/uploads/Publications/kyfranke-PhD-thesis-2007-Acrobat7.pdf
20
Fuzzy patches (1)
There are two steps in fuzzy rules adaptation:
1. unsupervised learning procedure for a rough
placement of the rule parameters
Cheap!
http://kyfranke.com/uploads/Publications/kyfranke-PhD-thesis-2007-Acrobat7.pdf 21
VARIABLE 24
The Fuzzy Tomato
“FLAVOR”
“sweet”
“tart”
“sour” •
TERMS “COLOUR”
VARIABLE
http://www.sciencedirect.com/science/article/pii/S0196890405003225 TERMS
● Smaller and more numerous patches will provide
more precision in the estimation of f(x)
● Increasing the crispness of the rules
● But increasing the complexity
● More terms
● More rules
● More training data required for accurate training
26
http://kyfranke.com/uploads/Publications/kyfranke-PhD-thesis-2007-Acrobat7.pdf
Kosko Method (2 Steps)
1 Rough Placement of Patches
● Unsupervised Learning (eg SOM) is applied to data
● Seeking rough approximation of rule parameters
(Rough location of fuzzy patches)
2 Refinement
● Supervised learning (eg gradient descent)
27
http://kyfranke.com/uploads/Publications/kyfranke-PhD-thesis-2007-Acrobat7.pdf
NF general overview
http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/sbaa/report.fuzrules.html 28
Fuzzy Inference Principles
● MIN-MAX principle:
● MIN: perform AND operation among atoms
to define fuzzy membership degree
30
http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/sbaa/report.fuzrules.html
31
Fuzzy Logic and Math Operations
Fuzzy set A is a collection of pairs x, μ(x): 𝐴 = 𝑥, μ 𝑥
Fuzzy rules:
𝐼𝐹𝑥1 𝑖𝑠𝐴 ∧ 𝑥2 𝑖𝑠𝐵 𝑇𝐻𝐸𝑁 𝐶𝑙𝑎𝑠𝑠𝑌 𝑅𝑖 : 𝐼𝐹𝑥1 𝑖𝑠𝐴1 ∧ 𝑥2 𝑖𝑠𝐴2 𝑇𝐻𝐸𝑁 𝑋𝑖𝑠𝐶𝑙𝑎𝑠𝑠𝑋
Hybrid NeuroFuzzy Networks
2
𝜕𝐶 𝜕 𝑦𝑖 − 𝑑𝑖 𝜕𝑑𝑖
= = −2 ⋅ 𝑦𝑖 − 𝑑𝑖 ⋅
𝜕ω 𝜕ω 𝜕ω
http://cnmat.berkeley.edu/publication/real_time_neuro_fuzzy_systems_adaptive_control_musical_processes 33
2 Rule Fuzzy Inference z = f(x,y)
Rule 1
Rule 2
Input 1 Input 2
http://www.cs.princeton.edu/courses/archive/fall07/cos436/HIDDEN/Knapp/fuzzy004.htm 34
Neuro-Fuzzy Architectures
http://www.sciencedirect.com/science/article/pii/S0307904X1200025X
38
Complex Neuro-Fuzzy Architecture
http://scialert.net/fulltext/?doi=jas.2008.309.315 39
Drawbacks
40
Strengths
● Human-understandable FL
● Human-like reasoning in ANN
● Enables modelling of complex I/O relationships
● results in simple fuzzy rules
● ANN is one of the best statistical approaches
● NF is flexible and easily adjusted by modifying
various parameters.
● ANN is amenable to parallel optimization
41
Thank you for your attention!
A tutorial on Principal Components Analysis
Lindsay I Smith
Introduction
1
Chapter 2
Background Mathematics
This section will attempt to give some elementary background mathematical skills that
will be required to understand the process of Principal Components Analysis. The
topics are covered independently of each other, and examples given. It is less important
to remember the exact mechanics of a mathematical technique than it is to understand
the reason why such a technique may be used, and what the result of the operation tells
us about our data. Not all of these techniques are used in PCA, but the ones that are not
explicitly required do provide the grounding on which the most important techniques
are based.
I have included a section on Statistics which looks at distribution measurements,
or, how the data is spread out. The other section is on Matrix Algebra and looks at
eigenvectors and eigenvalues, important properties of matrices that are fundamental to
PCA.
2.1 Statistics
The entire subject of statistics is based around the idea that you have this big set of data,
and you want to analyse that set in terms of the relationships between the individual
points in that data set. I am going to look at a few of the measures you can do on a set
of data, and what they tell you about the data itself.
2
of some bigger population. There is a reference later in this section pointing to more
information about samples and populations.
Here’s an example set:
I could simply use the symbol to refer to this entire set of numbers. If I want to
refer to an individual number in this data set, I will use subscripts on the symbol to
indicate a specific number. Eg. refers to the 3rd number in , namely the number
4. Note that
is the first number in the sequence, not like you may see in some
textbooks. Also, the symbol will be used to refer to the number of elements in the
set
There are a number of things that we can calculate about a data set. For example,
we can calculate the mean of the sample. I assume that the reader understands what the
!"$# "
mean of a sample is, and will only give the formula:
Notice the symbol (said “X bar”) to indicate the mean of the set . All this formula
says is “Add up all the numbers and then divide by how many there are”.
Unfortunately, the mean doesn’t tell us a lot about the data except for a sort of
middle point. For example, these two data sets have exactly the same mean (10), but
&%'&
%
( *) +
,
are obviously quite different:
So what is different about these two sets? It is the spread of the data that is different.
The Standard Deviation (SD) of a data set is a measure of how spread out the data is.
How do we calculate it? The English definition of the SD is: “The average distance
.-
from the mean of the data set to a point”. The way to calculate it is to compute the
squares of the distance from each data point to the mean of the set, add them all up,
/ -
/ 0 6- 2
Where is the usual symbol for standard deviation of a sample. I hear you asking “Why
0 7- 2
are you using and not ?”. Well, the answer is a bit complicated, but in general,
if your data set is a sample data set, ie. you have taken a subset of the real-world (like
surveying 500 people about the election) then you must use because it turns out
that this gives you an answer that is closer to the standard deviation that would result
0 8- 2
if you had used the entire population, than if you’d used . If, however, you are not
calculating the standard deviation for a sample, but for an entire population, then you
should divide by instead of . For further reading on this topic, the web page
http://mathcentral.uregina.ca/RR/database/RR.09.95/weston2.html describes standard
deviation in a similar way, and also provides an example experiment that shows the
3
Set 1:
0 - 32 0 - 32 4
0 -10 100
8 -2 4
12 2 4
20 10 100
Total 208
Divided by (n-1) 69.333
Square Root 8.3266
Set 2:
difference between each of the denominators. It also discusses the difference between
samples and populations.
So, for our two data sets above, the calculations of standard deviation are in Ta-
ble 2.1.
And so, as expected, the first set has a much larger standard deviation due to the
also has a mean of 10, but its standard deviation is 0, because all the numbers are the
same. None of them deviate from the mean.
2.1.2 Variance
Variance is another measure of the spread of data in a data set. In fact it is almost
4
/4
You will notice that this is simply the standard deviation squared, in both the symbol
( ) and the formula (there is no square root in the formula for variance). /4
is the
usual symbol for variance of a sample. Both these measurements are measures of the
spread of the data. Standard deviation is the most common measure, but variance is
also used. The reason why I have introduced variance in addition to standard deviation
is to provide a solid platform from which the next section, covariance, can launch from.
Exercises
;
Find the mean, standard deviation, and variance for each of these data sets.
;
[12 23 34 44 59 70 98]
;
[12 15 25 27 32 88 99]
[15 35 78 82 90 95 97]
2.1.3 Covariance
The last two measures we have looked at are purely 1-dimensional. Data sets like this
could be: heights of all the people in the room, marks for the last COMP101 exam etc.
However many data sets have more than one dimension, and the aim of the statistical
analysis of these data sets is usually to see if there is any relationship between the
dimensions. For example, we might have as our data set both the height of all the
students in a class, and the mark they received for that paper. We could then perform
statistical analysis to see if the height of a student has any effect on their mark.
Standard deviation and variance only operate on 1 dimension, so that you could
only calculate the standard deviation for each dimension of the data set independently
of the other dimensions. However, it is useful to have a similar measure to find out how
much the dimensions vary from the mean with respect to each other.
Covariance is such a measure. Covariance is always measured between 2 dimen-
< =
variance. So, if you had a 3-dimensional data set ( , , ), then you could measure the
covariance between the and dimensions, the and dimensions, and the and >
< = >
dimensions. Measuring the covariance between and , or and , or and would
give you the variance of the , and dimensions respectively.
The formula for covariance is very similar to the formula for variance. The formula
where I have simply expanded the square term to show both parts. So given that knowl-
5
includegraphicscovPlot.ps
Figure 2.1: A plot of the covariance data showing positive relationship between the
number of hours studied against the mark received
F
=<
It is exactly the same except that in the second set of brackets, the ’s are replaced by
H
how many hours in total that they spent studying COSC241, and the mark that they
I B&CD? 0 H E I 2
received. So we have two dimensions, the first is the dimension, the hours studied,
and the second is the dimension, the mark received. Figure 2.2 holds my imaginary
data, and the calculation of , the covariance between the Hours of study
done and the Mark received.
So what does it tell us? The exact value is not as important as it’s sign (ie. positive
or negative). If the value is positive, as it is here, then that indicates that both di-
mensions increase together, meaning that, in general, as the number of hours of study
increased, so did the final mark.
If the value is negative, then as one dimension increases, the other decreases. If we
had ended up with a negative covariance here, then that would have said the opposite,
that as the number of hours of study increased the the final mark decreased.
In the last case, if the covariance is zero, it indicates that the two dimensions are
independent of each other.
The result that mark given increases as the number of hours studied increases can
be easily seen by drawing a graph of the data, as in Figure 2.1.3. However, the luxury
of being able to visualize data is only available at 2 and 3 dimensions. Since the co-
variance value can be calculated between any 2 dimensions in a data set, this technique
F0 " - F 2 0 " - A2B1 CD? 0 8E1F,2 B&CD? 0 FLEMN2 0 " - A2 0 F " - 7F 2
You might ask “is equal to ”? Well, a quick look at the for-
mula for covariance tells us that yes, they are exactly the same since the only dif-
ference between and is that is replaced by
. And since multiplication is commutative, which means that it
doesn’t matter which way around I multiply two numbers, I always get the same num-
ber, these two equations give the same answer.
6
Hours(H) Mark(M)
Data 9 39
15 56
25 93
14 61
10 50
18 75
0 32
16 85
5 42
19 70
16 66
20 80
Totals 167 749
Averages 13.92 62.42
Covariance:
9
H I 0 H " - H 2 0I " - I 2 0H " - H 2 0 I " - I 2
39 -4.92 -23.42 115.23
15 56 1.08 -6.42 -6.93
25 93 11.08 30.58 338.83
14 61 0.08 -1.42 -0.11
10 50 -3.92 -12.42 48.69
18 75 4.08 12.58 51.33
0 32 -13.92 -30.42 423.45
16 85 2.08 22.58 46.97
5 42 -8.92 -20.42 182.15
19 70 5.08 7.58 38.51
16 66 2.08 3.58 7.45
20 80 6.08 17.58 106.89
Total 1149.89
Average 104.54
7
A useful way to get all the possible covariance values between all the different
dimensions is to calculate them all and put them in a matrix. I assume in this tutorial
that you are familiar with matrices, and how they can be defined. So, the definition for
is the th dimension.
All that this ugly looking formula says is that if you have an -dimensional data set,
then the matrix has rows and columns (so is square) and each entry in the matrix is
the result of calculating the covariance between two separate dimensions. Eg. the entry
on row 2, column 3, is the covariance value calculated between the 2nd dimension and
the 3rd dimension.
< = >
An example. We’ll make up the covariance matrix for an imaginary 3 dimensional
data set, using the usual dimensions , and . Then, the covariance matrix has 3 rows
V B&B&CDCD?? 00 <= EEE << 222 B&B&CDCD?? 00 < EE = 22 &BB&CDCD?? 00 < EE > 22
and 3 columns, and the values are this:
Exercises
< =
Work out the covariance between the and dimensions in the following 2 dimen-
sional data set, and describe what the result indicates about the data.
<=
Item Number: 1
10
43
2
39
13
3
19
32
4
23
21
5
28
20
<=
Item Number: 1
1
2
-1
3
4
> 2
1
1
3
3
-1
8
e
d f d
e
d f d
g f d
Figure 2.2: Example of one non-eigenvector and one eigenvector
f d
e
d f
g f
Figure 2.3: Example of how a scaled eigenvector is still and eigenvector
2.2.1 Eigenvectors
As you know, you can multiply two matrices together, provided they are compatible
sizes. Eigenvectors are a special case of this. Consider the two multiplications between
a matrix and a vector in Figure 2.2.
In the first example, the resulting vector is not an integer multiple of the original
d
vector, whereas in the second example, the example is exactly 4 times the vector we
began with. Why is this? Well, the vector is a vector in 2 dimensional space. The
vector
from the origin,
0 %SEh%2 0 dSEh2
(from the second example multiplication) represents an arrow pointing
, to the point . The other matrix, the square one, can be
thought of as a transformation matrix. If you multiply this matrix on the left of a
vector, the answer is another vector that is transformed from it’s original position.
= < = <
It is the nature of the transformation that the eigenvectors arise from. Imagine a
transformation matrix that, when multiplied on the left, reflected vectors in the line
. Then you can see that if there were a vector that lay on the line , it’s
reflection it itself. This vector (and all multiples of it, because it wouldn’t matter how
long the vector was), would be an eigenvector of that transformation matrix.
What properties do these eigenvectors have? You should first know that eigenvec-
dfd
vectors. And, given an
Given a
f
tors can only be found for square matrices. And, not every square matrix has eigen-
matrix that does have eigenvectors, there are of them.
matrix, there are 3 eigenvectors.
Another property of eigenvectors is that even if I scale the vector by some amount
before I multiply it, I still get the same multiple of it as a result, as in Figure 2.3. This
is because if you scale a vector by some amount, all you are doing is making it longer,
9
not changing it’s direction. Lastly, all the eigenvectors of a matrix are perpendicular,
ie. at right angles to each other, no matter how many dimensions you have. By the way,
another word for perpendicular, in maths talk, is orthogonal. This is important because
< =
it means that you can express the data in terms of these perpendicular eigenvectors,
instead of expressing them in terms of the and axes. We will be doing this later in
the section on PCA.
Another important thing to know is that when mathematicians find eigenvectors,
they like to find the eigenvectors whose length is exactly one. This is because, as you
know, the length of a vector doesn’t affect whether it’s an eigenvector or not, whereas
the direction does. So, in order to keep eigenvectors standard, whenever we find an
eigenvector we usually scale it to make it have a length of 1, so that all eigenvectors
d
have the same length. Here’s a demonstration from our example above.
j
only easy(ish) if you have a rather small matrix, like no bigger than about
dfd
How does one go about finding these mystical eigenvectors? Unfortunately, it’s
. After
that, the usual way to find the eigenvectors is by some complicated iterative method
which is beyond the scope of this tutorial (and this author). If you ever need to find the
eigenvectors of a matrix in a program, just find a maths library that does it all for you.
A useful maths package, called newmat, is available at http://webnz.com/robert/ .
Further information about eigenvectors in general, how to find them, and orthogo-
nality, can be found in the textbook “Elementary Linear Algebra 5e” by Howard Anton,
Publisher John Wiley & Sons Inc, ISBN 0-471-85223-6.
2.2.2 Eigenvalues
Eigenvalues are closely related to eigenvectors, in fact, we saw an eigenvalue in Fig-
ure 2.2. Notice how, in both those examples, the amount by which the original vector
was scaled after multiplication by the square matrix was the same? In that example,
the value was 4. 4 is the eigenvalue associated with that eigenvector. No matter what
multiple of the eigenvector we took before we multiplied it by the square matrix, we
would always get 4 times the scaled vector as our result (as in Figure 2.3).
So you can see that eigenvectors and eigenvalues always come in pairs. When you
get a fancy programming library to calculate your eigenvectors for you, you usually get
the eigenvalues as well.
10
Exercises
d %
For the following square matrix:
-- p% -
Decide which, if any, of the following vectors are eigenvectors of that matrix and
-% - % d
give the corresponding eigenvalue.
- d %
11
Chapter 3
3.1 Method
Step 1: Get some data
In my simple example, I am going to use my own made-up data set. It’s only got 2
dimensions, and the reason why I have chosen this is so that I can provide plots of the
data to show what the PCA analysis is doing at each step.
The data I have used is found in Figure 3.1, along with a plot of that data.
< <=
For PCA to work properly, you have to subtract the mean from each of the data dimen-
<=
sions. The mean subtracted is the average across each dimension. So, all the values
have (the mean of the values of all the data points) subtracted, and all the values
have subtracted from them. This produces a data set whose mean is zero.
12
x y x y
2.5 2.4 .69 .49
0.5 0.7 -1.31 -1.21
2.2 2.9 .39 .99
1.9 2.2 .09 .29
Data = 3.1 3.0 DataAdjust = 1.29 1.09
2.3 2.7 .49 .79
2 1.6 .19 -.31
1 1.1 -.81 -.81
1.5 1.6 -.31 -.31
1.1 0.9 -.71 -1.01
-1
-1 0 1 2 3 4
Figure 3.1: PCA example data, original data on the left, data with the means subtracted
on the right, and a plot of the data
13
Step 3: Calculate the covariance matrix
&B CD? qq rr
qq r
will just give you the result:
< =
So, since the non-diagonal elements in this covariance matrix are positive, we should
expect that both the and variable increase together.
14
Mean adjusted data with eigenvectors overlayed
2
"PCAdataadjust.dat"
(-.740682469/.671855252)*x
(-.671855252/-.740682469)*x
1.5
0.5
-0.5
-1
-1.5
-2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Figure 3.2: A plot of the normalised data (mean subtracted) with the eigenvectors of
the covariance matrix overlayed on top.
15
will notice that the eigenvalues are quite different values. In fact, it turns out that
the eigenvector with the highest eigenvalue is the principle component of the data set.
In our example, the eigenvector with the larges eigenvalue was the one that pointed
down the middle of the data. It is the most significant relationship between the data
dimensions.
In general, once eigenvectors are found from the covariance matrix, the next step
is to order them by eigenvalue, highest to lowest. This gives you the components in
order of significance. Now, if you like, you can decide to ignore the components of
lesser significance. You do lose some information, but if the eigenvalues are small, you
don’t lose much. If you leave out some components, the final data set will have less
dimensions than the original. To be precise, if you originally have dimensions in
{
your data, and so you calculate eigenvectors and eigenvalues, and then you choose
only the first eigenvectors, then the final data set has only dimensions.
What needs to be done now is you need to form a feature vector, which is just
{
a fancy name for a matrix of vectors. This is constructed by taking the eigenvectors
that you want to keep from the list of eigenvectors, and forming a matrix with these
!
Given our example set of data, and the fact that we have 2 eigenvectors, we have
--
16
a little T symbol above their names from now on.
| \ (u [ ( z (
easier if we take the transpose of the feature vector and the data first, rather that having
is the final data set, with
data items in columns, and dimensions along rows.
< =
What will this give us? It will give us the original data solely in terms of the vectors
we chose. Our original data set had two axes, and , so our data was in terms of
them. It is possible to express data in terms of any two axes that you like. If these
axes are perpendicular, then the expression is the most efficient. This was why it was
< =
important that eigenvectors are always perpendicular to each other. We have changed
our data from being in terms of the axes and , and now they are in terms of our 2
eigenvectors. In the case of when the new data set has reduced dimensionality, ie. we
have left some of the eigenvectors out, the new data is only in terms of the vectors that
we decided to keep.
To show this on our data, I have done the final transformation with each of the
possible feature vectors. I have taken the transpose of the result in each case to bring
the data back to the nice table-like format. I have also plotted the final points to show
how they relate to the components.
In the case of keeping both eigenvectors for the transformation, we get the data and
the plot found in Figure 3.3. This plot is basically the original data, rotated so that the
eigenvectors are the axes. This is understandable since we have lost no information in
this decomposition.
The other transformation we can make is by taking only the eigenvector with the
largest eigenvalue. The table of data resulting from that is found in Figure 3.4. As
expected, it only has a single dimension. If you compare this data set with the one
resulting from using both eigenvectors, you will notice that this data set is exactly the
<
first column of the other. So, if you were to plot this data, it would be 1 dimensional,
and would be points on a line in exactly the positions of the points in the plot in
Figure 3.3. We have effectively thrown away the whole other axis, which is the other
eigenvector.
So what have we done here? Basically we have transformed our data so that is
expressed in terms of the patterns between them, where the patterns are the lines that
most closely describe the relationships between the data. This is helpful because we
< =
have now classified our data point as a combination of the contributions from each of
those lines. Initially we had the simple and axes. This is fine, but the and
values of each data point don’t really tell us exactly how that point relates to the rest of
< =
the data. Now, the values of the data points tell us exactly where (ie. above/below) the
trend lines the data point sits. In the case of the transformation using both eigenvectors,
we have simply altered the data so that it is in terms of those eigenvectors instead of
the usual axes. But the single-eigenvector decomposition has removed the contribution
due to the smaller eigenvector and left us with data that is only in terms of the other.
17
-.827970186
< =
-.175115307
1.77758033 .142857227
-.992197494 .384374989
-.274210416 .130417207
Transformed Data= -1.67580142 -.209498461
-.912949103 .175282444
.0991094375 -.349824698
1.14457216 .0464172582
.438046137 .0177646297
1.22382056 -.162675287
Data transformed with 2 eigenvectors
2
"./doublevecfinal.dat"
1.5
0.5
-0.5
-1
-1.5
-2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Figure 3.3: The table of data by applying the PCA analysis using both eigenvectors,
and a plot of the new data points.
18
<
Transformed Data (Single eigenvector)
-.827970186
1.77758033
-.992197494
-.274210416
-1.67580142
-.912949103
.0991094375
1.14457216
.438046137
1.22382056
Figure 3.4: The data after transforming using only the most significant eigenvector
So, how do we get the original data back? Before we do that, remember that only if
we took all the eigenvectors in our transformation will we get exactly the original data
back. If we have reduced the number of eigenvectors in the final transformation, then
the retrieved data has lost some information.
where
CD | s ( z v@ sy}6syBZz1C @ R
is the inverse of
CD | s ( z v@ sx}7syBZz1C @
. However, when
we take all the eigenvectors in our feature vector, it turns out that the inverse of our
feature vector is actually equal to the transpose of our feature vector. This is only true
because the elements of the matrix are all the unit eigenvectors of our data set. This
But, to get the actual original data back, we need to add on the mean of that original
CD7 @ \Zt\ (u [ ( z (7 0 DC | s ( z v@ sy}6syBZz1C @ f | \ (u [ ( z (P2 i @ Z\ t\ (u I s (
data (remember we subtracted it right at the start). So, for completeness,
This formula also applies to when you do not have all the eigenvectors in the feature
vector. So even when you leave out some eigenvectors, the above equation still makes
the correct transform.
I will not perform the data re-creation using the complete feature vector, because the
result is exactly the data we started with. However, I will do it with the reduced feature
vector to show you how information has been lost. Figure 3.5 show this plot. Compare
19
Original data restored using only a single eigenvector
4
"./lossyplusmean.dat"
-1
-1 0 1 2 3 4
Figure 3.5: The reconstruction from the data that was derived using only a single eigen-
vector
it to the original data plot in Figure 3.1 and you will notice how, while the variation
along the principle eigenvector (see Figure 3.2 for the eigenvector overlayed on top of
the mean-adjusted data) has been kept, the variation along the other component (the
other eigenvector that we left out) has gone.
;
Exercises
;
What do the eigenvectors of the covariance matrix give us?
At what point in the PCA process can we decide to compress the data? What
;
effect does this have?
For an example of PCA and a graphical representation of the principal eigenvec-
tors, research the topic ’Eigenfaces’, which uses PCA to do facial recognition
20
Chapter 4
This chapter will outline the way that PCA is used in computer vision, first showing
how images are usually represented, and then showing what PCA can allow us to do
with those images. The information in this section regarding facial recognition comes
from “Face Recognition: Eigenface, Elastic Matching, and Neural Nets”, Jun Zhang et
al. Proceedings of the IEEE, Vol. 85, No. 9, September 1997. The representation infor-
mation, is taken from “Digital Image Processing” Rafael C. Gonzalez and Paul Wintz,
Addison-Wesley Publishing Company, 1987. It is also an excellent reference for further
information on the K-L transform in general. The image compression information is
taken from http://www.vision.auc.dk/ sig/Teaching/Flerdim/Current/hotelling/hotelling.html,
which also provides examples of image reconstruction using a varying amount of eigen-
vectors.
4.1 Representation
4
When using these sort of matrix techniques in computer vision, we must consider repre-
sentation of images. A square, by image can be expressed as an -dimensional
dimensional image. E.g. The first < g- <
where the rows of pixels in the image are placed one after the other to form a one-
21
^ (( t sx}6sxB
^ ( t xs / I ( z @ \ < ^ t qsx}6sxB
^ ( t syq }~syB %
which gives us a starting point for our PCA analysis. Once we have performed PCA,
we have our original data in terms of the eigenvectors we found from the covariance
matrix. Why is this useful? Say we want to do facial recognition, and so our original
images were of peoples faces. Then, the problem is, given a new image, whose face
from the original set is it? (Note that the new image is not one of the 20 we started
with.) The way this is done is computer vision is to measure the difference between
the new image and the original images, but not along the original axes, along the new
axes derived from the PCA analysis.
It turns out that these axes works much better for recognising faces, because the
PCA analysis has given us the original images in terms of the differences and simi-
4 4
larities between them. The PCA analysis has identified the statistical patterns in the
data.
Since all the vectors are dimensional, we will get eigenvectors. In practice,
we are able to leave out some of the less significant eigenvectors, and the recognition
still performs well.
o1
each vector is 20-dimensional. To compress the data, we can then choose to transform
the data only using, say 15 of the eigenvectors. This gives us a final data set with
only 15 dimensions, which has saved us of the space. However, when the original
data is reproduced, the images have lost some of the information. This compression
technique is said to be lossy because the decompressed image is not exactly the same
as the original, generally worse.
22
Appendix A
Implementation Code
This is code for use in Scilab, a freeware alternative to Matlab. I used this code to
generate all the examples in the text. Apart from the first macro, all the rest were
written by me.
23
cv(var,ct) = v;
cv(ct,var) = v;
// do the lower part of c also.
end,
end,
c=cv;
x= eig;
24
//
// NOTE: This function cannot handle data sets that have any eigenvalues
// equal to zero. It’s got something to do with the way that scilab treats
// the empty matrix and zeros.
//
function [meanadjusted,covmat,sorteigvalues,sortnormaleigs] = PCAprepare (data)
// Calculates the mean adjusted matrix, only for 2 dimensional data
means = mean(data,"r");
meanadjusted = meanadjust(data);
covmat = cov(meanadjusted);
eigvalues = spec(covmat);
normaleigs = justeigs(covmat);
sorteigvalues = sorteigvectors(eigvalues’,eigvalues’);
sortnormaleigs = sorteigvectors(eigvalues’,normaleigs);
25
// This sorts a matrix of vectors, based on the values of
// another matrix
//
// values = the list of eigenvalues (1 per column)
// vectors = The list of eigenvectors (1 per column)
//
// NOTE: The values should correspond to the vectors
// so that the value in column x corresponds to the vector
// in column x.
function [sortedvecs] = sorteigvectors(values,vectors)
inputsize = size(values);
numcols = inputsize(2);
highcol = highestvalcolumn(values);
sorted = vectors(:,highcol);
remainvec = removecolumn(vectors,highcol);
remainval = removecolumn(values,highcol);
for var = 2:numcols
highcol = highestvalcolumn(remainval);
sorted(:,var) = remainvec(:,highcol);
remainvec = removecolumn(remainvec,highcol);
remainval = removecolumn(remainval,highcol);
end,
sortedvecs = sorted;
26
Semester Plan (1)
Week 3 (20.01.2022) Lecture 1: (Kononenko 1,2; Chio 1) Introduction to the team / Data
Analysis / ML methods / Artificial Intelligence / Big Data / Data Analytics problems in Digital
Forensics and Information Security / Computational Forensics
Week 9 (03.03.2022) Lecture 4: (Kononenko 6,7) Attribute Quality Measures; PCA; LDA;
Feature Selection
Week 11 (17.03.2022) Lecture 5: (Kononenko 9,10) Symbolic & Stat learning; Visulatization
Week 14 (07.04.2022) Tutorial 6: Support Vector Machine & Artificial Neural Network
Week 15 Påske/Easter
Week 18 (05.05.2022) Guest lecture / MOCK exam Preparation for the exam; Q & A