Download as pdf or txt
Download as pdf or txt
You are on page 1of 70

02450: Introduction to Machine Learning and Data Mining

Data, feature extraction and PCA

Georgios Arvanitidis

DTU Compute, Technical University of Denmark (DTU)


Today
Feedback Groups of the day: Reading material:
Abrahim Deiaa El Din Abbas, Johanne Abildgaard,
Julieta Aceves, Helle Achari, Magnus Møller Chapter 2, Chapter 3
Aggernæs, Subhayan Akhuli, Melis Cemre Akyol,
Malek Al Abed, Mohammad Al-Ansari, Maximillian
Al-Helo, Ismail Ali, Yusuf Mohamed Alin, Mads
Albert Alkjærsig, Javier Alonso Fernandez, Rikke
Alstrup, Muhammad Hussain Rashid Al-Takmaji,
Mohamad Malaz Mohamed Alzarrad, Saeed
Mohamud Amin, William Kirk Andersen, Mikkel Arn
Andersen, Simon Rung Andersen, Oline Melinda
Andersen, Jeppe Aarup Andersen, Mathias Vith
Andersen, Giulia Andreatta, Sander Skjolden
Andresen, ¿eljko Antunovi¿, Theo Rønne Appel,
Pedro Aragon Fernandez, Ivan Antonino Arena, Alba
Arias Martínez, Enerel Ariunbold, Amari Karakandi
Arun, Matthew Hiroto Asano, Jacob Frederik
Aslan-Lassen, Bergur Ástrásson, Salomé Anaïs
Aubri, Mike Auer, Eseme Ida Elena Ayiwe, Md Amin
Azad, Gursharandeep Singh Badhesha, Haydar Hamid
Abbas Bahr, Eline Agnes Jacoueline Balland,
Volodymyr Baran, Samira Sanjay Barve, Laura Bauer,
Quim Bech Vilaseca, Aslan Dalhoff Behbahani,
Nikolaj Ivø Beier, Alex Belai, Magnus Johan
Berg-Arnbak, Toms Rudolfs Berzins, Kaushik Amol
Bhat, Mark Bidstrup, Kawa Shawki Bilal, Christian
Lundgaard Bjerregaard, Magnus Bjørnskov, Emma
Louise Blair, Louise Toft Blankensteiner

2 DTU Compute Lecture 2 5 September, 2023


Lecture Schedule
1 Introduction 8 Artificial Neural Networks and
29 August: C1 Bias/Variance
Data: Feature extraction, and visualization 24 October: C14, C15
2 Data, feature extraction and PCA 9 AUC and ensemble methods
5 September: C2, C3 31 October: C16, C17
3 Measures of similarity, summary Unsupervised learning: Clustering and density estimation

statistics and probabilities 10 K-means and hierarchical clustering


12 September: C4, C5 7 November: C18
4 Probability densities and data 11 Mixture models and density estimation
visualization 14 November: C19, C20 (Project 2 due before 13:00)
19 September: C6, C7 12 Association mining
Supervised learning: Classification and regression 21 November: C21

5 Decision trees and linear regression Recap

26 September: C8, C9 13 Recap and discussion of the exam


6 Overfitting, cross-validation and 28 November: C1-C21

Nearest Neighbor
3 October: C10, C12 (Project 1 due before 13:00)
7 Performance evaluation, Bayes, and
Naive Bayes
10 October: C11, C13

Online help: Discussion Forum (Piazza) on DTU Learn


Videos of lectures: https://panopto.dtu.dk
Streaming of lectures: Zoom (link on DTU Learn)
3 DTU Compute Lecture 2 5 September, 2023
Learning Objectives
• Understand the types of data, their attributes and data issues
• Understand the bag of word representation
• Be able to apply principal component analysis for data visualization and feature
extraction

4 DTU Compute Lecture 2 5 September, 2023


osvg-5

What is data?

5 DTU Compute Lecture 2 5 September, 2023


osvg-6

Discrete / continuous attributes

6 DTU Compute Lecture 2 5 September, 2023


osvg-7

Types of attributes

7 DTU Compute Lecture 2 5 September, 2023


osvg-8

8 DTU Compute Lecture 2 5 September, 2023


Quiz 1: Attribute types (Spring 2012)

No. Attribute description Abbrev. In a study of healthy breakfast habits 77 cereal


x1 Type TYPE brands were investigated. The attributes of the data
(0 = served cold, 1 = served hot) are given in Table 1. There are a total of 14 attributes
x2 Calories per serving CAL denoted x1 –x14 and one output variable y which de-
x3 Grams of protein PROT fines the average rating of the cereal products by the
x4 Grams of fat FAT
x5 Milligrams of sodium SOD consumers.
x6 Grams of dietary fiber FIB Which statement about the attributes in the data
x7 Grams of complex carbohydrates CARB set is incorrect?
x8 Grams of sugars SUG
x9 Milligrams of potassium POT A. NAME is discrete and nominal.
x10 Vitamins and minerals in 0%, 25%, VIT
or 100% of FDA recommendations B. PROT, FAT and SOD are all continuous and
x11 Shelf position SHELF
ratio.
(1, 2, or 3, counting from the floor)
x12
x13
Weight in ounces of one serving
Number of cups in one serving
WEIGHT
CUPS 0
C. TYPE and VIT are both discrete and ordinal.
x14 Name of cereal brand NAME
D. An attribute that is ratio will also be interval.
y Average rating of the cereal RAT
(from 0 to 100) E. Don’t know.
Table 1: Attributes in a study of cere-
als (i.e. breakfast products, taken from
http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html).

9 DTU Compute Lecture 2 5 September, 2023


Solution:

There are a finite set of brands thus NAME is TYPE must be considered nominal, VIT on the other
discrete and as the only operators that can be applied hand is ordinal as 0% is less than 25% which in turn
to NAME is equal or not equal NAME is nominal. is less than 100%. An attribute that is ratio will also
PROT, FAT and SOD are all continuous and since be both interval, ordinal and nominal, i.e. we can
they have that zero means absence they are ratio. apply all the operations =, 6=, >, <, +, , ⇤, / to a ratio
TYPE and VIT are both discrete, however, TYPE attribute.
is not ordinal, i.e. Hot is not better than Cold, thus

10 DTU Compute Lecture 2 5 September, 2023


osvg-9

Types of data sets

11 DTU Compute Lecture 2 5 September, 2023


osvg-10

Record data example: Market basket data

XI

x2

x3

x4

x 5

12 DTU Compute Lecture 2 5 September, 2023


osvg-11

Relational data example: Who knows who?


to we are
the
represen

0 1
0

13 DTU Compute Lecture 2 5 September, 2023


osvg-12

Ordered data example: Time series

14 DTU Compute Lecture 2 5 September, 2023


osvg-13

Data quality

15 DTU Compute Lecture 2 5 September, 2023


osvg-14

Noise

16 DTU Compute Lecture 2 5 September, 2023


osvg-15

Outliers

17 DTU Compute Lecture 2 5 September, 2023


osvg-16

Missing values
• Definition
• No value is stored for an attribute
in a data object
• Reasons for missing values
• Information is not collected or
measured
• People decline to give their age
• Attribute is not applicable
• Annual income is not applicable
to children
• Handling missing values
• Eliminate data objects
• Eliminate attributes
• Estimate missing values (e.g. an
average)
• Ignore the missing value in analysis
• Model the missing values
18 DTU Compute Lecture 2 5 September, 2023
osvg-17

19 DTU Compute Lecture 2 5 September, 2023


osvg-18

Dataset manipulations

20 DTU Compute Lecture 2 5 September, 2023


osvg-19

Feature processing

21 DTU Compute Lecture 2 5 September, 2023


osvg-20

Common feature transformations

22 DTU Compute Lecture 2 5 September, 2023


osvg-21

One-out-of K encoding

DNS

23 DTU Compute Lecture 2 5 September, 2023


osvg-22

Bag of words representation

24 DTU Compute Lecture 2 5 September, 2023


osvg-23

Bag of words representation

25 DTU Compute Lecture 2 5 September, 2023


osvg-24

Bag of words representation

26 DTU Compute Lecture 2 5 September, 2023


osvg-25

Bag of words representation

27 DTU Compute Lecture 2 5 September, 2023


osvg-26

Bag of words representation

28 DTU Compute Lecture 2 5 September, 2023


osvg-27

Image representation

-
E
- -
..

--
...

29 DTU Compute Lecture 2 5 September, 2023


osvg-28

Vector space representation

30 DTU Compute Lecture 2 5 September, 2023


Plan for the rest of today:

• Linear algebra recap (subspaces and projections)


• The goal of Principal Component Analysis (PCA)
• Derivation of PCA
• Singular Value Decomposition used to implement PCA
• Use of PCA for data visualization

31 DTU Compute Lecture 2 5 September, 2023


osvg-30

Vectors and matrices

32 DTU Compute Lecture 2 5 September, 2023


osvg-31

Matrix multiplication

C DC D C D
1 2 5 6 19 22
Example: =
3 4 7 8 43 50
33 DTU Compute Lecture 2 5 September, 2023
osvg-32

Matrix transpose

34 DTU Compute Lecture 2 5 September, 2023


osvg-33

The identity matrix


In general
ABFBA

35 DTU Compute Lecture 2 5 September, 2023


osvg-35

Norms

x
-

[2]
1 1 2 2
xX
.
.
+
=

trace)-])
L
=

a + b

36 DTU Compute Lecture 2 5 September, 2023


osvg-38

Vector spaces

- =
[
"

V
........
V =

find O
we can
&" by using M

sincer
n
poin combination XiGIR
a
vectors
of these

37 DTU Compute Lecture 2 5 September, 2023


osvg-39

Subspaces

~ in IR3
any
V dr21+012 227 03 Es
EIR
.

ae2
x
=
a 21 + a .

es

e =
(1 ,
0 , 0)
er xz =

a?. 2 .
+
023 EIR 3
0)
.

t > in IR
ez
=

(0 , ,
won this plane
any
es =
(0 0 H ,
e
n b. bz x z EIR
,
x
.

w
+
=

38 DTU Compute Lecture 2 5 September, 2023


osvg-40

Basis of a (sub)space

39 DTU Compute Lecture 2 5 September, 2023


osvg-41

Basis of a (sub)space

in 1123
point
any

x
=
an(8) an( : )
+
+ a(8)
112
any point on tre plane in
br Un + bz Vz
y
.
=

linear
each vi is vector and by
find any point
a

combinations we can

·n a subspace/rector space .

40 DTU Compute Lecture 2 5 September, 2023


osvg-43

Projection

EIR
x
·

Emel
-

41 DTU Compute Lecture 2 5 September, 2023


osvg-44

Projection onto a subspace


·
r =

[i ,
,
,
. .

.,

v]mx
x V] EIR"
xun ,
= Ix'r ,
..., M

bxVx R
=b : V- +
br vz
.
+ -.
+

·
X

42 DTU Compute Lecture 2 5 September, 2023


osvg-45

Projection onto a subspace

=[xre ,
x ve] < EIR -I

[4 +] [b bz] bi
in be
=

.
=
+
,
,

are

&

y
43 DTU Compute Lecture 2 5 September, 2023 r
osvg-46

PCA for high-dimensional data

" -> &

o
po

%.......
6 - - o o

44 DTU Compute Lecture 2 5 September, 2023


osvg-47

PCA for high-dimensional data

45 DTU Compute Lecture 2 5 September, 2023


osvg-48

PCA for high-dimensional data

11
46 DTU Compute Lecture 2 5 September, 2023
osvg-49
biGIR
PCA derivation
-

47 DTU Compute Lecture 2 5 September, 2023


osvg-49

PCA derivation

arg max Var[b] = arg max v T X̃ T X̃v


v v
s.t. ||v||2 = v T v = 1 *Y A I

ˆL
L(v, ⁄) = v T X̃ T X̃v ≠ ⁄(v T v ≠ 1), = 2X̃ T X̃v ≠ 2⁄v = 0
ˆv
or X̃ T X̃v = ⁄v
48 DTU Compute Lecture 2 5 September, 2023
osvg-49

PCA derivation

(1 +
,
b)
ˆL
ˆv = 2X̃ T X̃v ≠ 2⁄v = 0 or X̃ T X̃v = ⁄v (62 ,
v2)
i
1 1
This means that Var[b] = v T ⁄v = ⁄ (1v vn)
N ≠1 N ≠1 ,

48 DTU Compute Lecture 2 5 September, 2023


osvg-50

The Singular Value Decomposition (SVD)

49 DTU Compute Lecture 2 5 September, 2023


osvg-50b

The Singular Value Decomposition (SVD)

mi[fne[/in
>I

50 DTU Compute Lecture 2 5 September, 2023


osvg-51

Principal component analysis (PCA)

51 DTU Compute Lecture 2 5 September, 2023


osvg-52

Principal component analysis (PCA)

52 DTU Compute Lecture 2 5 September, 2023


osvg-53 Data Centered Reconstructions (centered)
>
x
-
>
x
-

Explained Variance
Recall that from SVD: X̃ = U ⌃V T
In the original space, the coordinates
of X̃ project onto the first K
components are:
centerestructions
recons X Õ = U ⌃(K) V(K)
T cov(X̃) = X̃ T X̃
NxK kxM trace(AB) = trace(BA)
We can measure how much variance
is retained in the reconstruction X Õ :
ÎX̃Î2F = trace(X̃ T X̃) = trace(X̃ X̃ T )
ÎX Õ Î2F
Explained var. =
ÎX̃Î2F

53 DTU Compute Lecture 2 5 September, 2023


osvg-53

Explained Variance
Recall that from SVD: X̃ = U ⌃V T
In the original space, the coordinates
of X̃ project onto the first K
components are:
cov(X̃) = X̃ T X̃
X Õ = U ⌃(K) V(K)
T

trace(AB) = trace(BA)
We can measure how much variance
is retained in the reconstruction X Õ :
ÎX̃Î2F = trace(X̃ T X̃) = trace(X̃ X̃ T )
qK
ÎX Õ Î2F 2
i=1 ‡i = trace(U ⌃V T (U ⌃V T )T )
Explained var. = = q
ÎX̃Î2F M 2
i=1 ‡i = trace(U ⌃V T V ⌃T U T )
= trace(U ⌃⌃T U T )
= trace(U T U ⌃⌃T )
ÿ
= trace(⌃⌃T ) = ‡i2
i

53 DTU Compute Lecture 2 5 September, 2023


osvg-53

Explained Variance
Recall that from SVD: X̃ = U ⌃V T
In the original space, the coordinates
of X̃ project onto the first K
components are:
cov(X̃) = X̃ T X̃
X Õ = U ⌃(K) V(K)
T

trace(AB) = trace(BA)
We can measure how much variance
is retained in the reconstruction X Õ :
ÎX̃Î2F = trace(X̃ T X̃) = trace(X̃ X̃ T )
qK
ÎX Õ Î2F 2
i=1 ‡i = trace(U ⌃V T (U ⌃V T )T )
Explained var. = = q
ÎX̃Î2F M 2
i=1 ‡i = trace(U ⌃V T V ⌃T U T )

Similarly, the fraction of explained = trace(U ⌃⌃T U T )


variance for the i’th component is = trace(U T U ⌃⌃T )
ÿ
= trace(⌃⌃T ) = ‡i2
‡2
Explained var. = qM i 2
i
i=1 ‡i
53 DTU Compute Lecture 2 5 September, 2023
Quiz 2: PCA (Fall 2012)

No. Attribute description Abbrev. A PCA analysis is applied to the standardized


data based on the attributes x1 –x10 . The squared
x1 Age (in years) AGE
Frobenius norm of the standardized data matrix X
x2 Gender (Female=0, Male=1) GDR
is given by kXk2F = 5780.0. The first four singular
x3 Total Bilirubin TB
values are 1 = 40.1, 2 = 34.2, 3 = 28.1, and
x4 Direct Bilirubin DB
4 = 24.8, Which of the following statements is
x5 Alkaline Phosphotase AP
correct?
x6 Alamine Aminotransferase AlA
x7 Aspartate Aminotransferase AsA A. The first PCA component accounts for more than
x8 Total Protiens TP 35 % of the variation.
x9 Albumin AB
x10 Albumin to Globulin ratio A/G B. The second PCA component accounts for more
than 30 % of the variation.
y 0=No liver disease, 1=Liver disease LD
Table 1: Attributes in a study on liver dis- C. The first three PCA components account for less
ease among Indians living in the north east- than 70 % of the variation in the data.
ern part of Andhra Pradesh, India. (taken from
http://archive.ics.uci.edu/ml/datasets/ILPD +%28In- D. The fourth PCA component accounts for less
dian+Liver+Patient+Dataset%29). The data has 10 input than 10 % of the variation in the data.
attributes x1 –x10 and one output variable y which defines
whether the subject considered has a liver disease (y = 1) E. Don’t know.
or not (y = 0). x3 –x9 are non-negative measurements
giving the concentrations of various quantities measured
in a blood test. x10 gives the ratio of Albumin to Globulin
in the blood.

54 DTU Compute Lecture 2 5 September, 2023


Solution:

2 2 2 2
The ith principal component accounts for P i
2 = count for 40.1 +34.2 +28.1
5780.0 = 61.7% of the variation
j j
2 whereas the fourth principal component accounts for
i
. We therefore have that the first PCA compo- 24.8 2
kXk2F 5780.0 = 10.6%. Thus, the first three PCA compo-
2 2
40.1
nent accounts for 5780.0 34.2
= 27.8%, the second 5780.0 = nents account for less than 70% of the variation in
20.2%, and the first three principal components ac- the data.

55 DTU Compute Lecture 2 5 September, 2023


osvg-54

Fishers Iris Data

56 DTU Compute Lecture 2 5 September, 2023


osvg-55

3D scatter plot of Iris Data

57 DTU Compute Lecture 2 5 September, 2023


osvg-56

3D scatter plot of Iris Data

58 DTU Compute Lecture 2 5 September, 2023


osvg-57

Visualization of the PCA projections of the data

59 DTU Compute ve V3 Lecture 2 5 September, 2023


VI
osvg-58

60 DTU Compute Lecture 2 5 September, 2023


Quiz 3: PCA Cont. (Fall 2012)
No. Attribute description Abbrev. of the liver-dataset are
x1
x2
Age (in years)
Gender (Female=0, Male=1)
AGE
GDR
2 3 2 3
x3 Total Bilirubin TB
0.1404 0.2859
x4 Direct Bilirubin DB 6 0.1090 7 6 0.0130 7
x5 Alkaline Phosphotase AP 6 7 6 7
6 0.4115 7 6 0.2510 7
x6 Alamine Aminotransferase AlA 6 7 6 7
x7 Aspartate Aminotransferase AsA 6 0.4179 7 6 0.2622 7
x8 Total Protiens TP 6 7 6 7
x9 Albumin AB 6 0.2468 7 6 0.0525 7
x10 Albumin to Globulin ratio A/G v1 = 6
6
7,
7 v2 = 6
6
7.
7
6 0.2682 7 6 0.4162 7
y 0=No liver disease, 1=Liver disease LD
6 0.3009 7 6 0.3927 7
6 7 6 7
6 0.2781 7 6 0.4197 7
6 7 6 7
4 0.4375 5 4 0.4323 5
0.3638 0.3052

In the figure, the data projected onto the first two


principal components is plotted, and the colors indi-
cate the presence of liver disease. Which of the fol-
lowing statements is correct?
A. Relatively high values of AGE, GDR, TB, DB, AP,
AlA, and AsA and low values of TP, AB, and A/G will
result in a positive projection onto the first principal
Figure 1: Principal component 1 (PCA1) plotted against component.
principal component 2 (PCA2). B. Relatively low values of the projection onto PCA1 and
high values of the projection onto PCA2 indicates the
The first and second principal component directions subject does not have a liver disease.

0
C. PCA2 mainly discriminate between old subjects with
low measurements of TB, DB, AlA, AsA, TP, AB, and
A/G from young subjects with high values of TB, DB,
AlA, AsA, TP, AB, and A/G.
D. The principal component directions are not guaran-
teed to be orthogonal to each other since the data has
61 DTU Compute Lecture 2 5 September, 2023
been standardized.
Solution:

AGE, GDR, TB, DB, AP, AlA, and AsA have have positive values while GDR and AP have small
negative coefficients of PCA1 whereas TP, AB, and amplitudes. As a result PCA2 mainly discriminate
A/G have positive coefficients resulting in a negative between young subjects with high measurements of
projection onto the first principal component, thus TD, DB, AlA, AsA, TP, AB, and A/G from old
this is correct. From the figure we observe that subjects with low values of TD, DB, AlA, AsA, TP,
observations with low values of PCA1 and high values AB, and A/G hence this is correct. The principal
of PCA2 in general have a red dot meaning they component directions are always orthogonal to each
have a liver disease. For PCA2 we observe that AGE other irrespective of the data preprocessing.
has a negative value whereas the remaining entities

62 DTU Compute Lecture 2 5 September, 2023


osvg-59

Visualization of hand written digits

63 DTU Compute Lecture 2 5 September, 2023


osvg-60

Visualization of hand written digits

64 DTU Compute Lecture 2 5 September, 2023


osvg-61

PCA as compression
add the
mean

actua
recontractions

65 DTU Compute Lecture 2 5 September, 2023


osvg-62

Data and domain driven feature extraction

66 DTU Compute Lecture 2 5 September, 2023


Resources

http://www2.imm.dtu.dk Our online PCA demo which highlights key


concepts of PCA such as the effect of normalization, variance
explained, and much more (http://www2.imm.dtu.dk/courses/02450/DemoPCA.html)
https://arxiv.org A great and more in-depth tutorial on PCA
(https://arxiv.org/abs/1404.1100)

https://www.3blue1brown.com An great, animated recap of linear algebra


(https://www.3blue1brown.com/essence-of-linear-algebra-page/)

67 DTU Compute Lecture 2 5 September, 2023

You might also like