Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 11

6 feb 2024 | 30554 - Aula 4 [L]* - Oggi la tua lezione è in

presenza

Data: vectors in D dimension.


Data matrix of N rows and D columns.
There are usually labels yn associated to vectors x1,...,xn.
Example of dataset: Iris dataset: y type of flower, x flower characteristics.
When data points are associated to labels and set up a learning process that allows you to
make predictions on y given x based on the data. (Supervised learning). Aim: predict target
(label) given features
When labels are not present we call it unsupervised learning
Covariate types: numbers, binary, categorical, binary depending on the set up.
Encoding
Cat. variables:
Modern problem: understanding the internal representation of info in the machines (high dim
space). Ex. experiment on the activity of the hippocampus based on different inputs. Turns
out the hippocampus has a nonstatic encoding changing with time (something that doesn't
exist in ML).

One hot encoding: A(100…), B(010…), C(001…)


Useful to compute computations on data. Not unique method.

Data generating process.


Assume data can be represented by a prob distr as we assume it is given by a stochastic
process. We could use it as a prior if we had information about Pxy. Ex. if I know prob distr
gaussian the problem would be just to find mean and variance. However, having a prior isn’t
feasible. In practise we proceed by assuming that we don’t know anything about the prior
(contemp approach). Brain: we do have some priors due to its evolution: it has prior it uses
to learn.
Data representation: take a data point (2dim string) x’=(x1, x2) x1,x2=0,1
Aim: representing function y=(x1+x2) (see 5.02 notes)
Suppose to expand to 3 dim and lifting up point 00 and 11 you can separate the data using a
hyperplane. Recap: higher dim representation can be very useful (90s).
Ex. N Main idea: enlarging the space to perform optimal classification of the data. In this way
we can find out which filters are more relevant for the scope of the problem: the machine can
tell you can is relevant without have to guess a priori.

Assume ys are given by a function and you want to guess the function
ys are stochastic: predict prob distr. + generalization: predict optimal parameter for a specific
prob distr but also the whole family of pb distr compatible witht the data.

Models as functions: aim is to learn info about f assuming it is linear.


Unknown: D+1
Recall linear regression (stats)
CHANGE ASSUMPTION(ys are not deterministic but stochastic)-> models as prob
distributions.

Given function/pb distr with iunknown parameters learning means finding out these
parameters

Empirical (on the given data, on which you perform computations) Risk minimisation-> main
operation is an optimization process (min). TRad ML took data and performed trans on the
data so that we optimization was a convex problem to make the optimization problem easier.
Modern ML works oppositely with mostly non convex functions. It’s the machine which tells
you which transf is optimal to represent the data (learning what is relevant).
Different views on data: data can’t be viewed just as worst case scenario as it is a typical
scenario (assuming everything is worst case: learning process was build aiming to obtain a
convex problem).
Modern AI: the machine aims to represent any type of function and proceeds to find the
optimal ones depending on many parameters (unk.)
Functions: reduce errors on data points.
Stoch: maximize the likelihood

Hyper-parameters: the family of func/pb distr. is characterized by parameters theta. There


exist other parameters tuned at a diff timescale. parameters that are chosen a priori, tested
on their performance and then changed Ex when choosing the dimension on which to
represent the data. PROBLEM: assume we have no info on the generation process of the
data ND DAta large enough: concept of cross-validation.
Split data points in 3 sets: training set (identify opt parameters): largest (ex minimise MSE),
validation set (fix hyperparameter ex running time, dim), testing set (performance).
NB result is not general since the performance is optimized on a specific dataset.
parameters and hyperparameters are targeted, which is why we need to use a third dataset
of pure data to test it. The data used in training process cannot be used when testing the
performance because it has affected the assessment of the parameters.

Key problems:
- space of possible models
- measurement of error (MSE, cross-entropy, …) define loss function (the object to
minimize)
- real objective: minimize the error on new data points (minimization on given data
needs to be useful for the machine when it encounters new data) N If I have a lot of
data and increase the degree I can fit all the given points in a function; however the
variance when we add a new point may be very big.
Mod AI: use linearity to pin the given points: in this way, you can use a high
dimensional function that is still relevant for new data. Optimization is easier for more
parameters too but we still have generalization. Why is opt good for many
parameters: the larger the dim the better the data is classified (divided by
hyperplanes) which means optimization becomes easier. AIM: Stabilize the function
while also minimizing the loss function.

LOSS FUNCTION for training: l:if y^=y l=0 and y^!=0 l>0
any function that gives positive if diff from 0 is ok (abs value, power of 2,4…)
EX: L=|f(x)-y| (prediction - data)
EX2 use square : large deviation are enlarged by taking the square wrt abs value.

Choose a loss function st loss is very flat so that if you move away from true value it still
works (on unseen data).

?
?
?
?

AIM: avoid solutions which are sensitive to noise in data.


In some cases, some directions may become irrelevant (when gr 0) (happens when trying to
fit data linearly in high dim). The aim is to add to the loss function a term which turns the
function (convex) into another function which is strongly convex in order to avoid fluctuation).
Through cross val we study how model complexity (dim of par) changes the accuracy of the
model. We find that very large systems don’t overfit and actually reduce errors.
Previous approach was to avoid overfitting by not using all the data and stopping at the first
local minimum.
?????
K-cross val: take the average of the result of different splitting of the data.

08.02.24
Take a data point x. Given a choice for the pb density to use for the data. What is the pb of
observing x for given parameters P(X|Theta)-> which we want to maximize this object (find
theta st prob maximized). Use MLE Lx(theta) = - log P(X|Theta).
Supervised learning
dataset: {(x1,y1), …, (xn,yn)} predict y given x: in this case we have P(Y|X;Theta) We don’t
predict x and y because of problem definition in principle.
In this case you construct the likelihood by Lx(Th) = -logP(Y|X;Th)
DEFINE: Y set of all ys, Y={y1,...,yn} to predict when you’re given X={x1,...,xn} find a pb distr

slide 66
we use the log transf to make optimization easier (sum of log of likelihood).
Ex1 linear function. fit the data using linear function and adding e as some noise. Y is a pb
distr because it’s not sure since it is a function of the noise too.
Assume noise Gaussian (sometimes unkn sigma)
y= f(x;th) +e, e random; f(x) deterministic. In this framework f(x;th) works as a shifter of the
Gaussian becoming its mean (0 + f(x)).
Therefore we obtain: P(Y|x;th) = N(f(x;th), sigma)
This procedure in statistical physics is known as the spherical cow: assume f linear and
noise gaussian. A lot of limitation even though the CLT explains the Gaussian part.
Bayes theorem (operative interpretation): P(x|th) easy to compute (x given , make
assumption on th). In this way the stochastic process producing y is easy to compute. On the
other hand, P(TH|X) is harder to compute, which is why we resort to Bayes th. We can
compute p(Th|x) at the price of knowing p(th), which can also be very hard.

Suppose dataset large:


slide 70 sum log (P(Th|xn,yn))

= arg min {-sumlogP(yn|xn,th) - logp(th)} As amount of data grows the relevance of the prior
becomes more important (?)
Maximize a posteriori probability ???

Assume data generated by a pb distr. Assume we measure many data points-> compute
product of likelihood p(x1xn|th) = prod p(xn|th)
The result of the MLE are not surprising: I estimated values with a Gaussian and computed
par for Gaussian. However this is not a very efficient calculation. Limitation of MLE with no
prior of pr dist.
EXERCISE: take problem and add prior of p(th) using bayes th (remember the denominator
p(x) is irrelevant for optimization). Imposing a good prior is very important to avoid getting a
delta function (?)

Optimize a posteriori probability you need to know prior over parameters.

When talking about fitting data and finding opt parameters, remember that mistakes are still
highly likely.
In prev ex ball of pr distr was gaussian
Imagine you find answer to opt problem but it may be very far from reality because of
assumptions. ML is about finding general pb disrt that can fit anything (through neural
networks any function can be expressed)

Bayesian inference Imagine data are generated by a 2 step process. start space of all poss
models. two levels of stochasticity. Assume pb distr over functions (need prior over
parameters) and function can still be stochastic. How can you perform likelihood in this
context?
AIm in space of funcfitons det by par and we want to know where joint pb between data and
par where it is maximum.

Bayesian pred: full pb distr over parameters: to make predictions you need to take into
account all models that have high pb through integration.
example: choosing best movie to see
asking the repr is like using mLE
taking a majority vote is like following bayesian model

Problem: no way to do integral in very high dimesional space. Bayesian inference is nice in
theory but not useful in practise. Integrating in high dim computationally impractical even if
you know pth
recap p(th) = p(mth)

81. choose theta and pick a model in the space of models and generate many data points.
P(D) not relevant for estimation.

FOr N very large prior bacomes irrelevant!!!!!!! If you have a lot of data you can appreciate
the full power of bayes thm. HIgh chance of making mistakes when constructing a prior but if
you have a lot of data which are stoch independent then the likelihood will depend only on
p(x|th). Key point: with a lot of data point contribution of prior becomes irrelevant. Using MLE
approach is useful/right only with a lot of data points.
using MLE is ok (with no consideration of the prior) is you have a lot of data. Problem would
be solved at least from a statistical inference point of view. But Bayesian approach is
significant since it takes info from the whole pb distribution instead of being confined to only
one value of theta.

Assume you have a noisy linear function and Data belong to d dimensional space. The
number of par means dim of par = dim of data. By lifting dimension you reach a greater
capability of discriminating points. Given a vector x of dim d through a non-linear
transformation, you create a new vector Phi(phi1(x), …, phin(x)) and use this as your new
data. Each coord depends on all coord of the original space. You get another dataset Dphi in
a larger dimension on which you can perform pb distr… (generalized model). The keypoint is
to make this transformation smart as possible. Data can be processed with kernel and then
you perform inference and everything else.
We get
before y= x^T.th + e then y= phi(x).th + e

Dimensionality reduction and principal component analysis


Two main concepts: data compression: take a low dimensional representation of data in
which you don’t lose info and can go back to original repr .
Through this, you can also learn to generate data
before I start to make it larger and do classification I need to make sure the data is not just
noise but is actually relevant.
?
?
?

bTS = Sb by S symmetric
b1 is eigenvector corresponding to lambda 1
Result: build covariance matrix depending on data. Compose eigenvalues. COmpute
eigenvector of biggest one and normalize it. It will give you the direction of maximum
variance in the low dimensional space. REVIEW EIGENVECTORS and EIGENVALUES.
power of a matrix to solve number of path of a specific length from v to w using adjacency
matric.

PCA steps: prepare data, subtract mean and divide by variance. Eigendec of covariance
matrix. COmpute eigenvector of the one you want to consider and you can represent data
efficiently.

Example 794 vector turned into a scalar and then qe can bring it back to x and get an image
that is pretty close to the original one. However, the process makes it so all number 8 are
similar, so it is not useful when trying to find out the person who wrote it with great accuracy
ie it waters down the image.

Exam: computations of PCA and Kullback

12.02 ML (Lucibello)
Unsupervised learning: 4data without labels
Notation:
- dataset D collection of examples D = {x^mu} mu= 1 ^M = {x^1, x^2, …, x^M}
mu indexes different data samples
Unsupervised learning task -> clustering. Assume x^M belongs to R^2. But generally R^d
Under this assumption we can represent data on a plane. .It may be that data have a
clustered structure. Aim: construct an algorithm that can find the clusters without defining
said clusters.
Network/graph: a collection of nodes and edges.
Social network: people are different nodes and I grow an edge if two nodes know each other.
It may be a directed graph. Assume for simplicity it is undirected. If I know two people the
probability that they also know each other is high. We can set a clustering task on this type
of graph.
Example 2: analyze all products: I want to analyze the outliers without groundproof of what
an outlier is. Graph wrt to temperature
Example 3: dimensionality reduction. Data lives usually in high-dimensional spaces. For
example images (represented as numbers in arrays. The number of numbers to represent
images could be 1024 for example. x belongs R^1024x1024x3.
Dimensional reduction f:R^D->R^D’ with D’<<D (D’ = 2 for visualization)
generative modeling p(x) True distribution of cat images.
You observe X^M distr p i.i.d samples from distr
Task generative modelling
Train model pth(x) so that pth(x) close to p (true distribution).
Now I can generate new samples x dist pth(x) of cats that are not the same as before
(generalized).

SUPERVISED LEARNING dataset is composed of pairs


D = {(x^mu, y^mu)}mu=1 ^M.
x^mu belongs to R^D
y^mu - R regression - {0,1} binary classification
Dtest = {(x^mu, y^mu)}mu=1 ^Mtest used not for training but for performance evaluation
Train a model (function) y^ = fth (x) prediction
Linear model with parameters TH= (w1,...,wd,l)
y^ = SUM(wixi) +l
How do we perform this training operation typically?
Frame a learning problem as an optimization problem (minimizing something).
Loss / Cost function l(y, y^) = (y-y^)^2 Regression mean square error (MSE) loss
Optimization problem. TH*=argmin th SUM over mu(l(y^mu, fth (x^mu))
Generalization metrics eps^gamma = test loss = E (x,y) distr D test l(y, f(x) = 1/Mtest SUM over x,y)
distr D test l(y, f(x)

OBJECT DETECTION
y=(e1, e2, w,h)
y^=fth (x)
IMAGE SEGMENTATION: identify different segmentations of the image e.g. Sky. tHIS IS A
per pixel classification task (more involved task)

REINFORCEMENT LEARNING: weak supervision. It’s like training a dog! Rewarding the
dog based on how it acts without being able to tell it how to act.
We have an actor and the environment: the actor acts in the environment
s^t (state at time t) belonging to state space.
probability of moving to the next space given the previous one and the action performed by
the actor p(s^k+1 |s^k, a^k)
actor codified by another cond probability (policy): pi(a^k|s^t)
reward: r(s^t, a^k)
Max over pi E(SUM over t gamma^t r(s^t,a^k)) where gamma is a discount factor
DIRTY BOOK: Hands-on ML with Scikit-Learn, Keras & Tensorflow!!!!!!!!!!!!! WOW!!

13.02 ZECCHINA
Compressing signals: see notes
Ex. information biology (uses information theory on DNA, redundant code)
LEct 3 slide 12: aim is finding a projection of the data points
Take the projection with direction b. Find direction so proj is maximal
Since we subtracted mean it is now 0 and variance loses second term
13
Maximize the variance-> by keeping b a unit vector (norm 1). Find direction that maximizes
the variance of the projection of the data matrix using lagrange multiplier technique
for constrained optimization problems maximize max(f(x) + lambdag(x)): transform constraint
pb into smooth opt problem (solvable in principle using derivatives)
in this case g=(norm(b))^2-1 = 0
Recall derivatives of vectors.
maximization problem becomes eigensystem problem (lambda eigenvalue, b1 eigenvector)
Instead of checking 2 der, put inside variance sb=lb -> since maximization I have to choose
largest value of eigenvalues.
Compute eigenvectors, normalize them and decide dim of reduction ex 5: take first five and
make projection to z where z1 = xb1… (project in direction of first eigenvector: change of
variable) (?)For second, third dir -> same reasoning

PCA steps: subtract mean (calculations assume mean=0; avoidable but makes calculations
more complicated); standardization, eigendec of covariance matrix, projection (decide
dimension,...). By applying inverse trans you can go back
Slide 20 (at least 300 dim are useless)
PCA compression perspective
??
25
Take projections in directions bi of xn.
consider decomposition of x (new vector in which first M comp are projections, the rest are
extensions and don’t depend on data point). First part depends on data (projection of
original vector in new basis (orth basis) up to M), second term: take projection for M
components and leave rest as a function of the data. For M small there is a bigger
reconstruction error.
Reconstruction Error: average norm of diff between or vector and compressed one
xn = sum from 1 to d of proj of x in dir bi times bi.
x^n = sum from 1 to M of same thing + rec error (which we need to obtain a d dim vector).
First two cancel up to M
find bi and zeta i so that rE minimized depending on M (chosen)
we need to identify: {bi, zetai}-> take der

Take der wrt zeta


Exch der and sum

NOTES

Why did we choose to measure error with expr 1/N… all calculations follow from this choice.
Quantity is a distance squared so that opt is a min. We can choose any function with the
same property. It is arbitrary but it simplifies. This argument applies to any measure of
deviation (ex.variance).

slide 28
log of a vector V^T scal prod V (taking log transformation)
penultima riga: put 1/N sum over N inside
1/N sum N (xn-xbar)^T ((xn-xbar) = S
def of S = 1/N sum over N xn xn^T
check they are equal: other terms cancel out

ask notes 30_35

Autoencoders: start from high dim inp reduce to small dim and reconstruct original. Hidden
layer obtained by non linear trans. z

CLUSTERING!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! <3!!!!
Unsupervised learning
put together data st similar data are in the same group.
recall hamming distance dH to compare strings. Used to compare DNA.

Example: patients of which you know symptoms. Aim is to put together patients with similar
symptoms. Same procedure is used by netflix to recommend shows.

Unsupervised learning problem (unlabeled). parameter K to be given.


Objective: put in same cluster points that are similar. However, everything depends on what
we call similarity.
Defining dissimilarity/distance:

• [Cluster 1 slide] Given an input, what you want to do is put together points that are
similar or concatenated => indeed, we see in the desired clustering image that the two
spirals are separated and some points are grouped

CHiedi app
Hierarchical Clusters =>
• Step 1: each point is a cluster
• Step 2: find the pair of clusters that have minimum distance
• Step 3: you glue then
• Step 4: repeat steps 2 & 3

s linkage: able to put together data that are similar due to some stringy shape of the cluster
(ex. many picture of the same person from same angle)

complete linkage max (not min)

3) average distance between all points.

Example:

LUCIBESO LEZIONE 3 (1 missing):


slide 15 a numpy

See python
#PANDAS: 26.02.24; Carlo Lucibello
# In [Part 2](02.00-Introduction-to-NumPy.ipynb), we dove into detail on NumPy and its
`ndarray` object, which enables efficient storage and manipulation of dense typed arrays in
Python.
# Here we'll build on this knowledge by looking in depth at the data structures provided by
the Pandas library.
# Pandas is a newer package built on top of NumPy that provides an efficient
implementation of a `DataFrame`.
# ``DataFrame``s are essentially multidimensional arrays with attached row and column
labels, often with heterogeneous types and/or missing data.\n",

# As well as offering a convenient storage interface for labeled data, Pandas implements a
number of powerful data operations familiar to users of both database frameworks and
spreadsheet programs.
# As we've seen, NumPy's `ndarray` data structure provides essential features for the type
of clean, well-organized data typically seen in numerical computing tasks.
# While it serves this purpose very well, its limitations become clear when we need more
flexibility (e.g., attaching labels to data, working with missing data, etc.) and when attempting
operations that do not map well to element-wise broadcasting (e.g., groupings, pivots, etc.),
each of which is an important piece of analyzing the less structured data available in many
forms in the world around us.
# Pandas, and in particular its `Series` and `DataFrame` objects, builds on the NumPy array
structure and provides efficient access to these sorts of \"data munging\" tasks that occupy
much of a data scientist's time.

# In this part of the book, we will focus on the mechanics of using `Series`, `DataFrame`, and
related structures effectively.
# We will use examples drawn from real datasets where appropriate, but these examples are
not necessarily the focus

# Learn how to install packages using anaconda and pip.install

# At a very basic level, Pandas objects can be thought of as enhanced versions of NumPy
structured arrays in which the rows and columns are identified with labels rather than simple
integer indices.
# As we will see during the course of this chapter, Pandas provides a host of useful tools,
methods, and functionality on top of the basic data structures, but nearly everything that
follows will require an understanding of what these structures are.
# Thus, before we go any further, let's take a look at these three fundamental Pandas data
structures: the `Series`, `DataFrame`, and `Index`.
# We will start our code sessions with the standard NumPy and Pandas imports:

# How to efficiently store data. Most prominent framework since 1950s: relational databases
# Relational Databeses: a collection of data is a collection of tables.
Mathematical formalism: relational algebra (formalizes the operations you can do). (DSL =
Domain specific language, SQL) In this course we stick to python and use panda for the
same function as SQL
This is not the only way to store data: there exist non relational databases (noSQL) (graph-
based, document-based…)
Panda (numpy operated)
objects: series (single column, vector), dataframes (whole table, matrix)
How to build series: pd.Series([...], index=[...]) -> for each value there is an explicit index
starting from 0 (default), but which can be changed (diff from arrays).
We can build a series from a dictionary directly using a series constructor. In this way we
store the column population while also keeping track of the row “state”

How to construct a dataframe: build another column. We can construct a dataframe


states = pd.Dateframes({...})
nb series are not ordered but when we construct the dataframe they will be correctly aligned
based on the index (takeaway: not needed to keep track of the order)

27.02.24 ML Lucibello
Pandas has a columnar structure, i.e., each column is a numpy array on its own
problem of missing data:
None-> object (most generic type in python). In this way you have an array which can
contain anything, which is not very useful since we’d like to restrict types.

Not in notebook: imputation, another procedure to deal with missing values.

29.02
Perceptron algorithm

You might also like