Professional Documents
Culture Documents
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
presenza
Assume ys are given by a function and you want to guess the function
ys are stochastic: predict prob distr. + generalization: predict optimal parameter for a specific
prob distr but also the whole family of pb distr compatible witht the data.
Given function/pb distr with iunknown parameters learning means finding out these
parameters
Empirical (on the given data, on which you perform computations) Risk minimisation-> main
operation is an optimization process (min). TRad ML took data and performed trans on the
data so that we optimization was a convex problem to make the optimization problem easier.
Modern ML works oppositely with mostly non convex functions. It’s the machine which tells
you which transf is optimal to represent the data (learning what is relevant).
Different views on data: data can’t be viewed just as worst case scenario as it is a typical
scenario (assuming everything is worst case: learning process was build aiming to obtain a
convex problem).
Modern AI: the machine aims to represent any type of function and proceeds to find the
optimal ones depending on many parameters (unk.)
Functions: reduce errors on data points.
Stoch: maximize the likelihood
Key problems:
- space of possible models
- measurement of error (MSE, cross-entropy, …) define loss function (the object to
minimize)
- real objective: minimize the error on new data points (minimization on given data
needs to be useful for the machine when it encounters new data) N If I have a lot of
data and increase the degree I can fit all the given points in a function; however the
variance when we add a new point may be very big.
Mod AI: use linearity to pin the given points: in this way, you can use a high
dimensional function that is still relevant for new data. Optimization is easier for more
parameters too but we still have generalization. Why is opt good for many
parameters: the larger the dim the better the data is classified (divided by
hyperplanes) which means optimization becomes easier. AIM: Stabilize the function
while also minimizing the loss function.
LOSS FUNCTION for training: l:if y^=y l=0 and y^!=0 l>0
any function that gives positive if diff from 0 is ok (abs value, power of 2,4…)
EX: L=|f(x)-y| (prediction - data)
EX2 use square : large deviation are enlarged by taking the square wrt abs value.
Choose a loss function st loss is very flat so that if you move away from true value it still
works (on unseen data).
?
?
?
?
08.02.24
Take a data point x. Given a choice for the pb density to use for the data. What is the pb of
observing x for given parameters P(X|Theta)-> which we want to maximize this object (find
theta st prob maximized). Use MLE Lx(theta) = - log P(X|Theta).
Supervised learning
dataset: {(x1,y1), …, (xn,yn)} predict y given x: in this case we have P(Y|X;Theta) We don’t
predict x and y because of problem definition in principle.
In this case you construct the likelihood by Lx(Th) = -logP(Y|X;Th)
DEFINE: Y set of all ys, Y={y1,...,yn} to predict when you’re given X={x1,...,xn} find a pb distr
slide 66
we use the log transf to make optimization easier (sum of log of likelihood).
Ex1 linear function. fit the data using linear function and adding e as some noise. Y is a pb
distr because it’s not sure since it is a function of the noise too.
Assume noise Gaussian (sometimes unkn sigma)
y= f(x;th) +e, e random; f(x) deterministic. In this framework f(x;th) works as a shifter of the
Gaussian becoming its mean (0 + f(x)).
Therefore we obtain: P(Y|x;th) = N(f(x;th), sigma)
This procedure in statistical physics is known as the spherical cow: assume f linear and
noise gaussian. A lot of limitation even though the CLT explains the Gaussian part.
Bayes theorem (operative interpretation): P(x|th) easy to compute (x given , make
assumption on th). In this way the stochastic process producing y is easy to compute. On the
other hand, P(TH|X) is harder to compute, which is why we resort to Bayes th. We can
compute p(Th|x) at the price of knowing p(th), which can also be very hard.
= arg min {-sumlogP(yn|xn,th) - logp(th)} As amount of data grows the relevance of the prior
becomes more important (?)
Maximize a posteriori probability ???
Assume data generated by a pb distr. Assume we measure many data points-> compute
product of likelihood p(x1xn|th) = prod p(xn|th)
The result of the MLE are not surprising: I estimated values with a Gaussian and computed
par for Gaussian. However this is not a very efficient calculation. Limitation of MLE with no
prior of pr dist.
EXERCISE: take problem and add prior of p(th) using bayes th (remember the denominator
p(x) is irrelevant for optimization). Imposing a good prior is very important to avoid getting a
delta function (?)
When talking about fitting data and finding opt parameters, remember that mistakes are still
highly likely.
In prev ex ball of pr distr was gaussian
Imagine you find answer to opt problem but it may be very far from reality because of
assumptions. ML is about finding general pb disrt that can fit anything (through neural
networks any function can be expressed)
Bayesian inference Imagine data are generated by a 2 step process. start space of all poss
models. two levels of stochasticity. Assume pb distr over functions (need prior over
parameters) and function can still be stochastic. How can you perform likelihood in this
context?
AIm in space of funcfitons det by par and we want to know where joint pb between data and
par where it is maximum.
Bayesian pred: full pb distr over parameters: to make predictions you need to take into
account all models that have high pb through integration.
example: choosing best movie to see
asking the repr is like using mLE
taking a majority vote is like following bayesian model
Problem: no way to do integral in very high dimesional space. Bayesian inference is nice in
theory but not useful in practise. Integrating in high dim computationally impractical even if
you know pth
recap p(th) = p(mth)
81. choose theta and pick a model in the space of models and generate many data points.
P(D) not relevant for estimation.
FOr N very large prior bacomes irrelevant!!!!!!! If you have a lot of data you can appreciate
the full power of bayes thm. HIgh chance of making mistakes when constructing a prior but if
you have a lot of data which are stoch independent then the likelihood will depend only on
p(x|th). Key point: with a lot of data point contribution of prior becomes irrelevant. Using MLE
approach is useful/right only with a lot of data points.
using MLE is ok (with no consideration of the prior) is you have a lot of data. Problem would
be solved at least from a statistical inference point of view. But Bayesian approach is
significant since it takes info from the whole pb distribution instead of being confined to only
one value of theta.
Assume you have a noisy linear function and Data belong to d dimensional space. The
number of par means dim of par = dim of data. By lifting dimension you reach a greater
capability of discriminating points. Given a vector x of dim d through a non-linear
transformation, you create a new vector Phi(phi1(x), …, phin(x)) and use this as your new
data. Each coord depends on all coord of the original space. You get another dataset Dphi in
a larger dimension on which you can perform pb distr… (generalized model). The keypoint is
to make this transformation smart as possible. Data can be processed with kernel and then
you perform inference and everything else.
We get
before y= x^T.th + e then y= phi(x).th + e
bTS = Sb by S symmetric
b1 is eigenvector corresponding to lambda 1
Result: build covariance matrix depending on data. Compose eigenvalues. COmpute
eigenvector of biggest one and normalize it. It will give you the direction of maximum
variance in the low dimensional space. REVIEW EIGENVECTORS and EIGENVALUES.
power of a matrix to solve number of path of a specific length from v to w using adjacency
matric.
PCA steps: prepare data, subtract mean and divide by variance. Eigendec of covariance
matrix. COmpute eigenvector of the one you want to consider and you can represent data
efficiently.
Example 794 vector turned into a scalar and then qe can bring it back to x and get an image
that is pretty close to the original one. However, the process makes it so all number 8 are
similar, so it is not useful when trying to find out the person who wrote it with great accuracy
ie it waters down the image.
12.02 ML (Lucibello)
Unsupervised learning: 4data without labels
Notation:
- dataset D collection of examples D = {x^mu} mu= 1 ^M = {x^1, x^2, …, x^M}
mu indexes different data samples
Unsupervised learning task -> clustering. Assume x^M belongs to R^2. But generally R^d
Under this assumption we can represent data on a plane. .It may be that data have a
clustered structure. Aim: construct an algorithm that can find the clusters without defining
said clusters.
Network/graph: a collection of nodes and edges.
Social network: people are different nodes and I grow an edge if two nodes know each other.
It may be a directed graph. Assume for simplicity it is undirected. If I know two people the
probability that they also know each other is high. We can set a clustering task on this type
of graph.
Example 2: analyze all products: I want to analyze the outliers without groundproof of what
an outlier is. Graph wrt to temperature
Example 3: dimensionality reduction. Data lives usually in high-dimensional spaces. For
example images (represented as numbers in arrays. The number of numbers to represent
images could be 1024 for example. x belongs R^1024x1024x3.
Dimensional reduction f:R^D->R^D’ with D’<<D (D’ = 2 for visualization)
generative modeling p(x) True distribution of cat images.
You observe X^M distr p i.i.d samples from distr
Task generative modelling
Train model pth(x) so that pth(x) close to p (true distribution).
Now I can generate new samples x dist pth(x) of cats that are not the same as before
(generalized).
OBJECT DETECTION
y=(e1, e2, w,h)
y^=fth (x)
IMAGE SEGMENTATION: identify different segmentations of the image e.g. Sky. tHIS IS A
per pixel classification task (more involved task)
REINFORCEMENT LEARNING: weak supervision. It’s like training a dog! Rewarding the
dog based on how it acts without being able to tell it how to act.
We have an actor and the environment: the actor acts in the environment
s^t (state at time t) belonging to state space.
probability of moving to the next space given the previous one and the action performed by
the actor p(s^k+1 |s^k, a^k)
actor codified by another cond probability (policy): pi(a^k|s^t)
reward: r(s^t, a^k)
Max over pi E(SUM over t gamma^t r(s^t,a^k)) where gamma is a discount factor
DIRTY BOOK: Hands-on ML with Scikit-Learn, Keras & Tensorflow!!!!!!!!!!!!! WOW!!
13.02 ZECCHINA
Compressing signals: see notes
Ex. information biology (uses information theory on DNA, redundant code)
LEct 3 slide 12: aim is finding a projection of the data points
Take the projection with direction b. Find direction so proj is maximal
Since we subtracted mean it is now 0 and variance loses second term
13
Maximize the variance-> by keeping b a unit vector (norm 1). Find direction that maximizes
the variance of the projection of the data matrix using lagrange multiplier technique
for constrained optimization problems maximize max(f(x) + lambdag(x)): transform constraint
pb into smooth opt problem (solvable in principle using derivatives)
in this case g=(norm(b))^2-1 = 0
Recall derivatives of vectors.
maximization problem becomes eigensystem problem (lambda eigenvalue, b1 eigenvector)
Instead of checking 2 der, put inside variance sb=lb -> since maximization I have to choose
largest value of eigenvalues.
Compute eigenvectors, normalize them and decide dim of reduction ex 5: take first five and
make projection to z where z1 = xb1… (project in direction of first eigenvector: change of
variable) (?)For second, third dir -> same reasoning
PCA steps: subtract mean (calculations assume mean=0; avoidable but makes calculations
more complicated); standardization, eigendec of covariance matrix, projection (decide
dimension,...). By applying inverse trans you can go back
Slide 20 (at least 300 dim are useless)
PCA compression perspective
??
25
Take projections in directions bi of xn.
consider decomposition of x (new vector in which first M comp are projections, the rest are
extensions and don’t depend on data point). First part depends on data (projection of
original vector in new basis (orth basis) up to M), second term: take projection for M
components and leave rest as a function of the data. For M small there is a bigger
reconstruction error.
Reconstruction Error: average norm of diff between or vector and compressed one
xn = sum from 1 to d of proj of x in dir bi times bi.
x^n = sum from 1 to M of same thing + rec error (which we need to obtain a d dim vector).
First two cancel up to M
find bi and zeta i so that rE minimized depending on M (chosen)
we need to identify: {bi, zetai}-> take der
NOTES
Why did we choose to measure error with expr 1/N… all calculations follow from this choice.
Quantity is a distance squared so that opt is a min. We can choose any function with the
same property. It is arbitrary but it simplifies. This argument applies to any measure of
deviation (ex.variance).
slide 28
log of a vector V^T scal prod V (taking log transformation)
penultima riga: put 1/N sum over N inside
1/N sum N (xn-xbar)^T ((xn-xbar) = S
def of S = 1/N sum over N xn xn^T
check they are equal: other terms cancel out
Autoencoders: start from high dim inp reduce to small dim and reconstruct original. Hidden
layer obtained by non linear trans. z
CLUSTERING!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! <3!!!!
Unsupervised learning
put together data st similar data are in the same group.
recall hamming distance dH to compare strings. Used to compare DNA.
Example: patients of which you know symptoms. Aim is to put together patients with similar
symptoms. Same procedure is used by netflix to recommend shows.
• [Cluster 1 slide] Given an input, what you want to do is put together points that are
similar or concatenated => indeed, we see in the desired clustering image that the two
spirals are separated and some points are grouped
CHiedi app
Hierarchical Clusters =>
• Step 1: each point is a cluster
• Step 2: find the pair of clusters that have minimum distance
• Step 3: you glue then
• Step 4: repeat steps 2 & 3
s linkage: able to put together data that are similar due to some stringy shape of the cluster
(ex. many picture of the same person from same angle)
Example:
See python
#PANDAS: 26.02.24; Carlo Lucibello
# In [Part 2](02.00-Introduction-to-NumPy.ipynb), we dove into detail on NumPy and its
`ndarray` object, which enables efficient storage and manipulation of dense typed arrays in
Python.
# Here we'll build on this knowledge by looking in depth at the data structures provided by
the Pandas library.
# Pandas is a newer package built on top of NumPy that provides an efficient
implementation of a `DataFrame`.
# ``DataFrame``s are essentially multidimensional arrays with attached row and column
labels, often with heterogeneous types and/or missing data.\n",
# As well as offering a convenient storage interface for labeled data, Pandas implements a
number of powerful data operations familiar to users of both database frameworks and
spreadsheet programs.
# As we've seen, NumPy's `ndarray` data structure provides essential features for the type
of clean, well-organized data typically seen in numerical computing tasks.
# While it serves this purpose very well, its limitations become clear when we need more
flexibility (e.g., attaching labels to data, working with missing data, etc.) and when attempting
operations that do not map well to element-wise broadcasting (e.g., groupings, pivots, etc.),
each of which is an important piece of analyzing the less structured data available in many
forms in the world around us.
# Pandas, and in particular its `Series` and `DataFrame` objects, builds on the NumPy array
structure and provides efficient access to these sorts of \"data munging\" tasks that occupy
much of a data scientist's time.
# In this part of the book, we will focus on the mechanics of using `Series`, `DataFrame`, and
related structures effectively.
# We will use examples drawn from real datasets where appropriate, but these examples are
not necessarily the focus
# At a very basic level, Pandas objects can be thought of as enhanced versions of NumPy
structured arrays in which the rows and columns are identified with labels rather than simple
integer indices.
# As we will see during the course of this chapter, Pandas provides a host of useful tools,
methods, and functionality on top of the basic data structures, but nearly everything that
follows will require an understanding of what these structures are.
# Thus, before we go any further, let's take a look at these three fundamental Pandas data
structures: the `Series`, `DataFrame`, and `Index`.
# We will start our code sessions with the standard NumPy and Pandas imports:
# How to efficiently store data. Most prominent framework since 1950s: relational databases
# Relational Databeses: a collection of data is a collection of tables.
Mathematical formalism: relational algebra (formalizes the operations you can do). (DSL =
Domain specific language, SQL) In this course we stick to python and use panda for the
same function as SQL
This is not the only way to store data: there exist non relational databases (noSQL) (graph-
based, document-based…)
Panda (numpy operated)
objects: series (single column, vector), dataframes (whole table, matrix)
How to build series: pd.Series([...], index=[...]) -> for each value there is an explicit index
starting from 0 (default), but which can be changed (diff from arrays).
We can build a series from a dictionary directly using a series constructor. In this way we
store the column population while also keeping track of the row “state”
27.02.24 ML Lucibello
Pandas has a columnar structure, i.e., each column is a numpy array on its own
problem of missing data:
None-> object (most generic type in python). In this way you have an array which can
contain anything, which is not very useful since we’d like to restrict types.
29.02
Perceptron algorithm