Professional Documents
Culture Documents
Lecture1 Intro ML
Lecture1 Intro ML
Lecture1 Intro ML
Mauricio A. Álvarez
1 / 60
Textbooks
Rogers and Girolami, A First Course Bishop, Pattern Recognition and Ma-
in Machine Learning, Chapman and chine Learning, Springer-Verlag, 2006.
Hall/CRC Press, 2nd Edition, 2016.
2 / 60
Textbooks
3 / 60
Textbooks
4 / 60
Textbooks
Hardt and Recht, Patterns, Predic- Zhang et al., Dive into Deep Learning,
tions, and Actions: Foundations of Cambridge University Press, 2023.
Machine Learning, Princeton Univer-
sity Press, 2022.
5 / 60
Contents
Machine learning
Definitions
Review of probability
Random variables
Discrete random variables
Continuous random variables
Additional comments
6 / 60
Machine learning or Statistical Learning
7 / 60
Examples of ML problems
8 / 60
Examples of ML problems
9 / 60
Examples of ML problems
10 / 60
Examples of ML problems
Stock market
11 / 60
Examples of ML problems
12 / 60
Examples of ML problems
Recommendation systems
13 / 60
ML has contributed to advances in AI
AlphaFold
14 / 60
Generative AI
DALL-E ChatGPT
Github Copilot
15 / 60
Contents
Machine learning
Definitions
Review of probability
Random variables
Discrete random variables
Continuous random variables
Additional comments
16 / 60
Basic definitions
❑ Handwritten digit recognition
❑ Variability
❑ An instance is made of the pair (x, y ), where y is the label of the image.
18 / 60
Supervised and unsupervised learning
❑ Supervised learning:
19 / 60
Contents
Machine learning
Definitions
Review of probability
Random variables
Discrete random variables
Continuous random variables
Additional comments
20 / 60
Olympic 100m Data
11.5
Seconds
11.0
10.5
10.0
22 / 60
Model
❑ We will use a linear model f (x, w), where y is the time in seconds and x
the year of the competition.
y = w1 x + w0 ,
23 / 60
Objective function
24 / 60
Data and model
11.5
Seconds
11.0
10.5
10.0
25 / 60
Predictions
y = f (x, w) = f (x = 2012, w)
= w1 x + w0 = (−1.34 × 10−2 ) × 2012 + 36.4 = 9.59.
26 / 60
Main challenges of machine learning
❑ Poor-quality data.
❑ Irrelevant features.
27 / 60
Contents
Machine learning
Definitions
Review of probability
Random variables
Discrete random variables
Continuous random variables
Additional comments
28 / 60
Contents
Machine learning
Definitions
Review of probability
Random variables
Discrete random variables
Continuous random variables
Additional comments
29 / 60
Definition
❑ We assign the number 0 to the outcome “tails” and the number “1” to
the outcome “heads”.
30 / 60
Discrete and continuous random variables
❑ A random variable can either be discrete or continuous.
❑ Examples include the time that a cyclist takes to finish the Tour de
France; the exchange rate between the british pound and the US dollar
on June 30, 2023.
31 / 60
Notation
❑ We use lowercase letters to denote the values that the random variable
takes, x, y , z, . . ..
32 / 60
Contents
Machine learning
Definitions
Review of probability
Random variables
Discrete random variables
Continuous random variables
Additional comments
33 / 60
Probability mass function
❑ Properties
1. P(X = xi ) ≥ 0, i = 1, . . . , n.
Xn
2. P(X = xi ) = 1.
i=1
34 / 60
Two discrete RVs
❑ In machine learning, we are usually interested in more than one
random variable.
❑ Properties
1. P(X = xi , Y = yj ) ≥ 0, i = 1, . . . , n, j = 1, . . . , m.
Xn Xm
2. P(X = xi , Y = yj ) = 1.
i=1 j=1
35 / 60
Rules of probability
❑ Marginal
m
X
P(X = xi ) = P(X = xi , Y = yj ).
j=1
❑ Conditional
P(X = xi , Y = yj )
P(X = xi |Y = yj ) = , P(Y = yj ) ̸= 0.
P(Y = yj )
36 / 60
How do we compute P(X = xi ) from data?
37 / 60
What about P(X = xi , Y = yj ) and P(X = xi |Y = yj )?
❑ We can follow a similar procedure to compute P(X = xi , Y = yj ),
nX =xi ,Y =yj
P(X = xi , Y = yj ) = lim ,
N→∞ N
where nX =xi ,Y =yj is the number of times we observe a simultaneous
occurence of X = xi and Y = yj .
❑ To compute P(X = xi |Y = yj ), we can use the definition of the
conditional probability
nX =xi ,Y =yj
P(X = xi , Y = yj ) N
nX =xi ,Y =yj
P(X = xi |Y = yj ) = ≈ nY =yj = .
P(Y = yj ) nY =yj
N
❑ In the limit N → ∞,
nX =xi ,Y =yj
P(X = xi |Y = yj ) = lim .
N→∞ nY =yj
38 / 60
Examples of the different pmf
p(X, Y ) p(Y )
Y =2
Y =1
X
p(X) p(X|Y = 1)
X X
40 / 60
Bayes theorem
P(Y = y |X = x)P(X = x)
P(X = x|Y = y) =
P(Y = y )
41 / 60
Example: Bayes theorem
There are two barrels in front of you. Barrel One contains 20 apples and 4
oranges. Barrel Two contains 4 apples and 8 oranges. You choose a barrel
randomly and select a fruit. It is an apple. What is the probability that the
barrel was Barrel One?
42 / 60
Answer (I)
43 / 60
Answer (II)
❑ The statement says “You choose a barrel randomly” which means that
P(B = One) = P(B = Two) = 21 .
44 / 60
Answer (III)
❑ Using the sum rule of probability and the product rule of probability
X X
P(F = A) = P(F = A, B) = P(F = A|B)P(B)
= P(F = A|B = One)P(B = One)
+ P(F = A|B = Two)P(B = Two).
20 1 4 1
❑ We then have P(F = A) = 24 2 + 12 2 .
45 / 60
Answer (IV)
46 / 60
Expected value and statistical moments
47 / 60
Example: Expected values
❑ Consider the following discrete RV X and its pmf. Compute µX and σX2 .
X 1 2 3 4
P(X ) 0.3 0.2 0.1 0.4
❑ For the mean µX , we have
n
X
µX = xi P(X = xi ) = (1)(0.3) + (2)(0.2) + (3)(0.1) + (4)(0.4) = 2.6.
i=1
48 / 60
Notation
49 / 60
Contents
Machine learning
Definitions
Review of probability
Random variables
Discrete random variables
Continuous random variables
Additional comments
50 / 60
Probability density function
❑ Properties of a pdf
1. pX (x) ≥ 0.
Z ∞
2. pX (x)dx = 1.
−∞
Z a
3. P(X ≤ a) = pX (x)dx.
−∞
Z b
4. P(a ≤ X ≤ b) = pX (x)dx.
a
51 / 60
Two continuous RVs
52 / 60
Rules of probability (continuous RVs)
❑ Sum rule of probability. In the case of continuous RVs, we replace
the sums we had before with an integral
Z ∞
pX (x) = pX ,Y (x, y )dy ,
−∞
pX ,Y (x, y )
pX |Y (x|y ) = ,
pY (y)
The conditional pdf in this last form is known as the product rule of
probability for two continuous RVs.
53 / 60
Bayes theorem and statistical independence
pX |Y (x|y)pY (y )
pY |X (y|x) = .
pX (x)
54 / 60
Expected values and statistical moments
55 / 60
Notation
56 / 60
Contents
Machine learning
Definitions
Review of probability
Random variables
Discrete random variables
Continuous random variables
Additional comments
57 / 60
What if we don’t have the pmf or pdf?
❑ There are advanced methods to model both pmf and pdfs but we will
not study those in this module.
58 / 60
What about moments?
❑ We can estimate µX and σX2 when we have access to observations of
the random variable X , x1 , x2 , · · · , xN , but no access to the pmf or the
pdf.
❑ In statistics, these are called “estimators” for µX and σX2 , denoted as µ
bX
2
and σ .
c
X
N
1 X
c2 =
σ X bX )2 .
(xk − µ
N −1
k =1
59 / 60
What if we have more than two RVs?
❑ In ML, we are usually faced with problems where we have more than
two RVs.
❑ The ideas that we saw before can be extended to these cases and we
will see some examples in the following lectures.
60 / 60