Lecture 05 Reasoning Under Uncertainty

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

Vietnam National University of HCMC

International University
School of Computer Science and Engineering

INTRODUCTION TO ARTIFICIAL INTELLIGENCE


(IT097IU)
LECTURE 05: REASONING UNDER UNCERTAINTY

Instructor: Nguyen Trung Ky


Our Status in Intro to AI
 We’re done with Part I Intelligent Agents and Search!

 Part II: Reasoning Under Uncertainty and Machine


Learning
 Diagnosis
 Speech recognition
 Tracking objects
 Robot mapping
 Genetics
 Spell corrector
 … lots more!
Inference in Ghostbusters
 A ghost is in the grid
somewhere
 Sensor readings tell how
close a square is to the
ghost
 On the ghost: red
 1 or 2 away: orange
 3 or 4 away: yellow
 5+ away: green

 Sensors are noisy, but we know F(Color | Distance)

[Demo: Ghostbuster – no probability (L12D1) ]


Video of Demo Ghostbuster – No probability
Uncertainty
 General situation:
 Observed variables (evidence): Agent knows certain things
about the state of the world (e.g., sensor readings or
symptoms)
 Unobserved variables: Agent needs to reason about other
aspects (e.g. where an object is or what disease is present)
 Model: Agent knows something about how the known
variables relate to the unknown variables

 Reasoning under uncertainty:


 A rational agent is one that makes rational decisions - in
order to maximize its performance measure
 A rational decision depends on likelihood and degrees of
belief to which they will be achieved.
 Probability theory is the main tool for handling
degrees of belief and uncertainty
Today
 Probability
 Random Variables and Events
 Joint and Marginal Distributions
 Conditional Distribution
 Product Rule, Chain Rule, Bayes’ Rule
 Inference
 Independence

 You’ll need all this stuff A LOT for the


next few weeks, so make sure you go
over it now!
Random Variables
 A random variable is some aspect of the world about
which we (may) have uncertainty
 L = Where is the ghost?
 R = Is it raining?
 T = Is it hot or cold?
 D = How long will it take to drive to work?

 We denote random variables with capital letters

 Like variables in a CSP, random variables have domains


 L in possible locations, maybe {(0,0), (0,1), …}
 R in {true, false} (often write as {+r, -r})
 T in {hot, cold}
 D in [0, )
Probability Distributions
 Associate a probability with each value

 Temperature:  Weather:

W P
T P
sun 0.6
hot 0.5
rain 0.1
cold 0.5
fog 0.3
meteor 0.0
Probability Distributions
 Unobserved random variables have distributions
Shorthand notation:

T P W P
hot 0.5 sun 0.6
cold 0.5 rain 0.1
fog 0.3
meteor 0.0

 A distribution is a TABLE of probabilities of values OK if all domain entries are unique

 A probability (lower case value) is a single number

 Must have: and


Joint Probability Distributions
 A joint probability distribution (JPD) over a set of random variables:
specifies a real number for each assignment (or outcome):

T W P
 Must obey: hot sun 0.4
hot rain 0.1
cold sun 0.2
cold rain 0.3

 Size of distribution if n variables with domain sizes d?


 For all but the smallest distributions, impractical to write out!
Events
 An event is a set E of outcomes
T W P
hot sun 0.4
hot rain 0.1
 From a joint distribution, we can cold sun 0.2
calculate the probability of any event cold rain 0.3

 Probability that it’s hot AND sunny?

 Probability that it’s hot?

 Probability that it’s hot OR sunny?

 Typically, the events we care about are


partial assignments, like P(T=hot)
Quiz: Events
 P(+x, +y) ?

X Y P
+x +y 0.2
 P(+x) ?
+x -y 0.3
-x +y 0.4
-x -y 0.1
 P(-y OR +x) ?
Marginal Distributions
 Marginal distributions are sub-tables which eliminate variables
 Marginalization (summing out): Combine collapsed rows by adding

T P
hot 0.5
T W P
cold 0.5
hot sun 0.4
hot rain 0.1
cold sun 0.2 W P
cold rain 0.3 sun 0.6
rain 0.4
Quiz: Marginal Distributions

X P
+x
X Y P
-x
+x +y 0.2
+x -y 0.3
-x +y 0.4 Y P
-x -y 0.1 +y
-y
Conditional Probabilities
 A simple relation between joint and conditional probabilities
 In fact, this is taken as the definition of a conditional probability
P(a,b) W = sun W = rain P(T)

T = hot 0.4 0.1 0.5


T = cold 0.2 0.3 0.5
P(W) 0.6 0.4 1
P(a) P(b)

T W P
hot sun 0.4
hot rain 0.1
cold sun 0.2
cold rain 0.3
Conditional Distributions
 Conditional distributions are probability distributions over
some variables given fixed values of others
Conditional Distributions
Joint Distribution

W P
T W P
sun
hot sun 0.4
rain
hot rain 0.1
cold sun 0.2
W P cold rain 0.3
sun
rain
Quiz: Conditional Probabilities
 P(+x | +y) ?

X Y P
+x +y 0.2  P(-x | +y) ?
+x -y 0.3
-x +y 0.4
-x -y 0.1
 P(-y | +x) ?
Y = +y Y = -y P(X)
X = +x 0.2 0.3 0.5
X = -x 0.4 0.1 0.5
P(Y) 0.6 0.4 1
Normalization Trick
 A trick to get a whole conditional distribution at once:
 Select the joint probabilities matching the evidence
 Normalize the selection (make it sum to one)
T W P
hot sun 0.4 T R P T P
hot rain 0.1 hot rain 0.1 hot 0.25
Select Normalize
cold sun 0.2 cold rain 0.3 cold 0.75
cold rain 0.3
 Why does this work? Sum of selection is P(evidence)! (P(r), here)
To Normalize
 (Dictionary) To bring or restore to a normal condition

All entries sum to ONE


 Procedure:
 Step 1: Compute Z = sum over all entries
 Step 2: Divide every entry by Z

 Example 1  Example 2
W P W P T W P T W P
Normalize
hot sun 20 Normalize hot sun 0.4
sun 0.2 sun 0.4
hot rain 5 hot rain 0.1
rain 0.3 Z = 0.5 rain 0.6 Z = 50
cold sun 10 cold sun 0.2
cold rain 15 cold rain 0.3
Probabilistic Models
 A probabilistic model is a joint distribution over a set of
variables

 Inference: given a joint distribution, we can reason about


unobserved variables given observations (evidence)

 General form of a query:

This conditional distribution is called a posterior distribution or


the the belief function of an agent which uses this model
Probabilistic Inference
 Probabilistic inference: compute a desired probability
from other known probabilities (e.g. conditional from
joint)

 We generally compute conditional probabilities


 P(on time | no reported accidents) = 0.90
 These represent the agent’s beliefs given the evidence

 Probabilities change with new evidence:


 P(on time | no accidents, 5 a.m.) = 0.95
 P(on time | no accidents, 5 a.m., raining) = 0.80
 Observing new evidence causes beliefs to be updated
Inference by Enumeration
* Works fine with
 General case:  We want: multiple query
 Evidence variables: variables, too
 Query* variable:
All variables
 Hidden variables:

 Step 1: Select the  Step 2: Sum out H to get joint  Step 3: Normalize
entries consistent of Query and evidence
with the evidence
Inference by Enumeration
S T W P
 P(W)?
summer hot sun 0.30
summer hot rain 0.05
summer cold sun 0.10
 P(W| winter)? summer cold rain 0.05
winter hot sun 0.10
winter hot rain 0.05
winter cold sun 0.15
 P(W| winter, hot)? winter cold rain 0.20
Inference by Enumeration

 Obvious problems:
 Worst-case time complexity O(dn)
 Space complexity O(dn) to store the joint distribution
 Solution
 Better techniques
 Better representation
 Simplifying assumptions

Bayesian Network
Bayes’ Rule

 Two ways to factor a joint distribution over two variables:


That’s my rule!

 Dividing, we get:

 Why is this at all helpful?


 Lets us build one conditional from its reverse
 Often one conditional is tricky but the other one is simple
 Foundation of many systems we’ll see later (e.g. ASR, MT)

 In the running for most important AI equation!


The Product Rule
 Sometimes have conditional distributions but want the joint

 Example:

D W P D W P
wet sun 0.1 wet sun 0.08
R P
dry sun 0.9 dry sun 0.72
sun 0.8
wet rain 0.7 wet rain 0.14
rain 0.2
dry rain 0.3 dry rain 0.06
The Chain Rule

 More generally, can always write any joint distribution as an incremental product of
conditional distributions

 Why is this always true?


Independence
 Two variables are independent if:

 This says that their joint distribution factors into a product two
simpler distributions
 Another form:

 We write:

 Independence is a simplifying modeling assumption


 Empirical joint distributions: at best “close” to independent
 What could we assume for {Weather, Traffic, Cavity, Toothache}?
Example: Independence?

T P
hot 0.5
cold 0.5
T W P T W P
hot sun 0.4 hot sun 0.3
hot rain 0.1 hot rain 0.2
cold sun 0.2 cold sun 0.3
cold rain 0.3 cold rain 0.2
W P
sun 0.6
rain 0.4 34
Example: Independence
 N fair, independent coin flips:

H 0.5 H 0.5 H 0.5


T 0.5 T 0.5 T 0.5
Conditional Independence
 P(Toothache, Cavity, Catch)
 If I have a cavity, the probability that the probe catches in it doesn't
depend on whether I have a toothache:
 P(+catch | +toothache, +cavity) = P(+catch | +cavity)
 The same independence holds if I don’t have a cavity:
 P(+catch | +toothache, -cavity) = P(+catch| -cavity)
 Catch is conditionally independent of Toothache given Cavity:
 P(Catch | Toothache, Cavity) = P(Catch | Cavity)
 Equivalent statements:
 P(Toothache | Catch , Cavity) = P(Toothache | Cavity)
 P(Toothache, Catch | Cavity) = P(Toothache | Cavity) P(Catch | Cavity)
 One can be derived from the other easily
Conditional Independence
 Unconditional (absolute) independence very rare (why?)

 Conditional independence is our most basic and robust


form of knowledge about uncertain environments:

 What about this domain:


 Traffic
 Umbrella
 Raining
Probability Summary
Model-based Classification with Naïve Bayes

 A general Naive Bayes model:


Y

|Y| parameters

F1 F2 Fn

|Y| x |F|n values n x |F| x |Y|


parameters

 We only have to specify how each feature depends on the class


 Total number of parameters is linear in n
 Model is very simplistic, but often works anyway
Inference for Naïve Bayes
 Goal: compute posterior distribution over label variable Y
 Step 1: get joint probability of label and evidence for each label

+
 Step 2: sum to get probability of evidence

 Step 3: normalize by dividing Step 1 by Step 2


General Naïve Bayes
 What do we need in order to use Naïve Bayes?

 Inference method (we just saw this part)


 Start with a bunch of probabilities: P(Y) and the P(Fi|Y) tables
 Use standard inference to compute P(Y|F1…Fn)
 Nothing new here

 Estimates of local conditional probability tables


 P(Y), the prior over labels
 P(Fi|Y) for each feature (evidence variable)
 These probabilities are collectively called the parameters of the
model and denoted by 
 Up until now, we assumed these appeared by magic, but…
 …they typically come from training data counts: we’ll look at this
soon
Example: Spam Filter
 Input: an email Dear Sir.
 Output: spam/ham First, I must solicit your confidence in
this transaction, this is by virture of its
 Setup: nature as being utterly confidencial and
top secret. …
 Get a large collection of example emails, each labeled
“spam” or “ham” TO BE REMOVED FROM FUTURE
 Note: someone has to hand label all this data! MAILINGS, SIMPLY REPLY TO THIS
MESSAGE AND PUT "REMOVE" IN THE
 Want to learn to predict labels of new, future emails SUBJECT.

 Features: The attributes used to make the ham / 99 MILLION EMAIL ADDRESSES
FOR ONLY $99
spam decision
 Words: FREE! Ok, Iknow this is blatantly OT but I'm
beginning to go insane. Had an old Dell
 Text Patterns: $dd, CAPS Dimension XPS sitting in the corner and
 Non-text: SenderInContacts decided to put it to use, I know it was
 … working pre being stuck in the corner,
but when I plugged it in, hit the power
nothing happened.
A Spam Filter
Dear Sir.
 Naïve Bayes spam filter
First, I must solicit your confidence in this
transaction, this is by virture of its nature
 Data: as being utterly confidencial and top
secret. …
 Collection of emails, labeled
spam or ham
TO BE REMOVED FROM FUTURE
 Note: someone has to hand MAILINGS, SIMPLY REPLY TO THIS
label all this data! MESSAGE AND PUT "REMOVE" IN THE
 Split into training, SUBJECT.
validation, test sets 99 MILLION EMAIL ADDRESSES
FOR ONLY $99
 Classifiers
Ok, Iknow this is blatantly OT but I'm
 Learn on the training set beginning to go insane. Had an old Dell
 (Tune it on a validation set) Dimension XPS sitting in the corner and
 Test it on new emails decided to put it to use, I know it was
working pre being stuck in the corner, but
when I plugged it in, hit the power nothing
happened.
Naïve Bayes for Text
 Bag-of-words Naïve Bayes:
 Features: Wi is the word at position i
 As before: predict label conditioned on feature variables (spam vs. ham)
 As before: assume features are conditionally independent given label
 New: each Wi is identically distributed Word at position
i, not ith word in
the dictionary!
 Generative model:

 “Tied” distributions and bag-of-words


 Usually, each variable gets its own conditional probability distribution P(F|Y)
 In a bag-of-words model
 Each position is identically distributed
 All positions share the same conditional probs P(W|Y)
 Why make this assumption?
 Called “bag-of-words” because model is insensitive to word order or reordering
Training and Testing
Important Concepts
 Data: labeled instances, e.g. emails marked spam/ham
 Training set
 Validation set
 Test set
 Features: attribute-value pairs which characterize each x Training
Data
 Experimentation cycle
 Learn parameters (e.g. model probabilities) on training set
 (Tune hyperparameters on validation set)
 Compute accuracy of test set
 Very important: never “peek” at the test set!
 Evaluation Validation
 Accuracy: fraction of instances predicted correctly
Data
 Overfitting and generalization
 Want a classifier which does well on test data
 Overfitting: fitting the training data very closely, but not Test
generalizing well Data
 We’ll investigate overfitting and generalization formally in a few
lectures

You might also like