– prior to covering machine learning approaches • There are many different approaches to handling uncertainty – Formal approaches based on mathematics (probabilities) – Formal approaches based on logic – Informal approaches • Many questions arise – How do we combine uncertainty values? – How do we obtain uncertainty values? – How do we interpret uncertainty values? – How do we add uncertainty values to our knowledge and inference mechanisms? Why Is Uncertainty Needed? • We will find none of the approaches to be entirely adequate so the natural question is why even bother? – Input data may be questionable • to what extent is a patient demonstrating some symptom? • do we rely on their word? – Knowledge may be questionable • is this really a fact? – Knowledge may not be truth-preserving • if I apply this piece of knowledge, does the conclusion necessarily hold true? associational knowledge for instance is not truth preserving, but used all the time in diagnosis – Input may be ambiguous or unclear • this is especially true if we are dealing with real-world inputs from sensors, or dealing with situations where ambiguity readily exists (natural languages for instance) – Output may be expected in terms of a plausibility/probability such as “what is the likelihood that it will rain today?” • The world is not just T/F, so our reasoners should be able to model this and reason over the shades of grey we find in the world Methods to Handle Uncertainty • Fuzzy Logic – Logic that extends traditional 2-valued logic to be a continuous logic (values from 0 to 1) • while this early on was developed to handle natural language ambiguities such as “you are very tall” it instead is more successfully applied to device controllers • Probabilistic Reasoning – Using probabilities as part of the data and using Bayes theorem or variants to reason over what is most likely • Hidden Markov Models – A variant of probabilistic reasoning where internal states are not observable (so they are called hidden) • Certainty Factors and Qualitative Fuzzy Logics – More ad hoc approaches (non formal) that might be more flexible or at least more human-like • Neural Networks – We will skip these in this lecture as we want to talk about NNs more with respect to learning Bayesian Probabilities • Bayes Theorem is given below
– P(H0 | E) = probability that H0 is true given evidence E (the
conditional probability) – P(E | H0) = probability that E will arise given that H0 has occurred (the evidential probability) – P(H0) = probability that H0 will arise (the prior probability) – P(E) = probability that evidence E will arise – Usually we normalize our probabilities so that P(E) = 1 • The idea is that you are given some evidence E = {e1, e2, …, en} and you have a collection of hypotheses H1, H2, …, Hm – Using a collection of evidential and prior probabilities, compute the most likely hypothesis Independence of Evidence • Note that since E is a collection of some evidence, but not all possible evidence, you will need a whole lot of probabilities – P(E1 & E2 | H0), P(E1 & E3 | H0), P(E1 & E2 & E3 | H0), … – If you have n items that could be evidence, you will need 2n different evidential probabilities for every hypothesis • In order to get around the problem of needing an exponential number of probabilities, one might make the assumption that pieces of evidence are independent – Under such an assumption • P(E1 & E2 | H) = P(E1 | H) * P(E2 | H) • P(E1 & E2) = P(E1) * P(E2) – Is this a reasonable assumption? Continued • Example: a patient is suffering from a fever and nausea – Can we treat these two symptoms as independent? • one might be causally linked to the other • the two combined may help identify a cause (disease) that the symptoms separately might not • A weaker form of independence is conditional independence – If hypothesis H is known to be true, then whether E1 is true should not impact P(E2 | H) or P(H | E2) – Again, it this a reasonable assumption? • Consider as an example: – You want to run the sprinkler system if it is not going to rain and you base your decision on whether it will rain or not on whether it is cloudy • the grass is wet, we want to know the probability that you ran the sprinkler versus if it rained • evidential probabilities P(sprinkler | wet) and P(rain | wet) are not independent of whether it was cloudy or not • Marginal Probability: The probability of an event irrespective of the outcomes of other random variables, e.g. P(A). • Joint Probability: Probability of two (or more) simultaneous events, e.g. P(A and B) or P(A, B). • Conditional Probability: Probability of one (or more) event given the occurrence of another event, e.g. P(A given B) or P(A | B). • The joint probability can be calculated using the conditional probability; for example: • P(A, B) = P(A | B) * P(B) • The conditional probability can be calculated using the joint probability; for example: • P(A | B) = P(A, B) / P(B) • Bayes Theorem: Principled way of calculating a conditional probability without the joint probability. • It is often the case that we do not have access to the denominator directly, e.g. P(B). • P(B) = P(B|A) * P(A) + P(B|not A) * P(not A) Therefore,P(A|B) = P(B|A) * P(A) / ((P(B|A) * P(A) + P(B|not A) * P(not A)) • P(A|B): Posterior probability. • P(A): Prior probability. • P(B|A): Likelihood. • P(B): Evidence Bayes Theorem to be restated as: • Posterior = Likelihood * Prior / Evidence Probability Basics Bayes' Theorem • Bayes’ Theorem is a way of finding a probability when we know certain other probabilities. Which tells us: how often A happens given that B happens, written P(A|B) • When P(A|B) we know: = how P(A) P(B|A) / P(B) often B happens given that A happens, written P(B|A) and how likely A is on its own, written P(A) and how likely B is on its own, written P(B) Bayes' Theorem Bayes' Theorem Bayes' Theorem Bayes' Theorem Three different machines are used to produce a particular manufactured item. The three machines, A, B and C, produce 20%, 30% and 50% of the items, respectively. Now, machines A, B and C produce defective items at a rate of 1%, 2% and 3%, respectively. Suppose that we pick an item from the final batch at random. The item is found to be defective. What is the probability that the item was produced by machine B? Bayesian Networks • We can avoid the assumption of independence by including causality in our knowledge – For this, we enhance our previous approach by using a network where directed edges denote some form of dependence or causality • An example of a causal network is shown to the right along with the probabilities (evidential and prior) – we cannot use Bayes theorem directly because the evidential probabilities are based on the prior probability of cloudy • However, a propagation algorithm can be applied where the prior probability for cloudiness will impact the evidential probabilities of sprinkler and rain – from there, we can finally compute the likelihood of rain versus sprinkler Real World Example
Here is a Bayesian net for
classification of intruders in an operating system
Notice that it contains
cycles
The probabilities for the
edges are learned by sorting through log data • HMMs A Markov model is a state transition diagram with probabilities on the edges – We use a Markov model to compute the probability of a certain sequence of states • see the figure to the right • This is extremely • In many problems, we have useful for recognition observations to tell us what states have problems where we been reached, but observations may not know the end result show us all of the states but not how the end – Intermediate states (those that are not result was produced identifiable from observations) are hidden – we know the patient’s – In the figure on the right, the observations symptoms but not the are Y1, Y2, Y3, Y4 and the hidden states disease that caused are Q1, Q2, Q3, Q4 the symptoms to • The HMM allows us to compute the appear most probable path that led to a – we know the speech particular observable state signal that the speaker – This allows us to find which hidden states uttered, but not the phonemes that made were most likely to have occurred up the speech signal