Cortado-Cap 5 Tercer Tema

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

 ,%!2.).

'
02/"!"),)34)# -/$%,3

,Q ZKLFK ZH YLHZ OHDUQLQJ DV D IRUP RI XQFHUWDLQ UHDVRQLQJ IURP REVHUYDWLRQV

#HAPTER  POINTED OUT THE PREVALENCE OF UNCERTAINTY IN REAL ENVIRONMENTS !GENTS CAN HANDLE
UNCERTAINTY BY USING THE METHODS OF PROBABILITY AND DECISION THEORY BUT lRST THEY MUST LEARN
THEIR PROBABILISTIC THEORIES OF THE WORLD FROM EXPERIENCE 4HIS CHAPTER EXPLAINS HOW THEY
CAN DO THAT BY FORMULATING THE LEARNING TASK ITSELF AS A PROCESS OF PROBABILISTIC INFERENCE
3ECTION   7E WILL SEE THAT A "AYESIAN VIEW OF LEARNING IS EXTREMELY POWERFUL PROVIDING
GENERAL SOLUTIONS TO THE PROBLEMS OF NOISE OVERlTTING AND OPTIMAL PREDICTION )T ALSO TAKES
INTO ACCOUNT THE FACT THAT A LESS THAN OMNISCIENT AGENT CAN NEVER BE CERTAIN ABOUT WHICH THEORY
OF THE WORLD IS CORRECT YET MUST STILL MAKE DECISIONS BY USING SOME THEORY OF THE WORLD
7E DESCRIBE METHODS FOR LEARNING PROBABILITY MODELSˆPRIMARILY "AYESIAN NETWORKSˆ
IN 3ECTIONS  AND  3OME OF THE MATERIAL IN THIS CHAPTER IS FAIRLY MATHEMATICAL AL
THOUGH THE GENERAL LESSONS CAN BE UNDERSTOOD WITHOUT PLUNGING INTO THE DETAILS )T MAY BENElT
THE READER TO REVIEW #HAPTERS  AND  AND PEEK AT !PPENDIX !

 3 4!4)34)#!, , %!2.).'

4HE KEY CONCEPTS IN THIS CHAPTER JUST AS IN #HAPTER  ARE GDWD AND K\SRWKHVHV (ERE THE
DATA ARE HYLGHQFHˆTHAT IS INSTANTIATIONS OF SOME OR ALL OF THE RANDOM VARIABLES DESCRIBING THE
DOMAIN 4HE HYPOTHESES IN THIS CHAPTER ARE PROBABILISTIC THEORIES OF HOW THE DOMAIN WORKS
INCLUDING LOGICAL THEORIES AS A SPECIAL CASE
#ONSIDER A SIMPLE EXAMPLE /UR FAVORITE 3URPRISE CANDY COMES IN TWO mAVORS CHERRY
YUM AND LIME UGH  4HE MANUFACTURER HAS A PECULIAR SENSE OF HUMOR AND WRAPS EACH PIECE
OF CANDY IN THE SAME OPAQUE WRAPPER REGARDLESS OF mAVOR 4HE CANDY IS SOLD IN VERY LARGE
BAGS OF WHICH THERE ARE KNOWN TO BE lVE KINDSˆAGAIN INDISTINGUISHABLE FROM THE OUTSIDE
h1   CHERRY
h2   CHERRY  LIME
h3   CHERRY  LIME
h4   CHERRY  LIME
h5   LIME 


3ECTION  3TATISTICAL ,EARNING 

'IVEN A NEW BAG OF CANDY THE RANDOM VARIABLE H FOR K\SRWKHVLV DENOTES THE TYPE OF THE
BAG WITH POSSIBLE VALUES h1 THROUGH h5  H IS NOT DIRECTLY OBSERVABLE OF COURSE !S THE
PIECES OF CANDY ARE OPENED AND INSPECTED DATA ARE REVEALEDˆD1 D2 . . . DN WHERE EACH
Di IS A RANDOM VARIABLE WITH POSSIBLE VALUES cherry AND lime 4HE BASIC TASK FACED BY THE
AGENT IS TO PREDICT THE mAVOR OF THE NEXT PIECE OF CANDY  $ESPITE ITS APPARENT TRIVIALITY THIS
SCENARIO SERVES TO INTRODUCE MANY OF THE MAJOR ISSUES 4HE AGENT REALLY DOES NEED TO INFER A
THEORY OF ITS WORLD ALBEIT A VERY SIMPLE ONE
BAYESIAN LEARNING %D\HVLDQ OHDUQLQJ SIMPLY CALCULATES THE PROBABILITY OF EACH HYPOTHESIS GIVEN THE DATA
AND MAKES PREDICTIONS ON THAT BASIS 4HAT IS THE PREDICTIONS ARE MADE BY USING DOO THE HY
POTHESES WEIGHTED BY THEIR PROBABILITIES RATHER THAN BY USING JUST A SINGLE hBESTv HYPOTHESIS
)N THIS WAY LEARNING IS REDUCED TO PROBABILISTIC INFERENCE ,ET ' REPRESENT ALL THE DATA WITH
OBSERVED VALUE G THEN THE PROBABILITY OF EACH HYPOTHESIS IS OBTAINED BY "AYES RULE
P (hi | G) = αP (G | hi )P (hi ) . 
.OW SUPPOSE WE WANT TO MAKE A PREDICTION ABOUT AN UNKNOWN QUANTITY X 4HEN WE HAVE
 
3(X | G) = 3(X | G, hi )3(hi | G) = 3(X | hi )P (hi | G) , 
i i
WHERE WE HAVE ASSUMED THAT EACH HYPOTHESIS DETERMINES A PROBABILITY DISTRIBUTION OVER X
4HIS EQUATION SHOWS THAT PREDICTIONS ARE WEIGHTED AVERAGES OVER THE PREDICTIONS OF THE INDI
VIDUAL HYPOTHESES 4HE HYPOTHESES THEMSELVES ARE ESSENTIALLY hINTERMEDIARIESv BETWEEN THE
RAW DATA AND THE PREDICTIONS 4HE KEY QUANTITIES IN THE "AYESIAN APPROACH ARE THE K\SRWKHVLV
HYPOTHESIS PRIOR SULRU P (hi ) AND THE OLNHOLKRRG OF THE DATA UNDER EACH HYPOTHESIS P (G | hi )
LIKELIHOOD &OR OUR CANDY EXAMPLE WE WILL ASSUME FOR THE TIME BEING THAT THE PRIOR DISTRIBUTION
OVER h1 , . . . , h5 IS GIVEN BY 0.1, 0.2, 0.4, 0.2, 0.1 AS ADVERTISED BY THE MANUFACTURER 4HE
LIKELIHOOD OF THE DATA IS CALCULATED UNDER THE ASSUMPTION THAT THE OBSERVATIONS ARE LLG SEE
PAGE  SO THAT

P (G | hi ) = P (dj | hi ) . 
j

&OR EXAMPLE SUPPOSE THE BAG IS REALLY AN ALL LIME BAG h5 AND THE lRST  CANDIES ARE ALL
LIME THEN P (G | h3 ) IS 0.510 BECAUSE HALF THE CANDIES IN AN h3 BAG ARE LIME &IGURE A
SHOWS HOW THE POSTERIOR PROBABILITIES OF THE lVE HYPOTHESES CHANGE AS THE SEQUENCE OF 
LIME CANDIES IS OBSERVED .OTICE THAT THE PROBABILITIES START OUT AT THEIR PRIOR VALUES SO h3
IS INITIALLY THE MOST LIKELY CHOICE AND REMAINS SO AFTER  LIME CANDY IS UNWRAPPED !FTER 
LIME CANDIES ARE UNWRAPPED h4 IS MOST LIKELY AFTER  OR MORE h5 THE DREADED ALL LIME BAG
IS THE MOST LIKELY !FTER  IN A ROW WE ARE FAIRLY CERTAIN OF OUR FATE &IGURE B SHOWS
THE PREDICTED PROBABILITY THAT THE NEXT CANDY IS LIME BASED ON %QUATION   !S WE WOULD
EXPECT IT INCREASES MONOTONICALLY TOWARD 
1 3TATISTICALLY SOPHISTICATED READERS WILL RECOGNIZE THIS SCENARIO AS A VARIANT OF THE XUQDQGEDOO SETUP 7E lND

URNS AND BALLS LESS COMPELLING THAN CANDY FURTHERMORE CANDY LENDS ITSELF TO OTHER TASKS SUCH AS DECIDING WHETHER
TO TRADE THE BAG WITH A FRIENDˆSEE %XERCISE 
2 7E STATED EARLIER THAT THE BAGS OF CANDY ARE VERY LARGE OTHERWISE THE IID ASSUMPTION FAILS TO HOLD 4ECHNICALLY

IT IS MORE CORRECT BUT LESS HYGIENIC TO REWRAP EACH CANDY AFTER INSPECTION AND RETURN IT TO THE BAG
 #HAPTER  ,EARNING 0ROBABILISTIC -ODELS

0OSTERIOR PROBABILITY OF HYPOTHESIS

0ROBABILITY THAT NEXT CANDY IS LIME


 3K \ G 
3K \ G
 3K \ G 
3K \ G
3K \ G 




 

 
           
.UMBER OF OBSERVATIONS IN G .UMBER OF OBSERVATIONS IN G
A B

)LJXUH  A 0OSTERIOR PROBABILITIES P (hi | d1 , . . . , dN ) FROM %QUATION   4HE


NUMBER OF OBSERVATIONS N RANGES FROM  TO  AND EACH OBSERVATION IS OF A LIME CANDY
B "AYESIAN PREDICTION P (dN +1 = lime | d1 , . . . , dN ) FROM %QUATION  

4HE EXAMPLE SHOWS THAT WKH %D\HVLDQ SUHGLFWLRQ HYHQWXDOO\ DJUHHV ZLWK WKH WUXH K\
SRWKHVLV 4HIS IS CHARACTERISTIC OF "AYESIAN LEARNING &OR ANY lXED PRIOR THAT DOES NOT RULE
OUT THE TRUE HYPOTHESIS THE POSTERIOR PROBABILITY OF ANY FALSE HYPOTHESIS WILL UNDER CERTAIN
TECHNICAL CONDITIONS EVENTUALLY VANISH 4HIS HAPPENS SIMPLY BECAUSE THE PROBABILITY OF GEN
ERATING hUNCHARACTERISTICv DATA INDElNITELY IS VANISHINGLY SMALL 4HIS POINT IS ANALOGOUS TO
ONE MADE IN THE DISCUSSION OF 0!# LEARNING IN #HAPTER  -ORE IMPORTANT THE "AYESIAN
PREDICTION IS RSWLPDO WHETHER THE DATA SET BE SMALL OR LARGE 'IVEN THE HYPOTHESIS PRIOR ANY
OTHER PREDICTION IS EXPECTED TO BE CORRECT LESS OFTEN
4HE OPTIMALITY OF "AYESIAN LEARNING COMES AT A PRICE OF COURSE &OR REAL LEARNING
PROBLEMS THE HYPOTHESIS SPACE IS USUALLY VERY LARGE OR INlNITE AS WE SAW IN #HAPTER  )N
SOME CASES THE SUMMATION IN %QUATION  OR INTEGRATION IN THE CONTINUOUS CASE CAN BE
CARRIED OUT TRACTABLY BUT IN MOST CASES WE MUST RESORT TO APPROXIMATE OR SIMPLIlED METHODS
! VERY COMMON APPROXIMATIONˆONE THAT IS USUALLY ADOPTED IN SCIENCEˆIS TO MAKE PRE
DICTIONS BASED ON A SINGLE PRVW SUREDEOH HYPOTHESISˆTHAT IS AN hi THAT MAXIMIZES P (hi | G)
MAXIMUM A
POSTERIORI 4HIS IS OFTEN CALLED A PD[LPXP D SRVWHULRUL OR -!0 PRONOUNCED hEM AY PEEv HYPOTHESIS
0REDICTIONS MADE ACCORDING TO AN -!0 HYPOTHESIS hMAP ARE APPROXIMATELY "AYESIAN TO THE
EXTENT THAT 3(X | G) ≈ 3(X | hMAP ) )N OUR CANDY EXAMPLE hMAP = h5 AFTER THREE LIME CAN
DIES IN A ROW SO THE -!0 LEARNER THEN PREDICTS THAT THE FOURTH CANDY IS LIME WITH PROBABILITY
ˆA MUCH MORE DANGEROUS PREDICTION THAN THE "AYESIAN PREDICTION OF  SHOWN IN &IG
URE B  !S MORE DATA ARRIVE THE -!0 AND "AYESIAN PREDICTIONS BECOME CLOSER BECAUSE
THE COMPETITORS TO THE -!0 HYPOTHESIS BECOME LESS AND LESS PROBABLE
!LTHOUGH OUR EXAMPLE DOESNT SHOW IT lNDING -!0 HYPOTHESES IS OFTEN MUCH EASIER
THAN "AYESIAN LEARNING BECAUSE IT REQUIRES SOLVING AN OPTIMIZATION PROBLEM INSTEAD OF A LARGE
SUMMATION OR INTEGRATION PROBLEM 7E WILL SEE EXAMPLES OF THIS LATER IN THE CHAPTER
3ECTION  3TATISTICAL ,EARNING 

)N BOTH "AYESIAN LEARNING AND -!0 LEARNING THE HYPOTHESIS PRIOR P (hi ) PLAYS AN IM
PORTANT ROLE 7E SAW IN #HAPTER  THAT RYHU¿WWLQJ CAN OCCUR WHEN THE HYPOTHESIS SPACE
IS TOO EXPRESSIVE SO THAT IT CONTAINS MANY HYPOTHESES THAT lT THE DATA SET WELL 2ATHER THAN
PLACING AN ARBITRARY LIMIT ON THE HYPOTHESES TO BE CONSIDERED "AYESIAN AND -!0 LEARNING
METHODS USE THE PRIOR TO SHQDOL]H FRPSOH[LW\ 4YPICALLY MORE COMPLEX HYPOTHESES HAVE A
LOWER PRIOR PROBABILITYˆIN PART BECAUSE THERE ARE USUALLY MANY MORE COMPLEX HYPOTHESES
THAN SIMPLE HYPOTHESES /N THE OTHER HAND MORE COMPLEX HYPOTHESES HAVE A GREATER CAPAC
ITY TO lT THE DATA )N THE EXTREME CASE A LOOKUP TABLE CAN REPRODUCE THE DATA EXACTLY WITH
PROBABILITY  (ENCE THE HYPOTHESIS PRIOR EMBODIES A TRADEOFF BETWEEN THE COMPLEXITY OF A
HYPOTHESIS AND ITS DEGREE OF lT TO THE DATA
7E CAN SEE THE EFFECT OF THIS TRADEOFF MOST CLEARLY IN THE LOGICAL CASE WHERE H CONTAINS
ONLY GHWHUPLQLVWLF HYPOTHESES )N THAT CASE P (G | hi ) IS  IF hi IS CONSISTENT AND  OTHERWISE
,OOKING AT %QUATION  WE SEE THAT hMAP WILL THEN BE THE VLPSOHVW ORJLFDO WKHRU\ WKDW
LV FRQVLVWHQW ZLWK WKH GDWD 4HEREFORE MAXIMUM D SRVWHULRUL LEARNING PROVIDES A NATURAL
EMBODIMENT OF /CKHAMS RAZOR
!NOTHER INSIGHT INTO THE TRADEOFF BETWEEN COMPLEXITY AND DEGREE OF lT IS OBTAINED BY
TAKING THE LOGARITHM OF %QUATION   #HOOSING hMAP TO MAXIMIZE P (G | hi )P (hi ) IS
EQUIVALENT TO MINIMIZING
− log2 P (G | hi ) − log2 P (hi ) .
5SING THE CONNECTION BETWEEN INFORMATION ENCODING AND PROBABILITY THAT WE INTRODUCED IN
#HAPTER  WE SEE THAT THE − log2 P (hi ) TERM EQUALS THE NUMBER OF BITS REQUIRED TO SPEC
IFY THE HYPOTHESIS hi  &URTHERMORE − log2 P (G | hi ) IS THE ADDITIONAL NUMBER OF BITS REQUIRED
TO SPECIFY THE DATA GIVEN THE HYPOTHESIS 4O SEE THIS CONSIDER THAT NO BITS ARE REQUIRED
IF THE HYPOTHESIS PREDICTS THE DATA EXACTLYˆAS WITH h5 AND THE STRING OF LIME CANDIESˆAND
log2 1 = 0 (ENCE -!0 LEARNING IS CHOOSING THE HYPOTHESIS THAT PROVIDES MAXIMUM FRP
SUHVVLRQ OF THE DATA 4HE SAME TASK IS ADDRESSED MORE DIRECTLY BY THE PLQLPXP GHVFULSWLRQ
OHQJWK OR -$, LEARNING METHOD 7HEREAS -!0 LEARNING EXPRESSES SIMPLICITY BY ASSIGNING
HIGHER PROBABILITIES TO SIMPLER HYPOTHESES -$, EXPRESSES IT DIRECTLY BY COUNTING THE BITS IN
A BINARY ENCODING OF THE HYPOTHESES AND DATA
! lNAL SIMPLIlCATION IS PROVIDED BY ASSUMING A XQLIRUP PRIOR OVER THE SPACE OF HY
POTHESES )N THAT CASE -!0 LEARNING REDUCES TO CHOOSING AN hi THAT MAXIMIZES P (G | hi )
MAXIMUM-
LIKELIHOOD 4HIS IS CALLED A PD[LPXPOLNHOLKRRG -, HYPOTHESIS hML  -AXIMUM LIKELIHOOD LEARNING
IS VERY COMMON IN STATISTICS A DISCIPLINE IN WHICH MANY RESEARCHERS DISTRUST THE SUBJECTIVE
NATURE OF HYPOTHESIS PRIORS )T IS A REASONABLE APPROACH WHEN THERE IS NO REASON TO PREFER ONE
HYPOTHESIS OVER ANOTHER D SULRULˆFOR EXAMPLE WHEN ALL HYPOTHESES ARE EQUALLY COMPLEX )T
PROVIDES A GOOD APPROXIMATION TO "AYESIAN AND -!0 LEARNING WHEN THE DATA SET IS LARGE
BECAUSE THE DATA SWAMPS THE PRIOR DISTRIBUTION OVER HYPOTHESES BUT IT HAS PROBLEMS AS WE
SHALL SEE WITH SMALL DATA SETS
 #HAPTER  ,EARNING 0ROBABILISTIC -ODELS

 , %!2.).' 7)4( # /-0,%4% $!4!

4HE GENERAL TASK OF LEARNING A PROBABILITY MODEL GIVEN DATA THAT ARE ASSUMED TO BE GENERATED
DENSITY ESTIMATION FROM THAT MODEL IS CALLED GHQVLW\ HVWLPDWLRQ 4HE TERM APPLIED ORIGINALLY TO PROBABILITY
DENSITY FUNCTIONS FOR CONTINUOUS VARIABLES BUT IS USED NOW FOR DISCRETE DISTRIBUTIONS TOO
COMPLETE DATA 4HIS SECTION COVERS THE SIMPLEST CASE WHERE WE HAVE FRPSOHWH GDWD $ATA ARE COM
PLETE WHEN EACH DATA POINT CONTAINS VALUES FOR EVERY VARIABLE IN THE PROBABILITY MODEL BEING
PARAMETER
LEARNING LEARNED 7E FOCUS ON SDUDPHWHU OHDUQLQJˆlNDING THE NUMERICAL PARAMETERS FOR A PROBA
BILITY MODEL WHOSE STRUCTURE IS lXED &OR EXAMPLE WE MIGHT BE INTERESTED IN LEARNING THE
CONDITIONAL PROBABILITIES IN A "AYESIAN NETWORK WITH A GIVEN STRUCTURE 7E WILL ALSO LOOK
BRIEmY AT THE PROBLEM OF LEARNING STRUCTURE AND AT NONPARAMETRIC DENSITY ESTIMATION

 0D[LPXPOLNHOLKRRG SDUDPHWHU OHDUQLQJ 'LVFUHWH PRGHOV


3UPPOSE WE BUY A BAG OF LIME AND CHERRY CANDY FROM A NEW MANUFACTURER WHOSE LIMEnCHERRY
PROPORTIONS ARE COMPLETELY UNKNOWN THE FRACTION COULD BE ANYWHERE BETWEEN  AND  )N
THAT CASE WE HAVE A CONTINUUM OF HYPOTHESES 4HE SDUDPHWHU IN THIS CASE WHICH WE CALL
θ IS THE PROPORTION OF CHERRY CANDIES AND THE HYPOTHESIS IS hθ  4HE PROPORTION OF LIMES IS
JUST 1 − θ )F WE ASSUME THAT ALL PROPORTIONS ARE EQUALLY LIKELY D SULRUL THEN A MAXIMUM
LIKELIHOOD APPROACH IS REASONABLE )F WE MODEL THE SITUATION WITH A "AYESIAN NETWORK WE
NEED JUST ONE RANDOM VARIABLE Flavor THE mAVOR OF A RANDOMLY CHOSEN CANDY FROM THE BAG 
)T HAS VALUES cherry AND lime WHERE THE PROBABILITY OF cherry IS θ SEE &IGURE A  .OW
SUPPOSE WE UNWRAP N CANDIES OF WHICH c ARE CHERRIES AND  = N − c ARE LIMES !CCORDING
TO %QUATION  THE LIKELIHOOD OF THIS PARTICULAR DATA SET IS
N

P (G | hθ ) = P (dj | hθ ) = θ c · (1 − θ) .
j =1
4HE MAXIMUM LIKELIHOOD HYPOTHESIS IS GIVEN BY THE VALUE OF θ THAT MAXIMIZES THIS EXPRES
LOG LIKELIHOOD SION 4HE SAME VALUE IS OBTAINED BY MAXIMIZING THE ORJ OLNHOLKRRG
N

L(G | hθ ) = log P (G | hθ ) = log P (dj | hθ ) = c log θ +  log(1 − θ) .
j =1
"Y TAKING LOGARITHMS WE REDUCE THE PRODUCT TO A SUM OVER THE DATA WHICH IS USUALLY EASIER
TO MAXIMIZE 4O lND THE MAXIMUM LIKELIHOOD VALUE OF θ WE DIFFERENTIATE L WITH RESPECT TO
θ AND SET THE RESULTING EXPRESSION TO ZERO
dL(G | hθ ) c  c c
= − =0 ⇒ θ= = .
dθ θ 1−θ c+ N
)N %NGLISH THEN THE MAXIMUM LIKELIHOOD HYPOTHESIS hML ASSERTS THAT THE ACTUAL PROPORTION
OF CHERRIES IN THE BAG IS EQUAL TO THE OBSERVED PROPORTION IN THE CANDIES UNWRAPPED SO FAR
)T APPEARS THAT WE HAVE DONE A LOT OF WORK TO DISCOVER THE OBVIOUS )N FACT THOUGH
WE HAVE LAID OUT ONE STANDARD METHOD FOR MAXIMUM LIKELIHOOD PARAMETER LEARNING A METHOD
WITH BROAD APPLICABILITY
3ECTION  ,EARNING WITH #OMPLETE $ATA 

3) FKHUU\
θ

3) FKHUU\ )ODYRU


θ
) 3: UHG\ )

)ODYRU FKHUU\ θ
OLPH θ

:UDSSHU

A B

)LJXUH  A "AYESIAN NETWORK MODEL FOR THE CASE OF CANDIES WITH AN UNKNOWN PROPOR
TION OF CHERRIES AND LIMES B -ODEL FOR THE CASE WHERE THE WRAPPER COLOR DEPENDS PROBA
BILISTICALLY ON THE CANDY mAVOR

 7RITE DOWN AN EXPRESSION FOR THE LIKELIHOOD OF THE DATA AS A FUNCTION OF THE PARAMETERS 
 7RITE DOWN THE DERIVATIVE OF THE LOG LIKELIHOOD WITH RESPECT TO EACH PARAMETER
 &IND THE PARAMETER VALUES SUCH THAT THE DERIVATIVES ARE ZERO
4HE TRICKIEST STEP IS USUALLY THE LAST )N OUR EXAMPLE IT WAS TRIVIAL BUT WE WILL SEE THAT IN
MANY CASES WE NEED TO RESORT TO ITERATIVE SOLUTION ALGORITHMS OR OTHER NUMERICAL OPTIMIZATION
TECHNIQUES AS DESCRIBED IN #HAPTER  4HE EXAMPLE ALSO ILLUSTRATES A SIGNIlCANT PROBLEM
WITH MAXIMUM LIKELIHOOD LEARNING IN GENERAL ZKHQ WKH GDWD VHW LV VPDOO HQRXJK WKDW VRPH
HYHQWV KDYH QRW \HW EHHQ REVHUYHG²IRU LQVWDQFH QR FKHUU\ FDQGLHV²WKH PD[LPXPOLNHOLKRRG
K\SRWKHVLV DVVLJQV ]HUR SUREDELOLW\ WR WKRVH HYHQWV 6ARIOUS TRICKS ARE USED TO AVOID THIS
PROBLEM SUCH AS INITIALIZING THE COUNTS FOR EACH EVENT TO  INSTEAD OF 
,ET US LOOK AT ANOTHER EXAMPLE 3UPPOSE THIS NEW CANDY MANUFACTURER WANTS TO GIVE A
LITTLE HINT TO THE CONSUMER AND USES CANDY WRAPPERS COLORED RED AND GREEN 4HE Wrapper FOR
EACH CANDY IS SELECTED SUREDELOLVWLFDOO\ ACCORDING TO SOME UNKNOWN CONDITIONAL DISTRIBUTION
DEPENDING ON THE mAVOR 4HE CORRESPONDING PROBABILITY MODEL IS SHOWN IN &IGURE B 
.OTICE THAT IT HAS THREE PARAMETERS θ θ1 AND θ2  7ITH THESE PARAMETERS THE LIKELIHOOD OF
SEEING SAY A CHERRY CANDY IN A GREEN WRAPPER CAN BE OBTAINED FROM THE STANDARD SEMANTICS
FOR "AYESIAN NETWORKS PAGE  
P (Flavor = cherry, Wrapper = green | hθ,θ1 ,θ2 )
= P (Flavor = cherry | hθ,θ1 ,θ2 )P (Wrapper = green | Flavor = cherry, hθ,θ1 ,θ2 )
= θ · (1 − θ1 ) .
.OW WE UNWRAP N CANDIES OF WHICH c ARE CHERRIES AND  ARE LIMES 4HE WRAPPER COUNTS ARE
AS FOLLOWS rc OF THE CHERRIES HAVE RED WRAPPERS AND gc HAVE GREEN WHILE r OF THE LIMES HAVE
RED AND g HAVE GREEN 4HE LIKELIHOOD OF THE DATA IS GIVEN BY
P (G | hθ,θ1 ,θ2 ) = θ c (1 − θ) · θ1rc (1 − θ1 )gc · θ2r (1 − θ2 )g .
 #HAPTER  ,EARNING 0ROBABILISTIC -ODELS

4HIS LOOKS PRETTY HORRIBLE BUT TAKING LOGARITHMS HELPS


L = [c log θ +  log(1 − θ)] + [rc log θ1 + gc log(1 − θ1 )] + [r log θ2 + g log(1 − θ2 )] .
4HE BENElT OF TAKING LOGS IS CLEAR THE LOG LIKELIHOOD IS THE SUM OF THREE TERMS EACH OF WHICH
CONTAINS A SINGLE PARAMETER 7HEN WE TAKE DERIVATIVES WITH RESPECT TO EACH PARAMETER AND SET
THEM TO ZERO WE GET THREE INDEPENDENT EQUATIONS EACH CONTAINING JUST ONE PARAMETER
∂L c  c
∂θ = θ − 1−θ = 0 ⇒ θ = c+
rc gc
∂L
∂θ1 = θ1 − 1−θ1 = 0 ⇒ θ1 = rcr+g
c
c
∂L r g r
∂θ2 = θ2 − 1−θ2 = 0 ⇒ θ2 = r +g .
4HE SOLUTION FOR θ IS THE SAME AS BEFORE 4HE SOLUTION FOR θ1 THE PROBABILITY THAT A CHERRY
CANDY HAS A RED WRAPPER IS THE OBSERVED FRACTION OF CHERRY CANDIES WITH RED WRAPPERS AND
SIMILARLY FOR θ2 
4HESE RESULTS ARE VERY COMFORTING AND IT IS EASY TO SEE THAT THEY CAN BE EXTENDED TO ANY
"AYESIAN NETWORK WHOSE CONDITIONAL PROBABILITIES ARE REPRESENTED AS TABLES 4HE MOST IMPOR
TANT POINT IS THAT ZLWK FRPSOHWH GDWD WKH PD[LPXPOLNHOLKRRG SDUDPHWHU OHDUQLQJ SUREOHP
IRU D %D\HVLDQ QHWZRUN GHFRPSRVHV LQWR VHSDUDWH OHDUQLQJ SUREOHPV RQH IRU HDFK SDUDPHWHU
3EE %XERCISE  FOR THE NONTABULATED CASE WHERE EACH PARAMETER AFFECTS SEVERAL CONDITIONAL
PROBABILITIES 4HE SECOND POINT IS THAT THE PARAMETER VALUES FOR A VARIABLE GIVEN ITS PARENTS
ARE JUST THE OBSERVED FREQUENCIES OF THE VARIABLE VALUES FOR EACH SETTING OF THE PARENT VALUES
!S BEFORE WE MUST BE CAREFUL TO AVOID ZEROES WHEN THE DATA SET IS SMALL

 1DLYH %D\HV PRGHOV


0ROBABLY THE MOST COMMON "AYESIAN NETWORK MODEL USED IN MACHINE LEARNING IS THE QDLYH
%D\HV MODEL lRST INTRODUCED ON PAGE  )N THIS MODEL THE hCLASSv VARIABLE C WHICH IS TO
BE PREDICTED IS THE ROOT AND THE hATTRIBUTEv VARIABLES Xi ARE THE LEAVES 4HE MODEL IS hNAIVEv
BECAUSE IT ASSUMES THAT THE ATTRIBUTES ARE CONDITIONALLY INDEPENDENT OF EACH OTHER GIVEN THE
CLASS 4HE MODEL IN &IGURE B IS A NAIVE "AYES MODEL WITH CLASS Flavor AND JUST ONE
ATTRIBUTE :UDSSHU !SSUMING "OOLEAN VARIABLES THE PARAMETERS ARE
θ = P (C = true), θi1 = P (Xi = true | C = true), θi2 = P (Xi = true | C = false).
4HE MAXIMUM LIKELIHOOD PARAMETER VALUES ARE FOUND IN EXACTLY THE SAME WAY AS FOR &IG
URE B  /NCE THE MODEL HAS BEEN TRAINED IN THIS WAY IT CAN BE USED TO CLASSIFY NEW EXAM
PLES FOR WHICH THE CLASS VARIABLE C IS UNOBSERVED 7ITH OBSERVED ATTRIBUTE VALUES x1 , . . . , xn
THE PROBABILITY OF EACH CLASS IS GIVEN BY

3(C | x1 , . . . , xn ) = α 3(C) 3(xi | C) .
i
! DETERMINISTIC PREDICTION CAN BE OBTAINED BY CHOOSING THE MOST LIKELY CLASS &IGURE 
SHOWS THE LEARNING CURVE FOR THIS METHOD WHEN IT IS APPLIED TO THE RESTAURANT PROBLEM FROM
#HAPTER  4HE METHOD LEARNS FAIRLY WELL BUT NOT AS WELL AS DECISION TREE LEARNING THIS IS
PRESUMABLY BECAUSE THE TRUE HYPOTHESISˆWHICH IS A DECISION TREEˆIS NOT REPRESENTABLE EX
ACTLY USING A NAIVE "AYES MODEL .AIVE "AYES LEARNING TURNS OUT TO DO SURPRISINGLY WELL IN A
WIDE RANGE OF APPLICATIONS THE BOOSTED VERSION %XERCISE  IS ONE OF THE MOST EFFECTIVE
3ECTION  ,EARNING WITH #OMPLETE $ATA 

0ROPORTION CORRECT ON TEST SET









$ECISION TREE
 .AIVE "AYES


     
4RAINING SET SIZE

)LJXUH  4HE LEARNING CURVE FOR NAIVE "AYES LEARNING APPLIED TO THE RESTAURANT PROBLEM
FROM #HAPTER  THE LEARNING CURVE FOR DECISION TREE LEARNING IS SHOWN FOR COMPARISON

GENERAL PURPOSE LEARNING ALGORITHMS .AIVE "AYES LEARNING SCALES WELL TO VERY LARGE PROB
LEMS WITH n "OOLEAN ATTRIBUTES THERE ARE JUST 2n + 1 PARAMETERS AND QR VHDUFK LV UHTXLUHG
WR ¿QG hML  WKH PD[LPXPOLNHOLKRRG QDLYH %D\HV K\SRWKHVLV &INALLY NAIVE "AYES LEARNING
SYSTEMS HAVE NO DIFlCULTY WITH NOISY OR MISSING DATA AND CAN GIVE PROBABILISTIC PREDICTIONS
WHEN APPROPRIATE

 0D[LPXPOLNHOLKRRG SDUDPHWHU OHDUQLQJ &RQWLQXRXV PRGHOV


#ONTINUOUS PROBABILITY MODELS SUCH AS THE OLQHDU *DXVVLDQ MODEL WERE INTRODUCED IN 3EC
TION  "ECAUSE CONTINUOUS VARIABLES ARE UBIQUITOUS IN REAL WORLD APPLICATIONS IT IS IMPOR
TANT TO KNOW HOW TO LEARN THE PARAMETERS OF CONTINUOUS MODELS FROM DATA 4HE PRINCIPLES FOR
MAXIMUM LIKELIHOOD LEARNING ARE IDENTICAL IN THE CONTINUOUS AND DISCRETE CASES
,ET US BEGIN WITH A VERY SIMPLE CASE LEARNING THE PARAMETERS OF A 'AUSSIAN DENSITY
FUNCTION ON A SINGLE VARIABLE 4HAT IS THE DATA ARE GENERATED AS FOLLOWS
1 (x−μ)2
P (x) = √ e− 2σ2 .
2πσ
4HE PARAMETERS OF THIS MODEL ARE THE MEAN μ AND THE STANDARD DEVIATION σ .OTICE THAT THE
NORMALIZING hCONSTANTv DEPENDS ON σ SO WE CANNOT IGNORE IT ,ET THE OBSERVED VALUES BE
x1 , . . . , xN  4HEN THE LOG LIKELIHOOD IS
N
  N
1 (x −μ)2
− j 2
√ (xj − μ)2
L= log √ e 2σ = N (− log 2π − log σ) − .
2πσ 2σ 2
j=1 j =1
3ETTING THE DERIVATIVES TO ZERO AS USUAL WE OBTAIN
P
∂L 1 N xj
∂μ = − σ2 j=1 (xj − μ) = 0 ⇒ μ= j
NP 
1 N
2
∂L N 2 j (xj −μ)
∂σ = − σ + σ3 j=1 (xj − μ) = 0 ⇒ σ= N .
4HAT IS THE MAXIMUM LIKELIHOOD VALUE OF THE MEAN IS THE SAMPLE AVERAGE AND THE MAXIMUM
LIKELIHOOD VALUE OF THE STANDARD DEVIATION IS THE SQUARE ROOT OF THE SAMPLE VARIANCE !GAIN
THESE ARE COMFORTING RESULTS THAT CONlRM hCOMMONSENSEv PRACTICE
 #HAPTER  ,EARNING 0ROBABILISTIC -ODELS


3\ \[
 


Y

 


 
  
 
   \
  
  
[ 
          
X
A B

)LJXUH  A ! LINEAR 'AUSSIAN MODEL DESCRIBED AS y = θ1 x + θ2 PLUS 'AUSSIAN NOISE


WITH lXED VARIANCE B ! SET OF  DATA POINTS GENERATED FROM THIS MODEL

.OW CONSIDER A LINEAR 'AUSSIAN MODEL WITH ONE CONTINUOUS PARENT X AND A CONTINUOUS
CHILD Y  !S EXPLAINED ON PAGE  Y HAS A 'AUSSIAN DISTRIBUTION WHOSE MEAN DEPENDS
LINEARLY ON THE VALUE OF X AND WHOSE STANDARD DEVIATION IS lXED 4O LEARN THE CONDITIONAL
DISTRIBUTION P (Y | X) WE CAN MAXIMIZE THE CONDITIONAL LIKELIHOOD
1 (y−(θ1 x+θ2 ))2
P (y | x) = √ e− 2σ 2 . 
2πσ
(ERE THE PARAMETERS ARE θ1 θ2 AND σ 4HE DATA ARE A COLLECTION OF (xj , yj ) PAIRS AS ILLUSTRATED
IN &IGURE  5SING THE USUAL METHODS %XERCISE  WE CAN lND THE MAXIMUM LIKELIHOOD
VALUES OF THE PARAMETERS 4HE POINT HERE IS DIFFERENT )F WE CONSIDER JUST THE PARAMETERS θ1
AND θ2 THAT DElNE THE LINEAR RELATIONSHIP BETWEEN x AND y IT BECOMES CLEAR THAT MAXIMIZING
THE LOG LIKELIHOOD WITH RESPECT TO THESE PARAMETERS IS THE SAME AS PLQLPL]LQJ THE NUMERATOR
(y − (θ1 x + θ2 ))2 IN THE EXPONENT OF %QUATION   4HIS IS THE L2 LOSS THE SQUARED ER
ROR BETWEEN THE ACTUAL VALUE y AND THE PREDICTION θ1 x + θ2  4HIS IS THE QUANTITY MINIMIZED
BY THE STANDARD OLQHDU UHJUHVVLRQ PROCEDURE DESCRIBED IN 3ECTION  .OW WE CAN UNDER
STAND WHY MINIMIZING THE SUM OF SQUARED ERRORS GIVES THE MAXIMUM LIKELIHOOD STRAIGHT LINE
MODEL SURYLGHG WKDW WKH GDWD DUH JHQHUDWHG ZLWK *DXVVLDQ QRLVH RI ¿[HG YDULDQFH

 %D\HVLDQ SDUDPHWHU OHDUQLQJ


-AXIMUM LIKELIHOOD LEARNING GIVES RISE TO SOME VERY SIMPLE PROCEDURES BUT IT HAS SOME
SERIOUS DElCIENCIES WITH SMALL DATA SETS &OR EXAMPLE AFTER SEEING ONE CHERRY CANDY THE
MAXIMUM LIKELIHOOD HYPOTHESIS IS THAT THE BAG IS  CHERRY IE θ = 1.0  5NLESS ONES
HYPOTHESIS PRIOR IS THAT BAGS MUST BE EITHER ALL CHERRY OR ALL LIME THIS IS NOT A REASONABLE
CONCLUSION )T IS MORE LIKELY THAT THE BAG IS A MIXTURE OF LIME AND CHERRY 4HE "AYESIAN
APPROACH TO PARAMETER LEARNING STARTS BY DElNING A PRIOR PROBABILITY DISTRIBUTION OVER THE
HYPOTHESIS PRIOR POSSIBLE HYPOTHESES 7E CALL THIS THE K\SRWKHVLV SULRU 4HEN AS DATA ARRIVES THE POSTERIOR
PROBABILITY DISTRIBUTION IS UPDATED
3ECTION  ,EARNING WITH #OMPLETE $ATA 

 
; =

  ; =

; = 

3Θ  θ

3Θ  θ

; =


; =
 
; =
 
           
0ARAMETER θ 0ARAMETER θ
A B

)LJXUH  %XAMPLES OF THE beta[a, b] DISTRIBUTION FOR DIFFERENT VALUES OF [a, b]

4HE CANDY EXAMPLE IN &IGURE A HAS ONE PARAMETER θ THE PROBABILITY THAT A RAN
DOMLY SELECTED PIECE OF CANDY IS CHERRY mAVORED )N THE "AYESIAN VIEW θ IS THE UNKNOWN
VALUE OF A RANDOM VARIABLE Θ THAT DElNES THE HYPOTHESIS SPACE THE HYPOTHESIS PRIOR IS JUST
THE PRIOR DISTRIBUTION 3(Θ) 4HUS P (Θ = θ) IS THE PRIOR PROBABILITY THAT THE BAG HAS A FRACTION
θ OF CHERRY CANDIES
)F THE PARAMETER θ CAN BE ANY VALUE BETWEEN  AND  THEN 3(Θ) MUST BE A CONTINUOUS
DISTRIBUTION THAT IS NONZERO ONLY BETWEEN  AND  AND THAT INTEGRATES TO  4HE UNIFORM DENSITY
P (θ) = Uniform[0, 1](θ) IS ONE CANDIDATE 3EE #HAPTER  )T TURNS OUT THAT THE UNIFORM
BETA DISTRIBUTION DENSITY IS A MEMBER OF THE FAMILY OF EHWD GLVWULEXWLRQV %ACH BETA DISTRIBUTION IS DElNED BY
HYPERPARAMETER TWO K\SHUSDUDPHWHUV a AND b SUCH THAT

beta[a, b](θ) = α θ a−1 (1 − θ)b−1 , 

FOR θ IN THE RANGE [0, 1] 4HE NORMALIZATION CONSTANT α WHICH MAKES THE DISTRIBUTION INTEGRATE
TO  DEPENDS ON a AND b 3EE %XERCISE  &IGURE  SHOWS WHAT THE DISTRIBUTION LOOKS
LIKE FOR VARIOUS VALUES OF a AND b 4HE MEAN VALUE OF THE DISTRIBUTION IS a/(a + b) SO LARGER
VALUES OF a SUGGEST A BELIEF THAT Θ IS CLOSER TO  THAN TO  ,ARGER VALUES OF a + b MAKE THE
DISTRIBUTION MORE PEAKED SUGGESTING GREATER CERTAINTY ABOUT THE VALUE OF Θ 4HUS THE BETA
FAMILY PROVIDES A USEFUL RANGE OF POSSIBILITIES FOR THE HYPOTHESIS PRIOR
"ESIDES ITS mEXIBILITY THE BETA FAMILY HAS ANOTHER WONDERFUL PROPERTY IF Θ HAS A PRIOR
beta[a, b] THEN AFTER A DATA POINT IS OBSERVED THE POSTERIOR DISTRIBUTION FOR Θ IS ALSO A BETA
DISTRIBUTION )N OTHER WORDS beta IS CLOSED UNDER UPDATE 4HE BETA FAMILY IS CALLED THE
CONJUGATE PRIOR FRQMXJDWH SULRU FOR THE FAMILY OF DISTRIBUTIONS FOR A "OOLEAN VARIABLE  ,ETS SEE HOW THIS
WORKS 3UPPOSE WE OBSERVE A CHERRY CANDY THEN WE HAVE
3 4HEY ARE CALLED HYPERPARAMETERS BECAUSE THEY PARAMETERIZE A DISTRIBUTION OVER θ WHICH IS ITSELF A PARAMETER
4 /THER CONJUGATE PRIORS INCLUDE THE 'LULFKOHW FAMILY FOR THE PARAMETERS OF A DISCRETE MULTIVALUED DISTRIBUTION
AND THE 1RUPDO±:LVKDUW FAMILY FOR THE PARAMETERS OF A 'AUSSIAN DISTRIBUTION 3EE "ERNARDO AND 3MITH  
 #HAPTER  ,EARNING 0ROBABILISTIC -ODELS

)ODYRU )ODYRU )ODYRU

:UDSSHU :UDSSHU :UDSSHU

Θ Θ

)LJXUH  ! "AYESIAN NETWORK THAT CORRESPONDS TO A "AYESIAN LEARNING PROCESS 0OSTE
RIOR DISTRIBUTIONS FOR THE PARAMETER VARIABLES Θ Θ1 AND Θ2 CAN BE INFERRED FROM THEIR PRIOR
DISTRIBUTIONS AND THE EVIDENCE IN THE Flavor i AND Wrapper i VARIABLES

P (θ | D1 = cherry) = α P (D1 = cherry | θ)P (θ)


= α θ · beta[a, b](θ) = α θ · θ a−1 (1 − θ)b−1
= α θ a (1 − θ)b−1 = beta[a + 1, b](θ) .
4HUS AFTER SEEING A CHERRY CANDY WE SIMPLY INCREMENT THE a PARAMETER TO GET THE POSTERIOR
SIMILARLY AFTER SEEING A LIME CANDY WE INCREMENT THE b PARAMETER 4HUS WE CAN VIEW THE a
VIRTUAL COUNTS AND b HYPERPARAMETERS AS YLUWXDO FRXQWV IN THE SENSE THAT A PRIOR beta[a, b] BEHAVES EXACTLY
AS IF WE HAD STARTED OUT WITH A UNIFORM PRIOR beta[1, 1] AND SEEN a − 1 ACTUAL CHERRY CANDIES
AND b − 1 ACTUAL LIME CANDIES
"Y EXAMINING A SEQUENCE OF BETA DISTRIBUTIONS FOR INCREASING VALUES OF a AND b KEEPING
THE PROPORTIONS lXED WE CAN SEE VIVIDLY HOW THE POSTERIOR DISTRIBUTION OVER THE PARAMETER
Θ CHANGES AS DATA ARRIVE &OR EXAMPLE SUPPOSE THE ACTUAL BAG OF CANDY IS  CHERRY &IG
URE B SHOWS THE SEQUENCE beta[3, 1] beta[6, 2] beta[30, 10] #LEARLY THE DISTRIBUTION
IS CONVERGING TO A NARROW PEAK AROUND THE TRUE VALUE OF Θ &OR LARGE DATA SETS THEN "AYESIAN
LEARNING AT LEAST IN THIS CASE CONVERGES TO THE SAME ANSWER AS MAXIMUM LIKELIHOOD LEARNING
.OW LET US CONSIDER A MORE COMPLICATED CASE 4HE NETWORK IN &IGURE B HAS THREE
PARAMETERS θ θ1 AND θ2 WHERE θ1 IS THE PROBABILITY OF A RED WRAPPER ON A CHERRY CANDY AND
θ2 IS THE PROBABILITY OF A RED WRAPPER ON A LIME CANDY 4HE "AYESIAN HYPOTHESIS PRIOR MUST
COVER ALL THREE PARAMETERSˆTHAT IS WE NEED TO SPECIFY 3(Θ, Θ1 , Θ2 ) 5SUALLY WE ASSUME
PARAMETER
INDEPENDENCE SDUDPHWHU LQGHSHQGHQFH
3(Θ, Θ1 , Θ2 ) = 3(Θ)3(Θ1 )3(Θ2 ) .
3ECTION  ,EARNING WITH #OMPLETE $ATA 

7ITH THIS ASSUMPTION EACH PARAMETER CAN HAVE ITS OWN BETA DISTRIBUTION THAT IS UPDATED SEP
ARATELY AS DATA ARRIVE &IGURE  SHOWS HOW WE CAN INCORPORATE THE HYPOTHESIS PRIOR AND
ANY DATA INTO ONE "AYESIAN NETWORK 4HE NODES Θ, Θ1 , Θ2 HAVE NO PARENTS "UT EACH TIME
WE MAKE AN OBSERVATION OF A WRAPPER AND CORRESPONDING mAVOR OF A PIECE OF CANDY WE ADD A
NODE Flavor i WHICH IS DEPENDENT ON THE mAVOR PARAMETER Θ
P (Flavor i = cherry | Θ = θ) = θ .
7E ALSO ADD A NODE Wrapper i WHICH IS DEPENDENT ON Θ1 AND Θ2 
P (Wrapper i = red | Flavor i = cherry, Θ1 = θ1 ) = θ1
P (Wrapper i = red | Flavor i = lime, Θ2 = θ2 ) = θ2 .
.OW THE ENTIRE "AYESIAN LEARNING PROCESS CAN BE FORMULATED AS AN LQIHUHQFH PROBLEM 7E
ADD NEW EVIDENCE NODES THEN QUERY THE UNKNOWN NODES IN THIS CASE Θ, Θ1 , Θ2  4HIS FOR
MULATION OF LEARNING AND PREDICTION MAKES IT CLEAR THAT "AYESIAN LEARNING REQUIRES NO EXTRA
hPRINCIPLES OF LEARNINGv &URTHERMORE WKHUH LV LQ HVVHQFH MXVW RQH OHDUQLQJ DOJRULWKP ˆTHE
INFERENCE ALGORITHM FOR "AYESIAN NETWORKS /F COURSE THE NATURE OF THESE NETWORKS IS SOME
WHAT DIFFERENT FROM THOSE OF #HAPTER  BECAUSE OF THE POTENTIALLY HUGE NUMBER OF EVIDENCE
VARIABLES REPRESENTING THE TRAINING SET AND THE PREVALENCE OF CONTINUOUS VALUED PARAMETER
VARIABLES

 /HDUQLQJ %D\HV QHW VWUXFWXUHV


3O FAR WE HAVE ASSUMED THAT THE STRUCTURE OF THE "AYES NET IS GIVEN AND WE ARE JUST TRYING TO
LEARN THE PARAMETERS 4HE STRUCTURE OF THE NETWORK REPRESENTS BASIC CAUSAL KNOWLEDGE ABOUT
THE DOMAIN THAT IS OFTEN EASY FOR AN EXPERT OR EVEN A NAIVE USER TO SUPPLY )N SOME CASES
HOWEVER THE CAUSAL MODEL MAY BE UNAVAILABLE OR SUBJECT TO DISPUTEˆFOR EXAMPLE CERTAIN
CORPORATIONS HAVE LONG CLAIMED THAT SMOKING DOES NOT CAUSE CANCERˆSO IT IS IMPORTANT TO
UNDERSTAND HOW THE STRUCTURE OF A "AYES NET CAN BE LEARNED FROM DATA 4HIS SECTION GIVES A
BRIEF SKETCH OF THE MAIN IDEAS
4HE MOST OBVIOUS APPROACH IS TO VHDUFK FOR A GOOD MODEL 7E CAN START WITH A MODEL
CONTAINING NO LINKS AND BEGIN ADDING PARENTS FOR EACH NODE lTTING THE PARAMETERS WITH THE
METHODS WE HAVE JUST COVERED AND MEASURING THE ACCURACY OF THE RESULTING MODEL !LTERNA
TIVELY WE CAN START WITH AN INITIAL GUESS AT THE STRUCTURE AND USE HILL CLIMBING OR SIMULATED
ANNEALING SEARCH TO MAKE MODIlCATIONS RETUNING THE PARAMETERS AFTER EACH CHANGE IN THE
STRUCTURE -ODIlCATIONS CAN INCLUDE REVERSING ADDING OR DELETING LINKS 7E MUST NOT IN
TRODUCE CYCLES IN THE PROCESS SO MANY ALGORITHMS ASSUME THAT AN ORDERING IS GIVEN FOR THE
VARIABLES AND THAT A NODE CAN HAVE PARENTS ONLY AMONG THOSE NODES THAT COME EARLIER IN THE
ORDERING JUST AS IN THE CONSTRUCTION PROCESS IN #HAPTER   &OR FULL GENERALITY WE ALSO NEED
TO SEARCH OVER POSSIBLE ORDERINGS
4HERE ARE TWO ALTERNATIVE METHODS FOR DECIDING WHEN A GOOD STRUCTURE HAS BEEN FOUND
4HE lRST IS TO TEST WHETHER THE CONDITIONAL INDEPENDENCE ASSERTIONS IMPLICIT IN THE STRUCTURE ARE
ACTUALLY SATISlED IN THE DATA &OR EXAMPLE THE USE OF A NAIVE "AYES MODEL FOR THE RESTAURANT
PROBLEM ASSUMES THAT
3(Fri /Sat , Bar | WillWait) = 3(Fri /Sat | WillWait)3(Bar | WillWait)
 #HAPTER  ,EARNING 0ROBABILISTIC -ODELS

AND WE CAN CHECK IN THE DATA THAT THE SAME EQUATION HOLDS BETWEEN THE CORRESPONDING CONDI
TIONAL FREQUENCIES "UT EVEN IF THE STRUCTURE DESCRIBES THE TRUE CAUSAL NATURE OF THE DOMAIN
STATISTICAL mUCTUATIONS IN THE DATA SET MEAN THAT THE EQUATION WILL NEVER BE SATISlED H[DFWO\
SO WE NEED TO PERFORM A SUITABLE STATISTICAL TEST TO SEE IF THERE IS SUFlCIENT EVIDENCE THAT THE
INDEPENDENCE HYPOTHESIS IS VIOLATED 4HE COMPLEXITY OF THE RESULTING NETWORK WILL DEPEND
ON THE THRESHOLD USED FOR THIS TESTˆTHE STRICTER THE INDEPENDENCE TEST THE MORE LINKS WILL BE
ADDED AND THE GREATER THE DANGER OF OVERlTTING
!N APPROACH MORE CONSISTENT WITH THE IDEAS IN THIS CHAPTER IS TO ASSESS THE DEGREE TO
WHICH THE PROPOSED MODEL EXPLAINS THE DATA IN A PROBABILISTIC SENSE  7E MUST BE CAREFUL
HOW WE MEASURE THIS HOWEVER )F WE JUST TRY TO lND THE MAXIMUM LIKELIHOOD HYPOTHESIS
WE WILL END UP WITH A FULLY CONNECTED NETWORK BECAUSE ADDING MORE PARENTS TO A NODE CAN
NOT DECREASE THE LIKELIHOOD %XERCISE   7E ARE FORCED TO PENALIZE MODEL COMPLEXITY IN
SOME WAY 4HE -!0 OR -$, APPROACH SIMPLY SUBTRACTS A PENALTY FROM THE LIKELIHOOD OF
EACH STRUCTURE AFTER PARAMETER TUNING BEFORE COMPARING DIFFERENT STRUCTURES 4HE "AYESIAN
APPROACH PLACES A JOINT PRIOR OVER STRUCTURES AND PARAMETERS 4HERE ARE USUALLY FAR TOO MANY
STRUCTURES TO SUM OVER SUPEREXPONENTIAL IN THE NUMBER OF VARIABLES SO MOST PRACTITIONERS
USE -#-# TO SAMPLE OVER STRUCTURES
0ENALIZING COMPLEXITY WHETHER BY -!0 OR "AYESIAN METHODS INTRODUCES AN IMPORTANT
CONNECTION BETWEEN THE OPTIMAL STRUCTURE AND THE NATURE OF THE REPRESENTATION FOR THE CONDI
TIONAL DISTRIBUTIONS IN THE NETWORK 7ITH TABULAR DISTRIBUTIONS THE COMPLEXITY PENALTY FOR A
NODES DISTRIBUTION GROWS EXPONENTIALLY WITH THE NUMBER OF PARENTS BUT WITH SAY NOISY /2
DISTRIBUTIONS IT GROWS ONLY LINEARLY 4HIS MEANS THAT LEARNING WITH NOISY /2 OR OTHER COM
PACTLY PARAMETERIZED MODELS TENDS TO PRODUCE LEARNED STRUCTURES WITH MORE PARENTS THAN DOES
LEARNING WITH TABULAR DISTRIBUTIONS

 'HQVLW\ HVWLPDWLRQ ZLWK QRQSDUDPHWULF PRGHOV


)T IS POSSIBLE TO LEARN A PROBABILITY MODEL WITHOUT MAKING ANY ASSUMPTIONS ABOUT ITS STRUCTURE
AND PARAMETERIZATION BY ADOPTING THE NONPARAMETRIC METHODS OF 3ECTION  4HE TASK OF
NONPARAMETRIC
DENSITY ESTIMATION QRQSDUDPHWULF GHQVLW\ HVWLPDWLRQ IS TYPICALLY DONE IN CONTINUOUS DOMAINS SUCH AS THAT
SHOWN IN &IGURE A  4HE lGURE SHOWS A PROBABILITY DENSITY FUNCTION ON A SPACE DElNED
BY TWO CONTINUOUS VARIABLES )N &IGURE B WE SEE A SAMPLE OF DATA POINTS FROM THIS
DENSITY FUNCTION 4HE QUESTION IS CAN WE RECOVER THE MODEL FROM THE SAMPLES
&IRST WE WILL CONSIDER k QHDUHVWQHLJKERUV MODELS )N #HAPTER  WE SAW NEAREST
NEIGHBOR MODELS FOR CLASSIlCATION AND REGRESSION HERE WE SEE THEM FOR DENSITY ESTIMATION
'IVEN A SAMPLE OF DATA POINTS TO ESTIMATE THE UNKNOWN PROBABILITY DENSITY AT A QUERY POINT [
WE CAN SIMPLY MEASURE THE DENSITY OF THE DATA POINTS IN THE NEIGHBORHOOD OF [ &IGURE B
SHOWS TWO QUERY POINTS SMALL SQUARES  &OR EACH QUERY POINT WE HAVE DRAWN THE SMALLEST
CIRCLE THAT ENCLOSES  NEIGHBORSˆTHE  NEAREST NEIGHBORHOOD 7E CAN SEE THAT THE CENTRAL
CIRCLE IS LARGE MEANING THERE IS A LOW DENSITY THERE AND THE CIRCLE ON THE RIGHT IS SMALL
MEANING THERE IS A HIGH DENSITY THERE )N &IGURE  WE SHOW THREE PLOTS OF DENSITY ESTIMATION
USING k NEAREST NEIGHBORS FOR DIFFERENT VALUES OF k )T SEEMS CLEAR THAT B IS ABOUT RIGHT
WHILE A IS TOO SPIKY k IS TOO SMALL AND C IS TOO SMOOTH k IS TOO BIG 
3ECTION  ,EARNING WITH #OMPLETE $ATA 

$ENSITY


 
 

 

 
 
 

  
 

  

 


     
A B

)LJXUH  A ! $ PLOT OF THE MIXTURE OF 'AUSSIANS FROM &IGURE A  B ! 
POINT SAMPLE OF POINTS FROM THE MIXTURE TOGETHER WITH TWO QUERY POINTS SMALL SQUARES AND
THEIR 10 NEAREST NEIGHBORHOODS MEDIUM AND LARGE CIRCLES 

$ENSITY $ENSITY $ENSITY

  
  
  
        
        
     
A B C

)LJXUH  $ENSITY ESTIMATION USING k NEAREST NEIGHBORS APPLIED TO THE DATA IN &IG
URE B FOR k = 3 10 AND 40 RESPECTIVELY k = 3 IS TOO SPIKY  IS TOO SMOOTH AND
 IS JUST ABOUT RIGHT 4HE BEST VALUE FOR k CAN BE CHOSEN BY CROSS VALIDATION

$ENSITY $ENSITY $ENSITY

  
  
  
        
           
  
A B C

)LJXUH  +ERNEL DENSITY ESTIMATION FOR THE DATA IN &IGURE B USING 'AUSSIAN KER
NELS WITH w = 0.02 0.07 AND 0.20 RESPECTIVELY w = 0.07 IS ABOUT RIGHT
 #HAPTER  ,EARNING 0ROBABILISTIC -ODELS

!NOTHER POSSIBILITY IS TO USE NHUQHO IXQFWLRQV AS WE DID FOR LOCALLY WEIGHTED REGRES
SION 4O APPLY A KERNEL MODEL TO DENSITY ESTIMATION ASSUME THAT EACH DATA POINT GENERATES ITS
OWN LITTLE DENSITY FUNCTION USING A 'AUSSIAN KERNEL 4HE ESTIMATED DENSITY AT A QUERY POINT [
IS THEN THE AVERAGE DENSITY AS GIVEN BY EACH KERNEL FUNCTION
N
1 
P ([) = K([, [j ) .
N
j=1

7E WILL ASSUME SPHERICAL 'AUSSIANS WITH STANDARD DEVIATION w ALONG EACH AXIS
2
1 D([,[j )
K([, [j ) = √ e− 2w2 ,
(w2 2π)d
WHERE d IS THE NUMBER OF DIMENSIONS IN [ AND D IS THE %UCLIDEAN DISTANCE FUNCTION 7E
STILL HAVE THE PROBLEM OF CHOOSING A SUITABLE VALUE FOR KERNEL WIDTH w &IGURE  SHOWS
VALUES THAT ARE TOO SMALL JUST RIGHT AND TOO LARGE ! GOOD VALUE OF w CAN BE CHOSEN BY USING
CROSS VALIDATION

 , %!2.).' 7)4( ( )$$%. 6!2)!",%3  4 (% %- ! ,'/2)4(-

4HE PRECEDING SECTION DEALT WITH THE FULLY OBSERVABLE CASE -ANY REAL WORLD PROBLEMS HAVE
LATENT VARIABLE KLGGHQ YDULDEOHV SOMETIMES CALLED ODWHQW YDULDEOHV WHICH ARE NOT OBSERVABLE IN THE DATA
THAT ARE AVAILABLE FOR LEARNING &OR EXAMPLE MEDICAL RECORDS OFTEN INCLUDE THE OBSERVED
SYMPTOMS THE PHYSICIANS DIAGNOSIS THE TREATMENT APPLIED AND PERHAPS THE OUTCOME OF THE
TREATMENT BUT THEY SELDOM CONTAIN A DIRECT OBSERVATION OF THE DISEASE ITSELF .OTE THAT THE
GLDJQRVLV IS NOT THE GLVHDVH IT IS A CAUSAL CONSEQUENCE OF THE OBSERVED SYMPTOMS WHICH ARE IN
TURN CAUSED BY THE DISEASE /NE MIGHT ASK h)F THE DISEASE IS NOT OBSERVED WHY NOT CONSTRUCT
A MODEL WITHOUT ITv 4HE ANSWER APPEARS IN &IGURE  WHICH SHOWS A SMALL lCTITIOUS
DIAGNOSTIC MODEL FOR HEART DISEASE 4HERE ARE THREE OBSERVABLE PREDISPOSING FACTORS AND THREE
OBSERVABLE SYMPTOMS WHICH ARE TOO DEPRESSING TO NAME  !SSUME THAT EACH VARIABLE HAS
THREE POSSIBLE VALUES EG none moderate AND severe  2EMOVING THE HIDDEN VARIABLE
FROM THE NETWORK IN A YIELDS THE NETWORK IN B  THE TOTAL NUMBER OF PARAMETERS INCREASES
FROM  TO  4HUS ODWHQW YDULDEOHV FDQ GUDPDWLFDOO\ UHGXFH WKH QXPEHU RI SDUDPHWHUV
UHTXLUHG WR VSHFLI\ D %D\HVLDQ QHWZRUN 4HIS IN TURN CAN DRAMATICALLY REDUCE THE AMOUNT OF
DATA NEEDED TO LEARN THE PARAMETERS
(IDDEN VARIABLES ARE IMPORTANT BUT THEY DO COMPLICATE THE LEARNING PROBLEM )N &IG
URE A FOR EXAMPLE IT IS NOT OBVIOUS HOW TO LEARN THE CONDITIONAL DISTRIBUTION FOR
HeartDisease GIVEN ITS PARENTS BECAUSE WE DO NOT KNOW THE VALUE OF HeartDisease IN EACH
CASE THE SAME PROBLEM ARISES IN LEARNING THE DISTRIBUTIONS FOR THE SYMPTOMS 4HIS SECTION
EXPECTATION–
MAXIMIZATION DESCRIBES AN ALGORITHM CALLED H[SHFWDWLRQ±PD[LPL]DWLRQ OR %- THAT SOLVES THIS PROBLEM
IN A VERY GENERAL WAY 7E WILL SHOW THREE EXAMPLES AND THEN PROVIDE A GENERAL DESCRIPTION
4HE ALGORITHM SEEMS LIKE MAGIC AT lRST BUT ONCE THE INTUITION HAS BEEN DEVELOPED ONE CAN
lND APPLICATIONS FOR %- IN A HUGE RANGE OF LEARNING PROBLEMS
3ECTION  ,EARNING WITH (IDDEN 6ARIABLES 4HE %- !LGORITHM 

     
6PRNLQJ 'LHW ([HUFLVH 6PRNLQJ 'LHW ([HUFLVH

 +HDUW'LVHDVH

     
6\PSWRP  6\PSWRP  6\PSWRP  6\PSWRP  6\PSWRP  6\PSWRP 

A B

)LJXUH  A ! SIMPLE DIAGNOSTIC NETWORK FOR HEART DISEASE WHICH IS ASSUMED TO BE
A HIDDEN VARIABLE %ACH VARIABLE HAS THREE POSSIBLE VALUES AND IS LABELED WITH THE NUMBER
OF INDEPENDENT PARAMETERS IN ITS CONDITIONAL DISTRIBUTION THE TOTAL NUMBER IS  B 4HE
EQUIVALENT NETWORK WITH HeartDisease REMOVED .OTE THAT THE SYMPTOM VARIABLES ARE NO
LONGER CONDITIONALLY INDEPENDENT GIVEN THEIR PARENTS 4HIS NETWORK REQUIRES  PARAMETERS

 8QVXSHUYLVHG FOXVWHULQJ /HDUQLQJ PL[WXUHV RI *DXVVLDQV


UNSUPERVISED
CLUSTERING 8QVXSHUYLVHG FOXVWHULQJ IS THE PROBLEM OF DISCERNING MULTIPLE CATEGORIES IN A COLLECTION OF
OBJECTS 4HE PROBLEM IS UNSUPERVISED BECAUSE THE CATEGORY LABELS ARE NOT GIVEN &OR EXAMPLE
SUPPOSE WE RECORD THE SPECTRA OF A HUNDRED THOUSAND STARS ARE THERE DIFFERENT W\SHV OF STARS
REVEALED BY THE SPECTRA AND IF SO HOW MANY TYPES AND WHAT ARE THEIR CHARACTERISTICS 7E
ARE ALL FAMILIAR WITH TERMS SUCH AS hRED GIANTv AND hWHITE DWARF v BUT THE STARS DO NOT CARRY
THESE LABELS ON THEIR HATSˆASTRONOMERS HAD TO PERFORM UNSUPERVISED CLUSTERING TO IDENTIFY
THESE CATEGORIES /THER EXAMPLES INCLUDE THE IDENTIlCATION OF SPECIES GENERA ORDERS AND
SO ON IN THE ,INNAN TAXONOMY AND THE CREATION OF NATURAL KINDS FOR ORDINARY OBJECTS SEE
#HAPTER  
5NSUPERVISED CLUSTERING BEGINS WITH DATA &IGURE B SHOWS  DATA POINTS EACH
OF WHICH SPECIlES THE VALUES OF TWO CONTINUOUS ATTRIBUTES 4HE DATA POINTS MIGHT CORRESPOND
TO STARS AND THE ATTRIBUTES MIGHT CORRESPOND TO SPECTRAL INTENSITIES AT TWO PARTICULAR FREQUEN
CIES .EXT WE NEED TO UNDERSTAND WHAT KIND OF PROBABILITY DISTRIBUTION MIGHT HAVE GENERATED
MIXTURE
DISTRIBUTION THE DATA #LUSTERING PRESUMES THAT THE DATA ARE GENERATED FROM A PL[WXUH GLVWULEXWLRQ P 
COMPONENT 3UCH A DISTRIBUTION HAS k FRPSRQHQWV EACH OF WHICH IS A DISTRIBUTION IN ITS OWN RIGHT !
DATA POINT IS GENERATED BY lRST CHOOSING A COMPONENT AND THEN GENERATING A SAMPLE FROM THAT
COMPONENT ,ET THE RANDOM VARIABLE C DENOTE THE COMPONENT WITH VALUES 1, . . . , k THEN THE
MIXTURE DISTRIBUTION IS GIVEN BY
k

P ([) = P (C = i) P ([ | C = i) ,
i=1
WHERE [ REFERS TO THE VALUES OF THE ATTRIBUTES FOR A DATA POINT &OR CONTINUOUS DATA A NATURAL
CHOICE FOR THE COMPONENT DISTRIBUTIONS IS THE MULTIVARIATE 'AUSSIAN WHICH GIVES THE SO CALLED
MIXTURE OF
GAUSSIANS PL[WXUH RI *DXVVLDQV FAMILY OF DISTRIBUTIONS 4HE PARAMETERS OF A MIXTURE OF 'AUSSIANS ARE
 #HAPTER  ,EARNING 0ROBABILISTIC -ODELS

  

  

  

  

  

  
                 
A B C

)LJXUH  A ! 'AUSSIAN MIXTURE MODEL WITH THREE COMPONENTS THE WEIGHTS LEFT TO
RIGHT ARE   AND  B  DATA POINTS SAMPLED FROM THE MODEL IN A  C 4HE MODEL
RECONSTRUCTED BY %- FROM THE DATA IN B 

wi = P (C = i) THE WEIGHT OF EACH COMPONENT μi THE MEAN OF EACH COMPONENT AND Σi


THE COVARIANCE OF EACH COMPONENT  &IGURE A SHOWS A MIXTURE OF THREE 'AUSSIANS
THIS MIXTURE IS IN FACT THE SOURCE OF THE DATA IN B AS WELL AS BEING THE MODEL SHOWN IN
&IGURE A ON PAGE 
4HE UNSUPERVISED CLUSTERING PROBLEM THEN IS TO RECOVER A MIXTURE MODEL LIKE THE ONE
IN &IGURE A FROM RAW DATA LIKE THAT IN &IGURE B  #LEARLY IF WE NQHZ WHICH COM
PONENT GENERATED EACH DATA POINT THEN IT WOULD BE EASY TO RECOVER THE COMPONENT 'AUSSIANS
WE COULD JUST SELECT ALL THE DATA POINTS FROM A GIVEN COMPONENT AND THEN APPLY A MULTIVARIATE
VERSION OF %QUATION  PAGE  FOR lTTING THE PARAMETERS OF A 'AUSSIAN TO A SET OF DATA
/N THE OTHER HAND IF WE NQHZ THE PARAMETERS OF EACH COMPONENT THEN WE COULD AT LEAST IN
A PROBABILISTIC SENSE ASSIGN EACH DATA POINT TO A COMPONENT 4HE PROBLEM IS THAT WE KNOW
NEITHER THE ASSIGNMENTS NOR THE PARAMETERS
4HE BASIC IDEA OF %- IN THIS CONTEXT IS TO SUHWHQG THAT WE KNOW THE PARAMETERS OF THE
MODEL AND THEN TO INFER THE PROBABILITY THAT EACH DATA POINT BELONGS TO EACH COMPONENT !FTER
THAT WE RElT THE COMPONENTS TO THE DATA WHERE EACH COMPONENT IS lTTED TO THE ENTIRE DATA SET
WITH EACH POINT WEIGHTED BY THE PROBABILITY THAT IT BELONGS TO THAT COMPONENT 4HE PROCESS
ITERATES UNTIL CONVERGENCE %SSENTIALLY WE ARE hCOMPLETINGv THE DATA BY INFERRING PROBABILITY
DISTRIBUTIONS OVER THE HIDDEN VARIABLESˆWHICH COMPONENT EACH DATA POINT BELONGS TOˆBASED
ON THE CURRENT MODEL &OR THE MIXTURE OF 'AUSSIANS WE INITIALIZE THE MIXTURE MODEL PARAME
TERS ARBITRARILY AND THEN ITERATE THE FOLLOWING TWO STEPS
 (VWHS #OMPUTE THE PROBABILITIES pij = P (C = i | [j ) THE PROBABILITY THAT DATUM [j
WAS GENERATED BY COMPONENT i "Y "AYES RULE WE HAVE pij = αP ([j | C = i)P (C = i)
4HE TERM P ([j | C = i) IS JUST THE PROBABILITY AT [j OF THE iTH 'AUSSIAN AND
 THE TERM
P (C = i) IS JUST THE WEIGHT PARAMETER FOR THE iTH 'AUSSIAN $ElNE ni = j pij THE
EFFECTIVE NUMBER OF DATA POINTS CURRENTLY ASSIGNED TO COMPONENT i
 0VWHS #OMPUTE THE NEW MEAN COVARIANCE AND COMPONENT WEIGHTS USING THE FOLLOW
ING STEPS IN SEQUENCE
3ECTION  ,EARNING WITH (IDDEN 6ARIABLES 4HE %- !LGORITHM 

μi ← pij [j /ni
j

Σi ← pij ([j − μi )([j − μi ) /ni
j
wi ← ni /N

WHERE N IS THE TOTAL NUMBER OF DATA POINTS 4HE % STEP OR H[SHFWDWLRQ STEP CAN BE VIEWED
INDICATOR VARIABLE AS COMPUTING THE EXPECTED VALUES pij OF THE HIDDEN LQGLFDWRU YDULDEOHV Zij WHERE Zij IS  IF
DATUM [j WAS GENERATED BY THE iTH COMPONENT AND  OTHERWISE 4HE - STEP OR PD[LPL]DWLRQ
STEP lNDS THE NEW VALUES OF THE PARAMETERS THAT MAXIMIZE THE LOG LIKELIHOOD OF THE DATA
GIVEN THE EXPECTED VALUES OF THE HIDDEN INDICATOR VARIABLES
4HE lNAL MODEL THAT %- LEARNS WHEN IT IS APPLIED TO THE DATA IN &IGURE A IS SHOWN
IN &IGURE C  IT IS VIRTUALLY INDISTINGUISHABLE FROM THE ORIGINAL MODEL FROM WHICH THE
DATA WERE GENERATED &IGURE A PLOTS THE LOG LIKELIHOOD OF THE DATA ACCORDING TO THE
CURRENT MODEL AS %- PROGRESSES
4HERE ARE TWO POINTS TO NOTICE &IRST THE LOG LIKELIHOOD FOR THE lNAL LEARNED MODEL
SLIGHTLY H[FHHGV THAT OF THE ORIGINAL MODEL FROM WHICH THE DATA WERE GENERATED 4HIS MIGHT
SEEM SURPRISING BUT IT SIMPLY REmECTS THE FACT THAT THE DATA WERE GENERATED RANDOMLY AND
MIGHT NOT PROVIDE AN EXACT REmECTION OF THE UNDERLYING MODEL 4HE SECOND POINT IS THAT (0
LQFUHDVHV WKH ORJ OLNHOLKRRG RI WKH GDWD DW HYHU\ LWHUDWLRQ 4HIS FACT CAN BE PROVED IN GENERAL
&URTHERMORE UNDER CERTAIN CONDITIONS THAT HOLD IN OST CASES %- CAN BE PROVEN TO REACH
A LOCAL MAXIMUM IN LIKELIHOOD )N RARE CASES IT COULD REACH A SADDLE POINT OR EVEN A LOCAL
MINIMUM )N THIS SENSE %- RESEMBLES A GRADIENT BASED HILL CLIMBING ALGORITHM BUT NOTICE
THAT IT HAS NO hSTEP SIZEv PARAMETER

 
 
 
,OG LIKELIHOOD /

,OG LIKELIHOOD /

 
 
 
 







           
)TERATION NUMBER )TERATION NUMBER
A B

)LJXUH  'RAPHS SHOWING THE LOG LIKELIHOOD OF THE DATA L AS A FUNCTION OF THE %-
ITERATION 4HE HORIZONTAL LINE SHOWS THE LOG LIKELIHOOD ACCORDING TO THE TRUE MODEL A 'RAPH
FOR THE 'AUSSIAN MIXTURE MODEL IN &IGURE  B 'RAPH FOR THE "AYESIAN NETWORK IN
&IGURE A 
 #HAPTER  ,EARNING 0ROBABILISTIC -ODELS

3%DJ 
θ

%DJ &
%DJ 3) FKHUU\\ %
 θ)
 θ)

)ODYRU :UDSSHU +ROH ;

A B

)LJXUH  A ! MIXTURE MODEL FOR CANDY 4HE PROPORTIONS OF DIFFERENT mAVORS WRAP
PERS PRESENCE OF HOLES DEPEND ON THE BAG WHICH IS NOT OBSERVED B "AYESIAN NETWORK FOR
A 'AUSSIAN MIXTURE 4HE MEAN AND COVARIANCE OF THE OBSERVABLE VARIABLES ; DEPEND ON THE
COMPONENT C

4HINGS DO NOT ALWAYS GO AS WELL AS &IGURE A MIGHT SUGGEST )T CAN HAPPEN FOR
EXAMPLE THAT ONE 'AUSSIAN COMPONENT SHRINKS SO THAT IT COVERS JUST A SINGLE DATA POINT 4HEN
ITS VARIANCE WILL GO TO ZERO AND ITS LIKELIHOOD WILL GO TO INlNITY !NOTHER PROBLEM IS THAT
TWO COMPONENTS CAN hMERGE v ACQUIRING IDENTICAL MEANS AND VARIANCES AND SHARING THEIR DATA
POINTS 4HESE KINDS OF DEGENERATE LOCAL MAXIMA ARE SERIOUS PROBLEMS ESPECIALLY IN HIGH
DIMENSIONS /NE SOLUTION IS TO PLACE PRIORS ON THE MODEL PARAMETERS AND TO APPLY THE -!0
VERSION OF %- !NOTHER IS TO RESTART A COMPONENT WITH NEW RANDOM PARAMETERS IF IT GETS TOO
SMALL OR TOO CLOSE TO ANOTHER COMPONENT 3ENSIBLE INITIALIZATION ALSO HELPS

 /HDUQLQJ %D\HVLDQ QHWZRUNV ZLWK KLGGHQ YDULDEOHV


4O LEARN A "AYESIAN NETWORK WITH HIDDEN VARIABLES WE APPLY THE SAME INSIGHTS THAT WORKED
FOR MIXTURES OF 'AUSSIANS &IGURE  REPRESENTS A SITUATION IN WHICH THERE ARE TWO BAGS OF
CANDIES THAT HAVE BEEN MIXED TOGETHER #ANDIES ARE DESCRIBED BY THREE FEATURES IN ADDITION
TO THE Flavor AND THE Wrapper SOME CANDIES HAVE A Hole IN THE MIDDLE AND SOME DO NOT
4HE DISTRIBUTION OF CANDIES IN EACH BAG IS DESCRIBED BY A QDLYH %D\HV MODEL THE FEATURES
ARE INDEPENDENT GIVEN THE BAG BUT THE CONDITIONAL PROBABILITY DISTRIBUTION FOR EACH FEATURE
DEPENDS ON THE BAG 4HE PARAMETERS ARE AS FOLLOWS θ IS THE PRIOR PROBABILITY THAT A CANDY
COMES FROM "AG  θF 1 AND θF 2 ARE THE PROBABILITIES THAT THE mAVOR IS CHERRY GIVEN THAT THE
CANDY COMES FROM "AG  OR "AG  RESPECTIVELY θW 1 AND θW 2 GIVE THE PROBABILITIES THAT THE
WRAPPER IS RED AND θH1 AND θH2 GIVE THE PROBABILITIES THAT THE CANDY HAS A HOLE .OTICE THAT
THE OVERALL MODEL IS A MIXTURE MODEL )N FACT WE CAN ALSO MODEL THE MIXTURE OF 'AUSSIANS
AS A "AYESIAN NETWORK AS SHOWN IN &IGURE B  )N THE lGURE THE BAG IS A HIDDEN
VARIABLE BECAUSE ONCE THE CANDIES HAVE BEEN MIXED TOGETHER WE NO LONGER KNOW WHICH BAG
EACH CANDY CAME FROM )N SUCH A CASE CAN WE RECOVER THE DESCRIPTIONS OF THE TWO BAGS BY
3ECTION  ,EARNING WITH (IDDEN 6ARIABLES 4HE %- !LGORITHM 

OBSERVING CANDIES FROM THE MIXTURE ,ET US WORK THROUGH AN ITERATION OF %- FOR THIS PROBLEM
&IRST LETS LOOK AT THE DATA 7E GENERATED  SAMPLES FROM A MODEL WHOSE TRUE PARAMETERS
ARE AS FOLLOWS
θ = 0.5, θF 1 = θW 1 = θH1 = 0.8, θF 2 = θW 2 = θH2 = 0.3 . 
4HAT IS THE CANDIES ARE EQUALLY LIKELY TO COME FROM EITHER BAG THE lRST IS MOSTLY CHERRIES
WITH RED WRAPPERS AND HOLES THE SECOND IS MOSTLY LIMES WITH GREEN WRAPPERS AND NO HOLES
4HE COUNTS FOR THE EIGHT POSSIBLE KINDS OF CANDY ARE AS FOLLOWS

W = red W = green
H =1 H =0 H =1 H =0
F = cherry    
F = lime    

7E START BY INITIALIZING THE PARAMETERS &OR NUMERICAL SIMPLICITY WE ARBITRARILY CHOOSE


(0) (0) (0) (0) (0) (0)
θ (0) = 0.6, θF 1 = θW 1 = θH1 = 0.6, θF 2 = θW 2 = θH2 = 0.4 . 
&IRST LET US WORK ON THE θ PARAMETER )N THE FULLY OBSERVABLE CASE WE WOULD ESTIMATE THIS
DIRECTLY FROM THE REVHUYHG COUNTS OF CANDIES FROM BAGS  AND  "ECAUSE THE BAG IS A HIDDEN
VARIABLE WE CALCULATE THE H[SHFWHG COUNTS INSTEAD 4HE EXPECTED COUNT N̂ (Bag = 1) IS THE
SUM OVER ALL CANDIES OF THE PROBABILITY THAT THE CANDY CAME FROM BAG 
N

θ (1) = N̂ (Bag = 1)/N = P (Bag = 1 | flavor j , wrapper j , holes j )/N .
j=1
4HESE PROBABILITIES CAN BE COMPUTED BY ANY INFERENCE ALGORITHM FOR "AYESIAN NETWORKS &OR
A NAIVE "AYES MODEL SUCH AS THE ONE IN OUR EXAMPLE WE CAN DO THE INFERENCE hBY HAND v
USING "AYES RULE AND APPLYING CONDITIONAL INDEPENDENCE
N
1  P (flavor j | Bag = 1)P (wrapper j | Bag = 1)P (holes j | Bag = 1)P (Bag = 1)
θ (1) =  .
N
j=1i P (flavor j | Bag = i)P (wrapper j | Bag = i)P (holes j | Bag = i)P (Bag = i)

!PPLYING THIS FORMULA TO SAY THE  RED WRAPPED CHERRY CANDIES WITH HOLES WE GET A CON
TRIBUTION OF
(0) (0) (0)
273 θ θ θ θ (0)
· (0) (0) (0) F 1 W 1(0)H1(0) (0) ≈ 0.22797 .
1000 θ θ θ θ (0) + θ θ θ (1 − θ (0) )
F 1 W 1 H1 F 2 W 2 H2
#ONTINUING WITH THE OTHER SEVEN KINDS OF CANDY IN THE TABLE OF COUNTS WE OBTAIN θ (1) = 0.6124
.OW LET US CONSIDER THE OTHER PARAMETERS SUCH AS θF 1  )N THE FULLY OBSERVABLE CASE WE
WOULD ESTIMATE THIS DIRECTLY FROM THE REVHUYHG COUNTS OF CHERRY AND LIME CANDIES FROM BAG 
4HE H[SHFWHG COUNT OF CHERRY CANDIES FROM BAG  IS GIVEN BY

P (Bag = 1 | Flavor j = cherry, wrapper j , holes j ) .
j:Flavor j = cherry

5 )T IS BETTER IN PRACTICE TO CHOOSE THEM RANDOMLY TO AVOID LOCAL MAXIMA DUE TO SYMMETRY
 #HAPTER  ,EARNING 0ROBABILISTIC -ODELS

!GAIN THESE PROBABILITIES CAN BE CALCULATED BY ANY "AYES NET ALGORITHM #OMPLETING THIS
PROCESS WE OBTAIN THE NEW VALUES OF ALL THE PARAMETERS
(1) (1) (1)
θ (1) = 0.6124, θF 1 = 0.6684, θW 1 = 0.6483, θH1 = 0.6558,
(1) (1) (1) 
θF 2 = 0.3887, θW 2 = 0.3817, θH2 = 0.3827 .
4HE LOG LIKELIHOOD OF THE DATA INCREASES FROM ABOUT −2044 INITIALLY TO ABOUT −2021 AFTER
THE lRST ITERATION AS SHOWN IN &IGURE B  4HAT IS THE UPDATE IMPROVES THE LIKELIHOOD
ITSELF BY A FACTOR OF ABOUT e23 ≈ 1010  "Y THE TENTH ITERATION THE LEARNED MODEL IS A BETTER
lT THAN THE ORIGINAL MODEL L = − 1982.214  4HEREAFTER PROGRESS BECOMES VERY SLOW 4HIS
IS NOT UNCOMMON WITH %- AND MANY PRACTICAL SYSTEMS COMBINE %- WITH A GRADIENT BASED
ALGORITHM SUCH AS .EWTONn2APHSON SEE #HAPTER  FOR THE LAST PHASE OF LEARNING
4HE GENERAL LESSON FROM THIS EXAMPLE IS THAT WKH SDUDPHWHU XSGDWHV IRU %D\HVLDQ QHW
ZRUN OHDUQLQJ ZLWK KLGGHQ YDULDEOHV DUH GLUHFWO\ DYDLODEOH IURP WKH UHVXOWV RI LQIHUHQFH RQ
HDFK H[DPSOH 0RUHRYHU RQO\ LOCAL SRVWHULRU SUREDELOLWLHV DUH QHHGHG IRU HDFK SDUDPH
WHU (ERE hLOCALv MEANS THAT THE #04 FOR EACH VARIABLE Xi CAN BE LEARNED FROM POSTERIOR
PROBABILITIES INVOLVING JUST Xi AND ITS PARENTS 8i  $ElNING θijk TO BE THE #04 PARAMETER
P (Xi = xij | 8i = Xik ) THE UPDATE IS GIVEN BY THE NORMALIZED EXPECTED COUNTS AS FOLLOWS
θijk ← N̂ (Xi = xij , 8i = Xik )/N̂ (8i = Xik ) .
4HE EXPECTED COUNTS ARE OBTAINED BY SUMMING OVER THE EXAMPLES COMPUTING THE PROBABILITIES
P (Xi = xij , 8i = Xik ) FOR EACH BY USING ANY "AYES NET INFERENCE ALGORITHM &OR THE EXACT
ALGORITHMSˆINCLUDING VARIABLE ELIMINATIONˆALL THESE PROBABILITIES ARE OBTAINABLE DIRECTLY AS
A BY PRODUCT OF STANDARD INFERENCE WITH NO NEED FOR EXTRA COMPUTATIONS SPECIlC TO LEARNING
-OREOVER THE INFORMATION NEEDED FOR LEARNING IS AVAILABLE ORFDOO\ FOR EACH PARAMETER

 /HDUQLQJ KLGGHQ 0DUNRY PRGHOV


/UR lNAL APPLICATION OF %- INVOLVES LEARNING THE TRANSITION PROBABILITIES IN HIDDEN -ARKOV
MODELS (--S  2ECALL FROM 3ECTION  THAT A HIDDEN -ARKOV MODEL CAN BE REPRESENTED
BY A DYNAMIC "AYES NET WITH A SINGLE DISCRETE STATE VARIABLE AS ILLUSTRATED IN &IGURE 
%ACH DATA POINT CONSISTS OF AN OBSERVATION VHTXHQFH OF lNITE LENGTH SO THE PROBLEM IS TO
LEARN THE TRANSITION PROBABILITIES FROM A SET OF OBSERVATION SEQUENCES OR FROM JUST ONE LONG
SEQUENCE 
7E HAVE ALREADY WORKED OUT HOW TO LEARN "AYES NETS BUT THERE IS ONE COMPLICATION
IN "AYES NETS EACH PARAMETER IS DISTINCT IN A HIDDEN -ARKOV MODEL ON THE OTHER HAND THE
INDIVIDUAL TRANSITION PROBABILITIES FROM STATE i TO STATE j AT TIME t θijt = P (Xt+1 = j | Xt = i)
ARE UHSHDWHG ACROSS TIMEˆTHAT IS θijt = θij FOR ALL t 4O ESTIMATE THE TRANSITION PROBABILITY
FROM STATE i TO STATE j WE SIMPLY CALCULATE THE EXPECTED PROPORTION OF TIMES THAT THE SYSTEM
UNDERGOES A TRANSITION TO STATE j WHEN IN STATE i
 
θij ← N̂ (Xt+1 = j, Xt = i)/ N̂(Xt = i) .
t t
4HE EXPECTED COUNTS ARE COMPUTED BY AN (-- INFERENCE ALGORITHM 4HE IRUZDUG±EDFNZDUG
ALGORITHM SHOWN IN &IGURE  CAN BE MODIlED VERY EASILY TO COMPUTE THE NECESSARY PROB
ABILITIES /NE IMPORTANT POINT IS THAT THE PROBABILITIES REQUIRED ARE OBTAINED BY VPRRWKLQJ
3ECTION  ,EARNING WITH (IDDEN 6ARIABLES 4HE %- !LGORITHM 

5 35
 5 35
 5 35
 5 35
 5 35

35  W 35  W W W W
    
 I   I  I  I  I 
5DLQ 5DLQ 5DLQ 5DLQ 5DLQ 5DLQ 5DLQ

8PEUHOOD 8PEUHOOD 8PEUHOOD 8PEUHOOD 8PEUHOOD

5 38 5 38 5 38 5 38 5 38


W  W  W  W  W 
I  I  I  I  I 

)LJXUH  !N UNROLLED DYNAMIC "AYESIAN NETWORK THAT REPRESENTS A HIDDEN -ARKOV
MODEL REPEAT OF &IGURE  

RATHER THAN ¿OWHULQJ THAT IS WE NEED TO PAY ATTENTION TO SUBSEQUENT EVIDENCE IN ESTIMATING
THE PROBABILITY THAT A PARTICULAR TRANSITION OCCURRED 4HE EVIDENCE IN A MURDER CASE IS USUALLY
OBTAINED DIWHU THE CRIME IE THE TRANSITION FROM STATE i TO STATE j HAS TAKEN PLACE

 7KH JHQHUDO IRUP RI WKH (0 DOJRULWKP


7E HAVE SEEN SEVERAL INSTANCES OF THE %- ALGORITHM %ACH INVOLVES COMPUTING EXPECTED
VALUES OF HIDDEN VARIABLES FOR EACH EXAMPLE AND THEN RECOMPUTING THE PARAMETERS USING THE
EXPECTED VALUES AS IF THEY WERE OBSERVED VALUES ,ET [ BE ALL THE OBSERVED VALUES IN ALL THE
EXAMPLES LET = DENOTE ALL THE HIDDEN VARIABLES FOR ALL THE EXAMPLES AND LET θ BE ALL THE
PARAMETERS FOR THE PROBABILITY MODEL 4HEN THE %- ALGORITHM IS

θ (i+1) = argmax P (= = ] | [, θ (i) )L([, = = ] | θ) .
θ ]
4HIS EQUATION IS THE %- ALGORITHM IN A NUTSHELL 4HE % STEP IS THE COMPUTATION OF THE SUMMA
TION WHICH IS THE EXPECTATION OF THE LOG LIKELIHOOD OF THE hCOMPLETEDv DATA WITH RESPECT TO THE
DISTRIBUTION P (= = ] | [, θ (i) ) WHICH IS THE POSTERIOR OVER THE HIDDEN VARIABLES GIVEN THE DATA
4HE - STEP IS THE MAXIMIZATION OF THIS EXPECTED LOG LIKELIHOOD WITH RESPECT TO THE PARAME
TERS &OR MIXTURES OF 'AUSSIANS THE HIDDEN VARIABLES ARE THE Zij S WHERE Zij IS  IF EXAMPLE j
WAS GENERATED BY COMPONENT i &OR "AYES NETS Zij IS THE VALUE OF UNOBSERVED VARIABLE Xi IN
EXAMPLE j &OR (--S Zjt IS THE STATE OF THE SEQUENCE IN EXAMPLE j AT TIME t 3TARTING FROM
THE GENERAL FORM IT IS POSSIBLE TO DERIVE AN %- ALGORITHM FOR A SPECIlC APPLICATION ONCE THE
APPROPRIATE HIDDEN VARIABLES HAVE BEEN IDENTIlED
!S SOON AS WE UNDERSTAND THE GENERAL IDEA OF %- IT BECOMES EASY TO DERIVE ALL SORTS
OF VARIANTS AND IMPROVEMENTS &OR EXAMPLE IN MANY CASES THE % STEPˆTHE COMPUTATION OF
POSTERIORS OVER THE HIDDEN VARIABLESˆIS INTRACTABLE AS IN LARGE "AYES NETS )T TURNS OUT THAT
ONE CAN USE AN DSSUR[LPDWH % STEP AND STILL OBTAIN AN EFFECTIVE LEARNING ALGORITHM 7ITH A
SAMPLING ALGORITHM SUCH AS -#-# SEE 3ECTION  THE LEARNING PROCESS IS VERY INTUITIVE
EACH STATE CONlGURATION OF HIDDEN AND OBSERVED VARIABLES VISITED BY -#-# IS TREATED EX
ACTLY AS IF IT WERE A COMPLETE OBSERVATION 4HUS THE PARAMETERS CAN BE UPDATED DIRECTLY AFTER
EACH -#-# TRANSITION /THER FORMS OF APPROXIMATE INFERENCE SUCH AS VARIATIONAL AND LOOPY
METHODS HAVE ALSO PROVED EFFECTIVE FOR LEARNING VERY LARGE NETWORKS
 #HAPTER  ,EARNING 0ROBABILISTIC -ODELS

 /HDUQLQJ %D\HV QHW VWUXFWXUHV ZLWK KLGGHQ YDULDEOHV

)N 3ECTION  WE DISCUSSED THE PROBLEM OF LEARNING "AYES NET STRUCTURES WITH COMPLETE
DATA 7HEN UNOBSERVED VARIABLES MAY BE INmUENCING THE DATA THAT ARE OBSERVED THINGS GET
MORE DIFlCULT )N THE SIMPLEST CASE A HUMAN EXPERT MIGHT TELL THE LEARNING ALGORITHM THAT CER
TAIN HIDDEN VARIABLES EXIST LEAVING IT TO THE ALGORITHM TO lND A PLACE FOR THEM IN THE NETWORK
STRUCTURE &OR EXAMPLE AN ALGORITHM MIGHT TRY TO LEARN THE STRUCTURE SHOWN IN &IGURE A
ON PAGE  GIVEN THE INFORMATION THAT HeartDisease A THREE VALUED VARIABLE SHOULD BE IN
CLUDED IN THE MODEL !S IN THE COMPLETE DATA CASE THE OVERALL ALGORITHM HAS AN OUTER LOOP THAT
SEARCHES OVER STRUCTURES AND AN INNER LOOP THAT lTS THE NETWORK PARAMETERS GIVEN THE STRUCTURE
)F THE LEARNING ALGORITHM IS NOT TOLD WHICH HIDDEN VARIABLES EXIST THEN THERE ARE TWO
CHOICES EITHER PRETEND THAT THE DATA IS REALLY COMPLETEˆWHICH MAY FORCE THE ALGORITHM TO
LEARN A PARAMETER INTENSIVE MODEL SUCH AS THE ONE IN &IGURE B ˆOR LQYHQW NEW HIDDEN
VARIABLES IN ORDER TO SIMPLIFY THE MODEL 4HE LATTER APPROACH CAN BE IMPLEMENTED BY INCLUDING
NEW MODIlCATION CHOICES IN THE STRUCTURE SEARCH IN ADDITION TO MODIFYING LINKS THE ALGORITHM
CAN ADD OR DELETE A HIDDEN VARIABLE OR CHANGE ITS ARITY /F COURSE THE ALGORITHM WILL NOT KNOW
THAT THE NEW VARIABLE IT HAS INVENTED IS CALLED HeartDisease NOR WILL IT HAVE MEANINGFUL
NAMES FOR THE VALUES &ORTUNATELY NEWLY INVENTED HIDDEN VARIABLES WILL USUALLY BE CONNECTED
TO PREEXISTING VARIABLES SO A HUMAN EXPERT CAN OFTEN INSPECT THE LOCAL CONDITIONAL DISTRIBUTIONS
INVOLVING THE NEW VARIABLE AND ASCERTAIN ITS MEANING
!S IN THE COMPLETE DATA CASE PURE MAXIMUM LIKELIHOOD STRUCTURE LEARNING WILL RESULT IN
A COMPLETELY CONNECTED NETWORK MOREOVER ONE WITH NO HIDDEN VARIABLES SO SOME FORM OF
COMPLEXITY PENALTY IS REQUIRED 7E CAN ALSO APPLY -#-# TO SAMPLE MANY POSSIBLE NETWORK
STRUCTURES THEREBY APPROXIMATING "AYESIAN LEARNING &OR EXAMPLE WE CAN LEARN MIXTURES OF
'AUSSIANS WITH AN UNKNOWN NUMBER OF COMPONENTS BY SAMPLING OVER THE NUMBER THE APPROX
IMATE POSTERIOR DISTRIBUTION FOR THE NUMBER OF 'AUSSIANS IS GIVEN BY THE SAMPLING FREQUENCIES
OF THE -#-# PROCESS
&OR THE COMPLETE DATA CASE THE INNER LOOP TO LEARN THE PARAMETERS IS VERY FASTˆJUST A
MATTER OF EXTRACTING CONDITIONAL FREQUENCIES FROM THE DATA SET 7HEN THERE ARE HIDDEN VARI
ABLES THE INNER LOOP MAY INVOLVE MANY ITERATIONS OF %- OR A GRADIENT BASED ALGORITHM AND
EACH ITERATION INVOLVES THE CALCULATION OF POSTERIORS IN A "AYES NET WHICH IS ITSELF AN .0 HARD
PROBLEM 4O DATE THIS APPROACH HAS PROVED IMPRACTICAL FOR LEARNING COMPLEX MODELS /NE
STRUCTURAL EM POSSIBLE IMPROVEMENT IS THE SO CALLED VWUXFWXUDO (0 ALGORITHM WHICH OPERATES IN MUCH THE
SAME WAY AS ORDINARY PARAMETRIC %- EXCEPT THAT THE ALGORITHM CAN UPDATE THE STRUCTURE
AS WELL AS THE PARAMETERS *UST AS ORDINARY %- USES THE CURRENT PARAMETERS TO COMPUTE THE
EXPECTED COUNTS IN THE % STEP AND THEN APPLIES THOSE COUNTS IN THE - STEP TO CHOOSE NEW
PARAMETERS STRUCTURAL %- USES THE CURRENT STRUCTURE TO COMPUTE EXPECTED COUNTS AND THEN AP
PLIES THOSE COUNTS IN THE - STEP TO EVALUATE THE LIKELIHOOD FOR POTENTIAL NEW STRUCTURES 4HIS
CONTRASTS WITH THE OUTER LOOPINNER LOOP METHOD WHICH COMPUTES NEW EXPECTED COUNTS FOR
EACH POTENTIAL STRUCTURE )N THIS WAY STRUCTURAL %- MAY MAKE SEVERAL STRUCTURAL ALTERATIONS
TO THE NETWORK WITHOUT ONCE RECOMPUTING THE EXPECTED COUNTS AND IS CAPABLE OF LEARNING NON
TRIVIAL "AYES NET STRUCTURES .ONETHELESS MUCH WORK REMAINS TO BE DONE BEFORE WE CAN SAY
THAT THE STRUCTURE LEARNING PROBLEM IS SOLVED
3ECTION  3UMMARY 

 3 5--!29

3TATISTICAL LEARNING METHODS RANGE FROM SIMPLE CALCULATION OF AVERAGES TO THE CONSTRUCTION OF
COMPLEX MODELS SUCH AS "AYESIAN NETWORKS 4HEY HAVE APPLICATIONS THROUGHOUT COMPUTER
SCIENCE ENGINEERING COMPUTATIONAL BIOLOGY NEUROSCIENCE PSYCHOLOGY AND PHYSICS 4HIS
CHAPTER HAS PRESENTED SOME OF THE BASIC IDEAS AND GIVEN A mAVOR OF THE MATHEMATICAL UNDER
PINNINGS 4HE MAIN POINTS ARE AS FOLLOWS
• %D\HVLDQ OHDUQLQJ METHODS FORMULATE LEARNING AS A FORM OF PROBABILISTIC INFERENCE
USING THE OBSERVATIONS TO UPDATE A PRIOR DISTRIBUTION OVER HYPOTHESES 4HIS APPROACH
PROVIDES A GOOD WAY TO IMPLEMENT /CKHAMS RAZOR BUT QUICKLY BECOMES INTRACTABLE FOR
COMPLEX HYPOTHESIS SPACES
• 0D[LPXP D SRVWHULRUL -!0 LEARNING SELECTS A SINGLE MOST LIKELY HYPOTHESIS GIVEN
THE DATA 4HE HYPOTHESIS PRIOR IS STILL USED AND THE METHOD IS OFTEN MORE TRACTABLE THAN
FULL "AYESIAN LEARNING
• 0D[LPXPOLNHOLKRRG LEARNING SIMPLY SELECTS THE HYPOTHESIS THAT MAXIMIZES THE LIKELI
HOOD OF THE DATA IT IS EQUIVALENT TO -!0 LEARNING WITH A UNIFORM PRIOR )N SIMPLE CASES
SUCH AS LINEAR REGRESSION AND FULLY OBSERVABLE "AYESIAN NETWORKS MAXIMUM LIKELIHOOD
SOLUTIONS CAN BE FOUND EASILY IN CLOSED FORM 1DLYH %D\HV LEARNING IS A PARTICULARLY
EFFECTIVE TECHNIQUE THAT SCALES WELL
• 7HEN SOME VARIABLES ARE HIDDEN LOCAL MAXIMUM LIKELIHOOD SOLUTIONS CAN BE FOUND
USING THE %- ALGORITHM !PPLICATIONS INCLUDE CLUSTERING USING MIXTURES OF 'AUSSIANS
LEARNING "AYESIAN NETWORKS AND LEARNING HIDDEN -ARKOV MODELS
• ,EARNING THE STRUCTURE OF "AYESIAN NETWORKS IS AN EXAMPLE OF PRGHO VHOHFWLRQ 4HIS
USUALLY INVOLVES A DISCRETE SEARCH IN THE SPACE OF STRUCTURES 3OME METHOD IS REQUIRED
FOR TRADING OFF MODEL COMPLEXITY AGAINST DEGREE OF lT
• 1RQSDUDPHWULF PRGHOV REPRESENT A DISTRIBUTION USING THE COLLECTION OF DATA POINTS
4HUS THE NUMBER OF PARAMETERS GROWS WITH THE TRAINING SET .EAREST NEIGHBORS METHODS
LOOK AT THE EXAMPLES NEAREST TO THE POINT IN QUESTION WHEREAS NHUQHO METHODS FORM A
DISTANCE WEIGHTED COMBINATION OF ALL THE EXAMPLES
3TATISTICAL LEARNING CONTINUES TO BE A VERY ACTIVE AREA OF RESEARCH %NORMOUS STRIDES HAVE BEEN
MADE IN BOTH THEORY AND PRACTICE TO THE POINT WHERE IT IS POSSIBLE TO LEARN ALMOST ANY MODEL
FOR WHICH EXACT OR APPROXIMATE INFERENCE IS FEASIBLE

" )",)/'2!0()#!, !.$ ( )34/2)#!, . /4%3

4HE APPLICATION OF STATISTICAL LEARNING TECHNIQUES IN !) WAS AN ACTIVE AREA OF RESEARCH IN THE
EARLY YEARS SEE $UDA AND (ART  BUT BECAME SEPARATED FROM MAINSTREAM !) AS THE
LATTER lELD CONCENTRATED ON SYMBOLIC METHODS ! RESURGENCE OF INTEREST OCCURRED SHORTLY AFTER
THE INTRODUCTION OF "AYESIAN NETWORK MODELS IN THE LATE S AT ROUGHLY THE SAME TIME
 #HAPTER  ,EARNING 0ROBABILISTIC -ODELS

A STATISTICAL VIEW OF NEURAL NETWORK LEARNING BEGAN TO EMERGE )N THE LATE S THERE WAS
A NOTICEABLE CONVERGENCE OF INTERESTS IN MACHINE LEARNING STATISTICS AND NEURAL NETWORKS
CENTERED ON METHODS FOR CREATING LARGE PROBABILISTIC MODELS FROM DATA
4HE NAIVE "AYES MODEL IS ONE OF THE OLDEST AND SIMPLEST FORMS OF "AYESIAN NETWORK
DATING BACK TO THE S )TS ORIGINS WERE MENTIONED IN #HAPTER  )TS SURPRISING SUCCESS IS
PARTIALLY EXPLAINED BY $OMINGOS AND 0AZZANI   ! BOOSTED FORM OF NAIVE "AYES LEARN
ING WON THE lRST +$$ #UP DATA MINING COMPETITION %LKAN   (ECKERMAN  GIVES
AN EXCELLENT INTRODUCTION TO THE GENERAL PROBLEM OF "AYES NET LEARNING "AYESIAN PARAME
TER LEARNING WITH $IRICHLET PRIORS FOR "AYESIAN NETWORKS WAS DISCUSSED BY 3PIEGELHALTER HW DO
  4HE " 5'3 SOFTWARE PACKAGE 'ILKS HW DO  INCORPORATES MANY OF THESE IDEAS AND
PROVIDES A VERY POWERFUL TOOL FOR FORMULATING AND LEARNING COMPLEX PROBABILITY MODELS 4HE
lRST ALGORITHMS FOR LEARNING "AYES NET STRUCTURES USED CONDITIONAL INDEPENDENCE TESTS 0EARL
 0EARL AND 6ERMA   3PIRTES HW DO  DEVELOPED A COMPREHENSIVE APPROACH
EMBODIED IN THE 4 %42!$ PACKAGE FOR "AYES NET LEARNING !LGORITHMIC IMPROVEMENTS SINCE
THEN LED TO A CLEAR VICTORY IN THE  +$$ #UP DATA MINING COMPETITION FOR A "AYES NET
LEARNING METHOD #HENG HW DO   4HE SPECIlC TASK HERE WAS A BIOINFORMATICS PROB
LEM WITH   FEATURES ! STRUCTURE LEARNING APPROACH BASED ON MAXIMIZING LIKELIHOOD
WAS DEVELOPED BY #OOPER AND (ERSKOVITS  AND IMPROVED BY (ECKERMAN HW DO  
3EVERAL ALGORITHMIC ADVANCES SINCE THAT TIME HAVE LED TO QUITE RESPECTABLE PERFORMANCE IN
THE COMPLETE DATA CASE -OORE AND 7ONG  4EYSSIER AND +OLLER   /NE IMPORTANT
COMPONENT IS AN EFlCIENT DATA STRUCTURE THE !$ TREE FOR CACHING COUNTS OVER ALL POSSIBLE
COMBINATIONS OF VARIABLES AND VALUES -OORE AND ,EE   &RIEDMAN AND 'OLDSZMIDT
 POINTED OUT THE INmUENCE OF THE REPRESENTATION OF LOCAL CONDITIONAL DISTRIBUTIONS ON THE
LEARNED STRUCTURE
4HE GENERAL PROBLEM OF LEARNING PROBABILITY MODELS WITH HIDDEN VARIABLES AND MISS
ING DATA WAS ADDRESSED BY (ARTLEY  WHO DESCRIBED THE GENERAL IDEA OF WHAT WAS LATER
CALLED %- AND GAVE SEVERAL EXAMPLES &URTHER IMPETUS CAME FROM THE "AUMn7ELCH ALGO
RITHM FOR (-- LEARNING "AUM AND 0ETRIE  WHICH IS A SPECIAL CASE OF %- 4HE PAPER
BY $EMPSTER ,AIRD AND 2UBIN  WHICH PRESENTED THE %- ALGORITHM IN GENERAL FORM
AND ANALYZED ITS CONVERGENCE IS ONE OF THE MOST CITED PAPERS IN BOTH COMPUTER SCIENCE AND
STATISTICS $EMPSTER HIMSELF VIEWS %- AS A SCHEMA RATHER THAN AN ALGORITHM SINCE A GOOD
DEAL OF MATHEMATICAL WORK MAY BE REQUIRED BEFORE IT CAN BE APPLIED TO A NEW FAMILY OF DIS
TRIBUTIONS -C,ACHLAN AND +RISHNAN  DEVOTE AN ENTIRE BOOK TO THE ALGORITHM AND ITS
PROPERTIES 4HE SPECIlC PROBLEM OF LEARNING MIXTURE MODELS INCLUDING MIXTURES OF 'AUS
SIANS IS COVERED BY 4ITTERINGTON HW DO   7ITHIN !) THE lRST SUCCESSFUL SYSTEM THAT USED
%- FOR MIXTURE MODELING WAS !54/#,!33 #HEESEMAN HW DO  #HEESEMAN AND 3TUTZ
  !54/#,!33 HAS BEEN APPLIED TO A NUMBER OF REAL WORLD SCIENTIlC CLASSIlCATION TASKS
INCLUDING THE DISCOVERY OF NEW TYPES OF STARS FROM SPECTRAL DATA 'OEBEL HW DO  AND NEW
CLASSES OF PROTEINS AND INTRONS IN $.!PROTEIN SEQUENCE DATABASES (UNTER AND 3TATES  
&OR MAXIMUM LIKELIHOOD PARAMETER LEARNING IN "AYES NETS WITH HIDDEN VARIABLES %-
AND GRADIENT BASED METHODS WERE INTRODUCED AROUND THE SAME TIME BY ,AURITZEN  2US
SELL HW DO  AND "INDER HW DO A  4HE STRUCTURAL %- ALGORITHM WAS DEVELOPED BY
&RIEDMAN  AND APPLIED TO MAXIMUM LIKELIHOOD LEARNING OF "AYES NET STRUCTURES WITH
%XERCISES 

LATENT VARIABLES &RIEDMAN AND +OLLER   DESCRIBE "AYESIAN STRUCTURE LEARNING
4HE ABILITY TO LEARN THE STRUCTURE OF "AYESIAN NETWORKS IS CLOSELY CONNECTED TO THE ISSUE
OF RECOVERING FDXVDO INFORMATION FROM DATA 4HAT IS IS IT POSSIBLE TO LEARN "AYES NETS IN
SUCH A WAY THAT THE RECOVERED NETWORK STRUCTURE INDICATES REAL CAUSAL INmUENCES &OR MANY
YEARS STATISTICIANS AVOIDED THIS QUESTION BELIEVING THAT OBSERVATIONAL DATA AS OPPOSED TO DATA
GENERATED FROM EXPERIMENTAL TRIALS COULD YIELD ONLY CORRELATIONAL INFORMATIONˆAFTER ALL ANY
TWO VARIABLES THAT APPEAR RELATED MIGHT IN FACT BE INmUENCED BY A THIRD UNKNOWN CAUSAL
FACTOR RATHER THAN INmUENCING EACH OTHER DIRECTLY 0EARL  HAS PRESENTED CONVINCING
ARGUMENTS TO THE CONTRARY SHOWING THAT THERE ARE IN FACT MANY CASES WHERE CAUSALITY CAN BE
CAUSAL NETWORK ASCERTAINED AND DEVELOPING THE FDXVDO QHWZRUN FORMALISM TO EXPRESS CAUSES AND THE EFFECTS
OF INTERVENTION AS WELL AS ORDINARY CONDITIONAL PROBABILITIES
.ONPARAMETRIC DENSITY ESTIMATION ALSO CALLED 3DU]HQ ZLQGRZ DENSITY ESTIMATION WAS
INVESTIGATED INITIALLY BY 2OSENBLATT  AND 0ARZEN   3INCE THAT TIME A HUGE LITERA
TURE HAS DEVELOPED INVESTIGATING THE PROPERTIES OF VARIOUS ESTIMATORS $EVROYE  GIVES A
THOROUGH INTRODUCTION 4HERE IS ALSO A RAPIDLY GROWING LITERATURE ON NONPARAMETRIC "AYESIAN
DIRICHLET PROCESS METHODS ORIGINATING WITH THE SEMINAL WORK OF &ERGUSON  ON THE 'LULFKOHW SURFHVV
WHICH CAN BE THOUGHT OF AS A DISTRIBUTION OVER $IRICHLET DISTRIBUTIONS 4HESE METHODS ARE PAR
TICULARLY USEFUL FOR MIXTURES WITH UNKNOWN NUMBERS OF COMPONENTS 'HAHRAMANI  AND
*ORDAN  PROVIDE USEFUL TUTORIALS ON THE MANY APPLICATIONS OF THESE IDEAS TO STATISTICAL
GAUSSIAN PROCESS LEARNING 4HE TEXT BY 2ASMUSSEN AND 7ILLIAMS  COVERS THE *DXVVLDQ SURFHVV WHICH
GIVES A WAY OF DElNING PRIOR DISTRIBUTIONS OVER THE SPACE OF CONTINUOUS FUNCTIONS
4HE MATERIAL IN THIS CHAPTER BRINGS TOGETHER WORK FROM THE lELDS OF STATISTICS AND PATTERN
RECOGNITION SO THE STORY HAS BEEN TOLD MANY TIMES IN MANY WAYS 'OOD TEXTS ON "AYESIAN
STATISTICS INCLUDE THOSE BY $E'ROOT  "ERGER  AND 'ELMAN HW DO   "ISHOP
 AND (ASTIE HW DO  PROVIDE AN EXCELLENT INTRODUCTION TO STATISTICAL MACHINE LEARN
ING &OR PATTERN CLASSIlCATION THE CLASSIC TEXT FOR MANY YEARS HAS BEEN $UDA AND (ART 
NOW UPDATED $UDA HW DO   4HE ANNUAL .)03 .EURAL )NFORMATION 0ROCESSING #ONFER
ENCE CONFERENCE WHOSE PROCEEDINGS ARE PUBLISHED AS THE SERIES $GYDQFHV LQ 1HXUDO ,QIRUPD
WLRQ 3URFHVVLQJ 6\VWHPV IS NOW DOMINATED BY "AYESIAN PAPERS 0APERS ON LEARNING "AYESIAN
NETWORKS ALSO APPEAR IN THE 8QFHUWDLQW\ LQ $, AND 0DFKLQH /HDUQLQJ CONFERENCES AND IN SEV
ERAL STATISTICS CONFERENCES *OURNALS SPECIlC TO NEURAL NETWORKS INCLUDE 1HXUDO &RPSXWDWLRQ
1HXUDO 1HWZRUNV AND THE ,((( 7UDQVDFWLRQV RQ 1HXUDO 1HWZRUNV 3PECIlCALLY "AYESIAN
VENUES INCLUDE THE 6ALENCIA )NTERNATIONAL -EETINGS ON "AYESIAN 3TATISTICS AND THE JOURNAL
%D\HVLDQ $QDO\VLV

% 8%2#)3%3

 4HE DATA USED FOR &IGURE  ON PAGE  CAN BE VIEWED AS BEING GENERATED BY h5 
&OR EACH OF THE OTHER FOUR HYPOTHESES GENERATE A DATA SET OF LENGTH  AND PLOT THE COR
RESPONDING GRAPHS FOR P (hi | d1 , . . . , dN ) AND P (DN +1 = lime | d1 , . . . , dN ) #OMMENT ON
YOUR RESULTS
 #HAPTER  ,EARNING 0ROBABILISTIC -ODELS

 2EPEAT %XERCISE  THIS TIME PLOTTING THE VALUES OF P (DN +1 = lime | hMAP ) AND
P (DN +1 = lime | hML )
 3UPPOSE THAT !NNS UTILITIES FOR CHERRY AND LIME CANDIES ARE cA AND A WHEREAS "OBS
UTILITIES ARE cB AND B  "UT ONCE !NN HAS UNWRAPPED A PIECE OF CANDY "OB WONT BUY
IT 0RESUMABLY IF "OB LIKES LIME CANDIES MUCH MORE THAN !NN IT WOULD BE WISE FOR !NN
TO SELL HER BAG OF CANDIES ONCE SHE IS SUFlCIENTLY SURE OF ITS LIME CONTENT /N THE OTHER HAND
IF !NN UNWRAPS TOO MANY CANDIES IN THE PROCESS THE BAG WILL BE WORTH LESS $ISCUSS THE
PROBLEM OF DETERMINING THE OPTIMAL POINT AT WHICH TO SELL THE BAG $ETERMINE THE EXPECTED
UTILITY OF THE OPTIMAL PROCEDURE GIVEN THE PRIOR DISTRIBUTION FROM 3ECTION 
 4WO STATISTICIANS GO TO THE DOCTOR AND ARE BOTH GIVEN THE SAME PROGNOSIS ! 
CHANCE THAT THE PROBLEM IS THE DEADLY DISEASE A AND A  CHANCE OF THE FATAL DISEASE B
&ORTUNATELY THERE ARE ANTI A AND ANTI B DRUGS THAT ARE INEXPENSIVE  EFFECTIVE AND FREE
OF SIDE EFFECTS 4HE STATISTICIANS HAVE THE CHOICE OF TAKING ONE DRUG BOTH OR NEITHER 7HAT
WILL THE lRST STATISTICIAN AN AVID "AYESIAN DO (OW ABOUT THE SECOND STATISTICIAN WHO ALWAYS
USES THE MAXIMUM LIKELIHOOD HYPOTHESIS
4HE DOCTOR DOES SOME RESEARCH AND DISCOVERS THAT DISEASE B ACTUALLY COMES IN TWO
VERSIONS DEXTRO B AND LEVO B WHICH ARE EQUALLY LIKELY AND EQUALLY TREATABLE BY THE ANTI B
DRUG .OW THAT THERE ARE THREE HYPOTHESES WHAT WILL THE TWO STATISTICIANS DO
 %XPLAIN HOW TO APPLY THE BOOSTING METHOD OF #HAPTER  TO NAIVE "AYES LEARNING 4EST
THE PERFORMANCE OF THE RESULTING ALGORITHM ON THE RESTAURANT LEARNING PROBLEM
 #ONSIDER N DATA POINTS (xj , yj ) WHERE THE yj S ARE GENERATED FROM THE xj S ACCORDING TO
THE LINEAR 'AUSSIAN MODEL IN %QUATION   &IND THE VALUES OF θ1 θ2 AND σ THAT MAXIMIZE
THE CONDITIONAL LOG LIKELIHOOD OF THE DATA
 #ONSIDER THE NOISY /2 MODEL FOR FEVER DESCRIBED IN 3ECTION  %XPLAIN HOW TO
APPLY MAXIMUM LIKELIHOOD LEARNING TO lT THE PARAMETERS OF SUCH A MODEL TO A SET OF COMPLETE
DATA +LQW USE THE CHAIN RULE FOR PARTIAL DERIVATIVES
 4HIS EXERCISE INVESTIGATES PROPERTIES OF THE "ETA DISTRIBUTION DElNED IN %QUATION  
D "Y INTEGRATING OVER THE RANGE [0, 1] SHOW THAT THE NORMALIZATION CONSTANT FOR THE DIS
TRIBUTION beta[a, b] IS GIVEN BY α = Γ(a + b)/Γ(a)Γ(b) WHERE Γ(x) IS THE *DPPD
GAMMA FUNCTION IXQFWLRQ DElNED BY Γ(x + 1) = x · Γ(x) AND Γ(1) = 1 &OR INTEGER x Γ(x + 1) = x!
E 3HOW THAT THE MEAN IS a/(a + b)
F &IND THE MODES THE MOST LIKELY VALUES OF θ 
G $ESCRIBE THE DISTRIBUTION beta[, ] FOR VERY SMALL  7HAT HAPPENS AS SUCH A DISTRIBUTION
IS UPDATED

 #ONSIDER AN ARBITRARY "AYESIAN NETWORK A COMPLETE DATA SET FOR THAT NETWORK AND THE
LIKELIHOOD FOR THE DATA SET ACCORDING TO THE NETWORK 'IVE A SIMPLE PROOF THAT THE LIKELIHOOD
OF THE DATA CANNOT DECREASE IF WE ADD A NEW LINK TO THE NETWORK AND RECOMPUTE THE MAXIMUM
LIKELIHOOD PARAMETER VALUES
%XERCISES 

 #ONSIDER THE APPLICATION OF %- TO LEARN THE PARAMETERS FOR THE NETWORK IN &IG
URE A GIVEN THE TRUE PARAMETERS IN %QUATION  
D %XPLAIN WHY THE %- ALGORITHM WOULD NOT WORK IF THERE WERE JUST TWO ATTRIBUTES IN THE
MODEL RATHER THAN THREE
E 3HOW THE CALCULATIONS FOR THE lRST ITERATION OF %- STARTING FROM %QUATION  
F 7HAT HAPPENS IF WE START WITH ALL THE PARAMETERS SET TO THE SAME VALUE p +LQW YOU
MAY lND IT HELPFUL TO INVESTIGATE THIS EMPIRICALLY BEFORE DERIVING THE GENERAL RESULT
G 7RITE OUT AN EXPRESSION FOR THE LOG LIKELIHOOD OF THE TABULATED CANDY DATA ON PAGE  IN
TERMS OF THE PARAMETERS CALCULATE THE PARTIAL DERIVATIVES WITH RESPECT TO EACH PARAMETER
AND INVESTIGATE THE NATURE OF THE lXED POINT REACHED IN PART C 

You might also like