Professional Documents
Culture Documents
Cortado-Cap 5 Tercer Tema
Cortado-Cap 5 Tercer Tema
Cortado-Cap 5 Tercer Tema
'
02/"!"),)34)# -/$%,3
#HAPTER POINTED OUT THE PREVALENCE OF UNCERTAINTY IN REAL ENVIRONMENTS !GENTS CAN HANDLE
UNCERTAINTY BY USING THE METHODS OF PROBABILITY AND DECISION THEORY BUT lRST THEY MUST LEARN
THEIR PROBABILISTIC THEORIES OF THE WORLD FROM EXPERIENCE 4HIS CHAPTER EXPLAINS HOW THEY
CAN DO THAT BY FORMULATING THE LEARNING TASK ITSELF AS A PROCESS OF PROBABILISTIC INFERENCE
3ECTION 7E WILL SEE THAT A "AYESIAN VIEW OF LEARNING IS EXTREMELY POWERFUL PROVIDING
GENERAL SOLUTIONS TO THE PROBLEMS OF NOISE OVERlTTING AND OPTIMAL PREDICTION )T ALSO TAKES
INTO ACCOUNT THE FACT THAT A LESS THAN OMNISCIENT AGENT CAN NEVER BE CERTAIN ABOUT WHICH THEORY
OF THE WORLD IS CORRECT YET MUST STILL MAKE DECISIONS BY USING SOME THEORY OF THE WORLD
7E DESCRIBE METHODS FOR LEARNING PROBABILITY MODELSPRIMARILY "AYESIAN NETWORKS
IN 3ECTIONS AND 3OME OF THE MATERIAL IN THIS CHAPTER IS FAIRLY MATHEMATICAL AL
THOUGH THE GENERAL LESSONS CAN BE UNDERSTOOD WITHOUT PLUNGING INTO THE DETAILS )T MAY BENElT
THE READER TO REVIEW #HAPTERS AND AND PEEK AT !PPENDIX !
4HE KEY CONCEPTS IN THIS CHAPTER JUST AS IN #HAPTER ARE GDWD AND K\SRWKHVHV (ERE THE
DATA ARE HYLGHQFHTHAT IS INSTANTIATIONS OF SOME OR ALL OF THE RANDOM VARIABLES DESCRIBING THE
DOMAIN 4HE HYPOTHESES IN THIS CHAPTER ARE PROBABILISTIC THEORIES OF HOW THE DOMAIN WORKS
INCLUDING LOGICAL THEORIES AS A SPECIAL CASE
#ONSIDER A SIMPLE EXAMPLE /UR FAVORITE 3URPRISE CANDY COMES IN TWO mAVORS CHERRY
YUM AND LIME UGH 4HE MANUFACTURER HAS A PECULIAR SENSE OF HUMOR AND WRAPS EACH PIECE
OF CANDY IN THE SAME OPAQUE WRAPPER REGARDLESS OF mAVOR 4HE CANDY IS SOLD IN VERY LARGE
BAGS OF WHICH THERE ARE KNOWN TO BE lVE KINDSAGAIN INDISTINGUISHABLE FROM THE OUTSIDE
h1 CHERRY
h2 CHERRY LIME
h3 CHERRY LIME
h4 CHERRY LIME
h5 LIME
3ECTION 3TATISTICAL ,EARNING
'IVEN A NEW BAG OF CANDY THE RANDOM VARIABLE H FOR K\SRWKHVLV DENOTES THE TYPE OF THE
BAG WITH POSSIBLE VALUES h1 THROUGH h5 H IS NOT DIRECTLY OBSERVABLE OF COURSE !S THE
PIECES OF CANDY ARE OPENED AND INSPECTED DATA ARE REVEALEDD1 D2 . . . DN WHERE EACH
Di IS A RANDOM VARIABLE WITH POSSIBLE VALUES cherry AND lime 4HE BASIC TASK FACED BY THE
AGENT IS TO PREDICT THE mAVOR OF THE NEXT PIECE OF CANDY $ESPITE ITS APPARENT TRIVIALITY THIS
SCENARIO SERVES TO INTRODUCE MANY OF THE MAJOR ISSUES 4HE AGENT REALLY DOES NEED TO INFER A
THEORY OF ITS WORLD ALBEIT A VERY SIMPLE ONE
BAYESIAN LEARNING %D\HVLDQ OHDUQLQJ SIMPLY CALCULATES THE PROBABILITY OF EACH HYPOTHESIS GIVEN THE DATA
AND MAKES PREDICTIONS ON THAT BASIS 4HAT IS THE PREDICTIONS ARE MADE BY USING DOO THE HY
POTHESES WEIGHTED BY THEIR PROBABILITIES RATHER THAN BY USING JUST A SINGLE hBESTv HYPOTHESIS
)N THIS WAY LEARNING IS REDUCED TO PROBABILISTIC INFERENCE ,ET ' REPRESENT ALL THE DATA WITH
OBSERVED VALUE G THEN THE PROBABILITY OF EACH HYPOTHESIS IS OBTAINED BY "AYES RULE
P (hi | G) = αP (G | hi )P (hi ) .
.OW SUPPOSE WE WANT TO MAKE A PREDICTION ABOUT AN UNKNOWN QUANTITY X 4HEN WE HAVE
3(X | G) = 3(X | G, hi )3(hi | G) = 3(X | hi )P (hi | G) ,
i i
WHERE WE HAVE ASSUMED THAT EACH HYPOTHESIS DETERMINES A PROBABILITY DISTRIBUTION OVER X
4HIS EQUATION SHOWS THAT PREDICTIONS ARE WEIGHTED AVERAGES OVER THE PREDICTIONS OF THE INDI
VIDUAL HYPOTHESES 4HE HYPOTHESES THEMSELVES ARE ESSENTIALLY hINTERMEDIARIESv BETWEEN THE
RAW DATA AND THE PREDICTIONS 4HE KEY QUANTITIES IN THE "AYESIAN APPROACH ARE THE K\SRWKHVLV
HYPOTHESIS PRIOR SULRU P (hi ) AND THE OLNHOLKRRG OF THE DATA UNDER EACH HYPOTHESIS P (G | hi )
LIKELIHOOD &OR OUR CANDY EXAMPLE WE WILL ASSUME FOR THE TIME BEING THAT THE PRIOR DISTRIBUTION
OVER h1 , . . . , h5 IS GIVEN BY 0.1, 0.2, 0.4, 0.2, 0.1 AS ADVERTISED BY THE MANUFACTURER 4HE
LIKELIHOOD OF THE DATA IS CALCULATED UNDER THE ASSUMPTION THAT THE OBSERVATIONS ARE LLG SEE
PAGE SO THAT
P (G | hi ) = P (dj | hi ) .
j
&OR EXAMPLE SUPPOSE THE BAG IS REALLY AN ALL LIME BAG h5 AND THE lRST CANDIES ARE ALL
LIME THEN P (G | h3 ) IS 0.510 BECAUSE HALF THE CANDIES IN AN h3 BAG ARE LIME &IGURE A
SHOWS HOW THE POSTERIOR PROBABILITIES OF THE lVE HYPOTHESES CHANGE AS THE SEQUENCE OF
LIME CANDIES IS OBSERVED .OTICE THAT THE PROBABILITIES START OUT AT THEIR PRIOR VALUES SO h3
IS INITIALLY THE MOST LIKELY CHOICE AND REMAINS SO AFTER LIME CANDY IS UNWRAPPED !FTER
LIME CANDIES ARE UNWRAPPED h4 IS MOST LIKELY AFTER OR MORE h5 THE DREADED ALL LIME BAG
IS THE MOST LIKELY !FTER IN A ROW WE ARE FAIRLY CERTAIN OF OUR FATE &IGURE B SHOWS
THE PREDICTED PROBABILITY THAT THE NEXT CANDY IS LIME BASED ON %QUATION !S WE WOULD
EXPECT IT INCREASES MONOTONICALLY TOWARD
1 3TATISTICALLY SOPHISTICATED READERS WILL RECOGNIZE THIS SCENARIO AS A VARIANT OF THE XUQDQGEDOO SETUP 7E lND
URNS AND BALLS LESS COMPELLING THAN CANDY FURTHERMORE CANDY LENDS ITSELF TO OTHER TASKS SUCH AS DECIDING WHETHER
TO TRADE THE BAG WITH A FRIENDSEE %XERCISE
2 7E STATED EARLIER THAT THE BAGS OF CANDY ARE VERY LARGE OTHERWISE THE IID ASSUMPTION FAILS TO HOLD 4ECHNICALLY
IT IS MORE CORRECT BUT LESS HYGIENIC TO REWRAP EACH CANDY AFTER INSPECTION AND RETURN IT TO THE BAG
#HAPTER ,EARNING 0ROBABILISTIC -ODELS
.UMBER OF OBSERVATIONS IN G .UMBER OF OBSERVATIONS IN G
A B
4HE EXAMPLE SHOWS THAT WKH %D\HVLDQ SUHGLFWLRQ HYHQWXDOO\ DJUHHV ZLWK WKH WUXH K\
SRWKHVLV 4HIS IS CHARACTERISTIC OF "AYESIAN LEARNING &OR ANY lXED PRIOR THAT DOES NOT RULE
OUT THE TRUE HYPOTHESIS THE POSTERIOR PROBABILITY OF ANY FALSE HYPOTHESIS WILL UNDER CERTAIN
TECHNICAL CONDITIONS EVENTUALLY VANISH 4HIS HAPPENS SIMPLY BECAUSE THE PROBABILITY OF GEN
ERATING hUNCHARACTERISTICv DATA INDElNITELY IS VANISHINGLY SMALL 4HIS POINT IS ANALOGOUS TO
ONE MADE IN THE DISCUSSION OF 0!# LEARNING IN #HAPTER -ORE IMPORTANT THE "AYESIAN
PREDICTION IS RSWLPDO WHETHER THE DATA SET BE SMALL OR LARGE 'IVEN THE HYPOTHESIS PRIOR ANY
OTHER PREDICTION IS EXPECTED TO BE CORRECT LESS OFTEN
4HE OPTIMALITY OF "AYESIAN LEARNING COMES AT A PRICE OF COURSE &OR REAL LEARNING
PROBLEMS THE HYPOTHESIS SPACE IS USUALLY VERY LARGE OR INlNITE AS WE SAW IN #HAPTER )N
SOME CASES THE SUMMATION IN %QUATION OR INTEGRATION IN THE CONTINUOUS CASE CAN BE
CARRIED OUT TRACTABLY BUT IN MOST CASES WE MUST RESORT TO APPROXIMATE OR SIMPLIlED METHODS
! VERY COMMON APPROXIMATIONONE THAT IS USUALLY ADOPTED IN SCIENCEIS TO MAKE PRE
DICTIONS BASED ON A SINGLE PRVW SUREDEOH HYPOTHESISTHAT IS AN hi THAT MAXIMIZES P (hi | G)
MAXIMUM A
POSTERIORI 4HIS IS OFTEN CALLED A PD[LPXP D SRVWHULRUL OR -!0 PRONOUNCED hEM AY PEEv HYPOTHESIS
0REDICTIONS MADE ACCORDING TO AN -!0 HYPOTHESIS hMAP ARE APPROXIMATELY "AYESIAN TO THE
EXTENT THAT 3(X | G) ≈ 3(X | hMAP ) )N OUR CANDY EXAMPLE hMAP = h5 AFTER THREE LIME CAN
DIES IN A ROW SO THE -!0 LEARNER THEN PREDICTS THAT THE FOURTH CANDY IS LIME WITH PROBABILITY
A MUCH MORE DANGEROUS PREDICTION THAN THE "AYESIAN PREDICTION OF SHOWN IN &IG
URE B !S MORE DATA ARRIVE THE -!0 AND "AYESIAN PREDICTIONS BECOME CLOSER BECAUSE
THE COMPETITORS TO THE -!0 HYPOTHESIS BECOME LESS AND LESS PROBABLE
!LTHOUGH OUR EXAMPLE DOESNT SHOW IT lNDING -!0 HYPOTHESES IS OFTEN MUCH EASIER
THAN "AYESIAN LEARNING BECAUSE IT REQUIRES SOLVING AN OPTIMIZATION PROBLEM INSTEAD OF A LARGE
SUMMATION OR INTEGRATION PROBLEM 7E WILL SEE EXAMPLES OF THIS LATER IN THE CHAPTER
3ECTION 3TATISTICAL ,EARNING
)N BOTH "AYESIAN LEARNING AND -!0 LEARNING THE HYPOTHESIS PRIOR P (hi ) PLAYS AN IM
PORTANT ROLE 7E SAW IN #HAPTER THAT RYHU¿WWLQJ CAN OCCUR WHEN THE HYPOTHESIS SPACE
IS TOO EXPRESSIVE SO THAT IT CONTAINS MANY HYPOTHESES THAT lT THE DATA SET WELL 2ATHER THAN
PLACING AN ARBITRARY LIMIT ON THE HYPOTHESES TO BE CONSIDERED "AYESIAN AND -!0 LEARNING
METHODS USE THE PRIOR TO SHQDOL]H FRPSOH[LW\ 4YPICALLY MORE COMPLEX HYPOTHESES HAVE A
LOWER PRIOR PROBABILITYIN PART BECAUSE THERE ARE USUALLY MANY MORE COMPLEX HYPOTHESES
THAN SIMPLE HYPOTHESES /N THE OTHER HAND MORE COMPLEX HYPOTHESES HAVE A GREATER CAPAC
ITY TO lT THE DATA )N THE EXTREME CASE A LOOKUP TABLE CAN REPRODUCE THE DATA EXACTLY WITH
PROBABILITY (ENCE THE HYPOTHESIS PRIOR EMBODIES A TRADEOFF BETWEEN THE COMPLEXITY OF A
HYPOTHESIS AND ITS DEGREE OF lT TO THE DATA
7E CAN SEE THE EFFECT OF THIS TRADEOFF MOST CLEARLY IN THE LOGICAL CASE WHERE H CONTAINS
ONLY GHWHUPLQLVWLF HYPOTHESES )N THAT CASE P (G | hi ) IS IF hi IS CONSISTENT AND OTHERWISE
,OOKING AT %QUATION WE SEE THAT hMAP WILL THEN BE THE VLPSOHVW ORJLFDO WKHRU\ WKDW
LV FRQVLVWHQW ZLWK WKH GDWD 4HEREFORE MAXIMUM D SRVWHULRUL LEARNING PROVIDES A NATURAL
EMBODIMENT OF /CKHAMS RAZOR
!NOTHER INSIGHT INTO THE TRADEOFF BETWEEN COMPLEXITY AND DEGREE OF lT IS OBTAINED BY
TAKING THE LOGARITHM OF %QUATION #HOOSING hMAP TO MAXIMIZE P (G | hi )P (hi ) IS
EQUIVALENT TO MINIMIZING
− log2 P (G | hi ) − log2 P (hi ) .
5SING THE CONNECTION BETWEEN INFORMATION ENCODING AND PROBABILITY THAT WE INTRODUCED IN
#HAPTER WE SEE THAT THE − log2 P (hi ) TERM EQUALS THE NUMBER OF BITS REQUIRED TO SPEC
IFY THE HYPOTHESIS hi &URTHERMORE − log2 P (G | hi ) IS THE ADDITIONAL NUMBER OF BITS REQUIRED
TO SPECIFY THE DATA GIVEN THE HYPOTHESIS 4O SEE THIS CONSIDER THAT NO BITS ARE REQUIRED
IF THE HYPOTHESIS PREDICTS THE DATA EXACTLYAS WITH h5 AND THE STRING OF LIME CANDIESAND
log2 1 = 0 (ENCE -!0 LEARNING IS CHOOSING THE HYPOTHESIS THAT PROVIDES MAXIMUM FRP
SUHVVLRQ OF THE DATA 4HE SAME TASK IS ADDRESSED MORE DIRECTLY BY THE PLQLPXP GHVFULSWLRQ
OHQJWK OR -$, LEARNING METHOD 7HEREAS -!0 LEARNING EXPRESSES SIMPLICITY BY ASSIGNING
HIGHER PROBABILITIES TO SIMPLER HYPOTHESES -$, EXPRESSES IT DIRECTLY BY COUNTING THE BITS IN
A BINARY ENCODING OF THE HYPOTHESES AND DATA
! lNAL SIMPLIlCATION IS PROVIDED BY ASSUMING A XQLIRUP PRIOR OVER THE SPACE OF HY
POTHESES )N THAT CASE -!0 LEARNING REDUCES TO CHOOSING AN hi THAT MAXIMIZES P (G | hi )
MAXIMUM-
LIKELIHOOD 4HIS IS CALLED A PD[LPXPOLNHOLKRRG -, HYPOTHESIS hML -AXIMUM LIKELIHOOD LEARNING
IS VERY COMMON IN STATISTICS A DISCIPLINE IN WHICH MANY RESEARCHERS DISTRUST THE SUBJECTIVE
NATURE OF HYPOTHESIS PRIORS )T IS A REASONABLE APPROACH WHEN THERE IS NO REASON TO PREFER ONE
HYPOTHESIS OVER ANOTHER D SULRULFOR EXAMPLE WHEN ALL HYPOTHESES ARE EQUALLY COMPLEX )T
PROVIDES A GOOD APPROXIMATION TO "AYESIAN AND -!0 LEARNING WHEN THE DATA SET IS LARGE
BECAUSE THE DATA SWAMPS THE PRIOR DISTRIBUTION OVER HYPOTHESES BUT IT HAS PROBLEMS AS WE
SHALL SEE WITH SMALL DATA SETS
#HAPTER ,EARNING 0ROBABILISTIC -ODELS
4HE GENERAL TASK OF LEARNING A PROBABILITY MODEL GIVEN DATA THAT ARE ASSUMED TO BE GENERATED
DENSITY ESTIMATION FROM THAT MODEL IS CALLED GHQVLW\ HVWLPDWLRQ 4HE TERM APPLIED ORIGINALLY TO PROBABILITY
DENSITY FUNCTIONS FOR CONTINUOUS VARIABLES BUT IS USED NOW FOR DISCRETE DISTRIBUTIONS TOO
COMPLETE DATA 4HIS SECTION COVERS THE SIMPLEST CASE WHERE WE HAVE FRPSOHWH GDWD $ATA ARE COM
PLETE WHEN EACH DATA POINT CONTAINS VALUES FOR EVERY VARIABLE IN THE PROBABILITY MODEL BEING
PARAMETER
LEARNING LEARNED 7E FOCUS ON SDUDPHWHU OHDUQLQJlNDING THE NUMERICAL PARAMETERS FOR A PROBA
BILITY MODEL WHOSE STRUCTURE IS lXED &OR EXAMPLE WE MIGHT BE INTERESTED IN LEARNING THE
CONDITIONAL PROBABILITIES IN A "AYESIAN NETWORK WITH A GIVEN STRUCTURE 7E WILL ALSO LOOK
BRIEmY AT THE PROBLEM OF LEARNING STRUCTURE AND AT NONPARAMETRIC DENSITY ESTIMATION
3) FKHUU\
θ
)ODYRU FKHUU\ θ
OLPH θ
:UDSSHU
A B
)LJXUH A "AYESIAN NETWORK MODEL FOR THE CASE OF CANDIES WITH AN UNKNOWN PROPOR
TION OF CHERRIES AND LIMES B -ODEL FOR THE CASE WHERE THE WRAPPER COLOR DEPENDS PROBA
BILISTICALLY ON THE CANDY mAVOR
7RITE DOWN AN EXPRESSION FOR THE LIKELIHOOD OF THE DATA AS A FUNCTION OF THE PARAMETERS
7RITE DOWN THE DERIVATIVE OF THE LOG LIKELIHOOD WITH RESPECT TO EACH PARAMETER
&IND THE PARAMETER VALUES SUCH THAT THE DERIVATIVES ARE ZERO
4HE TRICKIEST STEP IS USUALLY THE LAST )N OUR EXAMPLE IT WAS TRIVIAL BUT WE WILL SEE THAT IN
MANY CASES WE NEED TO RESORT TO ITERATIVE SOLUTION ALGORITHMS OR OTHER NUMERICAL OPTIMIZATION
TECHNIQUES AS DESCRIBED IN #HAPTER 4HE EXAMPLE ALSO ILLUSTRATES A SIGNIlCANT PROBLEM
WITH MAXIMUM LIKELIHOOD LEARNING IN GENERAL ZKHQ WKH GDWD VHW LV VPDOO HQRXJK WKDW VRPH
HYHQWV KDYH QRW \HW EHHQ REVHUYHG²IRU LQVWDQFH QR FKHUU\ FDQGLHV²WKH PD[LPXPOLNHOLKRRG
K\SRWKHVLV DVVLJQV ]HUR SUREDELOLW\ WR WKRVH HYHQWV 6ARIOUS TRICKS ARE USED TO AVOID THIS
PROBLEM SUCH AS INITIALIZING THE COUNTS FOR EACH EVENT TO INSTEAD OF
,ET US LOOK AT ANOTHER EXAMPLE 3UPPOSE THIS NEW CANDY MANUFACTURER WANTS TO GIVE A
LITTLE HINT TO THE CONSUMER AND USES CANDY WRAPPERS COLORED RED AND GREEN 4HE Wrapper FOR
EACH CANDY IS SELECTED SUREDELOLVWLFDOO\ ACCORDING TO SOME UNKNOWN CONDITIONAL DISTRIBUTION
DEPENDING ON THE mAVOR 4HE CORRESPONDING PROBABILITY MODEL IS SHOWN IN &IGURE B
.OTICE THAT IT HAS THREE PARAMETERS θ θ1 AND θ2 7ITH THESE PARAMETERS THE LIKELIHOOD OF
SEEING SAY A CHERRY CANDY IN A GREEN WRAPPER CAN BE OBTAINED FROM THE STANDARD SEMANTICS
FOR "AYESIAN NETWORKS PAGE
P (Flavor = cherry, Wrapper = green | hθ,θ1 ,θ2 )
= P (Flavor = cherry | hθ,θ1 ,θ2 )P (Wrapper = green | Flavor = cherry, hθ,θ1 ,θ2 )
= θ · (1 − θ1 ) .
.OW WE UNWRAP N CANDIES OF WHICH c ARE CHERRIES AND ARE LIMES 4HE WRAPPER COUNTS ARE
AS FOLLOWS rc OF THE CHERRIES HAVE RED WRAPPERS AND gc HAVE GREEN WHILE r OF THE LIMES HAVE
RED AND g HAVE GREEN 4HE LIKELIHOOD OF THE DATA IS GIVEN BY
P (G | hθ,θ1 ,θ2 ) = θ c (1 − θ) · θ1rc (1 − θ1 )gc · θ2r (1 − θ2 )g .
#HAPTER ,EARNING 0ROBABILISTIC -ODELS
$ECISION TREE
.AIVE "AYES
4RAINING SET SIZE
)LJXUH 4HE LEARNING CURVE FOR NAIVE "AYES LEARNING APPLIED TO THE RESTAURANT PROBLEM
FROM #HAPTER THE LEARNING CURVE FOR DECISION TREE LEARNING IS SHOWN FOR COMPARISON
GENERAL PURPOSE LEARNING ALGORITHMS .AIVE "AYES LEARNING SCALES WELL TO VERY LARGE PROB
LEMS WITH n "OOLEAN ATTRIBUTES THERE ARE JUST 2n + 1 PARAMETERS AND QR VHDUFK LV UHTXLUHG
WR ¿QG hML WKH PD[LPXPOLNHOLKRRG QDLYH %D\HV K\SRWKHVLV &INALLY NAIVE "AYES LEARNING
SYSTEMS HAVE NO DIFlCULTY WITH NOISY OR MISSING DATA AND CAN GIVE PROBABILISTIC PREDICTIONS
WHEN APPROPRIATE
3\ \[
Y
\
[
X
A B
.OW CONSIDER A LINEAR 'AUSSIAN MODEL WITH ONE CONTINUOUS PARENT X AND A CONTINUOUS
CHILD Y !S EXPLAINED ON PAGE Y HAS A 'AUSSIAN DISTRIBUTION WHOSE MEAN DEPENDS
LINEARLY ON THE VALUE OF X AND WHOSE STANDARD DEVIATION IS lXED 4O LEARN THE CONDITIONAL
DISTRIBUTION P (Y | X) WE CAN MAXIMIZE THE CONDITIONAL LIKELIHOOD
1 (y−(θ1 x+θ2 ))2
P (y | x) = √ e− 2σ 2 .
2πσ
(ERE THE PARAMETERS ARE θ1 θ2 AND σ 4HE DATA ARE A COLLECTION OF (xj , yj ) PAIRS AS ILLUSTRATED
IN &IGURE 5SING THE USUAL METHODS %XERCISE WE CAN lND THE MAXIMUM LIKELIHOOD
VALUES OF THE PARAMETERS 4HE POINT HERE IS DIFFERENT )F WE CONSIDER JUST THE PARAMETERS θ1
AND θ2 THAT DElNE THE LINEAR RELATIONSHIP BETWEEN x AND y IT BECOMES CLEAR THAT MAXIMIZING
THE LOG LIKELIHOOD WITH RESPECT TO THESE PARAMETERS IS THE SAME AS PLQLPL]LQJ THE NUMERATOR
(y − (θ1 x + θ2 ))2 IN THE EXPONENT OF %QUATION 4HIS IS THE L2 LOSS THE SQUARED ER
ROR BETWEEN THE ACTUAL VALUE y AND THE PREDICTION θ1 x + θ2 4HIS IS THE QUANTITY MINIMIZED
BY THE STANDARD OLQHDU UHJUHVVLRQ PROCEDURE DESCRIBED IN 3ECTION .OW WE CAN UNDER
STAND WHY MINIMIZING THE SUM OF SQUARED ERRORS GIVES THE MAXIMUM LIKELIHOOD STRAIGHT LINE
MODEL SURYLGHG WKDW WKH GDWD DUH JHQHUDWHG ZLWK *DXVVLDQ QRLVH RI ¿[HG YDULDQFH
; =
; =
; =
3Θ θ
3Θ θ
; =
; =
; =
0ARAMETER θ 0ARAMETER θ
A B
)LJXUH %XAMPLES OF THE beta[a, b] DISTRIBUTION FOR DIFFERENT VALUES OF [a, b]
4HE CANDY EXAMPLE IN &IGURE A HAS ONE PARAMETER θ THE PROBABILITY THAT A RAN
DOMLY SELECTED PIECE OF CANDY IS CHERRY mAVORED )N THE "AYESIAN VIEW θ IS THE UNKNOWN
VALUE OF A RANDOM VARIABLE Θ THAT DElNES THE HYPOTHESIS SPACE THE HYPOTHESIS PRIOR IS JUST
THE PRIOR DISTRIBUTION 3(Θ) 4HUS P (Θ = θ) IS THE PRIOR PROBABILITY THAT THE BAG HAS A FRACTION
θ OF CHERRY CANDIES
)F THE PARAMETER θ CAN BE ANY VALUE BETWEEN AND THEN 3(Θ) MUST BE A CONTINUOUS
DISTRIBUTION THAT IS NONZERO ONLY BETWEEN AND AND THAT INTEGRATES TO 4HE UNIFORM DENSITY
P (θ) = Uniform[0, 1](θ) IS ONE CANDIDATE 3EE #HAPTER )T TURNS OUT THAT THE UNIFORM
BETA DISTRIBUTION DENSITY IS A MEMBER OF THE FAMILY OF EHWD GLVWULEXWLRQV %ACH BETA DISTRIBUTION IS DElNED BY
HYPERPARAMETER TWO K\SHUSDUDPHWHUV a AND b SUCH THAT
FOR θ IN THE RANGE [0, 1] 4HE NORMALIZATION CONSTANT α WHICH MAKES THE DISTRIBUTION INTEGRATE
TO DEPENDS ON a AND b 3EE %XERCISE &IGURE SHOWS WHAT THE DISTRIBUTION LOOKS
LIKE FOR VARIOUS VALUES OF a AND b 4HE MEAN VALUE OF THE DISTRIBUTION IS a/(a + b) SO LARGER
VALUES OF a SUGGEST A BELIEF THAT Θ IS CLOSER TO THAN TO ,ARGER VALUES OF a + b MAKE THE
DISTRIBUTION MORE PEAKED SUGGESTING GREATER CERTAINTY ABOUT THE VALUE OF Θ 4HUS THE BETA
FAMILY PROVIDES A USEFUL RANGE OF POSSIBILITIES FOR THE HYPOTHESIS PRIOR
"ESIDES ITS mEXIBILITY THE BETA FAMILY HAS ANOTHER WONDERFUL PROPERTY IF Θ HAS A PRIOR
beta[a, b] THEN AFTER A DATA POINT IS OBSERVED THE POSTERIOR DISTRIBUTION FOR Θ IS ALSO A BETA
DISTRIBUTION )N OTHER WORDS beta IS CLOSED UNDER UPDATE 4HE BETA FAMILY IS CALLED THE
CONJUGATE PRIOR FRQMXJDWH SULRU FOR THE FAMILY OF DISTRIBUTIONS FOR A "OOLEAN VARIABLE ,ETS SEE HOW THIS
WORKS 3UPPOSE WE OBSERVE A CHERRY CANDY THEN WE HAVE
3 4HEY ARE CALLED HYPERPARAMETERS BECAUSE THEY PARAMETERIZE A DISTRIBUTION OVER θ WHICH IS ITSELF A PARAMETER
4 /THER CONJUGATE PRIORS INCLUDE THE 'LULFKOHW FAMILY FOR THE PARAMETERS OF A DISCRETE MULTIVALUED DISTRIBUTION
AND THE 1RUPDO±:LVKDUW FAMILY FOR THE PARAMETERS OF A 'AUSSIAN DISTRIBUTION 3EE "ERNARDO AND 3MITH
#HAPTER ,EARNING 0ROBABILISTIC -ODELS
Θ Θ
)LJXUH ! "AYESIAN NETWORK THAT CORRESPONDS TO A "AYESIAN LEARNING PROCESS 0OSTE
RIOR DISTRIBUTIONS FOR THE PARAMETER VARIABLES Θ Θ1 AND Θ2 CAN BE INFERRED FROM THEIR PRIOR
DISTRIBUTIONS AND THE EVIDENCE IN THE Flavor i AND Wrapper i VARIABLES
7ITH THIS ASSUMPTION EACH PARAMETER CAN HAVE ITS OWN BETA DISTRIBUTION THAT IS UPDATED SEP
ARATELY AS DATA ARRIVE &IGURE SHOWS HOW WE CAN INCORPORATE THE HYPOTHESIS PRIOR AND
ANY DATA INTO ONE "AYESIAN NETWORK 4HE NODES Θ, Θ1 , Θ2 HAVE NO PARENTS "UT EACH TIME
WE MAKE AN OBSERVATION OF A WRAPPER AND CORRESPONDING mAVOR OF A PIECE OF CANDY WE ADD A
NODE Flavor i WHICH IS DEPENDENT ON THE mAVOR PARAMETER Θ
P (Flavor i = cherry | Θ = θ) = θ .
7E ALSO ADD A NODE Wrapper i WHICH IS DEPENDENT ON Θ1 AND Θ2
P (Wrapper i = red | Flavor i = cherry, Θ1 = θ1 ) = θ1
P (Wrapper i = red | Flavor i = lime, Θ2 = θ2 ) = θ2 .
.OW THE ENTIRE "AYESIAN LEARNING PROCESS CAN BE FORMULATED AS AN LQIHUHQFH PROBLEM 7E
ADD NEW EVIDENCE NODES THEN QUERY THE UNKNOWN NODES IN THIS CASE Θ, Θ1 , Θ2 4HIS FOR
MULATION OF LEARNING AND PREDICTION MAKES IT CLEAR THAT "AYESIAN LEARNING REQUIRES NO EXTRA
hPRINCIPLES OF LEARNINGv &URTHERMORE WKHUH LV LQ HVVHQFH MXVW RQH OHDUQLQJ DOJRULWKP THE
INFERENCE ALGORITHM FOR "AYESIAN NETWORKS /F COURSE THE NATURE OF THESE NETWORKS IS SOME
WHAT DIFFERENT FROM THOSE OF #HAPTER BECAUSE OF THE POTENTIALLY HUGE NUMBER OF EVIDENCE
VARIABLES REPRESENTING THE TRAINING SET AND THE PREVALENCE OF CONTINUOUS VALUED PARAMETER
VARIABLES
AND WE CAN CHECK IN THE DATA THAT THE SAME EQUATION HOLDS BETWEEN THE CORRESPONDING CONDI
TIONAL FREQUENCIES "UT EVEN IF THE STRUCTURE DESCRIBES THE TRUE CAUSAL NATURE OF THE DOMAIN
STATISTICAL mUCTUATIONS IN THE DATA SET MEAN THAT THE EQUATION WILL NEVER BE SATISlED H[DFWO\
SO WE NEED TO PERFORM A SUITABLE STATISTICAL TEST TO SEE IF THERE IS SUFlCIENT EVIDENCE THAT THE
INDEPENDENCE HYPOTHESIS IS VIOLATED 4HE COMPLEXITY OF THE RESULTING NETWORK WILL DEPEND
ON THE THRESHOLD USED FOR THIS TESTTHE STRICTER THE INDEPENDENCE TEST THE MORE LINKS WILL BE
ADDED AND THE GREATER THE DANGER OF OVERlTTING
!N APPROACH MORE CONSISTENT WITH THE IDEAS IN THIS CHAPTER IS TO ASSESS THE DEGREE TO
WHICH THE PROPOSED MODEL EXPLAINS THE DATA IN A PROBABILISTIC SENSE 7E MUST BE CAREFUL
HOW WE MEASURE THIS HOWEVER )F WE JUST TRY TO lND THE MAXIMUM LIKELIHOOD HYPOTHESIS
WE WILL END UP WITH A FULLY CONNECTED NETWORK BECAUSE ADDING MORE PARENTS TO A NODE CAN
NOT DECREASE THE LIKELIHOOD %XERCISE 7E ARE FORCED TO PENALIZE MODEL COMPLEXITY IN
SOME WAY 4HE -!0 OR -$, APPROACH SIMPLY SUBTRACTS A PENALTY FROM THE LIKELIHOOD OF
EACH STRUCTURE AFTER PARAMETER TUNING BEFORE COMPARING DIFFERENT STRUCTURES 4HE "AYESIAN
APPROACH PLACES A JOINT PRIOR OVER STRUCTURES AND PARAMETERS 4HERE ARE USUALLY FAR TOO MANY
STRUCTURES TO SUM OVER SUPEREXPONENTIAL IN THE NUMBER OF VARIABLES SO MOST PRACTITIONERS
USE -#-# TO SAMPLE OVER STRUCTURES
0ENALIZING COMPLEXITY WHETHER BY -!0 OR "AYESIAN METHODS INTRODUCES AN IMPORTANT
CONNECTION BETWEEN THE OPTIMAL STRUCTURE AND THE NATURE OF THE REPRESENTATION FOR THE CONDI
TIONAL DISTRIBUTIONS IN THE NETWORK 7ITH TABULAR DISTRIBUTIONS THE COMPLEXITY PENALTY FOR A
NODES DISTRIBUTION GROWS EXPONENTIALLY WITH THE NUMBER OF PARENTS BUT WITH SAY NOISY /2
DISTRIBUTIONS IT GROWS ONLY LINEARLY 4HIS MEANS THAT LEARNING WITH NOISY /2 OR OTHER COM
PACTLY PARAMETERIZED MODELS TENDS TO PRODUCE LEARNED STRUCTURES WITH MORE PARENTS THAN DOES
LEARNING WITH TABULAR DISTRIBUTIONS
$ENSITY
A B
)LJXUH A ! $ PLOT OF THE MIXTURE OF 'AUSSIANS FROM &IGURE A B !
POINT SAMPLE OF POINTS FROM THE MIXTURE TOGETHER WITH TWO QUERY POINTS SMALL SQUARES AND
THEIR 10 NEAREST NEIGHBORHOODS MEDIUM AND LARGE CIRCLES
A B C
)LJXUH $ENSITY ESTIMATION USING k NEAREST NEIGHBORS APPLIED TO THE DATA IN &IG
URE B FOR k = 3 10 AND 40 RESPECTIVELY k = 3 IS TOO SPIKY IS TOO SMOOTH AND
IS JUST ABOUT RIGHT 4HE BEST VALUE FOR k CAN BE CHOSEN BY CROSS VALIDATION
A B C
)LJXUH +ERNEL DENSITY ESTIMATION FOR THE DATA IN &IGURE B USING 'AUSSIAN KER
NELS WITH w = 0.02 0.07 AND 0.20 RESPECTIVELY w = 0.07 IS ABOUT RIGHT
#HAPTER ,EARNING 0ROBABILISTIC -ODELS
!NOTHER POSSIBILITY IS TO USE NHUQHO IXQFWLRQV AS WE DID FOR LOCALLY WEIGHTED REGRES
SION 4O APPLY A KERNEL MODEL TO DENSITY ESTIMATION ASSUME THAT EACH DATA POINT GENERATES ITS
OWN LITTLE DENSITY FUNCTION USING A 'AUSSIAN KERNEL 4HE ESTIMATED DENSITY AT A QUERY POINT [
IS THEN THE AVERAGE DENSITY AS GIVEN BY EACH KERNEL FUNCTION
N
1
P ([) = K([, [j ) .
N
j=1
7E WILL ASSUME SPHERICAL 'AUSSIANS WITH STANDARD DEVIATION w ALONG EACH AXIS
2
1 D([,[j )
K([, [j ) = √ e− 2w2 ,
(w2 2π)d
WHERE d IS THE NUMBER OF DIMENSIONS IN [ AND D IS THE %UCLIDEAN DISTANCE FUNCTION 7E
STILL HAVE THE PROBLEM OF CHOOSING A SUITABLE VALUE FOR KERNEL WIDTH w &IGURE SHOWS
VALUES THAT ARE TOO SMALL JUST RIGHT AND TOO LARGE ! GOOD VALUE OF w CAN BE CHOSEN BY USING
CROSS VALIDATION
4HE PRECEDING SECTION DEALT WITH THE FULLY OBSERVABLE CASE -ANY REAL WORLD PROBLEMS HAVE
LATENT VARIABLE KLGGHQ YDULDEOHV SOMETIMES CALLED ODWHQW YDULDEOHV WHICH ARE NOT OBSERVABLE IN THE DATA
THAT ARE AVAILABLE FOR LEARNING &OR EXAMPLE MEDICAL RECORDS OFTEN INCLUDE THE OBSERVED
SYMPTOMS THE PHYSICIANS DIAGNOSIS THE TREATMENT APPLIED AND PERHAPS THE OUTCOME OF THE
TREATMENT BUT THEY SELDOM CONTAIN A DIRECT OBSERVATION OF THE DISEASE ITSELF .OTE THAT THE
GLDJQRVLV IS NOT THE GLVHDVH IT IS A CAUSAL CONSEQUENCE OF THE OBSERVED SYMPTOMS WHICH ARE IN
TURN CAUSED BY THE DISEASE /NE MIGHT ASK h)F THE DISEASE IS NOT OBSERVED WHY NOT CONSTRUCT
A MODEL WITHOUT ITv 4HE ANSWER APPEARS IN &IGURE WHICH SHOWS A SMALL lCTITIOUS
DIAGNOSTIC MODEL FOR HEART DISEASE 4HERE ARE THREE OBSERVABLE PREDISPOSING FACTORS AND THREE
OBSERVABLE SYMPTOMS WHICH ARE TOO DEPRESSING TO NAME !SSUME THAT EACH VARIABLE HAS
THREE POSSIBLE VALUES EG none moderate AND severe 2EMOVING THE HIDDEN VARIABLE
FROM THE NETWORK IN A YIELDS THE NETWORK IN B THE TOTAL NUMBER OF PARAMETERS INCREASES
FROM TO 4HUS ODWHQW YDULDEOHV FDQ GUDPDWLFDOO\ UHGXFH WKH QXPEHU RI SDUDPHWHUV
UHTXLUHG WR VSHFLI\ D %D\HVLDQ QHWZRUN 4HIS IN TURN CAN DRAMATICALLY REDUCE THE AMOUNT OF
DATA NEEDED TO LEARN THE PARAMETERS
(IDDEN VARIABLES ARE IMPORTANT BUT THEY DO COMPLICATE THE LEARNING PROBLEM )N &IG
URE A FOR EXAMPLE IT IS NOT OBVIOUS HOW TO LEARN THE CONDITIONAL DISTRIBUTION FOR
HeartDisease GIVEN ITS PARENTS BECAUSE WE DO NOT KNOW THE VALUE OF HeartDisease IN EACH
CASE THE SAME PROBLEM ARISES IN LEARNING THE DISTRIBUTIONS FOR THE SYMPTOMS 4HIS SECTION
EXPECTATION–
MAXIMIZATION DESCRIBES AN ALGORITHM CALLED H[SHFWDWLRQ±PD[LPL]DWLRQ OR %- THAT SOLVES THIS PROBLEM
IN A VERY GENERAL WAY 7E WILL SHOW THREE EXAMPLES AND THEN PROVIDE A GENERAL DESCRIPTION
4HE ALGORITHM SEEMS LIKE MAGIC AT lRST BUT ONCE THE INTUITION HAS BEEN DEVELOPED ONE CAN
lND APPLICATIONS FOR %- IN A HUGE RANGE OF LEARNING PROBLEMS
3ECTION ,EARNING WITH (IDDEN 6ARIABLES 4HE %- !LGORITHM
6PRNLQJ 'LHW ([HUFLVH 6PRNLQJ 'LHW ([HUFLVH
+HDUW'LVHDVH
6\PSWRP 6\PSWRP 6\PSWRP 6\PSWRP 6\PSWRP 6\PSWRP
A B
)LJXUH A ! SIMPLE DIAGNOSTIC NETWORK FOR HEART DISEASE WHICH IS ASSUMED TO BE
A HIDDEN VARIABLE %ACH VARIABLE HAS THREE POSSIBLE VALUES AND IS LABELED WITH THE NUMBER
OF INDEPENDENT PARAMETERS IN ITS CONDITIONAL DISTRIBUTION THE TOTAL NUMBER IS B 4HE
EQUIVALENT NETWORK WITH HeartDisease REMOVED .OTE THAT THE SYMPTOM VARIABLES ARE NO
LONGER CONDITIONALLY INDEPENDENT GIVEN THEIR PARENTS 4HIS NETWORK REQUIRES PARAMETERS
A B C
)LJXUH A ! 'AUSSIAN MIXTURE MODEL WITH THREE COMPONENTS THE WEIGHTS LEFT TO
RIGHT ARE AND B DATA POINTS SAMPLED FROM THE MODEL IN A C 4HE MODEL
RECONSTRUCTED BY %- FROM THE DATA IN B
WHERE N IS THE TOTAL NUMBER OF DATA POINTS 4HE % STEP OR H[SHFWDWLRQ STEP CAN BE VIEWED
INDICATOR VARIABLE AS COMPUTING THE EXPECTED VALUES pij OF THE HIDDEN LQGLFDWRU YDULDEOHV Zij WHERE Zij IS IF
DATUM [j WAS GENERATED BY THE iTH COMPONENT AND OTHERWISE 4HE - STEP OR PD[LPL]DWLRQ
STEP lNDS THE NEW VALUES OF THE PARAMETERS THAT MAXIMIZE THE LOG LIKELIHOOD OF THE DATA
GIVEN THE EXPECTED VALUES OF THE HIDDEN INDICATOR VARIABLES
4HE lNAL MODEL THAT %- LEARNS WHEN IT IS APPLIED TO THE DATA IN &IGURE A IS SHOWN
IN &IGURE C IT IS VIRTUALLY INDISTINGUISHABLE FROM THE ORIGINAL MODEL FROM WHICH THE
DATA WERE GENERATED &IGURE A PLOTS THE LOG LIKELIHOOD OF THE DATA ACCORDING TO THE
CURRENT MODEL AS %- PROGRESSES
4HERE ARE TWO POINTS TO NOTICE &IRST THE LOG LIKELIHOOD FOR THE lNAL LEARNED MODEL
SLIGHTLY H[FHHGV THAT OF THE ORIGINAL MODEL FROM WHICH THE DATA WERE GENERATED 4HIS MIGHT
SEEM SURPRISING BUT IT SIMPLY REmECTS THE FACT THAT THE DATA WERE GENERATED RANDOMLY AND
MIGHT NOT PROVIDE AN EXACT REmECTION OF THE UNDERLYING MODEL 4HE SECOND POINT IS THAT (0
LQFUHDVHV WKH ORJ OLNHOLKRRG RI WKH GDWD DW HYHU\ LWHUDWLRQ 4HIS FACT CAN BE PROVED IN GENERAL
&URTHERMORE UNDER CERTAIN CONDITIONS THAT HOLD IN OST CASES %- CAN BE PROVEN TO REACH
A LOCAL MAXIMUM IN LIKELIHOOD )N RARE CASES IT COULD REACH A SADDLE POINT OR EVEN A LOCAL
MINIMUM )N THIS SENSE %- RESEMBLES A GRADIENT BASED HILL CLIMBING ALGORITHM BUT NOTICE
THAT IT HAS NO hSTEP SIZEv PARAMETER
,OG LIKELIHOOD /
,OG LIKELIHOOD /
)TERATION NUMBER )TERATION NUMBER
A B
)LJXUH 'RAPHS SHOWING THE LOG LIKELIHOOD OF THE DATA L AS A FUNCTION OF THE %-
ITERATION 4HE HORIZONTAL LINE SHOWS THE LOG LIKELIHOOD ACCORDING TO THE TRUE MODEL A 'RAPH
FOR THE 'AUSSIAN MIXTURE MODEL IN &IGURE B 'RAPH FOR THE "AYESIAN NETWORK IN
&IGURE A
#HAPTER ,EARNING 0ROBABILISTIC -ODELS
3%DJ
θ
%DJ &
%DJ 3) FKHUU\\ %
θ)
θ)
A B
)LJXUH A ! MIXTURE MODEL FOR CANDY 4HE PROPORTIONS OF DIFFERENT mAVORS WRAP
PERS PRESENCE OF HOLES DEPEND ON THE BAG WHICH IS NOT OBSERVED B "AYESIAN NETWORK FOR
A 'AUSSIAN MIXTURE 4HE MEAN AND COVARIANCE OF THE OBSERVABLE VARIABLES ; DEPEND ON THE
COMPONENT C
4HINGS DO NOT ALWAYS GO AS WELL AS &IGURE A MIGHT SUGGEST )T CAN HAPPEN FOR
EXAMPLE THAT ONE 'AUSSIAN COMPONENT SHRINKS SO THAT IT COVERS JUST A SINGLE DATA POINT 4HEN
ITS VARIANCE WILL GO TO ZERO AND ITS LIKELIHOOD WILL GO TO INlNITY !NOTHER PROBLEM IS THAT
TWO COMPONENTS CAN hMERGE v ACQUIRING IDENTICAL MEANS AND VARIANCES AND SHARING THEIR DATA
POINTS 4HESE KINDS OF DEGENERATE LOCAL MAXIMA ARE SERIOUS PROBLEMS ESPECIALLY IN HIGH
DIMENSIONS /NE SOLUTION IS TO PLACE PRIORS ON THE MODEL PARAMETERS AND TO APPLY THE -!0
VERSION OF %- !NOTHER IS TO RESTART A COMPONENT WITH NEW RANDOM PARAMETERS IF IT GETS TOO
SMALL OR TOO CLOSE TO ANOTHER COMPONENT 3ENSIBLE INITIALIZATION ALSO HELPS
OBSERVING CANDIES FROM THE MIXTURE ,ET US WORK THROUGH AN ITERATION OF %- FOR THIS PROBLEM
&IRST LETS LOOK AT THE DATA 7E GENERATED SAMPLES FROM A MODEL WHOSE TRUE PARAMETERS
ARE AS FOLLOWS
θ = 0.5, θF 1 = θW 1 = θH1 = 0.8, θF 2 = θW 2 = θH2 = 0.3 .
4HAT IS THE CANDIES ARE EQUALLY LIKELY TO COME FROM EITHER BAG THE lRST IS MOSTLY CHERRIES
WITH RED WRAPPERS AND HOLES THE SECOND IS MOSTLY LIMES WITH GREEN WRAPPERS AND NO HOLES
4HE COUNTS FOR THE EIGHT POSSIBLE KINDS OF CANDY ARE AS FOLLOWS
W = red W = green
H =1 H =0 H =1 H =0
F = cherry
F = lime
!PPLYING THIS FORMULA TO SAY THE RED WRAPPED CHERRY CANDIES WITH HOLES WE GET A CON
TRIBUTION OF
(0) (0) (0)
273 θ θ θ θ (0)
· (0) (0) (0) F 1 W 1(0)H1(0) (0) ≈ 0.22797 .
1000 θ θ θ θ (0) + θ θ θ (1 − θ (0) )
F 1 W 1 H1 F 2 W 2 H2
#ONTINUING WITH THE OTHER SEVEN KINDS OF CANDY IN THE TABLE OF COUNTS WE OBTAIN θ (1) = 0.6124
.OW LET US CONSIDER THE OTHER PARAMETERS SUCH AS θF 1 )N THE FULLY OBSERVABLE CASE WE
WOULD ESTIMATE THIS DIRECTLY FROM THE REVHUYHG COUNTS OF CHERRY AND LIME CANDIES FROM BAG
4HE H[SHFWHG COUNT OF CHERRY CANDIES FROM BAG IS GIVEN BY
P (Bag = 1 | Flavor j = cherry, wrapper j , holes j ) .
j:Flavor j = cherry
5 )T IS BETTER IN PRACTICE TO CHOOSE THEM RANDOMLY TO AVOID LOCAL MAXIMA DUE TO SYMMETRY
#HAPTER ,EARNING 0ROBABILISTIC -ODELS
!GAIN THESE PROBABILITIES CAN BE CALCULATED BY ANY "AYES NET ALGORITHM #OMPLETING THIS
PROCESS WE OBTAIN THE NEW VALUES OF ALL THE PARAMETERS
(1) (1) (1)
θ (1) = 0.6124, θF 1 = 0.6684, θW 1 = 0.6483, θH1 = 0.6558,
(1) (1) (1)
θF 2 = 0.3887, θW 2 = 0.3817, θH2 = 0.3827 .
4HE LOG LIKELIHOOD OF THE DATA INCREASES FROM ABOUT −2044 INITIALLY TO ABOUT −2021 AFTER
THE lRST ITERATION AS SHOWN IN &IGURE B 4HAT IS THE UPDATE IMPROVES THE LIKELIHOOD
ITSELF BY A FACTOR OF ABOUT e23 ≈ 1010 "Y THE TENTH ITERATION THE LEARNED MODEL IS A BETTER
lT THAN THE ORIGINAL MODEL L = − 1982.214 4HEREAFTER PROGRESS BECOMES VERY SLOW 4HIS
IS NOT UNCOMMON WITH %- AND MANY PRACTICAL SYSTEMS COMBINE %- WITH A GRADIENT BASED
ALGORITHM SUCH AS .EWTONn2APHSON SEE #HAPTER FOR THE LAST PHASE OF LEARNING
4HE GENERAL LESSON FROM THIS EXAMPLE IS THAT WKH SDUDPHWHU XSGDWHV IRU %D\HVLDQ QHW
ZRUN OHDUQLQJ ZLWK KLGGHQ YDULDEOHV DUH GLUHFWO\ DYDLODEOH IURP WKH UHVXOWV RI LQIHUHQFH RQ
HDFK H[DPSOH 0RUHRYHU RQO\ LOCAL SRVWHULRU SUREDELOLWLHV DUH QHHGHG IRU HDFK SDUDPH
WHU (ERE hLOCALv MEANS THAT THE #04 FOR EACH VARIABLE Xi CAN BE LEARNED FROM POSTERIOR
PROBABILITIES INVOLVING JUST Xi AND ITS PARENTS 8i $ElNING θijk TO BE THE #04 PARAMETER
P (Xi = xij | 8i = Xik ) THE UPDATE IS GIVEN BY THE NORMALIZED EXPECTED COUNTS AS FOLLOWS
θijk ← N̂ (Xi = xij , 8i = Xik )/N̂ (8i = Xik ) .
4HE EXPECTED COUNTS ARE OBTAINED BY SUMMING OVER THE EXAMPLES COMPUTING THE PROBABILITIES
P (Xi = xij , 8i = Xik ) FOR EACH BY USING ANY "AYES NET INFERENCE ALGORITHM &OR THE EXACT
ALGORITHMSINCLUDING VARIABLE ELIMINATIONALL THESE PROBABILITIES ARE OBTAINABLE DIRECTLY AS
A BY PRODUCT OF STANDARD INFERENCE WITH NO NEED FOR EXTRA COMPUTATIONS SPECIlC TO LEARNING
-OREOVER THE INFORMATION NEEDED FOR LEARNING IS AVAILABLE ORFDOO\ FOR EACH PARAMETER
5 35
5 35
5 35
5 35
5 35
35 W 35 W W W W
I I I I I
5DLQ 5DLQ 5DLQ 5DLQ 5DLQ 5DLQ 5DLQ
)LJXUH !N UNROLLED DYNAMIC "AYESIAN NETWORK THAT REPRESENTS A HIDDEN -ARKOV
MODEL REPEAT OF &IGURE
RATHER THAN ¿OWHULQJ THAT IS WE NEED TO PAY ATTENTION TO SUBSEQUENT EVIDENCE IN ESTIMATING
THE PROBABILITY THAT A PARTICULAR TRANSITION OCCURRED 4HE EVIDENCE IN A MURDER CASE IS USUALLY
OBTAINED DIWHU THE CRIME IE THE TRANSITION FROM STATE i TO STATE j HAS TAKEN PLACE
)N 3ECTION WE DISCUSSED THE PROBLEM OF LEARNING "AYES NET STRUCTURES WITH COMPLETE
DATA 7HEN UNOBSERVED VARIABLES MAY BE INmUENCING THE DATA THAT ARE OBSERVED THINGS GET
MORE DIFlCULT )N THE SIMPLEST CASE A HUMAN EXPERT MIGHT TELL THE LEARNING ALGORITHM THAT CER
TAIN HIDDEN VARIABLES EXIST LEAVING IT TO THE ALGORITHM TO lND A PLACE FOR THEM IN THE NETWORK
STRUCTURE &OR EXAMPLE AN ALGORITHM MIGHT TRY TO LEARN THE STRUCTURE SHOWN IN &IGURE A
ON PAGE GIVEN THE INFORMATION THAT HeartDisease A THREE VALUED VARIABLE SHOULD BE IN
CLUDED IN THE MODEL !S IN THE COMPLETE DATA CASE THE OVERALL ALGORITHM HAS AN OUTER LOOP THAT
SEARCHES OVER STRUCTURES AND AN INNER LOOP THAT lTS THE NETWORK PARAMETERS GIVEN THE STRUCTURE
)F THE LEARNING ALGORITHM IS NOT TOLD WHICH HIDDEN VARIABLES EXIST THEN THERE ARE TWO
CHOICES EITHER PRETEND THAT THE DATA IS REALLY COMPLETEWHICH MAY FORCE THE ALGORITHM TO
LEARN A PARAMETER INTENSIVE MODEL SUCH AS THE ONE IN &IGURE B OR LQYHQW NEW HIDDEN
VARIABLES IN ORDER TO SIMPLIFY THE MODEL 4HE LATTER APPROACH CAN BE IMPLEMENTED BY INCLUDING
NEW MODIlCATION CHOICES IN THE STRUCTURE SEARCH IN ADDITION TO MODIFYING LINKS THE ALGORITHM
CAN ADD OR DELETE A HIDDEN VARIABLE OR CHANGE ITS ARITY /F COURSE THE ALGORITHM WILL NOT KNOW
THAT THE NEW VARIABLE IT HAS INVENTED IS CALLED HeartDisease NOR WILL IT HAVE MEANINGFUL
NAMES FOR THE VALUES &ORTUNATELY NEWLY INVENTED HIDDEN VARIABLES WILL USUALLY BE CONNECTED
TO PREEXISTING VARIABLES SO A HUMAN EXPERT CAN OFTEN INSPECT THE LOCAL CONDITIONAL DISTRIBUTIONS
INVOLVING THE NEW VARIABLE AND ASCERTAIN ITS MEANING
!S IN THE COMPLETE DATA CASE PURE MAXIMUM LIKELIHOOD STRUCTURE LEARNING WILL RESULT IN
A COMPLETELY CONNECTED NETWORK MOREOVER ONE WITH NO HIDDEN VARIABLES SO SOME FORM OF
COMPLEXITY PENALTY IS REQUIRED 7E CAN ALSO APPLY -#-# TO SAMPLE MANY POSSIBLE NETWORK
STRUCTURES THEREBY APPROXIMATING "AYESIAN LEARNING &OR EXAMPLE WE CAN LEARN MIXTURES OF
'AUSSIANS WITH AN UNKNOWN NUMBER OF COMPONENTS BY SAMPLING OVER THE NUMBER THE APPROX
IMATE POSTERIOR DISTRIBUTION FOR THE NUMBER OF 'AUSSIANS IS GIVEN BY THE SAMPLING FREQUENCIES
OF THE -#-# PROCESS
&OR THE COMPLETE DATA CASE THE INNER LOOP TO LEARN THE PARAMETERS IS VERY FASTJUST A
MATTER OF EXTRACTING CONDITIONAL FREQUENCIES FROM THE DATA SET 7HEN THERE ARE HIDDEN VARI
ABLES THE INNER LOOP MAY INVOLVE MANY ITERATIONS OF %- OR A GRADIENT BASED ALGORITHM AND
EACH ITERATION INVOLVES THE CALCULATION OF POSTERIORS IN A "AYES NET WHICH IS ITSELF AN .0 HARD
PROBLEM 4O DATE THIS APPROACH HAS PROVED IMPRACTICAL FOR LEARNING COMPLEX MODELS /NE
STRUCTURAL EM POSSIBLE IMPROVEMENT IS THE SO CALLED VWUXFWXUDO (0 ALGORITHM WHICH OPERATES IN MUCH THE
SAME WAY AS ORDINARY PARAMETRIC %- EXCEPT THAT THE ALGORITHM CAN UPDATE THE STRUCTURE
AS WELL AS THE PARAMETERS *UST AS ORDINARY %- USES THE CURRENT PARAMETERS TO COMPUTE THE
EXPECTED COUNTS IN THE % STEP AND THEN APPLIES THOSE COUNTS IN THE - STEP TO CHOOSE NEW
PARAMETERS STRUCTURAL %- USES THE CURRENT STRUCTURE TO COMPUTE EXPECTED COUNTS AND THEN AP
PLIES THOSE COUNTS IN THE - STEP TO EVALUATE THE LIKELIHOOD FOR POTENTIAL NEW STRUCTURES 4HIS
CONTRASTS WITH THE OUTER LOOPINNER LOOP METHOD WHICH COMPUTES NEW EXPECTED COUNTS FOR
EACH POTENTIAL STRUCTURE )N THIS WAY STRUCTURAL %- MAY MAKE SEVERAL STRUCTURAL ALTERATIONS
TO THE NETWORK WITHOUT ONCE RECOMPUTING THE EXPECTED COUNTS AND IS CAPABLE OF LEARNING NON
TRIVIAL "AYES NET STRUCTURES .ONETHELESS MUCH WORK REMAINS TO BE DONE BEFORE WE CAN SAY
THAT THE STRUCTURE LEARNING PROBLEM IS SOLVED
3ECTION 3UMMARY
3 5--!29
3TATISTICAL LEARNING METHODS RANGE FROM SIMPLE CALCULATION OF AVERAGES TO THE CONSTRUCTION OF
COMPLEX MODELS SUCH AS "AYESIAN NETWORKS 4HEY HAVE APPLICATIONS THROUGHOUT COMPUTER
SCIENCE ENGINEERING COMPUTATIONAL BIOLOGY NEUROSCIENCE PSYCHOLOGY AND PHYSICS 4HIS
CHAPTER HAS PRESENTED SOME OF THE BASIC IDEAS AND GIVEN A mAVOR OF THE MATHEMATICAL UNDER
PINNINGS 4HE MAIN POINTS ARE AS FOLLOWS
• %D\HVLDQ OHDUQLQJ METHODS FORMULATE LEARNING AS A FORM OF PROBABILISTIC INFERENCE
USING THE OBSERVATIONS TO UPDATE A PRIOR DISTRIBUTION OVER HYPOTHESES 4HIS APPROACH
PROVIDES A GOOD WAY TO IMPLEMENT /CKHAMS RAZOR BUT QUICKLY BECOMES INTRACTABLE FOR
COMPLEX HYPOTHESIS SPACES
• 0D[LPXP D SRVWHULRUL -!0 LEARNING SELECTS A SINGLE MOST LIKELY HYPOTHESIS GIVEN
THE DATA 4HE HYPOTHESIS PRIOR IS STILL USED AND THE METHOD IS OFTEN MORE TRACTABLE THAN
FULL "AYESIAN LEARNING
• 0D[LPXPOLNHOLKRRG LEARNING SIMPLY SELECTS THE HYPOTHESIS THAT MAXIMIZES THE LIKELI
HOOD OF THE DATA IT IS EQUIVALENT TO -!0 LEARNING WITH A UNIFORM PRIOR )N SIMPLE CASES
SUCH AS LINEAR REGRESSION AND FULLY OBSERVABLE "AYESIAN NETWORKS MAXIMUM LIKELIHOOD
SOLUTIONS CAN BE FOUND EASILY IN CLOSED FORM 1DLYH %D\HV LEARNING IS A PARTICULARLY
EFFECTIVE TECHNIQUE THAT SCALES WELL
• 7HEN SOME VARIABLES ARE HIDDEN LOCAL MAXIMUM LIKELIHOOD SOLUTIONS CAN BE FOUND
USING THE %- ALGORITHM !PPLICATIONS INCLUDE CLUSTERING USING MIXTURES OF 'AUSSIANS
LEARNING "AYESIAN NETWORKS AND LEARNING HIDDEN -ARKOV MODELS
• ,EARNING THE STRUCTURE OF "AYESIAN NETWORKS IS AN EXAMPLE OF PRGHO VHOHFWLRQ 4HIS
USUALLY INVOLVES A DISCRETE SEARCH IN THE SPACE OF STRUCTURES 3OME METHOD IS REQUIRED
FOR TRADING OFF MODEL COMPLEXITY AGAINST DEGREE OF lT
• 1RQSDUDPHWULF PRGHOV REPRESENT A DISTRIBUTION USING THE COLLECTION OF DATA POINTS
4HUS THE NUMBER OF PARAMETERS GROWS WITH THE TRAINING SET .EAREST NEIGHBORS METHODS
LOOK AT THE EXAMPLES NEAREST TO THE POINT IN QUESTION WHEREAS NHUQHO METHODS FORM A
DISTANCE WEIGHTED COMBINATION OF ALL THE EXAMPLES
3TATISTICAL LEARNING CONTINUES TO BE A VERY ACTIVE AREA OF RESEARCH %NORMOUS STRIDES HAVE BEEN
MADE IN BOTH THEORY AND PRACTICE TO THE POINT WHERE IT IS POSSIBLE TO LEARN ALMOST ANY MODEL
FOR WHICH EXACT OR APPROXIMATE INFERENCE IS FEASIBLE
4HE APPLICATION OF STATISTICAL LEARNING TECHNIQUES IN !) WAS AN ACTIVE AREA OF RESEARCH IN THE
EARLY YEARS SEE $UDA AND (ART BUT BECAME SEPARATED FROM MAINSTREAM !) AS THE
LATTER lELD CONCENTRATED ON SYMBOLIC METHODS ! RESURGENCE OF INTEREST OCCURRED SHORTLY AFTER
THE INTRODUCTION OF "AYESIAN NETWORK MODELS IN THE LATE S AT ROUGHLY THE SAME TIME
#HAPTER ,EARNING 0ROBABILISTIC -ODELS
A STATISTICAL VIEW OF NEURAL NETWORK LEARNING BEGAN TO EMERGE )N THE LATE S THERE WAS
A NOTICEABLE CONVERGENCE OF INTERESTS IN MACHINE LEARNING STATISTICS AND NEURAL NETWORKS
CENTERED ON METHODS FOR CREATING LARGE PROBABILISTIC MODELS FROM DATA
4HE NAIVE "AYES MODEL IS ONE OF THE OLDEST AND SIMPLEST FORMS OF "AYESIAN NETWORK
DATING BACK TO THE S )TS ORIGINS WERE MENTIONED IN #HAPTER )TS SURPRISING SUCCESS IS
PARTIALLY EXPLAINED BY $OMINGOS AND 0AZZANI ! BOOSTED FORM OF NAIVE "AYES LEARN
ING WON THE lRST +$$ #UP DATA MINING COMPETITION %LKAN (ECKERMAN GIVES
AN EXCELLENT INTRODUCTION TO THE GENERAL PROBLEM OF "AYES NET LEARNING "AYESIAN PARAME
TER LEARNING WITH $IRICHLET PRIORS FOR "AYESIAN NETWORKS WAS DISCUSSED BY 3PIEGELHALTER HW DO
4HE " 5'3 SOFTWARE PACKAGE 'ILKS HW DO INCORPORATES MANY OF THESE IDEAS AND
PROVIDES A VERY POWERFUL TOOL FOR FORMULATING AND LEARNING COMPLEX PROBABILITY MODELS 4HE
lRST ALGORITHMS FOR LEARNING "AYES NET STRUCTURES USED CONDITIONAL INDEPENDENCE TESTS 0EARL
0EARL AND 6ERMA 3PIRTES HW DO DEVELOPED A COMPREHENSIVE APPROACH
EMBODIED IN THE 4 %42!$ PACKAGE FOR "AYES NET LEARNING !LGORITHMIC IMPROVEMENTS SINCE
THEN LED TO A CLEAR VICTORY IN THE +$$ #UP DATA MINING COMPETITION FOR A "AYES NET
LEARNING METHOD #HENG HW DO 4HE SPECIlC TASK HERE WAS A BIOINFORMATICS PROB
LEM WITH FEATURES ! STRUCTURE LEARNING APPROACH BASED ON MAXIMIZING LIKELIHOOD
WAS DEVELOPED BY #OOPER AND (ERSKOVITS AND IMPROVED BY (ECKERMAN HW DO
3EVERAL ALGORITHMIC ADVANCES SINCE THAT TIME HAVE LED TO QUITE RESPECTABLE PERFORMANCE IN
THE COMPLETE DATA CASE -OORE AND 7ONG 4EYSSIER AND +OLLER /NE IMPORTANT
COMPONENT IS AN EFlCIENT DATA STRUCTURE THE !$ TREE FOR CACHING COUNTS OVER ALL POSSIBLE
COMBINATIONS OF VARIABLES AND VALUES -OORE AND ,EE &RIEDMAN AND 'OLDSZMIDT
POINTED OUT THE INmUENCE OF THE REPRESENTATION OF LOCAL CONDITIONAL DISTRIBUTIONS ON THE
LEARNED STRUCTURE
4HE GENERAL PROBLEM OF LEARNING PROBABILITY MODELS WITH HIDDEN VARIABLES AND MISS
ING DATA WAS ADDRESSED BY (ARTLEY WHO DESCRIBED THE GENERAL IDEA OF WHAT WAS LATER
CALLED %- AND GAVE SEVERAL EXAMPLES &URTHER IMPETUS CAME FROM THE "AUMn7ELCH ALGO
RITHM FOR (-- LEARNING "AUM AND 0ETRIE WHICH IS A SPECIAL CASE OF %- 4HE PAPER
BY $EMPSTER ,AIRD AND 2UBIN WHICH PRESENTED THE %- ALGORITHM IN GENERAL FORM
AND ANALYZED ITS CONVERGENCE IS ONE OF THE MOST CITED PAPERS IN BOTH COMPUTER SCIENCE AND
STATISTICS $EMPSTER HIMSELF VIEWS %- AS A SCHEMA RATHER THAN AN ALGORITHM SINCE A GOOD
DEAL OF MATHEMATICAL WORK MAY BE REQUIRED BEFORE IT CAN BE APPLIED TO A NEW FAMILY OF DIS
TRIBUTIONS -C,ACHLAN AND +RISHNAN DEVOTE AN ENTIRE BOOK TO THE ALGORITHM AND ITS
PROPERTIES 4HE SPECIlC PROBLEM OF LEARNING MIXTURE MODELS INCLUDING MIXTURES OF 'AUS
SIANS IS COVERED BY 4ITTERINGTON HW DO 7ITHIN !) THE lRST SUCCESSFUL SYSTEM THAT USED
%- FOR MIXTURE MODELING WAS !54/#,!33 #HEESEMAN HW DO #HEESEMAN AND 3TUTZ
!54/#,!33 HAS BEEN APPLIED TO A NUMBER OF REAL WORLD SCIENTIlC CLASSIlCATION TASKS
INCLUDING THE DISCOVERY OF NEW TYPES OF STARS FROM SPECTRAL DATA 'OEBEL HW DO AND NEW
CLASSES OF PROTEINS AND INTRONS IN $.!PROTEIN SEQUENCE DATABASES (UNTER AND 3TATES
&OR MAXIMUM LIKELIHOOD PARAMETER LEARNING IN "AYES NETS WITH HIDDEN VARIABLES %-
AND GRADIENT BASED METHODS WERE INTRODUCED AROUND THE SAME TIME BY ,AURITZEN 2US
SELL HW DO AND "INDER HW DO A 4HE STRUCTURAL %- ALGORITHM WAS DEVELOPED BY
&RIEDMAN AND APPLIED TO MAXIMUM LIKELIHOOD LEARNING OF "AYES NET STRUCTURES WITH
%XERCISES
LATENT VARIABLES &RIEDMAN AND +OLLER DESCRIBE "AYESIAN STRUCTURE LEARNING
4HE ABILITY TO LEARN THE STRUCTURE OF "AYESIAN NETWORKS IS CLOSELY CONNECTED TO THE ISSUE
OF RECOVERING FDXVDO INFORMATION FROM DATA 4HAT IS IS IT POSSIBLE TO LEARN "AYES NETS IN
SUCH A WAY THAT THE RECOVERED NETWORK STRUCTURE INDICATES REAL CAUSAL INmUENCES &OR MANY
YEARS STATISTICIANS AVOIDED THIS QUESTION BELIEVING THAT OBSERVATIONAL DATA AS OPPOSED TO DATA
GENERATED FROM EXPERIMENTAL TRIALS COULD YIELD ONLY CORRELATIONAL INFORMATIONAFTER ALL ANY
TWO VARIABLES THAT APPEAR RELATED MIGHT IN FACT BE INmUENCED BY A THIRD UNKNOWN CAUSAL
FACTOR RATHER THAN INmUENCING EACH OTHER DIRECTLY 0EARL HAS PRESENTED CONVINCING
ARGUMENTS TO THE CONTRARY SHOWING THAT THERE ARE IN FACT MANY CASES WHERE CAUSALITY CAN BE
CAUSAL NETWORK ASCERTAINED AND DEVELOPING THE FDXVDO QHWZRUN FORMALISM TO EXPRESS CAUSES AND THE EFFECTS
OF INTERVENTION AS WELL AS ORDINARY CONDITIONAL PROBABILITIES
.ONPARAMETRIC DENSITY ESTIMATION ALSO CALLED 3DU]HQ ZLQGRZ DENSITY ESTIMATION WAS
INVESTIGATED INITIALLY BY 2OSENBLATT AND 0ARZEN 3INCE THAT TIME A HUGE LITERA
TURE HAS DEVELOPED INVESTIGATING THE PROPERTIES OF VARIOUS ESTIMATORS $EVROYE GIVES A
THOROUGH INTRODUCTION 4HERE IS ALSO A RAPIDLY GROWING LITERATURE ON NONPARAMETRIC "AYESIAN
DIRICHLET PROCESS METHODS ORIGINATING WITH THE SEMINAL WORK OF &ERGUSON ON THE 'LULFKOHW SURFHVV
WHICH CAN BE THOUGHT OF AS A DISTRIBUTION OVER $IRICHLET DISTRIBUTIONS 4HESE METHODS ARE PAR
TICULARLY USEFUL FOR MIXTURES WITH UNKNOWN NUMBERS OF COMPONENTS 'HAHRAMANI AND
*ORDAN PROVIDE USEFUL TUTORIALS ON THE MANY APPLICATIONS OF THESE IDEAS TO STATISTICAL
GAUSSIAN PROCESS LEARNING 4HE TEXT BY 2ASMUSSEN AND 7ILLIAMS COVERS THE *DXVVLDQ SURFHVV WHICH
GIVES A WAY OF DElNING PRIOR DISTRIBUTIONS OVER THE SPACE OF CONTINUOUS FUNCTIONS
4HE MATERIAL IN THIS CHAPTER BRINGS TOGETHER WORK FROM THE lELDS OF STATISTICS AND PATTERN
RECOGNITION SO THE STORY HAS BEEN TOLD MANY TIMES IN MANY WAYS 'OOD TEXTS ON "AYESIAN
STATISTICS INCLUDE THOSE BY $E'ROOT "ERGER AND 'ELMAN HW DO "ISHOP
AND (ASTIE HW DO PROVIDE AN EXCELLENT INTRODUCTION TO STATISTICAL MACHINE LEARN
ING &OR PATTERN CLASSIlCATION THE CLASSIC TEXT FOR MANY YEARS HAS BEEN $UDA AND (ART
NOW UPDATED $UDA HW DO 4HE ANNUAL .)03 .EURAL )NFORMATION 0ROCESSING #ONFER
ENCE CONFERENCE WHOSE PROCEEDINGS ARE PUBLISHED AS THE SERIES $GYDQFHV LQ 1HXUDO ,QIRUPD
WLRQ 3URFHVVLQJ 6\VWHPV IS NOW DOMINATED BY "AYESIAN PAPERS 0APERS ON LEARNING "AYESIAN
NETWORKS ALSO APPEAR IN THE 8QFHUWDLQW\ LQ $, AND 0DFKLQH /HDUQLQJ CONFERENCES AND IN SEV
ERAL STATISTICS CONFERENCES *OURNALS SPECIlC TO NEURAL NETWORKS INCLUDE 1HXUDO &RPSXWDWLRQ
1HXUDO 1HWZRUNV AND THE ,((( 7UDQVDFWLRQV RQ 1HXUDO 1HWZRUNV 3PECIlCALLY "AYESIAN
VENUES INCLUDE THE 6ALENCIA )NTERNATIONAL -EETINGS ON "AYESIAN 3TATISTICS AND THE JOURNAL
%D\HVLDQ $QDO\VLV
% 8%2#)3%3
4HE DATA USED FOR &IGURE ON PAGE CAN BE VIEWED AS BEING GENERATED BY h5
&OR EACH OF THE OTHER FOUR HYPOTHESES GENERATE A DATA SET OF LENGTH AND PLOT THE COR
RESPONDING GRAPHS FOR P (hi | d1 , . . . , dN ) AND P (DN +1 = lime | d1 , . . . , dN ) #OMMENT ON
YOUR RESULTS
#HAPTER ,EARNING 0ROBABILISTIC -ODELS
2EPEAT %XERCISE THIS TIME PLOTTING THE VALUES OF P (DN +1 = lime | hMAP ) AND
P (DN +1 = lime | hML )
3UPPOSE THAT !NNS UTILITIES FOR CHERRY AND LIME CANDIES ARE cA AND A WHEREAS "OBS
UTILITIES ARE cB AND B "UT ONCE !NN HAS UNWRAPPED A PIECE OF CANDY "OB WONT BUY
IT 0RESUMABLY IF "OB LIKES LIME CANDIES MUCH MORE THAN !NN IT WOULD BE WISE FOR !NN
TO SELL HER BAG OF CANDIES ONCE SHE IS SUFlCIENTLY SURE OF ITS LIME CONTENT /N THE OTHER HAND
IF !NN UNWRAPS TOO MANY CANDIES IN THE PROCESS THE BAG WILL BE WORTH LESS $ISCUSS THE
PROBLEM OF DETERMINING THE OPTIMAL POINT AT WHICH TO SELL THE BAG $ETERMINE THE EXPECTED
UTILITY OF THE OPTIMAL PROCEDURE GIVEN THE PRIOR DISTRIBUTION FROM 3ECTION
4WO STATISTICIANS GO TO THE DOCTOR AND ARE BOTH GIVEN THE SAME PROGNOSIS !
CHANCE THAT THE PROBLEM IS THE DEADLY DISEASE A AND A CHANCE OF THE FATAL DISEASE B
&ORTUNATELY THERE ARE ANTI A AND ANTI B DRUGS THAT ARE INEXPENSIVE EFFECTIVE AND FREE
OF SIDE EFFECTS 4HE STATISTICIANS HAVE THE CHOICE OF TAKING ONE DRUG BOTH OR NEITHER 7HAT
WILL THE lRST STATISTICIAN AN AVID "AYESIAN DO (OW ABOUT THE SECOND STATISTICIAN WHO ALWAYS
USES THE MAXIMUM LIKELIHOOD HYPOTHESIS
4HE DOCTOR DOES SOME RESEARCH AND DISCOVERS THAT DISEASE B ACTUALLY COMES IN TWO
VERSIONS DEXTRO B AND LEVO B WHICH ARE EQUALLY LIKELY AND EQUALLY TREATABLE BY THE ANTI B
DRUG .OW THAT THERE ARE THREE HYPOTHESES WHAT WILL THE TWO STATISTICIANS DO
%XPLAIN HOW TO APPLY THE BOOSTING METHOD OF #HAPTER TO NAIVE "AYES LEARNING 4EST
THE PERFORMANCE OF THE RESULTING ALGORITHM ON THE RESTAURANT LEARNING PROBLEM
#ONSIDER N DATA POINTS (xj , yj ) WHERE THE yj S ARE GENERATED FROM THE xj S ACCORDING TO
THE LINEAR 'AUSSIAN MODEL IN %QUATION &IND THE VALUES OF θ1 θ2 AND σ THAT MAXIMIZE
THE CONDITIONAL LOG LIKELIHOOD OF THE DATA
#ONSIDER THE NOISY /2 MODEL FOR FEVER DESCRIBED IN 3ECTION %XPLAIN HOW TO
APPLY MAXIMUM LIKELIHOOD LEARNING TO lT THE PARAMETERS OF SUCH A MODEL TO A SET OF COMPLETE
DATA +LQW USE THE CHAIN RULE FOR PARTIAL DERIVATIVES
4HIS EXERCISE INVESTIGATES PROPERTIES OF THE "ETA DISTRIBUTION DElNED IN %QUATION
D "Y INTEGRATING OVER THE RANGE [0, 1] SHOW THAT THE NORMALIZATION CONSTANT FOR THE DIS
TRIBUTION beta[a, b] IS GIVEN BY α = Γ(a + b)/Γ(a)Γ(b) WHERE Γ(x) IS THE *DPPD
GAMMA FUNCTION IXQFWLRQ DElNED BY Γ(x + 1) = x · Γ(x) AND Γ(1) = 1 &OR INTEGER x Γ(x + 1) = x!
E 3HOW THAT THE MEAN IS a/(a + b)
F &IND THE MODES THE MOST LIKELY VALUES OF θ
G $ESCRIBE THE DISTRIBUTION beta[, ] FOR VERY SMALL 7HAT HAPPENS AS SUCH A DISTRIBUTION
IS UPDATED
#ONSIDER AN ARBITRARY "AYESIAN NETWORK A COMPLETE DATA SET FOR THAT NETWORK AND THE
LIKELIHOOD FOR THE DATA SET ACCORDING TO THE NETWORK 'IVE A SIMPLE PROOF THAT THE LIKELIHOOD
OF THE DATA CANNOT DECREASE IF WE ADD A NEW LINK TO THE NETWORK AND RECOMPUTE THE MAXIMUM
LIKELIHOOD PARAMETER VALUES
%XERCISES
#ONSIDER THE APPLICATION OF %- TO LEARN THE PARAMETERS FOR THE NETWORK IN &IG
URE A GIVEN THE TRUE PARAMETERS IN %QUATION
D %XPLAIN WHY THE %- ALGORITHM WOULD NOT WORK IF THERE WERE JUST TWO ATTRIBUTES IN THE
MODEL RATHER THAN THREE
E 3HOW THE CALCULATIONS FOR THE lRST ITERATION OF %- STARTING FROM %QUATION
F 7HAT HAPPENS IF WE START WITH ALL THE PARAMETERS SET TO THE SAME VALUE p +LQW YOU
MAY lND IT HELPFUL TO INVESTIGATE THIS EMPIRICALLY BEFORE DERIVING THE GENERAL RESULT
G 7RITE OUT AN EXPRESSION FOR THE LOG LIKELIHOOD OF THE TABULATED CANDY DATA ON PAGE IN
TERMS OF THE PARAMETERS CALCULATE THE PARTIAL DERIVATIVES WITH RESPECT TO EACH PARAMETER
AND INVESTIGATE THE NATURE OF THE lXED POINT REACHED IN PART C