Machine Learning Notes

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 385

Chapter Topics

•Hypothesis Testing Methodology


•Z Test for the Mean (σ Known)
• p-Value Approach to Hypothesis Testing
•Connection to Confidence Interval Estimation
•One Tail Test
• t Test of Hypothesis for the Mean
•Z Test of Hypothesis for the Proportion
© 1999 Prentice-Hall, Inc. Chap. 8 - 1
Review of ANOVA and linear
regression
Review of simple ANOVA
ANOVA
for comparing means between
more than 2 groups
Hypotheses of One-Way
ANOVA
 H0 : μ1 = μ2 = μ3 =  = μc
 All population means are equal
 i.e., no treatment effect (no variation in means among
groups)


H1 : Not all of the population means are the same
 At least one population mean is different
 i.e., there is a treatment effect
 Does not mean that all population means are different
(some pairs may be the same)
The F-distribution
 A ratio of variances follows an F-distribution:
2
σ between
2
~ Fn ,m
σ within

The F-test tests the hypothesis that two variances


are equal.
F will be close to 1 if sample variances are equal.
2 2
H 0 : σ between = σ within
2 2
H a : σ between ≠ σ within
How to calculate ANOVA’s by
hand…
Treatment 1 Treatment 2 Treatment 3 Treatment 4
y11 y21 y31 y41
y12 y22 y32 y42 n=10 obs./group
y13 y23 y33 y43
y14 y24 y34 y44 k=4 groups
y15 y25 y35 y45
y16 y26 y36 y46
y17 y27 y37 y47
y18 y28 y38 y48
y19 y29 y39 y49
y110 y210 y310 y410
10


10 10 10
y1 j ∑y
j =1
∑y 2j ∑y 3j
j =1
4j The group means
y1• = y 2• =
j =1
y 3• =
j =1 y 4• =
10 10 10 10
10 10 10

10

∑(y 1j − y1• ) 2

j =1
( y 2 j − y 2• ) 2 ∑(y
j =1
3j − y 3• ) 2
∑(y
j =1
4j − y 4• ) 2
The (within)
j =1

10 − 1 10 − 1 10 − 1 10 − 1 group
variances
Sum of Squares Within (SSW),
or Sum of Squares Error (SSE)
10 10 10
∑(y
10
∑(y
2
∑(y 1j − y1• ) 2
j =1
2j − y 2• ) ∑(y j =1
3j − y 3• ) 2

j =1
4j − y 4• ) 2
The (within)
j =1
group variances
10 − 1 10 − 1 10 − 1 10 − 1

10 10
10 10

∑(y 1j − y1• ) +
2
∑ ( y 2 j − y 2• ) 2 + ∑ ( y 3 j − y 3• ) + 2
∑(y
j =1
4j − y 4• ) 2
j =1 j =3
j =1

4 10
= ∑∑ i =1 j =1
( y ij − y i • ) 2 Sum of Squares Within (SSW)
(or SSE, for chance error)
Sum of Squares Between (SSB), or
Sum of Squares Regression (SSR)
4 10
Overall mean
of all 40
observations
∑∑ y
i =1 j =1
ij

(“grand y •• =
mean”) 40

4 Sum of Squares Between

10 x ∑(y − y •• ) 2 (SSB). Variability of the


group means compared
i• to the grand mean (the
i =1 variability due to the
treatment).
Total Sum of Squares (SST)

Total sum of
4 10 squares(TSS).

∑∑
i =1 j =1
( y ij − y •• ) 2 Squared difference of
every observation from
the overall mean.
(numerator of variance of
Y!)
Partitioning of Variance

4 10 4 4 10
∑∑ ( y
i =1 j =1
ij − y i• ) 2

+ 10x ( y i • − y •• ) 2
= ∑∑ ( y ij − y •• ) 2
i =1 i =1 j =1

SSW + SSB = TSS


ANOVA Table
Mean Sum
Source of Sum of of Squares
variation d.f. squares F-statistic p-value

Between k-1 SSB SSB/k-1 Go to


SSB
(sum of squared k −1
(k groups) SSW Fk-1,nk-k
deviations of nk − k chart
group means from
grand mean)

Within nk-k SSW s2=SSW/nk-k


(sum of squared
(n individuals per
deviations of
group)
observations from
their group mean)

Total nk-1 TSS


variation (sum of squared deviations of
observations from grand mean) TSS=SSB + SSW
Example
Treatment 1 Treatment 2 Treatment 3 Treatment 4
60 inches 50 48 47
67 52 49 67
42 43 50 54
67 67 55 67
56 67 56 68
62 59 61 65
64 67 61 65
59 64 60 56
72 63 59 60
71 65 64 65
Example
Step 1) calculate the sum
Treatment 1 Treatment 2 Treatment 3 Treatment 4
of squares between groups:
60 inches 50 48 47
67 52 49 67
42 43 50 54
Mean for group 1 = 62.0 67 67 55 67
56 67 56 68
Mean for group 2 = 59.7
62 59 61 65
Mean for group 3 = 56.3 64 67 61 65
59 64 60 56
Mean for group 4 = 61.4 72 63 59 60
71 65 64 65

Grand mean= 59.85

SSB = [(62-59.85)2 + (59.7-59.85)2 + (56.3-59.85)2 + (61.4-59.85)2 ] xn per


group= 19.65x10 = 196.5
Example
Step 2) calculate the sum
Treatment 1 Treatment 2 Treatment 3 Treatment 4
of squares within groups:
60 inches 50 48 47
67 52 49 67
42 43 50 54
(60-62) 2+(67-62) 2+ (42-62) 67 67 55 67
2+ (67-62) 2+ (56-62) 2+ (62-
56 67 56 68
62) 2+ (64-62) 2+ (59-62) 2+ 62 59 61 65
(72-62) 2+ (71-62) 2+ (50- 64 67 61 65
59.7) 2+ (52-59.7) 2+ (43- 59 64 60 56

59.7) 2+67-59.7) 2+ (67- 72 63 59 60


71 65 64 65
59.7) 2+ (69-59.7)
2…+….(sum of 40 squared

deviations) = 2060.6
Step 3) Fill in the ANOVA table
Source of variation d.f. Sum of squares Mean Sum of F-statistic p-value
Squares

Between 3 196.5 65.5 1.14 .344

Within 36 2060.6 57.2

Total 39 2257.1
Step 3) Fill in the ANOVA table
Source of variation d.f. Sum of squares Mean Sum of F-statistic p-value
Squares

Between 3 196.5 65.5 1.14 .344

Within 36 2060.6 57.2

Total 39 2257.1

INTERPRETATION of ANOVA:
How much of the variance in height is explained by treatment group?
R2=“Coefficient of Determination” = SSB/TSS = 196.5/2275.1=9%
Coefficient of Determination

2 SSB SSB
R = =
SSB + SSE SST
The amount of variation in the outcome variable (dependent
variable) that is explained by the predictor (independent variable).
ANOVA example
Table 6. Mean micronutrient intake from the school lunch by school
S1a, n=25 S2b, n=25 S3c, n=25 P-valued
Calcium (mg) Mean 117.8 158.7 206.5 0.000
SDe 62.4 70.5 86.2
Iron (mg) Mean 2.0 2.0 2.0 0.854
SD 0.6 0.6 0.6
Folate (μg) Mean 26.6 38.7 42.6 0.000
SD 13.1 14.5 15.1
Mean 1.9 1.5 1.3 0.055
Zinc (mg)
SD 1.0 1.2 0.4
a School 1 (most deprived; 40% subsidized lunches). FROM: Gould R, Russell J,
Barker ME. School lunch
b School 2 (medium deprived; <10% subsidized). menus and 11 to 12 year old
c School 3 (least deprived; no subsidization, private school). children's food choice in three
secondary schools in England-
d ANOVA; significant differences are highlighted in bold (P<0.05). are the nutritional standards
being met? Appetite. 2006
Jan;46(1):86-92.
Answer
Step 1) calculate the sum of squares between groups:
Mean for School 1 = 117.8
Mean for School 2 = 158.7
Mean for School 3 = 206.5

Grand mean: 161

SSB = [(117.8-161)2 + (158.7-161)2 + (206.5-161)2] x25 per


group= 98,113
Answer
Step 2) calculate the sum of squares within groups:

S.D. for S1 = 62.4


S.D. for S2 = 70.5
S.D. for S3 = 86.2

Therefore, sum of squares within is:


(24)[ 62.42 + 70.5 2+ 86.22]=391,066
Answer
Step 3) Fill in your ANOVA table

Source of variation d.f. Sum of squares Mean Sum of F-statistic p-value


Squares
Between 2 98,113 49056 9 <.05

Within 72 391,066 5431

Total 74 489,179

**R2=98113/489179=20%
School explains 20% of the variance in lunchtime calcium
intake in these kids.
Beyond one-way ANOVA
Often, you may want to test more than 1
treatment. ANOVA can accommodate
more than 1 treatment or factor, so long
as they are independent. Again, the
variation partitions beautifully!

TSS = SSB1 + SSB2 + SSW


Linear regression review
What is “Linear”?
 Remember this:
 Y=mX+B?

B
What’s Slope?

A slope of 2 means that every 1-unit change in X


yields a 2-unit change in Y.
Regression equation…
Expected value of y at a given level of x=

E ( yi / xi ) = α + βxi
Predicted value for an
individual…
yi= α + β*xi + random errori

Fixed – Follows a normal


exactly distribution
on the
line
Assumptions (or the fine print)
 Linear regression assumes that…
 1. The relationship between X and Y is linear
 2. Y is distributed normally at each value of X
 3. The variance of Y at every value of X is the
same (homogeneity of variances)
 4. The observations are independent**

 **When we talk about repeated measures


starting next week, we will violate this
assumption and hence need more
sophisticated regression models!
The standard error of Y given X is the average variability around the
regression line at any given value of X. It is assumed to be equal at
all values of X.

Sy/x

Sy/x
Sy/x

Sy/x
Sy/x

Sy/x
Regression Picture
ŷi = βxi + α
yi
C A

B
y A
B y
C
yi

*Least squares estimation


x gave us the line (β) that
n n n minimized C2
∑(y
i =1
i − y) 2
= ∑ ( yˆ
i =1
i − y) 2
+ ∑ ( yˆ
i =1
i − yi ) 2
R2=SSreg/SStotal
A2 B2 C2
SStotal SSreg SSresidual
Total squared distance of Distance from regression line to naïve mean of y Variance around the regression line
observations from naïve mean Variability due to x (regression) Additional variability not explained
of y by x—what least squares method aims
Total variation to minimize
Recall example: cognitive
function and vitamin D
 Hypothetical data loosely based on [1];
cross-sectional study of 100 middle-
aged and older European men.
 Cognitive function is measured by the Digit
Symbol Substitution Test (DSST).

1. Lee DM, Tajar A, Ulubaev A, et al. Association between 25-hydroxyvitamin D levels and cognitive performance in middle-aged
and older European men. J Neurol Neurosurg Psychiatry. 2009 Jul;80(7):722-9.
Distribution of vitamin D

Mean= 63 nmol/L
Standard deviation = 33 nmol/L
Distribution of DSST
Normally distributed
Mean = 28 points
Standard deviation = 10 points
Four hypothetical datasets
 I generated four hypothetical datasets,
with increasing TRUE slopes (between
vit D and DSST):
 0
 0.5 points per 10 nmol/L
 1.0 points per 10 nmol/L
 1.5 points per 10 nmol/L
Dataset 1: no relationship
Dataset 2: weak relationship
Dataset 3: weak to moderate
relationship
Dataset 4: moderate
relationship
The “Best fit” line

Regression
equation:
E(Yi) = 28 + 0*vit
Di (in 10 nmol/L)
The “Best fit” line

Note how the line is


a little deceptive; it
draws your eye,
making the
relationship appear
stronger than it
really is!

Regression
equation:
E(Yi) = 26 + 0.5*vit
Di (in 10 nmol/L)
The “Best fit” line

Regression equation:
E(Yi) = 22 + 1.0*vit
Di (in 10 nmol/L)
The “Best fit” line

Regression equation:
E(Yi) = 20 + 1.5*vit Di
(in 10 nmol/L)

Note: all the lines go


through the point
(63, 28)!
Significance testing…
Slope
Distribution of slope ~ Tn-2(β,s.e.( βˆ ))

H0: β1 = 0 (no linear relationship)


H1: β1 ≠ 0 (linear relationship does exist)

Tn-2=
βˆ − 0
s.e.( βˆ )
Example: dataset 4
 Standard error (beta) = 0.03
 T98 = 0.15/0.03 = 5, p<.0001

 95% Confidence interval = 0.09 to 0.21


Multiple linear regression…
 What if age is a confounder here?
 Older men have lower vitamin D
 Older men have poorer cognition
 “Adjust” for age by putting age in the
model:
 DSST score = intercept + slope1xvitamin D
+ slope2 xage
2 predictors: age and vit D…
Different 3D view…
Fit a plane rather than a line…

On the plane, the


slope for vitamin
D is the same at
every age; thus,
the slope for
vitamin D
represents the
effect of vitamin
D when age is
held constant.
Equation of the “Best fit”
plane…
 DSST score = 53 + 0.0039xvitamin D
(in 10 nmol/L) - 0.46 xage (in years)

 P-value for vitamin D >>.05


 P-value for age <.0001

 Thus, relationship with vitamin D was


due to confounding by age!
Multiple Linear Regression
 More than one predictor…

E(y)= α + β1*X + β2 *W + β3 *Z…

Each regression coefficient is the amount of


change in the outcome variable that would be
expected per one-unit change of the
predictor, if all other variables in the model
were held constant.
Functions of multivariate
analysis:
 Control for confounders
 Test for interactions between predictors
(effect modification)
 Improve predictions
ANOVA is linear regression!

 Divide vitamin D into three groups:


 Deficient (<25 nmol/L)
 Insufficient (>=25 and <50 nmol/L)
 Sufficient (>=50 nmol/L), reference group
DSST= α (=value for sufficient) + βinsufficient*(1
if insufficient) + β2 *(1 if deficient)
This is called “dummy coding”—where multiple
binary variables are created to represent
being in each category (or not) of a
categorical variable
The picture…
Sufficient vs.
Insufficient

Sufficient vs.
Deficient
Results…
Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 40.07407 1.47817 27.11 <.0001


deficient 1 -9.87407 3.73950 -2.64 0.0096
insufficient 1 -6.87963 2.33719 -2.94 0.0041

 Interpretation:
 The deficient group has a mean DSST 9.87 points
lower than the reference (sufficient) group.
 The insufficient group has a mean DSST 6.87
points lower than the reference (sufficient) group.
What is a Hypothesis?
A hypothesis is an I assume the mean GPA
assumption about the of this class is 3.5!
population parameter.
 A parameter is a
Population mean or
proportion
 The parameter must be
identified before
analysis.

© 1984-1994 T/Maker Co.

© 1999 Prentice-Hall, Inc. Chap. 8 - 2


The Null Hypothesis, H0
• States the Assumption (numerical) to be tested
e.g. The average # TV sets in US homes is at
least 3 (H0: µ ≥ 3)
• Begin with the assumption that the null
hypothesis is TRUE.
(Similar to the notion of innocent until proven guilty)

•Refers to the Status Quo


•Always contains the ‘ = ‘ sign
•The Null Hypothesis may or may not be rejected.
© 1999 Prentice-Hall, Inc. Chap. 8 - 3
The Alternative Hypothesis, H1

• Is the opposite of the null hypothesis


e.g. The average # TV sets in US homes is
less than 3 (H1: µ < 3)
• Challenges the Status Quo
• Never contains the ‘=‘ sign
• The Alternative Hypothesis may or may
not be accepted

© 1999 Prentice-Hall, Inc. Chap. 8 - 4


Identify the Problem

Steps:
 State the Null Hypothesis (H0: µ ≥ 3)

 State its opposite, the Alternative


Hypothesis (H1: µ < 3)
 Hypotheses are mutually exclusive &
exhaustive
 Sometimes it is easier to form the
alternative hypothesis first.

© 1999 Prentice-Hall, Inc. Chap. 8 - 5


Hypothesis Testing Process

Assume the
population
mean age is 50.
(Null Hypothesis) Population
The Sample
Is X = 20 ≅ µ = 50? Mean Is 20
No, not likely!

REJECT
Null Hypothesis Sample

© 1999 Prentice-Hall, Inc. Chap. 8 - 6


Reason for Rejecting H0

Sampling Distribution
It is unlikely
that we would ... Therefore, we
get a sample reject the null
mean of this hypothesis that
value ... µ = 50.
... if in fact this were
the population mean.

20 µ = 50 Sample Mean
H0
© 1999 Prentice-Hall, Inc. Chap. 8 - 7
Level of Significance, α
• Defines Unlikely Values of Sample
Statistic if Null Hypothesis Is True
 Called Rejection Region of Sampling
Distribution
• Designated α (alpha)
 Typical values are 0.01, 0.05, 0.10
• Selected by the Researcher at the Start
• Provides the Critical Value(s) of the Test
© 1999 Prentice-Hall, Inc. Chap. 8 - 8
Level of Significance, α and
the Rejection Region
H0: µ ≥ 3 α Critical
H1: µ < 3 Value(s)
Rejection 0
Regions α
H0: µ ≤ 3
H1: µ > 3
0
α/2
H0: µ = 3
H1: µ ≠ 3
0

© 1999 Prentice-Hall, Inc. Chap. 8 - 9


Errors in Making Decisions

• Type I Error
 Reject True Null Hypothesis
 Has Serious Consequences

 Probability of Type I Error Is α


 Called Level of Significance

• Type II Error
 Do Not Reject False Null Hypothesis
 Probability of Type II Error Is β (Beta)

© 1999 Prentice-Hall, Inc. Chap. 8 - 10


Result Possibilities
H0: Innocent
Jury Trial Hypothesis Test
Actual Situation Actual Situation
Verdict Innocent Guilty Decision H0 True H0 False
Do Not Type II
Innocent Correct Error Reject 1-α
Error (β )
H0
Type I Power
Guilty Error Correct Reject Error
H0 (1 - β)
(α )

© 1999 Prentice-Hall, Inc. Chap. 8 - 11


α & β Have an
Inverse Relationship
Reduce probability of one error
and the other one goes up.

© 1999 Prentice-Hall, Inc. Chap. 8 - 12


Z-Test Statistics (σ Known)

• Convert Sample Statistic (e.g., X ) to


Standardized Z Variable
X − µX X −µ Test Statistic
Z= =
σX σ
n

• Compare to Critical Z Value(s)


 If Z test Statistic falls in Critical Region,
Reject H0; Otherwise Do Not Reject H0
© 1999 Prentice-Hall, Inc. Chap. 8 - 13
p Value Test

• Probability of Obtaining a Test Statistic


More Extreme (≤ or ≥) than Actual
Sample Value Given H0 Is True
• Called Observed Level of Significance
 Smallest Value of a H0 Can Be Rejected
• Used to Make Rejection Decision
 If p value ≥ α, Do Not Reject H0
 If p value < α, Reject H0
© 1999 Prentice-Hall, Inc. Chap. 8 - 14
Hypothesis Testing: Steps

Test the Assumption that the true mean #


of TV sets in US homes is at least 3.
1. State H0 H0 : µ ≥ 3
2. State H1 H1 : µ < 3
3. Choose α α = .05
4. Choose n n = 100
5. Choose Test: Z Test (or p Value)

© 1999 Prentice-Hall, Inc. Chap. 8 - 15


Hypothesis Testing: Steps
(continued)

Test the Assumption that the average # of


TV sets in US homes is at least 3.
6. Set Up Critical Value(s) Z = -1.645
7. Collect Data 100 households surveyed
8. Compute Test Statistic Computed Test Stat.= -2
9. Make Statistical Decision Reject Null Hypothesis
10. Express Decision The true mean # of TV set
is less than 3 in the US
households.
© 1999 Prentice-Hall, Inc. Chap. 8 - 16
One-Tail Z Test for Mean
(σ Known)
• Assumptions
 Population Is Normally Distributed
 If Not Normal, use large samples

 Null Hypothesis Has ≤ or ≥ Sign Only

• Z Test Statistic:
x − µx x−µ
z= =
σx σ
n

© 1999 Prentice-Hall, Inc. Chap. 8 - 17


Rejection Region

H0: µ ≥ 0 H0: µ ≤ 0
H1: µ < 0 H1: µ > 0
Reject H0 Reject H 0

α α

0 Z 0 Z
Must Be Significantly Small values don’t contradict H0
Below µ = 0 Don’t Reject H0!
© 1999 Prentice-Hall, Inc. Chap. 8 - 18
Example: One Tail Test

Does an average box of


cereal contain more than
368 grams of cereal? A
random sample
_ of 25 boxes
showed X = 372.5. The 368 gm.
company has specified σ to
be 15 grams. Test at the H0: µ ≤ 368
α=0.05 level. H1: µ > 368

© 1999 Prentice-Hall, Inc. Chap. 8 - 19


Finding Critical Values:
One Tail
Standardized Normal
What Is Z Given α = 0.05? Probability Table (Portion)
.50
-.05 σZ = 1 Z .04 .05 .06
.45
1.6 .5495 .5505 .5515
α = .05
1.7 .5591 .5599 .5608

0 1.645 Z 1.8 .5671 .5678 .5686


Critical Value 1.9 .5738 .5744 .5750
= 1.645
© 1999 Prentice-Hall, Inc. Chap. 8 - 20
Example Solution: One Tail

H0: µ ≤ 368 Test Statistic:


H1: µ > 368
X −µ
α = 0.025 Z= = 1.50
n = 25
σ
n
Critical Value: 1.645
Decision:
Reject
Do Not Reject at α = .05
.05
Conclusion:
No Evidence True Mean
0 1.645 Z Is More than 368
© 1999 Prentice-Hall, Inc. Chap. 8 - 21
p Value Solution
p Value is P(Z ≥ 1.50) = 0.0668
Use the
alternative p Value 1.0000
hypothesis .0668
to find the - .9332
direction of .0668
the test. .9332
0 1.50 Z

From Z Table: Z Value of Sample


Lookup 1.50 Statistic
© 1999 Prentice-Hall, Inc. Chap. 8 - 22
p Value Solution

(p Value = 0.0668) ≥ (α = 0.05).


Do Not Reject.
p Value = 0.0668

Reject
α = 0.05

0 1.50 Z
Test Statistic Is In the Do Not Reject Region
© 1999 Prentice-Hall, Inc. Chap. 8 - 23
Example: Two Tail Test

Does an average box of


cereal contains 368 grams of
cereal? A random sample of
25 boxes showed X = 372.5.
The company has specified 368 gm.
σ to be 15 grams. Test at the
α=0.05 level.
H0: µ = 368
H1: µ ≠ 368

© 1999 Prentice-Hall, Inc. Chap. 8 - 24


Example Solution: Two Tail
H0: µ = 386 Test Statistic:
H1: µ ≠ 386
X − µ 372.5 − 368
α = 0.05 Z= = = 1.50
σ 15
n = 25 n 25
Critical Value: ±1.96
Decision:
Reject
Do Not Reject at α = .05
.025 .025
Conclusion:
No Evidence that True
-1.96 0 1.96 Z Mean Is Not 368
© 1999 Prentice-Hall, Inc. Chap. 8 - 25
Connection to
Confidence Intervals
_
For X = 372.5oz, σ = 15 and n = 25,
The 95% Confidence Interval is:
372.5 - (1.96) 15/ 25 to 372.5 + (1.96) 15/ 25
or
366.62 ≤ µ ≤ 378.38
If this interval contains the Hypothesized mean
(368), we do not reject the null hypothesis.
It does. Do not reject.
© 1999 Prentice-Hall, Inc. Chap. 8 - 26
t-Test: σ Unknown
Assumptions
 Population is normally distributed
 If not normal, only slightly skewed & a large
sample taken
Parametric test procedure
t test statistic X −µ
t=
S
n

© 1999 Prentice-Hall, Inc. Chap. 8 - 27


Example: One Tail t-Test

Does an average box of cereal


contain more than 368 grams
of cereal? A random sample of
36 boxes showed X = 372.5,
and σ = 15. Test at the α=0.01
368 gm.
level.

σ is not given, H0: µ ≤ 368


H1: µ > 368

© 1999 Prentice-Hall, Inc. Chap. 8 - 28


Example:Z Test for Proportion

•Problem: A marketing company claims


that it receives 4% responses from its
Mailing.
•Approach: To test this claim, a random
sample of 500 were surveyed with 25
responses.
•Solution: Test at the α = .05 significance
level.
© 1999 Prentice-Hall, Inc. Chap. 8 - 29
Z Test for Proportion:
Solution
H0: p = .04
Test Statistic:
H1: p ≠ .04
p - ps .04 -.05
α = .05 Z ≅ = = 1.14
p (1 - p) .04 (1 - .04)
n = 500 n 500
Critical Values: ± 1.96
Decision:
Reject Reject Do not reject at α = .05
.025 .025 Conclusion:
We do not have sufficient
evidence to reject the company’s
0 Z claim of 4% response rate.
© 1999 Prentice-Hall, Inc. Chap. 8 - 30
Machine Learning
(UCSE603)
Module –I:
Introduction
DR. RANJAN MAITY
CENTRAL INSTITUTE OF TECHNOLOGY
KOKRAJHAR
Syllabus
Module 1: Introduction to Machine learning [3L] Brief Introduction to Machine Learning, its advantage and disadvantages, Supervised Learning
Unsupervised Learning Reinforcement Learning.

Module 2: Bayesian Learning [5L] Probability Basics, Bayes Theorem, Naive Bayes Classifier, Gaussian Naive Bayes Classifier, Bayesian Networks.

Module 3: Artificial Neural Network [5L] Perceptron Learning Algorithm: Delta Rule and Gradient Descent. Multi-layer Perceptron Learning:
Backpropagation and Stochastic Gradient Descent. Hypotheses Space, Inductive Bias and Convergence. Variants Structures of Neural Network.

Module 4: Support Vector Machines [4L] Decision Boundary and Support Vector: Optimization and Primal-Dual Problem. Extension to SVM: Soft Margin
and Non-linear Decision Boundary. Kernel Functions and Radial Basis Functions (detailed later).

Module 5: Linear Models and Regression [3L] Linear Regression. Linear Classification. Logistic Regression. Non-linear Transformation.

Module 6: Decision Tree Learning [4L] Decision Tree Representation and Learning Algorithm (ID3). Attribute Selection using Entropy Measures and
Gains. Hypotheses Space and Inductive bias.

Module 7: Clustering [5L] Partitional Clustering and Hierarchical Clustering. Cluster Types, Attributes and Salient Features. k-Means, k-Nearest
Neighbour (kNN) Classifier. Hierarchical and Density-based Clustering Algorithms. Inter and Intra Clustering Similarity, Cohesion and Separation.

Module 8: Some Other Learning Concept [1L] Active Learning, Deep Learning, Transfer Learning.
Books
TEXT BOOKS:
1. Christopher Bishop. Pattern Recognition and Machine Learning. 2e.
2. The Elements of Statistical Learning Data Mining, Inference, and Prediction by T. Hastie, R. Tibshirani, J. Friedman

REFERENCES:
1. Machine Learning, Tom Mitchell, McGraw Hill, 1997
2. Introduction to Machine Learning, Third edition, Ethem Alpaydin, The MIT Press, September 2014.
3. Understanding Machine Learning: From Theory to Algorithms, First Edition by Shwartz, Shai Ben-David, Cambridge University
Press, 2014.
4. Learning From Data, First Edition by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin, AML Book, 2012.
Evaluation
Continuous (40):

1. Attendance : Min 75%

2. Class Test I (Open book): 5

3. Mid term : 15

4. Class Test II (Open Book): 5

5. Term Project : 15

End Term (60):


Prerequisites

1. Number System

2. Linear Algebra

3. Probability Theory
Quick Recap : Number System

Source: https://en.wikipedia.org/wiki/Natural_number
Cont.…

1. Complex Number (ℂ): C = A + iB [a real, b imaginary and i 2= -1], C’ = A – iB (complex conjugate)

2. Real Number (ℝ):

3. Rational Number (ℚ) : any number x that can be represented as a/b, where a and b are both integers. 1.5 = 3/2, Try for √2, √3 ???

4. Integers (ℤ): Whole number without fraction, can be +ve as well as –ve. Example - 3, 59, -113

5. Natural Numbers (ℕ): +ve integers, includes 0 according to ISO 80000-2,


Recap on Linear Algebra

▪Linear Algebra : Vector + Linear Functions

▪Vector: means carrier: List of numbers

▪Things we can do addition and multiplication


▪ Example: Numbers, polynomials, power series,
Vector in Space – n vectors

▪ n no of vectors with an large number of components


▪ an => the nth component of the vector not a multiplies n times

▪ Set of all n vectors =>

▪ Zero vector : all components are zero


Transpose of Vector

• Converts column to row or row to column


Vector Conversion

• Conversion two vectors x and y with size m and n respectively

•Convert x to y???

•Example :

•Got y
Linear Function

▪Let X, Y are two vectors, c is a number, F is a linear function


▪F(X) and F(Y) are also vector
▪Additivity
▪ F(X+Y)= F(X)+F(Y)

▪Homogeneity
▪ F(cX) = c F(X)

▪Linear: If the function F obeys Additivity and Homogeneity then we call F as Linear
▪Consider F(x)= x2, then F(x+x)= F(x) +F(x) test with a value…… Is it linear??
Matrix

▪ Matrix : 2D – info related to a linear function :

▪ gray image of Kalu = a matrix, where each element


Tensor.…

• n dimensional vector space

•Generally more than 3…

•Consider a colour image – each pixel -> 24 bits


• Red Component (8 bits)
• Green Component (8 bits)
• Blue Component (8 bits)
Cont.…
Cont.…

• Colour Image of KALU :


Dot Products of Vector

▪ Two vectors

▪Dot Product
Probability

• Def: How likely an event will occur

• Toss a Coin : Space ={H, T},

•Roll a Dice : Space={1,2,3,4,5,6}


Cont…

▪Problem: Flip three fair coins one by one, what will be the probability of at least two heads ??
▪ Space S= {TTT, TTH, THT, THH, HTT, HTH, HHT, HHH} ,
▪ Can map with digital logic ??
▪ Total Space = 8
▪ At least two heads M = { THH, HTH, HHT, HHH}
▪ Total Space =4
▪ Probability of at least two head =4/8 =1/2

▪Problem: Consider 2 fair coins and 2 dices are rolled independently. What will be the probability of {HH66}?
▪ Space ={2*2*6*6} =144
▪ Probability =1/144
▪ Independently, P(H)=1/2. P(6)=1/6
▪ P{HH66}= P(H)×P(H)×P(6) ×P(6) --------Independent event
Conditional Probability
▪ two events A and B, where occurrence of A depends on the occurrence of B, P(B)>0 then conditional probability

▪ Both A and B will occur

▪Example: Consider the case of flipping three coins : S= {TTT, TTH, THT, THH, HTT, HTH, HHT, HHH}

▪What is the probability of first coin to be head when two of the three are head???
▪ P(A)= first coin head , P(B)= two coins are head
▪ P(A) = {HTT, HTH, HHT, HHH} =4/8 =1/2
▪ P(B) = {THH, HHT, HHH} =3/8
▪ P(A∩B) = {HHT, HHH} =2/8
▪ P(A|B)= P(A∩B)/P(B)=2/8/3/8=2/3
Law of Total Probability, conditioned
▪Consider a sample space S,

▪ S is partitioned in A1, A2,…..An sample spaces.

▪ If B is an event then

▪Prob: Consider a train has 40% AC coach, while the rest are non AC. Suppose 50% of AC coaches have lower berth, while
lower berths in non AC rakes are only 30%. A berth is picked up randomly, What is the probability that it will be lower??
▪ A1= Lower berth in AC, A2= Lower berth in non AC
▪ P(A1) = 0.4, P(A2) = 0.6
Bayes Theorem

▪Two events A, B then

▪Example: Consider two gift boxes G1= {5 Red pens, 10 Green pens}, G2 = {4 Red Pens, 6 Green Pens}, What is the
probability that a green pen is selected from G1?

▪P (A) = probability of G1 =1/2

▪P(B) = prob. of green ball =

▪P(A|B)=???
Dr. Ranjan Maity
Assistant Professor
Central Institute of Technology Kokrajhar
r.maity@cit.ac.in

Machine Learning, UCSE603 Dr. Ranjan Maity, Central Institute of Technology Kokrajhar
Genesis
 Alan Turing: I believe that in about fifty
years’ time it will be possible to programme
computers, with a storage capacity of about 109,
to make them play the imitation game so well
that an average interrogator will not have more
than 70 percent chance of making the right
identification after five minutes of questioning.
… I believe that at the end of the century the use
of words and general educated opinion will have
altered so much that one will be able to speak of
machines thinking without expecting to be
contradicted.

Machine Learning, UCSE603 https://plato.stanford.edu/entries/turing-test/


Who is

Turing Test Machine


?

Machine Interrogator Human

Machine Learning, UCSE603 Dr. Ranjan Maity


Cont.…
 1952: Arthur Samuel - Game of Checkers
 1957: Frank Rosenblatt – neural network – simulates human brain
 1967: Cover and Hart – nearest neighbor – pattern recognition
 1997: Deep Blue – beats Garry Kasparov
 2006: Hinton – Deep learning
 Used to identify text and objects in images/videos
 2011: IBM Watson wins Jeopardy, an American Quiz Show
 2014: DeepFace – use to identify human

Machine Learning, UCSE603 Dr. Ranjan Maity


Cont…
 Auto Pilot – Tesla
 ADAS (Advance Driver
Assistance System)

Machine Learning, UCSE603 Dr. Ranjan Maity


More Applications of M/L

5/6/2022 Add a footer


Dr. Ranjan Maity
Assistant Professor
Central Institute of Technology Kokrajhar
r.maity@cit.ac.in

Machine Learning, UCSE603 Dr. Ranjan Maity, Central Institute of Technology Kokrajhar
Types
 Supervised

 Unsupervised

 Semi-supervised

 Reinforcement

Machine Learning, UCSE603 https://plato.stanford.edu/entries/turing-test/


Supervised Learning
 Consider the learnings
of alphabets – From
where you have
learned ??
 Training Data with
Outputs
 Learning Japanese
alphabets

https://www.busuu.com/en/japanese/alphabet
Machine Learning, UCSE603, Dr. Ranjan Maity
Cont…
 Consider the images –
 Apple (R, G) – class 1
 Lemon & Orange –class 2
 Watermelon – class 3
 Banana – class 4
 Model will be developed
based on the data
 Test it with test set

https://www.diegocalvo.es/en/supervised-learning/
Machine Learning, UCSE603, Dr. Ranjan Maity
Mathematical Formulation
 Two types –
 Classification
 Given a set of labelled (x1,y1), (x2,y2), (x3, y3)……(xn,yn)
 Learning a function f(x) , which will predict y (in terms of class) for a given x
 Regression
 Given a set of labelled (x1,y1), (x2,y2), (x3, y3)……(xn,yn)
 Learning a function f(x) , which will predict y (in terms of values) for a given x

Machine Learning, UCSE603 Dr. Ranjan Maity


Classification

 Cat
25
f(x)

20

15

Weight
 Tiger 10

0
0 1 2 3 4 5 6 7
height

Machine Learning, UCSE603 Dr. Ranjan Maity


Regression

 Stock market

Machine Learning, UCSE603 Dr. Ranjan Maity


Unsupervised Learning
 No labelled data

 Given x1, x2,


x3,…….,xn

 O/P – Structures of xi

 Clustering

Machine Learning, UCSE603 Dr. Ranjan Maity


Reinforcement Learning

 Def: learning the optimal


behavior in an environment
to obtain maximum reward
(https://www.synopsys.com/ai/what-is-reinforcement-learning.html)

 Agent
 Environment
 Policy
 Reward

Machine Learning, UCSE603 Dr. Ranjan Maity


Dr. Ranjan Maity
Assistant Professor
Central Institute of Technology Kokrajhar
r.maity@cit.ac.in

Machine Learning, UCSE603 Dr. Ranjan Maity, Central Institute of Technology Kokrajhar
Norms of Vector
 Length/magnitude of a vector

 Function f :u → R
 1. f is a function, u vector, R real space

 2. f ( x) ≥ 0, ∀x ∈ u

 3. Triangle inequality f ( x + y ) ≤ f ( x) + f ( y )∀x, y ∈ u

(α x)
 4. positive homegenity f= α f ( x), ∀x ∈ u, ∀α ∈ C

 5. f(x) = 0 iff x=0

Machine Learning, UCSE603 /


Cont.… n
 1st Norm v 1 = v1 + v2 + ..... + vn = ∑ vi
i =1
n
 2nd Norm/Euclidean Norm v 2 = v + v ...... + v = ∑ vi2
2
1
2
2
2
n
i =1

 Max Norm (α Norm) v 1 = max( v1 , v2 ,....., vn )

n n
 Frobenius Norm (For Matrix) M F = ∑∑ M ij2
=i 1 =j 1

 Problems

Machine Learning, UCSE603 Dr. Ranjan Maity


Linear Combination
 Vector v – linear combination of vectors v1, v2,……..vn if

v = α1v1 + α 2 v2 + ....... + α n vn

−4 5 −8 5 −3
v = 2v2 + v1 = + = + =
4 −2 8 −2 6

Machine Learning, UCSE603 Dr. Ranjan Maity


Span and Linear Independence
 Span: vector v, apply any linear combination

 Linear Independence: Consider a set of vectors v1, v2,………. vn, if no vi


can be represented by the linear combination of others. Check
0 1
v1 =
= , v2
5 3

Machine Learning, UCSE603 Dr. Ranjan Maity


Trace of a Matrix
 Sum of all diagonals 1 3 4 
v 1 =  2 7 9  , T (v 1 ) =1 + 7 + 8 =16
 Properties:  5 6 8 
 1. T (u + v)= T (u ) + T (v)

 2. T (uv) = T (vu )

 3. T (u ) = T (u )
T
1 2 6 7 9 
Compute Trace of v 1=   ?
3 4 5 8 10 

Machine Learning, UCSE603 Dr. Ranjan Maity


Determinant with Volume
a b c
 Determinant – M =  d e f  , M = a × (ei − hf ) − b × (di − fg ) + c × (dh − eg )
 g h i 

 Formula : M
=
n

∑ (−1) i+ j
M ij M ij
j =1 3,4

 Consider two vectors and compute the area

uv
3,-4

Machine Learning, UCSE603 Dr. Ranjan Maity


Naïve Bayes Classifier

Eamonn Keogh
UCR

This is a high level overview only. For details, see: Thomas Bayes
Pattern Recognition and Machine Learning, Christopher Bishop, Springer-Verlag, 2006.
Or
Pattern Classification by R. O. Duda, P. E. Hart, D. Stork, Wiley and Sons. 1702 - 1761

We will start off with a visual intuition, before looking at the math…
Grasshoppers Katydids

10
9
8
7
Antenna Length

6
5
4
3
2
1

1 2 3 4 5 6 7 8 9 10
Abdomen Length
Remember this example?
Let’s get lots more data…
With a lot of data, we can build a histogram. Let us
just build one for “Antenna Length” for now…
10
9
8
7
Antenna Length

6
5
4
3
2
1

1 2 3 4 5 6 7 8 9 10

Katydids
Grasshoppers
We can leave the
histograms as they are,
or we can summarize
them with two normal
distributions.

Let us us two normal


distributions for ease
of visualization in the
following slides…
• We want to classify an insect we have found. Its antennae are 3 units long.
How can we classify it?

• We can just ask ourselves, give the distributions of antennae lengths we have
seen, is it more probable that our insect is a Grasshopper or a Katydid.
• There is a formal way to discuss the most probable classification…
p(cj | d) = probability of class cj, given that we have observed d

Antennae length is 3
p(cj | d) = probability of class cj, given that we have observed d

P(Grasshopper | 3 ) = 10 / (10 + 2) = 0.833


P(Katydid | 3 ) = 2 / (10 + 2) = 0.166

10

Antennae length is 3
p(cj | d) = probability of class cj, given that we have observed d

P(Grasshopper | 7 ) = 3 / (3 + 9) = 0.250
P(Katydid | 7 ) = 9 / (3 + 9) = 0.750

9
3

Antennae length is 7
p(cj | d) = probability of class cj, given that we have observed d

P(Grasshopper | 5 ) = 6 / (6 + 6) = 0.500
P(Katydid | 5 ) = 6 / (6 + 6) = 0.500

66

Antennae length is 5
Bayes Classifiers
That was a visual intuition for a simple case of the Bayes classifier,
also called:

• Idiot Bayes
• Naïve Bayes
• Simple Bayes

We are about to see some of the mathematical formalisms, and


more examples, but keep in mind the basic idea.

Find out the probability of the previously unseen instance


belonging to each class, then simply pick the most probable class.
Bayes Classifiers
• Bayesian classifiers use Bayes theorem, which says
p(cj | d ) = p(d | cj ) p(cj)
p(d)

• p(cj | d) = probability of instance d being in class cj,


This is what we are trying to compute
• p(d | cj) = probability of generating instance d given class cj,
We can imagine that being in class cj, causes you to have feature d
with some probability
• p(cj) = probability of occurrence of class cj,
This is just how frequent the class cj, is in our database
• p(d) = probability of instance d occurring
This can actually be ignored, since it is the same for all classes
Assume that we have two classes (Note: “Drew
c1 = male, and c2 = female. can be a male
or female
name”)
We have a person whose sex we do not
know, say “drew” or d. Drew Barrymore
Classifying drew as male or female is
equivalent to asking is it more probable
that drew is male or female, I.e which is
greater p(male | drew) or p(female | drew)

Drew Carey
What is the probability of being called
“drew” given that you are a male?
What is the probability
of being a male?
p(male | drew) = p(drew | male ) p(male)
What is the probability of
p(drew)
being named “drew”?
(actually irrelevant, since it is
that same for all classes)
This is Officer Drew (who arrested me in
1997). Is Officer Drew a Male or Female?
Luckily, we have a small
database with names and sex.

We can use it to apply Bayes Name Sex


rule…
Drew Male
Officer Drew Claudia Female
Drew Female
Drew Female
p(cj | d) = p(d | cj ) p(cj) Alberto Male
p(d) Karin Female
Nina Female
Sergio Male
Name Sex
Drew Male
Claudia Female
Drew Female
Drew Female
p(cj | d) = p(d | cj ) p(cj) Alberto Male
p(d) Karin Female
Officer Drew Nina Female
Sergio Male
p(male | drew) = 1/3 * 3/8 = 0.125
3/8 3/8 Officer Drew is
more likely to be
p(female | drew) = 2/5 * 5/8 = 0.250 a Female.
3/8 3/8
Officer Drew IS a female!

Officer Drew

p(male | drew) = 1/3 * 3/8 = 0.125


3/8 3/8

p(female | drew) = 2/5 * 5/8 = 0.250


3/8 3/8
So far we have only considered Bayes p(cj | d) = p(d | cj ) p(cj)
Classification when we have one
attribute (the “antennae length”, or the p(d)
“name”). But we may have many
features.
How do we use all the features?
Name Over 170CM Eye Hair length Sex
Drew No Blue Short Male
Claudia Yes Brown Long Female
Drew No Blue Long Female
Drew No Blue Long Female
Alberto Yes Brown Short Male
Karin No Blue Long Female
Nina Yes Brown Short Female
Sergio Yes Blue Long Male
• To simplify the task, naïve Bayesian classifiers assume
attributes have independent distributions, and thereby estimate

p(d|cj) = p(d1|cj) * p(d2|cj) * ….* p(dn|cj)

The probability of
class cj generating
instance d, equals….
The probability of class cj
generating the observed
value for feature 1,
multiplied by..
The probability of class cj
generating the observed
value for feature 2,
multiplied by..
• To simplify the task, naïve Bayesian classifiers
assume attributes have independent distributions, and
thereby estimate
p(d|cj) = p(d1|cj) * p(d2|cj) * ….* p(dn|cj)

p(officer drew|cj) = p(over_170cm = yes|cj) * p(eye =blue|cj) * ….

Officer Drew
is blue-eyed, p(officer drew| Female) = 2/5 * 3/5 * ….
over 170cm
tall, and has p(officer drew| Male) = 2/3 * 2/3 * ….
long hair
The Naive Bayes classifiers
is often represented as this
type of graph…
cj
Note the direction of the
arrows, which state that
each class causes certain
features, with a certain
probability

p(d1|cj) p(d2|cj) … p(dn|cj)


Naïve Bayes is fast and cj
space efficient

We can look up all the probabilities


with a single scan of the database and
store them in a (small) table…

p(d1|cj) p(d2|cj) … p(dn|cj)

Sex Over190cm Sex Long Hair Sex


Male Yes 0.15 Male Yes 0.05 Male
No 0.85 No 0.95
Female Yes 0.01 Female Yes 0.70 Female
No 0.99 No 0.30
Naïve Bayes is NOT sensitive to irrelevant features...

Suppose we are trying to classify a persons sex based on


several features, including eye color. (Of course, eye color
is completely irrelevant to a persons gender)
p(Jessica |cj) = p(eye = brown|cj) * p( wears_dress = yes|cj) * ….

p(Jessica | Female) = 9,000/10,000 * 9,975/10,000 * ….


p(Jessica | Male) = 9,001/10,000 * 2/10,000 * ….
Almost the same!

However, this assumes that we have good enough estimates of


the probabilities, so the more data the better.
An obvious point. I have used a
simple two class problem, and
cj
two possible values for each
example, for my previous
examples. However we can have
an arbitrary number of classes, or
feature values

p(d1|cj) p(d2|cj) … p(dn|cj)

Animal Mass >10kg Animal Color Animal


Cat Yes 0.15 Cat Black 0.33 Cat
No 0.85 White 0.23
Dog Yes 0.91 Brown 0.44 Dog

No 0.09 Dog Black 0.97


Pig
Pig Yes 0.99 White 0.03
No 0.01 Brown 0.90
Pig Black 0.04
White 0.01
Problem! Naïve Bayesian
p(d|cj)
Classifier
Naïve Bayes assumes
independence of
features…

p(d1|cj) p(d2|cj) p(dn|cj)

Sex Over 6 Sex Over 200


foot pounds
Male Yes 0.15 Male Yes 0.11
No 0.85 No 0.80
Female Yes 0.01 Female Yes 0.05
No 0.99 No 0.95
Solution Naïve Bayesian
p(d|cj)
Classifier
Consider the
relationships between
attributes…

p(d1|cj) p(d2|cj) p(dn|cj)

Sex Over 6 Sex Over 200 pounds


foot
Male Yes and Over 6 foot 0.11
Male Yes 0.15
No and Over 6 foot 0.59
No 0.85
Yes and NOT Over 6 foot 0.05
Female Yes 0.01
No and NOT Over 6 foot 0.35
No 0.99
Solution Naïve Bayesian
p(d|cj)
Classifier
Consider the
relationships between
attributes…

p(d1|cj) p(d2|cj) p(dn|cj)

But how do we find the set of connecting arcs??


The Naïve Bayesian Classifier has a piecewise quadratic decision boundary

Grasshoppers
Katydids
Ants

Adapted from slide by Ricardo Gutierrez-Osuna


One second of audio from the laser sensor. Only
0.2
Bombus impatiens (Common Eastern Bumble
0.1 Bee) is in the insectary.
0

-0.1 Background noise


Bee begins to cross laser
-0.2 4
x 10
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

-3
x 10 Single-Sided Amplitude Spectrum of Y(t)
4

3
Peak at
|Y(f)|

60Hz 197Hz
2 interference
Harmonics
1

0 0 100 200 300 400 500 600 700 800 900 1000
Frequency (Hz)
-3
x 10
4

3
|Y(f)|

0 0 100 200 300 400 500 600 700 800 900 1000
Frequency (Hz)

0 100 200 300 400 500 600 700 800 900 1000
Frequency (Hz)

0 100 200 300 400 500 600 700 800 900 1000
Frequency (Hz)
0 100 200 300 400 500 600 700
Wing Beat Frequency Hz
0 100 200 300 400 500 600 700
Wing Beat Frequency Hz
400 500 600 700

Anopheles stephensi: Female Aedes aegyptii : Female


mean =475, Std = 30 mean =567, Std = 43
517

If I see an insect with a wingbeat frequency of 500, what is it?


1 (500−475)2

𝑃 𝐴𝑛𝑜𝑝ℎ𝑒𝑙𝑒𝑠 𝑤𝑖𝑛𝑔𝑏𝑒𝑎𝑡 = 500 = 𝑒 2×30 2
2𝜋 30
517

400 500 600 700

12.2% of the 8.02% of the


What is the error rate? area under the area under the
pink curve red curve

Can we get more features?


Circadian Features

Aedes aegypti (yellow fever mosquito)

0 dawn dusk
0 12 24
Midnight Noon Midnight
70
600
Suppose I observe an
insect with a wingbeat
frequency of 420Hz

500
What is it?

400
70
600
Suppose I observe an
insect with a wingbeat
frequency of 420Hz at

500
11:00am

What is it?

400
0 12 24
Midnight Noon Midnight
70
600
Suppose I observe an
insect with a wingbeat
frequency of 420 at

500
11:00am

What is it?

400
0 12 24
Midnight Noon Midnight

(Culex | [420Hz,11:00am]) = (6/ (6 + 6 + 0)) * (2/ (2 + 4 + 3)) = 0.111


(Anopheles | [420Hz,11:00am]) = (6/ (6 + 6 + 0)) * (4/ (2 + 4 + 3)) = 0.222
(Aedes | [420Hz,11:00am]) = (0/ (6 + 6 + 0)) * (3/ (2 + 4 + 3)) = 0.000
Which of the “Pigeon Problems” can be 10
solved by a decision tree? 9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10

100 10
90 9
80 8
70 7
60 6
50 5
40 4
30 3
20 2
10 1
10 20 30 40 50 60 70 80 90 100 1 2 3 4 5 6 7 8 9 10
Advantages/Disadvantages of Naïve Bayes
• Advantages:
– Fast to train (single scan). Fast to classify
– Not sensitive to irrelevant features
– Handles real and discrete data
– Handles streaming data well
• Disadvantages:
– Assumes independence of features
A Tutorial on
Inference and Learning
in Bayesian Networks

Irina Rish

IBM T.J.Watson Research Center


rish@us.ibm.com
http://www.research.ibm.com/people/r/rish/
Outline
 Motivation: learning probabilistic models from data
 Representation: Bayesian network models
 Probabilistic inference in Bayesian Networks
 Exact inference
 Approximate inference
 Learning Bayesian Networks
 Learning parameters
 Learning graph structure (model selection)
 Summary
Bayesian Networks
Structured, graphical representation of probabilistic
relationships between several random variables
Explicit representation of conditional independencies
Missing arcs encode conditional independence
Efficient representation of joint PDF P(X)
Generative model (not just discriminative): allows
arbitrary queries to be answered, e.g.

P (lung cancer=yes | smoking=no, positive X-ray=yes ) = ?


Bayesian Network: BN = (G, Θ)
P(S) G - directed acyclic graph (DAG)
Smoking nodes – random variables
edges – direct dependencies
P(C|S) P(B|S)

lung Cancer Bronchitis Θ - set of parameters in all


conditional probability
distributions (CPDs)
CPD:
P(X|C,S) C B D=0 D=1
P(D|C,B)
0 0 0.1 0.9 CPD of
X-ray Dyspnoea 0 1 0.7 0.3 node X:
1 0 0.8 0.2 P(X|parents(X))
1 1 0.9 0.1

Compact representation of joint distribution in a product form (chain rule):

P(S, C, B, X, D) = P(S) P(C|S) P(B|S) P(X|C,S) P(D|C,B)

1 + 2 + 2 + 4 + 4 = 13 parameters instead of 2 \ = ZY
Example: Printer Troubleshooting
Application
Output OK
Spool Print
Process OK Spooling On
Spooled GDI Data
Data OK Input OK
Local Disk Uncorrupted
Space Adequate Driver
Correct GDI Data
Driver Output OK Correct
Driver
Settings
Correct
Printer Print
Selected Data OK
Network Net/Local
Up Printing
Correct
Correct Net PC to Printer Local Local Port
Printer Path Path OK Transport OK Path OK
Local Cable
Connected
Net Cable Paper
Connected Loaded
Printer On Printer
and Online Data OK
Printer Memory
Print Output Adequate
OK
26 variables
Instead of 2 26 parameters we get
99 = 17x1 + 1x21 + 2x2 2 + 3x2 3 + 3x 2 4

[Heckerman, 95]
“Moral” graph of a BN
Moralization algorithm:
P(S)
1. Connect (“marry”) parents Smoking
P(B|S)
of each node. P(C|S)
2. Drop the directionality of lung Cancer Bronchitis
the edges.

P(D|C,B)
Resulting undirected graph is X-ray Dyspnoea
called the “moral” graph of BN P(X|C,S)

Interpretation:
every pair of nodes that occur together in a CPD is connected by an edge in the moral graph.

CPD for X and its k parents (called “family”) is represented by a clique of size
k
(k+1) in the moral graph, and contains d ( d − 1) probability parameters where
d is the number of values each variable can have (domain size).
Conditional Independence in BNs:
Three types of connections
Serial S Diverging
Smoking
Visit to Asia L B
A Lung Cancer Bronchitis
Knowing S makes L and B
T Tuberculosis independent (common cause)

L Converging B
X Chest X-ray
Lung Cancer Bronchitis

Knowing T makes
A and X independent NOT knowing D or M
(intermediate cause) D Dyspnoea
makes L and B
M independent
Running (common effect)
Marathon
d-separation
Nodes X and Y are d-separated if on any (undirected) path between X and
Y there is some variable Z such that is either
Z is in a serial or diverging connection and Z is known, or
Z is in a converging connection and neither Z nor any of Z’s descendants are
known

X Z X Y
Z X Y Z
Y M

Nodes X and Y are called d-connected if they are not d-separated


(there exists an undirected path between X and Y not d-
separated by any node or a set of nodes)
If nodes X and Y are d-separated by Z, then X and Y are
conditionally independent given Z (see Pearl, 1988)
Independence Relations in BN
A variable (node) is conditionally independent of its
non-descendants given its parents

Smoking

Lung Cancer
Bronchitis

Given Bronchitis and


Lung Cancer,
Chest X-ray Dyspnoea is independent
Dyspnoea
of X-ray (but may depend
Running on Running Marathon)
Marathon
Markov Blanket
A node is conditionally independent of ALL other nodes
given its Markov blanket, i.e. its parents, children, and
“spouses’’ (parents of common children)
(Proof left as a homework problem ☺)

Age Gender

Exposure to Toxins Smoking

Diet Cancer

Serum Calcium Lung Tumor

[Breese & Koller, 97]


What are BNs useful for?
 Diagnosis: P(cause|symptom)=? cause

 Prediction: P(symptom|cause)=?
C1 C2
 Classification: max P(class|data)
class
 Decision-making (given a cost function) symptom

Medicine
Speech Bio-
informatics
recognition

Text
Classification Computer
Stock market
troubleshooting
Application Examples
APRI system developed at AT&T Bell Labs
learns & uses Bayesian networks from data to identify customers
liable to default on bill payments
NASA Vista system
predict failures in propulsion systems
considers time criticality & suggests highest utility action
dynamically decide what information to show
Application Examples
Office Assistant in MS Office 97/ MS Office 95
Extension of Answer wizard
uses naïve Bayesian networks
help based on past experience (keyboard/mouse use) and task user is doing currently
This is the “smiley face” you get in your MS Office applications
Microsoft Pregnancy and Child-Care
Available on MSN in Health section
Frequently occurring children’s symptoms are linked to expert modules that repeatedly
ask parents relevant questions
Asks next best question based on provided information
Presents articles that are deemed relevant based on information provided
IBM’s systems management applications
Machine Learning for Systems @ Watson

(Hellerstein, Jayram, Rish (2000)) (Rish, Brodie, Ma (2001))

End-user transaction
Fault diagnosis using probes
recognition Software or hardware
components X
2
X 3
Issues:
Remote Procedure Calls (RPCs)  Efficiency (scalability)
X 1
 Missing data/noise:

R2 R5 R3 R2 R1 R2 R5 sensitivity analysis
 “Adaptive” probing:
 selecting “most-
informative” probes
on-line
Transaction1 Transaction2 T4 
learning/model
T1 updates
T3
BUY? OPEN_DB? Probe outcomes T 2  on-line diagnosis
SELL? SEARCH? Goal: finding most-likely diagnosis
OŸXQ S…ŸQ• ) = ˆ™ŽG”ˆŸwOŸXS…Ÿ• £›XS…› • P
Ÿ: S…Ÿ_

Pattern
Patterndiscovery,
discovery,classification,
classification,
diagnosis
diagnosisand
andprediction
prediction
Probabilistic Inference Tasks
 Belief updating:
BEL(Xi ) = P(Xi = x i | evidence)

 Finding most probable explanation (MPE)


x* = arg max P(x, e)
x

 Finding maximum a-posteriory hypothesis


A⊆ X :
(a1* ,..., ak* ) = arg max ∑ P(x, e) hypothesis variables
a
X/A

 Finding maximum-expected-utility (MEU) decision


(d 1* ,..., d k* ) = arg max
d
∑ P( x , e)U( x )
X/D
D ⊆ X : decision variables
U ( x ) : utility function
Belief Updating Task: Example
Smoking

lung Cancer Bronchitis

X-ray Dyspnoea

P (smoking| dyspnoea=yes ) = ?
Belief updating: find P(X|evidence)
P(s, d = 1)
P(s|d=1) = ∝ P(s, d = 1) =
S P(d = 1)

CC B
B ∑ P(s)P(c|s)P(b|s)P(x|c,s)P(d|c,b)=
d =1,b , x ,c

X
X D
D
P(s) ∑ ∑ P(b|s) ∑ ∑ P(c|s)P(x|c,s)P(d|c,b)
d =1 b x c
“Moral” graph

Variable Elimination
f ( s , d , b, x )
W*=4
Complexity: O(n exp(w*)) ”induced width”
(max induced clique size)
Efficient inference: variable orderings, conditioning, approximations
Variable elimination algorithms
(also called “bucket-elimination”)

Belief updating: VE-algorithm elim-bel (Dechter 1996)


∑∏ b Elimination operator
bucket B: P(b|a) P(d|b,a) P(e|b,c) B
bucket C: P(c|a) h B
(a, d, c, e)
C
bucket D: h C (a, d, e)
D
D
bucket E: e=0 h (a, e)
E
W*=4
bucket A: P(a) h E
(a)
”induced width” A
P(a|e=0) (max clique size)
Finding MPE = max P(x)
x
VE-algorithm elim-mpe (Dechter 1996)


is replaced by max :
MPE = max P ( a ) P ( c | a ) P ( b | a ) P ( d | a , b ) P ( e | b, c )
a , e , d , c ,b

max
b
∏ Elimination operator
bucket B: P(b|a) P(d|b,a) P(e|b,c) B
bucket C: P(c|a) h B
(a, d, c, e) C
bucket D: h C (a, d, e)
D
D
bucket E: e=0 h (a, e)
E
h E (a) W*=4
bucket A: P(a) ”induced width” A
MPE (max clique size)
probability
Generating the MPE-solution
5. b' = arg max P(b | a' ) ×
b B: P(b|a) P(d|b,a) P(e|b,c)
× P(d' | b, a' ) × P(e' | b, c' )

4. c' = arg max P(c | a' ) × C: P(c|a) h B (a, d, c, e)


c
B
× h (a' , d' , c, e' )

C D: h C (a, d, e)
3. d' = arg max h (a' , d, e' )
d

E: e=0 h D (a, e)
2. e' = 0

1. a' = arg max P(a) ⋅ h E (a)


a
A: P(a) h E (a)

Return (a' , b' , c' , d' , e' )


*
Complexity of VE-inference: O (n exp( w )) o
ž 3` − ›ŒGG•‹œŠŒ‹ GGž‹› GG–GG”–™ˆ“ GGŽ™ˆ— GGˆ“–•Ž GG–™‹Œ™•Ž GG–
tŒˆ••Ž aGGž 3` + XG= š¡ŒG–G“ˆ™ŽŒš› GŠ“˜œŒ GŠ™Œˆ›Œ‹ G‹œ™•Ž G•Œ™Œ•ŠŒ

The width wo (X) of a variable X in graph G along the ordering o is the number
of nodes preceding X in the ordering and connected to X (earlier neighbors ).
The width wo of a graph is the maximum width wo (X) among all nodes.
The induced graph G ' along the ordering o is obtained by recursivel y connecting
earlier neighbors of each node, from last to the first in the ordering.
The width of the induced graph G' is called the induced width of the graph G
(denoted wo* ).
A B E
C D
B C D C
E B
D E A A
“Moral” graph wo*1 = 4 wo*2 = 2
Ordering is important! But finding min-w* ordering is NP- hard…
Inference is also NP-hard in general case [Cooper].
Learning Bayesian Networks

 Combining domain expert <9.7 0.6 8 14 18>


<0.2 1.3 5 ?? ??>
knowledge with data <1.3 2.8 ?? 0 1 >
<?? 5.6 0 10 ??>
……………….

 Efficient representation and


inference

 Incremental learning: P(H) or

 Handling missing data: <1.3 2.8 ?? 0 1 >

 Learning causal relationships: S C


Learning tasks: four main cases
 Known graph – learn parameters
P(S)
S
Complete data: P(C|S) P(B|S)
parameter estimation (ML, MAP) C B
Incomplete data: P(X|C,S) P(D|C,B)
non-linear parametric X D
optimization (gradient descent, EM)

 Unknown graph – learn graph and parameters


Complete data: S S
optimization (search C B C B
in space of graphs)

Incomplete data: X D X D
structural EM,
mixture models ˆ = arg max Score(G)
G
G
Learning Parameters: complete data
(overview)

 ML-estimate: max log P ( D | Θ) - decomposable!


Θ
PaX
Multinomial counts
C B θ x ,pa X = N x ,pa X
ML( θ x ,pa X ) =
P ( x | pa X ) ∑N x , pa X
X x

 MAP-estimate max log P ( D | Θ) P ( Θ)


(Bayesian statistics) Θ

Conjugate priors - Dirichlet Dir(θ paX | α1,paX ,...,α m,paX )


N x ,pa X + α x ,pa X
MAP(θ x ,pa X ) =
∑N
x
x ,pa X + ∑ α x ,pa X
x
Equivalent sample size
(prior knowledge)
Learning Parameters
(details)
Learning Parameters: incomplete data
Non-decomposable marginal likelihood (hidden nodes)
Data
Initial parameters S X D C B
Expectation <? 0 1 0
<1 1 ? 0
1>
1>
<0 0 0 ? ?>
Compute EXPECTED <? ? 0 ? 1>
Current model Counts via inference in BN
………

(G, Θ) Expected counts

Maximization E P ( X ) [ N x ,pa X ] =
Update parameters N

EM-algorithm: (ML, MAP) ∑ p


k =1
( x , aX x | y k
, Θ, G )
iterate until convergence
Learning graph structure
ˆ = arg max Score(G)
Find G NP-hard
G
optimization
 Heuristic search:
Complete data – local computations
S Add S->B
Incomplete data (score non-
decomposable):stochastic methods S
C B
C B
Delete
Local greedy search; K2 algorithm S->B
S Reverse
S->B
 Constrained-based C B
S
methods (PC/IC algorithms) C B
 Data impose independence
relations (constraints) on graph
structure
Scoring function:
Minimum Description Length (MDL)

 Learning  data compression

<9.7 0.6 8 14 18>


<0.2 1.3 5 ?? ??>
<1.3 2.8 ?? 0 1 >
<?? 5.6 0 10 ??>
……………….

log N
MDL( BN | D ) = − log P ( D | Θ, G ) + |Θ|
2
DL(Data|model) DL(Model)

 Other: MDL = -BIC (Bayesian Information Criterion)


 Bayesian score (BDe) - asymptotically equivalent to MDL
Model selection trade-offs
Naïve Bayes – too simple Unrestricted BN – too complex
(less parameters, but bad model) (possible overfitting + complexity)

Class Class

P(f1 | class) P(f 2 | class) P(fn | class) P(f1 | class) P(f 2 | class) P(fn | class)

feature f1 feature f 2 feature f n feature f1 feature f 2 feature f n

Various approximations between the two extremes

TAN:
tree-augmented Naïve Bayes Class
[Friedman et al. 1997]
P(f1 | class) P(f 2 | class) P(fn | class)
Based on Chow-Liu Tree Method
feature f1 feature f 2 feature f n
(CL) for learning trees
[Chow-Liu, 1968]
Tree-structured distributions
A joint probability distribution is tree-structured if it can be written as
n
Ÿ) = ∏ P ( x i | x j ( i ) )
P (Ÿ
i =1
where x j ( i ) is the parent of xi in Bayesian network for P(x) (a directed tree)

A
A
C
B
C
B

P(A,B,C,D,E)= D E
D E P(A)P(B|A)P(C|A)
P(D|C)P(E|B) Not a tree – has an (undirected) cycle
A tree (with root A)

A tree requires only [(d-1) + d(d-1)(n-1)] parameters, where d is domain size


Moreover, inference in trees is O(n) (linear) since their w*=1
Approximations by trees
True distribution P(X) Tree-approximation P’(X)

A A
C C
B B

D E E
D

How good is approximation? Use cross-entropy (KL-divergence):

P ( Ÿ)
D ( P, P' ) = P ∑ P(Ÿ) log
Ÿ P ' (Ÿ)
D(P,P’) is non-negative, and D(P,P’)=0 if and only if P coincides with P’ (on a set of measure 1)

How to find the best tree-approximation?


Optimal trees: Chow-Liu result
 Lemma
Given a joint PDF P(x) and a fixed tree structure T, the best
approximation P’(x) (i.e., P’(x) that minimizes D(P,P’) ) satisfies

P ' ( xi | x j ( i ) ) = P ( xi | x j (i ) ) for all i = 1,...,n


Such P’(x) is called the projection of P(x) on T.

 Theorem [Chow and Liu, 1968]


Given a joint PDF P(x), the KL-divergence D(P,P’) is minimized by
projecting P(x) on a maximum-weight spanning tree (MSWT) over
nodes in X, where the weight on the edge ( X i , X j ) is defined by
the mutual information measure
P ( xi , x j )
I(X i ; X j ) = ∑ P( x , x
xi , x j
i j ) log
P ( xi ) P( x j )

Note, that I ( X ;Y ) = 0 when X and Y are independent


and that I ( X ;Y ) = D(P( x, y), P( x)P( y))
Proofs
Proof of Lemma :
n n
D ( P, P ' ) = − ∑ P(Ÿ )∑ log P ' ( x i | x j ( i ) ) + ∑ P(Ÿ ) log P (Ÿ ) = − ∑ P(Ÿ )∑ log P ' ( xi | x j ( i ) ) −H ( X ) =
Ÿ i =1 Ÿ Ÿ i =1
n n
= − ∑ ∑ P(Ÿ ) log P ' ( xi | x j ( i ) ) −H ( X ) = − ∑ ∑ P(x , x i j (i ) ) log P ' ( xi | x j ( i ) ) −H ( X ) = (1)
i =1 Ÿ i =1 xi , x j ( i )
n
= − ∑ ∑ P ( x j ( i ) )∑ P(x i | x j ( i ) ) log P ' ( xi | x j ( i ) ) −H ( X ) (2)
i =1 x j ( i ) xi

A known fact : given P(x), the maximum of ∑ P(x) log P ' ( x) is achieved by the choice P' (x) = P(x).
xi
Therefore, for any value of i and x j ( i ) , the term ∑ P(x | x ) log P ' ( x | x ) is maximized by
i j (i ) i j (i )
xi
choosing P ' ( x i | x j ( i ) ) = P ( x i | x j ( i ) ) (and thus the total D ( P , P ' ) is minimized ) , which proves the Lemma.

Proof of Theorem :
Replacing P ' ( x i | x j ( i ) ) = P ( xi | x j ( i ) ) in the expression (1) yields
n
D ( P, P ' ) = −∑ ∑ P(x , x i j (i ) ) log[ P ( x i x j ( i ) ) / P ( x j ( i ) )] −H ( X ) =
i =1 xi , x j ( i )
n  P ( xi x j (i ) ) 
=−∑ ∑ P(x i , x )
j (i )  log + log P ( x i  −H ( X ) =
)
i =1 xi , x j ( i )  P ( x j (i ) ) P ( x i ) 
n n
= − ∑ I ( X i , X j ( i ) ) − ∑ ∑ P ( x i ) log P ( xi ) −H ( X ).
i =1 i =1 xi
The last two terms are independen t of the choice of the tree, and thus D ( P , P ' )
is minimized by maximizing the sum of edge weights I ( X i , X j ( i ) ).
Chow-Liu algorithm
[As presented in Pearl, 1988]

1. From the given distribution P(x) (or from data generated by P(x)),
compute the joint distributionP( xi | x j ) for all i ≠ j
2. Using the pairwise distributions from step 1, compute the mutual
information
I(Xi ; X j ) for each pair of nodes and assign it as the
weight to the corresponding edge( X i , X j ) .
3. Compute the maximum-weight spanning tree (MSWT):
a. Start from the empty tree over n variables
b. Insert the two largest-weight edges
c. Find the next largest-weight edge and add it to the tree if no cycle is
formed; otherwise, discard the edge and repeat this step.
d. Repeat step (c) until n-1 edges have been selected (a tree is
constructed).
4. Select an arbitrary root node, and direct the edges outwards from
the root.
5. Tree approximation P’(x) can be computed as a projection of P(x) on
the resulting directed tree (using the product-form of P’(x)).
Summary:
Learning and inference in BNs
 Bayesian Networks – graphical probabilistic models
 Efficient representation and inference
 Expert knowledge + learning from data
 Learning:
 parameters (parameter estimation, EM)

 structure (optimization w/ score functions – e.g., MDL)

 Complexity trade-off:

 NB, BNs and trees


 There is much more: causality, modeling time (DBNs, HMMs),
approximate inference, on-line learning, active learning, etc.
Online/print resources on BNs
Conferences & Journals
UAI, ICML, AAAI, AISTAT, KDD
MLJ, DM&KD, JAIR, IEEE KDD, IJAR, IEEE PAMI
Books and Papers
Bayesian Networks without Tears by Eugene Charniak. AI
Magazine: Winter 1991.
Probabilistic Reasoning in Intelligent Systems by Judea Pearl.
Morgan Kaufmann: 1998.
Probabilistic Reasoning in Expert Systems by Richard
Neapolitan. Wiley: 1990.
CACM special issue on Real-world applications of BNs, March
1995
Online/Print Resources on BNs

AUAI online: www.auai.org. Links to:


Electronic proceedings for UAI conferences
Other sites with information on BNs and reasoning under
uncertainty
Several tutorials and important articles
Research groups & companies working in this area
Other societies, mailing lists and conferences
Publicly available s/w for BNs

List of BN software maintained by Russell Almond at


bayes.stat.washington.edu/almond/belief.html
several free packages: generally research only
commercial packages: most powerful (& expensive) is
HUGIN; others include Netica and Dxpress
we are working on developing a Java based BN toolkit here at
Watson
Lecture 13: Bayesian networks I

CS221 / Spring 2019 / Charikar & Sadigh


Pac-Man competition

1. (1783) Adam Klein


2. (1769) Jonathan Hollenbeck
3. (1763) Michael Du

CS221 / Spring 2019 / Charikar & Sadigh 1


cs221.stanford.edu/q Question
Earthquakes and burglaries are independent events that will cause an
alarm to go off. Suppose you hear an alarm. How does hearing on the
radio that there’s an earthquake change your beliefs?

it increases the probability of burglary

it decreases the probability of burglary

it does not change the probability of burglary

CS221 / Spring 2019 / Charikar & Sadigh 2


• Situations like these arise all the time in practice: we have a lot of unknowns which are all dependent on
one another. If we obtain evidence on some of these unknowns, how does that affect our belief about the
other unknowns?
• In this lecture, we’ll see how we can perform this type of reasoning under uncertainty in a principled way
using Bayesian networks.
Review: definition

X1 X2 X3

f1 f2 f3 f4

Definition: factor graph

Variables:
X = (X1 , . . . , Xn ), where Xi ∈ Domaini
Factors:
f1 , . . . , fm , with each fj (X) ≥ 0
m
Y
Weight(x) = fj (x)
j=1

CS221 / Spring 2019 / Charikar & Sadigh 4


• Last week, we talked about factor graphs, which uses local factors to specify a weight Weight(x) for each
assignment x in a compact way. The stated objective was to find the maximum weight assignment.
• Given any factor graph, we saw a number of algorithms (backtracking search, beam search, Gibbs sampling,
variable elimination) for (approximately) optimizing this objective.
Review: person tracking

Problem: person tracking

Sensors report positions: 0, 2, 2. Objects don’t move very fast and


sensors are a bit noisy. What path did the person take?
t1 t2
X1 X2 X3
o1 o2 o3

• Variables Xi : location of object at time i


• Transition factors ti (xi , xi+1 ): incorporate physics
• Observation factors oi (xi ): incorporate sensors
[demo: maxVariableElimination()]
What do the factors mean?

CS221 / Spring 2019 / Charikar & Sadigh 6


• As an example, recall the object tracking example. We defined observation factors to capture the fact that
the true object position is close to the sensor reading, and the transition factors to capture the fact that
the true object positions across time are close to each other.
• We just set them rather arbitrarily. Is there a more principled way to think about these factors beyond
being non-negative functions?
Course plan

Search problems
Constraint satisfaction problems
Markov decision processes
Adversarial games Bayesian networks

Reflex States Variables Logic


”Low-level intelligence” ”High-level intelligence”

Machine learning

CS221 / Spring 2019 / Charikar & Sadigh 8


• Much of this class has been on developing modeling frameworks. We started with state-based models,
where we cast real-world problems as finding paths or policies through a state graph.
• Then, we saw that for a large class of problems (such as scheduling), it was much more convenient to use
the language of factor graphs.
• While factor graphs could be reduced to state-based models by fixing the variable ordering, we saw that
they also led to notions of treewidth and variable elimination, which allowed us to understand our models
much better.
• In this lecture, we will introduce another modeling framework, Bayesian networks, which are factor graphs
imbued with the language of probability. This will give probabilistic life to the factors of factor graphs.
Roadmap

Basics

Probabilistic programs

Inference

CS221 / Spring 2019 / Charikar & Sadigh 10


• Bayesian networks were popularized in AI by Judea Pearl in the 1980s, who showed that having a coherent
probabilistic framework is important for reasoning under uncertainty.
• There is a lot to say about the Bayesian networks (CS228 is an entire course about them and their cousins,
Markov networks). So we will devote most of this lecture focusing on modeling.
Review: probability (example)
Random variables: sunshine S ∈ {0, 1}, rain R ∈ {0, 1}

Joint distribution:
s r P(S = s, R = r)
0 0 0.20
P(S, R) = 0 1 0.08
1 0 0.70
1 1 0.02

Marginal distribution: Conditional distribution:


s P(S = s) s P(S = s | R = 1)
P(S) = 0 0.28 P(S | R = 1) = 0 0.8
1 0.72 1 0.2

(aggregate rows) (select rows, normalize)

CS221 / Spring 2019 / Charikar & Sadigh 12


• Before introducing Bayesian networks, let’s review probability (at least the relevant parts). We start with
an example about the weather. Suppose we have two boolean random variables, S and R representing
sunshine and rain. Think of an assignment to (S, R) as representing a possible state of the world.
• The joint distribution specifies a probability for each assignment to (S, R) (state of the the world). We
use lowercase letters (e.g., s and r) to denote values and uppercase letters (e.g., S and R) to denote
random variables. Note that P(S = s, R = r) is a probability (a number) while P(S, R) is a distribution (a
table of probabilities). We don’t know what state of the world we’re in, but we know what the probabilities
are (there are no unknown unknowns). The joint distribution contains all the information and acts as the
central source of truth.
• From it, we can derive a marginal distribution over a subset of the variables. We get this by aggregating
the rows that share the same value of S. The interpretation is that we are interested in S. We don’t
explicitly care about R, but we want to take into account R’s effect on S. We say that R is marginalized
out. This is a special form of elimination. In the last lecture, we leveraged max-elimination, where we
took the max over the eliminated variables; here, we are taking a sum.
• The conditional distribution selects rows of the table matching the condition (right of the bar), and then
normalizes the probabilities so that they sum to 1. The interpretation is that we observe the condition
(R = 1) and are interested in S. This is the conditioning that we saw for factor graphs, but where we
normalize the selected rows to get probabilities.
Review: probability (general)
Random variables:
X = (X1 , . . . , Xn ) partitioned into (A, B)
Joint distribution:
P(X) = P(X1 , . . . , Xn )
Marginal distribution:
P
P(A) = b P(A, B = b)
Conditional distribution:
P(A | B = b) ∝ P(A, B = b)

CS221 / Spring 2019 / Charikar & Sadigh 14


• In general, we have n random variables X1 , . . . , Xn and let X denote all of them. Suppose X is partitioned
into A and B (e.g., A = (X1 , X3 ) and B = (X2 , X4 , X5 ) if n = 5).
• The marginal and conditional distributions can be defined over the subsets A and B rather than just single
variables.
• Of course, we can also have a hybrid too: for n = 3, P(X1 | X3 = 1) marginalizes out X2 and conditions
on X3 = 1.
• It is important to remember the types of objects here: P(A) is a table where rows are possible assignments
to A, whereas P(A = a) is a number representing the probability of the row corresponding to assignment
a.
Probabilistic inference task
Random variables: unknown quantities in the world
X = (S, R, T, A)
In words:
• Observe evidence (traffic in autumn): T = 1, A = 1
• Interested in query (rain?): R
In symbols:

R | T = 1, A = 1)
P(|{z}
| {z }
query condition
(S is marginalized out)

CS221 / Spring 2019 / Charikar & Sadigh 16


• At this point, you should have all the definitions to compute any marginal or conditional distribution given
access to a joint probability distribution. But what is this really doing and how is this useful?
• We should think about each assignment x as a possible state of the world (it’s raining, it’s not sunny,
there is traffic, it is autumn, etc.). Think of the joint distribution as one giant database that contains full
information about how the world works.
• In practice, we’d like to ask questions by querying this probabilistic database. First, we observe some
evidence, which effectively fixes some of the variables. Second, we are interested in the distribution of
some set of variables which we didn’t observe. This forms a query, and the process of answering this query
(computing the desired distribution) is called probabilistic inference.
Challenges
Modeling: How to specify a joint distribution P(X1 , . . . , Xn ) com-
pactly?

Bayesian networks (factor graphs for probability distributions)

Inference: How to compute queries P(R | T = 1, A = 1) efficiently?

Variable elimination, Gibbs sampling, particle filtering (analogue of


algorithms for finding maximum weight assignment)

CS221 / Spring 2019 / Charikar & Sadigh 18


• In general, a joint distribution over n variables has size exponential in n. From a modeling perspective,
how do we even specify an object that large? Here, we will see that Bayesian networks, based on factor
graphs, offer an elegant solution.
• From an algorithms perspective, there is still the question of how we perform probabilistic inference ef-
ficiently. In the next lecture, we will see how we can adapt all of the algorithms that we saw before for
computing maximum weight assignments in factor graphs, essentially by replacing a max with a sum.
• The two desiderata are rather synergistic, and it is the same property — conditional independence — that
makes both possible.
Bayesian network (alarm)
def
P(B = b, E = e, A = a) = p(b)p(e)p(a | b, e)
B E b p(b) e p(e) b e a p(a | b, e)
1  1  0 0 0 1
A
0 1− 0 1− 0 0 1 0
0 1 0 0
0 1 1 1
1 0 0 0
1 0 1 1
1 1 0 0
1 1 1 1

p(b) =  · [b = 1] + (1 − ) · [b = 0]
p(e) =  · [e = 1] + (1 − ) · [e = 0]
p(a | b, e) = [a = (b ∨ e)]
CS221 / Spring 2019 / Charikar & Sadigh 20
• Let us try to model the situation. First, we establish that there are three variables, B (burglary), E
(earthquake), and A (alarm). Next, we connect up the variables to model the dependencies.
• Unlike in factor graphs, these dependencies are represented as directed edges. You can intuitively think
about the directionality as suggesting causality, though what this actually means is a deeper question and
beyond the scope of this class.
• For each variable, we specify a local conditional distribution (a factor) of that variable given its parent
variables. In this example, B and E have no parents while A has two parents, B and E. This local
conditional distribution is what governs how a variable is generated.
• We are writing the local conditional distributions using p, while P is reserved for the joint distribution over
all random variables, which is defined as the product.
Bayesian network (alarm)
p(b) p(e)

B E B p(a | b, e) E

A A

P(B = b, E = e, A = a) = p(b)p(e)p(a | b, e)

Bayesian networks are a special case of factor graphs!

CS221 / Spring 2019 / Charikar & Sadigh 22


• Note that the local conditional distributions (e.g., p(a | b, e)) are non-negative so they can be thought
of simply as factors of a factor graph. The joint probability of an assignment is then the weight of that
assignment.
• In this light, Bayesian networks are just a type of factor graphs, but with additional structure and inter-
pretation.
Probabilistic inference (alarm)
Joint distribution:
b e a P(B = b, E = e, A = a)
0 0 0 (1 − )2
0 0 1 0
0 1 0 0
0 1 1 (1 − )
1 0 0 0
1 0 1 (1 − )
1 1 0 0
1 1 1 2

Queries: P(B)? P(B | A = 1)? P(B | A = 1, E = 1)?

[demo:  = 0.05]

CS221 / Spring 2019 / Charikar & Sadigh 24


• Bayesian networks can be used to capture common reasoning patterns under uncertainty (which was one
of their first applications).
• Consider the following model: Suppose the probability of an earthquake is  and the probability of a
burglary is  and both are independent. Suppose that the alarm always goes off if either an earthquake or
a burglary occurs.
• In the prior, we can eliminate A and E and get that the probability of the burglary is .
1
• Now suppose we hear the alarm A = 1. The probability of burglary is now P(B = 1 | A = 1) = 2− .
• Now suppose that you hear on the radio that there was an earthquake (E = 1). Then the probability of
burglary goes down to P(B = 1 | A = 1, E = 1) =  again.
Explaining away

B E

Key idea: explaining away

Suppose two causes positively influence an effect. Conditioned on


the effect, conditioning on one cause reduces the probability of the
other cause.

CS221 / Spring 2019 / Charikar & Sadigh 26


• This last phenomenon has a special name: explaining away. Suppose we have two cause variables B
and E, which are parents of an effect variable A. Assume the causes influence the effect positively (e.g.,
through the OR function).
• Conditioned on the effect A = 1, there is some posterior probability of B. Conditioned on the effect A = 1
and the other cause E = 1, the new posterior probability is reduced. We then say that the other cause E
has explained away B.
Definition

Definition: Bayesian network

Let X = (X1 , . . . , Xn ) be random variables.


A Bayesian network is a directed acyclic graph (DAG) that spec-
ifies a joint distribution over X as a product of local conditional
distributions, one for each node:
n
def
Y
P(X1 = x1 , . . . , Xn = xn ) = p(xi | xParents(i) )
i=1

CS221 / Spring 2019 / Charikar & Sadigh 28


• Without further ado, let’s define a Bayesian network formally. A Bayesian network defines a large joint
distribution in a modular way, one variable at a time.
• First, the graph structure captures what other variables a given variable depends on.
• Second, we specify a local conditional distribution for variable Xi , which is a function that specifies a
distribution over Xi given an assignment xParents(i) to its parents in the graph (possibly none). The joint
distribution is simply defined to be the product of all of the local conditional distributions together.
• Notationally, we use lowercase p (in p(xi | xParents(i) )) to denote a local conditional distribution, and
uppercase P to denote the induced joint distribution over all variables. While the two can coincide, it is
important to keep these things separate in your head!
Special properties
Key difference from general factor graphs:

Key idea: locally normalized

All X
factors (local conditional distributions) satisfy:
p(xi | xParents(i) ) = 1 for each xParents(i)
xi

Implications:
• Consistency of sub-Bayesian networks
• Consistency of conditional distributions

CS221 / Spring 2019 / Charikar & Sadigh 30


• But Bayesian networks are more than that. The key property is that all the local conditional distributions,
being distributions, sum to 1 over the first argument.
• This simple property results in two important properties of Bayesian networks that are not present in
general factor graphs.
Consistency of sub-Bayesian networks

B E

A short calculation:
P
P(B = b, E = e) = a P(B = b, E = e, A = a)
P
= ap(b)p(e)p(a | b, e)
P
= p(b)p(e) a p(a | b, e)

= p(b)p(e)

CS221 / Spring 2019 / Charikar & Sadigh 32


• First, let’s see what happens when we marginalize A (by performing algebra on the joint probability). We
see that we end up with p(b)p(e), which actually defines a sub-Bayesian network with one fewer variable,
and the same local conditional probabilities.
• If one marginalizes out all the variables, then one gets 1, which verifies that a Bayesian network actually
defines a probability distribution.
• The philosophical ramification of this property is that there could be many other variables that depend
on the variables you’ve modeled (earthquakes also impacts traffic) but as long as you don’t observe them,
they can be ignored mathematically (ignorance is bliss). Note that this doesn’t mean that knowing about
the other things isn’t useful.
Consistency of sub-Bayesian networks

Key idea: marginalization

Marginalization of a leaf node yields a Bayesian network without


the node.

B E
B E
A

p(b) p(e)

p(b) p(e)

B p(a | b, e) E
B E
A
CS221 / Spring 2019 / Charikar & Sadigh 34
• This property is very attractive, because it means that whenever we have a large Bayesian network, where
we don’t care about some of the variables, we can just remove them (graph operations), and this encodes
the same distribution as we would have gotten from marginalizing out variables (algebraic operations).
The former, being visual, can be more intuitive.
Consistency of local conditionals

Key idea: local conditional distributions

Local conditional distributions (factors) are the true conditional


distributions.

A B C

D E

F G H

P(D = d | A = a, B = b) = p(d | a, b)
| {z } | {z }
from probabilistic inference by definition

CS221 / Spring 2019 / Charikar & Sadigh 36


• Note that the local conditional distributions p(d | a, b) are simply defined by the user. On the other hand,
the quantity P(D = d | A = a, B = b) is not defined, but follows from probabilistic inference on the joint
distribution defined by the Bayesian network.
• It’s not clear a priori that the two have anything to do with each other. The second special property that
we get from using Bayesian networks is that the two are actually the same.
• To show this, we can remove all the descendants of D by the consistency of sub-Bayesian networks,
leaving us with the Bayesian network P(A = a, B = b, D = d) = p(a)p(b)p(d | a, b). By the chain rule,
P(A = a, B = b, D = d) = P(A = a, B = b)P(D = d | A = a, B = b). If we marginalize out D, then
we are left with the Bayesian network P(A = a, B = b) = p(a)p(b). From this, we can conclude that
P(D = d | A = a, B = b) = p(d | a, b).
• This argument generalizes to any Bayesian network and local conditional distribution.
Medical diagnosis

Problem: cold or allergies?

You are coughing and have itchy eyes. Do you have a cold or
allergies?

[demo]
Variables: Cold, Allergies, Cough, Itchy eyes
Bayesian network: Factor graph:

C A C A

H I H I

CS221 / Spring 2019 / Charikar & Sadigh 38


• Here is another example (a cartoon version of Bayesian networks for medical diagnosis). Allergies and cold
are the two hidden variables that we’d like to infer (we have some prior over these two). Cough and itchy
eyes are symptoms that we observe as evidence, and we have some likelihood model of these symptoms
given the hidden causes.
• We can use the demo to infer the hidden state given the evidence.
Summary so far

B E

• Set of random variables capture state of world

• Local conditional distributions ⇒ joint distribution

• Probabilistic inference task: ask questions

• Captures reasoning patterns (e.g., explaining away)

• Factor graph interpretation (for inference later)

CS221 / Spring 2019 / Charikar & Sadigh 40


Roadmap

Basics

Probabilistic programs

Inference

CS221 / Spring 2019 / Charikar & Sadigh 41


Probabilistic programs
Goal: make it easier to write down complex Bayesian networks

Key idea: probabilistic program

Write a program to generate an assignment (rather than specifying


the probability of an assignment).

CS221 / Spring 2019 / Charikar & Sadigh 42


Probabilistic programs
B E

Probabilistic program: alarm

B ∼ Bernoulli()
E ∼ Bernoulli()
A=B∨E

Key idea: probabilistic program

A randomized program that sets the random variables.

def Bernoulli(epsilon):
return random.random() < epsilon

CS221 / Spring 2019 / Charikar & Sadigh 43


• There is another way of writing down Bayesian networks other than graphically or mathematically, and that
is as a probabilistic program. A probabilistic program is a randomized program that invokes a random
number generator to make random choices. Executing this program will assign values to a collection of
random variables X1 , . . . , Xn ; that is, generating an assignment.
• The probability (e.g., fraction of times) that the program generates that assignment is exactly the proba-
bility under the joint distribution specified by that program.
• We should think of this program as outputting the state of the world (or at least the part of the world
that we care about for our task).
• Note that the probabilistic program is only used to define joint distributions. We usually wouldn’t actually
run this program directly.
• For example, we show the probabilistic program for alarm. B ∼ Bernoulli() simply means that P(B =
1) = . Here, we can think about Bernoulli() as a randomized function (random() < epsilon) that
returns 1 with probability  and 0 with probability 1 − .
Probabilistic program: example

Probabilistic program: object tracking

X0 = (0, 0)
For each time step i = 1, . . . , n:
With probability α:
Xi = Xi−1 + (1, 0) [go right]
With probability 1 − α:
Xi = Xi−1 + (0, 1) [go down]

Bayesian network structure:

X1 X2 X3 X4 X5

CS221 / Spring 2019 / Charikar & Sadigh 45


• This is a more interesting generative model since it has a for loop, which allows us to determine the
distribution over a templatized set of n variables rather than just 3 or 4.
• In these cases, variables are generally indexed by something like time or location.
• We can also draw the Bayesian network. Each Xi only depends on Xi−1 . This is a chain-structured
Bayesian network, called a Markov model.
Probabilistic program: example

(press ctrl-enter to save)


Run

CS221 / Spring 2019 / Charikar & Sadigh 47


• Try clicking [Run]. Each time a new assignment of (X1 , . . . , Xn ) is chosen.
Probabilistic inference: example

Query: what are possible trajectories given evidence X10 = (8, 2)?

(press ctrl-enter to save)


Run

CS221 / Spring 2019 / Charikar & Sadigh 49


• This program only serves for defining the distribution. Now we can query that distribution and ask the
question: suppose the program set X10 = (8, 2); what is the distribution over the other variables?
• In the demo, note that all trajectories are constrained to go through (8, 2) at time step 10.
Application: language modeling

Probabilistic program: Markov model

For each position i = 1, 2, . . . , n:


Generate word Xi ∼ p(Xi | Xi−1 )

Wreck a nice beach

X1 X2 X3 X4

CS221 / Spring 2019 / Charikar & Sadigh 51


• In the context of natural language, a Markov model is known as a bigram model. A higher-order general-
ization of bigram models are n-gram models (more generally known as higher-order Markov models).
• Language models are often used to measure the ”goodness” of a sentence, mostly within the context of a
larger system such as speech recognition or machine translation.
Application: object tracking

Probabilistic program: hidden Markov model (HMM)

For each time step t = 1, . . . , T :


Generate object location Ht ∼ p(Ht | Ht−1 )
Generate sensor reading Et ∼ p(Et | Ht )
(3,1) (3,2)

H1 H2 H3 H4 H5

E1 E2 E3 E4 E5
4 5

Applications: speech recognition, information extraction, gene finding

CS221 / Spring 2019 / Charikar & Sadigh 53


• Markov models are limiting because they do not have a way of talking about noisy evidence (sensor
readings). They can be extended quite easily to hidden Markov models, which introduce a parallel sequence
of observation variables.
• For example, in object tracking, Ht denotes the true object location, and Et denotes the noisy sensor
reading, which might be (i) the location Ht plus noise, or (ii) the distance from Ht plus noise, depending
on the type of sensor.
• In speech recognition, Ht would be the phonemes or words and Et would be the raw acoustic signal.
Application: multiple object tracking

Probabilistic program: factorial HMM

For each time step t = 1, . . . , T :


For each object o ∈ {a, b}:
o
Generate location Hto ∼ p(Hto | Ht−1 )
Generate sensor reading Et ∼ p(Et | Hta , Htb )

H1a H2a H3a H4a

H1b H2b H3b H4b

E1 E2 E3 E4

CS221 / Spring 2019 / Charikar & Sadigh 55


• An extension of an HMM, called a factorial HMM, can be used to track multiple objects. We assume
that each object moves independently according to a Markov model, but that we get one sensor reading
which is some noisy aggregated function of the true positions.
• For example, Et could be the set {Hta , Htb }, which reveals where the objects are, but doesn’t say which
object is responsible for which element in the set.
Application: document classification
Question: given a text document, what is it about?

Probabilistic program: naive Bayes

Generate label Y ∼ p(Y )


For each position i = 1, . . . , L:
Generate word Wi ∼ p(Wi | Y )

travel

W1 W2 ... WL
beach Paris
CS221 / Spring 2019 / Charikar & Sadigh 57
• Naive Bayes is a very simple model which can be used for classification. For document classification, we
generate a label and all the words in the document given that label.
• Note that the words are all generated independently, which is not a very realistic model of language, but
naive Bayes models are surprisingly effective for tasks such as document classification.
Application: topic modeling
Question: given a text document, what topics is it about?

Probabilistic program: latent Dirichlet allocation

Generate a distribution over topics α ∈ RK


For each position i = 1, . . . , L:
Generate a topic Zi ∼ p(Zi | α)
Generate a word Wi ∼ p(Wi | Zi )

α {travel:0.8,Europe:0.2}

travel Z1 Z2 ... ZL Europe

beach W1 W2 ... WL Euro


CS221 / Spring 2019 / Charikar & Sadigh 59
• A more sophisticated model of text is latent Dirichlet Allocation (LDA), which allows a document to not
just be about one topic (which was true in naive Bayes), but about multiple topics.
• Here, the distribution over topics α is chosen per document from a Dirichlet distribution. Note that α is a
continuous-valued random variable. For each position, we choose a topic according to that per-document
distribution and generate a word given that topic.
• Latent Dirichlet Alloction (LDA) has been very infuential for modeling not only text but images, videos,
music, etc.; any sort of data with hidden structure. It is very related to matrix factorization.
Application: medical diagnostics
Question: If patient has has a cough and fever, what disease(s) does
he/she have?
1 Pneumonia 0 Cold 0 Malaria

1 Fever 1 Cough 0 Vomit

Probabilistic program: diseases and symptoms

For each disease i = 1, . . . , m:


Generate activity of disease Di ∼ p(Di )
For each symptom j = 1, . . . , n:
Generate activity of symptom Sj ∼ p(Sj | D1:m )

CS221 / Spring 2019 / Charikar & Sadigh 61


• We already saw a special case of this model. In general, we would like to diagnose many diseases and
might have measured many symptoms and vitals.
Application: social network analysis
Question: Given a social network (graph over n people), what types of
people are there?
politician H1

0 E12 E13 0

scientist H2 E23 H3 scientist

Probabilistic program: stochastic block model

For each person i = 1, . . . , n:


Generate person type Hi ∼ p(Hi )
For each pair of people i 6= j:
Generate connectedness Eij ∼ p(Eij | Hi , Hj )

CS221 / Spring 2019 / Charikar & Sadigh 63


• One can also model graphs such as social networks. A very naive-Bayes-like model is that each node
(person) has a ”type”. Whether two people interact with each other is determined solely by their types
and random chance.
• Note: there are extensions called mixed membership models which, like LDA, allow each person to have
multiple types.
• In summary, it is quite easy to come up with probabilistic programs that tell a story of how the world works
for the domain of interest. These probabilistic programs define joint distributions over assignments to a
collection of variables. Usually, these programs describe how some collection of hidden variables H that
you’re interested in behave, and then describe the generation of the evidence E that you see conditioned
on H. After defining the model, one can do probabilistic inference to compute P(H | E = e).
Roadmap

Basics

Probabilistic programs

Inference

CS221 / Spring 2019 / Charikar & Sadigh 65


Review: probabilistic inference
Input
Bayesian network: P(X1 = x1 , . . . , Xn = xn )
Evidence: E = e where E ⊆ X is subset of variables
Query: Q ⊆ X is subset of variables

Output
P(Q = q | E = e) for all values q

Example: if coughing and itchy eyes, have a cold?


P(C | H = 1, I = 1)
CS221 / Spring 2019 / Charikar & Sadigh 66
Example: Markov model
X1 X2 X3 X4

Query: P(X3 = x3 | X2 = 5) for all x3


Tedious way:
X
∝ p(x1 )p(x2 = 5 | x1 )p(x3 | x2 = 5)p(x4 | x3 )
x1 ,x4

!
X
∝ p(x1 )p(x2 = 5 | x1 ) p(x3 | x2 = 5)
x1

∝ p(x3 | x2 = 5)

Fast way:
[whiteboard]
CS221 / Spring 2019 / Charikar & Sadigh 67
• Let’s first compute the query the old-fashioned way by grinding through the algebra. Then we’ll see a
faster, more graphical way, of doing this.
• We start by transforming the query into an expression that references the joint distribution, which allows
us to rewrite as the product of the local conditional probabilities. To do this, we invoke the definition of
marginal and conditional probability.
• One convenient shortcut we will take is make use of the proportional-to (∝) relation. Note that in the end,
we need to construct a distribution over X3 . This means that any quantity (such as P(X2 = 5)) which
doesn’t depend on X3 can be folded into the proportionality constant. If you don’t believe this, keep it
around to convince yourself that it doesn’t matter. Using ∝ can save you a lot of work.
P
• Next, we do some algebra to push the summations P inside. We notice that x4 p(x4 | x3 ) = 1 because
it’s a local conditional distribution. The factor x1 p(x1 )p(x2 = 5 | x1 ) can also be folded into the
proportionality constant.
• The final result is p(x3 | x2 = 5), which matches the query as we expected by the consistency of local
conditional distributions.
General strategy
Query:

P(Q | E = e)

Algorithm: general probabilistic inference strategy


• Remove (marginalize) variables that are not ancestors of Q
or E.
• Convert Bayesian network to factor graph.
• Condition on E = e (shade nodes + disconnect).
• Remove (marginalize) nodes disconnected from Q.
• Run probabilistic inference algorithm (manual, variable elim-
ination, Gibbs sampling, particle filtering).

CS221 / Spring 2019 / Charikar & Sadigh 69


• Our goal is to compute the conditional distribution over the query variables Q ⊆ H given evidence E = e.
We can do this with our bare hands by chugging through all the algebra starting with the definition of
marginal and conditional probability, but there is an easier way to do this that exploits the structure of the
Bayesian network.
• Step 1: remove variables which are not ancestors of Q or E. Intuitively, these don’t have an influence on Q
and E, so they can be removed. Mathematically, this is due to the consistency of sub-Bayesian networks.
• Step 2: turn this Bayesian network into a factor graph by simply introducing one factor per node which
is connected to that node and its parents. It’s important to include all the parents and the child into one
factor, not separate factors. From here out, all we need to think about is factor graphs.
• Step 3: condition on the evidence variables. Recall that conditioning on nodes in a factor graph shades
them in, and as a graph operation, rips out those variables from the graph.
• Step 4: remove nodes which are not connected to Q. These are independent of Q, so they have no impact
on the results.
• Step 5: Finally, run a standard probabilistic inference algorithm on the reduced factor graph. We’ll do this
manually for now using variable elimination. Later we’ll see automatic methods for doing this.
Example: alarm
b e a p(a | b, e)
B E b p(b) e p(e) 0 0 0 1
0 0 1 0
1  1  0 1 0 0
A 0 1 1 1
0 1− 0 1− 1 0 0 0
1 0 1 1
1 1 0 0
1 1 1 1

[whiteboard]
Query: P(B)
• Marginalize out A, E
Query: P(B | A = 1)

• Condition on A = 1

CS221 / Spring 2019 / Charikar & Sadigh 71


• Here is another example: the simple v-structured alarm network from last time.
• P(B) = p(b) trivially after marginalizing out A and E (step 1).
• For P(B | A = 1), step 1 doesn’t do anything. Conditioning (step 3) creates a factor graph with factors
p(b), p(e), and p(aP = 1 | b, e). In step 5, we eliminate E by replacing it and its incident factors with a
new factor f (b) = e p(e)p(a = 1 | b, e). Then, we multiply all the factors (which should only be unary
factors on the query variable B) and normalize: P(B = b | A = 1) ∝ p(b)f (b).
• To flesh this out, for b = 1, we have ( + (1 − )) = . For b = 0, we have (1 − )( + 0) = (1 − ). The
 1
normalized result is thus P(B = 1 | A = 1) = +(1−) = 2− .
• For a probabilistic interpretation, note that all we’ve done is calculate P(B = b | A = 1) =
P(B=b)P(A=1|B=b) p(b)f (b)
P(A=1) =P p(bi )f (bi ) , where the first equality follows from Bayes’ rule and the second
bi ∈Domain(B)
follows from the fact that the local conditional distributions are the true conditional distributions. The
Bayesian network has simply given us a methodical, algorithmic way to calculate this probability.
Example: A-H (section)
A B C

D E

F G H

[whiteboard]
Query: P(C | B = b)
• Marginalize out everything else, note C ⊥
⊥B
Query: P(C, H | E = e)
• Marginalize out A, D, F, G, note C ⊥
⊥H|E

CS221 / Spring 2019 / Charikar & Sadigh 73


• In the first example, once we marginalize out all variables we can, we are left with C and B, which are
disconnected. We condition on B, which just removes that node, and so we’re just left with P(C) = p(c),
as expected.
• In the second example, note that the two query variables are independent,
P so we can compute them
separately. The result is P(C = c, H = h | E = e) ∝ p(c)p(h | e) b p(b)p(e | b, c).
• If we had the actual values of these probabilities, we can compute these quantities.
Summary
Bayesian networks: modular definition of large joint distribution over
variables

Probabilistic inference: condition on evidence, query variables of interest

Next time: algorithms for probabilistic inference

CS221 / Spring 2019 / Charikar & Sadigh 75


Multiple Linear Regression
Multiple Regression for k = 2,
Graphical Demonstration
Y The simple linear regression model
allows for one independent variable, “X”
Y = β0 + β1X + ε

Now we have a plane

X1

The multiple linear regression model


allows for more than one independent variable.
Y = β0 + β1X1 + β2X2 + ε
X2
Model development
Cont…
• We may write

• In other way
Cont…
• We can write Y=Xβ+ε, where Y, X, β and ε are matrices…Compute
their size??
• For any Yi, we can say
yi = β 0 + β1 xi1 + β 2 xi 2 + ..... + β p xip + ε i
p
ε=i yi − ∑ β j xij
j =0

• Optimize nn p

i
2
∑=
ε i
=i 1 =i 1 =j 0
∑ ∑ j ij
( y − β x ) 2

dSSE
=0

Cont…
• SSE is a scalar qty. can be written as εεT
εε T ( y − X β )( y − X β )T
SSE ==

• Now
d ( SSE ) d ( y − X β )( y − X β )T
= = 0
dβ dβ
−2 X T ( y − X β ) =
0
XT Xβ = XT y
( X T X ) −1 X T X β = ( X T X ) −1 X T y
β = ( X T X ) −1 X T y
Do it yourself…..
• Problem ….
MULTIPLE LINEAR REGRESSION
Part II
Dr. Ranjan Maity
CIT Kokrajhar
REQUIREMENTS
More than one independent variables
One dependent variable
Rule of Thumb: Min No of observations = 10* No of independent Variables
ASSUMPTIONS
Independence
Normality
Homoscedasticity
Linearity
THE R AND BETA
R = the magnitude of the relationship between the dependent variable and the best linear
combination of the predictor variables
R2 = the proportion of variation in Y accounted for by the set of independent variables (X’s).
Testing R2
Test R2 through an F testTest of each partial regression coefficient (b) by t-tests
Test of competing models (difference between R2) through an F test of difference of R2s
Test of each partial regression coefficient (b) by t-tests
Comparison of partial regression coefficients with each other - t-test of difference between
standardized partial regression coefficients (β)
R 2

measure that provides information about the goodness of fit of a model


CONT…
CONT…
Lecture 9: Linear
Regression
Goals

• Develop basic concepts of linear regression from


a probabilistic framework

• Estimating parameters and hypothesis testing


with linear models

• Linear regression in R
Regression

• Technique used for the modeling and analysis of


numerical data

• Exploits the relationship between two or more


variables so that we can gain information about one of
them through knowing values of the other

• Regression can be used for prediction, estimation,


hypothesis testing, and modeling causal relationships
Regression Lingo

Y = X1 + X2 + X3

Dependent Variable Independent Variable

Outcome Variable Predictor Variable

Response Variable Explanatory Variable


Why Linear Regression?
• Suppose we want to model the dependent variable Y in terms
of three predictors, X1, X2, X3

Y = f(X1, X2, X3)

• Typically will not have enough data to try and directly


estimate f

• Therefore, we usually have to assume that it has some


restricted form, such as linear

Y = X1 + X2 + X3
Linear Regression is a Probabilistic Model

• Much of mathematics is devoted to studying variables


that are deterministically related to one another

y = " 0 + "1 x

y "y
! "0
"x
"1 =
#y
#x
!
! x
!
!
!
• But we’re interested in understanding the relationship
between variables related in a nondeterministic fashion
!
A Linear Probabilistic Model
• Definition: There exists parameters " 0 , "1, and " ,2 such that for
any fixed value of the independent variable x, the dependent
variable is related to x through the model equation

y = "!0 +! "1 x +
!#

2
• " is a rv assumed to be N(0, # )
True Regression Line
!
"3 y = " 0 + "1 x
y "1
"2

"0 !

! x !
!
!
!
Implications
• The expected value of Y is a linear function of X, but for fixed
x, the variable Y differs from its expected value by a random
amount

• Formally, let x* denote a particular value of the independent


variable x, then our linear probabilistic model says:

E(Y | x*) = µY|x* = mean value of Y when x is x *


2
V (Y | x*) = " Y|x* = variance of Y when x is x *

!
Graphical Interpretation
y
y = " 0 + "1 x
µY |x 2 = " 0 + "1 x 2
! !
!
µY |x1 = " 0 + "1 x1

x
x1 x2
!

• For example, if x = height and y = weight then µ!


Y |x =60 is the average
!
weight for!
all individuals 60 inches tall in the population

!
One More Example
Suppose the relationship between the independent variable height
(x) and dependent variable weight (y) is described by a simple
linear regression model with true regression line
y = 7.5 + 0.5x and " = 3
• Q1: What is the interpretation of "1 = 0.5?
The expected change in height associated with a 1-unit increase
in weight !
!
• Q2: If x = 20 what is the expected value of Y?
µY |x =20 = 7.5 + 0.5(20) = 17.5

• Q3: If x = 20 what is P(Y > 22)?


! " 22 -17.5 %
P(Y > 22 | x = 20) = P$ ' = 1( ) (1.5) = 0.067
# 3 &
Estimating Model Parameters
• Point estimates of "ˆ 0 and "ˆ1 are obtained by the principle of least
squares
n
f (" 0 , "1 ) = $ [ y i # (" 0 + "1 x i )] 2
! ! i=1

! y
"0

x
!


!
"ˆ = y # "ˆ x
0 1

!
Predicted and Residual Values
• Predicted, or fitted, values are values of y predicted by the least-
squares regression line obtained by plugging in x1,x2,…,xn into the
estimated regression line
yˆ1 = "ˆ 0 # "ˆ1 x1
yˆ = "ˆ # "ˆ x
2 0 1 2

• Residuals are
! the deviations of observed and predicted values
e1 = y1 " yˆ1
! e2 = y 2 " yˆ 2
y
e3

y1
! e1 e2
yˆ1
!
! ! ! x
!
!
Residuals Are Useful!
• They allow us to calculate the error sum of squares (SSE):
n n
SSE = " (ei ) = " (y i # yˆ i ) 2
2

i=1 i=1

• Which in turn allows us to estimate " 2 :


! SSE
"ˆ 2 =
n #2
!

• As well as an important statistic referred to as the coefficient of


determination:
!
n
2 SSE SST = # (y i " y ) 2
r = 1"
SST i=1

! !
Multiple Linear Regression
• Extension of the simple linear regression model to two or
more independent variables
y = " 0 + "1 x1 + " 2 x 2 + ...+ " n x n + #

Expression = Baseline + Age + Tissue + Sex + Error

!
• Partial Regression Coefficients: βi ≡ effect on the
dependent variable when increasing the ith independent
variable by 1 unit, holding all other predictors
constant
Categorical Independent Variables
• Qualitative variables are easily incorporated in regression
framework through dummy variables

• Simple example: sex can be coded as 0/1

• What if my categorical variable contains three levels:


0 if AA
xi = 1 if AG
2 if GG
Categorical Independent Variables
• Previous coding would result in colinearity

• Solution is to set up a series of dummy variable. In general


for k levels you need k-1 dummy variables
1 if AA
x1 =
0 otherwise
1 if AG
x2 =
0 otherwise

x1 x2
AA 1 0
AG 0 1
GG 0 0
Hypothesis Testing: Model Utility Test (or
Omnibus Test)
• The first thing we want to know after fitting a model is whether
any of the independent variables (X’s) are significantly related to
the dependent variable (Y):

H 0 : "1 = " 2 = ... = " k = 0


H A : At least one "1 # 0
R2 k
f = 2

(1$ R ) n $ (k + 1)

Rejection Region : F" ,k,n#(k +1)


!

!
Equivalent ANOVA Formulation of Omnibus Test

• We can also frame this in our now familiar ANOVA framework


- partition total variation into two components: SSE (unexplained
variation) and SSR (variation explained by linear model)
Equivalent ANOVA Formulation of Omnibus Test

• We can also frame this in our now familiar ANOVA framework


- partition total variation into two components: SSE (unexplained
variation) and SSR (variation explained by linear model)

Source of df Sum of Squares MS F


Variation
SSR MSR
Regression k SSR = # ( yˆ i " y ) 2
k MSE

SSE
Error n-2 SSE = # (y i " yˆ i ) 2
n "2
!
! n-1 !
Total SST = # (y i " y ) 2

! !

Rejection Region : F" ,k,n#(k +1)


!
F Test For Subsets of Independent Variables

• A powerful tool in multiple regression analyses is the ability to


compare two models

• For instance say we want to compare:


Full Model : y = " 0 + "1 x1 + " 2 x 2 + " 3 x 3 + " 4 x 4 + #

Reduced Model : y = " 0 + "1 x1 + " 2 x 2 + #

!
• Again, another example of ANOVA:
!
SSER = error sum of squares for
reduced model with l predictors (SSE R " SSE F ) /(k " l)
f =
SSEF = error sum of squares for SSE F /([n " (k + 1)]
full model with k predictors
!
Example of Model Comparison
• We have a quantitative trait and want to test the effects at two
markers, M1 and M2.

Full Model: Trait = Mean + M1 + M2 + (M1*M2) + error


Reduced Model: Trait = Mean + M1 + M2 + error

(SSE R " SSE F ) /(3 " 2) (SSE R " SSE F )


f = =
SSE F /([100 " (3 + 1)] SSE F /96

Rejection Region : Fa, 1, 96


!
Hypothesis Tests of Individual Regression
Coefficients
• Hypothesis tests for each "ˆ i can be done by simple t-tests:
H 0 : "ˆ i = 0

! H : "ˆ # 0
A i

"ˆ i $ " i
T=
se(" i )

Critical value : t" / 2,n#(k#1)

• Confidence Intervals
! are equally easy to obtain:

! "ˆ i ± t# / 2,n$(k$1) • se("ˆ i )


Checking Assumptions
• Critically important to examine data and check assumptions
underlying the regression model
 Outliers
 Normality
 Constant variance
 Independence among residuals

• Standard diagnostic plots include:


 scatter plots of y versus xi (outliers)
 qq plot of residuals (normality)
 residuals versus fitted values (independence, constant variance)
 residuals versus xi (outliers, constant variance)

• We’ll explore diagnostic plots in more detail in R


Fixed -vs- Random Effects Models
• In ANOVA and Regression analyses our independent variables can
be treated as Fixed or Random

• Fixed Effects: variables whose levels are either sampled


exhaustively or are the only ones considered relevant to the
experimenter

• Random Effects: variables whose levels are randomly sampled


from a large population of levels

• Example from our recent AJHG paper:

Expression = Baseline + Population + Individual + Error


Lecture 9: Linear
Regression
Goals

• Develop basic concepts of linear regression from


a probabilistic framework

• Estimating parameters and hypothesis testing


with linear models

• Linear regression in R
Regression

• Technique used for the modeling and analysis of


numerical data

• Exploits the relationship between two or more


variables so that we can gain information about one of
them through knowing values of the other

• Regression can be used for prediction, estimation,


hypothesis testing, and modeling causal relationships
Regression Lingo

Y = X1 + X2 + X3

Dependent Variable Independent Variable

Outcome Variable Predictor Variable

Response Variable Explanatory Variable


Why Linear Regression?
• Suppose we want to model the dependent variable Y in terms
of three predictors, X1, X2, X3

Y = f(X1, X2, X3)

• Typically will not have enough data to try and directly


estimate f

• Therefore, we usually have to assume that it has some


restricted form, such as linear

Y = X1 + X2 + X3
Linear Regression is a Probabilistic Model

• Much of mathematics is devoted to studying variables


that are deterministically related to one another

y = " 0 + "1 x

y "y
! "0
"x
"1 =
#y
#x
!
! x
!
!
!
• But we’re interested in understanding the relationship
between variables related in a nondeterministic fashion
!
A Linear Probabilistic Model
• Definition: There exists parameters " 0 , "1, and " ,2 such that for
any fixed value of the independent variable x, the dependent
variable is related to x through the model equation

y = "!0 +! "1 x +
!#

2
• " is a rv assumed to be N(0, # )
True Regression Line
!
"3 y = " 0 + "1 x
y "1
"2

"0 !

! x !
!
!
!
Implications
• The expected value of Y is a linear function of X, but for fixed
x, the variable Y differs from its expected value by a random
amount

• Formally, let x* denote a particular value of the independent


variable x, then our linear probabilistic model says:

E(Y | x*) = µY|x* = mean value of Y when x is x *


2
V (Y | x*) = " Y|x* = variance of Y when x is x *

!
Graphical Interpretation
y
y = " 0 + "1 x
µY |x 2 = " 0 + "1 x 2
! !
!
µY |x1 = " 0 + "1 x1

x
x1 x2
!

• For example, if x = height and y = weight then µ!


Y |x =60 is the average
!
weight for!
all individuals 60 inches tall in the population

!
One More Example
Suppose the relationship between the independent variable height
(x) and dependent variable weight (y) is described by a simple
linear regression model with true regression line
y = 7.5 + 0.5x and " = 3
• Q1: What is the interpretation of "1 = 0.5?
The expected change in height associated with a 1-unit increase
in weight !
!
• Q2: If x = 20 what is the expected value of Y?
µY |x =20 = 7.5 + 0.5(20) = 17.5

• Q3: If x = 20 what is P(Y > 22)?


! " 22 -17.5 %
P(Y > 22 | x = 20) = P$ ' = 1( ) (1.5) = 0.067
# 3 &
Estimating Model Parameters
• Point estimates of "ˆ 0 and "ˆ1 are obtained by the principle of least
squares
n
f (" 0 , "1 ) = $ [ y i # (" 0 + "1 x i )] 2
! ! i=1

! y
"0

x
!


!
"ˆ = y # "ˆ x
0 1

!
Predicted and Residual Values
• Predicted, or fitted, values are values of y predicted by the least-
squares regression line obtained by plugging in x1,x2,…,xn into the
estimated regression line
yˆ1 = "ˆ 0 # "ˆ1 x1
yˆ = "ˆ # "ˆ x
2 0 1 2

• Residuals are
! the deviations of observed and predicted values
e1 = y1 " yˆ1
! e2 = y 2 " yˆ 2
y
e3

y1
! e1 e2
yˆ1
!
! ! ! x
!
!
Residuals Are Useful!
• They allow us to calculate the error sum of squares (SSE):
n n
SSE = " (ei ) = " (y i # yˆ i ) 2
2

i=1 i=1

• Which in turn allows us to estimate " 2 :


! SSE
"ˆ 2 =
n #2
!

• As well as an important statistic referred to as the coefficient of


determination:
!
n
2 SSE SST = # (y i " y ) 2
r = 1"
SST i=1

! !
Multiple Linear Regression
• Extension of the simple linear regression model to two or
more independent variables
y = " 0 + "1 x1 + " 2 x 2 + ...+ " n x n + #

Expression = Baseline + Age + Tissue + Sex + Error

!
• Partial Regression Coefficients: βi ≡ effect on the
dependent variable when increasing the ith independent
variable by 1 unit, holding all other predictors
constant
Categorical Independent Variables
• Qualitative variables are easily incorporated in regression
framework through dummy variables

• Simple example: sex can be coded as 0/1

• What if my categorical variable contains three levels:


0 if AA
xi = 1 if AG
2 if GG
Categorical Independent Variables
• Previous coding would result in colinearity

• Solution is to set up a series of dummy variable. In general


for k levels you need k-1 dummy variables
1 if AA
x1 =
0 otherwise
1 if AG
x2 =
0 otherwise

x1 x2
AA 1 0
AG 0 1
GG 0 0
Hypothesis Testing: Model Utility Test (or
Omnibus Test)
• The first thing we want to know after fitting a model is whether
any of the independent variables (X’s) are significantly related to
the dependent variable (Y):

H 0 : "1 = " 2 = ... = " k = 0


H A : At least one "1 # 0
R2 k
f = 2

(1$ R ) n $ (k + 1)

Rejection Region : F" ,k,n#(k +1)


!

!
Equivalent ANOVA Formulation of Omnibus Test

• We can also frame this in our now familiar ANOVA framework


- partition total variation into two components: SSE (unexplained
variation) and SSR (variation explained by linear model)
Equivalent ANOVA Formulation of Omnibus Test

• We can also frame this in our now familiar ANOVA framework


- partition total variation into two components: SSE (unexplained
variation) and SSR (variation explained by linear model)

Source of df Sum of Squares MS F


Variation
SSR MSR
Regression k SSR = # ( yˆ i " y ) 2
k MSE

SSE
Error n-2 SSE = # (y i " yˆ i ) 2
n "2
!
! n-1 !
Total SST = # (y i " y ) 2

! !

Rejection Region : F" ,k,n#(k +1)


!
F Test For Subsets of Independent Variables

• A powerful tool in multiple regression analyses is the ability to


compare two models

• For instance say we want to compare:


Full Model : y = " 0 + "1 x1 + " 2 x 2 + " 3 x 3 + " 4 x 4 + #

Reduced Model : y = " 0 + "1 x1 + " 2 x 2 + #

!
• Again, another example of ANOVA:
!
SSER = error sum of squares for
reduced model with l predictors (SSE R " SSE F ) /(k " l)
f =
SSEF = error sum of squares for SSE F /([n " (k + 1)]
full model with k predictors
!
Example of Model Comparison
• We have a quantitative trait and want to test the effects at two
markers, M1 and M2.

Full Model: Trait = Mean + M1 + M2 + (M1*M2) + error


Reduced Model: Trait = Mean + M1 + M2 + error

(SSE R " SSE F ) /(3 " 2) (SSE R " SSE F )


f = =
SSE F /([100 " (3 + 1)] SSE F /96

Rejection Region : Fa, 1, 96


!
Hypothesis Tests of Individual Regression
Coefficients
• Hypothesis tests for each "ˆ i can be done by simple t-tests:
H 0 : "ˆ i = 0

! H : "ˆ # 0
A i

"ˆ i $ " i
T=
se(" i )

Critical value : t" / 2,n#(k#1)

• Confidence Intervals
! are equally easy to obtain:

! "ˆ i ± t# / 2,n$(k$1) • se("ˆ i )


Checking Assumptions
• Critically important to examine data and check assumptions
underlying the regression model
 Outliers
 Normality
 Constant variance
 Independence among residuals

• Standard diagnostic plots include:


 scatter plots of y versus xi (outliers)
 qq plot of residuals (normality)
 residuals versus fitted values (independence, constant variance)
 residuals versus xi (outliers, constant variance)

• We’ll explore diagnostic plots in more detail in R


Fixed -vs- Random Effects Models
• In ANOVA and Regression analyses our independent variables can
be treated as Fixed or Random

• Fixed Effects: variables whose levels are either sampled


exhaustively or are the only ones considered relevant to the
experimenter

• Random Effects: variables whose levels are randomly sampled


from a large population of levels

• Example from our recent AJHG paper:

Expression = Baseline + Population + Individual + Error


Simple Linear Regression
The Model

The model has a deterministic and a probabilistic components

House
Cost

Most lots sell


for $25,000

House size
However, house cost vary even among same size
houses! Since cost behave unpredictably,
House we add a random component.
Cost

Most lots sell


for $25,000

House size
 The first order linear model

Y = β 0 + β1 X + ε
Y = dependent variable
X = independent variable
β0 = Y-intercept β0 and β1 are unknown population
β1 = slope of the line Y parameters, therefore are estimated
from the data.
ε = error variable

Rise β1 = Rise/Run
β0 Run
X
Estimating the Coefficients

 The estimates are determined by


 drawing a sample from the population of interest,
 calculating sample statistics.
 producing a straight line that cuts into the data.

Y 
 Question: What should be
 considered a good line?

    
    

X
The Least Squares
(Regression) Line

A good line is one that minimizes


the sum of squared differences between the
points and the line.
Sum of squared differences = (2 - 1)2 + (4 - 2)2 +(1.5 - 3)2 + (3.2 - 4)2 = 6.89
Sum of squared differences = (2 -2.5)2 + (4 - 2.5)2 + (1.5 - 2.5)2 + (3.2 - 2.5)2 = 3.99
Let us compare two lines
4 (2,4)
 The second line is horizontal
3  (4,3.2)
2.5
2
(1,2) 
 (3,1.5)
1

The smaller the sum of


1 2 3 4 squared differences
the better the fit of the
line to the data.
The Estimated Coefficients

To calculate the estimates of the line The regression equation that estimates
coefficients, that minimize the differences the equation of the first order linear model
between the data points and the line, use is:
the formulas:

cov(X,Y )  sXY 
b1 = 2
sX
= 2 
 sX 
Yˆ = b0 + b1 X
b0 = Y − b1 X
The Simple Linear Regression Line
• Example 17.2 (Xm17-02)
 A car dealer wants to find
the relationship between
the odometer reading and
the selling price of used cars. Car Odometer Price
 A random sample of 100 cars is selected, 1 37388 14636
and the data 2 44758 14122
recorded.
3 45833 14016
 Find the regression line. 4 30862 15590
5 31705 15568
6 34010 14718
. .
Independent .
Dependent
. .
variable .
X variable Y
. . .
• Solution
– Solving by hand: Calculate a number of statistics
X = 36,009.45; sX2 =
∑ (X i − X )2
= 43,528,690
n −1

Y = 14,822.823; cov(X,Y ) =
∑ (X i − X )(Yi − Y )
= −2,712,511
n −1
where n = 100.
cov( X , Y ) −2, 712,511
b1 = 2
= = −.06232
sX 43,528, 690
b0 = Y − b1 X = 14,822.82 − (−.06232)(36, 009.45) = 17, 067

Yˆ = b0 + b1 X = 17,067 − .0623X
• Solution – continued
– Using the computer (Xm17-02)

Tools > Data Analysis > Regression >


[Shade the Y range and the X range] > OK
Xm17-02
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.8063
R Square 0.6501
Adjusted R Square 0.6466

Yˆ = 17,067 − .0623X
Standard Error 303.1
Observations 100

ANOVA
df SS MS F Significance F
Regression 1 16734111 16734111 182.11 0.0000
Residual 98 9005450 91892
Total 99 25739561

Coefficients Standard Error t Stat P-value


Intercept 17067 169 100.97 0.0000
Odometer -0.0623 0.0046 -13.49 0.0000
Interpreting the Linear Regression -
Equation
17067 Odometer Line Fit Plot

16000

15000
Price

14000

0 No data 13000
Odometer

Yˆ = 17,067 − .0623X

The intercept is b0 = $17067. This is the slope of the line.


For each additional mile on the odometer,
Do not interpret the intercept as the the price decreases by an average of $0.0623
“Price of cars that have not been driven”
Error Variable: Required Conditions

 The error ε is a critical part of the regression model.


 Four requirements involving the distribution of ε must be satisfied.
 The probability distribution of ε is normal.
 The mean of ε is zero: E(ε) = 0.
 The standard deviation of ε is σε for all values of X.
 The set of errors associated with different values of Y are all independent.
The Normality of ε
E(Y|X3)
The standard deviation remains constant,
β +β X µ3
0 1 3
E(Y|X2)
β0 + β1 X2 µ2
but the mean value changes with X E(Y|X1)

β0 + β1X1 µ1

X1 X2 X3
From the first three assumptions we
have: Y is normally distributed with
mean E(Y) = β0 + β1X, and a constant
standard deviation σε
Assessing the Model

 The least squares method will produces a regression line


whether or not there are linear relationship between X
and Y.
 Consequently, it is important to assess how well the
linear model fits the data.
 Several methods are used to assess the model. All are
based on the sum of squares for errors, SSE.
Sum of Squares for Errors
 This is the sum of differences between the points and the regression line.
 It can serve as a measure of how well the line fits the data. SSE is
defined by

n
SSE = ∑ (Yi − Yˆi ) 2 .
i=1

– A shortcut formula

SSE = (n − 1)s 2

[cov(X,Y)]
2

Y
sX2
Standard Error of Estimate
 The mean error is equal to zero.
 If σε is small the errors tend to be close to zero (close to the mean error).
Then, the model fits the data well.
 Therefore, we can, use σε as a measure of the suitability of using a linear
model.
 An estimator of σε is given by sε

S tan dard Error of Estimate


SSE
sε =
n−2
 Example 17.3
 Calculate the standard error of estimate for Example
17.2, and describe what does it tell you about the
model fit?
 Solution

∑ i i
(Y − Yˆ ) 2
Calculated before
sY2 = = 259,996
n −1
2 2
[cov( X , Y )] ( −2, 712,511)
SSE = (n − 1) sY2 − 2
= 99(259,996) − = 9,005,450
sX 43,528,690
SSE 9,005,450
sε = = = 303.13 It is hard to assess the model based
n−2 98 on sε even when compared with the
mean value of Y.
s ε = 303.1 y = 14,823
Testing the Slope
 When no linear relationship exists between two variables, the regression
line should be horizontal.




 


 

 

Linear relationship. No linear relationship.


Different inputs (X) yield Different inputs (X) yield
different outputs (Y). the same output (Y).
The slope is not equal to zero The slope is equal to zero
 We can draw inference about β1 from b1 by testing
H0: β1 = 0
H1: β1 ≠ 0 (or < 0,or > 0)
 The test statistic is

b1 − β1 where sε
t= sb1 =
If the error variable issnormally distributed, the statistic has Student
2

b1
distribution with d.f. = n-2.
(n −1)s X
t

The standard error of b1.


 Example 17.4
 Test to determine whether there is enough evidence to infer that there is
a linear relationship between the car auction price and the odometer
reading for all three-year-old Tauruses, in Example 17.2.
Use α = 5%.
 Solving by hand
 To compute “t” we need the values of b1 and sb1.

b1 = −.0623
sε 303.1
sb1 = = = .00462
2
(n −1)sX (99)(43,528,690)
b1 − β1 −.0623 − 0
t= = = −13.49
sb1 .00462
 The rejection region is t > t.025 or t < -t.025 with ν
= n-2 = 98.
Approximately, t.025 = 1.984
Xm17-02
• Using the computer
Price Odometer SUMMARY OUTPUT
14636 37388
14122 44758 Regression Statistics
14016 45833 Multiple R 0.8063
15590 30862 R Square 0.6501 There is overwhelming evidence to infer
15568 31705 Adjusted R Squ 0.6466
14718 34010 Standard Error 303.1
that the odometer reading affects the
14470 45854 Observations 100 auction selling price.
15690 19057
15072 40149 ANOVA
14802 40237 df SS MS F Significance F
15190 32359 Regression 1 16734111 16734111 182.11 0.0000
14660 43533 Residual 98 9005450 91892
15612 32744 Total 99 25739561
15610 34470
14634 37720 Coefficients Standard Error t Stat P-value
14632 41350 Intercept 17067 169 100.97 0.0000
15740 24469 Odometer -0.0623 0.0046 -13.49 0.0000
Coefficient of Determination
 To measure the strength of the linear relationship we use the coefficient
of determination:

[cov(X,Y)]
2

R 2
= 2 2
s s
(or, = rXY );
2

X Y

2 SSE
or, R = 1− (see p. 18 above)
∑ (Yi − Y ) 2
• To understand the significance of this coefficient
note:

The regression model


Overall variability in Y

The error
y2
Two data points (X1,Y1) and (X2,Y2)
of a certain sample are shown.

y1 Variation in Y = SSR + SSE

x1 x2
Variation explained by the
Total variation in Y = + Unexplained variation (error)
regression line

(Y1 − Y ) 2 + (Y2 − Y ) 2 = (Yˆ1 − Y ) 2 + (Yˆ2 − Y ) 2 +(Y1 − Yˆ1) 2 + (Y2 − Yˆ2 ) 2


• R2 measures the proportion of the variation in Y
that is explained by the variation in X.

2
R =1−
SSE
=
∑ i −SSE
(Y −Y ) 2

=
SSR
∑ (Yi −Y ) 2
∑ (Y −Y )
i
2
∑ (Yi −Y ) 2

• R2 takes on any value between zero and one.


R2 = 1: Perfect match between the line and the data points.
R2 = 0: There are no linear relationship between X and Y.
 Example 17.5
 Find the coefficient of determination for Example 17.2; what does this statistic
tell you about the model?
 Solution
 Solving by hand;

2
2 [cov(X,Y )] [−2,712,511]2
R = 2 2
= (43,528,688)(259,996) = .6501
sX sY
– Using the computer
From the regression output we have
SUMMARY OUTPUT

Regression Statistics
65% of the variation in the auction
Multiple R 0.8063 selling price is explained by the
R Square 0.6501
Adjusted R Square 0.6466 variation in odometer reading. The
Standard Error 303.1 rest (35%) remains unexplained by
Observations 100
this model.
ANOVA
df SS MS F Significance F
Regression 1 16734111 16734111 182.11 0.0000
Residual 98 9005450 91892
Total 99 25739561

CoefficientsStandard Error t Stat P-value


Intercept 17067 169 100.97 0.0000
Odometer -0.0623 0.0046 -13.49 0.0000
22/03/2022
2D 2D

VECTOR MATRIX TENSOR


VECTOR CONVERSION2

Cam mot
ALpanati blw the wo m 20
Ael mwrLasm9dusruoimg

2x
NOTE-

axb txo

x
ac
mp. cond

T27
2 Ix4

0 O
4x2

LINE AR FONCTION

dit n,y be uo_wotes, t b a


Mm.bun amd F br a
lmman mciom.
F )amd FLy) aauckms_Jso
ADDITIVITY > F =F(n)+ Fl
HOMOHENEITY F a c) cF()

Taume. ouous both.it ua


Vnbh umaie
D

Ftxy)
-C- Fca)
9 F)
t(xtn) F)tF()

4x O
2x 2
NOT LINEAR

MATRIX
3nt 4 10
t 3

t TENSOR
7 M-dummuon.al meteApacz-
7gmnau m than on tqal to 3.
A colounMinagtoh piiel24-bia)
R
&
HI

You might also like