Professional Documents
Culture Documents
Machine Learning Notes
Machine Learning Notes
Machine Learning Notes
H1 : Not all of the population means are the same
At least one population mean is different
i.e., there is a treatment effect
Does not mean that all population means are different
(some pairs may be the same)
The F-distribution
A ratio of variances follows an F-distribution:
2
σ between
2
~ Fn ,m
σ within
∑
10 10 10
y1 j ∑y
j =1
∑y 2j ∑y 3j
j =1
4j The group means
y1• = y 2• =
j =1
y 3• =
j =1 y 4• =
10 10 10 10
10 10 10
∑
10
∑(y 1j − y1• ) 2
j =1
( y 2 j − y 2• ) 2 ∑(y
j =1
3j − y 3• ) 2
∑(y
j =1
4j − y 4• ) 2
The (within)
j =1
10 − 1 10 − 1 10 − 1 10 − 1 group
variances
Sum of Squares Within (SSW),
or Sum of Squares Error (SSE)
10 10 10
∑(y
10
∑(y
2
∑(y 1j − y1• ) 2
j =1
2j − y 2• ) ∑(y j =1
3j − y 3• ) 2
j =1
4j − y 4• ) 2
The (within)
j =1
group variances
10 − 1 10 − 1 10 − 1 10 − 1
10 10
10 10
∑(y 1j − y1• ) +
2
∑ ( y 2 j − y 2• ) 2 + ∑ ( y 3 j − y 3• ) + 2
∑(y
j =1
4j − y 4• ) 2
j =1 j =3
j =1
4 10
= ∑∑ i =1 j =1
( y ij − y i • ) 2 Sum of Squares Within (SSW)
(or SSE, for chance error)
Sum of Squares Between (SSB), or
Sum of Squares Regression (SSR)
4 10
Overall mean
of all 40
observations
∑∑ y
i =1 j =1
ij
(“grand y •• =
mean”) 40
Total sum of
4 10 squares(TSS).
∑∑
i =1 j =1
( y ij − y •• ) 2 Squared difference of
every observation from
the overall mean.
(numerator of variance of
Y!)
Partitioning of Variance
4 10 4 4 10
∑∑ ( y
i =1 j =1
ij − y i• ) 2
∑
+ 10x ( y i • − y •• ) 2
= ∑∑ ( y ij − y •• ) 2
i =1 i =1 j =1
deviations) = 2060.6
Step 3) Fill in the ANOVA table
Source of variation d.f. Sum of squares Mean Sum of F-statistic p-value
Squares
Total 39 2257.1
Step 3) Fill in the ANOVA table
Source of variation d.f. Sum of squares Mean Sum of F-statistic p-value
Squares
Total 39 2257.1
INTERPRETATION of ANOVA:
How much of the variance in height is explained by treatment group?
R2=“Coefficient of Determination” = SSB/TSS = 196.5/2275.1=9%
Coefficient of Determination
2 SSB SSB
R = =
SSB + SSE SST
The amount of variation in the outcome variable (dependent
variable) that is explained by the predictor (independent variable).
ANOVA example
Table 6. Mean micronutrient intake from the school lunch by school
S1a, n=25 S2b, n=25 S3c, n=25 P-valued
Calcium (mg) Mean 117.8 158.7 206.5 0.000
SDe 62.4 70.5 86.2
Iron (mg) Mean 2.0 2.0 2.0 0.854
SD 0.6 0.6 0.6
Folate (μg) Mean 26.6 38.7 42.6 0.000
SD 13.1 14.5 15.1
Mean 1.9 1.5 1.3 0.055
Zinc (mg)
SD 1.0 1.2 0.4
a School 1 (most deprived; 40% subsidized lunches). FROM: Gould R, Russell J,
Barker ME. School lunch
b School 2 (medium deprived; <10% subsidized). menus and 11 to 12 year old
c School 3 (least deprived; no subsidization, private school). children's food choice in three
secondary schools in England-
d ANOVA; significant differences are highlighted in bold (P<0.05). are the nutritional standards
being met? Appetite. 2006
Jan;46(1):86-92.
Answer
Step 1) calculate the sum of squares between groups:
Mean for School 1 = 117.8
Mean for School 2 = 158.7
Mean for School 3 = 206.5
Total 74 489,179
**R2=98113/489179=20%
School explains 20% of the variance in lunchtime calcium
intake in these kids.
Beyond one-way ANOVA
Often, you may want to test more than 1
treatment. ANOVA can accommodate
more than 1 treatment or factor, so long
as they are independent. Again, the
variation partitions beautifully!
B
What’s Slope?
E ( yi / xi ) = α + βxi
Predicted value for an
individual…
yi= α + β*xi + random errori
Sy/x
Sy/x
Sy/x
Sy/x
Sy/x
Sy/x
Regression Picture
ŷi = βxi + α
yi
C A
B
y A
B y
C
yi
1. Lee DM, Tajar A, Ulubaev A, et al. Association between 25-hydroxyvitamin D levels and cognitive performance in middle-aged
and older European men. J Neurol Neurosurg Psychiatry. 2009 Jul;80(7):722-9.
Distribution of vitamin D
Mean= 63 nmol/L
Standard deviation = 33 nmol/L
Distribution of DSST
Normally distributed
Mean = 28 points
Standard deviation = 10 points
Four hypothetical datasets
I generated four hypothetical datasets,
with increasing TRUE slopes (between
vit D and DSST):
0
0.5 points per 10 nmol/L
1.0 points per 10 nmol/L
1.5 points per 10 nmol/L
Dataset 1: no relationship
Dataset 2: weak relationship
Dataset 3: weak to moderate
relationship
Dataset 4: moderate
relationship
The “Best fit” line
Regression
equation:
E(Yi) = 28 + 0*vit
Di (in 10 nmol/L)
The “Best fit” line
Regression
equation:
E(Yi) = 26 + 0.5*vit
Di (in 10 nmol/L)
The “Best fit” line
Regression equation:
E(Yi) = 22 + 1.0*vit
Di (in 10 nmol/L)
The “Best fit” line
Regression equation:
E(Yi) = 20 + 1.5*vit Di
(in 10 nmol/L)
Tn-2=
βˆ − 0
s.e.( βˆ )
Example: dataset 4
Standard error (beta) = 0.03
T98 = 0.15/0.03 = 5, p<.0001
Sufficient vs.
Deficient
Results…
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Interpretation:
The deficient group has a mean DSST 9.87 points
lower than the reference (sufficient) group.
The insufficient group has a mean DSST 6.87
points lower than the reference (sufficient) group.
What is a Hypothesis?
A hypothesis is an I assume the mean GPA
assumption about the of this class is 3.5!
population parameter.
A parameter is a
Population mean or
proportion
The parameter must be
identified before
analysis.
Steps:
State the Null Hypothesis (H0: µ ≥ 3)
Assume the
population
mean age is 50.
(Null Hypothesis) Population
The Sample
Is X = 20 ≅ µ = 50? Mean Is 20
No, not likely!
REJECT
Null Hypothesis Sample
Sampling Distribution
It is unlikely
that we would ... Therefore, we
get a sample reject the null
mean of this hypothesis that
value ... µ = 50.
... if in fact this were
the population mean.
20 µ = 50 Sample Mean
H0
© 1999 Prentice-Hall, Inc. Chap. 8 - 7
Level of Significance, α
• Defines Unlikely Values of Sample
Statistic if Null Hypothesis Is True
Called Rejection Region of Sampling
Distribution
• Designated α (alpha)
Typical values are 0.01, 0.05, 0.10
• Selected by the Researcher at the Start
• Provides the Critical Value(s) of the Test
© 1999 Prentice-Hall, Inc. Chap. 8 - 8
Level of Significance, α and
the Rejection Region
H0: µ ≥ 3 α Critical
H1: µ < 3 Value(s)
Rejection 0
Regions α
H0: µ ≤ 3
H1: µ > 3
0
α/2
H0: µ = 3
H1: µ ≠ 3
0
• Type I Error
Reject True Null Hypothesis
Has Serious Consequences
• Type II Error
Do Not Reject False Null Hypothesis
Probability of Type II Error Is β (Beta)
• Z Test Statistic:
x − µx x−µ
z= =
σx σ
n
H0: µ ≥ 0 H0: µ ≤ 0
H1: µ < 0 H1: µ > 0
Reject H0 Reject H 0
α α
0 Z 0 Z
Must Be Significantly Small values don’t contradict H0
Below µ = 0 Don’t Reject H0!
© 1999 Prentice-Hall, Inc. Chap. 8 - 18
Example: One Tail Test
Reject
α = 0.05
0 1.50 Z
Test Statistic Is In the Do Not Reject Region
© 1999 Prentice-Hall, Inc. Chap. 8 - 23
Example: Two Tail Test
Module 2: Bayesian Learning [5L] Probability Basics, Bayes Theorem, Naive Bayes Classifier, Gaussian Naive Bayes Classifier, Bayesian Networks.
Module 3: Artificial Neural Network [5L] Perceptron Learning Algorithm: Delta Rule and Gradient Descent. Multi-layer Perceptron Learning:
Backpropagation and Stochastic Gradient Descent. Hypotheses Space, Inductive Bias and Convergence. Variants Structures of Neural Network.
Module 4: Support Vector Machines [4L] Decision Boundary and Support Vector: Optimization and Primal-Dual Problem. Extension to SVM: Soft Margin
and Non-linear Decision Boundary. Kernel Functions and Radial Basis Functions (detailed later).
Module 5: Linear Models and Regression [3L] Linear Regression. Linear Classification. Logistic Regression. Non-linear Transformation.
Module 6: Decision Tree Learning [4L] Decision Tree Representation and Learning Algorithm (ID3). Attribute Selection using Entropy Measures and
Gains. Hypotheses Space and Inductive bias.
Module 7: Clustering [5L] Partitional Clustering and Hierarchical Clustering. Cluster Types, Attributes and Salient Features. k-Means, k-Nearest
Neighbour (kNN) Classifier. Hierarchical and Density-based Clustering Algorithms. Inter and Intra Clustering Similarity, Cohesion and Separation.
Module 8: Some Other Learning Concept [1L] Active Learning, Deep Learning, Transfer Learning.
Books
TEXT BOOKS:
1. Christopher Bishop. Pattern Recognition and Machine Learning. 2e.
2. The Elements of Statistical Learning Data Mining, Inference, and Prediction by T. Hastie, R. Tibshirani, J. Friedman
REFERENCES:
1. Machine Learning, Tom Mitchell, McGraw Hill, 1997
2. Introduction to Machine Learning, Third edition, Ethem Alpaydin, The MIT Press, September 2014.
3. Understanding Machine Learning: From Theory to Algorithms, First Edition by Shwartz, Shai Ben-David, Cambridge University
Press, 2014.
4. Learning From Data, First Edition by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin, AML Book, 2012.
Evaluation
Continuous (40):
3. Mid term : 15
5. Term Project : 15
1. Number System
2. Linear Algebra
3. Probability Theory
Quick Recap : Number System
Source: https://en.wikipedia.org/wiki/Natural_number
Cont.…
3. Rational Number (ℚ) : any number x that can be represented as a/b, where a and b are both integers. 1.5 = 3/2, Try for √2, √3 ???
4. Integers (ℤ): Whole number without fraction, can be +ve as well as –ve. Example - 3, 59, -113
•Convert x to y???
•Example :
•Got y
Linear Function
▪Homogeneity
▪ F(cX) = c F(X)
▪Linear: If the function F obeys Additivity and Homogeneity then we call F as Linear
▪Consider F(x)= x2, then F(x+x)= F(x) +F(x) test with a value…… Is it linear??
Matrix
▪ Two vectors
▪Dot Product
Probability
▪Problem: Flip three fair coins one by one, what will be the probability of at least two heads ??
▪ Space S= {TTT, TTH, THT, THH, HTT, HTH, HHT, HHH} ,
▪ Can map with digital logic ??
▪ Total Space = 8
▪ At least two heads M = { THH, HTH, HHT, HHH}
▪ Total Space =4
▪ Probability of at least two head =4/8 =1/2
▪Problem: Consider 2 fair coins and 2 dices are rolled independently. What will be the probability of {HH66}?
▪ Space ={2*2*6*6} =144
▪ Probability =1/144
▪ Independently, P(H)=1/2. P(6)=1/6
▪ P{HH66}= P(H)×P(H)×P(6) ×P(6) --------Independent event
Conditional Probability
▪ two events A and B, where occurrence of A depends on the occurrence of B, P(B)>0 then conditional probability
▪Example: Consider the case of flipping three coins : S= {TTT, TTH, THT, THH, HTT, HTH, HHT, HHH}
▪What is the probability of first coin to be head when two of the three are head???
▪ P(A)= first coin head , P(B)= two coins are head
▪ P(A) = {HTT, HTH, HHT, HHH} =4/8 =1/2
▪ P(B) = {THH, HHT, HHH} =3/8
▪ P(A∩B) = {HHT, HHH} =2/8
▪ P(A|B)= P(A∩B)/P(B)=2/8/3/8=2/3
Law of Total Probability, conditioned
▪Consider a sample space S,
▪ If B is an event then
▪Prob: Consider a train has 40% AC coach, while the rest are non AC. Suppose 50% of AC coaches have lower berth, while
lower berths in non AC rakes are only 30%. A berth is picked up randomly, What is the probability that it will be lower??
▪ A1= Lower berth in AC, A2= Lower berth in non AC
▪ P(A1) = 0.4, P(A2) = 0.6
Bayes Theorem
▪Example: Consider two gift boxes G1= {5 Red pens, 10 Green pens}, G2 = {4 Red Pens, 6 Green Pens}, What is the
probability that a green pen is selected from G1?
▪P(A|B)=???
Dr. Ranjan Maity
Assistant Professor
Central Institute of Technology Kokrajhar
r.maity@cit.ac.in
Machine Learning, UCSE603 Dr. Ranjan Maity, Central Institute of Technology Kokrajhar
Genesis
Alan Turing: I believe that in about fifty
years’ time it will be possible to programme
computers, with a storage capacity of about 109,
to make them play the imitation game so well
that an average interrogator will not have more
than 70 percent chance of making the right
identification after five minutes of questioning.
… I believe that at the end of the century the use
of words and general educated opinion will have
altered so much that one will be able to speak of
machines thinking without expecting to be
contradicted.
Machine Learning, UCSE603 Dr. Ranjan Maity, Central Institute of Technology Kokrajhar
Types
Supervised
Unsupervised
Semi-supervised
Reinforcement
https://www.busuu.com/en/japanese/alphabet
Machine Learning, UCSE603, Dr. Ranjan Maity
Cont…
Consider the images –
Apple (R, G) – class 1
Lemon & Orange –class 2
Watermelon – class 3
Banana – class 4
Model will be developed
based on the data
Test it with test set
https://www.diegocalvo.es/en/supervised-learning/
Machine Learning, UCSE603, Dr. Ranjan Maity
Mathematical Formulation
Two types –
Classification
Given a set of labelled (x1,y1), (x2,y2), (x3, y3)……(xn,yn)
Learning a function f(x) , which will predict y (in terms of class) for a given x
Regression
Given a set of labelled (x1,y1), (x2,y2), (x3, y3)……(xn,yn)
Learning a function f(x) , which will predict y (in terms of values) for a given x
Cat
25
f(x)
20
15
Weight
Tiger 10
0
0 1 2 3 4 5 6 7
height
Stock market
O/P – Structures of xi
Clustering
Agent
Environment
Policy
Reward
Machine Learning, UCSE603 Dr. Ranjan Maity, Central Institute of Technology Kokrajhar
Norms of Vector
Length/magnitude of a vector
Function f :u → R
1. f is a function, u vector, R real space
2. f ( x) ≥ 0, ∀x ∈ u
(α x)
4. positive homegenity f= α f ( x), ∀x ∈ u, ∀α ∈ C
n n
Frobenius Norm (For Matrix) M F = ∑∑ M ij2
=i 1 =j 1
Problems
v = α1v1 + α 2 v2 + ....... + α n vn
−4 5 −8 5 −3
v = 2v2 + v1 = + = + =
4 −2 8 −2 6
2. T (uv) = T (vu )
3. T (u ) = T (u )
T
1 2 6 7 9
Compute Trace of v 1= ?
3 4 5 8 10
Formula : M
=
n
∑ (−1) i+ j
M ij M ij
j =1 3,4
uv
3,-4
Eamonn Keogh
UCR
This is a high level overview only. For details, see: Thomas Bayes
Pattern Recognition and Machine Learning, Christopher Bishop, Springer-Verlag, 2006.
Or
Pattern Classification by R. O. Duda, P. E. Hart, D. Stork, Wiley and Sons. 1702 - 1761
We will start off with a visual intuition, before looking at the math…
Grasshoppers Katydids
10
9
8
7
Antenna Length
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Abdomen Length
Remember this example?
Let’s get lots more data…
With a lot of data, we can build a histogram. Let us
just build one for “Antenna Length” for now…
10
9
8
7
Antenna Length
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Katydids
Grasshoppers
We can leave the
histograms as they are,
or we can summarize
them with two normal
distributions.
• We can just ask ourselves, give the distributions of antennae lengths we have
seen, is it more probable that our insect is a Grasshopper or a Katydid.
• There is a formal way to discuss the most probable classification…
p(cj | d) = probability of class cj, given that we have observed d
Antennae length is 3
p(cj | d) = probability of class cj, given that we have observed d
10
Antennae length is 3
p(cj | d) = probability of class cj, given that we have observed d
P(Grasshopper | 7 ) = 3 / (3 + 9) = 0.250
P(Katydid | 7 ) = 9 / (3 + 9) = 0.750
9
3
Antennae length is 7
p(cj | d) = probability of class cj, given that we have observed d
P(Grasshopper | 5 ) = 6 / (6 + 6) = 0.500
P(Katydid | 5 ) = 6 / (6 + 6) = 0.500
66
Antennae length is 5
Bayes Classifiers
That was a visual intuition for a simple case of the Bayes classifier,
also called:
• Idiot Bayes
• Naïve Bayes
• Simple Bayes
Drew Carey
What is the probability of being called
“drew” given that you are a male?
What is the probability
of being a male?
p(male | drew) = p(drew | male ) p(male)
What is the probability of
p(drew)
being named “drew”?
(actually irrelevant, since it is
that same for all classes)
This is Officer Drew (who arrested me in
1997). Is Officer Drew a Male or Female?
Luckily, we have a small
database with names and sex.
Officer Drew
The probability of
class cj generating
instance d, equals….
The probability of class cj
generating the observed
value for feature 1,
multiplied by..
The probability of class cj
generating the observed
value for feature 2,
multiplied by..
• To simplify the task, naïve Bayesian classifiers
assume attributes have independent distributions, and
thereby estimate
p(d|cj) = p(d1|cj) * p(d2|cj) * ….* p(dn|cj)
Officer Drew
is blue-eyed, p(officer drew| Female) = 2/5 * 3/5 * ….
over 170cm
tall, and has p(officer drew| Male) = 2/3 * 2/3 * ….
long hair
The Naive Bayes classifiers
is often represented as this
type of graph…
cj
Note the direction of the
arrows, which state that
each class causes certain
features, with a certain
probability
Grasshoppers
Katydids
Ants
-3
x 10 Single-Sided Amplitude Spectrum of Y(t)
4
3
Peak at
|Y(f)|
60Hz 197Hz
2 interference
Harmonics
1
0 0 100 200 300 400 500 600 700 800 900 1000
Frequency (Hz)
-3
x 10
4
3
|Y(f)|
0 0 100 200 300 400 500 600 700 800 900 1000
Frequency (Hz)
0 100 200 300 400 500 600 700 800 900 1000
Frequency (Hz)
0 100 200 300 400 500 600 700 800 900 1000
Frequency (Hz)
0 100 200 300 400 500 600 700
Wing Beat Frequency Hz
0 100 200 300 400 500 600 700
Wing Beat Frequency Hz
400 500 600 700
0 dawn dusk
0 12 24
Midnight Noon Midnight
70
600
Suppose I observe an
insect with a wingbeat
frequency of 420Hz
500
What is it?
400
70
600
Suppose I observe an
insect with a wingbeat
frequency of 420Hz at
500
11:00am
What is it?
400
0 12 24
Midnight Noon Midnight
70
600
Suppose I observe an
insect with a wingbeat
frequency of 420 at
500
11:00am
What is it?
400
0 12 24
Midnight Noon Midnight
100 10
90 9
80 8
70 7
60 6
50 5
40 4
30 3
20 2
10 1
10 20 30 40 50 60 70 80 90 100 1 2 3 4 5 6 7 8 9 10
Advantages/Disadvantages of Naïve Bayes
• Advantages:
– Fast to train (single scan). Fast to classify
– Not sensitive to irrelevant features
– Handles real and discrete data
– Handles streaming data well
• Disadvantages:
– Assumes independence of features
A Tutorial on
Inference and Learning
in Bayesian Networks
Irina Rish
1 + 2 + 2 + 4 + 4 = 13 parameters instead of 2 \ = ZY
Example: Printer Troubleshooting
Application
Output OK
Spool Print
Process OK Spooling On
Spooled GDI Data
Data OK Input OK
Local Disk Uncorrupted
Space Adequate Driver
Correct GDI Data
Driver Output OK Correct
Driver
Settings
Correct
Printer Print
Selected Data OK
Network Net/Local
Up Printing
Correct
Correct Net PC to Printer Local Local Port
Printer Path Path OK Transport OK Path OK
Local Cable
Connected
Net Cable Paper
Connected Loaded
Printer On Printer
and Online Data OK
Printer Memory
Print Output Adequate
OK
26 variables
Instead of 2 26 parameters we get
99 = 17x1 + 1x21 + 2x2 2 + 3x2 3 + 3x 2 4
[Heckerman, 95]
“Moral” graph of a BN
Moralization algorithm:
P(S)
1. Connect (“marry”) parents Smoking
P(B|S)
of each node. P(C|S)
2. Drop the directionality of lung Cancer Bronchitis
the edges.
P(D|C,B)
Resulting undirected graph is X-ray Dyspnoea
called the “moral” graph of BN P(X|C,S)
Interpretation:
every pair of nodes that occur together in a CPD is connected by an edge in the moral graph.
CPD for X and its k parents (called “family”) is represented by a clique of size
k
(k+1) in the moral graph, and contains d ( d − 1) probability parameters where
d is the number of values each variable can have (domain size).
Conditional Independence in BNs:
Three types of connections
Serial S Diverging
Smoking
Visit to Asia L B
A Lung Cancer Bronchitis
Knowing S makes L and B
T Tuberculosis independent (common cause)
L Converging B
X Chest X-ray
Lung Cancer Bronchitis
Knowing T makes
A and X independent NOT knowing D or M
(intermediate cause) D Dyspnoea
makes L and B
M independent
Running (common effect)
Marathon
d-separation
Nodes X and Y are d-separated if on any (undirected) path between X and
Y there is some variable Z such that is either
Z is in a serial or diverging connection and Z is known, or
Z is in a converging connection and neither Z nor any of Z’s descendants are
known
X Z X Y
Z X Y Z
Y M
Smoking
Lung Cancer
Bronchitis
Age Gender
Diet Cancer
Prediction: P(symptom|cause)=?
C1 C2
Classification: max P(class|data)
class
Decision-making (given a cost function) symptom
Medicine
Speech Bio-
informatics
recognition
Text
Classification Computer
Stock market
troubleshooting
Application Examples
APRI system developed at AT&T Bell Labs
learns & uses Bayesian networks from data to identify customers
liable to default on bill payments
NASA Vista system
predict failures in propulsion systems
considers time criticality & suggests highest utility action
dynamically decide what information to show
Application Examples
Office Assistant in MS Office 97/ MS Office 95
Extension of Answer wizard
uses naïve Bayesian networks
help based on past experience (keyboard/mouse use) and task user is doing currently
This is the “smiley face” you get in your MS Office applications
Microsoft Pregnancy and Child-Care
Available on MSN in Health section
Frequently occurring children’s symptoms are linked to expert modules that repeatedly
ask parents relevant questions
Asks next best question based on provided information
Presents articles that are deemed relevant based on information provided
IBM’s systems management applications
Machine Learning for Systems @ Watson
End-user transaction
Fault diagnosis using probes
recognition Software or hardware
components X
2
X 3
Issues:
Remote Procedure Calls (RPCs) Efficiency (scalability)
X 1
Missing data/noise:
R2 R5 R3 R2 R1 R2 R5 sensitivity analysis
“Adaptive” probing:
selecting “most-
informative” probes
on-line
Transaction1 Transaction2 T4
learning/model
T1 updates
T3
BUY? OPEN_DB? Probe outcomes T 2 on-line diagnosis
SELL? SEARCH? Goal: finding most-likely diagnosis
OXQ S…Q ) = GwOXS… £XS… P
: S…_
Pattern
Patterndiscovery,
discovery,classification,
classification,
diagnosis
diagnosisand
andprediction
prediction
Probabilistic Inference Tasks
Belief updating:
BEL(Xi ) = P(Xi = x i | evidence)
X-ray Dyspnoea
P (smoking| dyspnoea=yes ) = ?
Belief updating: find P(X|evidence)
P(s, d = 1)
P(s|d=1) = ∝ P(s, d = 1) =
S P(d = 1)
CC B
B ∑ P(s)P(c|s)P(b|s)P(x|c,s)P(d|c,b)=
d =1,b , x ,c
X
X D
D
P(s) ∑ ∑ P(b|s) ∑ ∑ P(c|s)P(x|c,s)P(d|c,b)
d =1 b x c
“Moral” graph
Variable Elimination
f ( s , d , b, x )
W*=4
Complexity: O(n exp(w*)) ”induced width”
(max induced clique size)
Efficient inference: variable orderings, conditioning, approximations
Variable elimination algorithms
(also called “bucket-elimination”)
∑
is replaced by max :
MPE = max P ( a ) P ( c | a ) P ( b | a ) P ( d | a , b ) P ( e | b, c )
a , e , d , c ,b
max
b
∏ Elimination operator
bucket B: P(b|a) P(d|b,a) P(e|b,c) B
bucket C: P(c|a) h B
(a, d, c, e) C
bucket D: h C (a, d, e)
D
D
bucket E: e=0 h (a, e)
E
h E (a) W*=4
bucket A: P(a) ”induced width” A
MPE (max clique size)
probability
Generating the MPE-solution
5. b' = arg max P(b | a' ) ×
b B: P(b|a) P(d|b,a) P(e|b,c)
× P(d' | b, a' ) × P(e' | b, c' )
C D: h C (a, d, e)
3. d' = arg max h (a' , d, e' )
d
E: e=0 h D (a, e)
2. e' = 0
The width wo (X) of a variable X in graph G along the ordering o is the number
of nodes preceding X in the ordering and connected to X (earlier neighbors ).
The width wo of a graph is the maximum width wo (X) among all nodes.
The induced graph G ' along the ordering o is obtained by recursivel y connecting
earlier neighbors of each node, from last to the first in the ordering.
The width of the induced graph G' is called the induced width of the graph G
(denoted wo* ).
A B E
C D
B C D C
E B
D E A A
“Moral” graph wo*1 = 4 wo*2 = 2
Ordering is important! But finding min-w* ordering is NP- hard…
Inference is also NP-hard in general case [Cooper].
Learning Bayesian Networks
Incomplete data: X D X D
structural EM,
mixture models ˆ = arg max Score(G)
G
G
Learning Parameters: complete data
(overview)
Maximization E P ( X ) [ N x ,pa X ] =
Update parameters N
log N
MDL( BN | D ) = − log P ( D | Θ, G ) + |Θ|
2
DL(Data|model) DL(Model)
Class Class
P(f1 | class) P(f 2 | class) P(fn | class) P(f1 | class) P(f 2 | class) P(fn | class)
TAN:
tree-augmented Naïve Bayes Class
[Friedman et al. 1997]
P(f1 | class) P(f 2 | class) P(fn | class)
Based on Chow-Liu Tree Method
feature f1 feature f 2 feature f n
(CL) for learning trees
[Chow-Liu, 1968]
Tree-structured distributions
A joint probability distribution is tree-structured if it can be written as
n
) = ∏ P ( x i | x j ( i ) )
P (
i =1
where x j ( i ) is the parent of xi in Bayesian network for P(x) (a directed tree)
A
A
C
B
C
B
P(A,B,C,D,E)= D E
D E P(A)P(B|A)P(C|A)
P(D|C)P(E|B) Not a tree – has an (undirected) cycle
A tree (with root A)
A A
C C
B B
D E E
D
P ( )
D ( P, P' ) = P ∑ P() log
P ' ()
D(P,P’) is non-negative, and D(P,P’)=0 if and only if P coincides with P’ (on a set of measure 1)
A known fact : given P(x), the maximum of ∑ P(x) log P ' ( x) is achieved by the choice P' (x) = P(x).
xi
Therefore, for any value of i and x j ( i ) , the term ∑ P(x | x ) log P ' ( x | x ) is maximized by
i j (i ) i j (i )
xi
choosing P ' ( x i | x j ( i ) ) = P ( x i | x j ( i ) ) (and thus the total D ( P , P ' ) is minimized ) , which proves the Lemma.
Proof of Theorem :
Replacing P ' ( x i | x j ( i ) ) = P ( xi | x j ( i ) ) in the expression (1) yields
n
D ( P, P ' ) = −∑ ∑ P(x , x i j (i ) ) log[ P ( x i x j ( i ) ) / P ( x j ( i ) )] −H ( X ) =
i =1 xi , x j ( i )
n P ( xi x j (i ) )
=−∑ ∑ P(x i , x )
j (i ) log + log P ( x i −H ( X ) =
)
i =1 xi , x j ( i ) P ( x j (i ) ) P ( x i )
n n
= − ∑ I ( X i , X j ( i ) ) − ∑ ∑ P ( x i ) log P ( xi ) −H ( X ).
i =1 i =1 xi
The last two terms are independen t of the choice of the tree, and thus D ( P , P ' )
is minimized by maximizing the sum of edge weights I ( X i , X j ( i ) ).
Chow-Liu algorithm
[As presented in Pearl, 1988]
1. From the given distribution P(x) (or from data generated by P(x)),
compute the joint distributionP( xi | x j ) for all i ≠ j
2. Using the pairwise distributions from step 1, compute the mutual
information
I(Xi ; X j ) for each pair of nodes and assign it as the
weight to the corresponding edge( X i , X j ) .
3. Compute the maximum-weight spanning tree (MSWT):
a. Start from the empty tree over n variables
b. Insert the two largest-weight edges
c. Find the next largest-weight edge and add it to the tree if no cycle is
formed; otherwise, discard the edge and repeat this step.
d. Repeat step (c) until n-1 edges have been selected (a tree is
constructed).
4. Select an arbitrary root node, and direct the edges outwards from
the root.
5. Tree approximation P’(x) can be computed as a projection of P(x) on
the resulting directed tree (using the product-form of P’(x)).
Summary:
Learning and inference in BNs
Bayesian Networks – graphical probabilistic models
Efficient representation and inference
Expert knowledge + learning from data
Learning:
parameters (parameter estimation, EM)
Complexity trade-off:
X1 X2 X3
f1 f2 f3 f4
Variables:
X = (X1 , . . . , Xn ), where Xi ∈ Domaini
Factors:
f1 , . . . , fm , with each fj (X) ≥ 0
m
Y
Weight(x) = fj (x)
j=1
Search problems
Constraint satisfaction problems
Markov decision processes
Adversarial games Bayesian networks
Machine learning
Basics
Probabilistic programs
Inference
Joint distribution:
s r P(S = s, R = r)
0 0 0.20
P(S, R) = 0 1 0.08
1 0 0.70
1 1 0.02
R | T = 1, A = 1)
P(|{z}
| {z }
query condition
(S is marginalized out)
p(b) = · [b = 1] + (1 − ) · [b = 0]
p(e) = · [e = 1] + (1 − ) · [e = 0]
p(a | b, e) = [a = (b ∨ e)]
CS221 / Spring 2019 / Charikar & Sadigh 20
• Let us try to model the situation. First, we establish that there are three variables, B (burglary), E
(earthquake), and A (alarm). Next, we connect up the variables to model the dependencies.
• Unlike in factor graphs, these dependencies are represented as directed edges. You can intuitively think
about the directionality as suggesting causality, though what this actually means is a deeper question and
beyond the scope of this class.
• For each variable, we specify a local conditional distribution (a factor) of that variable given its parent
variables. In this example, B and E have no parents while A has two parents, B and E. This local
conditional distribution is what governs how a variable is generated.
• We are writing the local conditional distributions using p, while P is reserved for the joint distribution over
all random variables, which is defined as the product.
Bayesian network (alarm)
p(b) p(e)
B E B p(a | b, e) E
A A
P(B = b, E = e, A = a) = p(b)p(e)p(a | b, e)
[demo: = 0.05]
B E
All X
factors (local conditional distributions) satisfy:
p(xi | xParents(i) ) = 1 for each xParents(i)
xi
Implications:
• Consistency of sub-Bayesian networks
• Consistency of conditional distributions
B E
A short calculation:
P
P(B = b, E = e) = a P(B = b, E = e, A = a)
P
= ap(b)p(e)p(a | b, e)
P
= p(b)p(e) a p(a | b, e)
= p(b)p(e)
B E
B E
A
p(b) p(e)
p(b) p(e)
B p(a | b, e) E
B E
A
CS221 / Spring 2019 / Charikar & Sadigh 34
• This property is very attractive, because it means that whenever we have a large Bayesian network, where
we don’t care about some of the variables, we can just remove them (graph operations), and this encodes
the same distribution as we would have gotten from marginalizing out variables (algebraic operations).
The former, being visual, can be more intuitive.
Consistency of local conditionals
A B C
D E
F G H
P(D = d | A = a, B = b) = p(d | a, b)
| {z } | {z }
from probabilistic inference by definition
You are coughing and have itchy eyes. Do you have a cold or
allergies?
[demo]
Variables: Cold, Allergies, Cough, Itchy eyes
Bayesian network: Factor graph:
C A C A
H I H I
B E
Basics
Probabilistic programs
Inference
B ∼ Bernoulli()
E ∼ Bernoulli()
A=B∨E
def Bernoulli(epsilon):
return random.random() < epsilon
X0 = (0, 0)
For each time step i = 1, . . . , n:
With probability α:
Xi = Xi−1 + (1, 0) [go right]
With probability 1 − α:
Xi = Xi−1 + (0, 1) [go down]
X1 X2 X3 X4 X5
Query: what are possible trajectories given evidence X10 = (8, 2)?
X1 X2 X3 X4
H1 H2 H3 H4 H5
E1 E2 E3 E4 E5
4 5
E1 E2 E3 E4
travel
W1 W2 ... WL
beach Paris
CS221 / Spring 2019 / Charikar & Sadigh 57
• Naive Bayes is a very simple model which can be used for classification. For document classification, we
generate a label and all the words in the document given that label.
• Note that the words are all generated independently, which is not a very realistic model of language, but
naive Bayes models are surprisingly effective for tasks such as document classification.
Application: topic modeling
Question: given a text document, what topics is it about?
α {travel:0.8,Europe:0.2}
0 E12 E13 0
Basics
Probabilistic programs
Inference
Output
P(Q = q | E = e) for all values q
!
X
∝ p(x1 )p(x2 = 5 | x1 ) p(x3 | x2 = 5)
x1
∝ p(x3 | x2 = 5)
Fast way:
[whiteboard]
CS221 / Spring 2019 / Charikar & Sadigh 67
• Let’s first compute the query the old-fashioned way by grinding through the algebra. Then we’ll see a
faster, more graphical way, of doing this.
• We start by transforming the query into an expression that references the joint distribution, which allows
us to rewrite as the product of the local conditional probabilities. To do this, we invoke the definition of
marginal and conditional probability.
• One convenient shortcut we will take is make use of the proportional-to (∝) relation. Note that in the end,
we need to construct a distribution over X3 . This means that any quantity (such as P(X2 = 5)) which
doesn’t depend on X3 can be folded into the proportionality constant. If you don’t believe this, keep it
around to convince yourself that it doesn’t matter. Using ∝ can save you a lot of work.
P
• Next, we do some algebra to push the summations P inside. We notice that x4 p(x4 | x3 ) = 1 because
it’s a local conditional distribution. The factor x1 p(x1 )p(x2 = 5 | x1 ) can also be folded into the
proportionality constant.
• The final result is p(x3 | x2 = 5), which matches the query as we expected by the consistency of local
conditional distributions.
General strategy
Query:
P(Q | E = e)
[whiteboard]
Query: P(B)
• Marginalize out A, E
Query: P(B | A = 1)
• Condition on A = 1
D E
F G H
[whiteboard]
Query: P(C | B = b)
• Marginalize out everything else, note C ⊥
⊥B
Query: P(C, H | E = e)
• Marginalize out A, D, F, G, note C ⊥
⊥H|E
X1
• In other way
Cont…
• We can write Y=Xβ+ε, where Y, X, β and ε are matrices…Compute
their size??
• For any Yi, we can say
yi = β 0 + β1 xi1 + β 2 xi 2 + ..... + β p xip + ε i
p
ε=i yi − ∑ β j xij
j =0
• Optimize nn p
i
2
∑=
ε i
=i 1 =i 1 =j 0
∑ ∑ j ij
( y − β x ) 2
dSSE
=0
dβ
Cont…
• SSE is a scalar qty. can be written as εεT
εε T ( y − X β )( y − X β )T
SSE ==
• Now
d ( SSE ) d ( y − X β )( y − X β )T
= = 0
dβ dβ
−2 X T ( y − X β ) =
0
XT Xβ = XT y
( X T X ) −1 X T X β = ( X T X ) −1 X T y
β = ( X T X ) −1 X T y
Do it yourself…..
• Problem ….
MULTIPLE LINEAR REGRESSION
Part II
Dr. Ranjan Maity
CIT Kokrajhar
REQUIREMENTS
More than one independent variables
One dependent variable
Rule of Thumb: Min No of observations = 10* No of independent Variables
ASSUMPTIONS
Independence
Normality
Homoscedasticity
Linearity
THE R AND BETA
R = the magnitude of the relationship between the dependent variable and the best linear
combination of the predictor variables
R2 = the proportion of variation in Y accounted for by the set of independent variables (X’s).
Testing R2
Test R2 through an F testTest of each partial regression coefficient (b) by t-tests
Test of competing models (difference between R2) through an F test of difference of R2s
Test of each partial regression coefficient (b) by t-tests
Comparison of partial regression coefficients with each other - t-test of difference between
standardized partial regression coefficients (β)
R 2
• Linear regression in R
Regression
Y = X1 + X2 + X3
Y = X1 + X2 + X3
Linear Regression is a Probabilistic Model
y = " 0 + "1 x
y "y
! "0
"x
"1 =
#y
#x
!
! x
!
!
!
• But we’re interested in understanding the relationship
between variables related in a nondeterministic fashion
!
A Linear Probabilistic Model
• Definition: There exists parameters " 0 , "1, and " ,2 such that for
any fixed value of the independent variable x, the dependent
variable is related to x through the model equation
y = "!0 +! "1 x +
!#
2
• " is a rv assumed to be N(0, # )
True Regression Line
!
"3 y = " 0 + "1 x
y "1
"2
"0 !
! x !
!
!
!
Implications
• The expected value of Y is a linear function of X, but for fixed
x, the variable Y differs from its expected value by a random
amount
!
Graphical Interpretation
y
y = " 0 + "1 x
µY |x 2 = " 0 + "1 x 2
! !
!
µY |x1 = " 0 + "1 x1
x
x1 x2
!
!
One More Example
Suppose the relationship between the independent variable height
(x) and dependent variable weight (y) is described by a simple
linear regression model with true regression line
y = 7.5 + 0.5x and " = 3
• Q1: What is the interpretation of "1 = 0.5?
The expected change in height associated with a 1-unit increase
in weight !
!
• Q2: If x = 20 what is the expected value of Y?
µY |x =20 = 7.5 + 0.5(20) = 17.5
! y
"0
x
!
•
!
"ˆ = y # "ˆ x
0 1
!
Predicted and Residual Values
• Predicted, or fitted, values are values of y predicted by the least-
squares regression line obtained by plugging in x1,x2,…,xn into the
estimated regression line
yˆ1 = "ˆ 0 # "ˆ1 x1
yˆ = "ˆ # "ˆ x
2 0 1 2
• Residuals are
! the deviations of observed and predicted values
e1 = y1 " yˆ1
! e2 = y 2 " yˆ 2
y
e3
y1
! e1 e2
yˆ1
!
! ! ! x
!
!
Residuals Are Useful!
• They allow us to calculate the error sum of squares (SSE):
n n
SSE = " (ei ) = " (y i # yˆ i ) 2
2
i=1 i=1
! !
Multiple Linear Regression
• Extension of the simple linear regression model to two or
more independent variables
y = " 0 + "1 x1 + " 2 x 2 + ...+ " n x n + #
!
• Partial Regression Coefficients: βi ≡ effect on the
dependent variable when increasing the ith independent
variable by 1 unit, holding all other predictors
constant
Categorical Independent Variables
• Qualitative variables are easily incorporated in regression
framework through dummy variables
x1 x2
AA 1 0
AG 0 1
GG 0 0
Hypothesis Testing: Model Utility Test (or
Omnibus Test)
• The first thing we want to know after fitting a model is whether
any of the independent variables (X’s) are significantly related to
the dependent variable (Y):
!
Equivalent ANOVA Formulation of Omnibus Test
SSE
Error n-2 SSE = # (y i " yˆ i ) 2
n "2
!
! n-1 !
Total SST = # (y i " y ) 2
! !
!
• Again, another example of ANOVA:
!
SSER = error sum of squares for
reduced model with l predictors (SSE R " SSE F ) /(k " l)
f =
SSEF = error sum of squares for SSE F /([n " (k + 1)]
full model with k predictors
!
Example of Model Comparison
• We have a quantitative trait and want to test the effects at two
markers, M1 and M2.
! H : "ˆ # 0
A i
"ˆ i $ " i
T=
se(" i )
• Confidence Intervals
! are equally easy to obtain:
• Linear regression in R
Regression
Y = X1 + X2 + X3
Y = X1 + X2 + X3
Linear Regression is a Probabilistic Model
y = " 0 + "1 x
y "y
! "0
"x
"1 =
#y
#x
!
! x
!
!
!
• But we’re interested in understanding the relationship
between variables related in a nondeterministic fashion
!
A Linear Probabilistic Model
• Definition: There exists parameters " 0 , "1, and " ,2 such that for
any fixed value of the independent variable x, the dependent
variable is related to x through the model equation
y = "!0 +! "1 x +
!#
2
• " is a rv assumed to be N(0, # )
True Regression Line
!
"3 y = " 0 + "1 x
y "1
"2
"0 !
! x !
!
!
!
Implications
• The expected value of Y is a linear function of X, but for fixed
x, the variable Y differs from its expected value by a random
amount
!
Graphical Interpretation
y
y = " 0 + "1 x
µY |x 2 = " 0 + "1 x 2
! !
!
µY |x1 = " 0 + "1 x1
x
x1 x2
!
!
One More Example
Suppose the relationship between the independent variable height
(x) and dependent variable weight (y) is described by a simple
linear regression model with true regression line
y = 7.5 + 0.5x and " = 3
• Q1: What is the interpretation of "1 = 0.5?
The expected change in height associated with a 1-unit increase
in weight !
!
• Q2: If x = 20 what is the expected value of Y?
µY |x =20 = 7.5 + 0.5(20) = 17.5
! y
"0
x
!
•
!
"ˆ = y # "ˆ x
0 1
!
Predicted and Residual Values
• Predicted, or fitted, values are values of y predicted by the least-
squares regression line obtained by plugging in x1,x2,…,xn into the
estimated regression line
yˆ1 = "ˆ 0 # "ˆ1 x1
yˆ = "ˆ # "ˆ x
2 0 1 2
• Residuals are
! the deviations of observed and predicted values
e1 = y1 " yˆ1
! e2 = y 2 " yˆ 2
y
e3
y1
! e1 e2
yˆ1
!
! ! ! x
!
!
Residuals Are Useful!
• They allow us to calculate the error sum of squares (SSE):
n n
SSE = " (ei ) = " (y i # yˆ i ) 2
2
i=1 i=1
! !
Multiple Linear Regression
• Extension of the simple linear regression model to two or
more independent variables
y = " 0 + "1 x1 + " 2 x 2 + ...+ " n x n + #
!
• Partial Regression Coefficients: βi ≡ effect on the
dependent variable when increasing the ith independent
variable by 1 unit, holding all other predictors
constant
Categorical Independent Variables
• Qualitative variables are easily incorporated in regression
framework through dummy variables
x1 x2
AA 1 0
AG 0 1
GG 0 0
Hypothesis Testing: Model Utility Test (or
Omnibus Test)
• The first thing we want to know after fitting a model is whether
any of the independent variables (X’s) are significantly related to
the dependent variable (Y):
!
Equivalent ANOVA Formulation of Omnibus Test
SSE
Error n-2 SSE = # (y i " yˆ i ) 2
n "2
!
! n-1 !
Total SST = # (y i " y ) 2
! !
!
• Again, another example of ANOVA:
!
SSER = error sum of squares for
reduced model with l predictors (SSE R " SSE F ) /(k " l)
f =
SSEF = error sum of squares for SSE F /([n " (k + 1)]
full model with k predictors
!
Example of Model Comparison
• We have a quantitative trait and want to test the effects at two
markers, M1 and M2.
! H : "ˆ # 0
A i
"ˆ i $ " i
T=
se(" i )
• Confidence Intervals
! are equally easy to obtain:
House
Cost
House size
However, house cost vary even among same size
houses! Since cost behave unpredictably,
House we add a random component.
Cost
House size
The first order linear model
Y = β 0 + β1 X + ε
Y = dependent variable
X = independent variable
β0 = Y-intercept β0 and β1 are unknown population
β1 = slope of the line Y parameters, therefore are estimated
from the data.
ε = error variable
Rise β1 = Rise/Run
β0 Run
X
Estimating the Coefficients
Y
Question: What should be
considered a good line?
X
The Least Squares
(Regression) Line
To calculate the estimates of the line The regression equation that estimates
coefficients, that minimize the differences the equation of the first order linear model
between the data points and the line, use is:
the formulas:
cov(X,Y ) sXY
b1 = 2
sX
= 2
sX
Yˆ = b0 + b1 X
b0 = Y − b1 X
The Simple Linear Regression Line
• Example 17.2 (Xm17-02)
A car dealer wants to find
the relationship between
the odometer reading and
the selling price of used cars. Car Odometer Price
A random sample of 100 cars is selected, 1 37388 14636
and the data 2 44758 14122
recorded.
3 45833 14016
Find the regression line. 4 30862 15590
5 31705 15568
6 34010 14718
. .
Independent .
Dependent
. .
variable .
X variable Y
. . .
• Solution
– Solving by hand: Calculate a number of statistics
X = 36,009.45; sX2 =
∑ (X i − X )2
= 43,528,690
n −1
Y = 14,822.823; cov(X,Y ) =
∑ (X i − X )(Yi − Y )
= −2,712,511
n −1
where n = 100.
cov( X , Y ) −2, 712,511
b1 = 2
= = −.06232
sX 43,528, 690
b0 = Y − b1 X = 14,822.82 − (−.06232)(36, 009.45) = 17, 067
Yˆ = b0 + b1 X = 17,067 − .0623X
• Solution – continued
– Using the computer (Xm17-02)
Regression Statistics
Multiple R 0.8063
R Square 0.6501
Adjusted R Square 0.6466
Yˆ = 17,067 − .0623X
Standard Error 303.1
Observations 100
ANOVA
df SS MS F Significance F
Regression 1 16734111 16734111 182.11 0.0000
Residual 98 9005450 91892
Total 99 25739561
16000
15000
Price
14000
0 No data 13000
Odometer
Yˆ = 17,067 − .0623X
β0 + β1X1 µ1
X1 X2 X3
From the first three assumptions we
have: Y is normally distributed with
mean E(Y) = β0 + β1X, and a constant
standard deviation σε
Assessing the Model
n
SSE = ∑ (Yi − Yˆi ) 2 .
i=1
– A shortcut formula
SSE = (n − 1)s 2
−
[cov(X,Y)]
2
Y
sX2
Standard Error of Estimate
The mean error is equal to zero.
If σε is small the errors tend to be close to zero (close to the mean error).
Then, the model fits the data well.
Therefore, we can, use σε as a measure of the suitability of using a linear
model.
An estimator of σε is given by sε
∑ i i
(Y − Yˆ ) 2
Calculated before
sY2 = = 259,996
n −1
2 2
[cov( X , Y )] ( −2, 712,511)
SSE = (n − 1) sY2 − 2
= 99(259,996) − = 9,005,450
sX 43,528,690
SSE 9,005,450
sε = = = 303.13 It is hard to assess the model based
n−2 98 on sε even when compared with the
mean value of Y.
s ε = 303.1 y = 14,823
Testing the Slope
When no linear relationship exists between two variables, the regression
line should be horizontal.
b1 − β1 where sε
t= sb1 =
If the error variable issnormally distributed, the statistic has Student
2
b1
distribution with d.f. = n-2.
(n −1)s X
t
b1 = −.0623
sε 303.1
sb1 = = = .00462
2
(n −1)sX (99)(43,528,690)
b1 − β1 −.0623 − 0
t= = = −13.49
sb1 .00462
The rejection region is t > t.025 or t < -t.025 with ν
= n-2 = 98.
Approximately, t.025 = 1.984
Xm17-02
• Using the computer
Price Odometer SUMMARY OUTPUT
14636 37388
14122 44758 Regression Statistics
14016 45833 Multiple R 0.8063
15590 30862 R Square 0.6501 There is overwhelming evidence to infer
15568 31705 Adjusted R Squ 0.6466
14718 34010 Standard Error 303.1
that the odometer reading affects the
14470 45854 Observations 100 auction selling price.
15690 19057
15072 40149 ANOVA
14802 40237 df SS MS F Significance F
15190 32359 Regression 1 16734111 16734111 182.11 0.0000
14660 43533 Residual 98 9005450 91892
15612 32744 Total 99 25739561
15610 34470
14634 37720 Coefficients Standard Error t Stat P-value
14632 41350 Intercept 17067 169 100.97 0.0000
15740 24469 Odometer -0.0623 0.0046 -13.49 0.0000
Coefficient of Determination
To measure the strength of the linear relationship we use the coefficient
of determination:
[cov(X,Y)]
2
R 2
= 2 2
s s
(or, = rXY );
2
X Y
2 SSE
or, R = 1− (see p. 18 above)
∑ (Yi − Y ) 2
• To understand the significance of this coefficient
note:
The error
y2
Two data points (X1,Y1) and (X2,Y2)
of a certain sample are shown.
x1 x2
Variation explained by the
Total variation in Y = + Unexplained variation (error)
regression line
2
R =1−
SSE
=
∑ i −SSE
(Y −Y ) 2
=
SSR
∑ (Yi −Y ) 2
∑ (Y −Y )
i
2
∑ (Yi −Y ) 2
2
2 [cov(X,Y )] [−2,712,511]2
R = 2 2
= (43,528,688)(259,996) = .6501
sX sY
– Using the computer
From the regression output we have
SUMMARY OUTPUT
Regression Statistics
65% of the variation in the auction
Multiple R 0.8063 selling price is explained by the
R Square 0.6501
Adjusted R Square 0.6466 variation in odometer reading. The
Standard Error 303.1 rest (35%) remains unexplained by
Observations 100
this model.
ANOVA
df SS MS F Significance F
Regression 1 16734111 16734111 182.11 0.0000
Residual 98 9005450 91892
Total 99 25739561
Cam mot
ALpanati blw the wo m 20
Ael mwrLasm9dusruoimg
2x
NOTE-
axb txo
x
ac
mp. cond
T27
2 Ix4
0 O
4x2
LINE AR FONCTION
Ftxy)
-C- Fca)
9 F)
t(xtn) F)tF()
4x O
2x 2
NOT LINEAR
MATRIX
3nt 4 10
t 3
t TENSOR
7 M-dummuon.al meteApacz-
7gmnau m than on tqal to 3.
A colounMinagtoh piiel24-bia)
R
&
HI