Professional Documents
Culture Documents
2. Common Ppt (All Modules)
2. Common Ppt (All Modules)
FUNDAMENTALS IN AI AND ML
On the Verge of Major Breakthroughs
Work Economy
Medicine
Mobility
Fundamental in AI and ML 2
Applications of AI
Fundamental in AI and ML 3
But, What is AI ?
AI can be broadly defined as technology that can learn and
produce intelligent behavior
Input Output
Pixels: An AI Process “Tuberculosis”
Computer Vision
Fundamental in AI and ML 4
But, What is AI ?
AI can be broadly defined as technology that can learn and
produce intelligent behavior
Input Output
Fundamental in AI and ML 5
Applications of AI
AI can be broadly defined as technology that can learn and
produce intelligent behavior
Input Output
Audio Clip: An AI Process “I feel some eye pain”
Speech Recognition
Fundamental in AI and ML 6
Artificial Intelligence
Fundamental in AI and ML 7
Artificial Intelligence
Fundamental in AI and ML 8
AI in real time systems
• Search engines (such as Google Search), targeting online advertisements,
recommendation systems (offered by Netflix, YouTube or Amazon),
• Driving internet traffic,
• Targeted advertising (AdSense, Facebook),
• Virtual assistants (such as Siri or Alexa),
• Autonomous vehicles (including drones and self-driving cars),
• Automatic language translation (Microsoft Translator, Google Translate),
• Facial recognition (Apple's Face ID or Microsoft's DeepFace),
• Image labeling (used by Facebook, Apple's iPhoto and TikTok) and spam
filtering.
Fundamental in AI and ML 9
Agents in Artificial Intelligence
An AI system can be defined as the study of the rational agent and its environment. The
agents sense the environment through sensors and act on their environment through actuators.
An AI agent can have mental properties such as knowledge, belief, intention, etc.
What is an Agent?
An agent can be anything that perceives the environment through sensors and act upon that
environment through actuators. An Agent runs in the cycle of perceiving, thinking,
and acting. An agent can be:-
Human-Agent: A human agent has eyes, ears, and other organs which work for sensors and
•
Fundamental in AI and ML 11
Agents in Artificial Intelligence
Effectors: Effectors are the devices which affect the environment. Effectors
can be legs, wheels, arms, fingers, wings, fins, and display screen.
Fundamental in AI and ML 12
Intelligent Agent
Fundamental in AI and ML 13
Rational Agent
• A rational agent is an agent which has clear preference, models uncertainty,
and acts in a way to maximize its performance measure with all possible
actions.
Fundamental in AI and ML 15
Structure of an AI Agent
The task of AI is to design an agent program which implements the agent function. The
structure of an intelligent agent is a combination of architecture and agent program. It can be
viewed as:
f:P* → A
PEAS is a type of model on which an AI agent works upon. When we define an AI agent
or rational agent, then we can group its properties under PEAS representation model. It is
made up of four words:
P: Performance measure
•
E: Environment
•
A: Actuators
•
S: Sensors
•
Here performance measure is the objective for the success of an agent's behavior.
Fundamental in AI and ML 17
PEAS for self-driving cars:
Let's suppose a self-driving car then PEAS representation will be:
Performance: Safety, time, legal drive, comfort
Fundamental in AI and ML 18
Example of Agents with their PEAS
representation
Performance
Agent Environment Actuator Sensor
Measure
• Healthy
Patient Hospital Prescription, Symptoms
Medical
• • •
Fundamental in AI and ML 19
Example of Agents with their PEAS
representation
Performance
Agent Environment Actuator Sensor
Measure
Classroom,
Subject • Maximize Desk, Chair, Smart displays, Eyes, Ears,
Tutoring scores Board, Staff, Corrections Notebooks
Students
Fundamental in AI and ML 20
Example of Agents with their PEAS
representation
Performance
Agent Environment Actuator Sensor
Measure
Vacuum Cleanness
• Room
• Wheels
• Camera
•
Cleaner Efficiency
• Table
• Brushes
• Dirt detection
•
Battery life
• Wood floor
• Vacuum Extractor sensor
•
Security
• Carpet
• Cliff sensor
•
Various
• Bump Sensor
•
Sensor
Fundamental in AI and ML 21
Object Detection
Fundamental in AI and ML 22
Activity Recognition
Fundamental in AI and ML 23
Semantic Segmentation
Fundamental in AI and ML 24
Disease Detection
Fundamental in AI and ML 25
Image colorization
Fundamental in AI and ML 26
Style Transfer
Fundamental in AI and ML 27
Style transfer
Fundamental in AI and ML 28
Lip Sync
Fundamental in AI and ML 29
Image to Image Translation
Fundamental in AI and ML 30
Why interest in AI?
Fundamental in AI and ML 31
Agents
Fundamental in AI and ML 33
Agent / Robot E.g., vacuum-cleaner world
iRobot Corporation
Founder Rodney Brooks (MIT)
The right action is the one that will cause the agent to be most successful
Performance measure of game-playing agent: win/loss percentage
(maximize), robustness, unpredictability (to “confuse” opponent), etc.
Fundamental in AI and ML 35
Rational Agents
Fundamental in AI and ML 38
Characterizing a Task Environment
Fundamental in AI and ML 40
Task Environments
Fundamental in AI and ML 41
Task Environments
2) Deterministic / Stochastic
○ An environment is deterministic if the next state of the environment
is completely determined by the current state of the environment and
the action of the agent;
○ In a stochastic environment, there are multiple, unpredictable
outcomes. (If the environment is deterministic except for the actions
of other agents, then the environment is strategic).
In a fully observable, deterministic environment, the agent need not
deal with uncertainty.
Note: Uncertainty can also arise because of computational
limitations. E.g., we may be playing an omniscient (“all knowing”)
opponent but we may not be able to compute his/her moves.
Fundamental in AI and ML 43
Task Environments
3) Episodic / Sequential
4) Static / Dynamic
change
with the passage of time but the agent's performance score does.
Fundamental in AI and ML 45
Task Environments
5) Discrete / Continuous
○ If the number of distinct percepts and actions is limited, the environment is
discrete, otherwise it is continuous.
Fundamental in AI and ML 48
Exercise on Task environment
Fundamental in AI and ML 49
Agents and Environment
• Drawbacks:
– Huge table (often simply too large)
– Takes a long time to build/learn the table
Fundamental in AI and ML 52
I) Table-lookup driven agents
Toy example:Vacuum world.
Percepts: robot senses it’s location and “cleanliness.”
So, location and contents, e.g., [A, Dirty], [B, Clean].
With 2 locations, we get 4 different possible sensor
inputs.
Actions: Left, Right, Suck, NoOp
Fundamental in AI and ML 53
Table driven agent
Fundamental in AI and ML 54
Table Lookup
Action sequence of length K, gives 4^K different possible sequences. At
least many entries are needed in the table So, even in this very toy
world, with K = 20, you need a table with over 4^20 > 10^12 entries.
Fundamental in AI and ML 57
II) Simple reflex agents
Fundamental in AI and ML 58
II) Simple reflex agents
Fundamental in AI and ML 59
II) Simple reflex agents
Closely related to “behaviorism” (psychology; quite effective in explaining
lower-level animal behaviors, such as the behavior of ants and mice).
The Roomba robot largely behaves like this. Behaviors are robust and can be
quite effective and surprisingly complex.
But, how does complex behavior arise from simple reflex behavior?
E.g. ants colonies and bee hives are quite complex.
See A-life work (artificial life) community, and Wolfram’s cellular automata.
Fundamental in AI and ML 60
III) --- Model-based reflex agents
Fundamental in AI and ML 61
III) --- Model-based reflex agents
Module:
Logical Agents How detailed?
Representation and Reasoning: Part III/IV R&N
Fundamental in AI and ML 62
III) --- Model-based agents
Fundamental in AI and ML 63
III) --- Model-based agents An example: Brooks’
Subsumption Architecture
Main idea: build complex, intelligent robots by decomposing behaviors
into a hierarchy of skills, each defining a percept-action cycle for one
very specific task.
Examples: collision avoidance, wandering, exploring, recognizing
doorways, etc.
Each behavior is modeled by a finite-state machine with a few states
(though each state may correspond to a complex function or module;
provides internal state to the agent).
Behaviors are loosely coupled via asynchronous interactions.
Note: minimal internal state representation. p. 1003 R&N
Fundamental in AI and ML 64
III) --- Model-based agents An example: Brooks’
Subsumption Architecture
In subsumption architecture, increasingly complex behaviors arise from
the combination of simple behaviors.
The most basic simple behaviors are on the level of reflexes: • avoid an
object; • go toward food if hungry, • move randomly.
Fundamental in AI and ML 66
How much of an internal model of the world?
Brooks (mid 80s and 90s) challenged this view:
The philosophy behind Subsumption Architecture is that the world
should be used as its own model. According to Brooks, storing models of
the world is dangerous in dynamic, unpredictable environments because
representations might be incorrect or outdated. What is needed is the
ability to react quickly to the present. So, use minimal internal state
representation, complement at each time step with sensor input.
Debate continues to this day: How much of our world do we (should we)
represent explicitly? Subsumption architecture worked well in robotics.
Fundamental in AI and ML 67
IV) Goal-based agents
Key difference wrt Model-Based Agents:
In addition to state information, have goal information that describes
desirable situations to be achieved.
Agent keeps track of the world state as well as set of goals it’s trying to achieve: chooses
actions that will (eventually) lead to the goal(s).
More flexible than reflex agents may involve search and planning
Fundamental in AI and ML 69
V) Utility-based agents
When there are multiple possible alternatives, how to decide which one is
best?
Goals are qualitative: A goal specifies a crude distinction between a happy
and unhappy state, but often need a more general performance measure
that describes “degree of happiness.”
Utility function U: State → R indicating a measure of success or happiness
when at a given state.
Important for making tradeoffs: Allows decisions comparing choice
between conflicting goals, and choice between likelihood of success and
importance of goal (if achievement
Fundamentalis
in AI uncertain).
and ML 70
Use decision theoretic models: e.g., faster vs. safer.
V) Utility-based agents VI) --- Learning agents
Adapt and improve over time
Module:
Decision Making
Fundamental in AI and ML 71
VI) --- Learning agents Adapt and improve over
time More complicated when agent needs to learn utility information: Reinforcement
learning (based on action payoff)
Module:
Learning
No quick turn
Fundamental in AI and ML 75
Summary: Agent Types
● An agent program maps from percept to action and updates its internal
state.
○ Reflex agents (simple / model-based) respond immediately to percepts.
○ Goal-based agents act in order to achieve their goal(s), possible
sequence of steps.
○ Utility-based agents maximize their own utility function.
○ Learning agents improve their performance through learning.
● Representing knowledge is important for successful agent design.
● The most challenging environments are partially observable, stochastic,
sequential, dynamic, and continuous, and contain multiple intelligent
agents. Fundamental in AI and ML 76
Fundamental in AI and ML 77
Searching for a (shortest / least cost) path to goal
state(s)
Note: 1) Here we only check a node for possibly being a goal state, after we select the node
for expansion.
2) A “node” is a data structure containing state + additional info (parent
Fundamental in AI and ML 79
node, etc.
Tree-search algorithm- Example
Node selected for expansion.
Fundamental in AI and ML 80
Tree-search algorithm- Example
Nodes added to tree.
Fundamental in AI and ML 81
Tree-search algorithm- Example
Selected for expansion.
Added to tree.
Note:
1) Uses “explored” set to avoid visiting already explored states.
2) Uses “frontier” set to store states that remain to be explored and
expanded.
3) However, with eg uniform cost search, we need to make a special check
when node (i.e. state) is on frontier. Details later.
Fundamental in AI and ML 83
Implementation: states vs. nodes
Fringe is the collection of nodes that have been generated but not (yet)
expanded. Each node of the fringe is a leaf node.
The Expand function creates new nodes, filling in the various fields and
using the SuccessorFn of the problem to create the corresponding
states. Fundamental in AI and ML 84
Implementation: General Tree Search
Fundamental in AI and ML 85
Search Strategies
A search strategy is defined by picking the order of node expansion.
Fundamental in AI and ML 86
Uninformed Search Strategies
Uninformed (blind) search strategies use only the information available
in the problem definition:
○ Breadth-first search
○ Uniform-cost search
○ Depth-first search
○ Depth-limited search
○ Iterative deepening search
○ Bidirectional search
Key issue: type of queue used for the fringe of the search tree (collection
of tree nodes that have been generated but not yet expanded)
Fundamental in AI and ML 87
Breadth-First Search
Gives
<B, C>
Fundamental in AI and ML 88
Breadth-First Search
Select B from
Queue: <B, C> front, and expand.
Gives
<C, D, E>
Breadth-First Search
Fundamental in AI and ML 90
Breadth-First Search
Fundamental in AI and ML 92
Properties of breadth-first search
Fundamental in AI and ML 94
Uniform Cost Search
Fundamental in AI and ML 95
Uniform Cost Search
Fundamental in AI and ML 96
Uniform Cost Search
Fundamental in AI and ML 97
Uniform Cost Search
Expand least-cost (of path to) unexpanded node
(e.g. useful for finding shortest path on map)
Implementation:
fringe = queue ordered by path cost
○
So, B next.
Fundamental in AI and ML 100
Depth-First Search
Expanding B,
gives stack:
D
E
C
So, D next.
Expanding D,
gives stack:
H
I
E
C
So, H next.
etc.
Fundamental in AI and ML 102
Depth-First Search
Idea was a breakthrough in game playing. All game tree search uses
iterative deepening nowadays. What’s the added advantage in
games?
“Anytime” nature.
Fundamental in AI and ML 121
Iterative deepening search
Iterative deepening is the preferred uninformed search method when there is a large
search space and the depth of the solutionis not known.
Number of nodes generated in an iterative deepening search to depth
For b = 10, d = 5,
☺
○ NBFS= 10 + 100 + 1,000 + 10,000 + 100,000 = 111,110
Fundamental in AI and ML 122
○ NIDS = 50 + 400 + 3,000 + 20,000 + 100,000 = 123,456
Iterative deepening search
Complete? Yes
(b finite)
Space? O(bd)
• Checking a node for membership in the other search tree can be done
in constant time (hash table)
• Key limitations:
Space O(bd/2)
Also, how to search backwards can be an issue (e.g., in Chess)? What’s
tricky?
Problem: lots of states satisfy the goal; don’t know which one is
relevant.
Aside: The predecessor of a node should be easily computable (i.e., actions are easily
reversible). Fundamental in AI and ML 126
Repeated States
Failure to detect repeated states can
turn linear problem into an
exponential one!
Problems in which actions are reversible (e.g., routing problems or sliding-blocks puzzle).
Also, in eg Chess; uses hash tables to check for repeated states. Huge tables 100M+ size but
very useful.
See Tree-Search vs. Graph-Search. Need to be careful to maintain (path) optimality and
completeness. Fundamental in AI and ML 127
Bi directional Search - Example
In the below search tree, bidirectional
search algorithm is applied. This
algorithm divides one graph/tree into two
sub-graphs. It starts traversing from node
1 in the forward direction and starts from
goal node 16 in the backward direction.
• FOL not only assumes that does the world contains facts (like PL does), but
it also assumes the following:
•
• Objects: A, B, people, numbers, colors, wars, theories, squares, pit, etc.
Fundamental in AI and ML 156
First Order Predicate Logic
Relations: It is unary relation such as red, round, sister of, brother of, etc.
Function: father of, best friend, third inning of, end of, etc.
Example-
John and Michael are colleaguesà colleagues (John, Michael)
German Shepherd is a dogà Dog (German Shepherd )
Example-
1.Colleague (Oliver, Benjamin) Colleague (Benjamin, Oliver)
2.“x is an integer” Fundamental in AI and ML 162
Atomic and Complex Sentences in FOL:
It has two parts;
First, x is the subject.
Second, “is an integer” is called a predicate.
Explanation:
So, in logical notation, it can be written as:
This can be interpreted as: There is every x where x is a student who likes
Fundamental in AI and ML 166
Educative.
Existential Quantifiers:
Existential quantifiers are used to express that the statement within their
scope is true for at least one instance of something.
Example-
Some people like Football.
It can be interpreted as: There are some x where x is people who like football.
To Know:
Deterministic or stochastic?
1.
Teams or individuals?
4.
Turn-taking or simultaneous?
5.
Zero sum?
6.
Saviour
Alpha-beta
Pruning
Alpha bound of a node: Max current value of all Max ancestors of the
node.Exploration of a min node is stopped when its value equals or falls
below alpha.
Beta bound of a node: Min current value of all Min ancestors of the
node.Exploration of a max node is stopped when its value equals or exceeds
beta.
Fundamental in AI and ML 177
Fundamental in AI and ML 178
What can we Prune ?
50 100
For chess: only 35 instead of 35 !!
Yaaay!!!!!
• Advantages:
– Use very little memory
– Can often find reasonable solutions in large or infinite (continuous) state spaces.
• Random Sampling
– Generate a state randomly
• Random Walk
– Randomly pick a neighbor of the current state
current ← MAKE-NODE(INITIAL-STATE[problem])
loop do
neighbor ← a highest valued successor of current
if VALUE [neighbor] ≤ VALUE[current] then return STATE[current]
current ← neighbor
min version will reverse inequalities and look for lowest valued successor
Fundamental in AI and ML 213
Hill-climbing Search
• “a loop that continuously moves towards increasing value”
– terminates when a peak is reached
– Aka greedy local search
• Value can be either
– Objective function value
– Heuristic function value (minimized)
• State
– All 8 queens on the board in some configuration
• Successor function
– move a single queen to another square in the same column.
• Is this a solution?
• What is h?
Fundamental in AI and ML 219
Hill-climbing on 8-queens
• However…
– Takes only 4 steps on average when it succeeds
– And 3 on average when it gets stuck
– (for a state space with 8^8 =~17 million states)
Fundamental in AI and ML 220
Hill Climbing Drawbacks
• Local maxima
• Plateaus
• Diagonal ridges
Escaping Shoulders: Sideways Move
• If no downhill (uphill) moves, allow sideways moves in hope that
algorithm can escape
– Need to place a limit on the possible number of sideways moves to
avoid infinite loops
• For 8-queens
– Now allow sideways moves with a limit of 100
– Raises percentage of problem instances solved from 14 to 94%
– However….
• 21 steps for every successful solution
Fundamental in AI and ML 222
• 64 for each failure
Tabu Search
Properties:
– As the size of the tabu list grows, hill-climbing will asymptotically
become “non-redundant” (won’t look at the same state twice)
– In practice, a reasonable sized tabu list (say 100 or so) improves the
performance of hill climbing in many problems
Fundamental in AI and ML 223
Escaping Shoulders/local Optima Enforced Hill
Climbing
• Perform breadth first search from a local optima
– to find the next state with better h function
• Typically,
– prolonged periods of exhaustive search
– bridged by relatively quick periods of hill-climbing
• Stochastic hill-climbing
– Random selection among the uphill moves.
– The selection probability can vary with the steepness of the uphill
move.
When the state-space landscape has local minima, any search that moves only
in the greedy direction cannot be complete
current ← MAKE-NODE(INITIAL-STATE[problem])
for t ← 1 to ∞ do
T ← schedule[t]
if T = 0 then return current
next ← a randomly selected successor of current
∆E ← VALUE[next] - VALUE[current]
if ∆E > 0 then current ← next
Fundamental in AI and ML 232
else current ← next only with probability e∆E /T
Temperature T
probability of being
in next generation =
fitness/(Σ_i fitness_i)
• Fitness function: #non-attacking queen pairs How to convert a fitness
– min = 0, max = 8 × 7/2 = 28 value into a probability
• Σi fitness_i = 24+23+20+11 = 78 of being in the next
generation.
• P(pick child_1 for next gen.) = fitness_1/(Σ_i fitness_i) = 24/78 = 31%
Fundamental in AI and ML 249
• P(pick child_2 for next gen.) = fitness_2/(Σ_i fitness_i) = 23/78 = 29%; etc
Genetic algorithms
Has the effect of “jumping” to a completely different new part of the search space (quite
non-local)
• Negative points
– Large number of “tunable” parameters
• Difficult to replicate performance from one problem to another
– Lack of good empirical studies comparing to simpler methods
– Useful on some (small?) set of problems but no convincing evidence that GAs are better than
Fundamental in AI and ML 251
hill-climbing w/random restarts in general
MODULE – 3 MACHINE LEARNING
HOWEVER
• To get really useful results, you need good mathematical intuitions about
certain general machine learning principles, as well as the inner workings of
the individual algorithms.
When the process is repeated a large number of times, each outcome occurs
with a characteristic relative frequency, or probability.
Example: You need a model for how people’s heights are distributed. You
choose a normal distribution (bell-shaped curve) to represent the expected
Fundamental in AI and ML 259
relative probabilities.
Probability spaces
A probability space is a random process or experiment with three components:
I. Ω, the set of possible outcomes O
• number of possible outcomes = | Ω | = N
1. Non-negativity:
for any event E ∈ F, p( E ) ≥ 0
• example: E = ( O ∈ { HHT, HTH, THH } ), i.e. exactly two flips are heads
• example: E = ( O ∈ { THT, TTT } ), i.e. the first and third flips are tails
example:
sum of two
fair dice
example:
waiting time between
eruptions of Old Faithful
(minutes)
p( X = x, Y = y )
Fundamental in AI and ML 269
Example of multivariate distribution
• Marginal probability
• Probability distribution of a single variable in a joint distribution
• Example: two random variables X and Y:
p( X = x ) = ∑b=all values of Y p( X = x, Y = b )
• Conditional probability
• Probability distribution of one variable given that another variable takes a certain value
• Example: two random variables X and Y:
p( X = x | Y = y ) = p( X = x, Y = y ) / p( Y = y )
0.2
0.15
probabilit
0.1
0.05
y
American
sport
Asian SUV
European minivan
Y= sedan X = model
manufacturer type
Fundamental in AI and ML 273
Continuous multivariate distribution
E( f ) = ∫x = a → b p( x ) ⋅ f( x )
f( xi ) = ( xi - μ ) ⇒ σ2 = ∑i p( xi ) ⋅ ( xi - μ )2
• Average value of squared deviation of X = xi from mean μ, taking into account probability
of the various xi
• Most common measure of “spread” of a distribution
• σ is the standard deviation
high (positive)
no covariance
covariance
• Compare to formula for covariance of actual samples
A not A
(not A, not B)
A ( A, B ) B
Ω
(A, not B) (not A, B)
• Probability that a man has white hair (event A) and is over 65 (event B)
• p( B ) = 0.18
• p( A | B ) = 0.78
• p( A, B ) = p( A | B ) ⋅ p( B ) =
0.78 ⋅ 0.18 =
0.14
(not A, not B)
A ( A, B ) B
Ω
(A, not B) (not A, B)
p( A | B ) = p( A ) or p( A, B ) = p( A ) ⋅ p( B )
Ω
B
(A, not B) A ( A, B )
• Dependence:
• Height of two related individuals
• Duration of successive eruptions of Old Faithful
• Probability of getting a king on successive draws from a deck, if card from
each draw is not replaced Fundamental in AI and ML 287
Example of independence vs. dependence
• Independence: All manufacturers have identical product mix. p( X = x | Y = y ) =
p( X = x ).
• Dependence: American manufacturers love SUVs, Europeans manufacturers
don’t.
p( B | A ) = p( A | B ) ⋅ p( B ) / p( A )
(not A, not B)
A ( A, B ) B
Ω
(A, not B) (not A, B)
The result seems unintuitive but is correct. Even when the weatherman predicts
rain, it only rains only about 11% of the time.
Despite the weatherman's gloomy prediction, it is unlikely Marie will get rained on
at her wedding. Fundamental in AI and ML 291
Probabilities: when to add, when to multiply
• ADD: When you want to allow for occurrence of any of several possible
outcomes of a single process. Comparable to logical OR.
Matrix
Fundamental in AI and ML 294
Vectors
Definition: an n-tuple of values (usually real numbers).
• n referred to as the dimension of the vector
• n can be any positive integer, from 1 to infinity
• result is a vector
• result is a vector
Fundamental in AI and ML 297
Vector arithmetic
M
• Dot product of two vectors
• multiply corresponding elements, then add products
• result is a scalar y
TO THE BOARD!!
• Multiplication is associative
A⋅(B⋅C)=(A⋅B)⋅C
• Multiplication is not commutative
A ⋅ B ≠ B ⋅ A (generally)
• Transposition rule:
( A ⋅ B )T = B T ⋅ A T
Right:
A ⋅ B ⋅ AT CT ⋅ A ⋅ B AT ⋅ A ⋅ B C ⋅ CT ⋅ A
Wrong:
A⋅B⋅A C ⋅ A ⋅ B A ⋅ AT ⋅ B CT ⋅ C ⋅ A
• Maximum likelihood
• Expectation maximization
• Gradient descent
• How precisely does a given sample mean (x-bar) reflect underlying population
mean (μ)? How reliable are our inferences?
• To answer these questions, we consider a simulation experiment in which we
take all possible samples of size n taken from the population
• The standard deviation of the sampling distribution of the mean has a special
name: standard error of the mean (denoted σxbar or SExbar)
Quadrupling the sample size cuts the standard error of the mean in half
For n = 1 ⇒
For n = 4 ⇒
For n = 16 ⇒
• The sampling distribution of x-bar tends to be Normal with mean µ and σxbar = σ / √n
• Example: Let X represent Weschler Adult Intelligence Scores; X ~ N(100, 15).
▪ Take an SRS of n = 10
▪ σxbar = σ / √n = 15/√10 = 4.7
▪ Thus, xbar ~ N(100, 4.7)
• How large does the sample have to be to apply the Normal approximation? ⇒
One rule says that the Normal approximation applies when npq ≥ 5
• Machine learning usually refers to the changes in systems that perform tasks
associated with artificial intelligence AI Such tasks involve recognition, diagnosis,
planning, robot control, prediction, etc.
Learning Trained
algorithm machine
TRAINING
DATA Answer
Query
Fundamental in AI and ML 355
Steps in machine learning
1. Data collection.
2. Representation.
3. Modeling.
4. Estimation.
5. Validation.
6. Apply learned model to new “test” data
Problem Solving
Teacher
Results
Performance
Evaluation
2. Real world problems have too many variables and sensors might be too noisy.
3. Computational complexity.
1) Unsupervised Learning .
2) Semi-Supervised (reinforcement).
3) Supervised Learning.
Advantage
• Most of the laws of science were developed through unsupervised learning.
Disadvantage
• The identification of the features itself is a complex problem in many situations.
Instances of a problem and the learner has to form a concept that supports most of
the positive and no negative instances.
Unlike this, analogical learning can be accomplished from a single example. For
instance, given the following training instance, one has to determine the plural form
of bacilus.
3. Apply Mapping Function: Apply the mapping function to transform the new
problem from the given domain to the target domain.
5. Learning: If the validation is found to work well, the new knowledge is encoded
and saved for future usage.
However, for the sake of simplicity, we presume the restriction to Boolean outputs.
Each node in a decision tree represents ‘a test of some attribute of the instance, and
each branch descending from that node corresponds to one of the possible values
for this attribute’.
Fundamental in AI and ML 371
Learning by Decision Tree
To illustrate the contribution of a decision tree, we consider a set of instances, some
of which result in a true value for the decision. Those instances are called positive
instances. On the other hand, when the resulting decision is false, we call the
instance ‘a negative instance’. We now consider the learning problem of a bird’s
flying. Suppose a child sees different instances of birds as tabulated below.
Photo MRI CT
Fundamental in AI and ML 377
Iris verification
Training Testing
phase phase
Intermediate nodes :
Attributes
a1 a2 a3 a4 a5 a6
X Y Z
Other options:
Gain ratio, etc.
When an instance arrives for testing, runs the algorithm to get the class prediction
Alternatives:
1. Boolean distance (1 if same, 0 if different)
Alternative:
1. Average the values returned by K-nearest neighbours
Class
label
Test
instance (hi
,ci
)
Unit
( h k,1 / )
0
If
For each hi, Select hk, the (| Pi| - pn * | Ni |
feature >
Set of candidate Qi = Pi U Ni
with |Ni| - pp *|Pi| )
feature functions
( hi = 1 ) highest utility then 1
else 0
U i = max { | Pi| - pn * | Ni | , |Ni| - pp *|Pi| }
Pruning?
hi is not required if :
1. c i = c (r+1)
2. There is no h j ( j > i ) such that
Qi=Qj
Loan
Marita
Family
Age l
status
status
Add edges as required
Examples of algorithms: TAN, K2
w
threshol
w 0
d
input 1
w
n
output: : activation function p (v)
where
p (v) = sgn (w0 + w1x1 + … + wnxn )
Fundamental in AI and ML 410
Perceptron learning algorithm
Initialize values of weights
Apply training instances and get output
Update weights according to the update rule:
n : learning rate
t : target output
o : observed output
A large learning factor means the learner may skip the global minimum
Addition of momentum
But why?
Fundamental in AI and ML 416
Support vector machines
Basic idea
Margin
“Maximum
separating-margin
+
classifier”
1
Support vectors
-1
Separating hyperplane : wx+b = 0
Majority Classifier
vote learning
Class Label
Classifier scheme Sample
model Training
D1
dataset
Mn Test
set D
Total set
• Dimensionality reduction
• Transform instances into ‘smaller’ instances
s = Wx W is k x p transformation mtrx.
instance x in k-dimensions
k<p
Eigenvector matrix
( pXX n (k
) X (p X ( p X p ) ( pXn)
(k First k are k PCs
n)
n) p)
• TAN :
• Use Naïve Bayes as the baseline network
• Add different edges to the network based on utility
•Flat algorithms
• Usually start with a random (partial) partitioning
• Refine it iteratively
• K means clustering
• (Model based clustering)
•Hierarchical algorithms
• Bottom-up, agglomerative
• (Top-down, divisive)
Pick seeds
Reassign clusters
Compute centroids
Reassign clusters
Compute centroids
Reassign clusters
Converged!
•Recomputing the centroid after every assignment (rather than after all points are
re-assigned) can improve speed of convergence of K-means
•Assumes clusters are spherical in vector space
• Sensitive to coordinate changes, weighting etc.
•Disjoint and exhaustive
• Doesn’t have a notion of “outliers” by default
• But can add outlier filtering
Dhillon et al. ICDM 2002 – variation to fix some issues with small document clusters
•Given a clustering, define the Benefit for a doc to be the cosine similarity to
its centroid
•Define the Total Benefit to be the sum of the individual doc Benefits.
animal
vertebrate invertebrate
Note: the resulting clusters are still “hard” and induce a partition
• Can result in “straggly” (long and thin) clusters due to chaining effect.
• After merging ci and cj, the similarity of the resulting cluster to another cluster,
ck, is:
Ci Cj Ck
•In the first iteration, all HAC methods need to compute similarity of all pairs of N initial
instances, which is O(N2).
•In each of the subsequent N−2 merging iterations, compute the distance between the most
recently created cluster and all other existing clusters.
•In order to maintain an overall O(N2) performance, computing similarity to each other cluster
must be done in constant time.
• Often O(N3) if done naively or O(N2 log N) if done more cleverly
•Quality measured by its ability to discover some or all of the hidden patterns or
latent classes in gold standard data
•Assesses a clustering with respect to ground truth … requires labeled data
•Assume documents with C gold standard classes, while our clustering algorithms
produce K clusters, ω1, ω2, …, ωK with ni members.
• Simple measure: purity, the ratio between the dominant class in the cluster πi and
the size of cluster ωi
∙ ∙ ∙ ∙ ∙ ∙
∙ ∙ ∙ ∙ ∙ ∙
∙ ∙ ∙ ∙ ∙
People also define and use a cluster F-measure, which is probably a better measure.
• Resources
• IIR 16 except 16.5
• IIR 17.1–17.3
• Measures the relative strength of the linear relationship between two variables
• Unit-less
• Ranges between –1 and 1
• The closer to –1, the stronger the negative linear relationship
• The closer to 1, the stronger the positive linear relationship
• The closer to 0, the weaker any positive linear relationship
X X
r = -1 r = -.6 r=0
Y
Y Y
X X
r = +1 r = +.3 r=0
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004
■
Y Y
X X
Y Y
X X
Y Y
X X
Y Y
X X
No relationship
X
Fundamental in AI and ML 486
Calculating by hand…
Numerator of
covariance
Numerators of
variance
*note, like a proportion, the variance of the correlation coefficient depends on the
correlation coefficient itself substitute in estimated r
If you know something about X, this knowledge helps you predict something about
Y. (Sound familiar?…sound like conditional probabilities?)
yi
C A
B
y
B
A
C
yi
*Least squares
estimation gave us the
line (β) that minimized
x C2
A2 B2 C2
SStotal SSreg SSresidual
Total squared distance of observations from naïve Distance from regression line to naïve mean of y Variance around the regression line
mean of y Variability due to x (regression) Additional variability not explained by x—what
Total variation least squares method aims to minimize
Fundamental in AI and ML 498
Recall example: cognitive function and vitamin D
1. Lee DM, Tajar A, Ulubaev A, et al. Association between 25-hydroxyvitamin D levels and cognitive performance in middle-aged and older European men. J Neurol Neurosurg Psychiatry. 2009
Jul;80(7):722-9. Fundamental in AI and ML 499
Distribution of vitamin D
• I generated four hypothetical datasets, with increasing TRUE slopes (between vit D
and DSST):
• 0
• 0.5 points per 10 nmol/L
• 1.0 points per 10 nmol/L
• 1.5 points per 10 nmol/L
•Regression equation:
• E(Yi) = 26 + 0.5*vit Di (in 10
nmol/L)
Intercept=
In correlation, the two variables are treated as equals. In regression, one variable is
considered independent (=predictor) variable (X) and the other the dependent
(=outcome) variable Y.
Slope
Distribution of slope ~ Tn-2(β,s.e.( ))
Tn-
2
=
•The residual for observation i, ei, is the difference between its observed and predicted value
•Check the assumptions of regression by examining the residuals
• Examine for linearity assumption
• Examine for constant variance for all levels of X (homoscedasticity)
• Evaluate normal distribution assumption
• Evaluate independence assumption
•Graphical Analysis of Residuals
• Can plot residuals vs. X
Y Y
x x
residuals
residuals
x
Y Y
x
residuals
residuals
x x
Not Independent
residuals
✔ Independent
X
residuals
residuals
Each regression coefficient is the amount of change in the outcome variable that
would be expected per one-unit change of the predictor, if all other variables in the
model were held constant.
Parameter ````````````````Standard
Variable Estimate Error t Value Pr > |t|
Sufficient vs.
Insufficient
Sufficient vs.
Deficient
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Interpretation:
The deficient group has a mean DSST 9.87 points lower than the reference (sufficient) group.
The insufficient group has a mean DSST 6.87 points lower than the reference (sufficient)
group
Continuous Blood pressure Linear regression blood pressure (mmHg) = slopes—tells you how much the
α + βsalt*salt consumption (tsp/day) + βage*age outcome variable increases for every
(years) + βsmoker*ever smoker (yes=1/no=0) 1-unit increase in each predictor.
Binary High blood Logistic regression ln (odds of high blood pressure) = odds ratios—tells you how much the
pressure (yes/no) α + βsalt*salt consumption (tsp/day) + βage*age odds of the outcome increase for every
(years) + βsmoker*ever smoker (yes=1/no=0) 1-unit increase in each predictor.
Time-to-event Time-to- death Cox regression ln (rate of death) = hazard ratios—tells you how much the
α + βsalt*salt consumption (tsp/day) + βage*age rate of the outcome increases for every
(years) + βsmoker*ever smoker (yes=1/no=0) 1-unit increase in each predictor.
• Multicollinearity arises when two variables that measure the same thing or similar
things (e.g., weight and BMI) are both included in a multiple regression model;
they will, in effect, cancel each other out and generally destroy your model.
• You cannot completely wipe out confounding simply by adjusting for variables in
multiple regression unless variables are measured with zero error (which is usually
impossible).
• In multivariate modeling, you can get highly significant but meaningless results if
you put too many predictors in the model.
• The model is fit perfectly to the quirks of your particular sample, but has no
predictive ability in a new sample
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
• Exercise, sleep, and high ratings for Clinton are negatively related to optimism
(highly significant!) and high ratings for Obama and high love of math are
positively related to optimism (highly significant!).
Fundamental in AI and ML 543
Overfitting
• Pure noise variables still produce good R2
values if the model is overfitted. The
distribution of R2 values from a series of
simulated regression models containing only
noise variables.
• (Figure 1 from: Babyak, MA. What You See May Not Be What You Get: A Brief,
Nontechnical Introduction to Overfitting in Regression-Type Models.
Psychosomatic Medicine 66:411-421 (2004).)
•Bias – error caused because the model can not represent the concept
•Variance – error caused because the learning algorithm overreacts to small changes
(noise) in the training data
True
Concept
Training Data
Model Predicts +
Model Predicts -
Fit a Linear
Model
Fundamental in AI and ML 549
Visualizing Variance
• Goal: produce a model that matches this concept
• New data, new model
Different Bias
Mistakes
Model Predicts +
Model Predicts -
Fit a Linear
Model
Fundamental in AI and ML 550
Visualizing Variance
• Goal: produce a model that matches this concept
• New data, new model
Mistakes
• New data, new model… will vary
Model Predicts +
Model Predicts -
Fit a Linear
Model
Fundamental in AI and ML 551
Another way to think about Bias & Variance
Model Predicts -
Model Predicts -
• Not good!
• Increase minToSplit
• Increase minGainToSplit
• Limit total number of Nodes
• Penalize complexity
• Regularization
• Data
• Learning Algorithms
• Feature Sets
• Complexity of Concept
• Search and Computation
• Parameter sweeps!
# examine the parameters that seem best and adjust whatever you can…
Fundamental in AI and ML 562
Types of Parameter Sweeps
prior likelihood
posterior
evidence
•Actions: αi
•Loss of αi when the state is Ck : λik
•Expected risk (Duda and Hart, 1973)
•Log odds:
Diagnostic inference:
diagnostic Knowing that the grass is wet,
what is the probability that rain is
causal the cause?
Causal inference:
P(W|C) = P(W|R,S) P(R,S|C) +
P(W|~R,S) P(~R,S|C) +
P(W|R,~S) P(R,~S|C) +
P(W|~R,~S) P(~R,~S|C)
Diagnostic: P(C|W ) = ?
P (F | C) =
?
P (C,F) = ∑S ∑R ∑W P (C,S,R,W,F)
P (C | x )
decision node
•Association rule: X ® Y
•Support (X ® Y):
•Confidence (X ® Y):
•Transductive transfer
• No labeled target domain data available
• Focus of most transfer research in NLP
• E.g. Domain adaptation
•Inductive transfer
• Labeled target domain data available
• Goal: improve performance on the target task by training on other task(s)
• Jointly training on >1 task (multi-task learning)
• Pre-training (e.g. word embeddings)
where is the learning rate and is the gradient with regard to the
model’s objective function
•Discriminative fine-tuning:
• Split parameters into where contains the parameters of
the model at the 𝑙-th layer and 𝑳 is the number of layers of the model
• Obtain where is the learning rate of the 𝑙-th layer
• SGD update with discriminative fine-tuning:
• Fine tune the last layer and use as the learning rate for lower
layers
•Input sequences may consist of hundreds of words information may get lost if we
only use the last hidden state of the model
•Concatenate the hidden state at the last time step hT of the document with:
• Max-pooled representation of the hidden states
• Mean-pooled representation of the hidden states
over as many time steps as fit in GPU memory H = {h1, ..., hT}:
where [] is concatenation
• Low-shot learning – training a model for a task with small number of labeled
samples
• First train a layer of features that • It can be proved that each time we add
receive input directly from the another layer of features we improve a
pixels. variational lower bound on the log
• Then treat the activations of the probability of generating the training data.
trained features as if they were • The proof is complicated and only applies to
unreal cases.
pixels and learn features of
• It is based on a neat equivalence between an
features in a second hidden RBM and an infinitely deep belief net (see
layer. lecture 14b).
• Then do it again.
Train this
RBM first
data
Fundamental in AI and ML 613
An aside: Averaging factorial distributions
• If you average some factorial
distributions, you do NOT get a • Consider the binary vector 1,1,0,0.
factorial distribution. • in the posterior for v1, p(1,1,0,0)
= 0.9^4 = 0.43
• In an RBM, the posterior over 4
hidden units is factorial for each • in the posterior for v2, p(1,1,0,0) =
visible vector. 0.1^4 = .0001
• in the aggregated posterior, p(1,1,0,0)
• Posterior for v1: 0.9, 0.9, 0.1, 0.1 = 0.215.
• Posterior for v2: 0.1, 0.1, 0.9, 0.9
• If the aggregated posterior was
• Aggregated \= 0.5, 0.5, 0.5, 0.5 factorial it would have p = 0.5^4
E
sampling is still easy, though
learning needs to be much
slower.
Fundamental in AI and ML
v0 630
etc.
Inference in an infinite sigmoid belief net
h2
• The variables in h0 are conditionally independent
given v0.
• Inference is trivial. Just multiply v0 by v2
• The model above h0 implements a complementary
prior. h1
• Multiplying v0 by gives the product of the
likelihood term and the prior term.
• The complementary prior cancels the explaining away. v1
• Inference in the directed net is exactly equivalent + +
to letting an RBM settle to equilibrium starting at h0
the data. + +
v0
Fundamental in AI and ML 631
h2
• The learning rule for a sigmoid belief net is:
v2
v1
h0
v0
v1
Think of the symmetric connections as a
shorthand notation for an infinite directed net h0
with tied weights.
We ought to use maximum likelihood learning,
but we use CD1 as a shortcut. v0
Fundamental in AI and ML 633
Learning a deep directed network
h2
• Then freeze the first layer of weights in
both directions and learn the remaining
weights (still tied together). v2
• This is equivalent to learning another RBM,
using the aggregated posterior distribution h1
of h0 as the data.
v1
v1
h0
h0
v0
Fundamental in AI and ML 634
What happens when the weights in higher layers become
different from the weights in the first layer?
• The higher layers no longer • The higher layers learn a prior that is
implement a complementary prior. closer to the aggregated posterior
• So performing inference using the distribution of the first hidden layer.
frozen weights in the first layer is no • This improves the network’s model of
longer correct. the data.
• But its still pretty good. • Hinton, Osindero and Teh (2006)
• Using this incorrect inference procedure prove that this improvement is
gives a variational lower bound on the always bigger than the loss in the
log probability of the data. variational bound caused by using
less accurate inference.
h1
v1
h0
v0
Fundamental in AI and ML 636
Why is it OK to ignore the derivatives in higher
layers?
When the weights are small, the • As the weights grow we may need
Markov chain mixes fast. to run more iterations of CD.
• So the higher layers will be close to the • This allows CD to continue to be a
equilibrium distribution (i.e they will good approximation to maximum
have “forgotten” the datavector). likelihood.
• At equilibrium the derivatives must • But for learning layers of features, it
average to zero, because the current does not need to be a good
weights are a perfect model of the approximation to maximum
equilibrium distribution! likelhood!
• Precision =
good messages kept
all messages kept
• Recall =
good messages kept
all good messages
• Precision =
good messages kept
all messages kept
another system: better for
• Recall =
some users, worse for others
(can’t tell just by comparing
good messages kept
F-measures)
all good messages
• F-measure =
precision-1 + recall-1
Move from high precision to high recall by
deleting fewer messages (raise threshold)
( 2
) -1
True
class
co
rre
ct
644
Fundamental in AI and ML
Fancier Perfomance Metrics
• For multi-way classifiers:
• Average accuracy (or precision or recall) of 2-way distinctions: Sports or not, News or not,
etc.
• Better, estimate the cost of different kinds of errors
• e.g., how bad is each of the following?
• putting Sports articles in the News section
• putting Fashion articles in the News section
• putting News articles in the Fashion section
• Now tune system to minimize total cost
generates a whole
bunch of features – use
data to find out which
ones work best Fundamental in AI and ML 655
600.465 - Intro to NLP - J. Eisner 655
Feature Templates
This feature is
relatively
weak, but weak
features are
still useful,
especially since
very few
features will
fire in a given
context.
merged ranking
of all features
of all these types
Fundamental in AI and ML 656
600.465 - Intro to NLP - J. Eisner 656
Final decision list for lead (abbreviated)
CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per
round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit AMR,
immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase
took effect Thursday night and applies to most routes where it competes against discount carriers, such
as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York.
A feature of this
tagseq might give a
positive or negative
weight to this
B_ORG in conjunction
with some subset of
the nearby
attributes
NP VP
[head=plan] [head=thrill]
Det N V VP
[head=plan] [head=thrill]
The has
N VP
[head=plan] [head=swallow]
V VP
[head=thrill]
plan been
to VP V NP
[head=swallow] [head=thrill] [head=Otto]
thrilling Otto
V
[head=swallow] NP
[head=Wanda]
swallow Wanda
Fundamental in AI and ML 668
2. Each word is
Dependency Trees the head of a
S
[head=thrill] whole
connected
NP VP
[head=plan] [head=thrill] subgraph
Det N V VP
[head=plan] [head=thrill]
The has
N VP
[head=plan] [head=swallow]
V VP
[head=thrill]
plan been
to VP V NP
[head=swallow] [head=thrill] [head=Otto]
thrilling Otto
V
[head=swallow] NP
[head=Wanda]
swallow Wanda
Fundamental in AI and ML 669
2. Each word is
Dependency Trees the head of a
S
whole
connected
NP VP
subgraph
Det N V VP
The has
N VP V VP
plan been
to VP V NP
thrilling Otto
V NP
swallow Wanda
Fundamental in AI and ML 670
3. Just look at
Dependency Trees which words are
related
thrilling
plan
The has
swallow been
to
Otto
Wanda
Fundamental in AI and ML 671
4. Optionally
Dependency Trees flatten the
drawing
• Shows which words modify (“depend on”) another word
• Each subtree of the dependency tree is still a constituent
• But not all of the original constituents are subtrees (e.g., VP)
S
Color Code:
not-a-role NP VP
agent
patient NP PP V NP
source
Det A Prep NP bit Det A
destination N N
instrument The Adj A dog with Det A a ε girl
beneficiary N
big ε the ε boy
Fundamental in AI and ML 675
Parse tree paths as classification features
Path feature is S
NP VP
V ↑ VP ↑ S ↓ NP
NP PP V NP
which tends to Det A Prep NP bit Det A
be associated N N
The Adj A dog with Det A a ε girl
with agent role N
big ε the ε boy
Path feature is S
NP VP
V ↑ VP ↑ S ↓ NP ↓ PP ↓ NP
NP PP V NP
which tends to Det A Prep NP bit Det A
be associated N N
The Adj A dog with Det A a ε girl
with no role N
big ε the ε boy
• Like finding the intent (= frame) for each “trigger phrase,” and filling its slots
(a) (b)
Clustering in (a) YCbCr and (b) V vs. H space. Red is non-face and blue is
face data
Fundamental in AI and ML 682
Result of color segmentation using Global
thresholding
EigenFaces
(a) (b)
TheFundamental
headband template
in AI and ML 695
Table of results for training images
Approx. 95% accuracy with about 75 seconds runtime
1 22 21 21 0 0 15.9311 71.91 1
2 22 21 23 0 2 13.6109 82.96 1
3 25 25 25 0 0 9.8625 80.48 0
4 22 22 24 0 2 11.3667 81.15 0
5 24 24 24 0 0 9.5960 69.59 0
6 23 23 23 0 0 11.5512 80.25 0
7 22 21 21 0 0 14.1537 71.52 1
Sentiment Analysis
Lexical
Challenges Resources
SA
Subjectivity
Approach
detection
es
Applicatio
ns
Subjective
Polarity
Term sense
position
Objective
Polarity
see
s o -
al
an
to
ny
my
Ln
Lp
The sets at the end of kth step are called Tr(k,p) and Tr(k,n)
Tr(k,o) is the set that is not present in Tr(k,p) and Tr(k,n)
• Train a committee of classifiers of different types and different K-values for the
given data
• Observations:
• Low values of K give high precision and low recall
• Accuracy in determining positivity or negativity, however, remains almost
constant
• Partition cost:
Sample cuts:
Subjecti
ve
Document
Document Subjectivit POLARITY CLASSIFIER
y
Objectiv
detector
e
a0 a1 a2
s0 s1 s2 s3
r0 r1 r2
(1,1) (1,2) (1,3) (1,2) (1,3) (1,2) (1,1) (2,1) (3,1) (4,1) (4,2)[-1]
• Key idea: updating the utility value using the given training sequences.