Download as pdf or txt
Download as pdf or txt
You are on page 1of 736

CSA2001

FUNDAMENTALS IN AI AND ML
On the Verge of Major Breakthroughs

Artificial Intelligence (AI) has been moving extremely quickly in


the last few years, demonstrating a potential to revolutionize every
aspect of our lives

Work Economy
Medicine
Mobility

Fundamental in AI and ML 2
Applications of AI

Fundamental in AI and ML 3
But, What is AI ?
AI can be broadly defined as technology that can learn and
produce intelligent behavior

Input Output
Pixels: An AI Process “Tuberculosis”

Computer Vision

Fundamental in AI and ML 4
But, What is AI ?
AI can be broadly defined as technology that can learn and
produce intelligent behavior

Input Output

An AI Process “Four kids are playing


Pixels:
with a ball”

More than just a category


Computer Vision about the image!

Fundamental in AI and ML 5
Applications of AI
AI can be broadly defined as technology that can learn and
produce intelligent behavior

Input Output
Audio Clip: An AI Process “I feel some eye pain”

Speech Recognition

Fundamental in AI and ML 6
Artificial Intelligence

Fundamental in AI and ML 7
Artificial Intelligence

Deep Learning (DL)


Machine Learning (ML)
Artificial Intelligence (AI)
AI ML DL
Computer Science

Math Physics Chemistry Biology

Fundamental in AI and ML 8
AI in real time systems
• Search engines (such as Google Search), targeting online advertisements,
recommendation systems (offered by Netflix, YouTube or Amazon),
• Driving internet traffic,
• Targeted advertising (AdSense, Facebook),
• Virtual assistants (such as Siri or Alexa),
• Autonomous vehicles (including drones and self-driving cars),
• Automatic language translation (Microsoft Translator, Google Translate),
• Facial recognition (Apple's Face ID or Microsoft's DeepFace),
• Image labeling (used by Facebook, Apple's iPhoto and TikTok) and spam
filtering.

Fundamental in AI and ML 9
Agents in Artificial Intelligence
An AI system can be defined as the study of the rational agent and its environment. The
agents sense the environment through sensors and act on their environment through actuators.
An AI agent can have mental properties such as knowledge, belief, intention, etc.

What is an Agent?
An agent can be anything that perceives the environment through sensors and act upon that
environment through actuators. An Agent runs in the cycle of perceiving, thinking,
and acting. An agent can be:-
Human-Agent: A human agent has eyes, ears, and other organs which work for sensors and

hand, legs, vocal tract work for actuators.


Robotic Agent: A robotic agent can have cameras, infrared range finder, NLP for sensors

and various motors for actuators.


Software Agent: Software agent can have keystrokes, file contents as sensory input and act

on those inputs and display output on theFundamental


screen.in AI and ML 10
Agents in Artificial Intelligence
Sensor: Sensor is a device which detects the change in the environment and
sends the information to other electronic devices. An agent observes its
environment through sensors.
Actuators: Actuators are the component of machines that converts energy
into motion. The actuators are only responsible for moving and controlling a
system. An actuator can be an electric motor, gears, rails, etc.

Fundamental in AI and ML 11
Agents in Artificial Intelligence

Effectors: Effectors are the devices which affect the environment. Effectors
can be legs, wheels, arms, fingers, wings, fins, and display screen.

Fundamental in AI and ML 12
Intelligent Agent

An intelligent agent is an autonomous entity that act upon an environment


using sensors and actuators for achieving goals. An intelligent agent may
learn from the environment to achieve their goals. A thermostat is an example
of an intelligent agent.
Following are the main four rules for an AI agent:
Rule 1: An AI agent must have the ability to perceive the environment.

Rule 2: The observation must be used to make decisions.


Rule 3: Decision should result in an action.


Rule 4: The action taken by an AI agent must be a rational action.


Fundamental in AI and ML 13
Rational Agent
• A rational agent is an agent which has clear preference, models uncertainty,
and acts in a way to maximize its performance measure with all possible
actions.

• A rational agent is said to perform the right things. AI is about creating


rational agents to use for game theory and decision theory for various
real-world scenarios.

• For an AI agent, the rational action is most important because in AI


reinforcement learning algorithm, for each best possible action, agent gets the
positive reward and for each wrong action, an agent gets a negative reward.
Fundamental in AI and ML 14
Rationality
The rationality of an agent is measured by its performance measure. Rationality
can be judged on the basis of following points:

Performance measure which defines the success criterion.


Agent prior knowledge of its environment.


Best possible actions that an agent can perform.


The sequence of percepts.


Fundamental in AI and ML 15
Structure of an AI Agent
The task of AI is to design an agent program which implements the agent function. The
structure of an intelligent agent is a combination of architecture and agent program. It can be
viewed as:

Agent = Architecture + Agent program

Architecture: Architecture is machinery that an AI agent executes on.

Agent Function: The agent function is used to map a percept to an action.

f:P* → A

Agent program: An agent program is an implementation of agent function. An agent


program executes on the physical architecture toin AIproduce
Fundamental and ML function f. 16
PEAS Representation

PEAS is a type of model on which an AI agent works upon. When we define an AI agent
or rational agent, then we can group its properties under PEAS representation model. It is
made up of four words:

P: Performance measure

E: Environment

A: Actuators

S: Sensors

Here performance measure is the objective for the success of an agent's behavior.
Fundamental in AI and ML 17
PEAS for self-driving cars:
Let's suppose a self-driving car then PEAS representation will be:
Performance: Safety, time, legal drive, comfort

Environment: Roads, other vehicles, road signs, pedestrian

Actuators: Steering, accelerator, brake, signal, horn

Sensors: Camera, GPS, speedometer, odometer, accelerometer, sonar.

Fundamental in AI and ML 18
Example of Agents with their PEAS
representation

Performance
Agent Environment Actuator Sensor
Measure
• Healthy
Patient Hospital Prescription, Symptoms
Medical
• • •

Minimized Doctors Diagnosis, Patient’s


Diagnosis
• • •

Cost • Patients • Scan report response

Fundamental in AI and ML 19
Example of Agents with their PEAS
representation

Performance
Agent Environment Actuator Sensor
Measure
Classroom,
Subject • Maximize Desk, Chair, Smart displays, Eyes, Ears,
Tutoring scores Board, Staff, Corrections Notebooks
Students

Fundamental in AI and ML 20
Example of Agents with their PEAS
representation

Performance
Agent Environment Actuator Sensor
Measure
Vacuum Cleanness
• Room
• Wheels
• Camera

Cleaner Efficiency
• Table
• Brushes
• Dirt detection

Battery life
• Wood floor
• Vacuum Extractor sensor

Security
• Carpet
• Cliff sensor

Various
• Bump Sensor

obstacles Infrared Wall


Sensor

Fundamental in AI and ML 21
Object Detection

Fundamental in AI and ML 22
Activity Recognition

Fundamental in AI and ML 23
Semantic Segmentation

Fundamental in AI and ML 24
Disease Detection

Fundamental in AI and ML 25
Image colorization

Fundamental in AI and ML 26
Style Transfer

Fundamental in AI and ML 27
Style transfer

Fundamental in AI and ML 28
Lip Sync

Fundamental in AI and ML 29
Image to Image Translation

Fundamental in AI and ML 30
Why interest in AI?

Fundamental in AI and ML 31
Agents

Definition: An agent perceives its environment via sensors and


acts upon that environment through its actuators

Fundamental in AI and ML 33
Agent / Robot E.g., vacuum-cleaner world

iRobot Corporation
Founder Rodney Brooks (MIT)

Percepts: location and


contents, e.g., [A, Dirty]

Actions: Left, Right, Suck, NoOp


Fundamental in AI and ML 34
Rational Agents

An agent should strive to "do the right thing", based on what:


• it can perceive and
• the actions it can perform.

The right action is the one that will cause the agent to be most successful
Performance measure of game-playing agent: win/loss percentage
(maximize), robustness, unpredictability (to “confuse” opponent), etc.

Fundamental in AI and ML 35
Rational Agents

Performance measure: An objective criterion for success of an agent's


behavior.
Performance measures of a vacuum-cleaner agent: amount of dirt cleaned
up, amount of time taken, amount of electricity consumed, level of noise
generated, etc.

Performance measures self-driving car: time to reach destination


(minimize), safety, predictability of behavior for other agents, reliability,
etc.

Performance measure of game-playing agent: win/loss percentage


(maximize), robustness, unpredictability (to “confuse” opponent), etc.
Fundamental in AI and ML 36
Rational Agents
For each possible percept sequence, a rational agent should select an action
that maximizes its performance measure (in expectation) given the evidence
provided by the percept sequence and whatever built- in knowledge the agent
has.

Why “in expectation”?

Captures actions with stochastic / uncertain effects or actions performed in


stochastic environments. We can then look at the expected value of an action.

In high-risk settings, we may also want to limit the worst-case behavior.


Fundamental in AI and ML 37
Rational Agents
Notes:

Rationality is distinct from omniscience (“all knowing”). We can


behave rationally even when faced with incomplete information.

Agents can perform actions in order to modify future percepts so as to obtain


useful information: information gathering, exploration.

An agent is autonomous if its behavior is determined by its own experience


(with ability to learn and adapt).

Fundamental in AI and ML 38
Characterizing a Task Environment

Must first specify the setting for intelligent agent design.

PEAS: Performance measure, Environment, Actuators, Sensors

Example: the task of designing a self-driving car

• Performance measure Safe, fast, legal, comfortable trip


• Environment Roads, other traffic, pedestrians
• Actuators Steering wheel, accelerator, brake, signal, horn
• Sensors Cameras, LIDAR (light/radar), speedometer, GPS, odometer,
engine sensors, keyboard
Fundamental in AI and ML 39
PEAS Examples

Fundamental in AI and ML 40
Task Environments

1) Fully observable / Partially observable

If an agent’s sensors give it access to the complete state of the


environment needed to choose an action, the environment is
fully observable.
(e.g. chess – what about Kriegspiel?)

Fundamental in AI and ML 41
Task Environments

Making things a bit more challenging…


Kriegspiel --- you can’t see your opponent!

Incomplete / uncertain information inherent in the


game.

Balance exploitation (best move given current


knowledge) and exploration (moves to explore where
opponent’s pieces might be).

Use probabilistic reasoning techniques.


Fundamental in AI and ML 42
Task Environments

2) Deterministic / Stochastic
○ An environment is deterministic if the next state of the environment
is completely determined by the current state of the environment and
the action of the agent;
○ In a stochastic environment, there are multiple, unpredictable
outcomes. (If the environment is deterministic except for the actions
of other agents, then the environment is strategic).
In a fully observable, deterministic environment, the agent need not
deal with uncertainty.
Note: Uncertainty can also arise because of computational
limitations. E.g., we may be playing an omniscient (“all knowing”)
opponent but we may not be able to compute his/her moves.
Fundamental in AI and ML 43
Task Environments

3) Episodic / Sequential

○ In an episodic environment, the agent’s experience is divided into


atomic episodes. Each episode consists of the agent perceiving and
then performing a single action.

○ Subsequent episodes do not depend on what actions occurred in


previous episodes. Choice of action in each episode depends only on
the episode itself. (E.g., classifying images.)

○ In a sequential environment, the agent engages in a series of


connected episodes. Current decision can affect future decisions.
(E.g., chess and driving) Fundamental in AI and ML 44
Task Environments

4) Static / Dynamic

A static environment does not change while the agent is thinking.


The passage of time as an agent deliberates is irrelevant.


The environment is semidynamic if the environment itself does not


change
with the passage of time but the agent's performance score does.

Fundamental in AI and ML 45
Task Environments

5) Discrete / Continuous
○ If the number of distinct percepts and actions is limited, the environment is
discrete, otherwise it is continuous.

6) Single agent / Multi-agent


○ If the environment contains other intelligent agents, the agent needs to be
concerned about strategic, game-theoretic aspects of the environment (for
either cooperative or competitive agents).
○ Most engineering environments don’t have multi-agent properties, whereas
most social and economic systems get their complexity from the interactions
of (more or less) rational agents.
Fundamental in AI and ML 46
Example Tasks and Environment Types

How to make the right decisions? Decision theory


Fundamental in AI and ML 47
Task Environment

Fundamental in AI and ML 48
Exercise on Task environment

Fundamental in AI and ML 49
Agents and Environment

The agent function maps from percept histories to actions

The agent program runs (internally) on the physical architecture to produce f

agent = architecture + program


Fundamental in AI and ML 50
Fundamental in AI and ML 51
I) Table-lookup driven agents

Uses a percept sequence / action table in memory to


find the next action. Implemented as a (large) lookup table.

• Drawbacks:
– Huge table (often simply too large)
– Takes a long time to build/learn the table

Fundamental in AI and ML 52
I) Table-lookup driven agents
Toy example:Vacuum world.
Percepts: robot senses it’s location and “cleanliness.”
So, location and contents, e.g., [A, Dirty], [B, Clean].
With 2 locations, we get 4 different possible sensor
inputs.
Actions: Left, Right, Suck, NoOp

Fundamental in AI and ML 53
Table driven agent

Fundamental in AI and ML 54
Table Lookup
Action sequence of length K, gives 4^K different possible sequences. At
least many entries are needed in the table So, even in this very toy
world, with K = 20, you need a table with over 4^20 > 10^12 entries.

In more real-world scenarios, one would have many more different


percepts (eg many more locations), e.g., >=100.
There will therefore be 100^K different possible sequences of length K.
For K = 20, this would require a table with over 100^20 = 10^40
entries. Infeasible to even store.
Fundamental in AI and ML 55
Table LookupII) --- Simple reflex agents

So, table lookup formulation is mainly of theoretical interest. For


practical agent systems, we need to find much more compact
representations.

For example, logic-based representations, Bayesian net representations,


or neural net style representations, or use a different agent architecture,

e.g., “ignore the past” --- Reflex agents.


Fundamental in AI and ML 56
II) Simple reflex agents

Agents do not have memory of past world states or percepts.

So, actions depend solely on current percept. Action becomes a “reflex.”

Uses condition-action rules.

Fundamental in AI and ML 57
II) Simple reflex agents

Agent selects actions on the basis of current percept only.

If tail-light of car in front is red, then


brake.

Fundamental in AI and ML 58
II) Simple reflex agents

Fundamental in AI and ML 59
II) Simple reflex agents
Closely related to “behaviorism” (psychology; quite effective in explaining
lower-level animal behaviors, such as the behavior of ants and mice).
The Roomba robot largely behaves like this. Behaviors are robust and can be
quite effective and surprisingly complex.

But, how does complex behavior arise from simple reflex behavior?
E.g. ants colonies and bee hives are quite complex.

Simple rules in a diverse environment can give rise to surprising complexity.

See A-life work (artificial life) community, and Wolfram’s cellular automata.
Fundamental in AI and ML 60
III) --- Model-based reflex agents

Key difference (wrt simple reflex agents):


○ Agents have internal state, which is used to keep track of past states
of the world.
○ Agents have the ability to represent change in the World.

Example: Rodney Brooks’ Subsumption Architecture


--- behavior based robots.

Fundamental in AI and ML 61
III) --- Model-based reflex agents
Module:
Logical Agents How detailed?
Representation and Reasoning: Part III/IV R&N

If “dangerous driver in front,” “Infers potentially dangerous


then “keep distance.” driver in front.”

Fundamental in AI and ML 62
III) --- Model-based agents

Fundamental in AI and ML 63
III) --- Model-based agents An example: Brooks’
Subsumption Architecture
Main idea: build complex, intelligent robots by decomposing behaviors
into a hierarchy of skills, each defining a percept-action cycle for one
very specific task.
Examples: collision avoidance, wandering, exploring, recognizing
doorways, etc.
Each behavior is modeled by a finite-state machine with a few states
(though each state may correspond to a complex function or module;
provides internal state to the agent).
Behaviors are loosely coupled via asynchronous interactions.
Note: minimal internal state representation. p. 1003 R&N
Fundamental in AI and ML 64
III) --- Model-based agents An example: Brooks’
Subsumption Architecture
In subsumption architecture, increasingly complex behaviors arise from
the combination of simple behaviors.

The most basic simple behaviors are on the level of reflexes: • avoid an
object; • go toward food if hungry, • move randomly.

A more complex behavior that sits on top of simple behaviors may be


“go across the room.”

The more complex behaviors subsume the less complex ones to


accomplish their goal.
Fundamental in AI and ML 65
How much of an internal model of the world?
Planning in and reasoning about our surroundings appears to require
some kind of internal representation of our world.
We can “try” things out in this representation. Much like an running a
“simulation” of the effect of actions or a sequence of actions in our
head.
General assumption for many years:
The more detailed internal model, the better.

Fundamental in AI and ML 66
How much of an internal model of the world?
Brooks (mid 80s and 90s) challenged this view:
The philosophy behind Subsumption Architecture is that the world
should be used as its own model. According to Brooks, storing models of
the world is dangerous in dynamic, unpredictable environments because
representations might be incorrect or outdated. What is needed is the
ability to react quickly to the present. So, use minimal internal state
representation, complement at each time step with sensor input.
Debate continues to this day: How much of our world do we (should we)
represent explicitly? Subsumption architecture worked well in robotics.
Fundamental in AI and ML 67
IV) Goal-based agents
Key difference wrt Model-Based Agents:
In addition to state information, have goal information that describes
desirable situations to be achieved.

Agents of this kind take future events into consideration.


What sequence of actions can I take to achieve certain goals?

Choose actions so as to (eventually) achieve a (given or computed) goal.


problem solving and search! (R&N --- Part II, chapters 3 to 6)
Fundamental in AI and ML 68
IV) Goal-based agents
Module:
Problem Solving

“Clean kitchen” Considers “future”

Agent keeps track of the world state as well as set of goals it’s trying to achieve: chooses
actions that will (eventually) lead to the goal(s).
More flexible than reflex agents may involve search and planning
Fundamental in AI and ML 69
V) Utility-based agents

When there are multiple possible alternatives, how to decide which one is
best?
Goals are qualitative: A goal specifies a crude distinction between a happy
and unhappy state, but often need a more general performance measure
that describes “degree of happiness.”
Utility function U: State → R indicating a measure of success or happiness
when at a given state.
Important for making tradeoffs: Allows decisions comparing choice
between conflicting goals, and choice between likelihood of success and
importance of goal (if achievement
Fundamentalis
in AI uncertain).
and ML 70
Use decision theoretic models: e.g., faster vs. safer.
V) Utility-based agents VI) --- Learning agents
Adapt and improve over time
Module:
Decision Making

Decision theoretic actions:


e.g. faster vs. safer

Fundamental in AI and ML 71
VI) --- Learning agents Adapt and improve over
time More complicated when agent needs to learn utility information: Reinforcement
learning (based on action payoff)
Module:
Learning

“Quick turn is not safe”

No quick turn

Road conditions, etc


Takes percepts
and selects actions

Try out the brakes on


different road surfaces
Fundamental in AI and ML 72
Summary: Agent Types
(1) Table-driven agents
○ use a percept sequence/action table in memory to find the next
action. They are implemented by a (large) lookup table.
(2) Simple reflex agents
○ are based on condition-action rules, implemented with an
appropriate production system. They are stateless devices which do
not have memory of past world states.
(3) Agents with memory - Model-based reflex agents
○ have internal state, which is used to keep track of past states of the
world. Fundamental in AI and ML 73
Summary: Agent Types
(4) Agents with goals – Goal-based agents
○ are agents that, in addition to state information, have goal
information that describes desirable situations. Agents of this kind
take future events into consideration.
(5) Utility-based agents
○ base their decisions on classic axiomatic utility theory in order to
act rationally.
(6) Learning agents
○ they have the ability to improve performance through learning.
Fundamental in AI and ML 74
Summary: Agent Types
● An agent perceives and acts in an environment, has an architecture,
and is implemented by an agent program.
● A rational agent always chooses the action which maximizes its
expected performance, given its percept sequence so far.
● An autonomous agent uses its own experience rather than built-in
knowledge of the environment by the designer.

Fundamental in AI and ML 75
Summary: Agent Types
● An agent program maps from percept to action and updates its internal
state.
○ Reflex agents (simple / model-based) respond immediately to percepts.
○ Goal-based agents act in order to achieve their goal(s), possible
sequence of steps.
○ Utility-based agents maximize their own utility function.
○ Learning agents improve their performance through learning.
● Representing knowledge is important for successful agent design.
● The most challenging environments are partially observable, stochastic,
sequential, dynamic, and continuous, and contain multiple intelligent
agents. Fundamental in AI and ML 76
Fundamental in AI and ML 77
Searching for a (shortest / least cost) path to goal
state(s)

Search through the state space.

We will consider search techniques that use an


explicit search tree that is generated by the
initial state + successor function.

initialize (initial node)


Loop
choose a node for expansion
according to strategy
goal node? done
Fundamental in AI and ML 78
expand node with successor function
Tree-search algorithms
Basic idea:
○ simulated exploration of state space by generating successors of already-explored states (a.k.a. ~
expanding states)

Note: 1) Here we only check a node for possibly being a goal state, after we select the node
for expansion.
2) A “node” is a data structure containing state + additional info (parent
Fundamental in AI and ML 79
node, etc.
Tree-search algorithm- Example
Node selected for expansion.

Fundamental in AI and ML 80
Tree-search algorithm- Example
Nodes added to tree.

Fundamental in AI and ML 81
Tree-search algorithm- Example
Selected for expansion.

Added to tree.

Note: Arad added (again) to tree!


(reachable from Sibiu)

Not necessarily a problem, but


in Graph-Search, we will avoid
this by maintaining an
“explored” list.
Fundamental in AI and ML 82
Graph-search

Note:
1) Uses “explored” set to avoid visiting already explored states.
2) Uses “frontier” set to store states that remain to be explored and
expanded.
3) However, with eg uniform cost search, we need to make a special check
when node (i.e. state) is on frontier. Details later.
Fundamental in AI and ML 83
Implementation: states vs. nodes

A state is a --- representation of --- a physical configuration.


A node is a data structure constituting part of a search tree includes
state, tree parent node, action (applied to parent), path cost (initial
state to node) g(x), depth

Fringe is the collection of nodes that have been generated but not (yet)
expanded. Each node of the fringe is a leaf node.
The Expand function creates new nodes, filling in the various fields and
using the SuccessorFn of the problem to create the corresponding
states. Fundamental in AI and ML 84
Implementation: General Tree Search

Fundamental in AI and ML 85
Search Strategies
A search strategy is defined by picking the order of node expansion.

Strategies are evaluated along the following dimensions:


○ completeness: does it always find a solution if one exists?
○ time complexity: number of nodes generated
○ space complexity: maximum number of nodes in memory
○ optimality: does it always find a least-cost solution?

Time and space complexity are measured in terms of


○ b: maximum branching factor of the search tree
○ d: depth of the least-cost solution
○ m: maximum depth of the state space (may be ∞)

Fundamental in AI and ML 86
Uninformed Search Strategies
Uninformed (blind) search strategies use only the information available
in the problem definition:
○ Breadth-first search
○ Uniform-cost search
○ Depth-first search
○ Depth-limited search
○ Iterative deepening search
○ Bidirectional search

Key issue: type of queue used for the fringe of the search tree (collection
of tree nodes that have been generated but not yet expanded)
Fundamental in AI and ML 87
Breadth-First Search

Expand shallowest unexpanded node.


Implementation: Fringe queue: <A>

Select A from
fringe is a FIFO queue, i.e., new nodes go at end
(First In First Out queue.) queue and expand.

Gives
<B, C>

Fundamental in AI and ML 88
Breadth-First Search
Select B from
Queue: <B, C> front, and expand.

Put children at the


end.

Gives
<C, D, E>
Breadth-First Search

Fringe queue: <C, D, E>

Fundamental in AI and ML 90
Breadth-First Search

Fringe queue: <D, E, F, G>


Assuming no further children, queue becomes
<E, F, G>, <F, G>, <G>, <>. Each time node checked
for goal state.
Fundamental in AI and ML 91
Breadth-First Search

Fundamental in AI and ML 92
Properties of breadth-first search

Complete? Yes (if b is finite)


Time? 1+b+b2+b3+… +bd + b(bd-1) = O(bd+1)
Space? O(bd+1) (keeps every node in memory;
needed also to reconstruct soln. path)
Optimal soln. found?
Yes (if all step costs are identical)
Space is the bigger problem (more than time)
•b: maximum branching factor of the search tree
•d: depth of the least-cost solution
Note: check for goal only
when node is expanded.
Fundamental in AI and ML 93
Time and Space requirement for Breadth First
Search with bf = 10

Fundamental in AI and ML 94
Uniform Cost Search

Fundamental in AI and ML 95
Uniform Cost Search

Fundamental in AI and ML 96
Uniform Cost Search

Fundamental in AI and ML 97
Uniform Cost Search
Expand least-cost (of path to) unexpanded node
(e.g. useful for finding shortest path on map)
Implementation:
fringe = queue ordered by path cost

g – cost of reaching a node


Complete? Yes, if step cost ≥ ε (>0)
Time? # of nodes with g ≤ cost of optimal solution (C*), O(b(1+⎣C*/ ε⎦)
Space? # of nodes with g ≤ cost of optimal solution,
O(b(1+⎣C*/ ε⎦)
Optimal? Yes – nodes expanded in increasing order of g(n)
Note: Some subtleties (e.g. checking for goal state).
See p 84 R&N. Also, next slide.Fundamental in AI and ML 98
Uniform Cost Search

Two subtleties: (bottom p. 83 Norvig)


1) Do goal state test, only when a node is selected for expansion.
(Reason: Bucharest may occur on frontier with a longer than optimal
path. It won’t be selected for expansion yet. Other nodes will be expanded
first, leading us to uncover a shorter path to Bucharest.

2) Graph-search alg. says “don’t add child node to frontier if already on


explored list or already on frontier.” BUT, child may give a shorter path to
a state already on frontier. Then, we need to modify the existing node on
frontier with the shorter path. Fundamental in AI and ML 99
Depth-First Search
“Expand deepest unexpanded node”
Implementation:
○ fringe = LIFO queue, i.e., put successors at front (“push on stack”)
Last In First Out Fringe stack:
A
Expanding A,
gives stack:
B
C

So, B next.
Fundamental in AI and ML 100
Depth-First Search

Expanding B,
gives stack:
D
E
C

So, D next.

Fundamental in AI and ML 101


Depth-First Search

Expanding D,
gives stack:
H
I
E
C

So, H next.
etc.
Fundamental in AI and ML 102
Depth-First Search

Fundamental in AI and ML 103


Depth-First Search

Fundamental in AI and ML 104


Depth-First Search

Fundamental in AI and ML 105


Depth-First Search

Fundamental in AI and ML 106


Depth-First Search

Fundamental in AI and ML 107


Depth-First Search

Fundamental in AI and ML 108


Depth-First Search

Fundamental in AI and ML 109


Depth-First Search

Fundamental in AI and ML 110


Depth-First Search

Fundamental in AI and ML 111


Properties of Depth-First Search

Note: Can also


Complete? No: fails in infinite-depth spaces, spaces with loops
reconstruct soln. path
Modify to avoid repeated states along path from single stored
complete in finite spaces branch.

Time? O(bm): bad if m is much larger than d


○ but if solutions are dense, may be much faster than breadth-first
No
Space? O(bm), i.e., linear space!
b: max. branching factor of the search tree
d: depth of the shallowest (least-cost) soln.
Guarantee that m: maximum depth of state space
opt. soln. is found?
Fundamental in AI and ML 112
Depth-Limited Search

Fundamental in AI and ML 113


Complexity Analysis

Completeness If l < d, incomplete

Time Complexity O(bl)

Space Complexity O(bl)

Optimality If l > d, non optimal

Fundamental in AI and ML 114


Iterative deepening search

Fundamental in AI and ML 115


Iterative deepening search

Fundamental in AI and ML 116


Iterative deepening search

Fundamental in AI and ML 117


Iterative deepening search

Fundamental in AI and ML 118


Iterative deepening search l =2

Fundamental in AI and ML 119


Iterative deepening search l =3 Why would one
do that?

Fundamental in AI and ML 120


Why would one do that?

Combine good memory requirements of depth-first with the


completeness of breadth-first when branching factor is finite and is
optimal when the path cost is a non-decreasing function of the depth
of the node.

Idea was a breakthrough in game playing. All game tree search uses
iterative deepening nowadays. What’s the added advantage in
games?
“Anytime” nature.
Fundamental in AI and ML 121
Iterative deepening search
Iterative deepening is the preferred uninformed search method when there is a large
search space and the depth of the solutionis not known.
Number of nodes generated in an iterative deepening search to depth

d with branching factor b: Looks quite wasteful, is it?

NIDS = d b1 + (d-1)b2 + … + 3bd-2 +2bd-1 + 1bd

Nodes generated in a breadth-first search with branching factor b:


NBFS = b1 + b2 + … + bd-2 + bd-1 + bd

For b = 10, d = 5,

○ NBFS= 10 + 100 + 1,000 + 10,000 + 100,000 = 111,110
Fundamental in AI and ML 122
○ NIDS = 50 + 400 + 3,000 + 20,000 + 100,000 = 123,456
Iterative deepening search

Fundamental in AI and ML 123


Iterative deepening search

Complete? Yes
(b finite)

Time? d b1 + (d-1)b2 + … + bd = O(bd)

Space? O(bd)

Optimal? Yes, if step costs identical

Fundamental in AI and ML 124


Bidirectional Search
• Simultaneously:
○ Search forward from start
○ Search backward from the goal
Stop when the two searches meet.

• If branching factor = b in each direction,


with solution at depth d
only O(2 bd/2)= O(2 bd/2)

Aside: The predecessor of a node should be easily computable (i.e.,


actions are easily reversible).
Fundamental in AI and ML 125
Bi directional Search

• Checking a node for membership in the other search tree can be done
in constant time (hash table)

• Key limitations:
Space O(bd/2)
Also, how to search backwards can be an issue (e.g., in Chess)? What’s
tricky?
Problem: lots of states satisfy the goal; don’t know which one is
relevant.
Aside: The predecessor of a node should be easily computable (i.e., actions are easily
reversible). Fundamental in AI and ML 126
Repeated States
Failure to detect repeated states can
turn linear problem into an
exponential one!

Don’t return to parent node


Don’t generate successor = node’s parent
Don’t allow cycles
Don’t revisit state
Keep every visited state in memory! O(bd)
(can be expensive)

Problems in which actions are reversible (e.g., routing problems or sliding-blocks puzzle).
Also, in eg Chess; uses hash tables to check for repeated states. Huge tables 100M+ size but
very useful.
See Tree-Search vs. Graph-Search. Need to be careful to maintain (path) optimality and
completeness. Fundamental in AI and ML 127
Bi directional Search - Example
In the below search tree, bidirectional
search algorithm is applied. This
algorithm divides one graph/tree into two
sub-graphs. It starts traversing from node
1 in the forward direction and starts from
goal node 16 in the backward direction.

The algorithm terminates at node 9 where


two searches meet.

Fundamental in AI and ML 128


Bi directional Search
Completeness: Bidirectional Search is complete if we use BFS in both
searches.

Time Complexity: Time complexity of bidirectional search using BFS


is O(bd).

Space Complexity: Space complexity of bidirectional search is O(bd).

Optimal: Bidirectional search is Optimal.

Fundamental in AI and ML 129


Summary: General Uninformed Search
● Original search ideas in AI where inspired by studies of human problem
solving in, eg, puzzles, math, and games, but a great many AI tasks now
require some form of search (e.g. find optimal agent strategy; active
learning; constraint reasoning; NP-complete problems require search).
● Problem formulation usually requires abstracting away real-world
details to define a state space that can feasibly be explored.
● Variety of uninformed search strategies
● Iterative deepening search uses only linear space and not much more
time than other uninformed algorithms.
● Avoid repeating states / cycles.
Fundamental in AI and ML 130
Searching with Partial Observations

Fundamental in AI and ML 131


Conformant (Sensor less) search: Example Space

Fundamental in AI and ML 132


Conformant (Sensor less) search: Example Space

Fundamental in AI and ML 133


Searching with no observations

Fundamental in AI and ML 134


Searching with observations

Fundamental in AI and ML 135


Searching with observations

Fundamental in AI and ML 136


Constraint Satisfaction Problems
Constraint Satisfaction

• It is a search procedure that operates in a space of constraint sets


• Constraint satisfaction problem in AI have goal of discovering some
problem state that satisfies a given set of constraint
Process:

• Constrains are discovered and propagated throughout the system


• If still there is no solution search begins
Fundamental in AI and ML 137
Constraint Satisfaction Problems
•Constraint Satisfaction Problems:
•CSP consists of three components V,D,C
•V[variable set{v1,v2,v3…vn}]
•D[domain{D1,D2,D3…Dn}]=>one domain for each variable
•C[constraints]=>Specify allowable combination of values
•C1=(scope, relation), where scope= set of variables that participate in
constraints and relation= defines values that a variable can take.
•Intelligent backtracking method is used to solve Constraint Satisfaction
Problems.
•Only back track where conflict occurs.
Fundamental in AI and ML 138
Map Coding

Fundamental in AI and ML 139


Map Coding

Fundamental in AI and ML 140


Map Coding Example

Fundamental in AI and ML 141


Constraint Graph

Fundamental in AI and ML 142


Constraint Graph

Fundamental in AI and ML 143


Constraint Satisfaction Problems

•CSP can be viewed as a standard search problem as follows :


•Initial state : the empty assignment {},in which all variables are unassigned.
•Successor function : a value can be assigned to any unassigned variable,
provided that it does not conflict with previously assigned variables.
•Goal test : the current assignment is complete.
•Path cost : a constant cost(E.g.,1) for every step.

Fundamental in AI and ML 144


Backtracking Algorithm
•Backtracking can be defined as a general algorithmic technique that
considers searching every possible combination in order to solve a
computational problem.

What is Backtracking Algorithm?


•Backtracking is an algorithmic technique for solving problems recursively by
trying to build a solution incrementally, one piece at a time, removing those
solutions that fail to satisfy the constraints of the problem at any point of time.

Fundamental in AI and ML 145


Backtracking Algorithm

When to use a Backtracking algorithm?


•When we have multiple choices, then we make the decisions from the
available choices. In the following cases, we need to use the backtracking
algorithm:
•A piece of sufficient information is not available to make the best choice, so
we use the backtracking strategy to try out all the possible solutions.
•Each decision leads to a new set of choices. Then again, we backtrack to
make new decisions. In this case, we need to use the backtracking strategy.

Fundamental in AI and ML 146


Backtracking Algorithm
How does Backtracking work?
•Backtracking is a systematic method of trying out various sequences of
decisions until you find out that works. Let's understand through an example.
•We start with a start node. First, we move to node A. Since it is not a feasible
solution so we move to the next node, i.e., B. B is also not a feasible solution,
and it is a dead-end so we backtrack from node B to node A.

Fundamental in AI and ML 147


Backtracking Algorithm

Fundamental in AI and ML 148


Backtracking Algorithm
The terms related to the backtracking are:
•Live node: The nodes that can be further generated are known as live nodes.
•E node: The nodes whose children are being generated and become a success node.
•Success node: The node is said to be a success node if it provides a feasible solution.
•Dead node: The node which cannot be further generated and also does not provide a
feasible solution is known as a dead node.
Applications of Backtracking
•N-queen problem
•Sum of subset problem
•Graph coloring
Fundamental in AI and ML 149
Constraint propagation
•Constraint propagation is the general term for propagating the implications
of a constraint on one variable onto other variables.
•Constraint propagation repeatedly enforces constraints locally to detect
inconsistencies. This propagation can be done with different types of
consistency techniques. They are:
•Node consistency (one consistency)
•Arc consistency (two consistency)
•Path consistency (K-consistency)

Fundamental in AI and ML 150


Constraint propagation
Node consistency
• Simplest consistency technique
• The node representing a variable V in constraint graph is node consistent if
for every value X in the current domain of V, each unary constraint on V is
satisfied.
• The node inconsistency can be eliminated by simply removing those values
from the domain D of each variable V that do not satisfy unary constraint on
V.

Fundamental in AI and ML 151


Constraint propagation
Arc Consistency
•Here, 'arc’ refers to a directed arc in the constraint graph, such as the arc
from SA to NSW. Given the current domains of SA and NSW, the arc is
consistent if, for every value x of SA, there is some value y of NSW that is
consistent with x.
•In the constraint graph, binary constraint corresponds to arc. Therefore this
type of consistency is called arc consistency.
•Arc (vi, vj) is arc consistent if for every value X the current domain of vi
there is some value Y in the domain of vj such vi =X and vj=Y is permitted
by the binary constraint between vi and vj

Fundamental in AI and ML 152


Constraint propagation
k-Consistency (path Consistency)
•A CSP is k-consistent if, for any set of k - 1 variables and for any consistent
assignment to those variables, a consistent value can always be assigned to
any kth variable
•1-consistency means that each individual variable by itself is consistent; this
is also called node consistency.
•2-consistency is the same as arc consistency.
•3-consistency means that any pair of adjacent variables can always be
extended to a third neighboring variable; this is also called path consistency

Fundamental in AI and ML 153


Knowledge
Representation

Fundamental in AI and ML 154


First Order Predicate Logic
—First Order Logic (FOL) can be simply put as a collection of objects, their
attributes and relations among them to represent knowledge. It is also known
as Predicate Logic.

—First-order logic is another way of knowledge representation in artificial


intelligence. It is an extension to propositional logic. FOL is sufficiently
expressive to represent the natural language statements in a concise way.
First-order logic is also known as Predicate logic or First-order predicate
logic.

Fundamental in AI and ML 155


What is First Order Logic ?
• FOL is a mode of representation in AI. It is an extension of PL.

• FOL represents natural language statements in a concise way.

• FOL is also called predicate logic. It is a powerful language used to develop


information about an object and express the relationship between objects.

• FOL not only assumes that does the world contains facts (like PL does), but
it also assumes the following:
• —
• Objects: A, B, people, numbers, colors, wars, theories, squares, pit, etc.
Fundamental in AI and ML 156
First Order Predicate Logic
—
Relations: It is unary relation such as red, round, sister of, brother of, etc.

—
Function: father of, best friend, third inning of, end of, etc.

Fundamental in AI and ML 157


Representing Simple Statements in FOL:
It is important that you know the logical operators/connectives that are used in
Propositional Logic.

Fundamental in AI and ML 158


Part of First Order Logic
FOL also has two parts:
1.Syntax
2.Semantics
—

Syntax: The syntax of FOL decides which collection of symbols is a logical


expression. The basic syntactic elements of FOL are symbols. We use
symbols to write statements in shorthand notation.

Fundamental in AI and ML 159


Basic Elements of FOL:

Fundamental in AI and ML 160


Atomic and Complex Sentences in FOL:
Atomic Sentence:

This is a basic sentence of FOL formed from a predicate symbol followed by


a parenthesis with a sequence of terms.
We can represent atomic sentences as a predicate (value1, value2…., value n).

Example-
John and Michael are colleaguesà colleagues (John, Michael)
German Shepherd is a dogà Dog (German Shepherd )

Fundamental in AI and ML 161


Atomic and Complex Sentences in FOL:
Complex Sentence:
Complex sentences are made by combining atomic sentences using
connectives.

FOL is further divided into two parts:


Subject: the main part of the statement.
Predicate: defined as a relation that binds two atoms together.

Example-
1.Colleague (Oliver, Benjamin) Colleague (Benjamin, Oliver)
2.“x is an integer” Fundamental in AI and ML 162
Atomic and Complex Sentences in FOL:
It has two parts;
First, x is the subject.
Second, “is an integer” is called a predicate.

Fundamental in AI and ML 163


Quantifiers and their use in FOL:
Quantifiers generate quantification and specify the number of specimen in
the universe.
Quantifiers allow us to determine or identify the range and scope of the
variable in a logical expression.

There are two types of quantifiers:


1.Universal quantifier: for all, everyone, everything.
2.Existential quantifier: for some, at least one.

Fundamental in AI and ML 164


Universal Quantifiers:
Universal quantifiers specify that the statement within the range is true for everything
or every instance of a particular thing.

Universal quantifiers are denoted by a symbol that looks like an inverted A.

In a universal quantifier, we use


•If x is a variable then,
1. For all x
2. For every x
3. For each x

Fundamental in AI and ML 165


Example- Every student likes educative.

Explanation:
So, in logical notation, it can be written as:

This can be interpreted as: There is every x where x is a student who likes
Fundamental in AI and ML 166
Educative.
Existential Quantifiers:

Existential quantifiers are used to express that the statement within their
scope is true for at least one instance of something.

which looks like an inverted E, is used to represent them. We always use


AND or conjunction symbols.

Fundamental in AI and ML 167


Existential Quantifiers:
If x is a variable, the existential quantifier will be
1.For some x
2.There exists an x
3.For at least one x

Example-
Some people like Football.

Fundamental in AI and ML 168


Existential Quantifiers:
Explanation:
So, in logical notation, it can be written as:

It can be interpreted as: There are some x where x is people who like football.

Fundamental in AI and ML 169


Games: Minimax and Alpha-Beta Pruning

Fundamental in AI and ML 170


Outline:
1. Overview
2. Minimax for Zero-Sum Games
3. α-β Pruning

Fundamental in AI and ML 171


Types of Games:
Game = task environment with > 1 agent

To Know:
Deterministic or stochastic?
1.

Perfect information (fully observable)?


2.

Two, three, or more players?


3.

Teams or individuals?
4.

Turn-taking or simultaneous?
5.

Zero sum?
6.

Output: algorithms for calculating a contingent plan (a.k.a. strategy or policy)


which recommends a move for every possible eventuality
Fundamental in AI and ML 172
Example of a Game-Tree

Fundamental in AI and ML 173


Standard Games
Rational Agent→ A rational player
Game formulation: Assume all future moves will be optimal
1.Initial state: s0
2.Players: Player(s) indicates whose move it is i k e D FS
Mini m a x : L
3.Actions: Actions(s) for player on move O (b m)
Time:
4.Transition model: Result(s,a) Space: O(bm
)
5.Terminal test: Terminal-Test(s)
6.Terminal values: Utility(s,p) for player p
7.Or just Utility(s) for player making the decision at root
For chess, b » 35, m » 100
# Exact solution is completely infeasible
# Humans can’t do this either, so how do we play chess?
Fundamental in AI and ML 174
Zero-Sum Games

Zero-Sum Games General-Sum Games


Agents have opposite utilities Agents have independent utilities
Pure competition: Cooperation, indifference,
One maximizes, the other minimizes
competition, shifting alliances, and
more are all possible
Fundamental in AI and ML 175
Game Tree Complexity

Saviour
Alpha-beta
Pruning

Fundamental in AI and ML 176


Alpha-Beta Pruning
Principle:
In any game tree for any node n , if there is a node m either at the parent node
of n or any where further up, then n will never be reached in the actual
play!!!!!!

Alpha bound of a node: Max current value of all Max ancestors of the
node.Exploration of a min node is stopped when its value equals or falls
below alpha.

Beta bound of a node: Min current value of all Min ancestors of the
node.Exploration of a max node is stopped when its value equals or exceeds
beta.
Fundamental in AI and ML 177
Fundamental in AI and ML 178
What can we Prune ?
50 100
For chess: only 35 instead of 35 !!
Yaaay!!!!!

Fundamental in AI and ML 179


Why game playing ?

Fundamental in AI and ML 180


Characteristics of Game playing ?

Fundamental in AI and ML 181


Games vs. Search Problems

Fundamental in AI and ML 182


Two Player game

Fundamental in AI and ML 183


Games as Adversarial Search

Fundamental in AI and ML 184


Two player Game Tree (Example)

Fundamental in AI and ML 185


Mini-Max Algorithm

Fundamental in AI and ML 186


Mini Max Toy Game

Fundamental in AI and ML 187


Mini-Max Search Tree

Fundamental in AI and ML 188


Mini-Max Search Tree

Fundamental in AI and ML 189


Mini-Max Search Tree

Fundamental in AI and ML 190


Mini-Max Search Tree

Fundamental in AI and ML 191


Mini-Max Search Tree

Fundamental in AI and ML 192


Mini-Max Search Tree

Fundamental in AI and ML 193


Mini-Max Search Tree

Fundamental in AI and ML 194


Mini-Max Search Tree

Fundamental in AI and ML 195


Fundamental in AI and ML 196
Properties of Mini-Max

Fundamental in AI and ML 197


Why expand Unnecessary nodes ?

Fundamental in AI and ML 198


Mini-Max Search Tree

Fundamental in AI and ML 199


Mini-Max Search Tree

Fundamental in AI and ML 200


Alpha Beta

Fundamental in AI and ML 201


Mini-Max Search Tree

Fundamental in AI and ML 202


Mini-Max Search Tree

Fundamental in AI and ML 203


Mini-Max Search Tree

Fundamental in AI and ML 204


Bad and Good Cases for Alpha-Beta Pruning

Fundamental in AI and ML 205


Properties of Alpha-Beta Pruning

Fundamental in AI and ML 206


Local Search and Optimization

Fundamental in AI and ML 207


Outline

• Local search techniques and optimization


– Hill-climbing
– Gradient methods
– Simulated annealing
– Genetic algorithms
– Issues with local search

Fundamental in AI and ML 208


208
Local Search and Optimization

• Previous lecture: path to goal is solution to problem


– systematic exploration of search space.

• This lecture: a state is solution to problem


– for some problems path is irrelevant.
– E.g., 8-queens

• Different algorithms can be used


– Depth First Branch and Bound
– Local search
Fundamental in AI and ML 209
Local Search and Optimization

Goal Satisfaction Optimization

Reach the goal node Optimize(objective fn)


Constraint satisfaction Constraint Optimization

You can go back and forth between the two problems


Typically in the same complexity class

Fundamental in AI and ML 210


Local Search and Optimization
• Local search
– Keep track of single current state
– Move only to neighboring states
– Ignore paths

• Advantages:
– Use very little memory
– Can often find reasonable solutions in large or infinite (continuous) state spaces.

• “Pure optimization” problems


– All states have an objective function
– Goal is to find state with max (or min) objective value
– Does not quite fit into path-cost/goal-state formulation
– Local search can do quite well onFundamental
these inproblems.
AI and ML 211
Trivial Algorithms

• Random Sampling
– Generate a state randomly

• Random Walk
– Randomly pick a neighbor of the current state

• Both algorithms asymptotically complete.

Fundamental in AI and ML 212


Hill-climbing (Greedy Local Search) max version
function HILL-CLIMBING( problem) return a state that is a local maximum
input: problem, a problem
local variables: current, a node.
neighbor, a node.

current ← MAKE-NODE(INITIAL-STATE[problem])
loop do
neighbor ← a highest valued successor of current
if VALUE [neighbor] ≤ VALUE[current] then return STATE[current]
current ← neighbor
min version will reverse inequalities and look for lowest valued successor
Fundamental in AI and ML 213
Hill-climbing Search
• “a loop that continuously moves towards increasing value”
– terminates when a peak is reached
– Aka greedy local search
• Value can be either
– Objective function value
– Heuristic function value (minimized)

• Hill climbing does not look ahead of the immediate neighbors


• Can randomly choose among the set of best successors
– if multiple have the best value
• “climbing Mount Everest in a thick fog with amnesia”
Fundamental in AI and ML 214
“Landscape” Search

Hill Climbing gets stuck in local minima


depending on?
Fundamental in AI and ML 215
Example: n-queens

• Put n queens on an n x n board with no two queens on the same row,


column, or diagonal

• Is it a satisfaction problem or optimization?


Fundamental in AI and ML 216
Hill-climbing search: 8-queens problem

• Need to convert to an optimization problem


• h = number of pairs of queens that are attacking each other
• h = 17 for the above state Fundamental in AI and ML 217
Search Space

• State
– All 8 queens on the board in some configuration

• Successor function
– move a single queen to another square in the same column.

• Example of a heuristic function h(n):


– the number of pairs of queens that are attacking each other
– (so we want to minimize this)
Fundamental in AI and ML 218
Hill-climbing search: 8-queens problem

• Is this a solution?
• What is h?
Fundamental in AI and ML 219
Hill-climbing on 8-queens

• Randomly generated 8-queens starting states…


• 14% the time it solves the problem
• 86% of the time it get stuck at a local minimum

• However…
– Takes only 4 steps on average when it succeeds
– And 3 on average when it gets stuck
– (for a state space with 8^8 =~17 million states)
Fundamental in AI and ML 220
Hill Climbing Drawbacks

• Local maxima

• Plateaus

• Diagonal ridges
Escaping Shoulders: Sideways Move
• If no downhill (uphill) moves, allow sideways moves in hope that
algorithm can escape
– Need to place a limit on the possible number of sideways moves to
avoid infinite loops

• For 8-queens
– Now allow sideways moves with a limit of 100
– Raises percentage of problem instances solved from 14 to 94%
– However….
• 21 steps for every successful solution
Fundamental in AI and ML 222
• 64 for each failure
Tabu Search

• Prevent returning quickly to the same state


• Keep fixed length queue (“tabu list”)
• Add most recent state to queue; drop oldest
• Never make the step that is currently tabu’ed

Properties:
– As the size of the tabu list grows, hill-climbing will asymptotically
become “non-redundant” (won’t look at the same state twice)
– In practice, a reasonable sized tabu list (say 100 or so) improves the
performance of hill climbing in many problems
Fundamental in AI and ML 223
Escaping Shoulders/local Optima Enforced Hill
Climbing
• Perform breadth first search from a local optima
– to find the next state with better h function

• Typically,
– prolonged periods of exhaustive search
– bridged by relatively quick periods of hill-climbing

• Middle ground b/w local and systematic search


Fundamental in AI and ML 224
Hill-climbing: stochastic variations

• Stochastic hill-climbing
– Random selection among the uphill moves.
– The selection probability can vary with the steepness of the uphill
move.

• To avoid getting stuck in local minima


– Random-walk hill-climbing
– Random-restart hill-climbing
– Hill-climbing with both

Fundamental in AI and ML 225


Hill Climbing: stochastic variations

When the state-space landscape has local minima, any search that moves only
in the greedy direction cannot be complete

Random walk, on the other hand, is asymptotically complete

Idea: Put random walk into greedy hill-climbing

Fundamental in AI and ML 226


Hill-climbing with random restarts

• If at first you don’t succeed, try, try again!


• Different variations
– For each restart: run until termination vs. run for a fixed time
– Run a fixed number of restarts or run indefinitely
• Analysis
– Say each search has probability p of success
• E.g., for 8-queens, p = 0.14 with no sideways moves
– Expected number of restarts?
– Expected number of steps taken?
• If you want to pick one local search algorithm, learn this one!!
Fundamental in AI and ML 227
Hill-climbing with random walk

• At each step do one of the two


– Greedy: With prob p move to the neighbor with largest value
– Random: With prob 1-p move to a random neighbor

Fundamental in AI and ML 228


Hill-climbing with both
• At each step do one of the three
– Greedy: move to the neighbor with largest value
– Random Walk: move to a random neighbor
– Random Restart: Resample a new current state

Fundamental in AI and ML 229


Simulated Annealing
• Simulated Annealing = physics inspired twist on random walk
• Basic ideas:
– like hill-climbing identify the quality of the local improvements
– instead of picking the best move, pick one randomly
– say the change in objective function is δ
– if δ is positive, then move to that state
– otherwise:
• move to this state with probability proportional to δ
• thus: worse moves (very large negative δ) are executed less often
– however, there is always a chance of escaping from local maxima
– over time, make it less likely to accept locally bad moves
– (Can also make the size of the moveFundamental
random in AIas
and well,
ML i.e., allow “large” steps in state 230
space)
Physical Interpretation of Simulated Annealing
• A Physical Analogy:
• imagine letting a ball roll downhill on the function surface
– this is like hill-climbing (for minimization)
• now imagine shaking the surface, while the ball rolls, gradually reducing the
amount of shaking
– this is like simulated annealing

• Annealing = physical process of cooling a liquid or metal until particles


achieve a certain frozen crystal state
• simulated annealing:
– free variables are like particles
– seek “low energy” (high quality) configuration
Fundamental in AI and ML 231
– slowly reducing temp. T with particles moving around randomly
Simulated annealing
function SIMULATED-ANNEALING( problem, schedule) return a solution state
input: problem, a problem
schedule, a mapping from time to temperature
local variables: current, a node.
next, a node.
T, a “temperature” controlling the prob. of downward steps

current ← MAKE-NODE(INITIAL-STATE[problem])
for t ← 1 to ∞ do
T ← schedule[t]
if T = 0 then return current
next ← a randomly selected successor of current
∆E ← VALUE[next] - VALUE[current]
if ∆E > 0 then current ← next
Fundamental in AI and ML 232
else current ← next only with probability e∆E /T
Temperature T

• high T: probability of “locally bad” move is higher


• low T: probability of “locally bad” move is lower
• typically, T is decreased as the algorithm runs longer
• i.e., there is a “temperature schedule”

Fundamental in AI and ML 233


Simulated Annealing in Practice

– method proposed in 1983 by IBM researchers for solving VLSI layout


problems (Kirkpatrick et al, Science, 220:671-680, 1983).
• theoretically will always find the global optimum

– Other applications: Traveling salesman, Graph partitioning, Graph


coloring, Scheduling, Facility Layout, Image Processing, …

– useful for some problems, but can be very slow


• slowness comes about because T must be decreased very gradually
to retain optimality
Fundamental in AI and ML 234
Local beam search
• Idea: Keeping only one node in memory is an extreme reaction to memory
problems.

• Keep track of k states instead of one


– Initially: k randomly selected states
– Next: determine all successors of k states
– If any of successors is goal → finished
– Else select k best from successors and repeat

Fundamental in AI and ML 235


Local Beam Search (contd.)

• Not the same as k random-start searches run in parallel!


• Searches that find good states recruit other searches to join them
• Problem: quite often, all k states end up on same local hill
• Idea: Stochastic beam search
– Choose k successors randomly, biased towards good ones
• Observe the close analogy to natural selection!

Fundamental in AI and ML 236


Hey! Perhaps sex
can improve
search?

Fundamental in AI and ML 237


Sure! Check out
ye book.

Fundamental in AI and ML 238


Genetic algorithms
• Twist on Local Search: successor is generated by combining two parent states

• A state is represented as a string over a finite alphabet (e.g. binary)


– 8-queens
– State = position of 8 queens each in a column

• Start with k randomly generated states (population)

• Evaluation function (fitness function):


– Higher values for better states.
– Opposite to heuristic function, e.g., # non-attacking pairs in 8-queens

• Produce the next generation of states by “simulated evolution”


– Random selection
– Crossover
– Random mutation Fundamental in AI and ML 239
Genetic algorithms
8
7
6 String representation
5 16257483
4
3
2
1

Can we evolve 8-queens through genetic algorithms?


Fundamental in AI and ML 240
Genetic algorithms

4 states for 2 pairs of 2 states randomly New states Random


8-queens selected based on fitness. Random after crossover mutation
problem crossover points selected applied
• Fitness function: number of non-attacking pairs of queens (min = 0, max = 8 × 7/2 =
28)
• 24/(24+23+20+11) = 31%
• 23/(24+23+20+11) = 29% etc Fundamental in AI and ML 241
241
Genetic algorithms
• State = a string over a finite alphabet (an individual)
– A successor state is generated by combining two parent states

• Start with k randomly generated states (population)

• Evaluation function (fitness function).


– Higher values for better states.

• Select individuals for next generation based on fitness


– P(indiv. in next gen) = indiv. fitness / total population fitness

• Crossover: fit parents to yield next generation (offspring)


• Mutate the offspring randomly with some low probability
Fundamental in AI and ML 242
Steps in Genetic algorithms

Fundamental in AI and ML 243


Steps in Genetic algorithms

Fundamental in AI and ML 244


Steps in Genetic algorithms

Fundamental in AI and ML 245


Steps in Genetic algorithms

Fundamental in AI and ML 246


Steps in Genetic algorithms

Fundamental in AI and ML 247


Genetic algorithms

• Fitness function: number of non-attacking pairs of queens (min = 0,


max = 8 × 7/2 = 28)
• 24/(24+23+20+11) = 31%
• 23/(24+23+20+11) = 29% etc.
Fundamental in AI and ML 248
fitness =
#non-attacking
queens

probability of being
in next generation =
fitness/(Σ_i fitness_i)
• Fitness function: #non-attacking queen pairs How to convert a fitness
– min = 0, max = 8 × 7/2 = 28 value into a probability
• Σi fitness_i = 24+23+20+11 = 78 of being in the next
generation.
• P(pick child_1 for next gen.) = fitness_1/(Σ_i fitness_i) = 24/78 = 31%
Fundamental in AI and ML 249
• P(pick child_2 for next gen.) = fitness_2/(Σ_i fitness_i) = 23/78 = 29%; etc
Genetic algorithms

Has the effect of “jumping” to a completely different new part of the search space (quite
non-local)

Fundamental in AI and ML 250


Comments on Genetic Algorithms

Genetic algorithm is a variant of “stochastic beam search”


• Positive points
– Random exploration can find solutions that local search can’t
• (via crossover primarily)
– Appealing connection to human evolution
• “neural” networks, and “genetic” algorithms are metaphors!

• Negative points
– Large number of “tunable” parameters
• Difficult to replicate performance from one problem to another
– Lack of good empirical studies comparing to simpler methods
– Useful on some (small?) set of problems but no convincing evidence that GAs are better than
Fundamental in AI and ML 251
hill-climbing w/random restarts in general
MODULE – 3 MACHINE LEARNING

Fundamental in AI and ML 252


Areas of math essential to machine learning
Machine learning is part of both statistics and computer science
• Probability
• Statistical inference
• Validation
• Estimates of error, confidence intervals
Linear algebra
• Hugely useful for compact representation of linear transformations on data
• Dimensionality reduction techniques
Optimization theory

Fundamental in AI and ML 253


Why worry about the math?
• There are lots of easy-to-use machine learning packages out there.
• After this course, you will know how to apply several of the most
general-purpose algorithms.

HOWEVER
• To get really useful results, you need good mathematical intuitions about
certain general machine learning principles, as well as the inner workings of
the individual algorithms.

Fundamental in AI and ML 254


Why worry about the math?
These intuitions will allow you to:
• Choose the right algorithm(s) for the problem
• Make good choices on parameter settings, validation strategies
• Recognize over- or underfitting
• Troubleshoot poor / ambiguous results
• Put appropriate bounds of confidence / uncertainty on results
• Do a better job of coding algorithms or incorporating them into more complex
analysis pipelines

Fundamental in AI and ML 255


Notation
a∈A set membership: a is member of set A
|B| cardinality: number of items in set B
|| v || norm: length of vector v
∑ summation
∫ integral
ℜ the set of real numbers
ℜn real number space of dimension n
n = 2 : plane or 2-space
n = 3 : 3- (dimensional) space
n > 3 : n-space or hyperspace

Fundamental in AI and ML 256


Notation
• x, y, z, vector (bold, lower case)
u, v
• A, B, X matrix (bold, upper case)
• y = f( x ) function (map): assigns unique value in
range of y to each value in domain of x
• dy / dx derivative of y with respect to single
variable x
• y = f( x ) function on multiple variables, i.e. a
vector of variables; function in n-space
• ∂y / ∂xi partial derivative of y with respect to
element i of vector x
Fundamental in AI and ML 257
The concept of probability
Intuition:

In some process, several outcomes are possible.

When the process is repeated a large number of times, each outcome occurs
with a characteristic relative frequency, or probability.

If a particular outcome happens more often than another outcome, we say it


is more probable.

Fundamental in AI and ML 258


The concept of probability
Arises in two contexts:
In actual repeated experiments.
Example: You record the color of 1000 cars driving by. 57 of them are
green. You estimate the probability of a car being green as 57 / 1000 =
0.0057.

In idealized conceptions of a repeated process.


Example: You consider the behavior of an unbiased six-sided die. The
expected probability of rolling a 5 is 1 / 6 = 0.1667.

Example: You need a model for how people’s heights are distributed. You
choose a normal distribution (bell-shaped curve) to represent the expected
Fundamental in AI and ML 259
relative probabilities.
Probability spaces
A probability space is a random process or experiment with three components:
I. Ω, the set of possible outcomes O
• number of possible outcomes = | Ω | = N

II. F, the set of possible events E


• an event comprises 0 to N outcomes
• number of possible events = | F | = 2N

III. P, the probability distribution


• function mapping each outcome and event to real number between 0 and 1 (the probability of O or E)
• probability of an event is sum of probabilities of possible outcomes in event

Fundamental in AI and ML 260


Axioms of probability

1. Non-negativity:
for any event E ∈ F, p( E ) ≥ 0

2. All possible outcomes:


p( Ω ) = 1

3. Additivity of disjoint events:


for all events E, E’ ∈ F where E ∩ E’ = ∅,
p( E U E’ ) = p( E ) + p( E’ )
Fundamental in AI and ML 261
Types of probability spaces

Define | Ω | = number of possible outcomes

• Discrete space | Ω | is finite


• Analysis involves summations ( ∑ )

• Continuous space | Ω | is infinite


• Analysis involves integrals ( ∫ )

Fundamental in AI and ML 262


Example of discrete probability space
Single roll of a six-sided die
• 6 possible outcomes: O = 1, 2, 3, 4, 5, or 6
• 26 = 64 possible events

• example: E = ( O ∈ { 1, 3, 5 } ), i.e. outcome is odd

• If die is fair, then probabilities of outcomes are equal


p( 1 ) = p( 2 ) = p( 3 ) =
p( 4 ) = p( 5 ) = p( 6 ) = 1 / 6

• example: probability of event E = ( outcome is odd ) is


p( 1 ) + p( 3 ) + p( 5 ) = 1 / 2
Fundamental in AI and ML 263
Example of discrete probability space
Three consecutive flips of a coin
• 8 possible outcomes: O = HHH, HHT, HTH, HTT, THH, THT, TTH, TTT
• 28 = 256 possible events

• example: E = ( O ∈ { HHT, HTH, THH } ), i.e. exactly two flips are heads
• example: E = ( O ∈ { THT, TTT } ), i.e. the first and third flips are tails

• If coin is fair, then probabilities of outcomes are equal


p( HHH ) = p( HHT ) = p( HTH ) = p( HTT ) =
p( THH ) = p( THT ) = p( TTH ) = p( TTT ) = 1 / 8

• example: probability of event E = ( exactly two heads ) is


p( HHT ) + p( HTH ) + p( THH ) =in3AI and
Fundamental / 8ML 264
Example of continuous probability space

Height of a randomly chosen American male


• Infinite number of possible outcomes: O has some single value in range 2 feet to 8 feet
• Infinite number of possible events
• example: E = ( O | O < 5.5 feet ), i.e. individual chosen is less than 5.5 feet tall
• Probabilities of outcomes are not equal, and are described by a continuous function, p( O )

Fundamental in AI and ML 265


Example of continuous probability space
Height of a randomly chosen American male
• Probabilities of outcomes O are not equal, and are described by a continuous function, p(O)
• p( O ) is a relative, not an absolute probability
• p( O ) for any particular O is zero
• ∫ p( O ) from O = - ∞ to ∞ (i.e. area under curve) is 1
• example: p( O = 5’8” ) > p( O = 6’2” )
• example: p( O < 5’6” ) = (∫ p( O ) from O = - ∞ to 5’6” ) ≈ 0.25

Fundamental in AI and ML 266


Probability distributions
• Discrete: probability mass function (pmf)

example:
sum of two
fair dice

• Continuous: probability density function (pdf)

example:
waiting time between
eruptions of Old Faithful
(minutes)

Fundamental in AI and ML 267


Random variables
• A random variable X is a function that associates a number x with each outcome O of a
process
• Common notation: X( O ) = x, or just X = x
• Basically a way to redefine (usually simplify) a probability space to a new probability space
• X must obey axioms of probability (over the possible values of x)
• X can be discrete or continuous
• Example: X = number of heads in three flips of a coin
• Possible values of X are 0, 1, 2, 3
• p( X = 0 ) = p( X = 3 ) = 1 / 8 p( X = 1 ) = p( X = 2 ) = 3 / 8
• Size of space (number of “outcomes”) reduced from 8 to 4
• Example: X = average height of five randomly chosen American men
• Size of space unchanged (X can range from 2 feet to 8 feet), but pdf of X different than
for single man Fundamental in AI and ML 268
Multivariate probability distributions
• Scenario
• Several random processes occur (doesn’t matter whether in parallel or in
sequence)
• Want to know probabilities for each possible combination of outcomes

• Can describe as joint probability of several random variables


• Example: two processes whose outcomes are represented by random
variables X and Y. Probability that process X has outcome x and process Y
has outcome y is denoted as:

p( X = x, Y = y )
Fundamental in AI and ML 269
Example of multivariate distribution

Fundamental in AI and ML 270


Multivariate probability distributions

• Marginal probability
• Probability distribution of a single variable in a joint distribution
• Example: two random variables X and Y:
p( X = x ) = ∑b=all values of Y p( X = x, Y = b )

• Conditional probability
• Probability distribution of one variable given that another variable takes a certain value
• Example: two random variables X and Y:
p( X = x | Y = y ) = p( X = x, Y = y ) / p( Y = y )

Fundamental in AI and ML 271


Example of marginal probability
marginal probability: p( X = minivan ) = 0.0741 + 0.1111 + 0.1481 = 0.3333

Fundamental in AI and ML 272


Example of conditional probability
Conditional probability: p( Y = European | X = minivan ) = 0.1481 / ( 0.0741
+ 0.1111 + 0.1481 ) = 0.4433

0.2

0.15
probabilit

0.1

0.05
y

American
sport
Asian SUV
European minivan
Y= sedan X = model
manufacturer type
Fundamental in AI and ML 273
Continuous multivariate distribution

Same concepts of joint, marginal, and


conditional probabilities apply
(except use integrals)

Example: three-component Gaussian


mixture in two dimensions

Fundamental in AI and ML 274


Expected value
Given:
• A discrete random variable X, with possible values x = x1, x2, … xn
• Probabilities p( X = xi ) that X takes on the various values of xi
• A function yi = f( xi ) defined on X

The expected value of f is the probability-weighted “average” value of f( xi ):


E( f ) = ∑i p( xi ) ⋅ f( xi )

Fundamental in AI and ML 275


Example of expected value
• Process: game where one card is drawn from the deck
• If face card, dealer pays you $10
• If not a face card, you pay dealer $4
• Random variable X = { face card, not face card }
• p( face card ) = 3/13
• p( not face card ) = 10/13
• Function f( X ) is payout to you
• f( face card ) = 10
• f( not face card ) = -4
• Expected value of payout is:
E( f ) = ∑i p( xi ) ⋅ f( xi ) = 3/13 ⋅ 10 + 10/13 ⋅ -4 = -0.77
Fundamental in AI and ML 276
Expected value in continuous spaces

E( f ) = ∫x = a → b p( x ) ⋅ f( x )

Fundamental in AI and ML 277


Common forms of expected value (1)
• Mean (μ)
f( xi ) = xi ⇒ μ = E( f ) = ∑i p( xi ) ⋅ xi

• Average value of X = xi, taking into account probability of the various xi


• Most common measure of “center” of a distribution

• Compare to formula for mean of an actual sample

Fundamental in AI and ML 278


Common forms of expected value (2)
• Variance (σ2)

f( xi ) = ( xi - μ ) ⇒ σ2 = ∑i p( xi ) ⋅ ( xi - μ )2
• Average value of squared deviation of X = xi from mean μ, taking into account probability
of the various xi
• Most common measure of “spread” of a distribution
• σ is the standard deviation

• Compare to formula for variance of an actual sample

Fundamental in AI and ML 279


Common forms of expected value (3)
Covariance
f( xi ) = ( xi - μx ), g( yi ) = ( yi - μy ) ⇒
cov( x, y ) = ∑i p( xi , yi ) ⋅ ( xi - μx ) ⋅ ( yi - μy )
• Measures tendency for x and y to deviate from their means in same (or opposite)
directions at same time

high (positive)
no covariance

covariance
• Compare to formula for covariance of actual samples

Fundamental in AI and ML 280


Correlation
• Pearson’s correlation coefficient is covariance normalized by the standard deviations of the two
variables

• Always lies in range -1 to 1


• Only reflects linear dependence between variables
Linear dependence with noise

Linear dependence without noise

Various nonlinear dependencies


Fundamental in AI and ML 281
Complement rule
Given: event A, which can occur or not
p( not A ) = 1 - p( A )

A not A

areas represent relative probabilities


Fundamental in AI and ML 282
Product rule
Given: events A and B, which can co-occur (or not)
p( A, B ) = p( A | B ) ⋅ p( B )
(same expression given previously to define conditional probability)

(not A, not B)
A ( A, B ) B

Ω
(A, not B) (not A, B)

areas represent relative probabilitiesFundamental in AI and ML 283


Example of product rule

• Probability that a man has white hair (event A) and is over 65 (event B)
• p( B ) = 0.18
• p( A | B ) = 0.78
• p( A, B ) = p( A | B ) ⋅ p( B ) =
0.78 ⋅ 0.18 =
0.14

Fundamental in AI and ML 284


Rule of total probability
Given: events A and B, which can co-occur (or not)
p( A ) = p( A, B ) + p( A, not B )
(same expression given previously to define marginal probability)

(not A, not B)
A ( A, B ) B

Ω
(A, not B) (not A, B)

areas represent relative probabilities


Fundamental in AI and ML 285
Independence
Given: events A and B, which can co-occur (or not)

p( A | B ) = p( A ) or p( A, B ) = p( A ) ⋅ p( B )
Ω

(not A, not B) (not A, B)

B
(A, not B) A ( A, B )

areas represent relative probabilities


Fundamental in AI and ML 286
Examples of independence / dependence
• Independence:
• Outcomes on multiple rolls of a die
• Outcomes on multiple flips of a coin
• Height of two unrelated individuals
• Probability of getting a king on successive draws from a deck, if card from
each draw is replaced

• Dependence:
• Height of two related individuals
• Duration of successive eruptions of Old Faithful
• Probability of getting a king on successive draws from a deck, if card from
each draw is not replaced Fundamental in AI and ML 287
Example of independence vs. dependence
• Independence: All manufacturers have identical product mix. p( X = x | Y = y ) =
p( X = x ).
• Dependence: American manufacturers love SUVs, Europeans manufacturers
don’t.

Fundamental in AI and ML 288


Bayes rule
Posterior probability ∝ likelihood × prior probability

p( B | A ) = p( A | B ) ⋅ p( B ) / p( A )

(not A, not B)
A ( A, B ) B

Ω
(A, not B) (not A, B)

Fundamental in AI and ML 289


Example of Bayes rule
Marie is getting married tomorrow at an outdoor ceremony in the desert. In recent
years, it has rained only 5 days each year. Unfortunately, the weatherman is
forecasting rain for tomorrow. When it actually rains, the weatherman has
forecast rain 90% of the time. When it doesn't rain, he has forecast rain 10% of
the time. What is the probability it will rain on the day of Marie's wedding?
Event A: The weatherman has forecast rain.
Event B: It rains.
We know:
• p( B ) = 5 / 365 = 0.0137 [ It rains 5 days out of the year. ]
• p( not B ) = 360 / 365 = 0.9863
• p( A | B ) = 0.9 [ When it rains, the weatherman has forecast rain 90% of the time. ]
• p( A | not B ) = 0.1 [When it does not rain, the weatherman has forecast rain 10% of the
time.]
Fundamental in AI and ML 290
Example of Bayes rule, cont’d.
We want to know p( B | A ), the probability it will rain on the day of Marie's
wedding, given a forecast for rain by the weatherman.

The answer can be determined from Bayes rule:


1. p( B | A ) = p( A | B ) p( B ) / p( A )
2. p( A ) = p( A | B ) p( B ) + p( A | not B ) p( not B ) =
(0.9)(0.014) + (0.1)(0.986) = 0.111
3. p( B | A ) = (0.9)(0.0137) / 0.111 = 0.111

The result seems unintuitive but is correct. Even when the weatherman predicts
rain, it only rains only about 11% of the time.
Despite the weatherman's gloomy prediction, it is unlikely Marie will get rained on
at her wedding. Fundamental in AI and ML 291
Probabilities: when to add, when to multiply

• ADD: When you want to allow for occurrence of any of several possible
outcomes of a single process. Comparable to logical OR.

• MULTIPLY: When you want to allow for simultaneous occurrence of particular


outcomes from more than one process. Comparable to logical AND.
• But only if the processes are independent.

Fundamental in AI and ML 292


Linear algebra applications

1) Operations on or between vectors and matrices


2) Coordinate transformations
3) Dimensionality reduction
4) Linear regression
5) Solution of linear systems of equations
6) Many others

Applications 1) – 4) are directly relevant to this course. Today we’ll start


with 1).
Fundamental in AI and ML 293
Why vectors and matrices?
vector
• Most common form of data organization
for machine learning is a 2D array, where
• rows represent samples (records, items,
datapoints)
• columns represent attributes (features, variables)

• Natural to think of each sample as a vector


of attributes, and whole array as a matrix

Matrix
Fundamental in AI and ML 294
Vectors
Definition: an n-tuple of values (usually real numbers).
• n referred to as the dimension of the vector
• n can be any positive integer, from 1 to infinity

Can be written in column form or row form


• Column form is conventional
• Vector elements referenced by subscript

Fundamental in AI and ML 295


Vectors
• Can think of a vector as:
• a point in space or
• a directed line segment with a magnitude and direction

Fundamental in AI and ML 296


Vector arithmetic
• Addition of two vectors
• add corresponding elements

• result is a vector

• Scalar multiplication of a vector


• multiply each element by scalar

• result is a vector
Fundamental in AI and ML 297
Vector arithmetic
M
• Dot product of two vectors
• multiply corresponding elements, then add products

• result is a scalar y

• Dot product alternative form


θ
x

Fundamental in AI and ML 298


Matrices
• Definition: an m x n two-dimensional array of values (usually real numbers).
• m rows
• n columns
• Matrix referenced by two-element subscript
• first element in
subscript is row
• second element in
subscript is column
• example: A24 or a24 is element in second row, fourth column of A

Fundamental in AI and ML 299


Matrices
• A vector can be regarded as special case of a matrix, where one of matrix
dimensions = 1.
• Matrix transpose (denoted T)
• swap columns and rows
• row 1 becomes column 1, etc.
• m x n matrix becomes n x m matrix
• example:

Fundamental in AI and ML 300


Matrix arithmetic
Addition of two matrices
• matrices must be same size
• add corresponding elements:
cij = aij + bij
• result is a matrix of same size

Scalar multiplication of a matrix


• multiply each element by scalar:
bij = d ⋅ aij
• result is a matrix of same size

Fundamental in AI and ML 301


Matrix arithmetic
• Matrix-matrix multiplication
• vector-matrix multiplication just a special case

TO THE BOARD!!

• Multiplication is associative
A⋅(B⋅C)=(A⋅B)⋅C
• Multiplication is not commutative
A ⋅ B ≠ B ⋅ A (generally)
• Transposition rule:
( A ⋅ B )T = B T ⋅ A T

Fundamental in AI and ML 302


Matrix arithmetic
• RULE: In any chain of matrix multiplications, the column dimension of one
matrix in the chain must match the row dimension of the following matrix in the
chain.
• Examples
A3x5 B5x5 C3x1

Right:
A ⋅ B ⋅ AT CT ⋅ A ⋅ B AT ⋅ A ⋅ B C ⋅ CT ⋅ A

Wrong:
A⋅B⋅A C ⋅ A ⋅ B A ⋅ AT ⋅ B CT ⋅ C ⋅ A

Fundamental in AI and ML 303


Vector projection
M
Orthogonal projection of y onto x
• Can take place in any space of dimensionality > 2
y
• Unit vector in direction of x is
x / || x ||
• Length of projection of y in
direction of x is θ
x
|| y || ⋅ cos(θ) projx( y )
• Orthogonal projection of
y onto x is the vector
projx( y ) = x ⋅ || y || ⋅ cos(θ) / || x || =
[ ( x ⋅ y ) / || x ||2 ] x (using dot product alternate form)

Fundamental in AI and ML 304


Optimization theory topics

• Maximum likelihood
• Expectation maximization
• Gradient descent

Fundamental in AI and ML 305


Convex Optimization

Fundamental in AI and ML 306


General Optimization

Constraints do not need to be linear

Fundamental in AI and ML 307


Example

Fundamental in AI and ML 308


Example

Fundamental in AI and ML 309


Convex Optimization

Fundamental in AI and ML 310


Convex Optimization

Fundamental in AI and ML 311


Local and Global Optima

Fundamental in AI and ML 312


Local and Global Optima

Every local optimum of a convex optimization problem is a global optimum: let 𝑥


be a local minimum, 𝑦 a global minimum, and 𝑓(𝑥)>𝑓(𝑦)

Fundamental in AI and ML 313


Local and Global Optima

Every local optimum of a convex optimization problem is a global optimum: let 𝑥


be a local minimum, 𝑦 a global minimum, and 𝑓(𝑥)>𝑓(𝑦)

Fundamental in AI and ML 314


Local and Global Optima

Every local optimum of a convex optimization problem is a global optimum: let 𝑥


be a local minimum, 𝑦 a global minimum, and 𝑓(𝑥)>𝑓(𝑦)

Fundamental in AI and ML 315


Local Optima

Fundamental in AI and ML 316


Linear Programming Problems

Fundamental in AI and ML 317


Quadratic Programming Problems

Fundamental in AI and ML 318


Least Squares Regression

Fundamental in AI and ML 319


Least Squares Regression

Fundamental in AI and ML 320


Least Squares Regression

Fundamental in AI and ML 321


Projections

Fundamental in AI and ML 322


Projections

Fundamental in AI and ML 323


Projections

Fundamental in AI and ML 324


Projections

Fundamental in AI and ML 325


Smallest Enclosing Ball

Fundamental in AI and ML 326


Introduction to Statistical Inference

Fundamental in AI and ML 327


Statistical inference - Concepts
Statistical inference is the act of generalizing from a sample to a population with
calculated degree of certainty.

We want to learn about population parameters…

…but we can only calculate sample statistics


Fundamental in AI and ML 328
Parameters and Statistics

It is essential that we draw distinctions between parameters and statistics

Fundamental in AI and ML 329


Parameters and Statistics
We are going to illustrate inferential concept by considering how well a given
sample mean “x-bar” reflects an underling population mean µ

Fundamental in AI and ML 330


Precision and reliability

• How precisely does a given sample mean (x-bar) reflect underlying population
mean (μ)? How reliable are our inferences?
• To answer these questions, we consider a simulation experiment in which we
take all possible samples of size n taken from the population

Fundamental in AI and ML 331


Simulation Experiment

• Population (Figure A, next slide)


N = 10,000
Lognormal shape (positive skew)
μ = 173
σ = 30
• Take repeated SRSs, each of n = 10
• Calculate x-bar in each sample
• Plot x-bars (Figure B , next slide)

Fundamental in AI and ML 332


Simulation Experiment
A. Population (individual values) B. Sampling distribution of
x-bars

Fundamental in AI and ML 333


Simulation Experiment Results
1. Distribution B is more Normal than
distribution A ⇔ Central Limit
Theorem
2. Both distributions centered on µ ⇔
x-bar is unbiased estimator of μ
3. Distribution B is skinnier than
distribution A ⇔ related to “square
root law”

Fundamental in AI and ML 334


Reiteration of Key Findings

• Finding 1 (central limit theorem): the sampling distribution of x-bar tends


toward Normality even when the population is not Normal (esp. strong in large
samples).
• Finding 2 (unbiasedness): the expected value of x-bar is μ
• Finding 3 is related to the square root law, which says:

Fundamental in AI and ML 335


Standard Deviation of the Mean

• The standard deviation of the sampling distribution of the mean has a special
name: standard error of the mean (denoted σxbar or SExbar)

• The square root law says:

Fundamental in AI and ML 336


Square Root Law Example: σ = 15

Quadrupling the sample size cuts the standard error of the mean in half

For n = 1 ⇒

For n = 4 ⇒

For n = 16 ⇒

Fundamental in AI and ML 337


Putting it together:
x ~ N(µ, SE)

• The sampling distribution of x-bar tends to be Normal with mean µ and σxbar = σ / √n
• Example: Let X represent Weschler Adult Intelligence Scores; X ~ N(100, 15).
▪ Take an SRS of n = 10
▪ σxbar = σ / √n = 15/√10 = 4.7
▪ Thus, xbar ~ N(100, 4.7)

Fundamental in AI and ML 338


Individual WAIS (population) and mean WAIS
when n = 10

Fundamental in AI and ML 339


68-95-99.7 rule applied to the SDM
▪ We’ve established xbar ~ N(100, 4.7).
Therefore,

• 68% of x-bars within


µ ± σxbar
= 100 ± 4.7
= 95.3 to 104.7

• 95% of x-bars within


µ ± 2 · σxbar
= 100 ± (2·4.7)
= 90.6 to 109.4
Fundamental in AI and ML 340
Law of Large Numbers
As a sample gets larger and larger, the x-bar approaches μ. Figure demonstrates
results from an experiment done in a population with μ = 173.3

Fundamental in AI and ML 341


Sampling Behavior of Counts and Proportions

Recall Chapter: binomial random variable represents the random number of


successes in n independent Bernoulli trials each with probability of success p;
otation X~b(n,p)

X~b(10,0.2) is shown on the next slide. Note that


μ=2

Reexpress the counts of success as proportion p-hat = x / n. For this re-expression,


μ = 0.2

Fundamental in AI and ML 342


Sampling Behavior of Counts and Proportions

Fundamental in AI and ML 343


Normal Approximation to the Binomial (“npq
rule”)

• When n is large, the binomial distribution approximates a Normal distribution


(“the Normal Approximation”)

• How large does the sample have to be to apply the Normal approximation? ⇒
One rule says that the Normal approximation applies when npq ≥ 5

Fundamental in AI and ML 344


A
Top figure:
X~b(10,0.2)
npq = 10 · 0.2 · (1–0.2)
= 1.6 ⇒ Normal
approximation does not apply

Bottom figure: X~b(100,0.2)


npq = 100 · 0.2 · (1−0.2) = 16
⇒ Normal approximation
applies Fundamental in AI and ML 345
Normal Approximation for a Binomial Count

When Normal approximation applies:

Fundamental in AI and ML 346


Normal Approximation for a Binomial Proportion

Fundamental in AI and ML 347


“p-hat” represents the sample proportion
“p-hat” represents the sample proportion

Fundamental in AI and ML 348


Illustrative Example: Normal Approximation to
the Binomial
• Suppose the prevalence of a risk factor in a population is 20%
• Take an SRS of n = 100 from population
• A variable number of cases in a sample will follow a binomial distribution with n
= 20 and p = .2

Fundamental in AI and ML 349


Illustrative Example: Normal Approximation to
the Binomial
• The Normal approximation for the count is:

• The Normal approximation for the proportion is:

Fundamental in AI and ML 350


Illustrative Example: Normal Approximation to
the Binomial
• Statement of problem: Recall X ~ N(20, 4) Suppose we observe 30 cases in a
sample. What is the probability of observing at least 30 cases under these
circumstance, i.e., Pr(X ≥ 30) = ?

• Standardize: z = (30 – 20) / 4 = 2.5

• Sketch: next slide

• Table B: Pr(Z ≥ 2.5) = 0.0062

Fundamental in AI and ML 351


Illustrative Example: Normal Approximation to
the Binomial
• Binomial and superimposed Normal distributions

This model suggests


.0062 of samples will
see 30 or more cases.

Fundamental in AI and ML 352


Module 4

Fundamental in AI and ML 353


What is Machine Learning?

• Machine Learning is the study of methods for programming computers to learn.

• Building machines that automatically learn from experience.

• Machine learning usually refers to the changes in systems that perform tasks
associated with artificial intelligence AI Such tasks involve recognition, diagnosis,
planning, robot control, prediction, etc.

Fundamental in AI and ML 354


What is Machine Learning?

Learning Trained
algorithm machine

TRAINING
DATA Answer

Query
Fundamental in AI and ML 355
Steps in machine learning
1. Data collection.
2. Representation.
3. Modeling.
4. Estimation.
5. Validation.
6. Apply learned model to new “test” data

Fundamental in AI and ML 356


General structure of a learning system
Learning system

Data Learning Feed-back


Process

Problem Solving

Teacher
Results

Performance
Evaluation

Fundamental in AI and ML 357


Advantages of ML

)Solving vision problems through statistical inference.

)Intelligence from the common sense AI.

)Reducing the constraints over time achieving complete autonomy.

Fundamental in AI and ML 358


Disadvantages of ML
1. Application specific algorithms.

2. Real world problems have too many variables and sensors might be too noisy.

3. Computational complexity.

Fundamental in AI and ML 359


Types of machine Learning

1) Unsupervised Learning .

2) Semi-Supervised (reinforcement).

3) Supervised Learning.

Fundamental in AI and ML 360


Unsupervised Learning
1. Studies how input patterns can be represented to reflect the statistical structure
of the overall collection of input patterns
2. No outputs are used (unlike supervised learning and reinforcement learning)
3. Learner is provided only unlabeled data.
4. No feedback is provided from the environment

Fundamental in AI and ML 361


Unsupervised Learning

Advantage
• Most of the laws of science were developed through unsupervised learning.

Disadvantage
• The identification of the features itself is a complex problem in many situations.

Fundamental in AI and ML 362


Semi-Supervised (reinforcement)

• It is in between Supervised and Unsupervised learning techniques the amount of


labeled and unlabelled data required for training.

• With the goal of reducing the amount of supervision required compared to


supervised learning.

• At the same time improving the results of unsupervised clustering to the


expectations of the user.

Fundamental in AI and ML 363


Semi-Supervised (reinforcement)

•Semi-supervised learning is an area of increasing importance in Machine Learning.


•Automatic methods of collecting data make it more important than ever to develop
methods to make use of unlabeled data.

Fundamental in AI and ML 364


Supervised Learning
1) Analogical Learning.

2) Learning by Decision Tree.

Fundamental in AI and ML 365


Analogical Learning

Instances of a problem and the learner has to form a concept that supports most of
the positive and no negative instances.

This demonstrates that a number of training instances are required to form a


concept in inductive learning.

Unlike this, analogical learning can be accomplished from a single example. For
instance, given the following training instance, one has to determine the plural form
of bacilus.

Fundamental in AI and ML 366


Analogical Learning

Fundamental in AI and ML 367


The main steps in analogical learning are now
formalized below.

1.Identifying Analogy: Identify the similarity between an experienced problem


instance and a new problem.

2. Determining the Mapping Function: Relevant parts of the experienced problem


are selected and the mapping is determined.

3. Apply Mapping Function: Apply the mapping function to transform the new
problem from the given domain to the target domain.

Fundamental in AI and ML 368


The main steps in analogical learning are now
formalized below.

4. Validation: The newly constructed solution is validated for its applicability


through its trial processes like theorem or simulation .

5. Learning: If the validation is found to work well, the new knowledge is encoded
and saved for future usage.

Fundamental in AI and ML 369


Analogical Learning

Fundamental in AI and ML 370


Learning by Decision Tree
A decision tree receives a set of attributes (or properties) of the objects as inputs
and yields a binary decision of true or false values as output. Decision trees, thus,
generally represent Boolean functions. Besides a range of {0,1} other non-binary
ranges of outputs are also allowed.

However, for the sake of simplicity, we presume the restriction to Boolean outputs.
Each node in a decision tree represents ‘a test of some attribute of the instance, and
each branch descending from that node corresponds to one of the possible values
for this attribute’.
Fundamental in AI and ML 371
Learning by Decision Tree
To illustrate the contribution of a decision tree, we consider a set of instances, some
of which result in a true value for the decision. Those instances are called positive
instances. On the other hand, when the resulting decision is false, we call the
instance ‘a negative instance’. We now consider the learning problem of a bird’s
flying. Suppose a child sees different instances of birds as tabulated below.

Fundamental in AI and ML 372


Learning by Decision Tree

Fundamental in AI and ML 373


Decision Tree example

Fundamental in AI and ML 374


Decision Tree example

Fundamental in AI and ML 375


Applications of Machine Learning – Drug
Discovery

Fundamental in AI and ML 376


Medical diagnosis

Photo MRI CT
Fundamental in AI and ML 377
Iris verification

Fundamental in AI and ML 378


Hand-written digits

Fundamental in AI and ML 379


Radar Imaging

Fundamental in AI and ML 380


Speech Recognition

Fundamental in AI and ML 381


Finger print

Fundamental in AI and ML 382


Signature Verification

Fundamental in AI and ML 383


Face Recognition

Fundamental in AI and ML 384


Target Recognition

Fundamental in AI and ML 385


Robotics vision

Fundamental in AI and ML 386


Traffic Monitoring

Fundamental in AI and ML 387


Introduction to Classification

Fundamental in AI and ML 388


Classification learning

Training Testing
phase phase

Learning the classifier from the


available data ‘Training set’ (Labeled)

Fundamental in AI and ML 389


Generating datasets
Methods:
• Holdout (2/3rd training, 1/3rd testing)
• Cross validation (n – fold)
• Divide into n parts
• Train on (n-1), test on last
• Repeat for different combinations
•Bootstrapping
• Select random samples to form the training set

Fundamental in AI and ML 390


Evaluating classifiers
Outcome:
•Accuracy
•Confusion matrix
•If cost-sensitive, the expected cost of classification ( attribute test cost +
misclassification cost) etc.

Fundamental in AI and ML 391


Decision Trees - Example
Example algorithms: ID3, C4.5, SPRINT, CART

Intermediate nodes :
Attributes

Edges : Attribute value tests

Leaf nodes : Class


predictions

Fundamental in AI and ML 392


Decision Tree schematic

Training data set

a1 a2 a3 a4 a5 a6
X Y Z

Impure node, Impure node, Pure


Select best Select best node,
attribute attribute Leaf
and continue and continue node:
Class
Fundamental in AI and ML 393
RED
Decision Tree Issues
How to avoid overfitting?
Problem: Classifier performs well on training data, but fails
to give good results on test data

Example: Split on primary key gives pure nodes and good


accuracy on training – not for testing
Alternatives:
• Pre-prune : Halting construction at a certain level of tree /
level of purity
• Post-prune : Remove a node if the error rate remains
the same without it. Repeat process for all nodes in the decision tree
Fundamental in AI and ML 394
Decision Tree Issues
How does the type of attribute affect the split?

• Discrete-valued: Each branch corresponding to a value


• Continuous-valued: Each branch may be a range of values
(e.g.: splits may be age < 30, 30 < age < 50, age > 50 )
(aimed at maximizing the gain/gain ratio)

Fundamental in AI and ML 395


Decision Tree Issues
How to determine the attribute for split?
Alternatives:
Information Gain

Gain (A, S) = Entropy (S) – Σ ( (Sj/S)*Entropy(Sj) )

Other options:
Gain ratio, etc.

Fundamental in AI and ML 396


Lazy learners
‘Lazy’: Do not create a model of the training instances in advance

When an instance arrives for testing, runs the algorithm to get the class prediction

Example, K – nearest neighbor classifier


(K – NN classifier)

“One is known by the company one keeps”

Fundamental in AI and ML 397


K-NN classifier schematic
For a test instance,
1) Calculate distances from training pts.
2) Find K-nearest neighbours (say, K = 3)
3) Assign class label based on majority

Fundamental in AI and ML 398


K-NN classifier Issues
How to determine distances between values of categorical attributes?

Alternatives:
1. Boolean distance (1 if same, 0 if different)

2. Differential grading (e.g. weather – ‘drizzling’ and ‘rainy’ are


closer than ‘rainy’ and ‘sunny’ )

How to make real-valued prediction?

Alternative:
1. Average the values returned by K-nearest neighbours

Fundamental in AI and ML 399


K-NN classifier Issues
How to determine value of K?
Alternatives:
1. Determine K experimentally. The K that gives minimum
error is selected.

Any other modifications?


Alternatives:
2. Weighted attributes to decide final label
3. Assign distance to missing values as <max>
4. K=1 returns class label of nearest neighbour

How good is it?


• Susceptible to noisy values
• Slow because of distance calculation
Alternate approaches:
• Distances to representative points only
• Partial distance Fundamental in AI and ML 400
Decision Lists
• A sequence of Boolean functions that lead to a result

f ( y ) = cj, if j = min { i | hi (y) = 1 } exists


0 otherwise

Fundamental in AI and ML 401


Decision List example

Class
label
Test
instance (hi
,ci
)
Unit

Fundamental in AI and ML 402


Decision List learning
R
S’ = S - Qk

( h k,1 / )
0

If
For each hi, Select hk, the (| Pi| - pn * | Ni |
feature >
Set of candidate Qi = Pi U Ni
with |Ni| - pp *|Pi| )
feature functions
( hi = 1 ) highest utility then 1

else 0
U i = max { | Pi| - pn * | Ni | , |Ni| - pp *|Pi| }

Fundamental in AI and ML 403


Decision list Issues
What is the terminating condition?
1. Size of R (an upper threshold)
2. Qk = null
3. S’ contains examples of same class

Accuracy / Complexity tradeoff?


Size of R : Complexity (Length of the list)
S’ contains examples of both classes : Accuracy (Purity)

Pruning?
hi is not required if :
1. c i = c (r+1)
2. There is no h j ( j > i ) such that
Qi=Qj

Fundamental in AI and ML 404


Probabilistic classifiers : Naïve Bayes
Based on Bayes rule
Naïve Bayes : Conditional independence assumption

Fundamental in AI and ML 405


Naïve Bayes Issues
Problems due to sparsity of data?
Problem : Probabilities for some values may be zero
Solution : Laplace smoothing

How are different types of attributes handled?


1. Discrete-valued : P ( X | Ci ) is according to formula
2. Continuous-valued : Assume gaussian distribution. Plug in mean and variance for the
attribute and assign it to P ( X | Ci )

Fundamental in AI and ML 406


Probabilistic classifiers : BBN
• Bayesian belief networks : Attributes ARE dependent
• A directed acyclic graph and conditional probability tables

An added term for conditional


probability between attributes:

Fundamental in AI and ML 407


BBN learning
(when network structure known)

• Input : Network topology of BBN


• Output : Calculate the entries in conditional probability table

(when network structure not known)


• ???

Fundamental in AI and ML 408


Learning structure of BBN
Use Naïve Bayes as a basis pattern

Loan

Marita
Family
Age l
status
status
Add edges as required
Examples of algorithms: TAN, K2

Fundamental in AI and ML 409


Artificial Neural Networks
Based on biological concept of neurons
Structure of a fundamental unit of ANN:

w
threshol
w 0
d
input 1
w
n
output: : activation function p (v)

where
p (v) = sgn (w0 + w1x1 + … + wnxn )
Fundamental in AI and ML 410
Perceptron learning algorithm
Initialize values of weights
Apply training instances and get output
Update weights according to the update rule:
n : learning rate
t : target output
o : observed output

Repeat till converges

Can represent linearly separable functions only

Fundamental in AI and ML 411


Sigmoid perceptron
Basis for multilayer feedforward networks

Fundamental in AI and ML 412


Multilayer feedforward networks
Multilayer? Feedforward?

Input layer Hidden layer Output layer

Fundamental in AI and ML 413


Backpropagation
• Apply training instances as input and produce output
• Update weights in the ‘reverse’ direction as follows:

Fundamental in AI and ML 414


ANN Issues
Learning the structure of the network

1. Construct a complete network


2. Prune using heuristics:
• Remove edges with weights nearly zero
• Remove edges if the removal does not affect accuracy

What are the types of learning approaches?

Deterministic: Update weights after summing up Errors over all examples


Stochastic: Update weights per example

Fundamental in AI and ML 415


ANN Issues
Choosing the learning factor
A small learning factor means multiple iterations
required.

A large learning factor means the learner may skip the global minimum

Addition of momentum

But why?
Fundamental in AI and ML 416
Support vector machines
Basic idea

Margin
“Maximum
separating-margin
+
classifier”
1
Support vectors

-1
Separating hyperplane : wx+b = 0

Fundamental in AI and ML 417


SVM training

Dot product of xk and xl

Lagrangian multipliers are


zero for data instances other
than support vectors

Fundamental in AI and ML 418


Focussing on dot product
• For non-linear separable points,
we plan to map them to a higher dimensional (and linearly separable) space
• The product can be time-consuming. Therefore, we use kernel
functions

Fundamental in AI and ML 419


Kernel functions
Without having to know the non-linear mapping, apply kernel function, say,

Reduces the number of computations required to generate Q kl values.

Fundamental in AI and ML 420


Testing SVM

Test instance Class label


SVM

Fundamental in AI and ML 421


SVM Issues
What if n-classes are to be predicted?
Problem : SVMs deal with two-class classification
Solution : Have multiple SVMs each for one class

SVMs are immune to the removal of non-support-vector points

Fundamental in AI and ML 422


Combining Classifiers
• ‘Ensemble’ learning
• Use a combination of models for prediction
• Bagging : Majority votes
• Boosting : Attention to the ‘weak’ instances
• Goal : An improved combined model

Fundamental in AI and ML 423


Bagging
Classifier
model
M1

Majority Classifier
vote learning
Class Label
Classifier scheme Sample
model Training
D1
dataset
Mn Test
set D

Total set

At random. May use bootstrap sampling with replacement


Fundamental in AI and ML 424
Data preprocessing
• Attribute subset selection
• Select a subset of total attributes to reduce complexity

• Dimensionality reduction
• Transform instances into ‘smaller’ instances

Fundamental in AI and ML 425


Attribute subset selection
Information gain measure for attribute selection in decision trees

Stepwise forward / backward elimination of attributes

Fundamental in AI and ML 426


Dimensionality reduction
High dimensions : Computational complexity
Number of attributes of
a data instance
instance x in
p-dimensions

s = Wx W is k x p transformation mtrx.

instance x in k-dimensions
k<p

Fundamental in AI and ML 427


Principal Component Analysis
• Computes k orthonormal vectors : Principal components
• Essentially provide a new set of axes – in decreasing order of variance

Eigenvector matrix
( pXX n (k
) X (p X ( p X p ) ( pXn)
(k First k are k PCs
n)
n) p)

Fundamental in AI and ML 428


Learning structure of BBN
• K2 Algorithm :
• Consider nodes in an order
• For each node, calculate utility to add an edge from previous nodes to this one

• TAN :
• Use Naïve Bayes as the baseline network
• Add different edges to the network based on utility

• Examples of algorithms: TAN, K2

Fundamental in AI and ML 429


Delta rule
Delta rule enables to converge to a best fit if points are not linearly separable
Uses gradient descent to choose the hypothesis space

Fundamental in AI and ML 430


Clustering
• Document clustering
• Motivations
• Document representations
• Success criteria
• Clustering algorithms
• Partitional
• Hierarchical

Fundamental in AI and ML 431


What is clustering?
• Clustering: the process of grouping a set of objects into classes of similar objects

• Documents within a cluster should be similar.


• Documents from different clusters should be dissimilar.

• The commonest form of unsupervised learning


• Unsupervised learning = learning from raw data, as opposed to supervised data where a classification of
examples is given
• A common and important task that finds many applications in IR and other places

Fundamental in AI and ML 432


A data set with clear cluster structure
• How would you design an algorithm
for finding the three clusters in this
case?

Fundamental in AI and ML 433


Applications of clustering in IR
• Whole corpus analysis/navigation
• Better user interface: search without typing
• For improving recall in search applications
• Better search results (like pseudo RF)
• For better navigation of search results
• Effective “user recall” will be higher
• For speeding up vector space retrieval
• Cluster-based retrieval gives faster search

Fundamental in AI and ML 434


Yahoo! Hierarchy isn’t clustering but is the kind
of output you want from clustering
www.yahoo.com/Scienc
e
… (30)

agriculture biology physics CS space

... ... ... ...


...
dairy
crops botany cell AI courses craft
magnetism
forestry agronomy evolution HCI
relativity

Fundamental in AI and ML 435


Google News: automatic clustering gives an
effective news presentation metaphor

Fundamental in AI and ML 436


Scatter/Gather: Cutting, Karger, and Pedersen

Fundamental in AI and ML 437


For visualizing a document collection and its
themes
Wise et al, “Visualizing the non-visual” PNNL
ThemeScapes, Cartia
[Mountain height = cluster size]

Fundamental in AI and ML 438


For improving search recall
• Cluster hypothesis - Documents in the same cluster behave similarly with respect
to relevance to information needs
• Therefore, to improve search recall:
• Cluster docs in corpus a priori
• When a query matches a doc D, also return other docs in the cluster containing D
• Hope if we do this: The query “car” will also return docs containing automobile
• Because clustering grouped together docs containing car with those containing automobile.

Why might this


happen?

Fundamental in AI and ML 439


yippy.com – grouping search results

Fundamental in AI and ML 440


Issues for clustering
• Representation for clustering
• Document representation
• Vector space? Normalization?
• Centroids aren’t length normalized
• Need a notion of similarity/distance
• How many clusters?
• Fixed a priori?
• Completely data driven?
• Avoid “trivial” clusters - too large or small
• If a cluster's too large, then for navigation purposes you've wasted an extra user click without whittling down the
set of documents much.

Fundamental in AI and ML 441


Notion of similarity/distance

•Ideal: semantic similarity.


•Practical: term-statistical similarity
•We will use cosine similarity.
•Docs as vectors.
•For many algorithms, easier to think in terms of a distance (rather
than similarity) between docs.
•We will mostly speak of Euclidean distance
• But real implementations use cosine similarity

Fundamental in AI and ML 442


Clustering Algorithms

•Flat algorithms
• Usually start with a random (partial) partitioning
• Refine it iteratively
• K means clustering
• (Model based clustering)

•Hierarchical algorithms
• Bottom-up, agglomerative
• (Top-down, divisive)

Fundamental in AI and ML 443


Hard vs. soft clustering

•Hard clustering: Each document belongs to exactly one cluster


• More common and easier to do
•Soft clustering: A document can belong to more than one cluster.
• Makes more sense for applications like creating browsable hierarchies
• You may want to put a pair of sneakers in two clusters: (i) sports apparel and (ii) shoes
• You can only do that with a soft clustering approach.
•We won’t do soft clustering today. See IIR 16.5, 18

Fundamental in AI and ML 444


Partitioning Algorithms
• Partitioning method: Construct a partition of n documents into a set of K
clusters
• Given: a set of documents and the number K
• Find: a partition of K clusters that optimizes the chosen partitioning criterion
• Globally optimal
• Intractable for many objective functions
• Ergo, exhaustively enumerate all partitions
• Effective heuristic methods: K-means and K-medoids algorithms

Fundamental in AI and ML 445


K-Means

•Assumes documents are real-valued vectors.


•Clusters based on centroids (aka the center of gravity or mean) of points in a
cluster, c:

•Reassignment of instances to clusters is based on distance to the current cluster


centroids.
• (Or one can equivalently phrase it in terms of similarities)

Fundamental in AI and ML 446


K-Means Algorithm

• Select K random docs {s1, s2,… sK} as seeds.


• Until clustering converges (or other stopping criterion):
• For each doc di:
• Assign di to the cluster cj such that dist(xi, sj) is minimal.
• (Next, update the seeds to the centroid of each cluster)
• For each cluster cj
• sj = μ(cj)

Fundamental in AI and ML 447


K Means Example (K=2)

Pick seeds
Reassign clusters
Compute centroids
Reassign clusters
Compute centroids
Reassign clusters
Converged!

Fundamental in AI and ML 448


Termination conditions
• Several possibilities, e.g.,
• A fixed number of iterations.
• Doc partition unchanged.
• Centroid positions don’t change.

Does this mean that the docs in a


cluster are unchanged?

Fundamental in AI and ML 449


Convergence

•Why should the K-means algorithm ever reach a fixed point?


• A state in which clusters don’t change.
•K-means is a special case of a general procedure known as the Expectation
Maximization (EM) algorithm.
• EM is known to converge.
• Number of iterations could be large.
• But in practice usually isn’t

Fundamental in AI and ML 450


Convergence of K-Means

•Define goodness measure of cluster k as sum of squared distances from


cluster centroid:
• Gk = Σi (di – ck)2 (sum over all di in cluster k)
• G = Σk Gk
•Reassignment monotonically decreases G since each vector is assigned to the
closest centroid.

Fundamental in AI and ML 451


Convergence of K-Means
• Recomputation monotonically decreases each Gk since (mk is number of members
in cluster k):
• Σ (di – a)2 reaches minimum for:
• Σ –2(di – a) = 0
• Σ di = Σ a
• mK a = Σ d i
• a = (1/ mk) Σ di = ck
• K-means typically converges quickly

Fundamental in AI and ML 452


Time Complexity
• Computing distance between two docs is O(M) where M is the dimensionality of
the vectors.
• Reassigning clusters: O(KN) distance computations, or O(KNM).
• Computing centroids: Each doc gets added once to some centroid: O(NM).
• Assume these two steps are each done once for I iterations: O(IKNM).

Fundamental in AI and ML 453


Seed Choice
• Results can vary based on random seed Example showing
selection. sensitivity to seeds

• Some seeds can result in poor


convergence rate, or convergence to
sub-optimal clusterings.
• Select good seeds using a heuristic (e.g.,
doc least similar to any existing mean)
In the above, if you start with
• Try out multiple starting points B and E as centroids you
• Initialize with the results of another converge to {A,B,C} and
method. {D,E,F}
If you start with D and F you
converge to {A,B,D,E} {C,F}

Fundamental in AI and ML 454


K-means issues, variations, etc.

•Recomputing the centroid after every assignment (rather than after all points are
re-assigned) can improve speed of convergence of K-means
•Assumes clusters are spherical in vector space
• Sensitive to coordinate changes, weighting etc.
•Disjoint and exhaustive
• Doesn’t have a notion of “outliers” by default
• But can add outlier filtering

Dhillon et al. ICDM 2002 – variation to fix some issues with small document clusters

Fundamental in AI and ML 455


How Many Clusters?
• Number of clusters K is given
• Partition n docs into predetermined number of clusters
• Finding the “right” number of clusters is part of the problem
• Given docs, partition into an “appropriate” number of subsets.
• E.g., for query results - ideal value of K not known up front - though UI may
impose limits.
• Can usually take an algorithm for one flavor and convert to the other.

Fundamental in AI and ML 456


K not specified in advance

• Say, the results of a query.


• Solve an optimization problem: penalize having lots of clusters
• application dependent, e.g., compressed summary of search results list.
• Tradeoff between having more clusters (better focus within each cluster)
and having too many clusters

Fundamental in AI and ML 457


K not specified in advance

•Given a clustering, define the Benefit for a doc to be the cosine similarity to
its centroid
•Define the Total Benefit to be the sum of the individual doc Benefits.

Why is there always a clustering of Total Benefit n?

Fundamental in AI and ML 458


Penalize lots of clusters

•For each cluster, we have a Cost C.


•Thus for a clustering with K clusters, the Total Cost is KC.
•Define the Value of a clustering to be =
Total Benefit - Total Cost.
•Find the clustering of highest value, over all choices of K.
• Total benefit increases with increasing K. But can stop when it doesn’t increase by “much”. The Cost term
enforces this.

Fundamental in AI and ML 459


Hierarchical Clustering

•Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents.

animal

vertebrate invertebrate

fish reptile amphib. mammal worm insect


crustacean

•One approach: recursive application of a partitional clustering algorithm.

Fundamental in AI and ML 460


Dendrogram: Hierarchical Clustering
Clustering obtained by cutting the
dendrogram at a desired level: each
connected component forms a cluster.

Fundamental in AI and ML 461


Hierarchical Agglomerative Clustering (HAC)

•Starts with each doc in a separate cluster


•then repeatedly joins the closest pair of clusters, until there is only
one cluster.
•The history of merging forms a binary tree or hierarchy.

Note: the resulting clusters are still “hard” and induce a partition

Fundamental in AI and ML 462


Closest pair of clusters

•Many variants to defining closest pair of clusters


•Single-link
• Similarity of the most cosine-similar (single-link)
•Complete-link
• Similarity of the “furthest” points, the least cosine-similar
•Centroid
• Clusters whose centroids (centers of gravity) are the most cosine-similar
•Average-link
• Average cosine between pairs of elements

Fundamental in AI and ML 463


Single Link Agglomerative Clustering
• Use maximum similarity of pairs:

• Can result in “straggly” (long and thin) clusters due to chaining effect.
• After merging ci and cj, the similarity of the resulting cluster to another cluster,
ck, is:

Fundamental in AI and ML 464


Single Link Example

Fundamental in AI and ML 465


Complete Link
Use minimum similarity of pairs:

Makes “tighter,” spherical clusters that are typically preferable.


After merging ci and cj, the similarity of the resulting cluster to another cluster, ck,
is:

Ci Cj Ck

Fundamental in AI and ML 466


Complete Link Example

Fundamental in AI and ML 467


Computational Complexity

•In the first iteration, all HAC methods need to compute similarity of all pairs of N initial
instances, which is O(N2).
•In each of the subsequent N−2 merging iterations, compute the distance between the most
recently created cluster and all other existing clusters.
•In order to maintain an overall O(N2) performance, computing similarity to each other cluster
must be done in constant time.
• Often O(N3) if done naively or O(N2 log N) if done more cleverly

Fundamental in AI and ML 468


Group Average
S
•Similarity of two clusters = average similarity of all pairs within merged cluster.

•Compromise between single and complete link.


•Two options:
• Averaged across all ordered pairs in the merged cluster
• Averaged over all pairs between the two original clusters
•No clear difference in efficacy

Fundamental in AI and ML 469


Computing Group Average Similarity
• Always maintain sum of vectors in each cluster.

• Compute similarity of clusters in constant time:

Fundamental in AI and ML 470


What Is A Good Clustering?

•Internal criterion: A good clustering will produce high quality clusters in


which:
• the intra-class (that is, intra-cluster) similarity is high
• the inter-class similarity is low
• The measured quality of a clustering depends on both the document
representation and the similarity measure used

Fundamental in AI and ML 471


External criteria for clustering quality

•Quality measured by its ability to discover some or all of the hidden patterns or
latent classes in gold standard data
•Assesses a clustering with respect to ground truth … requires labeled data
•Assume documents with C gold standard classes, while our clustering algorithms
produce K clusters, ω1, ω2, …, ωK with ni members.

Fundamental in AI and ML 472


External Evaluation of Cluster Quality

• Simple measure: purity, the ratio between the dominant class in the cluster πi and
the size of cluster ωi

• Biased because having n clusters maximizes purity


• Others are entropy of classes in clusters (or mutual information between classes and
clusters)

Fundamental in AI and ML 473


Purity example

∙ ∙ ∙ ∙ ∙ ∙
∙ ∙ ∙ ∙ ∙ ∙
∙ ∙ ∙ ∙ ∙

Cluster I Cluster II Cluster III

Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6

Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6

Cluster III: Purity = 1/5 (max(2, 0, 3)) =Fundamental


3/5 in AI and ML 474
Rand Index measures between pair decisions.
Here RI = 0.68

Fundamental in AI and ML 475


Rand index and Cluster F-measure

Compare with standard Precision and Recall:

People also define and use a cluster F-measure, which is probably a better measure.

Fundamental in AI and ML 476


Final word and resources
• In clustering, clusters are inferred from the data without human input
(unsupervised learning)
• However, in practice, it’s a bit less clear: there are many ways of influencing the
outcome of clustering: number of clusters, similarity measure, representation of
documents, . . .

• Resources
• IIR 16 except 16.5
• IIR 17.1–17.3

Fundamental in AI and ML 477


Continuous outcome (means)

Fundamental in AI and ML 478


Recall: Covariance

Fundamental in AI and ML 479


Interpreting Covariance

cov(X,Y) > 0 X and Y are positively correlated


cov(X,Y) < 0 X and Y are inversely correlated
cov(X,Y) = 0 X and Y are independent

Fundamental in AI and ML 480


Correlation coefficient
Pearson’s Correlation Coefficient is standardized covariance (unitless):

Fundamental in AI and ML 481


Correlation

• Measures the relative strength of the linear relationship between two variables
• Unit-less
• Ranges between –1 and 1
• The closer to –1, the stronger the negative linear relationship
• The closer to 1, the stronger the positive linear relationship
• The closer to 0, the weaker any positive linear relationship

Fundamental in AI and ML 482


Scatter Plots of Data with Various Correlation
Coefficients
Y

X X
r = -1 r = -.6 r=0
Y
Y Y

X X
r = +1 r = +.3 r=0
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004

Prentice-Hall Fundamental in AI and ML 483


Linear Correlation
Linear relationships Curvilinear relationships

Y Y

X X

Y Y

X X

Fundamental in AI and ML 484


Linear Correlation
Strong relationships Weak relationships

Y Y

X X

Y Y

X X

Fundamental in AI and ML 485


Linear Correlation

No relationship

X
Fundamental in AI and ML 486
Calculating by hand…

Fundamental in AI and ML 487


Simpler calculation formula…

Numerator of
covariance

Numerators of
variance

Fundamental in AI and ML 488


Distribution of the correlation coefficient:
The sample correlation coefficient follows a T-distribution with n-2 degrees of
freedom (since you have to estimate the standard error).

*note, like a proportion, the variance of the correlation coefficient depends on the
correlation coefficient itself substitute in estimated r

Fundamental in AI and ML 489


Continuous outcome (means)

Fundamental in AI and ML 490


Linear regression
In correlation, the two variables are treated as equals. In regression, one variable is
considered independent (=predictor) variable (X) and the other the dependent
(=outcome) variable Y.

Fundamental in AI and ML 491


What is “Linear”?
• Remember this:
• Y=mX+B?

Fundamental in AI and ML 492


What’s Slope?

A slope of 2 means that every 1-unit change in X yields a 2-unit change in Y.

Fundamental in AI and ML 493


Prediction

If you know something about X, this knowledge helps you predict something about
Y. (Sound familiar?…sound like conditional probabilities?)

Fundamental in AI and ML 494


Regression equation…

Expected value of y at a given level of x=

Fundamental in AI and ML 495


Predicted value for an individual…

y i= α + β*xi + random errori

Fixed – Follows a normal


exactly on distribution
the line

Fundamental in AI and ML 496


Assumptions (or the fine print)
• Linear regression assumes that…
• 1. The relationship between X and Y is linear
• 2. Y is distributed normally at each value of X
• 3. The variance of Y at every value of X is the same (homogeneity of variances)
• 4. The observations are independent

Fundamental in AI and ML 497


Regression Picture
R2=SSreg/SStotal

yi
C A

B
y
B
A
C
yi
*Least squares
estimation gave us the
line (β) that minimized
x C2

A2 B2 C2
SStotal SSreg SSresidual
Total squared distance of observations from naïve Distance from regression line to naïve mean of y Variance around the regression line
mean of y Variability due to x (regression) Additional variability not explained by x—what
Total variation least squares method aims to minimize
Fundamental in AI and ML 498
Recall example: cognitive function and vitamin D

•Hypothetical data loosely based on [1]; cross-sectional study of 100 middle-aged


and older European men.
• Cognitive function is measured by the Digit Symbol Substitution Test (DSST).

1. Lee DM, Tajar A, Ulubaev A, et al. Association between 25-hydroxyvitamin D levels and cognitive performance in middle-aged and older European men. J Neurol Neurosurg Psychiatry. 2009
Jul;80(7):722-9. Fundamental in AI and ML 499
Distribution of vitamin D

Fundamental in AI and ML 500


Distribution of DSST
• Normally distributed
• Mean = 28 points
• Standard deviation = 10 points

Fundamental in AI and ML 501


Four hypothetical datasets

• I generated four hypothetical datasets, with increasing TRUE slopes (between vit D
and DSST):

• 0
• 0.5 points per 10 nmol/L
• 1.0 points per 10 nmol/L
• 1.5 points per 10 nmol/L

Fundamental in AI and ML 502


Dataset 1: no relationship

Fundamental in AI and ML 503


Dataset 2: weak relationship

Fundamental in AI and ML 504


Dataset 3: weak to moderate relationship

Fundamental in AI and ML 505


Dataset 4: moderate relationship

Fundamental in AI and ML 506


The “Best fit” line
• SRegression equation:
• E(Yi) = 28 + 0*vit Di (in 10 nmol/L)

Fundamental in AI and ML 507


The “Best fit” line
Note how the line is a little
deceptive; it draws your eye,
making the relationship appear
stronger than it really is!

•Regression equation:
• E(Yi) = 26 + 0.5*vit Di (in 10
nmol/L)

Fundamental in AI and ML 508


The “Best fit” line
Regression equation:
E(Yi) = 22 + 1.0*vit Di (in 10
nmol/L)

Fundamental in AI and ML 509


The “Best fit” line
Regression equation:
E(Yi) = 20 + 1.5*vit Di (in 10
nmol/L)

Note: all the lines go through the


point (63, 28)!

Fundamental in AI and ML 510


Estimating the intercept and slope: least squares
estimation

Fundamental in AI and ML 511


Resulting formulas…

Slope (beta coefficient) =

Intercept=

Regression line always goes through the point:

Fundamental in AI and ML 512


Relationship with correlation

In correlation, the two variables are treated as equals. In regression, one variable is
considered independent (=predictor) variable (X) and the other the dependent
(=outcome) variable Y.

Fundamental in AI and ML 513


Example: dataset 4
• SDx = 33 nmol/L
• SDy= 10 points
• Cov(X,Y) = 163 points*nmol/L
• Beta = 163/332 = 0.15 points per
nmol/L
• = 1.5 points per 10 nmol/L
• r = 163/(10*33) = 0.49
Or
• r = 0.15 * (33/10) = 0.49
Fundamental in AI and ML 514
Significance testing…

Slope
Distribution of slope ~ Tn-2(β,s.e.( ))

H0: β1 = 0 (no linear relationship)


H1: β1 ≠ 0 (linear relationship does exist)

Tn-
2
=

Fundamental in AI and ML 515


Formula for the standard error of beta (you will
not have to calculate by hand!):

Fundamental in AI and ML 516


Example: dataset 4
• Standard error (beta) = 0.03
• T98 = 0.15/0.03 = 5, p<.0001

• 95% Confidence interval = 0.09 to 0.21

Fundamental in AI and ML 517


Residual Analysis: check assumptions

•The residual for observation i, ei, is the difference between its observed and predicted value
•Check the assumptions of regression by examining the residuals
• Examine for linearity assumption
• Examine for constant variance for all levels of X (homoscedasticity)
• Evaluate normal distribution assumption
• Evaluate independence assumption
•Graphical Analysis of Residuals
• Can plot residuals vs. X

Fundamental in AI and ML 518


Predicted values…

For Vitamin D = 95 nmol/L (or 9.5 in 10 nmol/L):

Fundamental in AI and ML 519


Residual = observed - predicted

Fundamental in AI and ML 520


Residual Analysis for Linearity

Y Y

x x
residuals

residuals
x

Not Linear Linear


Fundamental in AI and ML 521
Residual Analysis for Homoscedasticity

Y Y

x
residuals

residuals
x x

Non-constant variance Constant variance

Fundamental in AI and ML 522


Residual Analysis for Independence

Not Independent
residuals
✔ Independent
X

residuals
residuals

Fundamental in AI and ML 523


Residual plot, dataset 4

Fundamental in AI and ML 524


Multiple linear regression…

•What if age is a confounder here?


• Older men have lower vitamin D
• Older men have poorer cognition
•“Adjust” for age by putting age in the model:
• DSST score = intercept + slope1xvitamin D + slope2 xage

Fundamental in AI and ML 525


2 predictors: age and vit D…

Fundamental in AI and ML 526


Different 3D view…

Fundamental in AI and ML 527


Fit a plane rather than a line…

On the plane, the slope for


vitamin D is the same at every
age; thus, the slope for vitamin D
represents the effect of vitamin D
when age is held constant.

Fundamental in AI and ML 528


Equation of the “Best fit” plane…
DSST score = 53 + 0.0039xvitamin D (in 10 nmol/L) - 0.46 xage (in years)

P-value for vitamin D >>.05


P-value for age <.0001

Thus, relationship with vitamin D was due to confounding by age!

Fundamental in AI and ML 529


Multiple Linear Regression

•More than one predictor…

E(y)= α + β1*X + β2 *W + β3 *Z…

Each regression coefficient is the amount of change in the outcome variable that
would be expected per one-unit change of the predictor, if all other variables in the
model were held constant.

Fundamental in AI and ML 530


Functions of multivariate analysis:
• Control for confounders
• Test for interactions between predictors (effect modification)
• Improve predictions

Fundamental in AI and ML 531


A ttest is linear regression!

•Divide vitamin D into two groups:


• Insufficient vitamin D (<50 nmol/L)
• Sufficient vitamin D (>=50 nmol/L), reference group
•We can evaluate these data with a ttest or a linear regression…

Fundamental in AI and ML 532


As a linear regression…

Intercept Slope represents the


represents the difference in means
mean value in between the groups.
the sufficient Difference is
group. significant.

Parameter ````````````````Standard
Variable Estimate Error t Value Pr > |t|

Intercept 40.07407 1.47511 27.17 <.0001


insuff -7.53060 2.17493 -3.46 0.0008 Fundamental in AI and ML 533
ANOVA is linear regression!

•Divide vitamin D into three groups:


• Deficient (<25 nmol/L)
• Insufficient (>=25 and <50 nmol/L)
• Sufficient (>=50 nmol/L), reference group
DSST= α (=value for sufficient) + βinsufficient*(1 if insufficient) + β2 *(1 if deficient)
This is called “dummy coding”—where multiple binary variables are created to
represent being in each category (or not) of a categorical variable

Fundamental in AI and ML 534


The picture…

Sufficient vs.
Insufficient

Sufficient vs.
Deficient

Fundamental in AI and ML 535


Results…
Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 40.07407 1.47817 27.11 <.0001


deficient 1 -9.87407 3.73950 -2.64 0.0096
insufficient 1 -6.87963 2.33719 -2.94 0.0041

Interpretation:
The deficient group has a mean DSST 9.87 points lower than the reference (sufficient) group.
The insufficient group has a mean DSST 6.87 points lower than the reference (sufficient)
group

Fundamental in AI and ML 536


Other types of multivariate regression

• Multiple linear regression is for normally distributed outcomes

• Logistic regression is for binary outcomes

• Cox proportional hazards regression is used when time-to-event is the outcome

Fundamental in AI and ML 537


Common multivariate regression models
Example outcome Appropriate Example equation What do the coefficients give you?
Outcome (dependent variable multivariate
variable) regression model

Continuous Blood pressure Linear regression blood pressure (mmHg) = slopes—tells you how much the
α + βsalt*salt consumption (tsp/day) + βage*age outcome variable increases for every
(years) + βsmoker*ever smoker (yes=1/no=0) 1-unit increase in each predictor.

Binary High blood Logistic regression ln (odds of high blood pressure) = odds ratios—tells you how much the
pressure (yes/no) α + βsalt*salt consumption (tsp/day) + βage*age odds of the outcome increase for every
(years) + βsmoker*ever smoker (yes=1/no=0) 1-unit increase in each predictor.

Time-to-event Time-to- death Cox regression ln (rate of death) = hazard ratios—tells you how much the
α + βsalt*salt consumption (tsp/day) + βage*age rate of the outcome increases for every
(years) + βsmoker*ever smoker (yes=1/no=0) 1-unit increase in each predictor.

Fundamental in AI and ML 538


Multivariate regression pitfalls
● Multi-collinearity
● Residual confounding
● Overfitting

Fundamental in AI and ML 539


Multicollinearity

• Multicollinearity arises when two variables that measure the same thing or similar
things (e.g., weight and BMI) are both included in a multiple regression model;
they will, in effect, cancel each other out and generally destroy your model.

• Model building and diagnostics are tricky business!

Fundamental in AI and ML 540


Residual confounding

• You cannot completely wipe out confounding simply by adjusting for variables in
multiple regression unless variables are measured with zero error (which is usually
impossible).

• Example: meat eating and mortality

Fundamental in AI and ML 541


Overfitting

• In multivariate modeling, you can get highly significant but meaningless results if
you put too many predictors in the model.

• The model is fit perfectly to the quirks of your particular sample, but has no
predictive ability in a new sample

Fundamental in AI and ML 542


Overfitting: class data example

• I asked SAS to automatically find predictors of optimism in our class dataset.


Here’s the resulting linear regression model.

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 11.80175 2.98341 11.96067 15.65 0.0019


exercise -0.29106 0.09798 6.74569 8.83 0.0117
sleep -1.91592 0.39494 17.98818 23.53 0.0004
obama 1.73993 0.24352 39.01944 51.05 <.0001
Clinton -0.83128 0.17066 18.13489 23.73 0.0004
mathLove 0.45653 0.10668 13.99925 18.32 0.0011

• Exercise, sleep, and high ratings for Clinton are negatively related to optimism
(highly significant!) and high ratings for Obama and high love of math are
positively related to optimism (highly significant!).
Fundamental in AI and ML 543
Overfitting
• Pure noise variables still produce good R2
values if the model is overfitted. The
distribution of R2 values from a series of
simulated regression models containing only
noise variables.
• (Figure 1 from: Babyak, MA. What You See May Not Be What You Get: A Brief,
Nontechnical Introduction to Overfitting in Regression-Type Models.
Psychosomatic Medicine 66:411-421 (2004).)

Rule of thumb: You need at least 10 subjects for


each additional predictor variable in the
multivariate regression model.

Fundamental in AI and ML 544


Overfitting and Underfitting
• Don’t expect your favorite learner to always be best

Fundamental in AI and ML 545


Bias and Variance

•Bias – error caused because the model can not represent the concept

•Variance – error caused because the learning algorithm overreacts to small changes
(noise) in the training data

TotalLoss = Bias + Variance (+ noise)

Fundamental in AI and ML 546


Visualizing Bias
Goal: produce a model that matches this concept

True
Concept

Fundamental in AI and ML 547


Visualizing Bias
• Goal: produce a model that matches this concept
• Training Data for the concept

Training Data

Fundamental in AI and ML 548


Visualizing Bias
• Goal: produce a model that matches this concept
• Training Data for concept
Bias Mistakes

Model Predicts +

Model Predicts -

Fit a Linear
Model
Fundamental in AI and ML 549
Visualizing Variance
• Goal: produce a model that matches this concept
• New data, new model
Different Bias
Mistakes

Model Predicts +

Model Predicts -

Fit a Linear
Model
Fundamental in AI and ML 550
Visualizing Variance
• Goal: produce a model that matches this concept
• New data, new model
Mistakes
• New data, new model… will vary

Model Predicts +

Model Predicts -

Fit a Linear
Model
Fundamental in AI and ML 551
Another way to think about Bias & Variance

Fundamental in AI and ML 552


Bias and Variance: More Powerful Model
• Powerful Models can represent complex concepts
• No Mistakes!
Model
Predicts +

Model Predicts -

Fundamental in AI and ML 553


Bias and Variance: More Powerful Model
• But get more data…
Model
Predicts +

Model Predicts -
• Not good!

Fundamental in AI and ML 554


Overfitting vs Underfitting
• Overfitting • Underfitting
• Fitting the data too well • Learning too little of the true
• Features are noisy / uncorrelated to concept
concept • Features don’t capture
• Modeling process very sensitive concept
(powerful) • Too much bias in model
• Too much search • Too little search to fit model

Fundamental in AI and ML 555


The Effect of Noise

Fundamental in AI and ML 556


The Power of a Model Building Process

• Weaker Modeling Process • More Powerful Modeling Process


( higher bias ) (higher variance)

• Complex Model (e.g. high order


• Simple Model (e.g. linear) polynomial)
• Fixed sized Model (e.g. fixed #
weights) • Scalable Model (e.g. decision tree)
• Small Feature Set (e.g. top 10 • Large Feature Set (e.g. every token in
tokens) data)
• Constrained Search (e.g. few
iterations of gradient descent) • Unconstrained Search (e.g. exhaustive
search)

Fundamental in AI and ML 557


Example of Under/Over-fitting

Fundamental in AI and ML 558


Ways to Control Decision Tree Learning

• Increase minToSplit
• Increase minGainToSplit
• Limit total number of Nodes
• Penalize complexity

Fundamental in AI and ML 559


Ways to Control Logistic Regression

• Adjust Step Size

• Adjust Iterations / stopping criteria of Gradient Descent

• Regularization

Fundamental in AI and ML 560


Modeling to Balance Under & Overfitting

• Data
• Learning Algorithms
• Feature Sets
• Complexity of Concept
• Search and Computation

• Parameter sweeps!

Fundamental in AI and ML 561


Parameter Sweep

# optimize first parameter


for p in [ setting_certain_to_underfit, …, setting_certain_to_overfit]:

# do cross validation to estimate accuracy

# find the setting that balances overfitting & underfitting

# optimize second parameter


# etc…

# examine the parameters that seem best and adjust whatever you can…
Fundamental in AI and ML 562
Types of Parameter Sweeps

• Optimize one parameter at a time • Optimize one parameter at a time


• Optimize one, update, move on • Optimize one, update, move on
• Iterate a few times • Iterate a few times

• Gradient descent on • Gradient descent on


meta-parameters meta-parameters
• Start somewhere ‘reasonable’ • Start somewhere ‘reasonable’
• Computationally calculate gradient • Computationally calculate gradient
wrt change in parameters wrt change in parameters

Fundamental in AI and ML 563


Summary of Overfitting and Underfitting

•Bias / Variance tradeoff a primary challenge in machine learning

•Internalize: More powerful modeling is not always better

•Learn to identify overfitting and underfitting

•Tuning parameters & interpreting output correctly is key

Fundamental in AI and ML 564


Bayesian Decision Theory -Probability and
Inference

•Result of tossing a coin is Î {Heads,Tails}


•Random var X Î{1,0}
Bernoulli: P {X=1} = poX (1 ‒ po)(1 ‒ X)
•Sample: X = {xt }Nt =1
Estimation: po = # {Heads}/#{Tosses} = ∑t xt / N
•Prediction of next toss:
Heads if po > ½, Tails otherwise

Fundamental in AI and ML 565


Classification
• Credit scoring: Inputs are income and savings.
Output is low-risk vs high-risk
• Input: x = [x1,x2]T ,Output: C Î {0,1}
• Prediction:

Fundamental in AI and ML 566


Bayes’ Rule

prior likelihood
posterior

evidence

Fundamental in AI and ML 567


Bayes’ Rule: K>2 Classes

Fundamental in AI and ML 568


Losses and Risks

•Actions: αi
•Loss of αi when the state is Ck : λik
•Expected risk (Duda and Hart, 1973)

Fundamental in AI and ML 569


Losses and Risks: 0/1 Loss

For minimum risk, choose the most probable class

Fundamental in AI and ML 570


Losses and Risks: Reject

Fundamental in AI and ML 571


Discriminant Functions

K decision regions R1,...,RK

Fundamental in AI and ML 572


K=2 Classes

•Dichotomizer (K=2) vs Polychotomizer (K>2)


•g(x) = g1(x) – g2(x)

•Log odds:

Fundamental in AI and ML 573


Utility Theory

•Prob of state k given exidence x: P (Sk|x)


•Utility of αi when state is k: Uik
•Expected utility:

Fundamental in AI and ML 574


Value of Information

•Expected utility using x only

•Expected utility using x and new feature z

•z is useful if EU (x,z) > EU (x)

Fundamental in AI and ML 575


Bayesian Networks

•Aka graphical models, probabilistic networks


•Nodes are hypotheses (random vars) and the prob corresponds to our belief in the
truth of the hypothesis
•Arcs are direct direct influences between hypotheses
•The structure is represented as a directed acyclic graph (DAG)
•The parameters are the conditional probs in the arcs
•(Pearl, 1988, 2000; Jensen, 1996; Lauritzen, 1996)

Fundamental in AI and ML 576


Causes and Bayes’ Rule

Diagnostic inference:
diagnostic Knowing that the grass is wet,
what is the probability that rain is
causal the cause?

Fundamental in AI and ML 577


Causal vs Diagnostic Inference

Causal inference: If the


sprinkler is on, what is the
probability that the grass is wet?

P(W|S) = P(W|R,S) P(R|S) +


P(W|~R,S) P(~R|S)
= P(W|R,S) P(R) +
P(W|~R,S) P(~R)
= 0.95 0.4 + 0.9 0.6 = 0.92

Diagnostic inference: If the grass is wet, what is the probability


that the sprinkler is on? P(S|W) = 0.35 > 0.2 P(S)
P(S|R,W) = 0.21 Explaining away: Knowing that it has rained decreases the probability that the sprinkler is
on.
Fundamental in AI and ML 578
Bayesian Networks: Causes

Causal inference:
P(W|C) = P(W|R,S) P(R,S|C) +
P(W|~R,S) P(~R,S|C) +
P(W|R,~S) P(R,~S|C) +
P(W|~R,~S) P(~R,~S|C)

and use the fact that


P(R,S|C) = P(R|C) P(S|C)

Diagnostic: P(C|W ) = ?

Fundamental in AI and ML 579


Bayesian Nets: Local structure

P (F | C) =
?

Fundamental in AI and ML 580


Bayesian Networks: Inference
P (C,S,R,W,F) = P (C) P (S|C) P (R|C) P (W|R,S) P (F|R)

P (C,F) = ∑S ∑R ∑W P (C,S,R,W,F)

P (F|C) = P (C,F) / P(C) Not efficient!

Belief propagation (Pearl, 1988)


Junction trees (Lauritzen and Spiegelhalter, 1988)

Fundamental in AI and ML 581


Bayesian Networks: Classification

Bayes’ rule inverts the arc:


diagnostic

P (C | x )

Fundamental in AI and ML 582


Naive Bayes’ Classifier

Given C, xj are independent:

p(x|C) = p(x1|C) p(x2|C) ... p(xd|C)

Fundamental in AI and ML 583


Influence Diagrams

decision node

chance node utility node

Fundamental in AI and ML 584


Association Rules

•Association rule: X ® Y
•Support (X ® Y):

•Confidence (X ® Y):

Fundamental in AI and ML 585


MODULE 5 Pre-training and
transfer learning

Fundamental in AI and ML 586


Real-life challenges in NLP tasks

•Deep learning methods are data-hungry


•>50K data items needed for training
•The distributions of the source and target data must be the same
•Labeled data in the target domain may be limited
•This problem is typically addressed with transfer learning

Fundamental in AI and ML 587


Transfer Learning Approaches

Fundamental in AI and ML 588


Transductive vs Inductive Transfer Learning

•Transductive transfer
• No labeled target domain data available
• Focus of most transfer research in NLP
• E.g. Domain adaptation

•Inductive transfer
• Labeled target domain data available
• Goal: improve performance on the target task by training on other task(s)
• Jointly training on >1 task (multi-task learning)
• Pre-training (e.g. word embeddings)

Fundamental in AI and ML 589


Pre-training – Word Embeddings

•Pre-trained word embeddings have been an essential component of most deep


learning models
•Problems with pre-trained word embeddings:
• Shallow approaches – trade expressivity for efficiency
• Learned word representations are not context sensitive
• No distinction between senses
• Only the first layer (embedding layer) of the model is pre-trained
• The rest of the model must be trained from scratch

Fundamental in AI and ML 590


Recent paradigm shift in pre-training for NLP
• Inspired by the success of models pre-trained on ImageNet in Computer Vision
(CV)
• The use of models pre-trained on ImageNet is now standard in CV

Fundamental in AI and ML 591


Recent paradigm shift in pre-training for NLP
• What is a good equivalent of an ImageNet task in NLP?
• Key desiderata:
• An ImageNet-like dataset should be sufficiently large, i.e. on the order of millions of
training examples.
• It should be representative of the problem space of the discipline.
• Contenders to that role:
• Reading Comprehension (SQuaD dataset, 100K Q-A pairs)
• Natural Language Inference (SNLI corpus, 570K sentence pairs)
• Machine Translation (WMT 2014, 40M French-English sentence pairs)
• Constituency parsing (millions of weakly labeled parses)
• Language Modeling (unlimited data, current benchmark dataset: 1B words
http://www.statmt.org/lm-benchmark/)
Fundamental in AI and ML 592
The case for Language Modeling

•LM captures many aspects of language:


• Long-term dependencies
• Hierarchical relations
• Sentiment
• Etc.
•Training data is unlimited

Fundamental in AI and ML 593


LM as pre-training – Approaches

•Embeddings from Language Models (ELMo) (Peters et al., 2018)


•Universal Language Model Fine-tuning (ULMFiT) (Howard and Ruder, 2018)
•OpenAI Transformer (Radford et al., 2018)

•Overview of the above approaches: https://thegradient.pub/nlp-imagenet/

Fundamental in AI and ML 594


Approach

•Inductive transfer setting:


• Given a static source task and any target task with we would like to
improve the performance on
•Pre-train a language model (LM) on a large general-domain corpus
•Fine-tune it on the target task using novel techniques
•The method is universal
• Works across tasks varying in document size, number and label type
• Uses single architecture and training process
• Requires no custom feature engineering or pre-processing
• Does not require additional in-domain documents or labels

Fundamental in AI and ML 595


Steps

Fundamental in AI and ML 596


Step 1: General domain LM pre-training
General domain LM pre-training

• Used AvSGD Weight-Dropped LSTM (AWD-LSTM, Merity et al. 2017)


• LM pre-trained on Wikitext-103 (103M words)
• Expensive, but performed only once
• Improves performance and convergence of downstream tasks

Fundamental in AI and ML 597


Step 2: Target task LM pre-training
Target task LM fine-tuning
•Data for the target task is likely from a different distribution
•Fine-tune the LM on data of the target task
•This stage converges faster
•Allows to train a robust LM even on small datasets
•Two approaches to fine-tuning:
• Discriminative fine-tuning
• Slanted triangular learning rates

Fundamental in AI and ML 598


Discriminative fine-tuning
• Different layers capture different types of information, hence, they should be
fine-tuned to different extents
• Instead of using one learning rate for all layers, tune each layer with different
learning rates.
• Regular SGD:

where is the learning rate and is the gradient with regard to the
model’s objective function

Fundamental in AI and ML 599


Discriminative fine-tuning (cont.)

•Discriminative fine-tuning:
• Split parameters into where contains the parameters of
the model at the 𝑙-th layer and 𝑳 is the number of layers of the model
• Obtain where is the learning rate of the 𝑙-th layer
• SGD update with discriminative fine-tuning:

• Fine tune the last layer and use as the learning rate for lower
layers

Fundamental in AI and ML 600


Slanted triangular learning rates (STLR)
• Intuition for adapting parameters to task-specific features:
• At the beginning of training: quickly converge to a suitable region of the
parameter space
• Later during training: refine the parameters

Fundamental in AI and ML 601


Step 3: Target task classifier fine-tuning
• Augment the pre-trained language model with two
additional linear layers:
• ReLU activations in the intermediate layer
• Softmax as the last layer
• These are the only layers, whose parameters are learned
from scratch
• First layer takes as input the pooled last hidden layer
states

Fundamental in AI and ML 602


Concat pooling

•Input sequences may consist of hundreds of words information may get lost if we
only use the last hidden state of the model
•Concatenate the hidden state at the last time step hT of the document with:
• Max-pooled representation of the hidden states
• Mean-pooled representation of the hidden states
over as many time steps as fit in GPU memory H = {h1, ..., hT}:

where [] is concatenation

Fundamental in AI and ML 603


Fine-tuning Procedure – Gradual Unfreezing

•Overly aggressive fine-tuning causes catastrophic forgetting


•Too cautious fine-tuning leads to slow convergence and overfitting
•Proposed approach: gradual unfreezing
• First unfreeze the last layer and fine-tune the unfrozen layer for one epoch
• Then unfreeze the next lower frozen layer and fine-tune all unfrozen layers
• Repeat until we fine-tune all layers until convergence in the last iteration
•The combination of discriminative fine-tuning, slanted triangular learning rates and
gradual unfreezing leads to best performance

Fundamental in AI and ML 604


Tasks and Datasets

•Sentiment analysis: binary (positive-negative) classification; the IMDb movie


review dataset
•Question classification: broad semantic categories; small TREC dataset
•Topic classification: large-scale AG news and DBPedia ontology datasets

Fundamental in AI and ML 605


Evaluation measure: error rate (lower is better)

Fundamental in AI and ML 606


Analysis

• Low-shot learning – training a model for a task with small number of labeled
samples

Fundamental in AI and ML 607


Different methods to fine-tune the classifier

Fundamental in AI and ML 608


“Full” vs ULMFiT

Fundamental in AI and ML 609


Conclusions

•ULMFiT is useful for a variety of tasks (different datasets, sizes, domains)


•Proposed approach to fine-tuning prevents catastrophic forgetting of knowledge
learned during pre-training
•Achieves good results even with 100 training data items
•Generally LM pre-training and task-specific fine-tuning will be useful for scenarios
where:
• Training data is limited
• New NLP tasks where no state-of-the-art architecture exists

Fundamental in AI and ML 610


Training a deep network by stacking RBMs

• First train a layer of features that • It can be proved that each time we add
receive input directly from the another layer of features we improve a
pixels. variational lower bound on the log
• Then treat the activations of the probability of generating the training data.
trained features as if they were • The proof is complicated and only applies to
unreal cases.
pixels and learn features of
• It is based on a neat equivalence between an
features in a second hidden RBM and an infinitely deep belief net (see
layer. lecture 14b).
• Then do it again.

Fundamental in AI and ML 611


Combining two RBMs to make a DBN
Compose the
Then train this two RBM
RBM models to make
a single DBN
model

copy binary state for each v

Train this
RBM first

Fundamental in AI and ML 612


The generative model after learning 3 layers
To generate data:
1. Get an equilibrium sample from the top-level h3
RBM by performing alternating Gibbs
sampling for a long time.
2. Perform a top-down pass to get states for all
the other layers. h2
The lower level bottom-up connections are not
part of the generative model. They are just used
for inference. h1

data
Fundamental in AI and ML 613
An aside: Averaging factorial distributions
• If you average some factorial
distributions, you do NOT get a • Consider the binary vector 1,1,0,0.
factorial distribution. • in the posterior for v1, p(1,1,0,0)
= 0.9^4 = 0.43
• In an RBM, the posterior over 4
hidden units is factorial for each • in the posterior for v2, p(1,1,0,0) =
visible vector. 0.1^4 = .0001
• in the aggregated posterior, p(1,1,0,0)
• Posterior for v1: 0.9, 0.9, 0.1, 0.1 = 0.215.
• Posterior for v2: 0.1, 0.1, 0.9, 0.9
• If the aggregated posterior was
• Aggregated \= 0.5, 0.5, 0.5, 0.5 factorial it would have p = 0.5^4

Fundamental in AI and ML 614


Why does greedy learning work?
• The weights, W, in the bottom level RBM define many different distributions:
p(v|h); p(h|v); p(v,h); p(h); p(v).

• We can express the RBM model as

• If we leave p(v|h) alone and improve p(h), we will improve p(v).


• To improve p(h), we need it to be a better model than p(h;W) of the aggregated
posterior distribution over hidden vectors produced by applying W transpose to
the data.

Fundamental in AI and ML 615


Fine-tuning with a contrastive version of the
wake-sleep algorithm
After learning many layers of features, we can fine-tune the features to improve
generation.
1. Do a stochastic bottom-up pass
• Then adjust the top-down weights of lower layers to be good at reconstructing the feature
activities in the layer below.
2. Do a few iterations of sampling in the top level RBM
-- Then adjust the weights in the top-level RBM using CD.
3. Do a stochastic top-down pass
• Then Adjust the bottom-up weights to be good at reconstructing the feature activities in the
layer above.

Fundamental in AI and ML 616


The DBN used for modeling the joint distribution
of MNIST digits and their labels

•The first two hidden layers are learned 2000 units


without using labels.
•The top layer is learned as an RBM for
modeling the labels concatenated with the 10 labels 500 units
features in the second hidden layer.
•The weights are then fine-tuned to be a better 500 units
generative model using contrastive
wake-sleep. 28 x 28
pixel
image

Fundamental in AI and ML 617


What happens during discriminative fine-tuning?
Learning Dynamics of Deep Nets

Before fine-tuning After fine-tuning

Fundamental in AI and ML 618


Effect of Unsupervised Pre-training

Fundamental in AI and ML 619


Effect of Depth

Fundamental in AI and ML 620


Trajectories of the learning in function space
(a 2-D visualization produced with t-SNE)

•Each point is a model in function


space
•Color = epoch
•Top: trajectories without
pre-training. Each trajectory
converges to a different local
min.
•Bottom: Trajectories with
pre-training.
•No overlap!
Fundamental in AI and ML 621
Why unsupervised pre-training makes sense
stuff stuff
high
bandwidth
image label image label

If image-label pairs were generated


this way, it would make sense to If image-label pairs are generated this
try to go straight from images to way, it makes sense to first learn to
labels. recover the stuff that caused the image
For example, do the pixels have by inverting the high bandwidth
even parity? pathway.

Fundamental in AI and ML 622


Modeling real-valued data
• For images of digits, intermediate
• This will not work for real images.
intensities can be represented as if
they were probabilities by using • In a real image, the intensity of a
“mean-field” logistic units. pixel is almost always, almost
exactly the average of the
• We treat intermediate values as the
neighboring pixels.
probability that the pixel is inked.
• Mean-field logistic units cannot
represent precise intermediate values.

Fundamental in AI and ML 623


A standard type of real-valued visible unit

• Model pixels as Gaussian


variables. Alternating Gibbs

E
sampling is still easy, though
learning needs to be much
slower.

parabolic energy-gradient produced


containment by the total input to a
function visible unit
Fundamental in AI and ML 624
Gaussian-Binary RBM’s
• Lots of people have failed to get these
to work properly. Its extremely hard to
learn tight variances for the visible units.
• It took a long time for us to figure out why it
is so hard to learn the visible variances.
• When sigma is small, we need many
more hidden units than visible units.
• This allows small weights to produce big When sigma is much less
top-down effects. than 1, the bottom-up
effects are too big and the
top-down effects are too
small.
Fundamental in AI and ML 625
Stepped sigmoid units: A neat way to implement
integer values
• Make many copies of a stochastic binary unit.
• All copies have the same weights and the same adaptive bias, b, but they
have different fixed offsets to the bias:

Fundamental in AI and ML 626


Fast approximations
• Contrastive divergence learning works well for the sum of stochastic
logistic units with offset biases. The noise variance is
• It also works for rectified linear units. These are much faster to compute
than the sum of many logistic units with different biases.

Fundamental in AI and ML 627


A nice property of rectified linear units

•If a relu has a bias of zero, it exhibits scale equivariance:


•This is a very nice property to have for images.

•It is like the equivariance to translation exhibited by


convolutional nets.

Fundamental in AI and ML 628


Another view of why layer-by-layer learning
works (Hinton, Osindero & Teh 2006)

• There is an unexpected equivalence • An RBM is actually just an infinitely


between RBM’s and directed deep sigmoid belief net with a lot of
networks with many layers that all weight sharing.
share the same weight matrix. • The Markov chain we run when we
• This equivalence also gives insight into want to sample from the equilibrium
why contrastive divergence learning distribution of an RBM can be viewed
works. as a sigmoid belief net.

Fundamental in AI and ML 629


An infinite sigmoid belief net that is equivalent to
etc.
an RBM

•The distribution generated by this infinite directed


net with replicated weights is the equilibrium v2
distribution for a compatible pair of conditional
distributions: p(v|h) and p(h|v) that are both defined
by W
• A top-down pass of the directed net is exactly
equivalent to letting a Restricted Boltzmann Machine
v1
settle to equilibrium.
• So this infinite directed net defines the same
distribution as an RBM.

Fundamental in AI and ML
v0 630
etc.
Inference in an infinite sigmoid belief net
h2
• The variables in h0 are conditionally independent
given v0.
• Inference is trivial. Just multiply v0 by v2
• The model above h0 implements a complementary
prior. h1
• Multiplying v0 by gives the product of the
likelihood term and the prior term.
• The complementary prior cancels the explaining away. v1
• Inference in the directed net is exactly equivalent + +
to letting an RBM settle to equilibrium starting at h0
the data. + +
v0
Fundamental in AI and ML 631
h2
• The learning rule for a sigmoid belief net is:

v2

• With replicated weights this rule becomes: h1

v1

h0

is an unbiased sample from v0


Fundamental in AI and ML 632
etc.
Learning a deep directed network
h2
S
First learn with all the weights tied. This is
exactly equivalent to learning an RBM. v2
h0
h1

v0
v1
Think of the symmetric connections as a
shorthand notation for an infinite directed net h0
with tied weights.
We ought to use maximum likelihood learning,
but we use CD1 as a shortcut. v0
Fundamental in AI and ML 633
Learning a deep directed network
h2
• Then freeze the first layer of weights in
both directions and learn the remaining
weights (still tied together). v2
• This is equivalent to learning another RBM,
using the aggregated posterior distribution h1
of h0 as the data.

v1
v1
h0

h0
v0
Fundamental in AI and ML 634
What happens when the weights in higher layers become
different from the weights in the first layer?
• The higher layers no longer • The higher layers learn a prior that is
implement a complementary prior. closer to the aggregated posterior
• So performing inference using the distribution of the first hidden layer.
frozen weights in the first layer is no • This improves the network’s model of
longer correct. the data.
• But its still pretty good. • Hinton, Osindero and Teh (2006)
• Using this incorrect inference procedure prove that this improvement is
gives a variational lower bound on the always bigger than the loss in the
log probability of the data. variational bound caused by using
less accurate inference.

Fundamental in AI and ML 635


What is really happening in contrastive etc.
divergence learning?
h2
Contrastive divergence learning in this RBM is
equivalent to ignoring the small derivatives
contributed by the tied weights in higher layers. v2

h1

v1

h0

v0
Fundamental in AI and ML 636
Why is it OK to ignore the derivatives in higher
layers?
When the weights are small, the • As the weights grow we may need
Markov chain mixes fast. to run more iterations of CD.
• So the higher layers will be close to the • This allows CD to continue to be a
equilibrium distribution (i.e they will good approximation to maximum
have “forgotten” the datavector). likelihood.
• At equilibrium the derivatives must • But for learning layers of features, it
average to zero, because the current does not need to be a good
weights are a perfect model of the approximation to maximum
equilibrium distribution! likelhood!

Fundamental in AI and ML 637


Sentiment classification
What features of the text could help predict # of stars?
(e.g., using a log-linear model) How to identify more? ?
Are the features hard to compute? (syntax? sarcasm?)

Fundamental in AI and ML 638


Other text categorization tasks
• Is it spam? (see features)
• What medical billing code for this visit?
• What grade, as an answer to this essay question?
• Is it interesting to this user?
• News filtering; helpdesk routing
• Is it interesting to this NLP program?
• Skill classification for a digital assistant!
• If it’s Spanish, translate it from Spanish
• If it’s subjective, run the sentiment classifier
• If it’s an appointment, run information extraction
• Where should it be filed?
• Which mail folder? (work, friends, junk, urgent ...)
• Yahoo! / Open Directory / digital libraries
Fundamental in AI and ML 639
Measuring Performance
• Classification accuracy: What % of messages were classified correctly?
• Is this what we care about?

• Which system do you prefer?

Fundamental in AI and ML 640


Measuring Performance

• Precision =
good messages kept
all messages kept
• Recall =
good messages kept
all good messages

Move from high precision to high recall by


deleting fewer messages (delete only if spamminess > high threshold)

Fundamental in AI and ML 641


Measuring Performance

OK for search engines


(users only want top 10)

high threshold: would prefer


all we keep is good, to be here!
but we don’t keep much

point where low threshold:


precision=recall keep all the good stuff,
(occasionally but a lot of the bad too
reported) OK for spam
filtering and
legal search
Fundamental in AI and ML 642
600.465 - Intro to NLP - J. Eisner 642
Measuring Performance

• Precision =
good messages kept
all messages kept
another system: better for
• Recall =
some users, worse for others
(can’t tell just by comparing
good messages kept
F-measures)
all good messages
• F-measure =
precision-1 + recall-1
Move from high precision to high recall by
deleting fewer messages (raise threshold)
( 2
) -1

Conventional to tune system and threshold to optimize F-measure on dev data


But it’s more informative to report the whole curve
Since in real life, the user should be able to pick a tradeoff
Fundamental point
in AI and ML they like 643
More than 2 classes
• Report F-measure for each class
• Show a confusion matrix
Predicted class

True
class
co
rre
ct
644

Fundamental in AI and ML
Fancier Perfomance Metrics
• For multi-way classifiers:
• Average accuracy (or precision or recall) of 2-way distinctions: Sports or not, News or not,
etc.
• Better, estimate the cost of different kinds of errors
• e.g., how bad is each of the following?
• putting Sports articles in the News section
• putting Fashion articles in the News section
• putting News articles in the Fashion section
• Now tune system to minimize total cost

• For ranking systems:


• Correlate with human rankings?
• Get active feedback from user?
• Measure user’s wasted time by tracking clicks?
Fundamental in AI and ML 645
Text Annotation Tasks

1.Classify the entire document


2.Classify individual word tokens

Fundamental in AI and ML 646


p(class | token in context)
(WSD)

Build a special classifier just for tokens of “plant”

Fundamental in AI and ML 647


slide courtesy of D. Yarowsky
p(class | token in context)
WSD for

Build a special classifier just for tokens of “sentence”

Fundamental in AI and ML 648


slide courtesy of D. Yarowsky
p(class | token in context)

Fundamental in AI and ML 649


slide courtesy of D. Yarowsky
p(class | token in context)

Fundamental in AI and ML 650


slide courtesy of D. Yarowsky
p(class | token in context)

Fundamental in AI and ML 651


slide courtesy of D. Yarowsky
p(class | token in context)

Fundamental in AI and ML 652


slide courtesy of D. Yarowsky
p(class | token in context)

Fundamental in AI and ML 653


slide courtesy of D. Yarowsky
What features? Example: “word to left”

Spelling correction using an


n-gram language model
(n ≥ 2) would use words to
left and right to help predict
the true word.

Similarly, an HMM would


predict a word’s class using
classes to left and right.

But we’d like to throw in all


kinds of other features, too
Fundamental in AI and ML … 654
600.465 - Intro to NLP - J. Eisner 654
Feature Templates

generates a whole
bunch of features – use
data to find out which
ones work best Fundamental in AI and ML 655
600.465 - Intro to NLP - J. Eisner 655
Feature Templates

This feature is
relatively
weak, but weak
features are
still useful,
especially since
very few
features will
fire in a given
context.

merged ranking
of all features
of all these types
Fundamental in AI and ML 656
600.465 - Intro to NLP - J. Eisner 656
Final decision list for lead (abbreviated)

List of all features,


ranked by their weight.
(These weights are for a simple
“decision list” model where the single
highest-weighted feature that fires gets
to make the decision all by itself.

However, a log-linear model, which


adds up the weights of all features that
fire, would be roughly similar.)

Fundamental in AI and ML 657


600.465 - Intro to NLP - J. Eisner 657
Text Annotation Tasks

1. Classify the entire document


2. Classify individual word tokens
3. Identify phrases (“chunking”)

Fundamental in AI and ML 658


600.465 - Intro to NLP - J. Eisner 658
Named Entity Recognition

CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per
round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit AMR,
immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase
took effect Thursday night and applies to most routes where it competes against discount carriers, such
as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York.

Fundamental in AI and ML 659


65
NE Types

Fundamental in AI and ML 660


66
Identifying phrases (chunking)
• Phrases that are useful for information extraction:
• Named entities
• As on previous slides
• Relationship phrases
• “said”, “according to”, …
• “was born in”, “hails from”, …
• “bought”, “hopes to acquire”, “formed a joint agreement with”, …
• Simple syntactic chunks (e.g., non-recursive NPs)
• “Syntactic chunking” sometimes done before (or instead of) parsing
• Also, “segmentation”: divide Chinese text into words (no spaces)
• So, how do we learn to mark phrases?
• Earlier, we built an FST to mark dates by inserting brackets
• But, it’s common to set this up as a tagging problem …
Fundamental in AI and ML 661
Reduce to a tagging problem …
• The IOB encoding (Ramshaw & Marcus 1995):
• B_X = “beginning” (first word of an X)
• I_X = “inside” (non-first word of an X)
• O = “outside” (not in any phrase)
• Does not allow overlapping or recursive phrases

…United Airlines said Friday it has increased …


B_ORG I_ORG O O O O O
… the move , spokesman Tim Wagner said …
O O O O B_PER I_PER O

What if this were tagged as B_ORG instead?


Fundamental in AI and ML 662
66
Attributes and
Feature Templates for NER
POS tags and chunks Now predict NER tagseq
from earlier processing

A feature of this
tagseq might give a
positive or negative
weight to this
B_ORG in conjunction
with some subset of
the nearby
attributes

Or even faraway attributes:


B_ORG is more likely in a
sentence with a spokesman!
Fundamental in AI and ML 663
Slide adapted from Jim Martin 66
Alternative: CRF Chunker

• Log-linear model of p(structure | sentence):


• CRF tagging takes O(n) time (“Markov property”)
• Score each possible tag or tag bigram at each position
• Then use dynamic programming to find best overall tagging

• CRF chunking takes O(n2) time (“semi-Markov”)


• Score each possible chunk, e.g., using a BiRNN-CRF
• Then use dynamic programming to find best overall chunking
• Forward algorithm:
• for all j: for all i < j: for all labels L:
α(j) += α(i) * score(possible chunk from i to j with label L)

• CRF parsing takes O(n3) time


• Score each possible rule at each position
• Then use dynamic programming to find best overall parse
Fundamental in AI and ML 664
Text Annotation Tasks
1. Classify the entire document
2. Classify individual word tokens
3. Identify phrases (“chunking”)
4. Syntactic annotation (parsing)

Fundamental in AI and ML 665


600.465 - Intro to NLP - J. Eisner 665
Parser Evaluation Metrics
• Runtime
• Exact match
• Is the parse 100% correct?
• Labeled precision, recall, F-measure of constituents
• Precision: You predicted (NP,5,8); was it right?
• Recall: (NP,5,8) was right; did you predict it?
• Easier versions:
• Unlabeled: Don’t worry about getting (NP,5,8) right, only (5,8)
• Short sentences: Only test on sentences of ≤ 15, ≤ 40, ≤ 100 words
• Dependency parsing: Labeled and unlabeled attachment accuracy
• Crossing brackets
• You predicted (…,5,8), but there was really a constituent (…,6,10)

Fundamental in AI and ML 666


Labeled Dependency Parsing
Raw sentence
He reckons the current account deficit will narrow to only 1.8 billion in September.
Part-of-speech tagging
POS-tagged sentence
He reckons the current account deficit will narrow to only 1.8 billion in September.
PRP VBZ DT JJ NN NN MD VB TO RB CD CD IN NNP .

Word dependency parsing

Word dependency parsed sentence


He reckons the current account deficit will narrow to only 1.8 billion in September .
MOD MOD COMP
SUBJ MOD SUBJ
COMP
SPEC
MOD
S-COMP
ROOT

Fundamental in AI and ML 667


1. Assign heads
Dependency Trees
S
[head=thrill]

NP VP
[head=plan] [head=thrill]

Det N V VP
[head=plan] [head=thrill]
The has
N VP
[head=plan] [head=swallow]
V VP
[head=thrill]
plan been
to VP V NP
[head=swallow] [head=thrill] [head=Otto]
thrilling Otto

V
[head=swallow] NP
[head=Wanda]
swallow Wanda
Fundamental in AI and ML 668
2. Each word is
Dependency Trees the head of a
S
[head=thrill] whole
connected
NP VP
[head=plan] [head=thrill] subgraph

Det N V VP
[head=plan] [head=thrill]
The has
N VP
[head=plan] [head=swallow]
V VP
[head=thrill]
plan been
to VP V NP
[head=swallow] [head=thrill] [head=Otto]
thrilling Otto

V
[head=swallow] NP
[head=Wanda]
swallow Wanda
Fundamental in AI and ML 669
2. Each word is
Dependency Trees the head of a
S
whole
connected
NP VP
subgraph

Det N V VP
The has
N VP V VP
plan been
to VP V NP
thrilling Otto

V NP
swallow Wanda
Fundamental in AI and ML 670
3. Just look at
Dependency Trees which words are
related
thrilling
plan

The has
swallow been

to
Otto

Wanda
Fundamental in AI and ML 671
4. Optionally
Dependency Trees flatten the
drawing
• Shows which words modify (“depend on”) another word
• Each subtree of the dependency tree is still a constituent
• But not all of the original constituents are subtrees (e.g., VP)

The plan to swallow Wanda has been thrilling Otto.


• Easy to spot semantic relations (“who did what to whom?”)
• Good source of syntactic features for other tasks
• Easy to annotate (high agreement)
• Easy to evaluate (what % of words have correct parent?)
• Data available in more languages (Universal Dependencies)
Fundamental in AI and ML 672
Text Annotation Tasks
1. Classify the entire document
2. Classify individual word tokens
3. Identify phrases (“chunking”)
4. Syntactic annotation (parsing)
5. Semantic annotation

Fundamental in AI and ML 673


600.465 - Intro to NLP - J. Eisner 673
Semantic Role Labeling (SRL)
• For each predicate (e.g., verb)
1. find its arguments (e.g., NPs)
2. determine their semantic roles

John drove Mary from Austin to Dallas in his Toyota Prius.

The hammer broke the window.

• agent: Actor of an action


• patient: Entity affected by the action
• source: Origin of the affected entity
• destination: Destination of the affected entity
• instrument: Tool used in performing action.
• beneficiary: Entity for whom action is performed

Fundamental in AI and ML 674


Might be helped by syntactic parsing first …
• Consider one verb at a time: “bit”
• Classify the role (if any) of each of the 3 NPs

S
Color Code:
not-a-role NP VP
agent
patient NP PP V NP
source
Det A Prep NP bit Det A
destination N N
instrument The Adj A dog with Det A a ε girl
beneficiary N
big ε the ε boy
Fundamental in AI and ML 675
Parse tree paths as classification features

Path feature is S

NP VP
V ↑ VP ↑ S ↓ NP
NP PP V NP
which tends to Det A Prep NP bit Det A
be associated N N
The Adj A dog with Det A a ε girl
with agent role N
big ε the ε boy

Fundamental in AI and ML 676


Parse tree paths as classification features

Path feature is S

NP VP
V ↑ VP ↑ S ↓ NP ↓ PP ↓ NP
NP PP V NP
which tends to Det A Prep NP bit Det A
be associated N N
The Adj A dog with Det A a ε girl
with no role N
big ε the ε boy

Fundamental in AI and ML 677


Head words as features
• Some roles prefer to be filled by certain kinds of NPs.
• This can give us useful features for classifying accurately:
• “John ate the spaghetti with chopsticks.” (instrument)
“John ate the spaghetti with meatballs.” (patient)
“John ate the spaghetti with Mary.”
• Instruments should be tools
• Patient of “eat” should be edible

• “John bought the car for $21K.” (instrument)


“John bought the car for Mary.” (beneficiary)
• Instrument of “buy” should be Money
• Beneficiaries should be animate (things with desires)

• “John drove Mary to school in the van”


“John drove the van to work with Mary.”
• What do you think? Fundamental in AI and ML 678
Semantic roles can feed into further tasks
• Find the answer to a user’s question
• “Who” questions usually want Agents
• “What” question usually want Patients
• “How” and “with what” questions usually want Instruments
• “Where” questions frequently want Sources/Destinations.
• “For whom” questions usually want Beneficiaries
• “To whom” questions usually want Destinations
• Generate text
• Many languages have specific syntactic constructions that must or should be used for specific semantic roles.
• Word sense disambiguation, using selectional restrictions
• The bat ate the bug. (what kind of bat? what kind of bug?)
• Agents (particularly of “eat”) should be animate – animal bat, not baseball bat
• Patients of “eat” should be edible – animal bug, not software bug
• John fired the secretary.
John fired the rifle.
Patients of fire1 are different than patients of fire2
Fundamental in AI and ML 679
Other Current Semantic Annotation Tasks
(similar to SRL)
• PropBank – coarse-grained roles of verbs
• NomBank – similar, but for nouns
• TimeBank – temporal expressions
• FrameNet – fine-grained roles of any word

• Like finding the intent (= frame) for each “trigger phrase,” and filling its slots

Alarm(delay=“20 minutes”, content=“turn off the rice”)

intent slot filler slot filler

Fundamental in AI and ML 680


Face Detection Overview
• Problem Identification
• Methods Adopted
• Color Segmentation
• Morphological Processing
• Template Matching
• EigenFaces
• Gender Classification

Fundamental in AI and ML 681


Color Segmentation
• Use the color information
• Two approaches:
• Global threshold in HSV and YCbCr space using set of linear equations. Lot of overlap exists

(a) (b)
Clustering in (a) YCbCr and (b) V vs. H space. Red is non-face and blue is
face data
Fundamental in AI and ML 682
Result of color segmentation using Global
thresholding

Fundamental in AI and ML 683


Overlap exists in RGB space also

Sample Blue vs Green plot for face (blue) and


non-face (red) data.

• Second approach involves RGB vector quantization


(Linde, Buzo, Gray)
• Use RGB as a 3-D vector and quantize the RGB space
for the face and non-face regions
Fundamental in AI and ML 684
Results from initial quantization
Common problems identified

Fundamental in AI and ML 685


Better Code book developed
Problem areas broken up

Fundamental in AI and ML 686


Face Detection
• Initial step of open and close performed to fill holes in faces
• Elongated objects removed by check on aspect ratio and small areas discarded

Fundamental in AI and ML 687


Morphological Processing
• Segmented and processed Image consists of all skin regions (face, arms and fists)
• Need to identify centers of all objects, including individual faces among
connected faces
• Repeated EROSION is done with specific structuring element

Fundamental in AI and ML 688


Face Detection

Previous state stored to identify new regions when split occurs

Superimposed mask image with eroded regions for estimate of centroids


Fundamental in AI and ML 689
Template Matching
• Data set has 145 male and 19 female faces
• Need to identify region around estimated centroids as face or non-face
• Multi-resolution was attempted. But distortion from neighboring faces gives false
values
• Smaller template has better result for all face shapes
• Template used is the mean face of 50x50 pixels

Mean Face used for


template
Fundamentalmatching
in AI and ML 690
• Illumination problem identified
• Top has low lighting, lower part is brighter
• Left and right edges of images do not have people
• 2-D weighting function for correlation values applied

2-D weighting function Sample correlation result


Fundamental in AI and ML 691
I

Result from template matching and thresholding.


Rejected - Red ‘x’. Detected Faces – Green ‘x’
Fundamental in AI and ML 692
EigenFace based detection
• Decompose faces into set of basis images
• Different methods of candidate face extraction from image

EigenFaces

(a) (b)

Candidate face extraction (a) Conservative (b) multi-resolution with


Fundamental in AI and ML 693
side distortion
Sample result of eigenface. Red ‘+’ is from morphological
processing and green ‘O’ is from eigenfaces
• Minimum Distance between
vector of coefficients to that of
the face dataset was the metric.
• It depends very much on spatial
similarity to trained dataset
• Slight changes give incorrect
results
• Hence, only template matching
was used

Fundamental in AI and ML 694


Gender classification
• Eigenfaces and template matching for specific face features do not yield good
results
• Other features for specific females were used – the headband
• Template matching was performed for it
• Conservative estimate was done to prevent falsely identifying males as a female

TheFundamental
headband template
in AI and ML 695
Table of results for training images
Approx. 95% accuracy with about 75 seconds runtime

Training Final Detect Number Num Num Distance Runtime Bonus


Image Score Score Hits Repeat False
Positives

1 22 21 21 0 0 15.9311 71.91 1

2 22 21 23 0 2 13.6109 82.96 1

3 25 25 25 0 0 9.8625 80.48 0

4 22 22 24 0 2 11.3667 81.15 0

5 24 24 24 0 0 9.5960 69.59 0

6 23 23 23 0 0 11.5512 80.25 0

7 22 21 21 0 0 14.1537 71.52 1

Fundamental in AI and ML 696


Training 1
Fundamental in AI and ML 697
Training 2
Fundamental in AI and ML 698
Training 3
Fundamental in AI and ML 699
Training 4
Fundamental in AI and ML 700
Training 5
Fundamental in AI and ML 701
Training 6
Fundamental in AI and ML 702
Training 7
Fundamental in AI and ML 703
Sentiment Analysis

Fundamental in AI and ML 704


What is SA & OM?

Identify the orientation of opinion in a piece of text

The movie The movie The movie


was fabulous! stars Mr. X was horrible!

Can be generalized to a wider set of emotions

Fundamental in AI and ML 705


Motivation
• Knowing sentiment is a very natural ability of a human being.
Can a machine be trained to do it?

• SA aims at getting sentiment-related knowledge especially from the huge amount


of information on the internet

• Can be generally used to understand opinion in a set of documents

Fundamental in AI and ML 706


Tripod of Sentiment Analysis
Cognitive Science

Sentiment Analysis

Machine Learning Natural Language


Processing
Fundamental in AI and ML 707
Sentiment Analysis

Lexical
Challenges Resources

SA
Subjectivity
Approach
detection
es

Applicatio
ns

Fundamental in AI and ML 708


SentiWordNet
• Lexical resource for sentiment analysis
• Built on the top of WordNet synsets
• Attaches sentiment-related information with synsets

Fundamental in AI and ML 709


Quantifying sentiment
Positive Negative

Subjective
Polarity

Term sense
position
Objective
Polarity

Each term has a Positive, Negative and Objective score. The


scores sum to one.

Fundamental in AI and ML 710


Building SentiWordNet

• Ln, Lo, Lp are the three seed sets


• Iteratively expand the seed sets through K steps
• Train the classifier for the expanded sets

Fundamental in AI and ML 711


Expansion of seed sets

see
s o -
al

an
to
ny
my
Ln
Lp

The sets at the end of kth step are called Tr(k,p) and Tr(k,n)
Tr(k,o) is the set that is not present in Tr(k,p) and Tr(k,n)

Fundamental in AI and ML 712


Committee of classifiers

• Train a committee of classifiers of different types and different K-values for the
given data
• Observations:
• Low values of K give high precision and low recall
• Accuracy in determining positivity or negativity, however, remains almost
constant

Fundamental in AI and ML 713


WordNet Affect
• Similar to SentiWordNet (an earlier work)
• WordNet-Affect: WordNet + annotated affective concepts in hierarchical order
• Hierarchy called ‘affective domain labels’
• behaviour
• personality
• cognitive state

Fundamental in AI and ML 714


Subjectivity detection
• Aim: To extract subjective portions of text
• Algorithm used: Minimum cut algorithm

Fundamental in AI and ML 715


Constructing the graph
• Build an undirected graph G with vertices {v1, v2…,s, t} (sentences and s,t)
• Add edges (s, vi) each with weight ind1(xi)
• Add edges (t, vi) each with weight ind2(xi)
• Add edges (vi, vk) with weight assoc (vi, vk)

• Partition cost:

Fundamental in AI and ML 716


Example

Sample cuts:

Fundamental in AI and ML 717


Results (1/2)
• Naïve Bayes, no extraction : 82.8%
• Naïve Bayes, subjective extraction : 86.4%
• Naïve Bayes, ‘flipped experiment’ : 71 %

Subjecti
ve
Document
Document Subjectivit POLARITY CLASSIFIER
y
Objectiv
detector
e

Fundamental in AI and ML 718


Results

Fundamental in AI and ML 719


Reinforcement Learning What is learning?
• Learning types
• Supervised learning:
a situation in which sample (input, output) pairs of the function to be learned can be perceived or are given
• You can think it as if there is a kind teacher
• Reinforcement learning:
in the case of the agent acts on its environment, it receives some evaluation of its action (reinforcement),
but is not told of which action is the correct one to achieve its goal

Fundamental in AI and ML 720


Reinforcement learning
• Task
Learn how to behave successfully to achieve a goal while interacting with an external
environment
• Learn via experiences!
• Examples
• Game playing: player knows whether it win or lose, but not know how to move at each
step
• Control: a traffic system can measure the delay of cars, but not know how to decrease it.

Fundamental in AI and ML 721


RL is learning from interaction

Fundamental in AI and ML 722


RL model
• Each percept(e) is enough to determine the State(the state is accessible)
• The agent can decompose the Reward component from a percept.
• The agent task: to find a optimal policy, mapping states to actions, that maximize long-run
measure of the reinforcement
• Think of reinforcement as reward
• Can be modeled as MDP model!

Fundamental in AI and ML 723


Review of MDP model
• MDP model <S,T,A,R>
• S– set of states
Agent • A– set of actions
• T(s,a,s’) = P(s’|s,a)– the
State Action probability of transition from
Reward
s to s’ given action a
Environment • R(s,a)– the expected reward
for taking action a in state s

a0 a1 a2
s0 s1 s2 s3
r0 r1 r2

Fundamental in AI and ML 724


Model based v.s.Model free approaches
• But, we don’t know anything about the environment model—the transition function T(s,a,s’)
• Here comes two approaches
• Model based approach RL:
learn the model, and use it to derive the optimal policy.
e.g Adaptive dynamic learning(ADP) approach

• Model free approach RL:


derive the optimal policy without learning the model.
e.g LMS and Temporal difference approach

• Which one is better?

Fundamental in AI and ML 725


Passive learning v.s. Active learning
• Passive learning
• The agent imply watches the world going by and tries to learn the utilities of being in
various states
• Active learning
• The agent not simply watches, but also acts

Fundamental in AI and ML 726


Example environment

Fundamental in AI and ML 727


Passive learning scenario
• The agent see the the sequences of state transitions and associate rewards
• The environment generates state transitions and the agent perceive them
e.g (1,1) (1,2) (1,3) (2,3) (3,3) (4,3)[+1]

(1,1) (1,2) (1,3) (1,2) (1,3) (1,2) (1,1) (2,1) (3,1) (4,1) (4,2)[-1]

• Key idea: updating the utility value using the given training sequences.

Fundamental in AI and ML 728


Adaptive dynamic programming(ADP) in passive
learning
• Different with LMS and TD method(model free approaches)
• ADP is a model based approach!
• The updating rule for passive learning

• However, in an unknown environment, T is not given, the agent must learn T


itself by experiences with the environment.
• How to learn T?

Fundamental in AI and ML 729


Active learning
• An active agent must consider
• what actions to take?
• what their outcomes maybe(both on learning and receiving the rewards in the long run)?
• Update utility equation

• Rule to chose action

Fundamental in AI and ML 730


How to learn model?
• Use the transition tuple <s, a, s’, r> to learn T(s,a,s’) and R(s,a). That’s supervised
learning!
• Since the agent can get every transition (s, a, s’,r) directly, so take (s,a)/s’ as an
input/output example of the transition probability function T.
• Different techniques in the supervised learning(see further reading for detail)
• Use r and T(s,a,s’) to learn R(s,a)

Fundamental in AI and ML 731


ADP approach pros and cons
• Pros:
• ADP algorithm converges far faster than LMS and Temporal learning. That is
because it use the information from the the model of the environment.
• Cons:
• Intractable for large state space
• In each step, update U for all states
• Improve this by prioritized-sweeping (see further reading for detail)

Fundamental in AI and ML 732


Exploration problem in Active learning
• An action has two kinds of outcome
• Gain rewards on the current experience tuple (s,a,s’)
• Affect the percepts received, and hence the ability of the agent to learn
• A trade off when choosing action between
• its immediately good(reflected in its current utility estimates using the what we have learned)
• its long term good(exploring more about the environment help it to behave optimally in the long run)
• Two extreme approaches
• “wacky”approach: acts randomly, in the hope that it will eventually explore the entire environment.
• “greedy”approach: acts to maximize its utility using current model estimate
See Figure 20.10
• Just like human in the real world! People need to decide between
• Continuing in a comfortable existence
• Or striking out into the unknown in the hopes of discovering a new and better life

Fundamental in AI and ML 733


Exploration problem in Active learning
• One kind of solution: the agent should be more wacky when it has little idea of
the environment, and more greedy when it has a model that is close to being
correct
• In a given state, the agent should give some weight to actions that it has not
tried very often.
• While tend to avoid actions that are believed to be of low utility
• Implemented by exploration function f(u,n):
• assigning a higher utility estimate to relatively unexplored action state pairs
• Chang the updating rule of value function to
• U+ denote the optimistic estimate of the utility

Fundamental in AI and ML 734


Generalization in Reinforcement Learning
• So far we assumed that all the functions learned by the agent are (U, T, R,Q) are tabular
forms—
i.e.. It is possible to enumerate state and action spaces.
• Use generalization techniques to deal with large state or action space.
• Function approximation techniques

Fundamental in AI and ML 735


Thank you!

You might also like