2. Common Ppt (All Modules)

CSA2001
FUNDAMENTALS IN AI AND ML
On the Verge of Major Breakthroughs
Artificial Intelligence (AI) has been moving extremely quickly in

the last few years, demonstrating a potential to revolutionize every
aspect of our lives
Work Economy
Medicine
Mobility
Fundamental in AI and ML 2
Applications of AI
But, What is AI ?
AI can be broadly defined as technology that can learn and
produce intelligent behavior
Input Output
Pixels: An AI Process “Tuberculosis”
Computer Vision
But, What is AI ?
Input Output
An AI Process “Four kids are playing

Pixels:
with a ball”
More than just a category

Computer Vision about the image!
Applications of AI
Input Output
Audio Clip: An AI Process “I feel some eye pain”
Speech Recognition
Artificial Intelligence
Artificial Intelligence
Deep Learning (DL)

Machine Learning (ML)
Artificial Intelligence (AI)
AI ML DL
Computer Science
Math Physics Chemistry Biology
AI in real time systems
• Search engines (such as Google Search), targeting online advertisements,
recommendation systems (offered by Netflix, YouTube or Amazon),
• Driving internet traffic,
• Targeted advertising (AdSense, Facebook),
• Virtual assistants (such as Siri or Alexa),
• Autonomous vehicles (including drones and self-driving cars),
• Automatic language translation (Microsoft Translator, Google Translate),
• Facial recognition (Apple's Face ID or Microsoft's DeepFace),
• Image labeling (used by Facebook, Apple's iPhoto and TikTok) and spam
filtering.
Agents in Artificial Intelligence
An AI system can be defined as the study of the rational agent and its environment. The
agents sense the environment through sensors and act on their environment through actuators.
An AI agent can have mental properties such as knowledge, belief, intention, etc.
What is an Agent?
An agent can be anything that perceives the environment through sensors and act upon that
environment through actuators. An Agent runs in the cycle of perceiving, thinking,
and acting. An agent can be:-
Human-Agent: A human agent has eyes, ears, and other organs which work for sensors and
•
hand, legs, vocal tract work for actuators.

Robotic Agent: A robotic agent can have cameras, infrared range finder, NLP for sensors
•
and various motors for actuators.

Software Agent: Software agent can have keystrokes, file contents as sensory input and act
•
on those inputs and display output on theFundamental

screen.in AI and ML 10
Sensor: Sensor is a device which detects the change in the environment and
sends the information to other electronic devices. An agent observes its
environment through sensors.
Actuators: Actuators are the component of machines that converts energy
into motion. The actuators are only responsible for moving and controlling a
system. An actuator can be an electric motor, gears, rails, etc.
Effectors: Effectors are the devices which affect the environment. Effectors
can be legs, wheels, arms, fingers, wings, fins, and display screen.
Intelligent Agent
An intelligent agent is an autonomous entity that act upon an environment

using sensors and actuators for achieving goals. An intelligent agent may
learn from the environment to achieve their goals. A thermostat is an example
of an intelligent agent.
Following are the main four rules for an AI agent:
Rule 1: An AI agent must have the ability to perceive the environment.
•
Rule 2: The observation must be used to make decisions.

•
Rule 3: Decision should result in an action.

•
Rule 4: The action taken by an AI agent must be a rational action.

•
Rational Agent
• A rational agent is an agent which has clear preference, models uncertainty,
and acts in a way to maximize its performance measure with all possible
actions.
• A rational agent is said to perform the right things. AI is about creating

rational agents to use for game theory and decision theory for various
real-world scenarios.
• For an AI agent, the rational action is most important because in AI

reinforcement learning algorithm, for each best possible action, agent gets the
positive reward and for each wrong action, an agent gets a negative reward.
Rationality
The rationality of an agent is measured by its performance measure. Rationality
can be judged on the basis of following points:
Performance measure which defines the success criterion.

•
Agent prior knowledge of its environment.

•
Best possible actions that an agent can perform.

•
The sequence of percepts.

•
Structure of an AI Agent
The task of AI is to design an agent program which implements the agent function. The
structure of an intelligent agent is a combination of architecture and agent program. It can be
viewed as:
Agent = Architecture + Agent program
Architecture: Architecture is machinery that an AI agent executes on.
Agent Function: The agent function is used to map a percept to an action.
f:P* → A
Agent program: An agent program is an implementation of agent function. An agent

program executes on the physical architecture toin AIproduce
Fundamental and ML function f. 16
PEAS Representation
PEAS is a type of model on which an AI agent works upon. When we define an AI agent
or rational agent, then we can group its properties under PEAS representation model. It is
made up of four words:
P: Performance measure
•
E: Environment
•
A: Actuators
•
S: Sensors
•
Here performance measure is the objective for the success of an agent's behavior.
PEAS for self-driving cars:
Let's suppose a self-driving car then PEAS representation will be:
Performance: Safety, time, legal drive, comfort
Environment: Roads, other vehicles, road signs, pedestrian
Actuators: Steering, accelerator, brake, signal, horn
Sensors: Camera, GPS, speedometer, odometer, accelerometer, sonar.
Example of Agents with their PEAS
representation
Performance
Agent Environment Actuator Sensor
Measure
• Healthy
Patient Hospital Prescription, Symptoms
Medical
• • •
Minimized Doctors Diagnosis, Patient’s

Diagnosis
• • •
Cost • Patients • Scan report response
representation
Performance
Measure
Classroom,
Subject • Maximize Desk, Chair, Smart displays, Eyes, Ears,
Tutoring scores Board, Staff, Corrections Notebooks
Students
representation
Performance
Measure
Vacuum Cleanness
• Room
• Wheels
• Camera
•
Cleaner Efficiency
• Table
• Brushes
• Dirt detection
•
Battery life
• Wood floor
• Vacuum Extractor sensor
•
Security
• Carpet
• Cliff sensor
•
Various
• Bump Sensor
•
obstacles Infrared Wall

•
Sensor
Object Detection
Activity Recognition
Semantic Segmentation
Disease Detection
Image colorization
Style Transfer
Style transfer
Lip Sync
Image to Image Translation
Why interest in AI?
Agents
Definition: An agent perceives its environment via sensors and

acts upon that environment through its actuators
Agent / Robot E.g., vacuum-cleaner world
iRobot Corporation
Founder Rodney Brooks (MIT)
Percepts: location and

contents, e.g., [A, Dirty]
Actions: Left, Right, Suck, NoOp

Rational Agents
An agent should strive to "do the right thing", based on what:

• it can perceive and
• the actions it can perform.
The right action is the one that will cause the agent to be most successful
Performance measure of game-playing agent: win/loss percentage
(maximize), robustness, unpredictability (to “confuse” opponent), etc.
Rational Agents
Performance measure: An objective criterion for success of an agent's

behavior.
Performance measures of a vacuum-cleaner agent: amount of dirt cleaned
up, amount of time taken, amount of electricity consumed, level of noise
generated, etc.
Performance measures self-driving car: time to reach destination

(minimize), safety, predictability of behavior for other agents, reliability,
etc.
Performance measure of game-playing agent: win/loss percentage

(maximize), robustness, unpredictability (to “confuse” opponent), etc.
Rational Agents
For each possible percept sequence, a rational agent should select an action
that maximizes its performance measure (in expectation) given the evidence
provided by the percept sequence and whatever built- in knowledge the agent
has.
Why “in expectation”?
Captures actions with stochastic / uncertain effects or actions performed in

stochastic environments. We can then look at the expected value of an action.
In high-risk settings, we may also want to limit the worst-case behavior.

Rational Agents
Notes:
Rationality is distinct from omniscience (“all knowing”). We can

behave rationally even when faced with incomplete information.
Agents can perform actions in order to modify future percepts so as to obtain

useful information: information gathering, exploration.
An agent is autonomous if its behavior is determined by its own experience

(with ability to learn and adapt).
Characterizing a Task Environment
Must first specify the setting for intelligent agent design.
PEAS: Performance measure, Environment, Actuators, Sensors
Example: the task of designing a self-driving car
• Performance measure Safe, fast, legal, comfortable trip

• Environment Roads, other traffic, pedestrians
• Actuators Steering wheel, accelerator, brake, signal, horn
• Sensors Cameras, LIDAR (light/radar), speedometer, GPS, odometer,
engine sensors, keyboard
PEAS Examples
Task Environments
1) Fully observable / Partially observable
If an agent’s sensors give it access to the complete state of the

environment needed to choose an action, the environment is
fully observable.
(e.g. chess – what about Kriegspiel?)
Task Environments
Making things a bit more challenging…

Kriegspiel --- you can’t see your opponent!
Incomplete / uncertain information inherent in the

game.
Balance exploitation (best move given current

knowledge) and exploration (moves to explore where
opponent’s pieces might be).
Use probabilistic reasoning techniques.

Task Environments
2) Deterministic / Stochastic
○ An environment is deterministic if the next state of the environment
is completely determined by the current state of the environment and
the action of the agent;
○ In a stochastic environment, there are multiple, unpredictable
outcomes. (If the environment is deterministic except for the actions
of other agents, then the environment is strategic).
In a fully observable, deterministic environment, the agent need not
deal with uncertainty.
Note: Uncertainty can also arise because of computational
limitations. E.g., we may be playing an omniscient (“all knowing”)
opponent but we may not be able to compute his/her moves.
Task Environments
3) Episodic / Sequential
○ In an episodic environment, the agent’s experience is divided into

atomic episodes. Each episode consists of the agent perceiving and
then performing a single action.
○ Subsequent episodes do not depend on what actions occurred in

previous episodes. Choice of action in each episode depends only on
the episode itself. (E.g., classifying images.)
○ In a sequential environment, the agent engages in a series of

connected episodes. Current decision can affect future decisions.
(E.g., chess and driving) Fundamental in AI and ML 44
Task Environments
4) Static / Dynamic
A static environment does not change while the agent is thinking.

○
The passage of time as an agent deliberates is irrelevant.

○
The environment is semidynamic if the environment itself does not

○
change
with the passage of time but the agent's performance score does.
Task Environments
5) Discrete / Continuous
○ If the number of distinct percepts and actions is limited, the environment is
discrete, otherwise it is continuous.
6) Single agent / Multi-agent

○ If the environment contains other intelligent agents, the agent needs to be
concerned about strategic, game-theoretic aspects of the environment (for
either cooperative or competitive agents).
○ Most engineering environments don’t have multi-agent properties, whereas
most social and economic systems get their complexity from the interactions
of (more or less) rational agents.
Example Tasks and Environment Types
How to make the right decisions? Decision theory

Task Environment
Exercise on Task environment
Agents and Environment
The agent function maps from percept histories to actions
The agent program runs (internally) on the physical architecture to produce f
agent = architecture + program

I) Table-lookup driven agents
Uses a percept sequence / action table in memory to

find the next action. Implemented as a (large) lookup table.
• Drawbacks:
– Huge table (often simply too large)
– Takes a long time to build/learn the table
I) Table-lookup driven agents
Toy example:Vacuum world.
Percepts: robot senses it’s location and “cleanliness.”
So, location and contents, e.g., [A, Dirty], [B, Clean].
With 2 locations, we get 4 different possible sensor
inputs.
Actions: Left, Right, Suck, NoOp
Table driven agent
Table Lookup
Action sequence of length K, gives 4^K different possible sequences. At
least many entries are needed in the table So, even in this very toy
world, with K = 20, you need a table with over 4^20 > 10^12 entries.
In more real-world scenarios, one would have many more different

percepts (eg many more locations), e.g., >=100.
There will therefore be 100^K different possible sequences of length K.
For K = 20, this would require a table with over 100^20 = 10^40
entries. Infeasible to even store.
Table LookupII) --- Simple reflex agents
So, table lookup formulation is mainly of theoretical interest. For

practical agent systems, we need to find much more compact
representations.
For example, logic-based representations, Bayesian net representations,

or neural net style representations, or use a different agent architecture,
e.g., “ignore the past” --- Reflex agents.

II) Simple reflex agents
Agents do not have memory of past world states or percepts.
So, actions depend solely on current percept. Action becomes a “reflex.”
Uses condition-action rules.
Agent selects actions on the basis of current percept only.
If tail-light of car in front is red, then

brake.
Closely related to “behaviorism” (psychology; quite effective in explaining
lower-level animal behaviors, such as the behavior of ants and mice).
The Roomba robot largely behaves like this. Behaviors are robust and can be
quite effective and surprisingly complex.
But, how does complex behavior arise from simple reflex behavior?
E.g. ants colonies and bee hives are quite complex.
Simple rules in a diverse environment can give rise to surprising complexity.
See A-life work (artificial life) community, and Wolfram’s cellular automata.
III) --- Model-based reflex agents
Key difference (wrt simple reflex agents):

○ Agents have internal state, which is used to keep track of past states
of the world.
○ Agents have the ability to represent change in the World.
Example: Rodney Brooks’ Subsumption Architecture

--- behavior based robots.
III) --- Model-based reflex agents
Module:
Logical Agents How detailed?
Representation and Reasoning: Part III/IV R&N
If “dangerous driver in front,” “Infers potentially dangerous

then “keep distance.” driver in front.”
III) --- Model-based agents
III) --- Model-based agents An example: Brooks’
Subsumption Architecture
Main idea: build complex, intelligent robots by decomposing behaviors
into a hierarchy of skills, each defining a percept-action cycle for one
very specific task.
Examples: collision avoidance, wandering, exploring, recognizing
doorways, etc.
Each behavior is modeled by a finite-state machine with a few states
(though each state may correspond to a complex function or module;
provides internal state to the agent).
Behaviors are loosely coupled via asynchronous interactions.
Note: minimal internal state representation. p. 1003 R&N
III) --- Model-based agents An example: Brooks’
Subsumption Architecture
In subsumption architecture, increasingly complex behaviors arise from
the combination of simple behaviors.
The most basic simple behaviors are on the level of reflexes: • avoid an
object; • go toward food if hungry, • move randomly.
A more complex behavior that sits on top of simple behaviors may be

“go across the room.”
The more complex behaviors subsume the less complex ones to

accomplish their goal.
How much of an internal model of the world?
Planning in and reasoning about our surroundings appears to require
some kind of internal representation of our world.
We can “try” things out in this representation. Much like an running a
“simulation” of the effect of actions or a sequence of actions in our
head.
General assumption for many years:
The more detailed internal model, the better.
How much of an internal model of the world?
Brooks (mid 80s and 90s) challenged this view:
The philosophy behind Subsumption Architecture is that the world
should be used as its own model. According to Brooks, storing models of
the world is dangerous in dynamic, unpredictable environments because
representations might be incorrect or outdated. What is needed is the
ability to react quickly to the present. So, use minimal internal state
representation, complement at each time step with sensor input.
Debate continues to this day: How much of our world do we (should we)
represent explicitly? Subsumption architecture worked well in robotics.
IV) Goal-based agents
Key difference wrt Model-Based Agents:
In addition to state information, have goal information that describes
desirable situations to be achieved.
Agents of this kind take future events into consideration.

What sequence of actions can I take to achieve certain goals?
Choose actions so as to (eventually) achieve a (given or computed) goal.

problem solving and search! (R&N --- Part II, chapters 3 to 6)
IV) Goal-based agents
Module:
Problem Solving
“Clean kitchen” Considers “future”
Agent keeps track of the world state as well as set of goals it’s trying to achieve: chooses
actions that will (eventually) lead to the goal(s).
More flexible than reflex agents may involve search and planning
V) Utility-based agents
When there are multiple possible alternatives, how to decide which one is
best?
Goals are qualitative: A goal specifies a crude distinction between a happy
and unhappy state, but often need a more general performance measure
that describes “degree of happiness.”
Utility function U: State → R indicating a measure of success or happiness
when at a given state.
Important for making tradeoffs: Allows decisions comparing choice
between conflicting goals, and choice between likelihood of success and
importance of goal (if achievement
Fundamentalis
in AI uncertain).
and ML 70
Use decision theoretic models: e.g., faster vs. safer.
V) Utility-based agents VI) --- Learning agents
Adapt and improve over time
Module:
Decision Making
Decision theoretic actions:

e.g. faster vs. safer
VI) --- Learning agents Adapt and improve over
time More complicated when agent needs to learn utility information: Reinforcement
learning (based on action payoff)
Module:
Learning
“Quick turn is not safe”
No quick turn
Road conditions, etc

Takes percepts
and selects actions
Try out the brakes on

different road surfaces
Summary: Agent Types
(1) Table-driven agents
○ use a percept sequence/action table in memory to find the next
action. They are implemented by a (large) lookup table.
(2) Simple reflex agents
○ are based on condition-action rules, implemented with an
appropriate production system. They are stateless devices which do
not have memory of past world states.
(3) Agents with memory - Model-based reflex agents
○ have internal state, which is used to keep track of past states of the
world. Fundamental in AI and ML 73
(4) Agents with goals – Goal-based agents
○ are agents that, in addition to state information, have goal
information that describes desirable situations. Agents of this kind
take future events into consideration.
(5) Utility-based agents
○ base their decisions on classic axiomatic utility theory in order to
act rationally.
(6) Learning agents
○ they have the ability to improve performance through learning.
● An agent perceives and acts in an environment, has an architecture,
and is implemented by an agent program.
● A rational agent always chooses the action which maximizes its
expected performance, given its percept sequence so far.
● An autonomous agent uses its own experience rather than built-in
knowledge of the environment by the designer.
● An agent program maps from percept to action and updates its internal
state.
○ Reflex agents (simple / model-based) respond immediately to percepts.
○ Goal-based agents act in order to achieve their goal(s), possible
sequence of steps.
○ Utility-based agents maximize their own utility function.
○ Learning agents improve their performance through learning.
● Representing knowledge is important for successful agent design.
● The most challenging environments are partially observable, stochastic,
sequential, dynamic, and continuous, and contain multiple intelligent
agents. Fundamental in AI and ML 76
Searching for a (shortest / least cost) path to goal
state(s)
Search through the state space.
We will consider search techniques that use an

explicit search tree that is generated by the
initial state + successor function.
initialize (initial node)

Loop
choose a node for expansion
according to strategy
goal node? done
expand node with successor function
Tree-search algorithms
Basic idea:
○ simulated exploration of state space by generating successors of already-explored states (a.k.a. ~
expanding states)
Note: 1) Here we only check a node for possibly being a goal state, after we select the node
for expansion.
2) A “node” is a data structure containing state + additional info (parent
node, etc.
Tree-search algorithm- Example
Node selected for expansion.
Nodes added to tree.
Selected for expansion.
Added to tree.
Note: Arad added (again) to tree!

(reachable from Sibiu)
Not necessarily a problem, but

in Graph-Search, we will avoid
this by maintaining an
“explored” list.
Graph-search
Note:
1) Uses “explored” set to avoid visiting already explored states.
2) Uses “frontier” set to store states that remain to be explored and
expanded.
3) However, with eg uniform cost search, we need to make a special check
when node (i.e. state) is on frontier. Details later.
Implementation: states vs. nodes
A state is a --- representation of --- a physical configuration.

A node is a data structure constituting part of a search tree includes
state, tree parent node, action (applied to parent), path cost (initial
state to node) g(x), depth
Fringe is the collection of nodes that have been generated but not (yet)
expanded. Each node of the fringe is a leaf node.
The Expand function creates new nodes, filling in the various fields and
using the SuccessorFn of the problem to create the corresponding
states. Fundamental in AI and ML 84
Implementation: General Tree Search
Search Strategies
A search strategy is defined by picking the order of node expansion.
Strategies are evaluated along the following dimensions:

○ completeness: does it always find a solution if one exists?
○ time complexity: number of nodes generated
○ space complexity: maximum number of nodes in memory
○ optimality: does it always find a least-cost solution?
Time and space complexity are measured in terms of

○ b: maximum branching factor of the search tree
○ d: depth of the least-cost solution
○ m: maximum depth of the state space (may be ∞)
Uninformed Search Strategies
Uninformed (blind) search strategies use only the information available
in the problem definition:
○ Breadth-first search
○ Uniform-cost search
○ Depth-first search
○ Depth-limited search
○ Iterative deepening search
○ Bidirectional search
Key issue: type of queue used for the fringe of the search tree (collection
of tree nodes that have been generated but not yet expanded)
Breadth-First Search
Expand shallowest unexpanded node.

Implementation: Fringe queue: <A>
○
Select A from
fringe is a FIFO queue, i.e., new nodes go at end
(First In First Out queue.) queue and expand.
Gives
<B, C>
Select B from
Queue: <B, C> front, and expand.
Put children at the

end.
Gives
<C, D, E>
Fringe queue: <C, D, E>
Fringe queue: <D, E, F, G>

Assuming no further children, queue becomes
<E, F, G>, <F, G>, <G>, <>. Each time node checked
for goal state.
Properties of breadth-first search
Complete? Yes (if b is finite)

Time? 1+b+b2+b3+… +bd + b(bd-1) = O(bd+1)
Space? O(bd+1) (keeps every node in memory;
needed also to reconstruct soln. path)
Optimal soln. found?
Yes (if all step costs are identical)
Space is the bigger problem (more than time)
•b: maximum branching factor of the search tree
•d: depth of the least-cost solution
Note: check for goal only
when node is expanded.
Time and Space requirement for Breadth First
Search with bf = 10
Uniform Cost Search
Uniform Cost Search
Uniform Cost Search
Uniform Cost Search
Expand least-cost (of path to) unexpanded node
(e.g. useful for finding shortest path on map)
Implementation:
fringe = queue ordered by path cost
○
g – cost of reaching a node

Complete? Yes, if step cost ≥ ε (>0)
Time? # of nodes with g ≤ cost of optimal solution (C*), O(b(1+⎣C*/ ε⎦)
Space? # of nodes with g ≤ cost of optimal solution,
O(b(1+⎣C*/ ε⎦)
Optimal? Yes – nodes expanded in increasing order of g(n)
Note: Some subtleties (e.g. checking for goal state).
See p 84 R&N. Also, next slide.Fundamental in AI and ML 98
Uniform Cost Search
Two subtleties: (bottom p. 83 Norvig)

1) Do goal state test, only when a node is selected for expansion.
(Reason: Bucharest may occur on frontier with a longer than optimal
path. It won’t be selected for expansion yet. Other nodes will be expanded
first, leading us to uncover a shorter path to Bucharest.
2) Graph-search alg. says “don’t add child node to frontier if already on

explored list or already on frontier.” BUT, child may give a shorter path to
a state already on frontier. Then, we need to modify the existing node on
frontier with the shorter path. Fundamental in AI and ML 99
Depth-First Search
“Expand deepest unexpanded node”
Implementation:
○ fringe = LIFO queue, i.e., put successors at front (“push on stack”)
Last In First Out Fringe stack:
A
Expanding A,
gives stack:
B
C
So, B next.
Depth-First Search
Expanding B,
gives stack:
D
E
C
So, D next.

Depth-First Search
Expanding D,
gives stack:
H
I
E
C
So, H next.
etc.
Depth-First Search

Depth-First Search

Depth-First Search

Depth-First Search

Depth-First Search

Depth-First Search

Depth-First Search

Depth-First Search

Depth-First Search

Properties of Depth-First Search
Note: Can also

Complete? No: fails in infinite-depth spaces, spaces with loops
reconstruct soln. path
Modify to avoid repeated states along path from single stored
complete in finite spaces branch.
Time? O(bm): bad if m is much larger than d

○ but if solutions are dense, may be much faster than breadth-first
No
Space? O(bm), i.e., linear space!
b: max. branching factor of the search tree
d: depth of the shallowest (least-cost) soln.
Guarantee that m: maximum depth of state space
opt. soln. is found?
Depth-Limited Search

Complexity Analysis
Completeness If l < d, incomplete
Time Complexity O(bl)
Space Complexity O(bl)
Optimality If l > d, non optimal

Iterative deepening search




Iterative deepening search l =2

Iterative deepening search l =3 Why would one
do that?

Why would one do that?
Combine good memory requirements of depth-first with the

completeness of breadth-first when branching factor is finite and is
optimal when the path cost is a non-decreasing function of the depth
of the node.
Idea was a breakthrough in game playing. All game tree search uses
iterative deepening nowadays. What’s the added advantage in
games?
“Anytime” nature.
Iterative deepening is the preferred uninformed search method when there is a large
search space and the depth of the solutionis not known.
Number of nodes generated in an iterative deepening search to depth
d with branching factor b: Looks quite wasteful, is it?
NIDS = d b1 + (d-1)b2 + … + 3bd-2 +2bd-1 + 1bd
Nodes generated in a breadth-first search with branching factor b:

NBFS = b1 + b2 + … + bd-2 + bd-1 + bd
For b = 10, d = 5,
☺
○ NBFS= 10 + 100 + 1,000 + 10,000 + 100,000 = 111,110
○ NIDS = 50 + 400 + 3,000 + 20,000 + 100,000 = 123,456

Complete? Yes
(b finite)
Time? d b1 + (d-1)b2 + … + bd = O(bd)
Space? O(bd)
Optimal? Yes, if step costs identical

Bidirectional Search
• Simultaneously:
○ Search forward from start
○ Search backward from the goal
Stop when the two searches meet.
• If branching factor = b in each direction,

with solution at depth d
only O(2 bd/2)= O(2 bd/2)
Aside: The predecessor of a node should be easily computable (i.e.,

actions are easily reversible).
Bi directional Search
• Checking a node for membership in the other search tree can be done
in constant time (hash table)
• Key limitations:
Space O(bd/2)
Also, how to search backwards can be an issue (e.g., in Chess)? What’s
tricky?
Problem: lots of states satisfy the goal; don’t know which one is
relevant.
Aside: The predecessor of a node should be easily computable (i.e., actions are easily
reversible). Fundamental in AI and ML 126
Repeated States
Failure to detect repeated states can
turn linear problem into an
exponential one!
Don’t return to parent node

Don’t generate successor = node’s parent
Don’t allow cycles
Don’t revisit state
Keep every visited state in memory! O(bd)
(can be expensive)
Problems in which actions are reversible (e.g., routing problems or sliding-blocks puzzle).
Also, in eg Chess; uses hash tables to check for repeated states. Huge tables 100M+ size but
very useful.
See Tree-Search vs. Graph-Search. Need to be careful to maintain (path) optimality and
completeness. Fundamental in AI and ML 127
Bi directional Search - Example
In the below search tree, bidirectional
search algorithm is applied. This
algorithm divides one graph/tree into two
sub-graphs. It starts traversing from node
1 in the forward direction and starts from
goal node 16 in the backward direction.
The algorithm terminates at node 9 where

two searches meet.

Bi directional Search
Completeness: Bidirectional Search is complete if we use BFS in both
searches.
Time Complexity: Time complexity of bidirectional search using BFS

is O(bd).
Space Complexity: Space complexity of bidirectional search is O(bd).
Optimal: Bidirectional search is Optimal.

Summary: General Uninformed Search
● Original search ideas in AI where inspired by studies of human problem
solving in, eg, puzzles, math, and games, but a great many AI tasks now
require some form of search (e.g. find optimal agent strategy; active
learning; constraint reasoning; NP-complete problems require search).
● Problem formulation usually requires abstracting away real-world
details to define a state space that can feasibly be explored.
● Variety of uninformed search strategies
● Iterative deepening search uses only linear space and not much more
time than other uninformed algorithms.
● Avoid repeating states / cycles.
Searching with Partial Observations

Conformant (Sensor less) search: Example Space

Conformant (Sensor less) search: Example Space

Searching with no observations

Searching with observations

Searching with observations

Constraint Satisfaction Problems
Constraint Satisfaction
• It is a search procedure that operates in a space of constraint sets

• Constraint satisfaction problem in AI have goal of discovering some
problem state that satisfies a given set of constraint
Process:
• Constrains are discovered and propagated throughout the system

• If still there is no solution search begins
•Constraint Satisfaction Problems:
•CSP consists of three components V,D,C
•V[variable set{v1,v2,v3…vn}]
•D[domain{D1,D2,D3…Dn}]=>one domain for each variable
•C[constraints]=>Specify allowable combination of values
•C1=(scope, relation), where scope= set of variables that participate in
constraints and relation= defines values that a variable can take.
•Intelligent backtracking method is used to solve Constraint Satisfaction
Problems.
•Only back track where conflict occurs.
Map Coding

Map Coding

Map Coding Example

Constraint Graph

Constraint Graph

•CSP can be viewed as a standard search problem as follows :

•Initial state : the empty assignment {},in which all variables are unassigned.
•Successor function : a value can be assigned to any unassigned variable,
provided that it does not conflict with previously assigned variables.
•Goal test : the current assignment is complete.
•Path cost : a constant cost(E.g.,1) for every step.

Backtracking Algorithm
•Backtracking can be defined as a general algorithmic technique that
considers searching every possible combination in order to solve a
computational problem.
What is Backtracking Algorithm?

•Backtracking is an algorithmic technique for solving problems recursively by
trying to build a solution incrementally, one piece at a time, removing those
solutions that fail to satisfy the constraints of the problem at any point of time.

When to use a Backtracking algorithm?

•When we have multiple choices, then we make the decisions from the
available choices. In the following cases, we need to use the backtracking
algorithm:
•A piece of sufficient information is not available to make the best choice, so
we use the backtracking strategy to try out all the possible solutions.
•Each decision leads to a new set of choices. Then again, we backtrack to
make new decisions. In this case, we need to use the backtracking strategy.

How does Backtracking work?
•Backtracking is a systematic method of trying out various sequences of
decisions until you find out that works. Let's understand through an example.
•We start with a start node. First, we move to node A. Since it is not a feasible
solution so we move to the next node, i.e., B. B is also not a feasible solution,
and it is a dead-end so we backtrack from node B to node A.


The terms related to the backtracking are:
•Live node: The nodes that can be further generated are known as live nodes.
•E node: The nodes whose children are being generated and become a success node.
•Success node: The node is said to be a success node if it provides a feasible solution.
•Dead node: The node which cannot be further generated and also does not provide a
feasible solution is known as a dead node.
Applications of Backtracking
•N-queen problem
•Sum of subset problem
•Graph coloring
Constraint propagation
•Constraint propagation is the general term for propagating the implications
of a constraint on one variable onto other variables.
•Constraint propagation repeatedly enforces constraints locally to detect
inconsistencies. This propagation can be done with different types of
consistency techniques. They are:
•Node consistency (one consistency)
•Arc consistency (two consistency)
•Path consistency (K-consistency)

Node consistency
• Simplest consistency technique
• The node representing a variable V in constraint graph is node consistent if
for every value X in the current domain of V, each unary constraint on V is
satisfied.
• The node inconsistency can be eliminated by simply removing those values
from the domain D of each variable V that do not satisfy unary constraint on
V.

Arc Consistency
•Here, 'arc’ refers to a directed arc in the constraint graph, such as the arc
from SA to NSW. Given the current domains of SA and NSW, the arc is
consistent if, for every value x of SA, there is some value y of NSW that is
consistent with x.
•In the constraint graph, binary constraint corresponds to arc. Therefore this
type of consistency is called arc consistency.
•Arc (vi, vj) is arc consistent if for every value X the current domain of vi
there is some value Y in the domain of vj such vi =X and vj=Y is permitted
by the binary constraint between vi and vj

k-Consistency (path Consistency)
•A CSP is k-consistent if, for any set of k - 1 variables and for any consistent
assignment to those variables, a consistent value can always be assigned to
any kth variable
•1-consistency means that each individual variable by itself is consistent; this
is also called node consistency.
•2-consistency is the same as arc consistency.
•3-consistency means that any pair of adjacent variables can always be
extended to a third neighboring variable; this is also called path consistency

Knowledge
Representation

First Order Predicate Logic
First Order Logic (FOL) can be simply put as a collection of objects, their
attributes and relations among them to represent knowledge. It is also known
as Predicate Logic.
First-order logic is another way of knowledge representation in artificial

intelligence. It is an extension to propositional logic. FOL is sufficiently
expressive to represent the natural language statements in a concise way.
First-order logic is also known as Predicate logic or First-order predicate
logic.

What is First Order Logic ?
• FOL is a mode of representation in AI. It is an extension of PL.
• FOL represents natural language statements in a concise way.
• FOL is also called predicate logic. It is a powerful language used to develop

information about an object and express the relationship between objects.
• FOL not only assumes that does the world contains facts (like PL does), but
it also assumes the following:
•
• Objects: A, B, people, numbers, colors, wars, theories, squares, pit, etc.
First Order Predicate Logic

Relations: It is unary relation such as red, round, sister of, brother of, etc.

Function: father of, best friend, third inning of, end of, etc.

Representing Simple Statements in FOL:
It is important that you know the logical operators/connectives that are used in
Propositional Logic.

Part of First Order Logic
FOL also has two parts:
1.Syntax
2.Semantics

Syntax: The syntax of FOL decides which collection of symbols is a logical

expression. The basic syntactic elements of FOL are symbols. We use
symbols to write statements in shorthand notation.

Basic Elements of FOL:

Atomic and Complex Sentences in FOL:
Atomic Sentence:
This is a basic sentence of FOL formed from a predicate symbol followed by

a parenthesis with a sequence of terms.
We can represent atomic sentences as a predicate (value1, value2…., value n).
Example-
John and Michael are colleaguesà colleagues (John, Michael)
German Shepherd is a dogà Dog (German Shepherd )

Complex Sentence:
Complex sentences are made by combining atomic sentences using
connectives.
FOL is further divided into two parts:

Subject: the main part of the statement.
Predicate: defined as a relation that binds two atoms together.
Example-
1.Colleague (Oliver, Benjamin) Colleague (Benjamin, Oliver)
2.“x is an integer” Fundamental in AI and ML 162
It has two parts;
First, x is the subject.
Second, “is an integer” is called a predicate.

Quantifiers and their use in FOL:
Quantifiers generate quantification and specify the number of specimen in
the universe.
Quantifiers allow us to determine or identify the range and scope of the
variable in a logical expression.
There are two types of quantifiers:

1.Universal quantifier: for all, everyone, everything.
2.Existential quantifier: for some, at least one.

Universal Quantifiers:
Universal quantifiers specify that the statement within the range is true for everything
or every instance of a particular thing.
Universal quantifiers are denoted by a symbol that looks like an inverted A.
In a universal quantifier, we use

•If x is a variable then,
1. For all x
2. For every x
3. For each x

Example- Every student likes educative.
Explanation:
So, in logical notation, it can be written as:
This can be interpreted as: There is every x where x is a student who likes
Educative.
Existential Quantifiers:
Existential quantifiers are used to express that the statement within their
scope is true for at least one instance of something.
which looks like an inverted E, is used to represent them. We always use

AND or conjunction symbols.

If x is a variable, the existential quantifier will be
1.For some x
2.There exists an x
3.For at least one x
Example-
Some people like Football.

Explanation:
So, in logical notation, it can be written as:
It can be interpreted as: There are some x where x is people who like football.

Games: Minimax and Alpha-Beta Pruning

Outline:
1. Overview
2. Minimax for Zero-Sum Games
3. α-β Pruning

Types of Games:
Game = task environment with > 1 agent
To Know:
Deterministic or stochastic?
1.
Perfect information (fully observable)?

2.
Two, three, or more players?

3.
Teams or individuals?
4.
Turn-taking or simultaneous?
5.
Zero sum?
6.
Output: algorithms for calculating a contingent plan (a.k.a. strategy or policy)

which recommends a move for every possible eventuality
Example of a Game-Tree

Standard Games
Rational Agent→ A rational player
Game formulation: Assume all future moves will be optimal
1.Initial state: s0
2.Players: Player(s) indicates whose move it is i k e D FS
Mini m a x : L
3.Actions: Actions(s) for player on move O (b m)
Time:
4.Transition model: Result(s,a) Space: O(bm
)
5.Terminal test: Terminal-Test(s)
6.Terminal values: Utility(s,p) for player p
7.Or just Utility(s) for player making the decision at root
For chess, b » 35, m » 100
# Exact solution is completely infeasible
# Humans can’t do this either, so how do we play chess?
Zero-Sum Games
Zero-Sum Games General-Sum Games

Agents have opposite utilities Agents have independent utilities
Pure competition: Cooperation, indifference,
One maximizes, the other minimizes
competition, shifting alliances, and
more are all possible
Game Tree Complexity
Saviour
Alpha-beta
Pruning

Alpha-Beta Pruning
Principle:
In any game tree for any node n , if there is a node m either at the parent node
of n or any where further up, then n will never be reached in the actual
play!!!!!!
Alpha bound of a node: Max current value of all Max ancestors of the
node.Exploration of a min node is stopped when its value equals or falls
below alpha.
Beta bound of a node: Min current value of all Min ancestors of the
node.Exploration of a max node is stopped when its value equals or exceeds
beta.
What can we Prune ?
50 100
For chess: only 35 instead of 35 !!
Yaaay!!!!!

Why game playing ?

Characteristics of Game playing ?

Games vs. Search Problems

Two Player game

Games as Adversarial Search

Two player Game Tree (Example)

Mini-Max Algorithm

Mini Max Toy Game

Mini-Max Search Tree








Properties of Mini-Max

Why expand Unnecessary nodes ?



Alpha Beta




Bad and Good Cases for Alpha-Beta Pruning

Properties of Alpha-Beta Pruning

Local Search and Optimization

Outline
• Local search techniques and optimization

– Hill-climbing
– Gradient methods
– Simulated annealing
– Genetic algorithms
– Issues with local search

208
• Previous lecture: path to goal is solution to problem

– systematic exploration of search space.
• This lecture: a state is solution to problem

– for some problems path is irrelevant.
– E.g., 8-queens
• Different algorithms can be used

– Depth First Branch and Bound
– Local search
Goal Satisfaction Optimization
Reach the goal node Optimize(objective fn)

Constraint satisfaction Constraint Optimization
You can go back and forth between the two problems

Typically in the same complexity class

• Local search
– Keep track of single current state
– Move only to neighboring states
– Ignore paths
• Advantages:
– Use very little memory
– Can often find reasonable solutions in large or infinite (continuous) state spaces.
• “Pure optimization” problems

– All states have an objective function
– Goal is to find state with max (or min) objective value
– Does not quite fit into path-cost/goal-state formulation
– Local search can do quite well onFundamental
these inproblems.
AI and ML 211
Trivial Algorithms
• Random Sampling
– Generate a state randomly
• Random Walk
– Randomly pick a neighbor of the current state
• Both algorithms asymptotically complete.

Hill-climbing (Greedy Local Search) max version
function HILL-CLIMBING( problem) return a state that is a local maximum
input: problem, a problem
local variables: current, a node.
neighbor, a node.
current ← MAKE-NODE(INITIAL-STATE[problem])
loop do
neighbor ← a highest valued successor of current
if VALUE [neighbor] ≤ VALUE[current] then return STATE[current]
current ← neighbor
min version will reverse inequalities and look for lowest valued successor
Hill-climbing Search
• “a loop that continuously moves towards increasing value”
– terminates when a peak is reached
– Aka greedy local search
• Value can be either
– Objective function value
– Heuristic function value (minimized)
• Hill climbing does not look ahead of the immediate neighbors

• Can randomly choose among the set of best successors
– if multiple have the best value
• “climbing Mount Everest in a thick fog with amnesia”
“Landscape” Search
Hill Climbing gets stuck in local minima

depending on?
Example: n-queens
• Put n queens on an n x n board with no two queens on the same row,

column, or diagonal
• Is it a satisfaction problem or optimization?

Hill-climbing search: 8-queens problem
• Need to convert to an optimization problem

• h = number of pairs of queens that are attacking each other
• h = 17 for the above state Fundamental in AI and ML 217
Search Space
• State
– All 8 queens on the board in some configuration
• Successor function
– move a single queen to another square in the same column.
• Example of a heuristic function h(n):

– the number of pairs of queens that are attacking each other
– (so we want to minimize this)
Hill-climbing search: 8-queens problem
• Is this a solution?
• What is h?
Hill-climbing on 8-queens
• Randomly generated 8-queens starting states…

• 14% the time it solves the problem
• 86% of the time it get stuck at a local minimum
• However…
– Takes only 4 steps on average when it succeeds
– And 3 on average when it gets stuck
– (for a state space with 8^8 =~17 million states)
Hill Climbing Drawbacks
• Local maxima
• Plateaus
• Diagonal ridges
Escaping Shoulders: Sideways Move
• If no downhill (uphill) moves, allow sideways moves in hope that
algorithm can escape
– Need to place a limit on the possible number of sideways moves to
avoid infinite loops
• For 8-queens
– Now allow sideways moves with a limit of 100
– Raises percentage of problem instances solved from 14 to 94%
– However….
• 21 steps for every successful solution
• 64 for each failure
Tabu Search
• Prevent returning quickly to the same state

• Keep fixed length queue (“tabu list”)
• Add most recent state to queue; drop oldest
• Never make the step that is currently tabu’ed
Properties:
– As the size of the tabu list grows, hill-climbing will asymptotically
become “non-redundant” (won’t look at the same state twice)
– In practice, a reasonable sized tabu list (say 100 or so) improves the
performance of hill climbing in many problems
Escaping Shoulders/local Optima Enforced Hill
Climbing
• Perform breadth first search from a local optima
– to find the next state with better h function
• Typically,
– prolonged periods of exhaustive search
– bridged by relatively quick periods of hill-climbing
• Middle ground b/w local and systematic search

Hill-climbing: stochastic variations
• Stochastic hill-climbing
– Random selection among the uphill moves.
– The selection probability can vary with the steepness of the uphill
move.
• To avoid getting stuck in local minima

– Random-walk hill-climbing
– Random-restart hill-climbing
– Hill-climbing with both

Hill Climbing: stochastic variations
When the state-space landscape has local minima, any search that moves only
in the greedy direction cannot be complete
Random walk, on the other hand, is asymptotically complete
Idea: Put random walk into greedy hill-climbing

Hill-climbing with random restarts
• If at first you don’t succeed, try, try again!

• Different variations
– For each restart: run until termination vs. run for a fixed time
– Run a fixed number of restarts or run indefinitely
• Analysis
– Say each search has probability p of success
• E.g., for 8-queens, p = 0.14 with no sideways moves
– Expected number of restarts?
– Expected number of steps taken?
• If you want to pick one local search algorithm, learn this one!!
Hill-climbing with random walk
• At each step do one of the two

– Greedy: With prob p move to the neighbor with largest value
– Random: With prob 1-p move to a random neighbor

Hill-climbing with both
• At each step do one of the three
– Greedy: move to the neighbor with largest value
– Random Walk: move to a random neighbor
– Random Restart: Resample a new current state

Simulated Annealing
• Simulated Annealing = physics inspired twist on random walk
• Basic ideas:
– like hill-climbing identify the quality of the local improvements
– instead of picking the best move, pick one randomly
– say the change in objective function is δ
– if δ is positive, then move to that state
– otherwise:
• move to this state with probability proportional to δ
• thus: worse moves (very large negative δ) are executed less often
– however, there is always a chance of escaping from local maxima
– over time, make it less likely to accept locally bad moves
– (Can also make the size of the moveFundamental
random in AIas
and well,
ML i.e., allow “large” steps in state 230
space)
Physical Interpretation of Simulated Annealing
• A Physical Analogy:
• imagine letting a ball roll downhill on the function surface
– this is like hill-climbing (for minimization)
• now imagine shaking the surface, while the ball rolls, gradually reducing the
amount of shaking
– this is like simulated annealing
• Annealing = physical process of cooling a liquid or metal until particles

achieve a certain frozen crystal state
• simulated annealing:
– free variables are like particles
– seek “low energy” (high quality) configuration
– slowly reducing temp. T with particles moving around randomly
Simulated annealing
function SIMULATED-ANNEALING( problem, schedule) return a solution state
input: problem, a problem
schedule, a mapping from time to temperature
local variables: current, a node.
next, a node.
T, a “temperature” controlling the prob. of downward steps
current ← MAKE-NODE(INITIAL-STATE[problem])
for t ← 1 to ∞ do
T ← schedule[t]
if T = 0 then return current
next ← a randomly selected successor of current
∆E ← VALUE[next] - VALUE[current]
if ∆E > 0 then current ← next
else current ← next only with probability e∆E /T
Temperature T
• high T: probability of “locally bad” move is higher

• low T: probability of “locally bad” move is lower
• typically, T is decreased as the algorithm runs longer
• i.e., there is a “temperature schedule”

Simulated Annealing in Practice
– method proposed in 1983 by IBM researchers for solving VLSI layout

problems (Kirkpatrick et al, Science, 220:671-680, 1983).
• theoretically will always find the global optimum
– Other applications: Traveling salesman, Graph partitioning, Graph

coloring, Scheduling, Facility Layout, Image Processing, …
– useful for some problems, but can be very slow

• slowness comes about because T must be decreased very gradually
to retain optimality
Local beam search
• Idea: Keeping only one node in memory is an extreme reaction to memory
problems.
• Keep track of k states instead of one

– Initially: k randomly selected states
– Next: determine all successors of k states
– If any of successors is goal → finished
– Else select k best from successors and repeat

Local Beam Search (contd.)
• Not the same as k random-start searches run in parallel!

• Searches that find good states recruit other searches to join them
• Problem: quite often, all k states end up on same local hill
• Idea: Stochastic beam search
– Choose k successors randomly, biased towards good ones
• Observe the close analogy to natural selection!

Hey! Perhaps sex
can improve
search?

Sure! Check out
ye book.

Genetic algorithms
• Twist on Local Search: successor is generated by combining two parent states
• A state is represented as a string over a finite alphabet (e.g. binary)

– 8-queens
– State = position of 8 queens each in a column
• Start with k randomly generated states (population)
• Evaluation function (fitness function):

– Higher values for better states.
– Opposite to heuristic function, e.g., # non-attacking pairs in 8-queens
• Produce the next generation of states by “simulated evolution”

– Random selection
– Crossover
– Random mutation Fundamental in AI and ML 239
Genetic algorithms
8
7
6 String representation
5 16257483
4
3
2
1
Can we evolve 8-queens through genetic algorithms?

Genetic algorithms
4 states for 2 pairs of 2 states randomly New states Random

8-queens selected based on fitness. Random after crossover mutation
problem crossover points selected applied
• Fitness function: number of non-attacking pairs of queens (min = 0, max = 8 × 7/2 =
28)
• 24/(24+23+20+11) = 31%
• 23/(24+23+20+11) = 29% etc Fundamental in AI and ML 241
241
Genetic algorithms
• State = a string over a finite alphabet (an individual)
– A successor state is generated by combining two parent states
• Start with k randomly generated states (population)
• Evaluation function (fitness function).

– Higher values for better states.
• Select individuals for next generation based on fitness

– P(indiv. in next gen) = indiv. fitness / total population fitness
• Crossover: fit parents to yield next generation (offspring)

• Mutate the offspring randomly with some low probability
Steps in Genetic algorithms





Genetic algorithms
• Fitness function: number of non-attacking pairs of queens (min = 0,

max = 8 × 7/2 = 28)
• 24/(24+23+20+11) = 31%
• 23/(24+23+20+11) = 29% etc.
fitness =
#non-attacking
queens
probability of being
in next generation =
fitness/(Σ_i fitness_i)
• Fitness function: #non-attacking queen pairs How to convert a fitness
– min = 0, max = 8 × 7/2 = 28 value into a probability
• Σi fitness_i = 24+23+20+11 = 78 of being in the next
generation.
• P(pick child_1 for next gen.) = fitness_1/(Σ_i fitness_i) = 24/78 = 31%
• P(pick child_2 for next gen.) = fitness_2/(Σ_i fitness_i) = 23/78 = 29%; etc
Genetic algorithms
Has the effect of “jumping” to a completely different new part of the search space (quite
non-local)

Comments on Genetic Algorithms
Genetic algorithm is a variant of “stochastic beam search”

• Positive points
– Random exploration can find solutions that local search can’t
• (via crossover primarily)
– Appealing connection to human evolution
• “neural” networks, and “genetic” algorithms are metaphors!
• Negative points
– Large number of “tunable” parameters
• Difficult to replicate performance from one problem to another
– Lack of good empirical studies comparing to simpler methods
– Useful on some (small?) set of problems but no convincing evidence that GAs are better than
hill-climbing w/random restarts in general
MODULE – 3 MACHINE LEARNING

Areas of math essential to machine learning
Machine learning is part of both statistics and computer science
• Probability
• Statistical inference
• Validation
• Estimates of error, confidence intervals
Linear algebra
• Hugely useful for compact representation of linear transformations on data
• Dimensionality reduction techniques
Optimization theory

Why worry about the math?
• There are lots of easy-to-use machine learning packages out there.
• After this course, you will know how to apply several of the most
general-purpose algorithms.
HOWEVER
• To get really useful results, you need good mathematical intuitions about
certain general machine learning principles, as well as the inner workings of
the individual algorithms.

Why worry about the math?
These intuitions will allow you to:
• Choose the right algorithm(s) for the problem
• Make good choices on parameter settings, validation strategies
• Recognize over- or underfitting
• Troubleshoot poor / ambiguous results
• Put appropriate bounds of confidence / uncertainty on results
• Do a better job of coding algorithms or incorporating them into more complex
analysis pipelines

Notation
a∈A set membership: a is member of set A
|B| cardinality: number of items in set B
|| v || norm: length of vector v
∑ summation
∫ integral
ℜ the set of real numbers
ℜn real number space of dimension n
n = 2 : plane or 2-space
n = 3 : 3- (dimensional) space
n > 3 : n-space or hyperspace

Notation
• x, y, z, vector (bold, lower case)
u, v
• A, B, X matrix (bold, upper case)
• y = f( x ) function (map): assigns unique value in
range of y to each value in domain of x
• dy / dx derivative of y with respect to single
variable x
• y = f( x ) function on multiple variables, i.e. a
vector of variables; function in n-space
• ∂y / ∂xi partial derivative of y with respect to
element i of vector x
The concept of probability
Intuition:
In some process, several outcomes are possible.
When the process is repeated a large number of times, each outcome occurs
with a characteristic relative frequency, or probability.
If a particular outcome happens more often than another outcome, we say it

is more probable.

The concept of probability
Arises in two contexts:
In actual repeated experiments.
Example: You record the color of 1000 cars driving by. 57 of them are
green. You estimate the probability of a car being green as 57 / 1000 =
0.0057.
In idealized conceptions of a repeated process.

Example: You consider the behavior of an unbiased six-sided die. The
expected probability of rolling a 5 is 1 / 6 = 0.1667.
Example: You need a model for how people’s heights are distributed. You
choose a normal distribution (bell-shaped curve) to represent the expected
relative probabilities.
Probability spaces
A probability space is a random process or experiment with three components:
I. Ω, the set of possible outcomes O
• number of possible outcomes = | Ω | = N
II. F, the set of possible events E

• an event comprises 0 to N outcomes
• number of possible events = | F | = 2N
III. P, the probability distribution

• function mapping each outcome and event to real number between 0 and 1 (the probability of O or E)
• probability of an event is sum of probabilities of possible outcomes in event

Axioms of probability
1. Non-negativity:
for any event E ∈ F, p( E ) ≥ 0
2. All possible outcomes:

p( Ω ) = 1
3. Additivity of disjoint events:

for all events E, E’ ∈ F where E ∩ E’ = ∅,
p( E U E’ ) = p( E ) + p( E’ )
Types of probability spaces
Define | Ω | = number of possible outcomes
• Discrete space | Ω | is finite

• Analysis involves summations ( ∑ )
• Continuous space | Ω | is infinite

• Analysis involves integrals ( ∫ )

Example of discrete probability space
Single roll of a six-sided die
• 6 possible outcomes: O = 1, 2, 3, 4, 5, or 6
• 26 = 64 possible events
• example: E = ( O ∈ { 1, 3, 5 } ), i.e. outcome is odd
• If die is fair, then probabilities of outcomes are equal

p( 1 ) = p( 2 ) = p( 3 ) =
p( 4 ) = p( 5 ) = p( 6 ) = 1 / 6
• example: probability of event E = ( outcome is odd ) is

p( 1 ) + p( 3 ) + p( 5 ) = 1 / 2
Example of discrete probability space
Three consecutive flips of a coin
• 8 possible outcomes: O = HHH, HHT, HTH, HTT, THH, THT, TTH, TTT
• 28 = 256 possible events
• example: E = ( O ∈ { HHT, HTH, THH } ), i.e. exactly two flips are heads
• example: E = ( O ∈ { THT, TTT } ), i.e. the first and third flips are tails
• If coin is fair, then probabilities of outcomes are equal

p( HHH ) = p( HHT ) = p( HTH ) = p( HTT ) =
p( THH ) = p( THT ) = p( TTH ) = p( TTT ) = 1 / 8
• example: probability of event E = ( exactly two heads ) is

p( HHT ) + p( HTH ) + p( THH ) =in3AI and
Fundamental / 8ML 264
Example of continuous probability space
Height of a randomly chosen American male

• Infinite number of possible outcomes: O has some single value in range 2 feet to 8 feet
• Infinite number of possible events
• example: E = ( O | O < 5.5 feet ), i.e. individual chosen is less than 5.5 feet tall
• Probabilities of outcomes are not equal, and are described by a continuous function, p( O )

Example of continuous probability space
Height of a randomly chosen American male
• Probabilities of outcomes O are not equal, and are described by a continuous function, p(O)
• p( O ) is a relative, not an absolute probability
• p( O ) for any particular O is zero
• ∫ p( O ) from O = - ∞ to ∞ (i.e. area under curve) is 1
• example: p( O = 5’8” ) > p( O = 6’2” )
• example: p( O < 5’6” ) = (∫ p( O ) from O = - ∞ to 5’6” ) ≈ 0.25

Probability distributions
• Discrete: probability mass function (pmf)
example:
sum of two
fair dice
• Continuous: probability density function (pdf)
example:
waiting time between
eruptions of Old Faithful
(minutes)

Random variables
• A random variable X is a function that associates a number x with each outcome O of a
process
• Common notation: X( O ) = x, or just X = x
• Basically a way to redefine (usually simplify) a probability space to a new probability space
• X must obey axioms of probability (over the possible values of x)
• X can be discrete or continuous
• Example: X = number of heads in three flips of a coin
• Possible values of X are 0, 1, 2, 3
• p( X = 0 ) = p( X = 3 ) = 1 / 8 p( X = 1 ) = p( X = 2 ) = 3 / 8
• Size of space (number of “outcomes”) reduced from 8 to 4
• Example: X = average height of five randomly chosen American men
• Size of space unchanged (X can range from 2 feet to 8 feet), but pdf of X different than
for single man Fundamental in AI and ML 268
Multivariate probability distributions
• Scenario
• Several random processes occur (doesn’t matter whether in parallel or in
sequence)
• Want to know probabilities for each possible combination of outcomes
• Can describe as joint probability of several random variables

• Example: two processes whose outcomes are represented by random
variables X and Y. Probability that process X has outcome x and process Y
has outcome y is denoted as:
p( X = x, Y = y )
Example of multivariate distribution

Multivariate probability distributions
• Marginal probability
• Probability distribution of a single variable in a joint distribution
• Example: two random variables X and Y:
p( X = x ) = ∑b=all values of Y p( X = x, Y = b )
• Conditional probability
• Probability distribution of one variable given that another variable takes a certain value
• Example: two random variables X and Y:
p( X = x | Y = y ) = p( X = x, Y = y ) / p( Y = y )

Example of marginal probability
marginal probability: p( X = minivan ) = 0.0741 + 0.1111 + 0.1481 = 0.3333

Example of conditional probability
Conditional probability: p( Y = European | X = minivan ) = 0.1481 / ( 0.0741
+ 0.1111 + 0.1481 ) = 0.4433
0.2
0.15
probabilit
0.1
0.05
y
American
sport
Asian SUV
European minivan
Y= sedan X = model
manufacturer type
Continuous multivariate distribution
Same concepts of joint, marginal, and

conditional probabilities apply
(except use integrals)
Example: three-component Gaussian

mixture in two dimensions

Expected value
Given:
• A discrete random variable X, with possible values x = x1, x2, … xn
• Probabilities p( X = xi ) that X takes on the various values of xi
• A function yi = f( xi ) defined on X
The expected value of f is the probability-weighted “average” value of f( xi ):

E( f ) = ∑i p( xi ) ⋅ f( xi )

Example of expected value
• Process: game where one card is drawn from the deck
• If face card, dealer pays you $10
• If not a face card, you pay dealer $4
• Random variable X = { face card, not face card }
• p( face card ) = 3/13
• p( not face card ) = 10/13
• Function f( X ) is payout to you
• f( face card ) = 10
• f( not face card ) = -4
• Expected value of payout is:
E( f ) = ∑i p( xi ) ⋅ f( xi ) = 3/13 ⋅ 10 + 10/13 ⋅ -4 = -0.77
Expected value in continuous spaces
E( f ) = ∫x = a → b p( x ) ⋅ f( x )

Common forms of expected value (1)
• Mean (μ)
f( xi ) = xi ⇒ μ = E( f ) = ∑i p( xi ) ⋅ xi
• Average value of X = xi, taking into account probability of the various xi

• Most common measure of “center” of a distribution
• Compare to formula for mean of an actual sample

• Variance (σ2)
f( xi ) = ( xi - μ ) ⇒ σ2 = ∑i p( xi ) ⋅ ( xi - μ )2
• Average value of squared deviation of X = xi from mean μ, taking into account probability
of the various xi
• Most common measure of “spread” of a distribution
• σ is the standard deviation
• Compare to formula for variance of an actual sample

Covariance
f( xi ) = ( xi - μx ), g( yi ) = ( yi - μy ) ⇒
cov( x, y ) = ∑i p( xi , yi ) ⋅ ( xi - μx ) ⋅ ( yi - μy )
• Measures tendency for x and y to deviate from their means in same (or opposite)
directions at same time
high (positive)
no covariance
covariance
• Compare to formula for covariance of actual samples

Correlation
• Pearson’s correlation coefficient is covariance normalized by the standard deviations of the two
variables
• Always lies in range -1 to 1

• Only reflects linear dependence between variables
Linear dependence with noise
Linear dependence without noise
Various nonlinear dependencies

Complement rule
Given: event A, which can occur or not
p( not A ) = 1 - p( A )
A not A
areas represent relative probabilities

Product rule
Given: events A and B, which can co-occur (or not)
p( A, B ) = p( A | B ) ⋅ p( B )
(same expression given previously to define conditional probability)
(not A, not B)
A ( A, B ) B
Ω
(A, not B) (not A, B)
areas represent relative probabilitiesFundamental in AI and ML 283

Example of product rule
• Probability that a man has white hair (event A) and is over 65 (event B)
• p( B ) = 0.18
• p( A | B ) = 0.78
• p( A, B ) = p( A | B ) ⋅ p( B ) =
0.78 ⋅ 0.18 =
0.14

Rule of total probability
p( A ) = p( A, B ) + p( A, not B )
(same expression given previously to define marginal probability)
(not A, not B)
A ( A, B ) B
Ω

Independence
p( A | B ) = p( A ) or p( A, B ) = p( A ) ⋅ p( B )
Ω
(not A, not B) (not A, B)
B
(A, not B) A ( A, B )

Examples of independence / dependence
• Independence:
• Outcomes on multiple rolls of a die
• Outcomes on multiple flips of a coin
• Height of two unrelated individuals
• Probability of getting a king on successive draws from a deck, if card from
each draw is replaced
• Dependence:
• Height of two related individuals
• Duration of successive eruptions of Old Faithful
• Probability of getting a king on successive draws from a deck, if card from
each draw is not replaced Fundamental in AI and ML 287
Example of independence vs. dependence
• Independence: All manufacturers have identical product mix. p( X = x | Y = y ) =
p( X = x ).
• Dependence: American manufacturers love SUVs, Europeans manufacturers
don’t.

Bayes rule
Posterior probability ∝ likelihood × prior probability
p( B | A ) = p( A | B ) ⋅ p( B ) / p( A )
(not A, not B)
A ( A, B ) B
Ω

Example of Bayes rule
Marie is getting married tomorrow at an outdoor ceremony in the desert. In recent
years, it has rained only 5 days each year. Unfortunately, the weatherman is
forecasting rain for tomorrow. When it actually rains, the weatherman has
forecast rain 90% of the time. When it doesn't rain, he has forecast rain 10% of
the time. What is the probability it will rain on the day of Marie's wedding?
Event A: The weatherman has forecast rain.
Event B: It rains.
We know:
• p( B ) = 5 / 365 = 0.0137 [ It rains 5 days out of the year. ]
• p( not B ) = 360 / 365 = 0.9863
• p( A | B ) = 0.9 [ When it rains, the weatherman has forecast rain 90% of the time. ]
• p( A | not B ) = 0.1 [When it does not rain, the weatherman has forecast rain 10% of the
time.]
Example of Bayes rule, cont’d.
We want to know p( B | A ), the probability it will rain on the day of Marie's
wedding, given a forecast for rain by the weatherman.
The answer can be determined from Bayes rule:

1. p( B | A ) = p( A | B ) p( B ) / p( A )
2. p( A ) = p( A | B ) p( B ) + p( A | not B ) p( not B ) =
(0.9)(0.014) + (0.1)(0.986) = 0.111
3. p( B | A ) = (0.9)(0.0137) / 0.111 = 0.111
The result seems unintuitive but is correct. Even when the weatherman predicts
rain, it only rains only about 11% of the time.
Despite the weatherman's gloomy prediction, it is unlikely Marie will get rained on
at her wedding. Fundamental in AI and ML 291
Probabilities: when to add, when to multiply
• ADD: When you want to allow for occurrence of any of several possible
outcomes of a single process. Comparable to logical OR.
• MULTIPLY: When you want to allow for simultaneous occurrence of particular

outcomes from more than one process. Comparable to logical AND.
• But only if the processes are independent.

Linear algebra applications
1) Operations on or between vectors and matrices

2) Coordinate transformations
3) Dimensionality reduction
4) Linear regression
5) Solution of linear systems of equations
6) Many others
Applications 1) – 4) are directly relevant to this course. Today we’ll start

with 1).
Why vectors and matrices?
vector
• Most common form of data organization
for machine learning is a 2D array, where
• rows represent samples (records, items,
datapoints)
• columns represent attributes (features, variables)
• Natural to think of each sample as a vector

of attributes, and whole array as a matrix
Matrix
Vectors
Definition: an n-tuple of values (usually real numbers).
• n referred to as the dimension of the vector
• n can be any positive integer, from 1 to infinity
Can be written in column form or row form

• Column form is conventional
• Vector elements referenced by subscript

Vectors
• Can think of a vector as:
• a point in space or
• a directed line segment with a magnitude and direction

Vector arithmetic
• Addition of two vectors
• add corresponding elements
• result is a vector
• Scalar multiplication of a vector

• multiply each element by scalar
• result is a vector
Vector arithmetic
M
• Dot product of two vectors
• multiply corresponding elements, then add products
• result is a scalar y
• Dot product alternative form

θ
x

Matrices
• Definition: an m x n two-dimensional array of values (usually real numbers).
• m rows
• n columns
• Matrix referenced by two-element subscript
• first element in
subscript is row
• second element in
subscript is column
• example: A24 or a24 is element in second row, fourth column of A

Matrices
• A vector can be regarded as special case of a matrix, where one of matrix
dimensions = 1.
• Matrix transpose (denoted T)
• swap columns and rows
• row 1 becomes column 1, etc.
• m x n matrix becomes n x m matrix
• example:

Matrix arithmetic
Addition of two matrices
• matrices must be same size
• add corresponding elements:
cij = aij + bij
• result is a matrix of same size
Scalar multiplication of a matrix

• multiply each element by scalar:
bij = d ⋅ aij
• result is a matrix of same size

Matrix arithmetic
• Matrix-matrix multiplication
• vector-matrix multiplication just a special case
TO THE BOARD!!
• Multiplication is associative
A⋅(B⋅C)=(A⋅B)⋅C
• Multiplication is not commutative
A ⋅ B ≠ B ⋅ A (generally)
• Transposition rule:
( A ⋅ B )T = B T ⋅ A T

Matrix arithmetic
• RULE: In any chain of matrix multiplications, the column dimension of one
matrix in the chain must match the row dimension of the following matrix in the
chain.
• Examples
A3x5 B5x5 C3x1
Right:
A ⋅ B ⋅ AT CT ⋅ A ⋅ B AT ⋅ A ⋅ B C ⋅ CT ⋅ A
Wrong:
A⋅B⋅A C ⋅ A ⋅ B A ⋅ AT ⋅ B CT ⋅ C ⋅ A

Vector projection
M
Orthogonal projection of y onto x
• Can take place in any space of dimensionality > 2
y
• Unit vector in direction of x is
x / || x ||
• Length of projection of y in
direction of x is θ
x
|| y || ⋅ cos(θ) projx( y )
• Orthogonal projection of
y onto x is the vector
projx( y ) = x ⋅ || y || ⋅ cos(θ) / || x || =
[ ( x ⋅ y ) / || x ||2 ] x (using dot product alternate form)

Optimization theory topics
• Maximum likelihood
• Expectation maximization
• Gradient descent

Convex Optimization

General Optimization
•
Constraints do not need to be linear

Example
•

Example
•

Convex Optimization
•

Convex Optimization
•

Local and Global Optima
•

Every local optimum of a convex optimization problem is a global optimum: let 𝑥

be a local minimum, 𝑦 a global minimum, and 𝑓(𝑥)>𝑓(𝑦)





Local Optima
•

Linear Programming Problems
•

Quadratic Programming Problems
•

Least Squares Regression
•

•

•

Projections
•

Projections
•

Projections
•

Projections
•

Smallest Enclosing Ball
•

Introduction to Statistical Inference

Statistical inference - Concepts
Statistical inference is the act of generalizing from a sample to a population with
calculated degree of certainty.
We want to learn about population parameters…
…but we can only calculate sample statistics

Parameters and Statistics
It is essential that we draw distinctions between parameters and statistics

Parameters and Statistics
We are going to illustrate inferential concept by considering how well a given
sample mean “x-bar” reflects an underling population mean µ

Precision and reliability
• How precisely does a given sample mean (x-bar) reflect underlying population
mean (μ)? How reliable are our inferences?
• To answer these questions, we consider a simulation experiment in which we
take all possible samples of size n taken from the population

Simulation Experiment
• Population (Figure A, next slide)

N = 10,000
Lognormal shape (positive skew)
μ = 173
σ = 30
• Take repeated SRSs, each of n = 10
• Calculate x-bar in each sample
• Plot x-bars (Figure B , next slide)

Simulation Experiment
A. Population (individual values) B. Sampling distribution of
x-bars

Simulation Experiment Results
1. Distribution B is more Normal than
distribution A ⇔ Central Limit
Theorem
2. Both distributions centered on µ ⇔
x-bar is unbiased estimator of μ
3. Distribution B is skinnier than
distribution A ⇔ related to “square
root law”

Reiteration of Key Findings
• Finding 1 (central limit theorem): the sampling distribution of x-bar tends

toward Normality even when the population is not Normal (esp. strong in large
samples).
• Finding 2 (unbiasedness): the expected value of x-bar is μ
• Finding 3 is related to the square root law, which says:

Standard Deviation of the Mean
• The standard deviation of the sampling distribution of the mean has a special
name: standard error of the mean (denoted σxbar or SExbar)
• The square root law says:

Square Root Law Example: σ = 15
Quadrupling the sample size cuts the standard error of the mean in half
For n = 1 ⇒
For n = 4 ⇒
For n = 16 ⇒

Putting it together:
x ~ N(µ, SE)
• The sampling distribution of x-bar tends to be Normal with mean µ and σxbar = σ / √n
• Example: Let X represent Weschler Adult Intelligence Scores; X ~ N(100, 15).
▪ Take an SRS of n = 10
▪ σxbar = σ / √n = 15/√10 = 4.7
▪ Thus, xbar ~ N(100, 4.7)

Individual WAIS (population) and mean WAIS
when n = 10

68-95-99.7 rule applied to the SDM
▪ We’ve established xbar ~ N(100, 4.7).
Therefore,
• 68% of x-bars within

µ ± σxbar
= 100 ± 4.7
= 95.3 to 104.7
• 95% of x-bars within

µ ± 2 · σxbar
= 100 ± (2·4.7)
= 90.6 to 109.4
Law of Large Numbers
As a sample gets larger and larger, the x-bar approaches μ. Figure demonstrates
results from an experiment done in a population with μ = 173.3

Sampling Behavior of Counts and Proportions
Recall Chapter: binomial random variable represents the random number of

successes in n independent Bernoulli trials each with probability of success p;
otation X~b(n,p)
X~b(10,0.2) is shown on the next slide. Note that

μ=2
Reexpress the counts of success as proportion p-hat = x / n. For this re-expression,

μ = 0.2

Sampling Behavior of Counts and Proportions

Normal Approximation to the Binomial (“npq
rule”)
• When n is large, the binomial distribution approximates a Normal distribution

(“the Normal Approximation”)
• How large does the sample have to be to apply the Normal approximation? ⇒
One rule says that the Normal approximation applies when npq ≥ 5

A
Top figure:
X~b(10,0.2)
npq = 10 · 0.2 · (1–0.2)
= 1.6 ⇒ Normal
approximation does not apply
Bottom figure: X~b(100,0.2)

npq = 100 · 0.2 · (1−0.2) = 16
⇒ Normal approximation
applies Fundamental in AI and ML 345
Normal Approximation for a Binomial Count
When Normal approximation applies:

Normal Approximation for a Binomial Proportion

“p-hat” represents the sample proportion
“p-hat” represents the sample proportion

Illustrative Example: Normal Approximation to
the Binomial
• Suppose the prevalence of a risk factor in a population is 20%
• Take an SRS of n = 100 from population
• A variable number of cases in a sample will follow a binomial distribution with n
= 20 and p = .2

the Binomial
• The Normal approximation for the count is:
• The Normal approximation for the proportion is:

the Binomial
• Statement of problem: Recall X ~ N(20, 4) Suppose we observe 30 cases in a
sample. What is the probability of observing at least 30 cases under these
circumstance, i.e., Pr(X ≥ 30) = ?
• Standardize: z = (30 – 20) / 4 = 2.5
• Sketch: next slide
• Table B: Pr(Z ≥ 2.5) = 0.0062

the Binomial
• Binomial and superimposed Normal distributions
This model suggests

.0062 of samples will
see 30 or more cases.

Module 4

What is Machine Learning?
• Machine Learning is the study of methods for programming computers to learn.
• Building machines that automatically learn from experience.
• Machine learning usually refers to the changes in systems that perform tasks
associated with artificial intelligence AI Such tasks involve recognition, diagnosis,
planning, robot control, prediction, etc.

What is Machine Learning?
Learning Trained
algorithm machine
TRAINING
DATA Answer
Query
Steps in machine learning
1. Data collection.
2. Representation.
3. Modeling.
4. Estimation.
5. Validation.
6. Apply learned model to new “test” data

General structure of a learning system
Learning system
Data Learning Feed-back

Process
Problem Solving
Teacher
Results
Performance
Evaluation

Advantages of ML
)Solving vision problems through statistical inference.
)Intelligence from the common sense AI.
)Reducing the constraints over time achieving complete autonomy.

Disadvantages of ML
1. Application specific algorithms.
2. Real world problems have too many variables and sensors might be too noisy.
3. Computational complexity.

Types of machine Learning
1) Unsupervised Learning .
2) Semi-Supervised (reinforcement).
3) Supervised Learning.

Unsupervised Learning
1. Studies how input patterns can be represented to reflect the statistical structure
of the overall collection of input patterns
2. No outputs are used (unlike supervised learning and reinforcement learning)
3. Learner is provided only unlabeled data.
4. No feedback is provided from the environment

Unsupervised Learning
Advantage
• Most of the laws of science were developed through unsupervised learning.
Disadvantage
• The identification of the features itself is a complex problem in many situations.

Semi-Supervised (reinforcement)
• It is in between Supervised and Unsupervised learning techniques the amount of

labeled and unlabelled data required for training.
• With the goal of reducing the amount of supervision required compared to

supervised learning.
• At the same time improving the results of unsupervised clustering to the

expectations of the user.

Semi-Supervised (reinforcement)
•Semi-supervised learning is an area of increasing importance in Machine Learning.

•Automatic methods of collecting data make it more important than ever to develop
methods to make use of unlabeled data.

Supervised Learning
1) Analogical Learning.
2) Learning by Decision Tree.

Analogical Learning
Instances of a problem and the learner has to form a concept that supports most of
the positive and no negative instances.
This demonstrates that a number of training instances are required to form a

concept in inductive learning.
Unlike this, analogical learning can be accomplished from a single example. For
instance, given the following training instance, one has to determine the plural form
of bacilus.

Analogical Learning

The main steps in analogical learning are now
formalized below.
1.Identifying Analogy: Identify the similarity between an experienced problem

instance and a new problem.
2. Determining the Mapping Function: Relevant parts of the experienced problem

are selected and the mapping is determined.
3. Apply Mapping Function: Apply the mapping function to transform the new
problem from the given domain to the target domain.

The main steps in analogical learning are now
formalized below.
4. Validation: The newly constructed solution is validated for its applicability

through its trial processes like theorem or simulation .
5. Learning: If the validation is found to work well, the new knowledge is encoded
and saved for future usage.

Analogical Learning

Learning by Decision Tree
A decision tree receives a set of attributes (or properties) of the objects as inputs
and yields a binary decision of true or false values as output. Decision trees, thus,
generally represent Boolean functions. Besides a range of {0,1} other non-binary
ranges of outputs are also allowed.
However, for the sake of simplicity, we presume the restriction to Boolean outputs.
Each node in a decision tree represents ‘a test of some attribute of the instance, and
each branch descending from that node corresponds to one of the possible values
for this attribute’.
To illustrate the contribution of a decision tree, we consider a set of instances, some
of which result in a true value for the decision. Those instances are called positive
instances. On the other hand, when the resulting decision is false, we call the
instance ‘a negative instance’. We now consider the learning problem of a bird’s
flying. Suppose a child sees different instances of birds as tabulated below.


Decision Tree example

Decision Tree example

Applications of Machine Learning – Drug
Discovery

Medical diagnosis
Photo MRI CT
Iris verification

Hand-written digits

Radar Imaging

Speech Recognition

Finger print

Signature Verification

Face Recognition

Target Recognition

Robotics vision

Traffic Monitoring

Introduction to Classification

Classification learning
Training Testing
phase phase
Learning the classifier from the

available data ‘Training set’ (Labeled)

Generating datasets
Methods:
• Holdout (2/3rd training, 1/3rd testing)
• Cross validation (n – fold)
• Divide into n parts
• Train on (n-1), test on last
• Repeat for different combinations
•Bootstrapping
• Select random samples to form the training set

Evaluating classifiers
Outcome:
•Accuracy
•Confusion matrix
•If cost-sensitive, the expected cost of classification ( attribute test cost +
misclassification cost) etc.

Decision Trees - Example
Example algorithms: ID3, C4.5, SPRINT, CART
Intermediate nodes :
Attributes
Edges : Attribute value tests
Leaf nodes : Class

predictions

Decision Tree schematic
Training data set
a1 a2 a3 a4 a5 a6
X Y Z
Impure node, Impure node, Pure

Select best Select best node,
attribute attribute Leaf
and continue and continue node:
Class
RED
Decision Tree Issues
How to avoid overfitting?
Problem: Classifier performs well on training data, but fails
to give good results on test data
Example: Split on primary key gives pure nodes and good

accuracy on training – not for testing
Alternatives:
• Pre-prune : Halting construction at a certain level of tree /
level of purity
• Post-prune : Remove a node if the error rate remains
the same without it. Repeat process for all nodes in the decision tree
How does the type of attribute affect the split?
• Discrete-valued: Each branch corresponding to a value

• Continuous-valued: Each branch may be a range of values
(e.g.: splits may be age < 30, 30 < age < 50, age > 50 )
(aimed at maximizing the gain/gain ratio)

How to determine the attribute for split?
Alternatives:
Information Gain
Gain (A, S) = Entropy (S) – Σ ( (Sj/S)*Entropy(Sj) )
Other options:
Gain ratio, etc.

Lazy learners
‘Lazy’: Do not create a model of the training instances in advance
When an instance arrives for testing, runs the algorithm to get the class prediction
Example, K – nearest neighbor classifier

(K – NN classifier)
“One is known by the company one keeps”

K-NN classifier schematic
For a test instance,
1) Calculate distances from training pts.
2) Find K-nearest neighbours (say, K = 3)
3) Assign class label based on majority

K-NN classifier Issues
How to determine distances between values of categorical attributes?
Alternatives:
1. Boolean distance (1 if same, 0 if different)
2. Differential grading (e.g. weather – ‘drizzling’ and ‘rainy’ are

closer than ‘rainy’ and ‘sunny’ )
How to make real-valued prediction?
Alternative:
1. Average the values returned by K-nearest neighbours

K-NN classifier Issues
How to determine value of K?
Alternatives:
1. Determine K experimentally. The K that gives minimum
error is selected.
Any other modifications?

Alternatives:
2. Weighted attributes to decide final label
3. Assign distance to missing values as <max>
4. K=1 returns class label of nearest neighbour
How good is it?

• Susceptible to noisy values
• Slow because of distance calculation
Alternate approaches:
• Distances to representative points only
• Partial distance Fundamental in AI and ML 400
Decision Lists
• A sequence of Boolean functions that lead to a result
f ( y ) = cj, if j = min { i | hi (y) = 1 } exists

0 otherwise

Decision List example
Class
label
Test
instance (hi
,ci
)
Unit

Decision List learning
R
S’ = S - Qk
( h k,1 / )
0
If
For each hi, Select hk, the (| Pi| - pn * | Ni |
feature >
Set of candidate Qi = Pi U Ni
with |Ni| - pp *|Pi| )
feature functions
( hi = 1 ) highest utility then 1
else 0
U i = max { | Pi| - pn * | Ni | , |Ni| - pp *|Pi| }

Decision list Issues
What is the terminating condition?
1. Size of R (an upper threshold)
2. Qk = null
3. S’ contains examples of same class
Accuracy / Complexity tradeoff?

Size of R : Complexity (Length of the list)
S’ contains examples of both classes : Accuracy (Purity)
Pruning?
hi is not required if :
1. c i = c (r+1)
2. There is no h j ( j > i ) such that
Qi=Qj

Probabilistic classifiers : Naïve Bayes
Based on Bayes rule
Naïve Bayes : Conditional independence assumption

Naïve Bayes Issues
Problems due to sparsity of data?
Problem : Probabilities for some values may be zero
Solution : Laplace smoothing
How are different types of attributes handled?

1. Discrete-valued : P ( X | Ci ) is according to formula
2. Continuous-valued : Assume gaussian distribution. Plug in mean and variance for the
attribute and assign it to P ( X | Ci )

Probabilistic classifiers : BBN
• Bayesian belief networks : Attributes ARE dependent
• A directed acyclic graph and conditional probability tables
An added term for conditional

probability between attributes:

BBN learning
(when network structure known)
• Input : Network topology of BBN

• Output : Calculate the entries in conditional probability table
(when network structure not known)

• ???

Learning structure of BBN
Use Naïve Bayes as a basis pattern
Loan
Marita
Family
Age l
status
status
Add edges as required
Examples of algorithms: TAN, K2

Artificial Neural Networks
Based on biological concept of neurons
Structure of a fundamental unit of ANN:
w
threshol
w 0
d
input 1
w
n
output: : activation function p (v)
where
p (v) = sgn (w0 + w1x1 + … + wnxn )
Perceptron learning algorithm
Initialize values of weights
Apply training instances and get output
Update weights according to the update rule:
n : learning rate
t : target output
o : observed output
Repeat till converges
Can represent linearly separable functions only

Sigmoid perceptron
Basis for multilayer feedforward networks

Multilayer feedforward networks
Multilayer? Feedforward?
Input layer Hidden layer Output layer

Backpropagation
• Apply training instances as input and produce output
• Update weights in the ‘reverse’ direction as follows:

ANN Issues
Learning the structure of the network
1. Construct a complete network

2. Prune using heuristics:
• Remove edges with weights nearly zero
• Remove edges if the removal does not affect accuracy
What are the types of learning approaches?
Deterministic: Update weights after summing up Errors over all examples

Stochastic: Update weights per example

ANN Issues
Choosing the learning factor
A small learning factor means multiple iterations
required.
A large learning factor means the learner may skip the global minimum
Addition of momentum
But why?
Support vector machines
Basic idea
Margin
“Maximum
separating-margin
+
classifier”
1
Support vectors
-1
Separating hyperplane : wx+b = 0

SVM training
Dot product of xk and xl
Lagrangian multipliers are

zero for data instances other
than support vectors

Focussing on dot product
• For non-linear separable points,
we plan to map them to a higher dimensional (and linearly separable) space
• The product can be time-consuming. Therefore, we use kernel
functions

Kernel functions
Without having to know the non-linear mapping, apply kernel function, say,
Reduces the number of computations required to generate Q kl values.

Testing SVM
Test instance Class label

SVM

SVM Issues
What if n-classes are to be predicted?
Problem : SVMs deal with two-class classification
Solution : Have multiple SVMs each for one class
SVMs are immune to the removal of non-support-vector points

Combining Classifiers
• ‘Ensemble’ learning
• Use a combination of models for prediction
• Bagging : Majority votes
• Boosting : Attention to the ‘weak’ instances
• Goal : An improved combined model

Bagging
Classifier
model
M1
Majority Classifier
vote learning
Class Label
Classifier scheme Sample
model Training
D1
dataset
Mn Test
set D
Total set
At random. May use bootstrap sampling with replacement

Data preprocessing
• Attribute subset selection
• Select a subset of total attributes to reduce complexity
• Dimensionality reduction
• Transform instances into ‘smaller’ instances

Attribute subset selection
Information gain measure for attribute selection in decision trees
Stepwise forward / backward elimination of attributes

Dimensionality reduction
High dimensions : Computational complexity
Number of attributes of
a data instance
instance x in
p-dimensions
s = Wx W is k x p transformation mtrx.
instance x in k-dimensions
k<p

Principal Component Analysis
• Computes k orthonormal vectors : Principal components
• Essentially provide a new set of axes – in decreasing order of variance
Eigenvector matrix
( pXX n (k
) X (p X ( p X p ) ( pXn)
(k First k are k PCs
n)
n) p)

Learning structure of BBN
• K2 Algorithm :
• Consider nodes in an order
• For each node, calculate utility to add an edge from previous nodes to this one
• TAN :
• Use Naïve Bayes as the baseline network
• Add different edges to the network based on utility
• Examples of algorithms: TAN, K2

Delta rule
Delta rule enables to converge to a best fit if points are not linearly separable
Uses gradient descent to choose the hypothesis space

Clustering
• Document clustering
• Motivations
• Document representations
• Success criteria
• Clustering algorithms
• Partitional
• Hierarchical

What is clustering?
• Clustering: the process of grouping a set of objects into classes of similar objects
• Documents within a cluster should be similar.

• Documents from different clusters should be dissimilar.
• The commonest form of unsupervised learning

• Unsupervised learning = learning from raw data, as opposed to supervised data where a classification of
examples is given
• A common and important task that finds many applications in IR and other places

A data set with clear cluster structure
• How would you design an algorithm
for finding the three clusters in this
case?

Applications of clustering in IR
• Whole corpus analysis/navigation
• Better user interface: search without typing
• For improving recall in search applications
• Better search results (like pseudo RF)
• For better navigation of search results
• Effective “user recall” will be higher
• For speeding up vector space retrieval
• Cluster-based retrieval gives faster search

Yahoo! Hierarchy isn’t clustering but is the kind
of output you want from clustering
www.yahoo.com/Scienc
e
… (30)
agriculture biology physics CS space
... ... ... ...

...
dairy
crops botany cell AI courses craft
magnetism
forestry agronomy evolution HCI
relativity

Google News: automatic clustering gives an
effective news presentation metaphor

Scatter/Gather: Cutting, Karger, and Pedersen

For visualizing a document collection and its
themes
Wise et al, “Visualizing the non-visual” PNNL
ThemeScapes, Cartia
[Mountain height = cluster size]

For improving search recall
• Cluster hypothesis - Documents in the same cluster behave similarly with respect
to relevance to information needs
• Therefore, to improve search recall:
• Cluster docs in corpus a priori
• When a query matches a doc D, also return other docs in the cluster containing D
• Hope if we do this: The query “car” will also return docs containing automobile
• Because clustering grouped together docs containing car with those containing automobile.
Why might this

happen?

yippy.com – grouping search results

Issues for clustering
• Representation for clustering
• Document representation
• Vector space? Normalization?
• Centroids aren’t length normalized
• Need a notion of similarity/distance
• How many clusters?
• Fixed a priori?
• Completely data driven?
• Avoid “trivial” clusters - too large or small
• If a cluster's too large, then for navigation purposes you've wasted an extra user click without whittling down the
set of documents much.

Notion of similarity/distance
•Ideal: semantic similarity.

•Practical: term-statistical similarity
•We will use cosine similarity.
•Docs as vectors.
•For many algorithms, easier to think in terms of a distance (rather
than similarity) between docs.
•We will mostly speak of Euclidean distance
• But real implementations use cosine similarity

Clustering Algorithms
•Flat algorithms
• Usually start with a random (partial) partitioning
• Refine it iteratively
• K means clustering
• (Model based clustering)
•Hierarchical algorithms
• Bottom-up, agglomerative
• (Top-down, divisive)

Hard vs. soft clustering
•Hard clustering: Each document belongs to exactly one cluster

• More common and easier to do
•Soft clustering: A document can belong to more than one cluster.
• Makes more sense for applications like creating browsable hierarchies
• You may want to put a pair of sneakers in two clusters: (i) sports apparel and (ii) shoes
• You can only do that with a soft clustering approach.
•We won’t do soft clustering today. See IIR 16.5, 18

Partitioning Algorithms
• Partitioning method: Construct a partition of n documents into a set of K
clusters
• Given: a set of documents and the number K
• Find: a partition of K clusters that optimizes the chosen partitioning criterion
• Globally optimal
• Intractable for many objective functions
• Ergo, exhaustively enumerate all partitions
• Effective heuristic methods: K-means and K-medoids algorithms

K-Means
•Assumes documents are real-valued vectors.

•Clusters based on centroids (aka the center of gravity or mean) of points in a
cluster, c:
•Reassignment of instances to clusters is based on distance to the current cluster

centroids.
• (Or one can equivalently phrase it in terms of similarities)

K-Means Algorithm
• Select K random docs {s1, s2,… sK} as seeds.

• Until clustering converges (or other stopping criterion):
• For each doc di:
• Assign di to the cluster cj such that dist(xi, sj) is minimal.
• (Next, update the seeds to the centroid of each cluster)
• For each cluster cj
• sj = μ(cj)

K Means Example (K=2)
Pick seeds
Reassign clusters
Compute centroids
Reassign clusters
Compute centroids
Reassign clusters
Converged!

Termination conditions
• Several possibilities, e.g.,
• A fixed number of iterations.
• Doc partition unchanged.
• Centroid positions don’t change.
Does this mean that the docs in a

cluster are unchanged?

Convergence
•Why should the K-means algorithm ever reach a fixed point?

• A state in which clusters don’t change.
•K-means is a special case of a general procedure known as the Expectation
Maximization (EM) algorithm.
• EM is known to converge.
• Number of iterations could be large.
• But in practice usually isn’t

Convergence of K-Means
•Define goodness measure of cluster k as sum of squared distances from

cluster centroid:
• Gk = Σi (di – ck)2 (sum over all di in cluster k)
• G = Σk Gk
•Reassignment monotonically decreases G since each vector is assigned to the
closest centroid.

Convergence of K-Means
• Recomputation monotonically decreases each Gk since (mk is number of members
in cluster k):
• Σ (di – a)2 reaches minimum for:
• Σ –2(di – a) = 0
• Σ di = Σ a
• mK a = Σ d i
• a = (1/ mk) Σ di = ck
• K-means typically converges quickly

Time Complexity
• Computing distance between two docs is O(M) where M is the dimensionality of
the vectors.
• Reassigning clusters: O(KN) distance computations, or O(KNM).
• Computing centroids: Each doc gets added once to some centroid: O(NM).
• Assume these two steps are each done once for I iterations: O(IKNM).

Seed Choice
• Results can vary based on random seed Example showing
selection. sensitivity to seeds
• Some seeds can result in poor

convergence rate, or convergence to
sub-optimal clusterings.
• Select good seeds using a heuristic (e.g.,
doc least similar to any existing mean)
In the above, if you start with
• Try out multiple starting points B and E as centroids you
• Initialize with the results of another converge to {A,B,C} and
method. {D,E,F}
If you start with D and F you
converge to {A,B,D,E} {C,F}

K-means issues, variations, etc.
•Recomputing the centroid after every assignment (rather than after all points are
re-assigned) can improve speed of convergence of K-means
•Assumes clusters are spherical in vector space
• Sensitive to coordinate changes, weighting etc.
•Disjoint and exhaustive
• Doesn’t have a notion of “outliers” by default
• But can add outlier filtering
Dhillon et al. ICDM 2002 – variation to fix some issues with small document clusters

How Many Clusters?
• Number of clusters K is given
• Partition n docs into predetermined number of clusters
• Finding the “right” number of clusters is part of the problem
• Given docs, partition into an “appropriate” number of subsets.
• E.g., for query results - ideal value of K not known up front - though UI may
impose limits.
• Can usually take an algorithm for one flavor and convert to the other.

K not specified in advance
• Say, the results of a query.

• Solve an optimization problem: penalize having lots of clusters
• application dependent, e.g., compressed summary of search results list.
• Tradeoff between having more clusters (better focus within each cluster)
and having too many clusters

K not specified in advance
•Given a clustering, define the Benefit for a doc to be the cosine similarity to
its centroid
•Define the Total Benefit to be the sum of the individual doc Benefits.
Why is there always a clustering of Total Benefit n?

Penalize lots of clusters
•For each cluster, we have a Cost C.

•Thus for a clustering with K clusters, the Total Cost is KC.
•Define the Value of a clustering to be =
Total Benefit - Total Cost.
•Find the clustering of highest value, over all choices of K.
• Total benefit increases with increasing K. But can stop when it doesn’t increase by “much”. The Cost term
enforces this.

Hierarchical Clustering
•Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents.
animal
vertebrate invertebrate
fish reptile amphib. mammal worm insect

crustacean
•One approach: recursive application of a partitional clustering algorithm.

Dendrogram: Hierarchical Clustering
Clustering obtained by cutting the
dendrogram at a desired level: each
connected component forms a cluster.

Hierarchical Agglomerative Clustering (HAC)
•Starts with each doc in a separate cluster

•then repeatedly joins the closest pair of clusters, until there is only
one cluster.
•The history of merging forms a binary tree or hierarchy.
Note: the resulting clusters are still “hard” and induce a partition

Closest pair of clusters
•Many variants to defining closest pair of clusters

•Single-link
• Similarity of the most cosine-similar (single-link)
•Complete-link
• Similarity of the “furthest” points, the least cosine-similar
•Centroid
• Clusters whose centroids (centers of gravity) are the most cosine-similar
•Average-link
• Average cosine between pairs of elements

Single Link Agglomerative Clustering
• Use maximum similarity of pairs:
• Can result in “straggly” (long and thin) clusters due to chaining effect.
• After merging ci and cj, the similarity of the resulting cluster to another cluster,
ck, is:

Single Link Example

Complete Link
Use minimum similarity of pairs:
Makes “tighter,” spherical clusters that are typically preferable.

After merging ci and cj, the similarity of the resulting cluster to another cluster, ck,
is:
Ci Cj Ck

Complete Link Example

Computational Complexity
•In the first iteration, all HAC methods need to compute similarity of all pairs of N initial
instances, which is O(N2).
•In each of the subsequent N−2 merging iterations, compute the distance between the most
recently created cluster and all other existing clusters.
•In order to maintain an overall O(N2) performance, computing similarity to each other cluster
must be done in constant time.
• Often O(N3) if done naively or O(N2 log N) if done more cleverly

Group Average
S
•Similarity of two clusters = average similarity of all pairs within merged cluster.
•Compromise between single and complete link.

•Two options:
• Averaged across all ordered pairs in the merged cluster
• Averaged over all pairs between the two original clusters
•No clear difference in efficacy

Computing Group Average Similarity
• Always maintain sum of vectors in each cluster.
• Compute similarity of clusters in constant time:

What Is A Good Clustering?
•Internal criterion: A good clustering will produce high quality clusters in

which:
• the intra-class (that is, intra-cluster) similarity is high
• the inter-class similarity is low
• The measured quality of a clustering depends on both the document
representation and the similarity measure used

External criteria for clustering quality
•Quality measured by its ability to discover some or all of the hidden patterns or
latent classes in gold standard data
•Assesses a clustering with respect to ground truth … requires labeled data
•Assume documents with C gold standard classes, while our clustering algorithms
produce K clusters, ω1, ω2, …, ωK with ni members.

External Evaluation of Cluster Quality
• Simple measure: purity, the ratio between the dominant class in the cluster πi and
the size of cluster ωi
• Biased because having n clusters maximizes purity

• Others are entropy of classes in clusters (or mutual information between classes and
clusters)

Purity example
∙ ∙ ∙ ∙ ∙ ∙
∙ ∙ ∙ ∙ ∙ ∙
∙ ∙ ∙ ∙ ∙
Cluster I Cluster II Cluster III
Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6
Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6
Cluster III: Purity = 1/5 (max(2, 0, 3)) =Fundamental

3/5 in AI and ML 474
Rand Index measures between pair decisions.
Here RI = 0.68

Rand index and Cluster F-measure
Compare with standard Precision and Recall:
People also define and use a cluster F-measure, which is probably a better measure.

Final word and resources
• In clustering, clusters are inferred from the data without human input
(unsupervised learning)
• However, in practice, it’s a bit less clear: there are many ways of influencing the
outcome of clustering: number of clusters, similarity measure, representation of
documents, . . .
• Resources
• IIR 16 except 16.5
• IIR 17.1–17.3

Continuous outcome (means)

Recall: Covariance

Interpreting Covariance
cov(X,Y) > 0 X and Y are positively correlated

cov(X,Y) < 0 X and Y are inversely correlated
cov(X,Y) = 0 X and Y are independent

Correlation coefficient
Pearson’s Correlation Coefficient is standardized covariance (unitless):

Correlation
• Measures the relative strength of the linear relationship between two variables
• Unit-less
• Ranges between –1 and 1
• The closer to –1, the stronger the negative linear relationship
• The closer to 1, the stronger the positive linear relationship
• The closer to 0, the weaker any positive linear relationship

Scatter Plots of Data with Various Correlation
Coefficients
Y
X X
r = -1 r = -.6 r=0
Y
Y Y
X X
r = +1 r = +.3 r=0
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004
■
Prentice-Hall Fundamental in AI and ML 483

Linear Correlation
Linear relationships Curvilinear relationships
Y Y
X X
Y Y
X X

Linear Correlation
Strong relationships Weak relationships
Y Y
X X
Y Y
X X

Linear Correlation
No relationship
X
Calculating by hand…

Simpler calculation formula…
Numerator of
covariance
Numerators of
variance

Distribution of the correlation coefficient:
The sample correlation coefficient follows a T-distribution with n-2 degrees of
freedom (since you have to estimate the standard error).
*note, like a proportion, the variance of the correlation coefficient depends on the
correlation coefficient itself substitute in estimated r

Continuous outcome (means)

Linear regression
In correlation, the two variables are treated as equals. In regression, one variable is
considered independent (=predictor) variable (X) and the other the dependent
(=outcome) variable Y.

What is “Linear”?
• Remember this:
• Y=mX+B?

What’s Slope?
A slope of 2 means that every 1-unit change in X yields a 2-unit change in Y.

Prediction
If you know something about X, this knowledge helps you predict something about
Y. (Sound familiar?…sound like conditional probabilities?)

Regression equation…
Expected value of y at a given level of x=

Predicted value for an individual…
y i= α + β*xi + random errori
Fixed – Follows a normal

exactly on distribution
the line

Assumptions (or the fine print)
• Linear regression assumes that…
• 1. The relationship between X and Y is linear
• 2. Y is distributed normally at each value of X
• 3. The variance of Y at every value of X is the same (homogeneity of variances)
• 4. The observations are independent

Regression Picture
R2=SSreg/SStotal
yi
C A
B
y
B
A
C
yi
*Least squares
estimation gave us the
line (β) that minimized
x C2
A2 B2 C2
SStotal SSreg SSresidual
Total squared distance of observations from naïve Distance from regression line to naïve mean of y Variance around the regression line
mean of y Variability due to x (regression) Additional variability not explained by x—what
Total variation least squares method aims to minimize
Recall example: cognitive function and vitamin D
•Hypothetical data loosely based on [1]; cross-sectional study of 100 middle-aged

and older European men.
• Cognitive function is measured by the Digit Symbol Substitution Test (DSST).
1. Lee DM, Tajar A, Ulubaev A, et al. Association between 25-hydroxyvitamin D levels and cognitive performance in middle-aged and older European men. J Neurol Neurosurg Psychiatry. 2009
Jul;80(7):722-9. Fundamental in AI and ML 499
Distribution of vitamin D

Distribution of DSST
• Normally distributed
• Mean = 28 points
• Standard deviation = 10 points

Four hypothetical datasets
• I generated four hypothetical datasets, with increasing TRUE slopes (between vit D
and DSST):
• 0
• 0.5 points per 10 nmol/L

Dataset 1: no relationship

Dataset 2: weak relationship

Dataset 3: weak to moderate relationship

Dataset 4: moderate relationship

The “Best fit” line
• SRegression equation:
• E(Yi) = 28 + 0*vit Di (in 10 nmol/L)

Note how the line is a little
deceptive; it draws your eye,
making the relationship appear
stronger than it really is!
•Regression equation:
• E(Yi) = 26 + 0.5*vit Di (in 10
nmol/L)

Regression equation:
E(Yi) = 22 + 1.0*vit Di (in 10
nmol/L)

Regression equation:
E(Yi) = 20 + 1.5*vit Di (in 10
nmol/L)
Note: all the lines go through the

point (63, 28)!

Estimating the intercept and slope: least squares
estimation

Resulting formulas…
Slope (beta coefficient) =
Intercept=
Regression line always goes through the point:

Relationship with correlation
In correlation, the two variables are treated as equals. In regression, one variable is
considered independent (=predictor) variable (X) and the other the dependent
(=outcome) variable Y.

Example: dataset 4
• SDx = 33 nmol/L
• SDy= 10 points
• Cov(X,Y) = 163 points*nmol/L
• Beta = 163/332 = 0.15 points per
nmol/L
• = 1.5 points per 10 nmol/L
• r = 163/(10*33) = 0.49
Or
• r = 0.15 * (33/10) = 0.49
Significance testing…
Slope
Distribution of slope ~ Tn-2(β,s.e.( ))
H0: β1 = 0 (no linear relationship)

H1: β1 ≠ 0 (linear relationship does exist)
Tn-
2
=

Formula for the standard error of beta (you will
not have to calculate by hand!):

Example: dataset 4
• Standard error (beta) = 0.03
• T98 = 0.15/0.03 = 5, p<.0001
• 95% Confidence interval = 0.09 to 0.21

Residual Analysis: check assumptions
•The residual for observation i, ei, is the difference between its observed and predicted value
•Check the assumptions of regression by examining the residuals
• Examine for linearity assumption
• Examine for constant variance for all levels of X (homoscedasticity)
• Evaluate normal distribution assumption
• Evaluate independence assumption
•Graphical Analysis of Residuals
• Can plot residuals vs. X

Predicted values…
For Vitamin D = 95 nmol/L (or 9.5 in 10 nmol/L):

Residual = observed - predicted

Residual Analysis for Linearity
Y Y
x x
residuals
residuals
x
Not Linear Linear

Residual Analysis for Homoscedasticity
Y Y
x
residuals
residuals
x x
Non-constant variance Constant variance

Residual Analysis for Independence
Not Independent
residuals
✔ Independent
X
residuals
residuals

Residual plot, dataset 4

Multiple linear regression…
•What if age is a confounder here?

• Older men have lower vitamin D
• Older men have poorer cognition
•“Adjust” for age by putting age in the model:
• DSST score = intercept + slope1xvitamin D + slope2 xage

2 predictors: age and vit D…

Different 3D view…

Fit a plane rather than a line…
On the plane, the slope for

vitamin D is the same at every
age; thus, the slope for vitamin D
represents the effect of vitamin D
when age is held constant.

Equation of the “Best fit” plane…
DSST score = 53 + 0.0039xvitamin D (in 10 nmol/L) - 0.46 xage (in years)
P-value for vitamin D >>.05

P-value for age <.0001
Thus, relationship with vitamin D was due to confounding by age!

Multiple Linear Regression
•More than one predictor…
E(y)= α + β1*X + β2 *W + β3 *Z…
Each regression coefficient is the amount of change in the outcome variable that
would be expected per one-unit change of the predictor, if all other variables in the
model were held constant.

Functions of multivariate analysis:
• Control for confounders
• Test for interactions between predictors (effect modification)
• Improve predictions

A ttest is linear regression!
•Divide vitamin D into two groups:

• Insufficient vitamin D (<50 nmol/L)
• Sufficient vitamin D (>=50 nmol/L), reference group
•We can evaluate these data with a ttest or a linear regression…

As a linear regression…
Intercept Slope represents the

represents the difference in means
mean value in between the groups.
the sufficient Difference is
group. significant.
Parameter ````````````````Standard
Variable Estimate Error t Value Pr > |t|
Intercept 40.07407 1.47511 27.17 <.0001

insuff -7.53060 2.17493 -3.46 0.0008 Fundamental in AI and ML 533
ANOVA is linear regression!
•Divide vitamin D into three groups:

• Deficient (<25 nmol/L)
• Insufficient (>=25 and <50 nmol/L)
• Sufficient (>=50 nmol/L), reference group
DSST= α (=value for sufficient) + βinsufficient*(1 if insufficient) + β2 *(1 if deficient)
This is called “dummy coding”—where multiple binary variables are created to
represent being in each category (or not) of a categorical variable

The picture…
Sufficient vs.
Insufficient
Sufficient vs.
Deficient

Results…
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 40.07407 1.47817 27.11 <.0001

deficient 1 -9.87407 3.73950 -2.64 0.0096
insufficient 1 -6.87963 2.33719 -2.94 0.0041
Interpretation:
The deficient group has a mean DSST 9.87 points lower than the reference (sufficient) group.
The insufficient group has a mean DSST 6.87 points lower than the reference (sufficient)
group

Other types of multivariate regression
• Multiple linear regression is for normally distributed outcomes
• Logistic regression is for binary outcomes
• Cox proportional hazards regression is used when time-to-event is the outcome

Common multivariate regression models
Example outcome Appropriate Example equation What do the coefficients give you?
Outcome (dependent variable multivariate
variable) regression model
Continuous Blood pressure Linear regression blood pressure (mmHg) = slopes—tells you how much the
α + βsalt*salt consumption (tsp/day) + βage*age outcome variable increases for every
(years) + βsmoker*ever smoker (yes=1/no=0) 1-unit increase in each predictor.
Binary High blood Logistic regression ln (odds of high blood pressure) = odds ratios—tells you how much the
pressure (yes/no) α + βsalt*salt consumption (tsp/day) + βage*age odds of the outcome increase for every
Time-to-event Time-to- death Cox regression ln (rate of death) = hazard ratios—tells you how much the
α + βsalt*salt consumption (tsp/day) + βage*age rate of the outcome increases for every

Multivariate regression pitfalls
● Multi-collinearity
● Residual confounding
● Overfitting

Multicollinearity
• Multicollinearity arises when two variables that measure the same thing or similar
things (e.g., weight and BMI) are both included in a multiple regression model;
they will, in effect, cancel each other out and generally destroy your model.
• Model building and diagnostics are tricky business!

Residual confounding
• You cannot completely wipe out confounding simply by adjusting for variables in
multiple regression unless variables are measured with zero error (which is usually
impossible).
• Example: meat eating and mortality

Overfitting
• In multivariate modeling, you can get highly significant but meaningless results if
you put too many predictors in the model.
• The model is fit perfectly to the quirks of your particular sample, but has no
predictive ability in a new sample

Overfitting: class data example
• I asked SAS to automatically find predictors of optimism in our class dataset.

Here’s the resulting linear regression model.
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
Intercept 11.80175 2.98341 11.96067 15.65 0.0019

exercise -0.29106 0.09798 6.74569 8.83 0.0117
sleep -1.91592 0.39494 17.98818 23.53 0.0004
obama 1.73993 0.24352 39.01944 51.05 <.0001
Clinton -0.83128 0.17066 18.13489 23.73 0.0004
mathLove 0.45653 0.10668 13.99925 18.32 0.0011
• Exercise, sleep, and high ratings for Clinton are negatively related to optimism
(highly significant!) and high ratings for Obama and high love of math are
positively related to optimism (highly significant!).
Overfitting
• Pure noise variables still produce good R2
values if the model is overfitted. The
distribution of R2 values from a series of
simulated regression models containing only
noise variables.
• (Figure 1 from: Babyak, MA. What You See May Not Be What You Get: A Brief,
Nontechnical Introduction to Overfitting in Regression-Type Models.
Psychosomatic Medicine 66:411-421 (2004).)
Rule of thumb: You need at least 10 subjects for

each additional predictor variable in the
multivariate regression model.

Overfitting and Underfitting
• Don’t expect your favorite learner to always be best

Bias and Variance
•Bias – error caused because the model can not represent the concept
•Variance – error caused because the learning algorithm overreacts to small changes
(noise) in the training data
TotalLoss = Bias + Variance (+ noise)

Visualizing Bias
Goal: produce a model that matches this concept
True
Concept

Visualizing Bias
• Goal: produce a model that matches this concept
• Training Data for the concept
Training Data

Visualizing Bias
• Training Data for concept
Bias Mistakes
Model Predicts +
Model Predicts -
Fit a Linear
Model
Visualizing Variance
• New data, new model
Different Bias
Mistakes
Model Predicts +
Model Predicts -
Fit a Linear
Model
Visualizing Variance
• New data, new model
Mistakes
• New data, new model… will vary
Model Predicts +
Model Predicts -
Fit a Linear
Model
Another way to think about Bias & Variance

Bias and Variance: More Powerful Model
• Powerful Models can represent complex concepts
• No Mistakes!
Model
Predicts +
Model Predicts -

Bias and Variance: More Powerful Model
• But get more data…
Model
Predicts +
Model Predicts -
• Not good!

Overfitting vs Underfitting
• Overfitting • Underfitting
• Fitting the data too well • Learning too little of the true
• Features are noisy / uncorrelated to concept
concept • Features don’t capture
• Modeling process very sensitive concept
(powerful) • Too much bias in model
• Too much search • Too little search to fit model

The Effect of Noise

The Power of a Model Building Process
• Weaker Modeling Process • More Powerful Modeling Process

( higher bias ) (higher variance)
• Complex Model (e.g. high order

• Simple Model (e.g. linear) polynomial)
• Fixed sized Model (e.g. fixed #
weights) • Scalable Model (e.g. decision tree)
• Small Feature Set (e.g. top 10 • Large Feature Set (e.g. every token in
tokens) data)
• Constrained Search (e.g. few
iterations of gradient descent) • Unconstrained Search (e.g. exhaustive
search)

Example of Under/Over-fitting

Ways to Control Decision Tree Learning
• Increase minToSplit
• Increase minGainToSplit
• Limit total number of Nodes
• Penalize complexity

Ways to Control Logistic Regression
• Adjust Step Size
• Adjust Iterations / stopping criteria of Gradient Descent
• Regularization

Modeling to Balance Under & Overfitting
• Data
• Learning Algorithms
• Feature Sets
• Complexity of Concept
• Search and Computation
• Parameter sweeps!

Parameter Sweep
# optimize first parameter

for p in [ setting_certain_to_underfit, …, setting_certain_to_overfit]:
# do cross validation to estimate accuracy
# find the setting that balances overfitting & underfitting
# optimize second parameter

# etc…
# examine the parameters that seem best and adjust whatever you can…
Types of Parameter Sweeps
• Optimize one parameter at a time • Optimize one parameter at a time

• Optimize one, update, move on • Optimize one, update, move on
• Iterate a few times • Iterate a few times
• Gradient descent on • Gradient descent on

meta-parameters meta-parameters
• Start somewhere ‘reasonable’ • Start somewhere ‘reasonable’
• Computationally calculate gradient • Computationally calculate gradient
wrt change in parameters wrt change in parameters

Summary of Overfitting and Underfitting
•Bias / Variance tradeoff a primary challenge in machine learning
•Internalize: More powerful modeling is not always better
•Learn to identify overfitting and underfitting
•Tuning parameters & interpreting output correctly is key

Bayesian Decision Theory -Probability and
Inference
•Result of tossing a coin is Î {Heads,Tails}

•Random var X Î{1,0}
Bernoulli: P {X=1} = poX (1 ‒ po)(1 ‒ X)
•Sample: X = {xt }Nt =1
Estimation: po = # {Heads}/#{Tosses} = ∑t xt / N
•Prediction of next toss:
Heads if po > ½, Tails otherwise

Classification
• Credit scoring: Inputs are income and savings.
Output is low-risk vs high-risk
• Input: x = [x1,x2]T ,Output: C Î {0,1}
• Prediction:

Bayes’ Rule
prior likelihood
posterior
evidence

Bayes’ Rule: K>2 Classes

Losses and Risks
•Actions: αi
•Loss of αi when the state is Ck : λik
•Expected risk (Duda and Hart, 1973)

Losses and Risks: 0/1 Loss
For minimum risk, choose the most probable class

Losses and Risks: Reject

Discriminant Functions
K decision regions R1,...,RK

K=2 Classes
•Dichotomizer (K=2) vs Polychotomizer (K>2)

•g(x) = g1(x) – g2(x)
•Log odds:

Utility Theory
•Prob of state k given exidence x: P (Sk|x)

•Utility of αi when state is k: Uik
•Expected utility:

Value of Information
•Expected utility using x only
•Expected utility using x and new feature z
•z is useful if EU (x,z) > EU (x)

Bayesian Networks
•Aka graphical models, probabilistic networks

•Nodes are hypotheses (random vars) and the prob corresponds to our belief in the
truth of the hypothesis
•Arcs are direct direct influences between hypotheses
•The structure is represented as a directed acyclic graph (DAG)
•The parameters are the conditional probs in the arcs
•(Pearl, 1988, 2000; Jensen, 1996; Lauritzen, 1996)

Causes and Bayes’ Rule
Diagnostic inference:
diagnostic Knowing that the grass is wet,
what is the probability that rain is
causal the cause?

Causal vs Diagnostic Inference
Causal inference: If the

sprinkler is on, what is the
probability that the grass is wet?
P(W|S) = P(W|R,S) P(R|S) +

P(W|~R,S) P(~R|S)
= P(W|R,S) P(R) +
P(W|~R,S) P(~R)
= 0.95 0.4 + 0.9 0.6 = 0.92
Diagnostic inference: If the grass is wet, what is the probability

that the sprinkler is on? P(S|W) = 0.35 > 0.2 P(S)
P(S|R,W) = 0.21 Explaining away: Knowing that it has rained decreases the probability that the sprinkler is
on.
Bayesian Networks: Causes
Causal inference:
P(W|C) = P(W|R,S) P(R,S|C) +
P(W|~R,S) P(~R,S|C) +
P(W|R,~S) P(R,~S|C) +
P(W|~R,~S) P(~R,~S|C)
and use the fact that

P(R,S|C) = P(R|C) P(S|C)
Diagnostic: P(C|W ) = ?

Bayesian Nets: Local structure
P (F | C) =
?

Bayesian Networks: Inference
P (C,S,R,W,F) = P (C) P (S|C) P (R|C) P (W|R,S) P (F|R)
P (C,F) = ∑S ∑R ∑W P (C,S,R,W,F)
P (F|C) = P (C,F) / P(C) Not efficient!
Belief propagation (Pearl, 1988)

Junction trees (Lauritzen and Spiegelhalter, 1988)

Bayesian Networks: Classification
Bayes’ rule inverts the arc:

diagnostic
P (C | x )

Naive Bayes’ Classifier
Given C, xj are independent:
p(x|C) = p(x1|C) p(x2|C) ... p(xd|C)

Influence Diagrams
decision node
chance node utility node

Association Rules
•Association rule: X ® Y
•Support (X ® Y):
•Confidence (X ® Y):

MODULE 5 Pre-training and
transfer learning

Real-life challenges in NLP tasks
•Deep learning methods are data-hungry

•>50K data items needed for training
•The distributions of the source and target data must be the same
•Labeled data in the target domain may be limited
•This problem is typically addressed with transfer learning

Transfer Learning Approaches

Transductive vs Inductive Transfer Learning
•Transductive transfer
• No labeled target domain data available
• Focus of most transfer research in NLP
• E.g. Domain adaptation
•Inductive transfer
• Labeled target domain data available
• Goal: improve performance on the target task by training on other task(s)
• Jointly training on >1 task (multi-task learning)
• Pre-training (e.g. word embeddings)

Pre-training – Word Embeddings
•Pre-trained word embeddings have been an essential component of most deep

learning models
•Problems with pre-trained word embeddings:
• Shallow approaches – trade expressivity for efficiency
• Learned word representations are not context sensitive
• No distinction between senses
• Only the first layer (embedding layer) of the model is pre-trained
• The rest of the model must be trained from scratch

Recent paradigm shift in pre-training for NLP
• Inspired by the success of models pre-trained on ImageNet in Computer Vision
(CV)
• The use of models pre-trained on ImageNet is now standard in CV

Recent paradigm shift in pre-training for NLP
• What is a good equivalent of an ImageNet task in NLP?
• Key desiderata:
• An ImageNet-like dataset should be sufficiently large, i.e. on the order of millions of
training examples.
• It should be representative of the problem space of the discipline.
• Contenders to that role:
• Reading Comprehension (SQuaD dataset, 100K Q-A pairs)
• Natural Language Inference (SNLI corpus, 570K sentence pairs)
• Machine Translation (WMT 2014, 40M French-English sentence pairs)
• Constituency parsing (millions of weakly labeled parses)
• Language Modeling (unlimited data, current benchmark dataset: 1B words
http://www.statmt.org/lm-benchmark/)
The case for Language Modeling
•LM captures many aspects of language:

• Long-term dependencies
• Hierarchical relations
• Sentiment
• Etc.
•Training data is unlimited

LM as pre-training – Approaches
•Embeddings from Language Models (ELMo) (Peters et al., 2018)

•Universal Language Model Fine-tuning (ULMFiT) (Howard and Ruder, 2018)
•OpenAI Transformer (Radford et al., 2018)
•Overview of the above approaches: https://thegradient.pub/nlp-imagenet/

Approach
•Inductive transfer setting:

• Given a static source task and any target task with we would like to
improve the performance on
•Pre-train a language model (LM) on a large general-domain corpus
•Fine-tune it on the target task using novel techniques
•The method is universal
• Works across tasks varying in document size, number and label type
• Uses single architecture and training process
• Requires no custom feature engineering or pre-processing
• Does not require additional in-domain documents or labels

Steps

Step 1: General domain LM pre-training
General domain LM pre-training
• Used AvSGD Weight-Dropped LSTM (AWD-LSTM, Merity et al. 2017)

• LM pre-trained on Wikitext-103 (103M words)
• Expensive, but performed only once
• Improves performance and convergence of downstream tasks

Step 2: Target task LM pre-training
Target task LM fine-tuning
•Data for the target task is likely from a different distribution
•Fine-tune the LM on data of the target task
•This stage converges faster
•Allows to train a robust LM even on small datasets
•Two approaches to fine-tuning:
• Discriminative fine-tuning
• Slanted triangular learning rates

Discriminative fine-tuning
• Different layers capture different types of information, hence, they should be
fine-tuned to different extents
• Instead of using one learning rate for all layers, tune each layer with different
learning rates.
• Regular SGD:
where is the learning rate and is the gradient with regard to the
model’s objective function

Discriminative fine-tuning (cont.)
•Discriminative fine-tuning:
• Split parameters into where contains the parameters of
the model at the 𝑙-th layer and 𝑳 is the number of layers of the model
• Obtain where is the learning rate of the 𝑙-th layer
• SGD update with discriminative fine-tuning:
• Fine tune the last layer and use as the learning rate for lower
layers

Slanted triangular learning rates (STLR)
• Intuition for adapting parameters to task-specific features:
• At the beginning of training: quickly converge to a suitable region of the
parameter space
• Later during training: refine the parameters

Step 3: Target task classifier fine-tuning
• Augment the pre-trained language model with two
additional linear layers:
• ReLU activations in the intermediate layer
• Softmax as the last layer
• These are the only layers, whose parameters are learned
from scratch
• First layer takes as input the pooled last hidden layer
states

Concat pooling
•Input sequences may consist of hundreds of words information may get lost if we
only use the last hidden state of the model
•Concatenate the hidden state at the last time step hT of the document with:
• Max-pooled representation of the hidden states
• Mean-pooled representation of the hidden states
over as many time steps as fit in GPU memory H = {h1, ..., hT}:
where [] is concatenation

Fine-tuning Procedure – Gradual Unfreezing
•Overly aggressive fine-tuning causes catastrophic forgetting

•Too cautious fine-tuning leads to slow convergence and overfitting
•Proposed approach: gradual unfreezing
• First unfreeze the last layer and fine-tune the unfrozen layer for one epoch
• Then unfreeze the next lower frozen layer and fine-tune all unfrozen layers
• Repeat until we fine-tune all layers until convergence in the last iteration
•The combination of discriminative fine-tuning, slanted triangular learning rates and
gradual unfreezing leads to best performance

Tasks and Datasets
•Sentiment analysis: binary (positive-negative) classification; the IMDb movie

review dataset
•Question classification: broad semantic categories; small TREC dataset
•Topic classification: large-scale AG news and DBPedia ontology datasets

Evaluation measure: error rate (lower is better)

Analysis
• Low-shot learning – training a model for a task with small number of labeled
samples

Different methods to fine-tune the classifier

“Full” vs ULMFiT

Conclusions
•ULMFiT is useful for a variety of tasks (different datasets, sizes, domains)

•Proposed approach to fine-tuning prevents catastrophic forgetting of knowledge
learned during pre-training
•Achieves good results even with 100 training data items
•Generally LM pre-training and task-specific fine-tuning will be useful for scenarios
where:
• Training data is limited
• New NLP tasks where no state-of-the-art architecture exists

Training a deep network by stacking RBMs
• First train a layer of features that • It can be proved that each time we add
receive input directly from the another layer of features we improve a
pixels. variational lower bound on the log
• Then treat the activations of the probability of generating the training data.
trained features as if they were • The proof is complicated and only applies to
unreal cases.
pixels and learn features of
• It is based on a neat equivalence between an
features in a second hidden RBM and an infinitely deep belief net (see
layer. lecture 14b).
• Then do it again.

Combining two RBMs to make a DBN
Compose the
Then train this two RBM
RBM models to make
a single DBN
model
copy binary state for each v
Train this
RBM first

The generative model after learning 3 layers
To generate data:
1. Get an equilibrium sample from the top-level h3
RBM by performing alternating Gibbs
sampling for a long time.
2. Perform a top-down pass to get states for all
the other layers. h2
The lower level bottom-up connections are not
part of the generative model. They are just used
for inference. h1
data
An aside: Averaging factorial distributions
• If you average some factorial
distributions, you do NOT get a • Consider the binary vector 1,1,0,0.
factorial distribution. • in the posterior for v1, p(1,1,0,0)
= 0.9^4 = 0.43
• In an RBM, the posterior over 4
hidden units is factorial for each • in the posterior for v2, p(1,1,0,0) =
visible vector. 0.1^4 = .0001
• in the aggregated posterior, p(1,1,0,0)
• Posterior for v1: 0.9, 0.9, 0.1, 0.1 = 0.215.
• Posterior for v2: 0.1, 0.1, 0.9, 0.9
• If the aggregated posterior was
• Aggregated \= 0.5, 0.5, 0.5, 0.5 factorial it would have p = 0.5^4

Why does greedy learning work?
• The weights, W, in the bottom level RBM define many different distributions:
p(v|h); p(h|v); p(v,h); p(h); p(v).
• We can express the RBM model as
• If we leave p(v|h) alone and improve p(h), we will improve p(v).

• To improve p(h), we need it to be a better model than p(h;W) of the aggregated
posterior distribution over hidden vectors produced by applying W transpose to
the data.

Fine-tuning with a contrastive version of the
wake-sleep algorithm
After learning many layers of features, we can fine-tune the features to improve
generation.
1. Do a stochastic bottom-up pass
• Then adjust the top-down weights of lower layers to be good at reconstructing the feature
activities in the layer below.
2. Do a few iterations of sampling in the top level RBM
-- Then adjust the weights in the top-level RBM using CD.
3. Do a stochastic top-down pass
• Then Adjust the bottom-up weights to be good at reconstructing the feature activities in the
layer above.

The DBN used for modeling the joint distribution
of MNIST digits and their labels
•The first two hidden layers are learned 2000 units

without using labels.
•The top layer is learned as an RBM for
modeling the labels concatenated with the 10 labels 500 units
features in the second hidden layer.
•The weights are then fine-tuned to be a better 500 units
generative model using contrastive
wake-sleep. 28 x 28
pixel
image

What happens during discriminative fine-tuning?
Learning Dynamics of Deep Nets
Before fine-tuning After fine-tuning

Effect of Unsupervised Pre-training

Effect of Depth

Trajectories of the learning in function space
(a 2-D visualization produced with t-SNE)
•Each point is a model in function

space
•Color = epoch
•Top: trajectories without
pre-training. Each trajectory
converges to a different local
min.
•Bottom: Trajectories with
pre-training.
•No overlap!
Why unsupervised pre-training makes sense
stuff stuff
high
bandwidth
image label image label
If image-label pairs were generated

this way, it would make sense to If image-label pairs are generated this
try to go straight from images to way, it makes sense to first learn to
labels. recover the stuff that caused the image
For example, do the pixels have by inverting the high bandwidth
even parity? pathway.

Modeling real-valued data
• For images of digits, intermediate
• This will not work for real images.
intensities can be represented as if
they were probabilities by using • In a real image, the intensity of a
“mean-field” logistic units. pixel is almost always, almost
exactly the average of the
• We treat intermediate values as the
neighboring pixels.
probability that the pixel is inked.
• Mean-field logistic units cannot
represent precise intermediate values.

A standard type of real-valued visible unit
• Model pixels as Gaussian

variables. Alternating Gibbs
E
sampling is still easy, though
learning needs to be much
slower.
parabolic energy-gradient produced

containment by the total input to a
function visible unit
Gaussian-Binary RBM’s
• Lots of people have failed to get these
to work properly. Its extremely hard to
learn tight variances for the visible units.
• It took a long time for us to figure out why it
is so hard to learn the visible variances.
• When sigma is small, we need many
more hidden units than visible units.
• This allows small weights to produce big When sigma is much less
top-down effects. than 1, the bottom-up
effects are too big and the
top-down effects are too
small.
Stepped sigmoid units: A neat way to implement
integer values
• Make many copies of a stochastic binary unit.
• All copies have the same weights and the same adaptive bias, b, but they
have different fixed offsets to the bias:

Fast approximations
• Contrastive divergence learning works well for the sum of stochastic
logistic units with offset biases. The noise variance is
• It also works for rectified linear units. These are much faster to compute
than the sum of many logistic units with different biases.

A nice property of rectified linear units
•If a relu has a bias of zero, it exhibits scale equivariance:

•This is a very nice property to have for images.
•It is like the equivariance to translation exhibited by

convolutional nets.

Another view of why layer-by-layer learning
works (Hinton, Osindero & Teh 2006)
• There is an unexpected equivalence • An RBM is actually just an infinitely

between RBM’s and directed deep sigmoid belief net with a lot of
networks with many layers that all weight sharing.
share the same weight matrix. • The Markov chain we run when we
• This equivalence also gives insight into want to sample from the equilibrium
why contrastive divergence learning distribution of an RBM can be viewed
works. as a sigmoid belief net.

An infinite sigmoid belief net that is equivalent to
etc.
an RBM
•The distribution generated by this infinite directed

net with replicated weights is the equilibrium v2
distribution for a compatible pair of conditional
distributions: p(v|h) and p(h|v) that are both defined
by W
• A top-down pass of the directed net is exactly
equivalent to letting a Restricted Boltzmann Machine
v1
settle to equilibrium.
• So this infinite directed net defines the same
distribution as an RBM.
Fundamental in AI and ML
v0 630
etc.
Inference in an infinite sigmoid belief net
h2
• The variables in h0 are conditionally independent
given v0.
• Inference is trivial. Just multiply v0 by v2
• The model above h0 implements a complementary
prior. h1
• Multiplying v0 by gives the product of the
likelihood term and the prior term.
• The complementary prior cancels the explaining away. v1
• Inference in the directed net is exactly equivalent + +
to letting an RBM settle to equilibrium starting at h0
the data. + +
v0
h2
• The learning rule for a sigmoid belief net is:
v2
• With replicated weights this rule becomes: h1
v1
h0
is an unbiased sample from v0

etc.
Learning a deep directed network
h2
S
First learn with all the weights tied. This is
exactly equivalent to learning an RBM. v2
h0
h1
v0
v1
Think of the symmetric connections as a
shorthand notation for an infinite directed net h0
with tied weights.
We ought to use maximum likelihood learning,
but we use CD1 as a shortcut. v0
Learning a deep directed network
h2
• Then freeze the first layer of weights in
both directions and learn the remaining
weights (still tied together). v2
• This is equivalent to learning another RBM,
using the aggregated posterior distribution h1
of h0 as the data.
v1
v1
h0
h0
v0
What happens when the weights in higher layers become
different from the weights in the first layer?
• The higher layers no longer • The higher layers learn a prior that is
implement a complementary prior. closer to the aggregated posterior
• So performing inference using the distribution of the first hidden layer.
frozen weights in the first layer is no • This improves the network’s model of
longer correct. the data.
• But its still pretty good. • Hinton, Osindero and Teh (2006)
• Using this incorrect inference procedure prove that this improvement is
gives a variational lower bound on the always bigger than the loss in the
log probability of the data. variational bound caused by using
less accurate inference.

What is really happening in contrastive etc.
divergence learning?
h2
Contrastive divergence learning in this RBM is
equivalent to ignoring the small derivatives
contributed by the tied weights in higher layers. v2
h1
v1
h0
v0
Why is it OK to ignore the derivatives in higher
layers?
When the weights are small, the • As the weights grow we may need
Markov chain mixes fast. to run more iterations of CD.
• So the higher layers will be close to the • This allows CD to continue to be a
equilibrium distribution (i.e they will good approximation to maximum
have “forgotten” the datavector). likelihood.
• At equilibrium the derivatives must • But for learning layers of features, it
average to zero, because the current does not need to be a good
weights are a perfect model of the approximation to maximum
equilibrium distribution! likelhood!

Sentiment classification
What features of the text could help predict # of stars?
(e.g., using a log-linear model) How to identify more? ?
Are the features hard to compute? (syntax? sarcasm?)

Other text categorization tasks
• Is it spam? (see features)
• What medical billing code for this visit?
• What grade, as an answer to this essay question?
• Is it interesting to this user?
• News filtering; helpdesk routing
• Is it interesting to this NLP program?
• Skill classification for a digital assistant!
• If it’s Spanish, translate it from Spanish
• If it’s subjective, run the sentiment classifier
• If it’s an appointment, run information extraction
• Where should it be filed?
• Which mail folder? (work, friends, junk, urgent ...)
• Yahoo! / Open Directory / digital libraries
Measuring Performance
• Classification accuracy: What % of messages were classified correctly?
• Is this what we care about?
• Which system do you prefer?

• Precision =
good messages kept
all messages kept
• Recall =
good messages kept
all good messages
Move from high precision to high recall by

deleting fewer messages (delete only if spamminess > high threshold)

OK for search engines

(users only want top 10)
high threshold: would prefer

all we keep is good, to be here!
but we don’t keep much
point where low threshold:

precision=recall keep all the good stuff,
(occasionally but a lot of the bad too
reported) OK for spam
filtering and
legal search
600.465 - Intro to NLP - J. Eisner 642
• Precision =
good messages kept
all messages kept
another system: better for
• Recall =
some users, worse for others
(can’t tell just by comparing
good messages kept
F-measures)
all good messages
• F-measure =
precision-1 + recall-1
Move from high precision to high recall by
deleting fewer messages (raise threshold)
( 2
) -1
Conventional to tune system and threshold to optimize F-measure on dev data

But it’s more informative to report the whole curve
Since in real life, the user should be able to pick a tradeoff
Fundamental point
in AI and ML they like 643
More than 2 classes
• Report F-measure for each class
• Show a confusion matrix
Predicted class
True
class
co
rre
ct
644
Fundamental in AI and ML
Fancier Perfomance Metrics
• For multi-way classifiers:
• Average accuracy (or precision or recall) of 2-way distinctions: Sports or not, News or not,
etc.
• Better, estimate the cost of different kinds of errors
• e.g., how bad is each of the following?
• putting Sports articles in the News section
• putting Fashion articles in the News section
• putting News articles in the Fashion section
• Now tune system to minimize total cost
• For ranking systems:

• Correlate with human rankings?
• Get active feedback from user?
• Measure user’s wasted time by tracking clicks?
Text Annotation Tasks
1.Classify the entire document

2.Classify individual word tokens

p(class | token in context)
(WSD)
Build a special classifier just for tokens of “plant”

slide courtesy of D. Yarowsky
WSD for
Build a special classifier just for tokens of “sentence”






What features? Example: “word to left”
Spelling correction using an

n-gram language model
(n ≥ 2) would use words to
left and right to help predict
the true word.
Similarly, an HMM would

predict a word’s class using
classes to left and right.
But we’d like to throw in all

kinds of other features, too
Fundamental in AI and ML … 654
Feature Templates
generates a whole
bunch of features – use
data to find out which
ones work best Fundamental in AI and ML 655
Feature Templates
This feature is
relatively
weak, but weak
features are
still useful,
especially since
very few
features will
fire in a given
context.
merged ranking
of all features
of all these types
Final decision list for lead (abbreviated)
List of all features,

ranked by their weight.
(These weights are for a simple
“decision list” model where the single
highest-weighted feature that fires gets
to make the decision all by itself.
However, a log-linear model, which

adds up the weights of all features that
fire, would be roughly similar.)

1. Classify the entire document

2. Classify individual word tokens
3. Identify phrases (“chunking”)

Named Entity Recognition
CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per
round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit AMR,
immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase
took effect Thursday night and applies to most routes where it competes against discount carriers, such
as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York.

65
NE Types

66
Identifying phrases (chunking)
• Phrases that are useful for information extraction:
• Named entities
• As on previous slides
• Relationship phrases
• “said”, “according to”, …
• “was born in”, “hails from”, …
• “bought”, “hopes to acquire”, “formed a joint agreement with”, …
• Simple syntactic chunks (e.g., non-recursive NPs)
• “Syntactic chunking” sometimes done before (or instead of) parsing
• Also, “segmentation”: divide Chinese text into words (no spaces)
• So, how do we learn to mark phrases?
• Earlier, we built an FST to mark dates by inserting brackets
• But, it’s common to set this up as a tagging problem …
Reduce to a tagging problem …
• The IOB encoding (Ramshaw & Marcus 1995):
• B_X = “beginning” (first word of an X)
• I_X = “inside” (non-first word of an X)
• O = “outside” (not in any phrase)
• Does not allow overlapping or recursive phrases
…United Airlines said Friday it has increased …

B_ORG I_ORG O O O O O
… the move , spokesman Tim Wagner said …
O O O O B_PER I_PER O
What if this were tagged as B_ORG instead?

66
Attributes and
Feature Templates for NER
POS tags and chunks Now predict NER tagseq
from earlier processing
A feature of this
tagseq might give a
positive or negative
weight to this
B_ORG in conjunction
with some subset of
the nearby
attributes
Or even faraway attributes:

B_ORG is more likely in a
sentence with a spokesman!
Slide adapted from Jim Martin 66
Alternative: CRF Chunker
• Log-linear model of p(structure | sentence):

• CRF tagging takes O(n) time (“Markov property”)
• Score each possible tag or tag bigram at each position
• Then use dynamic programming to find best overall tagging
• CRF chunking takes O(n2) time (“semi-Markov”)

• Score each possible chunk, e.g., using a BiRNN-CRF
• Then use dynamic programming to find best overall chunking
• Forward algorithm:
• for all j: for all i < j: for all labels L:
α(j) += α(i) * score(possible chunk from i to j with label L)
• CRF parsing takes O(n3) time

• Score each possible rule at each position
• Then use dynamic programming to find best overall parse
4. Syntactic annotation (parsing)

Parser Evaluation Metrics
• Runtime
• Exact match
• Is the parse 100% correct?
• Labeled precision, recall, F-measure of constituents
• Precision: You predicted (NP,5,8); was it right?
• Recall: (NP,5,8) was right; did you predict it?
• Easier versions:
• Unlabeled: Don’t worry about getting (NP,5,8) right, only (5,8)
• Short sentences: Only test on sentences of ≤ 15, ≤ 40, ≤ 100 words
• Dependency parsing: Labeled and unlabeled attachment accuracy
• Crossing brackets
• You predicted (…,5,8), but there was really a constituent (…,6,10)

Labeled Dependency Parsing
Raw sentence
He reckons the current account deficit will narrow to only 1.8 billion in September.
Part-of-speech tagging
POS-tagged sentence
He reckons the current account deficit will narrow to only 1.8 billion in September.
PRP VBZ DT JJ NN NN MD VB TO RB CD CD IN NNP .
Word dependency parsing
Word dependency parsed sentence

He reckons the current account deficit will narrow to only 1.8 billion in September .
MOD MOD COMP
SUBJ MOD SUBJ
COMP
SPEC
MOD
S-COMP
ROOT

1. Assign heads
Dependency Trees
S
[head=thrill]
NP VP
[head=plan] [head=thrill]
Det N V VP
The has
N VP
[head=plan] [head=swallow]
V VP
[head=thrill]
plan been
to VP V NP
[head=swallow] [head=thrill] [head=Otto]
thrilling Otto
V
[head=swallow] NP
[head=Wanda]
swallow Wanda
2. Each word is
Dependency Trees the head of a
S
[head=thrill] whole
connected
NP VP
[head=plan] [head=thrill] subgraph
Det N V VP
The has
N VP
[head=plan] [head=swallow]
V VP
[head=thrill]
plan been
to VP V NP
[head=swallow] [head=thrill] [head=Otto]
thrilling Otto
V
[head=swallow] NP
[head=Wanda]
swallow Wanda
2. Each word is
Dependency Trees the head of a
S
whole
connected
NP VP
subgraph
Det N V VP
The has
N VP V VP
plan been
to VP V NP
thrilling Otto
V NP
swallow Wanda
3. Just look at
Dependency Trees which words are
related
thrilling
plan
The has
swallow been
to
Otto
Wanda
4. Optionally
Dependency Trees flatten the
drawing
• Shows which words modify (“depend on”) another word
• Each subtree of the dependency tree is still a constituent
• But not all of the original constituents are subtrees (e.g., VP)
The plan to swallow Wanda has been thrilling Otto.

• Easy to spot semantic relations (“who did what to whom?”)
• Good source of syntactic features for other tasks
• Easy to annotate (high agreement)
• Easy to evaluate (what % of words have correct parent?)
• Data available in more languages (Universal Dependencies)
4. Syntactic annotation (parsing)
5. Semantic annotation

Semantic Role Labeling (SRL)
• For each predicate (e.g., verb)
1. find its arguments (e.g., NPs)
2. determine their semantic roles
John drove Mary from Austin to Dallas in his Toyota Prius.
The hammer broke the window.
• agent: Actor of an action

• patient: Entity affected by the action
• source: Origin of the affected entity
• destination: Destination of the affected entity
• instrument: Tool used in performing action.
• beneficiary: Entity for whom action is performed

Might be helped by syntactic parsing first …
• Consider one verb at a time: “bit”
• Classify the role (if any) of each of the 3 NPs
S
Color Code:
not-a-role NP VP
agent
patient NP PP V NP
source
Det A Prep NP bit Det A
destination N N
instrument The Adj A dog with Det A a ε girl
beneficiary N
big ε the ε boy
Parse tree paths as classification features
Path feature is S
NP VP
V ↑ VP ↑ S ↓ NP
NP PP V NP
which tends to Det A Prep NP bit Det A
be associated N N
The Adj A dog with Det A a ε girl
with agent role N
big ε the ε boy

Parse tree paths as classification features
Path feature is S
NP VP
V ↑ VP ↑ S ↓ NP ↓ PP ↓ NP
NP PP V NP
which tends to Det A Prep NP bit Det A
be associated N N
The Adj A dog with Det A a ε girl
with no role N
big ε the ε boy

Head words as features
• Some roles prefer to be filled by certain kinds of NPs.
• This can give us useful features for classifying accurately:
• “John ate the spaghetti with chopsticks.” (instrument)
“John ate the spaghetti with meatballs.” (patient)
“John ate the spaghetti with Mary.”
• Instruments should be tools
• Patient of “eat” should be edible
• “John bought the car for $21K.” (instrument)

“John bought the car for Mary.” (beneficiary)
• Instrument of “buy” should be Money
• Beneficiaries should be animate (things with desires)
• “John drove Mary to school in the van”

“John drove the van to work with Mary.”
• What do you think? Fundamental in AI and ML 678
Semantic roles can feed into further tasks
• Find the answer to a user’s question
• “Who” questions usually want Agents
• “What” question usually want Patients
• “How” and “with what” questions usually want Instruments
• “Where” questions frequently want Sources/Destinations.
• “For whom” questions usually want Beneficiaries
• “To whom” questions usually want Destinations
• Generate text
• Many languages have specific syntactic constructions that must or should be used for specific semantic roles.
• Word sense disambiguation, using selectional restrictions
• The bat ate the bug. (what kind of bat? what kind of bug?)
• Agents (particularly of “eat”) should be animate – animal bat, not baseball bat
• Patients of “eat” should be edible – animal bug, not software bug
• John fired the secretary.
John fired the rifle.
Patients of fire1 are different than patients of fire2
Other Current Semantic Annotation Tasks
(similar to SRL)
• PropBank – coarse-grained roles of verbs
• NomBank – similar, but for nouns
• TimeBank – temporal expressions
• FrameNet – fine-grained roles of any word
• Like finding the intent (= frame) for each “trigger phrase,” and filling its slots
Alarm(delay=“20 minutes”, content=“turn off the rice”)
intent slot filler slot filler

Face Detection Overview
• Problem Identification
• Methods Adopted
• Color Segmentation
• Morphological Processing
• Template Matching
• EigenFaces
• Gender Classification

Color Segmentation
• Use the color information
• Two approaches:
• Global threshold in HSV and YCbCr space using set of linear equations. Lot of overlap exists
(a) (b)
Clustering in (a) YCbCr and (b) V vs. H space. Red is non-face and blue is
face data
Result of color segmentation using Global
thresholding

Overlap exists in RGB space also
Sample Blue vs Green plot for face (blue) and

non-face (red) data.
• Second approach involves RGB vector quantization

(Linde, Buzo, Gray)
• Use RGB as a 3-D vector and quantize the RGB space
for the face and non-face regions
Results from initial quantization
Common problems identified

Better Code book developed
Problem areas broken up

Face Detection
• Initial step of open and close performed to fill holes in faces
• Elongated objects removed by check on aspect ratio and small areas discarded

Morphological Processing
• Segmented and processed Image consists of all skin regions (face, arms and fists)
• Need to identify centers of all objects, including individual faces among
connected faces
• Repeated EROSION is done with specific structuring element

Face Detection
Previous state stored to identify new regions when split occurs
Superimposed mask image with eroded regions for estimate of centroids

Template Matching
• Data set has 145 male and 19 female faces
• Need to identify region around estimated centroids as face or non-face
• Multi-resolution was attempted. But distortion from neighboring faces gives false
values
• Smaller template has better result for all face shapes
• Template used is the mean face of 50x50 pixels
Mean Face used for

template
Fundamentalmatching
in AI and ML 690
• Illumination problem identified
• Top has low lighting, lower part is brighter
• Left and right edges of images do not have people
• 2-D weighting function for correlation values applied
2-D weighting function Sample correlation result

I
Result from template matching and thresholding.

Rejected - Red ‘x’. Detected Faces – Green ‘x’
EigenFace based detection
• Decompose faces into set of basis images
• Different methods of candidate face extraction from image
EigenFaces
(a) (b)
Candidate face extraction (a) Conservative (b) multi-resolution with

side distortion
Sample result of eigenface. Red ‘+’ is from morphological
processing and green ‘O’ is from eigenfaces
• Minimum Distance between
vector of coefficients to that of
the face dataset was the metric.
• It depends very much on spatial
similarity to trained dataset
• Slight changes give incorrect
results
• Hence, only template matching
was used

Gender classification
• Eigenfaces and template matching for specific face features do not yield good
results
• Other features for specific females were used – the headband
• Template matching was performed for it
• Conservative estimate was done to prevent falsely identifying males as a female
TheFundamental
headband template
in AI and ML 695
Table of results for training images
Approx. 95% accuracy with about 75 seconds runtime
Training Final Detect Number Num Num Distance Runtime Bonus

Image Score Score Hits Repeat False
Positives
1 22 21 21 0 0 15.9311 71.91 1
2 22 21 23 0 2 13.6109 82.96 1
3 25 25 25 0 0 9.8625 80.48 0
4 22 22 24 0 2 11.3667 81.15 0
5 24 24 24 0 0 9.5960 69.59 0
6 23 23 23 0 0 11.5512 80.25 0
7 22 21 21 0 0 14.1537 71.52 1

Training 1
Training 2
Training 3
Training 4
Training 5
Training 6
Training 7
Sentiment Analysis

What is SA & OM?
Identify the orientation of opinion in a piece of text
The movie The movie The movie

was fabulous! stars Mr. X was horrible!
Can be generalized to a wider set of emotions

Motivation
• Knowing sentiment is a very natural ability of a human being.
Can a machine be trained to do it?
• SA aims at getting sentiment-related knowledge especially from the huge amount

of information on the internet
• Can be generally used to understand opinion in a set of documents

Tripod of Sentiment Analysis
Cognitive Science
Sentiment Analysis
Machine Learning Natural Language

Processing
Sentiment Analysis
Lexical
Challenges Resources
SA
Subjectivity
Approach
detection
es
Applicatio
ns

SentiWordNet
• Lexical resource for sentiment analysis
• Built on the top of WordNet synsets
• Attaches sentiment-related information with synsets

Quantifying sentiment
Positive Negative
Subjective
Polarity
Term sense
position
Objective
Polarity
Each term has a Positive, Negative and Objective score. The

scores sum to one.

Building SentiWordNet
• Ln, Lo, Lp are the three seed sets

• Iteratively expand the seed sets through K steps
• Train the classifier for the expanded sets

Expansion of seed sets
see
s o -
al
an
to
ny
my
Ln
Lp
The sets at the end of kth step are called Tr(k,p) and Tr(k,n)
Tr(k,o) is the set that is not present in Tr(k,p) and Tr(k,n)

Committee of classifiers
• Train a committee of classifiers of different types and different K-values for the
given data
• Observations:
• Low values of K give high precision and low recall
• Accuracy in determining positivity or negativity, however, remains almost
constant

WordNet Affect
• Similar to SentiWordNet (an earlier work)
• WordNet-Affect: WordNet + annotated affective concepts in hierarchical order
• Hierarchy called ‘affective domain labels’
• behaviour
• personality
• cognitive state

Subjectivity detection
• Aim: To extract subjective portions of text
• Algorithm used: Minimum cut algorithm

Constructing the graph
• Build an undirected graph G with vertices {v1, v2…,s, t} (sentences and s,t)
• Add edges (s, vi) each with weight ind1(xi)
• Add edges (t, vi) each with weight ind2(xi)
• Add edges (vi, vk) with weight assoc (vi, vk)
• Partition cost:

Example
Sample cuts:

Results (1/2)
• Naïve Bayes, no extraction : 82.8%
• Naïve Bayes, subjective extraction : 86.4%
• Naïve Bayes, ‘flipped experiment’ : 71 %
Subjecti
ve
Document
Document Subjectivit POLARITY CLASSIFIER
y
Objectiv
detector
e

Results

Reinforcement Learning What is learning?
• Learning types
• Supervised learning:
a situation in which sample (input, output) pairs of the function to be learned can be perceived or are given
• You can think it as if there is a kind teacher
• Reinforcement learning:
in the case of the agent acts on its environment, it receives some evaluation of its action (reinforcement),
but is not told of which action is the correct one to achieve its goal

Reinforcement learning
• Task
Learn how to behave successfully to achieve a goal while interacting with an external
environment
• Learn via experiences!
• Examples
• Game playing: player knows whether it win or lose, but not know how to move at each
step
• Control: a traffic system can measure the delay of cars, but not know how to decrease it.

RL is learning from interaction

RL model
• Each percept(e) is enough to determine the State(the state is accessible)
• The agent can decompose the Reward component from a percept.
• The agent task: to find a optimal policy, mapping states to actions, that maximize long-run
measure of the reinforcement
• Think of reinforcement as reward
• Can be modeled as MDP model!

Review of MDP model
• MDP model <S,T,A,R>
• S– set of states
Agent • A– set of actions
• T(s,a,s’) = P(s’|s,a)– the
State Action probability of transition from
Reward
s to s’ given action a
Environment • R(s,a)– the expected reward
for taking action a in state s
a0 a1 a2
s0 s1 s2 s3
r0 r1 r2

Model based v.s.Model free approaches
• But, we don’t know anything about the environment model—the transition function T(s,a,s’)
• Here comes two approaches
• Model based approach RL:
learn the model, and use it to derive the optimal policy.
e.g Adaptive dynamic learning(ADP) approach
• Model free approach RL:

derive the optimal policy without learning the model.
e.g LMS and Temporal difference approach
• Which one is better?

Passive learning v.s. Active learning
• Passive learning
• The agent imply watches the world going by and tries to learn the utilities of being in
various states
• Active learning
• The agent not simply watches, but also acts

Example environment

Passive learning scenario
• The agent see the the sequences of state transitions and associate rewards
• The environment generates state transitions and the agent perceive them
e.g (1,1) (1,2) (1,3) (2,3) (3,3) (4,3)[+1]
(1,1) (1,2) (1,3) (1,2) (1,3) (1,2) (1,1) (2,1) (3,1) (4,1) (4,2)[-1]
• Key idea: updating the utility value using the given training sequences.

Adaptive dynamic programming(ADP) in passive
learning
• Different with LMS and TD method(model free approaches)
• ADP is a model based approach!
• The updating rule for passive learning
• However, in an unknown environment, T is not given, the agent must learn T

itself by experiences with the environment.
• How to learn T?

Active learning
• An active agent must consider
• what actions to take?
• what their outcomes maybe(both on learning and receiving the rewards in the long run)?
• Update utility equation
• Rule to chose action

How to learn model?
• Use the transition tuple <s, a, s’, r> to learn T(s,a,s’) and R(s,a). That’s supervised
learning!
• Since the agent can get every transition (s, a, s’,r) directly, so take (s,a)/s’ as an
input/output example of the transition probability function T.
• Different techniques in the supervised learning(see further reading for detail)
• Use r and T(s,a,s’) to learn R(s,a)

ADP approach pros and cons
• Pros:
• ADP algorithm converges far faster than LMS and Temporal learning. That is
because it use the information from the the model of the environment.
• Cons:
• Intractable for large state space
• In each step, update U for all states
• Improve this by prioritized-sweeping (see further reading for detail)

Exploration problem in Active learning
• An action has two kinds of outcome
• Gain rewards on the current experience tuple (s,a,s’)
• Affect the percepts received, and hence the ability of the agent to learn
• A trade off when choosing action between
• its immediately good(reflected in its current utility estimates using the what we have learned)
• its long term good(exploring more about the environment help it to behave optimally in the long run)
• Two extreme approaches
• “wacky”approach: acts randomly, in the hope that it will eventually explore the entire environment.
• “greedy”approach: acts to maximize its utility using current model estimate
See Figure 20.10
• Just like human in the real world! People need to decide between
• Continuing in a comfortable existence
• Or striking out into the unknown in the hopes of discovering a new and better life

Exploration problem in Active learning
• One kind of solution: the agent should be more wacky when it has little idea of
the environment, and more greedy when it has a model that is close to being
correct
• In a given state, the agent should give some weight to actions that it has not
tried very often.
• While tend to avoid actions that are believed to be of low utility
• Implemented by exploration function f(u,n):
• assigning a higher utility estimate to relatively unexplored action state pairs
• Chang the updating rule of value function to
• U+ denote the optimistic estimate of the utility

Generalization in Reinforcement Learning
• So far we assumed that all the functions learned by the agent are (U, T, R,Q) are tabular
forms—
i.e.. It is possible to enumerate state and action spaces.
• Use generalization techniques to deal with large state or action space.
• Function approximation techniques

Thank you!

2. Common Ppt (All Modules)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2. Common Ppt (All Modules)

Uploaded by

Copyright:

Available Formats

CSA2001

Artificial Intelligence (AI) has been moving extremely quickly in

An AI Process “Four kids are playing

More than just a category

Deep Learning (DL)

Math Physics Chemistry Biology

hand, legs, vocal tract work for actuators.

and various motors for actuators.

on those inputs and display output on theFundamental

An intelligent agent is an autonomous entity that act upon an environment

Rule 2: The observation must be used to make decisions.

Rule 3: Decision should result in an action.

Rule 4: The action taken by an AI agent must be a rational action.

• A rational agent is said to perform the right things. AI is about creating

• For an AI agent, the rational action is most important because in AI

Performance measure which defines the success criterion.

Agent prior knowledge of its environment.

Best possible actions that an agent can perform.

The sequence of percepts.

Agent = Architecture + Agent program

Architecture: Architecture is machinery that an AI agent executes on.

Agent Function: The agent function is used to map a percept to an action.

Agent program: An agent program is an implementation of agent function. An agent

Environment: Roads, other vehicles, road signs, pedestrian

Actuators: Steering, accelerator, brake, signal, horn

Sensors: Camera, GPS, speedometer, odometer, accelerometer, sonar.

Minimized Doctors Diagnosis, Patient’s

Cost • Patients • Scan report response

obstacles Infrared Wall

Definition: An agent perceives its environment via sensors and

Percepts: location and

Actions: Left, Right, Suck, NoOp

An agent should strive to "do the right thing", based on what:

Performance measure: An objective criterion for success of an agent's

Performance measures self-driving car: time to reach destination

Performance measure of game-playing agent: win/loss percentage

Why “in expectation”?

Captures actions with stochastic / uncertain effects or actions performed in

In high-risk settings, we may also want to limit the worst-case behavior.

Rationality is distinct from omniscience (“all knowing”). We can

Agents can perform actions in order to modify future percepts so as to obtain

An agent is autonomous if its behavior is determined by its own experience

Must first specify the setting for intelligent agent design.

PEAS: Performance measure, Environment, Actuators, Sensors

Example: the task of designing a self-driving car

• Performance measure Safe, fast, legal, comfortable trip

1) Fully observable / Partially observable

If an agent’s sensors give it access to the complete state of the

Making things a bit more challenging…

Incomplete / uncertain information inherent in the

Balance exploitation (best move given current

Use probabilistic reasoning techniques.

○ In an episodic environment, the agent’s experience is divided into

○ Subsequent episodes do not depend on what actions occurred in

○ In a sequential environment, the agent engages in a series of

A static environment does not change while the agent is thinking.

The passage of time as an agent deliberates is irrelevant.

The environment is semidynamic if the environment itself does not

6) Single agent / Multi-agent

How to make the right decisions? Decision theory

The agent function maps from percept histories to actions

The agent program runs (internally) on the physical architecture to produce f

agent = architecture + program

Uses a percept sequence / action table in memory to