Unit-8 - Chapter-18 & 27: Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

• Unit-8
• Chapter-18 & 27
Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

LEARNING FROM
OBSERVATIONS
• All the "intelligence" in an agent has been built in by the
agent's designer. Whenever the designer has incomplete
knowledge of the environment that the agent will live in,
learning is the only way that the agent can acquire what it
needs to know. Learning thus provides autonomy.
• The field of machine learning - the subfield of AI
concerned with programs that learn from experience.
Chapter 18 introduces the basic design for learning agents,
and addresses the general problem of learning from
examples.

Forms of learning
A learning agent can be divided into four conceptual
components:
• The most important distinction is between the learning
element, which is responsible for making improvements,
and the performance element, which is responsible for
selecting external actions.
• The critic is designed to tell the learning element how well
the agent is doing.
• The problem generator is responsible for suggesting
actions that will lead to new and informative experiences.

Figure 18.1 A general model of learning agents.

The design of the learning element is affected by four major
issues:
• Which components of the performance element are to be
improved.
• What representation is used for those components.
• What feedback is available.
• What prior information is available.

components
• I . A direct mapping from conditions on the current state to
actions.
• 2. A means to infer relevant properties of the world from the
percept sequence.
• 3. Information about the way the world evolves and about
the results of possible actions
• the agent can take.
• 4. Utility information indicating the desirability of world
states.
• 5. Action-value information indicating the desirability of
actions.
• 6. Goals that describe classes of states whose achievement
maximizes the agent's utility

Available feedback
• The type of feedback available for learning is usually the

most important factor in determining the nature of the
learning problem that the agent faces.
• The field of machine learning usually distinguishes three
cases: supervised, unsupervised, and reinforcement
learning
• supervised learning
• reinforcement learning
• unsupervised learning

Representation of components
• linear weighted polynomials for utility functions in game-

playing programs;
• propositional and first-order logical sentences for all of the
cornponents in a logical agent;
• and probabilistic descriptions such as Bayesian networks
for the inferential components of a decision-theoretic agent

Prior knowledge
• The last major factor in the design of learning systems is

the availability of prior knowledge.
• The majority of learning research in AI, computer science,
and psychology has studied
• the case in which the agent begins with no knowledge at
all about what it is trying to learn.

Bringing it all together
Each of the seven components of the performance element can be
described mathematically as a function:
• information about the way the world evolves can be described
as a function from a world state (the current state) to a world
state (the next state or states);
• a goal can be described as a function from a state to a Boolean
value (0 or 1) indicating whether the state satisfies the goal.
The key point is that all learning can be seen as learning the
representation of a function.
• We can choose which component of the performance element to
improve and how it is to be represented.

18.2 INDUCTIVE LEARNING
• In supervised learning, the learning element is given the

correct (or approximately correct) value of the function for
particular inputs, and changes its representation of the
function to try to match the information provided by the
feedback.
• An example is a pair (x,f(x)), where x is the input and f(x) is
the output of the function applied to x.
• The task of pure inductive inference (or induction):
given a collection of examples off, return a function h that
approximates f. The function h is called a hypothesis.

Figure 18.2 In (a) we have some example (input,output) pairs.
In (b), (c), and (d) we have three hypotheses for functions from which these examples could be drawn.

Inductive learning
• Simplest form: learn a function from examples
f is the target function
An example is a pair (x, f(x))

Problem: find a hypothesis h
such that h ≈ f
given a training set of examples
(This is a highly simplified model of real learning:
– Ignores prior knowledge
– Assumes examples are given)
–
Inductive learning method
• Construct/adjust h to agree with f on training set

• (h is consistent if it agrees with f on all examples)
• E.g., curve fitting:

•

•
•

•
•

•
•
•
• Ockham’s razor: prefer the simplest hypothesis consistent with

data
OCKHAM’s Razor
• how do we choose from among multiple consistent

hypotheses?
• One answer is Ockham's razor: prefer the simplest
hypothesis consistent with the data.
• OCKHAM'S RAZOR: Intuitively, this makes sense,
because hypotheses that are no simpler than the data
themselves are failing to extract any pattern from the data

18.3 LEARNING DECISION TREES
• Decision tree induction is one of the simplest and yet

most successful forms of learning algorithm. It serves as
a good introduction to the area of inductive learning, and is
easy to implement.
• We first describe the performance element, and then show
how to learn it. Along the way, we will introduce many of
the ideas and terms that appear in all areas of inductive
learning.

Decision trees as performance
elements
• A decision tree takes as input an object or situation described
by a set of properties, and outputs a yes/no "decision."
• Decision trees therefore represent Boolean functions.
Functions with a larger range of outputs can also be
represented, but for simplicity we will usually stick to the
Boolean case.
• Each internal node in the tree corresponds to a test of the
value of one of the properties, and the branches from the node
are labelled with the possible values of the test.
• Each leaf node in the tree specifies the Boolean value to be
returned if that leaf is reached.
• the goal predicate (goal concept)

• A decision tree takes as input an object or situation
described by a set of attributes and returns a predicted
output value for the input.
• The input attributes can be discrete or continuous.
• . The output value can also be discrete or continuous;
• learning a discrete-valued function is called classification
learning;
• learning a continuous function is called regression.

Expressiveness of decision trees
• If decision trees correspond to sets of implication sentences, a
natural question is whether they can represent any set.
• Decision trees are fully expressive within the class of
propositional languages, that is, any Boolean function can be
written as a decision tree.

Expressiveness
• Decision trees can express any function of the input attributes
• E.g., for Boolean functions, truth table row → path to leaf
• Trivially, there is a consistent decision tree for any training set with one path to leaf for each
example (unless f nondeterministic in x) but it probably won't generalize to new examples
• Prefer to find more compact decision trees

Hypothesis spaces
How many distinct decision trees with n Boolean attributes?
= number of Boolean functions
= number of distinct truth tables with 2n rows = 22n (for each of the 2n rows of the
decision table, the function may return 0 or 1)
• E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 (more

than 18 quintillion) trees
Hypothesis spaces
How many distinct decision trees with n Boolean attributes?
= number of Boolean functions
= number of distinct truth tables with 2n rows = 22n
• E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees
How many purely conjunctive hypotheses (e.g., Hungry  Rain)?

• Each attribute can be in (positive), in (negative), or out
 3n distinct conjunctive hypotheses
• More expressive hypothesis space
– increases chance that target function can be expressed
– increases number of hypotheses consistent with training set
 may get worse predictions
Figure 18.4 A decision tree for deciding whether to wait for a table.

Inducing decision trees from examples
• An example is described by the values of the attributes

and the value of the goal predicate. We call the value of
the goal predicate the classification of the example. If the
goal predicate is true for some example, we call it a
positive example; otherwise we call it a negative example.
A set of examples X1,…, X12 for the restaurant domain is
shown in Figure 18.5. The positive examples are ones
where the goal WillWait is true (X1, X3,....) and negative
examples are ones where it is false (X2, X5, ...).
• The complete set of examples is called the training set.

Learning decision trees
Problem: decide whether to wait for a table at a restaurant, based on the
following attributes:
1. Alternate: is there an alternative restaurant nearby?
2. Bar: is there a comfortable bar area to wait in?
3. Fri/Sat: is today Friday or Saturday?
4. Hungry: are we hungry?
5. Patrons: number of people in the restaurant (None, Some, Full)
6. Price: price range ($, $$, $$$)
7. Raining: is it raining outside?
8. Reservation: have we made a reservation?
9. Type: kind of restaurant (French, Italian, Thai, Burger)
10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)
Attribute-based representations
• Examples described by attribute values (Boolean, discrete, continuous)
• E.g., situations where I will/won't wait for a table:
• Classification of examples is positive (T) or negative (F)

•
Example Attributes Goal
Alt Bar Fri Hun Pat Price Rain Res Type Est WillWait
X1 Yes No No Yes Some $$$ No Yes French 0-10 Yes
X2 Yes No No Yes Full $ No No Thai 30-60 No
X3 No Yes No No Some $ No No Burger 0-10 Yes
X4 Yes No Yes Yes Full $ No No Thai 10-30 Yes
X5 Yes No Yes No Full $$$ No Yes French >60 No
X6 No Yes No Yes Some $$ Yes Yes Italian 0-10 Yes
X7 No Yes No No None $ Yes No Burger 0-10 No
X8 No No No Yes Some $$ Yes Yes Thai 0-10 Yes
X9 No Yes Yes No Full $ Yes No Burger >60 No
X 10 Yes Yes Yes Yes Full $$$ No Yes Italian 10-30 No
X 11 No No No No None $ No No Thai 0-10 No
X 12 Yes Yes Yes Yes Full $ No No Burger 30-60 Yes
A set of examples X1,…, X12 for the restaurant domain is shown in Figure 18.5.
The positive examples are ones where the goal WillWait is true (X1, X3,....) and
negative examples are ones where it is false (X2, X5, ...).
Figure 18.5 Examples for the restaurant domain.

Choosing an attribute
• Idea: a good attribute splits the examples into subsets that

are (ideally) "all positive" or "all negative"
• Patrons? is a better choice

•
Figure 18.6 Splitting
the examples by
testing on
attributes.
In (a) Patrons is a
good attribute to
test first;
in (b) Type is a poor
one; and
in (c) Hungry is a
fairly good second
test, given that
Patrons is the first
test.

Figure 18.8 The decision tree induced from the 12-example training set.

Decision tree learning
• Aim: find a small tree consistent with the training examples
• Idea: (recursively) choose "most significant" attribute as root of
(sub)tree
Using information theory
• To implement Choose-Attribute in the DTL

algorithm
• Information Content (Entropy):
I(P(v1), … , P(vn)) = Σi=1 -P(vi) log2 P(vi)
• For a training set containing p positive examples
and n negative examples:
p n p p n n
I( , ) log 2  log 2
pn pn pn pn pn pn
Information gain
• A chosen attribute A divides the training set E into subsets

E1, … , Ev according to their values for A, where A has v
distinct values.
v
p i  ni pi ni
remainder ( A)   I( , )
i 1 p  n pi  ni pi  ni
• Information Gain (IG) or reduction in entropy from the
attribute test:
p n
IG ( A)  I ( , )  remainder ( A)
pn pn
• Choose the attribute with the largest IG
Information gain
For the training set, p = n = 6, I(6/12, 6/12) = 1 bit
Consider the attributes Patrons and Type (and others too):

2 4 6 2 4
IG( Patrons )  1  [ I (0,1)  I (1,0)  I ( , )]  .0541 bits
12 12 12 6 6
2 1 1 2 1 1 4 2 2 4 2 2
IG(Type )  1  [ I ( , )  I ( , )  I ( , )  I ( , )]  0 bits
12 2 2 12 2 2 12 4 4 12 4 4
Patrons has the highest IG of all attributes and so is chosen by the DTL
algorithm as the root
Example contd.
• Decision tree learned from the 12 examples:
• Substantially simpler than “true” tree---a more complex

hypothesis isn’t justified by small amount of data
Assessing the performance of the
learning algorithm
• A learning algorithm is good if it produces hypotheses that
do a good job of predicting the classifications of unseen
examples.
1. how prediction quality can be estimated in advance.
2. a methodology for assessing prediction quality after the
fact.
• Obviously, a prediction is good if it turns out to be true, so
we can assess the quality of a hypothesis by checking its
predictions against the correct classification once we know
it. We do this on a set of examples known as the test set.

If we train on all our available examples, then we will have to
go out and get some more to test on, so often it is more
convenient to adopt the following methodology:
1. Collect a large set of examples.
2. Divide it into two disjoint sets: the training set and the
test set.
3. Use the learning algorithm with the training set as
examples to generate a hypothesis H.
4. Measure the percentage of examples in the test set that are
correctly classified by H.
5. Repeat steps 1 to 4 for different sizes of training sets and
different randomly selected training sets of each size.

Performance measurement
• How do we know that h ≈ f ?
1. Use theorems of computational/statistical learning theory
2. Try h on a new test set of examples
(use same distribution over example space as training set)
Learning curve = % correct on test set as a function of training set size
Noise and overfitting
• Whenever there is a large set of possible hypotheses, one

has to be careful not to use the resulting freedom to find
meaningless "regularity" in the data. This problem is
called
OVERFITTING .
• A very general phenomenon, overfitting occurs even when
the target function is not random. It afflicts every kind of
learning algorithm, not just decision trees.

Decision tree prunning
• Here we present a simple technique called decision tree

pruning to deal with the problem.
• Pruning works by preventing recursive splitting on
attributes that are not clearly relevant, even when the data
at that node in the tree are not uniformly classified.
• The question is, how
• “do we detect an irrelevant attribute?”
• Use information gain—if close to zero then irrelevant

Significance test
• How large gain we require to split an attribute?
• We can answer this question by using a statistical significance test.
• Such a test begins by assuming that there is no underlying pattern
(the so-called null hypothesis).
• Then the actual data are analyzed to calculate the extent to which
they deviate from a perfect absence of pattern. If the degree of
deviation is statistically unlikely (usually taken to mean a 5%
probability or less), then that is considered to be good evidence for
the presence of a significant pattern in the data.
• The probabilities are calculated from standard distributions of the
amount of deviation one would expect to see in random sampling.
• Chi square test

Cross validation
• Cross-validation is another technique that reduces overfitting.
• It can be applied to any learning algorithm, not just decision tree learning.
• The basic idea is to estimate how well each hypothesis will predict unseen
data.
• This is done by setting aside some fraction of the known data and using it
to test the prediction performance of a hypothesis induced from the
remaining data.
• K-fold cross-validation means that you run k experiments, each time
setting aside a different 1/k of the data to test on, and average the results.
Popular values for k are 5 and 10. The extreme is k = n, also known as
leave-one-out cross-validation.
• Crossvalidation can be used in conjunction with any tree-construction
method (including pruning) in order to select a tree with good prediction
performance.
• 'To avoid peeking, we must then measure this performance with a new
test set.

Broadening the applicability of
decision trees
• Missing data: In many domains, not all the attribute
values will be known for every example. The values might
have gone unrecorded, or they might be too expensive to
obtain. This gives rise to two problems:
1. how should one classify an object that is missing one of
the test attributes?
2. Second, how should one modify the information gain
formula when some examples have unknown values for
the attribute?

Solution to missing values
• One way to handle this case is to pretend that the example

has all possible values for the attribute, but to weight each
value according to its frequency among all of the examples
that reach that node in the decision tree.
• The classification algorithm should follow all branches at
any node for which a value is missing and should multiply
the weights along each path.
• Now modify the information gain calculation

How to deal with Multivalued attributes?
When an attribute has many possible values, the information
gain measure gives an inappropriate indication of the
attribute's usefulness. In the extreme case, we could use an
attribute, such as RestaurantName, that has a different
value for every example. Then each subset of examples
would be a singleton with a unique classification, so the
information gain measure would have its highest value for
this attribute.
• Nonetheless, the attribute could be irrelevant or useless.
• One solution is to use the gain ratio

Solve continuos value
• Continuous-valued output attributes: If we are trying

to predict a numerical value,
• such as the price of a work of art, rather than a discrete
classification, then we need
• REGRESSION TREE a regression tree.

• Continuous and integer-valued input attributes:
Continuous or integer-valued attributes
• such as Height and Weight, have an infinite set of possible
values. Rather than generate infinitely many branches,
decision-tree learning algorithms typically find the split
point that gives the highest information gain. For
example, at a given node in the tree, it might be the case
that testing on Weight > 160 gives the most information.
• Efficient dynamic programming methods exist for finding
good split points, but it is still
• by far the most expensive part of real-world decision tree
learning applications.

Outline for Ensemble Learning and
Boosting
• Ensemble Learning
– Bagging
– Boosting
Ensemble Learning
• Sometimes each learning technique yields a different

hypothesis ..But no perfect hypothesis…
• Could we combine several imperfect hypotheses into a
better hypothesis?
• The idea of ensemble learning methods is to
• select a whole collection, or ensemble, of hypotheses from
the hypothesis space and combine their predictions.
• For example, we might generate a hundred different
decision trees from the same training set and have them
vote on the best classification for a new example
Ensemble Learning
• Analogies:
– Elections combine voters’ choices to pick a good
candidate
– Committees combine experts’ opinions to make better
decisions
• Intuitions:
– Individuals often make mistakes, but the “majority” is
less likely to make mistakes.
– Individuals often have partial knowledge, but a
committee can pool expertise to make better decisions
Ensemble Learning
• Definition: method to select and
combine an ensemble of hypotheses
into a (hopefully) better hypothesis
• Can enlarge hypothesis space
– Perceptron (a simple kind of
neural network)
• linear separator
– Ensemble of perceptrons
• polytope
Bagging
Bagging
• Assumptions:
– Each hi makes error with probability p
– The hypotheses are independent
• Majority voting of n hypotheses:
– k hypotheses make an error:
– Majority makes an error:
• – With n=5, p=0.1 error( majority ) < 0.01
Weighted Majority
• In practice
– Hypotheses rarely independent
– Some hypotheses make fewer errors than others
• Let’s take a weighted majority
• Intuition:
– Decrease weight of correlated hypotheses
– Increase weight of good hypotheses
Boosting
• Most popular ensemble technique

• Computes a weighted majority
• Can “boost” a “weak learner”
• Operates on a weighted training set
Weighted Training Set
• Learning with a weighted training set

– Supervised learning -> minimize training error
– Bias algorithm to learn correctly instances with high
weights
• Idea: when an instance is misclassified by a hypotheses,
increase its weight so that the next hypothesis is more
likely to classify it correctly
Boosting Framework
Read the figure left to right: the algorithm builds a hypothesis on a

weighted set of four examples, one hypothesis per column
AdaBoost (Adaptive Boosting)
There are N
examples.
There are M
“columns”
(hypotheses),
each of which
has weight zm
What can we boost?
• Weak learner: produces hypotheses at least as good as

random classifier.
• Examples:
– Rules of thumb
– Decision stumps (decision trees of one node)
– Perceptrons
– Naïve Bayes models
Boosting Paradigm
• Advantages
– No need to learn a perfect hypothesis
– Can boost any weak learning algorithm
– Boosting is very simple to program
– Good generalization
• Paradigm shift
– Don’t try to learn a perfect hypothesis
– Just learn simple rules of thumbs and boost them
Boosting Paradigm
• When we already have a bunch of hypotheses, boosting

provides a principled approach to combine them
• Useful for
– Sensor fusion
– Combining experts
Boosting Applications
• Any supervised learning task

– Spam filtering
– Speech recognition/natural language processing
– Data mining
– Etc.
computational learning
theory
• intersection of AI, statistics, and theoretical computer
science.
• The underlying principle is the following: any hypothesis
that is seriously wrong will almost certainly be “found
out"with high probability after a small number of
examples, because it will make an incorrect prediction.
Thus, any hypothesis that is consistent with a suficiently
large set of training examples is unlikely to be seriously
wrong: that is, it must be probably approximately correct.

Computational Learning Theory
How many examples are needed?
This is the probability that Hεbad

contains a consistent hypothesis
How many examples?
18.6 WHY LEARNING WORKS:
COMPUTATIONAL LEARNING
THEORY
• Learning means behaving better as a result of experience.
• computational learning theory - a field at the intersection
of AI and theoretical computer science.
• error(h) = P(h(x) f(x)x drawn from D)

error(h) = P(h(x) f(x)x drawn from D)
Figure 18.15 Schematic diagram of hypothesis space,

showing the "-ball" around the true function f.

Decision list
• A decision list is a logical expression of a restricted form.

• It consists of a series of tests, each of which is a
conjunction of literals. If a test succeeds when applied to
an example description, the decision list specifies the value
to be returned. If the test fails, processing continues with
the next test in the list.

Learning decision lists
 x WillWait(x)  Patrons(x, Some)  (Patrons(x, Full)  Fri/Sat(x))
Figure 18.16 A decision list for the restaurant problem.

Figure 18.18 Graph showing the predictive performance of the DECISION-LIST-
LEARNING algorithm on the restaurant data, as a function of the number of examples
seen. The curve for DECISION-TREE-LEARNING is shown for comparison

18.7 SUMMARY
All learning can be seen as learning a function, and in this chapter we
concentrate on induction: learning a function from example
input/output pairs. The main points were as follows:
• Learning in intelligent agents is essential for dealing with
unknown environments (i.e., compensating for the designer's lack of
omniscience about the agent's environment).
• Learning is also essential for building agents with a reasonable amount
of effort (i.e., compensating for the designer's laziness, or lack of
time).
• Learning agents can be divided conceptually into a performance
element, which is responsible for selecting actions, and a learning
element, which is responsible for modifying the performance element.

• Learning takes many forms, depending on the nature of the
performance element, the available feedback, and the
available knowledge.
• Learning any particular component of the performance
element can be cast as a problem of learning an accurate
representation of a function.
• Learning a function from examples of its inputs and outputs
is called inductive learning.
• The difficulty of learning depends on the chosen
representation. Functions can be represented by logical
sentences, polynomials, belief networks, neural networks,
and others.
• Decision trees are an efficient method for learning
deterministic Boolean functions.

• Ockham's razor suggests choosing the simplest hypothesis that
matches the observed examples. The information gain heuristic allows
us to find a simple decision tree.
• The performance of inductive learning algorithms is measured by
their learning curve, which shows the prediction accuracy as a
function of the number of observed examples.
• We presented two general approaches for learning logical theories. The
current-best-hypothesis approach maintains and adjusts a single
hypothesis, whereas the version space approach maintains a
representation of all consistent hypotheses. Both are vulnerable to noise
in the training set.
• Computational learning theory analyses the sample complexity and
computational complexity of inductive learning. There is a trade-off
between the expressiveness of the hypothesis language and the ease of
learning.

Chapter 27
• AI present and future
• Important topics
• 1) Components of agents
• 2) architectures
• 3)future direction

Agent components

components
• Interaction with the environment through sensors and

actuators
• Keeping track of state of world
• Projecting, evaluating, and selecting future courses of

action
• Utility as an expression of preferences
• learning

Interaction through sensors and actuators
• The situation has changed rapidly in recent years with the

availability of ready-made programmable robots, such as
the four-legged robots.
• These, in turn, have benefited from small, cheap, high-
resolution CCD cameras and compact, reliable motor
drives. MEMS (micro-electromechanical systems)
technology has supplied miniaturized accelerometers and
gyroscopes and is now producing actuators that will, for
example, power an artificial flying insect.
• internet

• Keeping track of the state of the world: This is one of the
core capabilities required for an intelligent agent.
• It requires both perception and updating of internal
representations.
 described methods for keeping track of worlds described
by propositional logic;
 extended this to first-order logic;

Projecting, evaluating, and selecting future courses of action:
• The basic knowledge representation requirements here are
the same as for keeping track of the world; the primary
difficulty is coping with courses of action-such as having a
conversation or a cup of tea that consist eventually of
thousands or millions of primitive steps for a real agent.
• It is only by imposing hierarchical structure on behavior
that we humans cope at all.
• There is clearly a great deal of work to do here, perhaps
• along the lines of recent developments in hierarchical
reinforcement learning
• uncertainity

Utility as an expression of preferences:
In principle, basing rational decisions on the maximization
of expected utility is completely general and avoids many
of the problems of purely goal-based approaches, such as
conflicting goals and uncertain attainment.
• As yet, however, there has been very little work on
constructing realistic utility functions
• One reason may be that preferences over states are really
compiled from preferences over state histories, which are
described by reward functions

• Learning: described how learning in an agent can be
formulated
• as inductive learning (supervised, unsupervised, or
reinforcement-based) of the functions that constitute the
various components of the agent.
• Very powerful logical and statistical techniques have been
developed that can cope with quite large problems, often
reaching or exceeding human capabilities in the
identification of predictive patterns defined on a given
vocabulary.
• On the other hand, machine learning has made very little
progress on the important problem of constructing new
representations at levels of abstraction higher than the
input vocabulary

Agent architecture

• We have seen that reflex responses are needed for
situations in which time is of the essence, whereas
knowledge-based deliberation allows the agent to plan
ahead.
• A complete agent must be able to do both, using a hybrid
architecture.
• One important property of hybrid architectures is that the
boundaries between different decision components are not
fixed.

• Agents also need ways to control their own deliberations.
• They must be able to cease deliberating when action is
demanded, and they must be able to use the time available
for deliberation to execute the most profitable
computations

Real time AI
• These issues are usually studied under the heading of real-

time AI.
• As AI systems move into more complex domains, all
problems will become real-time, because the agent will
never have long enough to solve the decision problem
exactly decision-theoretic metareasoning

• This method applies the theory of information value to the
selection of computations. The value of a
• computation depends on both its cost (in terms of delaying
action) and its benefits (in terms of improved decision
quality).
• Metareasoning techniques can be used to design better
search algorithms and to guarantee that the algorithms
have the anytime property.
• Metareasoning is expensive, of course, and compilation
methods can be applied so that the overhead is small
compared to the costs of the computations being
controlled.

ARE WE GOING IN THE RIGHT
DIRECTION?
• Consider the goals of AI
• Build agents with these goals in mind…
• How far we have come???

Perfect rationality
• A perfectly rational agent acts at every instant in such a

way as to maximize its expected utility, given the
information it has acquired from the environment.
• We have seen that the calculations necessary to achieve
perfect rationality in most environments are too time-
consuming, so perfect rationality is not a realistic goal

Calculative rationality
• This is the notion of rationality that we have used implicitly
in designing logical and decision-theoretic agents.
• A calculatively rational agent eventually returns what would
have been the rational choice at the beginning of its
deliberation.
• This is an interesting property for a system to exhibit, but in
most environments, the right answer at the wrong time is of
no value.
• A1 system designers are forced to compromise on decision
quality to obtain reasonable overall performance;

Bounded rationality
• Bounded rationality. Herbert Simon (1957) rejected the

notion of perfect (or even approximately perfect)
rationality and replaced it with bounded rationality, a
descriptive theory of decision making by real agents.
• It is not a formal specification for intelligent agents,
however, because the definition of "good enough" is not
given by the theory.
• Furthermore, satisficing seems to be just one of a large
range of methods used to cope with bounded resources.

BO
• Bounded optimality (BO). A bounded optimal agent behaves
as well as possible, given its computational resources.
• That is, the expected utility of the agent program for a
bounded optimal agent is at least as high as the expected
utility of any other agent program running on the same
machine
• Of these four possibilities, bounded optimality seems to offer
the best hope for a strong theoretical foundation for AI. It
has the advantage of being possible to achieve: there is
always at least one best program-something that perfect
rationality lacks. Bounded optimal agents are actually useful
in the real world, whereas calculatively rational agents
usually are not, and satisficing agents might or might not be,
depending on their own whims.

What if AI does succeed?
• There are ethical issues to consider ….
• Those who strive to develop
• A1 have a responsibility to see that the impact of their work is a positive one. The
scope of
• the impact will depend on the degree of success of AI. Even modest successes in A1
have
• already changed the ways in which computer science is taught (Stein, 2002) and
software
• development is practiced. A1 has made possible new applications such as speech
recognition
• systems, inventory control systems, surveillance systems, robots, and search engines.
• We can expect that medium-level successes in A1 would affect all kinds of people in
• their daily lives. So far, computerized communication networks, such as cell phones
and the

• Finally, it seems likely that a large-scale success in AI--the
creation of human-level intelligence
• and beyond-would change the lives of a majority of humankind.
The very nature
• of our work and play would be altered, as would our view of
intelligence, consciousness, and
• the future destiny of the human race. At this level, A1 systems
could pose a more direct threat
• to human autonomy, freedom, and even survival. For these
reasons, we cannot divorce A1
• research from its ethical consequences

Unit-8 - Chapter-18 & 27: Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit-8 - Chapter-18 & 27: Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

Uploaded by

Copyright:

Available Formats

• Unit-8

Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

• The type of feedback available for learning is usually the

Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

• linear weighted polynomials for utility functions in game-

Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

• The last major factor in the design of learning systems is

Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

• In supervised learning, the learning element is given the

Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

f is the target function

An example is a pair (x, f(x))

• Construct/adjust h to agree with f on training set

• E.g., curve fitting:

• Construct/adjust h to agree with f on training set

• Construct/adjust h to agree with f on training set

• Construct/adjust h to agree with f on training set

• Ockham’s razor: prefer the simplest hypothesis consistent with

• how do we choose from among multiple consistent

Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

• Decision tree induction is one of the simplest and yet

Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

• Prefer to find more compact decision trees

• E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 (more

• E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees

How many purely conjunctive hypotheses (e.g., Hungry  Rain)?

Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

• An example is described by the values of the attributes

Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

• Classification of examples is positive (T) or negative (F)

Figure 18.5 Examples for the restaurant domain.

• Idea: a good attribute splits the examples into subsets that

• Patrons? is a better choice

Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

• To implement Choose-Attribute in the DTL

• A chosen attribute A divides the training set E into subsets

For the training set, p = n = 6, I(6/12, 6/12) = 1 bit

Consider the attributes Patrons and Type (and others too):

• Substantially simpler than “true” tree---a more complex

Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

• Whenever there is a large set of possible hypotheses, one

Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

• Here we present a simple technique called decision tree

Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

• One way to handle this case is to pretend that the example

Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995