Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 69

UNIT III 1. State the conditions under which an uncertainity can arise?

Ans: Uncertainity can arise because of incompleteness and incorrectness in the agent's understanding of the properties of the environment. 2.What are the three reasons why FOL fails in medical diagnosis? Ans: The three reasons why the FOL fails in medical diagonsis are: (i) Laziness: Too much work to lilst the complete set of antecedents and consequents needed. (ii) Theoretical ignorance: Medical science has no complete theory for domain. (iii) Pratical ignorance: Even if we know all the rules uncertainity arises because some tests cannot be run on the patient's body. 3. What is the tool that is used to deal with degree of belief? Ans: The tool that is used to deal with the degree of belief is the probability theory which assigns a numeric degree of belief between 0 and 1. 4. For what the utility theory is useful and in what way it is related to decision theory? Ans: The utility theory is useful to represent and reason with preferences. The decision theory is related to utility theory as addition of the proability theory and the utility theory. 5. What is the fundamental idea of the decision theory? Ans: The fundamental idea of the decision theory is that an agent is rational if and only if it chooses the action that yields the highest expected utility . 6.How are the probability over the simple and complex propositions are classified? Ans: The probability over the simple and complex propositions are classified as (i) Unconditional or Prior probabilities. (ii)Conditional or Posterior probabilities.

7. State the axioms of probability. Ans: The axioms are: (i) All probability are between 0 and 1. 0<= P(A)<=1 (ii) Necessarily true propositions have probability 1 and necessarily false propositions have probability 0.P(True)=1,P(False)=0. (iii) The probability of a disjunction is given by P(A V B)= P(A) + P(B) - P(A ^ B) (iv) Let B=(negation)A in the axiom3. P(A V (negation)A )= P(A) + P( (negation)A) - P(A ^ (negation)A) (v)P(true)=P(A) + P( (negation)A)-P(false)(by logical equivalence) (vi)1= P(A) + P( (negation)A) (by step2) (vii)P((negation)A)= 1- P(A) (by algebra) 8.What is Joint Probability Distribution? Ans: An agent's probabitlity assignments to all propositions in the domain(both simple and complex) is called as Joint Probability Distribution. 9.What is the disadvantage of Baye's rule? Ans: It requires three terms to compute one conditional probability (P(B/A)) *One conditional probability -> P(A/B) *Two unconditional probability -> P(B) & P(A) 10.What is the advantage of Baye's rule? Ans: If three values are known then the unknown fourth value -> P(B/A) is computed easily. 11.What is a belief network? Ans: A Belief network is a graph in which the following holds:

1. A set of random variables makes up the nodes of the network. 2. A set of directed links or arrows connects pairs of nodes X->Y, X has a direct influence on Y. 3. Each node has a conditional probability table that quantities the effects that the parents have on the node.The parents of a node are all nodes that have arrows pointing to it. 4.Graph has no directed cycles -> DAG. 12.What is the task of any Probabilistic inference system? Ans: The Task of any Probabilistic inference system is to compute the posterior probability distribution for a set of query variables,given exact values of some evidence variable. 13. State the uses of belief networks. Ans: 1.Making decisions based on probabilities in the network & on the agent's abilities. 2.Deciding which additional evidence variables should be observed in order to gain useful information. 3.Sensitivity Analysis: which model has great impact. 4.Explaining the probabilistic inference result to user. 14. What are the two ways in which one can understand the semantics of belief networks? Ans: The two ways are 1.Network as a representation of the Joint Probabiltity distribution -used to know how to construct networks. 2.Encoding of a collection of conditional independence statements. -designing inference procedures. 15.What is Probabilitic Reasoning?Explain Ans: It explains how to build network models to reason under uncertainity according to the laws of probability theory.

16.What are the disadvantages of Full joint Probability distribution? Ans:The disadvantage is that, the interaction between the domain and the agent increases, the number of variables is also increases and to overcome this we use a new datastructure called Bayesian network. 17.Explain Bayesian Network. Ans: Bayesian network is used to represnt the dependencies among variables and to give a concise specification of any full joint probability distribution. 18.What makes the nodes of the Bayesian network and how are they connected? Ans: A set of random variables makes up the nodes of the network.Variables may be discrete or continuous and a set of directed links or arrows connect pair of nodes. 19.What is Conditional Probability Table(CPT)? Ans: The table representation of the Bayesian network is called the Conditional probability table. 20.What is Conditioning Case? Ans: A Conditioning case is just a possible combination of values for the parent nodes. 21.What are the Semantics of Bayesian Network? Ans: The Semantics can be represented as the 1.Global Semantics(Full joint probability distribution) 2.Local Semantics(Conditional independent statements) 22.State the way to representing the Full joint Distribution. Ans: A generic entry in the joint distribution is the probability of a conjunction of particular assignments to each variables,siuch as p(x1=x1^----^xn)xn)=p(x1,---- xn) 23.When will a Bayesian Network ia said as Compact? Ans: A Bayesian network is compact,when the information is complete and non-redundant.

The Compactness of Bayesian Network is an example of a very general property of locally structured systems. 24.Explain Node Ordering. Ans: The correct order in which to add nodes is to add the root causes(parents first and then the variables) and the addition process is continued,until we reach the leaves which have no direct casual inference on other variables. 25.What is Topological Semantics? Ans: Topological Semantics is a graph structure and from these we can gerive the numerical statements. 26.What is the specification for the Topological Semantics? Ans: 1.A node is conditionally independent of its non-descendants,given its parents 2.A node is conditionally independent of all other nodes in the network,given its parents children and childrens parents and this called the Markov blanket. 27.What are the two ways to represent Canonical Distributions? Ans:1.Deterministic nodes 2.Noisy-OR relation 28.What are Deterministic nodes? Ans: It is used to represent the certain knowledge, by means of the

x=f(parents(x)) for some function. 29.Explain Noisy-OR Relationship. Ans: It is the generalization of logical OR and it is used to represent the uncertain

knowledge. 30.What are the four types of inferences? Ans: 1.Diagnostic Inferences 2.Casual Inferences

3.Intereasual Inferences 4.Mixed Inferences 31.What are the three basic classes of algorithms for evaluating multiply connected networks? Ans: The three basic classes of algorithms for evaluating multiply connected networks are cluestring, conditioning methods and stochastic simulation methods. 32.What is done in clustering? Ans: It transforms the network into a probabilistically equivalent polytrees by merging offending nodes. 33.What is cutset? Ans: A set of variable that can be instantiated to yield polytree is called a cutset i.e this method transforms the network into several simpler polytrees. 34.What is a utility function? Ans: An agent's preferences between world states are captured by a utility function which assigns a single no. to express the derivability of a state utilities are combined with the outcome probabilities for actions to give an expected utility for each action. 35.What is the principle behind the maximum expected utility? Ans: The principle of MEU(Maximum Expected Utility) is that a rational agent should choose an action that maximizes the agent's expected utility. 36.Explain orderability. Ans: Given two states agent should prefer one or the other or else rate the two as equally preferable i.e an agent should know what it was:(A>B)V(B>A)V(A-B) 37. What is meant by multiattribute utility theory-MAUT? Ans: Problems in which outcomes are characterized by two or more attributes is known as MAUT. The basic approach adopted in MAUT is to identify regularity in the preference behavior, representation theorems to show that an agent with a certain kind of preference structure has a utility function.

38. What are the two types of the dominance? Ans: The two types of dominance are strict dominance and stochastic dominance. 39. What are the two roles of decision analysis? Ans: The two roles of decision analysis are decision makers and decision analyst. 40.What are the axioms of the utility theory ? Ans: They are orderability, transitivity, continuity, substitutability, monotonocity and decomposability. 41.What is non monotonic reasoning ? Ans:Non monotonic reasoning is one in which axioms and / or rules of inference are extended to make it possible to reason with incomplete information. These systems preserve the property that at any given moment, a statement is either believed to be true, believed to be false, or not believed to be either. 42.What is meant by nonmonotonic logic ? Ans: One system that provides a basis for default reasoning is Nonmonotonic logic in which the operator of first order logic is augmented with a modal operator M, which is read as inconsistent. 43.Give an example for nonmonotonic logic. x,y : Related(x,y) ^ M GetAlong(x,y) WillDefend(x,y) It is read as For all x and y, if x and y are related and if the fact that gets along with y is consistent with everything else that is believed, then conclude that x will defend y. 44.What are the 2 kinds of nonmonotonic reasoning ? Ans: The 2 kinds of nonmonotonic reasoning are: a) Abduction b) Inheritance 45.What is meant by ATMS ?

Ans: ATMS Assumption based truth maintenance system. An ATMS simply labels all the states that have been considered at the same time. An ATMS keeps track for each sentence of which assumptions would cause the sentence to be true. 46.What is meant by JTMS ? Ans: JTMS - Justification based truth maintenance system. JTMS simply labels each

sentence of being in and out. The maintenance of justifications allows us to move quickly from one state to another by making a few retractions and assertions, but only one state is represented at a time. 47.Define belief revision. Ans: Many of the inferences drawn by a knowledge representation system will have only default status.Some of these inferred facts will turn out to be wrong and will have to be retracted in the face of new information. This process is called belief revision. 48.List the limitations of CWA. Ans: The limitations of CWA are: 1. It operates on individual predicates without considering the interactions among predicates that are defined in the knowledge base. 2. It assumes that all predicates have all their instances listed. Although in many database applications this is true, in many knowledge based systems it is not. 49.Give the difference between ATMS and JTMS. Ans: ATMS : A) An ATMS simply labels all the states that has been considered at the same time. B) An ATMS keeps track for each sentence of which assumptions would cause the sentence to be true. JTMS : A) JTMS simply labels each sentence of being in and out. B) The maintenance of justifications allows us to move quickly from one state to another by making a few retractions and assertions, but only one state is represented at a time.

50.Define Markov Decision Process(MDP). Ans:The specification of a sequential decision problem for a fully observable environment with a Markovian Transition model and additive reward is called Markov Decision Process. 51.What are the 3 components that define MDP ? Ans: The 3 components that define MDP are: Initial state S Transition model T(s, a, s) Reward function R(s)

52.What is a policy and how is it denoted ? Ans: A solution must specify what agent should do for any state that the agent may reach. This solution is referred to as policy. It is denoted by . 53.When complex decisions are made? Ans: Complex decisions are made when the outcomes of an action are uncertain. Decisions are taken fully observable, partially observable and uncertain environments. 54.What does complex decisions deal with ? Ans:Complex decisions deal with sequential decision problems where the agents utility depends on a sequence of decisions. 55.Define transition model. Ans:The specification of the outcome probabilities for each action in each possible state is called a transition model. 56.What is meant by Markovian transition ? Ans: When action a is done in state s, then probability of reaching s is denoted by T(s, a, s). This is referred to as Markovian transition as the probability of reaching s from s depends only on s and not on the environment history. 57.What is a optimal policy and how is it denoted ?

Ans: An optimal policy is a policy that yields the highest expected utility. It is denoted by . 58.Define proper policy. Ans:A policy that is guaranteed to reach a terminal state is called a proper policy. 59.What are the types of horizons for decision making ? Ans: The 2 types of horizons for decision making are : Finite Horizon Infinite horizon

60.What is meant by finite horizon ? Ans: A finite horizon means that there is fixed time after which the game is over. With a finite horizon, the optimal action in a given state could change over time. Therefore, the optimal policy for a finite horizon is non stationary. 61.Differentiate between finite horizon and infinite horizon. Ans: FINITE HORIZON : With a finite horizon, the optimal action in a given state could change over time. The optimal policy for a finite horizon is non stationary.

INFINITE HORIZON : For an infinite horizon there is no fixed deadline. Since the optimal policy deals with only the current state, it is stationary.

62.What is mechanism design (Inverse game theory) ? Ans: Designing a game whose solutions consist of each agent having its own rational strategy is called as Mechanical Design (Inverse game theory).It can be used to construct intelligent multi agent systems that solves complex problems in distributed way. 63.What does a mechanism consists of ? Ans: A mechanism consists of

1. A language for describing the set of allowable strategies that the agent may adopt. 2. An outcome rule G that determines the payoffs to the agents given in the strategy profile of allowable strategies.

64.Define dominant strategy. Ans: A dominant strategy is a strategy that dominates all others. A strategy s for player p strongly dominates strategy s if the outcomes for p is better for p than the outcome for s. 65.Define weak strategy. Ans: Strategy s weakly dominates s if s is better than s on atleast one strategy profileand no worse on any other. 66.What is Nash equilibrium ? Ans: The mathematical John Nash proved that every game has an equilibrium of the type defined above and is called as Nash equilibrium. 67.What is maximin technique ? Ans: In order to apply equilibrium for a fi, Von Neumann developed a method for finding the optimal mixed strategy . This is called as technique. 68.What is dominant strategy equilibrium ? Ans: In case of a two player game, when both the players have a dominanant strategy, the combinations of those strategies is called dominant strategy equilibrium. 69.What is meant by pareto optimal and pareto dominated ? Ans: An outcome is referred to as pareto optimal if there is no other outcome that all players would prefer. An outcome is pareto dominated by another if all players prefer the other outcome. 70.What are the 2 types of strategy ? Ans: The 2 types of strategy are :

1. Pure strategy 2. Mixed strategy. 71.Define pure strategy. Ans: A pure strategy is a deterministic policy specifying a particular action to take in each situation. 72.Define mixed strategy. Ans: A mixed strategy is a randomized policy that selects particular actions according to a specific probability distribution over actions. 73.What are the components of a game in game theory ? Ans: The components of a game in game theory are: 1. Players or agents who will make decisions. 2. Actions that the player can choose. 3. A patoff matrix that gives the utility for each combinations of actions by all the players. 74.What is a contraction ? Ans: A contraction is a function of one argument, when applied to 2 different inputs, produce 2 output values, that are closer together by some constant amount than the original values. 75.What are the properties of contraction ? aAns: The properties of contraction are : A contradiction has only one fixed point. If 2 fixed points are there, they would not get closer and hence a contraction. When a function is applied to any argument, the value must get closer to the fixed point and hence repeated application of a contraction always reaches the fixed point.

Unit-IV
1.What is planning in AI? Planning is done by a planning agent or a planner to solve the current world problem. It has certain algorithms to implement the solutions for the given problem and this is called ideal planner algorithm

2.How does a planner implement the solution? Unify action and goal representation to allow selection (use logical language for both) Divide-and-conquer by subgoaling Relax requirement for sequential construction of solutions

3.What is the difference between planner and problem solving agents? Problem solving agent:it will represent the task of the problem and solves it using search techniques and any other algorithm. Planner:It overcomes the difficulties that arise in the problem solving agent.

4.what are the effects of non- planning? infinite branching factor in case of many tasks choosing heuristic function and works in the same sequence/order

5. what are the key ideas of planning approach? 3 key ideas to approach planning are open up the representation of states,goals ,actions using FOL the planner is free to add actions to the plan whenever and wherever they are needed rather than an incremental sequence most part of the world are going to be independent of the other parts

6.State the 3 components of operators in representation of planning? Action description Pre-condition Effect

7. what is STRIPS? STandford Research Institute Problem Solver Tidily arranged actions descriptions Restricted language (function-free literals) Efficient algorithms

8.What is situation space planner? The planner takes the situation and searches for it in the KB and locates it.It is reused if it is already present else a new plan is made for the situation and executed. 9. What are the drawbacks of programmers planner? High branching factor during searchin Huge search space

10. What are the Planning Algorithms for Searching a World Space: Searching a World Space: . There are two algorithms:
o

Progression: An algorithm that searches for the goal state by searching through the states generated by actions that can be performed in the given state, starting from the initial state.

Regression: An algorithm that searches backward from the goal state by finding actions whose effects satisfy one or more of the posted goals, and posting the chosen action's preconditions as goals ( goal regression).

11.What is partial plan? Partial plan is an incomplete plan which may be done during the Initial phase. There are 2 main operation allowed in planning

Refinement operator Modification operator (Partial) Plan consists of Set of operator applications Si Partial (temporal) order constraints Si ->Sj Causal links Si---c Sj

12.what is fully instantiated plan? A fully instantiated plan is formmaly defined as a data structure consisting of the following 4 components A set of plan steps A set of step monitoring constraints A set of variable binding constraints A set of casual links

13. What is Complete plan A plan is complete iff every precondition is achieved A precondition c of a step Sj is achieved (by Si) if Si < Sj c effect(Si) there is no Sk with Si <Sk < Sj and : c effect(Sk) (otherwise Sk is called a clobberer or threat)

14. What are the Properties of POP? Non-deterministic search for plan,backtracks over choicepoints on failure: . choice of Sadd to achieve Sneed . choice of promotion or demotion for clobberer Sound and complete There are extensions for: disjunction, universal quanti_cation, negation, conditionals,Efficient with good heuristics from problem description problems with loosely related subgoals But: very sensitive to subgoal ordering,Good for

15. What is conditional planning? Conditional planning Plan to obtain information (observation actions) Subplan for each contingency Example: [Check(Tire1); If(Intact(Tire1); [In_ate(Tire1)]; [CallHelp])] Disadvantage: Expensive because it plans for many unlikely cases

Similar to POP If an open condition can be established by observation action . add the action to the plan . complete plan for each possible observation outcome

16. What is monitoring/ replanning Monitoring / Replanning Assume normal states / outcomes Check progress during execution, re plan if necessary Disadvantage: Unanticipated outcomes may lead to failure.

17. Distinguish between the learning element and performance element Learning element is responsible for making improvements.

The learning element takes some knowledge about the learning element and some feedback on how the agent is doing, and determines how the performance element should be modified to (hopefully) do better in the future. Performance element is responsible for selecting external actions.

The performance element is what we have previously considered to be the entire agent: it takes in percepts and decides on actions.

18. What is the role of critic in the general model of learning. The critic is designed to tell the learning element how well the agent is doing. The critic employs a fixed standard of performance. This is necessary because the percepts themselves provide no indication of the agent's success.

19. Describe the problem generator. It is responsible for suggesting actions that will lead to new and informative experiences. The point is that if the performance element had its way, it would keep doing the actions that are best, given what it knows. But if the agent is willing to explore a little, and do some perhaps suboptimal actions in the short run, it might discover much better actions for the long run. . 20. What is meant by Speedup Learning. The learning element is also responsible for improving the efficiency of the performance

element. For example, when asked to make a trip to a new destination, the taxi might take a while to consult its map and plan the best route. But the next time a similar trip is requested, the planning process should be much faster. This is called speedup learning

21. What are the issues in the design of learning agents. The design of the learning element is affected by four major issues: Which components of the performance element are to be improved. What representation is used for those components. What feedback is available. What prior information is available.

22. List the Components of the performance element A direct mapping from conditions on the current state to actions. A means to infer relevant properties of the world from the percept sequence. Information about the way the world evolves. Information about the results of possible actions the agent can take. Utility information indicating the desirability of world states. Action-value information indicating the desirability of particular actions in particular states. Goals that describe classes of states whose achievement maximizes the agent's utility.

23. What are the types of learning Supervised learning Unsupervised learning Reinforcement learning

24. What are the approaches for learning logical sentences. The two approaches to learning logical sentences: Decision tree methods, which use a restricted representation of logical sentences specifically designed for learning, Version-space approach, which is more general but often rather inefficient.

25. What are decision trees. A decision tree takes as input an object or situation described by a set of properties, and outputs a yes/no "decision." Decision trees therefore represent Boolean functions. Each internal node in the tree corresponds to a test of the value of one of the properties, and the branches from the node are labelled with the possible values of the test. Each leaf node in the tree specifies the Boolean value to be returned if that leaf is reached.

26. Give an example for decision tree logical sentence. The path for a restaurant full of patrons, with an estimated wait of 10-30 minutes when the agent is not hungry is expressed by the logical sentence Vr Patrons(r,Full)f\ WaitEstimate(r,0-\0) A Hungry(r,N) => WillWait(r)

27. What are Parity Function & Majority Function. Parity Function is a finction which returns 1 if and only if an even number of inputs are 1, then an exponentially large decision tree will be needed. Majority function is one which returns 1 if more than half of its inputs are 1.

28. Draw an example decision tree

29. Explain the terms Positive example, negative example and training set. An example is described by the values of the attributes and the value of the goal predicate. We call the value of the goal predicate the classification of the example. If the goal predicate is true for some example, we call it a ositive example; otherwise we call it a negative example. A set of examples X\,... ,X\2 for the restaurant domain. The positive examples are ones where the goal WillWait is true (X\,Xi,,...) and negative examples are ones where it is false (X2,X5,...). The complete set of examples is called the training set.

30. Explain the methodology used for accessing the performance of learning algorithm. Collect a large set of examples. Divide it into two disjoint sets: the training set and the test set. Use the learning algorithm with the training set as examples to generate a hypothesis H. Measure the percentage of examples in the test set that are correctly classified by H. Repeat steps 1 to 4 for different sizes of training sets and different randomly selected training sets of each size.

31. Draw an example of training set.

32. What is Overfighting. Whenever there is a large set of possible hypotheses, one has to be careful not to use the resulting freedom to find meaningless "regularity" in the data. This problem is called overfitting. It is a very general phenomenon, and occurs even when the target function is not at all random. It afflicts every kind of learning algorithm, not just decision trees.

33. What is Decision tree pruning Pruning works by preventing recursive splitting on attributes that are not clearly relevant, even when the data at that node in the tree is not uniformly classified.

34. What is Pruning. The probability that the attribute is really irrelevant can be calculated with the help of standard statistical software. This is called pruning.

35. What is Cross Validation. Cross-validation is another technique that eliminates the dangers of overfitting. The basic idea of cross-validation is to try to estimate how well the current hypothesis will predict unseen data. This is done by setting aside some fraction of the known data, and using it to test the prediction performance of a hypothesis induced from the rest of the known data. 36. What is Missing Data In many domains, not all the attribute values will be known for every example. The values may not have been recorded, or they may be too expensive to obtain. This gives rise to two problems. First, given a complete decision tree, how should one classify an object that is missing one of the test attributes? Second, how should one modify the information gain formula when some examples have unknown values for the attribute?

37.What are multivalued attributes When an attribute has a large number of possible values, the information gain measure gives an inappropriate indication of the attribute's usefulness. Consider the extreme case where every example has a different value for the attributefor instance, if we were to use an attribute RestaurantName in the restaurant domain. In such a case, each subset of examples is a singleton and therefore has a unique classification, so the information gain measure would have its highest value for this attribute. However, the GAIN RATIO attribute may be irrelevant or useless.

38.What is continuous valued attribute Attributes such as Height and Weight have a large or infinite set of possible values. They are therefore not well-suited for decision-tree learning in raw form. An obvious way to deal with this problem is to discretize the attribute. For example, the Price attribute for restaurants was

discretized into $, $$, and $$$ values. Normally, such discrete ranges would be defined by hand. A better approach is to preprocess the raw attribute values during the tree-growing process in order to find out which ranges give the most useful information for classification purposes. 39.What are version space methods. version space methods are probably not practical in most real-world learning problems, mainly because of noise, they provide a good deal of insight into the logical structure of hypothesis space.

40. what is PAC learning? Any hypothesis that is seriously wrong will almost certainly be "found out" with high probability after a small number of examples, because it will make an incorrect prediction. Thus, any hypothesis that is consistent with a sufficiently large set of training examples is unlikely to be seriously wrongthat is, it must be Probably Approximately Correct. PAClearning is the subfield of computational learning theory that is devoted to this idea.

41.Difference between learning and performance agent? Learning agents can be divided conceptually into a performance element, which is responsiblefor selecting actions, and a learning element, which is responsible for modifying the performance element.

42.. Give a function for decision list learning? function DECisiON-LiST-LEARNiNG(e.ra;wp/<?.v) returns a decision list, No or failure if examples is empty then return the value No t a test that matches a nonempty subset examples, of examples such that the members of examples, are all positive or all negative if there is no such t then return failure if the examples in examples, are positive then o < Yes else o No return a decision list with initial test / and outcome o and remaining elements given by DEClsiON-LlST-LEARNING

43.What is decision list?

A decision list is a logical expression of a restricted form. It consists of a series of tests, each of which is a conjunction of literals. If a test succeeds when applied to an example description,the decision list specifies the value to be returned. If the test fails, processing continues with the next test in the list.

UNIT III
Reasoning under uncertainty: Logics of non-monotonic reasoning - Implementation- Basic probability notation - Bayes rule Certainty factors and rule based systems-Bayesian networks Dempster - Shafer Theory - Fuzzy Logic.

1.What is uncertainty.Explain Uncertainty


Let action At = leave for airport t minutes before flight Will At get me there on time?

Problems:
1) partial observability (road state, other drivers' plans, etc.) 2) noisy sensors (KCBS tra_c reports) 3) uncertainty in action outcomes (at tire, etc.) 4) immense complexity of modelling and predicting tra_c

Hence a purely logical approach either 1) risks falsehood: \A25 will get me there on time" or 2) leads to conclusions that are too weak for decision making: A25 will get me there on time if there's no accident on the bridge and it doesn't rain and my tires remain intact etc etc."

Methods for handling uncertainty

Default or nonmonotonic logic: Assume my car does not have a at tire Assume A25 works unless contradicted by evidence Issues: What assumptions are reasonable? How to handle contradiction? Rules with fudge factors:

Issues: Problems with combination, e.g., Sprinkler causes Rain??

2.What is non-monotonic reasoning .Explain


A non-monotonic logic is a formal logic whose consequence relation is not monotonic. Most studied formal logics have a monotonic consequence relation, meaning that adding a formula to a theory never produces a reduction of its set of consequences. Intuitively, monotonicity indicates that learning a new piece of knowledge cannot reduce the set of what is known. A monotonic logic cannot handle various reasoning tasks such as reasoning by default (consequences may be derived only because of lack of evidence of the contrary), abductive reasoning (consequences are only deduced as most likely explanations), some important approaches to reasoning about knowledge Default reasoning An example of a default assumption is that the typical bird flies. As a result, if a given animal is known to be a bird, and nothing else is known, it can be assumed to be able to fly. The default assumption must however be retracted if it is later learned that the considered animal is a penguin. This example shows that a logic that models default reasoning should not be monotonic. Logics formalizing default reasoning can be roughly divided in two categories: logics able to deal with arbitrary default assumptions (default logic, defeasible logic/defeasible reasoning/argument (logic), and answer set programming) and logics that

formalize the specific default assumption that facts that are not known to be true can be assumed false by default (closed world assumption and circumscription). Abductive reasoning Abductive reasoning is the process of deriving the most likely explanations of the known facts. An abductive logic should not be monotonic because the most likely explanations are not necessarily correct. For example, the most likely explanation for seeing wet grass is that it rained; however, this explanation has to be retracted when learning that the real cause of the grass being wet was a sprinkler. Since the old explanation (it rained) is retracted because of the addition of a piece of knowledge (a sprinkler was active), any logic that models explanations is non-monotonic. Reasoning about knowledge If a logic includes formulae that mean that something is not known, this logic should not be monotonic. Indeed, learning something that was previously not known leads to the removal of the formula specifying that this piece of knowledge is not known. This second change (a removal caused by an addition) violates the condition of monotonicity. A logic for reasoning about knowledge is the autoepistemic logic. Belief revision Belief revision is the process of changing beliefs to accommodate a new belief that might be inconsistent with the old ones. In the assumption that the new belief is correct, some of the old ones have to be retracted in order to maintain consistency. This retraction in response to an addition of a new belief makes any logic for belief revision to be non-monotonic. The belief revision approach is alternative to paraconsistent logics, which tolerate inconsistency rather than attempting to remove it.

3. What is probability and Basic probability notation

Probability
Given the available evidence, A25 will get me there on time with probability 0:04 (Fuzzy logic handles degree of truth NOT uncertainty e.g., WetGrass is true to degree 0:2) Probabilistic assertions summarize effects of laziness: failure to enumerate exceptions, qualifications, etc. ignorance: lack of relevant facts, initial conditions, etc. Subjective or Bayesian probability: Probabilities relate propositions to one's own state of knowledge e.g., P(A25/no reported accidents) = 0:06 These are not claims of a \probabilistic tendency" in the current situation (but might be learned from past experience of similar situations) Probabilities of propositions change with new evidence: e.g., P(A25jno reported accidents; 5 a.m.) = 0:15 (Analogous to logical entailment status KB j= _, not truth.)

Making decisions under uncertainty


Suppose I believe the following: P(A25 gets me there on timej : : :) = 0:04 P(A90 gets me there on timej : : :) = 0:70 P(A120 gets me there on timej : : :) = 0:95 P(A1440 gets me there on timej : : :) = 0:9999

Which action to choose? Depends on my preferences for missing ight vs. airport cuisine, etc. Utility theory is used to represent and infer preferences Decision theory = utility theory + probability theory

Probabilistic Reasoning Using logic to represent and reason we can represent knowledge about the world with facts and rules, like the following ones: bird(tweety). fly(X) :- bird(X). We can also use a theorem-prover to reason about the world and deduct new facts about the world, for e.g., ?- fly(tweety). Yes However, this often does not work outside of toy domains - non-tautologous certain rules are hard to find. A way to handle knowledge representation in real problems is to extend logic by using certainty factors. In other words, replace IF condition THEN fact with

IF condition with certainty x THEN fact with certainty f(x) Unfortunately cannot really adapt logical inference to probabilistic inference, since the latter is not context-free. Replacing rules with conditional probabilities makes inferencing simpler. Replace smoking -> lung cancer or lots of conditions, smoking -> lung cancer with P(lung cancer | smoking) = 0.6 Uncertainty is represented explicitly and quantitatively within probability theory, a formalism that has been developed over centuries. A probabilistic model describes the world in terms of a set S of possible states - the sample space. We dont know the true state of the world, so we (somehow) come up with a probability distribution over S which gives the probability of any state being the true one. The world usually described by a set of variables or attributes. Consider the probabilistic model of a fictitious medical expert system. The world is described by 8 binary valued variables: Visit to Asia? A Tuberculosis? T Either tub. or lung cancer? E

Lung cancer? L Smoking? S Bronchitis? B Dyspnoea? D Positive X-ray? X


8

We have 2 = 256 possible states or configurations and so 256 probabilities to find. 10.3 Review of Probability Theory .The primitives in probabilistic reasoning are random variables. Just like primitives in Propositional Logic are propositions. A random variable is not in fact a variable, but a function from a sample space S to another space, often the real numbers. For example, let the random variable Sum (representing outcome of two die throws) be defined thus: Sum(die1, die2) = die1 +die2 Each random variable has an associated probability distribution determined by the underlying distribution on the sample space Continuing our example : P(Sum = 2) = 1/36, P(Sum = 3) = 2/36, . . . , P(Sum = 12) = 1/36 Consdier the probabilistic model of the fictitious medical expert system mentioned before. The sample space is described by 8 binary valued variables. Visit to Asia? A Tuberculosis? T Either tub. or lung cancer? E Lung cancer? L Smoking? S Bronchitis? B Dyspnoea? D Positive X-ray? X
8

There are 2 = 256 events in the sample space. Each event is determined by a joint instantiation of all of the variables.

S = {(A = f, T = f,E = f,L = f, S = f,B = f,D = f,X = f), (A = f, T = f,E = f,L = f, S = f,B = f,D = f,X = t), . . . (A = t, T = t,E = t,L = t, S = t,B = t,D = t,X = t)}

Since S is defined in terms of joint instantations, any distribution defined on it is called a joint distribution. ll underlying distributions will be joint distributions in this module. The variables {A,T,E, L,S,B,D,X} are in fact random variables, which project values.

L(A = f, T = f,E = f,L = f, S = f,B = f,D = f,X = f) = f L(A = f, T = f,E = f,L = f, S = f,B = f,D = f,X = t) = f L(A = t, T = t,E = t,L = t, S = t,B = t,D = t,X = t) = t

Each of the random variables {A,T,E,L,S,B,D,X} has its own distribution, determined by the underlying joint distribution. This is known as the margin distribution. For example, the distribution for L is denoted P(L), and this distribution is defined by the two probabilities P(L = f) and P(L = t). For example, P(L = f) = P(A = f, T = f,E = f,L = f, S = f,B = f,D = f,X = f) + P(A = f, T = f,E = f,L = f, S = f,B = f,D = f,X = t) + P(A = f, T = f,E = f,L = f, S = f,B = f,D = t,X = f) ... P(A = t, T = t,E = t,L = f, S = t,B = t,D = t,X = t) P(L) is an example of a marginal distribution.

5.Explain Bayes' Rule

Bayes' Rule and conditional independence

Wumpus World

Specifying the probability model

Observations and query

Using conditional independence


Basic insight: observations are conditionally independent of other hidden squares given neighbouring hidden squares

Manipulate query into a form where we can use this!

6.Explain Bayesian networks


A simple, graphical notation for conditional independence assertions and hence for compact specification of full joint distributions Syntax: a set of nodes, one per variable a directed, acyclic graph (link _ \directly inuences") a conditional distribution for each node given its parents:

In the simplest case, conditional distribution represented as a conditional probability table (CPT) giving the distribution over Xi for each combination of parent values

Example
Topology of network encodes conditional independence assertions:

Weather is independent of the other variables Toothache and Catch are conditionally independent given Cavity Example I'm at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn't call. Sometimes it's set of by minor earthquakes. Is there a burglar? Variables: Burglar, Earthquake, Alarm, JohnCalls, MaryCalls Network topology reects \causal" knowledge:

A burglar can set the alarm off An earthquake can set the alarm off The alarm can cause Mary to call The alarm can cause John to call

Compactness

Constructing Bayesian networks


Need a method such that a series of locally testable assertions of conditional independence guarantees the required global semantics

Example

Deciding conditional independence is hard in noncausal directions (Causal models and conditional independence seem hardwired for humans!) Assessing conditional probabilities is hard in noncausal directions Network is less compact: 1 + 2 + 4 + 2 + 4=13 numbers needed

Example: Car diagnosis


Initial evidence: car won't start Testable variables (green), \broken, so _x it" variables (orange) Hidden variables (gray) ensure sparse structure, reduce parameters

Example: Car insurance

Compact conditional distributions

Hybrid (discrete+continuous) networks

Option 1: discretization|possibly large errors, large CPTs Option 2: finitely parameterized canonical families 1) Continuous variable, discrete+continuous parents (e.g., Cost) 2) Discrete variable, continuous parents (e.g., Buys?)

Continuous child variables


Need one conditional density function for child variable given continuous parents, for each possible assignment to discrete parents

Most common is the linear Gaussian model, e.g.,:

Mean Cost varies linearly with Harvest, variance is fixed Linear variation is unreasonable over the full range but works OK if the likely range of Harvest is narrow

All-continuous network with LG distributions full joint distribution is a multivariate Gaussian Discrete+continuous LG network is a conditional Gaussian network i.e., a multivariate Gaussian over all continuous variables for each combination of discrete variable values

Discrete variable w/ continuous parents

Inference in Bayesian networks Inference tasks

Inference by enumeration
Slightly intelligent way to sum out variables from the joint without actually constructing its explicit representation

7.Explain Dempster - Shafer Theory in AI


The DempsterShafer theory (DST) is a mathematical theory of evidence It allows one to combine evidence from different sources and arrive at a degree of belief (represented by a belief function) that takes into account all the available evidence. The theory was first developed by Arthur P. Dempster and Glenn Shafer.

In a narrow sense, the term DempsterShafer theory refers to the original conception of the theory by Dempster and Shafer. However, it is more common to use the term in the wider sense of the same general approach, as adapted to specific kinds of situations. In particular, many authors have proposed different rules for combining evidence, often with a view to handling conflicts in evidence better. DempsterShafer theory is a generalization of the Bayesian theory of subjective probability; whereas the latter requires probabilities for each question of interest, belief functions base degrees of belief (or confidence, or trust) for one question on the probabilities for a related question. These degrees of belief may or may not have the mathematical properties of probabilities; how much they differ depends on how closely the two questions are related.[4] Put another way, it is a way of representing epistemic plausibilities but it can yield answers that contradict those arrived at using probability theory. DempsterShafer theory is based on two ideas: obtaining degrees of belief for one question from subjective probabilities for a related question, and Dempster's rule[ for combining such degrees of belief when they are based on independent items of evidence. In essence, the degree of belief in a proposition depends primarily upon the number of answers (to the related questions) containing the proposition, and the subjective probability of

each answer. Also contributing are the rules of combination that reflect general assumptions about the data. In this formalism a degree of belief (also referred to as a mass) is represented as a belief function rather than a Bayesian probability distribution. Probability values are assigned to sets of possibilities rather than single events: their appeal rests on the fact they naturally encode evidence in favor of propositions. DempsterShafer theory assigns its masses to all of the non-empty subsets of the entities that compose a system.[clarification needed]

Belief and plausibility


Shafer's framework allows for belief about propositions to be represented as intervals, bounded by two values, belief (or support) and plausibility:
belief plausibility.

Belief in a hypothesis is constituted by the sum of the masses of all sets enclosed by it (i.e. the sum of the masses of all subsets of the hypothesis).[clarification needed] It is the amount of belief that directly supports a given hypothesis at least in part, forming a lower bound. Belief (usually denoted Bel) measures the strength of the evidence in favor of a set of propositions. It ranges from 0 (indicating no evidence) to 1 (denoting certainty). Plausibility is 1 minus the sum of the masses of all sets whose intersection with the hypothesis is empty. It is an upper bound on the possibility that the hypothesis could be true, i.e. it could possibly be the true state of the system up to that value, because there is only so much evidence that contradicts that hypothesis. Plausibility (denoted by Pl) is defined to be Pl(s)=1-Bel(~s). It also ranges from 0 to 1 and measures the extent to which evidence in favor of ~s leaves room for belief in s. For example, suppose we have a belief of 0.5 and a plausibility of 0.8 for a proposition, say the cat in the box is dead. This means that we have evidence that allows us to state strongly that the proposition is true with a confidence of 0.5. However, the evidence contrary to that hypothesis (i.e. the cat is alive) only has a confidence of 0.2.

The remaining mass of 0.3 (the gap between the 0.5 supporting evidence on the one hand, and the 0.2 contrary evidence on the other) is indeterminate, meaning that the cat could either be dead or alive. This interval represents the level of uncertainty based on the evidence in your system.
Hypothesis Null (neither alive nor dead) Alive Dead Either (alive or dead) Mass 0 0.2 0.5 0.3 Belief 0 0.2 0.5 1.0 Plausibility 0 0.5 0.8 1.0

The null hypothesis is set to zero by definition (it corresponds to no solution). The orthogonal hypotheses Alive and Dead have probabilities of 0.2 and 0.5, respectively. This could correspond to Live/Dead Cat Detector signals, which have respective reliabilities of 0.2 and 0.5. Finally, the all-encompassing Either hypothesis (which simply acknowledges there is a cat in the box) picks up the slack so that the sum of the masses is 1. The belief for the Alive and Dead hypotheses matches their corresponding masses because they have no subsets; belief for Either consists of the sum of all three masses (Either, Alive, and Dead) because Alive and Dead are each subsets of Either. The Alive plausibility is 1 m (Dead) and the Dead plausibility is 1m (Alive). Finally, the Either plausibility sums m(Alive) + m(Dead) + m(Either). The universal hypothesis (Either) will always have 100% belief and plausibility it acts as a checksum of sorts. Here is a somewhat more elaborate example where the behavior of belief and plausibility begins to emerge. We're looking through a variety of detector systems at a single faraway signal light, which can only be coloured in one of three colours (red, yellow, or green):

Combining beliefs

Beliefs corresponding to independent pieces of information are combined using Dempster's rule of combination, which is a generalization of the special case of Bayes' theorem where events are independent. Note that the probability masses from propositions that contradict each other can also be used to obtain a measure of how much conflict there is in a system. This measure has been used as a criterion for clustering multiple pieces of seemingly conflicting evidence around competing hypotheses. In addition, one of the computational advantages of the DempsterShafer framework is that priors and conditionals need not be specified, unlike Bayesian methods, which often use a symmetry (minimax error) argument to assign prior probabilities to random variables (e.g. assigning 0.5 to binary values for which no information is available about which is more likely). However, any information contained in the missing priors and conditionals is not used in the DempsterShafer framework unless it can be obtained indirectlyand arguably is then available for calculation using Bayes equations.

8.Explain Fuzzy Logic in AI.


Fuzzy logic is a form of many-valued logic or probabilistic logic; it deals with reasoning that is approximate rather than fixed and exact. In contrast with traditional logic theory, where binary sets have two-valued logic: true or false, fuzzy logic variables may have a truth value that ranges in degree between 0 and 1. Fuzzy logic has been extended to handle the concept of partial truth, where the truth value may range between completely true and completely false.[1] Furthermore, when linguistic variables are used, these degrees may be managed by specific functions. Overview The reasoning in fuzzy logic is similar to human reasoning. It allows for approximate values and inferences as well as incomplete or ambiguous data (fuzzy data) as opposed to only relying on crisp data (binary yes/no choices). Fuzzy logic is able to process incomplete data and provide approximate solutions to problems other methods find difficult to solve Degrees of truth

Fuzzy logic and probabilistic logic are mathematically similar both have truth values ranging between 0 and 1 but conceptually distinct, due to different interpretationssee interpretations of probability theory. Fuzzy logic corresponds to "degrees of truth", while probabilistic logic corresponds to "probability, likelihood"; as these differ, fuzzy logic and probabilistic logic yield different models of the same real-world situations. Both degrees of truth and probabilities range between 0 and 1 and hence may seem similar at first. For example, let a 100 ml glass contain 30 ml of water. Then we may consider two concepts: Empty and Full. The meaning of each of them can be represented by a certain fuzzy set. Then one might define the glass as being 0.7 empty and 0.3 full. Applying truth values A basic application might characterize subranges of a continuous variable. For instance, a temperature measurement for anti-lock brakes might have several separate membership functions defining particular temperature ranges needed to control the brakes properly. Each function maps the same temperature value to a truth value in the 0 to 1 range. These truth values can then be used to determine how the brakes should be controlled. Fuzzy logic temperatureIn this image, the meanings of the expressions cold, warm, and hot are
represented by functions mapping a temperature scale. A point on that scale has three "truth values"one for each of the three functions. The vertical line in the image represents a particular temperature that the three arrows (truth values) gauge. Since the red arrow points to zero, this temperature may be interpreted as "not hot". The orange arrow (pointing at 0.2) may describe it as "slightly warm" and the blue arrow (pointing at 0.8) "fairly cold".

Linguistic variables While variables in mathematics usually take numerical values, in fuzzy logic applications, the non-numeric linguistic variables are often used to facilitate the expression of rules and facts. A linguistic variable such as age may have a value such as young or its antonym old. However, the great utility of linguistic variables is that they can be modified via linguistic hedges applied to primary terms. The linguistic hedges can be associated with certain functions.

The most important propositional fuzzy logics are:

Monoidal t-norm-based propositional fuzzy logic MTL is an axiomatization of logic where conjunction is defined by a left continuous t-norm, and implication is defined as the residuum of the t-norm. Its models correspond to MTL-algebras that are prelinear commutative bounded integral residuated lattices.

Basic propositional fuzzy logic BL is an extension of MTL logic where conjunction is defined by a continuous t-norm, and implication is also defined as the residuum of the t-norm. Its models correspond to BL-algebras.

ukasiewicz fuzzy logic is the extension of basic fuzzy logic BL where standard conjunction is the ukasiewicz t-norm. It has the axioms of basic fuzzy logic plus an axiom of double negation, and its models correspond to MV-algebras.

Gdel fuzzy logic is the extension of basic fuzzy logic BL where conjunction is Gdel tnorm. It has the axioms of BL plus an axiom of idempotence of conjunction, and its models are called G-algebras.

Product fuzzy logic is the extension of basic fuzzy logic BL where conjunction is product tnorm. It has the axioms of BL plus another axiom for cancellativity of conjunction, and its models are called product algebras.

Fuzzy logic with evaluated syntax (sometimes also called Pavelka's logic), denoted by EV, is a further generalization of mathematical fuzzy logic. While the above kinds of fuzzy logic have traditional syntax and many-valued semantics, in EV is evaluated also syntax. This means that each formula has an evaluation. Axiomatization of EV stems from ukasziewicz fuzzy logic. A generalization of classical Gdel completeness theorem is provable in EV.

UNIT IV
Planning and Learning: Planning with state space search - conditional planning-continuous planning - Multi-Agent planning. Forms of learning - inductive learning - Reinforcement Learning - learning decision trees - Neural Net learning and Genetic learning

1.What is Planning in AI?Explain the Planning done by an agent?


Planning problem

Find a sequence of actions that achieves a given goal when executed from a given initial
world state. That is, given

a set of operator descriptions (defining the possible primitive actions by the agent), an initial state description, and a goal state description or predicate,
compute a plan, which is

a sequence of operator instances, such that executing them in the initial state will change
the world to a state satisfying the goal-state description.

Goals are usually specified as a conjunction of goals to be achieved


An Agent Architecture

Planning vs. problem solving

Planning and problem solving methods can often solve the same sorts of problems Planning is more powerful because of the representations and methods used States, goals, and actions are decomposed into sets of sentences (usually in first-order
logic)

Search often proceeds through plan space rather than state space (though there are also
state-space planners)

Subgoals can be planned independently, reducing the complexity of the planning problem
Typical assumptions

Atomic time: Each action is indivisible No concurrent actions are allowed (though actions do not need to be ordered with respect
to each other in the plan)

Deterministic actions: The result of actions are completely determinedthere is no


uncertainty in their effects

Agent is the sole cause of change in the world Agent is omniscient: Has complete knowledge of the state of the world Closed World Assumption: everything known to be true in the world is included in the
state description. Anything not listed is false. Blocks world The blocks world is a micro-world that consists of a table, a set of blocks and a robot hand. Some domain constraints:

Only one block can be on another block Any number of blocks can be on the table The hand can only hold one block
Typical representation:

ontable(a) ontable(c) on(b,a) handempty clear(b) clear(c)

General Problem Solver

The General Problem Solver (GPS) system was an early planner (Newell, Shaw, and
Simon)

GPS generated actions that reduced the difference between some state and a goal state GPS used Means-Ends Analysis Compare what is given or known with what is desired and select a reasonable thing to do
next

Use a table of differences to identify procedures to reduce types of differences GPS was a state space planner: it operated in the domain of state space problems
specified by an initial state, some goal states, and a set of operations

Situation calculus planning

Intuition: Represent the planning problem using first-order logic Situation calculus lets us reason about changes in the world Use theorem proving to prove that a particular sequence of actions, when applied to
the situation characterizing the world state, will lead to a desired result Situation calculus

Initial state: a logical sentence about (situation) S0 At(Home, S0) ^ ~Have(Milk, S0) ^ ~ Have(Bananas, S0) ^ ~Have(Drill, S0)

Goal state:
(s) At(Home,s) ^ Have(Milk,s) ^ Have(Bananas,s) ^ Have(Drill,s)

Operators are descriptions of how the world changes as a result of the agents actions:

(a,s) Have(Milk,Result(a,s)) <=> ((a=Buy(Milk) ^ At(Grocery,s)) (Have(Milk, s) ^ a~=Drop(Milk)))

Result(a,s) names the situation resulting from executing action a in situation s. Action sequences are also useful: Result'(l,s) is the result of executing the list of actions
(l) starting in s: (s) Result'([],s) = s (a,p,s) Result'([a|p]s) = Result'(p,Result(a,s)) Basic representations for planning

Classic approach first used in the STRIPS planner circa 1970 States represented as a conjunction of ground literals at(Home) ^ ~have(Milk) ^ ~have(bananas) ... Goals are conjunctions of literals, but may have variables which are assumed to be
existentially quantified

at(?x) ^ have(Milk) ^ have(bananas) ... Do not need to fully specify state Non-specified either dont-care or assumed false Represent many cases in small storage Often only represent changes in state rather than entire situation Unlike theorem prover, not seeking whether the goal is true, but is there a sequence of
actions to attain it

Operator/action representation

Operators contain three components: Action description Precondition - conjunction of positive literals Effect - conjunction of positive or negative literals which describe how situation changes
when operator is applied

Example: Op[Action: Go(there), Precond: At(here) ^ Path(here,there), Effect: At(there) ^ ~At(here)]

All variables are universally quantified Situation variables are implicit preconditions must be true in the state immediately before operator is applied; effects are
true immediately after Blocks world operators Here are the classic basic operations for the blocks world:

stack(X,Y): put block X on block Y unstack(X,Y): remove block X from block Y pickup(X): pickup block X putdown(X): put block X on the table
Each will be represented by

a list of preconditions a list of new facts to be added (add-effects) a list of facts to be removed (delete-effects) optionally, a set of (simple) variable constraints
For example: preconditions(stack(X,Y), [holding(X),clear(Y)]) deletes(stack(X,Y), [holding(X),clear(Y)]). adds(stack(X,Y), [handempty,on(X,Y),clear(X)]) constraints(stack(X,Y), [X\==Y,Y\==table,X\==table])

STRIPS planning

STRIPS maintain two additional data structures: State List - all currently true predicates. Goal Stack - a push down stack of goals to be solved, with current goal on top of stack. If current goal is not satisfied by present state, examine add lists of operators, and push
operator and preconditions list on stack. (Sub goals)

When a current goal is satisfied, POP it from stack. When an operator is on top stack, record the application of that operator on the plan
sequence and use the operators add and delete lists to update the current state.

Typical BW planning problem Initial state: clear(a) clear(b) clear(c) ontable(a) ontable(b) ontable(c) handempty

Goal interaction

Simple planning algorithms assume that the goals to be achieved are independent

Each can be solved separately and then the solutions concatenated

This planning problem, called the Sussman Anomaly, is the classic example of the goal interaction problem:

Solving on(A,B) first (by doing unstack(C,A), stack(A,B) will be undone when solving the second goal on(B,C) (by doing unstack(A,B), stack(B,C)).

Solving on(B,C) first will be undone when solving on(A,B)

Classic STRIPS could not handle this, although minor modifications can get it to do simple cases

State-space planning

We initially have a space of situations (where you are, what you have, etc.) The plan is a solution found by searching through the situations to get to the goal A progression planner searches forward from initial state to goal state A regression planner searches backward from the goal This works if operators have enough information to go both ways Ideally this leads to reduced branching you are only considering things that are relevant
to the goal Plan-space planning

An alternative is to search through the space of plans, rather than situations. Start from a partial plan which is expanded and refined until a complete plan that solves
the problem is generated.

Refinement operators add constraints to the partial plan and modification operators for
other changes.

We can still use STRIPS-style operators:


Op(ACTION: RightShoe, PRECOND: RightSockOn, EFFECT: RightShoeOn) Op(ACTION: RightSock, EFFECT: RightSockOn) Op(ACTION: LeftShoe, PRECOND: LeftSockOn, EFFECT: LeftShoeOn) Op(ACTION: LeftSock, EFFECT: leftSockOn) could result in a partial plan of [RightShoe, LeftShoe]

2.What is learning and its representation in AI explain?


Learning is an important area in AI, perhaps more so than planning.

Problems are hard -- harder than planning. Recognised Solutions are not as common as planning. A goal of AI is to enable computers that can be taught rather than programmed.

Learning is a an area of AI that focusses on processes of self-improvement. Information processes that improve their performance or enlarge their knowledge bases are said to learn. Why is it hard?

Intelligence implies that an organism or machine must be able to adapt to new situations.

It must be able to learn to do new things. This requires knowledge acquisition, inference, updating/refinement of knowledge base, acquisition of heuristics, applying faster searches, etc.

How can we learn?

Many approaches have been taken to attempt to provide a machine with learning capabilities. This is because learning tasks cover a wide range of phenomena. Listed below are a few examples of how one may learn. We will look at these in detail shortly
Skill refinement

-- one can learn by practicing, e.g playing the piano.


Knowledge acquisition

-- one can learn by experience and by storing the experience in a knowledge base. One basic example of this type is rote learning.
Taking advice

-- Similar to rote learning although the knowledge that is input may need to be transformed (or operationalised) in order to be used effectively.
Problem Solving

-- if we solve a problem one may learn from this experience. The next time we see a similar problem we can solve it more efficiently. This does not usually involve gathering new knowledge but may involve reorganisation of data or remembering how to achieve to solution.
Induction

-- One can learn from examples. Humans often classify things in the world without knowing explicit rules. Usually involves a teacher or trainer to aid the classification.
Discovery

Here one learns knowledge without the aid of a teacher.


Analogy

If a system can recognise similarities in information already stored then it may be able to transfer some knowledge to improve to solution of the task in hand. General model:

Learning is essential for unknown environments, i.e., when designer lacks


omniscience. Learning is useful as a system construction method, i.e., expose the agent to reality rather than trying to write it down.

Learning modifies the agent's decision mechanisms to improve performance

Learning element

Design of a learning element is affected by


Which components of the performance element are to be learned What feedback is available to learn these components What representation is used for the components

Type of feedback:
Supervised learning: correct answers for each example Unsupervised learning: correct answers not given Reinforcement learning: occasional rewards 3.Explain Inductive Learning Inductive learning is a kind of learning in which, given a set of examples an agent tries to estimate or create an evaluation function. Most inductive learning is supervised learning, in which examples provided with classifications. (The alternative is clustering.) More formally, an example is a pair (x, f(x)), where x is the input and f(x) is the output of the function applied to x. The task of pure inductive inference (or induction) is, given a set of examples of f, to find a hypothesis h that approximates f.

o Given an example, A pair <x, f(x)> where x is the input and f(x) is the result o Generate a hypothesis, A function h(x) that approximates f(x) o That will generalize well, Correctly predict values for unseen samples How do we do this? _Must determine a hypothesis space... o The set of all hypotheses we are willing to consider o _e.g. All functions of degree less than 10

That is realizable, Contains the true function Inductive learning method Simplest form: learn a function from examples (tabula rasa) f is the target function

An example is a pair x, f(x), e.g.,

Problem: find a(n) hypothesis h such that hf given a training set of examples (This is a highly simpli_ed model of real learning: { Ignores prior knowledge { Assumes a deterministic, observable \environment" { Assumes examples are given { Assumes that the agent wants to learn f|why?)

Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:

Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:

Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:

Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:

The most likely hypothesis is the simplest one consistent with the data." Since there can be noise in the measurements, in practice we need to make a tradeoff between simplicity of the hypothesis and how well it fits the data.

5.Explain Reinforcement Learning As opposed to supervised learning, reinforcement learning takes place in an environment where the agent cannot directly compare the results of its action to a desired result. Instead, it is given some reward or punishment that relates to its actions. It may win or lose a game, or be told it has made a good move or a poor one. The job of reinforcement learning is to find a successful function using these rewards. The reason reinforcement learning is harder than supervised learning is that the agent is never told what the right action is, only whether it is doing well or poorly, and in some cases (such as chess) it may only receive feedback after a long string of actions. There are two basic kinds of information an agent can try to learn.

utility function -- The agent learns the utility of being in various states, and chooses actions to maximize the expected utility of their outcomes. This requires the agent keep a model of the environment.

action-value -- The agent learns an action-value function giving the expected utility of performing an action in a given state. This is called Q-learning. This is the modelfree approach.

Passive Learning in a Known Environment

Assuming an environment consisting of a set of states, some terminal and some nonterminal, and a model that specifies the probabilities of transition from state to state, an agent learns passively by observing a set of training sequences, which consist of a set of state transitions followed by a reward. The goal is to use the reward information to learn the expected utility of each of the nonterminal states. An important simplifying assumption is that the utility of a sequence is the sum of the rewards accumulated in the states of the sequence. That is, the utility function is additive. A passive learning agent keeps an estimate U of the utility of each state, a table N of how many times each state was seen, and a table M of transition probabilities. There are a variety of ways the agent can update its table U. Naive Updating One simple updating method is the least mean squares (LMS) approach [Widrow and Hoff, 1960]. It assumes that the observed reward-to-go of a state in a sequence provides direct evidence of the actual reward-to-go. The approach is simply to keep the utility as a running average of the rewards based upon the number of times the state has been seen. This approach minimizes the mean square error with respect to the observed data. This approach converges very slowly, because it ignores the fact that the actual utility of a state is the probability-weighted average of its successors' utilities, plus its own reward. LMS disregards these probabilities. Adaptive Dynamic Programming If the transition probabilities and the rewards of the states are known (which will usually happen after a reasonably small set of training examples), then the actual utilities can be computed directly as U(i) = R(i) + SUMj MijU(j) where U(i) is the utility of state i, R is its reward, and Mij is the probability of transition from state i to state j. This is identical to a single value determination in the policy iteration

algorithm for Markov decision processes. Adaptive dynamic programming is any kind of reinforcement learning method that works by solving the utility equations using a dynamic programming algorithm. It is exact, but of course highly inefficient in large state spaces. Temporal Difference Learning [Richard Sutton] Temporal difference learning uses the difference in utility values between successive states to adjust them from one epoch to another. The key idea is to use the observed transitions to adjust the values of the observed states so that they agree with the ADP constraint equations. Practically, this means updating the utility of state i so that it agrees better with its successor j. This is done with the temporal-difference (TD) equation: U(i) <- U(i) + a(R(i) + U(j) - U(i)) where a is a learning rate parameter. Temporal difference learning is a way of approximating the ADP constraint equations without solving them for all possible states. The idea generally is to define conditions that hold over local transitions when the utility estimates are correct, and then create update rules that nudge the estimates toward this equation. This approach will cause U(i) to converge to the correct value if the learning rate parameter decreases with the number of times a state has been visited [Dayan, 1992]. In general, as the number of training sequences tends to infinity, TD will converge on the same utilities as ADP. Passive Learning in an Unknown Environment Since neither temporal difference learning nor LMS actually use the model M of state transition probabilities, they will operate unchanged in an unknown environment. The ADP approach, however, updates its estimated model of an unknown environment after each step, and this model is used to revise the utility estimates. Any method for learning stochastic functions can be used to learn the environment model; in particular, in a simple environment the transition probability Mij is just the percentage of times state i has transititoned to j.

The basic difference between TD and ADP is that TD adjusts a state to agree with the observed successor, while ADP makes a state agree with all successors that might occur, weighted by their probabilities. More importantly, ADP's adjustments may need to be propagated across all of the utility equations, while TD's affect only the current equation. TD is essentially a crude first approximation to ADP. A middle-ground can be found by bounding or ordering the number of adjustments made in ADP, beyond the simple one made in TD. The prioritized-sweeping heuristic prefers only to make adjustments to states whose likely successors have just undergone large adjustments in their utility estimates. Such approximate ADP systems can be very nearly as efficient as ADP in terms of convergence, but operate much more quickly. Active Learning in an Unknown Environment The difference between active and passive agents is that passive agents learn a fixed policy, while the active agent must decide what action to take and how it will affect its rewards. To represent an active agent, the environment model M is extended to give the probability of a transition from a state i to a state j, given an action a. Utility is modified to be the reward of the state plus the maximum utility expected depending upon the agent's action: U(i) = R(i) + maxa x SUMj MaijU(j) An ADP agent is extended to learn transition probabilities given actions; this is simply another dimension in its transition table. A TD agent must similarly be extended to have a model of the environment. Learning Action-Value Functions An action-value function assigns an expected utility to the result of performing a given action in a given state. If Q(a, i) is the value of doing action a in state i, then U(i) = maxa Q(a, i) The equations for Q-learning are similar to those for state-based learning agents. The difference is that Q-learning agents do not need models of the world. The equilibrium equation, which can be used directly (as with ADP agents) is

Q(a, i) = R(i) + SUMj Maij maxa' Q(a', j) The temporal difference version does not require that a model be learned; its update equation is Q(a, i) <- Q(a, i) + a(R(i) + maxa' Q(a', j) - Q(a, i)) Applications of Reinforcement Learning The first significant reinforcement learning system was used in Arthur Samuel's checkerplaying program. It used a weighted linear function to evaluate positions, though it did not use observed rewards in its learning process. TD-gammon [Tesauro, 1992] has an evaluation function represented by a fully-connected neural network with one hidden layer of 80 nodes; with the inclusion of some precomputed board features in its input, it was able to reach world-class play after about 300,000 training games. A case of reinforcement learning in robotics is the famous cart-pole balancing problem. The problem is to control the position of the cart (along a single axis) so as to keep a pole balanced on top of it upright, while staying within the limits of the track length. Actions are usually to jerk left or right, the so-called bang-bang control approach. The first work on this problem was the BOXES system [Michie and Chambers, 1968], in which state space was partitioned into boxes, and reinforcement propogated into the boxes. More recent simulated work using neural networks [Furuta et al., 1984] simulated the triple-inverted-pendulum problem, in which three poles balance one atop another on a cart

6.Discuss Neural Net learning and Genetic learning Neural Networks

More specifically, a neural network consists of a set of nodes (or units), links that connect one node to another, and weights associated with each link. Some nodes receive inputs via links; others directly from the environment, and some nodes send outputs out of the network. Learning usually occurs by adjusting the weights on the links. Each unit has a set of weighted inputs, an activation level, and a way to compute its activation level at the next time step. This is done by applying an activation function to the weighted sum of the node's inputs. Generally, the weighted sum (also called the input function) is a strictly linear sum, while the activation function may be nonlinear. If the value of the activation function is above a threshold, the node "fires." Generally, all nodes share the same activation function and threshold value, and only the topology and weights change.

o g is a non-linear function which takes as input a weighted sum of the input link signals (as well as an intrinsic bias weight) and outputs a certain signal strength. o g is commonly a threshold function or sigmoid function.

Network Structures The two fundamental types of network structure are feed-forward and recurrent. A feedforward network is a directed acyclic graph; information flows in one direction only, and there are no cycles. Such networks cannot represent internal state. Usually, neural networks are also layered, meaning that nodes are organized into groups of layers, and links only go from nodes to nodes in adjacent layers. Recurrent networks allow loops, and as a result can represent state, though they are much more complex to analyze. Hopfield networks and Boltzmann machines are examples of recurrent networks; Hopfield networks are the best understood. All connections in Hopfield networks are bidirectional with symmetric weights, all units have outputs of 1 or -1, and the

activation function is the sign function. Also, all nodes in a Hopfield network are both input and output nodes. Interestingly, it has been shown that a Hopfield network can reliably recognize 0.138N training examples, where N is the number of units in the network. Boltzmann machines allow non-input/output units, and they use a stochastic evaluation function that is based upon the sum of the total weighted input. Boltzmann machines are formally equivalent to a certain kind of belief network evaluated with a stochastic simulation algorithm. One problem in building neural networks is deciding on the initial topology, e.g., how many nodes there are and how they are connected. Genetic algorithms have been used to explore this problem, but it is a large search space and this is a computationally intensive approach. The optimal brain damage method uses information theory to determine whether weights can be removed from the network without loss of performance, and possibly improving it. The alternative of making the network larger has been tested with the tiling algorithm [Mezard and Nadal, 1989] which takes an approach similar to induction on decision trees; it expands a unit by adding new ones to cover instances it misclassified. Cross-validation techniques can be used to determine when the network size is right. Perceptrons

Perceptrons are single-layer, feed-forward networks that were first studied in the 1950's. They are only capable of learning linearly separable functions. That is, if we view F features as defining an F-dimensional space, the network can recognize any class that involves placing a single hyperplane between the instances of two classes. So, for example, they can easily represent AND, OR, or NOT, but cannot represent XOR. Perceptrons learn by updating the weights on their links in response to the difference between their output value and the correct output value. The updating rule (due to Frank Rosenblatt,

1960) is as follows. Define Err as the difference between the correct output and actual output. Then the learning rule for each weight is Wj <- Wj + A x Ij x Err where A is a constant called the learning rate. Of course, this was too good to last, and in Perceptrons [Minsky and Papert, 1969] it was observed how limited linearly separable functions were. Work on perceptrons withered, and neural networks didn't come into vogue again until the 1980's, when multi-layer networks became the focus.

Multi-Layer Feed-Forward Networks The standard method for learning in multi-layer feed-forward networks is back-propagation [Bryson and Ho, 1969]. Such networks have an input layer, and output layer, and one or more hidden layers in between. The difficulty is to divide the blame for an erroneous output among the nodes in the hidden layers. The back-propagation rule is similar to the perceptron learning rule. If Erri is the error at the output node, then the weight update for the link from unit j to unit i (the output node) is
Wj,i <- Wj,i + A x aj x Erri x g'(ini)

where g' is the derivative of the activation function, and aj is the activation of the unit j. (Note that this means the activation function must have a derivative, so the sigmoid function is usually used rather than the step function.) Define Di as Erri x g'(ini).

This updates the weights leading to the output node. To update the weights on the interior links, we use the idea that the hidden node j is responsible for part of the error in each of the nodes to which it connects. Thus the error at the output is divided according to the strength of the connection between the output node and the hidden node, and propogated backward to previous layers. Specifically, Dj = g'(inj) x SUMi Wj,i Di Thus the updating rule for internal nodes is Wj,i <- Wj,i + A x aj x Di. Lastly, the weight updating rule for the weights from the input layer to the hidden layer is is Wk,j <- Wk,j + A x Ik x Dj where k is the input node and j the hidden node, and Ik is the input value of k. A neural network requires 2n/n hidden units to represent all Boolean functions of n inputs. For m training examples and W weights, each epoch in the learning process takes O(mW) time; but in the worst case, the number of epochs can be exponential in the number of inputs. In general, if the number of hidden nodes is too large, the network may learn only the training examples, while if the number is too small it may never converge on a set of weights consistent with the training examples. Multi-layer feed-forward networks can represent any continuous function with a single hidden layer, and any function with two hidden layers [Cybenko, 1988, 1989]. Applications of Neural Networks John Denker remarked that "neural networks are the second best way of doing just about anything." They provide passable performance on a wide variety of problems that are difficult to solve well using other methods. NETtalk [Sejnowski and Rosenberg, 1987] was designed to learn how to pronounce written text. Input was a seven-character centered on the target character, and output was a set of

Booleans controlling the form of the sound to be produced. It learned 95% accuracy on its training set, but had only 78% accuracy on the test set. Not spectacularly good, but important because it impressed many people with the potential of neural networks. Other applications include a ZIP code recognition [Le Cun et al., 1989] system that achieves 99% accuracy on handwritten codes, and driving [Pomerleau, 1993] in the ALVINN system at CMU. ALVINN controls the NavLab vehicles, and translates inputs from a video image into steering control directions. ALVINN performs exceptionally well on the particular roadtype it learns, but poorly on other terrain types. The extended MANIAC system [Jochem et al., 1993] has multiple ALVINN subnets combined to handle different road types.

7.What is conditional

planning?Explain.

8.Describe continuous planning.

You might also like