1 - Module5 - Machine Learning

Module 5
Machine Learning
Definition of learning:
• A computer program is said to learn from experience E with respect to some
class of tasks T and performance measure P, if its performance at tasks T, as
measured by P, improves with experience E.
Examples
i) Handwriting recognition learning problem
• Task T: Recognising and classifying handwritten words within images
• Performance P: Percent of words correctly classified
• Training experience E: A dataset of handwritten words with given classifications
ii) A robot driving learning problem
• Task T: Driving on highways using vision sensors
• Performance measure P: Average distance traveled before an error
• training experience: A sequence of images and steering commands recorded while
observing a human driver
iii) A chess learning problem
• Task T: Playing chess
• Performance measure P: Percent of games won against opponents
• Training experience E: Playing practice games against itself.
A computer program which learns from experience is called a machine
learning program or simply a learning program. Such a program is sometimes
also referred to as a learner.
1. Data storage
• Facilities for storing and retrieving huge amounts of data are an important component of the learning process.
• Humans and computers alike utilize data storage as a foundation for advanced reasoning.
• In a human being, the data is stored in the brain and data is retrieved using electrochemical signals.
• Computers use hard disk drives, flash memory, random access memory and similar devices to store data and use cables and other
technology to retrieve data.
2. Abstraction
• The second component of the learning process is known as abstraction.
• Abstraction is the process of extracting knowledge about stored data.
• This involves creating general concepts about the data as a whole.
• The creation of knowledge involves application of known models and creation of new models. The process of fitting a
model to a dataset is known as training.
• When the model has been trained, the data is transformed into an abstract form that summarizes the original information.
3. Generalization
• The third component of the learning process is known as generalization.
• The term generalization describes the process of turning the knowledge about stored data into a form that can be
utilized for future action.
• These actions are to be carried out on tasks that are similar, but not identical, to those what have been seen before.
• In generalization, the goal is to discover those properties of the data that will be most relevant to future tasks.
4. Evaluation
• Evaluation is the last component of the learning process.
• It is the process of giving feedback to the user to measure the utility of the learned knowledge. This feedback is then
utilized to effect improvements in the whole learning process
Definition- Machine Learning:
• “Field of study that gives computers the capability to learn without
being explicitly programmed” - Arthur Samuel .
• Machine Learning is an application of Artificial Intelligence (AI) which
enables a program(software) to learn from the experiences and
improve their self at a task without being explicitly programmed.
• For example, how would you write a program that can identify fruits
based on their various properties, such as color, shape, size or any other
property?
• Machine Learning is the study of making machines more human-like in
their behavior and decision making by giving them the ability to learn
with minimum human intervention, i.e., no explicit programming.
• Machine Learning needs a huge computational power, a lot of data and
devices which are capable of storing such vast data.
• How can a program attain any experience and from where does it
learn?
• Data is also called the fuel for Machine Learning and we can safely say
that there is no machine learning without data.
Need :
• Machine Learning can automate many tasks, especially the ones
that only humans can perform with their innate intelligence.
• Replicating this intelligence to machines can be achieved only with
the help of machine learning.
• Machine Learning helps in creating models that can process and
analyze large amounts of complex data to deliver accurate results.
• These models are precise and scalable and function with less
turnaround time.
• Image recognition, text generation, and many other use-cases are
finding applications in the real world.
Work:
• A machine learning model learns from the historical data fed to it
and then builds prediction algorithms to predict the output for the
new set of data the comes in as input to the system.
• The accuracy of these models would depend on the quality and
amount of input data.
• A large amount of data will help build a better model which
predicts the output more accurately.
Machine
Learning
History
Features:
• Automation
• In your Gmail account, there is a spam folder that contains all the spam emails.
• ML recognizes the spam emails and thus, it is easy to automate this process.
• The ability to automate repetitive tasks is one of the biggest characteristics of machine
learning.
• Improved customer experience
• For any business, one of the most crucial ways to drive engagement, promote brand
loyalty and establish long-lasting customer relationships is by providing a customized
experience and providing better services.
• ML has enabled us to make amazing recommendation systems that are accurate. They
help us customize the user experience.
• An example of this is Eva from AirAsia airlines.
• Automated Data Visualization
• Take an example of companies like Google, Twitter, Facebook. How much data are they
generating per day?
• We can use this data and visualize the notable relationships, thus giving businesses the
ability to make better decisions that can actually benefit both companies as well as
customers.
• User-friendly automated data visualization platforms such as AutoViz
• Business Intelligence
• From retail to financial services to healthcare, and many more, ML has already become
one of the most effective technologies to boost business operations.
Why Python?
• Python provides flexibility in choosing
between object-oriented programming
or scripting.
• There is also no need to recompile the
code; developers can implement any
changes and instantly see the results.
• Python is a versatile programming
language and can run on any platform
including Windows, MacOS, Linux, Unix,
and others.
• Code migration needs some minor
modification.
Types of ML
• Machine learning has been broadly categorized into three categories
1.Supervised Learning
2.Unsupervised Learning
3.Reinforcement Learning
• Supervised Learning
• The learning is monitored or supervised in the sense that we already know the
output and the algorithm are corrected each time to optimize its results.
• The algorithm is trained over the data set and amended until it achieves an
acceptable level of performance.
• We have two sets of variables. One is called the target variable, or labels (the
variable we want to predict) and features(variables that help us to predict target
variables).
• We show the program(model) the features and the label associated with these
features and then the program is able to find the underlying pattern in the data.
• We can group the supervised learning problems as:
• Regression problems – Used to predict future values and the model is trained with the
historical data. E.g., Predicting the future price of a house.
• Classification problems – Various labels train the algorithm to identify items within a
specific category. E.g., Dog or cat( as mentioned in the above example), Apple or an
orange, Beer or wine or water.
Example
• Teaching a kid to differentiate dogs from cats. How would you do it?
• You may show him/her a dog and say “here is a dog” and when you encounter a cat you
would point it out as a cat.
• When you show the kid enough dogs and cats, he may learn to differentiate between them. If
he is trained well, he may be able to recognize different breeds of dogs which he hasn’t even
seen.
• The supervised learning model has a set of input variables (x), and an output variable (y). An
algorithm identifies the mapping function between the input and output variables. The
relationship is y = f(x).
• Take this example dataset , to predict the price of the house given its size. Feature? Target
variable?
Number of rooms Price

1 $100
3 $300
5 $500
Unsupervised Learning
• This approach is the one where we have no target variables, and we have only the
input variable(features) at hand. The algorithm learns by itself and discovers an
impressive structure in the data.
• The goal is to decipher the underlying distribution in the data to gain more
knowledge about the data.
• We can group the unsupervised learning problems as:
• Clustering: This means bundling the input variables with the same characteristics together.
E.g., grouping users based on search history
• Association: Here, we discover the rules that govern meaningful associations among the data
set. E.g., People who watch ‘X’ will also watch ‘Y’.
• Reinforcement Learning
• In this approach, machine learning models are trained to make a series of decisions
based on the rewards and feedback they receive for their actions. The machine
learns to achieve a goal in complex and uncertain situations and is rewarded each
time it achieves it during the learning period.
• Reinforcement learning is different from supervised learning in the sense that there
is no answer available, so the reinforcement agent decides the steps to perform a
task. The machine learns from its own experiences when there is no training data set
present
Designing a Learning system
• For any learning system, we must be knowing the three elements — T (Task), P (Performance Measure), and
E (Training Experience). At a high level, the process of learning system looks as below.
• The learning process starts with task T, performance measure P and training experience E and objectiveare
to find an unknown target function.
• The target function is an exact knowledge to be learned from the training experience and its unknown.
• For example, in a case of credit approval, the learning system will have customer application records as
experience and task would be to classify whether the given customer application is eligible for a loan.
• So in this case, the training examples can be represented as (x1,y1)(x2,y2)..(xn,yn) where X represents
customer application details and y represents the status of credit approval.
• With these details, what is that exact knowledge to be learned from the training experience?
• So the target function to be learned in the credit approval learning system is a mapping function f:X →y.
• This function represents the exact knowledge defining the relationship between input variable X and output variable y.
Design of a learning system
• Just now we looked into the learning process and also understood the goal of the learning.
• When we want to design a learning system that follows the learning process, we need to consider a few design choices. The design
choices will be to decide the following key components:
1. Type of training experience
2. Choosing the Target Function
3. Choosing a representation for the Target Function
4. Choosing an approximation algorithm for the Target Function
5. The final Design
• We will look into the game - checkers learning problem and apply the above design choices. For a checkers learning problem, the
three elements will be,
1. Task T: To play checkers
2. Performance measure P: Total percent of the game won in the tournament.
3. Training experience E: A set of games played against itself
1.1 Type of training experience
• During the design of the checker's learning system, the type of training experience available for a learning system will have a
significant effect on the success or failure of the learning.
1. Direct or Indirect training experience
• In the case of direct training experience, an individual board states and correct move for each board state are given.
• In case of indirect training experience, the move sequences for a game and the final result (win, loss or draw) are given for a
number of games.
• How to assign credit or blame to individual moves is the credit assignment problem.
2. Teacher or Not
• Supervised — The training experience will be labeled, which means, all the board states will be labeled with
the correct move. So the learning takes place in the presence of a supervisor or a teacher.
• Unsupervised — The training experience will be unlabeled, which means, all the board states will not have
the moves. So the learner generates random games and plays against itself with no supervision or teacher
involvement.
• Semi-supervised — Learner generates game states and asks the teacher for help in finding the correct move
if the board state is confusing.
3. Is the training experience good
• Do the training examples represent the distribution of examples over which the final system performance
will be measured?
• Performance is best when training examples and test examples are from the same/a similar distribution.
• The checker player learns by playing against oneself. Its experience is indirect.
• It may not encounter moves that are common in human expert play.
• Once the proper training experience is available, the next design step will be choosing the Target Function.
1.2 Choosing the Target Function
• When you are playing the checkers game, at any moment of time, you make a decision on choosing the best
move from different possibilities.
• You think and apply the learning that you have gained from the experience.
• Here the learning is, for a specific board, you move a checker such that your board state tends towards the
winning situation. Now the same learning has to be defined in terms of the target function.
• Here there are 2 considerations — direct and indirect experience.
• During the direct experience,
• The checkers learning system, it needs only to learn how to choose the best move among some large
search space.
• We need to find a target function that will help us choose the best move among alternatives.
• Let us call this function ChooseMove and use the notation ChooseMove : B →M to indicate that this
function accepts as input any board from the set of legal board states B and produces as output some
move from the set of legal moves M.
• When there is an indirect experience,
• It becomes difficult to learn such function. How about assigning a real score to the board state.
• So the function be V : B →R indicating that this accepts as input any board from the set of legal board
states B and produces an output a real score. This function assigns the higher scores to better board
states.
• Let us therefore define the target value V(b) for an arbitrary board state b in B, as follows:
1. if b is a final board state that is won, then V(b) = 100
2. if b is a final board state that is lost, then V(b) = -100
3. if b is a final board state that is drawn, then V(b) = 0
4. if b is a not a final state in the game, then V (b) = V (b’), where b’ is the best final board state that can be
achieved starting from b and playing optimally until the end of the game.
• The (4) is a recursive definition and to determine the value of V(b) for a particular board state, it performs
the search ahead for the optimal line of play, all the way to the end of the game.
• So this definition is not efficiently computable by our checkers playing program, we say that it is a Non
operational definition.
• The goal of learning, in this case, is to discover an operational description of V ; that is, a description that
can be used by the checkers-playing program to evaluate states and select moves within realistic time
bounds.
• It may be very difficult in general to learn such an operational form of V perfectly. We expect learning
algorithms to acquire only some approximation to the target function ^V.
1.3 Choosing a representation for the Target Function
• Now that we have specified the ideal target function V, we must choose a representation that the learning program
will use to describe the function ^V that it will learn.
• As with earlier design choices, we again have many options.
• We could, for example, allow the program to represent using a large table with a distinct entry specifying the value
for each distinct board state. Or we could allow it to represent using a collection of rules that match against
features of the board state, or a quadratic polynomial function of predefined board features, or an artificial neural
network.
• In general, this choice of representation involves a crucial tradeoff. On one hand, we wish to pick a very expressive
representation to allow representing as close an approximation as possible to the ideal target function V.
• On the other hand, the more expressive the representation, the more training data the program will require in
order to choose among the alternative hypotheses it can represent.
• To keep the discussion brief, let us choose a simple representation: for any given board state, the function ^V will
be calculated as a linear combination of the following board features:
• x1(b) — number of black pieces on board b
• x2(b) — number of red pieces on b
• x3(b) — number of black kings on b
• x4(b) — number of red kings on b
• x5(b) — number of red pieces threatened by black (i.e., which can be taken on black’s next turn)
• x6(b) — number of black pieces threatened by red
^V = w0 + w1 · x1(b) + w2 · x2(b) + w3 · x3(b) + w4 · x4(b) +w5 · x5(b) + w6 · x6(b)
• Where w0 through w6 are numerical coefficients or weights to be obtained by a learning algorithm.
• Weights w1 to w6 will determine the relative importance of different board features.
Specification of the Machine Learning Problem at this time
• Till now we worked on choosing the type of training experience, choosing the target function and its
representation. The checkers learning task can be summarized as below.
• Task T : Play Checkers
• Performance Measure : % of games won in world tournament
• Training Experience E : opportunity to play against itself
• Target Function : V : Board → R
• Target Function Representation : ^V = w0 + w1 · x1(b) + w2 · x2(b) + w3 · x3(b) + w4 · x4(b) +w5 ·
x5(b) + w6 · x6(b)
• The first three items above correspond to the specification of the learning task, whereas the final two items
constitute design choices for the implementation of the learning program.
1.4 Choosing an approximation algorithm for the Target Function
• Generating training data — To train our learning program, we need a set of training data, each describing a
specific board state b and the training value V_train (b) for b. Each training example is an ordered pair
<b,V_train(b)>
• For example, a training example may be <(x1 = 3, x2 = 0, x3 = 1, x4 = 0, x5 = 0, x6 = 0), +100">. This is an
example where black has won the game since x2 = 0 or red has no remaining pieces. However, such clean
values of V_train (b) can be obtained only for board value b that are clear win, loss or draw.
• In above case, assigning a training value V_train(b) for the specific boards b that are clean win, loss or draw
is direct as they are direct training experience.
• But in the case of indirect training experience, assigning a training value V_train(b) for the intermediate
boards is difficult. In such case, the training values are updated using temporal difference learning.
• Temporal difference (TD) learning is a concept central to reinforcement learning, in which learning
happens through the iterative correction of your estimated returns towards a more accurate target return.
• Let Successor(b) denotes the next board state following b for which it is again the program’s turn to move.
^V is the learner’s current approximation to V. Using these information, assign the training value of
V_train(b) for any intermediate board state b as below : V_train(b) ← ^V(Successor(b))
Adjusting the weights
• Now its time to define the learning algorithm for choosing the weights and best fit the set of training
examples.
• One common approach is to define the best hypothesis as that which minimizes the squared error E
between the training values and the values predicted by the hypothesis ^V.
• The learning algorithm should incrementally refine weights as more training examples become available and
it needs to be robust to errors in training data Least Mean Square (LMS) training rule is the one training
algorithm that will adjust weights a small amount in the direction that reduces the error.
• The LMS algorithm is defined as follows:
1.5 Final Design for Checkers Learning system
• The final design of our checkers learning system can be naturally described by four distinct program modules
that represent the central components in many learning systems.
1. The performance System — Takes a new board as input and outputs a trace of the game it played against
itself.
2. The Critic — Takes the trace of a game as an input and outputs a set of training examples of the target
function.
3. The Generalizer — Takes training examples as input and outputs a hypothesis that estimates the target
function. Good generalization to new cases is crucial.
4. The Experiment Generator — Takes the current hypothesis (currently learned function) as input and outputs
a new problem (an initial board state) for the performance system to explore.
PERSPECTIVES AND ISSUES IN MACHINE LEARNING
Perspectives in Machine Learning
• One useful perspective on machine learning is that it involves searching a very large space of possible
hypotheses to determine one that best fits the observed data and any prior knowledge held by the learner.
• For example, consider the space of hypotheses that could in principle be output by the above checkers
learner.
• This hypothesis space consists of all evaluation functions that can be represented by some choice of values
for the weights w0 through w6.
• The learner's task is thus to search through this vast space to locate the hypothesis that is most consistent
with the available training examples.
• The Least Mean Square (LMS) algorithm for fitting weights achieves this goal by iteratively tuning the
weights, adding a correction to each weight each time the hypothesized evaluation function predicts a value
that differs from the training value.
• This algorithm works well when the hypothesis representation considered by the learner defines a
continuously parameterized space of potential hypotheses.
• Algorithms that search a hypothesis space defined by some underlying representation (e.g., linear functions,
logical descriptions, decision trees, artificial neural networks).
• These different hypothesis representations are appropriate for learning different kinds of target functions.
• For each of these hypothesis representations, the corresponding learning algorithm takes advantage of a
different underlying structure to organize the search through the hypothesis space.
Issues in Machine Learning
• Our checkers example raises a number of generic questions about machine learning. The field of machine
learning, and much of this book, is concerned with answering questions such as the following:
• What algorithms exist for learning general target functions from specific training examples? In what settings
will particular algorithms converge to the desired function, given sufficient training data? Which algorithms
perform best for which types of problems and representations?
• How much training data is sufficient? What general bounds can be found to relate the confidence in learned
hypotheses to the amount of training experience and the character of the learner's hypothesis space?
• When and how can prior knowledge held by the learner guide the process of generalizing from examples?
Can prior knowledge be helpful even when it is only approximately correct?
• What is the best strategy for choosing a useful next training experience, and how does the choice of this
strategy alter the complexity of the learning problem?
• What is the best way to reduce the learning task to one or more function approximation problems? Put
another way, what specific functions should the system attempt to learn? Can this process itself be
automated?
• How can the learner automatically alter its representation to improve its ability to represent and learn the
target function?
Concept Learning
• Inducing general functions from specific training examples is a main
issue of machine learning.
• Concept Learning: Acquiring the definition of a general category from
given sample positive and negative training examples of the category.
• Concept Learning can seen as a problem of searching through a
predefined space of potential hypotheses for the hypothesis that
best fits the training examples.
• The hypothesis space has a general-to-specific ordering of hypotheses,
and the search can be efficiently organized by taking advantage of a
naturally occurring structure over the hypothesis space.
Machine Learning 1
Concept Learning
• A Formal Definition for Concept Learning:
Inferring a boolean-valued function from training examples of

its input and output.
• An example for concept-learning is the learning of bird-concept from

the given examples of birds (positive examples) and non-birds (negative
examples).
• We are trying to learn the definition of a concept from given examples.
Machine Learning 2
A Concept Learning Task – Enjoy Sport
Training Examples
Example Sky AirTemp Humidity Wind Water Forecast EnjoySport
1 Sunny Warm Normal Strong Warm Same YES
2 Sunny Warm High Strong Warm Same YES
3 Rainy Cold High Strong Warm Change NO
4 Sunny Warm High Strong Warm Change YES
ATTRIBUTES CONCEPT
• A set of example days, and each is described by six attributes.

• The task is to learn to predict the value of EnjoySport for arbitrary day,
based on the values of its attribute values.
Machine Learning 3
EnjoySport – Hypothesis Representation
• Each hypothesis consists of a conjuction of constraints on the
instance attributes.
• Each hypothesis will be a vector of six constraints, specifying the values
of the six attributes
– (Sky, AirTemp, Humidity, Wind, Water, and Forecast).
• Each attribute will be:
? - indicating any value is acceptable for the attribute (don’t care)
single value – specifying a single required value (ex. Warm) (specific)
0 - indicating no value is acceptable for the attribute (no value)
Machine Learning 4
Hypothesis Representation
• A hypothesis:
Sky AirTemp Humidity Wind Water Forecast
< Sunny, ? , ? , Strong , ? , Same >
• The most general hypothesis – that every day is a positive example
<?, ?, ?, ?, ?, ?>
• The most specific hypothesis – that no day is a positive example
<0, 0, 0, 0, 0, 0>
• EnjoySport concept learning task requires learning the sets of days for
which EnjoySport=yes, describing this set by a conjunction of
constraints over the instance attributes.
Machine Learning 5
EnjoySport Concept Learning Task
Given
– Instances X : set of all possible days, each described by the attributes
• Sky – (values: Sunny, Cloudy, Rainy)
• AirTemp – (values: Warm, Cold)
• Humidity – (values: Normal, High)
• Wind – (values: Strong, Weak)
• Water – (values: Warm, Cold)
• Forecast – (values: Same, Change)
– Target Concept (Function) c : EnjoySport : X  {0,1}
– Hypotheses H : Each hypothesis is described by a conjunction of constraints on
the attributes.
– Training Examples D : positive and negative examples of the target function
Determine
– A hypothesis h in H such that h(x) = c(x) for all x in D.
Machine Learning 6
The Inductive Learning Hypothesis
• Although the learning task is to determine a hypothesis h identical to the
target concept cover the entire set of instances X, the only information
available about c is its value over the training examples.
– Inductive learning algorithms can at best guarantee that the output hypothesis fits the target
concept over the training data.
– Lacking any further information, our assumption is that the best hypothesis regarding
unseen instances is the hypothesis that best fits the observed training data. This is the
fundamental assumption of inductive learning.
• The Inductive Learning Hypothesis - Any hypothesis found to

approximate the target function well over a sufficiently large set of
training examples will also approximate the target function well over
other unobserved examples.
Machine Learning 7
Concept Learning As Search
• Concept learning can be viewed as the task of searching through a large
space of hypotheses implicitly defined by the hypothesis representation.
• The goal of this search is to find the hypothesis that best fits the training
examples.
• By selecting a hypothesis representation, the designer of the learning
algorithm implicitly defines the space of all hypotheses that the program
can ever represent and therefore can ever learn.
Machine Learning 8
Enjoy Sport - Hypothesis Space
• Sky has 3 possible values, and other 5 attributes have 2 possible values.
• There are 96 (= 3.2.2.2.2.2) distinct instances in X.
• There are 5120 (=5.4.4.4.4.4) syntactically distinct hypotheses in H.
– Two more values for attributes: ? and 0
• Every hypothesis containing one or more 0 symbols represents the
empty set of instances; that is, it classifies every instance as negative.
• There are 973 (= 1 + 4.3.3.3.3.3) semantically distinct hypotheses in H.
– Only one more value for attributes: ?, and one hypothesis representing empty set of
instances.
• Although EnjoySport has small, finite hypothesis space, most learning
tasks have much larger (even infinite) hypothesis spaces.
– We need efficient search algorithms on the hypothesis spaces.
Machine Learning 9
General-to-Specific Ordering of Hypotheses
• Many algorithms for concept learning organize the search through the hypothesis
space by relying on a general-to-specific ordering of hypotheses.
• By taking advantage of this naturally occurring structure over the hypothesis space, we
can design learning algorithms that exhaustively search even infinite hypothesis spaces
without explicitly enumerating every hypothesis.
• Consider two hypotheses

h1 = (Sunny, ?, ?, Strong, ?, ?)
h2 = (Sunny, ?, ?, ?, ?, ?)
• Now consider the sets of instances that are classified positive by hl and by h2.
– Because h2 imposes fewer constraints on the instance, it classifies more instancesas
positive.
– In fact, any instance classified positive by hl will also be classified positive by h2.
– Therefore, we say that h2 is more general than hl.
Machine Learning 10
More-General-Than Relation
• For any instance x in X and hypothesis h in H, we say that x satisfies h
if and only if h(x) = 1.
• More-General-Than-Or-Equal Relation:
Let h1 and h2 be two boolean-valued functions defined over X.
Then h1 is more-general-than-or-equal-to h2 (written h1 ≥ h2)
if and only if any instance that satisfies h2 also satisfies h1.
• h1 is more-general-than h2 ( h1 > h2) if and only if h1≥h2 is true and

h2≥h1 is false. We also say h2 is more-specific-than h1.
Machine Learning 11
More-General-Relation
• h2 > h1 and h2 > h3

• But there is no more-general relation between h1 and h3
Machine Learning 12
FIND-S Algorithm
• FIND-S Algorithm starts from the most specific hypothesis and
generalize it by considering only positive examples.
• FIND-S algorithm ignores negative examples.
– As long as the hypothesis space contains a hypothesis that describes the true target concept,
and the training data contains no errors, ignoring negative examples does not cause to any
problem.
• FIND-S algorithm finds the most specific hypothesis within H that is
consistent with the positive training examples.
– The final hypothesis will also be consistent with negative examples if the correct target
concept is in H, and the training examples are correct.
Machine Learning 13
FIND-S Algorithm
1. Initialize h to the most specific hypothesis in H
2. For each positive training instance x
For each attribute constraint a, in h
If the constraint a, is satisfied by x
Then do nothing
Else replace a, in h by the next more general constraint that is
satisfied by x
3. Output hypothesis h
Machine Learning 14
FIND-S Algorithm - Example
Machine Learning 15
Unanswered Questions by FIND-S Algorithm
• Has FIND-S converged to the correct target concept?
– Although FIND-S will find a hypothesis consistent with the training data, it has no way to
determine whether it has found the only hypothesis in H consistent with the data (i.e., the
correct target concept), or whether there are many other consistent hypotheses as well.
– We would prefer a learning algorithm that could determine whether it had converged and,
if not, at least characterize its uncertainty regarding the true identity of the target concept.
• Why prefer the most specific hypothesis?

– In case there are multiple hypotheses consistent with the training examples, FIND-S will
find the most specific.
– It is unclear whether we should prefer this hypothesis over, say, the most general, or some
other hypothesis of intermediate generality.
Machine Learning 16
Unanswered Questions by FIND-S Algorithm
• Are the training examples consistent?
– In most practical learning problems there is some chance that the training examples will
contain at least some errors or noise.
– Such inconsistent sets of training examples can severely mislead FIND-S, given the fact
that it ignores negative examples.
– We would prefer an algorithm that could at least detect when the training data is
inconsistent and, preferably, accommodate such errors.
• What if there are several maximally specific consistent hypotheses?

– In the hypothesis language H for the EnjoySport task, there is always a unique, most
specific hypothesis consistent with any set of positive examples.
– However, for other hypothesis spaces there can be several maximally specific hypotheses
consistent with the data.
– In this case, FIND-S must be extended to allow it to backtrack on its choices of how to
generalize the hypothesis, to accommodate the possibility that the target concept lies along
a different branch of the partial ordering than the branch it has selected.
Machine Learning 17
Candidate-Elimination Algorithm
• FIND-S outputs a hypothesis from H, that is consistent with the training
examples, this is just one of many hypotheses from H that might fit the
training data equally well.
• The key idea in the Candidate-Elimination algorithm is to output a
description of the set of all hypotheses consistent with the training
examples.
– Candidate-Elimination algorithm computes the description of this set without explicitly
enumerating all of its members.
– This is accomplished by using the more-general-than partial ordering and maintaining a
compact representation of the set of consistent hypotheses.
Machine Learning 18
Consistent Hypothesis
• The key difference between this definition of consistent and satisfies.

• An example x is said to satisfy hypothesis h when h(x) = 1,
regardless of whether x is a positive or negative example of
the target concept.
• However, whether such an example is consistent with h depends
on the target concept, and in particular, whether h(x) = c(x).
Machine Learning 19
Version Spaces
• The Candidate-Elimination algorithm represents the set of
all hypotheses consistent with the observed training examples.
• This subset of all hypotheses is called the version space with
respect to the hypothesis space H and the training examples D,
because it contains all plausible versions of the target concept.
Machine Learning 20
List-Then-Eliminate Algorithm
• List-Then-Eliminate algorithm initializes the version space to contain all
hypotheses in H, then eliminates any hypothesis found inconsistent with
any training example.
• The version space of candidate hypotheses thus shrinks as more
examples are observed, until ideally just one hypothesis remains that is
consistent with all the observed examples.
– Presumably, this is the desired target concept.
– If insufficient data is available to narrow the version space to a single hypothesis, then the
algorithm can output the entire set of hypotheses consistent with the observed data.
• List-Then-Eliminate algorithm can be applied whenever the hypothesis
space H is finite.
– It has many advantages, including the fact that it is guaranteed to output all hypotheses
consistent with the training data.
– Unfortunately, it requires exhaustively enumerating all hypotheses in H - an unrealistic
requirement for all but the most trivial hypothesis spaces.
Machine Learning 21
List-Then-Eliminate Algorithm
Machine Learning 22
Compact Representation of Version Spaces
• A version space can be represented with its general and specific
boundary sets.
• The Candidate-Elimination algorithm represents the version space
by storing only its most general members G and its most specific
members S.
• Given only these two sets S and G, it is possible to enumerate all
members of a version space by generating hypotheses that lie between
these two sets in general-to-specific partial ordering over hypotheses.
• Every member of the version space lies between these boundaries
where x ≥y means x is more general or equal to y.

Machine Learning 23
Example Version Space
• A version space with its general and specific boundary sets.

• The version space includes all six hypotheses shown here,
but can be represented more simply by S and G.
Machine Learning 24
• The Candidate-Elimination algorithm computes the version space containing all
hypotheses from H that are consistent with an observed sequence of training examples.
• It begins by initializing the version space to the set of all hypotheses in H; that is, by
initializing the G boundary set to contain the most general hypothesis in H
G0  { <?, ?, ?, ?, ?, ?> }
and initializing the S boundary set to contain the most specific hypothesis
S0  { <0, 0, 0, 0, 0, 0> }
• These two boundary sets delimit the entire hypothesis space, because every other
hypothesis in H is both more general than S0 and more specific than G0.
• As each training example is considered, the S and G boundary sets are generalized and
specialized, respectively, to eliminate from the version space any hypotheses found
inconsistent with the new training example.
• After all examples have been processed, the computed version space contains all the
hypotheses consistent with these examples and only these hypotheses.
Machine Learning 25
• Initialize G to the set of maximally general hypotheses in H
• Initialize S to the set of maximally specific hypotheses in H
• For each training example d, do
– If d is a positive example
• Remove from G any hypothesis inconsistent with d ,
• For each hypothesis s in S that is not consistent with d ,-
– Remove s from S
– Add to S all minimal generalizations h of s such that
» h is consistent with d, and some member of G is more general than h
– Remove from S any hypothesis that is more general than another hypothesis in S
– If d is a negative example
• Remove from S any hypothesis inconsistent with d
• For each hypothesis g in G that is not consistent with d
– Remove g from G
– Add to G all minimal specializations h of g such that
» h is consistent with d, and some member of S is more specific than h
– Remove from G any hypothesis that is less general than another hypothesis in G
Machine Learning 26
Candidate-Elimination Algorithm - Example
• S0 and G0 are the initial
boundary sets corresponding to
the most specific and most
general hypotheses.
• Training examples 1 and 2

force the S boundary to become
more general.
• They have no effect on the G

boundary
Machine Learning 27
Machine Learning 28
• Given that there are six attributes that could be specified to specialize
G2, why are there only three new hypotheses in G3?
• For example, the hypothesis h = <?, ?, Normal, ?, ?, ?> is a minimal
specialization of G2 that correctly labels the new example as a negative
example, but it is not included in G3.
– The reason this hypothesis is excluded is that it is inconsistent with S2.
– The algorithm determines this simply by noting that h is not more general than the current
specific boundary, S2.
• In fact, the S boundary of the version space forms a summary of the
previously encountered positive examples that can be used to determine
whether any given hypothesis is consistent with these examples.
• The G boundary summarizes the information from previously
encountered negative examples. Any hypothesis more specific than G is
assured to be consistent with past negative examples
Machine Learning 29
Machine Learning 30
• The fourth training example further generalizes the S boundary of the
version space.
• It also results in removing one member of the G boundary, because this
member fails to cover the new positive example.
– To understand the rationale for this step, it is useful to consider why the offending
hypothesis must be removed from G.
– Notice it cannot be specialized, because specializing it would not make it cover the new
example.
– It also cannot be generalized, because by the definition of G, any more general hypothesis
will cover at least one negative training example.
– Therefore, the hypothesis must be dropped from the G boundary, thereby removing an
entire branch of the partial ordering from the version space of hypotheses remaining under
consideration
Machine Learning 31
Candidate-Elimination Algorithm – Example
Final Version Space
Machine Learning 32
Candidate-Elimination Algorithm – Example
Final Version Space
• After processing these four examples, the boundary sets S4 and G4
delimit the version space of all hypotheses consistent with the set of
incrementally observed training examples.
• This learned version space is independent of the sequence in which the
training examples are presented (because in the end it contains all
hypotheses consistent with the set of examples).
• As further training data is encountered, the S and G boundaries will
move monotonically closer to each other, delimiting a smaller and
smaller version space of candidate hypotheses.
Machine Learning 33
Will Candidate-Elimination Algorithm
Converge to Correct Hypothesis?
• The version space learned by the Candidate-Elimination Algorithm will
converge toward the hypothesis that correctly describes the target
concept, provided
– There are no errors in the training examples, and
– there is some hypothesis in H that correctly describes the target concept.
• What will happen if the training data contains errors?
– The algorithm removes the correct target concept from the version space.
– S and G boundary sets eventually converge to an empty version space if sufficient
additional training data is available.
– Such an empty version space indicates that there is no hypothesis in H consistent with all
observed training examples.
• A similar symptom will appear when the training examples are correct,
but the target concept cannot be described in the hypothesis
representation.
– e.g., if the target concept is a disjunction of feature attributes and the hypothesis space
supports only conjunctive descriptions
Machine Learning 34
What Training Example Should the Learner Request Next?
• We have assumed that training examples are provided to the learner by

some external teacher.
• Suppose instead that the learner is allowed to conduct experiments in
which it chooses the next instance, then obtains the correct classification
for this instance from an external oracle (e.g., nature or a teacher).
– This scenario covers situations in which the learner may conduct experiments in nature or in
which a teacher is available to provide the correct classification.
– We use the term query to refer to such instances constructed by the learner, which are then
classified by an external oracle.
• Considering the version space learned from the four training examples
of the EnjoySport concept.
– What would be a good query for the learner to pose at this point?
– What is a good query strategy in general?
Machine Learning 35
What Training Example Should the Learner Request Next?
• The learner should attempt to discriminate among the alternative competing

hypotheses in its current version space.
– Therefore, it should choose an instance that would be classified positive by some of these
hypotheses, but negative by others.
– One such instance is <Sunny, Warm, Normal, Light, Warm, Same>
– This instance satisfies three of the six hypotheses in the current version space.
– If the trainer classifies this instance as a positive example, the S boundary of the version
space can then be generalized.
– Alternatively, if the trainer indicates that this is a negative example, the G boundary can
then be specialized.
• In general, the optimal query strategy for a concept learner is to generate instances that
satisfy exactly half the hypotheses in the current version space.
• When this is possible, the size of the version space is reduced by half with each new
example, and the correct target concept can therefore be found with only log2 |VS| 
experiments.
Machine Learning 36
How Can Partially Learned Concepts Be Used?
• Even though the learned version space still contains multiple
hypotheses, indicating that the target concept has not yet been fully
learned, it is possible to classify certain examples with the same degree
of confidence as if the target concept had been uniquely identified.
• Let us assume that the followings are new instances to be classified:
Machine Learning 37
• Instance A was is classified as a positive instance by every hypothesis in the current
version space.
• Because the hypotheses in the version space unanimously agree that this is a positive
instance, the learner can classify instance A as positive with the same confidence it
would have if it had already converged to the single, correct target concept.
• Regardless of which hypothesis in the version space is eventually found to be the
correct target concept, it is already clear that it will classify instance A as a positive
example.
• Notice furthermore that we need not enumerate every hypothesis in the version space
in order to test whether each classifies the instance as positive.
– This condition will be met if and only if the instance satisfies every member of S.
– The reason is that every other hypothesis in the version space is at least as general as some
member of S.
– By our definition of more-general-than, if the new instance satisfies all members of S it
must also satisfy each of these more general hypotheses.
Machine Learning 38
• Instance B is classified as a negative instance by every hypothesis in
the version space.
– This instance can therefore be safely classified as negative, given the partially learned
concept.
– An efficient test for this condition is that the instance satisfies none of the members of G.
• Half of the version space hypotheses classify instance C as positive and
half classify it as negative.
– Thus, the learner cannot classify this example with confidence until further training
examples are available.
• Instance D is classified as positive by two of the version space
hypotheses and negative by the other four hypotheses.
– In this case we have less confidence in the classification than in the unambiguous cases of
instances A and B.
– Still, the vote is in favor of a negative classification, and one approach we could take would
be to output the majority vote, perhaps with a confidence rating indicating how close the
vote was.
Machine Learning 39
Inductive Bias - Fundamental Questions
for Inductive Inference
• The Candidate-Elimination Algorithm will converge toward the true
target concept provided it is given accurate training examples and
provided its initial hypothesis space contains the target concept.
• What if the target concept is not contained in the hypothesis space?

• Can we avoid this difficulty by using a hypothesis space that includes
every possible hypothesis?
• How does the size of this hypothesis space influence the ability of the
algorithm to generalize to unobserved instances?
• How does the size of the hypothesis space influence the number of
training examples that must be observed?
Machine Learning 40
Inductive Bias - A Biased Hypothesis Space
• In EnjoySport example, we restricted the hypothesis space to include only
conjunctions of attribute values.
– Because of this restriction, the hypothesis space is unable to represent even simple
disjunctive target concepts such as "Sky = Sunny or Sky = Cloudy."
• From first two examples  S2 : <?, Warm, Normal, Strong, Cool, Change>
• This is inconsistent with third examples, and there are no hypotheses consistent
with these three examples
PROBLEM: We have biased the learner to consider only conjunctive hypotheses.
 We require a more expressive hypothesis space.
Machine Learning 41
Inductive Bias - An Unbiased Learner
• The obvious solution to the problem of assuring that the target concept
is in the hypothesis space H is to provide a hypothesis space capable of
representing every teachable concept.
– Every possible subset of the instances X  the power set of X.
• What is the size of the hypothesis space H (the power set of X) ?

– In EnjoySport, the size of the instance space X is 96.
– The size of the power set of X is 2|X|  The size of H is 296
– Our conjunctive hypothesis space is able to represent only 973of these hypotheses.
 a very biased hypothesis space
Machine Learning 42
Inductive Bias - An Unbiased Learner : Problem
• Let the hypothesis space H to be the power set of X.
– A hypothesis can be represented with disjunctions, conjunctions, and negations of our
earlier hypotheses.
– The target concept "Sky = Sunny or Sky = Cloudy" could then be described as
<Sunny, ?, ?, ?, ?, ?>  <Cloudy, ?, ?, ?, ?, ?>
NEW PROBLEM: our concept learning algorithm is now completely

unable to generalize beyond the observed examples.
– three positive examples (xl,x2,x3) and two negative examples (x4,x5) to the learner.
– S : { x1  x2  x3 } and G : {  (x4  x5) }  NO GENERALIZATION
– Therefore, the only examples that will be unambiguously classified by S and G are the
observed training examples themselves.
Machine Learning 43
Inductive Bias –
Fundamental Property of Inductive Inference
• A learner that makes no a priori assumptions regarding the identity

of the target concept has no rational basis for classifying any unseen
instances.
• Inductive Leap: A learner should be able to generalize training data

using prior assumptions in order to classify unseen instances.
• The generalization is known as inductive leap and our prior
assumptions are the inductive bias of the learner.
• Inductive Bias (prior assumptions) of Candidate-Elimination Algorithm
is that the target concept can be represented by a conjunction of attribute
values, the target concept is contained in the hypothesis space and
training examples are correct.
Machine Learning 44
Inductive Bias – Formal Definition
Inductive Bias:
Consider a concept learning algorithm L for the set of instances X.
Let c be an arbitrary concept defined over X, and
let Dc = {<x , c(x)>} be an arbitrary set of training examples of c.
Let L(xi, Dc) denote the classification assigned to the instance xi by L
after training on the data Dc.
The inductive bias of L is any minimal set of assertions B such that for
any target concept c and corresponding training examples Dc the
following formula holds.
Machine Learning 45
Inductive Bias – Three Learning Algorithms
ROTE-LEARNER: Learning corresponds simply to storing each observed training
example in memory. Subsequent instances are classified by looking them up in
memory. If the instance is found in memory, the stored classification is returned.
Otherwise, the system refuses to classify the new instance.
Inductive Bias: No inductive bias
CANDIDATE-ELIMINATION: New instances are classified only in the case where all
members of the current version space agree on the classification. Otherwise, the
system refuses to classify the new instance.
Inductive Bias: the target concept can be represented in its hypothesis space.
FIND-S: This algorithm, described earlier, finds the most specific hypothesis consistent
with the training examples. It then uses this hypothesis to classify all subsequent
instances.
Inductive Bias: the target concept can be represented in its hypothesis space, and all
instances are negative instances unless the opposite is entailed by its other know1edge.
Machine Learning 46
Concept Learning - Summary
• Concept learning can be seen as a problem of searching through a large
predefined space of potential hypotheses.
• The general-to-specific partial ordering of hypotheses provides a useful
structure for organizing the search through the hypothesis space.
• The FIND-S algorithm utilizes this general-to-specific ordering,
performing a specific-to-general search through the hypothesis space
along one branch of the partial ordering, to find the most specific
hypothesis consistent with the training examples.
• The CANDIDATE-ELIMINATION algorithm utilizes this general-to-
specific ordering to compute the version space (the set of all hypotheses
consistent with the training data) by incrementally computing the sets of
maximally specific (S) and maximally general (G) hypotheses.
Machine Learning 47
• Because the S and G sets delimit the entire set of hypotheses consistent
with the data, they provide the learner with a description of its
uncertainty regarding the exact identity of the target concept. This
version space of alternative hypotheses can be examined
– to determine whether the learner has converged to the target concept,
– to determine when the training data are inconsistent,
– to generate informative queries to further refine the version space, and
– to determine which unseen instances can be unambiguously classified based on the partially
learned concept.
• The CANDIDATE-ELIMINATION algorithm is not robust to noisy
data or to situations in which the unknown target concept is not
expressible in the provided hypothesis space.
Machine Learning 48
• Inductive learning algorithms are able to classify unseen examples only
because of their implicit inductive bias for selecting one consistent
hypothesis over another.
• If the hypothesis space is enriched to the point where there is a
hypothesis corresponding to every possible subset of instances (the
power set of the instances), this will remove any inductive bias from the
CANDIDATE-ELIMINATION algorithm .
– Unfortunately, this also removes the ability to classify any instance beyond the observed
training examples.
– An unbiased learner cannot make inductive leaps to classify unseen examples.
Machine Learning 49
Linear Discriminant Analysis
• Normal Discriminant Analysis or Discriminant Function Analysis is a
dimensionality reduction technique that is commonly used for
supervised classification problems.
• It is used for modelling differences in groups i.e. separating two or
more classes. It is used to project the features in higher dimension
space into a lower dimension space.
• For example, we have two classes and we need to separate them
efficiently. Classes can have multiple features.
• Using only a single feature to classify them may result in some
overlapping as shown in the below figure. So, we will keep on
increasing the number of features for proper classification.
• Example:
Suppose we have two sets of data points belonging to two different
classes that we want to classify.
• As shown in the given 2D graph, when the data points are plotted on
the 2D plane, there’s no straight line that can separate the two
classes of the data points completely.
• Hence, in this case, LDA (Linear Discriminant Analysis) is used which
reduces the 2D graph into a 1D graph in order to maximize the
separability between the two classes.
• Here, Linear Discriminant Analysis uses both the
axes (X and Y) to create a new axis and projects
data onto a new axis in a way to maximize the
separation of the two categories and hence,
reducing the 2D graph into a 1D graph.
• Two criteria are used by LDA to create a new axis:
1.Maximize the distance between means of the two classes.
2.Minimize the variation within each class.
• In the above graph, it can be seen that a new axis
(in red) is generated and plotted in the 2D graph such
that it maximizes the distance between the means of
the two classes and minimizes the variation within each class.
• In simple terms, this newly generated axis increases the separation
between the data points of the two classes.
• After generating this new axis using the above-mentioned criteria, all
the data points of the classes are plotted on this new axis and are
shown in the figure given below.
• But Linear Discriminant Analysis fails when the mean of the distributions are shared, as it becomes
impossible for LDA to find a new axis that makes both the classes linearly separable.
• In such cases, we use non-linear discriminant analysis.
Applications:
Face Recognition:
• In the field of Computer Vision, face recognition is a very popular application in which each face is
represented by a very large number of pixel values.
• Linear discriminant analysis (LDA) is used here to reduce the number of features to a more manageable
number before the process of classification.
• Each of the new dimensions generated is a linear combination of pixel values, which form a template. The
linear combinations obtained using Fisher’s linear discriminant are called Fisher’s faces.
Medical:
• In this field, Linear discriminant analysis (LDA) is used to classify the patient disease state as mild, moderate,
or severe based upon the patient’s various parameters and the medical treatment he is going through. This
helps the doctors to intensify or reduce the pace of their treatment.
Customer Identification:
• Suppose we want to identify the type of customers who are most likely to buy a particular product in a
shopping mall.
• By doing a simple question and answers survey, we can gather all the features of the customers.
• Here, a Linear discriminant analysis will help us to identify and select the features which can describe the
characteristics of the group of customers that are most likely to buy that particular product in the shopping
mall.
Extensions to LDA:
1.Quadratic Discriminant Analysis (QDA): Each class uses its own
estimate of variance (or covariance when there are multiple input
variables).
2.Flexible Discriminant Analysis (FDA): Where non-linear combinations
of inputs are used such as splines.
3.Regularized Discriminant Analysis (RDA): Introduces regularization
into the estimate of the variance (actually covariance), moderating
the influence of different variables on LDA.

1 - Module5 - Machine Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 - Module5 - Machine Learning

Uploaded by

Copyright:

Available Formats

Module 5

Number of rooms Price

Inferring a boolean-valued function from training examples of

• An example for concept-learning is the learning of bird-concept from

Example Sky AirTemp Humidity Wind Water Forecast EnjoySport

1 Sunny Warm Normal Strong Warm Same YES

2 Sunny Warm High Strong Warm Same YES

3 Rainy Cold High Strong Warm Change NO

4 Sunny Warm High Strong Warm Change YES

• A set of example days, and each is described by six attributes.

• The Inductive Learning Hypothesis - Any hypothesis found to

• Consider two hypotheses

• h1 is more-general-than h2 ( h1 > h2) if and only if h1≥h2 is true and

• h2 > h1 and h2 > h3

• Why prefer the most specific hypothesis?

• What if there are several maximally specific consistent hypotheses?

• The key difference between this definition of consistent and satisfies.

where x ≥y means x is more general or equal to y.

• A version space with its general and specific boundary sets.

• Training examples 1 and 2

• They have no effect on the G

• We have assumed that training examples are provided to the learner by

• The learner should attempt to discriminate among the alternative competing

• Let us assume that the followings are new instances to be classified:

• What if the target concept is not contained in the hypothesis space?

• What is the size of the hypothesis space H (the power set of X) ?

NEW PROBLEM: our concept learning algorithm is now completely

• A learner that makes no a priori assumptions regarding the identity

• Inductive Leap: A learner should be able to generalize training data

You might also like