Download as pdf or txt
Download as pdf or txt
You are on page 1of 156

Over a period of time we

have to collect the data


and the data we have to
give as an input to the
system. The System will
learn from the particular
data. Once the system
learns the model can be
used to classify the
examples.

Over a period of Time Brain is trained to classify as tiger and Lion


AI is a bigger concept to create intelligent machines that can simulate human thinking
capability and behavior, whereas, machine learning is an application or subset of AI that
allows machines to learn from data without being programmed explicitly.

Eg.Siri on the phone


To work with machine learning projects, we need a huge amount of data, because, without the
data, one cannot train ML/AI models. Collecting and preparing the dataset is one of the most
crucial parts while creating an ML/AI project.

Kaggle Datasets
UCI Machine Learning Repository
Indian Government dataset
◦ https://data.gov.in/
 A Dataset is a table with the data from which the machine learns. The dataset contains the features and the target to predict.
When used to induce a model, the dataset is called training data.
 An Instance is a row in the dataset. Other names for ‘instance’ are: (data) point, example, observation.
 The Features are the inputs used for prediction or classification. A feature is a column in the dataset.
In the above example, we can see that the agent is given 2 options i.e. a path with water or a path with fire.
A reinforcement algorithm works on reward a system i.e. if the agent uses the fire path then the rewards are subtracted and
agent tries to learn that it should avoid the fire path.
If it had chosen the water path or the safe path then some points would have been added to the reward points, the agent then
would try to learn what path is safe and what path isn’t.
It is basically leveraging the rewards obtained, the agent improves its environment knowledge to select the next action.
To get a Successful learning system it should be designed properly. The design has to be
proper and Effective. The following steps are involved in designing a learning system
1. Choosing the Training Experience
2. Choosing the Target Function
3. Choosing a Representation for the Target Function
4. Choosing a Function Approximation Algorithm
5. The Final Design
Step 1. Choosing the Training Experience
(i) Type of training experience i.e. direct or indirect
(ii) Degree to which the learner can control the sequence of
training example
(iii) how well the training experience represents the distribution
of examples over the final system performance P must be measured

13-03-2024 43
Eg. Learning Driving
Trainers help

Partial Trainers help

Completely own
ChooseMove : B -> M to indicate that this function
accepts as input any board from the set of legal board
states B and produces as output some move from the
set of legal moves M

Valid Move : Set of Board State ->Real Number


Learning algorithms to acquire only some approximation to the target function, and for this
reason the process of learning the target function is often called function approximation
w0 – w6 =Numerical Coefficients (or) weight
of Each feature determine the importance or
weightage of feature.
w0 – Constant

we will use the symbol V


Cap to refer to the
function that is actually
learned by our program,
to distinguish it from the
ideal target function V
To train our learning program, we need a set of training data, each describing a
specific board state b and the training value V_train (b) for b. Each training
example is an ordered pair (b,V_train(b))
For example, a training example may be
((x1 = 3, x2 = 0, x3 = 1, x4 = 0, x5 = 0, x6 = 0), +100)
This is an example where black has won the game since x2 = 0 or red has no
remaining pieces.
Estimating Training Values: In every Step we Consider Successor (depending
on the next step of opponent))
Effective approach: using current approximation of V and next state

Current State<- Next State


There are some algorithm to find the weights of linear functions. Here we are
using LMS (Least Mean Square)
Adjusting the Weights ->To Minimize Error in Prediction
minimizing the squared error E between the training values and the values
predicted by the hypothesis V’

For each observed training example it adjusts the weights a


small amount in the direction that reduces the error on this
training example,
The final design of our checkers learning system can be naturally described by four
distinct program modules:
1. Performance system - is the module that must solve the given performance task, in this
case playing checkers, by using the learned target function(s).It takes an instance of a
new problem (new game) as input and produces a trace of its solution (game history) as
output.
2. Critic - takes as input the history or trace of the game and produces as output a set of
training examples of the target function.
3. Generalize - takes as input the training examples and produces an output hypothesis that
is its estimate of the target function.
4. Experiment generator - takes as input the current hypothesis (currently learned
function) and outputs a new problem (i.e., initial board state) for the Performance
System to explore.
What is the best strategy for choosing a useful next training experience ?
And how does the choice of this strategy alter the complexity of the learning
problem ?
What is the best way to reduce the learning task to one or more function
approximation problems ?
What specific functions should be learned ?
Can this process be automated ?
How can the learner automatically alter its representation to improve its ability
to represent and learn the target function ?
Machine Learning provides businesses with the knowledge to make more
informed,
Data-driven decisions that are faster than traditional approaches.
Machine Learning presents its own set of challenges.

Here are 5 common machine learning problems


Understanding Which Processes Need Automation
Lack of Quality Data
Inadequate Infrastructure
Implementation
Lack of Skilled Resources
Concepts or Categories
“birds”
“car”
“situations in which I should study more in order to pass the exam”
Concept
Each Concept can be viewed as some subset of objects or events defined over a
larger set, or a Boolean valued function defined over this larger set. (e.g., a
function defined over all animals, whose value is true for birds and false for other
animals).

Concept learning. Inferring a boolean-valued function from training examples of its


input and output.
13-03-2024 60
Learning
inducing general functions from specific training examples
Concept Learning
acquiring the definition of a general category given a sample of positive and negative training
examples of the category
Target Concept
◦ “days on which Aldo enjoys water sport”
Hypothesis
◦ vector of 6 constraints (Sky, AirTemp, Humidity, Wind, Water, Forecast, EnjoySport )
Learning
inducing general functions from specific training examples
Concept Learning
acquiring the definition of a general category given a sample of positive and negative training
examples of the category
Target Concept
◦ “days on which Aldo enjoys water sport”
Hypothesis
◦ vector of 6 constraints (Sky, AirTemp, Humidity, Wind, Water, Forecast, EnjoySport )
Many possible representations
Here, h is conjunction of constraints on attributes
Each constraint can be
a specific value (e.g., Water = Warm)
don’t care (e.g., “Water =?”)
no value allowed (e.g., “Water=0”)
In the current example, the target concept corresponds to the value of the attribute EnjoySport (i.e., c(x) = 1 if
EnjoySport = Yes, and c(x) = 0 if EnjoySport = No).
The main goal of this search is to find the hypothesis that best fits the training
examples.
Kinds of Space in EnjoySport
1. 3*2*2*2*2*2 = 96: Different instances Available
2. 5*4*4*4*4*4 = 5120: Syntactically distinct hypotheses
◦ Adding two more Values ? And

3. 1+(4*3*3*3*3*3) = 973: semantically distinct hypotheses )

(NULL is taken as Common Outside)


After finding all the semantically and syntactically all the hypothesis We find the
best match from all these.(i.e) Much Closer to the Learning Problem.
To illustrate the general-to-specific ordering, consider the two hypotheses:
Algorithm Order Strategy N/P

FIND-S Specific-to- general Top-down Positive

LIST-THEN-ELIMINATE General-to-Specific
Bottom-up Negative

CANDIDATE- Bi-directional Bi-directional Both


ELIMINATION
Sky AirTemp Humidity Wind Water Forecast EnjoySport

Sunny Warm Normal Strong Warm Same Yes


Sunny Warm High Strong Warm Same Yes
Rainy Cold High Strong Warm Change No
Sunny Warm High Strong Cool Change Yes

The first step of FINDS is to initialize h to the most specific hypothesis in H


1. The hypothesis space search performed by FINDS.
2. The search begins (ho) with the most specific hypothesis in H, then considers increasingly general hypotheses
(hl through h4) as mandated by the training examples.
3. In the instance space diagram, positive training examples are denoted by "+," negative by "-," and instances
that have not been presented as training examples are denoted by a solid circle.
Hypothesis

Hypothesis literally means an idea or theory that the researcher sets as the goal of the study and examines it
and is replaced as a theory when the hypothesis is true in the study's conclusion.

A hypothesis (plural: hypotheses), in a scientific context, is a testable statement about the relationship
between two or more variables or a proposed explanation for some observed phenomenon.
Hypothesis testing is the core of the scientific method.

A statistical hypothesis is an explanation about the relationship between data populations that is interpreted
probabilistically.

A machine learning hypothesis is a candidate model that approximates a target function for mapping
inputs to outputs.

Example:
If you increase the duration of light, (then) corn plants will grow more each day.
The hypothesis establishes two variables, length of light exposure, and the rate of plant
growth. An experiment could be designed to test whether the rate of growth depends on
the duration of light. 13-03-2024 78
c(x) = Target Function
h(x) = hypothesis

If h(x)=c(x) given hypothesis is


Consistent
VS – Version Space
H – Hypothesis
The LIST-THEN-ELIMINATE Algorithm D – Training Samples
List-Then-Eliminate Algorithm
◦ Appliable to finite H
◦ Algorithm Inefficient
◦ Perform poorly with noisy training data.
Uses Version Space.
Considers both positive and Negative Results.
We have Both Specific and General Hypothesis.
Positive Example
We tend to Generate Specific Hypothesis
Negative Example
We tend to Generate General Hypothesis more Specific
Remark 1 Will the Candidate-Elimination Algorithm Converge to the
Correct Hypothesis?
Remark 2 How Can Partially Learned Concepts Be Used?
Remark 3 What Training Example Should the Learner Request Next?
A : classified to positive
B : classified to negative
C : 3 positive , 3 negative
D : 2 positive, 4 negative
Biased Hypothesis
Target Concept is not Present
Does not Consider all type of Training Examples
Solution -> Include all hypothesis
Inductive Learning :
Derive Rules from
Examples.
(Construction of house
without knowing)
Deductive Learning :
Already existing rules
are applied to examples.
(Construction Example
– Civil Engineer)
Decision tree learning is a method for approximating discrete-valued target
functions, in which the learned function is represented by a decision tree.
Learned Tree can also be re-represented as sets of if-then rules to improve human
readability.
Decision Tree is used in Tree Structured Classification and Regression.
There are two types of Nodes.
1.Decision Node
2.Leaf Node
Decision trees classify instances by sorting them down the tree from the root to some leaf
node, which provides the classification of the instance.
Each node in the tree specifies a test of some attribute of the instance, and each branch
descending from that node corresponds to one of the possible values for this attribute.
A decision tree for the concept PlayTennis
Root Node

Internal Node

Terminal/Leaf Node
Splitting of Branches:
2 – two Splitting
3- three Splitting
This tree classifies Saturday mornings according to whether or not they are 3 or more – Multi way Splitting
suitable for playing tennis.
For example, the above instance would be sorted down the leftmost branch of this decision
tree and would therefore be classified as a negative instance (i.e., the tree predicts that
PlayTennis = no)
Decision trees represent a disjunction of conjunctions of constraints on the attribute values of
instances.
Each path from the tree root to a leaf corresponds to a conjunction of attribute tests, and the
tree itself to a disjunction of these conjunctions.
Decision Tree is best suited to problems for the following Characteristics:

1. Instances are represented by attribute-value pairs. - > Instances are described by


a fixed set of attributes (e.g., Temperature) and their values (e.g., Hot)
2. The target function has discrete output values. -> It assigns a Boolean
classification (e.g., yes or no) to each example.
3. Disjunctive descriptions may be required. -> Decision trees naturally represent
disjunctive (Lacking Connection or Consistency) expressions
4. The training data may contain errors. -> Decision tree learning methods are
robust to errors, both errors in classifications of the training examples and errors in
the attribute values that describe these examples
5. The training data may contain missing attribute values -> Decision tree
methods can be used even when some training examples have unknown values.
Decision tree learning has therefore been applied to problems such as learning
to classify medical patients by their disease, equipment malfunctions by their
cause, and loan applicants by their likelihood of defaulting on payments.
Such problems, in which the task is to classify examples into one of a discrete set
of possible categories, are often referred to as classification problems.
Two types of Trees in Decision tree Learning:
Binary Tree – It Splits the tree based on Attributes and Values.
Bushy Tree – It Splits the tree based on Values.
Splitting Conditions is Chosen based on Givii Index, Entropy and Information
Gain.
We calculate the Information Gain of each and Calculate the cost.
Recursive Partitioning is otherwise called as Greedy Algorithm which is
basically for minimizing the Cost Which is chosen for Splitting.
ID3 (Quinlan, 1986) is a basic algorithm for learning DT's
Given a training set of examples, the algorithms for building DT performs search
in the space of decision trees
The construction of the tree is top-down. The algorithm is greedy.
The fundamental question is “which attribute should be tested next? Which
question gives us more information?”
Select the best attribute. A descendent node is then created for each possible
value of this attribute and examples are partitioned according to this value
The process is repeated for each successor node until all the examples are
classified correctly or there are no attributes left
A statistical property called information gain, measures how well a given
attribute separates the training examples
Information gain uses the notion of entropy, commonly used in information
theory
Information gain = expected reduction of entropy
Entropy
Entropy measures the impurity of a collection of examples. It depends from
the distribution of the random variable p.
◦ S is a collection of training examples
◦ p+ the proportion of positive examples in S
◦ p– the proportion of negative examples in S
To illustrate, suppose S is a collection of 14 examples of some boolean concept,
including 9 positive and 5 negative examples (we adopt the notation [9+, 5-1 to
summarize such a sample of data). Then the entropy of S relative to this boolean
classification is

The entropy is 0 if all members of S belong to the same class.


The entropy is 1 when the collection contains an equal
number of positive and negative examples.
If the collection contains unequal numbers of positive and
negative examples, the entropy is between 0 and 1.
Figure 3.2 shows the form of the entropy function relative
to a boolean classification, as p, varies between 0 and 1
◦ Information gain is the expected reduction in entropy caused by partitioning the examples
on an attribute.
◦ The higher the information gain the more effective the attribute in classifying training data.
◦ Expected reduction in entropy knowing A

Values(A) possible values for A


Sv subset of S for which A has value v
The search space is made by partial decision
trees
The algorithm is hill-climbing
The evaluation function is information gain
The hypotheses space is complete (represents
all discrete-valued functions)
The search maintains a single current
hypothesis
No backtracking; no guarantee of optimality
It uses all the available examples (not
incremental)
May terminate earlier, accepting noisy classes
ID3 can be Characterized as a searching a space of Hypothesis for one that fits the
training examples.
ID3 will search set of possible decision trees from available Hypothesis.
ID3 performs simple to complex searching.

First Start with empty tree and keep on adding


Every Discrete Valued (Finite) function can be described with
some decision tree.
Avoids Major risk of Searching Incomplete Hypothesis.
Has only one Single Current Hypothesis.
Cannot Determine Alternative Decision Trees.
Backtracking is not possible.
Can be extended easily to noise data also.
Set of Assumptions
Inductive Bias of ID3 Consists of Describing the basis by which ID3 chooses one
consistent decision tree over all the possible DT’s.
ID3 Search Strategy
1. Selects in favor of Shorter trees over longer ones.
2. Selects Element with highest IG as root attribute over Lowest IG Ones.
Types of Inductive Bias
1. Restrictive Bias – Based on Conditions
2. Preference Bias – Based on Priorities
ID3 – Preference VS and CE – Restrictive
Why prefer shorter hypotheses?
Arguments in favor:
There are fewer short hypotheses than long ones
If a short hypothesis fits data unlikely to be a coincidence
Arguments against:
Not every short hypothesis is a reasonable one.
Occam's razor:"The simplest explanation is usually the best one."
a principle usually (though incorrectly) attributed 14th-century English logician
and Franciscan friar, William of Ockham.
The term razor refers to the act of shaving away unnecessary assumptions to get
to the simplest explanation.
Natural bias of information gain: it favours attributes with many possible
values.
Consider the attribute Date in the PlayTennis example.
Date would have the highest information gain since it perfectly separates
the training data.
It would be selected at the root resulting in a very broad tree
Very good on the training, this tree would perform poorly in predicting
unknown instances. Overfitting.
The problem is that the partition is too specific, too many small classes are
generated.
We need to look at alternative measures …
The gain ratio measure penalizes attributes such as Date by incorporating a term, called
split information, that is sensitive to how broadly and uniformly the attribute splits the data:

Si are the sets obtained by partitioning on value i of A


SplitInformation measures the entropy of S with respect to the
values of A. The more uniformly dispersed the data the higher it is.
The Gain Ratio measure is defined in terms of the earlier Gain measure, as well
as this Splitlnformation, as follows
How to cope with the problem that the value of some attribute may be
missing?
◦ Example: Blood-Test-Result in a medical diagnosis problem
The strategy: use other examples to guess attribute
Assign the value that is most common among the training examples at the
node
Assign a probability to each value, based on frequencies, and assign values to
missing attribute, according to this probability distribution
Missing values in new instances to be classified are treated accordingly, and
the most probable classification is chosen (C4.5)
Instance attributes may have an associated cost: we would prefer decision trees
that use low-cost attributes
ID3 can be modified to take into account costs:

Tan and Schlimmer (1990)


Nunez (1988)

You might also like