Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Please do not share these notes on apps like WhatsApp or Telegram.

The revenue we generate from the ads we show on our website and app
funds our services. The generated revenue helps us prepare new notes
and improve the quality of existing study materials, which are
available on our website and mobile app.

If you don't use our website and app directly, it will hurt our revenue,
and we might not be able to run the services and have to close them.
So, it is a humble request for all to stop sharing the study material we
provide on various apps. Please share the website's URL instead.
Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Subject Notes
Unit I
VIII Semester
Subject Name: IT 802-Machine Learning

Unit-1 Content: Introduction, Examples of various Learning Paradigms, Perspectives and


Issues, Concept Learning, Version Spaces, Finite and Infinite Hypothesis Spaces, PAC
Learning, VC Dimension

INTRODUCTION

Machine learning (ML) is a category of an algorithm that allows software applications to


become more accurate in predicting outcomes without being explicitly programmed. The
basic premise of machine learning is to build algorithms that can receive input data and use
statistical analysis to predict an output while updating outputs as new data becomes
available.

Software engineering combined human created rules with data to create answers to a
problem. Instead, machine learning uses data and answers to discover the rules behind a
problem. To learn the rules governing a phenomenon, machines have to go through a
learning process, trying different rules and learning from how well they perform. Therefore,
it’s known as Machine Learning.

There are multiple forms of Machine Learning; supervised, unsupervised, semi-supervised


and reinforcement learning. Each form of Machine Learning has differing approaches, but
they all follow the same underlying process and theory.

The basic terminology used in Machine Learning is:

• Dataset: A set of data examples that contain features important to solving the
problem.

• Features: Important pieces of data that help us understand a problem. These are fed
into a Machine Learning algorithm to help it learn.

• Model: The representation (internal model) of a phenomenon that a Machine


Learning algorithm has learnt. It learns this from the data it is shown during training.
The model is the output you get after training an algorithm. For example, a decision
tree algorithm would be trained and produce a decision tree model.

The process of Machine Learning includes the following steps:

• Data Collection: Collect the data that the algorithm will learn from.

• Data Preparation: Format and engineer the data into the optimal format, extracting
important features and performing dimensionality reduction.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

• Training: This is where the Machine Learning algorithm learns by showing it the data
that has been collected and prepared.

• Evaluation: Test the model to see how well it performs.

• Tuning: Fine tune the model to maximize its performance.

There are many approaches that can be taken when conducting Machine Learning.
Supervised and Unsupervised are well established approaches and the most used. Semi-
supervised and Reinforcement Learning are newer and more complex but have shown
impressive results.

EXAMPLES OF VARIOUS LERANINFG PARADIGMS

Learning Paradigms basically states a particular pattern on which something or someone


learns. The Learning Paradigms related to machine learning, i.e., how a machine learns
when some data is given to it, its pattern of approach for some particular data.

There are three basic types of learning paradigms widely associated with machine learning,
namely

1. Supervised Learning

2. Unsupervised Learning

3. Reinforcement Learning

Figure 1.1: Basic types of Machine Learning

Supervised Learning

Supervised learning is a machine learning task in which a function maps the input to output
data using the provided input-output pairs.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure: 1.2: Supervised Learning

In this type of learning, you need to give both the input and output (usually in the form of
labels) to the computer for it to learn from it. The computer generates a function based on
this data, which can be anything like a simple line, to a complex convex function, depending
on the data provided.

This is the most basic type of learning paradigm, and most algorithms we learn today are
based on this type of learning pattern. For example:

Figure 1.3: Linear regression

Logistic Regression (0 or 1 logic, meaning yes or no)

Figure 1.4: Logistic regression Model

Classification: Machine is trained to classify something into some class.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

• Classifying whether a patient has disease or not

• Classifying whether an email is spam or not

Regression: Machine is trained to predict some value like price, weight, or height.

• Predicting house/property price

• Predicting stock market price

Unsupervised Learning

In this type of learning paradigm, the computer is provided with just the input to develop a
learning pattern. It is basically learning from no results.

Figure 1.5: Unsupervised Learning

This means that the computer has to recognize a pattern in the given input and develop a
learning algorithm accordingly. So, we conclude that “the machine learns through
observation & find structures in data”. This is still a much unexplored field of machine
learning, and big tech giants like Google and Microsoft are currently researching on
development in it.

Clustering: A clustering problem is where you want to discover the inherent groupings in the
data

• such as grouping customers by purchasing behavior

Association: An association rule learning problem is where you want to discover rules that
describe large portions of your data

• such as people that buy X also tend to buy Y

Reinforcement Learning

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Reinforcement Learning is a type of Machine Learning, and thereby also a branch of Artificial
Intelligence. It allows machines and software agents to automatically determine the ideal
behavior within a specific context, in order to maximize its performance.

Figure 1.6: Reinforcement Learning

There is an excellent analogy to explain this type of learning paradigm, “training a dog”.

This learning paradigm is like a dog trainer, which teaches the dog how to respond to
specific signs, like a whistle, clap, or anything else. Whenever the dog responds correctly,
the trainer gives a reward to the dog, which can be a “Bone or a biscuit”.

A variety of different problems can be solved using Reinforcement Learning. Because RL


agents can learn without expert supervision, the type of problems that are best suited to RL
are complex problems where there appears to be no obvious or easily programmable
solution. Two of the main ones are:

Game playing — determining the best move to make in a game often depends on a number
of different factors; hence the number of possible states that can exist in a particular game
is usually very large.

Control problems — such as elevator scheduling. Again, it is not obvious what strategies
would provide the best, most timely elevator service. For control problems such as this, RL
agents can be left to learn in a simulated environment and eventually they will come up
with good controlling policies.

PERSPECTIVE & ISSUES

Perspective:

It involves searching a very large space of possible hypothesis to determine the one that
best fits the observed data. Machine perception is the capability of a computer system to
interpret data in a manner that is similar to the way humans use their senses to relate to the
world around them. The basic method that the computers take in and respond to their
environment is through the attached hardware. Until recently input was limited to a
keyboard, or a mouse, but advances in technology, both in hardware and software, have

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

allowed computers to take in sensory input in a way similar to humans. Machine perception
allows the computer to use this sensory input, as well as conventional computational means
of gathering information, to gather information with greater accuracy and to present it in a
way that is more comfortable for the user.

The end goal of machine perception is to give machines the ability to see, feel and perceive
the world as humans do and therefore for them to be able to explain in a human way why
they are making their decisions, to warn us when it is failing and more importantly, the
reason why it is failing. This purpose is very similar to the proposed purposes for artificial
intelligence generally, except that machine perception would only grant machines limited
sentience, rather than bestow upon machines full consciousness, self-awareness, and
intentionality.

Issues:

Some of the issues that the science of machine perception still has to overcome include:

• Embodied Cognition - The theory that cognition is a full body experience, and
therefore can only exist, and therefore be measure and analyzed, in fullness if all
required human abilities and processes are working together through a mutually
aware and supportive systems network.

• The Principle of Similarity - The ability young children develop to determine what
family a newly introduced stimulus falls under even when the said stimulus is
different from the members with which the child usually associates said family with.
(An example could be a child figuring that a Chihuahua is a dog and house pet rather
than vermin.)

• The Unconscious Inference: The natural human behavior of determining if a new


stimulus is dangerous or not, what it is, and then how to relate to it without ever
requiring any new conscious effort.

• The innate human ability to follow the Likelihood Principle in order to learn from
circumstances and others over time.

• The Recognition-by-components theory - being able to mentally analyze and break-


even complicated mechanisms into manageable parts with which to interact with.
For example: A person seeing both the cup and the handle parts that make up a mug
full of hot cocoa, in order to use the handle to hold the mug so as to avoid being
burned.

• The Free energy principle - determining long before hand how much energy one can
safely delegate to being aware of things outside one's self without the loss of the
needed energy one requires for sustaining their life and function satisfactorily. This
allows one to become both optimally aware of the world around them self without
depleting their energy so much that they experience damaging stress, decision
fatigue, and/or exhaustion.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

CONCEPT LEARNING

Concept learning, also known as category learning. It searches for and listing of attributes
that can be used to distinguish exemplars from non exemplars of various categories. More
simply put, concepts are the mental categories that help us classify objects, events, or ideas,
building on the understanding that each object, event, or idea has a set of common relevant
features. Thus, concept learning is a strategy which requires a learner to compare and
contrast groups or categories that contain concept-relevant features with groups or
categories that do not contain concept-relevant features.

Concepts in Machine Learning can be thought of as a Boolean-valued function defined over


a large set of training. For example, one possible target concept may be to find the day
when a person enjoys his favorite sport. We have some attributes/features of the day like,
Sky, Air Temperature, Humidity, Wind, Water, Forecast and based on this we have a target
Concept named EnjoySport.

We have the following training example available:

Table 1.1: Data table of finding the concept

Let’s Design the problem formally with TPE (Task, Performance, Experience):

Problem: Leaning the day when a person enjoys the sport.

Task T: Learn to predict the value of EnjoySport for an arbitrary day, based on the values of
the attributes of the day.

Performance measure P: Total percent of days (EnjoySport) correctly predicted.

Training experience E: A set of days with given labels (EnjoySport: Yes/No)

Let’s take a very simple hypothesis representation which consists of a conjunction of


constraints in the instance attributes. We get a hypothesis h_i with the help of example i for
our training set as below:

hi(x) := <x1, x2, x3, x4, x5, x6>

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Where x1, x2, x3, x4, x5 and x6 are the values of Sky, Air-Temp, Humidity, Wind, Water and
Forecast.

Hence h1 will look like (the first row of the table above):

h1(x=1): <Sunny, Warm, Normal, Strong, Warm, Same > Note: x=1 represents a positive
hypothesis / Positive example

We want to find the most suitable hypothesis which can represent the concept. For
example, the person enjoys his favorite sport only on cold days with high humidity.

h(x=1) = <?, Cold, High, ?, ?, ?>

Here ‘?’ indicates that any value of the attribute is acceptable. The most generic hypothesis
will be < ?, ?, ?, ?, ?, ?> where every day is a positive example and the most specific
hypothesis will be <?,?,?,?,?,? > where no day is a positive example. The two most popular
approaches to find a suitable hypothesis, they are:

1. Find-S Algorithm

2. List-Then-Eliminate Algorithm

Find-S Algorithm:

• Following are the steps for the Find-S algorithm:

• Initialize h to the most specific hypothesis in H

• For each positive training example,

o For each attribute, constraint ai in h

▪ If the constraints ai is satisfied by x

▪ Then do nothing

▪ Else replace ai in h by the next more general constraint that is


satisfied by x

• Output hypothesis h

The LIST-THEN-ELIMINATE Algorithm:

Following are the steps for the LIST-THE-ELIMINATE algorithm:

VersionSpace <- a list containing every hypothesis in H

For each training example, <x, c(x)>

• Remove from VersionSpace any hypothesis h for which h(x) != c(x)

Output the list of hypotheses in VersionSpace.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

VERSION SPACE

A version space is a hierarchical representation of knowledge that enables you to keep track
of all the useful information supplied by a sequence of learning examples without
remembering any of the examples.

The version space method is a concept learning process accomplished by managing multiple
models within a version space.

Version Space Characteristics

• Tentative heuristics are represented using version spaces.

• A version space represents all the alternative plausible descriptions of a heuristic.

• A plausible description is one that is applicable to all known positive examples and
no known negative example.

• A version space description consists of two complementary trees:

• One that contains nodes connected to overly general models.

• One that contains nodes connected to overly specific models.

FINITE AND INFINITE HYPOTHESIS SPACES

A hypothesis is a function on the sample space, giving a value for each point in the sample
space. If the possible values are {0, 1} then we can identify a hypothesis with the subset of
those points that are given value 1. The error of a hypothesis is the probability of that
subset where the hypothesis disagrees with the true hypothesis. Learning from examples is
the process of making independent random observations and eliminating those hypotheses
that disagree with observations.

The hypothesis space is the set of all possible hypotheses (i.e., functions from inputs to the
outputs) that can be returned by a model. The hypothesis space is important because it
specifies what types of functions you can model and what types you cannot. The absolute
best error you can achieve on a dataset is lower bounded by the error of the “best” function
in your hypothesis space.

Suppose we have a finite set of hypotheses, H, and that we make m observations. If h is a


hypothesis with error greater than E, then the probability that it will be consistent with a
given observation is less than 1 - E, and the probability that it will be consistent with all m
observations is less than (1 - E)m, which is less than exp( -Em ). Therefore the total
probability that some hypothesis with error greater than E remains after m observations is
less than |H|exp( -Em ). We can set this bound at some desired level, say d, and solve for m,
giving

m > (ln |H | + ln (1/d) )/ E

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

In machine learning are designed to make better use data using PAC (Probably
Approximately Correct). The PAC analyses assume that the true answer/concept is in the
given hypothesis space H. A machine learning algorithm L with hypothesis space H is one
that, given a training data set D, will always return a hypothesis H consistent with D if one
exists, otherwise it will indicate that no such hypothesis exists. In a finite machine learning
hypothesis H does not have polynomial sample complexity. If H has polynomial sample
complexity it is called infinite hypothesis.

PAC LEARNING

Probably Approximately Correct (PAC) learning is a framework for mathematical analysis of


machine learning. PAC Learning deals with the question of how to choose the size of the
training set.

In this framework, the learner receives samples and must select a generalization function
(called the hypothesis) from a certain class of possible functions. The goal is that, with high
probability (the "probably" part), the selected function will have low generalization error
(the "approximately correct" part). The learner must be able to learn the concept given any
arbitrary approximation ratio, probability of success, or distribution of the samples.

Probably approximately correct (PAC) learning theory helps analyze whether and under
what conditions a learner L will probably output an approximately correct classifier.

Approximate: A hypothesis h∈H is approximately correct if its error over the distribution of
inputs is bounded by some ϵ,0 ≤ ϵ ≤ (1/2). I.e., errorD(h)<ϵ, where D is the distribution over
inputs.

Probably: If L will output such a classifier with probability 1−δ, with 0 ≤ δ ≤ (1/2), we call
that classifier probably approximately correct.

Knowing that a target concept is PAC-learnable allows to bound the sample size necessary
to probably learn an approximately correct classifier, which is what's shown in the formula
reproduced:

To gain some intuition about this, note the effects on m when you alter variables in the
right-hand side. As allowable error decreases, the necessary sample size grows. Likewise, it
grows with the probability of an approximately correct learner, and with the size of the
hypothesis space H. (Loosely, a hypothesis space is the set of classifiers algorithm
considers.) More plainly, as we consider more possible classifiers, or desire a lower error or
higher probability of correctness, we need more data to distinguish between them.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

VC DIMENSION

The Vapnik–Chervonenkis (VC) dimension is a measure of the capacity (complexity,


expressive power, richness, or flexibility) of a set of functions that can be learned by a
statistical binary classification algorithm. It is defined as the cardinality of the largest set of
points that the algorithm can shatter.

The capacity of a classification model is related to how complicated it can be. For example,
consider the threshold of a high-degree polynomial: if the polynomial evaluates above zero,
that point is classified as positive, otherwise as negative. A high-degree polynomial can be
wiggly, so it can fit a given set of training points well. But one can expect that the classifier
will make errors on other points, because it is too wiggly. Such a polynomial has a high
capacity. A much simpler alternative is to threshold a linear function.

VC provides a measure of the complexity of a “Hypothesis space” or the “Power” of


“learning Machine”. The higher VC dimension implies the ability to represent more complex
functions.

Suppose we want a model (e.g., some classifier) that generalizes well on unseen data. And
we are limited to a specific amount of sample data.

The following figure shows some Models (S1 up to Sk) of differing complexity (VC
dimension), here shown on the x-axis and called h.

Figure 1.7: S1 up to Sk models of differing VC dimension

The diagram shows that a higher VC dimension allows for a lower empirical risk (the error a
model makes on the sample data), but also introduces a higher confidence interval. This

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

interval can be seen as the confidence in the model's ability to generalize.

Low VC dimension (high bias)

If we use a model of low complexity, we introduce assumption (bias) regarding the dataset
e.g., when using a linear classifier, we assume the data can be described with a linear
model. If this is not the case, our given problem cannot be solved by a linear model, for
example because the problem is of nonlinear nature. We will end up with a bad performing
model which will not be able to learn the data's structure. We should therefore try to avoid
introducing a strong bias.

High VC dimension (greater confidence interval)

On the other side of the x-axis, we see models of higher complexity which might be of such
a great capacity that it will rather memorize the data instead of learning its general
underlying structure i.e. the model over fits. After realizing this problem, it seems that we
should avoid complex models.

This may seem controversial as we shall not introduce a bias i.e., have low VC dimension but
should also do not have high VC dimension. This problem has deep roots in statistical
learning theory and is known as the bias-variance-tradeoff. What we should do in this
situation is to be as complex as necessary and as simplistic as possible, so when comparing
two models which end up with the same empirical error, we should use the less complex
one.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Thank you for using our services. Please support us so that we can
improve further and help more people.
https://www.rgpvnotes.in/support-us

If you have questions or doubts, contact us on


WhatsApp at +91-8989595022 or by email at hey@rgpvnotes.in.

For frequent updates, you can follow us on


Instagram: https://www.instagram.com/rgpvnotes.in/.

You might also like