Unit - 1 Deep Learning 3-2

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 15

ST.

ANN'S COLLEGE OF ENGINEERING & TECHNOLOGY: CHIRALA

Deep Learning
Unit-1
Fundamentals of Deep Learning: Artificial Intelligence, History of Machine learning: Probabilistic
Modeling, Early Neural Networks, Kernel Methods, Decision Trees, Random forests and Gradient
Boosting Machines, Fundamentals of Machine Learning: Four Branches of Machine Learning,
Evaluating Machine learning Models, Overfitting and Underfitting. [Text Book 2]

------------------------------------------------------------------------------------------------------------------------

Introduction to Deep Learning (Part-1)


Deep learning is a specific subfield of machine learning: a new way for learning
representations from data that puts an importance on learning successive layers of
increasingly meaningful representations.

The “deep” here stands for the idea of successive layers of representations. How many
layers contribute to a model of the data is called the depth of the model. Other appropriate
names for the field could have been layered representations learning and hierarchical
representations learning. Modern deep learning often involves tens or even hundreds of
successive layers of representations. In deep learning, these layered representations are
(almost always) learned via models called neural networks.

Artificial Intelligence, Machine Learning, and Deep Learning

Figure 1: Artificial intelligence, machine learning, and deep learning


Artificial intelligence:
 Artificial intelligence was born in the 1950, when some pioneers from
computer science started thinking whether the computers could be made to
“think”.

DEPARTMENT OF CSE- DATA SCIENCE III Year-II Sem


ST.ANN'S COLLEGE OF ENGINEERING & TECHNOLOGY: CHIRALA

 The definition is as follow, “the effort to automate intellectual tasks normally


performed by humans”.
 AI is a general field that encompasses machine learning and deep learning
 Early chess games used only hardcoded rules which do not qualify as machine
learning, but later the human intelligence is integrated in the form of explicit
rules for taking the decision as human do. This type of approach is called
Symbolic AI.
 It was dominant paradigm from 1950 to 1980, before the expert systems
came into existence.
Machine learning:
 Lady Ada Lovelace was a friend and collaborator of Charles Babbage, the
inventor of Analytical Engine.
 In those days the Analytical Engine was used to automate mechanical
operations to perform mathematical operations.
 The limitation of the Analytical Engine is that it just assisted humans, but
cannot take decisions on its own.
 In the Year 1950, Alan Turing introduced the Turing Test and also key
concepts that shaped AI.
 Machine learning arises from this question: could a computer go beyond
“what we know how to order it to perform”. This enables the computer to
learn the data processing rules from the data itself.
 In classical programming that is Symbolic AI, the programmer inputs the rules
and data to be processed using these rules, and the system will produce
output in the form of answers.
 A machine-learning system is trained rather than explicitly programmed.
 It started flourishing from 1990 and has become most successful subfield of
AI.

Symbolic AI (Classical Programming) Machine Learning

Learning representations from data:


 Before understanding the difference between deep learning and other
learning approaches it is good to know the idea of what machine learning
algorithms do.
 Every machine learning model expects THREE things:
o Input data points
o Examples of the expected output

DEPARTMENT OF CSE- DATA SCIENCE III Year-II Sem


ST.ANN'S COLLEGE OF ENGINEERING & TECHNOLOGY: CHIRALA

o A way to measure
 A machine-learning model transforms its input data into meaningful outputs.
The central problem in machine learning and deep learning is meaningfully
transform data.
 Let us take an example to understand THREE things. Consider an x-axis, and
y-axis, and some points represented by their coordinates in the (x, y) system,
as shown in figure.

Figure 2: Different Transformations


 As you can see, we have a few white points and a few black points. Let’s
develop a model that can use the coordinates of the points and determine
whether that point is “BLACK” or “WHITE”. (Eg: K-Means)
 In this case the
o The inputs are coordinates of the points
o The outputs are “BLACK’ and “WHITE” Colors.
o The measure is percentage that clearly gives how many points are
correctly classified.
 What we need here is a new representation that clearly separates white from
black points.
 If we are searching the different possible coordinate change and come up
with a solution which has good percentage of points being classified
correctly. Then it becomes a machine learning model.

The deep learning:


 Deep learning is a mathematical framework for learning representations from
data and is sub field of AI.
 Modern deep learning often involves tens or even hundreds of successive
layers of representations— and they’re all learned automatically from
exposure to training data.
 In deep learning, these layered representations are (almost always) learned
via models called neural networks.
 The term neural network is a reference to neurobiology; some of the
concepts are derived from the inspiration from human brain.
 Let us look at one example how deep learning works to recognize the digit
from the hand written image.

DEPARTMENT OF CSE- DATA SCIENCE III Year-II Sem


ST.ANN'S COLLEGE OF ENGINEERING & TECHNOLOGY: CHIRALA

 The network transforms the digit image into different representations from
the original image and increasingly informative about the final result.

Figure 3: A deep neural network for digit classification


 It appears to be multistage information- distillation operation, where
information goes through successive filters and comes out increasingly
purified.
How Deep Learning Works
 At this point the machine learning maps the input into targets by observing
the examples.
 Whereas the Deep learning do this input-to-target mapping via a deep
sequence of simple data transformations (Layers) and these transformations
are learned from the examples.
 The specification of what each layer does to its input will be stored in the
layer’s weights, which are bunch of numbers.

Figure 4: The loss score is used as a feedback signal to adjust the weights

 In technical terms the transformation is parametrized by the layer’s weights.


These weights sometimes also called parameters of layers. Initially these are
set using random values.

DEPARTMENT OF CSE- DATA SCIENCE III Year-II Sem


ST.ANN'S COLLEGE OF ENGINEERING & TECHNOLOGY: CHIRALA

 In this context, learning means finding a set of values for the weights of all
layers in a network, such that the network will correctly map example inputs
to their associated targets
 Finding the correct value for all of them may a daunting (frightening) task,
because the change in one parameter will affect other layers.
 To control the neural network, first we have to observe predicted value, and
we need to measure how far this output is from what you expected. This is
the job of the loss function of the network, also called the objective function.
 The loss function takes the predictions of the network and the true target
(what you wanted the network to output) and computes a distance score.
 Since the weights are initialized randomly using random process, the Loss
score obviously high.
 But with every example (item or image) the network processes, the weights
are adjusted a little in the correct direction, and the loss score decreases.
 This is the training loop, which is repeated sufficient number of times to
reduce the loss score. Then The outputs will be close to the targets.
Applications of Deep Learning
In particular, deep learning has achieved the following breakthroughs, all in
historically Difficult areas of machine learning:

 Near-human-level image classification


 Near-human-level speech recognition
 Near-human-level handwriting transcription
 Improved machine translation
 Improved text-to-speech conversion
 Digital assistants such as Google Now and Amazon Alexa
 Near-human-level autonomous driving
 Improved ad targeting, as used by Google, Baidu, and Bing
 Improved search results on the web
 Ability to answer natural-language questions
 Superhuman Go playing

A brief history of machine learning


Deep learning has got more public attention in the recent times and industries also
have invented never before seen in the history. The deep learning may not solve all the
problems, it needs sufficient data. Sometimes other machine learning methods could solve
the problem efficiently than deep learning.

Probabilistic Modeling:

DEPARTMENT OF CSE- DATA SCIENCE III Year-II Sem


ST.ANN'S COLLEGE OF ENGINEERING & TECHNOLOGY: CHIRALA

 Probabilistic modeling is the process of applying the principles of statistics to


perform data analysis.
 It was the earliest way of machine learning. One of the well-known
algorithms in this category is Naïve Bayes algorithm.
 Naive Bayes is a type of machine-learning classifier based on applying Bayes’
theorem while assuming that the features in the input data are all
independent.
 This type of algorithm was in use even before first computer came into
existence.
 The foundation for Bayes Theorem was laid in the 18th century.
 A closely related model is the logistic regression (used for classification
problems), which is sometimes considered as “hello world” program of
machine learning.
 The logistic regression model is used to predict the dependent variable from
the set of independent variables. It is classification algorithm rather than
regression algorithm.

Early Neural Networks:

Figure 5: Structure of Neural Networks

 The early neural networks have been replaced by the modern neural
networks.
 The early neural networks have laid the path to the deep learning. The core
idea of neural networks coined in the year 1950, and due its structure itself
was ignored for decades.
 When some people independently rediscovered the Backpropagation
algorithm has initiated the neural networks again.

DEPARTMENT OF CSE- DATA SCIENCE III Year-II Sem


ST.ANN'S COLLEGE OF ENGINEERING & TECHNOLOGY: CHIRALA

 The backpropagation is used to optimize the parameters or weights used in


the neural network using the gradient descent optimization by control the
learning rate.
 The first successful practical application of neural nets came in 1989 from Bell
Labs, when Yann LeCun combined the earlier ideas of convolutional neural
networks and backpropagation, and applied them to the problem of
classifying handwritten digits.

Kernel Methods:

The kernel methods are group of classification algorithms. The support vector
machine is one of the best known algorithm under this category. SVM was developed by
Vladimir Vapnik and cornna cortes in 1990s at Bell Labs. SVMs aim at solving classification
problems by finding good decision boundaries between two sets of points belonging to two
different categories. This decision boundary is a line which can be linear or non-linear and
separates two spaces belong to two categories. SVMs proceed to find these boundaries in
two steps:

 The data is mapped to a new high-dimensional representation where the


decision boundary can be expressed as a hyperplane.
 A good decision boundary is computed by maximizing the distance between
two closest data points on either side, which is also called “margin”.

The process of mapping the data to a high-dimensional space can be carried out using the
Kernel methods. An example of kernel methods is given below.

 These Kernel methods are used to transform the non-linear data into linear
(Ex: y=power(x,2)).

 Let us consider a small dataset as shown below:

 If we use only the first Feature, that is x. It appears to be non-linear.

x y=power(x,2)
1.2 1.44 x
1.4 1.96 12
10
1.3 1.69 8 x
6
1.5 2.25 4
1.3 1.69 2
0
1.2 1.44 0 2 4 6 8 10 12

But, if we add second feature using the polynomial expression y=power(x,2), then the
dataset becomes linearly separable as shown below.

DEPARTMENT OF CSE- DATA SCIENCE III Year-II Sem


ST.ANN'S COLLEGE OF ENGINEERING & TECHNOLOGY: CHIRALA

y=power(x,2)
2.5

1.5 y=power(x,2)

0.5

0
1.15 1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.55

Figure 6: Polynomial Kernel

Decision Trees:

Decision trees are Tree-like structures that let you classify input data points or
predict output values given inputs as Shown in the Figure 7. Decision Tree is a supervised
learning technique that can be used for both classification and Regression problems, but
mostly it is preferred for solving Classification problems. They’re easy to visualize and
interpret. IT contains 3 main elements: Decision Nodes, Branch, and Leaf Nodes. The
Decision nodes can have multiple branches whereas the Leaf nodes cannot contain any
further branches.

Figure 7: Decision Tree

DEPARTMENT OF CSE- DATA SCIENCE III Year-II Sem


ST.ANN'S COLLEGE OF ENGINEERING & TECHNOLOGY: CHIRALA

Example:

Figure 8: Example Decision Tree to Accept Offer

Random Forest:
Random Forest is a popular machine learning algorithm that belongs to the
supervised learning technique. It can be used for both Classification and Regression
problems in ML. It is a collection of large number of specialized decision trees. It is based on
the concept of ensemble learning, which is a process of combining multiple classifiers to
solve a complex problem and to improve the performance of the model. The greater number
of trees in the forest leads to higher accuracy and prevents the problem of overfitting.

Figure 9: Random Forest

For the same data different decision trees are created, instead of depending on one decision
tree, the random forest takes the decision from each tree and based on the majority votes
of prediction the final output will be predicted.

DEPARTMENT OF CSE- DATA SCIENCE III Year-II Sem


ST.ANN'S COLLEGE OF ENGINEERING & TECHNOLOGY: CHIRALA

Advantages of Random Forests:


1. It takes less time to train model as compared to other algorithms
2. It gives high accuracy
3. IT can also maintain when large portion of data is missing.

Gradient Boosting Machine:


A gradient boosting machine, much like a random forest, is a machine learning
technique based on ensembling weak prediction models, generally decision trees. It uses
gradient boosting to improve the performance of any machine learning model iteratively by
addressing the weak points of the previous models. When gradient boosting is applied to
decision trees, it will outperform the random forests most of the times. It is one of the best
algorithms to deal with non-perceptual data.

Fundamentals of Machine Learning (Part-2)


Here we will understand the all the concepts such as model evaluation, data pre-processing
for deep learning, feature engineering and talking the model overfit etc – into a seven step
workflow of any machine learning approach.
Four branches of machine learning
Basically there are three types of machine learning problems such as Binary
classification, multiclass classification and scalar regression. All these are instance of
supervised learning algorithms. The goal is to learn the relationship between input variables
and targets. These machine learning algorithms are categorized into 4 categories as follow:
1. Supervised learning
2. Unsupervised learning
3. Self-supervised learning
4. Reinforcement learning
Supervised learning:
This is most familiar and used to map the inputs to the known targets. All most all
the deep learning algorithms are belonging to this category. These are used for both
classification and regression tasks. Some of the applications of the supervised learning are
as follow:
 Sequence generation – It is used to predict the caption describing a given
picture or image.

DEPARTMENT OF CSE- DATA SCIENCE III Year-II Sem


ST.ANN'S COLLEGE OF ENGINEERING & TECHNOLOGY: CHIRALA

 Syntax tree prediction – It is used to predict the Syntax tree for the given
sentence.
 Object detection – Given a picture, it draws the boundary line around some
objects considering their internal features in the picture or image.
 Image segmentation – Divides the image into sub parts based on the pixels
intensity values.
Unsupervised learning:
This is used to find the interesting information from the input without knowledge of
the known targets. This is mainly used in data visualization, data compression, data
denoising, and understanding the correlations present in the data. This is often treated as
bread and butter for the data analysts before attempting to use any supervised learning
technique. There are two well-known categories of unsupervised learning as follow:
 Dimensionality reduction
 Clustering
Self-supervised learning:
It is specific types of supervised learning and deserves to be considering as a
different category. This is used learn the patterns without human involvement. Here, labels
are also involved, but are generated from the input data using the heuristic techniques.
 For instance, autoencoders are a well-known instance of self-supervised learning,
where the generated targets are the input, unmodified.
 In the same way, predicting the next frame from the video when some past frames
are given.
 It is used in predicting the next words in the text, when the previous words are
given.
Reinforcement learning:
In reinforcement learning, an agent receives information about its environment and
learns to choose actions that will maximize some reward. This is mostly used in games to
predict the next move which minimizes the loss and maximizes the reward. Some of the
applications are as follow:
 Self-driving cars
 Robotics
 Education

DEPARTMENT OF CSE- DATA SCIENCE III Year-II Sem


ST.ANN'S COLLEGE OF ENGINEERING & TECHNOLOGY: CHIRALA

Evaluating Machine learning Models


Once the model is trained it is evaluated to know the performance. The model is
evaluated on the data which is never-before-seen. If the evaluation is done on the same
data it leads to the model overfitting. Hence the training data will be split into three sets.

Training, Validation and Test sets:


The data will be split into three sets: training set, validation set, and test set. The
model is trained using the training set and is evaluated using the validation set. This will help
to fine tune the model. One final test will be conducted on the test set. We may get doubt;
it may be done with two sets: training and test sets which will be much simpler. Yes, we can
do that also which has been followed by most of the machine learning models, which is
much simpler. The main goal of nay machine learning is to achieve models that generalize
the performance well on never-before-seen data. The tuning will be done in the form of
learning- which is a search for a good solution in the parameter space.
Tuning the model parameters based on the performance on the validation set can
quickly put the model in overfitting to validation set. The central problem here is, the
moment when your model is trained and is validated using validation set, the information
leak happens. If the validation is done number of times on the validation set, then the
significant number of amount of information leak will happen into the model.
At the end the model obviously work well on the validation set, but new data that is
never-before-seen is given its performance will the decreasing. We have to take enough
cate to test our model on completely different data, on the only on the validation set. Here
comes the role of the test set. There are different techniques to divide the data into three
sets. They are as follow:
1. Simple Hold-out validation,
2. K-fold validation,
3. Iterated K-fold validation with shuffling
SIMPLE HOLD-OUT VALIDATION:
Here dataset will be divided into two parts: Training set and Hold-out Validation set. The
model is trained using the Training set and is tested with Validation set. This is preferred to
prevent information leaks that occur when we divide that data into Three Parts: Training,

DEPARTMENT OF CSE- DATA SCIENCE III Year-II Sem


ST.ANN'S COLLEGE OF ENGINEERING & TECHNOLOGY: CHIRALA

Validation and Test sets. Before starting the process the random shuffling can be done to
mix the data well.

Figure 10: Hold-out validation split


Drawback:
Though this is a simple protocol, it suffers from one drawback. If the dataset size is
small then obviously we have only few samples or records in the validation set. Hence, the
model may not be tuned well. This can be addressed with help of K-Folds Validation and
Iterated K-Fold validation.

K-FOLD VALIDATION
Here we split your data into K partitions of equal size. For each partition i, train a
model on the remaining K – 1 partitions, and evaluate it on partition i. The Same process is
repeated for K Times. The final score of the model is the average of all the scores obtained
in K Scores. This is preferred when your model is giving significance variance on the test set.
Here, only one fold may not be considered as validation set.

Figure 11: K-Fold Validation

DEPARTMENT OF CSE- DATA SCIENCE III Year-II Sem


ST.ANN'S COLLEGE OF ENGINEERING & TECHNOLOGY: CHIRALA

ITERATED K-FOLD VALIDATION WITH SHUFFLING


This one is for situations in which you have relatively little data available and you need to
evaluate your model as precisely as possible. It consists of applying K-fold validation
multiple times, shuffling the data every time before splitting it K ways. The final score is
the average of the scores obtained at each run of K-fold validation.

Overfitting and Underfitting


The central challenge in machine learning is that we must perform well on new, previously
unseen inputs—not just those on which our model was trained. The ability to perform well
on previously unseen inputs is called generalization.

We can compute some error measure on the training set called the training error, and we
reduce this training error. What separates machine learning from optimization is that we
want the generalization error, also called the test error to be low as well. The
generalization error is defined as the expected value of the error on a new input.

We typically estimate the generalization error of a machine learning model by measuring its
performance on a test set of examples that were collected separately from the training set.
The test error will be computer using the MSE (Means Square Error) as follow:

1. Measuring the distance of the observed y-values from the predicted y-values at each
value of x; (y-y`)

2. Squaring each of these distances; Eg: (y-y`)2

3. Calculating the mean of each of the squared distances. 1/n* (y-y`)2

The factors determining how well a machine learning algorithm will perform are its ability to:

1. Make the training error small.


2. Make the gap between training and test error small.

These two factors correspond to the two central challenges in machine learning: Underfitting
and overfitting. Underfitting occurs when the model is not able to obtain a sufficiently low
error value on the training set. That means the model has not learned from the training
sufficient enough. Overfitting occurs when the gap between the training error and test error is
too large. In this case the model has learned completely from the training set and results in
low training error and when the new item or sample is given the difference between the
training error and test error will be large.

The capacity play the major role in controlling the Underfitting and Overfitting. The capacity
is nothing the number of functions that are applied on the dataset to fit it. Models with low
capacity may struggle to fit the training set. Models with high capacity can Overfit.

DEPARTMENT OF CSE- DATA SCIENCE III Year-II Sem


ST.ANN'S COLLEGE OF ENGINEERING & TECHNOLOGY: CHIRALA

To prevent a model from learning misleading or irrelevant patterns found in the training
data, the best solution is to get more training data. A model trained on more data will
naturally generalize better. The processing of fighting overfitting this way is called
regularization. Let’s review some of the most common regularization techniques:

1. Reducing the network’s size

The simplest way to prevent overfitting is to reduce the size of the model: the number
of learnable parameters in the model. It is often referred as Capacity.

2. Adding weight regularization -

A simple model in this context is a model where the distribution of parameter values
has less entropy (or a model with fewer parameters, as you saw in the previous
section). Thus a common way to mitigate overfitting is to put constraints on the
complexity of a network by forcing its weights to take only small values, which
makes the distribution of weight values more regular. This is called weight
regularization. It is done with help of cost function. This cost comes in two flavors:

 L1 regularization—The cost added is proportional to the absolute value


of the weight coefficients (the L1 norm of the weights).

 L2 regularization—The cost added is proportional to the square of the


value of the weight coefficients (the L2 norm of the weights). L2
regularization is also called weight decay in the context of neural
networks

3. Adding dropout

Dropout is one of the most effective and most commonly used regularization
techniques for neural networks, developed by Geoff Hinton and his students at the
University of Toronto. Dropout, applied to a layer, consists of randomly dropping out
(setting to zero) a number of output features of the layer during training.

Let’s say a given layer would normally return a vector [0.2, 0.5, 1.3, 0.8, 1.1] for a
given input sample during training. After applying dropout, this vector will have a
few zero entries distributed at random: for example, [0, 0.5, 1.3, 0, 1.1].

-------------------End of Unit-1-----------

DEPARTMENT OF CSE- DATA SCIENCE III Year-II Sem

You might also like