AI Chapter 4

Artificial Intelligence
Institute of Technology
University of Gondar
Biomedical Engineering Department
By Ewunate Assaye (MSc.)

Chapter Two
Problem solving by searching
Outlines: -
» Deep learning
2
Deep Learning
» Deep learning is so cool for so many problems…

Introduction
» Machine learning is using algorithms to extract information from raw data and represent it in some
type of model
» The model uses to infer things about other data we have not yet modeled.
» A computer learns something about the structures that represent the information in the raw data.
» Structural descriptions are another term for the models we build to contain the information extracted
from the raw data, and we can use those structures or models to predict unknown data.
» Structural descriptions (or models) can take many forms, including the following:
o Decision trees
o Linear regression
o Neural network weights
Deep Learning(DL)
» Deep learning is a sub‐ set of the field of machine learning, which is a subfield of AI.
» Deep learning is an area of machine learning that emerged from the intersection of neural networks,
artificial intelligence, graphical modeling, optimization, pattern recognition and signal processing.
» Deep learning networks are a revolutionary development of neural networks.
» Deep learning is about supervised or unsupervised learning from data using multiple layered machine
learning models.
» The development of deep learning was motivated in part by the failure of traditional algorithms to
generalize well on such AI tasks
» Deep multi-layer neural networks contain many levels of nonlinearities which allow them to compactly
represent highly non-linear and/ or highly-varying functions.
Deep Learning(DL)
What Problems Can Deep Learning Solve?
» The power of deep learning models comes from their ability to classify or predict nonlinear
data using a modest number of parallel nonlinear steps.
» A deep learning model learns the input data features hierarchy all the way from raw data input
to the actual classification of the data.
» Each layer extracts features from the output of the previous layer.
» They are good at identifying complex patterns in data

o improve things like computer vision and natural language processing, and to solve
unstructured data challenges
Who Uses Deep Learning?
» Deep learning technology is being used commercially

in the: o Credit Rating of individuals
Market campaign
o health care industry
o Text searching
o medical image processing
o Financial Fraud Detection
o natural language processing and
o Optical character Recognition
o advertising to improve click through rates, Microsoft,
google, IBM, yahoo etc. are all exploiting deep
learning to understand users’ preference so that they
can recommended targeted services and products.
7
Why deep learning? Why now?
» In general, three technical forces are driving advances in machine learning:

o Hardware
o Datasets
o Algorithmic advances
» It is based on experimental findings rather than by theory, algorithmic advances

only become possible when appropriate data and hardware were available to try
new ideas (or scale up old ideas, as is often the case).
» Machine learning isn’t mathematics or physics, where major advances can be done
with a pen and a piece of paper. It’s an engineering science.
What makes deep learning different?
» It offered better performance on many problems
» Deep learning makes problem-solving much easier

o Since it is completely automates what used to be the most crucial step in a
machine-learning workflow
» A deep learning learns from data- incremental, layer-by-layer way in which

increasingly complex representations are developed and
Deep Feed forward Networks
» Deep feedforward networks often called multilayer perceptrons (MLPs) are

essential for deep learning models.
» The goal of a feed forward network is to approximate some function 𝑓 ∗ .
» For example, for a classifier, y = 𝑓 ∗ (x) maps an input x to a category y.
» A feedforward network defines a mapping y = f (x; θ) and learns the value of the
parameters θ that result in the best function approximation.
» There are no feedback connections in which outputs of the model are

fed back into itself.
» When feedforward neural networks are extended to include feedback

connections, they are called recurrent neural networks.
» For example, we may have three functions: 𝑓 (1) , 𝑓 (2) 𝑎𝑛𝑑 𝑓 (3) connected in a
chain to form 𝑓 𝑥 = 𝑓 3 (𝑓 2 𝑓 1 𝑥 ).
» 𝑓 (1) is called the first layer of the network, 𝑓 (2) is the second layer and so on.
» The overall length of the chain gives the of the model.
» It is from this terminology that the name “deep learning” arises.
» The final layer of a feedforward network is called the output layer.

» To extend linear models to represent nonlinear functions of x, we can apply the linear
model not to x itself but to a transformed input φ(x), where φ is a nonlinear
transformation.
» To obtain a nonlinear learning algorithm based on implicitly applying the φ mapping.
» The strategy of deep learning is to learn φ.

» In this approach, we have a model 𝑦 = 𝑓 𝑥; 𝜃, 𝑤 = ∅(𝑥; 𝜃)𝜏 𝑤
» y = f (x; θ, w, ) = φ(x; θ) w.
» We now have parameters θ that we use to learn φ from a broad class of functions, and
parameters w that map from φ(x) to the desired output.
» Learning in deep neural networks requires computing the gradients of complicated

functions.
» We present the back-propagation algorithm and its modern generalizations, which

can be used to efficiently compute these gradients.
» A deep neural network for digit classification

Understanding how deep learning works
» Machine learning is about mapping inputs (such as images) to targets

(such as the label "cat").
» This is done by observing many examples of input and targets.
» Deep neural networks do this input-to-target mapping via a deep sequence.
» The transformation implemented by a layer and by its weights.
» Parameters learning means finding a set of values for the weights of all
layers in a network.
17
» A neural network is parametrized by its weights
» The fundamental trick in deep learning is to use this

score as a feedback signal to adjust the value of the
weights a little.
» The loss score is used as a feedback signal to adjust

the weights.
The loss function takes the predictions of the network and the true target (what function you wanted
the network to output) and computes a distance score, capturing how well the network has done on
this specific example. A loss function measures the quality of the network’s output.
» A sufficient training loop or number of times yields weight values that minimize the loss
function
» Once the network architecture is defined, you still have to choose two more things:
o Loss function (objective function)—The quantity that will be minimized during training.
o Optimizer—Determines how the network will be updated based on the loss function. It
implements a specific variant of stochastic gradient descent (SGD)
» Choosing the right objective function for the right problem is extremely important
Convolutional Neural Networks
» Convolutional neural networks(CNN)are a specialized kind of neural network for

processing data
o time-series data and image data
o Speech recognition
o Translation
o Self-driving etc.
» Convolutional networks are simply neural networks that use convolution(Mathematical

Operations) in place of general matrix multiplication in at least one of their layers.
CNN: Motivation
» Convolution leverages three important ideas that can help improve a machine learning
system:
o Sparse interactions
o parameter sharing and
o equivariant representation
» Moreover, convolution provides a means for working with inputs of variable size.
» Traditional neural network layers use matrix multiplication by a matrix of parameters
» Every output unit interacts with every input unit

CNN: Motivation
» Densely or Fully connected neural network

CNN: Motivation
Sparse interactions:
» It is referred to as sparse connectivity or sparse weights
» This is accomplished by making the kernel smaller than the input
» For example:
o when processing an image, the input image might have thousands or millions of pixels, but
we can detect small, meaningful features such as edges with kernels that occupy only tens or
hundreds of pixels.
o This means that we need to store fewer parameters, which both reduces the memory
requirements of the model and improves its statistical efficiency
CNN: Motivation
» It also means that computing the output requires fewer operations.
» If there are m inputs and n outputs, then matrix multiplication requires mxn
parameters and the algorithms used in practice have O(m x n) runtime (per example).
» If we limit the number of connections each output may have to k, hen the sparsely
connected approach requires only kxn parameters and O(k xn) runtime.
» It is possible to obtain good performance on practical machine learning by reducing m

into k.
CNN: Motivation
» In a deep convolutional network, units in the deeper layers may indirectly

interact with a larger portion of the input
» This allows the network to efficiently describe complicated interactions

between many variables by constructing such interactions from simple building
blocks that each describe only sparse interactions.
CNN: Motivation
Parameter sharing:
» It refers to using the same parameter for more than one function in a
model.
» In a traditional neural net, each element of the weight matrix is used

exactly once when computing the output of a layer
» The parameter sharing used by the convolution operation means that

rather than learning a separate set of parameters for every location, we
learn only smaller set.
CNN: Motivation
» This reduces the storage requirements of the model to k parameters.
» Convolution is thus dramatically more efficient than dense matrix

multiplication in terms of the memory requirements and statistical
efficiency.
CNN: Motivation
» In the case of convolution, the particular form of parameter sharing causes the
layer to have a property called equivariance to translation.
» Equivariant means that if the input changes, the output changes in the same way.
» Specifically, a function f (x) is equivariant to a function g(x) if f (g(x)) = g(f (x)).
» In the case of convolution, if we let g be any function that translates the input,
i.e., shifts it, then the convolution function is equivariant to g.
CNN: Motivation
» Let g be a function mapping one image function to another image function, such that
𝐼 ′ = g (I ) is the image function with 𝐼 ′ 𝑥, 𝑦 = 𝐼(𝑥 − 1, 𝑦)
» This shifts every pixel of I one unit to the right.
» If we apply this transformation to I , then apply convolution, the result will be the
same as if we applied convolution to I’, then applied the transformation g to the
output.
CNN: Motivation
» For example, when processing images, it is useful to detect edges in the first
layer of a convolutional network.
» The same edges appear more or less everywhere in the image, so it is

practical to share parameters across the entire image.
The Convolution Operation
» CNNs exploit the spatially local correlation by enforcing a local connectivity

between neurons of adjacent layers.
» This implies that the inputs of hidden units in layer l belong to a subset of units in
layer l-1 which have “spatially contiguous” receptive fields.
» Each unit does not respond to the variations which are outer to its receptive field
with respect to the input.
» This ensures that the learnt “filters” result in the strongest response to a spatially
local input pattern.
» The convolution operation is the fundamental building block of a convolutional

neural network
The inputs of the hidden units in

layer l are from a subset of
units in layer l- 1 that have
spatially contiguous receptive
fields.
33
We take a 5 × 5 × 3 filter and slide

it over the complete image and at each
step, take the dot product between the
filter and the respective chunks of the
input image over one layer and, get an
output dimension of
28 × 28 × 1
If we have an n × n image matrix and an f × f filter, the general formula for the dimension
of the output image matrix is n − f + 1 × n − f + 1
For example:
» We will consider a 6 × 6 matrix and apply a 3 × 3 filter, which is a hyperparameter, over the
matrix and interpret the results.
» We can think of the filter as an arrangement of weights.
» The 4 × 4 output matrix as an image matrix

Pooling
A typical layer of a convolutional network consists of three stages :
» In 1st stage, the layer performs several convolutions in parallel to produce a set of
linear activations.
» In 2nd stage, each linear activation is run through a nonlinear activation function,
such as the rectified linear activation function. This stage is sometimes called the
detector stage.
» In 3rd stage, we use a pooling function to modify the output of the layer further.
Pooling
» A pooling function replaces the output of the net at a certain location with a summary
statistic of the nearby outputs.
» For example, the max pooling operation reports the maximum output within a rectangular
neighborhood.
» In all cases, pooling helps to make the representation become approximately invariant to
small translations of the input.
» Invariance to translation means that if we translate the input by a small amount, the values
of most of the pooled outputs do not change.
Pooling
» The pooling (POOL) layer reduces the dimensions of the input and reduces computation
and make feature detectors more invariant (invariance implies that we can recognize an
object even when its appearance varies) to its position in the input.
» ‘Translation invariance’ means that the system produces exactly the same response,
regardless of how the input varies.
» For example, a dog-detector might report “dog-identified” for all input images.
Pooling
» The two types of pooling layers are:

o Max-pooling layer: slides the filter over the input and stores the maximum of the values of the
input (overlapping the input) in the output.
o Average-pooling layer: slides the filter over the input and stores the average value of the input
(overlapping the input) in the output.
» Normally, max-pooling is used more than average-pooling. However, while working on a

deep neural network, if we would want to collapse our output, we would like to use
average-pooling.
Pooling
» The hyper parameters for pooling are the following:

o filter size f
o stride s
o Max Pool or Average Pool.

The whole CNN
cat dog ……
Convolution
Max Pooling
Can
Fully Connected repeat
Feedforward network
Convolution many
times
Max Pooling
Flattened
Sequence Modeling: Recurrent and Recursive Nets
» Recurrent neural networks(RNNs) are a family of neural networks for processing

sequential data.
» RNN is a neural network that is specialized for processing a sequence of values

𝒙(1) , … , 𝒙(𝑇) .
» Most recurrent networks can also process sequences of variable length.
» Parameter sharing makes it possible to extend and apply the model to examples of
different forms (different lengths, here) and generalize across them.
» RNN works for sequence of data like natural language processing (i.e. for a given
sentences weather positive or negative.) and Time series data( i.e. sales forecasting)
Sequence Modeling: Recurrent and Recursive Nets
» It is particularly important when a specific piece of information can occur at multiple

positions within the sequence
» Recurrent networks share parameters in a different way.
o Each member of the output is a function of the previous members of the output.
o Each member of the output is produced using the same update rule applied to the previous
outputs.
» RNNs as operating on a sequence that contains vectors 𝑥 (𝑡) with the time step index t
ranging from 1 to 𝜏
Unfolding Computational Graphs
» A computational graph is a way to formalize the structure of a set of computations:

o mapping inputs and parameters to outputs and loss.
» The idea of transform a recurrent computation into a computational unfolding graph that
has a repetitive structure(i.e. a chain of events).
» Unfolding this graph results in the sharing of parameters across a deep network structure.
» For example:
𝒔(𝑡) = 𝑓(𝒔 𝑡−1 ; 𝜽)
where 𝒔 𝑡 is called the state of the system.
» The above equation is recurrent because the definition of s at time t refers

back to the same definition at time t-1 .
» For a finite number of time steps τ , the graph can be unfolded by applying
the definition τ -1 times.
» If we unfold the above equation for τ = 3 time steps, we

obtain: 𝒔(3) = 𝑓 𝒔 2
;𝜽
= 𝑓(𝑓 𝒔 1 ; 𝜽 ; 𝜽)
» Unfolding the equation by repeatedly applying the definition in this way.
» Such an expression can now be represented by a traditional directed acyclic

computational graph.
» Each node represents the state at some time t and the function f maps the
state at t to the state at t + 1.
» The same parameters (the same value of used to parametrize ) are used for
all time steps.
» For example, let us consider a dynamical system driven by an external signal 𝑥 𝑡

𝒔(𝑡) = 𝑓 𝒔 𝑡−1
,𝒙 𝑡 ;𝜽
where we see that the state now contains information about the whole past sequence.
» Many recurrent neural networks use following equation to the values of their
hidden units:
𝒉(𝑡) = 𝑓(𝒉 𝑡−1
, 𝒙𝑡 ; 𝜽)
» typical RNNs will add extra architectural features such as
output layers that read information out of the state h to make predictions
» When the recurrent network is trained to perform a task that requires predicting the future
from the past, the network typically learns to use 𝑥 (𝑡) as the past sequence of inputs up to t.
» A recurrent network with no outputs.
» In the left diagram, the black square indicates a delay of 1 time step.
» In the right diagram, the same network seen as an unfolded

» The recurrent network just processes information from the input x by

incorporating it into the state h that is passed forward through time.
» Computational graph, where each node is now associated with one particular
time instance.
» What we call is unfolding the operation that maps a circuit as in the left side
of the figure to a computational graph with repeated pieces as in the right side.
» The unfolded graph now has a size that depends on the sequence length.
» We can represent the unfolded recurrence after steps t with a function 𝑔(𝑡) :
𝒉(𝑡) = 𝑔 𝑡 𝒙 𝑡 ,𝒙 𝑡−1 ,𝒙 𝑡−2 ,…,𝒙 2 ,𝒙 1
𝑡−1
= 𝑓(𝒉 , 𝒙 𝑡 ; 𝜽)
» The unfolded graph provides an explicit description of which computations to

perform.
» The unfolded graph also helps to illustrate the idea of information flow forward in
time (computing outputs and losses) and backward in time (computing gradients)
by explicitly showing the path along which this information flows
Recurrent Neural Networks
» The computational graph to compute the training loss of a recurrent network that
maps an input sequence of x values to a corresponding sequence of output o values.
» A loss L measures how far each o is from the corresponding training target y.
ෝ = 𝒔𝒐𝒇𝒕𝒎𝒂𝒙(𝒐) and compares this to the target y.

» The loss L internally computes 𝒚
» The RNN has input to hidden connections parametrized by a weight matrix U ,

hidden-to-hidden recurrent connections parametrized by a weight matrix W , and
hidden-to-output connections parametrized by a weight matrix V
Recurrent networks that produce an output at each time step and have recurrent
connections between hidden units.
𝑎(𝑡) = 𝑏 + 𝑤ℎ(𝑡−1) + 𝑈𝑥 (𝑡)
» In the Left, the RNN and its loss drawn with recurrent connections.
» In the Right, the same seen as an time-unfolded computational graph, where

each node is now associated with one particular time instance.
Introduction to Keras
» Keras is a deep-learning framework that provides a convenient way to define and train
almost any kind of deep-learning model.
» Keras was initially developed for researchers, with the aim of enabling fast experimentation.
» Keras has the following key features:

o It allows the same code to run seamlessly on CPU or GPU.
o It has a user-friendly API that makes it easy to quickly prototype deep-learning models.
o It has build-in support for convolutional networks (for computer vision), recurrent networks (for
sequence processing), and any combination of both.
Introduction to Keras
» It supports arbitrary network architectures: multi-input or multi-output models, layer

sharing, model sharing, and so on.
» The Keras R package is compatible with R versions 3.2 and higher
» Keras is used at Google, Netflix, Uber, CERN, Yelp, Square, and hundreds of startups
working on a wide range of problems
» Keras is also a popular framework on Kaggle, the machine-learning competition website,

where almost every recent deep-learning competition has been won using Keras models.
Keras, TensorFlow, Theano, and CNTK
» Keras is a model-level library, providing high-level building

blocks for developing deep-learning models
» It doesn’t handle low-level operations such as tensor

manipulation and differentiation.
» There are different backend engines can be plugged

seamlessly into Keras.
» The three existing backend implementations are the

TensorFlow backend, the Theano backend, and the
Microsoft Cognitive Toolkit (CNTK) backend
Keras, TensorFlow, Theano, and CNTK
» The TensorFlow backend as the default for most of your deep-learning needs
o It is the most widely adopted
o most scalable, and most production ready
» Via TensorFlow (or Theano, or CNTK), Keras is able to run seamlessly on both CPUs and
GPUs.
» When running on CPU, TensorFlow is itself wrapping a low-level library for tensor
operations called Eigen ( ).
» On GPU, TensorFlow eigen.tuxfamily.org wraps a library of well-optimized deep-

learning operations called the NVIDIA CUDA Deep Neural Network library (cuDNN).
Installing Keras
» To get started with Keras, you need to install the Keras R package, the core Keras
library, as well as a backend tensor engine (e.g. TensorFlow)
o # Install the keras R package
o install.packages("keras")
o # Install the core Keras library + TensorFlow
o library(keras)
o install_keras()
Developing with Keras: a quick overview
» You’ve already seen one example of a Keras model: the MNIST example.
The typical Keras workflow looks just like that example:
o Define your training data: input tensors and target tensors.
o Define a network of layers (or ) that maps your inputs to your targets.model
o Configure the learning process by choosing a loss function, an optimizer, and

some metrics to monitor.
o Iterate on your training data by calling the method of your model.

Developing with Keras: a quick overview
» There are two ways to define a model: using the keras_model_sequential()

function (only for linear stacks of layers, which is the most common network
architecture by far) or the (for directed acyclic graphs of layers, which lets
you build functional API completely arbitrary architectures).
Introduction to TensorFlow
» TensorFlow is a general purpose, open source,numerical computing library
» It is a hardware independent, lower level mathematical library for building deep

neural network architectures
» The keras R package makes it easy to use Keras and TensorFlow in R
» The TensorFlow architecture allows us to deploy computation to one or more

CPUs or GPUs in a desktop, server, or mobile device with a single API.
Project
1. Develop an ECG signal feature extractor using deep learning (the ECG features
will be P-peak, P-duration, Q-peak, R-peak, S-peak, QRS-duration, T-peak and
T-duration)
Submit via assaye738@gmail.com before July 05 2022

References

AI Chapter 4

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AI Chapter 4

Uploaded by

Copyright:

Available Formats

Artificial Intelligence

By Ewunate Assaye (MSc.)

» Deep learning is so cool for so many problems…

» Deep learning networks are a revolutionary development of neural networks.

What Problems Can Deep Learning Solve?

» They are good at identifying complex patterns in data

» Deep learning technology is being used commercially

» In general, three technical forces are driving advances in machine learning:

» It is based on experimental findings rather than by theory, algorithmic advances

» It offered better performance on many problems

» Deep learning makes problem-solving much easier

» A deep learning learns from data- incremental, layer-by-layer way in which

» Deep feedforward networks often called multilayer perceptrons (MLPs) are

» The goal of a feed forward network is to approximate some function 𝑓 ∗ .

» For example, for a classiﬁer, y = 𝑓 ∗ (x) maps an input x to a category y.

» There are no feedback connections in which outputs of the model are

» When feedforward neural networks are extended to include feedback

» The overall length of the chain gives the of the model.

» It is from this terminology that the name “deep learning” arises.

» The ﬁnal layer of a feedforward network is called the output layer.

» To obtain a nonlinear learning algorithm based on implicitly applying the φ mapping.

» The strategy of deep learning is to learn φ.

» In this approach, we have a model 𝑦 = 𝑓 𝑥; 𝜃, 𝑤 = ∅(𝑥; 𝜃)𝜏 𝑤

» Learning in deep neural networks requires computing the gradients of complicated

» We present the back-propagation algorithm and its modern generalizations, which

» A deep neural network for digit classification

» Machine learning is about mapping inputs (such as images) to targets

» This is done by observing many examples of input and targets.

» Deep neural networks do this input-to-target mapping via a deep sequence.

» The transformation implemented by a layer and by its weights.

» A neural network is parametrized by its weights

» The fundamental trick in deep learning is to use this

» The loss score is used as a feedback signal to adjust

» Convolutional neural networks(CNN)are a specialized kind of neural network for

» Convolutional networks are simply neural networks that use convolution(Mathematical

» Traditional neural network layers use matrix multiplication by a matrix of parameters

» Every output unit interacts with every input unit

» Densely or Fully connected neural network

» It is referred to as sparse connectivity or sparse weights

» This is accomplished by making the kernel smaller than the input

» It also means that computing the output requires fewer operations.

» It is possible to obtain good performance on practical machine learning by reducing m

» In a deep convolutional network, units in the deeper layers may indirectly

» This allows the network to eﬃciently describe complicated interactions

» In a traditional neural net, each element of the weight matrix is used

» The parameter sharing used by the convolution operation means that

» This reduces the storage requirements of the model to k parameters.

» Convolution is thus dramatically more eﬃcient than dense matrix

» Speciﬁcally, a function f (x) is equivariant to a function g(x) if f (g(x)) = g(f (x)).

» This shifts every pixel of I one unit to the right.

» The same edges appear more or less everywhere in the image, so it is

» CNNs exploit the spatially local correlation by enforcing a local connectivity

» The convolution operation is the fundamental building block of a convolutional

The inputs of the hidden units in

We take a 5 × 5 × 3 filter and slide

» We can think of the filter as an arrangement of weights.

» The 4 × 4 output matrix as an image matrix

A typical layer of a convolutional network consists of three stages :

» The two types of pooling layers are:

» Normally, max-pooling is used more than average-pooling. However, while working on a

» The hyper parameters for pooling are the following:

o Max Pool or Average Pool.

» Recurrent neural networks(RNNs) are a family of neural networks for processing

» RNN is a neural network that is specialized for processing a sequence of values