DeepLearing Theory

Deep Learning
Introduction to Neural Network

A neural network is a series of algorithms that endeavors to recognize underlying
relationships in a set of data through a process that mimics the way the human brain
operates. In this sense, neural networks refer to systems of neurons, either organic or
artificial in nature
Each neuron is made up of a cell body (the central mass of the cell) with a number of
connections coming off it: numerous dendrites (the cell's inputs—carrying information
toward the cell body) and a single axon (the cell's output—carrying information away).
Neurons are so tiny that you could pack about 100 of their cell bodies into a single
millimeter. (It's also worth noting, briefly in passing, that neurons make up only 10–50
percent of all the cells in the brain
How does a neural network learn things?

Information flows through a neural network in two ways. When it's learning (being
trained) or operating normally (after being trained), patterns of information are fed into
the network via the input units, which trigger the layers of hidden units, and these in turn
arrive at the output units. This common design is called a feedforward network. Not all
units "fire" all the time. Each unit receives inputs from the units to its left, and the inputs
are multiplied by the weights of the connections they travel along. Every unit adds up all
the inputs it receives in this way and (in the simplest type of network) if the sum is more
than a certain threshold value, the unit "fires" and triggers the units it's connected to (those
on its right).
Step 1:
For each input, multiply the input value xᵢ with weights wᵢ and sum all the multiplied
values. Weights — represent the strength of the connection between neurons and decides
how much influence the given input will have on the neuron’s output. If the weight w ₁ has a
higher value than the weight w₂, then the input x₁ will have a higher influence on the
output than w₂.
The row vectors of the inputs and weights are x = [x₁, x₂, … , xₙ] and w =[w ₁, w ₂, … , w ₙ]
respectively and their dot product is given by
Step 2:
Add bias b to the summation of multiplied values and let’s call this z. Bias — also known as
the offset is necessary in most of the cases, to move the entire activation function to the left
or right to generate the required output values. Z = X.W + B
Step 3:
Pass the value of z to a non-linear activation function. Activation functions — are used to
introduce non-linearity into the output of the neurons, without which the neural network
will just be a linear function. Moreover, they have a significant impact on the learning speed
of the neural network. Perceptrons have binary step function as their activation function.
However, we shall use sigmoid — also known as logistic function as our activation function.
Bias
This means when calculating the output of a node, the inputs are multiplied by weights, and
a bias value is added to the result. The bias value allows the activation function to be
shifted to the left or right, to better fit the data. Hence changes to the weights alter the
steepness of the sigmoid curve, whilst the bias offsets it, shifting the entire curve so it fits
better. Note also how the bias only influences the output values, it doesn’t interact with the
actual input data.
link: https://towardsdatascience.com/why-we-need-bias-in-neural-networks-
db8f7e07cb98
Activation Function
It’s just a thing function that you use to get the output of node. It is also known as Transfer
Function.
It is used to determine the output of neural network like yes or no. It maps the resulting
values in between 0 to 1 or -1
The Activation Functions can be basically divided into 2 types-
1. Linear Activation Function
2. Non-linear Activation Functions
1. Linear or Identity Activation Function

As you can see the function is a line or linear. Therefore, the output of the functions will not
be confined between any range.
Equation : f(x) = x
Range : (-infinity to infinity)
It doesn’t help with the complexity or various parameters of usual data that is fed to the
neural networks.
2. Non-linear Activation Function

The Nonlinear Activation Functions are the most used activation functions. Nonlinearity
helps to makes the graph look something like this
Derivative or Differential: Change in y-axis w.r.t. change in x-axis.It is also known as slope.
Monotonic function: A function which is either entirely non-increasing or non-decreasing.
2.1 Sigmoid or Logistic Activation Function

The Sigmoid Function curve looks like a S-shape.
The main reason why we use sigmoid function is because it exists between (0 to 1).
Therefore, it is especially used for models where we have to predict the probability as an
output.Since probability of anything exists only between the range of 0 and 1, sigmoid is
the right choice.derivate of sigmoid function lies between 0 to 0.25
2.2 Tanh or hyperbolic tangent Activation Function or threshold activation function

tanh is also like logistic sigmoid but better. The range of the tanh function is from (-1 to 1).
tanh is also sigmoidal (s - shaped).derivate of tanh lies between 0 to less than 1
2.3 ReLU (Rectified Linear Unit) Activation Function
The ReLU is the most used activation function in the world right now.Since, it is used in
almost all the convolutional neural networks or deep learning.derivate of z>0 it is 1, z<0 it
is 0 it have only two values.
As you can see, the ReLU is half rectified (from bottom). f(z) is zero when z is less than zero
and f(z) is equal to z when z is above or equal to zero.
Range: [ 0 to infinity)
The function and its derivative both are monotonic.
differentiation in almost every part of Machine Learning and Deep Learning.
##### Leaky ReLU Dead neuron:
if z<0 the derivative become 0, new calculated weight equals to old weight this neuron
called dead neuron.in order to fix this we use leaky ReLU.
Another problem we see in ReLU is the Dying ReLU problem where some ReLU Neurons
essentially die for all inputs and remain inactive no matter what input is supplied, here no
gradient flows and if large number of dead neurons are there in a Neural Network it’s
performance is affected, this can be corrected by making use of what is called Leaky ReLU
where slope is changed left of x=0 in above figure and thus causing a leak and extending the
range of ReLU.
ELU
SoftMax
Softmax is a very interesting activation function because it not only maps our output to a
[0,1] range but also maps each output in such a way that the total sum is 1. The output of
Softmax is therefore a probability distribution.
Optimizers
link:https://ruder.io/optimizing-gradient-descent/ https://heartbeat.fritz.ai/exploring-
optimizers-in-machine-learning-7f18d94cd65b
Optimizers associate loss function and model parameters together by updating the model,
i.e. the weights and biases of each node based on the output of the loss function.
Epoch:
One Epoch is when an ENTIRE dataset is passed forward and backward through the neural
network only ONCE
Iteration:
An iteration describes the number of times a batch of data passed through the algorithm. In
the case of neural networks, that means the forward pass and backward pass. So, every
time you pass a batch of data through the NN, you completed an iteration.
Gradient Descent
Well, of course we need to start off with the biggest star of our post — gradient descent.
Gradient descent is an iterative optimization algorithm. It is dependent on the derivatives
of the loss function for finding minima. Running the algorithm for numerous iterations and
epochs helps to reach the global minima (or closest to it).
• Gradient decent consider all data points(it is like population, ex we have 1000 data
points it consider all point while calculating error)
• it require large computational power for large data sets
stochastic Gradient Descent(SGD)

SGD randomly picks one data point from the whole data set at each iteration to reduce the
computations enormously.
• no of iterations more for large data sets so it will increase time and computational
power
Mini batch Stochastic Gradient Descent
sample a small number of data points instead of just one point at each step and that is
called “mini-batch” gradient descent. Mini-batch tries to strike a balance between the
goodness of gradient descent and speed of SGD.
• because of zig zag movement we experience noise in mini batch and stochastic to
over come this we introducing momentum
SGD with momentum

• we remove noise with help of Exponential moving average
The exponential moving average (EMA)": is a technical chart indicator that tracks the price
of an investment (like a stock or commodity) over time. The EMA is a type of weighted
moving average (WMA) that gives more weighting or importance to recent price data
Adagrad Optimizer
Adaptive Gradient Algorithm (Adagrad) is an algorithm for gradient-based optimization. ...
It performs smaller updates As a result, it is well-suited when dealing with sparse data
(NLP or image recognition) Each parameter has its own learning rate that improves
performance on problems with sparse gradients.
adagrad uses different learning rate for every iteration
we compute learning rate for every iteration
dense:
most of features will be non-zeros
sparse:
mpst of the featues are basically zeros
Advantages of Using AdaGrad

• it eliminates the need to manually tune the learning rate
• convergence is faster and more reliable – than simple SGD when the scaling of the
weights is unequal
• It is not very sensitive to the size of the master step
• one disadvantage is some time alpha t become very high with increasing of no of
iterations
AdaDelta
Adadelta is a more robust extension of Adagrad that adapts learning rates based on a
moving window of gradient updates, instead of accumulating all past gradients Compared
to Adagrad, in the original version of Adadelta you don't have to set an initial learning rate.
RMSprop
RMSprop is a gradient based optimization technique used in training neural networks. ...
This normalization balances the step size (momentum), decreasing the step for large
gradients to avoid exploding, and increasing the step for small gradients to avoid vanishing.
ADAM Optimizer
stochastic gradient vs gradient descent vs Mini batch
• in gradient descent it directly moves to Global minima
• in stochastic and mini batch its take time to move lo Global minima because its take
zig zag movement
Global minima & Local minima
A local minimum of a function is a point where the function value is smaller than at nearby
points, but possibly greater than at a distant point. A global minimum is a point where the
function value is smaller than at all other feasible points.
in some loss functions we experience below local minia and global minima
Convex function and non convex function

• convex function contains only contain global minima
• non convex function contain both global local minima
Chainrule
chain rule helps while readusting weights in backpropagation while finding derivatives
Vanishing Gradient Problem
In machine learning, the vanishing gradient problem is encountered when training artificial
neural networks with gradient-based learning methods and backpropagation.
the derivate of sigmoid function always lies between 0 to 0.25 because of that when you
have more layers the the new weights are almost equals to old weights this reason for
vanishing gradient problem and this is main reason we are not using sigmoid activation
function in all layers this can be over come by ReLU(rectified linear unit)
Certain activation functions, like the sigmoid function, squishes a large input space into a
small input space between 0 and 1. Therefore, a large change in the input of the sigmoid
function will cause a small change in the output. Hence, the derivative becomes small.
However, when n hidden layers use an activation like the sigmoid function, n small
derivatives are multiplied together. Thus, the gradient decreases exponentially as we
propagate down to the initial layers. A small gradient means that the weights and biases of
the initial layers will not be updated effectively with each training session. Since these
initial layers are often crucial to recognizing the core elements of the input data, it can lead
to overall inaccuracy of the whole network.
Exploding gradient problem

Exploding gradients are a problem where large error gradients accumulate and result in
very large updates to neural network model weights during training. This has the effect of
your model being unstable and unable to learn from your training data.
Dropout layers & Regularization

Dropout may be implemented on any or all hidden layers in the network as well as the
visible or input layer. It is not used on the output layer. The term “dropout” refers to
dropping out units (hidden and visible) in a neural network. — Dropout: A Simple Way to
Prevent Neural Networks from Overfitting.
there are 2 ways to overcome overfitting
1. Regularization, 2. drop out
dropout: before doing drop out we select drop out ratio(like probability ratio p).it is
between 0 to 1. we select sub set of input features in data and activation functions in the
hidden layers.
we define p value it randomly select some values and it deactivate that features and
activation functions and remaining process is same new weights are calculate in the 1st
iteration.n the 2nd iteration it will again select some features randomly with respect to p-
values.it only apply to the training data for test data we simply multiply with p-value with
weights obtained from training set.
How to select p-value: we apply hyperparameters to find p-values
Weights initialization Techniques

• Weight should be small
• weight should not be same
• weights should have good varience
there is no techniques gives best results we can only say doing experiment
1. uniform distribution
2. Xavire/Gorat
1.xavire normal
2.xavire uniform
3. He init
a. He uniform
b. He normal
Type of Loss Function or cost function or error function
1.Regression
1.1 MSE(Mean squred error):
Advantages:
1. in the form of quadratic equation ax^2 + bx + c
2. plot the quadratic quation we got a gradient descent with only one Global minima
and we didnt get any local minima
Disadvantage:
it is not robust to outliers
1.2 Absolute error loss (MAE)

we are taking absolute value
Advantage:
1. more robust to outliers as compared to MSE
Disadvantage:
1. computation of MAE is more difficult and take time
2. it may have local minima

1.3 Huber Loss:
Huber loss is a loss function used in robust regression, that is less sensitive to outliers in
data than the squared error loss. A variant for classification is also sometimes used.it is
combination of MAE and MSE
2. Classification
https://towardsdatascience.com/cross-entropy-for-classification-d98e7f974451
https://keras.io/api/losses/
2.1Cross Entropy:
Cross-entropy loss, or log loss, measures the performance of a classification model whose
output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted
probability diverges from the actual label.it can solve only binary classification.
2.2 Multi Class Cross Entropy Loss:

How to train neural network on back propagation
how to train multi layer neural network

Multilayer networks solve the classification problem for non linear sets by employing
hidden layers, whose neurons are not directly connected to the output. The additional
hidden layers can be interpreted geometrically as additional hyper-planes, which enhance
the separation capacity of the network.
ANN
Create Artificial Neural Network using Weights initialization Tricks

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv('bank.csv')
x = data.iloc[:, 3:13]
y = data.iloc[:,13]
geography = pd.get_dummies(x['Geography'], drop_first = True)

gender = pd.get_dummies(x['Gender'], drop_first = True)
data.head()
RowNumber CustomerId Surname CreditScore Geography Gender Age

\
0 1 15634602 Hargrave 619 France Female 42
1 2 15647311 Hill 608 Spain Female 41
2 3 15619304 Onio 502 France Female 42
3 4 15701354 Boni 699 France Female 39
4 5 15737888 Mitchell 850 Spain Female 43
Tenure Balance NumOfProducts HasCrCard IsActiveMember \

0 2 0.00 1 1 1
1 1 83807.86 1 0 1
2 8 159660.80 3 1 0
3 1 0.00 2 0 0
4 2 125510.82 1 1 1
EstimatedSalary Exited
0 101348.88 1
1 112542.58 0
2 113931.57 1
3 93826.63 0
4 79084.10 0
geography.head()
Germany Spain
0 0 0
1 0 1
2 0 0
3 0 0
4 0 1
gender.head()
Male
0 0
1 0
2 0
3 0
4 0
x = pd.concat([x,geography,gender],axis = 1)
x.head()
CreditScore Geography Gender Age Tenure Balance

NumOfProducts \
0 619 France Female 42 2 0.00
1
1 608 Spain Female 41 1 83807.86
1
2 502 France Female 42 8 159660.80
3
3 699 France Female 39 1 0.00
2
4 850 Spain Female 43 2 125510.82
1
HasCrCard IsActiveMember EstimatedSalary Germany Spain Male

0 1 1 101348.88 0 0 0
1 0 1 112542.58 0 1 0
2 1 0 113931.57 0 0 0
3 0 0 93826.63 0 0 0
4 1 1 79084.10 0 1 0
x = x.drop(['Geography','Gender'], axis = 1)
x.head()
CreditScore Age Tenure Balance NumOfProducts HasCrCard \

0 619 42 2 0.00 1 1
1 608 41 1 83807.86 1 0
2 502 42 8 159660.80 3 1
3 699 39 1 0.00 2 0
4 850 43 2 125510.82 1 1
IsActiveMember EstimatedSalary Germany Spain Male

0 1 101348.88 0 0 0
1 1 112542.58 0 1 0
2 0 113931.57 0 0 0
3 0 93826.63 0 0 0
4 1 79084.10 0 1 0
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y, test_size = 0.3,
random_state = 0 )
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
importing deep learning libraries

import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LeakyReLU,PReLU,ELU
from keras.layers import Dropout
initializing the ANN
classifier = Sequential()
Adding the input layer and 1st Hidden layer

classifier.add((Dense(units = 6, kernel_initializer = 'he_uniform',
activation = 'relu', input_dim = 11)))
• in the above code units = 6 represents no of hidden neurons
• kernel_initializer = he_uniform is weight initialization technique
• activation = 'relu' is a ReLU activation function
• input_dim = 11 is number of input feature
Adding 2nd Hidden Layer

classifier.add(Dense(units = 6, kernel_initializer = 'he_uniform',
activation = 'relu'))
Adding output layer

classifier.add(Dense(units = 1, kernel_initializer = 'glorot_uniform',
activation = 'sigmoid'))
• in above units = 1 indicates one output value
compiling ANN
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy',
metrics = ['accuracy'])
• optimizer = 'adam' represents which type of optiization function we are using
• loss = 'binary_crossentropy' for 0 or 1 we using binary for multiple outputs we are

using different loss function-- see keras documentation for more info
Fiting the ANN for training the data

batch size = no of data points for time
epoch = no of iterations
model = classifier.fit(x_train,y_train,validation_split = 0.33,
batch_size = 10, epochs = 100)
----------------------------------------------------------------------
-----
NameError Traceback (most recent call
last)
<ipython-input-1-7e13d534d424> in <module>
----> 1 model = classifier.fit(x_train,y_train,validation_split =
0.33, batch_size = 10, epochs = 100)
NameError: name 'classifier' is not defined

----------------------------------------------------------------------
-----
AttributeError Traceback (most recent call
last)
<ipython-input-39-5f15418b3570> in <module>
----> 1 model.summary()
AttributeError: 'History' object has no attribute 'summary'
y_pred = classifier.predict(x_test)
y_pred = (y_pred > 0.5)
----------------------------------------------------------------------
-----
NameError Traceback (most recent call
last)
<ipython-input-1-32fc0427837c> in <module>
----> 1 y_pred = classifier.predict(x_test)
2 y_pred = (y_pred > 0.5)
NameError: name 'classifier' is not defined
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test,y_pred)
# accuracy
from sklearn.metrics import accuracy_score
score = accuracy_score(y_pred, y_test)
score
0.8596666666666667
How to select Hiden layers and Hidden neurons in ANN using keras tuner
link:https://www.tensorflow.org/tutorials/keras/keras_tuner
Hyperparameters
• how many numbers of hidden layers we should have?
• how many number neurons we should have in hidden layers?
• learning rate
import pandas as pd
import numpy as np
import tensorflow
from tensorflow import keras
from tensorflow.keras import layers
from kerastuner.tuners import RandomSearch
df = pd.read_csv('Real.csv')
df.head()
T TM Tm SLP H VV V VM PM 2.5
0 7.4 9.8 4.8 1017.6 93.0 0.5 4.3 9.4 219.720833
1 7.8 12.7 4.4 1018.5 87.0 0.6 4.4 11.1 182.187500
2 6.7 13.4 2.4 1019.4 82.0 0.6 4.8 11.1 154.037500
3 8.6 15.5 3.3 1018.7 72.0 0.8 8.1 20.6 223.208333
4 12.4 20.9 4.4 1017.3 61.0 1.3 8.7 22.2 200.645833
x = df.iloc[:,:-1]
y = df.iloc[:,-1]
def model_builder(hp):
model = keras.Sequential()
model.add(keras.layers.Flatten(input_shape=(28, 28)))
# Tune the number of units in the first Dense layer
# Choose an optimal value between 32-512
hp_units = hp.Int('units', min_value=32, max_value=512, step=32)
model.add(keras.layers.Dense(units=hp_units, activation='relu'))
model.add(keras.layers.Dense(10))
# Tune the learning rate for the optimizer

# Choose an optimal value from 0.01, 0.001, or 0.0001
hp_learning_rate = hp.Choice('learning_rate', values=[1e-2, 1e-3,
1e-4])
model.compile(optimizer=keras.optimizers.Adam(learning_rate=hp_learnin
g_rate),
loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
return model
tuner = RandomSearch(model_builder,
objective='val_mean_absolute_error',
max_trials = 5,
executions_per_trails = 3,
directory='projects',
project_name='intro_to_kt')
----------------------------------------------------------------------
-----
TypeError Traceback (most recent call
last)
<ipython-input-8-0f55ecab4360> in <module>
----> 1 tuner = RandomSearch(model_builder,
2 objective='val_mean_absolute_error',
3 max_trials = 5,
4 executions_per_trails = 3,
5 directory='projects',
~\anaconda3\lib\site-packages\kerastuner\tuners\randomsearch.py in
__init__(self, hypermodel, objective, max_trials, seed,
hyperparameters, tune_new_entries, allow_new_entries, **kwargs)
132 tune_new_entries=tune_new_entries,
133 allow_new_entries=allow_new_entries)
--> 134 super(RandomSearch, self).__init__(
135 oracle,
136 hypermodel,
~\anaconda3\lib\site-packages\kerastuner\engine\
multi_execution_tuner.py in __init__(self, oracle, hypermodel,
executions_per_trial, **kwargs)
56 executions_per_trial=1,
57 **kwargs):
---> 58 super(MultiExecutionTuner, self).__init__(
59 oracle, hypermodel, **kwargs)
60 if isinstance(oracle.objective, list):
TypeError: __init__() got an unexpected keyword argument

'executions_per_trails'
CNN(Convolution neural network)

https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-
networks-the-eli5-way-3bd2b1164a53
• used for inputs in form of images and live video processing
cerebral cortex:
The cerebral cortex is the largest site of neural integration in the central nervous system. It
plays a key role in attention, perception, awareness, thought, memory, language, and
consciousness.
visual cortex:
The visual cortex of the brain is the area of the cerebral cortex that processes visual
information. It is located in the occipital lobe. Sensory input originating from the eyes
travels through the lateral geniculate nucleus in the thalamus and then reaches the visual
cortex.
it have different layers like V1, V2, V3, V4, V5 this layers plays very important role.
• V1 layer responsible for finding the edges of image and goes to layer V2, V3, V4, V5.
• V2 responsible for orientation, spatial frequency, and colour.
• V3 seems to play a role in processing motion
• V4 layer mainly used for face recognization,
• V5 perceiving motion and processing of complex stimuli
• V6 visual stimuli associated with self-motion and wide-field stimulation.
each layer have specific function to grab information from the images. we implement the
layers in cnn to process the images
What is convolution:
In mathematics (in particular, functional analysis), convolution is a mathematical operation
on two functions (f and g) that produces a third function ( ) that expresses how the shape of
one is modified by the other. The term convolution refers to both the result function and to
the process of computing it.
ex: in the above example we multiply image with filter or kernel to grab the information.
the filters may be a edge filter, vertical filters etc.
What is Padding & Stridge:

Padding is a term relevant to convolutional neural networks as it refers to the amount of
pixels added to an image when it is being processed by the kernel of a CNN. For example, if
the padding in a CNN is set to zero, then every pixel value that is added will be of value
zero.
Stride denotes how many steps we are moving in each steps in convolution.By default it is
one. Convolution with Stride 1. We can observe that the size of output is smaller that input.
To maintain the dimension of output as in input , we use padding. Padding is a process of
adding zeros to the input matrix symmetrically.
use of padding:
Padding is simply a process of adding layers of zeros to our input images so as to avoid the
problems mentioned above. This prevents shrinking as, if p = number of layers of zeros
added to the border of the image, then our (n x n) image becomes (n + 2p) x (n + 2p) image
after padding.
Max Pooling:
Max pooling is a sample-based discretization process. The objective is to down-sample an
input representation (image, hidden-layer output matrix, etc.), reducing its dimensionality
and allowing for assumptions to be made about features contained in the sub-regions
binned.
Use of max pooling:

Max pooling selects the brighter pixels from the image. It is useful when the background of
the image is dark and we are interested in only the lighter pixels of the image
Data Agumentation CNN:
link:https://machinelearningmastery.com/how-to-configure-image-data-augmentation-
when-training-deep-learning-neural-networks/
Data augmentation is a technique to artificially create new training data from existing
training data. This is done by applying domain-specific techniques to examples from the
training data that create new and different training examples
Creating data set using Data Agumentation
import pandas as pd
import numpy as np
import keras
from keras.preprocessing.image import ImageDataGenerator,
array_to_img, img_to_array, load_img
datagen = ImageDataGenerator(rotation_range =
40,width_shift_range=0.2,
height_shift_range=0.2,zoom_range=0.2,
channel_shift_range=0.2, horizontal_flip= True,
fill_mode='nearest' )
img = load_img('hero1.png')# load image
# img.show()# see image
x = img_to_array(img)
x = x.reshape((1,)+x.shape)
i = 0
for batch in datagen.flow(x, batch_size =1, save_to_dir =
'Preview' ,save_prefix = 'cat', save_format = 'jpeg'):
i+=1
if i> 20:
break
Creating CNN Model and Optimize using keras tuner

import pandas as pd
import numpy as np
import os
import tensorflow as tf
import cv2
import keras
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras.layers import BatchNormalization
from keras.callbacks import ModelCheckpoint
from keras.models import model_from_json
import kerastuner as kt
import PIL
import theano
link:https://vijayabhaskar96.medium.com/tutorial-on-keras-flow-from-dataframe-
1fd4493d237c loading image data from different files link: https://github.com/keras-
team/keras-tuner link:https://keras-team.github.io/keras-tune
# loading the images
df = pd.read_csv(r'C:\Users\tharu\OneDrive\Desktop\DataScience\Deep-
Learning\emrgency vehicle data\train_SOaYf6m\labels.csv',dtype=str)
# data type must be as str,list or tuple
datagen=ImageDataGenerator(rescale=1./255.,validation_split=0.25)
data_train = datagen.flow_from_dataframe(dataframe= df,
directory= r'C:\Users\tharu\OneDrive\Desktop\DataScience\
Deep-Learning\emrgency vehicle data\train_SOaYf6m\images',
x_col = "id",
y_col = "label",
subset = 'training',
batch_size=32,
target_size=(128, 128), # we need give size to resize all
images to single size
color_mode = 'rgb', # we need to specify for color images
seed=42,
shuffle=True,
class_mode="categorical")
Found 1235 validated image filenames belonging to 2 classes.
data_test=datagen.flow_from_dataframe(
dataframe= df,
directory= r"C:\Users\tharu\OneDrive\Desktop\DataScience\Deep-
Learning\emrgency vehicle data\train_SOaYf6m\images",
x_col="id",
y_col="label",
subset="validation",
batch_size=32,
target_size=(128, 128),
color_mode = 'rgb',
seed=42,
shuffle=True,
Building Model
create model
The model type that we will be using is Sequential. Sequential is the easiest way to build a
model in Keras. It allows you to build a model layer by layer.
model = Sequential()
adding layers
We use the ‘add()’ function to add layers to our model.
• Our first 2 layers are Conv2D layers. These are convolution layers that will deal with
our input images, which are seen as 2-dimensional matrices.
• 64 in the first layer and 32 in the second layer are the number of nodes(A node, also
called a neuron or Perceptron) in each layer. This number can be adjusted to be
higher or lower, depending on the size of the dataset. In our case, 64 and 32 work
well, so we will stick with this for now.
• Kernel size is the size of the filter matrix for our convolution. So a kernel size of 3
means we will have a 3x3 filter matrix. Refer back to the introduction and the first
image for a refresher on this.
• Activation is the activation function for the layer. The activation function we will be
using for our first 2 layers is the ReLU, or Rectified Linear Activation. This activation
function has been proven to work well in neural networks.
• Our first layer also takes in an input shape. This is the shape of each input image,
28,28,1 as seen earlier on, with the 1 signifying that the images are greyscale.
• In between the Conv2D layers and the dense layer, there is a ‘Flatten’ layer. Flatten
serves as a connection between the convolution and dense layers.
• ‘Dense’ is the layer type we will use in for our output layer. Dense is a standard layer
type that is used in many cases for neural networks.
• We will have 10 nodes in our output layer, one for each possible outcome (0–9).The
activation is ‘softmax’. Softmax makes the output sum up to 1 so the output can be
interpreted as probabilities. The model will then make its prediction based on which
option has the highest probability.
model.add(Conv2D(64,padding='valid', kernel_size=3, activation='relu',

input_shape=(128,128,3)))
model.add(Conv2D(32, kernel_size=3, activation='relu'))
model.add(Flatten())
model.add(Dense(2, activation='softmax'))
compile model using accuracy to measure model performance

model.compile(optimizer='adam', loss='categorical_crossentropy',
train the model

STEP_SIZE_TRAIN = data_train.n//data_train.batch_size
STEP_SIZE_VALID = data_test.n//data_test.batch_size
model.fit(x = data_train,steps_per_epoch=STEP_SIZE_TRAIN,
validation_data = data_test,validation_steps = STEP_SIZE_VALID,
epochs=10
)
Epoch 1/10
38/38 [==============================] - 119s 3s/step - loss: 2.0871 -
accuracy: 0.5894 - val_loss: 0.5837 - val_accuracy: 0.7005
Epoch 2/10
38/38 [==============================] - 33s 858ms/step - loss: 0.4566
- accuracy: 0.7858 - val_loss: 0.6593 - val_accuracy: 0.7448
Epoch 3/10
38/38 [==============================] - 34s 908ms/step - loss: 0.2640
Epoch 4/10
38/38 [==============================] - 38s 989ms/step - loss: 0.1519
Epoch 5/10
38/38 [==============================] - 33s 864ms/step - loss: 0.0367
Epoch 6/10
38/38 [==============================] - 34s 903ms/step - loss: 0.0156
Epoch 7/10
38/38 [==============================] - 34s 887ms/step - loss: 0.0274
Epoch 8/10
38/38 [==============================] - 31s 812ms/step - loss: 0.0261
Epoch 9/10
38/38 [==============================] - 31s 806ms/step - loss: 0.0140
Epoch 10/10
38/38 [==============================] - 31s 810ms/step - loss: 0.0162
<tensorflow.python.keras.callbacks.History at 0x15435a28d60>
import cv2
img = cv2.imread(r'C:\Users\tharu\OneDrive\Desktop\DataScience\Deep-
Learning\emrgency vehicle data\train_SOaYf6m\vehicle.jpg',1)
width = 128
height = 128
dim = (width, height)
resized = cv2.resize(img, dim, interpolation = cv2.INTER_ARE)
x_val = np.array(resized) / 255

x_val = x_val.reshape(-1, 128, 128, 3)
predicting given image is emergency vehicle or not

pred = model.predict(x_val)
print(pred)
[[0.99537885 0.0046212 ]]
img1 = cv2.imread(r'C:\Users\tharu\OneDrive\Desktop\DataScience\Deep-
Learning\emrgency vehicle data\train_SOaYf6m\car.jpg',1)
width = 128
height = 128
resized1 = cv2.resize(img1, dim, interpolation = cv2.INTER_AREA)
x_val1 = np.array(resized1) / 255

x_val1 = x_val1.reshape(-1, 128, 128, 3)
pred1 = model.predict(x_val1)
print(pred1)
[[0.9767471 0.02325287]]
img2 = cv2.imread(r'C:\Users\tharu\OneDrive\Desktop\DataScience\Deep-
width = 128
height = 128
resized2 = cv2.resize(img2, dim, interpolation = cv2.INTER_AREA)
x_val2 = np.array(resized2) / 255

x_val2 = x_val2.reshape(-1, 128, 128, 3)
pred2 = model.predict(x_val2)
print(pred2)
[[0.9767471 0.02325287]]
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 126, 126, 64) 1792
_________________________________________________________________
conv2d_1 (Conv2D) (None, 124, 124, 32) 18464
_________________________________________________________________
flatten (Flatten) (None, 492032) 0
_________________________________________________________________
dense (Dense) (None, 2) 984066
=================================================================
Total params: 1,004,322
Trainable params: 1,004,322
Non-trainable params: 0
_________________________________________________________________
x = tf.random.normal((4, 28, 28, 3))
x.shape
TensorShape([4, 28, 28, 3])
input_shape = (4, 28, 28, 3)
input_shape[1:]
(28, 28, 3)
building model with different layers

model.add(Conv2D(16,padding='same', kernel_size=3, activation='relu',

input_shape=(128,128,3)) )
model.add(Conv2D(16, 3, activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(rate=0.25))
model.add(Dense(units=128, activation='relu'))
model.compile(optimizer='adam', loss='categorical_crossentropy',
epochs=10
)
Epoch 1/10
38/38 [==============================] - 26s 666ms/step - loss: 0.9023
Epoch 2/10
38/38 [==============================] - 25s 668ms/step - loss: 0.5373
Epoch 3/10
38/38 [==============================] - 25s 656ms/step - loss: 0.4629
Epoch 4/10
38/38 [==============================] - 27s 704ms/step - loss: 0.4224
Epoch 5/10
38/38 [==============================] - 26s 675ms/step - loss: 0.4203
Epoch 6/10
38/38 [==============================] - 26s 677ms/step - loss: 0.3727
Epoch 7/10
38/38 [==============================] - 28s 739ms/step - loss: 0.2902
Epoch 8/10
38/38 [==============================] - 27s 702ms/step - loss: 0.2439
Epoch 9/10
38/38 [==============================] - 26s 688ms/step - loss: 0.2069
Epoch 10/10
38/38 [==============================] - 26s 672ms/step - loss: 0.1792
<tensorflow.python.keras.callbacks.History at 0x1e7a1200f70>
Learning\emrgency vehicle data\train_SOaYf6m\train.jpg',1)
width = 128
height = 128
resized = cv2.resize(img, dim, interpolation = cv2.INTER_AREA)
print(pred)
[[0.4878648 0.5121352]]
building model
input_shape=(128,128,3)))
model.add(MaxPooling2D((2, 2)))
model.add(BatchNormalization())
model.add(Conv2D(64, kernel_size=3,padding='same', activation='relu'))
model.add(MaxPooling2D((2, 2)))
model.add(BatchNormalization())
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.compile(loss='binary_crossentropy',
optimizer='adam',
epochs=10
)
Epoch 1/10
38/38 [==============================] - 22s 525ms/step - loss: 6.4985
Epoch 2/10
38/38 [==============================] - 20s 521ms/step - loss: 6.2590
Epoch 3/10
38/38 [==============================] - 20s 516ms/step - loss: 6.5250
Epoch 4/10
38/38 [==============================] - 19s 509ms/step - loss: 6.1739
Epoch 5/10
38/38 [==============================] - 20s 531ms/step - loss: 5.6002
Epoch 6/10
38/38 [==============================] - 20s 532ms/step - loss: 6.4007
Epoch 7/10
38/38 [==============================] - 19s 509ms/step - loss: 6.4062
Epoch 8/10
38/38 [==============================] - 21s 544ms/step - loss: 6.0466
Epoch 9/10
38/38 [==============================] - 19s 510ms/step - loss: 6.2585
Epoch 10/10
38/38 [==============================] - 20s 533ms/step - loss: 6.5785
<tensorflow.python.keras.callbacks.History at 0x1e7a16a28e0>
Learning\emrgency vehicle data\train_SOaYf6m\vehicle.jpg',1)
width = 128
height = 128
resized = cv2.resize(img, dim, interpolation = cv2.INTER_AREA)
print(pred)
[[1. 0.]]
imge = cv2.imread(r'C:\Users\tharu\OneDrive\Desktop\DataScience\Deep-
Learning\emrgency vehicle data\train_SOaYf6m\train.jpg',1)
width = 128
height = 128
resize = cv2.resize(imge, dim, interpolation = cv2.INTER_AREA)
x_vale = np.array(resize) / 255
x_vale = x_vale.reshape(-1, 128, 128, 3)
pred = model.predict(x_vale)
print(pred)
[[1.0000000e+00 1.5728423e-38]]
cnn using keras tuner

import pandas as pd
import numpy as np
import os
import cv2
import keras
from keras.layers import Dense, Dropout, Flatten
import PIL
import theano
x_col = "id",
y_col = "label",
batch_size=32,
seed=42,
shuffle=True,
dataframe= df,
x_col="id",
y_col="label",
batch_size=32,
color_mode = 'rgb',
seed=42,
shuffle=True,

input_shape=(64,64,3)) )
model.add(Dense(units=128, activation='relu'))
model.compile(
optimizer=keras.optimizers.Adam(1e-3),
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
epochs=10
)
Epoch 1/10
----------------------------------------------------------------------
-----
InvalidArgumentError Traceback (most recent call
last)
<ipython-input-83-f39f8a4579a2> in <module>
1 STEP_SIZE_TRAIN = data_train.n//data_train.batch_size
2 STEP_SIZE_VALID = data_test.n//data_test.batch_size
----> 3 model.fit(x = data_train,steps_per_epoch=STEP_SIZE_TRAIN,
4 epochs=10
5 )
~\anaconda3\lib\site-packages\tensorflow\python\keras\engine\
training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks,
validation_split, validation_data, shuffle, class_weight,
sample_weight, initial_epoch, steps_per_epoch, validation_steps,
validation_batch_size, validation_freq, max_queue_size, workers,
use_multiprocessing)
1098 _r=1):
1099 callbacks.on_train_batch_begin(step)
-> 1100 tmp_logs = self.train_function(iterator)
1101 if data_handler.should_sync:
1102 context.async_wait()
~\anaconda3\lib\site-packages\tensorflow\python\eager\def_function.py
in __call__(self, *args, **kwds)
826 tracing_count = self.experimental_get_tracing_count()
827 with trace.Trace(self._name) as tm:
--> 828 result = self._call(*args, **kwds)
829 compiler = "xla" if self._experimental_compile else
"nonXla"
830 new_tracing_count =
self.experimental_get_tracing_count()
~\anaconda3\lib\site-packages\tensorflow\python\eager\def_function.py
in _call(self, *args, **kwds)
886 # Lifting succeeded, so variables are initialized and
we can run the
887 # stateless function.
--> 888 return self._stateless_fn(*args, **kwds)
889 else:
890 _, _, _, filtered_flat_args = \
~\anaconda3\lib\site-packages\tensorflow\python\eager\function.py in
__call__(self, *args, **kwargs)
2940 (graph_function,
2941 filtered_flat_args) = self._maybe_define_function(args,
kwargs)
-> 2942 return graph_function._call_flat(
2943 filtered_flat_args,
captured_inputs=graph_function.captured_inputs) # pylint:
disable=protected-access
2944
_call_flat(self, args, captured_inputs, cancellation_manager)
1916 and executing_eagerly):
1917 # No tape is watching; skip to running the function.
-> 1918 return
self._build_call_outputs(self._inference_function.call(
1919 ctx, args,
cancellation_manager=cancellation_manager))
1920 forward_backward =
self._select_forward_and_backward_functions(
call(self, ctx, args, cancellation_manager)
553 with _InterpolateFunctionError(self):
554 if cancellation_manager is None:
--> 555 outputs = execute.execute(
556 str(self.signature.name),
557 num_outputs=self._num_outputs,
~\anaconda3\lib\site-packages\tensorflow\python\eager\execute.py in
quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
57 try:
58 ctx.ensure_initialized()
---> 59 tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle,
device_name, op_name,
60 inputs, attrs,
num_outputs)
61 except core._NotOkStatusException as e:
InvalidArgumentError: logits and labels must have the same first

dimension, got logits shape [32,2] and labels shape [64]
[[node
sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/
SparseSoftmaxCrossEntropyWithLogits (defined at <ipython-input-83-
f39f8a4579a2>:3) ]] [Op:__inference_train_function_14342]
Function call stack:

train_function
TensorBoard: TensorFlow's visualization toolkit

link:link:https://www.tensorflow.org/tensorboard/get_started
TensorBoard provides the visualization and tooling needed for machine learning
experimentation:
• Tracking and visualizing metrics such as loss and accuracy
• Visualizing the model graph (ops and layers)
• Viewing histograms of weights, biases, or other tensors as they change over time
• Projecting embeddings to a lower dimensional space
• Displaying images, text, and audio
TRANSFER LEARING
Transfer learning is a machine learning method where a model developed for a task is
reused as the starting point for a model on a second task.
It is a popular approach in deep learning where pre-trained models are used as the starting
point on computer vision and natural language processing tasks given the vast compute
and time resources required to develop neural network models on these problems and
from the huge jumps in skill that they provide on related problems.
link:https://www.analyticsvidhya.com/blog/2020/08/top-4-pre-trained-models-for-
image-classification-with-python-code/
#LINK:https://keras.io/api/applications/
import pandas as pd
import numpy as np
import os
import cv2
import keras
from keras.layers import Dense, Dropout, Flatten, Lambda
import PIL
import theano
import glob
from tensorflow.keras.applications.resnet50 import ResNet50

from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.resnet50 import preprocess_input,
decode_predictions
x_col = "id",
y_col = "label",
batch_size=32,
seed=42,
shuffle=True,
dataframe= df,
x_col="id",
y_col="label",
batch_size=32,
color_mode = 'rgb',
seed=42,
shuffle=True,

base_model = ResNet50(input_shape = (64, 64,

3),weights='imagenet',include_top=False)
for layer in base_model.layers:

layer.trainable = False
x = layers.Flatten()(base_model.output)
x = layers.Dense(1, activation='sigmoid')(x)
model = tf.keras.models.Model(base_model.input, x)
model.compile(optimizer = tf.keras.optimizers.RMSprop(lr=0.0001), loss

= 'binary_crossentropy',metrics = ['acc'])
vgghist = model.fit(x = data_train,steps_per_epoch=STEP_SIZE_TRAIN,
epochs=10)
Epoch 1/10
38/38 [==============================] - 23s 437ms/step - loss: 0.7749
- acc: 0.5000 - val_loss: 0.7174 - val_acc: 0.5000
Epoch 2/10
38/38 [==============================] - 14s 376ms/step - loss: 0.7110
Epoch 3/10
38/38 [==============================] - 15s 383ms/step - loss: 0.6974
Epoch 4/10
38/38 [==============================] - 14s 381ms/step - loss: 0.6941
Epoch 5/10
38/38 [==============================] - 16s 414ms/step - loss: 0.6933
Epoch 6/10
38/38 [==============================] - 14s 381ms/step - loss: 0.6932
Epoch 7/10
38/38 [==============================] - 15s 383ms/step - loss: 0.6931
Epoch 8/10
38/38 [==============================] - 15s 384ms/step - loss: 0.6931
Epoch 9/10
38/38 [==============================] - 14s 378ms/step - loss: 0.6931
Epoch 10/10
38/38 [==============================] - 14s 375ms/step - loss: 0.6931
imge = cv2.imread(r'C:\Users\tharu\OneDrive\Desktop\DataScience\Deep-
width = 64
height = 64
resize = cv2.resize(imge, dim, interpolation = cv2.INTER_AREA)
x_vale = np.array(resize) / 255
x_vale = x_vale.reshape(-1, 64, 64, 3)
pred = model.predict(x_vale)
print(pred)
[[0.49948993]]
RNN(Recurrent Neural Network)
it work very well with sequence data like nlp, time series analysis
A recurrent neural network is a neural network that is specialized for processing a
sequence of data x(t)= x(1), . . . , x(τ) with the time step index t ranging from 1 to τ. For
tasks that involve sequential inputs, such as speech and language, it is often better to use
RNNs. In a NLP problem, if you want to predict the next word in a sentence it is important
to know the words before it. RNNs are called recurrent because they perform the same task
for every element of a sequence, with the output being depended on the previous
computations. Another way to think about RNNs is that they have a “memory” which
captures information about what has been calculated so far.
The left side of the above diagram shows a notation of an RNN and on the right side an RNN
being unrolled (or unfolded) into a full network. By unrolling we mean that we write out
the network for the complete sequence. For example, if the sequence we care about is a
sentence of 3 words, the network would be unrolled into a 3-layer neural network, one
layer for each word.
Input: x(t) is taken as the input to the network at time step t. For example, x1,could be a
one-hot vector corresponding to a word of a sentence.
Hidden state: h(t) represents a hidden state at time t and acts as “memory” of the network.
h(t) is calculated based on the current input and the
previous time step’s hidden state: h(t) = f(U x(t) + W h(t−1)). The function f is taken to be a
non-linear transformation such as tanh, ReLU.
Weights: The RNN has input to hidden connections parameterized by a weight matrix U,
hidden-to-hidden recurrent connections parameterized by a weight matrix W, and hidden-
to-output connections parameterized by a weight matrix V and all these weights (U,V,W)
are shared across time.
Output: o(t) illustrates the output of the network. In the figure I just put an arrow after o(t)
which is also often subjected to non-linearity, especially when the network contains further
layers downstream.
Forward Pass
Problems with simple RNN

• However, RNNs suffer from the problem of vanishing gradients, which hampers
learning of long data sequences. The gradients carry information used in the RNN
parameter update and when the gradient becomes smaller and smaller, the
parameter updates become insignificant which means no real learning is done.
• For the vanishing gradient problem, the further you go through the network, the
lower your gradient is and the harder it is to train the weights, which has a domino
effect on all of the further weights throughout the network. That was the main
roadblock to using Recurrent Neural Networks.
File "<ipython-input-17-6c86d668c54d>", line 1

+b = 5
^
SyntaxError: cannot assign to operator
File "<ipython-input-18-855ab69d5122>", line 1

9hero = 20
^
SyntaxError: invalid syntax
----------------------------------------------------------------------
-----
TypeError Traceback (most recent call
last)
<ipython-input-22-86a4cc37efd5> in <module>
----> 1 'srtt'-'dfdddf'
TypeError: unsupported operand type(s) for -: 'str' and 'str'

DeepLearing Theory

Uploaded by

Copyright:

Available Formats

You might also like

DeepLearing Theory

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DeepLearing Theory

Uploaded by

Copyright:

Available Formats

Deep Learning

Introduction to Neural Network

How does a neural network learn things?

2. Non-linear Activation Functions

1. Linear or Identity Activation Function

2. Non-linear Activation Function

2.1 Sigmoid or Logistic Activation Function

2.2 Tanh or hyperbolic tangent Activation Function or threshold activation function

• it require large computational power for large data sets

stochastic Gradient Descent(SGD)

SGD with momentum

Advantages of Using AdaGrad

stochastic gradient vs gradient descent vs Mini batch

• in gradient descent it directly moves to Global minima

Convex function and non convex function

• non convex function contain both global local minima

Exploding gradient problem

Dropout layers & Regularization

Weights initialization Techniques

• weight should not be same

• weights should have good varience

1.1 MSE(Mean squred error):

1.2 Absolute error loss (MAE)

2. it may have local minima

2.2 Multi Class Cross Entropy Loss:

how to train multi layer neural network

Create Artificial Neural Network using Weights initialization Tricks

geography = pd.get_dummies(x['Geography'], drop_first = True)

RowNumber CustomerId Surname CreditScore Geography Gender Age

2 3 15619304 Onio 502 France Female 42

3 4 15701354 Boni 699 France Female 39

4 5 15737888 Mitchell 850 Spain Female 43

Tenure Balance NumOfProducts HasCrCard IsActiveMember \

CreditScore Geography Gender Age Tenure Balance

HasCrCard IsActiveMember EstimatedSalary Germany Spain Male

CreditScore Age Tenure Balance NumOfProducts HasCrCard \

IsActiveMember EstimatedSalary Germany Spain Male

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

importing deep learning libraries

Adding the input layer and 1st Hidden layer

• in the above code units = 6 represents no of hidden neurons

• kernel_initializer = he_uniform is weight initialization technique

• activation = 'relu' is a ReLU activation function

• input_dim = 11 is number of input feature

Adding 2nd Hidden Layer

Adding output layer

• in above units = 1 indicates one output value

• optimizer = 'adam' represents which type of optiization function we are using

• loss = 'binary_crossentropy' for 0 or 1 we using binary for multiple outputs we are

Fiting the ANN for training the data

NameError: name 'classifier' is not defined

AttributeError: 'History' object has no attribute 'summary'

NameError: name 'classifier' is not defined

from sklearn.metrics import confusion_matrix

• how many number neurons we should have in hidden layers?

# Tune the learning rate for the optimizer

TypeError: init() got an unexpected keyword argument