Study of ensemble of activation functions in Deep learning

Study of ensemble of activation functions in Deep
learning
Name:Rehan Khan
Roll:19EARCS095
ABSTRACT
Artificial neural networks (ANN), generally known as neural networks, are a
type of machine learning algorithms that had been developed in part to mimic
the real architecture of the human brain. Because they can approximate
complex functions from data, neural networks are naturally strong. The
transdisciplinary fields of image recognition, voice recognition, natural
language processing, and others have been impacted by this generalization
capability. In neural networks, activation functions are a crucial component.
They explain a network node's output in terms of a set of inputs. a simplistic
introduction to deep neural networks, a summary of what activation functions
are and how they are used in neural networks, a look at their most prevalent
characteristics, a look at the various types of activation functions, a look at
some of the problems, constraints, and potential alternatives that activation
functions face, and a conclusion.
1.INTRODUCTION
Using input data to classify and predict outcomes, neural networks are multi-
layered networks of neurons made up of nodes. There are three layers to it: an
input layer, a secret layers, and an output layer. Information is processed from
one layer to the next using nodes, each of which has a weight that is
considered into account.
Activation functions make the neural network non-linear i.e., the output
depends linearly on the input features. Although linear equations are simple
and easy to comprehend, their complexity is constrained, and they are no
longer able to analyse and identify complicated mappings from the data.
Without an activation function, a neural network frequently performs and also
has limited capacity as a linear regression model. A neural network is
anticipated to handle more complex tasks than just modelling complex sorts of
information such as images, videos, audio, voice, and text, in addition to
learning and computing a linear characteristic. To simplify complicated, high-
dimensional, and nonlinear data sets with several hidden layers for extracting
knowledge and meaningful information, we apply activation function
capabilities and artificial neural network.
In simple terms, the activation function decides whether a neuron in a neural
network should be activated or not, which is like to the working of our brain.
These functions play an important role in back propagation that is used to
update the weights and biases in a neural network.
2.LIST OF ACTIVATION FUNCTIONS

2.1 Linear:
The input is proportional to activation in the linear activation function,

sometimes referred to as "no activation" or the "identity function" (multiplied
by x1). The function just passes the value it was provided, doing nothing to the
weighted sum of the input.
2.2 Sigmoid:
Sigmoid transforms the values between the range 0 and 1. It is often used in
models when the output must be a probability prediction. Sigmoid is the best
option due to its range because probability only occurs between a range of 0
and 1. The gradient is smooth and the function is differentiable.
2.3 Tanh:
The tanh function, also known as the hyperbolic tangent, has gained greater
traction than the sigmoid function because it often provides superior training
results for multi-layer neural networks. All of the sigmoid function's beneficial
characteristics are inherited by the tanh function. A formulation of the tanh
function is:
2.4 ReLu:
ReLU has a derivative function and enables for backpropagation while still
being computationally efficient, while giving the impression of being a linear
function. The fact that not all of the neurons are activated simultaneously by
the ReLU function is crucial in this situation. Only if the output of the linear
transformation is less than 0 will the neurons become inactive. The ReLU
function is far more computationally efficient than sigmoid and tanh functions
since it only activates a subset of neurons. Due to ReLU's linear, non-saturating
nature, gradient descent's convergence towards the loss function's global
minimum is sped up.

2.5: Leaky ReLU:
Leaky ReLU has the same benefits as ReLU, and it also allows backpropagation
even for negative input values. This small adjustment ensures that the gradient
of the graph's left side is non-zero for negative input values. As a result, we
wouldn't see any dead neurons there anymore. Here is the Leaky ReLU
function's derivative.
2.6 Swish:
This function is constrained below but unconstrained above, meaning
that as X approaches negative infinity, Y approaches a constant value, but
as X approaches infinity, Y approaches infinity. Swish is a smooth
function, thus it doesn't change course suddenly like ReLU does close to x
= 0. Instead, it gently curves higher from 0 towards values 0, then
downward.
Due to the swish function's non-monotonic nature, the expression of the
input data and weight that may be learned is improved. It has the
following mathematical representation:
2.7 Softmax Function:

The sigmoid function's output fell between 0 and 1, which is comparable
to probability. Assume we have five output values that are, successively,
0.8, 0.9, 0.7, 0.8, and 0.6. How do we go with it? A concatenation of
several sigmoids is how the Softmax function is defined. The relative
probability are computed. The softmax function, like the sigmoid/logistic
activation function, yields the likelihood of each class in the neural
network. In the case of multi-class classification, it is most frequently
utilized as an activation function for the last layer of the neural network.
It has the following mathematical representation:
2.8 Mish:
It shares the swish-like quality of being unbounded above and bounded below.
Due of its non-monotonic nature, it offers advantages comparable to those of
swoosh. It has self-gates as well. Here, the gate is tanh(x)) rather than
(x).However, the gate's function remains the same. The actual cause of the
performance difference between swish and mish is this variation in the gate. It
has infinite order and is continuously differentiable. It belongs to the C class,
whereas ReLU belongs to the C0 class. For an activation function, this high
degree of differentiability is particularly beneficial.
3 Challenge faced by activation function

The vanishing gradient problem and the dead neuron problem are the two
main difficulties that commonly employed activation functions must over-
come. Both topics are covered in this section.
3.1 Problem with vanishing gradients
As backpropagation advances further into the network, this issue arises when
the gradient values approach zero, causing the weights to become saturated
and improperly updated. As a result, the loss stops dropping, and the network
is unable to complete its training. The vanishing gradient problem is the name
given to this issue. Saturated neurons are those whose weights have not been
adequately updated.
3.2 Death of a neuron Problem
As it has been previously mentioned, activation functions have an output value

that is between 0 and 1. When the value is almost 0, the relevant neurons are
made to become inactive and stop contributing to the output. Furthermore,
the weights might be changed in a way that makes a significant chunk of the
network's weighted total equal to zero. This situation may require the
deactivation of a significant amount of the input, which would impair network
performance in an unfixable way. As a result, these neurons that have forcibly
turned off are referred to as "dead neurons," and the issue is called the "dead
neuron problem."
4. CONCLUSION
Activation functions suffer from many issues like vanishing gradients,
exploding gradients, local oscillation, etc. Better performing activation functions are
those which are highly differentiable and vary within limits. Based on the use case, it
is beneficial to find optima and different functions give different results based on their
nature.
Many different activation functions have been invented that deal with the
aforementioned problems and still provide nonlinearity to the neural network model
at hand. The functions inside a model need to be ordered in such a way that the effect
of one does not get nullified by the next function in line. Fine-tuning parameters like
the number of neurons in a hidden layer and using a dropout and learning rate helps
achieve the desired results in addition to the activation function selection.
Sigmoid function and its variations usually work best in classification problems.
When in doubt, the ReLU function provides better results in most cases and is thus
widely used in deep learning networks. While designing your own activation
function, it is important to keep in mind that it would be used in the backpropagation
of errors and weights hence its effectiveness must be studied on different kinds of
datasets.
References:
1. Saha, Snehanshu & Mathur, Archana & Bora, Kakoli & Agrawal, Surbhi &
Basak, Suryoday. (2018). SBAF: A New Activation Function for Artificial
Neural Net based Habitability Classification.
2. https://www.v7labs.com/blog/neural-networks-activation-functions
3. https://towardsdatascience.com/an-overview-of-resnet-and-its-variants-
5281e2f5 6035
4. https://iq.opengenus.org/architecture-of-densenet121/
5. Wang Y, Li Y, Song Y, Rong X. The Influence of the Activation Function in a
Convolution Neural Network Model of Facial Expression Recognition.
Applied Sciences. 2020; 10(5):1897. https://doi.org/10.3390/app10051897
6. www.v7labs.com
7. Medium.com
8. Towardsdatascience.com
9. Paperswithcode.com
10. Parisi, Luca & Ma, Renfei & RaviChandran, Narrendar & Lanzillotta, Matteo.
(2021). hyper-sinh: An accurate and reliable function from shallow to deep
learning in TensorFlow and Keras. Machine Learning with Applications.
100112. 10.1016/j.mlwa.2021.100112.
11. Albert Gidon, Timothy Adam Zolnik, Pawel Fidzinski, Felix Bolduan, Athanasia
Papoutsi, Panayiota Poirazi, Martin Holtkamp, Imre Vida, and Matthew Evan
Larkum. Dendritic action potentials and computation in human layer 2/3 cortical
neurons. Science, 367(6473):83–87, 2020

Study of ensemble of activation functions in Deep learning

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Study of ensemble of activation functions in Deep learning

Uploaded by

Copyright:

Available Formats

Study of ensemble of activation functions in Deep

2.LIST OF ACTIVATION FUNCTIONS

The input is proportional to activation in the linear activation function,

minimum is sped up.

2.7 Softmax Function:

3 Challenge faced by activation function

3.1 Problem with vanishing gradients

As it has been previously mentioned, activation functions have an output value

You might also like