Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Preliminaries: Biological Neuron

to Artificial Neural Network 1


Artificial neural network (ANN) is designed with the goal of building intelligent
machines to solve complex perceptual problems by mimicking special features biological
neurons in the human brain. ANN is the most successful tool in the machine learning
based on astonishing landmarks in last half century. Recently, deep architecture of ANN
has gained new influence in this field. This chapter explains background and basics of
ANN towards deep architecture.
The chapter is organized as follows. Section 1.1 briefly explains the motivation of
ANN development. Section 1.2 describes distinguishing features biological neuron and
development of ANN mimicking it. Sections 1.3 and Section 1.4 describe common ANN
structure and a sample ANN structure in detail, respectively. Section 1.5 explains the
training of ANN. Features of ANN, benchmark problems and performance measure are
briefly explained in Section 1.6, Section 1.7 and Section 1.8, respectively. Finally, Section
1.9 explains motivation to deep architectures.

1.1 Motivation
Artificial neural network (ANN) is designed with the goal of building intelligent
machines to solve complex perceptual problems, such as pattern recognition and
optimization, by mimicking the networks of real neurons in the human brain. Biological
neural system possesses massive parallelism, distributed representation and computation,
learning ability, generalization ability, adaptability, inherent contextual information
processing, fault tolerance, and low energy consumption (Jain et al., 1996). An ANN is a
collection of simple processing units which has a natural propensity for storing
experimental knowledge and resembles the human brain in two respects (Haykin, 1999):
1. Knowledge is acquired by it from its environment through a learning process.
2. Interneuron connection strengths, known as synaptic weights, are used to store the
knowledge.
The procedure used to perform the learning process is called a learning algorithm.

1
2 | Chapter 1. Preliminaries: Biological Neuron to Artificial Neural Network

Since synaptic weights store the knowledge, the goal of a learning algorithm is to modify
the synaptic weights of a network in an orderly fashion to attain a desired objective. Back-
propagation (BP) (Rumelhart et al., 1986) is the most popular ANN learning method for
multi layered networks. In BP synaptic weights are adjusted at the time of output error
propagation from output layer to input layer.

1.2 Biological Neuron to Artificial Neuron


A neuron (or nerve cell) is a special biological cell, the essence of life, with
information processing ability. It has been estimated that human brain contains around
1011 neurons and each one is connected to 103 to 104 other neurons for information
gathering as well as response transfer. The introduction of neurons as basic structural
constituents of the brain was credited to Ramon y Cajal who won the 1906 Nobel Prize
for physiology and medicine for the crucial discovery of the extensive interconnections
within the cerebral cortx, the portion of the brain where approximately 90% of the neurons
in the human are located.
A sketch of a biological neuron is shown in Fig. 1.1. It is composed of a cell body,
or soma, and two types of out-reaching tree-like branches: the axon and the dendrites.
The cell body has a nucleus that contains information about hereditary traits and a plasma
that holds the molecular equipment for producing material needed by the neuron. A
neuron receives signals (impulses) from other neurons through its dendrites (receivers)
and transmits signals generated by its cell body along the axon (transmitter), which
eventually branches into strands and substrands. At the terminals of these strands are the
synapses. A synapse is an elementary structure and functional unit between two neurons

Figure 1.1: A Sketch of biological neuron.


Chapter 1. Preliminaries: Biological Neuron to Artificial Neural Network | 3

(an axon strand of one neuron and a dendrite of another), When the impulse reaches the
synapse's terminal, certain chemicals called neuro transmitters are released. The
neurotransmitters diffuse across the synaptic gap, to enhance or inhibit, depending on the
type of the synapse, the receptor neuron's own tendency to emit electrical impulses. The
synapse's effectiveness can be adjusted by the signals passing through it so that the
synapses can learn from the activities in which they participate. This dependence on
history acts as a memory, which is possibly responsible for human memory.
McCulloch and Pitts proposed a binary threshold unit as a computational model for
an artificial neuron (see Figure 1.2). This mathematical neuron computes a weighted sum
of its n input signals, xj, j = 1,2,..,n, and generates an output of 1 if this sum is above a
certain threshold u . Otherwise, an output of 0 results. Mathematically,

𝑦 = ∅(∑𝑛𝑗=1 𝑤𝑗 𝑥𝑗 − 𝑢) (1.1)

where ɵ() is a unit step function at 0, and w, is the synapse weight associated with the jth
input. Positive weights correspond to excitatory synapses, while negative weights model
inhibitory ones. There is a crude analogy here to a biological neuron: wires and
interconnections model axons and dendrites, connection weights represent synapses, and
the threshold function approximates the activity in a soma. The McCullochand Pitts
model, however, contains a number of simplifying assumptions that do not reflect the true
behavior of biological neurons.
The McCulloch-Pitts neuron has been generalized in many ways. An obvious one is
to use activation functions other than the threshold function, such as piecewise linear,
sigmoid, or Gaussian. Again, for simplicity of notation, it is often considered the threshold
U as another weight wo = - U attached to the neuron with a constant input x0 = 1. Fig. 1.3
shows a generalized form of artificial neuron which is consisted with a set of connecting
weights, a summing unit, and an activation function. Each input signal is weighted, that

Figure 1.2: McCulloch-Pitts model of a neuron.


4 | Chapter 1. Preliminaries: Biological Neuron to Artificial Neural Network

X1
Bias (x0 or b) = 1

W1
W0

X2
W2


Input
signals Output
(Y)

WN
XN Summing Activation
Synaptic unit function
weights

Figure 1.3: Generalized model of an artificial neuron.

is, it is multiplied with the weight value of the corresponding input (with an analogy to
the synaptic strength of the connections of real neurons). The output of the summing unit
is, therefore, a combination of weighted input signals and an externally applied bias. The
bias has the effect of increasing or decreasing the net input of the activation function,
depending on whether it is positive or negative, respectively. Finally, the output of a
neuron comes from the activation function. The activation function is also referred to as
a squashing function in the sense that it squashes the permissible amplitude range of the
output signal to some finite value.
Activation function may be threshold unit, linear function or non-linear function as shown
in Fig. 1.4. The most commonly used nonlinear activation function is the sigmoid function
(Haykin, 1999). The sigmoid function is defined by, y=1/ (1+exp(-ax)) where a is the
slope parameter. By varying the parameter a, one can obtain sigmoid functions of
different slopes, as illustrated in Fig. 1.5. Since the output of a sigmoid function is
bounded between 0 and 1, the increase or decrease of input values (x) by a large amount
will push the output into the saturated region. The important characteristics of the sigmoid
function are that it is bound above and below, it is monotonically increasing, and it is

Figure 1.4: Different types of activation functions: (a) threshold, (b) piecewise linear,
(c) sigmoid, and (d) Gaussian.
Chapter 1. Preliminaries: Biological Neuron to Artificial Neural Network | 5

a=2

Slope (a) decreased

a=1
a=1/2
a=1/3

x
Figure 1.5: Sigmoid function with various slope (a) values.

continuous and differentiable everywhere.

1.2.1 Solving Linear Problem with a Single Neuron


Synaptic weight values (including bias weight) are the parameters to adjust to solve
for different problems. For simplicity consider two input logic operations OR, AND,
XOR, NOR, NAND and XNOR. For different input conditions of two inputs X1 and X2,
the logic output Y on different input values are shown in Table 1.1. Fig. 1.6 shows the
boundary lines of the logic gates in X1 and X2 dimension. From the figure it is observed
that for a logic (e.g., OR) and its inverse logic (NOR) boundary line position is same and
output responses are exchange through boundary.
As a sample case, Fig. 1.7 shows the solution for OR operation with threshold
activation function, weight values +1 for both w1 and w2; bias weight value as -0.5. As
of Fig. 1.6 OR problem has single decision line and the architecture is able to solve the
problem. AND, NOR and NAND logic operations have also single decision line problem
and able to solve by a single neuron. Table 1.2 shows the weight values for the operations.

Table 1.1: Logic operations for two inputs.


Input Output (Y) for logic gate operation
X1 X2 OR AND NOR NAND XOR XNOR
0 0 0 0 1 1 0 1
0 1 1 0 0 1 1 0
1 0 1 0 0 1 1 0
1 1 1 1 0 0 0 1
6 | Chapter 1. Preliminaries: Biological Neuron to Artificial Neural Network

(a) OR (b) AND (c) XOR

(d) NOR (e) NAND (f) XNOR


Figure 1.6: Boundary Line for Logic gates
It is noticeable from the table that only changing weight value from -0.5 to -1.5, the
architecture transfer from OR to AND logic operation. On the other hand, changing sign
of the weight values the OR logic transfer as NOR logic. Any such single boundary
problem (e.g., AND, NOR, NAND) may be solved by single neuron. Such a problem is
also called as a linearly separable problem. On the other hand, XOR and XNOR are non-
linear problems those belong two boundary lines. It is also noticeable that total number
𝑛
binary logic depends on the problem input and for n input it is 22 (Graupe, 2007) and
therefore, for n=2 the value is 16. The six logics presented in Table 1.1 and shown
boundary line in Fig. 1.6 are the common logics and rest eight may also be present.
Among 16 cases, XOR and XNOR are only nonlinear and rest 14 is linearly separable.
On the other hand, for 3 input case, total number of problem is 256, in which 104 is
linearly separable and rest 152 is non-linear. With the increase of input number,
proportion of linear problem decreases and non-linear problem increases. In general, real
world problem with many inputs is always nonlinear in category.

Table 1.2: Weight values for two input single decision line logic operations.
Logic Operation W1 W2 W0
OR +1 +1 - 0.5
AND +1 +1 - 1.5
NOR -1 -1 + 0.5
NAND -1 -1 + 1.5
Chapter 1. Preliminaries: Biological Neuron to Artificial Neural Network | 7

Bias (X0 or b) = 1
X1 - 0.5

+1


Logic
Inputs Y (OR)

+1

X2

Figure 1.7: Architecture for OR operation.

1.2.2 Solving Nonlinear Problem


Single neuron is able to draw a single hyperplane to separate the input space in two
portions. For two input case the hyperplane is a line; three input case the hyperplane is a
two dimension plane; and so on. Solving nonlinear problem arranging neuron(s) is the
aim of NNs so that it performs well on real world problem. In nonlinear problem, solving
two input XOR is the basic as well as the most common challenging problem in machine
learning and NN.
No arrangement of values in the three weights of single neuron (i.e., bias and two
input weights) is able to solve XOR problem because it requires two boundary lines and
a neuron able to draw only one line. It is noticeable from the truth table that the response
of OR and XOR is same for three inputs (0, 0), (0, 1) and (1, 0). Only for input (1, 1)
XOR response is 0 whereas it is 1 for OR. Therefore, manipulation of OR logic operation
in case of (1, 1) may give XOR logic. There are several solutions are investigated to solve
XOR. One way is to solve transferring it from two dimensional space to tree dimension
space. A common statement is that a nonlinear problem of lower dimension may be a
linearly separable if transfer to higher dimension.
Figure 1.8 shows the transformation and architecture of solving XOR in three
dimension. XOR output is taken from the three input OR unit and the additional input is
the outcome of AND unit using the inputs (X1 and X2) through weight value -2. The three
input unit works as two input OR unit except input (1, 1). For (1, 1), the AND logic is
activated and makes the output of the three input unit as 0 due weight value -2. Another
way of solving XOR is shown in Fig. 1.9 where solution comes in two stages: AND (i.e.,
H1) and OR (i.e., H2) logics are prepared first from original inputs and finally XOR
8 | Chapter 1. Preliminaries: Biological Neuron to Artificial Neural Network

response is generated through a different unit. In this case XOR logic comes from two
inputs (i.e., H1 and H2) which are manipulated through AND and OR logic units. Final
output may represent as
𝑌 = ̅𝐻
̅̅1̅ 𝐴𝑁𝐷 𝐻2 = ̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
(𝑋1 𝐴𝑁𝐷 𝑋2 ) 𝐴𝑁𝐷 (𝑋1 𝑂𝑅 𝑋2 ) .
The operational procedure is similar to previous model in such as a way that for inputs (0,
0), (0, 1) and (1, 0) H1 = 0 (AND logic) and Y = H2 (OR logic) with outputs 0, 1, and 1,
respectively. On the other hand, for (1, 1), both H1 and H2 are activated but the output Y
= 0 because H1 is connected through weight value -2. In general term, the intermediate
AND and OR logics preparation is termed as hidden representation. Finally, to solve a
problem appropriate architecture as well as proper weight assignment is essential.

(a) Third dimension uplift XOR (1, 1) as (1, 1, 1) in three dimension.

(b) XOR solution with additional input from AND gate.

Figure 1.8: Solution of XOR in three dimension.


Chapter 1. Preliminaries: Biological Neuron to Artificial Neural Network | 9

Figure 1.9: XOR logic combining AND and OR logic operations.

1.3 Artificial Neural Network


Multi-layer feed-forward neural network with several layers of neurons is the most
common form of artificial neural network (NN) architecture to solve various real world
problems such as classification. This type of network is also known as a multilayer
perceptron (MLP) and a typical architecture is shown in Fig. 1.10. The units, i.e., the
artificial neurons, are arranged in layers, and each unit in a layer has all its inputs
connected to the units of a preceding layer (or to the inputs from the external world in the
case of the units in the first layer), but it does not have any connections to the units of the
same layer to which it belongs. The layers are arranged one after another so that there is
an input layer, multiple intermediate layers, and finally an output layer. The intermediate
layers don’t have input or output to the external world, and are called hidden layers.
Generally, the input layer is considered just as a distributor of the signals from the external
world.
There is no constraint on number of hidden layers and number of neurons for the
hidden layers. The purpose of hidden layer is to increase functional adaptability of a
network. As the complexity in the relationship between the input data and the desired
output increases, the number of neurons in the hidden layer should also increase. On the
other hand, number of hidden layer may also vary depends on problem. However it is
reported that a three layer MLP with a hidden layer having appropriate number of neuron
may solve any arbitrary function (Krose & Smagt, 1996).
In architectural point of view, a NN to classify input data (e.g., image pixel or features)
10 | Chapter 1. Preliminaries: Biological Neuron to Artificial Neural Network

Figure 1.10: A typical multi-layer feed-forward neural network.

and may express as

𝐼 → {𝑊ℎ1} → 𝐻1 → {𝑊ℎ2} → 𝐻2 … … . . → {𝑊𝑜} → 𝑂 ,


where I, H and O corresponds Input, Hidden neuron set and Output neuron set,
respectively. On the other hand, W represents the connection weight set between two
consecutive layers. Generally, nonlinear activation function is used in the hidden units,
and depending on the problem linear or nonlinear activation function is used in the output
units. For a particular class definition, a particular output node is defined as 1 while
considering others as zero.

1.4 Details of a Sample NN


Classification is one of the most frequently encountered decision making tasks in
human activities. A classification task occurs when an object needs to be assigned into a
predefined group or class based on a number of observed attributes related to that object.
To use NN as a classifier, the number of output neurons is selected according to the
number of distinct classes of a given classification task. The number of inputs depends
on the input features of a problem. As an example, to classify handwritten number
classification, the nodes in the output layer will be 10; and the nodes in the input layer
will depends on the representation of scanned handwritten digit. If pixel values of 28×28
sized images are considered as NN input than number of input nodes will be 784 (=28×28).
Suppose a standard NN architecture with a hidden layer with 300 neurons to classify handwritten
numeral image is
Chapter 1. Preliminaries: Biological Neuron to Artificial Neural Network | 11

NN: 𝐼784 → {𝑊ℎ784×300 } → 𝐻300 → {𝑊𝑜300×10 } → 𝑂10 .


In the input layer (I), 28×28 pixels are considered as 784 linear nodes. The number of output
nodes is 10 where each output node represents a particular digit. In the architecture, the matter of
bias node and weights are not shown for simplicity. Including the bias node with input the actual
dimension of Wh is 𝑊ℎ(784+1)×300 ; similar matter is also for output layer. In the sample
architecture, 300 intermediate features are generated from 784 input pixels values in hidden layer,
and finally class labels are generated from the 300 features.
On the other hand, Iris problem has four continuous inputs in centimeter for attributes:
(i) Sepal length, (ii) Sepal width, (iii) Petal length and (iv) Petal width. Based on the
attribute values, the task is to classify a sample as a plant type among (i) Setosa, (ii)
Versicolour, and (iii) Virginica. Therefore MLP architecture for Iris classification will
have four and three nodes in input layer and output layer, respectively. For such small
sized problem, a small sized NN may be suitable for Iris classification due to fewer input
attributes and class numbers. An architecture with a hidden layer with five hidden nodes may
represent as

NN: 𝐼4 → {𝑊ℎ4×5 } → 𝐻5 → {𝑊𝑜5×3 } → 𝑂3 .

1.5 Training of a Neural Network


To perform a NN for a particular task proper architecture as well as appropriate weight
set values are essential. The matter of architecture is mainly depends on the problem as it
is discussed above. But the getting weight set is most important and challenging in field
of NN because same architecture perform different functions for different weight values
as discussed in for single neuron Section 1.1. For very small sized problems (such as
common binary logics), the weight values may assign directly but for large sized real
world problem it is quite impossible. In this regard, ANN has a natural propensity for
storing experimental knowledge in synaptic weights and knowledge is acquired by it from
its environment through a learning process (Haykin, 199).
The purpose of learning of a NN is to modify its synaptic weights, so that, after
learning the network correctly recognizes class label of a particular pattern from its input
set or features (Siddique and Adeli, 2013). The procedure used to perform the learning
process is called training. Since synaptic weights store the knowledge, the goal of a
training algorithm is to modify the synaptic weights of a NN in an orderly fashion to attain
a desired objective. Back-propagation (BP) (Rumelhart et al., 1986) is the most popular
NN training method for multi layered NNs. In BP synaptic weights are adjusted at the
12 | Chapter 1. Preliminaries: Biological Neuron to Artificial Neural Network

time of output error propagation from output layer to input layer. To apply BP, connection
weights of a network are initialized with random values in a small range.
Consider most common three layer NN architecture to explain the BP algorithm as
shown in Fig. 1.11. The BP algorithm consists of two basic steps: a forward pass and a
backward pass. In the forward pass, input values of an example or pattern are presented
to the network, actual outputs are measured from output layer passing responses from
input layer to output layer through hidden layer, and then error for the pattern is calculated
based on actual output and desired output of that pattern. In the backward pass, the
connection weights are adjusted based on the error calculated. The weights between
hidden and output layer is updated first then weights between input and hidden layer.
If a weight w sends input x to a neuron and f is the output of that neuron, according to
BP learning, the weight correction (∆w) for that weight is given by the following equation:

w    x , (1.2)

where δ is the local gradient of the neuron and η is learning rate. The learning rate merely
indicates the relative size of change in weights and therefore affects learning speed. For
its high value, update due to one example may alter a weight value much adversely with
respect to others resulting oscillation. In general, the value of η considers in a small range
such as in between 0.1and 0.3.
The local gradient of output unit (δo) and hidden unit (δh) are defined by:

𝐼 → { 𝑊ℎ } → 𝐻 → { 𝑊𝑜 } → 𝑂

Figure 1.11: A typical three layer feed-forward NN.


Chapter 1. Preliminaries: Biological Neuron to Artificial Neural Network | 13

e f o
o   (1.3)
f o xo
f h
 h    o wo (1.4)
o xh
Here, xo and fo represents the net input (weighted sum) and the output of an output neuron,
respectively. e is the error that is defined by the difference between desired output and
actual response. The error function for n-th training pattern may be defined by the
following equation.
1
e(n)  (d (n)  f o (n)) 2 , (1.5)
2
where fo(n) is the actual output and d(n) is the desired output. To update weights, BP
algorithm requires partial derivative of Eq. (1.5) with respect to the output fo(n) and is
calculated as follows.
eo (n)
 d (n)  f o (n)  (1.6)
f o (n)
For sigmoid activation function f o (n)  1 / 1  exp  xo (n) and therefore
f o (n)
 f o (n)1  f o (n)  (1.7)
xo (n)
Now the local gradient of output unit (δo) becomes
 o  d (n)  f o (n) f o (n)1  f o (n) (1.8)
For the same sigmoid activation function the local gradient of hidden unit (δh) becomes

 h  f h (n)1  f h (n)   o wo (1.9)


o

Figure 1.12 shows the operational flowchart of the NN shown in Fig. 1.11. In the figure upper
part is the forward pass from input to actual output then error. The lower portion demonstrates the
local gradient and weight correction generation from output layer and input layer. Every directed
indicates the component require to calculate it. As an example, according to Eq. (1.8) local
gradient of output layer (i.e., 𝛿o) requires actual output (fo) and desired output (do). On the other
hand, according to Eq. (1.9) local gradient of hidden layer (i.e., 𝛿h) requires hidden layer output
(fh), local gradient of output layer (i.e., 𝛿o) and connecting weights of particular hidden node and
output nodes (wo).
14 | Chapter 1. Preliminaries: Biological Neuron to Artificial Neural Network

Figure 1.12: Operational flowchart of a NN of Fig. 1.11.

In a NN architecture, number of neurons in the hidden layer is considered based on


problem complexity. Such architecture is commonly known as multilayer perceptron
(MLP). Moreover, multiple hidden layers with distinct number of neurons may also be
considered for better outcome. In that case more steps will be followed to update the NN
through calculating local gradients of more hidden layer units but calculation will be
similar in fashion. Suppose, the architecture of Fig. 1.11 with two hidden layers is
𝐼 → {𝑊ℎ1} → 𝐻1 → {𝑊ℎ2} → 𝐻2 → {𝑊𝑜} → 𝑂
For such case Eq. (1.9) will change for local gradient for second hidden layer (H2) as
 h 2  f h 2 (n)1  f h 2 (n)   o wo . (1.10)
o

On the other hand, local gradient for first hidden layer (H1) will be
 h1  f h1 (n)1  f h1 (n)   h 2 wh2 (1.11)
h2

1.6 Matlab Implementation of Sample NN


In the present version of Matlab, functions and tools are available for NN with various
features. Various tutorials with codes are available to implement NN with Matlab (Palm,
2012). The brief description of basic functions according to operational steps are as
follows:

1. NN Setup:
The function nnsetup (.) is to create a feedforward NN (i.e., MLP) NN which takes
NN architecture as parameter and return NN in an object. CS1 is the code segment to
define the NN for the sample case to classify handwritten numeral from 28×28(=784)
pixel image. In the function, the parameters 784, 300 and 10 are for number of neurons
Chapter 1. Preliminaries: Biological Neuron to Artificial Neural Network | 15

in the input, hidden and output layers, respectively. The function return NN as nn
object. The setup function also defines other parameters such as activation function,
learning rate.
CS1:
nn = nnsetup([784 300 10]);

2. NN Training:
The function nntrain(.) is to train the NN which takes network object with training
data and return trained network. CS2 is the code segment for training which trains the
neural network nn with input train_x and output train_y for setting opts with epochs,
batchsize. It returns NN nn with updated activations, errors, weights and biases, and
L, the sum squared error for each training batch.

CS2:
[nn, L] = nntrain(nn, train_x, train_y, opts);

The training operation is performed in three different steps and three functions are
activated in it:
(i) nnff (.) - performs a feedforward pass
(ii) nnbp (.) - performs backpropagation and calculate gradients
(iii) nnapplygrads (.) - updates weights and biases with calculated gradients

3. NN Evaluation:
nneval(.) function evaluates performance of NN and CS3 is for code segment of it.
The function takes network nn, and test input feature test_x, and desired output of
inputs test_y. It measures actual response of the NN for input; compared with the
desired output; and finally, returns error rate er and misclassified pattern index bad.
CS3:
[er, bad] = nntest(nn, test_x, test_y);

1.7 Performance Measures


Learning and generalization is the most important topic in machine learning or neural
network research. Learning is the ability to approximate the underlying behavior
adaptively from the training data, while generalization is the ability to predict well beyond
16 | Chapter 1. Preliminaries: Biological Neuron to Artificial Neural Network

the training data. Generalization is a more desirable and critical feature because the most
common use of a classifier is to make good prediction on new or unknown objects.
Commonly generalization ability is measured on testing set data that are reserved from
the available data at the time of training. Testing error rate (TER), i.e., rate of wrong
classification on testing set, is widely acceptable quantitative measure, whose value
minimum is good. A number of benchmark problems are available to measure the TER
or generalization ability of neural networks or any other machine learning system.

1.8 Benchmark Data and Preprocessing for NNs


A benchmark is a point of reference by which something can be measured. For neural
network or machine learning, benchmark problems are the open problems from different
domains to measure effectiveness of a method on those problems. In general, datasets of
benchmark problems are publicly available; and a method tests on a subset of problems
and performance compares with other methods. Benchmark problems covers diverse
application domains such as human life, finance, animal, plant, molecular biology. The

Table 1.3: Characteristics of Benchmark Datasets

Input Feature NN Hidden


Dataset Example Class
Conti. Discr. Input Node
Australian Credit Card 690 2 6 9 51 10
Breast Cancer Wisconsin 699 2 9 - 9 5
Diabetes 768 2 8 - 8 5
Heart Disease Cleveland 303 2 6 7 35 5
Hypothyroid 7200 3 6 15 21 5
Ionosphere 351 2 34 - 34 10
Iris Plants 150 3 4 - 4 5
King+Rook vs King+Pawn 3196 2 - 36 74 10
Lungcancer 32 3 - 56 56 10
Mushroom 8124 2 - 22 125 10
Soybean 683 19 - 35 82 25
Segmentation 2310 7 19 - 19 10
Sonar 208 2 60 - 60 10
Splice Junction Gene Sequences 3175 3 - 60 60 10
Satellite 6435 6 36 - 36 10
Wine 178 3 13 - 13 5
Waveform 5000 3 21 - 21 10
Zoo 101 7 15 1 16 10

Figure 1.12: Operational flowchart of a NN of Fig. 1.11.


Chapter 1. Preliminaries: Biological Neuron to Artificial Neural Network | 17

most popular benchmark dataset collection is University of California, Irvine (UCI)


machine learning repository (Newman et al., 1998). The collection of UCI is only raw
data that must be preprocessed to use in neural networks. Some people also preprocessed
some datasets from UCI and dedicates for publicly use, Proben1 (Prechelt, 1994) is an
example of such preprocessed data. Besides, various persons or groups also maintain
different benchmark dataset collection for specific purpose, for example, Delve
(Rasmussen & Neal, 2003), Orange Datasets (Demsar et al., 2004) etc. Table 1.3 shows
the summary of several benchmark datasets from UCI that are used in the experiments
various NN studies (Akhand et al., 2009; Opitz & Maclin, 1999). The description of the
problems is available at UCI website (Newman et al., 1998).
Artificial NN manipulates numeric inputs, and it stores knowledge as numeric weight
values of inter connected neurons i.e., synaptic weights. But benchmark problems are
generally represented as a combination of numeric values (e.g., age, salary) and varieties
discrete values (e.g., sex, color). For this reason, real data of a benchmark problem must
be processed to use in NNs. For input and output presentation, there is a general role to
prepare a dataset (Prechelt, 1994). Per numeric feature (attribute) is considered as one
input, rescaling between 0 and 1 with a linear function to avoid the saturation region of
sigmoid output. For binary type discrete feature, one input 0/1 is used. For multi valued
discrete feature, normally, number of input node is the number of distinct discrete values.
A particular node value is considered as 1 for a particular discrete value while considering
others as 0. In the UCI datasets, some problems contain missing values in some attributes.
In general, missing values is replaced by the average of existing values of that attribute.
The outputs is encoded by 1-of-c representation for c classes where the output node with
the highest activation (i.e., 1) is designated as the network output while other output nodes’
value is lowest activation (i.e., 0).
In Table 1.3 NN input (column 6) is the number of inputs nodes that is actually used
for a particular problem after manipulation of original input features (attributes) of that
problem. The number of hidden neurons (column 7) in each network is chosen based on
the number of inputs and outputs of a particular problem used in various studies (Opitz &
Maclin, 1999; Akhand 2009; Akhand 2010). The general criteria of selecting hidden
neurons is: at least one hidden neuron per output, at least one neuron for every ten inputs,
and at least five hidden neurons for a problem.

1.9 Features of NN
18 | Chapter 1. Preliminaries: Biological Neuron to Artificial Neural Network

Artificial NN holds several distinguishing features. It requires inputs as numeral


values and generates layer wise responses. From the final output it measures the deviation
from desired output as an error which is the element to update the weights of NN. At the
beginning the weights are initialized with random values. Some features of NN are
discussed below.

a. Numerical Linear Input


NN takes numeral feature values as linear input and generate output through one or
more hidden layers. A problem may contain both numeral and discrete values. In such
case, discrete value may convert to numeral value. On the other hand, regardless of
problem dimension NN take input as single dimension. Two dimensional image data also
consider as single dimensional pixel values or feature extracted from the images.

b. Fully Connected and Separate Weight for Each Neuron


In feed forward NN architecture, a neuron takes responses from all the neurons of
preceding layer and contributes in generating responses of all the neurons of succeeding
layer. Each neuron maintains a separate and dedicated weight connection to each of
individual neurons of the preceding layer to scale up or down their responses in generating
its response. Therefore, number of weights increases with the number of hidden layers
and number of neurons in a layer. Since there is no reuse facility of weight, computation
increase simultaneously NN size with number of weights.

c. Train Whole Network Together


Training a NN follows several sequential steps. A NN first calculates responses of
neurons in the hidden layers and NN final output in the output layer neurons. To update
the weight values, local gradients of each layer’s neurons are calculated output layer to
hidden layers cumulatively. Local gradient of a neuron is weighted sum of local gradients
of all neurons of succeeding layer as of Eq. (10) and Eq. (11). So that update of weights
connected to input layer is required to update weights of layers toward output layer.
Therefore, whole NN is updated together and no partial update is allowed.

d. Update Depends on Final NN Output


Local gradient of output layer neurons depends on the error generated due to
mismatch between actual output and desired output. Output of a neuron comes from the
Chapter 1. Preliminaries: Biological Neuron to Artificial Neural Network | 19

activation function for weighted sum of previous layer responses. If final output match
with desired output, local gradient of output layer neurons will be zero. Since local
gradient of output layer transfer backward toward input layer, weights of a layer will not
be updated if weights connected to output layer do not update. In another words, training
NN by BP starting from random initialized parameters, does not work very well and easily
gets stuck in undesired local optima. This prevents the lower layers from learning useful
features.

1.10 Motivation to Deep Architectures


Deep leaning is a new area of Machine Learning research with introduction of more
human learning prospective in ANN. In general concept, traditional NN may transform to
deep architecture with several hidden layers but training such NN architecture is difficult.
On the other hand, NN is limited in ability to process natural data: it requires important
features of the input design manually and/or linear presentation to NN regardless the data
varieties. A NN automatically learns to map the input features to outputs without any
sense of data. The system works well for simple pattern recognition problems but getting
better performance for complex and large case it is required updated learning
methodology so that it might sense the input data. In this context, deep learning is about
learning multiple levels of representation and abstraction that help to make sense of data
such as images, sound, and text. The main motivations to learning deep architectures are:
(1) Insufficient depth can hurt, (2) The brain has a deep architecture and (3) Cognitive
processes seem deep.

a. Insufficient Depth can Hurt


A shallow NN with single hidden layer is enough in many cases (e.g. logical gates,
formal threshold neurons, sigmoid-neurons) to represent any function with a given target
accuracy. But many problems can be represented efficiently with a deep architecture which
cannot be represented efficiently with a shallow one. But this may come with a price that
the required number of nodes (i.e., computations) may grow very large. Moreover, training
NN by BP starting from random initialized parameters, does not work very well and easily
gets stuck in undesired local optima. Therefore, a kind of factorization is considered in deep
learning for efficient training.
20 | Chapter 1. Preliminaries: Biological Neuron to Artificial Neural Network

b. The Brain has a Deep Architecture


Human brain maintain connection in deep architecture but not fully connected. The
visual cortex is well-studied and shows a sequence of areas each of which contains a
representation of the input, and signals flow from one to the next. There are also skip
connections and at some level parallel paths. Each level of this feature hierarchy represents
the input at a different level of abstraction, with more abstract features further up in the
hierarchy, defined in terms of the lower-level ones. The connections in the brain are purely
local and sparse: about 1% of neurons are active simultaneously in the brain. Given the
huge number of neurons, this is still a very efficient representation.

c. Cognitive Processes seem Deep


Humans organize their ideas and concepts hierarchically. Humans first learn simpler
concepts and then compose them to represent more abstract ones. On the other hand,
engineers break-up solutions into multiple levels of abstraction and processing. Traditional
NN suffers from poor introspection ability (i.e., focus on small important area) due to global
connectivity. Such concepts are adapted in deep learning for better performance.

References
Akhand, M. A. H., Islam, M. M., & Murase, K. (2009). A Comparative Study of Data
Sampling Techniques for Constructing Neural Network Ensembles, International
Journal of Neural Systems 19(2), 67-89.

Akhand, M. A. H. & Murase, K. (2010). Neural Networks Ensembles: Existing Methods and
New Techniques, ISBN-10: 3838391373 & ISBN-13: 978-3838391373, LAP LAMBERT Academic
Publishing, 2010.

Demsar, J., Zupan, B., & Leban, G. (2004). Orange Datasets, AI Laboratory, Faculty of
Computer and Information Science, University of Ljubljana.
(http://www.ailab.si/orange/datasets.asp)

Graupe, D (2007). Principals of Artificial Neural Networks. World Scientific Publishing


Co., 2nd edition.

Haykin, S. (1999). Neural Networks – A Comprehensive Foundation. Prentice Hall, 2nd


edition.

Jain, A. K., Mao, J., & Mohiuddin, K. M. (1996). Artificial Neural Networks: A Tutorial.
IEEE Computer 29 (3), 31-44.
Krose, B. & Smagt, P (1996). An Introduction to Neural Networks. 8th edition.
Chapter 1. Preliminaries: Biological Neuron to Artificial Neural Network | 21

Newman, D. J., Hettich, S., Blake, C. L., & Merz, C. J. (1998). UCI Repository of
Machine Learning Databases. Department of Information and Computer Science,
University of California Irvine. (http://www.ics.uci.edu/~mlearn/)

Opitz, D. W. & Maclin, R.(1999). Popular Ensemble Methods: An Empirical Study.


Journal of Artificial Intelligence Research 11,169-198.

Prechelt, L. (1994). Proben1- A Set of Benchmarks and Benching Rules for Neural
Network Training Algorithms. Tech. rep. 21/94, Fakultat fur Informatik, University
of Karlsruhe, Germany.

Palm, R. B. (2012). Deep Learning Toolbox. Available


https://github.com/rasmusbergpalm/DeepLearnToolbox/

Rumelhart, D., Hinton, G., & Williams, R. (1986). Learning internal representations by
error propagation. In Rumelhart, D., & McClelland, J. (Eds.), Parallel Distributed
Processing: Explorations in the Microstructure of Cognition. Volume 1: Foundations,
pp. 318-363. MIT Press, Cambridge, MA.

Rasmussen, C. E. & Neal, R. M. (2003). Delve - Data for Evaluating Learning in Valid
Experiments, Department of Computer Science, University of Toronto, Canada.
(http://www.cs.toronto.edu/~delve/data/datasets.html)

Siddique, N. and Adeli, H. (2013). Computational Intelligence: Synergies of Fuzzy Logic,


Neural Networks and Evolutionary Computing, John Wiley and Sons, ISBN: 978-1-
1183-3784-4, pages 536.

You might also like