Professional Documents
Culture Documents
Data-Mining and Knowledge Discovery, Neural Networks in
Data-Mining and Knowledge Discovery, Neural Networks in
Data-Mining and Knowledge Discovery, Neural Networks in
that is, a mapping from observations about an item to con- chine learning method which has to be considered when-
clusions about its target value. Their article concentrates ever good predictive accuracy is demanded.
on the machine learning view. Liu and Zhao (see Manipulating Data and Dimen-
Orlov, Sipper and Hauptman (see Genetic and sion Reduction Methods: Feature Selection) cover feature
Evolutionary Algorithms and Programming: General In- selection, which is the study of algorithms for reducing
troduction and Application to Game Playing) cover ge- dimensionality of data for various purposes. One of the
netic and evolutionary algorithms, which are a family of most common purposes is to improve machine learning
search algorithms inspired by the process of (Darwinian) performance. The other purposes include simplifying data
evolution in Nature. Common to all the different fam- description, streamlining data collection, improving com-
ily members is the notion of solving problems by evolv- prehensibility of the learned models, and helping gain in-
ing an initially random population of candidate solu- sight through learning. The objective of feature selection is
tions, through the application of operators like crossover to remove irrelevant and/or redundant features and retain
and mutation inspired by natural genetics and natural se- only relevant features. Irrelevant features can be removed
lection, such that in time “fitter” (i. e., better) solutions without affecting learning performance. Redundant fea-
emerge. The field, whose origins can be traced back to tures are a type of irrelevant features. The distinction is
the 1950s and 1960s, has come into its own over the past that a redundant feature implies the co-presence of an-
two decades, proving successful in solving multitudinous other feature; individually, each feature is relevant, but the
problems from highly diverse domains including (to men- removal of either one will not affect learning performance.
tion but a few): optimization, automatic programming, As a plethora of data are generated in every possible means
electronic-circuit design, telecommunications, networks, with the exponential decreasing costs of data storage and
finance, economics, image analysis, signal processing, mu- computer processing power, data dimensionality increases
sic, and art. on a scale beyond imagination in cases ranging from trans-
Berkhin and Dhillon (see Knowledge Discovery: actional data to high-throughput data. In many fields such
Clustering) discuss the condition where data found in sci- as medicine, health care, Web search, and bioinformatics,
entific and business applications usually do not fit a partic- it is imperative to reduce high dimensionality such that ef-
ular parametrized probability distribution. In other words, ficient data processing and meaningful data analysis can
the data are complex. Knowledge discovery starts with be conducted in order to mine nuggets from high-dimen-
exploration of this complexity in order to find inconsis- sional, massive data.
tencies, artifacts, errors, etc. in the data. After data are
cleaned, it is usually still extremely complex. Descriptive
data mining deals with comprehending and reducing this
complexity. Clustering is a premier methodology in de- Data-Mining and Knowledge
scriptive unsupervised data mining. A cluster could rep- Discovery, Neural Networks in
resent an important subset of the data such as a galaxy in
astronomical data or a segment of customers in market- MARKUS BRAMEIER
ing applications. Clustering is important as a fundamental Bioinformatics Research Center, University of Aarhus,
technology to reduce data complexity and to find data pat- Århus, Denmark
terns in an unsupervised fashion. It is universally used as
a first technology of choice in data exploration. Article Outline
Džeroski, Panov and Zenko (see Machine Learning,
Ensemble Methods in) cover ensemble methods, which are Glossary
machine learning methods that construct a set of predic- Definition of the Subject
tive models and combine their outputs into a single predic- Introduction
tion. The purpose of combining several models together is Neural Network Learning
to achieve better predictive performance, and it has been Feedforward Neural Networks
shown in a number of cases that ensembles can be more Backpropagation
accurate than single models. While some work on ensem- Other Learning Rules
ble methods has already been done in the 1970s, it was not Other Neural Network Architectures
until the 1990s, and the introduction of methods such as Self-organizing Maps
bagging and boosting that ensemble methods started to be Future Directions
more widely used. Today, they represent a standard ma- Bibliography
Data-Mining and Knowledge Discovery, Neural Networks in D 1813
Hardware implementations of larger neural networks were tween neurons also do not permit simpler substructures in
realized on parallel computers or in neural network chips the model to be identified. Therefore, an extraction of the
with multiple units working simultaneously. acquired knowledge from such black box predictors to un-
derstand the underlying model is almost impossible.
tors xE. That means, it calculates how far away the current less weights, on the other hand, may not be sufficiently
state f of the network is from the optimal solution f with powerful to model the underlying function.
E( f ) E( f ) 8 f 2 F. A simple heuristic, called early stopping, helps to en-
A neural network cannot perfectly learn a mapping if sure that the network will generalize well to examples not
the input data does not contain enough information to de- in the training set. One solution is to check progress dur-
rive the desired outputs. It may also not converge if there ing training against an independent data set, the valida-
is not enough data available (see also below). tion set. As training progresses, the training error naturally
Unsupervised learning uses no external teacher and decreases monotonically and, providing training is min-
only local information. It is distinguished from supervised imizing the true error function, also the validation error
learning by the fact that there is no a priori output. In un- decreases. However, if the validation error stops dropping
supervised learning we are given some input data xE, and or even starts to increase again, this is an indication that
the cost function to be minimized can be any function the network is starting to overfit the data. Then the opti-
of xE and the network output f (E x ). Unsupervised learn- mization process has become stuck in a local minima and
ing incorporates self-organization, i. e., organizes the in- training should be stopped. The weights that produced
put data by using only their inherent properties to reveal the minimum validation error are then used for the final
their emergent collective properties. model.
A neural network learns offline if learning phase and In this case of overtraining, the size of the network, i. e.,
operation (application) phase are separated. A neural net- the number of hidden units and/or hidden layers, may be
work learns online if both happens at the same time. Usu- decreased. Neural networks typically involve experiment-
ally, supervised learning is performed offline, whereas un- ing with a large number of different configurations, train-
supervised learning is performed online. ing each one a number of times while observing the vali-
In reinforcement learning, neither inputs xE nor outputs dation error. A problem with repeated experimentation is
yE are given explicitly, but are generated by the interactions that the validation set is actually part of the training pro-
of an agent within an environment. The agent performs an cess. One may just find a network by chance that happens
action yE with costs c according to an observation xE made to perform well on the validation set. It is therefore nor-
in the environment. The aim is to discover a policy or plan mal practice to reserve a third set of examples for testing
for selecting actions that minimizes some measure of the the final model on this test set.
expected total costs. In many cases a sufficient amount of data is not avail-
able, however. Then we have to get around this problem by
resampling techniques, like cross validation. In principle,
Overtraining and Generalization
multiple experiments are conducted, each using a different
The overall motivation and most desirable property of division of the available data into training and validation
neural networks is their ability to generalize to new un- set. This should remove any sampling bias. For small data
known data, i. e., to classify patterns correctly on which sets, where splitting the data would leave too few observa-
they have not been trained. Minimizing the network er- tions for training, leave-one-out validation may be used to
ror on the training examples only, does not automatically determine when to stop training or the optimal network
minimize the real error of the unknown underlying func- size.
tion. This important problem is called overfitting or over- Machine-learning techniques, like neural networks,
training. require both positive and negative training examples for
A regular distribution of training examples over the in- solving classification problems. Because they minimize an
put data space is important. Generalization is reasonable overall error, the proportion of positive and negative ex-
only as long as the data inputs remain inside the range for amples in the training set is critical. Ideally, the relation
which the network was trained. If the training set only in- should be close to the (usually unknown) real distribution
cluded vectors from a certain part of the data space, pre- in the data space. Otherwise, it may bias the network’s de-
dictions on other parts are random and likely wrong. cision to be more often wrong on unknown data.
Overtraining may occur also when the iterative train-
ing algorithm is run for too long and if the network is too
Feedforward Neural Networks
complex for the problem to solve or the available quan-
tity of data. A larger neural network with more weights In feedforward neural networks the information is passed
models a more complex function and invariably achieves in only one direction (forward) from the inputs, through
a lower error, but is prone to overfitting. A network with the hidden nodes (if any) to the output nodes. There are
Data-Mining and Knowledge Discovery, Neural Networks in D 1817
no connections backwards to neurons of upper layers i. e., a simple stochastic gradient descent where the weights of
there are no feedback loops or cycles in the network. the network are adjusted depending on the error between
In feedforward networks with a single-layer of weights, the predicted outputs of the network and the example out-
the inputs are directly connected to the output units (sin- puts. The delta rule changes the weight vector such that
gle-layer neural networks). Multi-layer feedforward net- the output error is minimized. McClelland and Rumel-
works use additional intermediate layers of hidden units. hart [30] proved that a neural network using the delta rule
Neural networks with two or more processing layers may can learn associations whenever the inputs are linearly in-
have far greater processing power than networks with only dependent.
one layer. Single-layer neural networks are only capable of All neurons of a perceptron share the same structure
learning linearly separable patterns and functions. and learning algorithm. Each weight w i j , representing the
influence of input x i on neuron j, is updated at time t ac-
The Perceptron cording to the rule:
The most simple kind of feedforward neural network is w i j (t C 1) D w i j (t) C
w i j (2)
a perceptron, which consists of a single pseudo input layer
and one or more processing nodes in the output layer. All
w i j D ˛(o i y i )x i j : (3)
inputs are weighted and fed directly to the output neu-
The network learns by updating the weight vector after
ron(s) (see Fig. 1). Each node calculates the sum of the
each iteration (training example) by an amount propor-
products of weights and inputs. If this value is above some
tional to the difference between given output oi and calcu-
threshold (typically 0) the neuron takes the activated value
lated output y i D f (s i ). The learning rate ˛ is a constant
1, otherwise it outputs 0. Neurons with this kind of acti-
with 0 < ˛ < 1 and regulates the learning speed.
vation function are called threshold units. In the literature
The training data set is linearly separable in n-dimen-
the term perceptron often refers to networks consisting of
sional data space if its two-classes of vectors xE can be sep-
only one (output) neuron.
arated by an (n 1)-dimensional hyperplane. If the train-
More formally, the perceptron is a linear binary clas-
ing examples are not linearly separable, the perceptron
sifier that maps a binary n-dimensional input vector
learning algorithm is not guaranteed to converge. Lin-
xE 2 f0; 1gn to a binary output value f (w E xE) 2 f0; 1g cal-
ear classifiers, like single-unit perceptrons, are only able
culated as
to learn, i. e., perfectly classify, linearly separable patterns
1 if s > T because they can only implement a simple decision sur-
f (s) D (1)
0 otherwise face (single hyperplane) [33]. The same is true for single-
P layer neural networks with more than one output unit.
where s D w E xE D niD1 w i x i is the input sum and w E is
This makes these linear neural networks unable to learn,
a vector of real-valued weights. The constant threshold T
for example, the XOR function [33]. Nevertheless a prob-
does not depend on any input value.
lem that is thought to be highly complex may still be solved
In addition to the network topology, the learning rule
as well by a linear network as by a more powerful (non-lin-
is an important component of neural networks. Percep-
ear) neural network.
trons can be trained by a simple learning algorithm, called
the delta rule or perceptron learning rule. This realizes
Single-Layer Neural Networks
In general, the state of a neuron is represented by its ac-
tivation value. An activation or transfer function f calcu-
lates the activation value of a unit from the weighted sum s
of its inputs. In case of the perceptron f is called step or
threshold function, with the activation value being 1 if the
network sum is greater than a constant T, and 0 otherwise
(see Eq. (1)). Another common form of non-linear activa-
tion function is the logistic or sigmoid function:
1
f (s) D : (4)
1 C es
Data-Mining and Knowledge Discovery, Neural Networks in, Fig-
ure 1 This enables a neural network to compute a continuous
Principle structure of single-unit perceptron network output between 0 and 1 instead of a step function. With
1818 D Data-Mining and Knowledge Discovery, Neural Networks in
this choice, a single-layer network is identical to a logistic bers can be approximated arbitrarily closely by a multi-
regression model. If the activation functions is linear, i. e., layer neural network with only one hidden layer and cer-
the identity, then this is just a multiple linear regression tain types of non-linear activation functions. This gives,
and the output is proportional to the total weighted sum s. however, no indication about how fast or likely a solution
is found. Networks with two hidden layers may work bet-
ter for some problems. However, more than two hidden
Multi-Layer Neural Networks
layers usually provide only marginal benefit compared to
The limitation that non-linearly separable functions can- the significant increase in training time.
not be represented by a single-layer network with fixed Any multi-layer network with fixed weights and lin-
weights can be overcome by adding more layers. A multi- ear activation function is equivalent to a single-layer (lin-
layer network is a feedforward network with two or more ear) network: In the case of a two-layer linear system, for
layers of computational units, interconnected such that the instance, let all input vectors to the first layer form ma-
neurons’ outputs of one layer serve as inputs only to neu- trix X and W 1 and W 2 be the weight matrices of the two
rons of the directly subsequent layer (see Fig. 2). The input processing layers. Then the output Y1 D W1 X of the first
layer is not considered a real layer with processing neu- layer is input to the second layer, which produces output
rons. Y2 D W2 (W1 X) D (W2 W1 ) X. This is equivalent to
The number of units in the input layer is determined a single-layer network with weight matrix W D W2 W1 .
by the problem, i. e., the dimension of the input data space. Only a multi-layer network that is non-linear can pro-
The number of output units also depends on the output vide more computational power. In many applications
encoding (see Subsect.“ Application Issues”). these networks use a sigmoid function as non-linear ac-
By using hidden layers, the partitioning of the data tivation function. This is the case at least for the hidden
space can be more effective. In principle, each hidden unit units. For the output layer, the sigmoid activation func-
adds one hyperplane to divide the space and discriminate tion is usually applied with classification problems, while
the solution. Only if the outputs of at least two neurons a linear transfer function is applied with regression prob-
are combined in a third neuron, the XOR problem is solv- lems.
able. Important issues in multi-layer NN design are, thus,
the specification of the number of hidden layers and the
Error Function and Error Surface
number of units in these layers (see also Subsect. “Appli-
cation Issues”). Both numbers determine the complexity The error function derives the overall network error from
of functions that can be modeled. There is no theoretical the difference of the network’s output yij and target out-
limitation on the number of hidden layers, but usually one put oij over all examples i and output units j. The most-
or two are used. common error function is the sum squared error:
The universal approximation theorem for neural net-
works states that any continuous function that maps in- 1 XX
ED (o i j y i j )2 : (5)
tervals of real numbers to an output interval of real num- 2
i j
The speed of learning is the rate of convergence between ascribed partly to errors in the weights of its direct inputs
the current solution and the global minimum. and partly to errors in the outputs of higher-level (hidden)
In a linear network with a sum-squared error func- nodes [46].
tion, the error surface is an multi-dimensional parabola, Moreover, backpropagation learning may happen in
i. e., has only one minimum. In general, it is not possi- two different modes. In sequential mode or online mode
ble to analytically determine where the global minimum of weight adjustments are made example by example, i. e.,
the error surface is. Training is essentially an exploration each time an example pattern has been presented to the
of the error surface. Because of the probabilistic and often network. The batch mode or offline mode adjustments are
highly non-linear modeling by neural networks, we cannot made epoch by epoch, i. e., only after all example patterns
be sure that the error could not be lower still, i. e., that the have been presented. Theoretically, the backpropagation
minimum we found is the absolute one. Since the shape algorithm performs gradient descent on the total error
of the error space cannot be known a priori, neural net- only if the weights are updated epoch-wise. There are em-
work analysis requires a number of independent runs to pirical indications, however, that a pattern-wise update re-
determine the best solution. When different initial values sults in faster convergence. The training examples should
for the weights are selected, different network models will be presented in random order. Then the precision of pre-
be derived. dictions will be more similar over all inputs.
From an initially random configuration of the net- Backpropagation learning requires a differentiable ac-
work, i. e., a random point on the error surface, the train- tivation function. Besides adding non-linearity to multi-
ing algorithm starts seeking for the global minimum. Small layer networks, the sigmoid activation function (see
random values are typically used to initialize the network Eq. (4)) is often used in backpropagation networks because
weights. Although neural networks resulting from differ- it has a continuous derivative that can be calculated easily:
ent initial weights may have very different parameter set-
f 0 (s) D f (s)(1 f (s)) : (6)
tings, their prediction errors usually do not vary dramat-
ically. Training is stopped when a maximum number of We further assume that there is only one hidden layer
epochs has expired or when the network error does not in order to keep notations and equations clear. A gener-
improve any further. alization to networks with more than one hidden layer is
straightforward.
The backpropagation rule is a generalization of the
Backpropagation delta learning rule (see Eq. (3)) to multi-layer networks
The best known and most popular training algorithm for with non-linear activation function. For an input vector
multi-layer networks is backpropagation, short for back- xE the output y D f (s) is calculated at each output neuron
wards error propagation and also referred to as the gener- of the network and compared with the desired target out-
alized delta rule [46]. The algorithm involves two phases: put o, resulting in an error ı. Each weight is adjusted pro-
Forward pass. During the first phase, the free parame- portionally to its effect on the error. The weight of a con-
ters (weights) of the network are fixed. An example pattern nection between a unit i and a unit j is updated depending
is presented to the network and the input signals are prop- on the output of i (as input to j) and the error signal at j:
agated through the network layers to calculate the network
w i j D ˛ı j y i : (7)
output at the output unit(s).
Backward pass. During the second phase, the model For an output node j the error signal (error surface gra-
parameters are adjusted. The error signals at the output dient) is given by:
units, i. e., the differences between calculated and expected
ı j D (o j y j ) f 0 (s j ) D (o j y j )y j (1 y j ) : (8)
outputs, are propagated back through the network layers.
In doing so, the error at each processing unit is calculated If the error is zero, no changes are made to the con-
and used to make adjustments to its connecting weights nection weight. The larger the absolute error, the more the
such that the overall error of the network is reduced by responsible weight is changed, while the sign of the error
some small amount. determines the direction of change.
After iteratively repeating both phases for a sufficiently For a hidden neuron j the error signal is calculated re-
large number of training cycles (epochs) the network will cursively using the signals of all directly connected output
converge to a state where its output error is small enough. neurons k.
The backpropagation rule involves the repeated use of the X X
ı j D f 0 (s j ) ı k w jk D y j (1 y j ) ı k w jk : (9)
chain rule, saying that the output error of a neuron can be
k k
1820 D Data-Mining and Knowledge Discovery, Neural Networks in
The partial derivative of the error function with respect complex problems requiring larger networks or larger
to the network weights can be calculated purely locally, amounts of training data.
such that each neuron needs information only from neu- Another major limitation is that backpropagation does
rons directly connected to it. A theoretical foundation of not always converge. Still, it is a widely used algorithm
the backpropagation algorithm can be found in [31]. and has its advantages: It is relatively easy to apply and
The backpropagation algorithm performs a gradient to configure and provides a quick, though not absolutely
descent by calculating the gradient vector of the error sur- perfect solution. Its usually pattern-wise error adjustment
face at the current search point. This vector points into the is hardly affected by data that contains a larger number of
direction of the steepest descent. Moving in this direction redundant examples. Standard backpropagation also gen-
will decrease the error and will eventually find a new (lo- eralizes equally well on small data sets as more advanced
cal) minimum, provided that the step size is adapted ap- algorithms, e. g., if there is insufficient information avail-
propriately. Small steps slow down learning speed, i. e., re- able to find a more precise solution.
quire a larger number of iterations. Large steps may con- There are many variations of the backpropaga-
verge faster, but may also overstep the solution or make tion algorithm, like resilient propagation (Rprop) [42],
the algorithm oscillate around a minimum without con- quick propagation (Quickprop) [13], conjugate gradi-
vergence of the weights. Therefore, the step size is made ent descent [6], Levenberg–Marquardt [16], Delta-Bar-
proportional to the slope ı, i. e., is reduced when the search Delta [18], to mention the most popular. All these second-
point approaches a minimum, and to the learning rate ˛. order algorithms are designed to deal with some of the lim-
The constant ˛ allows one to control the size of the gra- itations on the standard approach. Some work substan-
dient descent step and is usually set to be between 0.1 and tially faster in many problem domains, but require more
0.5. For practical purposes, it is recommended to choose control parameters than backpropagation, which makes
the learning rate as large as possible without leading to os- them more difficult to use.
cillation.
Resilient Propagation
Momentum
Resilient propagation (Rprop) as proposed in [42,43] is
One possibility to avoid oscillation and to achieve faster a variant of standard backpropagation with very robust
convergence is in the addition of a momentum term that is control parameters that are easy to adjust. The algorithm
proportional to the previous weight change: converges faster than the standard algorithm without be-
ing less accurate.
w i j (t C 1) D ˛ı j y i C ˇ
w i j (t) : (10) The size of the weight step
w i j taken by standard
backpropagation not only depends on the learning rate ˛,
The algorithm increases learning speed step size if it but also on the size of the partial derivative (see Eq. (7)).
has taken several steps in the same direction. This gives it This may have an unpredictable influence during training
the ability to overcome obstacles in the error surface, e. g., that is difficult to control. Therefore, Rprop uses only the
to avoid and escape from local minima, and to move faster sign of derivative to adjust the weights. It necessarily re-
over larger plateaus. quires learning by epoch, i. e., all adjustments take place
Finding the optimum learning rate ˛ and momentum after each epoch only.
scale parameter ˇ, i. e., the best trade-off between longer One iteration of the Rprop algorithm involves two
training time and instability, can be difficult and might re- steps, the adjustment of the step size and the update of the
quire many experiments. Global or local adaptation tech- weights. The amount of weight change is found by the fol-
niques use, for instance, the partial derivative to automati- lowing update rule:
cally adapt the learning rate. Examples here are the Delta-
Bar-Delta rule [18] and the SuperSAB algorithm [51]. 8 C
<
i j (t 1) if d i j (t 1) d i j (t) > 0
i j (t) D
i j (t 1) if d i j (t 1) d i j (t) < 0
Other Learning Rules :
i j (t 1) otherwise
The backpropagation learning algorithm is computation- (11)
ally efficient in that its time complexity is linear in the
number of weight parameters. Its learning speed is com- with 0 < < 1 < C and derivative d i j D ı j y i . Every
paratively low, however, on the basis of epochs. This may time t the derivative term changes its sign, indicating that
result in long training times, especially for difficult and the last update (at time t 1) was too large and the algo-
Data-Mining and Knowledge Discovery, Neural Networks in D 1821
rithm has jumped over a local minimum, the update value accuracy, as long as cross validation is used to stop training
(step size)
i j (t 1) is decreased by a constant factor . in time.
The rule for updating the weights is straightforward: For classification problems, the neural network assigns
to each input case a class label or, more generally, estimates
w i j (t C 1) D w i j (t) C
w i j (t) (12) the probability of the case to fall into each class. The vari-
8 ous output classes of a problem are normally represented
<
i j (t) if d i j (t) > 0 in neural networks using one of two techniques, including
w i j (t) D C
i j (t) if d i j (t) < 0 : (13) binary encoding and one-out-of-n encoding.
:
0 otherwise A binary encoding is only possible for two-class prob-
lems. A single unit calculates class 1 if its output is above
One advantage of the Rprop algorithm, compared to,
the acceptance threshold. If the output is below the rejec-
for example, Quickprop [13], is its small set of parame-
tion threshold, class 0 is predicted. Otherwise, the output
ters that hardly requires adaptation. Standard values for
class is undecided. Should the network output be always
decrease factor and increase factor C are 0.5 and 1.2,
defined, both threshold values must be equal (e. g. 0.5).
respectively. To avoid too large or too small weight val-
In one-out-of-n encoding one unit is allocated for each
ues, this is bounded above by
max and bounded below by
class. A class is selected if the corresponding output is
the overall network error using an ordinary learning al- The architecture of RBF networks is fixed to two
gorithm, like, e. g., backpropagation. After that the input- processing layers. Each unit in the hidden layer repre-
side weighs of the new neuron are frozen. Thus, it does not sents a center vector and a basis function which realizes
change anymore and becomes a permanent feature detec- a non-linear transformation of the inputs. Each output
tor. unit calculates a weighted sum (linear combination) of
Cascade correlation networks have several advantages the non-linear outputs from the hidden layer. Only the
over multi-layer perceptrons: (1) Training time is much connections between hidden layer and output layer are
shorter already because the network size is relatively small. weighted.
(2) They require only little or no adjustment of parame- The use of a linear output layer in RBF networks is
ters, especially not in terms of the number of hidden neu- motivated by Cover’s theorem on the separability of pat-
rons to use. (3) They are more robust and training is less terns. The theorem states that if the transformation from
likely to become stuck in local minima. the data (input) space to the feature (hidden) space is non-
linear and the dimensionality of the feature space is rela-
Recurrent Networks tively high compared to that of the data space, then there
is a high likelihood that a non-separable pattern classifica-
A network architecture with cycles is adopted by recur- tion task in the input space is transformed into a linearly
rent or feedback neural networks such that outputs of some separable one in the feature space.
neurons are fed back as extra inputs. Because past outputs The center vectors are selected from the training data,
are used to calculate future outputs, the network is said to either randomly or uniformly distributed in the input
“remember” its previous state. Recurrent networks are de- space. In principle, as many centers (and hidden units)
signed to process sequential information, like time series may be used as there are data examples. Another method
data. Processing depends on the state of the network at the is to group the data in space using, for example, k-means
last time step. Consequently, the response to the current clustering, and select the center vectors close to the cluster
input depends on previous inputs. centers.
Two similar types of recurrent network are extensions RBF learning is considered a curve-fitting problem in
of the multi-layer perceptron: Jordan networks [19] feed high-dimensional space, i. e., approximates a surface with
back all network outputs into the input layer; Elman net- the basis functions that fits and interpolates the training
works [12] feed back from the hidden units. State or con- data points best. The basis functions are well-suited to
text units are added to the input layer for the feedback con- online learning applications, like adaptive process con-
nections which all have constant weight one. At each time trol. Adapting the network to new data and changing data
step t, an input vector is propagated in a standard feedfor- statistics only requires a retraining by linear regression
ward fashion, and then a learning rule (usually backprop- which is fast. RBF networks are more local approximators
agation) is applied. The extra units always maintain a copy than multi-layer perceptrons. New training data from one
of the previous outputs at time step t 1. region of the input space have less effect on the learned
model and its predictions in other regions.
Radial Basis Function Networks
Radial basic function (RBF) networks [8,34,37,38] are an- Self-organizing Maps
other popular variant of two-layer feedforward neural net-
A self-organizing map (SOM) or Kohonen map [22,23] ap-
works which uses radial basis functions as activation func-
plies an unsupervised and competitive learning scheme.
tions. The idea behind radial basis functions is to approx-
That means that the class labels of the data vectors are
imate the unknown function f (E x ) by a weighted sum of
unknown or not used for training and that each neuron
non-linear basis functions , which are often Gaussian
improves through competition with other neurons. It is
functions with a certain standard deviation .
a non-deterministic machine-learning approach to data
X clustering that implements a mapping of the high-dimen-
x) D
f (E x cEi jj)
w i (jjE (14)
i
sional input data into a low-dimensional feature space.
In doing so, SOMs filter and compress information while
The basis functions operate on the Euclidean distance preserving the most relevant features of a data set. Com-
between n-dimensional input vector xE and center vector plex, non-linear relationships and dependencies are re-
cEi . Once the center vectors cEi are fixed, the weight coeffi- vealed between data vectors and between clusters and are
cients wi are found by simple linear regression. transformed into simple geometric distances. Such an ab-
Data-Mining and Knowledge Discovery, Neural Networks in D 1823
straction facilitates both visualization and interpretation Both learning rate ˛ and neighborhood radius r mono-
of the clustering result. tonically decrease over time. Initially quite large areas of
Typically, the units of a SOM network are arranged the network are affected by the neighborhood update,
in a two-dimensional regular grid, the topological feature leading to a rather rough topological order. As epochs
map, which defines a two-dimensional Euclidean distance pass, less neurons are altered with lower intensity and finer
between units. Each unit is assigned a center vector from distinctions are drawn within areas of the map.
the n-dimensional data space and represents a certain clus- Unlike hierarchical clustering [52] and k-means clus-
ter. tering [29], which are both deterministic – apart from the
Algorithm 1 describes the basic principle behind SOM randomized initialization in k-means clustering – and op-
training. Starting with an initially random set of center erate only locally, SOMs are less likely to become stuck in
vectors, the algorithm iteratively adjusts them to reflect local minima and have a higher robustness and accuracy.
the clustering of the training data. In doing so, the two- Once the network has been trained to recognize struc-
dimensional order of the SOM units is imposed on the in- tures in the data, it can be used as a visualization tool and
put vectors such that more similar clusters (center vectors) for exploratory data analysis. The example map in Fig. 3
in the input space are closer to each other on the two-di- shows a clustering of time series of gene expression val-
mensional grid structure than more different clusters. One ues [48]. Clearly higher similarities between neighboring
can think of the topological map to be folded and distorted clusters are revealed when comparing the mean vectors.
into the n-dimensional input space, so as to preserve as If neurons in the feature map can be labeled, i. e.,
much as possible the original structure of the data. if common meaning can be derived from the vectors in
a clusters, the network becomes capable of classification.
Algorithm 1 (Self-organizing Map)
If the winning neuron of an unknown input case has not
1. Initialize the n-dimensional center vector cEi 2 Rn of been assigned a class label, labels of clusters in close or di-
each cluster randomly. rect neighborhood may be considered. Ideally, higher sim-
2. For each data point pE 2 Rn find the nearest center vec- ilarities between neighboring data clusters are reflected in
tor cEw (called the “winner”) in n-dimensional space ac- similar class labels. Alternatively, the network output is
cording to a distance metric d. undefined in this case. SOM classifiers also make use of
3. Move cEw and all centers cEi within its local neighbor- the distance of the winning neuron from the input case.
hood closer to pE : If this distance exceeds a certain maximum threshold, the
SOM is regarded as undecided. In this way, a SOM can be
cEi (t C 1) D cEi (t) C ˛(t) hr(t) ( pE cEi (t)) used for detecting novel data classes.
with learning rate 0 < ˛ < 1 and neighborhood func-
tion h depending on a neighborhood radius r. Future Directions
4. ˛(t C 1) D ˛(t)
˛ where
˛ D ˛0 /tmax
To date, neural networks are widely accepted as an alter-
r(t C 1) D r(t)
r where
r D r0 /tmax native to classical statistical methods and are frequently
used in medicine [4,5,11,25,27,44,45] with many applica-
5. Repeat steps 2.–4. for each epoch t D 1; ::; tmax . tions related to cancer research [10,26,35,36,49]. In the
6. Assign each data point to the cluster with the nearest first place, these comprise diagnostics and prognosis (i. e.
center vector. classification) tasks, but also image analysis and drug de-
sign. Cancer prediction is often based on clustering of
Each iteration involves randomly selecting a data point gene expression data [15,50] or microRNA expression
pE and moving the closest center vector a bit in the di- profiles [28] which may involve both self-organizing maps
rection of pE. Only distance metric d (usually Euclidean) and multi-layer feedforward neural networks.
defined on the data space influences the selection of Another broad application field of neural networks
the closest cluster. The adjustment of centers is applied today is bioinformatics and, in particular, the analysis
not just to the winning neuron, but to all the neurons and classification of gene and protein sequences [3,20,56].
of its neighborhood. The neighborhood function is often A well-known successful example is protein (secondary)
Gaussian. A simple definition calculates h D 1 if the Eu- structure prediction from sequence [39,40].
clidean distance between the grid coordinates of the win- Even though the NN technology is clearly established
ning cluster w and a cluster i is below a radius r, i. e., today, the current period is rather characterized by stag-
jj(xw ; yw ) (x i ; y i )jj < r, and h D 0 otherwise. nation. This is partly because of a redirection of research
1824 D Data-Mining and Knowledge Discovery, Neural Networks in
to newer and often – but not generally – more power- 5. Begg R, Kamruzzaman J, Sarkar R (2006) Neural networks in
ful paradigms, like the popular support vector machines healthcare: Potential and challenges. Idea Group Publishing,
(SVMs) [9,47], or to more open and flexible methods, Hershey
6. Bishop CM (1995) Neural networks for pattern recognition. Ox-
like genetic programming (GP) [7,24]. In many applica- ford University Press, London
tions, for example, in bioinformatics, SVMs have already 7. Brameier M, Banzhaf W (2007) Linear genetic programming.
replaced conventional neural networks as the state-of-the- Springer, New York
art black-box classifier. 8. Broomhead DS, Lowe D (1988) Multivariable functional inter-
polation and adaptive networks. Complex Syst 2:321–355
9. Cristianini N (2000) An introduction to support vector ma-
chines and other kernel-based learning methods. Cambridge
Bibliography
University Press, London
10. Dybowski R (2000) Neural computation in medicine: Perspec-
Primary Literature tives and prospects. In: Malmgren H, Borga M, Niklasson L (eds)
1. Ackley DH, Hinton GF, Sejnowski TJ (1985) A learning algo- Proceedings of the Conference on Artificial Neural Networks in
rithm for Boltzman machines. Cogn Sci 9:147–169 Medicine and Biology (ANNIMAB). Springer, Berlin, pp 26–36
2. Anderson JA et al (1977) Distinctive features, categorical per- 11. Dybowski R, Gant V (2001) Clinical Applications of Artificial
ception, and probability learning: Some applications of a neu- Neural Networks. Cambridge University Press, London
ral model. Psychol Rev 84:413–451 12. Elman JL (1990) Finding structure in time. Cogn Sci 14:179–211
3. Baldi P, Brunak S (2001) Bioinformatics: The machine learning 13. Fahlman SE (1989) Faster learning variations on backpropa-
approach. MIT Press, Cambridge gation: An empirical study. In: Touretzky DS, Hinton GE, Se-
4. Baxt WG (1995) Applications of artificial neural networks to jnowski TJ (eds) Proceedings of the (1988) Connectionist Mod-
clinical medicine. Lancet 346:1135–1138 els Summer School. Morgan Kaufmann, San Mateo, pp 38–51
Data-Mining and Knowledge Discovery, Neural Networks in D 1825
14. Fahlman SE, Lebiere C (1990) The cascade-correlation learning that are equivalent to multi-layer networks. Science 247:978–
architecture. In: Touretzky DS (ed) Advances in Neural Informa- 982
tion Processing Systems 2. Morgan Kaufmann, Los Altos 38. Poggio T, Girosi F (1990) Networks for approximation and
15. Golub TR et al (1999) Molecular classification of cancer: class learning. Proc IEEE 78:1481–1497
discovery and class prediction by gene expression monitoring. 39. Quian N, Sejnowski TJ (1988) Predicting the secondary struc-
Science 286(5439):531–537 ture of globular proteins using neural network models. J Mol
16. Hagan MT, Menhaj M (1994) Training feedforward net- Biol 202:865–884
works with the Marquardt algorithm. IEEE Trans Neural Netw 40. Rost B (2001) Review: Protein secondary structure prediction
5(6):989–993 continues to rise. J Struct Biol 134:204–218
17. Hopfield JJ (1982) Neural networks and physical systems with 41. Rosenblatt F (1958) The perceptron: A probabilistic model for
emergent collective computational abilities. Proc Natl Acad Sci information storage and organization in the brain. Psychol Rev
79(8):2554–2558 65(6):386–408
18. Jacobs RA (1988) Increased rates of convergence through 42. Riedmiller M, Braun H (1992) Rprop – a fast adaptive learning
learning rate adaptation. Neural Netw 1:295–307 algorithm. In: Proceedings of the International Symposium on
19. Jordan MI (1986) Attractor dynamics and parallelism in a con- Computer and Information Science VII
nectionist sequential machine. In: Proceedings of the Eighth 43. Riedmiller M, Braun H (1993) A direct adaptive method for
Annual Conf of the Cogn Sci Society. Lawrence Erlbaum, Hills- faster backpropagation learning: The Rprop algorithm. In: Pro-
dale, pp 531–546 ceedings of the IEEE International Conference on Neural Net-
20. Keedwell E, Narayanan A (2005) Intelligent bioinformatics: The works. IEEE Press, Piscataway, pp 586–591
application of artificial intelligence techniques to bioinformat- 44. Ripley BD, Ripley RM (2001) Neural Networks as statistical
ics problems. Wiley, New York methods in survival analysis. In: Dybowski R, Gant V (eds) Clini-
21. Kohonen T (1977) Associative Memory: A System-Theoretical cal Applications of Artificial Neural Networks. Cambridge Univ
Approach. Springer, Berlin Press, London
22. Kohonen T (1982) Self-organized formation of topologically 45. Robert C et al (2004) Bibliometric overview of the utilization of
correct feature maps. Biol Cybern 43:59–69 artificial neural networks in medicine and biology. Scientomet-
23. Kohonen T (1995) Self-organizing maps. Springer, Berlin rics 59:117–130
46. Rumelhart DE, Hinton GE, Williams RJ (1986) Learning repre-
24. Koza JR (1992) Genetic programming: On the programming
sentations by backpropagating errors. Nature 323:533–536
of computer programs by natural selection. MIT Press, Cam-
47. Schlökopf B, Smola AJ (2001) Learning with kernels: Support
bridge
vector machines, regularization, optimization, and beyond.
25. Lisboa PJG (2002) A review of evidence of health benefit from
MIT Press, Cambridge
artificial neural networks in medical intervention. Neural Netw
48. Spellman PT et al (1998) Comprehensive identification of cell
15:11–39
cycle-regulated genes of the yeast Saccharomyces cerevisiae
26. Lisboa PJG, Taktak AFG (2006) The use of artificial neural net-
by microarray hybridization. Mol Biol Cell 9(12):3273–97
works in decision support in cancer: a systematic review. Neu-
49. Taktak AFG, Fisher AC (2007) Outcome prediction in cancer. El-
ral Netw 19(4):408–415
sevier Science, London
27. Lisboa PJG, Ifeachor EC, Szczepaniak PS (2001) Artificial neural
50. Tamayo P et al (1999) Interpreting patterns of gene expres-
networks in biomedicine. Springer, Berlin
sion with self-organizing maps: Methods and application to
28. Lu J et al (2005) MicroRNA expression profiles classify human hematopoietic differentiation. PNAS 96(6):2907–2912
cancers. Nature 435:834–838 51. Tollenaere T (1990) SuperSAB: Fast adaptive backpropagation
29. MacQueen J (1967) Some methods for classification and anal- with good scaling properties. Neural Netw 3:561–573
ysis of multivariate observations. In: Proceedings of the 5th 52. Ward JH (1963) Hierarchical grouping to optimize an objective
Berkeley Symposium on Mathematical Statistics and Probabil- function. J Am Stat Assoc 58(301):236–244
ity, vol 1. University of California Press, Berkeley, pp 281–297 53. Werbos PJ (1974) Beyond regression: New tools for prediction
30. McClelland JL, Rumelhart DE (1986) Parallel distributed pro- and analysis in the behavioral science. Ph D Thesis, Harvard
cessing: Explorations in the microstructure of cognition. MIT University
Press, Cambridge 54. Werbos PJ (1994) The roots of backpropagation. Wiley, New
31. McClelland J, Rumelhart D (1988) Explorations in parallel dis- York
tributed processing. MIT Press, Cambridge 55. Widrow B, Hoff ME (1960) Adaptive switching circuits. In:
32. McCulloch WS, Pitts W (1943) A logical calculus of the ideas im- IRE WESCON Convention Record, Institute of Radio Engineers
manent in nervous activity. Bull Math Biophys 5:115–133 (now IEEE), vol 4. pp 96–104
33. Minsky ML, Papert SA (1969/1988) Perceptrons. MIT Press, 56. Wu CH, McLarty JW (2000) Neural networks and genome infor-
Cambridge matics. Elsevier Science, Amsterdam
34. Moody J, Darken CJ (1989) Fast learning in networks of locally-
tuned processing units. Neural Comput 1:281–294
35. Naguib RN, Sherbet GV (1997) Artificial neural networks in can- Books and Reviews
cer research. Pathobiology 65(3):129–139 Abdi H (1994) A neural network primer. J Biol Syst 2:247–281
36. Naguib RNG, Sherbet GV (2001) Artificial neural networks in Bishop CM (2008) Pattern recognition and machine learning.
cancer diagnosis, prognosis, and patient management. CRC Springer, Berlin
Press, Boca Raton Fausett L (1994) Fundamentals of neural networks: Architectures,
37. Poggio T, Girosi F (1990) Regularization algorithms for learning algorithms, and applications. Prentice Hall, New York
1826 D Decision Trees