Data-Mining and Knowledge Discovery, Neural Networks in

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

1812 D Data-Mining and Knowledge Discovery, Neural Networks in

that is, a mapping from observations about an item to con- chine learning method which has to be considered when-
clusions about its target value. Their article concentrates ever good predictive accuracy is demanded.
on the machine learning view. Liu and Zhao (see  Manipulating Data and Dimen-
Orlov, Sipper and Hauptman (see  Genetic and sion Reduction Methods: Feature Selection) cover feature
Evolutionary Algorithms and Programming: General In- selection, which is the study of algorithms for reducing
troduction and Application to Game Playing) cover ge- dimensionality of data for various purposes. One of the
netic and evolutionary algorithms, which are a family of most common purposes is to improve machine learning
search algorithms inspired by the process of (Darwinian) performance. The other purposes include simplifying data
evolution in Nature. Common to all the different fam- description, streamlining data collection, improving com-
ily members is the notion of solving problems by evolv- prehensibility of the learned models, and helping gain in-
ing an initially random population of candidate solu- sight through learning. The objective of feature selection is
tions, through the application of operators like crossover to remove irrelevant and/or redundant features and retain
and mutation inspired by natural genetics and natural se- only relevant features. Irrelevant features can be removed
lection, such that in time “fitter” (i. e., better) solutions without affecting learning performance. Redundant fea-
emerge. The field, whose origins can be traced back to tures are a type of irrelevant features. The distinction is
the 1950s and 1960s, has come into its own over the past that a redundant feature implies the co-presence of an-
two decades, proving successful in solving multitudinous other feature; individually, each feature is relevant, but the
problems from highly diverse domains including (to men- removal of either one will not affect learning performance.
tion but a few): optimization, automatic programming, As a plethora of data are generated in every possible means
electronic-circuit design, telecommunications, networks, with the exponential decreasing costs of data storage and
finance, economics, image analysis, signal processing, mu- computer processing power, data dimensionality increases
sic, and art. on a scale beyond imagination in cases ranging from trans-
Berkhin and Dhillon (see  Knowledge Discovery: actional data to high-throughput data. In many fields such
Clustering) discuss the condition where data found in sci- as medicine, health care, Web search, and bioinformatics,
entific and business applications usually do not fit a partic- it is imperative to reduce high dimensionality such that ef-
ular parametrized probability distribution. In other words, ficient data processing and meaningful data analysis can
the data are complex. Knowledge discovery starts with be conducted in order to mine nuggets from high-dimen-
exploration of this complexity in order to find inconsis- sional, massive data.
tencies, artifacts, errors, etc. in the data. After data are
cleaned, it is usually still extremely complex. Descriptive
data mining deals with comprehending and reducing this
complexity. Clustering is a premier methodology in de- Data-Mining and Knowledge
scriptive unsupervised data mining. A cluster could rep- Discovery, Neural Networks in
resent an important subset of the data such as a galaxy in
astronomical data or a segment of customers in market- MARKUS BRAMEIER
ing applications. Clustering is important as a fundamental Bioinformatics Research Center, University of Aarhus,
technology to reduce data complexity and to find data pat- Århus, Denmark
terns in an unsupervised fashion. It is universally used as
a first technology of choice in data exploration. Article Outline
Džeroski, Panov and Zenko (see  Machine Learning,
Ensemble Methods in) cover ensemble methods, which are Glossary
machine learning methods that construct a set of predic- Definition of the Subject
tive models and combine their outputs into a single predic- Introduction
tion. The purpose of combining several models together is Neural Network Learning
to achieve better predictive performance, and it has been Feedforward Neural Networks
shown in a number of cases that ensembles can be more Backpropagation
accurate than single models. While some work on ensem- Other Learning Rules
ble methods has already been done in the 1970s, it was not Other Neural Network Architectures
until the 1990s, and the introduction of methods such as Self-organizing Maps
bagging and boosting that ensemble methods started to be Future Directions
more widely used. Today, they represent a standard ma- Bibliography
Data-Mining and Knowledge Discovery, Neural Networks in D 1813

Glossary strength of the connections and the signal flow, to exter-


Artificial neural network An artificial neural network is nal information that is presented to the network. In other
a system composed of many simple, but highly inter- terms, the information is stored in the weighted connec-
connected processing nodes or neurons which oper- tions. If not stated explicitly, let the term “neural network”
ate in parallel and collectively. It resembles biological mean “artificial neural network” in the following.
nervous systems in two basic functions: (1) Experien- In more technical terms, neural networks are non-lin-
tial knowledge is acquired through a learning process ear statistical data modeling tools. Neural networks are
and can be retrieved again later. (2) The knowledge is generally well suited for solving problem tasks that in-
stored in the strength (weights) of the connections be- volve classification, pattern recognition and decision mak-
tween the neurons. ing. The power, and usefulness of neural networks have
Artificial neuron An artificial neuron receives a number been demonstrated in numerous application areas, like
of inputs, which may be either external inputs to the image processing, signal processing, biometric identifica-
neural network or outputs of other neurons. Each in- tion – including handwritten character, fingerprint, face,
put connection is assigned a weight, similar to the and speech recognition – robotic control, industrial engi-
synaptic efficacy of a biological neuron. The weighted neering, and biomedicine. In many of these tasks neural
sum of inputs is compared against an activation level networks outperform more traditional statistical or artifi-
(threshold) to determine the activation value of the cial intelligence techniques or may even achieve human-
neuron. like performance. The most valuable characteristics of
Activation function The activation or transfer function neural networks are adaptability and tolerance to noisy or
transforms the weighted inputs of a neuron into incomplete data. Another important advantage is in solv-
an output signal. Activation functions often have ing problems that do not have an algorithmic solution or
a “squashing” effect. Common activation functions for which an algorithmic solution is too complex or time-
used in neural networks are: threshold, linear, sigmoid, consuming to be found.
hyperbolic, and Gaussian. The first neural network model was developed by Mc-
Learning rule The learning rule describes the way a neu- Culloch and Pitts in the 1940s [32]. In 1958 Rosenblatt [41]
ral network is trained, i. e., how its free parameters un- described the first learning algorithm for a single neu-
dergo changes to fit the network to the training data. ron, the perceptron model. After Rumelhart et al. [46] in-
Feedforward network Feedforward neural networks are vented the popular backpropagation learning algorithm
organized in one or more layers of processing units for multi-layer networks in 1986, the field of neural net-
(neurons). In a feedforward neural network the signal works gained incredible popularity in the 1990s.
is allowed to flow one-way only, i. e., from inputs to Neural networks is regarded as a method of machine
outputs. There are no feedback loops, i. e., the outputs learning, the largest subfield of artificial intelligence (AI).
of a layer do not affect its inputs. Conventional AI mainly focuses on the development of
Feedback networks In feedback or recurrent networks expert systems and the design of intelligent agents. To-
signals may flow in both directions. Feedback net- day neural networks also belong to the more recent field of
works are dynamic such that they have a state that is computational intelligence (CI), which also includes evolu-
changing continuously until it reaches an equilibrium tionary algorithms (EAs) and fuzzy logic.
point.
Introduction
Definition of the Subject Herein I provide a concise introduction to the two most
Neural networks (NNs) or, more precisely, artificial neu- popular neural network types used in application, back-
ral networks (ANNs) are mathematical and computational propagation neural networks (BPNNs) and self-organiz-
models that are inspired by the way biological nervous sys- ing maps (SOMs). The former learn high-dimensional
tems process information. A neural network model con- non-linear functions from given input–output associa-
sists of a larger number of highly interconnected, simple tions for solving classification and approximation (regres-
processing nodes or units which operate in parallel and sion) problems. The latter are primarily used for data clus-
perform functions collectively, roughly similar to biolog- tering and visualization and for revealing relationships be-
ical neural networks. ANNs, like their biological counter- tween clusters. These models are discussed in context with
part, are adaptive systems that learn by example. Learning alternative learning algorithms and neural network archi-
works by adapting free model parameters, i. e., the signal tectures.
1814 D Data-Mining and Knowledge Discovery, Neural Networks in

Biological Motivation In 1958 Rosenblatt [41] extended the McCulloch–Pitts


Artificial neural networks are inspired by biological ner- model to the perceptron model. This network was based
vous systems, which are highly distributed and intercon- on a unit called the perceptron, which produces an output
nected networks. The human brain is principally com- depending on the weighted linear combination of its in-
posed of a very large number of relatively simple neu- puts. The weights are adapted by the perceptron learning
rons (approx. 100 billion), each of which is connected to rule. Another single-layer neural network that is based on
several thousand other neurons, on average. A neuron is the McCulloch–Pitts neuron is the ADALINE (ADAptive
a specialized cell that consists of the cell body (the soma), Linear Element) which was invented in 1960 by Widrow
multiple spine-like extensions (the dendrites) and a single and Hoff [55] and employs a Least-Mean-Squares (LMS)
nerve fiber (the axon). The axon connects to the dendrites learning rule.
of another neuron via a synapse. When a neuron is ac- In 1969 Minsky and Papert [33] provided mathemat-
tivated, it transmits an electrical impulse (activation po- ical proofs that single-layer neural networks like the per-
tential) along its axon. At the synapse the electric signal ceptron are incapable of representing functions which are
is transformed into a chemical signal such that a certain linearly inseparable, including in particular the exclusive-
number of neurotransmitters cross the synaptic gap to the or (XOR) function. This fundamental limitation led the re-
post synaptic neuron, where the chemical signal is con- search on neural networks to stagnate for many years, un-
verted back to an electrical signal to be transported along til it was found that a perceptron with more than one layer
the dendrites. The dendrites receive signals from the axons has far greater processing power.
of other neurons. One very important feature of neurons is The backpropagation learning method was first de-
that they react delayed. A neuron combines the strengths scribed by Werbos in 1974 [53,54], and further developed
(energies) of all received input signals and sends out its for multi-layer neural networks by Rumelhart et al. in
own signal (“fires”) only if the total signal strength exceeds 1986 [46]. Backpropagation networks are by far the most
a certain critical activation level. A synapse can either be well known and most commonly used neural networks to-
excitatory or inhibitory. Input signals from an excitatory day.
synapse increase the activation level of the neuron while Recurrent auto-associative networks were first de-
inputs from an inhibitory synapse reduce it. The strength scribed independently by Anderson [2] and Kohonen [21]
of the input signals critically depends on modulations at in 1977. Invented in 1982, the Hopfield network [17] is
the synapses. The brain learns basically by adjusting num- a recurrent neural network in which all connections are
ber and strength of the synaptic connections. symmetric. All neurons are both input and output neurons
Artificial neural networks copy only a small amount of and update their activation values asynchronously and in-
the biological complexity by using a much smaller num- dependently from each other. For each new input the net-
ber of simpler neurons and connections. Nevertheless, ar- work converges dynamically to a new stable state. A Hop-
tificial neural networks can perform remarkably complex field network may serve as an associative, i. e., content-ad-
tasks by applying a similar principle, i. e., the combina- dressable, memory.
tion of simple and local processing units, each calculating The Boltzmann machine by Ackley et al. [1] can be
a weighted sum of its inputs and sending out a signal if the seen as an extension of the Hopfield network. It uses
sum exceeds a certain threshold. a stochastic instead of a deterministic update rule that sim-
ulates the physical principle of annealing. The Boltzmann
machine is one of the first neural networks to demonstrate
History and Overview learning of an internal representation (hidden units).
The history of artificial neural networks begins with a dis- It was also in 1982 when Kohonen first published
crete mathematical model of a biological neural network his self-organizing maps [22], a neural networks model
developed by pioneers McCulloch and Pitts in 1943 [32]. based on unsupervised learning and competitive learning.
This model describes neurons as threshold logic units SOMs produce a low-dimensional representation (feature
(TLUs) or binary decision units (BDNs) with multiple bi- map) of high-dimensional input data while preserving
nary inputs and a single binary output. A neuron outputs 1 their most important topological features.
(is activated) if the sum of its unweighted inputs exceeds Significant progress was made in the 1990s in the field
a certain specified threshold, otherwise it outputs 0. Each of neural networks which attracted a great deal of atten-
neuron can only represent simple logic functions like OR tion both in research and in many application domains
or AND, but any boolean function can be realized by com- (see Sect. “Definition of the Subject”). Faster computers al-
binations of such neurons. lowed a more efficient solving of more complex problems.
Data-Mining and Knowledge Discovery, Neural Networks in D 1815

Hardware implementations of larger neural networks were tween neurons also do not permit simpler substructures in
realized on parallel computers or in neural network chips the model to be identified. Therefore, an extraction of the
with multiple units working simultaneously. acquired knowledge from such black box predictors to un-
derstand the underlying model is almost impossible.

Characteristics of Neural Networks


Neural Network Learning
Interesting general properties of neural networks are that
they (1) mimic the way the brain works, (2) are able to The human brain learns by practice and experience. The
learn by experience, (3) make predictions without having learned knowledge can change if more information is re-
to know the precise underlying model, (4) have a high fault ceived. Another important element of learning is the abil-
tolerance, i. e., can still give the correct output to missing, ity to infer knowledge, i. e., to make assumptions based
noisy or partially correct inputs, and (5) can work with on what we know and to apply what we have learned in
data they have never seen before, provided that the under- the past to similar problems and situations. One theory of
lying distribution fits the training data. the physiology of human learning by repetition is that re-
Computation in neural networks is local and highly peated sequences of impulses strengthen connections be-
distributed throughout the network such that each node tween neurons and form memory paths. To retrieve the
operates by itself, but tries to minimize the overall network learned information, nerve impulses follow these paths to
(output) error in cooperation with other nodes. Working the correct information. If we get out of practice these
like a cluster of interconnected processing nodes, neural paths may diminish over time and we forget what we have
networks automatically distribute both the problem and learned.
the workload among the nodes in order to find and con- The information stored in a neural network is con-
verge to a common solution. Actually, parallel processing tained in its free parameters. In general, the network ar-
was one of the original motivations behind the develop- chitecture and connections are held constant and only the
ment of artificial neural networks. connection weights are variable during training. Once the
Neural networks have the topology of a directed graph. numbers of (hidden) layers and units have been selected,
There are only one-way connections between nodes, just the free parameters (weights) are set to fit the model or
like in biological nervous systems. A two-way relationship function represented by the network to the training data,
requires two one-way connections. Many different net- following a certain training algorithm or learning rule.
work architectures are used, often with hundreds or thou- A neural network learns, i. e., acquires knowledge, by ad-
sands of adjustable parameters. justing the weights of connections between its neurons.
Neural networks are typically organized in layers This is also referred to as connectionist learning. Train-
which each consist of a number of interconnected nodes ing occurs iteratively in multiple cycles during which the
(see Fig. 2 below). Data patterns are presented to the sys- training examples are repeatedly presented to the network.
tem via the input layer. This non-computing layer is con- During one epoch all data patterns pass through the net-
nected to one or more hidden layers where the actual pro- work once.
cessing is done. The hidden layers then link to an output The three major learning paradigms include super-
layer which combines the results of multiple processing vised learning, unsupervised learning, and reinforcement
units to produce the final response of the network. learning.
The network acts as a high-dimensional vector func- Supervised learning, in general, means learning by an
tion, taking one vector as input and returning another vec- external teacher using global information. The problem
tor as output. The modeled functions are general and com- the network is supposed to solve is defined through a set of
plex enough to solve a large class of non-linear classifica- training examples given. The learning algorithm searches
tion and estimation problems. No matter which problem the solution space F, the class of possible functions, for
domain a neural network is operating in, the input data al- a function f  2 F that matches this set of input–output
ways have to be encoded into numbers which may be con- associations (Ex ; yE) best. In other words, the mapping im-
tinuous or discrete. plied by the sample data has to be inferred.
The high number of free parameters in a neural net- Training a neural network means to determine a set
work and the high degree of collinearity between the neu- of weights which minimizes its prediction error on the
ron outputs let individual parameter settings (weight co- training set. The cost or error function E : F ! R mea-
efficients) become meaningless and make the network for sures the error between the desired output values yE and
the most part uninterpretable. High-order interactions be- the predicted network outputs f (E x ) over all input vec-
1816 D Data-Mining and Knowledge Discovery, Neural Networks in

tors xE. That means, it calculates how far away the current less weights, on the other hand, may not be sufficiently
state f of the network is from the optimal solution f  with powerful to model the underlying function.
E( f  )  E( f ) 8 f 2 F. A simple heuristic, called early stopping, helps to en-
A neural network cannot perfectly learn a mapping if sure that the network will generalize well to examples not
the input data does not contain enough information to de- in the training set. One solution is to check progress dur-
rive the desired outputs. It may also not converge if there ing training against an independent data set, the valida-
is not enough data available (see also below). tion set. As training progresses, the training error naturally
Unsupervised learning uses no external teacher and decreases monotonically and, providing training is min-
only local information. It is distinguished from supervised imizing the true error function, also the validation error
learning by the fact that there is no a priori output. In un- decreases. However, if the validation error stops dropping
supervised learning we are given some input data xE, and or even starts to increase again, this is an indication that
the cost function to be minimized can be any function the network is starting to overfit the data. Then the opti-
of xE and the network output f (E x ). Unsupervised learn- mization process has become stuck in a local minima and
ing incorporates self-organization, i. e., organizes the in- training should be stopped. The weights that produced
put data by using only their inherent properties to reveal the minimum validation error are then used for the final
their emergent collective properties. model.
A neural network learns offline if learning phase and In this case of overtraining, the size of the network, i. e.,
operation (application) phase are separated. A neural net- the number of hidden units and/or hidden layers, may be
work learns online if both happens at the same time. Usu- decreased. Neural networks typically involve experiment-
ally, supervised learning is performed offline, whereas un- ing with a large number of different configurations, train-
supervised learning is performed online. ing each one a number of times while observing the vali-
In reinforcement learning, neither inputs xE nor outputs dation error. A problem with repeated experimentation is
yE are given explicitly, but are generated by the interactions that the validation set is actually part of the training pro-
of an agent within an environment. The agent performs an cess. One may just find a network by chance that happens
action yE with costs c according to an observation xE made to perform well on the validation set. It is therefore nor-
in the environment. The aim is to discover a policy or plan mal practice to reserve a third set of examples for testing
for selecting actions that minimizes some measure of the the final model on this test set.
expected total costs. In many cases a sufficient amount of data is not avail-
able, however. Then we have to get around this problem by
resampling techniques, like cross validation. In principle,
Overtraining and Generalization
multiple experiments are conducted, each using a different
The overall motivation and most desirable property of division of the available data into training and validation
neural networks is their ability to generalize to new un- set. This should remove any sampling bias. For small data
known data, i. e., to classify patterns correctly on which sets, where splitting the data would leave too few observa-
they have not been trained. Minimizing the network er- tions for training, leave-one-out validation may be used to
ror on the training examples only, does not automatically determine when to stop training or the optimal network
minimize the real error of the unknown underlying func- size.
tion. This important problem is called overfitting or over- Machine-learning techniques, like neural networks,
training. require both positive and negative training examples for
A regular distribution of training examples over the in- solving classification problems. Because they minimize an
put data space is important. Generalization is reasonable overall error, the proportion of positive and negative ex-
only as long as the data inputs remain inside the range for amples in the training set is critical. Ideally, the relation
which the network was trained. If the training set only in- should be close to the (usually unknown) real distribution
cluded vectors from a certain part of the data space, pre- in the data space. Otherwise, it may bias the network’s de-
dictions on other parts are random and likely wrong. cision to be more often wrong on unknown data.
Overtraining may occur also when the iterative train-
ing algorithm is run for too long and if the network is too
Feedforward Neural Networks
complex for the problem to solve or the available quan-
tity of data. A larger neural network with more weights In feedforward neural networks the information is passed
models a more complex function and invariably achieves in only one direction (forward) from the inputs, through
a lower error, but is prone to overfitting. A network with the hidden nodes (if any) to the output nodes. There are
Data-Mining and Knowledge Discovery, Neural Networks in D 1817

no connections backwards to neurons of upper layers i. e., a simple stochastic gradient descent where the weights of
there are no feedback loops or cycles in the network. the network are adjusted depending on the error between
In feedforward networks with a single-layer of weights, the predicted outputs of the network and the example out-
the inputs are directly connected to the output units (sin- puts. The delta rule changes the weight vector such that
gle-layer neural networks). Multi-layer feedforward net- the output error is minimized. McClelland and Rumel-
works use additional intermediate layers of hidden units. hart [30] proved that a neural network using the delta rule
Neural networks with two or more processing layers may can learn associations whenever the inputs are linearly in-
have far greater processing power than networks with only dependent.
one layer. Single-layer neural networks are only capable of All neurons of a perceptron share the same structure
learning linearly separable patterns and functions. and learning algorithm. Each weight w i j , representing the
influence of input x i on neuron j, is updated at time t ac-
The Perceptron cording to the rule:
The most simple kind of feedforward neural network is w i j (t C 1) D w i j (t) C
w i j (2)
a perceptron, which consists of a single pseudo input layer
and one or more processing nodes in the output layer. All
w i j D ˛(o i  y i )x i j : (3)
inputs are weighted and fed directly to the output neu-
The network learns by updating the weight vector after
ron(s) (see Fig. 1). Each node calculates the sum of the
each iteration (training example) by an amount propor-
products of weights and inputs. If this value is above some
tional to the difference between given output oi and calcu-
threshold (typically 0) the neuron takes the activated value
lated output y i D f (s i ). The learning rate ˛ is a constant
1, otherwise it outputs 0. Neurons with this kind of acti-
with 0 < ˛ < 1 and regulates the learning speed.
vation function are called threshold units. In the literature
The training data set is linearly separable in n-dimen-
the term perceptron often refers to networks consisting of
sional data space if its two-classes of vectors xE can be sep-
only one (output) neuron.
arated by an (n  1)-dimensional hyperplane. If the train-
More formally, the perceptron is a linear binary clas-
ing examples are not linearly separable, the perceptron
sifier that maps a binary n-dimensional input vector
learning algorithm is not guaranteed to converge. Lin-
xE 2 f0; 1gn to a binary output value f (w E xE) 2 f0; 1g cal-
ear classifiers, like single-unit perceptrons, are only able
culated as
 to learn, i. e., perfectly classify, linearly separable patterns
1 if s > T because they can only implement a simple decision sur-
f (s) D (1)
0 otherwise face (single hyperplane) [33]. The same is true for single-
P layer neural networks with more than one output unit.
where s D w E xE D niD1 w i x i is the input sum and w E is
This makes these linear neural networks unable to learn,
a vector of real-valued weights. The constant threshold T
for example, the XOR function [33]. Nevertheless a prob-
does not depend on any input value.
lem that is thought to be highly complex may still be solved
In addition to the network topology, the learning rule
as well by a linear network as by a more powerful (non-lin-
is an important component of neural networks. Percep-
ear) neural network.
trons can be trained by a simple learning algorithm, called
the delta rule or perceptron learning rule. This realizes
Single-Layer Neural Networks
In general, the state of a neuron is represented by its ac-
tivation value. An activation or transfer function f calcu-
lates the activation value of a unit from the weighted sum s
of its inputs. In case of the perceptron f is called step or
threshold function, with the activation value being 1 if the
network sum is greater than a constant T, and 0 otherwise
(see Eq. (1)). Another common form of non-linear activa-
tion function is the logistic or sigmoid function:
1
f (s) D : (4)
1 C es
Data-Mining and Knowledge Discovery, Neural Networks in, Fig-
ure 1 This enables a neural network to compute a continuous
Principle structure of single-unit perceptron network output between 0 and 1 instead of a step function. With
1818 D Data-Mining and Knowledge Discovery, Neural Networks in

this choice, a single-layer network is identical to a logistic bers can be approximated arbitrarily closely by a multi-
regression model. If the activation functions is linear, i. e., layer neural network with only one hidden layer and cer-
the identity, then this is just a multiple linear regression tain types of non-linear activation functions. This gives,
and the output is proportional to the total weighted sum s. however, no indication about how fast or likely a solution
is found. Networks with two hidden layers may work bet-
ter for some problems. However, more than two hidden
Multi-Layer Neural Networks
layers usually provide only marginal benefit compared to
The limitation that non-linearly separable functions can- the significant increase in training time.
not be represented by a single-layer network with fixed Any multi-layer network with fixed weights and lin-
weights can be overcome by adding more layers. A multi- ear activation function is equivalent to a single-layer (lin-
layer network is a feedforward network with two or more ear) network: In the case of a two-layer linear system, for
layers of computational units, interconnected such that the instance, let all input vectors to the first layer form ma-
neurons’ outputs of one layer serve as inputs only to neu- trix X and W 1 and W 2 be the weight matrices of the two
rons of the directly subsequent layer (see Fig. 2). The input processing layers. Then the output Y1 D W1 X of the first
layer is not considered a real layer with processing neu- layer is input to the second layer, which produces output
rons. Y2 D W2 (W1 X) D (W2 W1 ) X. This is equivalent to
The number of units in the input layer is determined a single-layer network with weight matrix W D W2 W1 .
by the problem, i. e., the dimension of the input data space. Only a multi-layer network that is non-linear can pro-
The number of output units also depends on the output vide more computational power. In many applications
encoding (see Subsect.“ Application Issues”). these networks use a sigmoid function as non-linear ac-
By using hidden layers, the partitioning of the data tivation function. This is the case at least for the hidden
space can be more effective. In principle, each hidden unit units. For the output layer, the sigmoid activation func-
adds one hyperplane to divide the space and discriminate tion is usually applied with classification problems, while
the solution. Only if the outputs of at least two neurons a linear transfer function is applied with regression prob-
are combined in a third neuron, the XOR problem is solv- lems.
able. Important issues in multi-layer NN design are, thus,
the specification of the number of hidden layers and the
Error Function and Error Surface
number of units in these layers (see also Subsect. “Appli-
cation Issues”). Both numbers determine the complexity The error function derives the overall network error from
of functions that can be modeled. There is no theoretical the difference of the network’s output yij and target out-
limitation on the number of hidden layers, but usually one put oij over all examples i and output units j. The most-
or two are used. common error function is the sum squared error:
The universal approximation theorem for neural net-
works states that any continuous function that maps in- 1 XX
ED (o i j  y i j )2 : (5)
tervals of real numbers to an output interval of real num- 2
i j

Neural network training performs a search within the


space of solution, i. e., all possible network configurations,
towards a global minimum of the error surface. The global
minimum is the best overall solution with the lowest pos-
sible error.
A helpful concept for understanding NN training is
the error surface: The n weight parameters of the network
model form the n dimensions of the search space. For any
possible state of the network or configuration of weights
the error is plotted in the (n C 1)th dimension. The objec-
tive of training is to find the lowest point on this n-dimen-
sional surface. The error surface is seldom smooth. Indeed,
Data-Mining and Knowledge Discovery, Neural Networks in, Fig- for most problems, the surface is quite rugged with numer-
ure 2 ous hills and valleys which may cause the network search
Principle structure of multi-layer feedforward neural network to run into a local minimum, i. e., a Suboptimum solution.
Data-Mining and Knowledge Discovery, Neural Networks in D 1819

The speed of learning is the rate of convergence between ascribed partly to errors in the weights of its direct inputs
the current solution and the global minimum. and partly to errors in the outputs of higher-level (hidden)
In a linear network with a sum-squared error func- nodes [46].
tion, the error surface is an multi-dimensional parabola, Moreover, backpropagation learning may happen in
i. e., has only one minimum. In general, it is not possi- two different modes. In sequential mode or online mode
ble to analytically determine where the global minimum of weight adjustments are made example by example, i. e.,
the error surface is. Training is essentially an exploration each time an example pattern has been presented to the
of the error surface. Because of the probabilistic and often network. The batch mode or offline mode adjustments are
highly non-linear modeling by neural networks, we cannot made epoch by epoch, i. e., only after all example patterns
be sure that the error could not be lower still, i. e., that the have been presented. Theoretically, the backpropagation
minimum we found is the absolute one. Since the shape algorithm performs gradient descent on the total error
of the error space cannot be known a priori, neural net- only if the weights are updated epoch-wise. There are em-
work analysis requires a number of independent runs to pirical indications, however, that a pattern-wise update re-
determine the best solution. When different initial values sults in faster convergence. The training examples should
for the weights are selected, different network models will be presented in random order. Then the precision of pre-
be derived. dictions will be more similar over all inputs.
From an initially random configuration of the net- Backpropagation learning requires a differentiable ac-
work, i. e., a random point on the error surface, the train- tivation function. Besides adding non-linearity to multi-
ing algorithm starts seeking for the global minimum. Small layer networks, the sigmoid activation function (see
random values are typically used to initialize the network Eq. (4)) is often used in backpropagation networks because
weights. Although neural networks resulting from differ- it has a continuous derivative that can be calculated easily:
ent initial weights may have very different parameter set-
f 0 (s) D f (s)(1  f (s)) : (6)
tings, their prediction errors usually do not vary dramat-
ically. Training is stopped when a maximum number of We further assume that there is only one hidden layer
epochs has expired or when the network error does not in order to keep notations and equations clear. A gener-
improve any further. alization to networks with more than one hidden layer is
straightforward.
The backpropagation rule is a generalization of the
Backpropagation delta learning rule (see Eq. (3)) to multi-layer networks
The best known and most popular training algorithm for with non-linear activation function. For an input vector
multi-layer networks is backpropagation, short for back- xE the output y D f (s) is calculated at each output neuron
wards error propagation and also referred to as the gener- of the network and compared with the desired target out-
alized delta rule [46]. The algorithm involves two phases: put o, resulting in an error ı. Each weight is adjusted pro-
Forward pass. During the first phase, the free parame- portionally to its effect on the error. The weight of a con-
ters (weights) of the network are fixed. An example pattern nection between a unit i and a unit j is updated depending
is presented to the network and the input signals are prop- on the output of i (as input to j) and the error signal at j:
agated through the network layers to calculate the network

w i j D ˛ı j y i : (7)
output at the output unit(s).
Backward pass. During the second phase, the model For an output node j the error signal (error surface gra-
parameters are adjusted. The error signals at the output dient) is given by:
units, i. e., the differences between calculated and expected
ı j D (o j  y j ) f 0 (s j ) D (o j  y j )y j (1  y j ) : (8)
outputs, are propagated back through the network layers.
In doing so, the error at each processing unit is calculated If the error is zero, no changes are made to the con-
and used to make adjustments to its connecting weights nection weight. The larger the absolute error, the more the
such that the overall error of the network is reduced by responsible weight is changed, while the sign of the error
some small amount. determines the direction of change.
After iteratively repeating both phases for a sufficiently For a hidden neuron j the error signal is calculated re-
large number of training cycles (epochs) the network will cursively using the signals of all directly connected output
converge to a state where its output error is small enough. neurons k.
The backpropagation rule involves the repeated use of the X X
ı j D f 0 (s j ) ı k w jk D y j (1  y j ) ı k w jk : (9)
chain rule, saying that the output error of a neuron can be
k k
1820 D Data-Mining and Knowledge Discovery, Neural Networks in

The partial derivative of the error function with respect complex problems requiring larger networks or larger
to the network weights can be calculated purely locally, amounts of training data.
such that each neuron needs information only from neu- Another major limitation is that backpropagation does
rons directly connected to it. A theoretical foundation of not always converge. Still, it is a widely used algorithm
the backpropagation algorithm can be found in [31]. and has its advantages: It is relatively easy to apply and
The backpropagation algorithm performs a gradient to configure and provides a quick, though not absolutely
descent by calculating the gradient vector of the error sur- perfect solution. Its usually pattern-wise error adjustment
face at the current search point. This vector points into the is hardly affected by data that contains a larger number of
direction of the steepest descent. Moving in this direction redundant examples. Standard backpropagation also gen-
will decrease the error and will eventually find a new (lo- eralizes equally well on small data sets as more advanced
cal) minimum, provided that the step size is adapted ap- algorithms, e. g., if there is insufficient information avail-
propriately. Small steps slow down learning speed, i. e., re- able to find a more precise solution.
quire a larger number of iterations. Large steps may con- There are many variations of the backpropaga-
verge faster, but may also overstep the solution or make tion algorithm, like resilient propagation (Rprop) [42],
the algorithm oscillate around a minimum without con- quick propagation (Quickprop) [13], conjugate gradi-
vergence of the weights. Therefore, the step size is made ent descent [6], Levenberg–Marquardt [16], Delta-Bar-
proportional to the slope ı, i. e., is reduced when the search Delta [18], to mention the most popular. All these second-
point approaches a minimum, and to the learning rate ˛. order algorithms are designed to deal with some of the lim-
The constant ˛ allows one to control the size of the gra- itations on the standard approach. Some work substan-
dient descent step and is usually set to be between 0.1 and tially faster in many problem domains, but require more
0.5. For practical purposes, it is recommended to choose control parameters than backpropagation, which makes
the learning rate as large as possible without leading to os- them more difficult to use.
cillation.
Resilient Propagation
Momentum
Resilient propagation (Rprop) as proposed in [42,43] is
One possibility to avoid oscillation and to achieve faster a variant of standard backpropagation with very robust
convergence is in the addition of a momentum term that is control parameters that are easy to adjust. The algorithm
proportional to the previous weight change: converges faster than the standard algorithm without be-
ing less accurate.

w i j (t C 1) D ˛ı j y i C ˇ
w i j (t) : (10) The size of the weight step
w i j taken by standard
backpropagation not only depends on the learning rate ˛,
The algorithm increases learning speed step size if it but also on the size of the partial derivative (see Eq. (7)).
has taken several steps in the same direction. This gives it This may have an unpredictable influence during training
the ability to overcome obstacles in the error surface, e. g., that is difficult to control. Therefore, Rprop uses only the
to avoid and escape from local minima, and to move faster sign of derivative to adjust the weights. It necessarily re-
over larger plateaus. quires learning by epoch, i. e., all adjustments take place
Finding the optimum learning rate ˛ and momentum after each epoch only.
scale parameter ˇ, i. e., the best trade-off between longer One iteration of the Rprop algorithm involves two
training time and instability, can be difficult and might re- steps, the adjustment of the step size and the update of the
quire many experiments. Global or local adaptation tech- weights. The amount of weight change is found by the fol-
niques use, for instance, the partial derivative to automati- lowing update rule:
cally adapt the learning rate. Examples here are the Delta-
Bar-Delta rule [18] and the SuperSAB algorithm [51]. 8 C
< 
i j (t  1) if d i j (t  1) d i j (t) > 0

i j (t) D 
i j (t  1) if d i j (t  1) d i j (t) < 0
Other Learning Rules :

i j (t  1) otherwise
The backpropagation learning algorithm is computation- (11)
ally efficient in that its time complexity is linear in the
number of weight parameters. Its learning speed is com- with 0 <  < 1 < C and derivative d i j D ı j y i . Every
paratively low, however, on the basis of epochs. This may time t the derivative term changes its sign, indicating that
result in long training times, especially for difficult and the last update (at time t  1) was too large and the algo-
Data-Mining and Knowledge Discovery, Neural Networks in D 1821

rithm has jumped over a local minimum, the update value accuracy, as long as cross validation is used to stop training
(step size)
i j (t  1) is decreased by a constant factor  . in time.
The rule for updating the weights is straightforward: For classification problems, the neural network assigns
to each input case a class label or, more generally, estimates
w i j (t C 1) D w i j (t) C
w i j (t) (12) the probability of the case to fall into each class. The vari-
8 ous output classes of a problem are normally represented
< 
i j (t) if d i j (t) > 0 in neural networks using one of two techniques, including

w i j (t) D C
i j (t) if d i j (t) < 0 : (13) binary encoding and one-out-of-n encoding.
:
0 otherwise A binary encoding is only possible for two-class prob-
lems. A single unit calculates class 1 if its output is above
One advantage of the Rprop algorithm, compared to,
the acceptance threshold. If the output is below the rejec-
for example, Quickprop [13], is its small set of parame-
tion threshold, class 0 is predicted. Otherwise, the output
ters that hardly requires adaptation. Standard values for
class is undecided. Should the network output be always
decrease factor  and increase factor C are 0.5 and 1.2,
defined, both threshold values must be equal (e. g. 0.5).
respectively. To avoid too large or too small weight val-
In one-out-of-n encoding one unit is allocated for each
ues, this is bounded above by
max and bounded below by
class. A class is selected if the corresponding output is

min , set by default to 50 and 106 . The same initial value


above the acceptance threshold and all the other outputs

0 D 0:1 is recommended for all


ij . While the choice of
are below the rejection threshold. If this condition is not
the parameter settings is not critical, for most problems
met, the class is undecided. Alternatively, instead of using
no other choice is needed to obtain the optimal or at least
a threshold, a winner-takes-all decision may be made such
a nearly optimal solution.
that the unit with the highest output gives the class.
For regression problems, the objective is to estimate
Application Issues the value of a continuous output variable, given the input
The architecture of a neural network, i.e, the number of variables. Particularly important issues in regression are
(hidden) neurons and layers, is an important decision. If output scaling and interpolation. The most-common NN
a neural network is highly redundant and overparameter- architectures produce outputs in a limited range. Scaling
ized, it might adapt too much to the data. Thus, there is algorithms may be applied to the training data to ensure
a trade-off between reducing bias (fitting the training data) that the target outputs are in the same range. Constrain-
and reducing variance (fitting unknown data). The most- ing the network’s outputs limits its generalization perfor-
common procedure is to select a network structure that mance. To overcome this, a linear activation function may
has more than enough parameters and neurons and then be used for the output units. Then there is often no need
to avoid overfitting only over the training algorithm (see for output scaling at all, since the units can in principle
Subsect. “Overtraining and Generalization”). calculate any value.
There is no general best network structure for a par-
ticular type of application. There are only general rules Other Neural Network Architectures
for selecting the network architecture: (1) The more com-
This section summarizes some alternative NN architec-
plex the relationships between input and output data are,
tures that are variants or extensions of multi-layer feed-
the higher the number of hidden units should be selected.
forward networks.
(2) If the modeled process is separable into multiple stages,
more than one hidden layer may be beneficial. (3) An up-
per bound for the total number of hidden units may be set Cascade Correlation Networks
by the number of data examples divided by the number of Cascade-correlation is a neural network architecture with
input and output units and multiplied by a scaling factor. variable size and topology [14]. The initial network has
A simpler rule is to start with one hidden layer and half as no hidden layer and grows during training by adding new
many hidden units as there are input and output units. hidden units one at a time. In doing so, a near minimal
One would expect that for a given data set there would network topology is built. In the cascade architecture the
be an optimal network size, lying between a minimum of outputs from all existing hidden neurons in the network
one hidden neuron (high bias, low variance) and a very are fed into a new neuron. In addition, all neurons – in-
large number of neurons (low bias, high variance). While cluding the output neurons – receive all input values.
this is true for some data sets, in many cases increasing the For each new hidden unit, the learning algorithm tries
number of hidden nodes continues to improve prediction to maximize the correlation between this unit’s output and
1822 D Data-Mining and Knowledge Discovery, Neural Networks in

the overall network error using an ordinary learning al- The architecture of RBF networks is fixed to two
gorithm, like, e. g., backpropagation. After that the input- processing layers. Each unit in the hidden layer repre-
side weighs of the new neuron are frozen. Thus, it does not sents a center vector and a basis function which realizes
change anymore and becomes a permanent feature detec- a non-linear transformation of the inputs. Each output
tor. unit calculates a weighted sum (linear combination) of
Cascade correlation networks have several advantages the non-linear outputs from the hidden layer. Only the
over multi-layer perceptrons: (1) Training time is much connections between hidden layer and output layer are
shorter already because the network size is relatively small. weighted.
(2) They require only little or no adjustment of parame- The use of a linear output layer in RBF networks is
ters, especially not in terms of the number of hidden neu- motivated by Cover’s theorem on the separability of pat-
rons to use. (3) They are more robust and training is less terns. The theorem states that if the transformation from
likely to become stuck in local minima. the data (input) space to the feature (hidden) space is non-
linear and the dimensionality of the feature space is rela-
Recurrent Networks tively high compared to that of the data space, then there
is a high likelihood that a non-separable pattern classifica-
A network architecture with cycles is adopted by recur- tion task in the input space is transformed into a linearly
rent or feedback neural networks such that outputs of some separable one in the feature space.
neurons are fed back as extra inputs. Because past outputs The center vectors are selected from the training data,
are used to calculate future outputs, the network is said to either randomly or uniformly distributed in the input
“remember” its previous state. Recurrent networks are de- space. In principle, as many centers (and hidden units)
signed to process sequential information, like time series may be used as there are data examples. Another method
data. Processing depends on the state of the network at the is to group the data in space using, for example, k-means
last time step. Consequently, the response to the current clustering, and select the center vectors close to the cluster
input depends on previous inputs. centers.
Two similar types of recurrent network are extensions RBF learning is considered a curve-fitting problem in
of the multi-layer perceptron: Jordan networks [19] feed high-dimensional space, i. e., approximates a surface with
back all network outputs into the input layer; Elman net- the basis functions that fits and interpolates the training
works [12] feed back from the hidden units. State or con- data points best. The basis functions are well-suited to
text units are added to the input layer for the feedback con- online learning applications, like adaptive process con-
nections which all have constant weight one. At each time trol. Adapting the network to new data and changing data
step t, an input vector is propagated in a standard feedfor- statistics only requires a retraining by linear regression
ward fashion, and then a learning rule (usually backprop- which is fast. RBF networks are more local approximators
agation) is applied. The extra units always maintain a copy than multi-layer perceptrons. New training data from one
of the previous outputs at time step t  1. region of the input space have less effect on the learned
model and its predictions in other regions.
Radial Basis Function Networks
Radial basic function (RBF) networks [8,34,37,38] are an- Self-organizing Maps
other popular variant of two-layer feedforward neural net-
A self-organizing map (SOM) or Kohonen map [22,23] ap-
works which uses radial basis functions as activation func-
plies an unsupervised and competitive learning scheme.
tions. The idea behind radial basis functions is to approx-
That means that the class labels of the data vectors are
imate the unknown function f (E x ) by a weighted sum of
unknown or not used for training and that each neuron
non-linear basis functions , which are often Gaussian
improves through competition with other neurons. It is
functions with a certain standard deviation .
a non-deterministic machine-learning approach to data
X clustering that implements a mapping of the high-dimen-
x) D
f (E x  cEi jj)
w i (jjE (14)
i
sional input data into a low-dimensional feature space.
In doing so, SOMs filter and compress information while
The basis functions operate on the Euclidean distance preserving the most relevant features of a data set. Com-
between n-dimensional input vector xE and center vector plex, non-linear relationships and dependencies are re-
cEi . Once the center vectors cEi are fixed, the weight coeffi- vealed between data vectors and between clusters and are
cients wi are found by simple linear regression. transformed into simple geometric distances. Such an ab-
Data-Mining and Knowledge Discovery, Neural Networks in D 1823

straction facilitates both visualization and interpretation Both learning rate ˛ and neighborhood radius r mono-
of the clustering result. tonically decrease over time. Initially quite large areas of
Typically, the units of a SOM network are arranged the network are affected by the neighborhood update,
in a two-dimensional regular grid, the topological feature leading to a rather rough topological order. As epochs
map, which defines a two-dimensional Euclidean distance pass, less neurons are altered with lower intensity and finer
between units. Each unit is assigned a center vector from distinctions are drawn within areas of the map.
the n-dimensional data space and represents a certain clus- Unlike hierarchical clustering [52] and k-means clus-
ter. tering [29], which are both deterministic – apart from the
Algorithm 1 describes the basic principle behind SOM randomized initialization in k-means clustering – and op-
training. Starting with an initially random set of center erate only locally, SOMs are less likely to become stuck in
vectors, the algorithm iteratively adjusts them to reflect local minima and have a higher robustness and accuracy.
the clustering of the training data. In doing so, the two- Once the network has been trained to recognize struc-
dimensional order of the SOM units is imposed on the in- tures in the data, it can be used as a visualization tool and
put vectors such that more similar clusters (center vectors) for exploratory data analysis. The example map in Fig. 3
in the input space are closer to each other on the two-di- shows a clustering of time series of gene expression val-
mensional grid structure than more different clusters. One ues [48]. Clearly higher similarities between neighboring
can think of the topological map to be folded and distorted clusters are revealed when comparing the mean vectors.
into the n-dimensional input space, so as to preserve as If neurons in the feature map can be labeled, i. e.,
much as possible the original structure of the data. if common meaning can be derived from the vectors in
a clusters, the network becomes capable of classification.
Algorithm 1 (Self-organizing Map)
If the winning neuron of an unknown input case has not
1. Initialize the n-dimensional center vector cEi 2 Rn of been assigned a class label, labels of clusters in close or di-
each cluster randomly. rect neighborhood may be considered. Ideally, higher sim-
2. For each data point pE 2 Rn find the nearest center vec- ilarities between neighboring data clusters are reflected in
tor cEw (called the “winner”) in n-dimensional space ac- similar class labels. Alternatively, the network output is
cording to a distance metric d. undefined in this case. SOM classifiers also make use of
3. Move cEw and all centers cEi within its local neighbor- the distance of the winning neuron from the input case.
hood closer to pE : If this distance exceeds a certain maximum threshold, the
SOM is regarded as undecided. In this way, a SOM can be
cEi (t C 1) D cEi (t) C ˛(t) hr(t) ( pE  cEi (t)) used for detecting novel data classes.
with learning rate 0 < ˛ < 1 and neighborhood func-
tion h depending on a neighborhood radius r. Future Directions
4. ˛(t C 1) D ˛(t) 
˛ where
˛ D ˛0 /tmax
To date, neural networks are widely accepted as an alter-
r(t C 1) D r(t) 
r where
r D r0 /tmax native to classical statistical methods and are frequently
used in medicine [4,5,11,25,27,44,45] with many applica-
5. Repeat steps 2.–4. for each epoch t D 1; ::; tmax . tions related to cancer research [10,26,35,36,49]. In the
6. Assign each data point to the cluster with the nearest first place, these comprise diagnostics and prognosis (i. e.
center vector. classification) tasks, but also image analysis and drug de-
sign. Cancer prediction is often based on clustering of
Each iteration involves randomly selecting a data point gene expression data [15,50] or microRNA expression
pE and moving the closest center vector a bit in the di- profiles [28] which may involve both self-organizing maps
rection of pE. Only distance metric d (usually Euclidean) and multi-layer feedforward neural networks.
defined on the data space influences the selection of Another broad application field of neural networks
the closest cluster. The adjustment of centers is applied today is bioinformatics and, in particular, the analysis
not just to the winning neuron, but to all the neurons and classification of gene and protein sequences [3,20,56].
of its neighborhood. The neighborhood function is often A well-known successful example is protein (secondary)
Gaussian. A simple definition calculates h D 1 if the Eu- structure prediction from sequence [39,40].
clidean distance between the grid coordinates of the win- Even though the NN technology is clearly established
ning cluster w and a cluster i is below a radius r, i. e., today, the current period is rather characterized by stag-
jj(xw ; yw )  (x i ; y i )jj < r, and h D 0 otherwise. nation. This is partly because of a redirection of research
1824 D Data-Mining and Knowledge Discovery, Neural Networks in

Data-Mining and Knowledge Discovery, Neural Networks in, Figure 3


6 x 6 SOM example clustering of gene expression data (time series over 24 time points) [48]. Mean expression vector plotted for each
cluster. Cluster sizes indicate the number of vectors (genes) in each cluster

to newer and often – but not generally – more power- 5. Begg R, Kamruzzaman J, Sarkar R (2006) Neural networks in
ful paradigms, like the popular support vector machines healthcare: Potential and challenges. Idea Group Publishing,
(SVMs) [9,47], or to more open and flexible methods, Hershey
6. Bishop CM (1995) Neural networks for pattern recognition. Ox-
like genetic programming (GP) [7,24]. In many applica- ford University Press, London
tions, for example, in bioinformatics, SVMs have already 7. Brameier M, Banzhaf W (2007) Linear genetic programming.
replaced conventional neural networks as the state-of-the- Springer, New York
art black-box classifier. 8. Broomhead DS, Lowe D (1988) Multivariable functional inter-
polation and adaptive networks. Complex Syst 2:321–355
9. Cristianini N (2000) An introduction to support vector ma-
chines and other kernel-based learning methods. Cambridge
Bibliography
University Press, London
10. Dybowski R (2000) Neural computation in medicine: Perspec-
Primary Literature tives and prospects. In: Malmgren H, Borga M, Niklasson L (eds)
1. Ackley DH, Hinton GF, Sejnowski TJ (1985) A learning algo- Proceedings of the Conference on Artificial Neural Networks in
rithm for Boltzman machines. Cogn Sci 9:147–169 Medicine and Biology (ANNIMAB). Springer, Berlin, pp 26–36
2. Anderson JA et al (1977) Distinctive features, categorical per- 11. Dybowski R, Gant V (2001) Clinical Applications of Artificial
ception, and probability learning: Some applications of a neu- Neural Networks. Cambridge University Press, London
ral model. Psychol Rev 84:413–451 12. Elman JL (1990) Finding structure in time. Cogn Sci 14:179–211
3. Baldi P, Brunak S (2001) Bioinformatics: The machine learning 13. Fahlman SE (1989) Faster learning variations on backpropa-
approach. MIT Press, Cambridge gation: An empirical study. In: Touretzky DS, Hinton GE, Se-
4. Baxt WG (1995) Applications of artificial neural networks to jnowski TJ (eds) Proceedings of the (1988) Connectionist Mod-
clinical medicine. Lancet 346:1135–1138 els Summer School. Morgan Kaufmann, San Mateo, pp 38–51
Data-Mining and Knowledge Discovery, Neural Networks in D 1825

14. Fahlman SE, Lebiere C (1990) The cascade-correlation learning that are equivalent to multi-layer networks. Science 247:978–
architecture. In: Touretzky DS (ed) Advances in Neural Informa- 982
tion Processing Systems 2. Morgan Kaufmann, Los Altos 38. Poggio T, Girosi F (1990) Networks for approximation and
15. Golub TR et al (1999) Molecular classification of cancer: class learning. Proc IEEE 78:1481–1497
discovery and class prediction by gene expression monitoring. 39. Quian N, Sejnowski TJ (1988) Predicting the secondary struc-
Science 286(5439):531–537 ture of globular proteins using neural network models. J Mol
16. Hagan MT, Menhaj M (1994) Training feedforward net- Biol 202:865–884
works with the Marquardt algorithm. IEEE Trans Neural Netw 40. Rost B (2001) Review: Protein secondary structure prediction
5(6):989–993 continues to rise. J Struct Biol 134:204–218
17. Hopfield JJ (1982) Neural networks and physical systems with 41. Rosenblatt F (1958) The perceptron: A probabilistic model for
emergent collective computational abilities. Proc Natl Acad Sci information storage and organization in the brain. Psychol Rev
79(8):2554–2558 65(6):386–408
18. Jacobs RA (1988) Increased rates of convergence through 42. Riedmiller M, Braun H (1992) Rprop – a fast adaptive learning
learning rate adaptation. Neural Netw 1:295–307 algorithm. In: Proceedings of the International Symposium on
19. Jordan MI (1986) Attractor dynamics and parallelism in a con- Computer and Information Science VII
nectionist sequential machine. In: Proceedings of the Eighth 43. Riedmiller M, Braun H (1993) A direct adaptive method for
Annual Conf of the Cogn Sci Society. Lawrence Erlbaum, Hills- faster backpropagation learning: The Rprop algorithm. In: Pro-
dale, pp 531–546 ceedings of the IEEE International Conference on Neural Net-
20. Keedwell E, Narayanan A (2005) Intelligent bioinformatics: The works. IEEE Press, Piscataway, pp 586–591
application of artificial intelligence techniques to bioinformat- 44. Ripley BD, Ripley RM (2001) Neural Networks as statistical
ics problems. Wiley, New York methods in survival analysis. In: Dybowski R, Gant V (eds) Clini-
21. Kohonen T (1977) Associative Memory: A System-Theoretical cal Applications of Artificial Neural Networks. Cambridge Univ
Approach. Springer, Berlin Press, London
22. Kohonen T (1982) Self-organized formation of topologically 45. Robert C et al (2004) Bibliometric overview of the utilization of
correct feature maps. Biol Cybern 43:59–69 artificial neural networks in medicine and biology. Scientomet-
23. Kohonen T (1995) Self-organizing maps. Springer, Berlin rics 59:117–130
46. Rumelhart DE, Hinton GE, Williams RJ (1986) Learning repre-
24. Koza JR (1992) Genetic programming: On the programming
sentations by backpropagating errors. Nature 323:533–536
of computer programs by natural selection. MIT Press, Cam-
47. Schlökopf B, Smola AJ (2001) Learning with kernels: Support
bridge
vector machines, regularization, optimization, and beyond.
25. Lisboa PJG (2002) A review of evidence of health benefit from
MIT Press, Cambridge
artificial neural networks in medical intervention. Neural Netw
48. Spellman PT et al (1998) Comprehensive identification of cell
15:11–39
cycle-regulated genes of the yeast Saccharomyces cerevisiae
26. Lisboa PJG, Taktak AFG (2006) The use of artificial neural net-
by microarray hybridization. Mol Biol Cell 9(12):3273–97
works in decision support in cancer: a systematic review. Neu-
49. Taktak AFG, Fisher AC (2007) Outcome prediction in cancer. El-
ral Netw 19(4):408–415
sevier Science, London
27. Lisboa PJG, Ifeachor EC, Szczepaniak PS (2001) Artificial neural
50. Tamayo P et al (1999) Interpreting patterns of gene expres-
networks in biomedicine. Springer, Berlin
sion with self-organizing maps: Methods and application to
28. Lu J et al (2005) MicroRNA expression profiles classify human hematopoietic differentiation. PNAS 96(6):2907–2912
cancers. Nature 435:834–838 51. Tollenaere T (1990) SuperSAB: Fast adaptive backpropagation
29. MacQueen J (1967) Some methods for classification and anal- with good scaling properties. Neural Netw 3:561–573
ysis of multivariate observations. In: Proceedings of the 5th 52. Ward JH (1963) Hierarchical grouping to optimize an objective
Berkeley Symposium on Mathematical Statistics and Probabil- function. J Am Stat Assoc 58(301):236–244
ity, vol 1. University of California Press, Berkeley, pp 281–297 53. Werbos PJ (1974) Beyond regression: New tools for prediction
30. McClelland JL, Rumelhart DE (1986) Parallel distributed pro- and analysis in the behavioral science. Ph D Thesis, Harvard
cessing: Explorations in the microstructure of cognition. MIT University
Press, Cambridge 54. Werbos PJ (1994) The roots of backpropagation. Wiley, New
31. McClelland J, Rumelhart D (1988) Explorations in parallel dis- York
tributed processing. MIT Press, Cambridge 55. Widrow B, Hoff ME (1960) Adaptive switching circuits. In:
32. McCulloch WS, Pitts W (1943) A logical calculus of the ideas im- IRE WESCON Convention Record, Institute of Radio Engineers
manent in nervous activity. Bull Math Biophys 5:115–133 (now IEEE), vol 4. pp 96–104
33. Minsky ML, Papert SA (1969/1988) Perceptrons. MIT Press, 56. Wu CH, McLarty JW (2000) Neural networks and genome infor-
Cambridge matics. Elsevier Science, Amsterdam
34. Moody J, Darken CJ (1989) Fast learning in networks of locally-
tuned processing units. Neural Comput 1:281–294
35. Naguib RN, Sherbet GV (1997) Artificial neural networks in can- Books and Reviews
cer research. Pathobiology 65(3):129–139 Abdi H (1994) A neural network primer. J Biol Syst 2:247–281
36. Naguib RNG, Sherbet GV (2001) Artificial neural networks in Bishop CM (2008) Pattern recognition and machine learning.
cancer diagnosis, prognosis, and patient management. CRC Springer, Berlin
Press, Boca Raton Fausett L (1994) Fundamentals of neural networks: Architectures,
37. Poggio T, Girosi F (1990) Regularization algorithms for learning algorithms, and applications. Prentice Hall, New York
1826 D Decision Trees

Freeman JA, Skapura DM (1991) Neural networks: Algorithms, ap- Glossary


plications, and programming techniques. Addison, Reading
Gurney K (1997) An Introduction to neural networks. Routledge, Accuracy The most important quality measure of an in-
London
Hastie T, Tibshirani R, Friedman JH (2003) The elements of statistical
duced decision tree classifier. The most general is the
learning. Springer, Berlin overall accuracy, defined as a percentage of correctly
Haykin S (1999) Neural networks: A comprehensive foundation. classified instances from all instances (correctly classi-
Prentice Hall, New York fied and not correctly classified). The accuracy is usu-
Hertz J, Krogh A, Palmer R (1991) Introduction to the theory of neu- ally measured both for the training set and the testing
ral computation. Addison, Redwood City
Kröse B, van der Smagt P (1996) An introduction to neural net- set.
works. University of Amsterdam, Amsterdam Attribute A feature that describes an aspect of an object
Masters T (1993) Practical neural network recipes in C++. Academic (both training and testing) used for a decision tree. An
Press, San Diego object is typically represented as a vector of attribute
Masters T (1995) Advanced algorithms for neural networks: A C++ values. There are two types of attributes: continuous
sourcebook. Wiley, New York
Parks R, Levine D, Long D (1998) Fundamentals of neural network attributes whose domain is numerical, and discrete at-
modeling. MIT Press, Cambridge tributes whose domain is a set of predetermined val-
Patterson D (1996) Artif neural networks. Prentice Hall, New York ues. There is one distinguished attribute called deci-
Peretto P (1992) An introduction to the modeling of neural net- sion class (a dependent attribute). The remaining at-
works. Cambridge University Press, London tributes (the independent attributes) are used to deter-
Ripley BD (1996/2007) Pattern recognition and neural networks.
Cambridge University Press, London mine the value of the decision class.
Smith M (1993) Neural networks for statistical modeling. Van Nos- Attribute node Also called a test node. It is an internal
trand Reinhold, New York node in the decision tree model that is used to deter-
Wasserman PD (1989) Neural Computing: Theory and Practice. Van mine a branch from this node based on the value of the
Nostrand Reinhold, New York corresponding attribute of an object being classified.
Wasserman PD (1993) Advanced methods in neural computing.
Van Nostrand Reinhold, New York
Classification A process of mapping instances (i. e. train-
De Veaux RD, Ungar LH (1997) A brief introduction to neural net- ing or testing objects) represented by attribute-value
works. Technical Report, Williams College, University of Penn- vectors to decision classes. If the predicted decision
sylvania class of an object is equal to the actual decision class
Hinton GE (1992) How neural networks learn from experience. Sci of the object, then the classification of the object is ac-
Am 267:144–151
Lippman RP (1987) An introduction to computing neural networks.
curate. The aim of classification methods is to classify
IEEE ASSP Mag 4(2):4–22 objects with the highest possible accuracy.
Reilly D, Cooper L (1990) An overview of neural networks: Early Classifier A model built upon the training set used for
models to real world systems. Neural Electron Netw 2:229–250 classification. The input to a classifier is an object
(a vector of known values of the attributes) and the
output of the classifier is the predicted decision class
for this object.
Decision Trees Decision node A leaf in a decision tree model (also called
a decision) containing one of the possible decision
VILI PODGORELEC, MILAN Z ORMAN classes. It is used to determine the predicted decision
University of Maribor, Maribor, Slovenia class of an object being classified that arrives to the leaf
on its path through the decision tree model.
Article Outline Instance Also called an object (training and testing), rep-
resented by attribute-value vectors. Instances are used
Glossary to describe the domain data.
Definition of the Subject Induction Inductive inference is the process of moving
Introduction from concrete examples to general models, where the
The Basics of Decision Trees goal is to learn how to classify objects by analyzing
Induction of Decision Trees a set of instances (already solved cases) whose classes
Evaluation of Quality are known. Instances are typically represented as at-
Applications and Available Software tribute-value vectors. Learning input consists of a set
Future Directions of such vectors, each belonging to a known class, and
Bibliography the output consists of a mapping from attribute val-

You might also like