Professional Documents
Culture Documents
Data Mining Using Neural Networks: Miss. Mukta Arankalle
Data Mining Using Neural Networks: Miss. Mukta Arankalle
USING
NEURAL NETWORKS
By:
CERTIFICATE
This is to certify that
Miss. Mukta Arankalle
Roll no. 201
BE II
Has completed the necessary seminar work and prepared the bona fide report on
DATA MINING USING NEURAL NETWORKS
In a satisfactory manner as partial fulfillment for requirement of the degree of
B.E (Computer)
Of
University of Pune
In the academic year 2002-2003
Date:
Place:
Prof.
Internal Guide
Prof. G P Potdar
Seminar coordinator
ACKNOWLEDGEMENTS
I would like to extend my sincere gratitude to Prof. G.P. Potdar, (H.O.D., I.T.), P.I.C.T.,
for his encouragement and guidance.
I would like to thank Mr. Piyush Menon, (B.E. Comp) A.I.T., for his valuable help.
I would also like to thank Prof. Dr. C. V. K. Rao, (H.O.D. Computer Dept) and Prof.
R. B. Ingle, our internal guide.
INDEX
Chapter
Pg.No.
1. Data Mining
1.1
Introduction
1.2
1.3
1.4
10
1.5
12
2. Neural Networks
16
2.1
Introduction
16
2.2
17
2.3
A Neural Net
18
2.4
22
26
3.1
Introduction
26
3.2
26
3.3
Challenges Involved
27
3.4
Advantages
28
3.5
Extraction Methods
28
3.6
30
4. Conclusion
34
References
35
1.
DATA MINING
1.1
Introduction
The past two decades has seen a dramatic increase in the amount of information
or data being stored in electronic format. This accumulation of data has taken place at
an explosive rate. It has been estimated that the amount of information in the world
doubles every 20 months and the size and number of databases are increasing even
faster. The increase in use of electronic data gathering devices such as point-of-sale or
remote sensing devices has contributed to this explosion of available data. The problem
of effectively utilizing these massive volumes of data is becoming a major problem for
all enterprises.
Data storage became easier as the availability of large amounts of computing
power at low cost ie the cost of processing power and storage is falling, made data
cheap. There was also the introduction of new machine learning methods for knowledge
representation based on logic programming etc. in addition to traditional statistical
analysis of data. The new methods tend to be computationally intensive hence a demand
for more processing power.
It was recognized that information is at the heart of business operations and that
decision-makers could make use of the data stored to gain valuable insight into the
business. Database Management systems gave access to the data stored but this was
only a small part of what could be gained from the data. Traditional on-line transaction
processing systems, OLTPs, are good at putting data into databases quickly, safely and
efficiently but are not good at delivering meaningful analysis in return. Analyzing data
can provide further knowledge about a business by going beyond the data explicitly
stored to derive knowledge about the business. This is where Data Mining has obvious
benefits for any enterprise.
1.2
Definition
Researchers William J Frawley, Gregory Piatetsky-Shapiro and Christopher J
1.2.2
Explanation
Data mining, the extraction of hidden predictive information from large
databases, is a powerful new technology with great potential to help companies focus
on the most important information in their data warehouses. Data mining tools predict
future trends and behaviours, allowing business to make proactive knowledge driven
decisions. The automated, prospective analysis offered by data mining move beyond the
analysis of past events provided by retrospective tools typical of decision support
systems. Data mining tools can answer business questions that traditionally were too
time consuming to resolve. They scour databases for hidden patterns, finding predictive
information that experts may miss because it lies outside their expectations.
The data mining process consists of three basic stages: exploration, model
building and pattern definition. Fig. 1.1 shows a simple data mining structure.
Condition
al Logic
Discovery
Affinities &
Association
Trends &
Variations
Data
Predictive
Modeling
Forensic Analysis
Outcome
Prediction
Forecastin
Deviation
Detection
Link Analysis
Again this is analogous to a mining operation where large amounts of low grade
materials are sifted through in order to find something of value.
1.2.3
Example
A home finance loan actually has an average life span of only 7 to 10 years, due
to prepayment. Prepayment means, the loan is paid off early, rather than at the end of,
say 25 years. People prepay loans when they refinance or when they sell their home.
The financial return that a home-finance derives from a loan depends on its life span.
Therefore it is necessary for the financial institutions to be able to predict the life spans
of their loans. Rule discovery techniques are used to accurately predict the aggregate
number of loan payments in a given quarter (or in a year), as a function of prevailing
interest rates, borrower characteristics, and account data. This information can be used
to finetune loan parameters such as interest rates, points and fees, in order to maximize
profits.
1.3
Fayyad distinguishes between KDD and data mining by giving the following
definitions:
Knowledge discovery in databases is the process of identifying a valid,
potentially useful and ultimately understandable structure in data.
Data mining is a step in the KDD process concerned with the algorithmic means
by which patterns or structures are enumerated from the data under acceptable
computational efficiency limits.
The structures that are the outcome of the data mining process must meet certain
conditions so that these can be considered as knowledge. These conditions are: validity,
understandability, utility, novelty and interestingness.
Data
Preprocessing
Preprocessing
Target Data
Preprocessed
Data
Transformation
Transformation
Knowledge
Patterns
Transformed
Data
Data
Data
Mining
Mining
Interpretation
Interpretation
Evaluation
&&Evaluation
Selection: This stage is concerned with selecting or segmenting the data that are
relevant to some criteria. E.g.: for credit card customer profiling, we extract the type of
transactions for each type of customers and we may not be interested in the details of
the shop where the transaction takes place.
Preprocessing: Preprocessing is the data cleaning stage where unnecessary information
is removed. E.g.: it is unnecessary to note the sex of a patient when studying pregnancy.
This stage reconfigures the data to ensure a consistent format, as there is a possibility of
inconsistent formats.
Transformation: The data is not merely transferred across, but transformed in order to
be suitable for the task of data mining. In this stage, the data is made usable and
navigable.
Data Mining: This stage is concerned with the extraction of patterns from the data.
Interpretation and Evaluation: The patterns obtained in the data mining stage are
converted into knowledge, which in turn, is used to support decision-making.
1.4
1.4.1 Statistics
Statistics is a theory-rich approach for data analysis, which generates results that can be
overwhelming and difficult to interpret. Not withstanding this, statistics is one of the
foundations on which data mining technology is built. Statistical analysis systems are
used by analysts to detect unusual patterns and explain patterns using statistical models.
Statistics have an important role to play and data mining will not replace such analyses,
but rather statistics can act upon more directed analyses based on the results of data
mining.
10
only learning from examples, but also reinforcement learning, learning with teacher, etc.
A learning algorithm takes the data set and its accompanying information as input and
returns a statement e.g. a concept representing the results of learning as output.
11
1.5
intervention or guidance from the user. The discovery or data mining tools aim to reveal
a large number of facts about the data in as short a time as possible.
An example of such a model is a supermarket database, which is mined to
discover the particular groups of customers to target for a mailing campaign. The data is
searched with no hypothesis in mind other than for the system to group the customers
according to the common characteristics found.
13
14
Neural Networks
Genetic Algorithms
Cluster Analysis
Induction
OLAP
Data Visualization
15
2.
2.1
NEURAL NETWORKS
Introduction
Anyone can see that the human brain is superior to a digital computer at many
tasks. A good example is the processing of visual information: a one-year-old baby is
much better and faster at recognizing objects, faces, and so on than even the most
advanced AI system running on the fastest supercomputer. The brain has many other
features that would be desirable in artificial systems.
This is the real motivation for studying neural computation. It is an alternative
paradigm to the usual one (based on a programmed instruction sequence), which was
introduced by von Neumann and has been used as the basis of almost all machine
computation to date. It is inspired by the knowledge from neuroscience, though it does
not try to be biologically realistic in detail.
Neural networks are an approach to computing that involves developing
mathematical structures with the ability to learn. The methods are the result of academic
investigations to model nervous system learning. Neural networks have the remarkable
ability to derive meaning from complicated or imprecise data and can be used to extract
patterns and detect trends that are too complex to be noticed by either humans or other
computer techniques. A trained neural network can be thought of as an "expert" in the
category of information it has been given to analyze. This expert can then be used to
provide projections given new situations of interest and answer "what if" questions.
Neural networks use a set of processing elements (or nodes) analogous to
neurons in the brain. These processing elements are interconnected in a network that
can then identify patterns in data once it is exposed to the data, i.e the network learns
from experience just as people do. This distinguishes neural networks from traditional
computing programs, that simply follow instructions in a fixed sequential order.
16
2.2
I
N
P
U
T
S
wi1
Linear
Combiner
wi2
Activatio
n
Function
wi3
O
U
T
P
U
T
Threshold
2.3
A Neural Net
A single neuron is insufficient for many practical problems, and network with a
large number of nodes are frequently used. The way the nodes are connected determines
how computations proceeds and constitutes an important early design decision by a
neural network developer.
2.3.1
18
Hidden Node
Input Node
Output Node
Input Node
Output Node
Input Node
Output Node
Output Node
Asymmetric Network
Symmetric Network
2.3.2
Layered Networks
These are networks in which nodes are partitioned into subsets called layers,
with no connections that lead from layer j to layer k if j>k.
A single input arrives at and is distributed to other nodes by each node of the
input layer or layer 0; no other computation occurs at nodes in layer 0, and there are
no intra-layer connections among nodes in this layer. Connections with arbitrary
weights, may exist from any node in layer i to any node in layer j for j >= i; intra-layer
connections may exist.
19
Layer 1
Layer 2
( Hidden Layers )
Fig 2.4 A Layered Network
2.3.3
Acyclic Networks
This is a subclass of layered networks in which there are no intra-layer
connections, as shown in the fig. 2.5. A connection may exist between any node in layer
i and any node in layer j for i < j, but a connection is not allowed for i = j. Networks that
are not acyclic are referred to as recurrent networks.
20
Layer 1
Layer 2
( Hidden Layers )
Fig 2.5 An Acyclic Network
2.3.4
Feedforward Networks
Layer 1
Layer 2
( Hidden Layers )
Fig 2.6 A Feedforward 3-2-3-2 Network
This is a subclass of acyclic networks in which a connection is allowed from a
node in layer i only to nodes in layer i+1 as shown in fig. 2.6. These networks are
succinctly described by a sequence of numbers indicating the number of nodes in each
layer.
These networks, generally with no more than 4 such layers, are among the most
common neural nets in use. Conceptually, nodes in successively higher layers abstract
successively higher-level features from preceding layers.
2.3.5
Most problems are solved using neural networks whose architecture consists of
several modules, with sparse interconnections between modules. Modularity allows the
neural network developer to solve smaller tasks separately using small (neural network)
modules and then combine these modules in a logical manner. Modules can be
organized in several different ways, some of which are: hierarchical organization,
successive refinement and input modularity.
2.4
2.4.1
2.4.2
22
The MLP overcomes the above shortcoming of the single layer perceptron. The
idea is to carry out the computation layer-wise, moving in the forward direction.
Similarly, the weight adjustment can be done layer-wise, by moving in a backward
direction. For the nodes in the output layer, it is easy to compute the error, as we know
the actual outcome and the desired result. For the nodes in the hidden layers, since we
do not know the desired result, we propagate the error computed in the last layer
backward. This process gives a change in the weight for the edges layer-wise. This
standard method used in training MLPs is called the back propagation algorithm.
Formally, the training steps consist of:
Forward Pass: The outputs and the error at the output units are calculated.
Backward Pass: The output unit error is used to alter weights on the output units. Then
the error, at the hidden nodes is calculated, and weights on hidden nodes are altered
using these values.
For each training data, a forward pass and a backward pass is performed. This is
repeated over and over again, until the error is at an acceptably low level.
2.4.3
2.4.4
Competitive Learning
Competitive learning, or winner-takes-all is regarded as the basis of a number of
unsupervised learning strategies. It consists of k units with weight vectors wk, of equal
dimension to the input data. During learning process, the unit with its weight vector
closest to the input vector x is adapted in such a way that the vector becomes closer to
the input vector after adaptation. The unit with the closest vector is termed as the winner
of the selection process. This learning strategy is generally implemented by gradually
reducing the difference between the weight vector and the input vector. The actual
amount of reduction at each learning step is guided by means of the learning rate, .
23
During the learning process, the weight vectors converge towards the mean of the set of
input data.
2.4.5
Kohonens SOM
The self-organising map (SOM) was a neural network model developed by
Teuvo Kohonen during 1979-82. SOM is one of the most widely used unsupervised NN
models and employs competitive learning steps. It consists of a layer of input units,
each of which is fully connected to a set of output units. These output units are arranged
in some topology (the most common choice is a 2-d grid. The inputs, after receiving the
input patterns X, propagate them as they are onto the output units. Each of the output
units k is assigned a weight vector wk. During the learning step, the unit c
corresponding to the highest activity level w.r.t. a randomly selected input pattern X, is
adapted in such a way that it exhibits an even higher activity level at a future
presentation of X. This is accomplished by competitive learning.
The similarity metric is chosen to be the Euclidean distance. During the learning
steps of SOM, a set of units around the winner is tuned towards the currently presented
input pattern enabling a spatial arrangement of the input patterns, such that similar
inputs are mapped onto regions close to each other in the grid of output units. Thus, the
training process of SOM results in a topological organization of the input patterns.
Thus, SOM takes a high-dimensional input and clusters it, but still retains some
topological ordering of the output. After training, an input will cause some of the output
units in some area to become active. Such clustering (and dimensional reduction) is
very useful as a preprocessing stage, whether for neural network data processing, or for
more additional techniques.
2-D Array of
output units
wk
24
High
dimensional
input X
25
3.
3.1
3.2
26
Although neural networks have an appropriate inductive bias for a wide class of
problems, they are not commonly used for data mining tasks. There are two reasons:
trained neural networks are usually not comprehensible and many neural network learning
methods are slow, making them impractical for very large data sets.
3.3
Challenges Involved
The hypothesis represented by a trained neural network is defined by:
(a) The topology of the network
(b) The transfer functions used for hidden and output units and
(c) The real-valued parameters associated with the network connections (i.e., the
weights) and the units (i.e., the biases of sigmoid units).
Such hypotheses are difficult to comprehend for several reasons. First, typical
systems have hundreds or thousands of real-valued parameters. These parameters encode
the relationships between the input features and target values. Although single-parameter
encodings are usually not hard to understand, the sheer number of parameters in a typical
network can make the task of understanding them quite difficult. Second, in multi-layer
networks, these parameters may represent non-linear, non-monotonic relationships
between the input features and the target values. Thus, it is usually not possible to
determine, in isolation, the effect of a given feature on the target value, because this effect
may be medicated by the values of other features.
These non-linear, non-monotonic relationships are represented by hidden units,
which combine the inputs of multiple features, thus allowing the model to take advantage
of dependencies among the features. Hidden units can be thought of, as representing
higher-level, derived features. Understanding of hidden units is usually difficult because
they learn distributed representations. In a distributed representation, the individual units
do not correspond to well understood features in a problem domain. Instead, features,
which are meaningful in the context of the problem domain, are often encoded by patterns
of activation across many hidden units. Similarly, each hidden unit may play a part in
representing many derived features.
27
Consider the issue of learning time required for neural networks. The process of
learning in most neural network methods, involves using some gradient-based
optimization method to adjust the networks parameters. Such optimization iteratively
executes two basic steps: calculating the gradient of the error function (with respect to the
networks adjustable parameters) and adjusting the networks parameters in the direction
suggested by the gradient. Learning can be quite slow in such methods, because the
optimization may involve a large number of small steps, and the cost of calculating the
gradient at each step may be quite expensive.
3.4
Advantages
One appealing aspect of many neural-network learning methods is that they are
online algorithms, meaning that they update their hypothesis after every example is
presented. Because they update their parameters frequently, online neural-network
learning algorithms converge much faster than batch algorithms. This is specially the case
for large data sets. Often a reasonably good solution can be found in one pass through a
large training set. For this reason, we argue that training-time performance of neuralnetwork algorithms may often be acceptable for data mining tasks, especially given the
availability of high performance desktop computers.
3.5
Extraction Methods
One approach to understanding the hypothesis represented by a trained neural
network is to translate the hypotheses into a more comprehensible language. Various
strategies using this approach have been investigated under the rubric of rule extraction.
Some keywords:
Representation Language: It is the language used by the extraction methods to describe
the neural networks learned model. The languages that have been used include
conjunctive inference rules, fuzzy rules, m-of-n rules, decision trees and finite state
automata.
Extraction Strategy: It is the strategy used by the extraction method to map the model
represented by the trained neural network into a model in the new representation
28
language. Specifically how the method explores the candidate descriptions and what level
of descriptions it uses to characterize the neural network. That is, do the rules extracted by
the method describe the behavior as a whole (global methods), or the behaviour of
individual units in the network (local methods) or something in between these two cases.
Network Requirements: The architectural and training requirements that the extraction
method imposes on neural networks. In other words, the range of networks to which the
method is applicable.
3.5.1
x1
x2
= -9
x3
-4
x4
x5
29
3.6
30
31
An m-of-n expression is satisfied when at least m of its n conditions are satisfied. For
example, suppose we have three Boolean features, a, b, and c; the m-of-n expression 2-offa; :b; cg is logically equivalent to (a ^ :b) . (a ^ c) . (:b ^ c).
Split Selection. Split selection involves deciding how to partition the input space at a
given internal node in the tree. A limitation of conventional tree-induction algorithms is
that the amount of training data used to select splits decreases with the depth of the tree.
Thus splits near the bottom of a tree are often poorly chosen because these decisions are
based on few training examples. In contrast, because Trepan has an oracle available, it is
able to use as many instances as desired to select each split.
Stopping Criteria. Trepan uses two separate criteria to decide when to stop growing an
extracted decision tree. First, a given node becomes a leaf in the tree if, with high
probability, the node covers only instances of a single class. To make this decision, Trepan
determines the proportion of examples that fall into the most common class at a given
node, and then calculates a confidence interval around this proportion.
Trepan also accepts a parameter that specifies a limit on the number of internal
nodes in an extracted tree. This parameter can be used to control the comprehensibility of
extracted trees, since in some domains, it may require very large trees to describe
networks to a high level of fidelity.
3.6.3 Algorithm
Input: Oracle(), training set S, feature set F, min_sample
Initialize root of the tree, R, as root node
/* get a sample of instances */
use S to construct a model MR of the distribution of instances covered by node R
q := max(0, min_sample - | S | )
query_instancesR := a set of q instances generated using model MR
/* use the network to label all instances */
for each example x (S U query_instancesR)
32
33
4.
CONCLUSION
The advent of Data Mining is only the latest step in the extension of quantitative,
34
REFERENCES
[1] IEEE Transactions on Neural Networks; Data Mining in a Soft Computing Framework: A
Survey, Authors: Sushmita Mitra, Sankar K. Pal and Pabitra Mitra. (January 2002, Vol. 13,
No. 1)
[2] Using Neural Networks for Data Mining: Mark W. Craven, Jude W. Shavlik
[3] Data Mining Techniques: Arjun K. Pujari
[4] Introduction to the theory of Neural Computation: John Hertz, Anders Krogh, Richard G.
Palmer
[5] Elements of Artificial Neural Networks: Kishan Mehrotra, Chilukuri K. Mohan, Sanjay
Ranka.
[6] Artificial Neural Networks: Galgotia Publication
[7] Neural Networks based Data Mining and Knowledge Discovery in Inventory Applications:
Kanti Bansal, Sanjeev Vadhavkar, Amar Gupta
[8] Data Mining, An Introduction: Ruth Dilly, Parallel Computer Centre, Queens University
Belfast: http://www.pcc.qub.ac.uk/tec/courses/datamining/stu_notes/dm_book_1.html
[9] Introduction to Backpropagation Neural Networks: http://cortex.snowseed.com/index.html
[10] Data Mining Techniques: Electronic textbook, Statsoft:
http://www.statsoftinc.com/textbook/stdatmin.html#neural
35
36