Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Feature Selection for a real-world Learning Task

Kollmar, D.; Hellmann, D. H.


Universitt Kaiserslautern, D 67663 Kaiserslautern, Germany {kollmar, hellmann}@rhrk.uni-kl.de

Abstract. Machine learning algorithms used in early fault detection for centrifugal pumps make it possible to better exploit the information content of measured signals, making machine monitoring more economical and application-oriented. The total amount of sensors is reduced by exhausting the information derived from the sensors far beyond the scope of traditional engineering through the application of various features and highdimensional decision-making. The feature selection plays a crucial role in modelling an early fault detection system. Due to presence of noisy features with outliers and correlations between features a correctly determined subset of features will distinctly improve the classification rate. In addition the requirements for the hardware to monitor the pump decrease therefore its price. Wrappers and filters, the two major approaches for feature selection described in literature [4] will be investigated and compared using real-world data.

The machine learning task


The process industry ensures many commodities of our modern society like chemicals, pharmaceutics and nutrition. In most plants pumps are the driving force and their availability directly determines the production output. Smaller process pumps are low price, highly standardised products; hundreds and even thousands of them are employed in one large plant alone. Due to the diversity of the processes and the media handled, the actual design of each pump as well as its operating range vary depending on the type of plant. Mostly, pump damage is not only caused by ageing and natural wear, but also by impermissible operating conditions such as operation at minimum flow or dry-running. Market competition has put a focus on the decrease of operating costs, namely production outages due to pump downtimes and plant design costs which are strongly driven by redundant pumps and the additional piping and instrumentation required therefore. Traditional machine monitoring techniques employ one or more sensors per fault whose signals are interpreted by a human expert. Sterile, hazardous or toxic environments put strong impositions on sensors; the price of some measurement chains almost equals the pumps price. In former times skilled service staff detected faults such as a bearing failure or cavitation from alterations of the sound emitted by a pump even with complex mixtures caused by adjacent machinery.
Selective Filtering of Informationen

Man as a Model for Machine Learning


Fig. 1. Extracting Information from Complex Signals

This implicit monitoring system fulfils the requirements of sensor minimisation and even does not require any media-wetted sensors which could conflict with certain aggressive products. Therefore the human expert is the ideal model for the early fault detection system.

Machine Learning meets Pump Monitoring


In the DFG1-funded project Machine Learning applied to early fault detection for failure critical components [12] the application of machine learning algorithms for pump monitoring has been investigated for the first time. The interdisciplinary collaboration of pump experts and computer scientists was a major asset in this new approach. The data is obtained from velocity probes mounted on the pump casing. The amplitude and phase frequency spectra carry a huge amount of information despite strong correlations between certain frequencies. The machine learning algorithm proposed for this project is See5 [11] a very fast and powerful inducer for decision trees. Many comparative studies of machine learning algorithms as by Lim and Loh [5] or the Statlog Project [7] proved the competitive classifying ability and the outstanding computing performance of this program. A recent study by Martens et. al. [6] comparing C4.5, the precessor of See5, and NEFClass, a late neural network with fuzzy logic, reflects these results:
100 Classification Rate 80 60 40 20 0

Aucrx C4.5

Breast

Heart

Monks1

Monks2

Monks3

Pima

Average

NEFClass

Database

Fig. 2. Comparison of C4.5 and NEFClass on various machine learning problems [6].

Neither of the algorithms has a constantly superior classification rate; depending on the type of learning problem the neural network or the decision tree is better. On average the decision tree even outperforms the neural network regarding the classification rate. As shown by Quinlan in [9] neural networks are advantageous, when most features are to be exploited in parallel for making up a decision, whereas the decision tree can handle features, which are only locally relevant. The major advantage of decision trees is their training time which is 2 3 orders of magnitude shorter compared to neural networks. A further asset of See5 are cost matrices which allow the user to model a real-world problem The individual importance of the correct classification of different classes as well as the ratio of false alarms and undetected faults can be influenced. Centrifugal pumps cease to operate when the medium contains too much gas. The gas intensity of the liquid is a continuous variable and any threshold for defining classes is arbitrary.

predicted class

Fig. 3. Cost matrix for modeling a pump monitoring task in See5


1

Deutsche Forschungsgemeinschaft (German Research Association)

true class

As shown in figure 3 costs can be employed to model the physical relationship among the different gas classes (GA, GB and GC). 0 refers to a correct classification, 1 describes a misclassification into a neighbouring class and 3 refers to a total misclassification. Blockage (pressure sided valve closed) is a purely binary fault which is translated into misclassification costs of 3 for any different class. During optimisation of a pump monitoring system, single cost values can be adjusted to influence the classification ability of the decision tree. Although the overall error rate often increases the errors can be shifted to harmless misclassifications, e. g. the confusion of neighbouring classes. In the following study the cost matrix remains unchanged in order to ensure the comparability of the results. The average costs which in the following will be used to evaluate different classifiers are obtained by dividing the cumulated costs for the unseen test data sets (3 cross validations) by their population. The process of modelling an early fault detection system is described in figure 4. The pump is equipped with different sensors and operated under normal operating conditions at various flow rates and under every fault to be detected. In a preliminary study only the frequencies carrying information are preselected by a human expert in order to reduce the amount of data. Up to 40 features are generated from one sensor. A typical database consists of 30.000 datasets with 30 features each.

6 Vibration Velocity [mm/s]

Preselection

Cavitation

Normal Operating
0 0 50 100 150 200 Frequency [Hz]

Speed < 2950


yes no

Characteristic features with high Content of Information


Features Speed 1. Harmonic (Amplitude) 1. Harmonic (Phase Angle) 4. Harmonic (Amplitude) Normal Operating 2943 2,7 mm/s 112 1,4mm/s Cavitation 2945 3,7 mm/s 180 2,8 mm/s

1. Harmonic > 3,4


yes no

Dry Run

Cavitation

Normal Operating

Fig. 4. Defining features from spectra and using them for growing a decision tree

Wrappers for Feature Selection


As shown in figure 5, an exhaustive exploration of the feature space is hardly possible with as few as 10 features due to the number of combinations arising and the resulting computing time.
Z - Number of Combinations

100000 10000 1000 100 10 1 1 3

Z =
i =1

n! i!(n i )!

11

13

15

n - Number of Features

Fig. 5. Evolution of the number of different features subsets for an exhaustive search

MDLM 01

D. Kollmar, D.-H. Hellmann

1/3

Wrappers [4, 8] are auxiliary algorithms wrapped around the basic machine learning algorithm to be employed for classification. They create several different subsets of features and compute a classifier for each of them. According to a criterion, mainly the error rate on unseen data, the optimal subset is composed.

Classification

speed 1. harmonic Classification

See5

...

Model1

speed 1. harmonic

}
d1 features d1 D

...

...

Modelopt

Fig. 6. Wrappers determine the optimal feature subset by running the machine learning algorithm with a sequence of subsets

}
D features
Classification 1. harmonic

See5

...

Modelm

}
dm features dm D

Best Features (BF) [2] Given a data set with N features, N classifiers are trained, each using one feature only. The classifiers are ranked according to the criterion and the best M < N are elected. Best Features does not take into account dependencies between features (feature A is only effective with presence of feature B) nor redundancy among the features and is therefore a fast but very sub optimal approach. Sequential Forward Selection (SFS) [2] In its basic form the feature selection starts with an empty subset. In each round all possible feature subsets are constructed, which contain the best feature combination found in the round before plus one extra feature. A classifier is trained and evaluated based on each subset. This approach is capable of avoiding the selection of redundant features.
1st Round
A B C D E AB AC AD AE BC BD BE CD CE DE

2nd Round
ABE BCE BDE ABCE ABDE

3rd Round
ABDE BCDE ABCDE

stage 2

stage 1

BE

BDE

ABDE

Fig. 7. Double-stage sequential forward-algorithm with 5 features (A, B, C, D, E)

In order to determine dependencies between features, several features have to be introduced simultaneously, the basic scheme is depicted in figure 7. With 25 features the first round requires training of 25 classifiers in a single-stage SFS, 325 classifiers in a double-stage SFS and as many as 2625 classifiers in a triple-stage SFS. Sequential Backward Selection (SBS) [2] The SBS is initialised with the complete feature set. In each round all possible feature subsets are constructed from the current subset omitting one feature. After each round the feature subset with the best criterion is retained. As in the case of the SFS, the single-stage version is capable of detecting redundant features but higher stage orders have to be considered for dealing with dependent features.
0th Round
ABCDE 1 stage

1st Round
ABCD ABCE ABDE ACDE BCDE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

2nd Round
AC AD AE CD CE DE ACD ACE ADE CDE

Best Feature Set ACDE Features to be removed B Fig. 8. 2-stage sequential Backward-Algorithm with 5 features (A, B, C, D, E)

stage 2

AE CD

Branch and Bound (BB) [2] In contrary to the selection algorithms described so far, the BB yields an optimal subset (as by an exhaustive search) provided the following assumption is fulfilled: Adding an extra feature y to a feature subset does not lead to a deterioration of the criterion J() (monotonicity) Feature Sets:

{y1 , y 2 ,..., y D } = Y = 0 1 2 ... D = { }


J

( Eq. 1 )

Criterion:

( ) J( ) J( ) ... J( )
0 1 2 D

(Eq. 2 )

The BB-algorithm is depicted in figure 9:

MDLM 01

D. Kollmar, D.-H. Hellmann

1/5

{1,2,3,4,5,6}

{1,3,5,6}

4 {5,6}

6 {3,4}

5 {1,6}

6 {1,2}

Fig. 9. Branch and Bound-Tree for determining the optimal subset containing 2 out of 6 features

The tree is covered from the right to the left side. The criterion c = J({1,2} of the first leaf {1,2} is determined. Next the tree is searched for the first unprocessed node. Departing from this node the rightmost, unprocessed path is chosen. For each node the criterion is determined. If the criterion J({1,3,5,6} < c = J {1,2}, the current path is abandoned. BB is ideal for any classificator which allows for a recursive training when adding or removing a feature, e. g. linear discriminant analysis. In general the monotonicity-assumption only holds when the error rate on training data is chosen as criterion but fails on any criterion based on unseen test data. In this case BB is not optimal any more

Comparison of Wrappers
As BF and BB were considered unsuitable for the pump data, only SBS and SFS have been examined. The results depicted in figure 10 show a strong influence of feature selection on the quality of the resulting classifier. Whereas using all 25 features (i. e. directly employing the data without feature selection) yields average misclassification costs of 0,62, the optimum for a single stage SFS occurs at 17 features with costs of 0,50, but a combination of only 6 features leads to costs of 0,51. A double-stage SFS further reduces the misclassification costs to 0,38. Both SBS trials proved worse than their SFS counterparts although SBS theoretically should retain dependent features due to its deselection strategy.
1,5 Avaerage Costs on Test Data 1,3 1,1 0,9 0,7 0,5 0,3

2-stage sequential Forward 2-stage sequential Backward 1-stage sequential Backward 1-stage sequential Forward

10

Number of Features

15

20

25

Fig. 10. Comparison of SBS and SFS with pump data (30.000 data sets; 25 features; 3 fold crossvalidation averaged in each round) using See5 with a cost matrix (weighted misclassification rate)

A ranking of the features (figure 11) showed the varying effectiveness of the features depending on the selection method. The feature c1 was selected in the first round of the double -stage SFS but in the 21st round of the single-stage SFS. What is more the second feature selected aside c1 in the double -stage selection was n,

which was selected in the 2nd round of the double-stage SFS. The chart shows a strong interdependency between the features and the non-linear behaviour of the inducer.
25

20

Relevance

15

10

vR

c1

c4

s8

v8a v1a

c2

v2p v0p v1p

c0 vB2 vB1 v0a Features

s2

s0

v2a

s1

v4p v4a

c8

veff

s4

v8p

Fig. 11. Ranking of the 25 features in the 4 selections shown in figure 10 (1 = most relevant)

A further obstacle is the influence of the two tuning parameters implemented in See5 on the actual feature selection. Not only the value of the criterion is changed but also the location of the optimum and the best feature subset are altered.
Avaerage Costs on Test Data 1,5 1,3 1,1 0,9 0,7 0,5 0,3 Cases = 10; Pruning = 0,1 Cases = 10; Pruning = 0,01 Cases = 2; Pruning = 1

10 Number of Features

15

20

25

Fig. 12. Influence of the parameters in See5 on a double -stage SFS (30.000 data sets; 25 features; 3 fold crossvalidation averaged in each round). The curve for Pruning=0,01 was started with a triple-stage SFS in the 1st round and then continued as a double -stage SFS

Genetic Wrapper
The comparison of wrappers showed a distinct advantage of multistage wrappers due to their ability of selecting dependent features. Even if the selection is stopped after reaching the optimum, the computing effort remains high because of the high amount of combinations in the beginning. A further disadvantage of the sequential selection is the rigid strategy which does not allow for the removal of features which, in the course of the selection process, have become obsolete. An alternative approach consists of a genetic selection which explores the feature space from a determined or from a random starting point. Genetic algorithms use two parallel strategies for optimising: evolution (small and steady development of the population and selection of the best) mutation (spontaneous creation of new specimen)

MDLM 01

D. Kollmar, D.-H. Hellmann

1/7

The flow charts of the genetic selection are shown in figures 13 & 14. The algorithm uses a result stack for the subsets already evaluated and a working stack for new subsets. The subsets in the working stack are sorted with respect to their prognosis, which is the evaluation criterion reached by their parent. In each round the first A subsets are evaluated.
Random create N subsets with M features and store in the working stack repeat until stop condition or working stack is empty with A subsets having the best prognosis remove from working stack train classifier evaluate criterion store subset and criterion in result stack expand subsets Mopt = Number of features of best subset per B subsets in working stack create 1 mutation with Mopt features and store on top of the working stack

Fig. 13. Flow chart for genetic feature selection

The evolution is modelled by single-stage SFS and SBS. After training a classifier from a subset with M features, all possible sequential combinations with M+1 and M-1 features are composed. All subsets which have not been evaluated yet are stored in the working stack.
with each subset to be expanded perform single stage sequential forward selection set prognosis = criterion (subset) add to templist perform single stage sequential backward selection set prognosis = criterion (subset) add to templist for each subset in list subset in result stack ? subset in working stack ? yes prognosis = MAX( prognosis (working stack) prognosis (templist)) store subset in working stack no

yes

no

Fig. 14. Expansion of Subsets

The algorithm will fully exploit the feature space, if no stopping criterion is introduced. If memory is scarce, the size of the working stack can be limited by deleting the subsets having the worst prognosis. When the selection is attracted by a local minimum, the area will be fully explored before the search is continued outside unless a mutation leads to a better subset. The double stage SFS, applied to the pump data, has determined a subset of 10 features with an average cost on test data of 0,38. The least computing effort (stop immediately after a minimum) would have been the training of 1386 decision trees.

0,8

average costs [-]

0,7 0,6 0,5 0,4 0,3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1. Subset 2. Subset 3. Subset

Fig. 15. Genetic feature selection with N = 10 subsets / M = 10 features / A = 100 subsets / B = 2%

After training 610 decision trees the genetic selection (figure 15) comes up with a subset of 9 features at a cost of 0,38. Within 1110 subsets evaluated (Round 12) there are 5 subsets with 9 or 10 features having costs of 0,38 each. The user can choose between different subsets and therefore minimise other criteria imposed on the features such as number or costs of sensors required.

1 0,9 average costs [-] 0,8 0,7 0,6 0,5 0,4 0,3 1 2 3 4 5 6 7
1. Subset

10

11

12

13

14

15

16

17

18

2. Subset

3. Subset

Fig. 16. Genetic feature selection with N = 10 subsets / M = 4 features / A = 100 subsets / B = 2%

Due to the evolutionary strategy the random selected features play a minor role for convergence and quality of the results. The computing time rises if the number of features in the initial selection is far off the optimum, as shown in figure 16. For an unknown dataset it may be worth to start with a single stage SFS and set the initial value M to the number of features at minimum criterion.

Filters for feature selection


Filters are standalone algorithms which determine an optimal subset regardless of the machine learning algorithm envisaged for classification. Filters are mainly used for ML algorithms with categorical classes. The effectiveness of continuous features for continuous classification problems can be directly evaluated using regression techniques. Three groups of filters can be discerned: Wrappers combined with a fast machine learning algorithm, e. g. Nearest Neighbour. The evaluation of the subsets is based on a criterion defined for the classifier of the secondary ML algorithm. Blackbox-algorithms, which directly determine a suitable feature subset (e. g. Eubafes [10]). Evaluation of the feature subset based on the class topology in the feature space. Criteria are the distance between the classes in the subspace (based on the records in the training set) or the classification probability based on the estimated density functions of the classes, e. g. Chernoff, Bhattacharyya or Mahalanobis distance [2]. Criteria which depend on distances in the feature space are strongly influenced by scaling (e. g. due to an altered amplification of a signal used to derive some features) and transformations applied to features as well as the chosen metric :

MDLM 01

D. Kollmar, D.-H. Hellmann

1/9

feature y

Class B -Block Distance

nce ista nD idia ucl E

Class A City Tchebychev Distance

; 1 2
1

feature x

Fig. 17. Metrics for the distance between 2 points 1 and 2 in the feature space

In general the Euclidean metric is preferred, but City-Block (sum of components) Tchebycheff (largest component) deliver different kinds of information about the data at modest computing effort. The estimation of the probabilistic distance requires the knowledge of the class density function. In general a normal distribution is assumed but particularly in the case of multi-modal class densities this assumption is illconceived. Analysis of decision trees Occasionally the usage of statistics based on decision trees is used as a filter. In [1] the features are ranked according to their frequency of occurrence in the decision tree.
MM2 > y1 MM1 > x1 MM2 > y1 K2 K1 MM2 > y 2 MM1 > x2 ... K1 K2 K1 K2

feature 2

feature 2

Class 2

1 ss Cla

Class 1

s2 as Cl

feature 1

feature 1

Fig. 18. Orthogonal and oblique class borders and corresponding decision trees

As shown in figure 18 this approach favours features which require oblique borders in the feature space for class separation. Some weighting rules have been considered but none performed satisfyingly on the pump data Weighting of feature frequency with the number of data sets of the node Weighting of feature frequency with the number of correctly classified data sets of the node Weighting of feature frequency with the node level in the tree Comparison of filter algorithms The filtering of features for classification with See5 using a cost matrix is challenging because due to the cost matrix in See5 the optimisation goal differs between filter and classificator.

Several filter criteria, applied in a single-stage SFS, are plotted in figure 19. All criteria have a monotonous evolution, therefore the determination of the most suitable feature subset is arbitrary and has to be based on curvature or a similar criterion.
Criterion relative 100,0% Tschebychev 80,0% City-Block Euklid J3

60,0% Mahalanobis 40,0% Bhattacharyya Euklid J2

20,0%

0,0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of Features

Fig. 19. Criteria of different filters as in [13] for an increasing number of features(SBS) related to 100% for 20 features of the pump problem. The dot indicates the feature subset actually chosen for testing.

In addition a Bayes-classifier and a 1-Nearest-Neighbor-Classifier have been combined with a single stage SFS. The best feature subset determined with each filter was subjected to a parameter study in See5 (figure 20). Most filters lead to poor performing feature subsets. Only the Bhattacharyya-distance achieves similar results as the single-stage SFS wrapper with See5. A general suitability of probabilistic filters is refuted by the Mahalanobis-distance.
other classificators class distance probabilistic distance

Blackbox

0,90 average error costs for test data 0,80 0,70 0,60 0,50 0,40
fe at ur es W ra pp er fo rC 5 Eu ba fe s 0, 5 Eu ba fe s 0, 8 Eu ba fe s 0, 9 Eu ba fe s 1, 0 Ba N ye c s hs te rN ac hb ar

6 m=10 c=0,2

6 m=10 c=1

9 m=5 c=10

20 m=2 c=1 12 m=1 c=1

14 m=2 c=0,5

12 m=1 c=0,2

11 m=2 c=1

8 m=2 c=1

8 m=5 c=2

7 m=5 c=2

5 m=2 c=10

13 m=5 c=2

9 m=1 c=10

Ts ch e

Fig. 20. Average costs for test data (mean value of 3 cross validations) obtained with the best See5-decision tree after a parameter optimisation with 4 values for Cases (m) and 7 values for Pruning (c).

MDLM 01

D. Kollmar, D.-H. Hellmann

Bh at ta

20

ch ar yy a M ah al an ob is

Eu kl id

Eu kl id

C ity -B l

by ch ev

J2

J3

oc k

1 / 11

Conclusion
Several feature selection algorithms have been compared on a real world machine learning problem. Due to its flexibility for modelling a complex relationship among classes with cost matrices, See5 has been chosen as induction algorithm. Sequential forward and backward selection has been studied in depth. Using multistage approaches very effective feature subsets have been identified. A new genetic wrapper algorithm reduces computing time while increasing the number of nearly optimal subsets determined. As expected filter algorithms do perform worse than wrappers because of the particular optimisation goal introduced through the cost matrix. The application of filters only seems to be advisable with slowly converging classificators such as neural networks, when computing time limits the usage of wrappers.

References
Cardie, C.: Using decision-trees to improve case-based learning; 10. Conf on Machine Learning; Wien; 1993 Devijer, P.; Kittler, J.: Pattern Recognition A statistical approach; Prentice / Hall; 1982 Duda, R.; Hart, P.: Pattern classification and scene analysis; John Wiley & Sons; 1973 Kohavi, R.: Wrappers for performance enhancement and oblivious decison graphs; Dissertation; Stanford University; 1995 5. Lim, T.-S., Loh, W.-Y. and Shih, Y.-S.: A comparison of prediction accuracy, complexity, and training time of 33 old and new classification algorithms; Machine Leraning; preprint www.recursivepartitioning.com/mach1317.pdf; 1999 6. Martens et al: An initial comparison of a fuzzy neural classifier and a decision tree based classifier, Expert Systems with Applications (Pergamon) 15; 1998 7. Michie, D.; Spiegelhalter, D.; Taylor, C.: Machine Learning, Neural and Statistical Classification; Ellis Horwood; 1994 8. Perner, P.; Apte, C.: Empirical Evaluation of Feature Subset Selection Based on a Real World Data Set, In: D.A. Zighed, J. Komorowski, and J. Zytkow, Principles of Data Mining and Knowledge Discovery, Springer; 2000 9. Quinlan, J. R., Comparing connectionist and symbolic learning methods, Computational Learning Theory and Natural Learning Systems; Constraints and Prospects, ed. R. Rivest; MIT Press, 1994 10.Scherf, M.; Brauer, W.: Feature Selection by Means of a Feature Weighting Approach; Technical Report No FKI-221-97; Forschungsberichte Knstliche Intelligenz, Institut fr Informatik, TU Mnchen; 1997 11.See5 / C5; Release 1.10; Quinlan, J. R. ;www.rulequest.com; 1999 12.Spath, D.; Hellmann, D. H.: Automatisches Lernen zur Strungsfrherkennung bei ausfallkritischen Anlageelementen; Abschlubericht zum DFG- Forschungsprojekt Sp 448/7-3 und He 2585/1-3; 1999 13.Rauber, T.: Tooldiag - Pattern Recognition Toolbox; Version 2.1; www.inf.ufes.br/~thomas/home/tooldiag; 1994 1. 2. 3. 4.

You might also like