Professional Documents
Culture Documents
Feature Selection For A Real-World Learning Task
Feature Selection For A Real-World Learning Task
Abstract. Machine learning algorithms used in early fault detection for centrifugal pumps make it possible to better exploit the information content of measured signals, making machine monitoring more economical and application-oriented. The total amount of sensors is reduced by exhausting the information derived from the sensors far beyond the scope of traditional engineering through the application of various features and highdimensional decision-making. The feature selection plays a crucial role in modelling an early fault detection system. Due to presence of noisy features with outliers and correlations between features a correctly determined subset of features will distinctly improve the classification rate. In addition the requirements for the hardware to monitor the pump decrease therefore its price. Wrappers and filters, the two major approaches for feature selection described in literature [4] will be investigated and compared using real-world data.
This implicit monitoring system fulfils the requirements of sensor minimisation and even does not require any media-wetted sensors which could conflict with certain aggressive products. Therefore the human expert is the ideal model for the early fault detection system.
Aucrx C4.5
Breast
Heart
Monks1
Monks2
Monks3
Pima
Average
NEFClass
Database
Fig. 2. Comparison of C4.5 and NEFClass on various machine learning problems [6].
Neither of the algorithms has a constantly superior classification rate; depending on the type of learning problem the neural network or the decision tree is better. On average the decision tree even outperforms the neural network regarding the classification rate. As shown by Quinlan in [9] neural networks are advantageous, when most features are to be exploited in parallel for making up a decision, whereas the decision tree can handle features, which are only locally relevant. The major advantage of decision trees is their training time which is 2 3 orders of magnitude shorter compared to neural networks. A further asset of See5 are cost matrices which allow the user to model a real-world problem The individual importance of the correct classification of different classes as well as the ratio of false alarms and undetected faults can be influenced. Centrifugal pumps cease to operate when the medium contains too much gas. The gas intensity of the liquid is a continuous variable and any threshold for defining classes is arbitrary.
predicted class
true class
As shown in figure 3 costs can be employed to model the physical relationship among the different gas classes (GA, GB and GC). 0 refers to a correct classification, 1 describes a misclassification into a neighbouring class and 3 refers to a total misclassification. Blockage (pressure sided valve closed) is a purely binary fault which is translated into misclassification costs of 3 for any different class. During optimisation of a pump monitoring system, single cost values can be adjusted to influence the classification ability of the decision tree. Although the overall error rate often increases the errors can be shifted to harmless misclassifications, e. g. the confusion of neighbouring classes. In the following study the cost matrix remains unchanged in order to ensure the comparability of the results. The average costs which in the following will be used to evaluate different classifiers are obtained by dividing the cumulated costs for the unseen test data sets (3 cross validations) by their population. The process of modelling an early fault detection system is described in figure 4. The pump is equipped with different sensors and operated under normal operating conditions at various flow rates and under every fault to be detected. In a preliminary study only the frequencies carrying information are preselected by a human expert in order to reduce the amount of data. Up to 40 features are generated from one sensor. A typical database consists of 30.000 datasets with 30 features each.
Preselection
Cavitation
Normal Operating
0 0 50 100 150 200 Frequency [Hz]
Dry Run
Cavitation
Normal Operating
Fig. 4. Defining features from spectra and using them for growing a decision tree
Z =
i =1
n! i!(n i )!
11
13
15
n - Number of Features
Fig. 5. Evolution of the number of different features subsets for an exhaustive search
MDLM 01
1/3
Wrappers [4, 8] are auxiliary algorithms wrapped around the basic machine learning algorithm to be employed for classification. They create several different subsets of features and compute a classifier for each of them. According to a criterion, mainly the error rate on unseen data, the optimal subset is composed.
Classification
See5
...
Model1
speed 1. harmonic
}
d1 features d1 D
...
...
Modelopt
Fig. 6. Wrappers determine the optimal feature subset by running the machine learning algorithm with a sequence of subsets
}
D features
Classification 1. harmonic
See5
...
Modelm
}
dm features dm D
Best Features (BF) [2] Given a data set with N features, N classifiers are trained, each using one feature only. The classifiers are ranked according to the criterion and the best M < N are elected. Best Features does not take into account dependencies between features (feature A is only effective with presence of feature B) nor redundancy among the features and is therefore a fast but very sub optimal approach. Sequential Forward Selection (SFS) [2] In its basic form the feature selection starts with an empty subset. In each round all possible feature subsets are constructed, which contain the best feature combination found in the round before plus one extra feature. A classifier is trained and evaluated based on each subset. This approach is capable of avoiding the selection of redundant features.
1st Round
A B C D E AB AC AD AE BC BD BE CD CE DE
2nd Round
ABE BCE BDE ABCE ABDE
3rd Round
ABDE BCDE ABCDE
stage 2
stage 1
BE
BDE
ABDE
In order to determine dependencies between features, several features have to be introduced simultaneously, the basic scheme is depicted in figure 7. With 25 features the first round requires training of 25 classifiers in a single-stage SFS, 325 classifiers in a double-stage SFS and as many as 2625 classifiers in a triple-stage SFS. Sequential Backward Selection (SBS) [2] The SBS is initialised with the complete feature set. In each round all possible feature subsets are constructed from the current subset omitting one feature. After each round the feature subset with the best criterion is retained. As in the case of the SFS, the single-stage version is capable of detecting redundant features but higher stage orders have to be considered for dealing with dependent features.
0th Round
ABCDE 1 stage
1st Round
ABCD ABCE ABDE ACDE BCDE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
2nd Round
AC AD AE CD CE DE ACD ACE ADE CDE
Best Feature Set ACDE Features to be removed B Fig. 8. 2-stage sequential Backward-Algorithm with 5 features (A, B, C, D, E)
stage 2
AE CD
Branch and Bound (BB) [2] In contrary to the selection algorithms described so far, the BB yields an optimal subset (as by an exhaustive search) provided the following assumption is fulfilled: Adding an extra feature y to a feature subset does not lead to a deterioration of the criterion J() (monotonicity) Feature Sets:
( Eq. 1 )
Criterion:
( ) J( ) J( ) ... J( )
0 1 2 D
(Eq. 2 )
MDLM 01
1/5
{1,2,3,4,5,6}
{1,3,5,6}
4 {5,6}
6 {3,4}
5 {1,6}
6 {1,2}
Fig. 9. Branch and Bound-Tree for determining the optimal subset containing 2 out of 6 features
The tree is covered from the right to the left side. The criterion c = J({1,2} of the first leaf {1,2} is determined. Next the tree is searched for the first unprocessed node. Departing from this node the rightmost, unprocessed path is chosen. For each node the criterion is determined. If the criterion J({1,3,5,6} < c = J {1,2}, the current path is abandoned. BB is ideal for any classificator which allows for a recursive training when adding or removing a feature, e. g. linear discriminant analysis. In general the monotonicity-assumption only holds when the error rate on training data is chosen as criterion but fails on any criterion based on unseen test data. In this case BB is not optimal any more
Comparison of Wrappers
As BF and BB were considered unsuitable for the pump data, only SBS and SFS have been examined. The results depicted in figure 10 show a strong influence of feature selection on the quality of the resulting classifier. Whereas using all 25 features (i. e. directly employing the data without feature selection) yields average misclassification costs of 0,62, the optimum for a single stage SFS occurs at 17 features with costs of 0,50, but a combination of only 6 features leads to costs of 0,51. A double-stage SFS further reduces the misclassification costs to 0,38. Both SBS trials proved worse than their SFS counterparts although SBS theoretically should retain dependent features due to its deselection strategy.
1,5 Avaerage Costs on Test Data 1,3 1,1 0,9 0,7 0,5 0,3
2-stage sequential Forward 2-stage sequential Backward 1-stage sequential Backward 1-stage sequential Forward
10
Number of Features
15
20
25
Fig. 10. Comparison of SBS and SFS with pump data (30.000 data sets; 25 features; 3 fold crossvalidation averaged in each round) using See5 with a cost matrix (weighted misclassification rate)
A ranking of the features (figure 11) showed the varying effectiveness of the features depending on the selection method. The feature c1 was selected in the first round of the double -stage SFS but in the 21st round of the single-stage SFS. What is more the second feature selected aside c1 in the double -stage selection was n,
which was selected in the 2nd round of the double-stage SFS. The chart shows a strong interdependency between the features and the non-linear behaviour of the inducer.
25
20
Relevance
15
10
vR
c1
c4
s8
v8a v1a
c2
s2
s0
v2a
s1
v4p v4a
c8
veff
s4
v8p
Fig. 11. Ranking of the 25 features in the 4 selections shown in figure 10 (1 = most relevant)
A further obstacle is the influence of the two tuning parameters implemented in See5 on the actual feature selection. Not only the value of the criterion is changed but also the location of the optimum and the best feature subset are altered.
Avaerage Costs on Test Data 1,5 1,3 1,1 0,9 0,7 0,5 0,3 Cases = 10; Pruning = 0,1 Cases = 10; Pruning = 0,01 Cases = 2; Pruning = 1
10 Number of Features
15
20
25
Fig. 12. Influence of the parameters in See5 on a double -stage SFS (30.000 data sets; 25 features; 3 fold crossvalidation averaged in each round). The curve for Pruning=0,01 was started with a triple-stage SFS in the 1st round and then continued as a double -stage SFS
Genetic Wrapper
The comparison of wrappers showed a distinct advantage of multistage wrappers due to their ability of selecting dependent features. Even if the selection is stopped after reaching the optimum, the computing effort remains high because of the high amount of combinations in the beginning. A further disadvantage of the sequential selection is the rigid strategy which does not allow for the removal of features which, in the course of the selection process, have become obsolete. An alternative approach consists of a genetic selection which explores the feature space from a determined or from a random starting point. Genetic algorithms use two parallel strategies for optimising: evolution (small and steady development of the population and selection of the best) mutation (spontaneous creation of new specimen)
MDLM 01
1/7
The flow charts of the genetic selection are shown in figures 13 & 14. The algorithm uses a result stack for the subsets already evaluated and a working stack for new subsets. The subsets in the working stack are sorted with respect to their prognosis, which is the evaluation criterion reached by their parent. In each round the first A subsets are evaluated.
Random create N subsets with M features and store in the working stack repeat until stop condition or working stack is empty with A subsets having the best prognosis remove from working stack train classifier evaluate criterion store subset and criterion in result stack expand subsets Mopt = Number of features of best subset per B subsets in working stack create 1 mutation with Mopt features and store on top of the working stack
The evolution is modelled by single-stage SFS and SBS. After training a classifier from a subset with M features, all possible sequential combinations with M+1 and M-1 features are composed. All subsets which have not been evaluated yet are stored in the working stack.
with each subset to be expanded perform single stage sequential forward selection set prognosis = criterion (subset) add to templist perform single stage sequential backward selection set prognosis = criterion (subset) add to templist for each subset in list subset in result stack ? subset in working stack ? yes prognosis = MAX( prognosis (working stack) prognosis (templist)) store subset in working stack no
yes
no
The algorithm will fully exploit the feature space, if no stopping criterion is introduced. If memory is scarce, the size of the working stack can be limited by deleting the subsets having the worst prognosis. When the selection is attracted by a local minimum, the area will be fully explored before the search is continued outside unless a mutation leads to a better subset. The double stage SFS, applied to the pump data, has determined a subset of 10 features with an average cost on test data of 0,38. The least computing effort (stop immediately after a minimum) would have been the training of 1386 decision trees.
0,8
Fig. 15. Genetic feature selection with N = 10 subsets / M = 10 features / A = 100 subsets / B = 2%
After training 610 decision trees the genetic selection (figure 15) comes up with a subset of 9 features at a cost of 0,38. Within 1110 subsets evaluated (Round 12) there are 5 subsets with 9 or 10 features having costs of 0,38 each. The user can choose between different subsets and therefore minimise other criteria imposed on the features such as number or costs of sensors required.
1 0,9 average costs [-] 0,8 0,7 0,6 0,5 0,4 0,3 1 2 3 4 5 6 7
1. Subset
10
11
12
13
14
15
16
17
18
2. Subset
3. Subset
Fig. 16. Genetic feature selection with N = 10 subsets / M = 4 features / A = 100 subsets / B = 2%
Due to the evolutionary strategy the random selected features play a minor role for convergence and quality of the results. The computing time rises if the number of features in the initial selection is far off the optimum, as shown in figure 16. For an unknown dataset it may be worth to start with a single stage SFS and set the initial value M to the number of features at minimum criterion.
MDLM 01
1/9
feature y
; 1 2
1
feature x
Fig. 17. Metrics for the distance between 2 points 1 and 2 in the feature space
In general the Euclidean metric is preferred, but City-Block (sum of components) Tchebycheff (largest component) deliver different kinds of information about the data at modest computing effort. The estimation of the probabilistic distance requires the knowledge of the class density function. In general a normal distribution is assumed but particularly in the case of multi-modal class densities this assumption is illconceived. Analysis of decision trees Occasionally the usage of statistics based on decision trees is used as a filter. In [1] the features are ranked according to their frequency of occurrence in the decision tree.
MM2 > y1 MM1 > x1 MM2 > y1 K2 K1 MM2 > y 2 MM1 > x2 ... K1 K2 K1 K2
feature 2
feature 2
Class 2
1 ss Cla
Class 1
s2 as Cl
feature 1
feature 1
Fig. 18. Orthogonal and oblique class borders and corresponding decision trees
As shown in figure 18 this approach favours features which require oblique borders in the feature space for class separation. Some weighting rules have been considered but none performed satisfyingly on the pump data Weighting of feature frequency with the number of data sets of the node Weighting of feature frequency with the number of correctly classified data sets of the node Weighting of feature frequency with the node level in the tree Comparison of filter algorithms The filtering of features for classification with See5 using a cost matrix is challenging because due to the cost matrix in See5 the optimisation goal differs between filter and classificator.
Several filter criteria, applied in a single-stage SFS, are plotted in figure 19. All criteria have a monotonous evolution, therefore the determination of the most suitable feature subset is arbitrary and has to be based on curvature or a similar criterion.
Criterion relative 100,0% Tschebychev 80,0% City-Block Euklid J3
20,0%
Fig. 19. Criteria of different filters as in [13] for an increasing number of features(SBS) related to 100% for 20 features of the pump problem. The dot indicates the feature subset actually chosen for testing.
In addition a Bayes-classifier and a 1-Nearest-Neighbor-Classifier have been combined with a single stage SFS. The best feature subset determined with each filter was subjected to a parameter study in See5 (figure 20). Most filters lead to poor performing feature subsets. Only the Bhattacharyya-distance achieves similar results as the single-stage SFS wrapper with See5. A general suitability of probabilistic filters is refuted by the Mahalanobis-distance.
other classificators class distance probabilistic distance
Blackbox
0,90 average error costs for test data 0,80 0,70 0,60 0,50 0,40
fe at ur es W ra pp er fo rC 5 Eu ba fe s 0, 5 Eu ba fe s 0, 8 Eu ba fe s 0, 9 Eu ba fe s 1, 0 Ba N ye c s hs te rN ac hb ar
6 m=10 c=0,2
6 m=10 c=1
9 m=5 c=10
14 m=2 c=0,5
12 m=1 c=0,2
11 m=2 c=1
8 m=2 c=1
8 m=5 c=2
7 m=5 c=2
5 m=2 c=10
13 m=5 c=2
9 m=1 c=10
Ts ch e
Fig. 20. Average costs for test data (mean value of 3 cross validations) obtained with the best See5-decision tree after a parameter optimisation with 4 values for Cases (m) and 7 values for Pruning (c).
MDLM 01
Bh at ta
20
ch ar yy a M ah al an ob is
Eu kl id
Eu kl id
C ity -B l
by ch ev
J2
J3
oc k
1 / 11
Conclusion
Several feature selection algorithms have been compared on a real world machine learning problem. Due to its flexibility for modelling a complex relationship among classes with cost matrices, See5 has been chosen as induction algorithm. Sequential forward and backward selection has been studied in depth. Using multistage approaches very effective feature subsets have been identified. A new genetic wrapper algorithm reduces computing time while increasing the number of nearly optimal subsets determined. As expected filter algorithms do perform worse than wrappers because of the particular optimisation goal introduced through the cost matrix. The application of filters only seems to be advisable with slowly converging classificators such as neural networks, when computing time limits the usage of wrappers.
References
Cardie, C.: Using decision-trees to improve case-based learning; 10. Conf on Machine Learning; Wien; 1993 Devijer, P.; Kittler, J.: Pattern Recognition A statistical approach; Prentice / Hall; 1982 Duda, R.; Hart, P.: Pattern classification and scene analysis; John Wiley & Sons; 1973 Kohavi, R.: Wrappers for performance enhancement and oblivious decison graphs; Dissertation; Stanford University; 1995 5. Lim, T.-S., Loh, W.-Y. and Shih, Y.-S.: A comparison of prediction accuracy, complexity, and training time of 33 old and new classification algorithms; Machine Leraning; preprint www.recursivepartitioning.com/mach1317.pdf; 1999 6. Martens et al: An initial comparison of a fuzzy neural classifier and a decision tree based classifier, Expert Systems with Applications (Pergamon) 15; 1998 7. Michie, D.; Spiegelhalter, D.; Taylor, C.: Machine Learning, Neural and Statistical Classification; Ellis Horwood; 1994 8. Perner, P.; Apte, C.: Empirical Evaluation of Feature Subset Selection Based on a Real World Data Set, In: D.A. Zighed, J. Komorowski, and J. Zytkow, Principles of Data Mining and Knowledge Discovery, Springer; 2000 9. Quinlan, J. R., Comparing connectionist and symbolic learning methods, Computational Learning Theory and Natural Learning Systems; Constraints and Prospects, ed. R. Rivest; MIT Press, 1994 10.Scherf, M.; Brauer, W.: Feature Selection by Means of a Feature Weighting Approach; Technical Report No FKI-221-97; Forschungsberichte Knstliche Intelligenz, Institut fr Informatik, TU Mnchen; 1997 11.See5 / C5; Release 1.10; Quinlan, J. R. ;www.rulequest.com; 1999 12.Spath, D.; Hellmann, D. H.: Automatisches Lernen zur Strungsfrherkennung bei ausfallkritischen Anlageelementen; Abschlubericht zum DFG- Forschungsprojekt Sp 448/7-3 und He 2585/1-3; 1999 13.Rauber, T.: Tooldiag - Pattern Recognition Toolbox; Version 2.1; www.inf.ufes.br/~thomas/home/tooldiag; 1994 1. 2. 3. 4.