Professional Documents
Culture Documents
Evolutionary Computation For Feature Selection and Feature Construction
Evolutionary Computation For Feature Selection and Feature Construction
Selection and Feature Construction v Bing Xue is a currently a Professor in Computer Science and Deputy Head of
School in School of Engineering and Computer Science at VUW. She has over
300 papers published in fully refereed international journals and conferences and
Bing Xue and Mengjie Zhang her research focuses mainly on evolutionary computation, feature selection,
feature construction, machine learning, classification, symbolic regression,
Evolutionary Computation Research Group evolving deep neural networks, image analysis, transfer learning, multi-objective
machine learning. Dr Xue was the founding Chair of IEEE Computational
Victoria University of Wellington, New Zealand Intelligence Society (CIS) Task Force on Evolutionary Feature Selection and
Construction. She co-chaired Women@GECCO at 2018, and co-chaired the
Evolutionary Machine Learning, and NueroEvolution Tracks in GECCO 2019 to
Bing.Xue@ecs.vuw.ac.nz and Mengjie.Zhang@ecs.vuw.ac.nz 2022, She is also served as associate editor of several international journals,
such as IEEE Transactions on Evolutionary Computation and ACM Transactions
on Evolutionary Learning and Optimization.
http://gecco-2022.sigevo.org/ v Mengjie Zhang is a Fellow of Royal Society of NZ, a Fellow of IEEE, a Fellow of
Engineering NZ, and an IEEE Distinguished Lecturer. He is Professor of
Computer Science at the School of Engineering and Computer Science, Victoria
Permission to make digital or hard copies of part or all of this work for University of Wellington, New Zealand. His research is mainly focused on
personal or classroom use is granted without fee provided that copies are evolutionary computation, particularly genetic programming, evolutionary deep
not made or distributed for profit or commercial advantage and that and transfer learning, image analysis, feature selection and reduction, and
copies bear this notice and the full citation on the first page. Copyrights evolutionary scheduling and combinatorial optimisation. He has published over
for third-party components of this work must be honored. For all other 700 academic papers in refereed international journals and conferences. He is
uses, contact the owner/author(s).
currently an associate editor for over ten international journals (e.g. IEEE TEVC,
GECCO '22 Companion, July 9–13, 2022, Boston, MA, USA
© 2022 Association for Computing Machinery. ECJ, ACM TELO, IEEE TCYB, and IEEE TETCI). He has been serving as a
ACM ISBN 978-1-4503-9268-6/22/07…$15.00 steering committee member and a program committee member for over eighty
https://doi.org/10.1145/3520304.3533659 international conferences. He is a reviewer of research grants for many
countries/regions (e.g. Canada, Portugal, Spain, Germany, UK, Netherland,
Austria, Mexico, Czech, Italy, HK, Australia, NZ).
GECCO,Dever, GECCO,Dever,
Colorado, USA. 20-24 Colorado, USA. 20-24
1200
Week3:21
422
What is a Good feature? FS: 5
COMP422
COMP473 Week3:22
FS: 6
7measure
COMP422 of What
goodness is isa subjective
Good feature?
with respect to38the
Feature Manipulation:
What is
What is a Good feature?
e The
of classifier.
measure of The features
goodness
What in this
What
is ais isfigure,
subjective
Good awith
feature? goodx1 and
respect x2,
feature?
to are
the •• The
The same
same set
set of
of features
features are
are not
not good
good for
for aa decision tree that
DT classifier
What is a good feature?
COMP422 Feature Manipulation: 39 COMP422
dtype
for aof linear classifier.
classifier. The features in this figure, x1 and x2, are classifier
is not ablethat is not able its
to transform to input
transform its input space.
space.
The measure of goodness is subjective with respect to the type of What is a Good feature? Why Use GP for Featu
- good for a linear classifier.
classifier. The features in this figure, x1 and x2, are good for a linear
.
classifier.
• GP is flexible in making mathem
s
y • There isn’t mush structural (to
d search space of possible function
proach (such as evolutionary com
The same set of features are not good for a decision tree classifier
that is not able to transform its input space.
Fig. 9. (a) Decision tree induced by the J48 (C4.5) algorithm usi
9 COMP422 Feature Manipulation: 40
observations in the previous figure. (b) Class boundary general
decision tree. The decision tree inducer has failed to generalize
Why Use GP for Feature Construction? COMP422
of a line. Feature Manipulation: 41 COMP422
2018 IF of TEVC
IEEE Transactions on Evolutionary Computation
NSGAII
NSGAIII
MOEA/D
GECCO,Dever,
Colorado, USA. 20-24
https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
1202
GECCO,Dever, GECCO,Dever,
Colorado, USA. 20-24 Colorado, USA. 20-24
• The quality of input features can drastically affect the learning • Reduce the dimensionality (No. of features)
performance.
• Improve the (classification) performance
• Even if the quality of the original features is good,
transformations might be required to make them usable for • Simplify the learnt model
certain types of classifiers. • Speed up the processing time
• Help visualisation
• Feature construction does not add to the cost of extracting
(measuring) original features; it only carries computational • Improve interpretability and explainablity
cost. • Reduce the cost, e.g. save memory
• In some cases, feature construction can lead to
dimensionality reduction or implicit feature selection.
• and ?
Challenges Good
inrepresentation
FS and FC General FS/FC System
July 2016 Forward Selection: July 2016
tive, random, GA).
of solutions (subsets) is very important. A • Start with the empty set.
search algorithm can perform better by using the structural infor-
• Large search space: n possible feature subsets
2 mation available to subsets of features: the lattice of subsets. • At each step examine all individual features by tentatively
adding them to the current set of selected features. Choose the Evolutionary Feature
Constructed/Selec
- 1990: n < 20 feature that yields the highest improvement and add it to the set
of selected features.
Selection/Construction ted Feature(s)
- 1998: n <= 50 • Repeat the above step until the performance can no longer be
improved.
- 2007: n ≈ 100s
- Now: 1000s, 1 000 000s How does this algorithm compare to exhaustive search?
• Feature interactions How would you use the lattice to perform a search?
Learnt Classifier
Original
Features
Embedded Method
Selected Features
GECCO,Dever, GECCO,Dever,
Colorado, USA. 20-24 Colorado, USA. 20-24
• Generally:
Bing Xue, Mengjie Zhang and Will Browne. "A Comprehensive Comparison on Feature Selection Approaches to Classification".
International Journal of Computational Intelligence and Applications (IJCIA). Vol. 14, No. 2. 2015. pp. 1550008 (1-23)
1204
GECCO,Dever, GECCO,Dever,
Colorado, USA. 20-24 Colorado, USA. 20-24
GECCO,Dever,
Colorado, USA. 20-24
GECCO,Dever,
Colorado, USA. 20-24
Evolutionary Feature Selection
EC for Feature Selection EC for Feature Selection
July 2016 July 2016
• EC Paradigms
EC Paradigms Evaluation Number of Objectiv
• Evaluation
• Number of Objectives
JOURNAL OF LATEX CLASS FILES, VOL. , NO. , 3
Evolutionary Swarm Filter Wrapper Combined
Embedded,
Single
Others Sparsity based
Algorithms
Evolutionary Feature Selection Intelligence Approaches Approaches Approaches Objective
JOURNAL OF LATEX CLASS FILES, VOL. , NO. , 4
EC Paradigms Evaluation Number of Objectives
have recently
categories become popular, such
Algorithms Intelligence Approaches Approaches Objective Objective
- Genetic operators
• Multi-objective
• Over 25feature selection
years ago, first EC techniques
Population
Representation • Scalability- issue
Filter, Wrapper, Single Objective, Multi-objective
Initialisation
• Representation
1 0 1 1 0 0 1 0 0 1 1 0 0
- Binary string
Fitness/Objective
Evaluate fitness • Search mechanisms
function
R. Leardi, R. Boggia, and M. Terrile, “Genetic algorithms as a strategy for feature selection,” Journal of Chemometrics, vol.
- Genetic operators
6, no. 5, pp. 267– 281, 1992.
Z. Zhu, Y.-S. Ong, and M. Dash, “Markov blanket-embedded genetic algorithm for gene selection,” Pattern Recognition, vol.
• Multi-objective
40, no. 11,pp. 3236–3248, 2007. feature selection
W. Sheng, X. Liu, and M. Fairhurst, “A niching memetic algorithm for simultaneous clustering and feature selection,” IEEE
No • Scalability
Transactions on Knowledge issue vol. 20, no. 7, pp. 868–879, 2008.
and Data Engineering,
Generate new Bing Xue, Mengjie Zhang, Will Browne, Xin Yao. "A Survey on Evolutionary Computation Approaches to Feature Selection",
Stop ? population IEEE Transaction on Evolutionary Computation, doi: 10.1109/TEVC.2015.2504420, published online on 30 Nov 2015
25
R. Leardi, R. Boggia, and M. Terrile, “Genetic algorithms as a strategy for feature selection,” Journal of Chemometrics, vol. 6, no. 5, pp. 267–
281, 1992.
Yes Ishibuchi, Hisao, and Tomoharu Nakashima. “Multi-objective pattern and feature selection by a genetic algorithm.” In GECCO, pp. 1069-
1076. 2000.
Z. Zhu, Y.-S. Ong, and M. Dash, “Markov blanket-embedded genetic algorithm for gene selection,” Pattern Recognition, vol. 40, no. 11,pp.
Return the best Search/updating 3236–3248, 2007.
Xue, Bing, Mengjie Zhang, Will N. Browne, and Xin Yao. “A survey on evolutionary computation approaches to feature selection.” IEEE
solution(s) mechanism Transactions on Evolutionary Computation, 20, no. 4 (2016): 606-626.
Tao, Zhou, Lu Huiling, Wang Wenwen, and Yong Xia. "GA-SVM based feature selection and parameter optimization in hospitalization
expense modeling." Applied Soft Computing 75 (2019): 323-332.
S. Liu, H. Wang, W. Peng and W. Yao, “A Surrogate-Assisted Evolutionary Feature Selection Algorithm with Parallel Random Grouping for
High-Dimensional Classification,” in IEEE Transactions on Evolutionary Computation, doi: 10.1109/TEVC.2022.3149601. 2022
GECCO,Dever,
Colorado, USA. 20-24 GP for Feature Selection GECCO,Dever,
Colorado, USA. 20-24 PSO for Feature Selection
GP for Feature Selection PSO for Feature Selection
July 2016 July 2016
• Computationally expensive
• Scalability
• Computationally expensive
Hafiz, Faizal, et al. "A two-dimensional (2-D) learning framework for Particle Swarm based feature selection." Pattern Recognition. 76 (2018):
Feature 4 Feature 5 416-433.
E. K. Tang, P. Suganthan, and X. Yao, “Feature selection for microarray data using least squares SVM and particle swarm
Binh Tran and Bing Xue and Mengjie Zhang."A New Representation in PSO for Discretisation-Based Feature Selection", IEEE Transactions on
optimization,”
Cybernetics, vol.in48,
IEEE
no. 6,Symposium
pp.1733-1746,on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), pp. 1–
2018
M. G. SmithL. Jung-Yi, K. Hao-Ren,
and L. Bull, “GeneticC.programming
Been-Chian, and withY. aWei-Pang, “Classifierfor
genetic algorithm design withconstruction
feature feature selection
andand feature extraction
selection,” Genetic
usingand
layered geneticMachines,
programming,” Expert
8, 2005.
Xue, Y., Xue, B., & Zhang, M.. Self-adaptive particle swarm optimization for large-scale feature selection in classification. ACM Transactions
Programming Evolvable vol. 6, no. 3,Systems with Applications,
pp. 265–281, 2005. vol. 34, no. 2, pp. 1384–1393, 2008.
L.
on Y. Chuang,Discovery
Knowledge H. W. Chang, C. J.(TKDD),
from Data Tu, and13(5),
C. H. Yang,
1-27. 2019“Improved binary PSO for feature selection using gene expression data,”
Purohit, N. Chaudhari, and A. Tiwari, “Construction of classi- fier with feature selection
B. Tran, B. Xue, M. Zhang. "Class Dependent Multiple Feature Construction Using Genetic Programming for based on genetic programming,” in
High-Dimensional Tran, Binh, Bing Xue,
Computational and Mengjie
Biology Zhang. "Adaptive
and Chemistry, vol. 32, multi-subswarm
no. 29, pp. 29–optimisation
38, 2008. for feature selection on high-dimensional classification."
IEEE Congress
Data". Proceedings of theon Evolutionary
30th Computation
AI2017,LNCS. (CEC),
Vol. 10400. pp. 1–5,Melbourne,
Springer. 2010. Australia, August 19-20th, 2017. pp. 182-194. In Proceedings
C. L. Huang and of the
J. Genetic
F. Dun,and
“A Evolutionary Computation
distributed PSO-SVM Conference,
hybrid system pp.with
481-489. 2019.
feature selection and parameter optimization,”
Tran, Binh,M.Bing
G. Smith
Xue, and
andL.Mengjie
Bull, “Genetic
Zhang.programming with a geneticfor
"Genetic programming algorithm for feature construction
multiple-feature construction and selection,” Genetic
on high-dimensional Bach Hoai Nguyen, Bing Xue, Mengjie Zhang, and FengyuZhou "A survey on swarm intelligence approaches to feature selection in data
Programming
classification." and Evolvable93Machines,
Pattern Recognition vol. 6, no. 3, pp. 265–281, 2005.
(2019): 404-417. Application on Soft Computing, vol. 8, pp. 1381–1391, 2008.
mining", Swarm and Evolutionary Computation, vol. 54, num , pp 100663:1-30. , May 2020. (doi.org/10.1016/j.swevo.2020.100663)
Bing Xue,
Nag, Kaustuv, MengjieR.Zhang,
and Nikhil Will Browne,
Pal. “Feature Xin Yao.
extraction and"Aselection
Survey onfor Evolutionary
parsimoniousComputation Approaches
classifiers to Feature genetic
with multiobjective B. Xue, M. Zhang, and W. N. Browne, “Multi-objective particle swarm optimisation (PSO) for feature selection,” in Proceeding
Selection",
programming.” IEEEIEEE Transaction
Transactions onon EvolutionaryComputation
Evolutionary Computation, doi:24.310.1109/TEVC.2015.2504420,
(2019): 454-466. published online on 30 Nov of the 14th Annual Conference on Genetic and Evolutionary Computation Conference (GECCO), pp. 81–88, ACM, 2012.
26 Bing Xue, Mengjie Zhang, Will Browne, Xin Yao. "A Survey on Evolutionary Computation Approaches to Feature Selection",
2015
A. Lensen, B. Xue and M. Zhang, “Genetic Programming for Manifold Learning: Preserving Local Topology,” in IEEE
Transactions on Evolutionary Computation, doi: 10.1109/TEVC.2021.3106672, 2021 IEEE Transaction on Evolutionary Computation, doi: 10.1109/TEVC.2015.2504420, published online on 30 Nov 2015 27
1206
GECCO,Dever, GECCO,Dever,
Colorado, USA. 20-24 Colorado, USA. 20-24
S. Kashef and H. Nezamabadi-pour, “An advanced ACO algorithm for feature subset selection,” Neurocomputing, 2014. Hancer, Emrah, Bing Xue, and Mengjie Zhang. "Differential evolution for filter feature selection based on information theory and feature
C.-K. Zhang and H. Hu, “Feature selection using the hybrid of ant colony optimization and mutual information for the forecaster,” in International ranking." Knowledge-Based Systems 140 (2018): 103-119.
Conference on Machine Learning and Cybernetics, vol. 3, pp. 1728–1732, 2005. Z. Li, Z. Shang, B. Qu, and J. Liang, “Feature selection based on manifold-learning with dynamic constraint handling differential evolution,” in IEEE
L. Ke, Z. Feng, and Z. Ren, “An efficient ant colony optimization approach to attribute reduction in rough set theory,” Pattern Recognition Letters, Congress on Evolutionary Computation (CEC), pp. 332–337, 2014.
vol. 29, no. 9, pp. 1351–1357, 2008. Hoai Bach Nguyen, Bing Xue, Hisao Ishibuchi, Peter Andreae, and Mengjie Zhang. "Multiple Reference Points MOEA/D for Feature Selection".
Moradi, Parham, and Mehrdad Rostami. "Integration of graph clustering with ant colony optimization for feature selection." Knowledge-Based Proceedings of 2017 Genetic and Evolutionary Computation Conference (GECCO 2017) Companion. ACM Press. Berlin, German, 15 - 19 July
Systems 84 (2015): 144-161. 2017.pp 157-158.
Tabakhi, Sina, and Parham Moradi. "Relevance–redundancy feature selection based on ant colony optimization." Pattern recognition 48, no. 9 I.-S. Oh, J.-S. Lee, and B.-R. Moon, “Hybrid genetic algorithms for feature selection,” IEEE Transactions on Pattern Analysis and Machine
(2015): 2798-2811. Intelligence, vol. 26, no. 11, pp. 1424 –1437, 2004.
Sara, V. J., S. Belina, and K. Kalaiselvi. "Ant colony optimization (ACO) based feature selection and extreme learning machine (ELM) for chronic Z. Zhu, S. Jia, and Z. Ji, “Towards a memetic feature selection paradigm [application notes],” IEEE Computational Intelligence Magazine, vol. 5, no.
kidney disease detection." International Journal of Advanced Studies of Scientific Research 4, no. 1 (2019). 2, pp. 41–53, 2010.
Paniri, M., Dowlatshahi, M.B. and Nezamabadi-pour, H., 2020. MLACO: A multi-label feature selection algorithm based on ant colony Y. Wen and H. Xu, “A cooperative coevolution-based pittsburgh learning classifier system embedded with memetic feature selection,” in IEEE
optimization. Knowledge-Based Systems, 192, p.105285. Congress on Evolutionary Computation, pp. 2415–2422, 2011.
Bach Hoai Nguyen, Bing Xue, Mengjie Zhang, and FengyuZhou “A survey on swarm intelligence approaches to feature selection in data mining”, Zhang, Yong, Dun-wei Gong, Xiao-zhi Gao, Tian Tian, and Xiao-yan Sun. "Binary differential evolution with self-learning for multi-objective feature
Swarm and Evolutionary Computation, vol. 54, num , pp 100663:1-30. , May 2020 selection." Information Sciences 507 (2020): 67-85.
GECCO,Dever, GECCO,Dever,
Colorado, USA. 20-24 Colorado, USA. 20-24
Feature Selection
• Network/web service
- Web service composition and development, network security, and email
spam detection.
• Business and financial problems
- Financial crisis, credit card issuing in bank systems, and customer churn
prediction.
• Others
- power system optimisation, weed recognition in agriculture, melting point
prediction in chemistry, and weather prediction.
Xue, Bing, Mengjie Zhang, Will N. Browne, and Xin Yao. "A survey on evolutionary computation approaches to feature
selection." IEEE Transactions on Evolutionary Computation20, no. 4 (2016): 606-626.
Bach Hoai Nguyen, Bing Xue, Mengjie Zhang"A survey on swarm intelligence approaches to feature selection in data mining",
Swarm and Evolutionary Computation, vol. 54, num , pp 100663:1-30. , May 2020
1207
GECCO,Dever,
Colorado, USA. 20-24
Fitness/Objective
Evaluate fitness
function
No
Generate new
Stop ? population
Yes
GECCO,Dever,
Colorado, USA. 20-24
74CHAPTER
PSO for FS: initialisation and updating
3. WRAPPER BASED SINGLE OBJECTIVE FEATURE SELECTION
Initialization in Multi-objective FS
July 2016
The evolutionary training process of a PSO based wrapper feature se- 1208
lection algorithm is shown in Figure 3.2. The key step is the goodness/fitness
Initialization in Multi-objective FS Duplication Analysis-Based MOFS
GECCO ’20, July 8–12, 2020, Cancun, Mexico
IGDTable
and HV and
3: Mean showStandardthat segmented
Deviation of IGD on Test initialization
Data, mechanism
Table 4: Mean and ofoffspring
and Standard Deviation HV on Test Data, 208
XU et al.: DAEA FOR BIOBJECTIVE FEATURE SELECTION
IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 25, NO. 2, APRIL 2021
209
where Performances of SIOM Versions Better than Bench- where Performances of SIOM Versions Better than Bench-
in modification mechanism
mark Versions Are each
Highlighted in Grey contributed
and Those Not Sig- positively
mark to the success
Versions Are Highlighted of the
in Grey and Those new
Not Sig-
Algorithm 2 Dup_Analysis(Pop)
ection ni�cantlyGECCO
plug-in Di�erent
MOEAs, areJuly
’20, Pre�xed
while by † Cancun, Mexico
8–12,combining
2020, ni�cantly Di�erent are Pre�xed by †
them together contributed the most. Algorithm 4 Reproduce(Pop)
Input: population Pop;
Output: population Pop; Input: population Pop;
No. NSGA-II SIOM-NSGAII MOEA/HD SIOM-MOEAHD HypE SIOM-HypE No. NSGA-II SIOM-NSGAII MOEA/HD SIOM-MOEAHD HypE SIOM-HypE
eatures and Table 2: Datasets in Ascend Order of Feature Number
2.7541e-02 †2.8094e-02 2.6056e-02 †2.5309e-02 2.3140e-02 †2.0473e-02 8.8843e-01 †8.8913e-01 8.8859e-01 †8.8940e-01 8.8964e-01 †8.9046e-01 Output: offspring set S;
1
± 1.78e-02 ± 1.88e-02 ± 1.47e-02 ± 1.33e-02 ± 1.45e-02 ± 1.33e-02
1
± 7.89e-03 ± 7.07e-03 ± 7.44e-03 ± 7.43e-03 ± 7.39e-03 ± 6.96e-03
1: generate a |Pop| ∗ |Pop| matrix Dist, while Dist(i, j) denotes the
3.0766e-02 2.6738e-02 2.9258e-02 †2.7358e-02 3.1853e-02 †2.8169e-02 7.2368e-01 †7.2578e-01 7.2476e-01 †7.2747e-01 7.2244e-01 †7.2394e-01 Manhattan distance between Pop(i) and Pop(j); 1: set the number of offspring N = |Pop|;
2 2
No. ± 8.65e-03 Names
Dataset ± 5.72e-03 5.97e-03
±Features ± 6.64e-03
Instances ± 7.05e-03 ± 6.87e-03
Classes ± 1.01e-02 ± 6.31e-03 ± 5.55e-03 ± 8.03e-03 ± 7.98e-03 ± 8.55e-03
2: get objective vectors Obj of solutions in Pop, and find all the 2: set the neighborhood size T = max(4, #N ∗ 0.2%);
1.9171e-02 1.6010e-02 1.9382e-02 1.6075e-02 2.5828e-02 †1.7509e-02 9.3936e-01 †9.3930e-01 9.3751e-01 †9.3850e-01 9.2626e-01 9.3848e-01
1 3 Climate± 5.45e-03
± 5.44e-03 18
± 5.08e-03 540
± 5.49e-03 2
± 6.92e-03 ± 6.63e-03
3
± 6.29e-03 ± 5.60e-03 ± 5.22e-03 ± 6.05e-03 ± 1.46e-02 ± 5.96e-03 unique objective vectors Obj ; u 3: temporarily normalize the objective vectors of all the solutions
h 7th
2 7.1828e-02
4 Statlog_German
5.4087e-02
± 2.25e-02 ± 1.33e-02
5.2255e-02
24
± 1.33e-02
5.9423e-02
1000
± 1.87e-02
7.3563e-02
2 †6.0459e-02
± 2.37e-02 ± 1.95e-02
4
7.8557e-01 †7.9769e-01
± 2.21e-02 ± 2.09e-02
8.0020e-01
± 2.14e-02
†7.9177e-01
± 2.34e-02
7.6884e-01 7.9011e-01
± 2.71e-02 ± 2.29e-02
3: generate an empty removal set R; in Pop to the scale [0, 1];
0 0 3 Breast_Cancer
3.9339e-02 3.0121e-02 30
1.5318e-01 569
†3.2831e-02 2
3.8027e-02 †3.2985e-02 7.2290e-01 7.3768e-01 5.9972e-01 7.3325e-01 7.2896e-01 7.3943e-01
u
4: for i = 1, . . . , |Obj | do 4: generate a N ∗ T neighborhood matrix Nic, while Nic(i) con-
5 5
1 0 4 Connectionist_Bench_Sonar
± 1.69e-02 ± 5.93e-03
6.4000e-02 3.5805e-02
60
± 1.93e-01
3.8137e-02
208
± 8.71e-03
3.6259e-02
2
± 7.44e-03 ± 6.74e-03
4.2683e-02 3.8640e-02
± 2.53e-02 ± 1.51e-02
5.9247e-01 6.3030e-01
± 2.14e-01
6.2509e-01
± 1.90e-02
†6.3098e-01
± 1.56e-02 ± 1.27e-02
6.1941e-01 †6.2524e-01
5: find indexes j s.t. Obj(j) == Obju (i); sists of T nearest solutions to Pop(i), according to the Euclidean
1 0 5 Mice_Protein_Expression
6
± 2.20e-02 ± 7.80e-03 77
± 7.88e-03 1077
± 8.62e-03 8
± 1.24e-02 ± 1.03e-02
6
± 2.46e-02 ± 1.22e-02 ± 1.43e-02 ± 1.22e-02 ± 1.46e-02 ± 1.14e-02 6: if |j| > 1 then distances among normalized objective vectors;
0 0 6 Hill_Valley
8.3886e-02 4.0700e-02 100
5.9125e-02 606
4.2565e-02 2
7.8597e-02 5.3750e-02 8.3315e-01 9.0255e-01 8.6995e-01 8.9757e-01 8.3748e-01 8.7525e-01
7: set t as the numbers of selected features and D as the
7
± 1.53e-02 ± 9.15e-03 ± 1.52e-02 ± 9.83e-03 ± 1.37e-02 ± 9.64e-03
7
± 2.35e-02 ± 1.56e-02 ± 2.41e-02 ± 1.52e-02 ± 2.50e-02 ± 1.83e-02 Fig. 2. Example of the proposed diversity-based selection. Fig. 1. 5:
Example generate an empty
of the proposed parentanalysis
duplication set Par; method.
0 1 7 MUSK1 8.7910e-02 166 476 2 numbers of total features;
8
1.7138e-01 1.4786e-01 7.3506e-02 1.8262e-01 1.3175e-01
8
7.2989e-01 8.0715e-01 7.5261e-01 8.1625e-01 7.1856e-01 7.6815e-01 6: for i = 1, . . . , N do
4 0.2 8 Semeion_Handwritten_Digit
± 1.63e-02 ± 1.67e-02 256
± 1.46e-02 1593
± 2.21e-02 10
± 1.18e-02 ± 1.08e-02 ± 1.55e-02 ± 1.26e-02 ± 1.40e-02 ± 1.35e-02 ± 1.24e-02 ± 1.13e-02
8: generate a |j| size dissimilarity set Diss;
9 9
5.3205e-02
Arrhythmia 3.0679e-02 1.9043e-01
278 5.5534e-02
452 6.7819e-02 †5.1123e-02
16 9
6.6089e-01 6.9266e-01 5.0177e-01 6.6032e-01 6.4381e-01 6.6781e-01 7: if rand < 0.8 then
6 0.4 ± 1.15e-02 ± 9.40e-03 ± 1.92e-02 ± 3.79e-02 ± 2.12e-02 ± 1.06e-02 ± 1.67e-02 ± 1.45e-02 ± 2.26e-02 ± 4.83e-02 ± 2.50e-02 ± 1.60e-02 9: for k = 1, .3. .Div_Select(Pop,
Algorithm , |j| do α, β, N) 8: is setrandomly
The reproduction accordingselect a parent from of Nic(i) intofea-
Par;
Pop; selected indexesprocess
α; unselectedis modified 9: to elseimprove the quality of
6 0.8
10 LSVT_Voice_Rehabilitation
10
1.2058e-01 3.3069e-02 310
1.0178e-01 126
3.2729e-02 2
1.2115e-01 5.4647e-02
10
8.2075e-01 9.2341e-01 8.4570e-01 9.2429e-01 8.2964e-01 8.9074e-01
10: •Diss(k) = min(Dist(j(k), j)); threshold (δ) to the number selected
11 1.98e-02 1.23e-02
Madelon_Train_Validation
± ± ± 1.79e-02
500 ± 1.33e-02
2600 ± 1.35e-02
2 ± 1.16e-02 ± 3.08e-02 ± 2.14e-02 ± 2.55e-02 ± 2.91e-02 ± 2.07e-02 ± 2.24e-02 Input: population indexes
turesβ;(t) and the number of total features (D). The value of
8 1 2.3433e-01 6.3770e-02 2.4804e-01 8.9064e-02 2.2342e-01 1.0028e-01 5.8352e-01 8.1875e-01 5.7165e-01 7.5211e-01 5.9267e-01 7.3684e-01 11: end for
12 11 ISOLET5± 1.23e-02
± 1.52e-02 617
± 1.33e-02 1559
± 1.88e-02 26
± 1.75e-02 ± 1.82e-02
11
± 1.51e-02 ± 3.38e-02 ± 1.53e-02 ± 3.35e-02 ± 1.84e-02 ± 2.93e-02
12:
population offspring;
size
Diss = (Diss/2)/t; N; 10:
δ varies from
11: 0.2
randomly select a parent from
endtoif0.8, as the value of t changes from D
Pop into Par;
13 12 Multiple_Features
1.3052e-01 3.3607e-02 649
1.3313e-01 2000
3.9944e-02 10
1.2246e-01 4.9909e-02 6.9352e-01 8.0820e-01 6.8977e-01 8.0297e-01 6.9876e-01 7.8173e-01
Output: population
δ = 0.8 Pop;
A duplication − 0.6 ∗ (t −analysis
1)/(D − 1), as method ismaking
proposed tocriterion
filter more outinclusive.
the Then, in
12
± 1.28e-02 ± 6.12e-03 ± 1.06e-02 ± 8.71e-03 ± 9.04e-03 ± 7.00e-03 ± 1.51e-02 ± 1.00e-02 ± 1.20e-02 ± 1.27e-02 ± 1.08e-02 ± 1.23e-02 13: set•threshold noted in to 1, the forremoval
14 SRBCT 1.5668e-02
1.3078e-01 2308
1.3619e-01 83
2.5488e-02 4
1.5954e-01 4.5891e-02 7.8757e-01 9.0081e-01 7.8274e-01 8.8876e-01 7.6248e-01 8.6388e-01 gettext
1:the objective
below; vectors Obj of solutions in Pop; 12: end
13 13
ourage more 15 Leukemia1 5327 72 3 lines 14 and13: 15, those theduplicated
offspringsolutions are separated into
redundant <solutions
± 1.26e-02 ± 6.20e-03 ± 1.28e-02 ± 9.43e-03 ± 1.23e-02 ± 1.11e-02 ± 1.13e-02 ± 1.02e-02 ± 1.18e-02 ± 1.18e-02 ± 1.12e-02 ± 1.21e-02
4.7118e-01 2.4316e-01 5.2250e-01 4.0008e-01 4.7079e-01 2.4959e-01 2.9424e-01 5.6236e-01 2.5868e-01 3.6473e-01 2.9455e-01 5.7069e-01 14: calculate
2:find indexes crowding distances
k s.t. Diss(k) δ; C for solutions in Pop(β); initialize set S = Pop;
noted that no 16 14 DLBCL± 1.43e-01 5469 77 2 14
two groups
14: for
according
i = 1, to
. . . their
, N do dissimilarity degrees: 1) the
± 2.78e-03 ± 3.61e-03 ± 7.57e-02 ± 3.07e-03 ± 1.27e-01 ± 2.14e-03 ± 1.74e-01 ± 2.29e-03 ± 8.43e-02 ± 2.38e-03 ± 1.70e-01
15: 3:R while
= R ∪ |α| < N do
j(randi(k, |k| − 1));
4: if• sortA diversity-based selection method 15:is adopted toPop(i,
further orselect
17 Brain1
2.3735e-01 4.1850e-02 5920
2.5769e-01 90
5.4007e-02 5
2.4632e-01 4.9974e-02 5.3839e-01 7.1387e-01 5.2521e-01 6.9892e-01 5.2619e-01 7.0910e-01
of a feature is, 15
± 7.56e-03 ± 1.50e-02 ± 8.25e-03 ± 1.68e-02 ± 9.54e-03 ± 1.50e-02
15
± 1.76e-02 ± 1.97e-02 ± 1.84e-02 ± 2.46e-02 ± 2.03e-02 ± 2.72e-02 16: end β according to the descend order of C(β); remote solutions find
withindexes j s.t.
dissimilarity higher &= Par(i,
j) than j); to the
equal
18 Leukemia 7070 72 2
ing that each if |j| 1 then
theanreserved
empty diversitysolutions 16:
2.4178e-01 5.9839e-02 2.5878e-01 6.3833e-02 2.4775e-01 6.6742e-02 6.0251e-01 7.9549e-01 5.8483e-01 7.8223e-01 5.9509e-01 7.9343e-01
16
± 6.60e-03 ± 1.57e-02 ± 5.77e-03 ± 1.24e-02 ± 7.76e-03 ± 1.84e-02
16
± 2.09e-02 ± 3.58e-02 ± 1.50e-02 ± 2.46e-02 ± 1.98e-02 ± 3.57e-02
17: end 5:for generate score set Div; >
removal threshold and 2) the clustered solutions with dissim-
17
2.3192e-01 2.3935e-02 2.5770e-01 3.4972e-02 2.3736e-01 2.7818e-02
17
4.8237e-01 6.3834e-01 4.6436e-01 6.2643e-01 4.7869e-01 6.3286e-01 18: remove
6: solutions
for i = Pop(R) fromdoPop;
1, . . . , |β| 17: k = randi(j, randi({1,
ilarity lower than the removal threshold. Among them, . . . , |j| − 1}, 1));all the
± 4.12e-03 ± 6.50e-03 ± 1.28e-02 ± 6.13e-03 ± 5.88e-03 ± 6.48e-03 ± 2.88e-03 ± 8.24e-03 ± 8.95e-03 ± 7.22e-03 ± 3.90e-03 ± 6.19e-03
7: find indexes j s.t. Obj(α(j), 1) == Obj(β, 1); 18: do crossover: S(i, k) = Par(i, k);
the largest is 2600 (i.e. Madelon). Therefore, the datasets used in this
18
2.3960e-01 5.0753e-02 2.6176e-01 6.6186e-02 2.4756e-01 5.5212e-02
18
5.4917e-01 7.2973e-01 5.2839e-01 7.1279e-01 5.3623e-01 7.2762e-01 remote solutions
19:
and ifonly one clustered solution are reserved,
end
± 7.18e-03 ± 2.55e-02 ± 1.02e-02 ± 3.11e-02
work are both comprehensive and complicated enough to challenge
± 9.62e-03 ± 2.52e-02 ± 1.72e-02 ± 2.82e-02 ± 2.06e-02 ± 3.65e-02 ± 1.94e-02 ± 3.28e-02 8: Div = Div ∪ |j|; while the 20:method randi(k, |k| − 1) means randomly selecting
9: end for end for
the e�ciency
Xu, BingandXue,e�ectiveness
and Mengjie of each "Segmented
comparisonInitialization
algorithm for Hang Xu, Bing Xue, and Mengjie Zhang. "Segmented Initialization and Offspring − 1)Modification
members
21: run S from the array
in Evolutionary k (line 15).
inAlgorithms
Algorithm 5;Thus, a larger
for Bi-objective Feature
Hang
feature selection
Zhang. and Offspring Modification in Evolutionary Algorithms for Bi-objective Feature into three
10: parts:
k= 1) themin
β(arg duplication
ofi (Div(i))); analysis (line 12); 2) the (|k| = Mutate(S)
As represent- Selection". to theirincorresponding
Proceedings classi�cation.
literatures
of 2020 Genetic and[1, 5, 23]. The population
Evolutionary size N
Computation Conference (GECCO 2020). ACM Press. Cancun, Mexico. July 8th-12th 2020, 9pp Selection". Proceedings 2020 Genetic and Evolutionary Computation Conference (GECCO 2020). ACM Press. Cancun, Mexico. July 8th-12th 2020, 9pp
value of δ will make the duplicated solutions more likely to
the decision maker. Here, we use the �rst nondominated front ob- nondominated remove
11:H., Xue,
Xu, sorting fromM.
k (lines
B., & Zhang, but add
β13–15);
(2020). k into
Aand 3) the
α; diversity
duplication main-
analysis based evolutionary algorithm for bi-objective feature selection. IEEE Transactions on
based MOEA 、 is adaptively set to the number of features but not beyond 200 in
be removed. Additionally, because is set to be negatively
4.3 Performance
order to balanceMetrics
the diversity and e�ciency, meaning N for datasets tained from the training data to measure the performance of an tenance end16).
12:(line
Evolutionarywhile In fact, the second part of selection has no
Computation. δ
EA/HD [23]), algorithm on the test data. It should be noted that some nondomi-
I is a classic with feature numbers larger than 200 is still kept to 200. The maxi-
For comprehensive analyses, there are two performance metrics ap- nated solutions on the training data can become dominated on the
difference from=the
13: Pop traditional dominance-based MOEAs [33], correlated with t (line 13), it requires a higher dissimilarity
Pop(α);
dashedwith line, and thenumber
diversity scores for the eight solutions
d sorting and mum objective function evaluation number is set to 100 times of
plied in thethe
experiment to measure both convergence and diversity test data. Thereby, we need to further select the �nal nondominated [40], [79], while the method num(k) denotes the number of for the solution a smaller of selected features to
population size, capping the maximum number of evolutionary {x1 , x2 , . . . , x8 } in Fig. 2 are counted as {2, 4, 1, 2, 2, 1, 1, 1},
A/HD is a re- performances generationscomparison
of each to about 100.algorithm, i.e. Hypervolume (HV) solutions on the test data from the previously sampled solutions to solutions in the first k fronts (line 14). More details of the non- be reserved.
ierarchically [21] and Inverted Generation Distance (IGD) is[30]. For HV, all theinto a measure the IGD and HV performances. respectively. For instance, there
An example of the above-mentioned duplication analysis are two selected solutions
As for classi�cation, each dataset randomly divided dominated sorting method
the solution can becrowding
with a larger referred to [33]. Moreover,
distance comes first.method
Then, iswhich
h is especially objective values obtained
training by each
data subset and aalgorithm are normalized
test data subset by the of
with the proportions The overall IGD and HV performances on each test dataset are
it should be noted that all the duplicated solutions in the shownhave the same
in Fig. number
1, where fourofdifferent as solution x1 , there-
features solutions
Ps. HypE is a shown in Tables 3 and 4 respectively. It can be observed that all the in lines 7 and 8, the diversity score of an unselected solution 1 4 ) for the 1
of x is with x , x , x7 , and
3 6
ideal and nadir
nearlypoints
70% andto the
30% scale [0, 1] infollowing
respectively, each objective direction,
a strati�ed split process
SIOM plug-in MOEAs perform better on most of the test datasets, decision space are eliminated in the beginning of environmen- (x , . . . , xfore, diversity
a feature score problem
selection 2. Obviously,
totally eight
Hypervolume [15, 25,
while the ideal and26].nadir
It should be noted
points that this division
are repsectively setisto
�xed
(0, for a speci�c
0) and is set
GECCO,Dever,as 20-24
the number of the already selected solutions with the 8 alldecision
compared with their corresponding benchmark MOEAs, in terms of tal selection (line 11). As for the computational complexity, features inx the
Colorado, USA. rank thespace arediversity
first by assumedscore.
to have the there
Since same are multiple
1209
GECCO,Dever,
Colorado, USA. 20-24
• Simultaneously minimise the number of features and the error • Multiple Reference Points based Decomposition for Multi-objective FS
rate
The fitness fun
sub-problem is d
f itnessS = eRa
1210
Evolutionary Multimodal Optimisation for FS Evolutionary Multimodal Optimisation for FS
• The goal is to find multiple optimal feature subsets. • The goal is to find multiple optimal feature subsets.
• Fitness function: Minimize classification error rate • MO: Minimize classification error rate, and number of selected
features
• In Breast Cancer Wisconsin (Original) Data Set, there are • Start Algorithm 1: Subset f1 filtration mechanism features. Meanwhile
Input: S: A set of feature 0 subsets1 2 3 positives, false nega
699 instances, 9 features and 2 classes. Initialize individuals
Output: U S: A set of diverse feature subsets respectively.
s1
Get objective values of all 1 begin s2 3
B. General Procedu
individuals based on Eq. (7) 2 if All selected features are different between each
other in S then s3 Most existing EM
Error rate
Generate offspring including NSGAII, u
Using subset [F1, F2, F7] or [F2, F3, F7] with KNN can 3 US STABLE II: The results of HV on the testssets 2
• Dataset 4
GDE3 else SPEA2 MOEAD
s4 5
NSGAII F-NSGAII GF-NSGAII to choose subsets in
achieve the same 97.81% classification accuracy.
Combine parents and offspring 0.876 ± 0.015
Wine
Zoo
0.847 ± 0.014(o ↓)
5
0.830 ± 0.013(o ↓) //
0.841 ± 0.017(o ↓)
S dup:
0.820 ± 0.018(↓ ↓) All duplicated
0.828 ± 0.041(↓ ↓)
0.811 ± 0.039(↓ ↓) feature subsets
0.840 ± 0.027(o ↓)
0.816 ± 0.029(↓ ↓) in
0.846 ± 0.015(↓)
S
0.831 ± 0.010(↓) 0.842 ± 0.017 an example to show
SPECT
WBCD
0.765 ± 0.002(o o)
6
0.916 ± 0.003(↑ o)
// S dup = S 0.896
0.764 ± 0.008(o o)
0.913 ± 0.009(o o)
0.764 ± 0.017(o o)
dup 1 [ S↓) dup
± 0.022(↓
0.744 ± 0.024(↓ ↓)
2[
0.903
. . . [ S dup
± 0.016(↓ ↓)
0.765 ± 0.013(o) 1 0.767 ± 0.008
0.913 ±p 0.008(o) 0.913 ± 0.010 needs to pick five in
Filter feature subsets Ionosphere
Sonar
0.902 ± 0.017(↑ o)
7
0.843 ± 0.035(o ↓)
// U S
0.884 ± 0.021(o ↓)
0.832 ± 0.038(o ↓)
S\S0.851 dup± 0.036(↓ ↓)
0.815 ± 0.035(↓ ↓)
0.864 ± 0.028(↓ ↓)
0.821 ± 0.033(o ↓)
0.888 ± 0.022(↓)
0.841s±7 0.036(↓)
s8
0.901 ± 0.016
0.863 ± 0.016 next generation (S9
Movement
Hillvally
0.778 ±8 0.013(o o)
0.598 ± 0.007(o ↓)
for ±i 0.018(o
0.772 = 1 o)to p0.590
0.595 ± 0.011(↓ ↓)
do±± 0.013(↓
0.765 0.023(↓ o)
↓)
0.766 ± 0.023(↓ o)
0.585 ± 0.014(↓ ↓)
0.779 ± 0.017(o) 0.772 ± 0.019
0.601 ± 0.008(↓) 0 0.608 ± 0.009 the definition of the
Meet pop size ?
Y Musk1 0.929 ±90.009(↓ ↓) 0.965 ±// Calculate
0.010(↓ o) 0.963SCR± 0.008(↓ values
o) of
0.946 ±all individuals
0.011(↓ ↓)
s0.972 ± 0.006(↑)
s 0.963 ± 0.005
S5 , and S7 -S8 are d
F1 is ‘Clump Thickness’ and F3 is ‘Uniformity of Cell Shape’.
Multiple 0.871 ± 0.008(↓ ↓) 0.937 ± 0.007(↓ ↑) 0.936 ± 0.007(↓ ↑) 0.884 ± 0.017(↓ ↓) 0.943 ± 0.006(↑)
6 9 0.902 ± 0.014
• Arrhythmia
SRBCT
0.631 ± 0.013(↓ ↓)
0.746 ± 0.010(↓ ↓)
0.691 ± in S o)dup0.659
0.020(o
0.926 ± 0.019(↑ ↑)
i based
± 0.029(↓on
0.834 ± 0.008(↓ ↓)
↓) Eq. (9)
0.678 ± 0.018(o ↓)
0.895 ± 0.024(↓ o)
0.682 ± 0.021(o)
0.911 ± 0.015(↑)
0.690 ± 0.023
0.900 ± 0.007 will be chosen. How
N Leukemia 0.652 10
± 0.024(↓ ↓) //0.048(o
index ↓) =0.703 find maximum
↓) 0.754 index[SCR]
± 0.049(o ↓) 0.816 ± 0.021
• A new initialization strategy based on the feature relevance • A new initialization strategy based on the feature relevance
• A dynamic hypervolume contribution indicator • A dynamic hypervolume contribution indicator
• A dynamic crowding distance measure
P. Wang, B. Xue, J. Liang and M. Zhang, "Multiobjective Differential Evolution for Feature Selection in Classification," in IEEE P. Wang, B. Xue, J. Liang and M. Zhang, "Multiobjective Differential Evolution for Feature Selection in Classification," in IEEE
Transactions on Cybernetics, doi: 10.1109/TCYB.2021.3128540. Transactions on Cybernetics, doi: 10.1109/TCYB.2021.3128540.
P. Wang, B. Xue, J. Liang and M. Zhang, "Differential Evolution Based Feature Selection: A Niching-based Multi-objective Approach,"
P. Wang, B. Xue, J. Liang and M. Zhang, "Multiobjective Differential Evolution for Feature Selection in Classification," in IEEE in IEEE Transactions on Evolutionary Computation, doi: 10.1109/TEVC.2022.3168052.
Transactions on Cybernetics, doi: 10.1109/TCYB.2021.3128540.
1212
Niching-based Multi-objective FS Niching-based Multi-objective FS
• A new mutation operator combining the local information of •
the niche and the global information of the whole population
• A subset repairing mechanism
- Situation 1: some common features are included in all the ψ-
quasi equal feature subsets
- Situation 2: no common feature
- Situation 3: some solutions have common features
P. Wang, B. Xue, J. Liang and M. Zhang, "Differential Evolution Based Feature Selection: A Niching-based Multi-objective Approach," P. Wang, B. Xue, J. Liang and M. Zhang, "Differential Evolution Based Feature Selection: A Niching-based Multi-objective Approach,"
in IEEE Transactions on Evolutionary Computation, doi: 10.1109/TEVC.2022.3168052. in IEEE Transactions on Evolutionary Computation, doi: 10.1109/TEVC.2022.3168052.
fsur
Estimate promising areas fori
I iterations
fori
Fig. 1: Overall algorithm
Hoai Bach Nguyen, Bing Xue, and Peter Andreae. "PSO with Surrogate Models for Feature Selection: Static and Dynamic Clustering-based Hoai Bach Nguyen, Bing Xue, and Peter Andreae. "PSO with Surrogate Models for Feature Selection: Static and Dynamic Clustering-based
Methods", Memetic Computing, vol. 10, pp. 291-300, 2018 Methods", Memetic Computing, vol. 10, pp. 291-300, 2018
training set, the corresponding fitness function is a surrogate fitness function, de-
noted by f itnesssur . If the whole training set is used to build the classifier then
the fitness function is the original one, called f itnessori . In terms of PSO’s repre-
sentation, each feature corresponds to a position entry, whose value is either 1 or 0,
1213 indicating the feature is selected or not selected, respectively. The overall proposed
algorithm, named SurSammPSO, is described in Fig. 1, in which the contribution
Surrogate-Assisted Evolutionary FS Surrogate-Assisted Evolutionary FS
S. Liu, H. Wang, W. Peng and W. Yao, "A Surrogate-Assisted Evolutionary Feature Selection Algorithm with Parallel Random Grouping for S. Liu, H. Wang, W. Peng and W. Yao, "A Surrogate-Assisted Evolutionary Feature Selection Algorithm with Parallel Random Grouping for
High-Dimensional Classification," in IEEE Transactions on Evolutionary Computation, doi: 10.1109/TEVC.2022.3149601. High-Dimensional Classification," in IEEE Transactions on Evolutionary Computation, doi: 10.1109/TEVC.2022.3149601.
K. Chen, B. Xue, M. Zhang and F. Zhou, "Correlation-Guided Updating Strategy for Feature Selection in Classification with Surrogate-Assisted K. Chen, B. Xue, M. Zhang and F. Zhou, "Correlation-Guided Updating Strategy for Feature Selection in Classification with Surrogate-Assisted
Particle Swarm Optimisation," in IEEE Transactions on Evolutionary Computation, doi: 10.1109/TEVC.2021.3134804. Particle Swarm Optimisation," in IEEE Transactions on Evolutionary Computation, doi: 10.1109/TEVC.2021.3134804.
1214
Correlation-Guided and Surrogate-Assisted PSO Information Theory Feature Selection
•
!"#$%&& = (%) − (%+ - is the selected feature subset
where, .' , .( ∶ single feature in 1
2 is the class labels
!"# = % &(($ ; *)
345 : relevance between 1 and *
!! "#
346 : redundancy in 1
!", = % &(($ ; (& )
!! ,!" "#
• Relevance:
- Classification performance
- The relevance (mutual information) between each selected feature and the
class labels
• Redundancy:
- Number of features
- The relevance (mutual information) between the selected features
Peng, Hanchuan, Fuhui Long, and Chris Ding. "Feature selection based on mutual information criteria of max-dependency, max-
relevance, and min-redundancy." IEEE Transactions on pattern analysis and machine intelligence27.8 (2005): 1226-1238.
Liam Cervante, Bing Xue and Mengjie Zhang."Binary Particle Swarm Optimisation for Feature Selection: A Filter Based
K. Chen, B. Xue, M. Zhang and F. Zhou, "Correlation-Guided Updating Strategy for Feature Selection in Classification with Surrogate-Assisted Approach". Proceedings of 2012 IEEE World Congress on Computational Intelligence/ IEEE Congress on Evolutionary
Particle Swarm Optimisation," in IEEE Transactions on Evolutionary Computation, doi: 10.1109/TEVC.2021.3134804. Computation (WCCI/CEC 2012). Brisbane, Australian. 10-17 June, 2012. IEEE Press. pp. 881-888.
GECCO,Dever,
Colorado, USA. 20-24
Journal on Artificial Intelligence Tools. Vol. 22, Issue 04, August 2013. pp. 1350024 -- 1 - 31. DOI: 10.1142/S0218213013500243.
Bing Xue, Mengjie Zhang and Will Browne. "A Comprehensive Comparison on Feature Selection Approaches to Classification". International
Journal of Computational Intelligence and Applications (IJCIA). Vol. 14, No. 2. 2015. pp. 1550008 (1-23).
Hancer, Emrah, Bing Xue, and Mengjie Zhang. "Differential evolution for filter feature selection based on information theory and feature
ranking." Knowledge-Based Systems 140 (2018): 103-119.
1215
124 CHAPTER 4. PSO BASED FEATURE SELECTION VIA DISCRETISATION
many different models to obtain the same accuracy. Furthermore, while the
wrapper methods can produce solutions with high accuracy, filter methods
are usually faster and more general. Combining the strength of these two
GECCO,Dever, GECCO,Dever,
Colorado, USA. 20-24 Colorado, USA. 20-24 approaches in the evaluation function is expected to produce better solutions.
Feature Selection Though Data Discretisation FS Though Data Discretisation
July 2016 July 2016
In addition, combining the two measures can also better distinguish the subtle
122 difference between
CHAPTER 4. feature subsets,
PSO BASED providing
FEATURE a smootherVIA
SELECTION fitness landscape
DISCRETISATION
Bare-Bone Particle Swarm Optimisation to facilitate the search process. However, simply combining these measures
• Utilising information from MLDP
may be impractical in terms of computation time. Therefore, a smart way is
Two-stage (PSO-FS) • One-stage (PSO-DFS)
needed to cut-point
Potential combine table
them without requiring
Cut-point more running time. Among the
index
#C C1 used
commonly C2 filter
C3 measures,
C4 distance is a multivariate measure that can
Particle’s position
F 1 2 5.25 6.8
evaluate the discriminating ability of a F feature
1 F 2 set Fand
3 itF 4is used
… asFtheN base
F 2 3 10.7 24.9 50.2
measure of KNN. Incorporating this measure 2 with
0 the
3 KNN 0 wrapper
… 2 method
F 3 4 -0.45 0.67 5.2 20.5
will not increase the computation time much because the distance measure
F 4 1 -32.5
is…calculated only once but used twice. 6.8 Therefore,5.2 in the fitness … function
-1.54 of
Proposed PPSO,
F N 3 balanced
-3.72 -1.54 KNN6.55 classification accuracy is combined with a distance
Candiate solution
measure using a weighting coefficient (“) as described in Section 3.2.2. Its
descriptionFigure 4.4: here
is repeated Particle representation
for reading convenience. of PPSO.
• Hybrid fitness function
search space especially
f itnessin
= multivariate discretisation
(“ · balanced_accuracy + (1 ≠on “) high-dimensional
· distance) (4.3)data.
Therefore, PPSO uses BBPSO to choose a cut-point from potential cut-points
where
1
of each feature to narrow the search= space
distance into potentially promising(4.4)
1 + exp≠5(Db ≠Dw )
areas.
1 X
M
4.2.4.1 Db =
Particle representation min Dis(Ii , Ij ) (4.5)
M {j|j”=i,class(Ii )”=class(Ij )}
i=1
6, pp.1733-1746, 2018
Each feature
Binh Tran and Bing Xue and MengjieD b is theNew
may
Zhang."A average
have adistance
different
Representation between
in PSO
number each ofinstance andSelection",
potential
for Discretisation-Based Feature the nearest
cut-points instance
which
IEEE Transactions of are
on Cybernetics, vol. 48, no.
6, pp.1733-1746, 2018
Yu Zhou, Junhao Kang, Sam Kwong, Xu Wang, Qingfu Zhang. "An evolutionary multi-objective optimization framework of discretization-based feature selection for
calculated other
Zhou, Yu, et al. "An evolutionary
andclasses.
stored
multi-objective Dw isintheframework
optimization
a average of distance
potential between
cut-point
discretization-based each
table.
feature instance
Figure
selection to the
4.4 farthest
shows
for classification."
an
Swarm and Evolutionary
classification." Swarm and Evolutionary Computation 60 (2021): 100770. Computation 60 (2021): 100770.
instance of the same class. Dis(Ii , Ij ) is the number of matches or overlapping
example of this table and a particle’s position as well as the corresponding
between two nominal vectors divided by the size of the vectors, which is as
candidate solution. Each particle position is an integer vector denoting the
in Chapter 3. Since KNN also works based on this overlapping distance
chosen cut-point indexes. The size of the vector, therefore, is equal to the
number of the original features and the evolved value needs to be between
GECCO,Dever, GECCO,Dever,
Colorado, USA. 20-24 Colorado, USA. 20-24 1 and the number of potential cut-points of the corresponding feature. For
example, in Figure 4.4, the first feature (F1 ) has two potential cut-points
with index 1 and 2. Therefore, the first dimension of the particle needs to
fall in the range [1,2]. If it is 2 as in the example, then the cut-point 6.8 is
Feature Selection Classification Feature Selection
chosen to discretise F1 . Classification
During the updating process, if the updated value of a dimension is
<0 >= outside the cut-point index range, then it is set to < 0 indicates that>=
0, which the
Class + Class - Class +
corresponding+feature does not have a good cut-point Class
and therefore should -
- - -
be ignored, not selected.
- *
+ % + % + %
F1 F3 F12 F40
Soha Ahmed, Mengjie Zhang, Lifeng Peng and Bing Xue. "Genetic Programming for Measuring Peptide Detectability". Proceedings of the 10th International Soha Ahmed, Mengjie Zhang, Lifeng Peng and Bing Xue. "Genetic Programming for Measuring Peptide Detectability". Proceedings of the 10th International
Conference on Simulated Evolution and Learning (SEAL 2014). Lecture Notes in Computer Science. Vol. 8886. Dunedin, New Zealand. December 15-18,
74 Conference on Simulated Evolution and Learning (SEAL 2014). Lecture Notes in Computer Science. Vol. 8886. Dunedin, New Zealand. December 15-18,
75
2014. pp. 593-604 2014. pp. 593-604
1216
GP for Embeded Feature Selection GP for Embeded Feature Selection
• Existing feature selection metrics have some biases in class
imbalance problems
• Each terminal node (leaf) consists of a (“basic”) feature
selection metric, which returns a set of features considered
highly discriminative by such metric
• The set operations may be union (∪), intersection (∩), set
difference ( ), and so on
Viegas, Felipe, Leonardo Rocha, Marcos Gonçalves, Fernando Mourão, Giovanni Sá, Thiago Salles, Guilherme Andrade, and Isac Sandin. Viegas, Felipe, Leonardo Rocha, Marcos Gonçalves, Fernando Mourão, Giovanni Sá, Thiago Salles, Guilherme Andrade, and Isac Sandin.
"A genetic programming approach for feature selection in highly dimensional skewed data." Neurocomputing 273 (2018): 554-569. "A genetic programming approach for feature selection in highly dimensional skewed data." Neurocomputing 273 (2018): 554-569.
Figure 3.2 presents the main process of GPPI. While the left part of the
72 Feature SelectionCHAPTER
for Symbolic
3. GP WITH FS Regression
FOR SR Feature
figure shows Selection
the process of calculating for Symbolic
permutation Regression
feature importance in
one GP run, the right part describes the process to obtain the permutation
importance of one feature. The whole process is defined as:
• GP run for Permutation Feature Importance Calculating Permutation Feature Importance
• Permutation feature importance:
Training Start 1. Randomly split the training data into a sub-training set and a sub-
Data
Collect all the distinct features
test set.
Randomly Split {Xi, Xm, , Xk} in Ib
2. Carry out a standard GP run and get the best-of-run individual Ib ,
Finish calculating the which has the lowest training error over the sub-training set.
Sub-training Sub-test importance of
Data Data {Xi, Xm, , Xk}?
No
3. Compute the generalisation error of Ib over the sub-test set, which is
GP For feature Xj in referred to Errorg (Ib ).
{Xi, Xm, , Xk}, permute the
values of Xj within the
Best GP
sub-test set
4. For each distinct feature Xj in Ib , permute its values within the sub-
Model Ib
Obtain the generalisation
test set, and get the test error of Ib on the permuted sub-test set,
Yes
Obtain the test error error Errpmt of Ib over the shown as Errpmt (Ib ).
Errorg of Ib permuted sub-test set
Calculate the raw feature 5. Calculate the distance between Errorg (Ib ) and Errpmt (Ib ), and use it
Calculate Permutation
Feature Importance importance FIxj of Xj to measure the raw feature importance of the feature F Iraw (Xj ), i.e.
FIxj=(Errpmt – Errorg)
Raw feature F Iraw (Xj ) = Errpmt (Ib ) Errorg (Ib ) (3.3)
importances End
Dick, G.,Steps
Rimoni, 4 A.
andP., 5And Whigham,
need P. A re-examination
to be performed for eachofdistinct
the use of geneticin
feature programming
the best- on the oral
Qi. Chen, Mengjie. Zhang, and Bing. Xue. "Feature Selection to Improve Generalisation of Genetic Programming bioavailability prob- lem. In Proceedings of the 17th Annual Conference on Genetic and Evolutionary
Figure 3.2: GP for Permutation Importance (GPPI).
for High-Dimensional Symbolic Regression", IEEE Transaction on Evolutionary Computation, vol. 21, no. 5, pp. of-run individual Ib . It is important to note that the importance values of
Computation Conference (GECCO) (2015), ACM, pp. 1015– 1022.
792-806, 2017.
features absent from Ib are defined to be 0. The whole process repeats n
Permutation feature/variable importance in random forests (RF) is a (n 30) times on the best-of-run individuals collected from n independent
widely used score to measure the importance of features [53]. Permuting
GP runs on the given training data, respectively. The condition n 30 is to
a feature refers to rearranging the values of the feature within the dataset
1217 reduce the bias of random seed used in GP runs on the importance values.
randomly. For example, the values of a feature are denoted as {4,7,9}, the
The pseudo-code of this procedure is shown in Algorithm 1.
98 CHAPTER 3. GP WITH FS FOR SR
Significance
0.235
Training NRMSE Test NRMSE Test
Benchmark Method
0.20
0.225
(training, test)
0.10
0.215
0.00
RF 0.055±0.0013 0.16±0.0017 ( , )
F1
GP
GP−C50
GP−RF
GP−GPPI
GP
GP−C50
GP−RF
GP−GPPI
GP
GP−C50
GP−RF
GP−GPPI
GP 0.012±0.016 0.095±0.03 (+, )
DLBCL−Test CCUN−Test CCN−Test GP-GPPI 0.037±0.043 0.049±0.064
LASSO 0.11 0.09 ( , )
0.170
0.26
RF 0.040±4.20E-4 0.078±5.61E-4 ( , )
0.160
F2
0.22
GP−C50
GP−RF
GP−GPPI
GP
GP−C50
GP−RF
GP−GPPI
GP
GP−C50
GP−RF
GP−GPPI
GP 0.19±0.009 0.25±0.026 (+, )
GP-GPPI 0.21±4.45E-3 0.21±4.45E-3
3.4. RESULT ANALYSIS
Figure 3.5: Distribution of the Corresponding Test NRMSEs. 87 LASSO 0.18 0.22 ( , )
RF 0.058±7.77E-4 0.13±0.0014 (+, )
DLBCL
F1 F2 LD50 GP 0.088±0.012 0.182±0.032 ( , )
programs manipulating the irrelevant features. Thus the evolutionary pro-
0.240
cess is more likely to be guided towards the better models. Better feature
LASSO 0.13 0.15 ( , )
0.235
CCUN
but more effective since they can discard more irrelevant features while GP 0.073±1.48E-3 0.099±2.22E-3 (+, =)
0.230
0.06
to (near) optimal models. It also explains the pattern that the difference LASSO 0.21 0.23 ( , )
0.04
0.10
GP
0.175
0.215
GP−C50
Results on the Test Sets — Generalisation Ability
0.115
GP−RF
Testing NRMSE
Qi. Chen,
well.Mengjie. Zhang,
of and Bing. Xue. "Feature
GP−GPPI
ods performed The examples the evolved models are shown in
0.165
0.205
0.110
Figure 3.5 shows the distribution of generalisation errors of the 100 best-
Selection to Improve Generalisation of Genetic
of-run individuals. The overall trend is similar to the training set, i.e. GP- Table 3.5. The mathematically simplified forms of the models are also pre-
0.155
Q. Ul Ain, H. Al-Sahaf, B. Xue, and M. Zhang, “A multi-tree genetic 1153 programming representation for melanoma detection using local and 1154
Programming for High-Dimensional Symbolic
0.105
0.195
GPPI outperforms the other methods. Particularly on LD50, GP-GPPI has sented in Table 3.5 for an easier analysis of the models. In these three global features,” in Proc. 31st Australas. Joint Conf. Artif. Intell., 2018, 1155 pp. 111–123.
0.145
0 10 30 50
Regression", IEEE Transaction on Evolutionary
0 10 30 50 0 10 30 50 Ain, Qurrat Ul, et al. "Generating knowledge-guided discriminative features using genetic programming for melanoma detection." IEEE Transactions
Generation runs, regarding the relevant
Computation, vol. features
21, no.in
5,the
pp.target function
792-806,
Generation F1 = g XX1 X2 2
2017. Generation on Emerging Topics in Computational Intelligence 5.4 (2020): 554-569.
3
of features in Task 1 (i.e., keep the size unchanged). Equation (8) γR (D) = 1 − ∗ TPRi (7)
c
i=1
is used to determine the number of features updated. Since
where c denotes the number of classes in a classification
the promising feature subset may be updated during the search problem, and TPRi represents the proportion of correctly iden-
Q. Ul Ain, H. Al-Sahaf, B. Xue, and M. Zhang, “A multi-tree genetic 1153 programming representationprocess, thedetection
for melanoma combination
using local of
andpotentially
1154 complementary features Fig. 2. Example is used to select promising features based on the knee point.
tified instances in class i. Since the balanced accuracy is no
global features,” in Proc. 31st Australas. Joint Conf. Artif. Intell., 2018, 1155 pp. 111–123.
in Task 1 Using
willGenetic
likelyProgramming
be enhanced to some extent K. Chen, B. Xue, M. Zhang and F. Zhou, "An Evolutionary Multitasking-Based Feature bias to each
Selection class
Method forin the classification
High-Dimensional problem, the weight for
Classification,"
Qurrat Ul Ain, Harith Al-Sahaf, Bing Xue, Mengjie Zhang. "Generating Knowledge-guided Discriminative Features for related tasks. In this doi:
situation, we adopt a multipopulation each class is set to 1/c.
Melanoma Detection", IEEE Transactions on Emerging Topics in Computational Intelligence, vol. , Issue. , pp. , Online 2020 (DOI: in IEEE Transactions on Cybernetics, 10.1109/TCYB.2020.3042243.
10.1109/TETCI.2020.2983426). 14pp framework to transfer knowledge between Task 1 and Task 2. 3) Final Selected Features: It is noted that the final goal
numChange = ρ ∗ numSelect (8) During the search process, a skill factor is used to allocate in this article is to obtain a feature subset that can achieve
the particles for each FS task. In other words, if a parti- high classification accuracy in high-dimensional classification
where numChange and numSelect represent the number of fea- cle whose skill factor value is 1, the particle is assigned to problems. However, two feature subsets will be produced when
tures updated and the number of features in Task 1, respectively. solve Task 1; otherwise, it is used to address Task 2. Vertical applying evolutionary multitasking in step 2, which are from
cultural transmission is another important operator in PSO- Task 1 and Task 2, respectively. We will choose the selected
The scale factor ρ is used to control the number of updates. EMT, which can change the value of skill factor to implement features from Task 1 as the final feature subset to return, and
1218
A high ρ value provided by the user may result in a lot of transfer solutions between Task 1 and Task 2. The way of the reasons are as follows. During the search process, Task 1
transferring knowledge plays an important role in knowledge
for each subpopulation (task) but also can provide assistance is still increased by 4.10%, 1.24%, and 1.49%, respectively.
C. Further
accuracy on all datasets. Improvement
The effectiveness Mechanisms
of PSO-EMT is con- is still increased by 4.10%, 1.24%, and 1.49%, respectively. 9Tumor, the classification accuracy only increases 6.78% on
ETHOD to find
7 better feature subsets. During the evolutionary process,
tributed by evolutionary multitasking, which not only considers
Note that FCBF is an effective filter FS method to deal
PSO often suffers from premature Notewith
convergence thathigh-dimensional
FCBF
and easily is an effective filter FS method to deal average accuracy and 9.06% on best accuracy. In terms of
data. However, FCBF is prone to be
a random mating probability (rmp) is defined to control the entirewhenfeatures but also focuses on the important features withtrapped
high-dimensional data. However, FCBF be dimensionality reduction, although PSO reduces the original
falling into local optima. We propose two mechanisms into
to over-local optima since it uses the isheuristic
prone tosearch
we limit the to transfer knowledge from another task. At each during generation,
the FS process. In addition, the variable-range strategy trapped into local
strategies (i.e., optima
sequential since it usesorthe
forward search feature size by around half on all datasets, the numbers of
heuristicselection)
backward
come these deficiencies, which are: 1) variable-range strategy
search space if a random number rand is less than or equal to rmp, (4) is and subset updating mechanism are designed to guide particles strategies
in the (i.e., second sequential
stage. PSO-EMT forward or canbackward overcomeselected features achieved by PSO-EMT are still much smaller
effectivelyselection)
adopted to update particle’s velocity; otherwise, (2)search to focus onand
is applied more 2)promising
feature subset updatingPSO-EMT
areas. Therefore, mechanism. in the this second
issue toThis stage.
achieve PSO-EMT can due effectivelystrong overcome than PSO on all datasets. The highest dimensionality reduction
re less likely article hasbetter subsets
been accepted for inclusiontoin its
a future issue ofglobal
this journal. Content is final as presented, with the exception of pagination.
has better performance 1) than in evolvingWhen
its counterpartsStrategy:
Variable-Range feature implementing issue toPSO
this searchability. achieve for better subsets due to its strong global occurs on Leuk3, where PSO-EMT chooses about 20 times
earch ranges to update particle’s velocity. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented,This
witharticle has beenofaccepted
the exception for inclusion in a future issue of this journal. Content is final as presented, w
pagination.
Feature
EMT is shorter
Table III shows thanthe training v-time
PSO-EMT on (Time),
all datasets. average Thisfeature
shows features, Table
accuracy III shows
with
although aVLPSO
smallthe feature
trainingsubset
achieves atime
smaller (Time),alsoaverage
butnumber can feature
reduce
of features the features, although VLPS
are used in the experimental analyses. They are open sources
on https://ckzixf.github.io/dataset.html. Table I shows the key that approach
the size
subset (Size), best(such
variable-range strategy
classificationcanas evolutionary
effectively
accuracy reduceaverage
(Best), search than computation)
subset
training
PSO-EMT,size
time.(Size), best classification
the feature subset seems accuracyby(Best),
size selected PSO-EMT average than PSO-EMT, the fea
classification
space by adapting accuracy (Mean),range
the search and standard
of each deviation
feature during (Std) of classification
the is already accuracy
very small.
The proposed For(Mean),
method instance, andthestandard
has accomplished feature the deviation
subset (Std)
selected
implicit knowl-of is already very small. F
mental setup, characteristics of these datasets. The distribution of data is reasonable.
FSclassification
process. Inaccuracies
terms of ofthe thefeature
involved methods.
subset size, PSO-EMT byedge classification
PSO-EMT
transfer on accuracies isof 5.93%
Brain1 different
between the involvedof the
tasks methods.
onlyoriginal
by sharingfeatures, global by PSO-EMT on Bra
Construction
methods, and highly unbalanced. The classification on such datasets is a outperforms PSO-EMTv- on eight out ofConstructed ten datasets. On the best solutions in PSO without taking into account the other
challenging task. Lung dataset, PSO-EMT chooses about 24 fewer features than useful information, such as individual position and velocity.
Features
Authorized licensed use limited to: Victoria University of Wellington. Downloaded on MayAuthorized
-
02,2021 at licensed
21:29:09use limited
from to: Victoria University of Wellington. Downloaded on May 02,2021 at 21:29:09
Constructed UTC IEEE Xplore. Restrictions apply.
Features
ellington. Downloaded on May 02,2021 at 21:29:09 UTC from IEEE Xplore. Restrictions apply. Authorized licensed use limited to: Victoria University of Wellington. Downloaded on May 02,2021 at 21:29:09 UTC from IEEE Xplore. Restrictions apply.
+ %
Selected
Features
F1 F3 F12 F40
1219
persion of the instances of that class along the feature axis. The
dispersion of instances itself is related to the distribution of data
points in that class.
GECCO,Dever,
• An interval I is represented with a pair (lower, upper) which GECCO,Dever,
Colorado, USA. 20-24 Colorado, USA. 20-24
COMP422
COMP422 to indicate an interval for class c. Feature 43Manipulation:
Feature Manipulation: COMP422 43 COMP422 Feature Manipulation: 44 Feature Manipulation: 44
• The interval
Examplesof
• class
of good
The
COMP422 c could
andbe
interval ofbad formulated
class
class as be
intervals
cclass
could follows
formulatedif the as follows 658
Measuring
Feature Manipulation: 43 COMP422
if the the Purity of• Class 4 features,
Intervals3 classes
IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 16, NO. 5, OCTOBER 2012
Feature Manipulation: 44
Examples
COMP422
class distributions
GECCO,Dever, Colorado,
Examples
ofwere
of
good
good
and
normal,
and bad
badintervals
measuring
class theintervals
Feature Manipulation: 43
purity, purity
COMP422
Measuring the Purity of
Measuring
Class Intervals
the Purity of
Feature Class
Manipulation: 44 Intervals
GP for FC Measure:Examples of good and bad
USA. 20-24 July 2016
GECCO,Dever, Colorado,
Examples
classof
distributions
good and bad
were
class
normal.
intervals Measuring the Purity of Class Intervals
of GP
eachfor FC
class Measure:Examples of good andGiven bada discrete
USA. 20-24 July 2016
Overlapping intervals: class intervals Given a discrete or categorical random variable C (the class
Overlapping intervals: or categorical random variable C (the class
Overlapping intervals:Ic = [µ − 3σ , µ
class intervals
c c c + 3σ c ] label) which can take Given
label)
values c1 , a
which can take
c2 , . discrete
values c ,or c2 , .categorical
. . , cL with 1probabilities random variable C (the class
. . , cL with probabilities
• Overlapping intervals
Overlapping intervals: p(c Given), p(c a), discrete
. . . , p(c or
), the categorical
Shannon random
entropy
by of
variable C (the
C is defined class
However, the normality assumption is ), p(calways
p(c1not ),2 thelabel)
2 ), . . 1. , p(cLsatisfied.ShannonLentropywhich of C can take
is defined
!L can take values c1, c2 , . . . , cL with 1probabilities values c , cby
2 , . . . , cL with probabilities
• Overlapping
• Overlappingintervals
intervals, bad H(C) = −H(C)
! label)
L
i=1 p(c
which
=i)−logp(cb p(c i ) p(ci ) logb p(ci )
p(c ), p(c
i=1 ),p(c
p(c
2 ), . . 1. ,and ),2),the. .Shannon ), the Shannon
. , p(cLentropy entropy
of C is defined by of C is defined by
base of1 the
where b iswhere b islogarithm
base !of 6.2.!THE
isL usually 2. and PROPOSED
is usually METHOD: MGPFC 185
L the logarithm L 2.
x A class interval H(C) = −H(C)
establishes a new =i)−logb space.
p(c
i=1 probability p(c
i=1i ) p(c i ) logthe
Therefore, b p(ci )
x probabilityAofclass interval
classes establishes
in the above equation a new be
should probability
conditioned space. Therefore, the
where b isof base of the logarithm ofand islogarithm
usually 2. be
• Non-overlapping
Non-overlapping intervals:
Non-overlappingintervals
intervals:
probability
on the values where
of the feature
Atheclass
that fall b
classes is
inin
thethe
6.2.2
baseabove
interval. the
equation
MGPFC Fitness Function
should and is usually
2.
conditioned
x Given X, aon feature, C, interval
values ofFig.
the set of establishes
the 9.feature
all class that
labels,
(a) Decision aandnew
fall
tree , probability
! in
thethe
cinduced theofJ48 space.
by interval.
class Therefore,
(C4.5) algorithm using thethe
1000
• Non-overlapping intervals, good
x Aof class Icinterval establishes ashould
new probability space. Therefore, the
To evaluate an individual, its constructed features are used to transform
interest with probability
corresponding observations
classes
interval in
! , thein Shannon
the previous
the above figure.of(b)
equation
entropy theClass boundary
be generalized
conditionedby the
Given ! X, a feature, C, the set of all class labels, and c !
, the m
class
decision tree. The decision tree inducer has failed to generalize the concept of
Non-overlapping intervals: interval of class c is
on thewith probability
values ofathe
of line.feature ofinterval
classes
that fall inin thethe above equation should be conditioned
interval.
!
interest corresponding the ∈ Itraining
I c set into a new !training set with m features. The discriminating
! , the Shannon entropy of the
• Non-overlapping
Non-overlappingintervals
intervals: H(Ic! ) = − c∈C p(c|X ∈ Ic! ) log2 p(c|X
Givenof
interval aon
X,class the
c! is values
feature, of ofthe
C, the set
c! )
allfeature
class labels, thatand fall
c ,in thethe class interval.
of
ability of theI transformed dataset indicates how good the constructed features
x
interest !
with corresponding interval ! , the Shannon entropy of the
H(Ic ) = − Given c !
!
X, a∈feature,
c∈C p(c|X Ic ) log2 p(c|X
!
C, the ∈ Iset
c ) of all class labels, and c , the class of
interval of class c! is are. While a wrapper measure based onFig.
!
a 11.
classification
Visualization of the algorithm
features in balance can scalebeproblem. The top four
x
!
interest with corresponding interval I ! , the Shannon plots depict the original
entropy of features
the of the problem and the bottom plot is for
Neshatian, K.; Mengjie Zhang; Andreae, P., "A Filter Approach to Multiple Feature Construction for Symbolic Learning Classifiers Using Genetic
Programming," IEEE Transactions on Evolutionary Computation , vol.16, no.5, pp.645-661, Oct. 2012
H(Ic! ) = − c∈C p(c|X a good
∈ Ic! ) indicator
log 2 p(c|X
Neshatian, ∈for c! )searching
IZhang;
K.; Mengjie
c good feature sets,feature
a constructed the(xresulted Infeature
− x3 x4 ).Learning
x2 Symbolic
Andreae, P., "A Filter Approach to Multiple Feature Construction1for
each sethorizontal axis is the
plot, the
Classifiers Using Genetic
interval ofmayclass c!Programming,"
is instanceno.5,number andOct.
the2012
vertical axis is the value of the feature for the given
resetting gbest
100 1TGPFC CDFC 100 1TGPFC CDFC
KNN Accuracy
GP For Class
NB Accuracy
Training
Set
tTest
Measure
Terminal Set
80
Class+1
Dependent
Feature
Constructed
& Selected
80
dimensional classification.
... Construction
Features +
50 Data Transformation 50
208
Test Set CL
Dataset FC USING GP
CHAPTER 6. CLASS-DEPENDENT MULTIPLE
DL LK CN PR OV LK1 SR CL DL LK CN
Dataset
PR OV LK1 SR
- The fitness function.
Test Transformed
End
Accuracy
Learning Algorithm
(a) KNN Training & Test Set (b) NB - The local search.
100 1TGPFC CDFC 15 KNN NB DT • PSO-LSSU achieved much smaller feature
+
subsets with significantly better classification
+ +
3
DT Accuracy
80
2
most cases.
1
CF2
CF3
1.2
CF3
1.0
6
CF2
0
1.0 + 0.8
70
4
0.6
−3 −2 −1
0.5
0.4 5
2
0.0
0.2
−2 0
−0.5 0.0
−1.0 −0.2
−10 −5 0 5 10 60 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
+
5 - 6 times faster
CF1 CF1
0
50
CL DL LK CN PR OV LK1 SR CL DL LK CN PR OV LK1 SR
than PSO
(a) 1TGPFC on Leukemia1 (b) CDFC on Leukemia1
Dataset Dataset
Figure 8: Madelon feature correlation
(c) DT
Figure 6.10: Constructed features on Leukemia1. (d) Accuracy Improvement
stances of oneTran,
Binh Ngan classBing
fromXue, Figurecontributions
the Mengjie
others, 6.7: "Class
Zhang. Results of multiple-tree
toDependent
the superior resultsFeature
Multiple of GP (CDFC)
Construction versusProgramming
Using Genetic single-tree GP
for High-
Dimensional Data". Proceedings of the 30th Australasian Joint Conference on Artificial Intelligence (AI2017) Lecture Notes in Computer
CDFC mainly came from its fitness function. While both methods used an
Science. Vol. 10400. Springer.(1TGPFC) for multiple
Melbourne, Australia, August feature construction.
19-20th, 2017. 5.2 The constructed feature
pp. 182-194.
entropy-based measure
Binh Ngan Tran, to minimise
Evolutionary the impurity
Computation forofFeature
the constructed feature
Manipulation in Classification
Figure 2: Feature X95735 at and D42043 at
Toon
seeHigh-dimensional Data, PhD
why SVM can improve thesis, Victoria
its performance by using
values within of
University each class, CDFC
Wellington, Newincorporated an additional distance measure
Zealand, http://researcharchive.vuw.ac.nz/xmlui/handle/10063/7078
only one constructed feature, we pick the best constructed
Binh Tran, Mengjie Zhang and Bing Xue, "A PSO Based Hybrid Feature Selection Algorithm For High-Dimensional Classification". Proceedings of 2016 IEEE World
to evaluate the whole set of constructed features. The aim of the distance feature in one GP run on Leukemia dataset to analyse.
The average results over the 50 runs of both methods areGP displayed
tree with in
the Figure
Congress on Computational Intelligence/ IEEE Congress on Evolutionary Computation (WCCI/CEC 2016). Vancouver, Canada. 24-29 July, 2016. pp.3801-3808.
Figure 1 shows this the size of ten and
measure is to maximise the distance between instances of different classes and constructing a new feature based on four original features:
minimise the distance between6.7.instances of the same class. This is the reason D13637 at, D42043 at, D78611a t and FigureX95735 at. The 7: val-
Ovarian feature correlation
why the data clouds produced by CDFC are more compact and scattered far ues of these four selected features are plotted in Figure 2 to
Results from Figure 6.7 showed that using Figurethe CDFCthese
3. Among constructed features,
features, X95735 at has the least
away from each other. number of overlapping values between two classes. How-
Figure 2: Feature allX95735
the three at learning
and algorithms
D42043 at obtained ever,
significantly
by combining higher accuracies
these features, than feature
the constructed
y using overfitting
can split di↵erent instances byinto
in di↵erent classes constructing
two com- new high-level features with bet-
GECCO,Dever,
6.6.5
using the ones created by 1TGPFC on all pletely
Comparison with Feature Selection Results
datasets. KNN
separate obtained
intervals. Figure 4the largest
shows the feature values GECCO,Dever,
tructed Colorado, USA. 20-24 ter distribution than the original skewed features.Colorado, However,
USA. 20-24
increase of 8.1% average accuracy on Prostate, NB had 7.2% biggest gap
GP for FS and FC in Biology GP for FS and FC
July 2016
created by this constructed feature and a similar constructed July 2016
nalyse. The results of CDFC have shown that feature construction is a promising
feature for DLBCL GEMS the number
dataset. of
The result features
shows that constructed by one best individual
Figure 3: D13637 at and D78611 at is
en and approach
Figure to dimensionality
2: Feature onX95735
CNS,
reduction
andatDT achieved
on high-dimensional
and 13%atincrease high-level
data. This section
D42043 on SRBCT. Comparisons
highertoo
between
GP has the ability to select informative features to build
features with astill small ability.
discriminating to be representative for the whole feature set
using
eatures: the with
will compare the results of CDFC threethe learning
best results algorithms in Figure 6.7(d) showed that
obtained by feature withKNN had the
a large number of features. Increasing the number of
ructed
The val- selection methods in Chapters 3 and 4 using all the four datasets that have
highest improvement on four datasets. On theFigure other1: four ⇤ constructed
datasets,
constructed
Leukemia
DT had features may further improve the performance
feature • Feature distribution
Figure 9: Feature distributions
⇥
nalyse.
ure 2 to been used in all the experiments of feature selection and feature construction - Leukemia constructed feature
of these learning algorithms on high-dimensional data. Our
methods in this thesis, namelythe largest difference ofand
10.5% and 13% on Leukemia1 and SRBCT,
``` which
Leukemia four selected features
⇠ `
en least
he and DLBCL, Prostate, Leukemia1 SRBCT. ⇠⇠⇠
⌅
⌥ ⇠⇠ ⇤will focus on this direction.
`
6000
⇠ ``` 7000
atures:
How- future work 5000
⌃ ⇧ ⇥
6000
Sqrt - 4000
he val-
feature "b
⇤ ⌥
7. REFERENCES "" ⌅⌥bb
⌅
5000 3000
re com-
wo 2 to
⇥ XX ⌃ ⇧⌃ ⇧
2000
4000
- D42043 at X95735 at
ee values
least 1000
X ⌅
⌥
% XXXX[1]
⇤%
How-
tructed S. Ahmed, M. Zhang, andof L.
features thesePeng. Improving
two datasets in Figure 7 andfeature
Figure 8, we 3000
0
⇥ ⌃ ⇧
find that Ovarian features are more correlated to each other
eature
ws that * 0.55 ranking for biomarker discovery in proteomics mass 2000 -1000
Figure 3: D13637 at and D78611 at (about 0.4 to 0.6) than Madelon features (about 0.01 to
"b
⌥ "" ⌅⌥bb ⌅
o com-
build spectrometry data using genetic
0.03). This indicates programming.
that for non-linear data GP can work 1000
-2000
-3000
better than libSVM which is a linear classification algorithm.
⌃ ⇧⌃ ⇧
values
ty. D13637 at D78611 at Connection Science, 26(3):215–243, 2014.
A↵ected by this overfitting problem, SVM has the poor-
0
-4000
ructed [2] S. Ahmed, M. Zhang, est test performance on most of the datasets. However, only
one and L. Peng. Asignificantly
new gp-based -1000
1 2 3 4 1
constructed feature can improve its perfor-
ws that 5.3 Overfitting problem wrapper
Figure 5: Colon first one hundred features
⇧
based on the training fold can not be generalised to correctly improve the performance of kNN, SVM and GP classifiers
735 at predict the unseen data in the test593–604. Springer
fold. This may be the International
on high-dimensionalPublishing, 2014.
problems. Using only the constructed
reason why the training and test accuracies are so di↵er- feature, SVM achieve a higher accuracies than using all fea-
ent. This explanation is concordant to the result of Ovarian tures and other created subsets in most datasets. Among
⌅
features of these two datasets in Figure 7 and Figure 8, we dataset where both SVM and GP achieve similar perfor- the six created subsets, the construct feature and the termi-
Binh Tran, Bing Xue and Mengjie Zhang. "Genetic Programming for Feature Construction and Selection in
find that Ovarian features are more correlated to each other mance on training and test sets. The boxplot of the first nal features achieve the highest performance in GP. The last
Classification on High-dimensional Data", Memetic Computing, vol 8,hundred Issue 1, pp3-15. 2016
⇧
four subsets achieve similar performance in kNN and SVM
(about 0.4 to 0.6) than Madelon features (about 0.01 to one features of this dataset in Figure 6 shows that
Binh Tran, Bing Xue and Mengjie Zhang. "Genetic Programming for Feature Construction and Selection in
Figure 6: Ovarian first one hund
AML
AML
Training Set DLBCL 5469 819.20 0.85 0.96
Leukemia 7129 901.30 0.87 0.98
class
class
Redundancy-based Feature CNS 7129 79.30 0.99 1.00
Clustering Method Prostate 10509 1634.80 0.84 0.85
Ovarian 15154 601.20 0.96 0.31
ALL
ALL
Alizadeh 1095 93.60 0.91 0.94
AML
AML
AML
Construction -
the average of ASC over 10 folds of each dataset.
As can be seen from the fourth - column of Table 5.6, all datasets obtained
class
class
class
+
Constructed & Selected Features at least 84% of dimensionality reduction after applying the proposed feature
from the best individual clustering algorithm. The number of input features into GP was significantly
ALL
ALL
ALL
M19507_at X61587_at U09087_s_at U82759_at
reduced with the largest reduction of 99% on CNS and 96% on Ovarian and
Yeoh. The third column of Table 5.6 also showed differences −4 −2 in 0 the 2 number
4 −1.0 0.0 0.5 1.0 −1.0 0.0 0.5 1.0
1222
GP for FC in Clustering: Multi-Tree GP Representation – Vector
• Having to set t is annoying. Can we use a single tree?
• Each tree creates a single constructed feature.
• Introduce a new concat operator which can create vectors of
• Each individual contains t trees, to give t constructed features. CFs.
- Automatically build up a suitable length vector.
- Extend the function set to work on vectors.
Andrew Lensen, Bing Xue, and Mengjie Zhang. "New Representations in Genetic Programming for Feature Andrew Lensen, Bing Xue, and Mengjie Zhang. "New Representations in Genetic Programming for Feature
Construction in k-means Clustering". Proceedings of the 11th International Conference on Simulated Evolution Construction in k-means Clustering". Proceedings of the 11th International Conference on Simulated Evolution
and Learning (SEAL 2017). Lecture Notes in Computer Science. Vol. 10593. Shenzhen, China. November 10- and Learning (SEAL 2017). Lecture Notes in Computer Science. Vol. 10593. Shenzhen, China. November 10-
13, 2017. pp. 543--555. 13, 2017. pp. 543--555.
A. Lensen, B. Xue and M. Zhang, "Genetic Programming for Manifold Learning: Preserving Local Topology," in IEEE A. Lensen, B. Xue and M. Zhang, "Genetic Programming for Manifold Learning: Preserving Local Topology," in IEEE
Transactions on Evolutionary Computation, doi: 10.1109/TEVC.2021.3106672. Transactions on Evolutionary Computation, doi: 10.1109/TEVC.2021.3106672.
1223
Image Recognition/Classification
Example GP Tree
• GP for image ananlysis: evolve image descriptors
• Keypoints identification, feature extraction, feature
construction/selection
feature vector
GP has a tree-based representation, which is known for variable lengths of evolved
Feature
Extraction
solutions.The individual representation of the proposed FELGP approach is based
on STGP [22], where each function has input and output types, and each terminal
has an output type. To define the type constraints, a new program structure is de-
veloped, as shown in Fig. 8.2. The new program structure has the input, filtering
Feature
& pooling, feature extraction, concatenation, classification, combination, and output
Selection
layers. Each layer except for the input and output layers has different functions for
Construction
different purposes. The input layer represents the input of the FELGP system, such
as the terminals. The filtering & pooling layer has filtering and pooling functions,
which operate on images. The feature extraction layer extracts informative features
from the images using existing feature extraction methods. The concatenation layer
A. Lensen, B. Xue and M. Zhang, "Genetic Programming for Manifold Learning: Preserving Local Topology," in IEEE concatenates features produced by its children nodes into a feature vector. The fil-
Transactions on Evolutionary Computation, doi: 10.1109/TEVC.2021.3106672. tering & pooling, feature extraction and concatenation layers belong to the process
of feature learning, where informative features are learned from raw images. The
learned features can be directly fed into any classification algorithm for classifica-
tion. Therefore, a classification layer is connected with the concatenation layer. The
classification layer has several classification functions that can be used to train the
classifiers using the learned features. The combination layer has several combination
GECCO,Dever,
Colorado, USA. 20-24
functions to combine the outputs of the classification functions. The classification
and combination layers belong to the process of ensemble learning, where the clas-
Image Recognition/Classification GP for Image Classification
July 2016
sification functions are selected and trained, and the outputs of the classifiers are
combined. Finally, the output layer performs the plurality voting on the outputs
• The traditional way produced by the combination layer to obtain the combined predictions.
Learning
SVM RF LR
Classification Layer
The feature areas need to be designed by FeaCon2 Y_train 1 FeaCon2 Y_train 200
50
FeaCon3 Y_train 1
Concatenation Layer
domain-experts uLBP SIFT HOG SIFT HOG SIFT uLBP
Feature
Learning
LoG1 SobelX LBP-F MaxP
Transform the pixel values of the X_train X_train X_train
Select a subset out of the extracted Fig. 8.2 The new program structure of FELGP and an example solution/program that can be
features (optional) evolved by FELGP
without selection) to GP-based classifier structure of the FELGP approach has one more layer, i.e., the feature extraction layer.
and Evolve Ensembles for Image Classification," in IEEE Transactions on Cybernetics, vol. 51, no. 4, pp. 1769-1783,
April 2021, doi: 10.1109/TCYB.2020.2964566.
1224
GECCO,Dever, GECCO,Dever,
Colorado, USA. 20-24 Colorado, USA. 20-24
Prostate Cancer 15000 322 4 Apple minus m/z (5 New Method (5 ✓ ) Method 2 (2✓)
biomarkers)
Toxpath 7105 115 4
✓ ✗
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
463.0
✓ ✓
12 IEEE TRANSACTIONS ON CYBERNETICS
Soha Ahmed, Mengjie Zhang, Lifeng Peng and Bing Xue."Multiple Feature Construction for Effective Biomarker Identification and Classification
using Genetic Programming". Proceedings of 2014 Genetic and Evolutionary Computation Conference (GECCO 2014). ACM Press. 2014.pp.249—256
Soha Ahmed, Genetic Programming for Biomarker Detection in Classification of Mass Spectrometry Data, PhD thesis, 2015, School of Engineering
and Computer Science, Victoria University of Wellington, New Zealand
- visualizations quality;
- visualizations complexity;
feature set, yet they are able to separate three classes of the
dataset well. In particular, the blue, red, and purple classes can
be separated
Fig. 11. out by drawing
Complexity two vertical lines through the plot.
of 13. Fig. 14. Complexity of 33. Fig. 16. Complexity of 104.
In other words, two thresholds can be applied to the output of
the top tree (x-axis) in order to roughly separate the classes and the only feature subtracted in the tree is f21 , this suggestsnf8 feature. This is, however, sufficient to cause some of the
into three groups: 1) red; 2) orange and green; and 3) purple that the blue and purple classes have particularly small orangefea-points to be separated from the yellow and green points
and blue. Given that all features have the same range of [0, 1] ture values for f21 (as they have high x values), and along the red the x-axis—indicating that higher values of nf8 may indi-
Fig. 15. Complexity of 59. cate a point belongs to the orange class rather than the yellow
Fig. 13. Complexity of 26. one. The major change in the bottom tree is the introduction
of the nf21 and nf6 features, which helps to compact the blue
Authorized licensed use limited to: Victoria University of Wellington. Downloaded on May 03,2021 at 01:55:12 UTC from IEEE Xplore. Restrictions apply.
Lensen, Andrew, Bing Xue, and Mengjie Zhang. "Genetic programming for evolving a front of interpretable models for data Lensen, Andrew, Bing Xue, and Mengjie classZhang.
has particularly
"Genetic large values (as for
programming it has low x values).
evolving a front fof 21 interpretable
class and startsmodels for data
to separate the red class more strongly.
visualization." IEEE Transactions on Cybernetics 51.11 (2020): 5468-5482. visualization." IEEE Transactions on Cybernetics in the Dermatology dataset
51.11 (2020): corresponds to the “thinning of
5468-5482. Fig. 14 further pushes the blue class away from the other
the suprapapillary epidermis” feature, which is a feature com- classes at the expense of squashing the other classes together.
monly associated with the skin condition of psoriasis [39]. In This is achieved by making the top tree (x-axis) weight f
3
the visualization in Fig. 11, the red class corresponds to a diag- and f even more strongly. It is interesting to note that all
32
nosis (class label) of psoriasis—and indeed, the red instances the trees analyzed so far have been strictly linear functions,
appear on the left side of the plot, which is consistent with utilizing only arithmetic and subtraction. This is perhaps not
having a high value of f21 being subtracted from the rest of unexpected: utilizing more sophisticated functions is likely
the tree. This sort of analysis can be used by a clinician to to require complex trees to fully take advantage of them—
1225 understand that this feature alone could be used to diagnose for example, taking the maximum of two simple subtrees is
psoriasis with reasonable accuracy, and also provides greater unlikely to be more representative of the original dataset than
confidence in the visualization’s accuracy. simply adding together features. This also has the benefit of
Feature Selection for Explainable ML Feature Selection for Explainable ML
Bin Wang, Wenbin Pei, Bing Xue, and Mengjie Zhang. "Evolving Local Interpretable Model-agnostic Explanations for
Deep Neural Networks in Image Classification". Proceedings of 2021 Genetic and Evolutionary Computation
Conference (GECCO 2021 Companion)
GECCO,Dever,
Colorado, USA. 20-24
Any problem ?
July 2016
Test
Data
(a) Data (Feature)
Issues and Challenges Subset
Training
Test
(b) Data
Training
Training (Feature)
Subset
1226
GECCO,Dever, GECCO,Dever,
Colorado, USA. 20-24 Colorado, USA. 20-24
738
739 (b)
740
741
742
743
744
• If the whole dataset is used during FS/FC process, the
745
experiments(or evaluation) have FS/FC Bias 746
Figure 4. Structures of experiments with and without feature selection bias.
• What if only a small number of instances available ? 747
748 the index of the features, where the index of the features are listed in a descending order
- In classification, use k-fold cross validation
749 by their frequency. It would be much more interesting to discover the biological finding
- How to use k-fold cross validation in FS/FC to evaluate a FS/FC
750 Binh
of Tran, genes,
these Bing Xuebut
and since
Mengjie Zhang.
the "Genetic
original Programming
meanings for Feature
of the Construction
features/genes inand Selection
GEMS in
are not
system ? Classification on High-dimensional Data", Memetic Computing, vol 8, Issue 1, pp3-15. 2016
751 given, it is impossible to perform such research. In the future, we intend to collaborate with
752 researchers from biology to deep analyse the biological finding of the selected genes.
753
754
755 4.6. Further discussions
GECCO,Dever,
756 In this section, the re-substitution estimator was used to evaluate the performance of
GECCO,Dever,
Colorado, USA. 20-24
757 Colorado, USA. 20-24
the feature selection algorithms, which is the same as in Chuang et al. (2008) and many
K-CV in Each Evaluation — Inner Loop Weakness and Issues
July 2016 July 2016
758 other existing papers (Abedini et al., 2013; Ahmed et al., 2012; Alba et al., 2007; Babaoglu
759 et al., 2010; Huang et al., 2007; Mishra et al., 2009; Mohamad et al., 2011, 2013; Santana
• 3-CV as an inner loop to evaluate each feature subset 760 et •al.,Large Search
2010; Shen et al., space:
2007; Yu et al., 2009). The re-substitution estimator, in other words,
761 means - the
Large search
whole dataset space:
is usedbit-string/vector with afeature
during the evolutionary length equal to
selection the (as
process
• In each evaluation to get :Acc
762 shown in total number
Figure of features
4(a)). There is no separate unseen data to test the generality of the selected
763 features.- Classification accuracy
According to Ambroise and or existing
McLachlan filterthere
(2002), measures
is a featureinselection
the fitness
bias issue
Feature 764 here, sofunction,
one cannotwhich
claim thatoftenthe cannot lead to can
selected features a smooth
be used forfitness
futurelandscape
unseen data.
Subset 765 Feature or selection
with lowbias locality
typically happens when the dataset includes only a small number
Test 766 of •instances,
Training Poor scalability the gene expression data, where n-fold CV (10-CV) or LOOCV is
especially on
1 Classifier Acc 767 needed. Figure 4 compares the structures experiments with and without feature selection
Learning Alg. 3 - the dimensionality of the search space often equals to the total
2 768 bias. It can be seen that with selection bias, the algorithm reports the classification per-
769 number of features, thousands, or even millions
formance of the (inner) CV loop and “such results are optimistically biased and are a subtle
Training Set
1 770 means - of
the number
training on theoftestinstances is and
set” (Kohavi large
John, 1997). Therefore, the conclusions drawn
2 Classifier Ave 771 • Generalisation
the re-substitutionissue
2 Learning Alg. 1 Acc from estimator with selection bias may be different from that without
3 Acc 772 bias. This, however, has not been seriously investigated in EC for gene selection.
3 - especially wrappers
773
- Feature construction
774
1 Classifier Acc 5.•Experiment II
Long computational time
Learning Alg. 2 775
3 776 In •this section, selection
the second set of experiments have
Feature or construction biasbeen conducted, where the feature
issue
777 selection bias issue is removed.
Kohavi, Ron, and George H. John. "Wrappers for feature subset selection." Artificial intelligence 97.1-2 (1997): 273-324. 778
779
5.1. Performance evaluation
780
781 To avoid feature selection bias and compare the performance of the algorithms with and
1227 782 without bias, the second set of experiments without feature selection bias have been
GECCO,Dever, GECCO,Dever,
Colorado, USA. 20-24 Colorado, USA. 20-24
GECCO,Dever, GECCO,Dever,
Colorado, USA. 20-24 Colorado, USA. 20-24
Future Directions
July 2016 July 2016
• Explainable machine learning: • Thanks my colleagues, post-docs, visiting scholars, PhD and
- increase the interpretability/understandability of the obtained Master/Hons/Summer Research students!
feature set
- Simple models via feature selection/construction
• Feature construction
- Construct multiple features
- both feature selection and feature construction
1228
GECCO,Dever, GECCO,Dever,
Colorado, USA. 20-24 Colorado, USA. 20-24
July 2016 July 2016
1229