Evolutionary Computation For Feature Selection and Feature Construction

Evolutionary Computation for Feature Instructors
Selection and Feature Construction v Bing Xue is a currently a Professor in Computer Science and Deputy Head of
School in School of Engineering and Computer Science at VUW. She has over
300 papers published in fully refereed international journals and conferences and
Bing Xue and Mengjie Zhang her research focuses mainly on evolutionary computation, feature selection,
feature construction, machine learning, classification, symbolic regression,
Evolutionary Computation Research Group evolving deep neural networks, image analysis, transfer learning, multi-objective
machine learning. Dr Xue was the founding Chair of IEEE Computational
Victoria University of Wellington, New Zealand Intelligence Society (CIS) Task Force on Evolutionary Feature Selection and
Construction. She co-chaired Women@GECCO at 2018, and co-chaired the
Evolutionary Machine Learning, and NueroEvolution Tracks in GECCO 2019 to
Bing.Xue@ecs.vuw.ac.nz and Mengjie.Zhang@ecs.vuw.ac.nz 2022, She is also served as associate editor of several international journals,
such as IEEE Transactions on Evolutionary Computation and ACM Transactions
on Evolutionary Learning and Optimization.
http://gecco-2022.sigevo.org/ v Mengjie Zhang is a Fellow of Royal Society of NZ, a Fellow of IEEE, a Fellow of
Engineering NZ, and an IEEE Distinguished Lecturer. He is Professor of
Computer Science at the School of Engineering and Computer Science, Victoria
Permission to make digital or hard copies of part or all of this work for University of Wellington, New Zealand. His research is mainly focused on
personal or classroom use is granted without fee provided that copies are evolutionary computation, particularly genetic programming, evolutionary deep
not made or distributed for profit or commercial advantage and that and transfer learning, image analysis, feature selection and reduction, and
copies bear this notice and the full citation on the first page. Copyrights evolutionary scheduling and combinatorial optimisation. He has published over
for third-party components of this work must be honored. For all other 700 academic papers in refereed international journals and conferences. He is
uses, contact the owner/author(s).
currently an associate editor for over ten international journals (e.g. IEEE TEVC,
GECCO '22 Companion, July 9–13, 2022, Boston, MA, USA
© 2022 Association for Computing Machinery. ECJ, ACM TELO, IEEE TCYB, and IEEE TETCI). He has been serving as a
ACM ISBN 978-1-4503-9268-6/22/07…$15.00 steering committee member and a program committee member for over eighty
https://doi.org/10.1145/3520304.3533659 international conferences. He is a reviewer of research grants for many
countries/regions (e.g. Canada, Portugal, Spain, Germany, UK, Netherland,
Austria, Mexico, Czech, Italy, HK, Australia, NZ).
GECCO,Dever, GECCO,Dever,
Colorado, USA. 20-24 Colorado, USA. 20-24
Outline Data set (Classification) — Example

July 2016 July 2016
• Feature Selection and Feature Construction Credit card application:

• 7 applicants (examples/instances/observations)
• Evolutionary Computation (EC) for Feature
• 2 classes: Approve, Reject
Selection and Feature Construction
• 3 features/variables/attributes
• Evolutionary Feature Selection Methods
• Evolutionary Feature Construction Methods Job Saving Family Class

Applicant 1 true high single Approve
• Issues and Challenges Applicant 2 false high couple Approve
Applicant 3 true low couple Reject
Applicant 4 true low couple Approve
Applicant 5 true high children Reject
Applicant 6 false low single Reject
Applicant 7 true high single Approve
1200
Week3:21
422
What is a Good feature? FS: 5
COMP422
COMP473 Week3:22
FS: 6
7measure
COMP422 of What
goodness is isa subjective
Good feature?
with respect to38the
Feature Manipulation:
What is
What is a Good feature?
e The
of classifier.
measure of The features
goodness
What in this
What
is ais isfigure,
subjective
Good awith
feature? goodx1 and
respect x2,
feature?
to are
the •• The
The same
same set
set of
of features
features are
are not
not good
good for
for aa decision tree that
DT classifier
What is a good feature?
COMP422 Feature Manipulation: 39 COMP422
dtype
for aof linear classifier.
classifier. The features in this figure, x1 and x2, are classifier
is not ablethat is not able its
to transform to input
transform its input space.
space.
The measure of goodness is subjective with respect to the type of What is a Good feature? Why Use GP for Featu
- good for a linear classifier.
classifier. The features in this figure, x1 and x2, are good for a linear
.
classifier.
• GP is flexible in making mathem
s
y • There isn’t mush structural (to
d search space of possible function
proach (such as evolutionary com
The same set of features are not good for a decision tree classifier
that is not able to transform its input space.
Fig. 9. (a) Decision tree induced by the J48 (C4.5) algorithm usi
9 COMP422 Feature Manipulation: 40
observations in the previous figure. (b) Class boundary general
decision tree. The decision tree inducer has failed to generalize
Why Use GP for Feature Construction? COMP422
of a line. Feature Manipulation: 41 COMP422
GP for Feature Construction: A System Diagram A Sample Measure of Good

• GP is flexible in making mathematical and logical functions GECCO,Dever,
Colorado, USA. 20-24 Class Inte
What is a information
good feature? Feature Selection and Feature Construction
July 2016
• There isn’t mush structural (topological) in the

search space of possible functions, so using a meta-heuristic ap- Defining a measure of goodness for a
proach (such as evolutionary computation) seems reasonable. • Feature selection aims to
• The interval of a class along a fe
to select a small subset of
persion of the instances of that cl
relevant features from the
dispersion of instances itself is re
original large set of points in that class.
features in order to • An interval I is represented wit
maintain or even improve shows the lower for
and the
upper bound
Fig. 10. (a) Simple GPMFC-constructed feature 1000 o
the performance(b) J48 induced decision tree using the constructed feature, y. c.
to indicate an interval for class
• The interval of class c could be
r class distributions were normal.
E. Further
• Feature construction is to Discussions Ic = [µc − 3σc, µc + 3σc]
construct new high-level
Theoretically, GP and symbolic
features using original
However,learners
the normalityboth gen
assumpti
telligible
features to improve the models: GP generates expression trees and
1 COMP422 Feature Manipulation: 42 learners generate chains of rules or trees of decision
classification performance.
A Sample Measure of Goodness: The Entropy of In practice, however, both of them can produce
Class Intervals (constructed features or classifiers) that are not eas
Defining a measure of goodness for a single feature:
prehensible; very often constructed features are unne
complicated and therefore unintelligible. Although in
• The interval of a class along a feature is determined by the dis- 1201 our empirical results, the complexity of the constructe
persion of the instances of that class along the feature axis. The
Most Contributed Paper to
Increasingly Popular Topic

July 2016 July 2016
2018 IF of TEVC
IEEE Transactions on Evolutionary Computation
NSGAII
NSGAIII
MOEA/D
In top 5 since 2017
GECCO,Dever,
Colorado, USA. 20-24
Scikit-Learn Why Feature Selection ?

July 2016
• “Curse of the dimensionality”

- Large number of features: 100s, 1000s, even
millions
• Not all features are useful (relevant)
• Redundant or irrelevant features may reduce the
performance (e.g. classification accuracy)
• Costly: time, memory, and money
• Explainable/Interpretable AI
• Gain knowledge about the problem/data
• GIGO: Garbage in, garbage out
- Representation/feature learning
https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
1202
Why Feature Construction? What can FS/FC do ?

July 2016 July 2016
• The quality of input features can drastically affect the learning • Reduce the dimensionality (No. of features)
performance.
• Improve the (classification) performance
• Even if the quality of the original features is good,
transformations might be required to make them usable for • Simplify the learnt model
certain types of classifiers. • Speed up the processing time
• Help visualisation
• Feature construction does not add to the cost of extracting
(measuring) original features; it only carries computational • Improve interpretability and explainablity
cost. • Reduce the cost, e.g. save memory
• In some cases, feature construction can lead to
dimensionality reduction or implicit feature selection.
• and ?
Do we still need feature selection when we have deep learning?
COMP422 Feature Manipulation: 19 COMP422 Feature Manipulation: 20
Search Strategy Forward Selection Algorithm

Colorado, USA. 20-24 Basically, many different search algorithms can be used (eg exhaus- Colorado, USA. 20-24
Challenges Good
inrepresentation
FS and FC General FS/FC System
July 2016 Forward Selection: July 2016
tive, random, GA).
of solutions (subsets) is very important. A • Start with the empty set.
search algorithm can perform better by using the structural infor-
• Large search space: n possible feature subsets
2 mation available to subsets of features: the lattice of subsets. • At each step examine all individual features by tentatively
adding them to the current set of selected features. Choose the Evolutionary Feature
Constructed/Selec
- 1990: n < 20 feature that yields the highest improvement and add it to the set
of selected features.
Selection/Construction ted Feature(s)
- 1998: n <= 50 • Repeat the above step until the performance can no longer be
improved.
- 2007: n ≈ 100s
- Now: 1000s, 1 000 000s How does this algorithm compare to exhaustive search?
• Feature interactions How would you use the lattice to perform a search?
- Relevant features may become redundant

- Weakly relevant or irrelevant features may
become highly useful COMP422 Feature Manipulation: 21 COMP422 Feature Manipulation: 22
Backward Selection Algorithms Overcoming Computational Intensity in Feature

• Slow processing time, or even not possible Selection
Backward Selection:
• Multi-objective Problems
• Start with the set of all available features.
Things to do to make searching the exponentially-growing space of
subsets of features (with 2m elements) feasible:
• At each step examine all individual features by tentatively re-
• Don’t search the entire space; use some heuristics instead (e.g.
moving them from the current set of selected features. Pick the
Forward Selection, GA, PSO, ...)
feature that its removal yields the highest improvement and re-
move it from the set of selected features. • Choose computationally cheap learning algorithms (e.g. Near-
est Centroid) or other measures (non-wrappers).
• Repeat the above step until the performance can no longer be
improved. • Take a different approach (e.g. using GP to search the space
implicitly)
1203
How does this algorithm compare to forward selection?
Feature FS/FC Process Feature Selection Approaches

July 2016 July 2016
• On training set: • Based on Evaluation ——— learning algorithm

No
- Three categories: Filter, Wrapper, Embedded
Initialization
Constructed/Selected Features Stop? - Hybrid (Combined): commonly wrapper + filter
Feature(s) evaluation
Ye
s
Selected Filter
feature subset Original
Selected Evaluation
Selected
Features Features
Features (Measure)
• Responsible by a search • Evaluate feature

sets by an Wrapper
mechanism
Original Selected
• Generate promising evaluation criterion Selected Learning
Features Features
feature set candidates (fitness/objective Features Classifier
function)
Learnt Classifier
Original
Features
Embedded Method
Selected Features
Feature Selection Approaches

July 2016 July 2016
• Generally:
Classification Computational Generality

Accuracy Cost (different classifiers)
Filter Low Low High Feature Selection

Embedded Medium Medium Medium
Wrapper High High Low
Bing Xue, Mengjie Zhang and Will Browne. "A Comprehensive Comparison on Feature Selection Approaches to Classification".
International Journal of Computational Intelligence and Applications (IJCIA). Vol. 14, No. 2. 2015. pp. 1550008 (1-23)
1204
Feature Selection Approaches Why Evolutionary Computation ?

July 2016 July 2016
• Conventional approaches • Don't need domain knowledge

- The Relief algorithm
• Don’t make any assumption
- The FOCUS algorithm
- Sequential forward/backward floating selection - e.g. differentiable, linearity, separability, equality
- Statistical feature selection methods • Easy to handle constraints
- Sparsity based feature selection methods
• EC can simultaneously build model structures
• Evolutionary Computation (EC) based approaches
Jia, Weikuan, et al. "Feature dimensionality reduction: a review." Complex & Intelligent Systems (2022): 1-31.
Li, J., Guo, R., Liu, C., & Liu, H. (2019, July). Adaptive unsupervised feature selection on attributed networks. In Proceedings of the 25th
and optimise parameters
ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 92-100).
Tang, Chang, Xinwang Liu, Xinzhong Zhu, Jian Xiong, Miaomiao Li, Jingyuan Xia, Xiangke Wang, and Lizhe Wang. "Feature selective
projection with low-rank embedding and dual Laplacian regularization." IEEE Transactions on Knowledge and Data Engineering (2019). • Population based search is particularly suitable
Li, Yun, Tao Li, and Huan Liu. "Recent advances in feature selection and its applications." Knowledge and Information Systems 53.3 (2017):
551-577.
Cheng, Kewei, Jundong Li, and Huan Liu. "FeatureMiner: a tool for interactive feature selection." Proceedings of the 25th ACM
for multi-objective optimisation
International on Conference on Information and Knowledge Management. ACM, 2016.
Zhai, Yiteng, Yew-Soon Ong, and Ivor W. Tsang. "The Emerging" Big Dimensionality"." IEEE Computational Intelligence Magazine 9.3
(2014): 14-26.
Gui, J., Sun, Z., Ji, S., Tao, D., & Tan, T. Feature selection based on structured sparsity: A comprehensive study. IEEE transactions on neural
networks and learning systems, 28(7), (2017): 1490-1507.
Zhai, Yiteng, Yew-Soon Ong, and Ivor W. Tsang. "Making trillion correlations feasible in feature grouping and selection." IEEE
transactions on pattern analysis and machine intelligence 38.12 (2016): 2472-2486.
Guyon, Isabelle, and André Elisseeff. "An introduction to variable and feature selection." Journal of machine learning research 3.Mar •A.E. Eiben and J. Smith “From evolutionary computation to the evolution of things”, Nature, 521:476-482, 2015
(2003): 1157-1182.
Dash, M. and Liu, H., 1997. Feature selection for classification. Intelligent data analysis, 1(1-4), pp.131-156.
Kohavi, Ron, and George H. John. "Wrappers for feature subset selection." Artificial intelligence 97.1-2 (1997): 273-324.
JOURNAL OF LATEX CLASS FILES, VOL. , NO. ,
GECCO,Dever,
GECCO,Dever,
Evolutionary Feature Selection
EC for Feature Selection EC for Feature Selection
July 2016 July 2016
• EC Paradigms
EC Paradigms Evaluation Number of Objectiv
• Evaluation
• Number of Objectives
JOURNAL OF LATEX CLASS FILES, VOL. , NO. , 3
Evolutionary Swarm Filter Wrapper Combined
Embedded,
Single
Others Sparsity based
Algorithms
Evolutionary Feature Selection Intelligence Approaches Approaches Approaches Objective
JOURNAL OF LATEX CLASS FILES, VOL. , NO. , 4
EC Paradigms Evaluation Number of Objectives
LCSs, ES, Filter Approaches

GAs GP
CART, are relatively simple and can achieve good performance
PSO ACO DE Memetic
ABC, et al.
Evolutionary Swarm Filter Wrapper Combined Single Multi-
Others
[35]. Sparse approaches

Approaches
have recently
categories become popular, such
Algorithms Intelligence Approaches Approaches Objective Objective
Fig. 3. Overall of EC for feature selection.

as sparse logistic regression for feature selection [36], which
LCSs, ES,
GAs GP PSO ACO DE Memetic
ABC, et al.
Information Correlation Distance Consistency Fuzzy Set Rough Set
Fig. 3. Overall categories of EC for feature selection.
has been used for (e.g. feature
50), in selection
many tasks with
situations such millions
methods of
are
Measure
computationally
Measure
is
Measure
their stability
Measure
since
Theory
the algorithms
Theory
often select
too expensive to perform. Therefore,features. different heuristicFor
search example, the sparse logisticrequire a regression method
(e.g. 50), in many situations such methods are computationally is their stability since the algorithms often select different
tooprocess
expensive
features from different
tousers.perform.
runs, which may
Therefore, different heuristic
further
search features from different runs, which may require
techniques
Xue, Bing,have been
Mengjie applied
Zhang, Will to
[36] automatically
feature selection,
N. Browne, such
and Xin Yao. as selection
"A survey
assigns
issues is ofagreat
on evolutionary
weight
for real-world
computation
importance to
approaches
as theeach
toFurther
feature research to
feature showing Fig. 4. Different measures in EC based filter approaches.
selection."
greedy IEEEalgorithms,
search Transactionswhere
on Evolutionary Computation20,
typical examples
techniques
are se-no. address
4 (2016):these
606-626.
have been applied
increasingly
to feature selection, such asN. Browne,
Xue, Bing, Mengjie Zhang, Will selection
and Xin Yao. "Aprocess for computation
survey on evolutionary real-world users.
approaches to feature Further re
selection (SBS) [24]. However, bothits relevance.
suffer from the Irrelevant features arein many
assigned with low weights selection." IEEE Transactions on Evolutionary Computation20, no. 4 (2016): 606-626.
quential forward selection (SFS) [23], sequential backward large number of features increases the computational cost and
methods
greedycriteria:
lowers the stability
search
of the algorithms
algorithms, real-world tasks.
where typical examples are se- address these issues is of great importance as the inc
close to zero, which
so-called “nesting effect” because a feature that is selected 2) Evaluationhas the effect
For wrapper featureof filtering
selection ap- out these to avoid confusion with the concept of hybrid algorithms in
quential forward selection (SFS) [23], sequential backward large number of features increases the computational
or removed cannot be removed or selected in later stages. proaches, the classification performance of the selected fea-
features. Sparse learning based methods
“plus-l-take-away-r” [25] compromises these two approaches tures is used as the evaluation criterion. Most of the popular
tend to learn simple the EC field, which hybridise multiple EC search techniques.
by applying SFS l times and then SBS r times. This strategy classification selectionalgorithms,(SBS) [24].
such as decision However,
tree (DT), support both methods suffer from the lowers the stability of the algorithms in many real-wo
models due to their bias to features
can avoid the nesting effect in principle, but it is hard to vector machines (SVMs), Naı̈ve Bayes (NB), K-nearest neigh-
determine appropriate values for l and r in practice. To avoid bour so-called
with high weights. These According to the number of objectives, EC based feature selec-
1205
this problem, two methods called sequential backward floating
“nesting
(KNN), artificial neural effect”
networks (ANNs), because a feature that is selected
and linear
2) Evaluation criteria: For wrapper feature sele
- Filter, Wrapper, Single Objective, Multi-objective
• Representation
- Binary string
• Search mechanisms
GECCO,Dever,
A General Approach GAs for Feature Selection

July 2016
- Genetic operators
• Multi-objective
• Over 25feature selection
years ago, first EC techniques
Population
Representation • Scalability- issue
Filter, Wrapper, Single Objective, Multi-objective
Initialisation
• Representation
1 0 1 1 0 0 1 0 0 1 1 0 0
- Binary string
Fitness/Objective
Evaluate fitness • Search mechanisms
function
R. Leardi, R. Boggia, and M. Terrile, “Genetic algorithms as a strategy for feature selection,” Journal of Chemometrics, vol.
- Genetic operators
6, no. 5, pp. 267– 281, 1992.
Z. Zhu, Y.-S. Ong, and M. Dash, “Markov blanket-embedded genetic algorithm for gene selection,” Pattern Recognition, vol.
• Multi-objective
40, no. 11,pp. 3236–3248, 2007. feature selection
W. Sheng, X. Liu, and M. Fairhurst, “A niching memetic algorithm for simultaneous clustering and feature selection,” IEEE
No • Scalability
Transactions on Knowledge issue vol. 20, no. 7, pp. 868–879, 2008.
and Data Engineering,
Generate new Bing Xue, Mengjie Zhang, Will Browne, Xin Yao. "A Survey on Evolutionary Computation Approaches to Feature Selection",
Stop ? population IEEE Transaction on Evolutionary Computation, doi: 10.1109/TEVC.2015.2504420, published online on 30 Nov 2015
25
R. Leardi, R. Boggia, and M. Terrile, “Genetic algorithms as a strategy for feature selection,” Journal of Chemometrics, vol. 6, no. 5, pp. 267–
281, 1992.
Yes Ishibuchi, Hisao, and Tomoharu Nakashima. “Multi-objective pattern and feature selection by a genetic algorithm.” In GECCO, pp. 1069-
1076. 2000.
Z. Zhu, Y.-S. Ong, and M. Dash, “Markov blanket-embedded genetic algorithm for gene selection,” Pattern Recognition, vol. 40, no. 11,pp.
Return the best Search/updating 3236–3248, 2007.
Xue, Bing, Mengjie Zhang, Will N. Browne, and Xin Yao. “A survey on evolutionary computation approaches to feature selection.” IEEE
solution(s) mechanism Transactions on Evolutionary Computation, 20, no. 4 (2016): 606-626.
Tao, Zhou, Lu Huiling, Wang Wenwen, and Yong Xia. "GA-SVM based feature selection and parameter optimization in hospitalization
expense modeling." Applied Soft Computing 75 (2019): 323-332.
S. Liu, H. Wang, W. Peng and W. Yao, “A Surrogate-Assisted Evolutionary Feature Selection Algorithm with Parallel Random Grouping for
High-Dimensional Classification,” in IEEE Transactions on Evolutionary Computation, doi: 10.1109/TEVC.2022.3149601. 2022
GECCO,Dever, Colorado, GECCO,Dever, Colorado,

USA. 20-24 July 2016 USA. 20-24 July 2016
GECCO,Dever,
Colorado, USA. 20-24 GP for Feature Selection GECCO,Dever,
Colorado, USA. 20-24 PSO for Feature Selection
GP for Feature Selection PSO for Feature Selection
July 2016 July 2016
• Implicit feature selection • Very popular in recent years

• Implicit feature
- Filter, selection
Wrapper, Single Objective, Multi-objective • Very popular
- Filter, in recent
Wrapper, years
Single Objective, Multi-objective
- Filter, Wrapper, Single Objective, Multi-objective - Filter, Wrapper, Single Objective, Multi-objective
• Representation, continuous PSO vs Binary PSO
• Embedded feature selection • Representation, continuous PSO vs Binary PSO
• Search mechanism
*
Root
• Search mechanism
• Embedded
• Featurefeature selection
construction • Fitness function
• Fitness function
• Feature construction 0.7 0.12 0.84 0.69 0.25 0.06 0.92 0.45 0.36 0.80 0.67 0.30 0.41
Feature 1 +
• Scalability 1 0 1 1 0 0 1 0 0 1 1 0 0
• Computationally expensive
• Scalability
• Computationally expensive
Hafiz, Faizal, et al. "A two-dimensional (2-D) learning framework for Particle Swarm based feature selection." Pattern Recognition. 76 (2018):
Feature 4 Feature 5 416-433.
E. K. Tang, P. Suganthan, and X. Yao, “Feature selection for microarray data using least squares SVM and particle swarm
Binh Tran and Bing Xue and Mengjie Zhang."A New Representation in PSO for Discretisation-Based Feature Selection", IEEE Transactions on
optimization,”
Cybernetics, vol.in48,
IEEE
no. 6,Symposium
pp.1733-1746,on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), pp. 1–
2018
M. G. SmithL. Jung-Yi, K. Hao-Ren,
and L. Bull, “GeneticC.programming
Been-Chian, and withY. aWei-Pang, “Classifierfor
genetic algorithm design withconstruction
feature feature selection
andand feature extraction
selection,” Genetic
usingand
layered geneticMachines,
programming,” Expert
8, 2005.
Xue, Y., Xue, B., & Zhang, M.. Self-adaptive particle swarm optimization for large-scale feature selection in classification. ACM Transactions
Programming Evolvable vol. 6, no. 3,Systems with Applications,
pp. 265–281, 2005. vol. 34, no. 2, pp. 1384–1393, 2008.
L.
on Y. Chuang,Discovery
Knowledge H. W. Chang, C. J.(TKDD),
from Data Tu, and13(5),
C. H. Yang,
1-27. 2019“Improved binary PSO for feature selection using gene expression data,”
Purohit, N. Chaudhari, and A. Tiwari, “Construction of classifier with feature selection
B. Tran, B. Xue, M. Zhang. "Class Dependent Multiple Feature Construction Using Genetic Programming for based on genetic programming,” in
High-Dimensional Tran, Binh, Bing Xue,
Computational and Mengjie
Biology Zhang. "Adaptive
and Chemistry, vol. 32, multi-subswarm
no. 29, pp. 29–optimisation
38, 2008. for feature selection on high-dimensional classification."
IEEE Congress
Data". Proceedings of theon Evolutionary
30th Computation
AI2017,LNCS. (CEC),
Vol. 10400. pp. 1–5,Melbourne,
Springer. 2010. Australia, August 19-20th, 2017. pp. 182-194. In Proceedings
C. L. Huang and of the
J. Genetic
F. Dun,and
“A Evolutionary Computation
distributed PSO-SVM Conference,
hybrid system pp.with
481-489. 2019.
feature selection and parameter optimization,”
Tran, Binh,M.Bing
G. Smith
Xue, and
andL.Mengjie
Bull, “Genetic
Zhang.programming with a geneticfor
"Genetic programming algorithm for feature construction
multiple-feature construction and selection,” Genetic
on high-dimensional Bach Hoai Nguyen, Bing Xue, Mengjie Zhang, and FengyuZhou "A survey on swarm intelligence approaches to feature selection in data
Programming
classification." and Evolvable93Machines,
Pattern Recognition vol. 6, no. 3, pp. 265–281, 2005.
(2019): 404-417. Application on Soft Computing, vol. 8, pp. 1381–1391, 2008.
mining", Swarm and Evolutionary Computation, vol. 54, num , pp 100663:1-30. , May 2020. (doi.org/10.1016/j.swevo.2020.100663)
Bing Xue,
Nag, Kaustuv, MengjieR.Zhang,
and Nikhil Will Browne,
Pal. “Feature Xin Yao.
extraction and"Aselection
Survey onfor Evolutionary
parsimoniousComputation Approaches
classifiers to Feature genetic
with multiobjective B. Xue, M. Zhang, and W. N. Browne, “Multi-objective particle swarm optimisation (PSO) for feature selection,” in Proceeding
Selection",
programming.” IEEEIEEE Transaction
Transactions onon EvolutionaryComputation
Evolutionary Computation, doi:24.310.1109/TEVC.2015.2504420,
(2019): 454-466. published online on 30 Nov of the 14th Annual Conference on Genetic and Evolutionary Computation Conference (GECCO), pp. 81–88, ACM, 2012.
26 Bing Xue, Mengjie Zhang, Will Browne, Xin Yao. "A Survey on Evolutionary Computation Approaches to Feature Selection",
2015
A. Lensen, B. Xue and M. Zhang, “Genetic Programming for Manifold Learning: Preserving Local Topology,” in IEEE
Transactions on Evolutionary Computation, doi: 10.1109/TEVC.2021.3106672, 2021 IEEE Transaction on Evolutionary Computation, doi: 10.1109/TEVC.2015.2504420, published online on 30 Nov 2015 27
1206
ACO for Feature Selection DE, LCSs, and Memetic

July 2016 July 2016
• Start from around 2003 • DE: since 2008

- Filter, Wrapper, Single Objective, Multi-objective - potential for large-scale
• LCSs:
• Representation - implicit feature selection
• Search mechanism - embedded feature selection
• Filter approaches • memetic:
- population search + local search
- Wrapper + filter
• Scalability
S. Kashef and H. Nezamabadi-pour, “An advanced ACO algorithm for feature subset selection,” Neurocomputing, 2014. Hancer, Emrah, Bing Xue, and Mengjie Zhang. "Differential evolution for filter feature selection based on information theory and feature
C.-K. Zhang and H. Hu, “Feature selection using the hybrid of ant colony optimization and mutual information for the forecaster,” in International ranking." Knowledge-Based Systems 140 (2018): 103-119.
Conference on Machine Learning and Cybernetics, vol. 3, pp. 1728–1732, 2005. Z. Li, Z. Shang, B. Qu, and J. Liang, “Feature selection based on manifold-learning with dynamic constraint handling differential evolution,” in IEEE
L. Ke, Z. Feng, and Z. Ren, “An efficient ant colony optimization approach to attribute reduction in rough set theory,” Pattern Recognition Letters, Congress on Evolutionary Computation (CEC), pp. 332–337, 2014.
vol. 29, no. 9, pp. 1351–1357, 2008. Hoai Bach Nguyen, Bing Xue, Hisao Ishibuchi, Peter Andreae, and Mengjie Zhang. "Multiple Reference Points MOEA/D for Feature Selection".
Moradi, Parham, and Mehrdad Rostami. "Integration of graph clustering with ant colony optimization for feature selection." Knowledge-Based Proceedings of 2017 Genetic and Evolutionary Computation Conference (GECCO 2017) Companion. ACM Press. Berlin, German, 15 - 19 July
Systems 84 (2015): 144-161. 2017.pp 157-158.
Tabakhi, Sina, and Parham Moradi. "Relevance–redundancy feature selection based on ant colony optimization." Pattern recognition 48, no. 9 I.-S. Oh, J.-S. Lee, and B.-R. Moon, “Hybrid genetic algorithms for feature selection,” IEEE Transactions on Pattern Analysis and Machine
(2015): 2798-2811. Intelligence, vol. 26, no. 11, pp. 1424 –1437, 2004.
Sara, V. J., S. Belina, and K. Kalaiselvi. "Ant colony optimization (ACO) based feature selection and extreme learning machine (ELM) for chronic Z. Zhu, S. Jia, and Z. Ji, “Towards a memetic feature selection paradigm [application notes],” IEEE Computational Intelligence Magazine, vol. 5, no.
kidney disease detection." International Journal of Advanced Studies of Scientific Research 4, no. 1 (2019). 2, pp. 41–53, 2010.
Paniri, M., Dowlatshahi, M.B. and Nezamabadi-pour, H., 2020. MLACO: A multi-label feature selection algorithm based on ant colony Y. Wen and H. Xu, “A cooperative coevolution-based pittsburgh learning classifier system embedded with memetic feature selection,” in IEEE
optimization. Knowledge-Based Systems, 192, p.105285. Congress on Evolutionary Computation, pp. 2415–2422, 2011.
Bach Hoai Nguyen, Bing Xue, Mengjie Zhang, and FengyuZhou “A survey on swarm intelligence approaches to feature selection in data mining”, Zhang, Yong, Dun-wei Gong, Xiao-zhi Gao, Tian Tian, and Xiao-yan Sun. "Binary differential evolution with self-learning for multi-objective feature
Swarm and Evolutionary Computation, vol. 54, num , pp 100663:1-30. , May 2020 selection." Information Sciences 507 (2020): 67-85.
Related Areas (Applications)

July 2016 July 2016
• Biological and biomedical tasks

- gene analysis, biomarker detection, cancer classification, and disease
diagnosis
• Image and signal processing
- image analysis, face recognition, human action recognition, EEG brain-
computer-interface, speaker recognition, handwritten digit recognition,
personal identification, and music instrument recognition.
Feature Selection
• Network/web service
- Web service composition and development, network security, and email
spam detection.
• Business and financial problems
- Financial crisis, credit card issuing in bank systems, and customer churn
prediction.
• Others
- power system optimisation, weed recognition in agriculture, melting point
prediction in chemistry, and weather prediction.
Xue, Bing, Mengjie Zhang, Will N. Browne, and Xin Yao. "A survey on evolutionary computation approaches to feature
selection." IEEE Transactions on Evolutionary Computation20, no. 4 (2016): 606-626.
Bach Hoai Nguyen, Bing Xue, Mengjie Zhang"A survey on swarm intelligence approaches to feature selection in data mining",
Swarm and Evolutionary Computation, vol. 54, num , pp 100663:1-30. , May 2020
1207
GECCO,Dever,
A General Approach Variable-length PSO for FS

July 2016
• Comprehensive Learning PSO

Population
Representation
Initialisation Feature Ranking
Fitness/Objective
Evaluate fitness
function
No
Generate new
Stop ? population
Yes
Return the best Search/updating

solution(s) mechanism
Binh Tran and Bing Xue and Mengjie Zhang. "Variable-Length Particle Swarm Optimisation for Feature Selection on High-Dimensional
Classification", IEEE Transactions on Evolutionary Computation, Vol. 23, pp 473-487, 2019. pp 473-487.
GECCO,Dever,
74CHAPTER
PSO for FS: initialisation and updating
3. WRAPPER BASED SINGLE OBJECTIVE FEATURE SELECTION
Initialization in Multi-objective FS
July 2016
Initialise the position and velocity of

each particle • Initialisation: • Segmented initialization mechanism respectively generates
- Forward selection three sub-populations whose solutions are randomly located
Collect the features selected by a particle
- Backward selection around the “forward”, “middle” and “backward” areas
Transform
training set
- Mixture of both - areas with a small, medium or large number of selected
Fitness
Evaluate the classification performance Evaluation
• Updating: features respectively.
No - Consider the number of
Calculate the goodness of the particle
according to fitness function features in the pest and
gbest updating
Update pbest and gbest • Offspring modification:
- find out all the duplicated solutions
- Modify duplicated solutions to become unique ones
Update the velocity and position of
each particle
‣ by each flipping one or two dimensions (each dimension

Termination ?
corresponds to one original feature), according to the analysis of
Yes
common features in the first nondominated front
Return the best solution (Selected features)

Hang Xu, Bing Xue, and Mengjie Zhang. "Segmented Initialization and Offspring Modification in Evolutionary Algorithms for Bi-
objective Feature Selection". Proceedings of 2020 Genetic and Evolutionary Computation Conference (GECCO 2020). ACM Press.
Bing Xue, Mengjie Zhang, Will N. Browne."Particle Swarm Optimisation for Feature Selection in Classification: Novel Initialisation Cancun, Mexico. July 8th-12th 2020, 9pp
and Updating Mechanisms". Applied Soft Computing. Vol 18, PP. 261--276, 2014
Figure 3.2: The evolutionary training process of a PSO based feature selec-
tion algorithm.
The evolutionary training process of a PSO based wrapper feature se- 1208
lection algorithm is shown in Figure 3.2. The key step is the goodness/fitness
Initialization in Multi-objective FS Duplication Analysis-Based MOFS
GECCO ’20, July 8–12, 2020, Cancun, Mexico
IGDTable
and HV and
3: Mean showStandardthat segmented
Deviation of IGD on Test initialization
Data, mechanism
Table 4: Mean and ofoffspring
and Standard Deviation HV on Test Data, 208
XU et al.: DAEA FOR BIOBJECTIVE FEATURE SELECTION
IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 25, NO. 2, APRIL 2021
209
where Performances of SIOM Versions Better than Bench- where Performances of SIOM Versions Better than Bench-
in modification mechanism
mark Versions Are each
Highlighted in Grey contributed
and Those Not Sig- positively
mark to the success
Versions Are Highlighted of the
in Grey and Those new
Not Sig-
Algorithm 2 Dup_Analysis(Pop)
ection ni�cantlyGECCO
plug-in Di�erent
MOEAs, areJuly
’20, Pre�xed
while by † Cancun, Mexico
8–12,combining
2020, ni�cantly Di�erent are Pre�xed by †
them together contributed the most. Algorithm 4 Reproduce(Pop)
Input: population Pop;
Output: population Pop; Input: population Pop;
No. NSGA-II SIOM-NSGAII MOEA/HD SIOM-MOEAHD HypE SIOM-HypE No. NSGA-II SIOM-NSGAII MOEA/HD SIOM-MOEAHD HypE SIOM-HypE
eatures and Table 2: Datasets in Ascend Order of Feature Number
2.7541e-02 †2.8094e-02 2.6056e-02 †2.5309e-02 2.3140e-02 †2.0473e-02 8.8843e-01 †8.8913e-01 8.8859e-01 †8.8940e-01 8.8964e-01 †8.9046e-01 Output: offspring set S;
1
± 1.78e-02 ± 1.88e-02 ± 1.47e-02 ± 1.33e-02 ± 1.45e-02 ± 1.33e-02
1
± 7.89e-03 ± 7.07e-03 ± 7.44e-03 ± 7.43e-03 ± 7.39e-03 ± 6.96e-03
1: generate a |Pop| ∗ |Pop| matrix Dist, while Dist(i, j) denotes the
3.0766e-02 2.6738e-02 2.9258e-02 †2.7358e-02 3.1853e-02 †2.8169e-02 7.2368e-01 †7.2578e-01 7.2476e-01 †7.2747e-01 7.2244e-01 †7.2394e-01 Manhattan distance between Pop(i) and Pop(j); 1: set the number of offspring N = |Pop|;
2 2
No. ± 8.65e-03 Names
Dataset ± 5.72e-03 5.97e-03
±Features ± 6.64e-03
Instances ± 7.05e-03 ± 6.87e-03
Classes ± 1.01e-02 ± 6.31e-03 ± 5.55e-03 ± 8.03e-03 ± 7.98e-03 ± 8.55e-03
2: get objective vectors Obj of solutions in Pop, and find all the 2: set the neighborhood size T = max(4, #N ∗ 0.2%);
1.9171e-02 1.6010e-02 1.9382e-02 1.6075e-02 2.5828e-02 †1.7509e-02 9.3936e-01 †9.3930e-01 9.3751e-01 †9.3850e-01 9.2626e-01 9.3848e-01
1 3 Climate± 5.45e-03
± 5.44e-03 18
± 5.08e-03 540
± 5.49e-03 2
± 6.92e-03 ± 6.63e-03
3
± 6.29e-03 ± 5.60e-03 ± 5.22e-03 ± 6.05e-03 ± 1.46e-02 ± 5.96e-03 unique objective vectors Obj ; u 3: temporarily normalize the objective vectors of all the solutions
h 7th
2 7.1828e-02
4 Statlog_German
5.4087e-02
± 2.25e-02 ± 1.33e-02
5.2255e-02
24
± 1.33e-02
5.9423e-02
1000
± 1.87e-02
7.3563e-02
2 †6.0459e-02
± 2.37e-02 ± 1.95e-02
4
7.8557e-01 †7.9769e-01
± 2.21e-02 ± 2.09e-02
8.0020e-01
± 2.14e-02
†7.9177e-01
± 2.34e-02
7.6884e-01 7.9011e-01
± 2.71e-02 ± 2.29e-02
3: generate an empty removal set R; in Pop to the scale [0, 1];
0 0 3 Breast_Cancer
3.9339e-02 3.0121e-02 30
1.5318e-01 569
†3.2831e-02 2
3.8027e-02 †3.2985e-02 7.2290e-01 7.3768e-01 5.9972e-01 7.3325e-01 7.2896e-01 7.3943e-01
u
4: for i = 1, . . . , |Obj | do 4: generate a N ∗ T neighborhood matrix Nic, while Nic(i) con-
5 5
1 0 4 Connectionist_Bench_Sonar
± 1.69e-02 ± 5.93e-03
6.4000e-02 3.5805e-02
60
± 1.93e-01
3.8137e-02
208
± 8.71e-03
3.6259e-02
2
± 7.44e-03 ± 6.74e-03
4.2683e-02 3.8640e-02
± 2.53e-02 ± 1.51e-02
5.9247e-01 6.3030e-01
± 2.14e-01
6.2509e-01
± 1.90e-02
†6.3098e-01
± 1.56e-02 ± 1.27e-02
6.1941e-01 †6.2524e-01
5: find indexes j s.t. Obj(j) == Obju (i); sists of T nearest solutions to Pop(i), according to the Euclidean
1 0 5 Mice_Protein_Expression
6
± 2.20e-02 ± 7.80e-03 77
± 7.88e-03 1077
± 8.62e-03 8
± 1.24e-02 ± 1.03e-02
6
± 2.46e-02 ± 1.22e-02 ± 1.43e-02 ± 1.22e-02 ± 1.46e-02 ± 1.14e-02 6: if |j| > 1 then distances among normalized objective vectors;
0 0 6 Hill_Valley
8.3886e-02 4.0700e-02 100
5.9125e-02 606
4.2565e-02 2
7.8597e-02 5.3750e-02 8.3315e-01 9.0255e-01 8.6995e-01 8.9757e-01 8.3748e-01 8.7525e-01
7: set t as the numbers of selected features and D as the
7
± 1.53e-02 ± 9.15e-03 ± 1.52e-02 ± 9.83e-03 ± 1.37e-02 ± 9.64e-03
7
± 2.35e-02 ± 1.56e-02 ± 2.41e-02 ± 1.52e-02 ± 2.50e-02 ± 1.83e-02 Fig. 2. Example of the proposed diversity-based selection. Fig. 1. 5:
Example generate an empty
of the proposed parentanalysis
duplication set Par; method.
0 1 7 MUSK1 8.7910e-02 166 476 2 numbers of total features;
8
1.7138e-01 1.4786e-01 7.3506e-02 1.8262e-01 1.3175e-01
8
7.2989e-01 8.0715e-01 7.5261e-01 8.1625e-01 7.1856e-01 7.6815e-01 6: for i = 1, . . . , N do
4 0.2 8 Semeion_Handwritten_Digit
± 1.63e-02 ± 1.67e-02 256
± 1.46e-02 1593
± 2.21e-02 10
± 1.18e-02 ± 1.08e-02 ± 1.55e-02 ± 1.26e-02 ± 1.40e-02 ± 1.35e-02 ± 1.24e-02 ± 1.13e-02
8: generate a |j| size dissimilarity set Diss;
9 9
5.3205e-02
Arrhythmia 3.0679e-02 1.9043e-01
278 5.5534e-02
452 6.7819e-02 †5.1123e-02
16 9
6.6089e-01 6.9266e-01 5.0177e-01 6.6032e-01 6.4381e-01 6.6781e-01 7: if rand < 0.8 then
6 0.4 ± 1.15e-02 ± 9.40e-03 ± 1.92e-02 ± 3.79e-02 ± 2.12e-02 ± 1.06e-02 ± 1.67e-02 ± 1.45e-02 ± 2.26e-02 ± 4.83e-02 ± 2.50e-02 ± 1.60e-02 9: for k = 1, .3. .Div_Select(Pop,
Algorithm , |j| do α, β, N) 8: is setrandomly
The reproduction accordingselect a parent from of Nic(i) intofea-
Par;
Pop; selected indexesprocess
α; unselectedis modified 9: to elseimprove the quality of
6 0.8
10 LSVT_Voice_Rehabilitation
10
1.2058e-01 3.3069e-02 310
1.0178e-01 126
3.2729e-02 2
1.2115e-01 5.4647e-02
10
8.2075e-01 9.2341e-01 8.4570e-01 9.2429e-01 8.2964e-01 8.9074e-01
10: •Diss(k) = min(Dist(j(k), j)); threshold (δ) to the number selected
11 1.98e-02 1.23e-02
Madelon_Train_Validation
± ± ± 1.79e-02
500 ± 1.33e-02
2600 ± 1.35e-02
2 ± 1.16e-02 ± 3.08e-02 ± 2.14e-02 ± 2.55e-02 ± 2.91e-02 ± 2.07e-02 ± 2.24e-02 Input: population indexes
turesβ;(t) and the number of total features (D). The value of
8 1 2.3433e-01 6.3770e-02 2.4804e-01 8.9064e-02 2.2342e-01 1.0028e-01 5.8352e-01 8.1875e-01 5.7165e-01 7.5211e-01 5.9267e-01 7.3684e-01 11: end for
12 11 ISOLET5± 1.23e-02
± 1.52e-02 617
± 1.33e-02 1559
± 1.88e-02 26
± 1.75e-02 ± 1.82e-02
11
± 1.51e-02 ± 3.38e-02 ± 1.53e-02 ± 3.35e-02 ± 1.84e-02 ± 2.93e-02
12:
population offspring;
size
Diss = (Diss/2)/t; N; 10:
δ varies from
11: 0.2
randomly select a parent from
endtoif0.8, as the value of t changes from D
Pop into Par;
13 12 Multiple_Features
1.3052e-01 3.3607e-02 649
1.3313e-01 2000
3.9944e-02 10
1.2246e-01 4.9909e-02 6.9352e-01 8.0820e-01 6.8977e-01 8.0297e-01 6.9876e-01 7.8173e-01
Output: population
δ = 0.8 Pop;
A duplication − 0.6 ∗ (t −analysis
1)/(D − 1), as method ismaking
proposed tocriterion
filter more outinclusive.
the Then, in
12
± 1.28e-02 ± 6.12e-03 ± 1.06e-02 ± 8.71e-03 ± 9.04e-03 ± 7.00e-03 ± 1.51e-02 ± 1.00e-02 ± 1.20e-02 ± 1.27e-02 ± 1.08e-02 ± 1.23e-02 13: set•threshold noted in to 1, the forremoval
14 SRBCT 1.5668e-02
1.3078e-01 2308
1.3619e-01 83
2.5488e-02 4
1.5954e-01 4.5891e-02 7.8757e-01 9.0081e-01 7.8274e-01 8.8876e-01 7.6248e-01 8.6388e-01 gettext
1:the objective
below; vectors Obj of solutions in Pop; 12: end
13 13
ourage more 15 Leukemia1 5327 72 3 lines 14 and13: 15, those theduplicated
offspringsolutions are separated into
redundant <solutions
± 1.26e-02 ± 6.20e-03 ± 1.28e-02 ± 9.43e-03 ± 1.23e-02 ± 1.11e-02 ± 1.13e-02 ± 1.02e-02 ± 1.18e-02 ± 1.18e-02 ± 1.12e-02 ± 1.21e-02
4.7118e-01 2.4316e-01 5.2250e-01 4.0008e-01 4.7079e-01 2.4959e-01 2.9424e-01 5.6236e-01 2.5868e-01 3.6473e-01 2.9455e-01 5.7069e-01 14: calculate
2:find indexes crowding distances
k s.t. Diss(k) δ; C for solutions in Pop(β); initialize set S = Pop;
noted that no 16 14 DLBCL± 1.43e-01 5469 77 2 14
two groups
14: for
according
i = 1, to
. . . their
, N do dissimilarity degrees: 1) the
± 2.78e-03 ± 3.61e-03 ± 7.57e-02 ± 3.07e-03 ± 1.27e-01 ± 2.14e-03 ± 1.74e-01 ± 2.29e-03 ± 8.43e-02 ± 2.38e-03 ± 1.70e-01
15: 3:R while
= R ∪ |α| < N do
j(randi(k, |k| − 1));
4: if• sortA diversity-based selection method 15:is adopted toPop(i,
further orselect
17 Brain1
2.3735e-01 4.1850e-02 5920
2.5769e-01 90
5.4007e-02 5
2.4632e-01 4.9974e-02 5.3839e-01 7.1387e-01 5.2521e-01 6.9892e-01 5.2619e-01 7.0910e-01
of a feature is, 15
± 7.56e-03 ± 1.50e-02 ± 8.25e-03 ± 1.68e-02 ± 9.54e-03 ± 1.50e-02
15
± 1.76e-02 ± 1.97e-02 ± 1.84e-02 ± 2.46e-02 ± 2.03e-02 ± 2.72e-02 16: end β according to the descend order of C(β); remote solutions find
withindexes j s.t.
dissimilarity higher &= Par(i,
j) than j); to the
equal
18 Leukemia 7070 72 2
ing that each if |j| 1 then
theanreserved
empty diversitysolutions 16:
2.4178e-01 5.9839e-02 2.5878e-01 6.3833e-02 2.4775e-01 6.6742e-02 6.0251e-01 7.9549e-01 5.8483e-01 7.8223e-01 5.9509e-01 7.9343e-01
16
± 6.60e-03 ± 1.57e-02 ± 5.77e-03 ± 1.24e-02 ± 7.76e-03 ± 1.84e-02
16
± 2.09e-02 ± 3.58e-02 ± 1.50e-02 ± 2.46e-02 ± 1.98e-02 ± 3.57e-02
17: end 5:for generate score set Div; >
removal threshold and 2) the clustered solutions with dissim-
17
2.3192e-01 2.3935e-02 2.5770e-01 3.4972e-02 2.3736e-01 2.7818e-02
17
4.8237e-01 6.3834e-01 4.6436e-01 6.2643e-01 4.7869e-01 6.3286e-01 18: remove
6: solutions
for i = Pop(R) fromdoPop;
1, . . . , |β| 17: k = randi(j, randi({1,
ilarity lower than the removal threshold. Among them, . . . , |j| − 1}, 1));all the
± 4.12e-03 ± 6.50e-03 ± 1.28e-02 ± 6.13e-03 ± 5.88e-03 ± 6.48e-03 ± 2.88e-03 ± 8.24e-03 ± 8.95e-03 ± 7.22e-03 ± 3.90e-03 ± 6.19e-03
7: find indexes j s.t. Obj(α(j), 1) == Obj(β, 1); 18: do crossover: S(i, k) = Par(i, k);
the largest is 2600 (i.e. Madelon). Therefore, the datasets used in this
18
2.3960e-01 5.0753e-02 2.6176e-01 6.6186e-02 2.4756e-01 5.5212e-02
18
5.4917e-01 7.2973e-01 5.2839e-01 7.1279e-01 5.3623e-01 7.2762e-01 remote solutions
19:
and ifonly one clustered solution are reserved,
end
± 7.18e-03 ± 2.55e-02 ± 1.02e-02 ± 3.11e-02
work are both comprehensive and complicated enough to challenge
± 9.62e-03 ± 2.52e-02 ± 1.72e-02 ± 2.82e-02 ± 2.06e-02 ± 3.65e-02 ± 1.94e-02 ± 3.28e-02 8: Div = Div ∪ |j|; while the 20:method randi(k, |k| − 1) means randomly selecting
9: end for end for
the e�ciency
Xu, BingandXue,e�ectiveness
and Mengjie of each "Segmented
comparisonInitialization
algorithm for Hang Xu, Bing Xue, and Mengjie Zhang. "Segmented Initialization and Offspring − 1)Modification
members
21: run S from the array
in Evolutionary k (line 15).
inAlgorithms
Algorithm 5;Thus, a larger
for Bi-objective Feature
Hang
feature selection
Zhang. and Offspring Modification in Evolutionary Algorithms for Bi-objective Feature into three
10: parts:
k= 1) themin
β(arg duplication
ofi (Div(i))); analysis (line 12); 2) the (|k| = Mutate(S)
As represent- Selection". to theirincorresponding
Proceedings classi�cation.
literatures
of 2020 Genetic and[1, 5, 23]. The population
Evolutionary size N
Computation Conference (GECCO 2020). ACM Press. Cancun, Mexico. July 8th-12th 2020, 9pp Selection". Proceedings 2020 Genetic and Evolutionary Computation Conference (GECCO 2020). ACM Press. Cancun, Mexico. July 8th-12th 2020, 9pp
value of δ will make the duplicated solutions more likely to
the decision maker. Here, we use the �rst nondominated front ob- nondominated remove
11:H., Xue,
Xu, sorting fromM.
k (lines
B., & Zhang, but add
β13–15);
(2020). k into
Aand 3) the
α; diversity
duplication main-
analysis based evolutionary algorithm for bi-objective feature selection. IEEE Transactions on
based MOEA 、 is adaptively set to the number of features but not beyond 200 in
be removed. Additionally, because is set to be negatively
4.3 Performance
order to balanceMetrics
the diversity and e�ciency, meaning N for datasets tained from the training data to measure the performance of an tenance end16).
12:(line
Evolutionarywhile In fact, the second part of selection has no
Computation. δ
EA/HD [23]), algorithm on the test data. It should be noted that some nondomi-
I is a classic with feature numbers larger than 200 is still kept to 200. The maxi-
For comprehensive analyses, there are two performance metrics ap- nated solutions on the training data can become dominated on the
difference from=the
13: Pop traditional dominance-based MOEAs [33], correlated with t (line 13), it requires a higher dissimilarity
Pop(α);
dashedwith line, and thenumber
diversity scores for the eight solutions
d sorting and mum objective function evaluation number is set to 100 times of
plied in thethe
experiment to measure both convergence and diversity test data. Thereby, we need to further select the �nal nondominated [40], [79], while the method num(k) denotes the number of for the solution a smaller of selected features to
population size, capping the maximum number of evolutionary {x1 , x2 , . . . , x8 } in Fig. 2 are counted as {2, 4, 1, 2, 2, 1, 1, 1},
A/HD is a re- performances generationscomparison
of each to about 100.algorithm, i.e. Hypervolume (HV) solutions on the test data from the previously sampled solutions to solutions in the first k fronts (line 14). More details of the non- be reserved.
ierarchically [21] and Inverted Generation Distance (IGD) is[30]. For HV, all theinto a measure the IGD and HV performances. respectively. For instance, there
An example of the above-mentioned duplication analysis are two selected solutions
As for classi�cation, each dataset randomly divided dominated sorting method
the solution can becrowding
with a larger referred to [33]. Moreover,
distance comes first.method
Then, iswhich
h is especially objective values obtained
training by each
data subset and aalgorithm are normalized
test data subset by the of
with the proportions The overall IGD and HV performances on each test dataset are
it should be noted that all the duplicated solutions in the shownhave the same
in Fig. number
1, where fourofdifferent as solution x1 , there-
features solutions
Ps. HypE is a shown in Tables 3 and 4 respectively. It can be observed that all the in lines 7 and 8, the diversity score of an unselected solution 1 4 ) for the 1
of x is with x , x , x7 , and
3 6
ideal and nadir
nearlypoints
70% andto the
30% scale [0, 1] infollowing
respectively, each objective direction,
a strati�ed split process
SIOM plug-in MOEAs perform better on most of the test datasets, decision space are eliminated in the beginning of environmen- (x , . . . , xfore, diversity
a feature score problem
selection 2. Obviously,
totally eight
Hypervolume [15, 25,
while the ideal and26].nadir
It should be noted
points that this division
are repsectively setisto
�xed
(0, for a speci�c
0) and is set
GECCO,Dever,as 20-24
the number of the already selected solutions with the 8 alldecision
compared with their corresponding benchmark MOEAs, in terms of tal selection (line 11). As for the computational complexity, features inx the
Colorado, USA. rank thespace arediversity
first by assumedscore.
to have the there
Since same are multiple
Multi-objective PSO for FS

ion criterion. dataset
(1.1, 1.1). The on all the
reference 30 runs
point of HVso is
that
setthe
tostatistical
(1, 1). Forperformance
IGD, unlikemakes
PSO FS: with backward elimination
Julysame
2016 number of selected features, i.e., the same value of the
ation and o�- sense, but randomized for di�erent datasets.
continuous and benchmark optimization problems (such as DTLZ During the training either IGD or HV. Among them, SIOM-HypE gains the greatest im- normally, the most time-consuming part of multiobjective evo- objective vector, choices,mapping
we further to the same point
compare their in the objective
crowding distances, turning
process, KNN with 10-fold cross-validation is used to calculate the provement versus its benchmark HypE, with better performances 1st-objective f1 that has already been defined in the firstspace. para- In Fig. 1, each
out that x6 , xrow
7 , and x8 are
denotes a solution to x3 and
superiorvector witheachlarger crowding
gged into the [6] or WFG [10]), feature selection is a discrete and real-world prob- on all the test datasets and signi�cantly better on the majority. Ei-
lutionary feature selection is the process of evaluating the
OM-NSGAII
classi�cation error on the training data subset to avoid the feature graph of Section II-A. Finally, the solution
objective values, i.e., training the classification model with with the smallest
column denotes a feature, while the value 1
distances (the crowding distance of each boundary means the cor- solution is
lem which selection
is very di�cult to calculate the exact Pareto front. For
bias [11, 19], while the number of nearest neighbors is set ther SIOM-NSGAII or SIOM-MOEAHD only loses to its benchmark
OM plugged the sake of to
fairness, diversity score is selected (lines 10 and 11), while the crowd-
responding feature
automatically is selected.
set to According
infinite to
according linesto 8–12
[33]). inHowever, the
�ve for wethe �rst collect the �nal populations of all the MOEA twice out of the 18 datasets. To be more speci�c, SIOM-
Introduce and develop
a selected feature subset. Since the proposed algorithm needs
the Algorithmcrowding
o HypE). The comparison algorithms and merge them into a union population. •
balance between accuracy and e�ciency.
Filter measure based on mutual
NSGAII loses to NSGA-II on the 1st dataset (Climate) in IGD and on no extraing• distance
objective only takesevaluations,
function effect when thereasare
such multiple
using addi- solutions 2, the dissimilarity
distances ofDiss x6 , for
x7 , each
and xsolution
8 are stillisthe cal-same because
the 3rd dataset (Breast_Cancer) in HV, while SIOM-MOEAHD loses
initialization
reproduction
Then, this union population isPERFORMANCES
5 EMPIRICAL
while only the
ranked by nondominated sorting,
information in backward
to MOEA/HD on the 9th dataset (Arrhythmia) in IGD and on the tional with
local the first
sameitsmulti-objective
search, smallest diversity
computational PSO
score, based
complexity on the
remains culated as 0.25, 0.25, 0.75, and 0.25, respectively,
a sorting they overlap in the objective space. In this case, we just ran- while the
5.1�rstIGDnondominated front is reserved as the reference
and HV Performances 4th dataset (Connectionist_Bench_Sonar) in HV. Nevertheless, only level. (line 4). Since the optimal solutions on thecurrent
beforehand
reasonable train- threshold δ = 0.8 − 0.6 ∗ (4 − 1)/(8 − 1) is approx-
domly select one of them, such as x8 . It should be noted that
ing dataapproach to featuresolutions onselection
oted that the points of IGD. Generally speaking, the smaller IGD values represent
strategies of
In the experiment, the sampled solutions used to measure the per-
the better algorithm performances, which is just opposite for HV.
formance of an algorithm should only rely on the information from
elimination:
the loss of SIOM-MOEAHD to MOEA/HD on the 9th dataset in IGD
is signi�cant, while SIOM-MOEAHD still performs signi�cantly
may not be the optimal imately
the test data, i.e., equal to 0.543. Thereby, according to lines 14–18 in
the selection
3 process is dynamic and only selects one solu-
Algorithm 2, x belonging to the remote group is directly
there might be an overfitting issue, it is important to maintain selection round, x3
Moreover, the Wilcoxon
the training test
data, [29] with
because a signi�cance
the test data is alwayslevel of 0.05 from
kept unseen better than MOEA/HD on the 9th dataset in HV.
B. Duplication - Analysis
a good diversity
Simultaneously minimise the
for each size of the selected feature subset,
tion in each round. Thereby, belonging
reserved, while the other three solutions
in the second to the clus-
is applied to check whether two performances are signi�cantly
tered group willcanrank
randomly than x6 one.
better reserve and Asx7 abyresult,
diversity
only score,
two due to the
di�erent or not. i.e., ininthe
As shown fnumber
1 objective
Algorithm of features
2, thedirection.
main andanal-
idea of duplication the previous selection of x8are
. reserved, based on the
of the four duplicated solutions
xperiment, all An example
ysis is using of Algorithm
the Manhattan distances 3among
is presented
solutionsininFig.
the 2, where
downloaded 4.4 Parameter Settings error theirrate proposed duplication analysis method.
decisionthespace
gray todots denote unique
estimate solutionsdegrees
dissimilarity and thein red
orderconcentric
wn in Table 2, In this work, all the experiments are coded by MATLAB on the duplicated D. Reproduction Modification
) to 7070 (i.e. open-source platform PlatEMO [18]. Each comparison algorithm
to decide
circles -denote
which >900
duplicated
duplicatedcitations
solutions in the
solutions since
objective
(one 2013
space
circle one
shouldsolution),
be removed. all In
in line
the 5, variablespace.
objective j is a set indexesthatC.solu-
We ofassume DiversityAs Maintenance
shown in Algorithm 4, the reproduction process of
es. Moreover, is run for 30 times independently on each classi�cation dataset,
i.e. ISOLET5),
corresponding
tions {x1to, xthe
2 , . solutions
. . , x8 } aresubject
in thetolast(abbreviated
nondominatedas s.t.front The
(the pseudocode
DAEA is of the diversity
mainly divided maintenance process
into three parts: is niche mat-
1) the
while the initial random seeds are set as the same for all the al-
problems and gorithms but changes with di�erent datasets and runs. Thus, all in the kth
pseudocodes)
front) thatthe have condition
not been thatselected
they have the same
(some of them shown
will in ing
Algorithm 3, where
(lines 1–12); 2) the
the input
k-bit variables and β 13–20); and
crossoverα (lines
e number of the performances can be reproduced anytime and anywhere. The objective
needvector,
to beinselected
order totofind form outthe
all new
the solutions comeallfrom
with while
population), 3) lines 15 and
the hybrid 16 of Algorithm
mutation 1, corresponding
(line 21). The niche mating method ran-
eukemia) and customizable parameters for all the chosen algorithms are referred duplicated objective
the other vectors.
solutions areIninlines 8–12,
the first k− nondominated tofronts
the1 dissimilarity the indexes
domly of selects
the solutions
parentsinwith
the first few nondominated
a proportion of 80% or 20% from
degree of a solution is calculated by its minimum Manhattan fronts. As shown in Algorithm 3, the proposed diversity-based
that have already been selected, by the nondominated sort- the local neighborhood or the global population alternatively,
distance to all the other solutions in the decision space, the selection method adopts two selection criteria: 1) the diversity
ing action taken in lines 13–15 of Algorithm 1. In Fig. 2, which is similar to the regular settings, i.e., 85%/15% in [9]
Accurate Wrapper, Global value of which should be between 0 and 1. The Manhattan score (lines 5–11) and 2) the crowding distance (lines 2–4).
solutions with the same number of selected features lie in and 90%/10% in [43]. We adopt the idea of niching because
distance between two decision vectors actually reveals how The former acts as the main criterion while the latter acts as a
+ thetheir
different sameselected
dashedfeaturesverticalare.
line,In while thethe
line 13, diversity
removalscore cal- it isToa be
supplement. stabler
more and morethe
specific, effective
sorting way to 4exchange
in line makes evolution-
culated in Algorithm 3 actually means the number of already
Bing Xue, Mengjie Zhang, Will Browne. “Particle swarm optimization for
ary information between two solutions with similar objective
Fast Filter, Local selected
feature solutions
selection that have
in classification: the same selected
A multi-objective feature ratio vectors, as already proved in many research works [9], [43],
approach, IEEE
with a licensed
certain
Transactions
Authorized
unselected
on Cybernetics, vol. solution. Thus,
43, no. 6, pp. the diversity
1656-1671, 2013. score is [46], [47]. In line 2, the neighborhood size is set to 20% of
M. R. Sierra and C.use
A. limited to: Victoria
C. Coello, University
"Improving of Wellington.
PSO-based Downloaded on May 03,2021 at 00:06:18 UTC from IEEE Xplore. Restrictions apply.
multi-objective
equal to the number of already selected solutions on the same the population size, similar to the regular setting 10% in [46],
optimization using crowding, mutation and epsilon-dominance", Proc. EMO,
pp. 505-519, 2005
Bach Hoai Nguyen, Bing Xue, Ivy Liu and Mengjie Zhang."Filter based Backward Elimination in Wrapper based PSO for Feature Selection in
Classification", Proceedings of 201 IEEE Congress on Evolutionary Computation. Beijing, China. 6-11 July, 2014. IEEE Press. PP.3111-- 3118. 2015
Authorized licensed use limited to: Victoria University of Wellington. Downloaded on May 03,2021 at 00:06:18 UTC from IEEE Xplore. Restrictions apply.
1209
GECCO,Dever,
Multi-objective PSO for FS MOEA/D for Feature Selection

July 2016
• Simultaneously minimise the number of features and the error • Multiple Reference Points based Decomposition for Multi-objective FS
rate
The fitness fun
sub-problem is d
f itnessS = eRa
where |S| is the n

the sub-problem i
(a) Balance (b) Towards f1 (c) Towards f2 which is the first
and represented b
Fig. 1: Effect of bias Pareto fronts on MOEA/D. Fig. 3: Multiple reference points in MOEA/D.
(a) Weight vector for102CHAPTER
FS 5. DECOMPOSITION-BASED
(b) Not always conflicting MULTI-OBJECTIVE FEATURE SELECTION is a penalty fact
in the figure, only the red points can form a Pareto front while of selected featu
and f2 , and the green dots show the best solutions for the Fig. 2: Characteristics of Feature Selection. all green points are dominated by the solution at the threshold component is rela
weight vectors. When the Pareto front is biased towards f1 feature ratio. It will be more effective for a multi-objective the number of se
(Fig. 1b) or f2 (Fig. 1c), a set of evenly distributed weight feature selection. It then shows how to use multiple reference algorithm to allocate more computational efforts on regions control the priorit
vectors does not work well. Most solutions are obtained around points to decompose a feature selection problem, followed by belows the threshold. However, the threshold is the first objective
with fRatio
the center of the Pareto front, and only a few solutions around two mechanisms (static and dynamic) to allocate the reference problem dependent and not easy to identify. In the following of selecting a sol
points. Finally, it introduces a repair mechanism, which sections,
helps both static and dynamic multiple reference points lower classificatio
the edge of the Pareto front are found. More weight vectors
to ensure that the decomposition constraints are not violated strategies are introduced to address the above characteristics have the same p
should be located near the edge to obtain more solutions there. of feature selection. classification erro
Several works attempted to update weight vectors based on and the population diversity is preserved.
Figure 5.4: Dynamic reference points B. Decomposition
example: f ixed with multiple
points reference points
are green, smaller than 1.
the densities of regions to preserve the population diversity A. Characteristics of feature A decompositio
movingselection
points are red, dashed line shows In standard MOEA/D,
the interval the effectiveness
that moving points of the weight vector
[36], [37]. However, they require an additional computation uous space may l
Bing Xue, Mengjie Zhang, Will Browne. “Particle swarm optimization for feature selection in classification: A multi-objective There are several properties whichare make
locatedfeature set depends oniterations.
in the selection
corresponding the shape of true Pareto front which is unknown
cost to adaptively adjust the weight vector set. Although therefore may res
approach, IEEE Transactions on Cybernetics, vol. 43, no. 6, pp. 1656-1671, 2013.
different from many other multi-objective problems. The in feature
task selection. To alleviate the effect of the Pareto front front. In contrast,
these proposed algorithms are tested on numeric problems Hoai Bach Nguyen, Bing Xue, Hisao Ishibuchi, Peter Andreae, and Mengjie Zhang. shape"Multiple
to produce
Referencemore non-dominated
Points MOEA/D for Featurefeature subsets
Selection". with
Proceedings
having irregular Pareto fronts, none of the problems has a ofof
2017feature selection
Genetic and is Computation
Evolutionary to reduceConference
the classification
(GECCO 2017)error
Companion. rate
different ACM Press.of
numbers Berlin, German,
features, we 15 - 19multiple
use July 2017.pp 157-158 points
reference
to sub-problems
(eRate)
Bach whileBing
Hoai Nguyen, Xue, Peterval
selecting containing
aAndreae,
small portion
Hisao the threshold
of the
Ishibuchi, feature
Zhang.ratio,
original
Mengjie feature beyond
"Multiple which
Reference Pointsthebased
twoDecomposition
objectives for Multi-objective Feature front, and therefo
front as highly discontinuous as feature selection. Instead of are
Selection in Classification: Static and mostly
Dynamic not conflicting
Mechanisms", (Fig.
IEEE Transactions to on
5.2b). decompose
The multi-objective
reference
Evolutionary points
Computation, vol.feature
are no. 1, selection
divided
24, pp.170 - 184,problems
2019 in-
set (f Ratio). eRate measures the ratio between the number R of the true Pareto
adaptively adjusting weight vectors, this work develops a new stead of of multiple weight vectors. Specifically, we allocate a
into Fand function, this dec
wrongly classified instances the points
f ixed and M moving
total number points (R = F + M ). The F fixed
of instances
set of R reference points on the f Ratio axis. A reference
decomposition mechanism for feature selection, which not points for classification
(m). eRate is discrete and theareinterval
evenly located
between acrossthethe I intervals,
adjacent
point placed shown
at positionby the green on
refRatio points
the fRatio axis represents
The idea of usi
only alleviates the dependency on the Pareto front shape but on isFig. 5.4. Similarly,
At the beginning,
values (i.e. the granularity) 1/m. fRatio isthe anMratio
the moving solution
idealized points are all an
with located on of 100% (i.e. 0%
accuracy
already been exam
also copes well with the front’s high discontinuity. the first interval, eRate) usingspreads exactlythe (bref ⇤ nc) features where n is
Ratiopoints
between the number of selected featuresand andthethe
locating mechanism
total number moving those algorithms,
Another characteristic of feature selection is the complicated while avoiding overlapping the the total number of reference
the original features. In the MOEA/D
of original features n. fRatio is also discrete andbetween
the interval two types of points as be updated every
relationship between its two objectives. Firstly, the objective much
search, there will be one individual in the population for each
In our approach, t
between its adjacent values isas1/n.
possible. After a the
Therefore, certain number
Pareto front of iterations defined by the divi-
reference points, just as there is one individual for each weight
of reducing the classification error has higher priority than
MOEA/D for Feature Selection reducing the number of selected features. Secondly, the two Evolutionary Multimodal Optimisation for FS
of feature selection is highly sion between the maximum
discontinuous. number
If weight of iterations
vectors
vector when theand the number
problem of inter- using weight vectors. axis prior to the
is decomposed
weight vector in t
are used to decompose feature vals, theselection,
moving points thesearevectors have3on
re-allocated
Fig. toshows
the next interval.
a set For example,
of reference points marked by blue dots.
objectives are not always in conflict. Therefore, some parts of make our algori
be carefully selected, otherwise, in Fig. 5.4,therein thewill
first be vectorsUsing
10 iterations, which multiple
the three reference
moving pointspoints, the multi-objective feature
are located points EMO algo
a Pareto front on the conflicting regions are more difficult to
approximate than other parts on the non-conflicting regions. • The
do not goalto is
correspond anyon to thefind
solution firston multiple
the Pareto
interval. In thefrontoptimal
next as10selection
shown
iterations, feature
problem
the moving subsets.
is decomposed
points areintore-a sub-problem for each
reference point. The solution of a sub-problem for a reference C. Reference poi
by the dashed line in allocated Fig. 2a. toFurthermore,
the second intervalalthough
and soboth on. Here, the 10th , 20th ... iterations
In standard MOEA/D, all sub-problems are treated equally, point at refRatio is the feature subset, whose size is at most The previous s
and they usually receive the same amount of computational • Fitness
objectives are in the function:
same
are called Minimize
rangeboundary
[0,1], iterations, classification
they typically
since(n onhave
these iterations error
the moving rate points
ref = bref Ratio ⇤ nc). Such a feature subset will be on can be used to ef
different granularity duearetore-allocated. the difference between 1/m and
the true Pareto front. The search space of each sub-problem is its discontinuous
resource [25]. However, it is shown that some parts of a Pareto
1/n. It has been shown that Thesolving multi-objective
re-allocation problems
smaller until
process is continued than the
the original
algorithm onedetects
since itthat
is limited by the number reference points
front can be more challenging to approximate than others [38].
where the objectives have different granularity usually
the two objectives are potentially not
ofresults
features defined by the corresponding reference point.
conflicting any more. As can be
is to fix location
It is natural to allocate resources (weight vectors) differently to In our proposed decomposition, the search space of a sub- which is called s
in imbalanced Pareto fronts [40].
different sub-problems with respect to their difficulties, which seen in Fig. 5.2b, most solutions inproblem the potentially non-conflicting regions
corresponding to a larger nref covers the search is to dynamically
The relationship between the two objectives in feature
results in better efficiency [39]. A similar question appears in (green) are dominated by a solution in theofconflicting
space a sub-problemregioncorresponding
(red). There- to a lower nref . If detect conflicting
selection makes it an unusually challenging multi-objective
feature selection. Possibly, better Pareto front approximations fore, to determine whether the two theobjectives
two search arespaces are disjointed,
still conflicting in thethe search space of the 1) Static alloc
problem. In feature selection, the classification performance former sub-problem is smaller and the two sub-problems does points are unifor
can be achieved by putting more efforts on the conflicting re-
is usually given a higher priority. For example, if a not feature
share the same solution. However, with the disjointed change during t
gions rather than evenly spending resources on both conflicting
set selects 10% more features than the other feature searchset spaces, a sub-problem probably does not contribute any points, the posit
and non-conflicting regions.
but achieves 10% better accuracy, the first set is definitely to the approximated Pareto front, which can affect the Notice that there
solution
The goal of this paper is to develop a decomposition
preferred. Furthermore, the two objectives are not always indiversity [19]. More importantly, the disjointed search since it defines a
front’s
mechanism which can help MOEA/D to cope with different spaces limit the assistance between different sub-problems, In MOEA/D, n
conflict. Specifically, removing irrelevant or redundant features
characteristics of feature selection such as its unknown and which is an essential property of MOEA/D. Although our which allows th
from a feature set may improve the classification performance, decomposition does not ensure the uniqueness between two candidate solutio
highly discontinuous Pareto front, the complicated relationship
which means that the two objectives are not conflicting neighboringin sub-problems, it satisfies the two above desirable sub-problems wh
between its two objectives. It is expected that the proposed
some regions. However, if all the features in a feature set
properties. problem’s referen
MOEA/D-based feature selection algorithm results in a set of
are relevant and complementary, removing any feature re-
non-dominated solutions with a wide-range number of features
duces the classification performance. So only after removing
and better classification performance.
all irrelevant/redundant features, the two objectives become
conflicting. In other words, there might be a threshold feature
III. PProceedings
Hoai Bach Nguyen, Bing Xue, Hisao Ishibuchi, Peter Andreae, and Mengjie Zhang. "Multiple Reference Points MOEA/D for Feature Selection". ROPOSED ALGORITHMS ratio beyond which the two objectives are mostly harmonious.
of 2017 Genetic and Evolutionary Computation Conference (GECCO 2017) Companion. ACM Press. Berlin, German, 15 - 19 July 2017.pp 157-158
The section starts by listing characteristics of feature se-
Bach Hoai Nguyen, Bing Xue, Peter Andreae, Hisao Ishibuchi, Mengjie Zhang. "Multiple Reference Points based Decomposition for Multi-objective Feature Fig. 2b illustrates the situation, in which each point is the best
lection
Selection in Classification: Static and Dynamic Mechanisms", IEEE Transactions on Evolutionary Computation, vol. 24, no. 1,that
pp.170illustrate
- 184, 2019 difficulties when applying MOEA/D to solution with the corresponding feature ratio. As can be seen
1210
Evolutionary Multimodal Optimisation for FS Evolutionary Multimodal Optimisation for FS
• The goal is to find multiple optimal feature subsets. • The goal is to find multiple optimal feature subsets.
• Fitness function: Minimize classification error rate • MO: Minimize classification error rate, and number of selected
features
Why Multimodal Optimisation for FS? Grid-dominance based Multi-objective FS
• In Breast Cancer Wisconsin (Original) Data Set, there are • Start Algorithm 1: Subset f1 filtration mechanism features. Meanwhile
Input: S: A set of feature 0 subsets1 2 3 positives, false nega
699 instances, 9 features and 2 classes. Initialize individuals
Output: U S: A set of diverse feature subsets respectively.
s1
Get objective values of all 1 begin s2 3
B. General Procedu
individuals based on Eq. (7) 2 if All selected features are different between each
other in S then s3 Most existing EM
Error rate
Generate offspring including NSGAII, u
Using subset [F1, F2, F7] or [F2, F3, F7] with KNN can 3 US STABLE II: The results of HV on the testssets 2
• Dataset 4
GDE3 else SPEA2 MOEAD
s4 5
NSGAII F-NSGAII GF-NSGAII to choose subsets in
achieve the same 97.81% classification accuracy.
Combine parents and offspring 0.876 ± 0.015
Wine
Zoo
0.847 ± 0.014(o ↓)
5
0.830 ± 0.013(o ↓) //
0.841 ± 0.017(o ↓)
S dup:
0.820 ± 0.018(↓ ↓) All duplicated
0.828 ± 0.041(↓ ↓)
0.811 ± 0.039(↓ ↓) feature subsets
0.840 ± 0.027(o ↓)
0.816 ± 0.029(↓ ↓) in
0.846 ± 0.015(↓)
S
0.831 ± 0.010(↓) 0.842 ± 0.017 an example to show
SPECT
WBCD
0.765 ± 0.002(o o)
6
0.916 ± 0.003(↑ o)
// S dup = S 0.896
0.764 ± 0.008(o o)
0.913 ± 0.009(o o)
0.764 ± 0.017(o o)
dup 1 [ S↓) dup
± 0.022(↓
0.744 ± 0.024(↓ ↓)
2[
0.903
. . . [ S dup
± 0.016(↓ ↓)
0.765 ± 0.013(o) 1 0.767 ± 0.008
0.913 ±p 0.008(o) 0.913 ± 0.010 needs to pick five in
Filter feature subsets Ionosphere
Sonar
0.902 ± 0.017(↑ o)
7
0.843 ± 0.035(o ↓)
// U S
0.884 ± 0.021(o ↓)
0.832 ± 0.038(o ↓)
S\S0.851 dup± 0.036(↓ ↓)
0.815 ± 0.035(↓ ↓)
0.864 ± 0.028(↓ ↓)
0.821 ± 0.033(o ↓)
0.888 ± 0.022(↓)
0.841s±7 0.036(↓)
s8
0.901 ± 0.016
0.863 ± 0.016 next generation (S9
Movement
Hillvally
0.778 ±8 0.013(o o)
0.598 ± 0.007(o ↓)
for ±i 0.018(o
0.772 = 1 o)to p0.590
0.595 ± 0.011(↓ ↓)
do±± 0.013(↓
0.765 0.023(↓ o)
↓)
0.766 ± 0.023(↓ o)
0.585 ± 0.014(↓ ↓)
0.779 ± 0.017(o) 0.772 ± 0.019
0.601 ± 0.008(↓) 0 0.608 ± 0.009 the definition of the
Meet pop size ?
Y Musk1 0.929 ±90.009(↓ ↓) 0.965 ±// Calculate
0.010(↓ o) 0.963SCR± 0.008(↓ values
o) of
0.946 ±all individuals
0.011(↓ ↓)
s0.972 ± 0.006(↑)
s 0.963 ± 0.005
S5 , and S7 -S8 are d
F1 is ‘Clump Thickness’ and F3 is ‘Uniformity of Cell Shape’.
Multiple 0.871 ± 0.008(↓ ↓) 0.937 ± 0.007(↓ ↑) 0.936 ± 0.007(↓ ↑) 0.884 ± 0.017(↓ ↓) 0.943 ± 0.006(↑)
6 9 0.902 ± 0.014
• Arrhythmia
SRBCT
0.631 ± 0.013(↓ ↓)
0.746 ± 0.010(↓ ↓)
0.691 ± in S o)dup0.659
0.020(o
0.926 ± 0.019(↑ ↑)
i based
± 0.029(↓on
0.834 ± 0.008(↓ ↓)
↓) Eq. (9)
0.678 ± 0.018(o ↓)
0.895 ± 0.024(↓ o)
0.682 ± 0.021(o)
0.911 ± 0.015(↑)
0.690 ± 0.023
0.900 ± 0.007 will be chosen. How
N Leukemia 0.652 10
± 0.024(↓ ↓) //0.048(o
index ↓) =0.703 find maximum
↓) 0.754 index[SCR]
± 0.049(o ↓) 0.816 ± 0.021
Obviously, the first feature is easier to be collected than the

0.782 ± ± 0.024(↓ 0.772 ± 0.039(↓)
DLBCL 0.700 ± 0.008(↓ ↓) 0.834 ± 0.018(↑ ↑) 0.742 ± 0.004(↓ The↓)number 0.779of selected
± 0.019(↓ ↓) features
0.803 ± 0.015(↑) f2 ± 0.004
0.796 the algorithm fail to
Sort all individuals based on Ranking (2/6/6)11
(0/4/10) 3.75 (2/8/4)U S 2.80 S (0/1/13)
(3/5/6) dup(1/3/10)
i (index) 5.18 (0/4/10) (0/2/12) 4.86 (4/4/6) 2.25 2.14
grid non-dominated sorting preferred since it h
third one. end
12 Fig. 1: Illustration of solutions in a grid environment in a bi-objective space. The f1
TABLE III: The results of IGD on the test sets
Dataset 13GDE3 end

and f2 are the classification error rate and the number of selected features,
SPEA2 MOEAD NSGAII F-NSGAII GF-NSGAII
the whole populatio
Calculate fitness values of individuals respectively. In this example, the number of the divisions of the objective space is four
based on Eqs. (3) to (6)
Wine
Zoo
0.072 ± 0.010(o ↓)
0.05914± 0.016(o
0.076 ± 0.009(o ↓)
end o) (i.e., div
0.070 = 4).↓)Suppose
± 0.016(↓
0.081 ± 0.024(↓ ↓)
0.082 ±that these
0.029(↓
0.077 ± 0.015(o ↓)
↓) points
0.072represent
± 0.021(↓ ↓)
0.073 ± 0.010(↓)
the distribution
0.062 ± 0.012(↓)
0.047 ± 0.014
of solutions for a
0.053 ± 0.014
solutions (i.e., S4 an
SPECT 0.044 ± 0.008(o ↓) feature
0.047 ±selection
0.020 ± 0.002(↑ ↓) same
0.016(o ↓) task in ±
0.037
↓) objective
a 0.017(o
certain↓)generation. Solutions/subsets
0.048 ± 0.014(↓ ↓) S4 and S0.024
0.039 ± 0.014(↓) 5 sit
± in
0.014 not
± 0.007
the
0.007 which reduces the p
WBCD ± 0.006(o
0.021point in the space.↓)S is
0.031 ± 0.014(↓ ± 0.010(↓ ↓) solution,
a hypothetical
0.025 0.021 ± 0.003(↓)
which does exist
Select individuals enter into Ionosphere
Sonar
0.031 ± 0.008(↑ ↓)
0.059 ± 0.022(o ↓)
0.038 ± 0.009(o ↓)
0.066 ± 0.029(o ↓)
0.061 ± 0.026(↓ ↓) 9 0.046 ± 0.013(↓ ↓)
0.077 ± 0.023(↓ ↓) in this0.070
generation.
± 0.018(o ↓)
0.037 ± 0.012(↓)
0.064 ± 0.023(↓)
0.020 ± 0.013
0.031 ± 0.011
To remove the dup
the next generation Movement
Hillvally
0.048 ± 0.012(o o) 0.058 ± 0.020(o ↓) 0.074 ± 0.028(↓ ↓)
the proposed mechanism is to find and keep potential good and
0.033 ± 0.005(o ↓) 0.035 ± 0.007(↓ ↓) 0.037 ± 0.010(↓ ↓)
0.089 ± 0.030(↓ ↓)
0.038 ± 0.009(↓ ↓)
0.046 ± 0.009(o)
0.031 ± 0.005(↓)
0.043 ± 0.013
0.026 ± 0.005
anism is proposed a
Musk1
Multiple
0.039 ± 0.004(↓ ↑) 0.025 ± 0.005(↓ ↑) 0.029 ± 0.007(↓ ↑)
different feature subsets during evolution.
0.140 ± 0.008(↓ ↓) 0.062 ± 0.020(↓ ↑) 0.057 ± 0.018(↓ ↑)
0.029 ± 0.006(↓ ↑)
0.134 ± 0.014(↓ ↓)
0.022 ± 0.004(↑)
0.044 ± 0.015(↑)
0.043 ± 0.006
0.092 ± 0.013 algorithm. During t
N Meet stop Arrhythmia where G (~x) and F (~x) mean the grid coordinate and actual
0.114 ± 0.007(↓ ↓) 0.033 ± 0.009(o o) 0.051 ± 0.020(↓ ↓) 0.043 ± 0.012(o ↓) 0.038 ± 0.012(↓) 0.030 ± 0.014
one solution will b
Peng Wang, Bing Xue, Jing Liang and Mengjie Zhang. "Improved Crowding Distance in Multi-objective condition ?
SRBCT 0.226 ± 0.005(↓ ↓)
i
0.023 ± 0.016(↑ ↑)
i
0.141 ± 0.010(↓ ↓) 0.055 ± 0.020(↓ o) 0.039 ± 0.010(↑) 0.056 ± 0.008
Leukemia objectiveIV.values
0.214 ± 0.010(↓ ↓) E of individual
0.087 ± 0.036(o ↓) XPERIMENTAL D ~ x in the i-th objective, respec-
0.172 ± 0.015(↓ ↓) ESIGN
0.108 ± 0.034(o ↓) 0.087 ± 0.025(↓) 0.070 ± 0.007
solutions (e.g., S4 a
Optimization for Feature Selection in Classification". Proceeding of the 23th European Conference on DLBCL 0.177 ± 0.004(↓ ↓) 0.028 ± 0.018(↑ ↑) 0.135 ± 0.008(↓ ↓) 0.086 ± 0.017(↓ ↓) 0.060 ± 0.009(↑) 0.073 ± 0.005
Y Ranking
A. Baselinetively.EMO
(2/6/6) (1/2/11) 3.93
Methods lbi and ubi represent the lower and upper
Meanwhile,
(2/8/4) (4/1/9) 3.14 (0/1/13) (2/0/12) 4.82 (0/4/10) (1/1/12) 4.75 (4/1/9) 2.43 1.99
The details of the p
Applications of Evolutionary Computation (EvoApplications 2021). Lecture Notes in Computer Science. End
Vol. , Leipzig, Germany. Seville, Spain, 7-9 April 2021 In order to examine the performance of The
boundaries of the grid, respectively. the width
is 1 among the five algorithms (GDE3, SPEA2, MOEA/D, II and III, GF-NSGAII shows the first performance among
proposed of a hyperbox be shown in next
Fig. 2: The flowchart of the proposed GF-NSGAII algorithm.
NSGAII, and F-NSGAII).mechanisms for the on i-th objective
feature selection,is denoted
the sixtwo asMore
variant
algorithms. i , specifically,
di = (ub
dalgorithms ofi lbi /div);
GF-NSGAII achieves population diversity
Peng Wang, Bing Xue, Jing Liang and Mengjie NotedZhang. “ A Grid-dominance
NSGAII
that F-NSGAII div
employs [19] is are
theproposed.
the proposed number
based
subset filtrationof
The the
Multi-objective divisions
detailed
significantlyAlgorithm HVof
information
better theFeature
for
results objective
thancan
GDE3 Selection
be space
and NSGAII inin in
all grid-dominance is a
Classification." IEEE Congress on Evolutionary mechanism. Computation (CEC
The results show that 2021).
using theKrakow, Poland,
proposed mech- the28 June
used - 1 July
datasets. 2021,F-NSGAII
However, 8pp achieves better HV
anism increases theseen in each
population the dimension
and therefore (e.g.,
following:
diversity helps andin IGD
Fig.results
1, div = proposed
than the 4). GF-NSGAII algorithm on S7 or S8 can be k
where CR(i) represents the confidence rate onthe the i-th
algorithm to achieve better HVGR and
and IGD GCP
results. D are concerned
four
* NSGAII with the proposed subset filtration mecha- datasets, namely with
Musk1, the population
Multiple, SRBCT, and conver-
DLBCL.
in the first front bas
feature, and fi is the position entry in the i-th dimension
B. GF-NSGAII vs Others nism
(i 2 {1, 2, . . . , N }, N is the number of the original features).
gence while GCD
is named F-NSGAII. is used Thetoonly
As can bemeasure
seen in Figs. the
difference4 andpopulation diversity
5, the obtained feature
between subsets
probability of GF-N
Now we will focus on analyzing the effect of the grid- from GF-NSGAII achieve lower classification error rate than
in the objective
F-NSGAII and NSGAII space. GR is regarded
is that the subset as the primary criterion,
filtration
dominance. Fig. 3 shows that by using the proposed sub- F-NSGAII on test sets, although the shapes of fronts between The flowchart of
N and GCDisthe
set filtration mechanism andmechanism
grid-dominance, isperformed
the secondary
number F-NSGAII
of in F-NSGAII one, activated
and before when
GF-NSGAII are similar
the non- the GR value
on training sets. Let
processes of GF-NS
X unique feature subsets in GF-NSGAII is slightly less than that take the fronts on the SPECT dataset as an example. GF-
dominated
of individuals
(9)during the evolutionary sorting.is the is same.
NSGAII Whenand F-NSGAIIthe first
containtwo criteria
27 feature subsetsfail to
in their generation of off-spr
1211 SCR(~x) = CR(i) of F-NSGAII
* The
process. The reason
proposed GF-NSGAII algorithm utilizes
i=1 that when using the relaxed discriminate
form individuals,
of the Pareto-dominance, respective
the front
third,on
GCPboth Dsetissubset
the training of SPECT, and they show
used to break a
a similar front shape. Meanwhile, the test error rate of the bined together. Afte
filtration
i.e., the grid-dominance, multiple
tie. For mechanism
feature subsets may and
these three
sit grid-dominance. The procedure
within the same grid in the objective space. By using theseindicators, the from
obtained solutions smaller
F-NSGAII thevaries
better.
in the range [0.222, if the preserved num
Multiobjective DE for Feature Selection Multiobjective DE for Feature Selection
• A new initialization strategy based on the feature relevance • A new initialization strategy based on the feature relevance
• A dynamic hypervolume contribution indicator • A dynamic hypervolume contribution indicator
• A dynamic crowding distance measure
P. Wang, B. Xue, J. Liang and M. Zhang, "Multiobjective Differential Evolution for Feature Selection in Classification," in IEEE P. Wang, B. Xue, J. Liang and M. Zhang, "Multiobjective Differential Evolution for Feature Selection in Classification," in IEEE
Transactions on Cybernetics, doi: 10.1109/TCYB.2021.3128540. Transactions on Cybernetics, doi: 10.1109/TCYB.2021.3128540.
Multiobjective DE for Feature Selection Niching-based Multi-objective FS

• • Transforms the task of searching for multiple optimal feature
subsets into searching for ψ-quasi equal feature subsets
- ψ-quasi equal:Feature subset S1 is a ψ-quasi equal feature
subset to feature subset S2 , if f2 (S1 ) = f2 (S2 ) and
|f1(S1)−f1(S2)| ≤ ψ, ψ is a small constant, 0 ≤ ψ < 1.
MOCDE with and without the

proposed PIM initialization
P. Wang, B. Xue, J. Liang and M. Zhang, "Differential Evolution Based Feature Selection: A Niching-based Multi-objective Approach,"
P. Wang, B. Xue, J. Liang and M. Zhang, "Multiobjective Differential Evolution for Feature Selection in Classification," in IEEE in IEEE Transactions on Evolutionary Computation, doi: 10.1109/TEVC.2022.3168052.
Transactions on Cybernetics, doi: 10.1109/TCYB.2021.3128540.
1212
Niching-based Multi-objective FS Niching-based Multi-objective FS
• A new mutation operator combining the local information of •
the niche and the global information of the whole population
• A subset repairing mechanism
- Situation 1: some common features are included in all the ψ-
quasi equal feature subsets
- Situation 2: no common feature
- Situation 3: some solutions have common features
P. Wang, B. Xue, J. Liang and M. Zhang, "Differential Evolution Based Feature Selection: A Niching-based Multi-objective Approach," P. Wang, B. Xue, J. Liang and M. Zhang, "Differential Evolution Based Feature Selection: A Niching-based Multi-objective Approach,"
in IEEE Transactions on Evolutionary Computation, doi: 10.1109/TEVC.2022.3168052. in IEEE Transactions on Evolutionary Computation, doi: 10.1109/TEVC.2022.3168052.
Surrogate Training Set/ FS Processing Surrogate Training Set/ FS Processing

• Original fitness function (fori) vs Surrogate fitness function (fsur) •
• Surrogate training set
• Surrogate FS/training process Surrogate rate: Is/I

Is iterations I-Is iterations
fsur
Estimate promising areas fori
I iterations
fori
Fig. 1: Overall algorithm
Hoai Bach Nguyen, Bing Xue, and Peter Andreae. "PSO with Surrogate Models for Feature Selection: Static and Dynamic Clustering-based Hoai Bach Nguyen, Bing Xue, and Peter Andreae. "PSO with Surrogate Models for Feature Selection: Static and Dynamic Clustering-based
Methods", Memetic Computing, vol. 10, pp. 291-300, 2018 Methods", Memetic Computing, vol. 10, pp. 291-300, 2018
training set, the corresponding fitness function is a surrogate fitness function, de-
noted by f itnesssur . If the whole training set is used to build the classifier then
the fitness function is the original one, called f itnessori . In terms of PSO’s repre-
sentation, each feature corresponds to a position entry, whose value is either 1 or 0,
1213 indicating the feature is selected or not selected, respectively. The overall proposed
algorithm, named SurSammPSO, is described in Fig. 1, in which the contribution
Surrogate-Assisted Evolutionary FS Surrogate-Assisted Evolutionary FS
1. Initialize population P based on sampled solutions. •

2. Decompose a high-dimensional FS problem into several
low-dimensional subproblems.
3. For each subproblem, construct a surrogate model and use
P to initialize subpopulation.
4. Optimize all subproblems in parallel using SAEAs.
5. Merge new subpopulations into the new population P.
6. Stop if the termination criteria is satisfied; otherwise go to
Step 2 for the next cycle.
S. Liu, H. Wang, W. Peng and W. Yao, "A Surrogate-Assisted Evolutionary Feature Selection Algorithm with Parallel Random Grouping for S. Liu, H. Wang, W. Peng and W. Yao, "A Surrogate-Assisted Evolutionary Feature Selection Algorithm with Parallel Random Grouping for
High-Dimensional Classification," in IEEE Transactions on Evolutionary Computation, doi: 10.1109/TEVC.2022.3149601. High-Dimensional Classification," in IEEE Transactions on Evolutionary Computation, doi: 10.1109/TEVC.2022.3149601.
Correlation-Guided and Surrogate-Assisted PSO Correlation-Guided and Surrogate-Assisted PSO
• • a surrogate model is established based on

- the surrogate training set: the input and the output are the
particle’s positions and their corresponding real fitness values in
the previous iteration
• Calculate the Euclidean distance between the target particle
and each particle in the surrogate training set, then rank in
ascending order
• Top k particles in the sorted surrogate training set are
selected, and their average fitness values is the surrogate
fitness of the predicted particles
K. Chen, B. Xue, M. Zhang and F. Zhou, "Correlation-Guided Updating Strategy for Feature Selection in Classification with Surrogate-Assisted K. Chen, B. Xue, M. Zhang and F. Zhou, "Correlation-Guided Updating Strategy for Feature Selection in Classification with Surrogate-Assisted
Particle Swarm Optimisation," in IEEE Transactions on Evolutionary Computation, doi: 10.1109/TEVC.2021.3134804. Particle Swarm Optimisation," in IEEE Transactions on Evolutionary Computation, doi: 10.1109/TEVC.2021.3134804.
1214
Correlation-Guided and Surrogate-Assisted PSO Information Theory Feature Selection
•
!"#$%&& = (%) − (%+ - is the selected feature subset
where, .' , .( ∶ single feature in 1
2 is the class labels
!"# = % &(($ ; *)
345 : relevance between 1 and *
!! "#
346 : redundancy in 1
!", = % &(($ ; (& )
!! ,!" "#
• Relevance:
- Classification performance
- The relevance (mutual information) between each selected feature and the
class labels
• Redundancy:
- Number of features
- The relevance (mutual information) between the selected features
Peng, Hanchuan, Fuhui Long, and Chris Ding. "Feature selection based on mutual information criteria of max-dependency, max-
relevance, and min-redundancy." IEEE Transactions on pattern analysis and machine intelligence27.8 (2005): 1226-1238.
Liam Cervante, Bing Xue and Mengjie Zhang."Binary Particle Swarm Optimisation for Feature Selection: A Filter Based
K. Chen, B. Xue, M. Zhang and F. Zhou, "Correlation-Guided Updating Strategy for Feature Selection in Classification with Surrogate-Assisted Approach". Proceedings of 2012 IEEE World Congress on Computational Intelligence/ IEEE Congress on Evolutionary
Particle Swarm Optimisation," in IEEE Transactions on Evolutionary Computation, doi: 10.1109/TEVC.2021.3134804. Computation (WCCI/CEC 2012). Brisbane, Australian. 10-17 June, 2012. IEEE Press. pp. 881-888.
GECCO,Dever,
Information Theory Feature Selection

July 2016
Feature Subset Selection based on Ranking
• MRFS based on mutual information • Information theory in evolutionary feature selection

- Relief - Fast algorithm — mutual information
- Fisher Score - New measures, evaluate multiple features
- Evolutionary multi-objective filter feature selection
Journal on Artificial Intelligence Tools. Vol. 22, Issue 04, August 2013. pp. 1350024 -- 1 - 31. DOI: 10.1142/S0218213013500243.
Bing Xue, Mengjie Zhang and Will Browne. "A Comprehensive Comparison on Feature Selection Approaches to Classification". International
Journal of Computational Intelligence and Applications (IJCIA). Vol. 14, No. 2. 2015. pp. 1550008 (1-23).
Hancer, Emrah, Bing Xue, and Mengjie Zhang. "Differential evolution for filter feature selection based on information theory and feature
ranking." Knowledge-Based Systems 140 (2018): 103-119.
1215
124 CHAPTER 4. PSO BASED FEATURE SELECTION VIA DISCRETISATION
many different models to obtain the same accuracy. Furthermore, while the
wrapper methods can produce solutions with high accuracy, filter methods
are usually faster and more general. Combining the strength of these two
Colorado, USA. 20-24 Colorado, USA. 20-24 approaches in the evaluation function is expected to produce better solutions.
Feature Selection Though Data Discretisation FS Though Data Discretisation
July 2016 July 2016
In addition, combining the two measures can also better distinguish the subtle
122 difference between
CHAPTER 4. feature subsets,
PSO BASED providing
FEATURE a smootherVIA
SELECTION fitness landscape
DISCRETISATION
Bare-Bone Particle Swarm Optimisation to facilitate the search process. However, simply combining these measures
• Utilising information from MLDP
may be impractical in terms of computation time. Therefore, a smart way is
Two-stage (PSO-FS) • One-stage (PSO-DFS)
needed to cut-point
Potential combine table
them without requiring
Cut-point more running time. Among the
index
#C C1 used
commonly C2 filter
C3 measures,
C4 distance is a multivariate measure that can
Particle’s position
F 1 2 5.25 6.8
evaluate the discriminating ability of a F feature
1 F 2 set Fand
3 itF 4is used
… asFtheN base
F 2 3 10.7 24.9 50.2
measure of KNN. Incorporating this measure 2 with
0 the
3 KNN 0 wrapper
… 2 method
F 3 4 -0.45 0.67 5.2 20.5
will not increase the computation time much because the distance measure
F 4 1 -32.5
is…calculated only once but used twice. 6.8 Therefore,5.2 in the fitness … function
-1.54 of
Proposed PPSO,
F N 3 balanced
-3.72 -1.54 KNN6.55 classification accuracy is combined with a distance
Candiate solution
measure using a weighting coefficient (“) as described in Section 3.2.2. Its
descriptionFigure 4.4: here
is repeated Particle representation
for reading convenience. of PPSO.
• Hybrid fitness function
search space especially
f itnessin
= multivariate discretisation
(“ · balanced_accuracy + (1 ≠on “) high-dimensional
· distance) (4.3)data.
Therefore, PPSO uses BBPSO to choose a cut-point from potential cut-points
where
1
of each feature to narrow the search= space
distance into potentially promising(4.4)
1 + exp≠5(Db ≠Dw )
areas.
1 X
M
4.2.4.1 Db =
Particle representation min Dis(Ii , Ij ) (4.5)
M {j|j”=i,class(Ii )”=class(Ij )}
i=1
Potential cut-points are entropy-based 1 X

M
cut-points that have their information
Dw = max Dis(Ii , Ij ) (4.6)
Binh Tran and Bing Xue and Mengjie Zhang."A New Representation in PSO for Discretisation-Based Feature Selection", IEEE Transactions on Cybernetics, vol. 48, no.
gain satisfying the MDLPMcriterion i=1 shown in Equation (2.3) on Page 31.
{j|j”=i,class(Ii )=class(Ij )}
6, pp.1733-1746, 2018
Each feature
Binh Tran and Bing Xue and MengjieD b is theNew
may
Zhang."A average
have adistance
different
Representation between
in PSO
number each ofinstance andSelection",
potential
for Discretisation-Based Feature the nearest
cut-points instance
which
IEEE Transactions of are
on Cybernetics, vol. 48, no.
6, pp.1733-1746, 2018
Yu Zhou, Junhao Kang, Sam Kwong, Xu Wang, Qingfu Zhang. "An evolutionary multi-objective optimization framework of discretization-based feature selection for
calculated other
Zhou, Yu, et al. "An evolutionary
andclasses.
stored
multi-objective Dw isintheframework
optimization
a average of distance
potential between
cut-point
discretization-based each
table.
feature instance
Figure
selection to the
4.4 farthest
shows
for classification."
an
Swarm and Evolutionary
classification." Swarm and Evolutionary Computation 60 (2021): 100770. Computation 60 (2021): 100770.
instance of the same class. Dis(Ii , Ij ) is the number of matches or overlapping
example of this table and a particle’s position as well as the corresponding
between two nominal vectors divided by the size of the vectors, which is as
candidate solution. Each particle position is an integer vector denoting the
in Chapter 3. Since KNN also works based on this overlapping distance
chosen cut-point indexes. The size of the vector, therefore, is equal to the
number of the original features and the evolved value needs to be between
Colorado, USA. 20-24 Colorado, USA. 20-24 1 and the number of potential cut-points of the corresponding feature. For
GP for Embedded Feature Selection GP for Embedded Feature Selection

July 2016 July 2016
example, in Figure 4.4, the first feature (F1 ) has two potential cut-points
with index 1 and 2. Therefore, the first dimension of the particle needs to
fall in the range [1,2]. If it is 2 as in the example, then the cut-point 6.8 is
Feature Selection Classification Feature Selection
chosen to discretise F1 . Classification
During the updating process, if the updated value of a dimension is
<0 >= outside the cut-point index range, then it is set to < 0 indicates that>=
0, which the
Class + Class - Class +
corresponding+feature does not have a good cut-point Class
and therefore should -
- - -
be ignored, not selected.
- *
+ % + % + %
F1 F3 F12 F40
F1 F3 F12 F40 F1 F3 F12 F40 F1 F3 F12 F40
Soha Ahmed, Mengjie Zhang, Lifeng Peng and Bing Xue. "Genetic Programming for Measuring Peptide Detectability". Proceedings of the 10th International Soha Ahmed, Mengjie Zhang, Lifeng Peng and Bing Xue. "Genetic Programming for Measuring Peptide Detectability". Proceedings of the 10th International
Conference on Simulated Evolution and Learning (SEAL 2014). Lecture Notes in Computer Science. Vol. 8886. Dunedin, New Zealand. December 15-18,
74 Conference on Simulated Evolution and Learning (SEAL 2014). Lecture Notes in Computer Science. Vol. 8886. Dunedin, New Zealand. December 15-18,
75
2014. pp. 593-604 2014. pp. 593-604
1216
GP for Embeded Feature Selection GP for Embeded Feature Selection
• Existing feature selection metrics have some biases in class
imbalance problems
• Each terminal node (leaf) consists of a (“basic”) feature
selection metric, which returns a set of features considered
highly discriminative by such metric
• The set operations may be union (∪), intersection (∩), set
difference ( ), and so on
Viegas, Felipe, Leonardo Rocha, Marcos Gonçalves, Fernando Mourão, Giovanni Sá, Thiago Salles, Guilherme Andrade, and Isac Sandin. Viegas, Felipe, Leonardo Rocha, Marcos Gonçalves, Fernando Mourão, Giovanni Sá, Thiago Salles, Guilherme Andrade, and Isac Sandin.
"A genetic programming approach for feature selection in highly dimensional skewed data." Neurocomputing 273 (2018): 554-569. "A genetic programming approach for feature selection in highly dimensional skewed data." Neurocomputing 273 (2018): 554-569.
3.2. THE PROPOSED METHOD 73
Figure 3.2 presents the main process of GPPI. While the left part of the
72 Feature SelectionCHAPTER
for Symbolic
3. GP WITH FS Regression
FOR SR Feature
figure shows Selection
the process of calculating for Symbolic
permutation Regression
feature importance in
one GP run, the right part describes the process to obtain the permutation
importance of one feature. The whole process is defined as:
• GP run for Permutation Feature Importance Calculating Permutation Feature Importance
• Permutation feature importance:
Training Start 1. Randomly split the training data into a sub-training set and a sub-
Data
Collect all the distinct features
test set.
Randomly Split {Xi, Xm, , Xk} in Ib
2. Carry out a standard GP run and get the best-of-run individual Ib ,
Finish calculating the which has the lowest training error over the sub-training set.
Sub-training Sub-test importance of
Data Data {Xi, Xm, , Xk}?
No
3. Compute the generalisation error of Ib over the sub-test set, which is
GP For feature Xj in referred to Errorg (Ib ).
{Xi, Xm, , Xk}, permute the
values of Xj within the
Best GP
sub-test set
4. For each distinct feature Xj in Ib , permute its values within the sub-
Model Ib
Obtain the generalisation
test set, and get the test error of Ib on the permuted sub-test set,
Yes
Obtain the test error error Errpmt of Ib over the shown as Errpmt (Ib ).
Errorg of Ib permuted sub-test set
Calculate the raw feature 5. Calculate the distance between Errorg (Ib ) and Errpmt (Ib ), and use it
Calculate Permutation
Feature Importance importance FIxj of Xj to measure the raw feature importance of the feature F Iraw (Xj ), i.e.
FIxj=(Errpmt – Errorg)
Raw feature F Iraw (Xj ) = Errpmt (Ib ) Errorg (Ib ) (3.3)
importances End
Dick, G.,Steps
Rimoni, 4 A.
andP., 5And Whigham,
need P. A re-examination
to be performed for eachofdistinct
the use of geneticin
feature programming
the best- on the oral
Qi. Chen, Mengjie. Zhang, and Bing. Xue. "Feature Selection to Improve Generalisation of Genetic Programming bioavailability problem. In Proceedings of the 17th Annual Conference on Genetic and Evolutionary
Figure 3.2: GP for Permutation Importance (GPPI).
for High-Dimensional Symbolic Regression", IEEE Transaction on Evolutionary Computation, vol. 21, no. 5, pp. of-run individual Ib . It is important to note that the importance values of
Computation Conference (GECCO) (2015), ACM, pp. 1015– 1022.
792-806, 2017.
features absent from Ib are defined to be 0. The whole process repeats n
Permutation feature/variable importance in random forests (RF) is a (n 30) times on the best-of-run individuals collected from n independent
widely used score to measure the importance of features [53]. Permuting
GP runs on the given training data, respectively. The condition n 30 is to
a feature refers to rearranging the values of the feature within the dataset
1217 reduce the bias of random seed used in GP runs on the importance values.
randomly. For example, the values of a feature are denoted as {4,7,9}, the
The pseudo-code of this procedure is shown in Algorithm 1.
98 CHAPTER 3. GP WITH FS FOR SR
3.4. RESULT ANALYSIS

Feature Selection for
Table Symbolic
3.6: Comparisons between Regression
LASSO, Random Forest (RF), GP, and
85 Multi-tree GP for FS: Wrapper and Embedded
GP-GPPI
F1−Test F2−Test LD50−Test
Significance
0.00 0.02 0.04 0.06 0.08 0.10

• • Multi-tree GP for FS: Wrapper and Embedded
0.30
0.235
Training NRMSE Test NRMSE Test
Benchmark Method
0.20
(Medain±MAD) (Medain±MAD) (with GP-GPPI)

NRMSE
0.225
(training, test)
0.10
LASSO 0.17 0.22 ( , )
0.215
0.00
RF 0.055±0.0013 0.16±0.0017 ( , )
F1
GP
GP−C50
GP−RF
GP−GPPI
GP
GP−C50
GP−RF
GP−GPPI
GP
GP−C50
GP−RF
GP−GPPI
GP 0.012±0.016 0.095±0.03 (+, )
DLBCL−Test CCUN−Test CCN−Test GP-GPPI 0.037±0.043 0.049±0.064
LASSO 0.11 0.09 ( , )
0.170
0.26
0.095 0.100 0.105 0.110 0.115
RF 0.040±4.20E-4 0.078±5.61E-4 ( , )
0.160
F2
0.22
GP 0.002±2.97E-3 0.005±4.45E-3 (=, =)

NRMSE
0.150 GP-GPPI 0.005±4.45E-3 0.004±2.97E-3

0.18
LASSO 0.04 0.68 (+, )

0.140
0.14
RF 0.097±7.61E-4 0.23±0.0013 (+, )

LD50
GP
GP−C50
GP−RF
GP−GPPI
GP
GP−C50
GP−RF
GP−GPPI
GP
GP−C50
GP−RF
GP−GPPI
GP 0.19±0.009 0.25±0.026 (+, )
GP-GPPI 0.21±4.45E-3 0.21±4.45E-3
3.4. RESULT ANALYSIS
Figure 3.5: Distribution of the Corresponding Test NRMSEs. 87 LASSO 0.18 0.22 ( , )
RF 0.058±7.77E-4 0.13±0.0014 (+, )
DLBCL
F1 F2 LD50 GP 0.088±0.012 0.182±0.032 ( , )
programs manipulating the irrelevant features. Thus the evolutionary pro-
0.240
GP-GPPI 0.081±0.012 0.11±0.019

0.10
0.25
cess is more likely to be guided towards the better models. Better feature
LASSO 0.13 0.15 ( , )
0.235
selection methods can shrink the search space of GP to be much smaller

0.08
Testing NRMSE
RF 0.030±1.18E-4 0.098±2.25E-4 (+, =)

0.20
CCUN
but more effective since they can discard more irrelevant features while GP 0.073±1.48E-3 0.099±2.22E-3 (+, =)
0.230
0.06
keeping important features. Thus less effort is needed for GP to converge

0.15
GP-GPPI 0.076±1.48E-3 0.097±2.97E-3

0.225
to (near) optimal models. It also explains the pattern that the difference LASSO 0.21 0.23 ( , )
0.04
0.10
on NRMSEs RF 0.054±1.77E-4 0.141±3.44E-4 (+, )

0 10between
30 GP-GPPI
50 and the other
0 10 30 three
50 methods is increasing
0 10 30 50 CCN
GP 0.133±2.97E-3 0.143±2.97E-3 (+, )
over generations, and why GP-GPPI Generation
Generation
has a distinguished advantage
Generation
over
GP-GPPI 0.139±2.22E-3 0.139±2.97E-3
the other methods.
DLBCL CCUN CCN
0.120
GP
0.175
0.215
GP−C50
Results on the Test Sets — Generalisation Ability
0.115
GP−RF
Testing NRMSE
Qi. Chen,
well.Mengjie. Zhang,
of and Bing. Xue. "Feature
GP−GPPI
ods performed The examples the evolved models are shown in
0.165
0.205
0.110
Figure 3.5 shows the distribution of generalisation errors of the 100 best-
Selection to Improve Generalisation of Genetic
of-run individuals. The overall trend is similar to the training set, i.e. GP- Table 3.5. The mathematically simplified forms of the models are also pre-
0.155
Q. Ul Ain, H. Al-Sahaf, B. Xue, and M. Zhang, “A multi-tree genetic 1153 programming representation for melanoma detection using local and 1154
Programming for High-Dimensional Symbolic
0.105
0.195
GPPI outperforms the other methods. Particularly on LD50, GP-GPPI has sented in Table 3.5 for an easier analysis of the models. In these three global features,” in Proc. 31st Australas. Joint Conf. Artif. Intell., 2018, 1155 pp. 111–123.
0.145
0 10 30 50
Regression", IEEE Transaction on Evolutionary
0 10 30 50 0 10 30 50 Ain, Qurrat Ul, et al. "Generating knowledge-guided discriminative features using genetic programming for melanoma detection." IEEE Transactions
Generation runs, regarding the relevant
Computation, vol. features
21, no.in
5,the
pp.target function
792-806,
Generation F1 = g XX1 X2 2
2017. Generation on Emerging Topics in Computational Intelligence 5.4 (2020): 554-569.
3
Figure 3.6: The Testing Error Evolution Plots.

(g = 6.67408E 11), it can be observed that GP-GPPI can include all the im-
portant features, and uses only the relevant features to construct the mod-
Wilcoxon test. While on DLBCL, GP-GPPI has notable generalisation els.gain On the other hand, the other three GPSR methods and standard GP
over the other three methods, on the other three tasks, GP-GPPI still out-
either include some noisy features or can not include all the relevant fea-
performs the other methods. On LD50 and DLBCL, while GP-RF has
slightly but not significantly better generalisation performance thantures.
GP, Concerning the shape of the models, it is also easy to find that mod-
GP-C5.0 has significantly higher test NRMSEs than GP on LD50 and slightly
better generalisation gain than GP on DLBCL. On CCUN and CCN, GP-
RF can not improve the generalisation performance of GP to a significant
Evolutionary Multitasking-Based
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
level. GP-C5.0 achieves a significant generalisation gain on CCN.
GP for FS: Wrapper and CHEN
Embedded
In summary, GPPI can enhance the generalisation of GP more effec-
tively because it can discard more noisy/irrelevant features than other fea-
et al.: EVOLUTIONARY MULTITASKING-BASED FS METHOD
Feature Selection
7
Method
ture selection methods (feature selection results will be presented in more
• Multi-tree GP for FS: Wrapper and Embedded
detail in Section 3.5), so that GP is more likely to construct models using
due to interactions between features. Therefore, we limit the • ReliefF:
the relevant features. Moreover, the GP models with a continuous prop-
dimensions corresponding to these features in the search space
to a smaller search space, that is, these features are less likely
- Feature
to be selected. In the variable-range strategy, the search ranges ranking/weighting
of these features are set to [0, δ], where δ denotes the upper - Based on h nearest
bound value for the dimensions corresponding to these features. neighbors of the same
If the feature weights are between the weight of the knee point
and 0, the search ranges for the dimensions corresponding to category as the
these features are linearly reduced from [0, 1] to [0, δ] based

6
selected sample R , Hj.
IEEE TRANSACTIONS ON CYBERNETICS
on the weights of features. study. For - Task Find 2, wehsetnearest

α = 0.9 to increase the impact of the
2) Subset Updating Mechanism: To help PSO escape from
neighbors different
number of selected features on subset evaluation, which can
potential local optima, we design a promising subset updating effectively speed up the removal of redundant features from
mechanism in Task 1. During the evolutionary process, if gbest the entire featurefrom set.category Ri
To handle unbalanced data in the FS process, a balanced
does not change within the given iterations (m), we modify the accuracy [42] is used to determine the accuracy of the learning
feature set in Task 1. We randomly select some features from the algorithm. Equation (7) shows the classification error rate
unselected features in the first step to replace the same number 1 !
c
of features in Task 1 (i.e., keep the size unchanged). Equation (8) γR (D) = 1 − ∗ TPRi (7)
c
i=1
is used to determine the number of features updated. Since
where c denotes the number of classes in a classification
the promising feature subset may be updated during the search problem, and TPRi represents the proportion of correctly iden-
Q. Ul Ain, H. Al-Sahaf, B. Xue, and M. Zhang, “A multi-tree genetic 1153 programming representationprocess, thedetection
for melanoma combination
using local of
andpotentially
1154 complementary features Fig. 2. Example is used to select promising features based on the knee point.
tified instances in class i. Since the balanced accuracy is no
global features,” in Proc. 31st Australas. Joint Conf. Artif. Intell., 2018, 1155 pp. 111–123.
in Task 1 Using
willGenetic
likelyProgramming
be enhanced to some extent K. Chen, B. Xue, M. Zhang and F. Zhou, "An Evolutionary Multitasking-Based Feature bias to each
Selection class
Method forin the classification
High-Dimensional problem, the weight for
Classification,"
Qurrat Ul Ain, Harith Al-Sahaf, Bing Xue, Mengjie Zhang. "Generating Knowledge-guided Discriminative Features for related tasks. In this doi:
situation, we adopt a multipopulation each class is set to 1/c.
Melanoma Detection", IEEE Transactions on Emerging Topics in Computational Intelligence, vol. , Issue. , pp. , Online 2020 (DOI: in IEEE Transactions on Cybernetics, 10.1109/TCYB.2020.3042243.
10.1109/TETCI.2020.2983426). 14pp framework to transfer knowledge between Task 1 and Task 2. 3) Final Selected Features: It is noted that the final goal
numChange = ρ ∗ numSelect (8) During the search process, a skill factor is used to allocate in this article is to obtain a feature subset that can achieve
the particles for each FS task. In other words, if a parti- high classification accuracy in high-dimensional classification
where numChange and numSelect represent the number of fea- cle whose skill factor value is 1, the particle is assigned to problems. However, two feature subsets will be produced when
tures updated and the number of features in Task 1, respectively. solve Task 1; otherwise, it is used to address Task 2. Vertical applying evolutionary multitasking in step 2, which are from
cultural transmission is another important operator in PSO- Task 1 and Task 2, respectively. We will choose the selected
The scale factor ρ is used to control the number of updates. EMT, which can change the value of skill factor to implement features from Task 1 as the final feature subset to return, and
1218
A high ρ value provided by the user may result in a lot of transfer solutions between Task 1 and Task 2. The way of the reasons are as follows. During the search process, Task 1
transferring knowledge plays an important role in knowledge
for each subpopulation (task) but also can provide assistance is still increased by 4.10%, 1.24%, and 1.49%, respectively.
C. Further
accuracy on all datasets. Improvement
The effectiveness Mechanisms
of PSO-EMT is con- is still increased by 4.10%, 1.24%, and 1.49%, respectively. 9Tumor, the classification accuracy only increases 6.78% on
ETHOD to find
7 better feature subsets. During the evolutionary process,
tributed by evolutionary multitasking, which not only considers
Note that FCBF is an effective filter FS method to deal
PSO often suffers from premature Notewith
convergence thathigh-dimensional
FCBF
and easily is an effective filter FS method to deal average accuracy and 9.06% on best accuracy. In terms of
data. However, FCBF is prone to be
a random mating probability (rmp) is defined to control the entirewhenfeatures but also focuses on the important features withtrapped
high-dimensional data. However, FCBF be dimensionality reduction, although PSO reduces the original
falling into local optima. We propose two mechanisms into
to over-local optima since it uses the isheuristic
prone tosearch
we limit the to transfer knowledge from another task. At each during generation,
the FS process. In addition, the variable-range strategy trapped into local
strategies (i.e., optima
sequential since it usesorthe
forward search feature size by around half on all datasets, the numbers of
heuristicselection)
backward
come these deficiencies, which are: 1) variable-range strategy
search space if a random number rand is less than or equal to rmp, (4) is and subset updating mechanism are designed to guide particles strategies
in the (i.e., second sequential
stage. PSO-EMT forward or canbackward overcomeselected features achieved by PSO-EMT are still much smaller
effectivelyselection)
adopted to update particle’s velocity; otherwise, (2)search to focus onand
is applied more 2)promising
feature subset updatingPSO-EMT
areas. Therefore, mechanism. in the this second
issue toThis stage.
achieve PSO-EMT can due effectivelystrong overcome than PSO on all datasets. The highest dimensionality reduction
re less likely article hasbetter subsets
been accepted for inclusiontoin its
a future issue ofglobal
this journal. Content is final as presented, with the exception of pagination.
has better performance 1) than in evolvingWhen
its counterpartsStrategy:
Variable-Range feature implementing issue toPSO
this searchability. achieve for better subsets due to its strong global occurs on Leuk3, where PSO-EMT chooses about 20 times
earch ranges to update particle’s velocity. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented,This
witharticle has beenofaccepted
the exception for inclusion in a future issue of this journal. Content is final as presented, w
pagination.
Evolutionary Multitasking-Based Evolutionary Multitasking-Based

subsets with better fewer features than PSO and still increases 4.68% and 5.10% 13
es the upper 2) Fitness Function: In this study, the k-nearest neighbor FS,classification accuracies. Section
a particle represents VI will
a potential feature
CHENsearchability.
subset, and each
et al.: EVOLUTIONARY
3) PSO-EMT MULTITASKING-BASED FS METHOD
Versus ReliefF: As can be seen from the fourthon CHEN
further investigate the evolutionary multitasking, subset updat- CHEN3)et PSO-EMT al.: EVOLUTIONARY MULTITASKING-BASED FS METHOD average and
et al.: best classification
EVOLUTIONARY accuracies, respectively.
MULTITASKING-BASED FS METHOD 9
hese features. (KNN) algorithm [32] is adopted as an evaluator toingassess the and position value is within a fixed range (i.e., [0, 1]), which column indicates
of TableVersus IV,ReliefF:
PSO-EMT As can be seen
selects fewer from the fourth
features on all
3) PSO-EMT Versus CSO:TABLE PSO-EMT chooses fewer fea-
mechanism, variable-range strategy that contribute to
Feature Selection Method Feature Selection Method
column of 6or
Table VII
he knee point selected subset. Equation (6) represents the fitness function whether
the effectiveness of PSO-EMT. the corresponding feature should beFigs. test
reserved5 and notIV,
classification
show the PSO-EMT
bydatasets
average
TABLEfitness
III
selects
and achieves valuefewer features
significantly
changing of on
better
Task all
and
tures on Cthree OMPARED out of ten datasets,
R ESULTS TABLE
OF PSO-EMT and itAND
III outperforms
PSO-EMT VCSO -
test classification
similar averagedatasets andESTachieves significantly better and 1) PSO-EMT VersusAFULL: 1) PSO-EMT Versus
esponding to that considers both the number of selected features and the using a user-defined threshold (e.g., 0.6). 1Since and Task 2 on the Aclassification
the importance ten
VERAGEtest Tdatasetsaccuracy
R ESULTSduringon six out of ten
the evolutionary datasets.
on seven out of ten datasets. VERAGE TableT EST R
PSO-EMT III shows
ESULTS
obtains thatthePSO-EMT
highest
similarTheaverage
main reason classification accuracyevaluates
is that ReliefF on six outthe of importance
ten datasets.of significantly improves
process. Asashown in Figs. 5 and 6, in terms of the convergence significantly enhancement improves on the Prostatethe classification
dataset thataccuracy the meanover andusingbest
[0, δ] based • Fitness
classification function:
accuracy B. Training Timeof each feature in the dataset is different, The
of using the subset for classification using features
main fix range
individually,
reason is that and the selected
ReliefF evaluates feature subset still con-
the importance of the full set of original features on all datasets. PSO-EMT the full set of origina
forofeach feature difficult torate, PSO-EMT displays superior PSO-EMTconvergence performance classification accuracies are improved by 9.70% and 6.22%,
|S| The training time PSO-EMT andintheTask
compared2 isapproaches find
overtains
a good
tains
features
PSO-EMT
somem- feature
redundant
individually,
on
andfeatures.
all
the selected featureachieves
datasets.
subset the stillhighest
con- can achieve
using respectively. With the
On VS highest
without
the Leuk2 improvement
knowledge
dataset, PSO-EMTontransfer
the 9Tumor selectsdataset
about
can achieve the highes
fitness = α ∗ γR (D) + (1 − α) ∗ are shown(6) in thesubset due toofthe
third column huge
Table III.search space.time
The training A variable-rangeimprovement strategy
some redundant on Brain2 features. the This
that PSO-EMT average indicates
and bestthat
achieves classification
the highest where the average and best accuracies are increased by 22.06% where the average and
escape from evolutionary multitasking to remove redundant and/or irrelevant 166 fewer features than CSO, but its average and best classifica-
|N| of PSO-EMT is based on theamong
the smallest importance
all the offivefeatures
methodsison proposed, accuracieson
improvement
which areBrain2
aims improved
to that bythe 4.27%
averageand and12.00%, respectively.and
best classification 30.73%, respectively. On the Leuk3 dataset, PSO-EMT and 30.73%, respectiv
bset updating features On and the add
9Tumor more relevant
dataset, ones,
PSO-EMT PSO-EMT
achieves acan
lowerachieve
average tion accuracy are increased by 9.24% and 6.96%, respectively.
accuracies are improved by 4.27%m-and 12.00%, respectively. can achieve 5.37% improvement on the average accuracy and can achieve 5.37% imp
cess, if gbest • Knowledge
where γR (D) denotes thetransfer
classification error rate ofthree
the datasets
learn- (i.e., DLBCL,
reduce Brain2, space
the search and Leuk3). Compared
by adapting the smaller
search
On
range value
fitness
classification
the 9Tumor
of each than PSO-EMT
accuracy
dataset, than ReliefF,
PSO-EMT during
achievesbut the a evolutionary
PSO-EMT
lower
In addition, the results of the significance test indicate that
obtains8.18%
average improvement on the best accuracy. In addition, the 8.18% improvement o
ing algorithm, |S| represents the number of selected withfeatures,
PSO, the training
feature. timeTheof PSO-EMT
proposed isstrategy
betweencan 1/3 effectively
and ensure that the PSO-EMT obtains significantly better or similar classification
e modify the process. This
5.00% is due
higher tothanthe knowledge
ReliefF in transfer
terms of between
the best different
accuracy.
classification accuracy than ReliefF, but PSO-EMT obtains feature sizes obtained by PSO-EMT are much fewer than the feature sizes obtained
ures from the and |N| -indicates
Multipopulation
the total number offramework 1/7 of that of PSO. CSO needs the longest computational
available features in the features with higher (lower) importance have tasks athat
5.00% higher
4) can
higher (lower)
effectively
PSO-EMT than ReliefF
Versus help in PSO-EMT
terms
SBMLR: of the
Although jump
best
the out
accuracy.
performance
performance than CSO on seven out of ten examined datasets.
of local oforiginal feature size on all the examined datasets. The feature original feature size on
time among all compared algorithms. The main reason is that 4) PSO-EMT Versus AMSO: In terms of dimension reduc-
dataset. -α isAssortative
a parameter to reflect the role of the classification probability to be selected. optima.PSO-EMT in both the number of selected features and thesize size is reduced roughly
ame number mating - crossover CSO needs to store the fitness values of the previously cho-
In
4) PSO-EMT Versus SBMLR: Although the performance of tion, AMSO selects slightlyTABLE
summary,
training evolutionary
time is slightly multitasking
worse than is
SBMLR, an effective
the tech-
classification
is reduced roughly by 98% on
fewer the DLBCL dataset,
VI features than that of
which
is the largest reduction
Equation (8) error rate and the number of selected feature subsets, wheresubsets The
sen feature search
to avoid rangeassessments.
repeated of each dimension
However, is PSO-EMT
determined
Traditional based
in both the number of selected features and the is
methods the largest
PSO-EMT Con reduction.
all VS
OMPARED
With R ESULTS
datasets.
without OF PSO-EMT
However,
subsetPSO-EMT achievesS -better
AND PSO-EMT
updating
dated. Since - according
α ∈ [0, 1] Vertical cultural
to [41]. transmission
In the multitasking system, since
the fitness on two
evaluation timedemarcation
saved appearsvalues (e.g., the by
to be influenced nique
weighttrainingto
oftenaccuracies
enhance
thedatasets. of
FS
kneeis point
time
PSO-EMT
performance are better
in than
terms SBMLR
of the on
slightly worse than SBMLR, the classification than or similar
This isselected
because SBMLRfor uses a sparse
seven
classification out
multinomial
of 2) PSO-EMT
C
Versus
to PSO:
classification
R
Compared
TABLE VI
performance
PSO-EMT
with PSO,
compared
PSO-EMT
PSO-EMT
S -with
2) PSO-EMT Versus
Thirtythe
accuracy, independent
number ofruns are conducted
features, andeachthe method.
training The achieves significantly better performance on all datasets in achieves significantly
on seventime,
OMPARED ESULTS OF AND
ng the search Task 1 is to select features from a subset of the available the time fea-
requiredandto match
point the0). archiving
The knee solution.
point Compared
was described accuracies
inlogistic
Section of PSO-EMT
III-A. are better than SBMLR out of AMSO on eight out of ten classification datasets. On the Brain1
regression testmethod
with atosignificance
determine feature
level subsets,
of 0.05 which
tures and Task ‣ assign the
2 is to select skillfrom
features factor for
the entirewithsetAMSO
of the andIfVLPSO, the training
the weights of thetime of PSO-EMT
features are greater
Wilcoxon
especially
is than ten datasets. onrank-sum
the high-dimensional
Thisto isoverfitting
because SBMLR problems.
uses a sparse multinomial
is terms of the classification accuracy and the number of selected
dataset, PSO-EMT obtains the highest improvement, that is,
terms of the classificati
ntary features used thateasily
to ofleads
verify the knee
the performance of the
of the proposed
trained model.
method. ForIninstance,
the features. The mean c
a little longer on seven datasets (i.e., Leuk1, 9Tumor, Brain1, logistic regression method toofdetermine feature subsets, which features.
the average Theand mean bestclassification
classificationaccuracy accuracies on areeight out of
enhanced
available features,each generated
the number individual
of features in Task 1 is much point, their search ranges of the corresponding
Prostate, Leuk2, 11Tumor, and Lung). This is because a fea- B. easily
following
Effect
the highest
of
dimensions
results,
leads
by 16.02% Subset
to
improvement
“+,” “–,” and
Updating
overfitting
on the average of
PSO-EMT
“≈”
Mechanism
the
indicateis that
trained
achieved
model. For
on 11Tumor
the proposed
instance, ten datasetsand
by 14.70% are 10.00%,
increasing at least 10%
respectively. by PSO-EMT.
In terms of the signifi- The ten datasets are increa
smaller than that in Task 2. In order to ensure that the clas- are set to [0, 1]. If the feature weights areFSless
method than is significantly
0, these better accuracy
than, worse andthan,
18.94% and on the besthighest
similar improvement is on 9Tumor by 15.28% on the average highest improvement is
ture ranking method was embedded in AMSO and VLPSO theinvestigate
To highest
accuracies. improvement
the influence of of
PSO-EMT
the subset isupdating
achievedmechanism
on 11Tumor cance test, the v- proposed method achieves significantly better
(8) with the compared methods. The larger the number of “+” classification PSO-EMT accuracy , but theand average and on best accuracies of PSO- classification accuracy
• Variable-Range
sification accuracy is always more Strategy
important than to thedesign
numbersolutionfeatures are lessstrategies,
representation helpful which
for solving problems.
can effec- insymbols,
the
However,
by 16.02%
proposed
5)the
onFS
better
PSO-EMT
these
the average we
method,
the proposed
Versus
accuracy
SPEC:compare
method.
and the
Compared
18.94%results on of
with SPEC,
thewith
best classification
EMT With
theaccuracy. VS performance
areHowever,without
still increasedthethan
21.67%AMSO
variable-range
by 1.07%
thehalf
on
and
using PSO to choose a feature subset on
best ofclassification
strategy
0.50%,
the datasets
respectively. accuracy. However, us
of selected features in Task 1, we set α = 0.999999 tively in this the features
reduce search spacemayduring
become the useful whenIncombined
FS process. withproposed
accuracies.
terms (PSO-EMT) otherand features
without (PSO-EMT
PSO-EMT method evolves
s- ) the subset updating (9Tumor, Brain1, Leuk2, 11Tumor, and Lung).
smaller feature subsets In terms of the results of the Wilcoxon test, PSO-EMT is 9Tumor, the classificat
umber of fea- - linearly reduced from [0,of1] to
classification performance, the feature subsets returned by 5)with
mechanism, PSO-EMT Versus
which classification
higher is shown SPEC: Compared
in accuracies
Table VI. in The with
all resultsSPEC,
datasets, of except
the
9Tumor,
5) PSO-EMT
the average
the classification
significantly
accuracy
Versus VLPSO:
betterand than9.06%
accuracy
PSO-EMT
on
only increases
Compared
best v-accuracy.
with VLPSO, 6.78% the
on threeIn datasets
on
terms of(i.e., average accuracy and
respectively. proposed method obtains higher average classification accuracy
r of updates. [0, δ] PSO-EMT significantly achieve better classification accura-
cies than AMSO and VLPSO on seven datasets (i.e., 9Tumor,
A.proposed
significance
Performance PSO-EMT
for Leuk1.test show OnPSO-EMT
of method
thethatBrain2
PSO-EMTevolves
dataset, is smaller
PSO-EMT feature
significantly achieves subsets
betterhigh-dimensionality
DLBCL, reduction,
Brain1, and although
Lung), and
on nine out of ten datasets. In terms of the number of selectedPSO reduces
similar on the
the original
remaining dimensionality reductio
thanwith esthigher
PSO-EMT classification
s-
improvement, onthe that
three accuracies
is, the
datasets best
(i.e., inBrain1,
and all datasets,
average Leuk2, except
classification
and feature seven size by around half on9Tumor,
alla smaller
datasets, the numbers of feature size by around
t in a lot of Authorized licensed use limited to: Victoria University of Wellington.
Brain1,Downloaded on MayLeuk3,
Prostate, Leuk2, 02,202111Tumor,
at 21:29:09 andUTC from and
Lung) Xplore.improvement
IEEEnine Table
Restrictions on
III apply.
shows Brain2 where
training the mean
time24.27% (Time), and best
average classifica-
feature features, datasets
although (i.e.,
VLPSO Leuk1,
achieves Prostate,
number Leuk2,
of features Brain2,
• Subset Updating Mechanism: datasets (i.e., DLBCL, 9Tumor, Brain1, Prostate, Leuk2, Brain2, subset
for
Leuk3)
tion Leuk1.
accuracies
and
accuracy
size
On
similar areare
(Size),
the
on Brain2
improved
the
increased
best
dataset,
remaining
by
classification
by
4.27% PSO-EMT
seven
accuracy
and
anddatasets
12%, achieves
32.00%,
(i.e.,average
respectively.
(Best),
high- selected
respec-
Leuk1, Leuk3,
than features
PSO-EMT, achieved
and 11Tumor).
the featureby On PSO-EMT
subset thesize DLBCL are still
selected much
dataset,
by smaller
PSO-EMTPSO-EMT selected features achiev
lue does not est
DLBCL,
In termstively.
improvement,
9Tumor,
of Intheterms thatofaccuracy,
Prostate,
best theBrain2,
is, thetraining
best time,average
and
11Tumor,
PSO-EMT PSO-EMT
and
achievesLung). isbetter
aThe
classification littlethan PSO on all datasets. The highest dimensionality reduction than PSO on all dataset
classification accuracy (Mean), byand standard deviationuses (Std) of is already very small. For instance, the feature subset selected best
achieves the highest improvement where the average and
ocal optima. - Update candidate features Leuk3,in 11Tumor, and Lung), respectively. In general, among number
the five algorithms, the proposed PSO-EMT method has the classification
results bit slower
accuraciesof allare
forfeatures than SPEC.
improved
selected
datasets,
accuracies of
This
by is
thePSO-EMT
except for because
24.27%
involved Leuk1 isSPEC
and and32.00%,
smaller
methods. than
Prostate. therespec-
spectral
For occurs
PSO- on Leuk3,
byclassification
PSO-EMT where
Brain1 PSO-EMT
accuracies
on isare5.93%increased chooses
of about 20
thebyoriginal
1.53% andtimes
features,0.95%, occurs on Leuk3, whe
Section IV-C tively.
s- graphIn theory
terms to measure
of the features
training time, and requires less
PSO-EMT is acomput-
little fewer features than PS
average clas- fewer features than PSO and still increases 4.68% and 5.10%
Task 1, size unchanged best tradeoff between the effectiveness and efficiency. example,
EMT
bit slower
sification
although
on eight
ing resources.
out of
thanonSPEC.
accuracy
PSO-EMT
ten datasets.
In general,
than CFS This
achieves
is because
on the Lung
a lowerdimensionality
The highest
PSO-EMT SPEC has uses
dataset,
betterthetradeoff
PSO-EMT spectral ofonrespectively.
average and best classification accuracies, respectively. on average and best cl
reduction occurs
thetheory
classification Leuk3 where
performance PSO-EMT
and chooses about 146
graph
improves the best to measure features
s-accuracy andtheby trainingless
requires timecomput-
than that
3) PSO-EMT Versus CSO:Types PSO-EMT chooses fewer fea- 3) PSO-EMT Versus
fewer features
of SPEC. thanclassification
PSO-EMT . Also, the 0.24%.
average Note
accuracy that
Downloaded onD. MayExperiments on More
VII. IEEEof
C ONCLUSION High-Dimensional
C. Comparisons With Traditional Methods ingAuthorized
isresources.
licensedIn use
general, PSO-EMT
limited to: Victoria Universityhas approach,
ofbetter tradeoff
Wellington.
whichof turesDatasets
02,2021 at 21:29:09 UTC from
on three out of ten datasets, and it outperforms CSO
Xplore. Restrictions apply. tures on three out of
Fig.K.3.Chen,
Framework
B. Xue, M.of the PSO-EMT.
Zhang and F. Zhou, “An Evolutionary Multitasking-Based Feature Selection Method for High-Dimensional ofCFSPSO-EMT aInfilter-based
is increased
summary, and deterministic
among by comparisons
50 1.02% and FS the the
with bestfiveaccuracy
traditional
Classification,” in IEEE Transactions on Cybernetics, 2020, doi: 10.1109/TCYB.2020.3042243. In this section, we compare the results of PSO-EMT with fiveK. of is the classification
usually faster performance
than wrapper-based and the
FS training
methods. time than
However, that on seven The goal out of of tenthis datasets.
article was to develop
PSO-EMT an effective
obtains the highest hybrid on seven out of ten d
PSO-EMT
Chen, B.methods, is
Xue, M. Zhang increased
the numberand F. ofby 1.25%.
wins,
Zhou, "Anloses,Furthermore,
and draws
Evolutionary PSO-EMT
of PSO-EMT method
Multitasking-Based D. In order
Experiments
Feature to further
Selection on examine
More
Method forTypes theof performance
High-Dimensional
High-Dimensional of the proposed
Classification,"
EMT method. L. Feng et al., “An empirical study of multifactorial PSO and mulitifactorial DE,” in Proc. IEEE Congr. Evol. Comput., San Sebastian, classical FS approaches (i.e., CFS, FCBF, ReliefF, SBMLR,in IEEE itsofcomputational
costs
SPEC.
a shorter time
32, 13,computational
Transactions on is longer
Cybernetics, doi: than
inThe PSO-EMT
time10.1109/TCYB.2020.3042243.
comparison on seven
with PSO- enhancement for FS
on on thehigh-dimensional
Prostate dataset that
method, we conduct more experiments on other types of
classification.
the mean The and goal best has enhancement on the P
Spain, 2017, pp. 921–928.
datasetsInissummary,
due to the
and
among 5, respectively.
pairwise 50 comparisons
correlation
results
with
measures.the indicate
five
This
that the Datasets
traditional
shows been successfully
classification accuracies achieved
are by
improved designing
by a
9.70% novel
and FS
6.22%, method classification accuracie
TABLE I and SPEC). Table IV shows the results of PSO-EMT with five EMT s- on eight the datasets. In wins,
general, PSO-EMT saves about 5%
proposed method can achieve significantly better high-dimensional datasets. Due page limitations, we have put
successfully methods, number of loses, and draws of classification
PSO-EMT basedIn order
on PSO to further
and MFOexamine thatthe canperformance
effectively of the proposed
select aresults
feature respectively. On the L
DATASETS other comparison methods on ten datasets. that PSO-EMT
training accuracy withis more
a smaller suitable
s- on
feature for high-dimensional
subset than the compared data7tra-respectively.
the details On
intothe theLeuk2
online dataset,
supplementary PSO-EMT selects
materials. Theabout
rpose of step is 32,time 13, of andPSO-EMT
5, respectively. allThe examined
results datasets.
indicate that Fig. the 166 method,
subset
fewer via we conduct
information
features thantypes
CSO,more
sharing experiments
itsbetween
buthigh-dimensional
average and twoonbestother
related types
tasks.
classifica- of 166 fewer features than
1) PSO-EMT Versus CFS: Table IV shows that PSO- shows than CFS. ditional
the averagemethods on most
feature size ofof thegbest
investigated
from high-dimensional
10 (i.e., after on these different of datasets show
p each other EMT achieves better classification accuracy than CFS on using
proposed method can achieve significantly better classification tion
2) thePSO-EMT
classification Versus
problems.FCBF: Compared with FCBF, PSO-
high-dimensional
accuracy
The
that results
PSO-EMT areshowedis
datasets.
increased
able that
to
Due
byobtain
9.24%
the page and
proposed
a
limitations,
feature6.96%,
method
subset
we
canhave
respectively.
with
put
achieve
higher
tion accuracy are incre
accuracy subset
with updating
a smaller mechanism
feature subset forthan
the first time during
the compared the
tra- Inbetter details into the online supplementary materials. The results In addition, the result
tep 1 is the six out of ten datasets, where PSO-EMT obtains the highest the EMT chooses more features than that of FCBF onPSO-EMT
all datasets. s- .
addition, the results of the significance
classification accuracy with smaller number of features test indicate that
search
ditional process)
methods ontomost100of ofthePSO-EMT
investigatedand high-dimensional on these
PSO-EMT different
obtains types
significantly of better
high-dimensional
or similar datasets show PSO-EMT obtains sign
ks, which are than state-of-the-art FS methods in most ofclassification
the examined
Theclassification
figure showsproblems. that the subset updating mechanism provides performance that PSO-EMT than is able to obtain out aoffeature subset with higher performance than CSO
ntire feature Authorized licensed use limited to: Victoria University of Wellington. Downloaded datasets.
on MayThis02,2021is atCSO
due toonthe
23:13:37
seven
UTC effectiveness ten of
from IEEE Xplore.
examined
information
Restrictions
datasets.
apply.transfer
GECCO,Dever, effective redundant and/or irrelevant feature removal capability
GECCO,Dever, 4) PSO-EMT Versus AMSO: In terms of dimension reduc- 4) PSO-EMT Versus
oint selection Colorado, USA. 20-24 Colorado, USA. 20-24
between the designed two related tasks. In terms of the training
during
Authorized licensed use limited to: Victoria University of Wellington. Downloaded on May the evolutionary
02,2021 at 23:13:37 process,
UTC from IEEE enablingXplore. PSO-EMT
Restrictions apply.to search tion, AMSO selects slightly fewer features than that of tion, AMSO selects
time, the proposed method requires less computational cost than
Why use GP for Feature Construction?
July 2016 July 2016
e subset. The for a good feature subset. In to:addition, the subset updating PSO-EMT on all datasets.
21:29:09 However, PSO-EMT achieves better PSO-EMT on all datase
Authorized licensed use limited Victoria University of Wellington. Downloaded other compared
on May 02,2021 atmethods UTCin most
from IEEEcases dueRestrictions
Xplore. to the search apply. space
h knowledge mechanism can increase the feature interactive in Task 1 during than or similar to classification performance compared with
Thirty independent
than or similar to clas
Thirty independent runs are conducted for each method. The control strategies, suchruns as are the conducted
variable-range for each method.
strategy and Thethe AMSO on eight out of t
The inputs of the FS process. AMSO on eight
Wilcoxon out of ten
rank-sum classification
test with a datasets. On
significance leveltheof Brain1
0.05 is
Wilcoxon rank-sum test with a significance level of 0.05 is knee point selection scheme. In addition, the proposed subset dataset, PSO-EMT ob
ature subset, dataset, PSO-EMT theobtains the highest improvement, that In is,the
class labels.
•usedGP
C.following
is
to verify
Effect ofresults,
flexible
the performancein
Variable-Range
of the
Strategy
making
proposed method. mathematical
In the used to
updating
thefollowing
verify
average results, and
mechanism
and best“+,” logical
performance
can
classification
“–,”
of thefunctions
successfully proposed
maintain
accuracies
method.
are the
the diversity
enhanced the average and best
“+,” “–,” and “≈” indicate that the proposed
byofFS the population
14.70% andis 10.00%, during theand
respectively.
“≈”
search indicate that
Inprocess. ofThis
terms than,
proposed
the mechanism
signifi- by 14.70% and 10.00%
nd the subset FS
Tomethod
investigate is significantly
the contribution better of than,theworse than, and similar
variable-range strategy cance method
can help PSO-EMT significantlyescape better
from than, worse
potential local and optima similarand
test, the proposed method achieves significantly cance test, the propose
arch space of onwith
the the compared
training time, methods.
the number Theoflarger selected the features,
number of and “+” with themore
the classification
explore compared
fruitfulmethods.
regions The larger
during thehalfFS numberbetter
the process. of “+”
performance than AMSO on of the datasets classification performa
symbols, theaccuracy,
classification better the we proposed
compare method.
the results of with (PSO- (9Tumor, symbols,
This isBrain1,the better
the firstLeuk2, the
work that proposedaims toand method.
useLung).
the idea of evolutionary
respectively. 11Tumor, (9Tumor, Brain1, Leuk
ously in step • There
EMT) and without isn’t (PSO-EMT much v- ) the structural
variable-range strategy (topological)
multitasking
5) PSO-EMTtoVersus handle information
FS in theCompared
VLPSO: high-dimensional inVLPSO,
with the
classification
the 5) PSO-EMT Versus
A. Datasets in the proposed method. The results are recorded in Table VII. proposed problems. It provides
method obtains ahigher new averageeffectiveclassification
way for FS.accuracy In general, proposed method obtain
Ten gene expression datasets with thousands of features search
AsA. can
Performance
be seen from space
of PSO-EMT
Table VII, of the possible
training time functions, of PSO- onthe A.
nine outso
Performance
proposed of ten using
method of PSO-EMT
datasets. canInnotaterms meta-heuristic
onlyof achieve
the number highofclassification
selected on nine out of ten data
Feature
EMT is shorter
Table III shows thanthe training v-time
PSO-EMT on (Time),
all datasets. average Thisfeature
shows features, Table
accuracy III shows
with
although aVLPSO
smallthe feature
trainingsubset
achieves atime
smaller (Time),alsoaverage
butnumber can feature
reduce
of features the features, although VLPS
are used in the experimental analyses. They are open sources
on https://ckzixf.github.io/dataset.html. Table I shows the key that approach
the size
subset (Size), best(such
variable-range strategy
classificationcanas evolutionary
effectively
accuracy reduceaverage
(Best), search than computation)
subset
training
PSO-EMT,size
time.(Size), best classification
the feature subset seems accuracyby(Best),
size selected PSO-EMT average than PSO-EMT, the fea
classification
space by adapting accuracy (Mean),range
the search and standard
of each deviation
feature during (Std) of classification
the is already accuracy
very small.
The proposed For(Mean),
method instance, andthestandard
has accomplished feature the deviation
subset (Std)
selected
implicit knowl-of is already very small. F
mental setup, characteristics of these datasets. The distribution of data is reasonable.
FSclassification
process. Inaccuracies
terms of ofthe thefeature
involved methods.
subset size, PSO-EMT byedge classification
PSO-EMT
transfer on accuracies isof 5.93%
Brain1 different
between the involvedof the
tasks methods.
onlyoriginal
by sharingfeatures, global by PSO-EMT on Bra
Construction
methods, and highly unbalanced. The classification on such datasets is a outperforms PSO-EMTv- on eight out ofConstructed ten datasets. On the best solutions in PSO without taking into account the other
challenging task. Lung dataset, PSO-EMT chooses about 24 fewer features than useful information, such as individual position and velocity.
Features
Authorized licensed use limited to: Victoria University of Wellington. Downloaded on MayAuthorized
-
02,2021 at licensed
21:29:09use limited
from to: Victoria University of Wellington. Downloaded on May 02,2021 at 21:29:09
Constructed UTC IEEE Xplore. Restrictions apply.
Features
ellington. Downloaded on May 02,2021 at 21:29:09 UTC from IEEE Xplore. Restrictions apply. Authorized licensed use limited to: Victoria University of Wellington. Downloaded on May 02,2021 at 21:29:09 UTC from IEEE Xplore. Restrictions apply.
+ %
Selected
Features
F1 F3 F12 F40
1219
persion of the instances of that class along the feature axis. The
dispersion of instances itself is related to the distribution of data
points in that class.
GECCO,Dever,
• An interval I is represented with a pair (lower, upper) which GECCO,Dever,
GP shows for Feature

the lower andConstruction
upper boundaries of the interval. Ic is used GP for FC Measure: Original VS Constructed
July 2016 July 2016
COMP422
COMP422 to indicate an interval for class c. Feature 43Manipulation:
Feature Manipulation: COMP422 43 COMP422 Feature Manipulation: 44 Feature Manipulation: 44
• The interval
Examplesof
• class
of good
The
COMP422 c could
andbe
interval ofbad formulated
class
class as be
intervals
cclass
could follows
formulatedif the as follows 658
Measuring
Feature Manipulation: 43 COMP422
if the the Purity of• Class 4 features,
Intervals3 classes
IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 16, NO. 5, OCTOBER 2012
Feature Manipulation: 44
Examples
COMP422
class distributions
GECCO,Dever, Colorado,
Examples
ofwere
of
good
good
and
normal,
and bad
badintervals
measuring
class theintervals
Feature Manipulation: 43
purity, purity
COMP422
Measuring the Purity of
Measuring
Class Intervals
the Purity of
Feature Class
Manipulation: 44 Intervals
GP for FC Measure:Examples of good and bad
USA. 20-24 July 2016
GECCO,Dever, Colorado,
Examples
classof
distributions
good and bad
were
class
normal.
intervals Measuring the Purity of Class Intervals
of GP
eachfor FC
class Measure:Examples of good andGiven bada discrete
USA. 20-24 July 2016
Overlapping intervals: class intervals Given a discrete or categorical random variable C (the class
Overlapping intervals: or categorical random variable C (the class
Overlapping intervals:Ic = [µ − 3σ , µ
class intervals
c c c + 3σ c ] label) which can take Given
label)
values c1 , a
which can take
c2 , . discrete
values c ,or c2 , .categorical
. . , cL with 1probabilities random variable C (the class
. . , cL with probabilities
• Overlapping intervals
Overlapping intervals: p(c Given), p(c a), discrete
. . . , p(c or
), the categorical
Shannon random
entropy
by of
variable C (the
C is defined class
However, the normality assumption is ), p(calways
p(c1not ),2 thelabel)
2 ), . . 1. , p(cLsatisfied.ShannonLentropywhich of C can take
is defined
!L can take values c1, c2 , . . . , cL with 1probabilities values c , cby
2 , . . . , cL with probabilities
• Overlapping
• Overlappingintervals
intervals, bad H(C) = −H(C)
! label)
L
i=1 p(c
which
=i)−logp(cb p(c i ) p(ci ) logb p(ci )
p(c ), p(c
i=1 ),p(c
p(c
2 ), . . 1. ,and ),2),the. .Shannon ), the Shannon
. , p(cLentropy entropy
of C is defined by of C is defined by
base of1 the
where b iswhere b islogarithm
base !of 6.2.!THE
isL usually 2. and PROPOSED
is usually METHOD: MGPFC 185
L the logarithm L 2.
x A class interval H(C) = −H(C)
establishes a new =i)−logb space.
p(c
i=1 probability p(c
i=1i ) p(c i ) logthe
Therefore, b p(ci )
x probabilityAofclass interval
classes establishes
in the above equation a new be
should probability
conditioned space. Therefore, the
where b isof base of the logarithm ofand islogarithm
usually 2. be
• Non-overlapping
Non-overlapping intervals:
Non-overlappingintervals
intervals:
probability
on the values where
of the feature
Atheclass
that fall b
classes is
inin
thethe
6.2.2
baseabove
interval. the
equation
MGPFC Fitness Function
should and is usually
2.
conditioned
x Given X, aon feature, C, interval
values ofFig.
the set of establishes
the 9.feature
all class that
labels,
(a) Decision aandnew
fall
tree , probability
! in
thethe
cinduced theofJ48 space.
by interval.
class Therefore,
(C4.5) algorithm using thethe
1000
• Non-overlapping intervals, good
x Aof class Icinterval establishes ashould
new probability space. Therefore, the
To evaluate an individual, its constructed features are used to transform
interest with probability
corresponding observations
classes
interval in
! , thein Shannon
the previous
the above figure.of(b)
equation
entropy theClass boundary
be generalized
conditionedby the
Given ! X, a feature, C, the set of all class labels, and c !
, the m
class
decision tree. The decision tree inducer has failed to generalize the concept of
Non-overlapping intervals: interval of class c is
on thewith probability
values ofathe
of line.feature ofinterval
classes
that fall inin thethe above equation should be conditioned
interval.
!
interest corresponding the ∈ Itraining
I c set into a new !training set with m features. The discriminating
! , the Shannon entropy of the
• Non-overlapping
Non-overlappingintervals
intervals: H(Ic! ) = − c∈C p(c|X ∈ Ic! ) log2 p(c|X
Givenof
interval aon
X,class the
c! is values
feature, of ofthe
C, the set
c! )
allfeature
class labels, thatand fall
c ,in thethe class interval.
of
ability of theI transformed dataset indicates how good the constructed features
x
interest !
with corresponding interval ! , the Shannon entropy of the
H(Ic ) = − Given c !
!
X, a∈feature,
c∈C p(c|X Ic ) log2 p(c|X
!
C, the ∈ Iset
c ) of all class labels, and c , the class of
interval of class c! is are. While a wrapper measure based onFig.
!
a 11.
classification
Visualization of the algorithm
features in balance can scalebeproblem. The top four
x
!
interest with corresponding interval I ! , the Shannon plots depict the original
entropy of features
the of the problem and the bottom plot is for
Neshatian, K.; Mengjie Zhang; Andreae, P., "A Filter Approach to Multiple Feature Construction for Symbolic Learning Classifiers Using Genetic
Programming," IEEE Transactions on Evolutionary Computation , vol.16, no.5, pp.645-661, Oct. 2012
H(Ic! ) = − c∈C p(c|X a good
∈ Ic! ) indicator
log 2 p(c|X
Neshatian, ∈for c! )searching
IZhang;
K.; Mengjie
c good feature sets,feature
a constructed the(xresulted Infeature
− x3 x4 ).Learning
x2 Symbolic
Andreae, P., "A Filter Approach to Multiple Feature Construction1for
each sethorizontal axis is the
plot, the
Classifiers Using Genetic
interval ofmayclass c!Programming,"
is instanceno.5,number andOct.
the2012
vertical axis is the value of the feature for the given
! be general for other classification

not algorithms. Ongrouped
the based
other hand,
COMP422 Feature Manipulation: 45 COMP422 in Evolutionary Computation, IEEE Transactions on , vol.16,
Feature Manipulation: 46 pp.645-661,
x instance. The instances are on their class labels. The vertical
Single vs. Multiple Feature Construction
Neshatian, K.; Mengjie Zhang; Andreae, P., "A Filter Approach to Multiple Feature Construction for Symbolic Learning Classifiers Using Example Papers for Reading shaded areas from left to right correspond to the class labels: “left,” “balance,”
Genetic Programming," in Evolutionary Computation, IEEE Transactions on , vol.16, no.5, pp.645-661, Oct. 2012 52 H(Ic! ) = a−filterc∈Cmeasure
p(c|X ∈is Ibased c! ) log2on p(c|X the intrinsic
∈ Ic! ) and characteristics
“right.” The two dashed lines of the
in the data,
bottom plotitsdepict the way a J48
Fig. 10. (a) Simple GPMFC-constructed feature for the 1000 observations. decision tree inducer would partition the input (constructed) feature space in
COMP422 x the common optionFeature
• With only one constructed feature, •45KouroshCOMP422
is to Manipulation:
use (b) J48Peter
Neshatian, Mengjie Zhang, solutions
induced decision
Andreae: may
tree
Genetic using be effective
the constructed
Programming for
feature,
Feature y.many
Manipulation: 46 learning
order toalgorithms, however,
learn the three concepts (classes). with the
for Feature Ranking in Classification Problems. SEAL 2008: 544-554.
price of lower classification accuracy than wrapper approaches. Therefore,
an augmented datasets.
Single vs. Multiple Feature Construction http://dx.doi.org/10.1007/978-3-540-89694-4_55
Fig. 9. (a) Decision tree induced by the J48 E. Further
(C4.5) Example Papers
Discussions
algorithm using for Reading
the 1000
previous subsection), in many cases the overall complexity
Multi-tree GP for FC
COMP422• Possible ways to make multiple features are: random restart
Feature
andManipulation: 45 in the previous
COMP422 Feature Manipulation: 46
Class-Dependent Terminal Sets
observations figure. (b) Class boundary generalized by the
picking multiple individuals. However, these methods usually decision
• Kourosh Neshatian,
a hybrid approach that
Mengjie Zhang, Mark Johnston: Feature Con-
tree. The decision tree inducer has failed to generalize the concept
Theoretically,
struction and Dimension Reduction Using Genetic
of a line.
GP and symbolic
Programming. learners combines
both generate both in- wrapperdid not change and filter
or evenwas proposed
increased to construction.
after feature
Single vs. Multiple Feature Construction
Neshatian, K.; Mengjie Zhang; Andreae, P., "A Filter Approach to Multiple Feature Construction for Symbolic Learning Classifiers Example
Usingon telligible models: Papers
GP for
generates Reading
expression trees and symbolic Two common causes of unnecessarily high complexity in GP
• With lead
only toone
veryconstructed
high correlation betweenthe
feature, constructed
common features.
option is
Genetic Programming," in Evolutionary Computation, IEEE Transactions on , vol.16, no.5, pp.645-661, Oct. 2012 to use Australian • Conference
Kourosh 52
Artificial
Neshatian, synthesise
Intelligence
Mengjie
learnersingenerate
Zhang, 2007:
their
Peter
chainsProblems.
of
strengths.
160-170.
Andreae:
rules orSEAL
Genetic
trees of
Decision
Programming
decision
Tree isprograms
stumps.
used toareevaluate the classification
verbosity and introns, both
•A constructed feature aims at discriminating instances ofofa which can be
http://dx.doi.org/10.1007/978-3-540-76928-6_18
for Feature Ranking Classification 2008: 544-554.
CF1 an augmented
• Another Class
way
datasets.
1 CF
of making multiple featuresClass
2 2 CF
is to use a fitness
3
func- Class 3 In practice, performance
however, both of of the them constructed
can produce feature
solutions set as
addressed it isbya fast learning
algebraic or numericalgorithm.
simplification.
Min that has this potential. + -
• Kourosh Neshatian, Mengjie Zhang:
http://dx.doi.org/10.1007/978-3-540-89694-4_55 Using genetic
class
pro-
(c) from other classes is => It is inconstructed
some cases thebased
proposedon
tion Fig. 11. Visualization of the features in balance scale problem. The top four
COMP422 • With only one constructed feature, the common option is to use Feature Manipulation: • Kourosh
45 Neshatian,
COMP422 Mengjie Zhang, Peter Andreae: Genetic Programming Feature Manipulation: 46
• Possible ways to make multiple features are: random restart andgramming for (constructed
context-sensitive feature features
scoring or
plotsclassifiers)
in depict the original
classifica- that areof not
features easily
the problem and com-
the bottom plot Even for though feature construc-
an augmented datasets. tion problems. for Feature
• Kourosh Ranking
Neshatian,
Connect. For
Sci. Mengjie
prehensible; veryfilter
in 23(3):
Classification
Zhang, aapproach,
constructed
183-207
often
Problems.
features
Mark(2011).
constructed
instance
feature (xaSEAL
Johnston:
number and the
x distance
1 2
features
− x x ).2008:
that
3 4
vertical isare
Feature
are
axis
measure
In each plot, the horizontal axis
544-554.
valuerelevant
Con-
unnecessarily
the
isthe
of the feature for tion to
chosen
is the
class
system may
given
because
c.
it isthesimple
not reduce and
overall complexity of classifi-
+
Single
picking vs.* Multiple
multiple individuals. - Feature
However, * Construction
these methods + usually Max
http://dx.doi.org/10.1080/09540091.2011.630065
struction and Dimension
complicated multivariate
http://dx.doi.org/10.1007/978-3-540-89694-4_55Example
Reduction
and therefore shadedwhich
Using
instance. The
unintelligible.
areas from left
Papers
means
Genetic
instances are grouped
it for
can
Programming.
to right Although
based
correspond
on
to the
theirReading
inevaluate
class
some
class “left,”the
labels.
labels:of
The
discriminating
vertical
cation
“balance,” systems (complexity ability
of theof a set of
constructed features plus the
• Possible
lead to veryways hightocorrelation
make multiple features
between are: random
constructed features. restart•and Australian
Kourosh Neshatian’s Conference
PhD Thesis:
• Kourosh Neshatian, our empirical onresults,
Artificial
the Class-relevant
• Intelligence
and complexity
“right.” of the
The two dashed 2007: in the measure:
constructed
lines 160-170.
bottom feature
plot depict the way complexity
a J48 of the learnt classifier), throughout this paper, the
features atorder
atoinduced
time.
Mengjie Zhang, Mark Johnston: Feature Con-
feature for the 1000 observations. decision tree inducer would partition the input (constructed) feature space in
Fig. 10. http://homepages.ecs.vuw.ac.nz/˜mengjie/students/KouroshPhd_thesis.p
(a) Simple GPMFC-constructed
http://dx.doi.org/10.1007/978-3-540-76928-6_18
and the complexity of the decision tree on those fea- main focus was only on the complexity of induced decision
F•7 Another
F2 way
picking F4 of making
multipleF2 Fmultiple
individuals.
9 F4 features
F5 is Fto8 use aFfitness
However, these methods
6 Ffunc-
usually
(b)
3 F8 F9
J48 induced decision tree using the constructed
struction and feature,
Dimension
y. Reduction learn the three
Using concepts (classes).
Genetic Programming.
• Withtion lead
only to veryconstructed
that one
high correlation between constructed features.
has this potential. feature, the common option E. Further to•use •tures
AustralianNeshatian,
Kourosh
isDiscussions altogether
Conference
Kourosh To
onwere
Neshatian,
Mengjie evaluate
less
Artificial than
Mengjie
Zhang: an Using
the
Intelligence individual
complexity
Zhang, 2007: of the
Peter
genetic with induced
160-170.
Andreae:
pro- m trees, trees. This
Genetic itsProgramming
constructed
is features
because a decision aretoo many nodes
tree with
decision tree on the originalpreviousfeatures
subsection), (see in the
manyexamples
cases the overallin the complexity (on numeric features) can create serrated decision boundaries
bothused toin-transform the training setfeatureintoconstruction.
aSEAL new dataset with m features. The
http://dx.doi.org/10.1007/978-3-540-76928-6_18
gramming for context-sensitive feature scoring in classifica-
1. for Feature generateRanking did notinchange Classification Problems. 2008: 544-554.
anConstructed
• Another way
augmented feature
of making
datasets. set multiple
(CF): CF 1, CFis
features 2, toCF 3 a fitness Theoretically,
use func-
telligible models: GP
GP and symbolic learners
tion problems.
• generates
Kouroshexpression
Connect.
trees Sci. common
Mengjie TwoZhang:
and symbolic 23(3):
or even
causes
increased
of183-207
unnecessarily
after
(2011).
genetichigh pro- complexity in GP
• CF1 = Min ((F7 + F2), (F4 * F2))
tion that has this potential. learners generate chains
Neshatian,
discriminating • Half
stumps. programsability
http://dx.doi.org/10.1007/978-3-540-89694-4_55
of rules or trees of decision
http://dx.doi.org/10.1080/09540091.2011.630065 of
are verbosity the
Using
of the top-ranked
transformed
and introns, both of which features can be will
training set be used
will be used to toform the
determine
gramming
both of them for context-sensitive feature scoring in simplification.
classifica-
• Possible ways
•COMP422 to make
CF2 = (F9 - F4) + (F5 * F8) multiple features are: random In practice,
restart however,
and can produce
notthe
solutions addressed by algebraic
terminal or numeric
set of class c.
(constructed features tion
or
• Kourosh classifiers)• that
problems.
Neshatian’sKourosh
are PhD easilyfitness
Neshatian,
Connect.
Thesis: of the
com- FeatureSci. thoughindividual.
Mengjie
Manipulation: 4823(3):
Even in some Zhang,183-207
cases the Equation
Mark
proposed(2011). feature (6.1)
Johnston:construc- describes
Feature the Con-fitness function
picking • CF multiple + Findividuals.
3 = (F6 Example 3) - Max FHowever,
(F8 ,for
Papers Reading these methods
9))
prehensible; usually Exampleconstructed features are unnecessarily tion system
very oftenhttp://dx.doi.org/10.1080/09540091.2011.630065
Papers for
struction Reading
and Dimension - => may not eliminate
reduce the overall
Reduction irrelevant
complexity offeatures
Using classifi-
Genetic Programming.
which
in some combines
http://homepages.ecs.vuw.ac.nz/˜mengjie/students/KouroshPhd_thesis.p
complicated and therefore unintelligible. Although of cation systems the(complexity
classification of the constructed performance
features plus the and a distance measure using
2. Selected
lead to very feature set (Ter): F2,between
high correlation F3, F4, F5constructed
, F6, F7, F8, Ffeatures. our
9 empirical results, • the complexity
Kourosh ofAustralian
the constructed
Neshatian’s PhDWill feature complexity
Conference
Thesis: - => ofonthe narrow
learnt thethroughout
classifier),
Artificial searchthisspace
Intelligence paper, the 2007: 160-170.
• Bing Xue, Mengjie Zhang, Will Browne. Particle swarm optimization and the complexity of the induced decision tree a on weighting
those fea- maincoefficient
• Bing Xue, Liam Cervante, Lin Shang, Browne, Mengjie
focus was only –. on the complexity of induced decision
3. Combination set (CFTer): CF1, CF2, CF3, F2, F3, F4, F5, F 6, F7A
Zhang. ,F 8,http://homepages.ecs.vuw.ac.nz/˜mengjie/students/KouroshPhd_thesis.p
F9
Multi-Objective http://dx.doi.org/10.1007/978-3-540-76928-6_18
Particle Swarm Optimisation for Fil-
tures altogether were less than the complexity of the induced trees. This is because a decision tree with too many nodes
• Anotherfor way feature selection in classification: A multi-objective approach, IEEE
of onmaking
Cybernetics,multiple
vol. 43, no. features is to 2013.use adecision
fitness
tertree func-
on
Based the originalSelection
Feature features (see
in the examples in the
Classification (on numeric
Problems”. Con-features) can create serrated decision boundaries
Mengjie= –Zhang: + (1 ≠ –) (6.1)
Transactions 6, pp. 1656-1671,
Binh Ngan Tran, Bing Xue, Mengjie Zhang. "Class Dependent Multiple Feature Construction Using Genetic Programming
http://dx.doi.org/10.1109/TSMCB.2012.2227469 Science. Vol. •24,
for High-
nection Kourosh
No. 2-3, pp. Neshatian, • 2012.
91-116, F itness · Accuracy Using · Distance
genetic pro-
tion that
Dimensional has this
Data". Proceedings potential.
of the 30th Australasian Joint Conference on Artificial Intelligence (AI2017) Lecture Notes in Computer
Science. Vol. 10400. Springer. Melbourne, Australia, August 19-20th, 2017. pp. 182-194.
• Bing Xue, Liam Cervante, Lin Shang, Will Browne and Mengjie Zhang. gramming for context-sensitive feature scoring in classifica-
http://dx.doi.org/10.1080/09540091.2012.737765
where Accuracy is the average accuracy of DT over K-fold (K=3) cross-

Binh Tran and Bing Xue and Mengjie Zhang. "Genetic programming for multiple-feature construction on high-dimensional
Binary PSO and rough set theory
2019 for feature selection: a multi-objective • Bing Xue, Liam Cervante, Lin Shang, Will N. Browne,
classification", Pattern
COMP422
Recognition, vol. 93, pp. 404-417,
filter based approach”. International Journal of Computational Intelligence
tion problems. Connect. FeatureSci. 23(3):
Manipulation: 48
183-207 (2011).
Mengjie Zhang. Multi-Objective Evolutionary Algorithms
for Filter Based Feature Selectionvalidation (CV) In- on the transformed training set. To avoid overfitting, this
and Applications (IJCIA), Vol. 13, No. 2 (2014). pp. 1450009 – 1- 34. http://dx.doi.org/10.1080/09540091.2011.630065
in Classification”.
Example Papers for Reading
http://dx.doi.org/10.1142/S1469026814500096 Example Papers for Reading
COMP422
ternational
Feature Manipulation: 47 K-fold CV is repeated
Journal on Artificial Intelligence Tools. Vol.
COMP422 L times (L=3)48with different data splitting similar to
Feature Manipulation:
• Bing Xue, Mengjie Zhang, Will N. Browne. Particle Swarm Optimisa- 22, Issue 04, August •2013.
Kourosh pp. Neshatian’s
1350024 – 1PhD - 31.Thesis:
tion for Feature Selection in Classification: Novel Initialisation and Updat-
Example
ing Mechanisms”.
• Bing Xue, Mengjie Papers
Applied Soft
Zhang, WillComputing.
Browne. for Reading
VolParticle
18, PP. 261–276,
swarm 2014.
optimization
• Bing Xue, Liam
Example [127].Papers
Cervante,
http://dx.doi.org/10.1142/S0218213013500243 Totally,
Lin Shang, ◊ L models were built to evaluate the set of constructed
Will
forKReadingBrowne, Mengjie
http://homepages.ecs.vuw.ac.nz/˜mengjie/students/KouroshPhd_thesis.p
1220
Zhang. A Multi-Objective Particle Swarm Optimisation for Fil-
http://dx.doi.org/10.1016/j.asoc.2013.09.018
for feature selection in classification: A multi-objective approach, IEEE features. Therefore, 9 models are trained in each evaluation, which requires an
• http://www.informatik.uni-trier.de/˜ley/pers/hd/x/Xue:Bing
GECCO,Dever,
204 CHAPTER 6. CLASS-DEPENDENT MULTIPLE FC USING GP
Biology: PSO with local search on pbest and
Class-Dependent Multi-tree GP for FC

July 2016
resetting gbest
100 1TGPFC CDFC 100 1TGPFC CDFC
Begin CDFC Terminal Set + +

90Class 0 + +
+ +
+
90
+ +
+ +
+ • A PSO based hybrid FS algorithm for high-
KNN Accuracy
GP For Class
NB Accuracy
Training
Set
tTest
Measure
Terminal Set
80
Class+1
Dependent
Feature
Constructed
& Selected
80
dimensional classification.
... Construction
Features +
PSO-LSSU combines wrapper and filter

70 70
Terminal Set •
60 Class c
measures:
60
+ +
50 Data Transformation 50
208
Test Set CL
Dataset FC USING GP
CHAPTER 6. CLASS-DEPENDENT MULTIPLE
DL LK CN PR OV LK1 SR CL DL LK CN
Dataset
PR OV LK1 SR
- The fitness function.
Test Transformed
End
Accuracy
Learning Algorithm
(a) KNN Training & Test Set (b) NB - The local search.
100 1TGPFC CDFC 15 KNN NB DT • PSO-LSSU achieved much smaller feature
+
subsets with significantly better classification
Accuracy Improvement (%)

90
4
performance than the compared methods in

+ + + 10
8 10 12 14
+ +
3
DT Accuracy
80
2
most cases.
1
CF2
CF3
1.2
CF3
1.0
6
CF2
0
1.0 + 0.8
70
4
0.6
−3 −2 −1
0.5
0.4 5
2
0.0
0.2
−2 0
−0.5 0.0
−1.0 −0.2
−10 −5 0 5 10 60 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
+
5 - 6 times faster
CF1 CF1
0
50
CL DL LK CN PR OV LK1 SR CL DL LK CN PR OV LK1 SR
than PSO
(a) 1TGPFC on Leukemia1 (b) CDFC on Leukemia1
Dataset Dataset
Figure 8: Madelon feature correlation
(c) DT
Figure 6.10: Constructed features on Leukemia1. (d) Accuracy Improvement
stances of oneTran,
Binh Ngan classBing
fromXue, Figurecontributions
the Mengjie
others, 6.7: "Class
Zhang. Results of multiple-tree
toDependent
the superior resultsFeature
Multiple of GP (CDFC)
Construction versusProgramming
Using Genetic single-tree GP
for High-
Dimensional Data". Proceedings of the 30th Australasian Joint Conference on Artificial Intelligence (AI2017) Lecture Notes in Computer
CDFC mainly came from its fitness function. While both methods used an
Science. Vol. 10400. Springer.(1TGPFC) for multiple
Melbourne, Australia, August feature construction.
19-20th, 2017. 5.2 The constructed feature
pp. 182-194.
entropy-based measure
Binh Ngan Tran, to minimise
Evolutionary the impurity
Computation forofFeature
the constructed feature
Manipulation in Classification
Figure 2: Feature X95735 at and D42043 at
Toon
seeHigh-dimensional Data, PhD
why SVM can improve thesis, Victoria
its performance by using
values within of
University each class, CDFC
Wellington, Newincorporated an additional distance measure
Zealand, http://researcharchive.vuw.ac.nz/xmlui/handle/10063/7078
only one constructed feature, we pick the best constructed
Binh Tran, Mengjie Zhang and Bing Xue, "A PSO Based Hybrid Feature Selection Algorithm For High-Dimensional Classification". Proceedings of 2016 IEEE World
to evaluate the whole set of constructed features. The aim of the distance feature in one GP run on Leukemia dataset to analyse.
The average results over the 50 runs of both methods areGP displayed
tree with in
the Figure
Congress on Computational Intelligence/ IEEE Congress on Evolutionary Computation (WCCI/CEC 2016). Vancouver, Canada. 24-29 July, 2016. pp.3801-3808.
Figure 1 shows this the size of ten and
measure is to maximise the distance between instances of different classes and constructing a new feature based on four original features:
minimise the distance between6.7.instances of the same class. This is the reason D13637 at, D42043 at, D78611a t and FigureX95735 at. The 7: val-
Ovarian feature correlation
why the data clouds produced by CDFC are more compact and scattered far ues of these four selected features are plotted in Figure 2 to
Results from Figure 6.7 showed that using Figurethe CDFCthese
3. Among constructed features,
features, X95735 at has the least
away from each other. number of overlapping values between two classes. How-
Figure 2: Feature allX95735
the three at learning
and algorithms
D42043 at obtained ever,
significantly
by combining higher accuracies
these features, than feature
the constructed
y using overfitting
can split di↵erent instances byinto
in di↵erent classes constructing
two com- new high-level features with bet-
GECCO,Dever,
6.6.5
using the ones created by 1TGPFC on all pletely
Comparison with Feature Selection Results
datasets. KNN
separate obtained
intervals. Figure 4the largest
shows the feature values GECCO,Dever,
tructed Colorado, USA. 20-24 ter distribution than the original skewed features.Colorado, However,
USA. 20-24
increase of 8.1% average accuracy on Prostate, NB had 7.2% biggest gap
GP for FS and FC in Biology GP for FS and FC
July 2016
created by this constructed feature and a similar constructed July 2016
nalyse. The results of CDFC have shown that feature construction is a promising
feature for DLBCL GEMS the number
dataset. of
The result features
shows that constructed by one best individual
Figure 3: D13637 at and D78611 at is
en and approach
Figure to dimensionality
2: Feature onX95735
CNS,
reduction
andatDT achieved
on high-dimensional
and 13%atincrease high-level
data. This section
D42043 on SRBCT. Comparisons
highertoo
between
GP has the ability to select informative features to build
features with astill small ability.
discriminating to be representative for the whole feature set
using
eatures: the with
will compare the results of CDFC threethe learning
best results algorithms in Figure 6.7(d) showed that
obtained by feature withKNN had the
a large number of features. Increasing the number of
ructed
The val- selection methods in Chapters 3 and 4 using all the four datasets that have
highest improvement on four datasets. On theFigure other1: four ⇤ constructed
datasets,
constructed
Leukemia
DT had features may further improve the performance
feature • Feature distribution
Figure 9: Feature distributions
⇥
nalyse.
ure 2 to been used in all the experiments of feature selection and feature construction - Leukemia constructed feature
of these learning algorithms on high-dimensional data. Our
methods in this thesis, namelythe largest difference ofand
10.5% and 13% on Leukemia1 and SRBCT,
``` which
Leukemia four selected features
⇠ `
en least
he and DLBCL, Prostate, Leukemia1 SRBCT. ⇠⇠⇠
⌅
⌥ ⇠⇠ ⇤will focus on this direction.
`
6000
⇠ ``` 7000
atures:
How- future work 5000
⌃ ⇧ ⇥
6000
Sqrt - 4000
he val-
feature "b
⇤ ⌥
7. REFERENCES "" ⌅⌥bb
⌅
5000 3000
re com-
wo 2 to
⇥ XX ⌃ ⇧⌃ ⇧
2000
4000
- D42043 at X95735 at
ee values
least 1000
X ⌅
⌥
% XXXX[1]
⇤%
How-
tructed S. Ahmed, M. Zhang, andof L.
features thesePeng. Improving
two datasets in Figure 7 andfeature
Figure 8, we 3000
0
⇥ ⌃ ⇧
find that Ovarian features are more correlated to each other
eature
ws that * 0.55 ranking for biomarker discovery in proteomics mass 2000 -1000
Figure 3: D13637 at and D78611 at (about 0.4 to 0.6) than Madelon features (about 0.01 to
"b
⌥ "" ⌅⌥bb ⌅
o com-
build spectrometry data using genetic
0.03). This indicates programming.
that for non-linear data GP can work 1000
-2000
-3000
better than libSVM which is a linear classification algorithm.
⌃ ⇧⌃ ⇧
values
ty. D13637 at D78611 at Connection Science, 26(3):215–243, 2014.
A↵ected by this overfitting problem, SVM has the poor-
0
-4000
ructed [2] S. Ahmed, M. Zhang, est test performance on most of the datasets. However, only
one and L. Peng. Asignificantly
new gp-based -1000
1 2 3 4 1
constructed feature can improve its perfor-
ws that 5.3 Overfitting problem wrapper
Figure 5: Colon first one hundred features
Figure 3: D13637 at and D78611 at Figure 4: Constructed Featuresfeature construction approach

mance on 6 datasets tocreated subsets can not.
while other
build To analyse the overfitting problem, we choose Colon, the This can be explained by looking at the distribution of the
y. classification and
smallest dataset, to have a closer look at the distribution of biomarker identification. In
constructed feature. Figure 9 shows the boxplot of the con-
each feature. Figure 5 shows the boxplot of the first 100 fea- structed features and its four original base features. Com-
tures of Colon dataset. We can see Proceedings of the IEEE
that all of these features Congress on Evolutionary
pared to the original features, the constructed feature has
have a skewed distribution. EachComputation
feature also has many (CEC),
out- pages
a much better2756–2763, 2014.
distribution without outliers. Therefore, the
liers scattering far away from its mean value. In the experi- overfitting problem may be alleviated in the transformed
ment, Colon is divided into[3] S.each
10 folds Ahmed, M.
of which has Zhang,
about L. using
dataset Peng, and B. feature.
this constructed Xue. Genetic
6 instances. Therefore, there is a very high chances that
the distributions of the training Programming for Measuring
6. CONCLUSIONS Peptide AND Detectability.
FUTURE WORK In
⌅
and the test folds are very
b Simulated
di↵erent. As a result, the constructed or selectedEvolution
features and Learning,
In general, results showvolume 8886,
that GP constructed pages
features can
⇧
based on the training fold can not be generalised to correctly improve the performance of kNN, SVM and GP classifiers
735 at predict the unseen data in the test593–604. Springer
fold. This may be the International
on high-dimensionalPublishing, 2014.
problems. Using only the constructed
reason why the training and test accuracies are so di↵er- feature, SVM achieve a higher accuracies than using all fea-
ent. This explanation is concordant to the result of Ovarian tures and other created subsets in most datasets. Among
⌅
features of these two datasets in Figure 7 and Figure 8, we dataset where both SVM and GP achieve similar perfor- the six created subsets, the construct feature and the termi-
Binh Tran, Bing Xue and Mengjie Zhang. "Genetic Programming for Feature Construction and Selection in
find that Ovarian features are more correlated to each other mance on training and test sets. The boxplot of the first nal features achieve the highest performance in GP. The last
Classification on High-dimensional Data", Memetic Computing, vol 8,hundred Issue 1, pp3-15. 2016
⇧
four subsets achieve similar performance in kNN and SVM
(about 0.4 to 0.6) than Madelon features (about 0.01 to one features of this dataset in Figure 6 shows that
Binh Tran, Bing Xue and Mengjie Zhang. "Genetic Programming for Feature Construction and Selection in
Figure 6: Ovarian first one hund
35 at these features have a rather symmetric distribution without on all datasets.

0.03). This indicates that for non-linear data GP can work many outliers. This is also the only dataset that KNN gives Classification
Analysis of the constructed feature shows that by choos- on High-dimensional Data", Memetic Computing, vol 8, Issue 1, pp3-15. 2016
ing informative features, GP can construct new features
features of these
better than libSVMtwowhich
datasets
is a in Figure
linear 7 and Figure
classification 8, we similar
algorithm. performance on training and test sets.
Similar to Ovarian, Madelon dataset also has a symmetric which have higher discriminating ability than original fea-
findA↵ected
that Ovarian
by thisfeatures are more
overfitting correlated
problem, SVM to haseach
the other
poor- distribution. However, GP and SVM have very di↵erent tures. The big di↵erence between training and test sets on
(about
est test0.4 to 0.6) than
performance Madelon
on most of thefeatures
datasets.(about to behaviours in this dataset. While GP has nearly the same
0.01only
However, most of the datasets indicates the problem of overfitting. By
classification accuracies on training and test sets with about analysing the datasets, we found that this problem occurs
0.03). This indicates
one constructed thatcan
feature forsignificantly
non-linear data GP its
improve canperfor-
work 63%, SVM achieve 100% accuracy on training set but only when the data has a skewed distribution with many out-
better
mancethanon 6libSVM which
datasets is aother
while linearcreated
classification
subsetsalgorithm.
can not. about 50% in test set. By plotting the relationship between 1221
liers. GP also shows its ability to alleviate the problem of
on, the A↵ected
This can bebyexplained
this overfitting problem,
by looking at theSVM has the poor-
distribution of the
Feature Clustering for Feature Clustering for
GP-Based FC on High-Dimensional Data 5.5. FURTHER ANALYSIS GP-Based FC on High-Dimensional Data 171
5.5. FURTHER ANALYSIS 173
Eliminating redundant features may improve GP performance in FC • Results of5.6:cluster

Table analysis
Results of cluster analysis
%Dimensionality Leukemia M19507_at Leukemia U09087_s_at
Dataset #Features #Clusters ASC
reduction
Colon 2000 104.10 0.95 0.80
AML
AML
Training Set DLBCL 5469 819.20 0.85 0.96
Leukemia 7129 901.30 0.87 0.98
class
class
Redundancy-based Feature CNS 7129 79.30 0.99 1.00
Clustering Method Prostate 10509 1634.80 0.84 0.85
Ovarian 15154 601.20 0.96 0.31
ALL
ALL
Alizadeh 1095 93.60 0.91 0.94
Feature Cluster C1 Feature Cluster C2 ... Feature Cluster Cm

Yeoh 2526 97.60 0.96 1.00
−1.0 0.0 0.5 1.0 −1.0 0.0 0.5 1.0
172 CHAPTER 5. CLUSTER-BASED SINGLE FC USING GP
Feature values Feature values
The best the experiments were conducted based on a 10-fold CV framework on each
The best feature of C2 The best
feature of C1 feature of Cm dataset, the average of ASC over 10 folds was calculated. Table 5.6 shows the
Leukemia CF Leukemia X61587_at Leukemia U82759_at
GP For original number of features, the average number of clusters generated with
Single Feature the redundancy level of 0.9, the percentage of dimensionality reduction, and
AML
AML
AML
Construction -
the average of ASC over 10 folds of each dataset.
As can be seen from the fourth - column of Table 5.6, all datasets obtained
class
class
class
+
Constructed & Selected Features at least 84% of dimensionality reduction after applying the proposed feature
from the best individual clustering algorithm. The number of input features into GP was significantly
ALL
ALL
ALL
M19507_at X61587_at U09087_s_at U82759_at
reduced with the largest reduction of 99% on CNS and 96% on Ovarian and
Yeoh. The third column of Table 5.6 also showed differences −4 −2 in 0 the 2 number
4 −1.0 0.0 0.5 1.0 −1.0 0.0 0.5 1.0
of clusters generated on different datasets regardless of its original

Feature values number Feature values Feature values
Binh Ngan Tran, Bing Xue, Mengjie Zhang. "Using Feature Clustering for GP-Based Feature Construction on High-Dimensional Data". Proceedings
of features. For example, CNS had a much smaller number of clusters than
of the 20th European Conference on Genetic Programming (EuroGP 2017). Lecture Notes in Computer Science. Vol. 10196. Amsterdam. 18-21 April
2017. pp.210--226. Figure 5.5:
Colon although its original featureLeukemia
set size constructed
was more thanfeature.
three times larger Figure 5.6: Leukemia original features.
than Colon. This again confirms that it is very difficult to predefine an
of the features
appropriate numberinofthis dataset
clusters have dataset
for each their correlation
to maintain coefficient
a certain with
level other
of by CGPFC in the run of seed 1 on fold 5 of the Leukemia dataset.
returned
features higher
redundancy amongthan 0.5. Therefore,
all features in the samefeatures in different clusters might
cluster. It was still
constructed from four original features, which are feature M19507_at,
be highlyofcorrelated
Results but with
the silhouette a lowershown
coefficients level than thelast
in the predefined
column threshold
of the table(0.9). U09087_s_at, and U82759_at whose values are plotted in Figure
X61587_at,
wereHowever, even on
0.8 or above though its silhouette
all datasets except coefficient
for Ovarian.wasThis
low indicates
(0.31), the results
that
5.6. Noteofthat this dataset has two classes of leukemia patients, namely acute
the
this dataset
clusters shown
generated in Table
by RFC were5.2 revealedand
compact that the feature
separated. clustering
Only method
on Ovarian,
lymphoblastic leukemia (ALL) or acute myelogenous leukemia (AML).
thisenabled the was
coefficient constructed
low. Anfeature to perform
investigation significantly
on this better that
dataset showed thanmost
the one
It can be seen from these scatter plots in Figure 5.6 that the selected
constructed from the whole feature thanks to the significant reduction of 96%
features had low impurity or high correlation to the class labels. Specifically,
in dimensionality.
Multi-objective GP for FC and FS Multi-objective GP for FC and FS
5.5.2 The Constructed Features
• c-class problem into c binary classification problems • section investigates the reason why the constructed and selected features
This
can achieve good performance by showing a constructed feature by a GP run
• evolve c sets of binary classifiers employing a steady- on a dataset as a typical example. Leukemia was chosen because the size
of the constructed feature (or GP tree size) on this dataset is smaller than
state multi-objective GP with three minimizing others. This is also a challenging dataset as can be seen from its results in
objectives. Tables 5.2 and 5.3.
Figure 5.5 shows the GP tree of the constructed feature and its values
- (i) false positives (FPs),
- (ii) false negatives (FNs),
- (iii) the number of leaf nodes of the corresponding
encoding tree.
• Each binary classifier is composed of a binary tree and
a linear support vector machine (SVM)
• During crossover and mutation, the SVM-weights are
used to determine the usefulness of the corresponding
nodes
Nag, Kaustuv, and Nikhil R. Pal. "Feature Extraction and Selection for Parsimonious Classifiers with Multiobjective Genetic Nag, Kaustuv, and Nikhil R. Pal. "Feature Extraction and Selection for Parsimonious Classifiers with Multiobjective Genetic
Programming." IEEE Transactions on Evolutionary Computation (2019). Programming." IEEE Transactions on Evolutionary Computation (2019).
1222
GP for FC in Clustering: Multi-Tree GP Representation – Vector
• Having to set t is annoying. Can we use a single tree?
• Each tree creates a single constructed feature.
• Introduce a new concat operator which can create vectors of
• Each individual contains t trees, to give t constructed features. CFs.
- Automatically build up a suitable length vector.
- Extend the function set to work on vectors.
Example output gives 4 features:

[min(0.63,F23), F37/0.59, F86, F85]
• However, each tree must be larger.
Andrew Lensen, Bing Xue, and Mengjie Zhang. "New Representations in Genetic Programming for Feature Andrew Lensen, Bing Xue, and Mengjie Zhang. "New Representations in Genetic Programming for Feature
Construction in k-means Clustering". Proceedings of the 11th International Conference on Simulated Evolution Construction in k-means Clustering". Proceedings of the 11th International Conference on Simulated Evolution
and Learning (SEAL 2017). Lecture Notes in Computer Science. Vol. 10593. Shenzhen, China. November 10- and Learning (SEAL 2017). Lecture Notes in Computer Science. Vol. 10593. Shenzhen, China. November 10-
13, 2017. pp. 543--555. 13, 2017. pp. 543--555.
GP for Manifold Learning Visualisation of GP-MaL-LT vs LLE

• a fitness function to measure how well local topology is
preserved by the evolved mapping (represented by GP trees);
• a surrogate model to approximate nearest-neighbours,
greatly reducing computational cost while still ensuring the
preservation of local structure;
A. Lensen, B. Xue and M. Zhang, "Genetic Programming for Manifold Learning: Preserving Local Topology," in IEEE A. Lensen, B. Xue and M. Zhang, "Genetic Programming for Manifold Learning: Preserving Local Topology," in IEEE
Transactions on Evolutionary Computation, doi: 10.1109/TEVC.2021.3106672. Transactions on Evolutionary Computation, doi: 10.1109/TEVC.2021.3106672.
1223
Image Recognition/Classification
Example GP Tree
• GP for image ananlysis: evolve image descriptors
• Keypoints identification, feature extraction, feature
construction/selection
8.2 The Proposed Approach 189
The steps for generating a

Feature
8.2.1
Detection Individual Representation
feature vector
GP has a tree-based representation, which is known for variable lengths of evolved
Feature
Extraction
solutions.The individual representation of the proposed FELGP approach is based
on STGP [22], where each function has input and output types, and each terminal
has an output type. To define the type constraints, a new program structure is de-
veloped, as shown in Fig. 8.2. The new program structure has the input, filtering
Feature
& pooling, feature extraction, concatenation, classification, combination, and output
Selection
layers. Each layer except for the input and output layers has different functions for
Construction
different purposes. The input layer represents the input of the FELGP system, such
as the terminals. The filtering & pooling layer has filtering and pooling functions,
which operate on images. The feature extraction layer extracts informative features
from the images using existing feature extraction methods. The concatenation layer
A. Lensen, B. Xue and M. Zhang, "Genetic Programming for Manifold Learning: Preserving Local Topology," in IEEE concatenates features produced by its children nodes into a feature vector. The fil-
Transactions on Evolutionary Computation, doi: 10.1109/TEVC.2021.3106672. tering & pooling, feature extraction and concatenation layers belong to the process
of feature learning, where informative features are learned from raw images. The
learned features can be directly fed into any classification algorithm for classifica-
tion. Therefore, a classification layer is connected with the concatenation layer. The
classification layer has several classification functions that can be used to train the
classifiers using the learned features. The combination layer has several combination
GECCO,Dever,
functions to combine the outputs of the classification functions. The classification
and combination layers belong to the process of ensemble learning, where the clas-
Image Recognition/Classification GP for Image Classification
July 2016
sification functions are selected and trained, and the outputs of the classifiers are
combined. Finally, the output layer performs the plurality voting on the outputs
• The traditional way produced by the combination layer to obtain the combined predictions.
• Domain-specific pre-extracted features approach (DS-GP)

Output Layer Class Labels: Y_predict
The input is raw image pixel values Ensemble Combination Layer

Comb3
Learning
SVM RF LR
Classification Layer
The feature areas need to be designed by FeaCon2 Y_train 1 FeaCon2 Y_train 200
50
FeaCon3 Y_train 1
Concatenation Layer
domain-experts uLBP SIFT HOG SIFT HOG SIFT uLBP
Feature
Learning
LoG1 SobelX LBP-F MaxP
Transform the pixel values of the X_train X_train X_train
Filtering & Pooling Layer 2

selected areas to a different domain Med
X_train
Mean HOG-F 2
Input Layer X_train X_train X_train
Select a subset out of the extracted Fig. 8.2 The new program structure of FELGP and an example solution/program that can be
features (optional) evolved by FELGP
Feed the extracted features (with or Y. Bi, Compared

B. Xue and M. with
Zhang,the program
"Genetic structure
Programming With of theRepresentation
a New EGP method [3], the new
to Automatically program
Learn Features
without selection) to GP-based classifier structure of the FELGP approach has one more layer, i.e., the feature extraction layer.
and Evolve Ensembles for Image Classification," in IEEE Transactions on Cybernetics, vol. 51, no. 4, pp. 1769-1783,
April 2021, doi: 10.1109/TCYB.2020.2964566.
1224
Biological Datasets Biomarker Identification

July 2016 July 2016
m/z values in Apple-plus data New Method (9 ✓ Method 2 (3✓)

Data set # Features # Samples # Classes set (12 biomarkers) )
331.21 ✗ ✓
Pancreatic
6771 181 2 471.09 ✓ ✓
Cancer
107.05, 169.05, 238.05, 275.09, 45 ✓ ✗
6.11, 459.13
Ovarian Cancer1 15154 253 2
456.62, 475.10 ✗ ✗
Ovarian Cancer 449.11 ✓ ✓
15000 216 2
2 229.09 ✓ ✗
Prostate Cancer 15000 322 4 Apple minus m/z (5 New Method (5 ✓ ) Method 2 (2✓)
biomarkers)
Toxpath 7105 115 4
✓ ✗
463.0
✓ ✓
12 IEEE TRANSACTIONS ON CYBERNETICS
Arcene 10,000 200 2 447.09

273.03 ✓ ✓
Apple-plus 773 40 4 435.13 ✓ ✗
Apple-minus 365 40 4 227.07 ✓ ✗
Soha Ahmed, Mengjie Zhang, Lifeng Peng and Bing Xue."Multiple Feature Construction for Effective Biomarker Identification and Classification
using Genetic Programming". Proceedings of 2014 Genetic and Evolutionary Computation Conference (GECCO 2014). ACM Press. 2014.pp.249—256
Soha Ahmed, Genetic Programming for Biomarker Detection in Classification of Mass Spectrometry Data, PhD thesis, 2015, School of Engineering
and Computer Science, Victoria University of Wellington, New Zealand
Fig. 11. Complexity of 13.
GP for FS/FC for Visualisations GP for FS/FC for Visualisations

• Multi-objective: 12
Fig. 13. This article has of
Complexity been26.
accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE TRANSACTIONS ON CYBERNETICS
LENSEN et al.: GP FOR EVOLVING FRONT OF INTERPRETABLE MODELS FOR DATA VISUALIZATION 13
- visualizations quality;
- visualizations complexity;
• Based on t-distributed Stochastic

Neighbor Embedding (t-SNE)
Fig. 12. Complexity of 19.
feature set, yet they are able to separate three classes of the
dataset well. In particular, the blue, red, and purple classes can
be separated
Fig. 11. out by drawing
Complexity two vertical lines through the plot.
of 13. Fig. 14. Complexity of 33. Fig. 16. Complexity of 104.
In other words, two thresholds can be applied to the output of
the top tree (x-axis) in order to roughly separate the classes and the only feature subtracted in the tree is f21 , this suggestsnf8 feature. This is, however, sufficient to cause some of the
into three groups: 1) red; 2) orange and green; and 3) purple that the blue and purple classes have particularly small orangefea-points to be separated from the yellow and green points
and blue. Given that all features have the same range of [0, 1] ture values for f21 (as they have high x values), and along the red the x-axis—indicating that higher values of nf8 may indi-
Fig. 15. Complexity of 59. cate a point belongs to the orange class rather than the yellow
Fig. 13. Complexity of 26. one. The major change in the bottom tree is the introduction
of the nf21 and nf6 features, which helps to compact the blue
Authorized licensed use limited to: Victoria University of Wellington. Downloaded on May 03,2021 at 01:55:12 UTC from IEEE Xplore. Restrictions apply.
Lensen, Andrew, Bing Xue, and Mengjie Zhang. "Genetic programming for evolving a front of interpretable models for data Lensen, Andrew, Bing Xue, and Mengjie classZhang.
has particularly
"Genetic large values (as for
programming it has low x values).
evolving a front fof 21 interpretable
class and startsmodels for data
to separate the red class more strongly.
visualization." IEEE Transactions on Cybernetics 51.11 (2020): 5468-5482. visualization." IEEE Transactions on Cybernetics in the Dermatology dataset
51.11 (2020): corresponds to the “thinning of
5468-5482. Fig. 14 further pushes the blue class away from the other
the suprapapillary epidermis” feature, which is a feature com- classes at the expense of squashing the other classes together.
monly associated with the skin condition of psoriasis [39]. In This is achieved by making the top tree (x-axis) weight f
3
the visualization in Fig. 11, the red class corresponds to a diag- and f even more strongly. It is interesting to note that all
32
nosis (class label) of psoriasis—and indeed, the red instances the trees analyzed so far have been strictly linear functions,
appear on the left side of the plot, which is consistent with utilizing only arithmetic and subtraction. This is perhaps not
having a high value of f21 being subtracted from the rest of unexpected: utilizing more sophisticated functions is likely
the tree. This sort of analysis can be used by a clinician to to require complex trees to fully take advantage of them—
1225 understand that this feature alone could be used to diagnose for example, taking the maximum of two simple subtrees is
psoriasis with reasonable accuracy, and also provides greater unlikely to be more representative of the original dataset than
confidence in the visualization’s accuracy. simply adding together features. This also has the benefit of
Feature Selection for Explainable ML Feature Selection for Explainable ML
Bin Wang, Wenbin Pei, Bing Xue, and Mengjie Zhang. "Evolving Local Interpretable Model-agnostic Explanations for
Deep Neural Networks in Image Classification". Proceedings of 2021 Genetic and Evolutionary Computation
Conference (GECCO 2021 Companion)
GECCO,Dever,
Any problem ?
July 2016
Test
Data
(a) Data (Feature)
Issues and Challenges Subset
Training
Test
(b) Data
Training
Training (Feature)
Subset
1226
Feature Selection/Construction Bias Feature Selection Bias

July 2016 July 2016
Feature • Many works on bio-data containing feature selection

Selection
- which leads to biased results CONNECTION SCIENCE 17
- conclusion might change
737 (a)
738
739 (b)
740
741
742
743
744
• If the whole dataset is used during FS/FC process, the
745
experiments(or evaluation) have FS/FC Bias 746
Figure 4. Structures of experiments with and without feature selection bias.
• What if only a small number of instances available ? 747
748 the index of the features, where the index of the features are listed in a descending order
- In classification, use k-fold cross validation
749 by their frequency. It would be much more interesting to discover the biological finding
- How to use k-fold cross validation in FS/FC to evaluate a FS/FC
750 Binh
of Tran, genes,
these Bing Xuebut
and since
Mengjie Zhang.
the "Genetic
original Programming
meanings for Feature
of the Construction
features/genes inand Selection
GEMS in
are not
system ? Classification on High-dimensional Data", Memetic Computing, vol 8, Issue 1, pp3-15. 2016
751 given, it is impossible to perform such research. In the future, we intend to collaborate with
752 researchers from biology to deep analyse the biological finding of the selected genes.
753
754
755 4.6. Further discussions
GECCO,Dever,
756 In this section, the re-substitution estimator was used to evaluate the performance of
GECCO,Dever,
757 Colorado, USA. 20-24
the feature selection algorithms, which is the same as in Chuang et al. (2008) and many
K-CV in Each Evaluation — Inner Loop Weakness and Issues
July 2016 July 2016
758 other existing papers (Abedini et al., 2013; Ahmed et al., 2012; Alba et al., 2007; Babaoglu
759 et al., 2010; Huang et al., 2007; Mishra et al., 2009; Mohamad et al., 2011, 2013; Santana
• 3-CV as an inner loop to evaluate each feature subset 760 et •al.,Large Search
2010; Shen et al., space:
2007; Yu et al., 2009). The re-substitution estimator, in other words,
761 means - the
Large search
whole dataset space:
is usedbit-string/vector with afeature
during the evolutionary length equal to
selection the (as
process
• In each evaluation to get :Acc
762 shown in total number
Figure of features
4(a)). There is no separate unseen data to test the generality of the selected
763 features.- Classification accuracy
According to Ambroise and or existing
McLachlan filterthere
(2002), measures
is a featureinselection
the fitness
bias issue
Feature 764 here, sofunction,
one cannotwhich
claim thatoftenthe cannot lead to can
selected features a smooth
be used forfitness
futurelandscape
unseen data.
Subset 765 Feature or selection
with lowbias locality
typically happens when the dataset includes only a small number
Test 766 of •instances,
Training Poor scalability the gene expression data, where n-fold CV (10-CV) or LOOCV is
especially on
1 Classifier Acc 767 needed. Figure 4 compares the structures experiments with and without feature selection
Learning Alg. 3 - the dimensionality of the search space often equals to the total
2 768 bias. It can be seen that with selection bias, the algorithm reports the classification per-
769 number of features, thousands, or even millions
formance of the (inner) CV loop and “such results are optimistically biased and are a subtle
Training Set
1 770 means - of
the number
training on theoftestinstances is and
set” (Kohavi large
John, 1997). Therefore, the conclusions drawn
2 Classifier Ave 771 • Generalisation
the re-substitutionissue
2 Learning Alg. 1 Acc from estimator with selection bias may be different from that without
3 Acc 772 bias. This, however, has not been seriously investigated in EC for gene selection.
3 - especially wrappers
773
- Feature construction
774
1 Classifier Acc 5.•Experiment II
Long computational time
Learning Alg. 2 775
3 776 In •this section, selection
the second set of experiments have
Feature or construction biasbeen conducted, where the feature
issue
777 selection bias issue is removed.
Kohavi, Ron, and George H. John. "Wrappers for feature subset selection." Artificial intelligence 97.1-2 (1997): 273-324. 778
779
5.1. Performance evaluation
780
781 To avoid feature selection bias and compare the performance of the algorithms with and
1227 782 without bias, the second set of experiments without feature selection bias have been
Future Directions Future Directions

July 2016 July 2016
• Fitness function: • Search mechanism

- reduce the computational cost, - Combinatorial optimisation
- smooth the landscape of the search space, - Memetic computing
- improve the learning and generalisation performance, and - Large-scale optimisation
- efficient and effective filter measure - Adaptive parameter control techniques
- hybridise wrapper and filter measures
- Surrogate models • Multi-objective feature selection or feature construction
- How to keep non-dominated solutions
- Objective space, solution space and search space
• Representation - Distance measure, e.g. crowding distance
- Reduce the search space - How to maintain archive set
- Incorporate more information of about the features, e.g.
relative importance of features, feature interactions or feature
similarity
- Embedded feature selection or construction
Future Directions
July 2016 July 2016
Evolutionary Computation Research Group
• Explainable machine learning: • Thanks my colleagues, post-docs, visiting scholars, PhD and
- increase the interpretability/understandability of the obtained Master/Hons/Summer Research students!
feature set
- Simple models via feature selection/construction
• Feature construction
- Construct multiple features
- both feature selection and feature construction
• Transfer learning/Multi-tasking via or for feature selection

and/or construction
• Instance selection and construction
• Combining EC with machine learning approaches
• Feature selection and feature construction for other machine

learning tasks: clustering, regression, text mining, etc.
1228
July 2016 July 2016
Recent Research Grants and Recruiting

Recent Research Grants
•MBIE SSIF and Endeavour Fund (2019, 2020, 2021)
- $13 million NZD (Evo and Stat learning for Aquaculture)
- $16.2 million NZD (AI/ML for Cyber-Marine models)
- $3 million NZD (NZ-SG DS on Wellington Trees)
-$9.8 million NZD (unmanned aerial vehicles (drones))
•Marsden Fund (2019, 2020)

Thank you
- $707,000 NZD (Symbolic regression)
- $707,000 NZD (Evolutionary deep learning)
- $300,000 NZD (Interpretable models for SR)
•NSC/SfTI Fund: (2019)

- $512,300 NZD (AI for Aquaculture)
- $550,000 NZD (LCS and Robotics)
Recruiting
-Postdoc fellows
-PhD students
1229

Evolutionary Computation For Feature Selection and Feature Construction

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Evolutionary Computation For Feature Selection and Feature Construction

Uploaded by

Copyright:

Available Formats

Evolutionary Computation for Feature Instructors

Outline Data set (Classification) — Example

• Feature Selection and Feature Construction Credit card application:

• Evolutionary Feature Construction Methods Job Saving Family Class

GP for Feature Construction: A System Diagram A Sample Measure of Good

• There isn’t mush structural (topological) in the

Increasingly Popular Topic

In top 5 since 2017

Scikit-Learn Why Feature Selection ?

• “Curse of the dimensionality”

Why Feature Construction? What can FS/FC do ?

Do we still need feature selection when we have deep learning?

COMP422 Feature Manipulation: 19 COMP422 Feature Manipulation: 20

Search Strategy Forward Selection Algorithm

- Relevant features may become redundant

Backward Selection Algorithms Overcoming Computational Intensity in Feature

Feature FS/FC Process Feature Selection Approaches

• On training set: • Based on Evaluation ——— learning algorithm

• Responsible by a search • Evaluate feature

Feature Selection Approaches

Classification Computational Generality

Filter Low Low High Feature Selection

Wrapper High High Low

Feature Selection Approaches Why Evolutionary Computation ?

• Conventional approaches • Don't need domain knowledge

JOURNAL OF LATEX CLASS FILES, VOL. , NO. ,

LCSs, ES, Filter Approaches

[35]. Sparse approaches

Fig. 3. Overall of EC for feature selection.

A General Approach GAs for Feature Selection

GECCO,Dever, Colorado, GECCO,Dever, Colorado,

• Implicit feature selection • Very popular in recent years

ACO for Feature Selection DE, LCSs, and Memetic

• Start from around 2003 • DE: since 2008

Related Areas (Applications)

• Biological and biomedical tasks

A General Approach Variable-length PSO for FS

• Comprehensive Learning PSO

Return the best Search/updating

Initialise the position and velocity of

‣ by each flipping one or two dimensions (each dimension

Return the best solution (Selected features)

Multi-objective PSO for FS

Multi-objective PSO for FS MOEA/D for Feature Selection

where |S| is the n

Why Multimodal Optimisation for FS? Grid-dominance based Multi-objective FS

Obviously, the first feature is easier to be collected than the

Dataset 13GDE3 end

Multiobjective DE for Feature Selection Niching-based Multi-objective FS

MOCDE with and without the

Surrogate Training Set/ FS Processing Surrogate Training Set/ FS Processing

• Surrogate training set

• Surrogate FS/training process Surrogate rate: Is/I

1. Initialize population P based on sampled solutions. •

Correlation-Guided and Surrogate-Assisted PSO Correlation-Guided and Surrogate-Assisted PSO

• • a surrogate model is established based on

Information Theory Feature Selection

• MRFS based on mutual information • Information theory in evolutionary feature selection

Potential cut-points are entropy-based 1 X

GP for Embedded Feature Selection GP for Embedded Feature Selection

F1 F3 F12 F40 F1 F3 F12 F40 F1 F3 F12 F40

3.2. THE PROPOSED METHOD 73