Mtcomm D 24 03395

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 45

Materials Today Communications

Comparison of boosting and genetic programming techniques for prediction of tensile


strain capacity of Engineered Cementitious Composites (ECC)
--Manuscript Draft--

Manuscript Number: MTCOMM-D-24-03395

Article Type: Full Length Article

Section/Category: Materials Data Science and AI

Keywords: Machine learning; Engineered cementitious composites; Tensile strain capacity;


Fibres; Shapley additive analysis

Corresponding Author: Muhammad Javed


Ghulam Ishaq Khan Institute of Engineering Sciences and Technology
PAKISTAN

First Author: Waleed Bin Inqiad

Order of Authors: Waleed Bin Inqiad

Muhammad Javed

Muhammad Shahid Siddique

Naseer Muhammad Khan

Abstract: Plain concrete is weak against tension and has low Tensile Strain Capacity (TSC)
which significantly affects its long-term performance. To overcome this issue,
Engineered Cementitious Composites (ECC) were developed by incorporating polymer
fibres in the cement matrix which increases ductility and provides higher TSC than
plain concrete and they have emerged as a viable alternative to brittle plain concrete.
This study is conducted to develop empirical prediction models for TSC prediction of
ECC without requiring extensive experimental procedures. For this purpose, two
evolutionary programming techniques known as Multi Expression Programming (MEP),
Gene Expression Programming (GEP) along with two boosting-based techniques:
AdaBoost and Extreme Gradient Boosting (XGB) were developed using data collected
from published literature. The gathered dataset had seven input parameters including
water-to-binder ratio, fibre content, cement, and age etc. and only one output
parameter i.e., TSC. The error assessment of developed models was done using
correlation coefficient, Mean Absolute Error (MAE), and Objective Function (OF) etc.
and the error comparison showed that XGB has the highest accuracy having the least
OF value of 0.081 as compared to 0.11 of AdaBoost, 0.13 of GEP, and 0.16 of MEP.
Shapley additive analysis was conducted on the XGB model since it proved to be the
most accurate, and the results highlighted that fibre content, age, and water-to-binder
ratio are the most important features to predict TSC of ECC.

Suggested Reviewers: Mohsin Ali Khan


North Dakota State University
mohsin.khan@ndsu.edu

Muhamad Izhar Shah


University of Louisiana at Lafayette
muhammad-izhar.shah1@louisiana.edu

Muhammad Ali
Universiti Teknologi PETRONAS
ali.musarat@utp.edu.my

Opposed Reviewers:

Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation
Manuscript File

Cover letter

Muhammad Faisal Javed

Ghulam Ishaq Khan Institute, GIKi

21st March 2024

Dear Editors-in-chief,

We wish to submit an original research article entitled “Comparison of boosting and genetic
programming techniques for prediction of tensile strain capacity of Engineered Cementitious
Composites (ECC)” for consideration by in your Journal.

We confirm that this work is original and has not been published elsewhere, nor is it currently
under consideration for publication elsewhere.

We have no conflicts of interest to disclose.

Please address all correspondence concerning this manuscript to me at arbabfaisal@giki.edu.pk.

Thank you for your consideration of this manuscript.

Sincerely,

Muhammad Faisal Javed

Associate Professor
Department of Civil Engineering,
Ghulam Ishaq Khan Institute of Engineering Sciences and Technology,
Topi, Swabi, Khyber Pakhtunkhwa, 23640, Pakistan
Graphical Abstract
Manuscript File Click here to view linked References

Comparison of boosting and genetic programming techniques for


prediction of tensile strain capacity of Engineered Cementitious
Composites (ECC).
Waleed Bin Inqiad 1, Muhammad Faisal Javed2*, Muhammad Shahid Siddique 1, Naseer
Muhammad Khan 3,4
1. Department of Structural Engineering, Military College of Engineering (MCE), National
University of Science and Technology (NUST), Islamabad 44000, Pakistan;
WaleedBinInqiad@gmail.com (W.B.I), mshahidsiddique@mce.nust.edu.pk (M.S.S)
2. Department of Civil Engineering, Ghulam Ishaq Khan Institute of Engineering Sciences
and Technology, Topi, Swabi, Khyber Pakhtunkhwa, 23640, Pakistan
3. Department of Sustainable Advanced Geomechanical Engineering, Military College of
Engineering (MCE), National University of Science and Technology (NUST), Islamabad
44000, Pakistan; nmkhan@mce.nust.edu.pk
4. MEU Research Unit, Middle East University, Amman 11831, Jordan

*Correpsonding author: arbabfaisal@giki.edu.pk

Abstract: Plain concrete is weak against tension and has low Tensile Strain Capacity
(TSC) which significantly affects its long-term performance. To overcome this issue,
Engineered Cementitious Composites (ECC) were developed by incorporating
polymer fibres in the cement matrix which increases ductility and provides higher TSC
than plain concrete and they have emerged as a viable alternative to brittle plain
concrete. This study is conducted to develop empirical prediction models for TSC
prediction of ECC without requiring extensive experimental procedures. For this
purpose, two evolutionary programming techniques known as Multi Expression
Programming (MEP), Gene Expression Programming (GEP) along with two boosting-
based techniques: AdaBoost and Extreme Gradient Boosting (XGB) were developed
using data collected from published literature. The gathered dataset had seven input
parameters including water-to-binder ratio, fibre content, cement, and age etc. and
only one output parameter i.e., TSC. The error assessment of developed models was
done using correlation coefficient, Mean Absolute Error (MAE), and Objective Function
(OF) etc. and the error comparison showed that XGB has the highest accuracy having
the least OF value of 0.081 as compared to 0.11 of AdaBoost, 0.13 of GEP, and 0.16
of MEP. Shapley additive analysis was conducted on the XGB model since it proved
to be the most accurate, and the results highlighted that fibre content, age, and water-
to-binder ratio are the most important features to predict TSC of ECC.
Keywords: Machine learning; Engineered cementitious composites; Tensile strain
capacity; Fibres; Shapley additive analysis.

1. Introduction
Concrete is undoubtedly the most widely used construction material due to its
significant advantages over other building materials like bricks, stones, and steel etc.
The researchers in the field of concrete technology have been constantly working to
enhance different desirable properties of concrete. The development of ECC is an
outcome of this remarkable research in the realm of concrete technology. Since plain
exhibits a brittle nature and has low tensile strength, the researchers have tried to
improve these properties of concrete and the most effective way to overcome these
limitations of plain concrete is the incorporation of fibres in the cement mortar resulting
in development of fibre-reinforced mortar or commonly known as ECC. The
randomized arrangement of fibres in the cement matrix results in improved crack
resistance, strain capacity, and tensile strength [1] thus improving the post-cracking
behaviour of concrete and helps to reduce the propagation of small cracks and
enhancing the overall ductility of ECC [2]. An overview of desirable ECC properties is
given in Figure 1.

The development of ECC was done for the first time in 1992 Li et al. [3,4] and since
then it has emerged as the most viable solution to the plain concrete. ECC is the
preferred material of construction for many maintenance activities such as retrofitting
of structures damaged by seismic activity, patching of slabs and concrete pavements,
and in restoration of dams due to resistance against spalling and micro-cracks etc. [5-
8]. The ductility of ECC enables it to undergo significant deformation in case of a
seismic event without cracking or collapsing and this property makes them a suitable
option for construction of earthquake-resistant and other structures which require high
durability.

Generally, the ECC mixture composition involves cement, sand, fly ash, fibres, and a
superplasticizer and the proportion of each of these components in the ECC mix affect
its properties [9]. The most widely used fibres in the industry include polyethylene (PE),
polypropylene (PP), and polyvinyl alcohol (PVA) fibres. Since PVA fibres are known to
have higher tensile strength than PP fibres, and the chief reason of using fibres is to
enhance the tensile strength, it is more preferred than PP fibres. Also, ECC containing
PVA fibres has greater flexural strength and toughness than containing PP fibres [10]
and ECC’s compressive strength is comparable to ordinary concrete structures [11].
The 28-day compressive strength is widely used as an indication of concrete’s general
performance in the civil engineering industry because all other properties of concrete
are somehow linked to its compressive strength. However, in case of ECC, the
investigation of properties like TSC and tensile strength is more crucial since the basic
idea of developing ECC is having a material with higher tensile strength than plain
concrete. However, the determination of these properties requires extensive
experimental work, preparing many samples, curing, and testing them which results in
loss of time and resources [12-14]. Thus, there must be a method for swift calculation
of TSC of ECC to foster its use in the industry without any delays. However, it is a
significant challenge to get an idea about the TSC of ECC without experimental work
because the mixture proportion of ECC significantly affects its TSC. The conventional
method of determining TSC of ECC isn’t suitable anymore due to the extensive
research done in this field using various superplasticizers, fibre, and cement types.
Therefore, it is high time to develop a cost-effective and time-efficient procedure to
forecast the experimental outputs. It is of particular importance for TSC of ECC due to
its high sensitivity towards its mixture proportions. Such kind of an effective prediction
system can be developed by using data-driven models due to their robustness in
modelling the non-linear relationships which is impossible to model by conventional
regression techniques [15-19].

Figure 1. Benefits of Engineered Cementitious Composites (ECC).

1.1. Overview of Artificial Intelligence (AI)

With the recent advancements in the realm of artificial intelligence (AI), the
construction industry has also transformed massively and is making a transition
towards sustainable materials and data-driven methodologies. The utilization of AI
techniques in the expansive realm of civil engineering refers to optimizing material
utilization, enhancing monitoring practices of critical infrastructure, thus enhancing the
overall safety and efficiency of the construction industry. Machine learning (ML) can
be defined as a subset of AI which refers to the process in which machines learn
patterns from data based on mathematical and statistical principles and makes
predictions based on the new data without requiring explicit human input. There are
two prevalent types of ML algorithms: standalone approaches and ensemble
approaches. Ensemble approaches refer to the amalgamation of two algorithms to
enhance the learning abilities of a particular algorithm [20]. The most common
ensemble approaches involve boosting and bagging approaches as shown in Figure
2. These techniques involve multiple base learners for the algorithm training and then
the predictions are averaged to get the final result. Recently, the prediction of various
concrete and cement composites using ML has emerged as a feasible way to foster
sustainability in the field of civil engineering due to their effectiveness, accuracy, and
ability to model the non-linear relationships between different inputs and outputs [21-
24]. ML algorithms like MEP, GEP, XGB, and neural networks etc. have been widely
utilized to forecast various properties of concrete composites, slope failure
susceptibility, soil compaction and qualification parameters etc. [25-35]
Figure 2. Artificial Intelligence (AI) and its subtypes.

2. Literature review
ML techniques have emerged as an effective tool for researchers to develop empirical
prediction models for revolutionary concrete composites and foster their widespread
use by providing ways to efficiently calculate their various engineering properties. For
example, Nguyen et al. [36] employed various tree-based ML models to forecast
strength of eco-friendly geopolymer concrete. The authors utilized a dataset of 110
instances having 14 input parameters and the employed algorithms included random
forest (RF), decision tree (DT), and XGB. The results highlighted the accuracy of XGB
as compared to other algorithms. Similarly, due to the various advantages of fibre-
reinforce concrete, Wang et al. [37] utilized various ensemble ML techniques like XGB,
light gradient boosting machine (LGBM), CatBoost, and AdaBoost to predict
compressive strength of concrete reinforced with fibres and modified with nano silica
on a dataset of 175 points. The study considered six input variables and the
comparative analysis of the algorithms showed that XGB algorithm with grey-wolf
optimizer achieved highest accuracy for strength estimation. In the same way, Lee et
al. [38] predicted strength of high-performance concrete using different boosting-
based algorithms and grid search approach on an extensive dataset of more than 1000
points. Moreover, Amin et al. [39] predicted the strength of eco-friendly and sustainable
concrete made with rice husk ash in place of cement. The study was conducted using
more than 1200 data points having seven input variables and a single output variable.
The authors utilized evolutionary algorithms along with various boosting and bagging
techniques in the study to predict sustainable concrete strength and the analysis
showed that RF algorithm proved to be the most accurate one.

In addition to the studies regarding prediction of various concrete properties, there are
some studies about the prediction of various properties of cement mortars. For
instance, Oey et al.[40] utilized ML to predict 28-day compression strength of
cementitious systems based on its mixture composition by utilizing a dataset of 200
experiments obtained from different laboratories. Also, Amin et al. [41] used ML to
forecast flexural strength of mortar having glass powder in place of cement for
achieving the sustainability goals. The authors revealed that gradient boosting
regressor exhibited greater accuracy than other algorithms employed in the study.
Moreover, Joseph Mwiti Marangu [42] trained SVM and ANN to estimate strength of
cement mortar made with calcined clay cement from experimental data and
highlighted that ANN showed a correlation of 0.95 between actual and predicted
strength values compared to 0.88 of SVM. Furthermore, Gayathri et al. [43] performed
a comparative analysis of different ML algorithms including RF, linear regression (LR),
AdaBoost, DT, XGB and other several algorithms to predict compression strength of
cement mortar. The models were trained on a dataset of 424 points having six inputs
and one output parameter and reported that XGB turned out to be the most accurate
algorithm while LR was the least accurate one. The only study attributed to predicting
TSC of ECC containing fibres was done recently by Faraj et al. [44], in which the
authors employed multi-linear regression (MLR), ANN, and three other algorithms and
revealed that ANN outperformed all other algorithms for the prediction of output.

3. Research significance
As discussed earlier in introduction section that ECC are widely used in structures
where increased ductility is required. However, it is evident from the overview of
relevant updated literature in section 2 that there aren’t many studies that address the
problem of estimating TSC of ECC containing fibre content and fly ash as input
parameters particularly using the algorithms considered in this study. This research
gap significantly limits the utilization of ECC in construction industry. Thus, this study
was conducted to address the pressing need of industry to provide reliable prediction
models to the construction industry for forecasting TSC of ECC. Another important
significance of this study is that it compares the most popular evolutionary techniques
(MEP, GEP) and boosting techniques (XGB, AdaBoost) to find out the relative
strengths and weaknesses of each for prediction of TSC of ECC. The rationale behind
using MEP and GEP techniques in current study stems from their advantage of
expressing output as an empirical equation. That is why MEP and GEP are widely
regarded as grey-box models in contrast to the black-box XGB and AdaBoost models
in which there are no insights into the process happening at the back of the algorithm
for making predictions [45]. Furthermore, the techniques employed in this study are
advantageous in the sense that they don’t require the pre-optimization of the model
architecture like in ANNs thus reducing the time and computing power required to train
the algorithms [46].

4. Research methodology
The first step after the identification of the problem is to collect data from literature
published internationally to be used for model development and specify the input
parameters. Thus, an extensive literature search was performed resulting in 122
gathered points which were arranged in an excel sheet. Then, various statistical
analyses were performed on the data whose detail is given in section 5. After the data
analysis and splitting, ML models were developed using the training data while testing
data was later used to assess their accuracy. After the successful development of
data-driven models, their accuracy was checked using several commonly used error
metrices and their accuracies were compared to find out the most accurate algorithm
so that further analysis can be performed on it. The methodology of the study is
illustrated in Figure 3.

Figure 3. Overall methodology of the study.

4.1. Genetic Programming Techniques

John Holland proposed Genetic Algorithms which are basically the simplified form of
natural evolution process in which the solution is presented in a fixed length
chromosome. Genetic programming (GP) was initially developed by Cramer and
further modified by Koza [47]. GP is an extension of a genetic algorithm which involve
developing a population of individuals and selecting them according to a specific
fitness function. The main difference however between GP and genetic algorithm is
that both depicts solutions differently. In case of genetic algorithm, a set of numbers is
developed which represents the solution, whereas GP trains a computer program
which helps to solve the problem by making use of Darwin’s principles of natural
selection and genetic mutation etc. [48,49]. Moreover, GP uses non-linear trees having
various shapes and sizes in contrast to the fixed length solutions of genetic algorithm.
There are many subtypes of GP available, but this study outlines the utilization of two
most famous subtypes of GP used for regression problems in the realm of concrete
and civil engineering i.e., GEP and MEP.
4.1.1. Gene expression programming (GEP)
Ferreira [50] devised a modified GP version on the basis of evolutionary population
algorithm and called this version GEP. It utilizes a simple linear chromosome of fixed
length as opposed to the parse tree structure used in GP. GEP is based on
genotype/phenotype phenomenon in which a simple genome is used to retain and
transfer the genetic information and a phenotype is used to adapt to the complex
environment offered by the problem [51]. It inherits the fixed-length linear
chromosomes from genetic algorithms and non-linear expression trees from GP. Also,
a simple criterion of genetic variety is used in GEP due to its mechanism of genetic
variations at the chromosome level. Moreover, it also allows the formation of complex
and non-linear programs due to its multi-gene nature.

The GEP algorithm consists of five sets each for functions, terminals, fitness functions,
parameters, and selection criteria for termination of the algorithm. It programs each
individual as a linear string called genome. The size, shape, and linearity of these
genomes are later changed which give rise to expression trees (ETs). A generic
representation of ET is given in Figure 4. These ETs help to generate new individuals
with greater accuracy to solve the problem. GEP uses the genetic processes of
mutation and crossover same as in humans to evolve and create increasingly accurate
ETs which help to converge the model towards the solution of greatest accuracy and
removes inaccurate individuals from the population [52]. The schematic representation
of these genetic processes are given in Figure 4.

The main idea of GEP revolves around using a set of genes each representing a
smaller piece of code to represent the solution. Several functions known as primitive
functions are used to build these genes and these range from simple arithmetic
functions such as addition and subtraction etc. Various evolutionary processes are
used to combine these genes in different ways to create a complete program [53]. The
process of solving a problem by GEP begins with creating a random population of
computer programs called chromosomes. These chromosomes are evaluated using a
fitness function to check their ability to accurately solve the problem. The worst
performing chromosomes are discarded from the population while the good performing
ones are used to develop a new generation of chromosomes [54]. This process of
generating random chromosomes, evaluating their fitness, and using them to create
another generation of chromosomes is repeated several times until the model
achieves highest possible accuracy. The chromosomes are continually improved and
modified by the algorithm to reach as close to the actual value as possible [55]. The
flowchart methodology of GEP algorithm is given in Figure 5.
Figure 4. Representation of expression tree and genetic processes.

4.1.2. Multi expression programming (MEP)


MEP is also a subtype of GP which uses linear chromosomes to solve the problem.
The basic idea of MEP is similar to that of GEP with the exception of MEP being able
to encode multiple results in a single chromosome, thus increasing the accuracy of
algorithm [56]. Out of all the chromosomes developed by MEP in the population, the
chromosome having maximum accuracy according to a pre-determined fitness
function is used to represent the solution of the problem. Two potential parents are
selected by the process of binary tournament and these parents are recombined to
make new offspring [57]. The developed offsprings are modified using genetic
processes of crossover and mutation and in turn used to make new offspring. The
process of crossover involves combining two individuals to create a new one while
mutation refers to the process of alteration in an existing individual. The schematic
representation of both these processes is given in Figure 4.
The MEP algorithm can be optimized using a wide range of parameters including code
length, crossover probability, size of subpopulation, and mathematical functions to be
included in the final equation. The selection of optimal values of these parameters is
necessary to increase the likelihood of accurate model [58]. However, this increase in
accuracy of the model may require greater computational power. If the model has more
sub-populations it will require longer time to complete the model evolution process.
Similarly, the length of code is directly related to the length of resultant equation by
MEP.
The MEP algorithm starts by developing a population consisting of random
expressions called chromosomes. Then the algorithm measures the efficiency of these
expressions to solve the problem by means of a fitness function [59]. The expressions
giving accurate results are selected to be included in the next generation to create
even more accurate expressions while the inaccurate expressions are discarded. The
creation of new expressions from already existing expressions takes place by utilizing
the processes of mutation, crossover, and recombination [60]. The MEP algorithm
explore new and potentially better solutions by means of these processes. This
process of generating random expressions, evaluating their fitness, and using them to
create a new set of expressions is termed as one iteration or generation of MEP. This
process repeats many times until the model discovers an expression which predicts
value as close to the actual value as possible or a stopping criterion is met. The
flowchart methodology of MEP is given in Figure 5.

Figure 5. Prediction methodology followed by evolutionary techniques.

4.2. Boosting Techniques

Boosting method refers to the process of repeatedly running a weak learning algorithm
on different data variations and averaging their output to calculate the final prediction
of the model. The general boosting framework is given in Figure 6. Notice from the
figure that the boosting process consists of a sequential learning of predictors. When
the boosting process starts, first the algorithm builds a model based on the learning
from the data, then the prediction from previous tree is used to build increasingly
accurate subsequent trees [38]. The boosting technique have been extensively utilized
on tree-based algorithms to solve regression tasks which have led to the development
of some of the most accurate ML used today. These algorithms prominently include
XGB, AdaBoost, CatBoost etc. However, in this study, only two of the boosting-based
algorithms i.e., XGB and AdaBoost are employed to predict tensile strain capacity of
ECC.

Figure 6. Schematic representation of boosting technique.

4.2.1. Extreme gradient boosting (XGB)


Extreme gradient boosting also known as XGBoost and abbreviated as XGB is a
recently developed boosting-based algorithm which is very popular for its remarkable
accuracy in solving both regression and classification problems. Soon after its
development, it has found applications in various aspects of life such as medicine,
engineering, business etc [48,49]. It is a highly efficient and saleable implementation
of gradient boosting technique. It also leverages regularization and parallel boosting
techniques to reduce model accuracy and creating a powerful learner by combining
several weak learners respectively [63]. It uses decision trees as base learners and
uses gradient boosting technique on them to build more efficient trees. The trees used
in XGB provide useful information about the importance of various variables in
forecasting the output and XGB enhances this ability of trees by optimizing a loss
function to reduce the residual of previously generated trees [64].The model’s training
speed is also remarkably fast due to implementation of parallelization technique when
selecting splitting points for enumeration. Also, it has the ability to set sample weights
in first derivative g and second derivative h.
The gradient boosting framework is traditionally described as:

𝐹𝑚 (𝑥) = ∑𝑚
𝑖=1 𝛽𝑖 𝑓𝑖 (𝑥) (1)

In Equation (1), m depicts the count of weak learners, f represents weak learner, β is
the coefficient, and F is the model. The goal of using XGB algorithm is to get a good
model with loss function as low as possible. It is mathematically represented by
Equation (2):
𝐹𝑚 arg min ∑𝑛𝑖=1 𝐿(𝑦𝑖 , 𝐹𝑚 (𝑥𝑖 )) = 𝛽𝑚 arg min ∑𝑛𝑖=1 𝐿(𝑦𝑖 , 𝐹𝑚−1 (𝑥𝑖 ) + 𝛽𝑚 𝑓𝑚 (𝑥𝑖 )) (2)
It should be noted that F is a constant function in the beginning, and one learner and
its coefficient are used to improve the accuracy of F. The gradient boosting minimizes
the loss function by making the newly added term equal to the negative gradient of
loss function in each step. Thus, new term can be described in the form:
𝜕𝐿(𝑦𝑖 ,𝐹𝑚−1 (𝑥𝑖 ))
𝛽𝑚 𝑓𝑚 (𝑥𝑖 ) = 𝛾 (3)
𝜕𝐹𝑚−1 (𝑥𝑖 )

In Equation (3), 𝛾 represents step size having opposite sign of negative gradient and
β. The left side of the formula which removes 𝛾 is called pseudo residual (also known
as gradient) given by:
𝜕𝐿(𝑦𝑖 ,𝐹𝑚−1 (𝑥𝑖 ))
𝑅𝑖𝑚 = 𝛾 (4)
𝜕𝐹𝑚−1 (𝑥𝑖 )

By virtue of Equation (4), it is possible to find the gradient or pseudo residual R for
each training sample because the previous step of model F knows it for every training
sample. Following this way, a weak learner 𝑓𝑚 can be trained according to x and y
values of training samples. Eventually, the trained weak learner to minimize the loss
function and 𝛾 can be achieved. Thus, the finalized general model can be represented
by Equation (5):

𝐹𝑚 = 𝐹𝑚−1 (𝑥) + 𝛽𝑚 𝑓𝑖 (𝑥) (5)

XGB algorithm has great potential to be used as a computing tool because it combines
gradient boosting with DT approach to enhance the accuracy of prediction process.
The algorithm combines various weak learners to create a stronger learner by an
iterative process [65]. The iterative process involves adding new trees in a sequence
to reduce the residual of previous trees. Due to this feature, the algorithm fits quickly
to the training data, but it can also lead to overfitting if the residual reduction by means
of adding new trees is done too quickly. Thus, the algorithm limits the speed of addition
of new trees by applying a factor to the predictions of newly generated trees. This
factor is called shrinking factor or learning rate [66]. The value of this factor ranges
from 0.01 to 1 as suggested by the literature and it is suggested to keep this factor as
small as possible so that the algorithm gradually adds new trees to reduce the residual
of previous trees and it doesn’t get overfitted to the training data [67]. The schematic
representation of XGB prediction process is given in Figure 7.
Figure 7. Prediction methodology followed by boosting techniques.

4.2.2. AdaBoost
The DT algorithm is a very practical and efficient algorithm which trains a decision tree
to achieve an already known value. A general form of expression tree is given in Figure
4. A sample starts from root node, and travels along the branches containing functional
nodes until it reaches leaf/terminal nodes. However, a single decision tree has limited
capacity to accurately predict the given information, thus it is commonly regarded to
as a weak learner. Schapire in 1990 argued that a strong learner can be obtained by
combining several weak learners [68]. This lead to the development of groundwork for
gradient boosting technique. Every time the AdaBoost algorithm adds a new tree, only
the strongest tree is added, and weaker trees are eliminated. The algorithm improves
its accuracy by repeating this process several times. In simpler words, the AdaBoost
algorithm involves the improvement of a simple weak regression process whose
accuracy is improved by continuous training. The training data fed to the algorithm is
used to create the first weak learner, which then make predictions on the training data.
The initially developed learner predicts some values accurately and give large errors
for some values. The samples for which the tree fails to make accurate predictions are
combined with untrained data to make a new training sample. This training sample is
utilized to make another tree. The samples wrongly predicted by this second tree are
again combined with untrained data to develop a third tree. This process is repeated
several times before an accurate learner is obtained. The AdaBoost algorithm also
assigns weights to some samples to increase the accuracy of algorithm [69]. More
weight is assigned to the inaccurate learner so that the algorithm pays more attention
to correct them [20]. The general framework of AdaBoost algorithm is given by
Equation (6):
𝐹𝑛 (𝑥) = 𝐹𝑚−1 (𝑥) + arg min ∑𝑛𝑖=1 𝐿(𝑦𝑖 , 𝐹𝑚−1 (𝑥𝑖 ) + ℎ(𝑥𝑖 )) (6)

In above equation, 𝐹𝑛 (𝑥) represents overall model, 𝐹𝑚−1 (𝑥) is the previously obtained
value, ℎ(𝑥𝑖 ) is the newly added tree, while 𝑦𝑖 depicts the prediction result of ith tree. It
is imperative to note that while carrying out the training process of basic tree model,
the weight distribution of each sample in dataset must be adjusted. The training results
vary as the training data changes, which are eventually summed to get the final result.
The overall AdaBoost model development process is given in Figure 7.

5. Data collection and analysis


To develop accurate and generalized ML models, it is imperative to collect data from
various resources having multiple input parameters. Thus, an extensive literature
search was performed and a database of 122 points was gathered from literature
published internationally [70-92]. The data was collected from studies published
across the world to mitigate the bias in the data. The collected dataset had seven input
parameters including cement content, fly ash, water-to-binder ratio, quantity of fine
aggregate (sand), superplasticizer, fibre content, and age and only one output
parameter i.e., TSC. It is also necessary to select the most influential input parameters
which greatly affect the particular output parameter for creating accurate and widely
applicable models. Thus, the input selection was done through previous literature
recommendations [44].

5.1. Data description and splitting


As per the literature suggestions, the gathered dataset has been split between training
subset (70%) and testing subset (30%) to be used for training the ML algorithms and
testing their accuracy respectively [93,94]. The testing data will be used to assess the
accuracy of developed models once after they have been trained using the training
set. This practice makes sure that the models can predict well on the unseen data
[95,96]. The statistical summary of the overall dataset used in current study is given in
Table 1.
Table 1. Statistical description of dataset.
Cement Fly ash W/b Sand Superplasticizer Fiber content Age Tensile Strain Capacity
Units 𝑘𝑔⁄𝑚3 𝑘𝑔⁄𝑚3 - 𝑘𝑔⁄𝑚3 𝑘𝑔⁄𝑚3 𝑘𝑔⁄𝑚3 Days %
Symbol 𝑑0 𝑑1 𝑑2 𝑑3 𝑑4 𝑑5 𝑑6 y
Maximum 963 1063 0.58 838 23 29 90 10.75
Minimum 190 0 0.19 380 2.0 0 3 0.02
Mean 430.68 787.77 0.27 470.7 7.71 22.93 32.49 3.44
Standard Deviation 123.06 149.01 0.046 68.58 4.84 4.62 22.81 2.34
25% 368 684 0.25 454 5 19 28 2.377
50% 398 807 0.26 456 5.80 26 28 3.1
75% 502 886 0.56 838 23 29 90 10.75

5.2. Data correlation and distribution


Figure 8 represents the Pearson’s correlation matrix developed for the dataset used
in current study. It depicts the correlation values between all inputs and output for all
possible variable combinations. It is frequently used to assess the strength and
direction of relationship between different variables [97]. Each regression between
variable combinations is assigned a correlation value ranging from -1 to 1 in the
correlation matrix with 1 representing a positive perfectly linear relationship between
two variables and -1 representing a perfectly negative correlation because the
correlation value between two variables is the measure of linear relationship between
them. The correlation matrix is diagonally symmetrical because the two parameters in
corresponding squares in the lower triangular matrix and upper triangular matrix are
same. The correlation value being close to zero indicates the presence of no or little
linear correlation between two particular variables while the value near to one indicates
the presence of strong linear correlation. The sign of the correlation coefficient
between two variables indicates the effect of increase in one variable on the other. A
positive correlation indicates that increase in one variable causes an increase in the
other variable too, while a negative correlation implies that increase in one variable
causes a decrease in the other variable. Generally, a value greater than 0.8 indicates
the presence of excellent correlation between two variables [98]. It is important to
check the interdependence of different variables involved in the model building
process before starting the actual model training. This is because if the input variables
are highly correlated with one another, it may give rise to a problem during model
development known as multi-collinearity [99]. It is advised that the correlation between
different input variable combinations must be less than 0.8 to avoid the potential risk
of multi-collinearity [100,101]. Notice from Figure 8 that the correlation values for most
of the variable combinations whether positive or negative are less than the
recommended value of 0.8 which means that the risk of multi-collinearity will not arise
during the model development.
1
Cement 1
0.8

Flyash -0.84 1 0.6

W/b 0.27 -0.24 1 0.4

0.2
Sand 0.20 -0.52 0.070 1
0
Superplasticizer 0.13 -0.40 -0.13 0.58 1
-0.2

Fiber content 0.035 -0.075 -0.066 -0.080 0.092 1 -0.4

-0.6
Age -0.074 -0.0093 -0.076 -0.011 0.095 0.069 1
-0.8
Tensile strain capacity 0.15 -0.23 -0.15 0.22 0.20 0.029 -0.100 1
-1

nt sh /b nd er nt e ity
e
ya W Sa
iz te Ag c
em Fl tic on pa
C as c ca
pl er in
p er Fi
b ra
st
Su ile
ns
Te

Figure 8. Correlation matrix of input and output variables.

Data distribution is another important factor to consider before building data-driven


models because the quality of developed models directly depends upon the quality of
collected data. It has been suggested that having wider range for input and output
variables results in development of more generalized and robust models which can be
applied to a broad range of input variables [102]. Thus, the distribution of variables
involved in current study is shown by means of violin plots in Figure 9. The white dot
represents the median of the values and thick grey bar indicates the interquartile range
while the total area covered by the violin plot indicates the density of the values. Also,
the extended line shows the 95% confidence level. Notice from the figure that the input
variables are spread across a wide range of values. The cement and fly ash values
used in current study go to a maximum value of 1000 𝑘𝑔⁄𝑚3 and 1200 𝑘𝑔⁄𝑚3
respectively. Similarly, the water-to-binder ratio is spread from 0.2 to 0.6. Same trend
is followed by other input variables too. Furthermore, the output variable also ranges
from 0% to 12%. Thus, it can be inferred that the variables involved in development
of ML models are spread across a wider range and it will result in development of
robust and widely applicable models on multiple data configurations.

25%~75% Range within 1.5IQR Median


1200
1000
1000

800
800
Range (kg/m3)

Range (kg/m3)

600 600

400
400

200
200
0

0
Cement Flyash
900
0.6
800

0.5
700
Range (kg/m3)
Range

0.4 600

500
0.3

400
0.2
300
W/b Sand
35
25
30

20 25

Range (kg/m3)
Range (kg/m3)

20
15
15
10
10
5
5

0 0

Superplasticizer Fiber content


12
100

9
80
Range (Days)

Range (%)
60 6

40
3

20
0
0

Age Tensile strain capacity

25%~75% Range within 1.5IQR Median

1000 Figure 9. Distribution of data by means of violin plots.

6. Performance
800 assessment
The performance of deployed ML models in both training and testing phases using the
Range (kg/m3)

metrices listed600
in Table 2. It is necessary to evaluate the model using error metrices to
make sure that the developed models effectively solve the problem and doesn’t give
abnormally large errors [103]. The error evaluation criteria given in Table 2 shows the
400
range of values for these errors and the recommended values for a good prediction
model. The coefficient of correlation (R) is used to assess the general accuracy of ML
models [94]. However,
200 it should be noted that it is not sensitive to the multiplication or
division of the output with a variable, thus other error metrices must be used alongside
R [104]. The other0 error metrices MAE and RMSE are widely used for evaluating ML
models. MAE depicts the average error Cement
between actual and model predicted values
while RMSE is used as an indication of larger errors in the model. Similarly, a20-index
is a widely used metric which tells how much the predictions deviate from the actual
values. It tells about the proportion of predictions that deviate more than ±20% from
the actual values [105]. Moreover, the two metrices performance index (PI) and
objective function (OF) are used to assess the overall performance of the ML model.
PI considers the collective effect of R and relative root mean squared error (RRMSE)
while OF also takes into account the relative percentage of data in training and testing
sets and these metrices should have values as close to zero as possible for a model
to be deemed acceptable. These criteria will be used to assess the model’s
performance and check their suitability for prediction of TSC of ECC.
Table 2. Summary of error evaluation criteria
No. Metric Abbreviation Formula Range Suggested value

1. Coefficient of correlation R (N∑y − (∑x)(∑y)) 0 to 1 𝑅 > 0.8


√(N∑x 2 − (∑x)2 )( n∑y 2 − (∑y)2 )
2. Root mean square error RMSE 0 to + ∞ Close to zero
∑(x − y)2

N
3. Mean absolute error MAE Σ |x − y| 0 to + ∞ Close to zero
N

4. a20-index a20 𝑛20 0 to 1 Close to one


𝑛
5. Performance index PI RRMSE 0 to + ∞ PI < 0.2
1+R
6. Objective Function OF NTraining − NTesting NTesting 0 to + ∞ OF < 0.2
( ) ƿTraining + 2( )ƿTesting
N N

7. Results and discussion

7.1. GEP result


The GEP model was developed using a specialized software package known as
Genexpro Tools. There are various fitting parameters of GEP that need to be specified
before the model starts the training process. It is necessary to carefully choose the
values of these hyperparameters because they directly affect the accuracy of the
resulting equation. These hyperparameters were chosen using recommendations
from previous similar studies and a trial-and-error method [106]. The model
development was started by using head size as 5 and number of chromosomes as 10
and these were changed until the model with highest accuracy was reached at head
size 20 and 50 chromosomes. The complete set of hyperparameters of GEP model
employed in this study is given in Table 3.

Notice from Table 3 that only simple arithmetic functions (+, −,×,÷) are selected to be
included in the final equation. It is done in an attempt to make the resulting equation
as simple as possible. Also, the addition is chosen as a linking function for the equation
which means that the subexpression from each gene will be added to yield the final
equation. The resulting equation of GEP model is given by Equation 7. Moreover, the
predictive ability of GEP equation is demonstrated by means of a curve fitting plot in
Figure 10 (a). Notice from the plot that GEP predicted TSC of ECC accurately at most
of the places. The difference between real and model predicted values is also very
small at other points which depicts the robustness of GEP algorithm.
y=𝐴+𝐵+𝐶 (7)
Where
d2 d1 d2
{d2 ( 4.32 × −9.15)} − { }
d2 4.32 − √d5
𝐴= [ ]+
d4 − √d0 d3 − [√d6 × (d5 − d6 )]
[ ]

d5
𝐵 = 3.01 + [√ ]
d6

√4.744
𝐶= [ d4 ]
−5.48−{( )×6.770}
d1 √d2
( )+( )
d5 d1 +0.241
−d4
4.744

Table 3. Hyperparameter settings of ML models.

Parameters Settings
GEP Parameters
Constants per Gene 5
No. of Genes 5
Linking Function Addition
Head Size 20
No. of Chromosomes 50
Functions +, −,×,÷
MEP Parameters
Number of Generations 500
Subpopulation Size 200
Runs 10
Crossover Probability 0.8
No. of Subpopulations 500
Functions +, −,×,÷
Code Length 50
AdaBoost Parameters
Learning rate 0.01
n_estimators 100
Max. depth 10
XGB Parameters
Learning rate 0.08
n_estimators 60
Max. depth 8

Also, the error metrices described in section 6 calculated for both datasets of GEP are
given in Table 4. Firstly, notice that the R values for both training and testing phases
are greater than the recommended value of 0.8. It means that the algorithm is
generally accurate according to the R. Also, the MAE values are 0.77 and 0.588 for
training and testing sets respectively which shows that there is a minimal difference
between actual and GEP predicted values. The RMSE values also follow the same
trend. Moreover, the performance index values are 0.145 for training and 0.12 for
testing which are also less than the upper limit of 0.2. Furthermore, the OF value being
close to zero (0.130) also shows that the GEP model is overall satisfactory to predict
TSC of ECC.

7.2. MEP result


Similar to the GEP algorithm, MEP was also employed using a specialized software
MEPX 2021.05.18.0. The MEP algorithm also requires specifying the values of some
hyperparameters before the model can be trained on the training dataset. The
algorithm training was started using an initial value of 100 as number of generations
and it was varied along with other parameters until the model having lowest RMSE
and highest R value was reached. The optimal set of MEP hyperparameters used in
current study are given in Table 3. The increase in subpopulation and number of
generations generally increases the accuracy of the MEP model but increases the
computational power required and complexity of the equation too [96]. Thus, these
values must be carefully chosen. The parameter code length is directly related to the
length of the resulting MEP equation and thus should also be chosen carefully.
Moreover, the parameter crossover probability defines the probability of an offspring
undergoing the process of genetic crossover. In current study, its value is chosen as
0.8 means that 80 out of every 100 offsprings will go under the process of crossover.

The result of MEP algorithm is given by Equation (8). Notice that this equation is also
constructed using only simple arithmetic operations for simplification purposes.
However, it is comparatively simple and compact as compared to the equation
developed by GEP. The predictive capabilities of MEP algorithm can be visualized
from the curve fitting plot given in Figure 10 (b). Notice from the curve fitting plot that
MEP algorithm failed to predict accurately at several points and resulted in large errors
between real and MEP values. Thus, it can be concluded that while the resulting
equation by MEP is computationally simple, it is not as accurate as the GEP equation.
d +d
d6 +d0 d6 +(d3 + 6 0 )
d6
y = {d5 ⁄d0 − d6 + (d3 + d6
)} + { d +d +d ⁄d2 } (8)
( 6 0 1)
2d1

The summary of error evaluation of MEP for both data subsets is given in Table 4.
Notice that the correlation between actual and MEP predicted values is less than the
recommended value of 0.8 for the training phase which explains the larger errors
between actual and predicted values in the MEP curve fitting plot. Also, the average
error is 1.18 and 0.77 for training and testing respectively which is larger than the GEP
errors. Similarly, the training phase performance index is 0.267 which is larger than
the recommended value of 0.2. Same trend is true for RMSE and a20-index values.
These values indicate that MEP algorithm failed to accurately predict tensile strain
capacity of ECC particularly in the training phase. It is also evident from the objective
function value being very close to the upper limit of 0.2.
7.3. XGB result
Contrary to the GEP and MEP models, the XGB model was employed using Python
language in the Jupyter library of Anaconda software. The selection of optimal
hyperparameters of XGB algorithm was done with Grid Search approach which
involves varying a specific parameter across a range of possible values while keeping
all other parameters constant. This approach helps to find out the combination of
hyperparameters which result in the highest R value and lowest RMSE. There are two
hyperparameters which need to be tuned in case of XGB called n-estimators and
maximum depth which specify the number of trees build by the algorithm and depth of
each tree respectively. The XGB algorithm adds new trees in a sequence to reduce
the residual of previously constructed trees and it can fit very quickly to the training
data by virtue of this technique, but it can also lead to overfitting of the algorithm to the
training data [66]. Thus, it is beneficial to limit the rate at which the algorithm adds new
trees to reduce the residual of previous tree. This can be done by applying a shrinking
factor to the prediction made by the newly constructed trees. Therefore, a factor called
learning rate is applied to the new predictions which limits the rate at which the
algorithm fits to the training data. The literature suggests keeping its value between
0.01 and 0.1 [64] so that the residual reduction by each added tree is done slowly.
Thus, the value of learning rate has been selected as 0.08 for current study. The values
of other XGB hyperparameters used in current study found using grid search approach
are given in Table 3. The curve fitting plot between actual and XGB predicted values
is given in Figure 10 (c). Notice from the Figure that the predicted values curve
practically lies above the actual curve which shows the remarkable predictive capacity
of XGB.

Also, the error metrices of XGB algorithm are depicted in Table 4 and it can be seen
that the correlation values for training and testing data are 0.98 and 0.96 respectively
which indicates the excellent correlation between actual and XGB predicted values.
Similarly, the MAE and RMSE values of XGB algorithm are the lowest of all algorithms
for both training and testing sets. Moreover, the a20-index values are very close to 1
which shows that there is minimal deviation between actual and XGB predicted values.
Also, the PI and OF values of XGB are very close to zero which again indicates the
overall efficiency of XGB algorithm. Thus, it can be concluded that XGB is the most
accurate algorithm employed in this study.
Actual GEP Training GEP Testing
12
Tensile Strain Capacity (%)

10

0 10 20 30 40 50 60 70 80 90 100 110 120


Data Points

(a)

Actual MEP Training MEP Testing


12
Tensile Strain Capacity (%)

10

0 10 20 30 40 50 60 70 80 90 100 110 120


Data Points

(b)

Actual XGB Training XGB Testing


12
Tensile Strain Capacity (%)

10

0 10 20 30 40 50 60 70 80 90 100 110 120


Data Points

(c)
Actual Adaboost Training Adaboost Testing
12
Tensile Strain Capacity (%)

10

0 10 20 30 40 50 60 70 80 90 100 110 120


Data Points

(d)

Figure 10. Curve fitting plots of algorithms; (a) GEP; (b) MEP; (c) XGB; (d) AdaBoost.

7.4. AdaBoost result


The AdaBoost technique, a boosting-based technique like XGB was also employed by
utilizing python programming language in Anaconda software. The model fitting
parameters for AdaBoost are also the same as XGB which were chosen using a grid
search approach. The optimal set of parameters for AdaBoost optimization employed
in current study are given in Table 3. Notice that the number of trees in the AdaBoost
model were selected as 100 as shown by the value of n-estimator. However, the
learning rate of AdaBoost model is 0.01 in contrast to the XGB value of 0.08.

The predictive capabilities of AdaBoost are given in Figure 10 (d). It can be seen from
the curve fitting plot that the algorithm performed a decent job in predicting values at
most of the points and shows minimum errors at only some points. The error evaluation
of AdaBoost is also given in Table 4. It can be seen from Table 4 that the correlation
coefficient values for both training and testing phases are greater than 0.95 which
shows that the algorithm is more than 95% accurate in predicting tensile strain
capacity of ECC. Similarly, the average error of AdaBoost algorithm is less than both
MEP and GEP and very close to zero which shows that the actual and predicted values
are very close to each other. Also, the a20-index values are closer to one for AdaBoost
algorithm. Moreover, the PI values being closer to zero also testifies the accuracy of
AdaBoost. Furthermore, the objective function value being 0.11 also shows that the
model is overall accurate.

Table 4. Error metrices of developed models.


GEP MEP AdaBoost XGB
Training Testing Training Testing Training Testing Training Testing
MAE 0.777 0.588 1.18 0.733 0.35 0.45 0.21 0.48
RMSE 0.99 0.724 1.65 1.022 0.534 0.66 0.376 0.685
R 0.9173 0.926 0.758 0.860 0.972 0.967 0.986 0.965
PI 0.145 0.120 0.267 0.168 0.078 0.142 0.055 0.100
a20-index 0.517 0.6 0.418 0.638 0.860 0.666 0.906 0.666
OF 0.130 0.160 0.111 0.081

7.5. Comparison with Multi Linear Regression (MLR) model


Linear regression is a common tool frequently used to depict relationship between two
variables. MLR is an extension of simple regression model which helps to model linear
relationship between several variables. To the best of author’s knowledge, there are
no statistical models available for predicting the TSC of ECC using the set of input
variables used in current study. Thus, this study also applies MLR to model the
relationship between inputs and outputs. The resulting equation of MLR is given by
Equation (9). This equation models the linear relationship between TSC, and seven
input variables used in current study.
y = 10.04 − 0.001d0 − 0.004d1 − 10.61d2 + 0.002d3 + 0.01d4 + 0.003d5 − 0.01d6 (9)

After building the MLR model, it is imperative to compare its accuracy with the
developed ML models to investigate about strengths and weaknesses of statistical
techniques and ML algorithms. Thus, a taylor plot-based comparison between ML
models and MLR model is given in Figure 11. The taylor plot is a useful technique to
compare the efficiency of different models using combined effect of R and standard
deviation of the predicted values [107]. The interpretation of taylor diagram is also
pretty simple. The closer a point of ML model is to the reference data point, the more
accurate it is and vice versa. Notice from Figure 11 that the reference data point is
shown by a red dot and XGB point is closest to the actual point as compared to other
algorithms. It means that XGB is the most accurate algorithm deployed in current
study. After XGB, AdaBoost point lies close to reference data followed by GEP and
lastly MEP. As far as MLR is concerned, it is located at the greatest distance from
actual data point as compared to ML algorithms which indicates that its accuracy in
predicting the output is far less than the ML algorithms. Thus, the order of model
accuracy as per taylor diagram is XGB>AdaBoost>GEP>MEP>MLR. This further
reinforces the importance of ML algorithms to model complex relationships between
non-linear input and output variables which can’t be modelled accurately using
statistical techniques.
XGB
0 0.2
MEP
Adaboost

Co
0.4 GEP

rr
MLR XGB

el
2.5 0.6
Adaboost

at
MLR

Standard Deviation

io
n
2.0 0.8 MEP

Co
ef
fi
1.5 0.9

ci
en
GE

t
1.0 0.95

0.5 0.99

0.0 1
Ref.
0.0

0.5

1.0

1.5

2.0

2.5
Figure 11. Taylor plot-based comparison of algorithms with MLR model.

7.6. External Validation of algorithms


In addition to the error evaluation metrices described in section 6, various other
metrices have been suggested by different researchers to further validate the
robustness of the ML algorithms. These metrices are commonly known as external
validation criteria. A model exhibiting good performance according to error evaluation
metrices must also depict good performance when checked against these external
validation criteria to be declared acceptable and suitable for practical use. Thus, this
study employs some external validation criteria to check the developed models and a
summary is given in Table 5. Notice that the values of the external validation metrices
calculated for all models lie within the recommended range. It indicates that all
developed models are suitable for predicting TSC of ECC. However, the XGB values
are closest to the recommended values indicating its overall robustness as compared
to the other algorithms.
Table 5. Summary of external validation criteria.
S No. Reference Expression Criteria MEP GEP XGB AdaBoost
1 [96] ∑ni=1(𝑥i × 𝑦i ) 0.85 < s < 1.15 0.854 0.929 0.974 0.96
s=
∑ni=1(𝑥i2 )
2 [108] ∑ni=1(𝑥i × y) 0.85 < 𝑠 ′ < 1.15 1.019 1.022 1.012 1.01
s′ =
∑ni=1(𝑦i2 )
3 [109] ∑n o
i=1(𝑦i −𝑥i )
2
𝑅𝑜2 ≈ 1 0.868 0.9137 0.9553 0.925
R2o = 1 − o 2
, xio = s × 𝑦i
∑n
i=1(𝑦i −𝑦i )
4 [109] ∑i=1(𝑥i −yo
n
i)
2
𝑅𝑜′2 ≈ 1 0.997 0.990 0.988 0.995
R′2
o =1− n o
∑i=1(𝑦i −𝑦i )2
, 𝑦io = s ′ × 𝑥i
5 [110] R2 − R2o 𝑚 < 0.1 -0.31 -0.07 0.001 0.01
m=
R2o
6 [110] R2 − R2o 𝑛 < 0.1 -0.27 -0.07 -0.011 0.017
n=
R′2
o
7.7. Error comparison and best model selection
The error evaluation summary of all developed ML models is given in Table 4.
However, it is essential to compare the error metrices of developed models to highlight
the strengths and weaknesses of each model and specify a model which is the most
accurate. Thus, a comparison of error metrices of developed models is given in Figure
12. As far as MAE is concerned, it can be seen from Figure 12 that XGB has the least
training MAE while the testing MAE values for XGB and AdaBoost are almost equal.
The training MAE is the highest for MEP algorithm. Thus, it can be said that XGB
algorithm has the least difference between actual and predicted values. The trend for
RMSE of all models is almost the same as MAE. Talking about the correlation
coefficient, the R values are highest for the XGB model for both training and testing
sets (0.986 and 0.965 respectively). In the same manner, the AdaBoost values are
0.972 and 0.967 respectively for training and testing sets. After the AdaBoost model,
GEP has the highest correlation value while the MEP correlation values are the least.
Similarly, the performance index values which takes into account RRMSE and
correlation values simultaneously, also exhibits the same pattern. The PI values of
XGB are the lowest for both datasets followed by AdaBoost, GEP and lastly MEP.
Moreover, to compare the overall efficiency of models in both phases, the comparison
of objective function values are also given in Figure 12. It can be noticed that XGB has
the lowest OF values depicting its overall good performance to predict TSC of ECC.
After XGB, AdaBoost has the second lowest value of objective function while GEP and
MEP have the lowest OF values. Also, the external validation criteria also suggests
that XGB has the greatest accuracy of all the algorithms. Thus, it can be inferred that
XGB is the most accurate algorithm employed in this study and it will be used to
conduct further analysis.
Training Testing Training Testing

1.2 1.6

1.4
1.0
1.2
0.8
1.0

RMSE
MAE

0.6 0.8

0.6
0.4

0.4
0.2
0.2

0.0
GEP MEP Adaboost XGB 0.0
GEP MEP Adaboost XGB

Training Testing Training Testing


0.30
1.0
0.25
Correlation Coefficient (R)

Performance Index (PI)


0.8
0.20

0.6
0.15

0.4
0.10

0.2 0.05

0.0 0.00
GEP MEP Adaboost XGB GEP MEP Adaboost XGB

0.20

0.15
Objective Function (OF)

0.10

0.05

0.00
GEP MEP Adaboost XGB

Figure 12. Error comparison of developed models.

7.8. Shapley additive explanatory analysis


It is imperative to conduct some kind of explanatory analysis on the developed ML
modes to increase the transparency and to get insights into the prediction process of
black-box models [111]. Thus, the XGB model underwent shapley analysis since it was
established in section 7.7 that it is the most accurate of all the algorithms used in
current study.

The Shapley analysis takes inspiration from coalitional game theory and the feature
values are assumed to be coalition members [112]. An additive feature attribution is
employed in shapley analysis to make an interpretation of a black-box ML model. It
defines the output as a linear summation of different inputs. Assume a model with 𝑥 =
(𝑥1 , 𝑥2 , 𝑥3 , … . , 𝑥𝑛 ) where n denotes the total number of inputs. The model’s explanation
𝐺 (𝑥 ′ ) with input 𝑥 ′ from the model 𝑓 (𝑥) can be represented by Equation (10):
𝑓 (𝑥) = 𝐺 (𝑥 ′ ) = Φ0 + ∑𝑘𝑖=1(Φ𝑖 × 𝑥𝑖′ ) (10)

A mapping function 𝑥 = ℎ𝑥 (𝑥 ′ ) is used to link the inputs 𝑥 and 𝑥 ′ . The SHAP values
are generally approximated by methods such as Deep SHAP and Tree SHAP etc. The
method tree SHAP is used for interpretation of random forest, decision trees, and
boosting-based models and thus is utilized in current study . The shapley interpretation
of a ML model given in the form of various plots is divided into two broad categories
known as SHAP local interpretation and SHAP global interpretation.
7.9.1. SHAP Global Interpretation

The SHAP global interpretation plots are those plots which give insight into the overall
model outlining the significance of input variables in making predictions. The most
used plots for global interpretation of SHAP are mean absolute shap plots and bee
swarm shap plot. The basic concept of mean SHAP values is that inputs with high
absolute shap values are more significant in predicting output and vice versa. The
shap values for each variable are calculated across the whole dataset and then their
absolute mean is taken. The mean absolute SHAP plot for XGB model developed in
current study is given in Figure 13 (a). The parameters are arranged from in order of
their decreasing importance from top to bottom. The absolute SHAP values calculated
for each variable are shown by red dots. The broader the range of dots for a particular
variable, the more significance it has in predicting output and vice versa. Also, the
mean absolute shap values are highlighted with a black mark indicating the point
where mean of the absolute shap values lie for a particular variable. Notice that the
variable having the broadest range of absolute shap values and consequently having
the greatest significance is fibre content followed by water-to-binder ratio, cement, and
age. However, the contribution of sand, chemical admixture (superplasticizer) and
mineral admixture (fly ash) is the lowest. Although Figure 13 (a) provides useful
information about the mean absolute shap values, it does not specify which variables
are positively and negatively correlated with the output. This limitation can be mitigated
by using another plot known as SHAP summary plot or a bee swarm plot such as given
in Figure 13 (b). The SHAP summary plot considers the importance of input variables
and their impact on the output. The points are colour coded in such a way that their
colour conveys its value, with blue indicating lower values and red indicating higher
values. The variables in this plot are also arranged in order of their decreasing
importance from top to bottom and the significant variables are the ones having the
broadest range of shapley values same as in the mean absolute shap plot. Notice from
the summary plot in Figure 13 (b) that fibre content has the broadest range of both red
and blue points indicating that it is most important variable to predict TSC of ECC
followed by water-to-binder ratio, cement, and age. Fly ash has the least broad range
of shap values with almost all points practically lying on the zero-line indicating that it
has the least contribution in predicting tensile strain capacity. Also, the findings of
these global interpretation of shapley analysis are in line with the findings of
experimental investigations done previously. It has been rigorously reported that fibres
are the most important component to increase ductility and TSC of ECC [73]. Similarly,
the relative contribution of water-to-binder ratio and age in developing TSC of ECC is
documented [87,92,113,114]. Thus, the results of shapley analysis also resonate with
the previous experimental studies.

(a)

(b)

Figure 13. SHAP Global Interpretation; (a) mean absolute SHAP values; (b) SHAP summary
plot.

7.9.2. SHAP Local Interpretation

The plots given in Figure 13 give an overview of the most contributing variables to the
output. However, it is crucial to check the local interpretability of model predicted
values to get an idea about how the model made individual predictions. Thus, the
shapley force plots are given in Figure 14 to get insight into the model prediction
process. The force plots for prediction of TSC values of 2.11, 2.01, and 1.92 are given
in Figure 14 (a). These plots provide a compacted and interactive way to interpret
individual predictions. The force plots are centred at the base value, which is average
value of all the predictions and each arrow on the force plot represents a variable
which helps in predicting the output. The direction of each arrow indicates the direction
and magnitude of the effect of one particular variable. The arrows are colour coded in
such a way that red colour represents higher values and blue colour represents lower
values. The point of intersection of blue and red arrows on the force plot show the
model predicted value as given in Figure 14 (a). Notice from the force plots that each
input variable offers some contribution in predicting the output. However, the
contribution of water-to-binder ratio, age, and cement is more pronounced than other
input variables. It is because these variables are more significant in output prediction
than other variables as previously indicated by the SHAP global interpretation plots.

After the identification of most influential variables and investigating their effect on
individual predictions, it is also beneficial to highlight relation between different input
variables and their SHAP values. This can be done by using SHAP partial dependence
plots. Thus, a compacted form of SHAP partial dependence plots between variables
used in this study are given in Figure 14 (b). Notice from the partial dependence plots
that the variable values are given along the bottom axis and right-hand side while the
corresponding SHAP values for variable in x-axis are given on the left-hand side. In
this way, the variation in SHAP values of a particular variable with change in value of
other variable can be observed. Overall, the SHAP values of water-to-binder ratio, fibre
content, age, and cement vary greatly as compared to other variables. It is because
these are the most prominent variables in predicting the TSC of ECC. Notice that the
SHAP values of cement vary from -0.5 to 0.5 as sand content changes from around
450 to 500 kilograms. Similarly, the shapley values of W/b changes with change in
fibre content from 15kg to 22kg. Same trend is followed between shap values of sand
with change in W/b. In contrast, the shapley values for fly ash seem to be unaffected
with change in fibre content. It is because fly ash is the least significant variable in
output prediction as previously identified in SHAP global summary plot. Furthermore,
the frequency distribution histogram of the variable in x-axis of the partial dependence
plot is also given in the background of plot. Thus, the partial dependence plots serve
the purposes of highlighting the relationship between shap values of different variables
and also provide the frequency distribution histograms of variables.
(a)

(b)

Figure 14. SHAP Local Interpretation; (a) SHAP force plots; (b) partial independence plots.

8. Conclusions
This study aimed to provide empirical prediction models for fast and accurate
estimation of TSC of ECC using boosting and genetic programming techniques for
saving time and resources. The main conclusions of the study are:

 All algorithms employed in current study exhibited good performance to predict


TSC of ECC. However, the XGB algorithm turned out to be the most accurate by
exhibiting the highest correlation (0.986) between actual and predicted TSC
values.
 The MEP and GEP algorithms developed in current study expressed their output
as an empirical equation which can be used to calculate TSC of ECC while the
other two algorithms failed to provide an empirical equation.
 The accuracy of developed algorithms was assessed by several error metrices
including MAE, RMSE, PI, and OF etc. and the results showed that XGB has the
highest accuracy having the least OF value of 0.081 as compared to 0.11 of
AdaBoost, 0.13 of GEP, and 0.16 of MEP
 The external validation of developed algorithms also highlighted their robustness
to be used practically.
 Shapley additive analysis was done on the XGB model since it was the most
accurate algorithm, and the results highlighted that fibre content, age, and water-
to-binder ratio are the most important algorithms to predict TSC of ECC.

Although this study presents a significant contribution on the subject of TSC prediction
of ECC using soft-computing techniques, it is crucial to highlight its limitations and
provide avenues for future research endeavours:
 The models presented in this study were developed using a limited dataset of 122
points gathered from literature. However, considering larger datasets is of
paramount importance for developing more robust and generalized models in
future studies.
 Several material properties like fineness index of fine aggregate, nominal diameter
and length of the incorporated fibres etc. affect TSC of ECC. While this study does
not consider these properties for simplification purposes, it is essential to consider
them as input parameters in future studies.
 In addition to TSC, other important properties like compressive strength, split
tensile strength etc. of ECC should be considered as outputs in building empirical
models.

Data availability
Data will be provided upon request.

References
[1] K. B. Ramkumar, P. R. Kannan Rajkumar, S. Noor Ahmmad, and M. Jegan, “A Review
on Performance of Self-Compacting Concrete – Use of Mineral Admixtures and Steel
Fibres with Artificial Neural Network Application,” Constr Build Mater, vol. 261, Nov.
2020, doi: 10.1016/J.CONBUILDMAT.2020.120215.
[2] H. P. Behbahani, B. Nematollahi, and M. Farasatpour, “Steel Fiber Reinforced
Concrete: A Review.”
[3] V. Li, “Performance driven design of fiber reinforced cementitious composites,” 1992,
Accessed: Feb. 23, 2024. [Online]. Available:
https://deepblue.lib.umich.edu/bitstream/handle/2027.42/84754/li_FRCC_PPDA_92.p
df?sequence=1
[4] V. C. Li, “Postcrack Scaling Relations for Fiber Reinforced Cementitious Composites,”
Journal of Materials in Civil Engineering, vol. 4, no. 1, pp. 41–57, Feb. 1992, doi:
10.1061/(asce)0899-1561(1992)4:1(41).
[5] S. Qudah and M. Maalej, “Application of Engineered Cementitious Composites (ECC)
in interior beam-column connections for enhanced seismic resistance,” Eng Struct,
vol. 69, pp. 235–245, Jun. 2014, doi: 10.1016/J.ENGSTRUCT.2014.03.026.
[6] M. Maalej, S. T. Quek, S. F. U. Ahmed, J. Zhang, V. W. J. Lin, and K. S. Leong,
“Review of potential structural applications of hybrid fiber Engineered Cementitious
Composites,” Constr Build Mater, vol. 36, pp. 216–227, Nov. 2012, doi:
10.1016/J.CONBUILDMAT.2012.04.010.
[7] M. Al-Emam, G. rkan Yıldırım, Y. S. Emre, T. Kemal Erdem, and M. Lachemi, “High-
early-strength ductile cementitious composites with characteristics of low early-age
shrinkage for repair of infrastructures,” SpringerM Şahmaran, M Al-Emam, G Yıldırım,
YE Şimşek, TK Erdem, M LachemiMaterials and Structures, 2015•Springer, vol. 48,
no. 5, pp. 1389–1403, May 2015, doi: 10.1617/s11527-013-0241-z.
[8] Y. M. Lim and V. C. Lib, “Durable Repair of Aged Infrastructures Using Trapping
Mechanism of Engineered Cementitious Composites,” 1997.
[9] L. Z. Li, Y. Bai, K. Q. Yu, J. T. Yu, and Z. D. Lu, “Reinforced high-strength engineered
cementitious composite (ECC) columns under eccentric compression: Experiment
and theoretical model,” Eng Struct, vol. 198, Nov. 2019, doi:
10.1016/J.ENGSTRUCT.2019.109541.
[10] E. H. Yang and V. C. Li, “Strain-hardening fiber cement optimization and component
tailoring by means of a micromechanical model,” Constr Build Mater, vol. 24, no. 2,
pp. 130–139, Feb. 2010, doi: 10.1016/j.conbuildmat.2007.05.014.
[11] Kallepalli Bindu Madhavi, Mandala Venugopal, V Rajesh, and kunchepu suresh,
“Experimental Study on Bendable Concrete,” International Journal of Engineering
Research and, vol. V5, no. 10, Oct. 2016, doi: 10.17577/IJERTV5IS100400.
[12] S. Gupta and P. Sihag, “Prediction of the compressive strength of concrete using
various predictive modeling techniques,” Neural Comput Appl, vol. 34, no. 8, pp.
6535–6545, Apr. 2022, doi: 10.1007/S00521-021-06820-Y.
[13] D. Ma, H. Duan, J. Zhang, and H. Bai, “A state-of-the-art review on rock seepage
mechanism of water inrush disaster in coal mines,” International Journal of Coal
Science & Technology 2022 9:1, vol. 9, no. 1, pp. 1–28, Jul. 2022, doi:
10.1007/S40789-022-00525-W.
[14] F. Soleimani, G. Si, H. Roshan, and J. Zhang, “Numerical modelling of gas outburst
from coal: a review from control parameters to the initiation process,” Int J Coal Sci
Technol, vol. 10, no. 1, Dec. 2023, doi: 10.1007/S40789-023-00657-7.
[15] W. Gao, M. Karbasi, A. M. Derakhsh, and A. Jalili, “Development of a novel soft-
computing framework for the simulation aims: a case study,” Eng Comput, vol. 35, no.
1, pp. 315–322, Jan. 2019, doi: 10.1007/s00366-018-0601-y.
[16] A. D. Skentou et al., “Closed-Form Equation for Estimating Unconfined Compressive
Strength of Granite from Three Non-destructive Tests Using Soft Computing Models,”
Rock Mech Rock Eng, vol. 56, no. 1, pp. 487–514, Jan. 2023, doi: 10.1007/s00603-
022-03046-9.
[17] Z. Ali, M. Karakus, G. D. Nguyen, and K. Amrouch, “Effect of loading rate and time
delay on the tangent modulus method (TMM) in coal and coal measured rocks,” Int J
Coal Sci Technol, vol. 9, no. 1, pp. 1–13, Dec. 2022, doi: 10.1007/S40789-022-00552-
7/FIGURES/16.
[18] G. Wang et al., “Research and practice of intelligent coal mine technology systems in
China,” Int J Coal Sci Technol, vol. 9, no. 1, pp. 1–17, Dec. 2022, doi:
10.1007/S40789-022-00491-3/FIGURES/13.
[19] M. Wang, X. Xi, Q. Guo, J. Pan, M. Cai, and S. Yang, “Sulfate diffusion in coal pillar:
experimental data and prediction model,” Int J Coal Sci Technol, vol. 10, no. 1, pp. 1–
12, Dec. 2023, doi: 10.1007/S40789-023-00575-8/FIGURES/12.
[20] C. Ying, M. Qi-Guang, L. Jia-Chen, and G. Lin, “Advance and Prospects of AdaBoost
Algorithm”.
[21] S. K. Rahman and R. Al-Ameri, “Experimental investigation and artificial neural
network based prediction of bond strength in self-compacting geopolymer concrete
reinforced with basalt FRP bars,” Applied Sciences (Switzerland), vol. 11, no. 11, Jun.
2021, doi: 10.3390/app11114889.
[22] I. Nunez, A. Marani, and M. L. Nehdi, “Mixture optimization of recycled aggregate
concrete using hybrid machine learning model,” Materials, vol. 13, no. 19, pp. 1–24,
Oct. 2020, doi: 10.3390/ma13194331.
[23] S. Wang, J. Guo, Y. Yu, P. Shi, and H. Zhang, “Quality evaluation of land reclamation
in mining area based on remote sensing,” Int J Coal Sci Technol, vol. 10, no. 1, pp. 1–
10, Dec. 2023, doi: 10.1007/S40789-023-00601-9/TABLES/6.
[24] Y. Cai et al., “A review of monitoring, calculation, and simulation methods for ground
subsidence induced by coal mining,” International Journal of Coal Science &
Technology 2023 10:1, vol. 10, no. 1, pp. 1–23, Jun. 2023, doi: 10.1007/S40789-023-
00595-4.
[25] W. Bin Inqiad, M. S. Siddique, S. S. Alarifi, M. J. Butt, T. Najeh, and Y. Gamil,
“Comparative analysis of various machine learning algorithms to predict 28-day
compressive strength of Self-compacting concrete,” Heliyon, vol. 9, no. 11, p. e22036,
Nov. 2023, doi: 10.1016/j.heliyon.2023.e22036.
[26] M. Nematzadeh, A. A. Shahmansouri, and M. Fakoor, “Post-fire compressive strength
of recycled PET aggregate concrete reinforced with steel fibers: Optimization and
prediction via RSM and GEP,” Constr Build Mater, vol. 252, Aug. 2020, doi:
10.1016/J.CONBUILDMAT.2020.119057.
[27] M. A. Khan, A. Zafar, A. Akbar, M. F. Javed, and A. Mosavi, “Application of gene
expression programming (GEP) for the prediction of compressive strength of
geopolymer concrete,” Materials, vol. 14, no. 5, pp. 1–23, Mar. 2021, doi:
10.3390/ma14051106.
[28] F. E. Jalal, Y. Xu, M. Iqbal, B. Jamhiri, and M. F. Javed, “Predicting the compaction
characteristics of expansive soils using two genetic programming-based algorithms,”
Transportation Geotechnics, vol. 30, Sep. 2021, doi: 10.1016/j.trgeo.2021.100608.
[29] C. Qi and X. Tang, “Slope stability prediction using integrated metaheuristic and
machine learning approaches: A comparative study,” Comput Ind Eng, vol. 118, pp.
112–122, Apr. 2018, doi: 10.1016/j.cie.2018.02.028.
[30] F. Huang et al., “Slope stability prediction based on a long short-term memory neural
network: comparisons with convolutional neural networks, support vector machines
and random forest models,” Int J Coal Sci Technol, vol. 10, no. 1, pp. 1–14, Dec.
2023, doi: 10.1007/S40789-023-00579-4/FIGURES/5.
[31] H. Wu, Y. Chen, H. Lv, Q. Xie, Y. Chen, and J. Gu, “Stability analysis of rib pillars in
highwall mining under dynamic and static loads in open-pit coal mine,” Int J Coal Sci
Technol, vol. 9, no. 1, Dec. 2022, doi: 10.1007/S40789-022-00504-1.
[32] C. Zhang, P. Wang, E. Wang, D. Chen, and C. Li, “Characteristics of coal resources in
China and statistical analysis and preventive measures for coal mine accidents,” Int J
Coal Sci Technol, vol. 10, no. 1, pp. 1–13, Dec. 2023, doi: 10.1007/S40789-023-
00582-9/FIGURES/13.
[33] Q. Qi, X. Yue, X. Duo, Z. Xu, and Z. Li, “Spatial prediction of soil organic carbon in
coal mining subsidence areas based on RBF neural network,” Int J Coal Sci Technol,
vol. 10, no. 1, pp. 1–13, Dec. 2023, doi: 10.1007/S40789-023-00588-3/TABLES/4.
[34] F. Huang et al., “Slope stability prediction based on a long short-term memory neural
network: comparisons with convolutional neural networks, support vector machines
and random forest models,” Int J Coal Sci Technol, vol. 10, no. 1, pp. 1–14, Dec.
2023, doi: 10.1007/S40789-023-00579-4/FIGURES/5.
[35] M. Mirrashid and H. Naderpour, “Computational intelligence-based models for
estimating the fundamental period of infilled reinforced concrete frames,” Journal of
Building Engineering, vol. 46, p. 103456, Apr. 2022, doi:
10.1016/J.JOBE.2021.103456.
[36] M. H. Nguyen, H. V. T. Mai, S. H. Trinh, and H. B. Ly, “A comparative assessment of
tree-based predictive models to estimate geopolymer concrete compressive strength,”
Neural Comput Appl, vol. 35, no. 9, pp. 6569–6588, Mar. 2023, doi: 10.1007/s00521-
022-08042-2.
[37] R. Wang, J. Zhang, Y. Lu, and J. Huang, “Towards Designing Durable Sculptural
Elements: Ensemble Learning in Predicting Compressive Strength of Fiber-
Reinforced Nano-Silica Modified Concrete,” Buildings, vol. 14, no. 2, p. 396, Feb.
2024, doi: 10.3390/buildings14020396.
[38] S. Lee, N. H. Nguyen, A. Karamanli, J. Lee, and T. P. Vo, “Super learner machine-
learning algorithms for compressive strength prediction of high performance
concrete,” Structural Concrete, vol. 24, no. 2, pp. 2208–2228, Apr. 2023, doi:
10.1002/suco.202200424.
[39] M. N. Amin, K. Khan, A. M. Abu Arab, F. Farooq, S. M. Eldin, and M. F. Javed,
“Prediction of sustainable concrete utilizing rice husk ash (RHA) as supplementary
cementitious material (SCM): Optimization and hyper-tuning,” Journal of Materials
Research and Technology, vol. 25, pp. 1495–1536, Jul. 2023, doi:
10.1016/j.jmrt.2023.06.006.
[40] T. Oey, S. Jones, … J. B.-J. of the A., and undefined 2020, “Machine learning can
predict setting behavior and strength evolution of hydrating cement systems,” Wiley
Online LibraryT Oey, S Jones, JW Bullard, G SantJournal of the American Ceramic
Society, 2020•Wiley Online Library, vol. 103, no. 1, pp. 480–490, Jan. 2020, doi:
10.1111/JACE.16706.
[41] M. N. Amin, H. A. Alkadhim, W. Ahmad, K. Khan, H. Alabduljabbar, and A. Mohamed,
“Experimental and machine learning approaches to investigate the effect of waste
glass powder on the flexural strength of cement mortar,” PLoS One, vol. 18, no. 1, p.
e0280761, Jan. 2023, doi: 10.1371/JOURNAL.PONE.0280761.
[42] J. M.-J. of S. C. M. and and undefined 2020, “Prediction of compressive strength of
calcined clay based cement mortars using support vector machine and artificial neural
network techniques,” dergipark.org.trJ MARANGUJournal of Sustainable Construction
Materials and Technologies, 2020•dergipark.org.tr, vol. 5, no. 1, pp. 392–398, 2020,
doi: 10.29187/jscmt.2020.43.
[43] R. Gayathri, S. Rani, L. Čepová, M. Rajesh, K. K.- Processes, and undefined 2022, “A
comparative analysis of machine learning models in prediction of mortar compressive
strength,” mdpi.comR Gayathri, SU Rani, L Čepová, M Rajesh, K KalitaProcesses,
2022•mdpi.com, 2022, doi: 10.3390/pr10071387.
[44] R. H. Faraj, H. U. Ahmed, H. S. Fathullah, A. S. Abdulrahman, and F. Abed, “Tensile
Strain Capacity Prediction of Engineered Cementitious Composites (ECC) Using Soft
Computing Techniques,” Computer Modeling in Engineering & Sciences, vol. 138, no.
3, pp. 2925–2954, 2024, doi: 10.32604/cmes.2023.029392.
[45] M. Shahin, M. B. Jaksa, and H. R. Maier, “Physical Modeling of Rolling Dynamic
Compaction View project Artificial neural networks-pile capacity prediction View
project,” 2008. [Online]. Available:
https://www.researchgate.net/publication/228364758
[46] M. A. Khan, A. Zafar, A. Akbar, M. F. Javed, and A. Mosavi, “Application of gene
expression programming (GEP) for the prediction of compressive strength of
geopolymer concrete,” Materials, vol. 14, no. 5, pp. 1–23, Mar. 2021, doi:
10.3390/ma14051106.
[47] J. R. Koza, “Genetic programming as a means for programming computers by natural
selection,” Stat Comput, vol. 4, no. 2, pp. 87–112, Jun. 1994, doi:
10.1007/BF00175355/METRICS.
[48] M. Saridemir, “Genetic programming approach for prediction of compressive strength
of concretes containing rice husk ash,” Constr Build Mater, vol. 24, no. 10, pp. 1911–
1919, Oct. 2010, doi: 10.1016/j.conbuildmat.2010.04.011.
[49] J. R. Koza and R. Poli, “Chapter 5 GENETIC PROGRAMMING.”
[50] C. Ferreira, Gene expression programming: mathematical modeling by an artificial
intelligence. 2006. Accessed: Jan. 11, 2024. [Online]. Available:
https://books.google.com/books?hl=en&lr=&id=NkG7BQAAQBAJ&oi=fnd&pg=PR7&d
q=%5D+C.+Ferreira,+Gene+expression+programming:+mathematical+modeling+by+
an+artificial++intelligence,+Springer2006.+&ots=Y-
irsB1pyX&sig=eeYcsFzblKDNWCxnkUnfQH41BT0
[51] A. Gholampour, A. H. Gandomi, and T. Ozbakkaloglu, “New formulations for
mechanical properties of recycled aggregate concrete using gene expression
programming,” Constr Build Mater, vol. 130, pp. 122–145, Jan. 2017, doi:
10.1016/j.conbuildmat.2016.10.114.
[52] J. R. Koza and M. Jacks Hall, “SURVEY OF GENETIC ALGORITHMS AND GENETIC
PROGRAMMING.” [Online]. Available: http://www-cs-faculty.stanford.edu/~koza/
[53] A. H. Gandomi, A. H. Alavi, M. R. Mirzahosseini, and F. M. Nejad, “Nonlinear Genetic-
Based Models for Prediction of Flow Number of Asphalt Mixtures,” Journal of
Materials in Civil Engineering, vol. 23, no. 3, pp. 248–263, Mar. 2011, doi:
10.1061/(asce)mt.1943-5533.0000154.
[54] Z. Chen et al., “Strength evaluation of eco-friendly waste-derived self-compacting
concrete via interpretable genetic-based machine learning models,” Mater Today
Commun, vol. 37, Dec. 2023, doi: 10.1016/j.mtcomm.2023.107356.
[55] M. O. Crina and G. Gros¸an, “A Comparison of Several Linear GP Techniques A
Comparison of Several Linear Genetic Programming Techniques,” 2003. [Online].
Available: www.mep.cs.ubbcluj.ro.
[56] H. L. Wang and Z. Y. Yin, “High performance prediction of soil compaction parameters
using multi expression programming,” Eng Geol, vol. 276, Oct. 2020, doi:
10.1016/j.enggeo.2020.105758.
[57] M. Oltean, “Multi Expression Programming for solving classification problems Fruit
recognition from images using deep learning View project Optical Computing View
project Mihai Oltean Multi Expression Programming for solving classification
problems,” 2022, doi: 10.21203/rs.3.rs-1458572/v1.
[58] A. Fallahpour, E. U. Olugu, and S. N. Musa, “A hybrid model for supplier selection:
integration of AHP and multi expression programming (MEP),” Neural Comput Appl,
vol. 28, no. 3, pp. 499–504, Mar. 2017, doi: 10.1007/s00521-015-2078-6.
[59] A. H. Alavi, A. H. Gandomi, M. G. Sahab, and M. Gandomi, “Multi expression
programming: A new approach to formulation of soil classification,” Eng Comput, vol.
26, no. 2, pp. 111–118, Apr. 2010, doi: 10.1007/s00366-009-0140-7.
[60] Q. Zhang, X. Meng, B. Yang, and W. Liu, “MREP: Multi-reference expression
programming,” in Lecture Notes in Computer Science (including subseries Lecture
Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Springer Verlag,
2016, pp. 26–38. doi: 10.1007/978-3-319-42294-7_3.
[61] Y. Qu, Z. Lin, H. Li, and X. Zhang, “Feature Recognition of Urban Road Traffic
Accidents Based on GA-XGBoost in the Context of Big Data,” IEEE Access, vol. 7, pp.
170106–170115, 2019, doi: 10.1109/ACCESS.2019.2952655.
[62] H. Xu et al., “Identifying diseases that cause psychological trauma and social
avoidance by GCN-Xgboost,” BMC Bioinformatics, vol. 21, Dec. 2020, doi:
10.1186/s12859-020-03847-1.
[63] T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” in Proceedings
of the ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, Association for Computing Machinery, Aug. 2016, pp. 785–794. doi:
10.1145/2939672.2939785.
[64] L. Cui, P. Chen, L. Wang, J. Li, and H. Ling, “Application of Extreme Gradient
Boosting Based on Grey Relation Analysis for Prediction of Compressive Strength of
Concrete,” Advances in Civil Engineering, vol. 2021, 2021, doi:
10.1155/2021/8878396.
[65] X. liang Jin et al., “Estimation of Wheat Agronomic Parameters using New Spectral
Indices,” PLoS One, vol. 8, no. 8, Aug. 2013, doi: 10.1371/journal.pone.0072736.
[66] S. R. Al-Taai, N. M. Azize, Z. A. Thoeny, H. Imran, L. F. A. Bernardo, and Z. Al-Khafaji,
“XGBoost Prediction Model Optimized with Bayesian for the Compressive Strength of
Eco-Friendly Concrete Containing Ground Granulated Blast Furnace Slag and
Recycled Coarse Aggregate,” Applied Sciences (Switzerland), vol. 13, no. 15, Aug.
2023, doi: 10.3390/app13158889.
[67] N. D. Hoang, “A novel ant colony-optimized extreme gradient boosting machine for
estimating compressive strength of recycled aggregate concrete,” Multiscale and
Multidisciplinary Modeling, Experiments and Design, 2023, doi: 10.1007/s41939-023-
00220-6.
[68] R. E. Schapire, “The Strength of Weak Learnability,” 1990.
[69] Y. Freund and R. E. Schapire, “A Short Introduction to Boosting,” Journal of Japanese
Society for Artificial Intelligence, vol. 14, no. 5, pp. 771–780, 1999, Accessed: Dec. 26,
2023. [Online]. Available: www.research.att.com/fyoav,
[70] X. Huang, R. Ranade, W. Ni, and V. C. Li, “Development of green engineered
cementitious composites using iron ore tailings as aggregates,” Constr Build Mater,
vol. 44, pp. 757–764, 2013, doi: 10.1016/j.conbuildmat.2013.03.088.
[71] X. Huang, R. Ranade, W. Ni, and V. C. Li, “On the use of recycled tire rubber to
develop low E-modulus ECC for durable concrete repairs,” Constr Build Mater, vol.
46, pp. 134–141, 2013, doi: 10.1016/j.conbuildmat.2013.04.027.
[72] D. Meng, C. K. Lee, and Y. X. Zhang, “Flexural and shear behaviours of plain and
reinforced polyvinyl alcohol-engineered cementitious composite beams,” Eng Struct,
vol. 151, pp. 261–272, Nov. 2017, doi: 10.1016/J.ENGSTRUCT.2017.08.036.
[73] K. T. Soe, Y. X. Zhang, and L. C. Zhang, “Material properties of a new hybrid fibre-
reinforced engineered cementitious composite,” Constr Build Mater, vol. 43, pp. 399–
407, 2013, doi: 10.1016/J.CONBUILDMAT.2013.02.021.
[74] W. Zhu, P. J. M. Bartos, and A. Porro, “Application of nanotechnology in construction,”
Mater Struct, vol. 37, no. 9, pp. 649–658, Nov. 2004, doi: 10.1007/BF02483294.
[75] H. Liu, Q. Zhang, V. Li, H. Su, C. G.-C. and B. Materials, and undefined 2017,
“Durability study on engineered cementitious composites (ECC) under sulfate and
chloride environment,” Elsevier, vol. 133, pp. 171–181, Feb. 2017, doi:
10.1016/j.conbuildmat.2016.12.074.
[76] L. li Kan, R. xin Shi, and J. Zhu, “Effect of fineness and calcium content of fly ash on
the mechanical properties of Engineered Cementitious Composites (ECC),” Constr
Build Mater, vol. 209, pp. 476–484, Jun. 2019, doi:
10.1016/J.CONBUILDMAT.2019.03.129.
[77] M. D. Lepech and V. C. Li, “Water permeability of engineered cementitious
composites,” Cem Concr Compos, vol. 31, no. 10, pp. 744–753, Nov. 2009, doi:
10.1016/J.CEMCONCOMP.2009.07.002.
[78] S. Z. Qian, J. Zhou, and E. Schlangen, “Influence of curing condition and precracking
time on the self-healing behavior of Engineered Cementitious Composites,” Cem
Concr Compos, vol. 32, no. 9, pp. 686–693, Oct. 2010, doi:
10.1016/J.CEMCONCOMP.2010.07.015.
[79] Y. Qian, W. Zhou, J. Yan, W. Li, and L. Han, “Comparing machine learning classifiers
for object-based land cover classification using very high resolution imagery,” Remote
Sens (Basel), vol. 7, no. 1, pp. 153–168, 2015, doi: 10.3390/rs70100153.
[80] M. Sahmaran, M. Li, V. L.-A. M. Journal, and undefined 2007, “Transport properties of
engineered cementitious composites under chloride exposure,”
acemrl.engin.umich.eduM Sahmaran, M Li, VC LiACI Materials Journal,
2007•acemrl.engin.umich.edu, 2007, Accessed: Feb. 02, 2024. [Online]. Available:
http://acemrl.engin.umich.edu/wp-content/uploads/sites/412/2018/10/Transport-
Properties-of-Engineered-Cementitious-Composites-Under-Chloride-Exposure.pdf
[81] M. Şahmaran and V. C. Li, “De-icing salt scaling resistance of mechanically loaded
engineered cementitious composites,” Cem Concr Res, vol. 37, no. 7, pp. 1035–1046,
Jul. 2007, doi: 10.1016/J.CEMCONRES.2007.04.001.
[82] M. Şahmaran and V. C. Li, “Influence of microcracking on water absorption and
sorptivity of ECC,” Materials and Structures/Materiaux et Constructions, vol. 42, no. 5,
pp. 593–603, Jun. 2009, doi: 10.1617/S11527-008-9406-6.
[83] M. Şahmaran and V. C. Li, “Durability of mechanically loaded engineered cementitious
composites under highly alkaline environments,” Cem Concr Compos, vol. 30, no. 2,
pp. 72–81, Feb. 2008, doi: 10.1016/J.CEMCONCOMP.2007.09.004.
[84] M. S. ¸ Ahmaran, H. E. Yücel, S. Demirhan, M. T. Arık, and V. C. Li, “Combined effect
of aggregate and mineral admixtures on tensile ductility of engineered cementitious
composites,” acemrl.engin.umich.eduM Sahmaran, HE Yücel, S Demirhan, MT Arýk,
VC LiACI Materials Journal, 2012•acemrl.engin.umich.edu, vol. 109, no. 6, Accessed:
Feb. 02, 2024. [Online]. Available: http://acemrl.engin.umich.edu/wp-
content/uploads/sites/412/2018/10/Combined-Effect-of-Aggregate-and-Mineral-
Admixures-on-Tensile-Ductility-of-Engineered-Cementitious-Composites.pdf
[85] M. Şahmaran, E. Özbay, H. Yücel, … M. L.-C. and C., and undefined 2012, “Frost
resistance and microstructure of Engineered Cementitious Composites: Influence of
fly ash and micro poly-vinyl-alcohol fiber,” Elsevier, vol. 34, no. 2, pp. 156–165, Feb.
2012, doi: 10.1016/j.cemconcomp.2011.10.002.
[86] M. Şahmaran, M. Lachemi, … K. H.-C. and concrete, and undefined 2009, “Internal
curing of engineered cementitious composites for prevention of early age autogenous
shrinkage cracking,” Elsevier, vol. 39, no. 10, pp. 893–901, Oct. 2009, doi:
10.1016/j.cemconres.2009.07.006.
[87] M. Şahmaran, V. L.-C. and C. Research, and undefined 2009, “Durability properties of
micro-cracked ECC containing high volumes fly ash,” Elsevier, vol. 39, no. 11, pp.
1033–1043, Nov. 2009, doi: 10.1016/j.cemconres.2009.07.009.
[88] E. Yang, Y. Yang, V. L.-A. materials journal, and undefined 2007, “Use of high volumes
of fly ash to improve ECC mechanical properties and material greenness,”
search.proquest.comEH Yang, Y Yang, VC LiACI materials journal,
2007•search.proquest.com, 2007, Accessed: Feb. 02, 2024. [Online]. Available:
https://search.proquest.com/openview/030ff19eb05b0b91d4900d49e0238a1c/1?pq-
origsite=gscholar&cbl=37076
[89] Y. Yao, Y. Zhu, and Y. Yang, “Incorporation superabsorbent polymer (SAP) particles as
controlling pre-existing flaws to improve the performance of engineered cementitious
composites (ECC),” Constr Build Mater, vol. 28, no. 1, pp. 139–145, Mar. 2012, doi:
10.1016/J.CONBUILDMAT.2011.08.032.
[90] B. Gencturk, F. H.-C. J. of C. Engineering, and undefined 2015, “Evaluation of
reinforced concrete and reinforced engineered cementitious composite (ECC)
members and structures using small-scale testing,” cdnsciencepub.comB Gencturk, F
HosseiniCanadian Journal of Civil Engineering, 2015•cdnsciencepub.com, vol. 42, no.
3, pp. 164–177, Jan. 2015, doi: 10.1139/cjce-2013-0445.
[91] Q. Guan, P. Z.-A. M. Research, and undefined 2011, “Effect of clay dosage on
mechanical properties of plastic concrete,” Trans Tech PublQY Guan, P
ZhangAdvanced Materials Research, 2011•Trans Tech Publ, Accessed: Jan. 16,
2024. [Online]. Available: https://www.scientific.net/AMR.250-253.664
[92] D. Meng, T. Huang, Y. X. Zhang, and C. K. Lee, “Mechanical behaviour of a polyvinyl
alcohol fibre reinforced engineered cementitious composite (PVA-ECC) using local
ingredients,” Constr Build Mater, vol. 141, pp. 259–270, Jun. 2017, doi:
10.1016/J.CONBUILDMAT.2017.02.158.
[93] S. M. Mousavi, A. H. Alavi, A. H. Gandomi, M. A. Esmaeili, and M. Gandomi, “A data
mining approach to compressive strength of CFRP-confined concrete cylinders,”
2010.
[94] A. H. Gandomi and D. A. Roke, “Assessment of artificial neural network and genetic
programming as predictive tools,” Advances in Engineering Software, vol. 88, pp. 63–
72, Jun. 2015, doi: 10.1016/j.advengsoft.2015.05.007.
[95] P. Jagadesh, J. de Prado-Gil, N. Silva-Monteiro, and R. Martínez-García, “Assessing
the compressive strength of self-compacting concrete with recycled aggregates from
mix ratio using machine learning approach,” Journal of Materials Research and
Technology, vol. 24, pp. 1483–1498, May 2023, doi: 10.1016/j.jmrt.2023.03.037.
[96] H. H. Chu et al., “Sustainable use of fly-ash: Use of gene-expression programming
(GEP) and multi-expression programming (MEP) for forecasting the compressive
strength geopolymer concrete,” Ain Shams Engineering Journal, vol. 12, no. 4, pp.
3603–3617, Dec. 2021, doi: 10.1016/j.asej.2021.03.018.
[97] Z. Li, X. Gao, D. L.-C. and B. Materials, and undefined 2021, “Correlation analysis
and statistical assessment of early hydration characteristics and compressive strength
for multi-composite cement paste,” Elsevier, Accessed: Feb. 10, 2024. [Online].
Available: https://www.sciencedirect.com/science/article/pii/S0950061821030014
[98] M. Sarveghadi, A. H. Gandomi, H. Bolandi, and A. H. Alavi, “Development of
prediction models for shear strength of SFRCB using a machine learning approach,”
Neural Computing and Applications, vol. 31, no. 7. Springer London, pp. 2085–2094,
Jul. 01, 2019. doi: 10.1007/s00521-015-1997-6.
[99] I. Ilyas et al., “Advanced Machine Learning Modeling Approach for Prediction of
Compressive Strength of FRP Confined Concrete Using Multiphysics Genetic
Expression Programming,” Polymers (Basel), vol. 14, no. 9, May 2022, doi:
10.3390/polym14091789.
[100] B. Iftikhar et al., “Predictive modeling of compressive strength of sustainable rice husk
ash concrete: Ensemble learner optimization and comparison,” J Clean Prod, vol.
348, May 2022, doi: 10.1016/j.jclepro.2022.131285.
[101] “Probability, Statistics, and Decision for Civil Engineers - Jack R Benjamin, C. Allin
Cornell - Google Books.” Accessed: Oct. 08, 2023. [Online]. Available:
https://books.google.com.pk/books?hl=en&lr=&id=Gqm-
AwAAQBAJ&oi=fnd&pg=PP1&dq=Probability,+Statistics+and+Decision+for+Civil+Eng
ineers%3B+Courier+Cooperation,+Dover+Publication,+Mineola:+New+York,+NY,+US
A,+2014%3B+p.+244.&ots=6cVO1xCc4F&sig=lFzwIUMnqpjNdyfRdnIGFNWfIi0&redir
_esc=y#v=onepage&q&f=false
[102] M. A. Khan et al., “Geopolymer Concrete Compressive Strength via Artificial Neural
Network, Adaptive Neuro Fuzzy Interface System, and Gene Expression
Programming With K-Fold Cross Validation,” Front Mater, vol. 8, May 2021, doi:
10.3389/fmats.2021.621163.
[103] A. Rostami, A. Raef, A. Kamari, M. W. Totten, M. Abdelwahhab, and E.
Panacharoensawad, “Rigorous framework determining residual gas saturations during
spontaneous and forced imbibition using gene expression programming,” J Nat Gas
Sci Eng, vol. 84, Dec. 2020, doi: 10.1016/J.JNGSE.2020.103644.
[104] F. E. Jalal, Y. Xu, M. Iqbal, M. F. Javed, and B. Jamhiri, “Predictive modeling of swell-
strength of expansive soils using artificial intelligence approaches: ANN, ANFIS and
GEP,” J Environ Manage, vol. 289, Jul. 2021, doi: 10.1016/j.jenvman.2021.112420.
[105] P. G. Asteris and V. G. Mokos, “Concrete compressive strength using artificial neural
networks,” Neural Comput Appl, vol. 32, no. 15, pp. 11807–11826, Aug. 2020, doi:
10.1007/s00521-019-04663-2.
[106] F. E. Jalal et al., “Indirect Estimation of Swelling Pressure of Expansive Soil: GEP
versus MEP Modelling,” Advances in Materials Science and Engineering, vol. 2023,
2023, doi: 10.1155/2023/1827117.
[107] K. E. Taylor, “Summarizing multiple aspects of model performance in a single
diagram,” Journal of Geophysical Research Atmospheres, vol. 106, no. D7, pp. 7183–
7192, Apr. 2001, doi: 10.1029/2000JD900719.
[108] H. H. Chu et al., “Sustainable use of fly-ash: Use of gene-expression programming
(GEP) and multi-expression programming (MEP) for forecasting the compressive
strength geopolymer concrete,” Ain Shams Engineering Journal, vol. 12, no. 4, pp.
3603–3617, Dec. 2021, doi: 10.1016/j.asej.2021.03.018.
[109] P. P. Roy and K. Roy, “On Some Aspects of Variable Selection for Partial Least
Squares Regression Models,” QSAR Comb Sci, vol. 27, no. 3, pp. 302–313, Mar.
2008, doi: 10.1002/QSAR.200710043.
[110] S. Nazar et al., “Machine learning interpretable-prediction models to evaluate the
slump and strength of fly ash-based geopolymer,” Journal of Materials Research and
Technology, vol. 24, pp. 100–124, May 2023, doi: 10.1016/j.jmrt.2023.02.180.
[111] M. F. Iqbal et al., “Sustainable utilization of foundry waste: Forecasting mechanical
properties of foundry sand based concrete using multi-expression programming,”
Science of the Total Environment, vol. 780, Aug. 2021, doi:
10.1016/j.scitotenv.2021.146524.
[112] S. Lundberg, G. Erion, H. Chen, … A. D.-N. machine, and undefined 2020, “From
local explanations to global understanding with explainable AI for trees,”
nature.comSM Lundberg, G Erion, H Chen, A DeGrave, JM Prutkin, B Nair, R Katz, J
HimmelfarbNature machine intelligence, 2020•nature.com, Accessed: Feb. 06, 2024.
[Online]. Available: https://www.nature.com/articles/s42256-019-0138-9
[113] M. Alberti, A. Enfedaque, … J. G.-C. and B., and undefined 2020, “Optimisation of
fibre reinforcement with a combination strategy and through the use of self-
compacting concrete,” Elsevier, Accessed: Dec. 19, 2023. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S0950061819327412
[114] S. S. Vivek and C. Prabalini, “Experimental and microstructure study on coconut fibre
reinforced self compacting concrete (CFRSCC),” Asian Journal of Civil Engineering,
vol. 22, no. 1, pp. 111–123, Jan. 2021, doi: 10.1007/s42107-020-00302-7.
Declaration of Interest Statement

Declaration of interests

☒ The authors declare that they have no known competing financial interests or personal relationships
that could have appeared to influence the work reported in this paper.

☐The authors declare the following financial interests/personal relationships which may be considered
as potential competing interests:
Author Statement

All authors contributed equally in this manuscript.

You might also like