Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Research article

Received: 3 October 2015, Revised: 10 March 2016, Accepted: 13 March 2016, Published online in Wiley Online Library

(wileyonlinelibrary.com) DOI: 10.1002/cem.2800

Quantitative structure–activity relationship


model for prediction study of corrosion
inhibition efficiency using two-stage sparse
multiple linear regression
Abdo Mohammed Al-Fakiha,b, Zakariya Yahya Algamalc,d,
Muhammad Hisyam Leec, Hassan H. Abdallahe, Hasmerya Maarofa and
Madzlan Aziza,f*

ABSTRACT: A new quantitative structure–activity relationship (QSAR) of the inhibition of mild steel corrosion in 1 M
hydrochloric acid using furan derivatives was developed by proposing two-stage sparse multiple linear regression.
The sparse multiple linear regression using ridge penalty and sparse multiple linear regression using elastic net
(SMLRE) were used to develop the QSAR model. The results show that the SMLRE-based model possesses high
predictive power compared with sparse multiple linear regression using ridge penalty-based model according to
the mean-squared errors for both training and test datasets, leave-one-out internal validation (Q2int = 0.98), and
external validation (Q2ext = 0.95). In addition, the results of applicability domain assessment using the leverage
approach reveal a reliable and robust SMLRE-based model. In conclusion, the developed QSAR model using SMLRE
can be efficiently used in the studies of corrosion inhibition efficiency. Copyright © 2016 John Wiley & Sons, Ltd.

Additional supporting information may be found in the online version of this article at the publisher’s web site.

Keywords: sure independence screening; elastic net penalty; QSAR; corrosion inhibitors; furan derivatives

1. INTRODUCTION chemical activities and molecular descriptors are treated as


response variables and predictor variables, respectively [15].
Metal corrosion causes a waste of resources, affects the
equipment’s lifetime, and harms the environment [1–3]. Prevention
* Correspondence to: Madzlan Aziz, Department of Chemistry, Universiti
and inhibition of iron and mild steel corrosion in acid solutions Teknologi Malaysia, 81310 Skudai, Johor, Malaysia
have been given considerable amount of effort [4–6]. Organic E-mail: madzlan@utm.my
compounds with heteroatoms such as oxygen, nitrogen, sulfur,
and phosphorous, or compounds containing multiple bonds, are a A. M. Al-Fakih, H. Maarof, M. Aziz
Department of Chemistry, Universiti Teknologi Malaysia, 81310, Skudai, Johor,
the most reported inhibitors of metal corrosion. Organic inhibitors
Malaysia
act by adsorption onto the metal surface, forming a layer and
decreasing the corrosion rate [7,8]. Traditionally, experimental b A. M. Al-Fakih
techniques are mainly used to study the performance of corrosion Department of Chemistry, Faculty of Science, Sana’a University, Sana’a, Yemen
inhibitors. However, they are often expensive and time-consuming.
c Z. Y. Algamal, M. H. Lee
With the improvement in computer hardware and software as well Department of Mathematical Sciences, Universiti Teknologi Malaysia, 81310,
as in theoretical chemistry, computational chemistry has been Skudai, Johor, Malaysia
increasingly used in the design and development of corrosion
inhibitors. Computational chemistry is also applied to elucidate d Z. Y. Algamal
Department of Statistics and Informatics, College of Computer Science and
the corrosion inhibition mechanism [9,10].
Mathematics, University of Mosul, Mosul, Iraq
Quantitative structure–activity relationship (QSAR) is a
theoretical approach that has been successfully used in e H. H. Abdallah
computational chemistry [11,12]. The principle of QSAR is to Department of Chemistry, College of Education, Salahaddin University, Erbil,
correlate the compounds’ structures quantitatively with their Iraq
chemical activities. Multiple linear regression (MLR) analysis is f M. Aziz
one of the most important and widely used statistical methods Advanced Membrane Technology Centre, Universiti Teknologi Malaysia,
for constructing QSAR models [13,14]. In MLR modeling, 81310UTM, Skudai, Johor, Malaysia

J. Chemometrics (2016) Copyright © 2016 John Wiley & Sons, Ltd.


A. M. Al-Fakih et al.

Typically, the two fundamental criteria for evaluating the quality In QSAR modeling, the trend today is towards producing
of QSAR models are the prediction accuracy and prediction thousands of molecular descriptors, such as DRAGON 6, which
reliability [16]. can calculate 4,885 molecular descriptors [23]. Consequently,
Quantitative structure–activity relationship modeling is the data collected on individual compounds as molecular
becoming more desirable for predicting corrosion inhibition descriptors have dimensions in thousands, while there are only
properties of expected organic inhibitors [17]. Many QSAR a small number of compounds available for study. This results
approaches have been proposed and applied for building in a high-dimensional dataset, where the number of molecular
predictive QSAR models as effective tools for predicting descriptors, p, suppresses the number of compounds, n. In such
corrosion inhibition properties. Many QSAR studies on several a case, the MLR is neither applicable nor suitable. This is because
organic compounds (corrosion inhibitors) have been carried the descriptors design matrix, X, has more columns than rows,
out. El Ashry and Senior [18] developed linear QSAR models for which lead (XTX) 1 to be not computed [24–27].
corrosion inhibition efficiencies (IE) of 20 compounds (lauric To handle the high dimensionality problem, selection of the
hydrazide and its salts) based on quantum descriptors and relevant descriptors is an essential step in constructing QSAR
topological descriptors. The study concluded that the predicted models. Sparse regression methods are an attractive framework
IE using the developed QSAR models agreed with experimental that have been adapted and gained popularity for performing
IE. Khaled [17] used the genetic function approximation to descriptor selection and QSAR model estimation in high-
develop a QSAR model for corrosion IE of 14 pyrimidine dimensional data simultaneously [26–29].
derivatives based on quantum descriptors. The study concluded In this work, a new QSAR model was developed by proposing
that the predicted corrosion IE of the studied 14 compounds two-stage sparse multiple linear regression (SMLR). In the first
nicely matched the experimental measurements. Eddy et al. stage, sure independence screening (SIS) approach was
[19] experimentally measured the corrosion inhibition of some considered in order to reduce the dimensions of the molecular
amino acids for the corrosion of mild steel in 0.1 M HCl. descriptors from high to low dimensionality. In the second stage,
Computational calculations were also carried out using quantum the SMLR with both ridge and elastic net methods was used. The
chemical approaches and QSAR modeling methods. A linear performance and predictive capability of each developed
QSAR was developed for the experimentally determined IE method in the second stage were investigated and compared.
based on two quantum descriptors, that is, the highest occupied
molecular orbital energy (EHOMO) and the lowest unoccupied
molecular orbital energy (ELUMO). Non-linear models were also 2. MATERIALS AND METHODS
developed. The study concluded that the correlations between
2.1. Experimental details
the predicted IE (obtained from QSAR) and the experimental IE
were excellent. According to Khaled and Al-Mobarak [20], a Eighteen furan derivatives were obtained from Sigma-Aldrich
number of 11 thiophene compounds were experimentally and investigated as corrosion inhibitors for mild steel in 1 M
investigated as corrosion inhibitors for mild steel in 0.5 M H2SO4. hydrochloric acid (HCl). The names and structures of the furan
Artificial neural network was used for building a QSAR model derivatives are given in Figure 1. The used mild steel specimens
based on quantum chemical descriptors. The study concluded that were composed of (wt.%) the following: 0.036 C, 0.172 Mn, 0.082
the developed QSAR model showed reliable predictions. Zhao Cu, 0.108 Ni, 0.053 Cr, 0.035 Al, 0.146 Zr, and Fe the balance. The
et al. [21] developed a non-linear QSAR model for the corrosion surface of the mild steel was abraded using several grades, up to
IE of 19 amino acids based on quantum descriptors using the
support vector machine. The study concluded that the QSAR
model showed good prediction performance. Mousavi et al. [22]
developed a quantitative structure inhibition relationship model
for the corrosion IE of 11 compounds based on a combination of
quantum descriptors (using quantum methods) and molecular
descriptors (using DRAGON software; Talete srl, Milan, Italy). A total
of 1,519 descriptors were calculated as follows: 1,497 descriptors
were calculated using DRAGON software, and 22 descriptors were
generated using quantum methods. After data processing, 1,050
descriptors were remained. Stepwise MLR method was used to
build linear models. A number of five descriptors out of 1,050
descriptors were selected for the first proposed model, and four
descriptors for the second model. The study concluded that,
besides quantum descriptors, the new descriptors calculated by
DRAGON software contributed excellently to describe the corrosion
inhibition on iron.
Based on the literature earlier, it is obvious that a number of
the used quantum descriptors are limited, and only few
descriptors can be used for building linear and non-linear QSAR
models in corrosion inhibition studies. Therefore, the use of
DRAGON-based molecular descriptors contributes significantly to
overcome the limitation of insufficient number of molecular
descriptors in corrosion studies. Therefore, in this work, DRAGON-
based descriptors were used. Figure 1. Names and structures of the furan derivatives.

wileyonlinelibrary.com/journal/cem Copyright © 2016 John Wiley & Sons, Ltd. J. Chemometrics (2016)
QSAR study using two-stage SMLR

1,500 grades, of sand paper. The specimens were well cleaned the residual sum of squares
with distilled water and then again by acetone.
The measurements were applied using potentiodynamic βbOLS ¼ arg min ðy  XβÞT ðy  XβÞ (2)
β
polarization method at room temperature (25 ± 1 °C). Each
experiment was carried out using 250 mL of 1 M HCl with and
without addition of 0.002 M of the inhibitor. Before polarization The OLS estimator is then obtained by solving Equation 2 and
measurements, the mild steel electrode was immersed in the is defined as
test solutions for 30 min to attain a steady state (a stable value  1
of open circuit potential). Polarization curves were recorded at βbOLS ¼ XT X XT y (3)
a scan rate of 1 mV/s with a scan range from 0.25 to +0.25 V
with respect to open circuit potential. A three-electrode cell
assembly that contained a 1 cm2 coupon of mild steel In QSAR studies, researchers are often able to collect a large
embedded in a specimen holder was used. The mild steel number of molecular descriptors to be used for constructing
specimen acted as working electrode. A platinum electrode MLR models. Because of the involvement of irrelevant
was used as a counter electrode, and the reference electrode descriptors, and the presence of multicollinearity that is easily
was a saturated calomel electrode. The polarization curves are bound to appear, the predictive ability of MLR can be
shown in Figure S1. The electrochemical parameters and the significantly decreased. As a result, selecting descriptors that
corrosion IE of the inhibitors are given in Table S1. truly affect the chemical activity is needed at the initial stage
of QSAR modeling. SMLR method has attracted widespread
attention. Without loss of generality, it is assumed that the
2.2. Data set
molecular descriptors are standardized and the response
In this QSAR study, 18 furan derivatives and their experimental IE variable is centered, then the SMLR is defined as
were used. The data were randomly divided into a training n o
dataset and a test dataset containing 70% and 30% of the βbSMLR ¼ arg min ðy  XβÞT ðy  XβÞ þ λ hðβÞ (4)
compounds, respectively. The training dataset was used to β

construct the QSAR model, and the test dataset was used to
evaluate the performance of the QSAR model based on several The penalty function λ h(β) depends on the positive tuning
evaluation criteria. parameterλ, which controls the trade-off between fitting the
data to the model and the effect of the penalty. In other words,
2.3. Molecular descriptor calculation it controls the amount of shrinkage. For λ = 0, we obtain the OLS
The molecular structures of the 18 compounds were sketched using estimation. In contrast, for large values of λ, the influence of the
CHEM3D software (CambridgeSoft Corporation, Cambridge, MA, USA). penalty function on the coefficient estimates increases. Various
The molecular structures were optimized using the molecular penalty functions have been developed to identify the true
mechanics method and then by a molecular orbital package variables and estimate their corresponding coefficients
module in CHEM3D software. DRAGON software (version 6.0) was used simultaneously, including ridge penalty [30], least absolute
to generate 4,885 molecular descriptors based on the optimized shrinkage and selection operator (LASSO) [31], smoothly clipped
molecular structures [23]. To include consistent and useful absolute deviation [32], and elastic net [33]. In QSAR studies,
molecular descriptors, preprocessing steps were carried out as LASSO has been applied and compared with other methods
follows: first, those that have constant value for all compounds were [26,28,29,34].
excluded; second, descriptors in which 50% of their values equal to Including more descriptors allows the QSAR model to adapt to
zero were removed; then, descriptors that have zero values for all more complicated relationships in the data. However, a model
compounds were discarded; after that, descriptors with relative with too many molecular descriptors may over fit the QSAR
standard deviation less than 0.001 were removed; finally, only model. Such over fitting leads to a QSAR model that may not
1,922 descriptors remained for developing the QSAR model. provide good description for new compounds. Based on the
type of sparse term used, the SMLR can alleviate the problems
of multicollinearity and can also produce sparse QSAR models
2.4. Quantitative structure–activity relationship model
that are easier to interpret scientifically. The ridge regression
development
(ridge) proposed by Hoerl and Kennard [30] is one of the most
Consider the traditional MLR for QSAR study used sparse methods as a remedy for the multicollinearity
problem in statistics. Ridge shrinks the descriptor coefficients
y ¼ Xβ þ ε (1) towards zero, but never equals zero. However, ridge suffers from
some limitations. In particular, when p > n, it does not have the
where y = (y1, …, yn)T ∈ Rn is the response vector of the corrosion capability to perform variable selection and therefore, does not
IE, X = (x0, x1, …, xp)T ∈ Rn × (p + 1) is the design matrix of molecular give an easily interpretable model. The LASSO, introduced by
descriptors, in which the first column is ones to account for the Tibshirani [31], is another frequently used sparse method. The
β0, β = (β0, β1, …, βp)T ∈ Rp + 1 is the unknown regression LASSO can perform variable selection by assigning some
coefficient vector of the molecular descriptors, and ε = (ε1, …, ε molecular descriptor coefficients to zero. Despite the advantage
n) ∈ R
T n
is the error vector, in which its components are of LASSO, it has some shortcomings. First, it cannot select more
independently and identically distributed with normal molecular descriptors than the number of compounds. Second,
distribution of mean 0 and variance σ 2ε . The most common when there is a group of correlated descriptors, LASSO arbitrarily
estimation method of Equation 1 is the ordinary least squares selects one or a few correlated descriptors [26,33]. Elastic net is a
(OLS) method, where their estimates are obtained by minimizing sparse method for variable selection introduced by Zou and

J. Chemometrics (2016) Copyright © 2016 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/cem
A. M. Al-Fakih et al.

Hastie [33]. It originally proposed to select the highly correlated 2.5. Quantitative structure–activity relationship evaluation
variables as a group by using ridge penalty to deal with highly criteria
correlated variables and using LASSO to perform variable
To provide a satisfactory comparison between SMLRR and
selection.
SMLRE in constructing an efficient QSAR model, four criteria
In this study, to build an efficient QSAR model, a new QSAR
were used. For the training dataset, the first two criteria are the
model was developed by proposing two-stage SMLR. In the first
mean-squared error of the training dataset (MSEtrain) and the
stage, SIS approach [35] is considered in order to reduce the
leave-one-out internal validation (Q2int), which are defined by
dimensions of the molecular descriptors from high to low
dimensionality. The SIS screens molecular descriptors by ranking ntrain  2
the marginal correlations (Pearson correlation) of descriptors ∑ y i;train  b
y i;train
i¼1
with the IE. This means, the relevant descriptors are screened MSE train ¼ (7)
ntrain
depending on the highest amount of the Pearson correlation
between the IE and each descriptor. Then, the importance of
and
the descriptors will be in descending order according to the
marginal correlations values from the highest to the lowest. 2n
 2 3
train
After that, SIS selects the relevant molecular descriptors ∑ y
6 i¼1 i;train  b
y i;train 7
corresponding to a threshold value, θ, of the largest values of Q2 int ¼16
4 ntrain  2 7 5 (8)
the marginal correlation. Usually, θ is chosen as θ = n  1, where ∑ y i;train  y
n is the number of compounds [35]. i¼1

In the second stage, SMLR with both ridge and elastic net
methods is used. The aim of this stage is to select the most respectively.
relevant descriptors. Because we expect that there will be high Furthermore, the test dataset was used to validate the model
correlation among the screened descriptors in the first stage, a by computing last two criteria, the mean-squared error of the
QSAR model using MLR is not suitable. Therefore, SMLR using test dataset (MSEtest) and the external validation (Q2ext). The two
ridge penalty (SMLRR) and SMLR using elastic net (SMLRE) will criteria are defined by
be employed to construct an efficient QSAR model. The  2
ntest
difference between SMLRR and SMLRE is that SMLRR uses all ∑ y i;test  b
y i;test
the screened descriptors from the first stage, while SMLRE MSE test ¼ i¼1
(9)
selects the most important descriptors out of the screened ntest
descriptors. The constructed QSAR models based on both SMLRR
and SMLRE are defined by Equations 5 and 6, respectively. and

  2  2 3
ntest
p
∑ y  b
y
βbSMLRR ¼ arg min ðy  XβÞT ðy  XβÞ þ κ ∑ β2j (5) 6 i¼1 i;test i;test 7
β j¼1 Q2 ext ¼16
4 ntest  2 7
5 (10)
∑ y i;test  y train
i¼1
 
p   p
βbSMLRE ¼ arg min ðy  XβÞT ðy  XβÞ þ λ1 ∑ βj  þ λ2 ∑ βj 2 respectively, where ntrain and ntest represent the training and
β j¼1 j¼1
test sample sizes, the yi,train, yi,test, ŷi,train, and ŷi,test stand for the
(6) IE values of the training dataset, test dataset, and their
corresponding predicted IE values. While y and y train represent
where κ, λ1, λ2 ≥ 0 are the tuning parameters. Cross-validation the mean of all the IE values and the mean of the training IE
(CV) method is often used to find the best values of these tuning values, respectively. In addition, the performance of the SMLRE
parameters. was also compared with the partial least square (PLS).

Table I. The 12 screened descriptor names and their descriptions in the first stage

Descriptor name Group type Description


PJI3 Geometrical descriptors 3D Petitjean shape index
RDF035m RDF descriptors Radial distribution function – 035/weighted by mass RDF descriptors
VE2_G/D 3D matrix-based descriptors Average coefficient of the last eigenvector from geometrical matrix
Mor11v 3D-MoRSE descriptors Signal 11/weighted by van der Waals volume
H1p GETAWAY descriptors H autocorrelation of lag 1/weighted by polarizability
Mor11p 3D-MoRSE descriptors Signal 11/weighted by polarizability
P_VSA_p_3 P_VSA-like descriptors P_VSA-like on polarizability, bin 3
Eig02_AEA(ed) Edge adjacency indices Eigenvalue n. 2 from augmented edge adjacency mat. weighted by edge degree
Dp WHIM descriptors D total accessibility index/weighted by polarizability
P_VSA_e_2 P_VSA-like descriptors P_VSA-like on Sanderson electronegativity, bin 2
Mor12s 3D-MoRSE descriptors Signal 12/weighted by I-state
Mor12m 3D-MoRSE descriptors Signal 12/weighted by mass

wileyonlinelibrary.com/journal/cem Copyright © 2016 John Wiley & Sons, Ltd. J. Chemometrics (2016)
QSAR study using two-stage SMLR

Figure 2. The correlation matrix among the screened descriptors.

Table II. Evaluation criteria values for the training and test
datasets

Methods Training dataset Test dataset


MSEtrain Q2int MSEtest Q2ext
PLS 9.88 0.87 7.02 0.86
SMLRR 10.13 0.92 7.75 0.89
SMLRE 6.25 0.98 2.34 0.95
PLS, partial least square; SMLRR, SMLR using ridge penalty;
SMLRE, SMLR using elastic net.

3. RESULTS AND DISCUSSION


3.1. First stage results
In QSAR modeling of inhibition efficiency, all 1,922 molecular
descriptors were given the chance in the study. The SIS was
employed to screen the most relevant molecular descriptors.
First, the Pearson correlation between each descriptor and the
IE was calculated. Then, based on θ = ntrain  1, 12 relevant
descriptors were screened according to their highest Pearson
correlation values. The names of the screened descriptors and
their descriptions are listed in Table I. The correlation matrix of
the screened descriptors is shown in Figure 2.

Figure 3. Plot of experimental versus predicted IE values as obtained


3.2. Second stage results from the training and test datasets (a) SMLRR and (b) SMLRE.

The SMLRR and SMLRE methods were evaluated and validated to validation dataset, and the remaining two folds were used as
test their predictive ability in constructing the QSAR model. For training dataset to fit the SMLRR and SMLRE by a specific tuning
the training dataset, the threefold CV method was conducted value. The CV process was repeated three times, and then the
to find the optimal values of the tuning parameters of SMLRR optimal tuning parameter was the one that has the minimum
and SMLRE. The training dataset was randomly divided into CV prediction error. The optimal values for κ, λ1, and λ2 were
three folds where inside each fold, the training dataset is 4.52, 1.42, and 0.07, respectively. Consequently, SMLRE selected
different each time. First, we set λ2 to take the values between seven relevant molecular descriptors from the screened
0 and 100, and then for each λ2 value, the threefold CV was molecular descriptors. The prediction assessment criteria results
employed to find the best value of λ1, which was also set to take are reported in Table II. In addition, PLS was also applied, and
the values between 0 and 100. Second, the values of κ were set its prediction assessment criteria results are given in Table II.
between the range 0 and 100. Again, the threefold CV was The performance of SMLRE was compared with SMLRR and PLS
employed. Among the three folds, one fold was retained as methods.

J. Chemometrics (2016) Copyright © 2016 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/cem
A. M. Al-Fakih et al.

Table III. Experimental and predicted IE values of the training and test datasets

Compound Compound name Experimental Predicted IE (%)


no. IE (%)
SMLRR SMLRE
1 Ethyl 5-(chloromethyl)-2-furoate 84.51 81.31 82.48
2 5-(2-Furyl)-1,3-cyclohexanedione 73.53 77.00 71.79
3 2-Furanmethanethiol 88.05 84.05 86.04
4 2-Furonitrile 72.68 73.69 74.49
5 5-Bromo-2-furoic acid 72.01 69.64 69.00
6* 5-Methylfurfurylamine 71.56 75.14 70.02
7* trans-3-Furanacrylic acid 61.41 65.55 63.36
8 2-Ethylfuran 59.53 54.95 62.35
9* Methyl 2-furoate 58.65 56.60 57.45
10 5-Methylfurfural 56.73 53.97 57.53
11 2-Furoic acid 57.33 59.77 59.33
12 5-(Dimethylaminomethyl)furfuryl alcohol hydrochloride 66.83 63.19 63.19
13 Methyl 2-methyl-3-furoate 55.97 51.35 51.50
14 2-Furoyl chloride 50.95 53.71 53.65
15 Furfuryl alcohol 52.55 54.36 53.18
16 Furfurylamine 40.11 42.73 42.20
17* 2-(2-Nitrovinyl)furan 34.76 36.60 34.54
18* Methyl 5-nitro-2-furoate 54.91 55.97 56.92
*The compound belongs to test data.

It can be seen from Table II that MSEtrain of the SMLRE is about William’s plot. This plot demonstrates the standardized residuals
38.3% and 36.7% lower than that of SMLRR and PLS, respectively. versus the leverage. The compound is considered as an outlier
In addition, the prediction performance of the SMLRE when its corresponding standardized residual falls outside the
(Q2int = 0.98) for the training dataset is much better than that of standardized residual limits (± 3σ). In addition, the influential
the SMLRR (Q2int = 0.92) and PLS (Q2int = 0.87), indicating better compound can be detected when its leverage value is greater
predictive ability of the SMLRE than the SMLRR and PLS. For than leverage threshold (h* = 3(p + 1)/n), where p represents
the test dataset, it is noteworthy that SMLRE reduces the MSEtest the number of the selected descriptors in the final QSAR model,
significantly in comparison with the SMLRR and PLS. The MSEtest and n represents the number of compounds. Figure 5 displays
value of SMLRE is 69.8% and 66.6% lower compared with that of William’s plot of the leverage values against the standardized
SMLRR and PLS, respectively. Furthermore, it is apparent that the residuals for each compound for the SMLRR and SMLRE models.
Q2ext value for SMLRE (Q2ext = 0.95) is higher than that for SMLRR The dotted line indicates the leverage threshold, while the
(Q2ext = 0.89) and PLS (Q2ext = 0.86), indicating the greater dashed line represents the standardized residual limits. It is
predictive ability of SMLRE.
Figure 3 demonstrates the correlation between the
experimental values of the IE and their corresponding predicted
values using SMLRR and SMLRE for both the training and test
datasets. Figure 3(b) clearly reveals that the predicted IE values
are in good agreement with the experimental values with
correlation of 0.984 and 0.992 for the training and test datasets,
respectively. All the predicted IE for both the training and test
datasets using SMLRR and SMLRE are listed in Table III.
Furthermore, a Y-randomization test was performed to
confirm that the constructed QSAR model by SMLRE is not
obtained by chance correlation. The IE values of the training
dataset were repeatedly shuffled to construct new QSAR models.
Each time, the Q2int of the SMLRE was calculated. To decide
whether the SMLRE-based model is efficiently reliable model,
all the new Q2int of the SMLRE values must be lower than the
original Q2int of the SMLRE (Q2int = 0.98). Figure 4 shows the results
obtained by Y-randomization test for 500 times. As shown in
Figure 4, the lower Q2int values obtained by Y-randomization test
compared with the original Q2int value indicate that the SMLRE-
based model is not due to chance correlation.
For further evaluation of the SMLRE ability in constructing a
robust QSAR model, applicability domain was employed using Figure 4. Y-randomization test for SMLRE over 500 times.

wileyonlinelibrary.com/journal/cem Copyright © 2016 John Wiley & Sons, Ltd. J. Chemometrics (2016)
QSAR study using two-stage SMLR

4. CONCLUSION
In the present work, a new QSAR of the inhibition of mild steel
corrosion in 1 M HCl using furan derivatives was developed by
proposing two-stage SMLR. The results gained by the internal
validation criteria (MSEtrain and Q2int) for training dataset and the
external validation parameters (MSEtest and Q2ext) for the test
dataset prove better predictive capability of the QSAR model
developed using SMLRE compared with SMLRR-based model.
In addition, the obtained results by the Y-randomization test
and applicability domain confirm that the SMLRE-based model
is reliable and robust. In conclusion, the current study proposes
SMLRE as a useful approach to be appropriately used in other
QSAR studies.

Acknowledgements
The authors acknowledge the Ministry of Higher Education of
Malaysia (MOHE), the Research Management Center (RMC) at
the University Technology Malaysia (UTM), grant with VOT no.
4F257 and financial support given to the first author by Sana’a
University, Sana’a, Yemen. We thank Dr. Mohamed Noor Hasan
for the permission of using DRAGON software.

REFERENCES
1. Amin MA, Khaled KF, Fadl-Allah SA. Testing validity of the Tafel
extrapolation method for monitoring corrosion of cold rolled steel
in HCl solutions – experimental and theoretical studies. Corr. Scien.
2010; 52: 140–151.
2. Hussin MH, Kassim MJ. The corrosion inhibition and adsorption
behavior of Uncaria gambir extract on mild steel in 1 M HCl. Mate.
Chem. and Phys. 2011; 125: 461–468.
3. Solmaz R. Investigation of adsorption and corrosion inhibition of mild
steel in hydrochloric acid solution by 5-(4-dimethylaminobenzylidene)
rhodanine. Corr. Scien. 2014; 79: 169–176.
4. Al-Turkustani AM, Arab ST, Al-Qarni LSS. Medicago Sative plant as
safe inhibitor on the corrosion of steel in 2.0 M H2SO4 solution.
J. Saudi Chem Soc 2011; 15: 73–82.
5. Moretti G, Guidi F, Fabris F. Corrosion inhibition of the mild steel in
0.5 M HCl by 2-butyl-hexahydropyrrolo[1,2-b][1,2]oxazole. Corr Scien
2013; 76: 206–218.
6. Raja PB, Qureshi AK, Abdul Rahim A, Osman H, Awang K.
Neolamarckia cadamba alkaloids as eco-friendly corrosion inhibitors
for mild steel in 1 M HCl media. Corr. Scien. 2013; 69: 292–301.
7. Ramde T, Rossi S, Zanella C. Inhibition of the Cu65/Zn35 brass
corrosion by natural extract of Camellia sinensis. Appl. Surf. Scien.
2014; 307: 209–216.
8. Zarrouk A, Hammouti B, Dafali A, Bouachrine M, Zarrok H, Boukhris S,
Al-Deyab SS. A theoretical study on the inhibition efficiencies of
some quinoxalines as corrosion inhibitors of copper in nitric acid.
J. Saudi Chem. Soc. 2014; 18: 450–455.
9. Bentiss F, Mernari B, Traisnel M, Vezin H, Lagrenée M. On the
Figure 5. William’s plot for the training and test datasets (a) SMLRR and relationship between corrosion inhibiting effect and molecular
(b) SMLRE. structure of 2,5-bis(n-pyridyl)-1,3,4-thiadiazole derivatives in acidic
media: ac impedance and DFT studies. Corr. Scien. 2011; 53: 487–495.
10. Gholami M, Danaee I, Maddahy MH, RashvandAvei M. Correlated ab
initio and electroanalytical study on inhibition behavior of 2-
obvious from Figure 5(a) that two compounds, 14 and 7, are mercaptobenzothiazole and its thiole–thione tautomerism effect
considered as outliers for both training and test datasets, for the corrosion of steel (API 5 L X52) in sulphuric acid solution. Indu.
respectively. This is because of their standard residuals that are & Engin. Chem. Rese. 2013; 52: 14875–14889.
11. Pourbasheer E, Aalizadeh R, Ganjali MR. QSAR study of CK2 inhibitors
higher than ± 3σ; however, they are not considered as influential by GA-MLR and GA-SVM methods. Arab. J: Chem, 2015. [Available at
compounds. It is clear from Figure 5(b) that there are no outlier 10.1016/j.arabjc.2014.12.021].
compounds and none of them are considered as influential 12. Pourbasheer E, Aalizadeh R, Shokouhi TS, Ganjali MR, Norouzi P,
compounds. In conclusion, the evaluation and validation results Shadmanesh J. 2D and 3D quantitative structure–activity
relationship study of hepatitis C virus NS5B polymerase inhibitors
suggest that the constructed model using SMLRE is reliable by comparative molecular field analysis and comparative molecular
and can be used to predict the inhibition efficiency of new similarity indices analysis methods. J. Chem. Inf. Model. 2014; 54:
relevant compounds. 2902–2914.

J. Chemometrics (2016) Copyright © 2016 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/cem
A. M. Al-Fakih et al.

13. Gupta S, Basant N, Singh KP. Qualitative and quantitative structure- 25. Al-Fakih AM, Aziz M, Abdallah HH, Algamal ZY, Lee MH, Maarof H.
activity relationship modelling for predicting blood-brain barrier High dimensional QSAR study of mild steel corrosion inhibition in
permeability of structurally diverse chemicals. SAR QSAR Environ. acidic medium by furan derivatives. Int. J. Electrochem. Sci. 2015;
Res. 2015; 26: 95–124. 10: 3568–3583.
14. Pourbasheer E, Aalizadeh R, Ganjali MR, Norouzi P, Shadmanesh J. 26. Algamal ZY, Lee MH, Al-Fakih AM, Aziz M. High-dimensional QSAR
QSAR study of ACK1 inhibitors by genetic algorithm–multiple linear prediction of anticancer potency of imidazo[4,5-b]pyridine
regression (GA–MLR). J. Saudi Chem. Soc. 2014; 18: 681–688. derivatives using adjusted adaptive LASSO. J. Chemom. 2015; 29:
15. Zhang K, Hughes-Oliver JM, Young SS. Analysis of high-dimensional 547–556.
structure-activity screening datasets using the optimal bit string tree. 27. Rasmussen MA, Bro R. A tutorial on the LASSO approach to sparse
Technometrics. 2013; 55: 161–173. modeling. Chemom. Intell. Lab. Syst. 2012; 119: 21–31.
16. Huang J, Fan X. Reliably assessing prediction reliability for high 28. Ross Kunz M, She Y. Multivariate calibration maintenance and
dimensional QSAR data. Mol. Divers. 2013; 17: 63–73. transfer through robust fused LASSO. J. Chemom. 2013; 27: 233–242.
17. Khaled KF. Modeling corrosion inhibition of iron in acid medium by 29. ter Braak CJF. Regression by L1 regularization of smart contrasts and
genetic function approximation method: a QSAR model. Corros. Sci. sums (ROSCAS) beats PLS and elastic net in latent variable model.
2011; 53: 3457–3465. J. Chemom. 2009; 23: 217–228.
18. El Ashry ESH, Senior SA. QSAR of lauric hydrazide and its salts as 30. Hoerl AE, Kennard RW. Ridge regression: biased estimation for
corrosion inhibitors by using the quantum chemical and topological nonorthogonal problems. Technometrics. 1970; 12: 55–67.
descriptors. Corros. Sci. 2011; 53: 1025–1034. 31. Tibshirani R. Regression shrinkage and selection via the LASSO.
19. Eddy NO, Awe FE, Gimba CE, Ibisi NO, Ebenso EE. QSAR, experimental J. Roy. Statist. Soc. Ser. B. 1996; 58: 267–288.
and computational chemistry simulation studies on the inhibition 32. Fan J, Li R. Variable selection via nonconcave penalized likelihood
potentials of some amino acids for the corrosion of mild steel in and its oracle properties. J. Am. Stat. Assoc. 2001; 96: 1348–1360.
0.1 M HCl. Int J. Electrochem. Sci 2011; 6: 931–957. 33. Zou H, Hastie T. Regularization and variable selection via the elastic
20. Khaled KF, Al-Mobarak NA. A predictive model for corrosion net. J. Roy. Statist. Soc. Ser. B. 2005; 67: 301–320.
inhibition of mild steel by thiophene and its derivatives using 34. Guo Y, Berman M. A comparison between subset selection and L1
artificial neural network. Int. J. Electrochem. Sci. 2012; 7: 1045–1059. regularisation with an application in spectroscopy. Chemom. Intell.
21. Zhao H, Zhang X, Ji L, Hu H, Li Q. Quantitative structure–activity Lab. Syst. 2012; 118: 127–138.
relationship model for amino acids as corrosion inhibitors based 35. Fan J, Lv J. Sure independence screening for ultra-high-dimensional
on the support vector machine and molecular design. Corros. Sci. feature space. J. Roy. Statist. Soc. Ser. B. 2008; 70: 849–911.
2014; 83: 261–271.
22. Mousavi M, Safarizadeh H, Khosravan A. A new cluster model based
descriptor for structure-inhibition relationships: a study of the effects 5. SUPPORTING INFORMATION
of benzimidazole, aniline and their derivatives on iron corrosion.
Corros. Sci. 2012; 65: 249–258. Additional supporting information can be found in the online
23. Todeschini R, Consonni V, Mauri A, Pavan M. DRAGON, Software version
6.0, Talete srl. (2010). http://www.talete.mi.it/.
version of this article at the publisher’s website.
24. Filzmoser P, Gschwandtner M, Todorov V. Review of sparse methods
in regression and classification with application to chemometrics.
J. Chemom. 2012; 26: 42–51.

wileyonlinelibrary.com/journal/cem Copyright © 2016 John Wiley & Sons, Ltd. J. Chemometrics (2016)

You might also like