Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Available online at www.sciencedirect.

com

ScienceDirect
Procedia Engineering 84 (2014) 879 – 886

“2014 ISSST”, 2014 International Symposium on Safety Science and Technology

Prediction on the auto-ignition temperature using substructural


molecular fragments
SHI Jingjie, CHEN Liping*, CHEN Wanghua
Department of Safety Engineering, School of Engineering, Nanjing University of Science & Technology, Nanjing 210094, China

Abstract

Quantitative structure-property relationship (QSPR) modeling of the auto-ignition temperature was performed using ensemble
multiple linear regression analysis and substructural molecular fragments method implemented in ISIDA software. In this paper,
molecular fragments were used as molecular descriptors and we chose the length of sequences respectively from two to eight and
twelve. Afterwards, the model was developed between the auto-ignition temperature and molecular descriptors and evaluated by
r2
the internal and external validations. Meanwhile, a new metric m was adopted to assess the built model further. The model
applicability domain was checked by the ISIDA software to verify prediction reliability. The results show that the prediction
2 'r 2
results are in good agreement with the experimental ones. The value of rm is more than 0.5, and the value of m is lower than
0.2. The above results indicate that the established model is great successful. This paper provides a new and effective method for
engineering to predict AIT of organic compounds.
© 2014
© 2014 The
The Authors.
Authors. Published
Published by
by Elsevier
Elsevier Ltd.
Ltd. This is an open access article under the CC BY-NC-ND license
Peer-review under responsibility of scientific committee of Beijing Institute of Technology.
(http://creativecommons.org/licenses/by-nc-nd/3.0/).
Peer-review under responsibility of scientific committee of Beijing Institute of Technology
Keywords: auto-ignition temperature; substructural molecular fragments;molecular descriptors; organic compounds; prediction

1. Introduction

The Auto-ignition temperature (AIT), which is also referred to as autogenous ignition temperature, spontaneous
ignition temperature, and self-ignition temperature, is one of the most important safety specifications used to
characterize the hazard potential of a chemical substance. In terms of the safety theory research and the production

* Corresponding author. Tel.: +086-13913897044


E-mail address: clp2005@hotmail.com

1877-7058 © 2014 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/3.0/).
Peer-review under responsibility of scientific committee of Beijing Institute of Technology
doi:10.1016/j.proeng.2014.10.510
880 Shi Jingjie et al. / Procedia Engineering 84 (2014) 879 – 886

safety of enterprise, the study on AIT is of great importance. At present, the measurement of AIT of compounds has
knowledge of the authors, QSPR is one of the most widely used theoretical methods for predicting AIT[1-5].
Quantitative Structure Property Relationship (QSPR) models are obtained by analyzing and calculating the
correlation between the property and a variety of structural information and physicochemical parameters[6]. John
Tetteh et al.[5] applied two types of feed forward neural modeling networks, radial basis function (RBF) and back
propagation function(BPF), to predict AIT of 233 compounds (a training set of 85 compounds and the validation set
of 148 compounds). Albahri et al.[1] used artificial neural network to investigate several structural group
contribution for predicting pure components AIT. Recently, the substructural molecular fragments (SMF) method
has been widely performed to predict many properties [7-9]. Solov’ev V.P. et al.[7] applied the SMF method to
model the relationships between the structure of organic molecules and their thermodynamic parameters of
complexationor extraction. In 2003, they assessed anti-HIV activity for large data sets for three families of
compounds:1-[2-hydroxyethoxymethyl]-6-(phenylthio)–thymine (HEPT) derivatives,
tetrahydroimidazobenzodiazepinone (TIBO) derivatives, and cyclic urea (CU) derivatives by using the SMF method.
What’s more, their group obtained very satisfactory results. So far, no one uses the SMF method to build the
quantitative relationship between AIT and molecular structures. Thus in this paper, we describe QSPR modeling of
AIT using SMF method implemented in the ISIDA (In Silico design and Data Analysis) software.

2. Materials and methodology

2.1. Data set

The initial dataset is composed of 265 compounds which are taken from The Chemical Database, which is one of
the most reliable databases at present. In order to obtain and validate a QSPR model, the dataset was randomly
divided into two subsets, training set consisting of 212 compounds and test set consisting of 53 compounds. The 2D
sketcher EdChemS and the Structure Data File (SDF) manager EDiSDF were used to prepare the 2D structures of
the compounds expressed with explicit hydrogen atoms. Distributions of AIT are given in Fig. 1. The values of AIT
vary in the range of 171–680ć.


Fig. 1. Distribution of experimental values of auto-ignition temperature in the modeling data set of 265 compounds.

2.2. Descriptors

The subgraphs of the molecular graph in Substructural Molecular Fragments (SMF) method are used as
descriptors[10]. It uses two types of topological descriptors (fragments) (Fig. 2): atom/bond sequences, and
“augmented atoms” (atoms with their nearest neighbors). Three sub-types of molecular fragments of AB, A and B
Shi Jingjie et al. / Procedia Engineering 84 (2014) 879 – 886 881

are defined for each class. For the fragments I, they represent sequences of atoms and bonds (AB), of atoms only
(A), or of bonds only (B). Shortest or all paths from one atom to the other are used. For each type of sequences, the
minimal (nmin) and maximal (nmax) number of constituted atoms must be defined. Thus, for the partitioning I (AB,
nmin - nmax), I(A, nmin - nmax) and I(B, nmin - nmax), the program generates “intermediate” sequences involving n atoms
(nmin d n d nmax). In the current version of ISIDA, nmin t 2 and nmax d 15. An “augmented atom” represents a selected
atom with its environment including either neighboring atoms or bonds (AB), or atoms only (A) (taking into account
atom hybridization, additional Hy type), or bonds only (B).

O
OH

O
N N O
OH
C
HO
C N N
O
N O
HO
O
SEQUENCES AUGMENTED ATOMS

ĉ II

ATOMS and BONDS (AB)


O=C-C-N; C-C-N; C-N; O=C-C; C=O; C-C C (-C) (-O) (=O)

C (C) (O) (O) or

O C C N; C C N; C N; O C C; C O; C C (Hy) Csp2(Csp3)(Osp3)(Osp2)

= - -; - -; -; = -; =; - C (-) (-) (=)

Fig. 2. Substructural molecular fragments: atom/bond sequences (A) and augmented atoms (B). Shortest path sequences (I) and augmented atoms
(II) including atoms and bonds (AB), only atoms (A) or only bonds (B). From top to bottom: the sequences (I) correspond to the I(AB, 2-4),
I(A, 2-4) and I(B, 2-4) types involving paths between each pair of atoms. The II(Hy) augmented atoms correspond to the II(A) type, where
hybridization of atom is taken into account.

The key problem of any QSPR study is related to selection of pertinent descriptors to model. In ISIDA software,
screening descriptors mainly follows three steps, namely filtering stage, forward stepwise pre-selection stage and
backward stepwise selection stage. In the first stage, the program eliminates variables which have small correlation
coefficient with the property, and those highly correlated with other variables, which were already selected for the
model. In the second stage, the suite of forward and backward stepwise algorithms has been used for variable pre-
selection in ISIDA studies by the Variable Selection Suite (VSS) program. The final selection is performed to use
backward stepwise variable selection procedure based on the t statistic criterion.

2.3. QSPR models

The ISIDA program realizes the SMF method for QSPR and QSAR modeling. The SMF method is based on
the splitting of a molecular graph into fragments, and on the calculations of their contributions to a given property.
When a compound is split into constitutive fragments, the fragments contributions to AIT or to any other physical or
chemical property are calculated using the linear fitting equation.
Y a0  ¦ ai N i  * (1)
i

where, Y is the modelling property (Y=ATI), ai is the fragment contributions, Ni is the number of fragments of i type,
882 Shi Jingjie et al. / Procedia Engineering 84 (2014) 879 – 886

the a0 term is the independent fragment and by default = 0.


A significant advantage of SMF method is the possibility to select during the training stage several best fit models
(instead of a single QSPR model) related to different fragmentation schemes. A Consensus Model (CM) can be
calculated by ISIDA/QSPR program which combines the information issued from several models. According to the
values of cross-validation correlation coefficient, Q2 t Q2lim, where Q2lim is user defined threshold. Here, Q2lim t0.85
has been used. Thus, for the test set, the property as an arithmetic mean of values is obtained by those models. Those
leading to outlying values were excluded according to Grubbs’s statistics. To our knowledge, such an ensemble
model can improve the quality of property predictions due to smoothing inaccuracies of individual models.

3. Model validation

The regulatory use of QSPRs requires validation to ensure that they have acceptable predictive power. So internal
validation (Q2loo) and external validation (Q2ext) should be applied for evaluating the model in this paper.
The internal validation for the model is necessary for robustness and possible high predictive power. In this
research, we have applied the leave-one-out (LOO) for the internal validation, which is calculated according to the
formula[11].
training

¦ (Yi  Yli ) 2
(2)
Q 2
loo 1 i 1
training

¦i 1
(Yi  Y ) 2

Y l
where, i , Yi and Y are the experimental, predicted, and averaged (over the entire training dataset) values of the
samples in the training set, respectively. A model is considered acceptable when the value of Q2loo exceeds 0.5.
Varying the minimal and maximal length of the fragments, descriptors were generated followed by the building of
QSPR model from each of them. The most robust models were selected according to leave-one-out cross-validation
correlation coefficient Q2loo > Q2loolim, where Q2loolim is a user-defined threshold. Q2loolim ≥0.85 have been used in this
study.
2
However, recent studies of Tropsha et al.[12] have systematically indicated that the Qloo is a necessary, but not
sufficient measure of a model’s true predictive power. Roy[13] and Pinheiro[14] et al. pointed out that the real
predictive capability of each model developed on the training set is verified on an external validation set. The
predictive ability of a model on external validation set can be expressed by the following equation(3)[11].
prediction

¦ (Yi  Yli ) 2
2
Q
ext 1 i 1 (3)
prediction

¦i 1
(Yi  Y ) 2

l
where , i and Yi are respectively the experimental, predicted values of the test set, and is averaged values of the
Y
samples in the training set.
The other useful parameters named squared correlation coefficient (R2) and root mean-squared error (RMSE)
were also employed to evaluate the performance of developed models, which are important indicators for linear
correlation between predicted and experimental data. They characterize an ability of the model to reproduce
quantitatively the experimental data. R2 is an indicator that measures the linear correlation degree between one
variable and another. RMSE indicates dispersion degree of the random error, which summarizes the overall error of
the model.
2
In addition to the classic validation, the Roy group proposed a novel metric rm as an additional validation
parameter, which is based on the correlations between the observed and predicted values with and without intercept
for the least squares regression lines. The mentioned equations are as follows.
rm2 r 2 u (1  r 2  r02 )
(4)
2
rm2  rm'
rm2
2 (5)
Shi Jingjie et al. / Procedia Engineering 84 (2014) 879 – 886 883

2
'rm2 rm2  rm'
(6)
where r 2 and r02 are respectively the determination coefficients of the regression function calculated using the
experimental and the predicted data of the prediction set, forcing respectively the origin of the axis ( r02 ) or not ( r 2 ).
2
rm2 is calculated using the experimental values on the ordinate axis, while rm' is on the abscissa. A web application
for calculation of the new metrics has been introduced here. For an acceptable QSPR model, the value of rm2 should
be more than 0.5, and the value of 'rm2 should preferably be lower than 0.2.
Model AD is an active area of modern QSAR research. In this study, the applicability domain (AD) of models
was defined by the software. In the software, two AD approaches have been simultaneously used: bounding box,
considering AD as a multidimension descriptor space confined by minimal and maximal values of counts of SMF
descriptors, and fragment control rejecting test compounds fragments non-occurring in the initial SMF descriptors
pool. That is to say that: (1) it does not predict if fragment count outside the min/max values of the model; (2) it
does not predict if compound includes fragments outside the model.

4. Results and discussion

4.1. Results and discussion

Based on the length of sequences from 2 to 15, the models are respectively established in this study. After that,
the developed models are analyzed by the leave-one-out (LOO) method. The Q2loo values are as shown in Fig. 3. The
Q2loo values of sequences 2 to 12, 13, 14 and 15 are the same, so we only need to choose any one of them. According
to the principle of Q2loo≥0.85, the best length of sequences are from 2 to 8 and from 2 to 12, respectively.
0.90
0.85
0.80
0.75
0.70
0.65
Q loo
2

0.60
0.55
0.50
0.45
0.40
2-3 2-4 2-5 2-6 2-7 2-8 2-9 2-102-112-122-132-142-15
the length of sequences

Fig. 3. The length of sequences vs Q2loo.

Afterwards, the applicability domain (AD) of this model is deeply analyzed. At the training stage, TRAIL
excluded 24compounds containing unique fragments (occurs only in one compound), thus reducing the training set
to 188 compounds. Among 53 compounds of the test set, 14 compounds found outside of the model’s applicability
domain and, therefore, have not been predicted. The following statistical parameters of predictions have been
obtained for the remaining 188 compounds for the training set and 39 compounds for the test set.
From Table 1, we can see that an ensemble model has improved the quality of property predictions due to
smoothing inaccuracies of individual models which are based on the length of sequences from 2 to 8 and from 2 to
12. The performance parameters of the ensemble model are better than the individual model. Hence, this research is
only in-depth discussion about the ensemble model. For the ensemble model, a reasonable performance of
prediction of AIT has been achieved: the squared determination coefficient is high. The Fig. 4 shows that the
predicted values are in good agreement with experimental values, which manifests the linear relation between AIT
and the molecular structure. Compared with the predicted result of the training and test set, the squared
determination coefficient (R2) is very high and the prediction error is quite low. In addition, r is much more than
2
m
884 Shi Jingjie et al. / Procedia Engineering 84 (2014) 879 – 886

0.5 and 'rm is lower than 0.2. All of analysis demonstrates that the developed model has not only stronger
2

prediction ability, but also better generalization performance than the previous one.

Table 1. Performance comparison.


performance IAB2-8 IAB2-12 average
parameters train set test set train set test set train set test set
R2 0.9228 0.7908 0.9216 0.7898 0.9311 0.8046
Q2loo 0.9227 -- 0.9216 -- 0.9309 --
Q2ext -- 0.8058 -- 0.7933 -- 0.8141
RMSE 27.29 40.32 27.47 41.59 25.80 39.44
rm2 0.8810 0.6744 0.8859 0.6804 0.8879 0.6831
'rm2 0.06784 0.1705 0.06825 0.1691 0.06126 0.1633
n 188 39 188 39 188 39

700

600

500
Predicted values/ć

400

300

200
training set
test set
100
100 200 300 400 500 600 700
Experimental values/ć

Fig. 4. Comparison between the predicted and experimental AIT for the training and test set.

In order to further analyze this model, the prediction error is calculated for the whole sample set. The formula of
the relative error is expressed by the equation (7).
n yi ,pred  yi ,
RE= ¦
i 1 yi ,obs
(7)
In terms of the relative error, the mean relative error is 6.456% and the maximum relative error is 33.82%. The
relative errors of 180 compounds are smaller than 10%, whose number accounts for almost 80% of the whole
samples. As shown in Fig. 5, the relative errors of compounds are mainly concentrated in the range of from 0 to 5,
which reveals that this model has strong predictive ability for the most compounds.
Although prediction effect of the developed model in this study is satisfying, the outlier still exists, such as N-n-
butyldiethanolamine, dibutylamine, and so on. The outlier is screened based on the standard that the absolute error is
more than twice the standard error, and the screened outliers are listed in Table 2.The outliers severely influence the
prediction performance of the models. If abnormal values are screened, the performance of the models is greatly
improved. For the training and test set, the squared correlation coefficient is 0.9413 and 0.9043, the corresponding
Q2loo and Q2ext is 0.9410 and 0.9117, RMSE is 23.77 and 25.38, respectively. In terms of the training set, r and
2
m

'rm2 are 0.9033 and 0.05234, and for the test set r and 'rm2 are 0.8646 and 0.05329. Comprehensive comparison
2
m

of the all the parameters reveals the improvement of the performance of the model.
Shi Jingjie et al. / Procedia Engineering 84 (2014) 879 – 886 885

120 115

Number of compounds for each interval


100

80
65
60

40
28
20
12
3 2 2
0
0-5 5-10 10-15 15-20 20-25 25-30 30-35
error interval/%

Fig. 5. No. of compounds for each interval and the percent errors obtained by the ensemble model.

Table 2. The outliers of the sample.


No. Compound Experimental value/ć Predicted value/ć Absolute residuals/ć
1 N-n-butyldiethanolamine 260 347.94 87.94
2 dibutylamine 260 343.86 83.86
3 catechol 567 492.25 74.76
4 isobutane 460 386.29 73.71
5 1-pentene 273 346.35 73.35
6 3-methoxyaniline 545 474.62 70.39
7 propane 432 366.43 65.58
8 benzyl alcohol 435 498.38 63.38
9 ethyl methacrylate 435 373.10 61.91
10 diisopropanolamine 290 351.30 61.30
11 n-methylacetamide 490 430.38 59.62
12 cis-2-pentene 288 347.48 59.48
13 sec-butylamine 290 347.94 57.94

4.2. Modeling of AIT: SMF method vs other QSPR techniques

As we known, the various models had been built based on different dataset and different methods, and each model
possesses its own advantages and disadvantages. So it is suggested that not only the prediction results but also more
other important characteristics of models should be taken into account and analyzed. Therefore, detailed comparison
between the present model and previous ones are presented as follows.
Chen Xi-jin et al.[2] modeled the AIT using a dataset of 52 compounds only including alkanes. They used atom-
type electrotopological state indices as molecular structure descriptors, and applied trial-and-error method to
determine the optimal parameters. They built Multiple Linear Regression (MLR) model and Artificial Neural Network
(ANN) model, and ANN model is superior to MLR model. Compared with the work of Chen Xi-jin et al., this
model has some advantages. (1) Only atom-type electrotopological state indices were used as molecular structure
descriptors in their model. Although this kind of descriptors can combine together both electronic and topological
characteristics of the analyzed molecules, they fail to fully characterize the molecules. In this study, the Dragon 2.1
program is used to calculate the molecular descriptors, which is a sophisticated program for the calculation of
molecular descriptors. A wide variety of descriptors have been calculated for each compound in the dataset, such as
topological descriptors, geometrical descriptors, electrostatic descriptors and quantum chemical descriptors and so
on. (2) The experimental data have been derived from different database, and the data of the same compound are not
the same provided by different researchers. The error of the experimental data will directly influence the reliability
of the QSPR model. In this paper, the data are all from The Chemical Database, which can provide a large number
886 Shi Jingjie et al. / Procedia Engineering 84 (2014) 879 – 886

of experimental values for organic compounds.


As the work of Yong et al.[3], a general comparison is presented. Pan Yong applied Multiple Linear Regression
(MLR), BP neural network (BPNN) and Support Vector Machine (SVM) to build models for 90 hydrocarbon
compounds. Both internal and external validations were performed to validate the performance of the resulting
models. The results showed that the prediction results of SVM were in good agreement with the experimental values,
and were the most satisfactory among the three methods. SVM has some advantages of converging to the global
optimum but not to the a local optimum. However, the models of Pan Yong still have some disadvantages.
Regarding the diversity of compounds, Pan only uses the 90 hydrocarbon compounds as the sample set, and the set
is too small to indicate the quantitative relationship between the auto-ignition temperature (AIT) and molecular
structures. The number of the compounds in this paper is almost three times as much as the one of the compounds in
Pan’s work. Regarding operational complexity, SVM needs programming in MATLAB software, choosing the
appropriate kernel function and determining the parameters of the kernel function. The above process is very
complicated, while this operation is simple and easy to apply. The whole process is accomplished only by ISIDA
software.

5. Conclusions

In the present work, an accurate QSPR model has been developed for predicting the AIT values of a diverse of
organic compounds by using SMF method. Internal validation, external validation and a new rm2 metric were
performed to validate the performance of the built model. What’s more, the application domain was also analyzed
by the ISIDA software. The results showed that the satisfactory model was obtained, and the prediction errors were
comparatively small and within the accepted error range. Thus, it can be concluded that the proposed method would
be expected to predict AIT for new organic compounds or for other organic compounds for which experimental
values are unknown.

References

[1] J. Van der Geer, J.A.J. Hanraads, R.A. Lupton, The art of writing a scientific article, J. Sci. Commun. 163 (2000) 51–59.
[2] W. Strunk Jr., E.B. White, The Elements of Style, third ed., Macmillan, New York, 1979.
[3] G.R. Mettam, L.B. Adams, How to prepare an electronic version of your article, in: B.S. Jones, R.Z. Smith (Eds.), Introduction to the
Electronic Age, E-Publishing Inc., New York, 1999, pp. 281–304.
[1] T.A. Albahri, R.S. George, Artificial Neural Network Investigation of the Structural Group Contribution Method for Predicting Pure
Components Auto Ignition Temperature. Industrial and Engineering Chemistry Research, 42(2003) 5708-5714.
[2] X.J. Chen, Y. Pan, Prediction of auto-ignition temperatures of alkanes by electrotopological state indices. Journal of Safety Science and
Technology, 5(2009)16-20.
[3] Pan, Y., Jiang, J.C., Cao, H.Y. & Wang, R. Prediction of flammability characteristics by using quantitative structure-property relationship
study based on neural network. Chemical Industry and Engineering Progress, 27(2008) 378-384.
[4] T. Suzuki, Quantitative Structure-Property Relationships for Auto-ignition Temperatures of Organic Compounds. Fire and Materials,
18(1994) 81-88.
[5] J. Tetteh, E.Metcalfe, S.L. Howells, Optimisation of radial basis and backpropagation neural networks for modelling auto-ignition
temperature by quantitative-structure property relationships. Chemometrics and Intelligent Laboratory Systems, 32(1996) 177-191.
[6] A. Golbraikh, A. Tropsha, Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set
selection. Journal of Computer-Aided Molecular Design, 5(2002) 231-243.
[7] V.P. Solov’ev, A. Varnek, Anti-HIV Activity of HEPT, TIBO, and Cyclic Urea Derivatives: Structure-Property Studies, Focused
Combinatorial Library Generation, and Hits Selection Using Substructural Molecular Fragments Method. Journal of Chemical Information
and Computer Science, 43(2003) 1703-1719.
[8] V.P. Solov’ev, A.A. Varnek, Structure-property modeling of metal binders using molecular fragments. Russian Chemical Bulletin,
International Edition, 53(2004) 1434-1445.
[9] A. Varnek, G. Wipff, Assessment of the Macrocyclic Effect for the Complexation of Crown-Ethers with Alkali Cations Using the
Substructural Molecular Fragments Method. Journal of Chemical Information and Computer Science, 42(2002) 812-829.
[10] V.P. Solov’ev, Modeling of Ion Complexation and Extraction Using Substructural Molecular Fragments. Journal of Chemical Information
and Computer Science, 40(2000) 847-858.
[11] J.J. Shi, L.P.Chen, W.H. Chen, N. Shi, H. Yang, W. Xu, Prediction of the Thermal Conductivity of Organic Compounds Using Heuristic and
Support Vector Machine Methods. Acta Physico-Chimica Sinica, 28(2012) 2790-2796.
[12] Golbraikh, A.& Tropsha, A. Beware of q2! Journal of Molecular Graphics and Modelling, 20(2002) 269-276.
[13] K. Roy, I. Mitra, S. Kar, K. P. Ojha, R.N. Das, H. Kabir, Comparative Studies on Some Metrics for External Validation of QSPR Models.
Journal of Chemical Information and Modeling, 52(2012) 396-408.
[14] L.M.V. Pinheiro, M.C.M.M. Ventura, M.L.C.J. Moita, Application of QSPR-MLR methodology to solvatochromic behavior of quinoline in
binary solvent HBD/DMF mixtures. Journal of Molecular Liquids, 154(2010) 102-110.

You might also like