Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Full Papers

Prediction of ITQ-21 Zeolite Phase Crystallinity:


Parametric Versus Non-parametric Strategies
Laurent A. Baumes, Manuel Moliner, Avelino Corma*
Instituto de Tecnologa Qumica (UPV-CSIC), av. de los Naranjos, E-46022, Valencia, Spain, E-mail: acorma@itq.upv.es

Keywords: Data mining, High-throughput, ITQ-21 zeolite, Parametric, Regression, Statistics

Received: May 8, 2006; Accepted: July 13, 2006

DOI: 10.1002/qsar.200620064

Abstract
This work deals with data analysis techniques and high-throughput tools for synthesis and
characterization of solid materials. In previous studies, it was found that the final
properties of materials could be successfully modeled using learning systems. Machine
learning algorithms such as neural networks, support vector machines, and regression
trees are non-parametric strategies. They are compared to traditional parametric statistical
approaches. We review a wide range of statistical methodologies, and all the methods are
evaluated using experimental data derived from an exploration-optimization of the
material ITQ-21. The results are judged on the numerical prediction of phase>s crystal-
linity. We discuss the theoretical aspects of such statistical techniques, which make them
an attractive method when compared to other learning strategies for modeling the
properties of the solids. Advantages and drawbacks are highlighted. We show that such
approaches, by offering broad solutions, can reach high-level performances while offering
ease of use, comprehensibility, and control. Finally, we shed light on both the interpre-
tation and stability of results, which remain the main drawbacks of the majority of
machine learning methodologies when trying to retrieve knowledge from the data
treatment.

1 Introduction als. However, the synthesis of zeolitic materials through


HT experimentation has received a weaker impulse and
Molecular sieve and more specifically zeolites are materi- fewer studies have been reported. The models allow to
als of considerable interest in gas adsorption and separa- predict the properties of unsynthesized materials (also-
tion, catalysis, and for electronics and medical uses [1, 2]. called “virtual” solids) taking into account their expected
Recent research work from different groups has contribut- compositions or preparation conditions as input variables.
ed to the understanding of the synthesis mechanism, as Among the different ML techniques, Neural Networks
well as to the discovery of new zeolitic structure [3 – 11]. (NNs) often yielded the best modeling results. They have
The discovery of new structures or enlarging the synthesis been applied for modeling and prediction of the catalytic
space, and optimization of existing ones require a consid- performance of libraries for a variety of reactions, and
erable experimental effort. This can be reduced by using some selected examples are water gas shift reaction [15],
High-Throughput (HT) synthesis and characterization oxidative dehydrogenation ethane [16], oxidative dehydro-
techniques [12 – 14] since the amount of samples to be genation of propane to propene [17], and propene oxida-
processed is tremendously increased, and consequently tion to propene-oxide [18]. However, NNs may suffer
the number of parameters to be simultaneously explored. from overfitting the data, reproducibility problems and,
Thus, the possibility of discovering new materials or better therefore, there is still the need to use or even develop
covering a phase diagram may be strongly accelerated. other techniques. In this sense Support Vector Machines
The need for advanced strategies that aim at optimizing (SVM) can be a suitable method for overcoming the pit-
the retrieve of knowledge from experiments while main- falls of NNs when they may occur, and a first comparison
taining their number at a reasonable level is a critical part has been recently done for the design of catalysts and ma-
of the discovery and optimization processes. Numerous terials [19, 20]. In this case, even if overfitting is rather dis-
different Machine Learning (ML) techniques have been carded, the interpretation of results still remains difficult
successfully applied for modeling experimental data ob- when using complex kernel functions.
tained during the exploration of multi-component materi-

QSAR Comb. Sci. 00, 0000, No. &, 1 – 18 I 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim &1&
These are not the final page numbers! ÞÞ
Full Papers Laurent A. Baumes et al.

ML methods do not assume any parametric form of the


appropriate model to use; they are classified in the set of
distribution-free methods. Instead of starting with assump-
tions on a particular problem, ML uses a toolbox approach
in order to identify the correct model structure directly
from the available data. One of the main consequences is
that the methods typically require larger datasets than
parametric statistics. In materials science domain, even
when using HT techniques, the number of examples re-
mains low (i.e., less than 150). This represents a great prob-
lem for non-parametric procedures for preventing overfit-
ting. Since the 1990s, a large amount of publications have
appeared using only such ML methods, while traditional
parametric statistics remains relatively neglected as it has
been emphasized in [21]. In [22], the authors make use of
traditional statistical analysis while examining split-plot
design, and very recently a new hybrid statistical method-
ology has been proposed [23], which combines evolution-
ary algorithm operators with a statistical criterion for opti-
mizing the structure characterization of a given search
space taking into account an a priori limited amount of ex-
periments to be conducted.
This work, which deals with data analysis techniques
and HT tools for synthesis and characterization of solid
materials, aims at showing that statistics can enable a bet-
ter interpretation of results while showing similar quality
of performances and discarding ML pitfalls. We review a
wide range of statistical methodologies and discuss the
theoretical aspects of such techniques, which make them
an attractive method for modeling the properties of solids
when compared to the other learning strategies. Advantag-
es and drawbacks are highlighted. We show that such stat-
istical approaches, by offering broad solutions, allow to Figure 1. a) Structure of the ITQ-21 zeolite. b) Standard dif-
reach high-level performances while offering ease of use, fractogram of the ITQ-21 zeolite.
comprehensibility, and control. Finally, we shed light on
both the interpretation and stability of results which re-
main the major drawbacks of the black-box learning meth-
2 Experimental Section
odologies.
All the methods are evaluated here using experimental
2.1 Synthesis Experimental Data
data derived from exploration/optimization of the synthe-
sis of a zeolitic material (ITQ-21) [24]. ITQ-21 is a zeolite A large amount of parameters govern the hydrothermal
with a three-dimensional pore network containing 1.18- crystallization processes of microporous materials, deter-
nm-wide cavities, each of which is accessible through six mining which phases are formed and their crystallization
circular and 0.74-nm-wide windows. The structure is shown kinetics. In this study, a detailed exploration of the hydro-
in Figure 1a. We have chosen this system because the thermal synthesis in the system SiO2/GeO2/Al2O3/F  /H2O/
structure as ITQ-21 is one of the most interesting large N(16) Methylsparteinium (MSPT) has been performed, in
pore zeolites that combines the catalytic properties of order to understand the effect of these factors over the
USY zeolites with a higher diffusitivity and a lower rate of growth of ITQ-21. The synthesis variables have been se-
catalyst deactivation. Then there is incentive for better un- lected in order to cover the broadest range of the most
derstanding and optimizing the synthesis and chemical promising parameter space based on previous experience,
composition of this material. The results are judged on the while keeping the total amount of experiments within a
numerical prediction of phase crystallinity. feasible and reasonable range. Five synthesis variables and
their respective-expected values are: Si/Ge ¼ {15, 20, 25,
50}, Al/(Si þ Ge) ¼ {0.02, 0.04, 0.06}, MSPT/(Si þ Ge) ¼
{0.25, 0.5}, H2O/(Si þ Ge) ¼ {2, 5, 10}, and time (day) ¼ {1,
5}. A sixth variable, F/(Si þ Ge), is always maintained

&2& I 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.qcs.wiley-vch.de QSAR Comb. Sci. 00, 0000, No. &, 1 – 18
ÝÝ These are not the final page numbers!
Prediction of ITQ-21 Zeolite Phase Crystallinity: Parametric Versus Non-parametric Strategies

equal to the MSPT amount to get neutral pH. The distribu- ble values are predicted from a linear combination of pre-
tion of experiments comes from a full factorial design with dictor variables, which are connected to the dependent
Si/Ge, H2O, (MSPT & F  )/(Si þ Ge), Al/(Si þ Ge), and the variable via a function called link function. The relation-
synthesis duration noted as t, respectively, at 4, 3, 2, 3, and ship in GLZ is assumed to be y ¼ g(b0 þ b1x1 þ ... þ bkxk) þ
2 levels. Therefore each experiment is one combination e, where e stands for the error variability. The inverse func-
among the 144 possible. All the experiments were carried tion g1 ¼P
f is the link function; so that
i¼k
out in a random order. f ð~yÞ ¼ b0 þ i¼1 bi xi , where ~y stands for the expected val-
The reagents employed for gel syntheses were ammoni- ue of y. For additional information about GLZ, see [27,
um fluoride (98%, Aldrich), germanium oxide (99.998% 28].
Aldrich), aluminum isopropoxide (98%, Aldrich), methyl-
sparteine, LUDOX (AS 40 wt% Aldrich), MilliQ water
2.2.3 Piecewise Linear Regression (PLR)
(Millipore) and N(16)-methyl-sparteinium hydroxide. Au-
tomated gel synthesis was done inside Teflon vials (3 mL), This model specifies a common intercept b0, and a slope
which were finally inserted in a multi-autoclave of 15 posi- that is either equal to b1 if y  100, or b2 taking into ac-
tions and sealed with a Teflon-lined stainless-steel tip, and count a problem with only two variables, and the following
subsequently allowed to crystallize at 175 8C. The samples model: y ¼ b0 þ b1x(y  100) þ b2x(y > 100). Stepwise mod-
were then washed and filtered in parallel and then dried at el-building techniques for regression designs with a single
100 8C overnight. Finally, the samples were weighted and dependent variable are described in numerous sources [29,
characterized by XRD using a multi-sample Philips X>Pert 30].
diffractometer employing Cu Ka radiation. The standard
X-ray diffractogram for ITQ-21 is shown in Figure 1b. Cal-
2.2.4 SVMs as Regression Tool
culation of the occurrence and crystalinity was done inte-
grating the area of the characteristic peaks. For ITQ-21 A general introduction of SVMs was already presented in
the range for the angle 2q is comprised between 25.78 and [20]. With e-SV regression [31], the goal is to find a func-
26.58. tion f(x) that has at most e deviation from the target yi for
all the training data, and at the same time, as flat as possi-
ble. Formally, the problem is written as a convex optimiza-
2.2 Computational Methods
tion problem.
In regression problems, the objective is to estimate the val-
ue of a continuous output variable that in our case is a giv-
2.2.5 Regression Trees (RTs)
en crystalline phase from input variables such as the syn-
thesis parameters. All the different techniques used in this Regression trees may be considered as a variant of deci-
study are quickly detailed except NNs which already have sion trees, designed to approximate real-valued functions
received considerable attention, see [15 – 20] for recent ap- instead of being used for classification tasks. RT is built
plications in material science, and [25, 26] for more techni- through a process known as binary recursive partitioning.
cal explanations. In order to provide a fair comparison be- This is an iterative process of splitting the data into parti-
tween the different techniques investigated, 28% of the tions, and then splitting it up further on each of the
data chosen randomly among the whole available dataset branches. In our experiments the classical C&RT [32] tree
composed of 144 distinct experiments is kept unused for is used.
model generalization evaluation.

2.2.1 Multiple Linear Regression (MLR) 3 Results of Parametric Statistics and Prediction of
ITQ-21 Phase Crystallinity
An MLR model specifies the relationship between one de-
pendent variable
Pi¼ky, and a set of predictor variables X, so 3.1. Experimental Results
that y ¼ b0 þ i¼1 bi xi in where bi are the regression coef-
ficients. In Figure 2 is represented the effect of each synthesis vari-
able on ITQ-21 crystallinity. It is shown that ITQ-21 is fa-
vored by some combination of synthesis variables. The
2.2.2 Generalized Linear Model (GLZ)
highest values of crystallinity appear in concentrated gels
GLZ can be used to predict responses for both dependent [H2O/(Si þ Ge) < 5] with low ratios of Si/Ge. The presence
variables with discrete distributions and for dependent var- of the zeolite can be affected by the content of aluminum,
iables which are non-linearly related to the predictors. in such a way that the more aluminum, the less crystallini-
GLZ differs from the linear model mainly in the following ty. Furthermore, high MSPT/(Si þ Ge) and F/(Si þ Ge) ra-
major aspects. (i) The distribution of the dependent varia- tios play positive roles in the formation of ITQ-21.
ble can be explicitly non-normal, (ii) the dependent varia-

QSAR Comb. Sci. 00, 0000, No. &, 1 – 18 www.qcs.wiley-vch.de I 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim &3&
These are not the final page numbers! ÞÞ
Full Papers Laurent A. Baumes et al.

Figure 2. Variation of ITQ-21 phase crystallinity with the studied variables.

&4& I 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.qcs.wiley-vch.de QSAR Comb. Sci. 00, 0000, No. &, 1 – 18
ÝÝ These are not the final page numbers!
Prediction of ITQ-21 Zeolite Phase Crystallinity: Parametric Versus Non-parametric Strategies

Table 1. Table shows the standardized regression coefficients (b) and the raw regression coefficients (B) for the MLR.
b B t(138) p-Level Partial Semi-partial Tolerance R2
Intercept – 65.152 9.7772 0.000000 – – – –
T 0.094280 1.205 1.7771 0.077757 0.149574 0.094273 0.999844 0.000156
Si/Ge  0.403721  0.741  7.6099 0.000000  0.543687  0.403695 0.999875 0.000125
Al/c  0.231247  317.365  4.3587 0.000025  0.347867  0.231227 0.999829 0.000171
MSPT/c 0.111600 22.344 2.1036 0.037230 0.176265 0.111593 0.999870 0.000130
H2O/c  0.611522  4.725  11.5273 0.000000  0.700392  0.611513 0.999971 0.000029
The magnitude of b allows to compare the relative contribution of each independent variable in the prediction of the dependent variable. The squared
semi-partial correlation is an indicator of the percent of total variance uniquely accounted for by the respective independent variable, while the squared
partial correlation is an indicator of the percent of residual variance accounted after adjusting the dependent variable for all other independent varia-
bles. Grey cells indicate that the estimates are significant (a ¼ 5%). Gray.

Figure 3. Prediction of ITQ-21 phase crystallinity with synthesis variables as input, for GLZs using Gamma distribution with log
link function, normal distribution with identity link function, and normal distribution with log link function as modeling approaches.

3.2. MLR and First Inspection of the Dataset are statistically significant (p < 5%) except for the synthe-
sis duration (variable 2). If the risk a is increased to 10%,
In Figure 3, the MLR is calculated with the synthesis varia- the variable 2 becomes significant. A non-significant p-val-
bles as input. Real ITQ-21 phase crystallinity is indicated ue does not mean that the null hypothesis*is true. It simply
on the y-axis while the expected one is represented on the means that this dataset is not strong enough to convince
x-axis. The adjustment was R2 ¼ 0.61164 [F(5.138) ¼ 43.46; that the null hypothesis is not true. To conclude that a val-
p ¼ 0.00000]. According to this method, 61.16% of the
original variability has been explained, and (1  R2) is the
residual variability. Regression coefficients are given in
* A significance test is performed to determine if an observed
Table 1, where highlighted values (gray background color)
value of a statistic differs enough from a hypothesized value
are significant. As indicated by b values, Si/Ge and H2O/ of a parameter to draw the inference that the hypothesized
(Si þ Ge) (respectively, variables 3 and 6) are the most im- value of the parameter is not the true value. The hypothe-
portant predictors of ITQ-21 phase crystallinity, and all sized value of the parameter is called the “null hypothesis”.

QSAR Comb. Sci. 00, 0000, No. &, 1 – 18 www.qcs.wiley-vch.de I 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim &5&
These are not the final page numbers! ÞÞ
Full Papers Laurent A. Baumes et al.

ue is not statistically significant when the null hypothesis is times and temperature reported here. Finally, the other
false is called Type II error. For more details of this aspect factor that is statistically interesting is the Al content. The
see [23]. highest values of crystallinity have been obtained at low
Another way of looking at the unique contributions of levels of Al. The reason being that the number of frame-
each independent variable is to compute the partial and work negative charges introduced by Al and which have to
semi-partial correlations. In Table 1, partial correlations be neutralized by the Organic Structure Directing Agent
are the correlations between the respective independent (OSDA) are limited, due to the fact that OSDA has also
variables adjusted by all other variables, and the depen- to compensate the F  located within the double four
dent variable adjusted by all other variables. The semi-par- member rings [33], and the void volume of the ITQ-21
tial correlation is the correlation of the respective inde- structure can fit a limited number of MSPT cations. It has
pendent variable adjusted by all other variables, with the to be noted that H2O has the largest effect on crystalliza-
raw dependent variable. Values in Table 1 for such partial tion only considering the chosen ranges of variation.
and semi-partial correlations appear relatively similar and The use of parametric procedures allows taking advan-
confirm the trends observed with b values. In Table 2, the tages of the whole theory behind the model. However, as-
partial correlation sizes the correlation between two varia- sumptions should always be first verified, otherwise the
bles that remains after partialling out one other variables conclusion may not be accurate. For example, in MLR, it
(indicated with “  ”), while the correlation coefficient is assumed that the residuals are distributed normally.
does not take into account such control. It can be observed Many tests are robust with regard to violations of this as-
that the correlations, and partial correlations, between sumption. The normal probability plot of residuals gives a
each variable and ITQ-21 crystallinity, are quite similar. quick indication of whether or not violations have occur-
However, one can note that without considering the effect red. If the observed residuals (plotted on the x-axis of Fig-
of H2O (i.e., fifth column of partial correlations in bold) ure 4) are normally distributed, then all values should fall
the correlation between Si/Ge and ITQ-21 crystallinity de- onto a straight line. If the residuals are not normally dis-
creases by ten points. Actually, a similar jump is examined tributed, then they will deviate from the line. Figure 4
for all the correlations when H2O is partialled out; in the shows a particular lack of fit: the data seems to form an S-
case of positive response (MSTP or F) such effects are in- shape around the line. This pattern is characteristic when
creased, while negative partial correlations are decreased. the dependent variable may have to be transformed
Surprisingly, it seems that H2O increase, which has a global through a log-transformation to pull the tails of the distri-
negative effect on ITQ-21 crystallinity, when combined bution.
with other variables has a good effect on negative feature Another important step when building models is the de-
and a bad one for the unique positive relation. tection of outliers. If one experiment is clearly an outlier,
Moreover, it is shown that the three variables that have then there is a tendency for the regression line to be pulled
the greatest influences on the formation of ITQ-21 are H2 by this outlier. As mentioned before, one can say that such
O, Si/Ge, and Al content. For the levels chosen in the pres- a deviation would be rather low compared to the conse-
ent work, the water content is the variable that has the quences (overfitting) which might be observed using ML
largest influence on ITQ-21 crystallinity. This phase pre- models. As a result, if the respective cases were excluded,
fers concentrated gels that present relations of H2O/(Si þ different B coefficients would be found. Figure 5 shows
Ge) with values less than 5. This can also indicate that the “deleted residual” statistic which is the standardized
high concentration of F  has a positive effect on crystalli- residual for the respective case that one would obtain if
zation. The content of Ge in the framework of the ITQ-21 the case was excluded from the analysis. Therefore, if the
is a critical factor. When the content of Ge decreases in deleted residual is different from the standardized residual
the starting gel, the rate of crystallization of ITQ-21 de- the regression analysis may be biased by the given case.
creases, and for high values of Si/Ge (> 30), small amounts However, such a case does not belong to our experimental
of ITQ-21 (low crystallinity) is achieved with the set of dataset and therefore the entire set is kept. Another inter-

Table 2. Partial correlations and correlation coefficients between all variables involved in the synthesis study.
Variables ITQ21 – partial correlation ITQ21 – correlation
Time – 0.10 0.09 0.09 0.12 0.09
Si/Ge  0.40 –  0.41  0.41  0.51  0.40
Al  0.23  0.26 –  0.23  0.29  0.23
MSTP or F 0.11 0.12 0.11 – 0.13 0.11
H2O  0.62  0.67  0.63  0.62 –  0.61
The partial correlation sizes a correlation between two variables that remains after controlling for (e.g., partialling out) one or more other variables.
Gray cells contain significant values at p < 0.05.

&6& I 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.qcs.wiley-vch.de QSAR Comb. Sci. 00, 0000, No. &, 1 – 18
ÝÝ These are not the final page numbers!
Prediction of ITQ-21 Zeolite Phase Crystallinity: Parametric Versus Non-parametric Strategies

Figure 4. Normal probability plot of residuals for ITQ-21 Figure 5. Residuals vs. deleted residuals plot. This technique
phase crystallinity linear model. This visualization procedure allows to separate outliers from the dataset when the latter are
permits to quickly examine if the normal distribution of residual relatively far from the line.
assumption is respected or not. The tails show an S-shape pat-
tern.

3.3. Generalized Linear Model


esting test such as heteroscedasticity may be investigated.
Homoscedasticity is the assumption that the variability in The construction of a GLZ starts by selecting an appropri-
scores for one variable is roughly the same at all values of ate link function and response probability distribution.
the other variable, which is related to normality, as when Two alternatives are investigated: a distribution fitting and
normality is not met, variables are not homoscedastic, but, the choice of the corresponding link function, or only the
they are heteroscedastic. For example, the Goldfeld – transformation of the dependent variable through a chos-
Quandt test is applicable if you think heteroscedasticity is en link function. There is many potential distributions
related to only one of the x variables. This test is of great (normal, Exponential, Weibull, log-normal, Gamma, etc.)
interest, for example if the heating system of a multi-chan- that could be used as a distributional model for the data.
nel reactor becomes hazardous on increasing the tempera- Therefore two basic questions are addressed: (i) Does a
ture, generating additional noise. given distributional model provide an adequate fit to the
To summarize, the MLR fits moderately (~ 61%) the ob- data? (ii) Does one distribution fit the data better than an-
jective variable, and fails to preserve the fitted ITQ-21 other distribution? The use of Goodness-of-Fit (GoF) tests
phase crystallinity from negative values. Moreover, the provide a method to answer these two questions. The Kol-
amount of false positive is very high (i.e., the gray squares mogorov – Smirnov (KS) test is chosen because of the fol-
on the x-axis in Figure 3), and for the other experiments, lowing reasons: unlike the parametric t-test for independ-
the phase crystallinity is greatly underestimated. However, ent samples, which tests differences in means in the loca-
such a preliminary methodology has allowed us to obtain a tion of two samples, the KS test is also sensitive to differ-
first idea about the dependent variable modeling and its ences in the general shapes of the distributions in the two
correlations with synthesis variables through estimates. samples (i.e., differences in dispersion, skewness, etc.). The
However, it has been shown that this technique allows to GoF tests confirm that either the Gamma or the log-nor-
make test of assumptions that are usually too often accept- mal distribution would provide a good model for this data.
ed without being tested. Assumptions about the normality Finally, the best fitting is the Gamma distribution which is
of residuals, the detection of outliers, the significance of defined as f(x) ¼ (x/b)1e(  x/b)[bG(c)]1 with b > 0 the scale
variables, and others are of great help to the user in deter- parameter, c > 0 the so-called shape parameter and G is
mining the first steps of how works the underlying mecha- the gammaR1 function with the following formula:
nism. The examination of the normality assumption will G ðaÞ ¼ 0 ta1 et dt. Here b ¼ 39.3, c ¼ 0.456, see Figure 6.
require further more complex methodologies, allowing to The corresponding link function for such distribution is
transform the dependent variable in order to respect the the log function. Considering the second option proposed
hypothesis of normal distribution while preventing predic- earlier, the normal probability plot of residuals has given
tions from negative values. The GLZ, as an extension of an indication of the non-normal distribution of observed
the MLR, is investigated below. residuals. Since the data follow an S-shape pattern around
the line, we have supposed that the dependent variable
should be transformed into a new one such as g(y) ¼ ln(y)

QSAR Comb. Sci. 00, 0000, No. &, 1 – 18 www.qcs.wiley-vch.de I 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim &7&
These are not the final page numbers! ÞÞ
Full Papers Laurent A. Baumes et al.

plicity, and the fractional factorial design to degree 3 since


it represents an intermediary solution. The relationship be-
tween predictors and their interactive effects (e.g., two
predictors masking the effects of a third) are much more
complex. However, it can be observed that the conclusion
drawn previously about the effect of Si/Ge, and Al con-
tents when considering or not the effect of the water is
confirmed here through the inspection of the significance
of 2-way interaction effects.
One can also make statistical inference about the param-
eters using confidence intervals and hypothesis tests. The
confidence intervals for specific statistics give a range of
values around the statistic where the “true” statistic can be
expected to be located with a given level of certainty (here
the level is set 90%). Therefore it is possible to provide con-
fidence intervals for predicted values. An example is given
Figure 6. Distribution fitting of ITQ-21 phase crystallinity with for the best model found earlier, i.e., quadratic response sur-
the Gamma function. face regression model, in Figure 8. As a decreases the inter-
val will be narrower. Here are examples of the numerous
advantages allowed using such a parametric modeling.
in order to pull in the tail of the distribution. Therefore,
this modification is handled through the log link function
3.4. Piecewise Linear Regression
which will force to maintain the values within a positive
range while the distribution of the dependent variable is The slope of a function at a particular point can be com-
still supposed to be normal. Figure 3 shows the predictions puted as the first-order derivative of the function at that
of ITQ-21 crystallinity for GLZ model using Gamma dis- point. The “slope of the slope” is the second-order deriva-
tribution with log link function, normal distribution as- tive, which tells us how fast the slope is changing at the re-
sumption and log link function, and MLR (i.e., normal dis- spective point, and in which direction. The quasi-Newton
tribution assumption and identity link function). A better method, at each step, evaluates the function at different
fitting of GLZ over MLR can be observed. However, points in order to estimate the first order derivatives and
GLZ using the normal distribution and log link function second order derivatives. It uses this information to follow
remains the best. In this situation, compared to Gamma a path toward the minimum of the loss function. We have
assumption, more weight is indirectly given to non-null chosen the quasi-Newton method since, for most applica-
crystallinity values of ITQ-21 phases, and therefore the tions, it yields good performances. Other procedures that
variability of response for high crystallinity values is nar- use various geometrical approaches to function minimiza-
rower. tion, may be more “robust,” that is, they are less likely to
The previous GLZ were defined with only first-order ef- converge on a local minima, and are less sensitive to “bad”
fects, i.e., bixi. However, in GLZ, more advanced configu- starting values. However, special attention has been given
rations, such as factorial, fractional, polynomial, quadratic to such parameters and care about the reproducibility of
models, or even some special user effects, can be defined. the results was taken. The loss function is a least square as
In Figure 7, all the models are estimated and their respec- in many other cases. In Figure 9 the predicted values of
tive predicted values of ITQ-21 crystallinity are plotted ITQ-21 crystallinity are plotted against the observed val-
with corresponding effect estimates given in Table 3. The ues. It is surprising to see that this very simple method,
values of the parameters (bi and the scale parameter) in compared to all other approaches, allows us to obtain a
the GLZ are obtained by maximum likelihood estimation. quasi-perfect fitting of very low or even null crystallinity
Note that highlighted values correspond to statistically sig- values as shown in Figure 9. The equation of the PLR
nificant estimates for a ¼ 0.5. On the basis of estimate val- model with a breakpoint at 17.4582 is
ues and their significances when considering different
forms of models, we can say that the MLR does not con- 5.19880.1013t0.0244Si/Ge24.8966Al(Si þ Ge)
tain enough features for capturing the underlying informa- þ 4.0457MSTP/(Si þ Ge)0.5039H2O/(Si þ Ge)
tion and consequently all the input variables are signifi-
cant. On the contrary, the full factorial design takes into and
account too many variables; thus, the information is spread
and smoothed into the numerous estimates. Finally, the 113.28 þ 3.2668t2.1384SiGe557.013Al/Si þ Ge)
models retained are the quadratic one for its overall per- þ 26.9783MSTP/(Si þ Ge)8.2507H2O/(Si þ Ge).
formance, the fractional factorial to degree 2 for its sim-

&8& I 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.qcs.wiley-vch.de QSAR Comb. Sci. 00, 0000, No. &, 1 – 18
ÝÝ These are not the final page numbers!
Prediction of ITQ-21 Zeolite Phase Crystallinity: Parametric Versus Non-parametric Strategies

Table 3. Estimates of GLZ models using different configurations of effects. Gray cells contain significant values at p < 0.05.
Effects Multiple reg. Full-factorial Polynomial- Quadratic Fractional Fractional
degree 2 response factorial factorial
surface to degree 3 to degree 2
regression
Intercept B0 5.8659 5.446 6.9819 3.4273 2.535 5.6146
First order (main effects) t 0.0552  0.324 0.0485  0.0346  0.004  0.0240
Si/Ge  0.0672  0.032 0.0255 0.0191 0.051  0.0256
Al/c  13.7012 61.500  11.3677  1.3192 143.024  10.6196
MSPT/c 0.7943 0.482  11.1976 3.2865 4.124  0.6261
H2O/c  0.3214  0.247 0.3193 0.7122 0.665  0.0037
Two-way interaction t. ( Si/Ge)  0.006 0.0016  0.002 0.0023
t. Al/c  6.832 0.6027  8.681 0.2978
t. MSPT/c 0.655  0.0004 0.501 0.0192
t. H2O/c 0.076 0.0190 0.060 0.0085
( Si/Ge).Al/c  2.568  0.1918  3.816  0.1389
( Si/Ge).MSPT/c  0.103  0.0327  0.129  0.0242
( Si/Ge).H2O/c 0.003 33.8077  0.007  0.0171
( Al/c).MSPT/c  162.920  0.0270  225.243 29.0218
( Al/c).H2O/c  27.895  9.0908  49.030  6.5146
( H2O/c).MSPT/c  0.583 0.4554  1.037 0.3316
Three-way interaction t.( Si/Ge).Al/c 1.203 0.379
t.( Si/Ge). MSPT/c 0.028  0.011
t.( Si/Ge). H2O/c 0.009  0.002
t.( Al/c). MSPT/c 18.467  0.266
t.( Al/c). H2O/c 5.623 1.135
t.( MSPT/c). H2O/c 0.145  0.126
( Si/Ge).Al/c.( MSPT/c) 9.108 5.223
( Si/Ge).Al/c.( H2O/c) 0.486  0.095
( Si/Ge).MSPT/c.( H2O/c) 0.064  0.004
( Al/c). MSPT/c.( H2O/c) 100.063 80.883
{4...5}-way ... ...
Second order T2 0.0000 0.0000
( Si/Ge)2  0.0002  0.0003
( Al/c) 2  33.0816  65.1404
( MSPT/c) 2 15.6960  5.3025
( H2O/c) 2  0.0935  0.0751
Scale 10.2652 7.724 9.8454 7.5255 7.812 8.7050

One has to note that the breakpoint is defined on the de- namely “training,” “selection,” and “test,” respectively,
pendent variable and therefore, in order to assign a value with 64, 40, and 40 individuals in each set in order to avoid
to a new experiment it should be first evaluated on which overfitting. Thus, the test set represents 28% of the entire
side on the breakpoint the dependent variable will be. dataset as mentioned before.
However, a previous model can be used or a classification
algorithm with a two-class system defined by the threshold 4.1 Comparison and Performance Assessments
(i.e., the breakpoint). Therefore the final PLR efficiency
depends on such a previous estimation. A quick classifica- As in the case of traditional MLR models, fitted GLZ can
tion using the quadratic model only misclassified six ex- be summarized through statistics such as parameter esti-
periments. mates, their standard errors, and GoF statistics. Here dif-
ferent statistics such as the correlation coefficient (i.e., the
correlation coefficient between the predicted and ob-
4 Results of Non-parametric Approaches and served output values), the coefficient of determination
Prediction of ITQ-21 Phase Crystallinity (R2, Eq. 3), R2 adjusted (R2adj , Eq. 4), the standard devia-
tion (Eq. 1) of the target output variable (sy), and the stan-
Having previously estimated the distribution of the col- dard deviation of errors for the output variable (se) have
lected data from ITQ21 analysis study, the predictions of been calculated. r (Eq. 2) represents the linear relationship
previous parametric statistics are compared with NN, between two variables. A perfect prediction will have a
SVMs, and RTs. For each ML approach, the whole dataset correlation coefficient of 1. A correlation of 1 does not
which contains 144 data is divided into three different sets, necessarily indicate a perfect prediction (only a prediction

QSAR Comb. Sci. 00, 0000, No. &, 1 – 18 www.qcs.wiley-vch.de I 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim &9&
These are not the final page numbers! ÞÞ
Full Papers Laurent A. Baumes et al.

Figure 7. ITQ-21 phase crystallinity fitting using GLZs and MLR is given as a reference.

which is perfectly linearly correlated with the actual out- 4.2 Performances of Neural Networks, Regression Trees,
puts), although in practice the correlation coefficient is a and SVMs
good indicator of performance. It also provides a simple
and familiar way to compare the performance of statistical The most common NN architectures have outputs in a lim-
and ML methods. In Eqs. 1 – 4, formulas are given for each ited range (e.g., 0 – 1 for the logistic activation function we
statistics, with n the amount of data, and p the number of use). When the desired output is in such a range, it pres-
predictors. Adding more independent variables to a model ents an interest for classification problems as has been in-
can only increase the R2. Since the number of variables se- vestigated [15]. However, for regression problems there is
lected by the NN is different from the one used in the oth- clearly an issue to be resolved, and some of the consequen-
er approaches, R2adj has also been used. ces are quite subtle. A scaling algorithm can be applied to
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ensure that the network>s output will be in a sensible
1 X 2 range. The simplest scaling function finds the minimum
s¼ ðx  xÞ ð1Þ and maximum values of a variable in the training data, and
N i

h X X X i performs a linear transformation to convert the values into


r¼ n x i y i  x y the target range. Therefore the network>s output will be
i i i i i
constrained to lie within this range. However, this brings

X X 2 1=2
X X 2 1=2 to the problem of extrapolation of new materials out of
2 2
 n x
i i
 i
x i  n y
i i
 y
i i the range defined by the training case. For a fair simula-
tion of the prediction of new materials> crystallinity, one
ð2Þ
has to consider that the expected values can reach levels
P 2 lower than the actual worst experiment or upper the best
ðy  ~yÞ
2
R ¼ 1  Pi ð3Þ case previously seen in the current dataset. Thus, we have
i ðy  yÞ2 chosen to always rescale the training data within the range
  [0 – 0.9] due to the fact that a crystallinity lower than zero
n1
R 2
¼ 1  1  R2  ð4Þ (i.e., amorphous material) cannot be attained. However, it
adj
np1 may be possible to obtain a more crystalline ITQ-21 sam-

&10& I 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.qcs.wiley-vch.de QSAR Comb. Sci. 00, 0000, No. &, 1 – 18
ÝÝ These are not the final page numbers!
Prediction of ITQ-21 Zeolite Phase Crystallinity: Parametric Versus Non-parametric Strategies

Figure 8. Confidence intervals of predicted values for the best GLZ models. Three different a values are considered (i.e., 10, 5, and
1%)

ple than the one obtained up to now. The 100% crystal- NNs. (i) The work required to obtain and select the best
lized material may not belong to the training set (e.g., ran- NN is by far more time-consuming than the other non-
dom selection of training set), and the 100% crystallinity parametric approaches. Considerable attention has been
has been arbitrarily defined by the best zeolite found in given to NN due to the high variability of results we have
our experimentation. Nevertheless new synthesis could obtained. Numerous architectures, activation functions,
achieve an even better crystallized sample. and other parameters have been tested. Several NN mod-
In all the cases, NNs as Multi-Layer Perceptron (MLP) els have been discarded due to the great difference of per-
and SVMs using RBF kernel form have reached the best formance between the training/selection and the test, indi-
performances. In Table 4, the best NN model for the pre- cating a clear overfitting phenomenon. (ii) Having com-
diction of ITQ21 crystallinity is shown. Two points have to bined a feature selection algorithm to the NN, among the
be underlined considering the performance assessment of first selected “good” networks, some of them are com-

QSAR Comb. Sci. 00, 0000, No. &, 1 – 18 www.qcs.wiley-vch.de I 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim &11&
These are not the final page numbers! ÞÞ
Full Papers Laurent A. Baumes et al.

Figure 9. ITQ-21 phase crystallinity modeling with best GLZ, PLR, SVM, NN, and RT.

posed of very few input variables. Considering the synthe- could emerge. Both R2 and corrected R2 have been given,
sis of zeolites, it can be shown that any of the variables we while the use of the latter can be questioned because of
have used is without effect and can be eliminated from the the above reasons. It has to be noted that such a feature se-
synthesis steps. However, the selection of input variables lection mechanism could have been used for SVM or re-
permits to eliminate variables from which the network did gression trees. Conversely, the stability of these methodol-
not find the right way to utilize the information brought ogies are usually better, partially due to the very little
after the exploitation of the others. Moreover, reducing number of parameters compared to the numerous ones
the pool of variables input minimizes inherently the poten- simply contained into the NN architecture as will be
tiality of extrapolation when using a broader range for syn- shown later.
thesis variables, since the role of the discarded variables

Table 4. Description of all the selected models for the prediction of ITQ21 phase>s crystallinity.
Statistics Models MLR GLZ Full Polynomial Quadratic Fractional Fractional Piecewise Neural SVM Regres-
(normal fac- of degree 2 response factorial factorial linear network radial sion
distribu- torial surface to degree 3 to degree 2 regression MLP 4 : basis tree
tion, regression 4 – 5-1 : 1 function
log link
function)
Correlation 0.782 0.919 0.953 0.923 0.955 0.952 0.941 0.962 0.918 0.921 0.916
coefficient (r)
R2 0.611 0.844 0.909 0.853 0.913 0.907 0.885 0.925 0.843 0.849 0.840
R2 adjusted 0.597 0.839 0.906 0.847 0.910 0.904 0.881 0.923 0.838 0.844 0.835
Standard deviation 15.931 10.151 7.695 9.813 7.514 7.782 8.664 6.978 10.139 9.956 10.216
of errors
Black cells are used for non-parametric approaches and gray ones are the selected models. Mean of the whole dataset: 17.458&Pls check change&.
Standard deviation of the whole dataset: 25.565&Pls check change&.

&12& I 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.qcs.wiley-vch.de QSAR Comb. Sci. 00, 0000, No. &, 1 – 18
ÝÝ These are not the final page numbers!
Prediction of ITQ-21 Zeolite Phase Crystallinity: Parametric Versus Non-parametric Strategies

Figure 10. Final Regression tree for prediction of ITQ-21 phase crystallinity after additional user pruning.

Figure 9 shows the predictions of ITQ21 crystallinity for succeeds in isolating the different levels of ITQ-21 crystal-
the given NN. The effect of the synthesis variable named linity. It is interesting to observe that the position of the
“Time” being rather low (as indicated earlier), NN has re- rectangles gives an intuitive classification of the samples
moved it from the input parameter. The number of false studied, allowing an easy visualization of the crystallinity
positives is much more important for NN-MLP and SVM- and the synthesis factors. The leaves in the left branches
RBF (radial basis function) compared to all other techni- present an increasing crystallization (dark rectangles) go-
ques. The SVM-RBF is the best among non-parametric ap- ing down the splits, while in the right branches the crystal-
proaches considering the overall criteria given in Table 4, linity is descending (bright rectangles). For each leaf, the
but on the other hand, numerous negative crystallinity val- mean (m, i.e., mu) of the samples is indicated. Figure 10
ues can be observed. A k-fold (k ¼ 10) Cross-Validation shows that the highest crystalline samples are obtained for
(CV) has been utilized for the optimizing capacity (C) and concentrate gels (H2O < 3.5) and Si/Ge < 36. These results
epsilon (e) at the same time. For C ¼ 10, gamma (g) has are in agreement with conclusions obtained in Section 2,
been set at 0.2, and e ?¼? 0.1. Regression Tree (RT) pro- and in the MLR (Section 3.2). Moreover, previous work
duces accurate predictions based on few logical if – then [33] suggests that ITQ-21 could be obtained for a Si/Ge ra-
conditions. A ten-fold CV is used for pruning. The original tio of 25, but not for 50.
version of the RT was composed of 13 non-terminal nodes SVM-RBF obtains the best results without requiring
and 14 (terminal) leaves. In Figure 10, some terminals heavy pre- and post-treatments. However, RT may be pre-
have been pruned again (the leaves containing less than 20 ferred because the RBF kernel makes the model interpre-
individuals are removed) making the reading easier. It can tation more difficult than other easier ones. Such a kernel
be observed through the gray scale rectangles that the RT has been chosen to give SVM the same chance facing NN.

QSAR Comb. Sci. 00, 0000, No. &, 1 – 18 www.qcs.wiley-vch.de I 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim &13&
These are not the final page numbers! ÞÞ
Full Papers Laurent A. Baumes et al.

However, simpler kernels such as polynomials of degrees through the difference between training and test perform-
2 and 3 have also been tested. Results for a 30% test set ances as done before. However, the set #2 in Table 6 shows
and 10-CV are the following: an obvious estimation failure. Only one input has been
{Degree, C, e, g, coeff.} ¼ {3, 10, 0.1, 0.3} gives r ¼ 0.915 kept; consequently, the range of maximum interest (i.e.,
(training), r ¼ 0.85 (test) “> 50”) is greatly under-evaluated while amorphous mate-
{Degree, C, e, g, coeff.} ¼ {2, 10, 0.1, 0.3} gives r ¼ 0.920 rials are overestimated. Obviously, such an NN has been
(training), r ¼ 0.87 (test) trapped into a local optimum. In Tables 6 – 9, the gray cells
These results confirm what was concluded through indicate where a given failure has been encountered, while
MLR and GLZ examination, i.e., the use of second degree the black cells indicate the selected models. Different
effects is useful while the integration of higher effects is kinds of disappointment are underlined below. It has to be
not. The difference in performance between RBF and such pointed out that the following criteria are not independ-
a simple kernel is very low and once again it discards all ent, and therefore only the most significant criteria of the
forms of more complex models in this study. Finally, both failure are shown in gray. On the basis of traditional statis-
RT and SVM with polynomial kernel of degree 2 are se- tics listed in Tables 8 and 9.
lected. RT should be used for a quick overview of the sys- (1) High performance drop from calculation on training
tem while SVM could allow to draw precisely a contour to test sets such as the NN using the MLP technique and
plot. tested with set #1 (11.4% ¼ 98.186.7, Table 8) which rep-
resents the greatest fall, but also set #5 for NN-MLP in Ta-
ble 8.
5 Advanced Analysis of Methodologies and (2) Relatively low performance compared to all other
Interpretation models of the same type. Therefore, set #4 for NN-RBF in
Table 9 is discarded.
Not only to show the difficulties encountered using NN (3) Relatively high error standard deviation. One has to
but also for better arguing the selection of SVM methodol- note that even if a prediction error mean extremely close
ogy over NN in this study, both techniques are compared to zero is expected, it is possible to get a zero prediction
based on the stability/variability of their performances de- error mean simply by estimating the averaged training
pending on the amount of data available for the training data value, without any recourse to the input variables or
step. We have chosen to assess performance generalization any advanced methodologies at all. Thus, the standard de-
for only these two approaches since SVM has been quali- viation error is of great interest in order not to use false
fied as a more stable technique compared to NN, and all good models as NN-RBF tested on set #4 in Tables 8 and
other used techniques are far less likely to overfit the data 9. NN-RBF with set #2 in Table 8 could have been discard-
or a fast post-processing treatment can be easily combined ed directly with the error mean. Note that if the standard
such as for RT. Through this analysis it is also investigated deviation error is no better than the training data standard
if the decrease of the size of the test set for allocating deviation, then the technique has performed no better
more “resources” to the training part makes the variability than a simple mean estimator.
of performance higher and thus the risk of false accepta- (4) A weak (i.e., non-robust) architecture. Not only the
tion of the model becomes larger. NN-MLP tested on set #2 in Table 8, but also NN-MLP
The dataset is divided into two parts: training (Tr) and and NN-RBF tested on set #5 in Table 9 possess a very low
test (Te). Their respective size varies and the fitting capaci- number of input data indicating that the networks did not
ty is assessed. The relative amount of data in the test sub- manage to use the information brought by all variables. In
set is set to either 70 or 30% of the whole available data- Tables 6 and 7, predictions are followed on separated rang-
set. Five different samplings for each distribution into es of crystallinity.
training and test are presented for both NNs and SVM. (5) Difference between observed and predicted mean of
The frequencies of responses have been checked in order ITQ-21 crystallinity. This is generally observed for high
to have a minimum number of each type of experiments values of crystallinity (sets #2, #3, #5 for NN-MLP, and sets
into both training and test sets, i.e., low and high ITQ-21 #4 and #5 for NN-RBF in Table 6, and sets #1, #3, #4 with
crystallinity values. This will permit to assess the perfor- NN-RBF in Table 7). This is due to the relatively low
mance of the modeling on three different ranges of crystal- amount of experiments belonging to the range “> 50.” On
linity: {0, ]0...50], ]50...100]}. Table 5 gives the mean and the other hand, in set #2 for both NN-RBF and NN-MLP
standard deviation of each sample taking into account the in Table 6, a very bad recognition of amorphous materials
ranges, while Tables 6 and 7 indicate the statistics for the is detected as well for set #1 for NN-MLP. The prediction
predicted values. The best solution using RBF and MLP for medium crystallized materials is overestimated in set
(three or four layers) is conserved for NN while SVM #1 for NN-MLP, making the margin between the medium
makes use of only RBF model form. Considering NNs, the and highly crystallized zeolites very narrow.
best network found is kept for each sampling after elimi- (6) Overfitting phenomenon is also detected through
nation of the networks that show a clear overfitting the high standard deviation of the predicted ITQ-21 crystal-

&14& I 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.qcs.wiley-vch.de QSAR Comb. Sci. 00, 0000, No. &, 1 – 18
ÝÝ These are not the final page numbers!
Prediction of ITQ-21 Zeolite Phase Crystallinity: Parametric Versus Non-parametric Strategies

Table 5. Two different partitions of the whole available experimental set of data are described.
Test sets Real ITQ-21 crystallinity mean ( SD )
Percentage (%) Set #1 Set #2 Set #3 Set #4 Set #5 Ranges
(mean nominal value)
30 (44.6) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0
30.8362 (15.6366) 17.0754 (13.0485) 28.2170 (17.2106) 28.4415 (15.2259) 25.4530 (17.5253) > 50
68.1282 (13.1038) 69.9754 (12.9964) 72.1727 (9.4963) 63.9546 (3.7527) 71.8512 (15.1351) < 50
21.4306 (28.9117) 18.9280 (28.2568) 15.2557 (25.4654) 18.7385 (25.7839) 22.7955 (29.8045) Total
70 (100.8) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0
25.692 (13.3596) 23.674 (14.5605) 25.772 (14.4003) 22.415 (15.2069) 24.740 (15.085) > 50
65.853 (10.683) 65.077 (10.527) 66.109 (8.068) 66.822 (11.615) 69.327 (12.645) < 50
17.0139 (24.3044) 17.5168 (25.4006) 16.7966 (24.8081) 17.3316 (26.3934) 17.4238 (26.5437) Total
The first column gives the percentage and corresponding nominal mean of the test set. Then, considering a given partition, the real mean and standard
deviation of each test set are given taking into account different ranges of ITQ-21 phase crystallinity (entire set, null, ]0...50] and ]50...100]) noted in the
last column.

Table 6. Mean and standard deviation of the predicted ITQ-21 crystallinity for a test set of 70% of the entire dataset.
Neural networks
Test sets only Ranges MLP MLP MLP MLP MLP
4 : 4 – 2-1 : 1 1 : 1 – 1-1 – 1 : 1 3 : 3 – 1-2 – 1 : 1 3 : 3 – 2-1 : 1 3 : 3 – 3-1 : 1
ITQ-21 crytallitnity Mean 0 2.9502 5.1625 0.3841 2.4025 0.3139
> 50 22.0063 21.1096 24.6572 26.2958 24.6209
< 50 62.5251 27.1190 47.3451 60.8654 45.7861
SD 0 7.9992 6.3111 5.1928 8.8584 5.6070
> 50 13.5509 10.1026 16.3604 21.5982 17.2128
< 50 31.1547 6.7165 14.6670 17.3099 12.2742
Test sets only Ranges RBF RBF RBF RBF RBF
3 : 3 – 9-1 : 1 3 : 3 – 10 – 1 : 1 3 : 3 – 9-1 : 1 4 : 4 – 9-1 : 1 4 : 4 – 10 – 1 : 1
ITQ-21 crytallitnity Mean 0 0.5330 8.8558 0.8193 1.0616 1.7603
> 50 30.0371 33.4153 33.7114 27.1984 24.8732
< 50 58.6176 69.5698 60.7412 44.4586 40.9229
SD 0 7.6737 13.1794 7.3416 14.0595 11.2243
> 50 17.5384 19.2353 22.6005 12.2213 12.0779
< 50 24.2643 19.7210 23.1682 8.7685 11.6271
Support vector machines
Test sets only Ranges RBF 1 RBF 2 RBF 3 RBF 4 RBF 5
ITQ-21 crytallitnity Mean 0 1.3014 1.87664  1.3818 0.7985 2.2881
> 50 29.4619 30.8891 28.3283 29.4217 31.6706
< 50 54.8544 54.5931 60.4716 60.6477 51.7803
SD 0 11.3509 14.4216 8.1337 10.7433 12.1001
> 50 15.1640 14.9366 15.0950 12.1353 15.8295
< 50 20.7396 12.4630 14.3953 13.7363 16.8359
The two statistics are given depending on both the methodologies employed and ranges of the real ITQ-21 crystallinity.

Table 7. Mean and standard deviation of the predicted ITQ-21 crystallinity depending on both the methodologies employed and rang-
es of the real ITQ-21 crystallinity.
Neural networks
Test sets only Ranges MLP MLP MLP MLP MLP
4 : 4 – 8-1 : 1 4 : 4 – 3-1 : 1 4 : 4 – 10 – 8-1 : 1 5 : 5 – 3-1 : 1 3 : 3 – 1-3 – 1 : 1
ITQ-21 crytallitnity Mean 0 1.6462 0.4754 5.6347 0.3186  0.3159
> 50 39.6653 20.6251 24.4674 30.0032 31.9088
< 50 51.9028 60.1008 61.6987 56.7048 61.3224
SD 0 2.8551 4.4862 10.0612 5.3826 4.0088
> 50 21.1517 15.4822 17.3254 18.4007 24.8200
< 50 20.4111 13.2987 26.5842 23.2101 9.8563

QSAR Comb. Sci. 00, 0000, No. &, 1 – 18 www.qcs.wiley-vch.de I 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim &15&
These are not the final page numbers! ÞÞ
Full Papers Laurent A. Baumes et al.

Table 7. (cont.)

Neural networks
Test sets only Ranges MLP MLP MLP MLP MLP
4 : 4 – 8-1 : 1 4 : 4 – 3-1 : 1 4 : 4 – 10 – 8-1 : 1 5 : 5 – 3-1 : 1 3 : 3 – 1-3 – 1 : 1
Test sets only Ranges RBF RBF RBF RBF RBF
5 : 5 – 20 – 1 : 1 4 : 4 – 15 – 1 : 1 4 : 4 – 19 – 1 : 1 5 : 5 – 6-1 : 1 2 : 2 – 9-1 : 1
ITQ-21 crytallitnity Mean 0 1.3549 0.1903 3.5478 3.1014 1.5366
> 50 33.6393 23.2889 31.4614 29.4760 27.0582
< 50 43.0530 56.7472 50.5591 41.0717 53.1617
SD 0 7.5802 7.4352 9.1270 14.8800 5.3202
> 50 14.3402 14.2591 20.1331 8.8716 20.6843
< 50 14.2721 12.0603 14.6490 9.5832 13.1921
Support vector machines
Test sets only Ranges RBF 1 RBF 2 RBF 3 RBF 4 RBF 5
ITQ-21 crytallitnity Mean 0  0.3699 1.5211 2.0352 0.0455  0.1315
> 50 33.1845 24.0824 26.9391 29.1968 29.1049
< 50 55.7318 59.0732 55.0579 62.3767 58.3044
SD 0 6.2682 8.2075 8.8295 7.2102 6.0857
> 50 17.0246 14.4054 11.9040 12.9160 16.2556
< 50 23.4673 13.2500 15.2324 13.7425 15.9692
Tables 6 and 7 differ from the percent of test set which is used. Here the test set represents 30% of the whole available dataset.

Table 8. Different statistics are given for each type of model, parameters, test set such as the mean error, the error standard devia-
tion, the ratio of the prediction error standard deviation to the original output data standard deviation noted “SD ratio,” as well as
the Pearson correlation r for both training and test sets. Note that a lower SD ratio indicates a better prediction, and this is equiva-
lent to 1 minus the explained variance of the model. The percentage of test set used is 70 as indicated in the first column.
70% of Test set
Methodology Test sets Statistics on test set Models
Error Pearson correlation (i.e., r) Form Parameters
Mean SD ratio Training SD ratio Test
( þ selection)
Neural networks Set #1  0.0430 8.3626 0.5254 0.98191 0.86779 MLP 4 : 4 – 2-1 : 1
Set #2 4.2375 12.8548 0.7489 0.55621 0.70001 1 : 1 – 1-1 – 1 : 1
Set #3 2.8432 6.5612 0.4112 0.91686 0.91527 3 : 3 – 1-2 – 1 : 1
Set #4  1.3291 8.3130 0.4696 0.93298 0.88991 3 : 3 – 2-1 : 1
Set #5 3.5427 7.8134 0.4612 0.96023 0.89572 3 : 3 – 3-1 : 1
Set #1  0.6261 8.9759 0.5109 0.92194 0.87771 RBF 3 : 3 – 9-1 : 1
Set #2  8.3668 12.8457 0.5488 0.90154 0.86598 3 : 3 – 10 – 1 : 1
Set #3  1.8639 9.4848 0.5318 0.92553 0.87900 3 : 3 – 9-1 : 1
Set #4 2.0565 13.1512 0.6105 0.82674 0.79190 4 : 4 – 9-1 : 1
Set #5 3.4021 11.8073 0.5934 0.81881 0.80956 4 : 4 – 10 – 1 : 1

70% of Test set


Methodology Test sets Statistics on test set Models
Error on Test set Pearson correlation (i.e., r) Form Parameters
Mean SD ratio Training SD ratio Test
( þ selection)
Support vector machines Set #1  0.3540 9.5929 0.5268 0.93276 0.8517 RBF {C, e, g} ¼ {10, 0.1, 0.3}
Set #2  1.2760 0.30301 0.4636 0.91980 0.8597
Set #3 2.3972 8.7152 0.4634 0.93419 0.8781
Set #4 0.5403 10.4749 0.4992 0.91399 0.8581
Set #5  0.4055 10.2129 0.5233 0.95273 0.8670

&16& I 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.qcs.wiley-vch.de QSAR Comb. Sci. 00, 0000, No. &, 1 – 18
ÝÝ These are not the final page numbers!
Prediction of ITQ-21 Zeolite Phase Crystallinity: Parametric Versus Non-parametric Strategies

Table 9. Different statistics are given for each type of model, parameters, test set such as the mean error, the error standard devia-
tion, the ratio of the prediction error standard deviation to the original output data standard deviation noted “SD ratio,” as well as
the Pearson correlation r for both training and test sets. Note that a lower SD ratio indicates a better prediction, and this is equiva-
lent to 1 minus the explained variance of the model. The percentage of test set used is 30 as it is indicated in the first column.
30% of Test set
Methodology Test sets Statistics on test set Models
Error Pearson correlation (i.e., r) Form Parameters
Mean SD SD ratio Training Test
(þ selection)
Neural networks Set #1 1.7774 6.7509 0.3846 0.9309 0.9246 MLP 4 : 4 – 8-1 : 1
Set #2 0.7065 6.6734 0.3610 0.9316 0.9336 4 : 4 – 3-1 : 1
Set #3  1.4581 7.7908 0.5077 0.9281 0.8625 4 : 4 – 10 – 8-1 : 1
Set #4 0.7467 7.2215 0.4313 0.9520 0.9081 5 : 5 – 3-1 : 1
Set #5 0.2503 8.2122 0.3942 0.9086 0.9197 3 : 3 – 1-3 – 1 : 1
Set #1 4.3222 6.9701 0.4833 0.8688 0.8891 RBF 5 : 5 – 20 – 1 : 1
Set #2 0.7534 4.0173 0.4108 0.9018 0.9133 4 : 4 – 15 – 1 : 1
Set #3  0.6127 8.7909 0.5222 0.8497 0.8528 4 : 4 – 19 – 1 : 1
Set #4 2.1398 11.4482 0.6267 0.7989 0.7793 5 : 5 – 6-1 : 1
Set #5 2.5788 6.9778 0.4557 0.8225 0.8938 2 : 2 – 9-1 : 1

30% of Test set


Methodology Test sets Statistics on test set Models
Error on test set Pearson correlation (i.e., r) Form Parameters
Mean SD SD ratio Training Test
(þ selection)
Support vector machines Set #1 2.2549 7.5327 0.3688 0.9323 0.9297 RBF {C, e, g} ¼ {10, 0.1, 0.3}
Set #2  0.6212 8.3591 0.3921 0.9258 0.9208
Set #3 0.3328 8.2413 0.4362 0.9420 0.9055
Set #4 1.8904 6.8110 0.3529 0.9430 0.9452
Set #5 1.6718 8.8380 0.3939 0.9207 0.9204

linity for medium and highly crystallized materials. This is 6 Conclusions


observed for the majority of NN models: set #1 for NN-
MLP, sets #4 and #5 for NN-RBF in Table 6, and all NN- This work shows a broad investigation of different model-
MLP except the one tested on set #2, and sets #3 – 5 for ing techniques for the prediction of performances in mate-
NN-RBF in Table 7. This statistic shows that NNs often rial science. Two types of methodologies are examined: on
fail to find a good and stable model over the whole range the one hand the parametric strategies, and on the other
of ITQ-21 crystallinity. hand, the non-parametric techniques. The non-parametric
Considering all these criteria, one can observe that NNs methods employed here are all ML algorithms namely
are much more affected than SVMs for both sizes of train- NNs, SVMs and regression trees. They reach a reasonable
ing sets. The number of detected failures increases as the fitting accuracy. However, considering ML techniques, the
amount of training data decreases as it was expected. Rel- recurrent problem of overfitting had to be considered and
atively small test sets increase the risk of false selection of investigated. The parametric methods are less subjected to
model as seen through the higher variability of criteria. this problem of great importance. The difference is due to
Considering the case with 70% of test set, it can be the fact that the statistical models are inherently restricted
checked that the number of selected inputs for NNs is low- in their model forms, while learning methods, and particu-
er than for the other case. The relative lack of experiments larly NNs, possess a high flexibility and numerous setting
does not permit to take advantage of the whole set of fea- parameters. The advanced performance assessment of
tures, the variability of responses being quickly associated NNs and SVM has allowed to verify such an assumption.
to few variables, the others are considered so as to bring As a general advice, the parametric approach should al-
redundant or noisy information. ways employed as a reference for further work. Both ap-
proaches are compatible and the selection of a unique
model is not compulsory. In contrast, we advocate the use

QSAR Comb. Sci. 00, 0000, No. &, 1 – 18 www.qcs.wiley-vch.de I 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim &17&
These are not the final page numbers! ÞÞ
Full Papers Laurent A. Baumes et al.

of the simplest methodologies at the beginning, while tional Zeolite Conference 12th, Baltimore, July 5 – 10, 1998,
more complicated but less informative techniques are kept Meeting Date 1998, 1999, 541 – 549.
when complex underlying systems are detected. The com- [8] A. Corma, F. Rey, J. Rius, M. J. Sabatier, S. Valencia, Nature
2004, 431, 287 – 290.
bination of multiple approaches is of great appeal and [9] A. Corma, M. J. Daz-Cabañas, F. Rey, S. Nicolopoulus, K.
should allow to reach higher and more stable performan- Boulahya, Chem. Commun. 2004, 12, 1356 – 1357.
ces, taking advantage of each method contribution, while [10] C. S. Cundy, P. A. Cox, Micropor. Mesopor. Mat. 2005, 82,
expected drawbacks might be eliminated or corrected 1 – 78.
through the complementarities of each technique>s [11] A. Corma, V. Fornes, U. Diaz, Chem. Commun. 2001, 24,
strength. Such a study has pointed the difficulty of select- 2642 – 2643.
[12] M. Moliner, J. M. Serra, A. Corma, E. Argente, S. Valero,
ing models when any a priori or preference is given to a
V. Botti, Micropor. Mesopor. Mat. 2005, 78, 73 – 81.
given modeling technique. Of course, such a work is prob- [13] O. B. Vistad, D. E. Akporiaye, K. Mejland, R. Wendelbo, A.
lem-dependent as the number of input variables, available Karlsson, M. Plassen, K. P. Lillerud, Stud. Surf. Sci. Catal.
amount of data, and complexity of the system investigated 2004, 154, 731 – 738.
makes a given approach more or less adapted. Thus, such [14] A. Cantn, A. Corma, M. J. Diaz-Cabanas, J. L. Jordá, M.
preliminary inspection of techniques appears to be manda- Moliner, J. Am. Chem. Soc. 2006, 128, 4216 – 4217.
[15] L. A. Baumes, D. Farruseng, M. Lengliz, C. Mirodatos,
tory decreasing the risk of deceptive results. The data min-
QSAR Comb. Sci. 2004, 29, 767 – 778.
ing technology is more and more applied in the production [16] A. Corma, J. M. Serra, E. Argente, S. Valero, V. Botti,
mode, which usually requires automatic analysis of data Chem. Phys. Chem. 2002, 3, 939 – 945.
and related results in order to proceed to conclusions. But [17] M. Holena, M. Baerns, Catal. Today 2003, 81, 485 – 494.
we have shown here that the selection of a given model re- [18] C. Klanner, D. Farrusseng, L. A. Baumes, C. Mirodatos, F.
mains a difficult task making the automation of the whole Schuth, Angew. Chem. Int. Ed. 2004, 43, 5347 – 5349.
combinatorial loop a problem which is too often under-es- [19] J. M. Serra, L. A. Baumes, M. Moliner, P. Serna, A. Corma,
Comb. Chem. High Throughput Screen. 2006 (Submitted).
timated. Unlike traditional data mining contexts which [20] L. A. Baumes, J. M. Serra, P. Serna, A. Corma. J. Comb.
deal with voluminous amounts of data, materials science is Chem. 2006, 8, 583 – 596.
actually characterized by a scarcity of data, owing to the [21] D. Nicolaides, QSAR Comb. Sci. 2005, 24, 15 – 21.
cost and time involved in conducting simulations or setting [22] M. M. Gardner, J. N. Cawse, in: J. M. Cawse (Ed.), Experi-
up experimental apparatus for data collection. In such do- mental Design for Combinatorial and High Throughput Ma-
mains, it is prudent to balance speed through automation terials Development, John Wiley & Sons, Hoboken, New
Jersey, 2003, pp. 129 – 145.
and the utility of data. For these reasons, the human inter-
[23] L. A. Baumes, J. Comb. Chem. 2006, 8, 304 – 313.
action, verification and guidance may lead to better quali- [24] A. Corma, M. J. Daz-Cabañas, J. Martnez-Triguero, F. Rey,
ty output. J. Rius, Nature 2002, 418, 514 – 517.
[25] C. Bishop, Neural Networks for Pattern Recognition, Oxford
University Press, Oxford, 1995.
Acknowledgements [26] S. Haykin, Neural Networks: A Comprehensive Foundation,
Macmillan Publishing, New York, 1994.
[27] P. J. Green, B. W. Silverman, Nonparametric Regression and
EU Commission (TOPCOMBI Project) is gratefully ac- Generalized Linear Models: A Roughness Penalty Approach,
knowledged. Chapman & Hall, New York, 1994.
[28] A. J. Dobson, An Introduction to Generalized Linear Mod-
els, Chapman & Hall, New York, 1990.
References [29] J. Stevens, Applied Multivariate Statistics for the Social Sci-
ences, Erlbaum, Hillsdale, NJ, 1986.
[30] M. S. Younger, A First Course in Linear Regression, 2nd ed,
[1] H. Lee, S. I. Zones, M. E. Davis, Nature 2003, 425, 385 – 387.
Duxbury Press, Boston, 1985.
[2] A. Corma, J. Catal. 2003, 216(1 – 2), 298 – 312.
[31] V. Vapnik, The Nature of Statistical Learning Theory,
[3] C. S. Cundy, P. A. Cox, Chem. Rev. 2003, 103, 663 – 702.
Springer, Berlin, Germany, 1995.
[4] S. I. Zones, S. J. Hwang, S. Elomari, I. Ogino, M. E. Davis,
[32] L. Breiman, J. H. Friedman, R. A. Olshen, C. J. Stone, Clas-
A. W. Burton, C. R. Chim. 2005, 8, 267 – 282.
sification and Regression Trees, Wadsworth & Brooks/Cole
[5] J. L. Paillaud, B. Harbuzaru, J. Patarin, N. Bats, Science
Advanced Books & Software, Monterey, CA, 1984.
2004, 304(5673), 990 – 992.
[33] T. Blasco, A. Corma, M. J. Diaz-Cabanas, F. Rey, J. Rius, G.
[6] K. G. Strohmaier, D. E. Vaughan, J. Am. Chem. Soc. 2003,
Sastre, J. A. Vidal-Moya, J. Am. Chem. Soc. 2004, 126,
125(51), 16035 – 16039.
13414 – 13423.
[7] R. Millini, C. Perego, L. Carluccio, G. Bellussi, D. E. Cox,
B. J. Campbell, A. K. Cheetham, Proceedings of the Interna-

&18& I 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.qcs.wiley-vch.de QSAR Comb. Sci. 00, 0000, No. &, 1 – 18
ÝÝ These are not the final page numbers!

You might also like