Professional Documents
Culture Documents
Goodness-Of-fit of Randomistic Models
Goodness-Of-fit of Randomistic Models
Hugo Hernandez
ForsChem Research, 050030 Medellin, Colombia
hugo.hernandez@forschem.org
doi: 10.13140/RG.2.2.35386.34248
Abstract
Keywords
1. Introduction
Mathematical models are able to describe, explain and even predict the behavior of any
particular system. However, no mathematical model is perfect because it cannot consider all
possible factors that may have an influence on the system, and also because there is always
uncertainty in the information on the factors and their effects considered in the model. The
difference between the true behavior of the system and the model prediction (or estimation)
of the behavior of the system under any particular set of conditions is known as the residual
error of the model. Precisely because the residual error incorporates all unknown effects and
all uncertainties, it is clearly a random variable. A mathematical model that neglects these
residuals is basically a pure deterministic model. A mathematical model that takes into account
the residual error in its structure is usually denotes as a statistical model. However, since such
model contains both a deterministic and a random component in its structure, it will be
denoted here as a randomistic model instead.[1]
Typically, the fitness of a mathematical model is determined by numerical criteria such as the
coefficient of determination ( ),[2] the likelihood-ratio ( ),[3] and the Akaike information
criterion ( ).[4] The coefficient of determination is the complement of the relative error
observed with the model. can be used to test if the model estimations are significantly
different from the error noise, since by itself does not provide an indication of the
significance of the model. On the other hand, also evaluates the simplicity of the model
trying to overcome overfitting, which might occur by just maximizing .
Although there might be different implicit assumptions during the development of the model
(e.g. normality, independency and homoscedasticity of residuals), none of the previous
performance criteria penalizes the model when those implicit assumptions are violated. Thus,
the deterministic component of the model might seem to perform very well but its random
counterpart might not be satisfactory.
Random models (not to be confused with random effect models or random coefficients
models) are basically mathematical models representing the distribution of probabilities or
cumulative probabilities of a variable. They are also known as stochastic models, probability
models or probability distributions. Particularly, we are interested in “pure” random variables
where the mean or average value is exactly zero. Any randomistic variable can be transformed
into a pure random variable just by subtracting the mean value (a linear transformation). In
that case, the mean value can be interpreted as the deterministic component of the
randomistic variable.
The evaluation of the goodness-of-fit of a random model must be done comparing the model
with experimental results. When comparing probability density functions (PDF), it is found that
experimental PDF’s are usually highly noisy and sample-size dependent,[5] and thus, the quality
of the performance evaluation might be compromised. It is therefore recommended to
evaluate the performance of the model by comparing the cumulative probability model with
the experimental cumulative probability observed in the data. Three different performance
metrics have been previously proposed:[6]
(2.3)
where
( ) ( )
( )
( ) ( )
{
(2.4)
represents the cumulative probability function of the random variable , are the
experimental values of the variable, is the ascending rank of the experimental value , and n
is the total number of experimental data.
Please notice that the third metric (Eq. 2.3) involves a sum of squares and therefore it is
possible to define an analogue determination coefficient for the random model as:
(2.5)
where
∑( )
(2.6)
thus, the coefficient of determination of the random model becomes:
( )∑
(2.7)
Let us now consider a general randomistic model describing the response variable in terms of
a set of predictor variables :
( )
(3.1)
where represents any arbitrary deterministic model, and represents a pure (zero-mean)
random model, described by the following PDF:
( ) ( )
(3.2)
where is any arbitrary positive function of (realizations of the random model ), such that:
∫ ( )
(3.3)
∫ ( ) ( )
(3.4)
∫ ( ) ( )
(3.5)
represents the expected value (mean value) operator, represents the variance operator,
and is the standard deviation of the residual error of the deterministic model.§
∑
∑ ( 〈 〉)
(3.6)
where
( )
(3.7)
∑
〈 〉
(3.8)
On the other hand, the determination coefficient of the random model is (from Eq. 2.7):
( )∑
(3.9)
where
( ) ( )
( )
( ) ( )
{
(3.10)
§
Please notice that only when ( ) , ( ) ( ).
( ) ∫ ( )
(3.11)
Particularly, if a random model is assumed for the distribution of the residuals, then:
( )
√
( )
(3.12)
Since both models (deterministic and random models) are important components of the
randomistic model, the goodness-of-fit (determination coefficient) of the randomistic model
can be expressed as follows (as proposed in [7]):
( )
(3.13)
Eq. (3.13) indicates that the randomistic model has a good fit as long as at least one of the
models (deterministic or random) is good. On the other hand, if both models are poor, then the
fitness of the randomistic model will also be poor.
Goodness-of-fit metrics are useful for selecting and identifying the best model for describing a
particular system. According to Eq. (3.13), it is equally good to describe the system either with a
good deterministic model, or with a good random model. However, for practical purposes, it is
desirable to reduce the uncertainty of the estimations of a good model. That is, minimizing
while maximizing . This, however, might lead to overfitting of the model.
Overfitting will occur when the standard deviation of the estimation error is smaller than the
natural variance of the system ( ). Such natural variance is caused by experimental
measurement uncertainties as well as additional variations due to uncontrolled, unmeasured
factors influencing the response variable. If such natural variance is known, then the following
constraint should be present in the optimization problem:
(4.1)
On the other hand, since the proposed optimization problem is multiobjective (minimizing
while maximizing ), a suitable multiobjective optimization method should be used for
solving the problem. One of those methods consists on transforming the problem into a single
objective optimization. One possible single-objective constrained optimization problem is the
following:
( )
(4.2)
If only measurement uncertainties in the response variable will be considered for the natural
variation , then assuming a uniform distribution of the uncertain measured value:
(4.3)
where is the resolution of the system used for measuring the response variable .
The effect of measurement uncertainties in the predictor variables on the natural variation will
depend on the specific deterministic model considered.
Particularly, if a dimensionless objective function is required for comparison purposes, then the
optimization problem can be expressed as:
( ( ) ( ))
( )
(4.4)
where ( ) and ( ) represents the maximum and minimum observations found in the
measured response variable.
5. Examples
Whitman et al. [8] reported a correlation between the coloration of the nose of adult lions and
their age. Table 1 shows reported data of the proportion of black coloration of the nose of 32
adult male lions (from Seregenti and Ngorongoro, Tanzania) and their corresponding age.
Table 1. Nose coloration and age of adult male lions in Tanzania. Data from [9]
Nose Nose Nose Nose
coloration Age (yr) coloration Age (yr) coloration Age (yr) coloration Age (yr)
(%black) (%black) (%black) (%black)
21 1.1 23 2.4 30 4.3 48 7.3
14 1.5 22 2.1 42 3.8 44 7.3
11 1.9 20 1.9 43 4.2 34 7.8
13 2.2 17 1.9 59 5.4 37 7.1
12 2.6 15 1.9 60 5.8 34 7.1
13 3.2 27 1.9 72 6 74 13.1
12 3.2 26 2.8 29 3.4 79 8.8
18 2.9 21 3.6 10 4 51 5.4
( ) ( )
(5.1)
The deterministic component of model (5.1) presents a determination coefficient (from Eq. 3.6)
. On the other hand, the random component of model (5.1) presents
a determination coefficient (from Eq. 3.9) . Thus, the
overall determination coefficient of the full randomistic model (5.1) is (from Eq. 3.13):
.
On the other hand, let us compare the results obtained using a constant-average value model
of the data. Such model will simply be expressed as:
( )
(5.2)
**
By performing the least-squares linear regression, the normality of the residuals is assumed.
Figure 1. Age estimation for adult lions using nose coloration using the randomistic model (5.1).
Blue data points: Data reported in [9]. Solid red line: Prediction of the deterministic
component. Gray dotted lines: Represent the random component of the model. Green dashed
lines: 99% confidence prediction limits of the randomistic model.
Figure 2. Age estimation for adult lions using nose coloration using the randomistic model (5.2).
Blue data points: Data reported in [9]. Solid red line: Prediction of the deterministic
component. Gray dotted lines: Represent the random component of the model. Green dashed
lines: 99% confidence prediction limits of the randomistic model.
It is also possible to determine the optimal randomistic model by solving the optimization
problem presented in Eq. (4.2). Considering a linear deterministic model with a normal random
model, the best randomistic model coincides with the linear regression model given in Eq. (5.1).
Table 2 summarizes the results obtained considering the multi-objective function defined in Eq.
(4.2), clearly demonstrating the superiority of the linear regression model.
Finally, the resolution in the determination of the age of the lions seems to be (according to the
data available) . Thus, the natural standard deviation due to measurement error is
, way below the standard deviations of the model residuals obtained in both
cases. Particularly for the linear regression model, if uncertainty in the determination of the
nose coloration is considered (using a resolution of 1% black), then:
( )
√
(5.3)
still below the standard deviation of the residuals for both models, satisfying constraint (4.1).
For the next example, let us consider some data reported for the average annual precipitation
between 1961 and 1990 at different locations in Central Italy.[11] The data is presented in Figure
3. The purpose of this example is comparing different polynomial regression models for fitting
the data. Table 3 summarizes the goodness-of-fit and multi-objective function performance
obtained for different randomistic models consisting of different polynomial deterministic
models paired with a normal random model. The deterministic models obtained by least
squares regression, from zero degree to 5th degree polynomials, are graphically compared in
Figure 4.
Figure 3. Average annual precipitation 1961-1990 in different locations of Central Italy vs.
distance from the sea. Slight differences are expected with respect to the original data as a
result of data extraction from the plot reported in [11].
††
This model corresponds to a normal random model.
The standard deviations of the model residuals were estimated using the following equation:
∑
√
(5.4)
where represents the model residuals, and is the degree of the polynomial. In other
words, represents the degrees of freedom of the model residuals.
Even though the goodness-of-fit of the deterministic model increases with the degree of the
polynomial, the goodness-of-fit of the randomistic model remains practically constant ( )
th
from the first to the 5 degree. There is however a minimum standard deviation of the
residuals at the third degree (but still larger than the natural deviation due to measurement
errors), which leads to a maximum in the objective function . This optimum value
illustrates the parsimony principle, since increasing the complexity of the model does not
necessarily improve the performance of the randomistic model (even though the performance
of the deterministic model improves).
By increasing the degree of the polynomial model from zero to one, the objective function
clearly improves, with a relative increase of . However, increasing the degree of the
polynomial from one to two, only yields a relative increase in the objective function of .A
question then arises whether this difference in objective functions is significant or not.
Considering normal random models, it is possible to compare two objective functions using a
-test of hypothesis. In this case, the objective function of a second model can be considered
to improve significantly with respect to a first model when:
( )
(5.5)
where represents the critical value of the distribution with a significance level . Smaller
values of the significance level require larger differences between the objective functions of
the models in order to consider them different.
For this example, assuming a significance level of (which is relatively high), the minimum
relative difference between the objective functions of two models for considering them
significantly different should be about .‡‡ Thus, there are no significant differences
between the polynomial models from first to 5 th degree, even with a significance level.
Therefore, even though the third degree polynomial model was found to be numerically the
best, it is not significantly different from the linear model (as it can be observed in Figure 4).
Thus, from the parsimony principle a linear model should be the right choice (between the
polynomial models considered), since it provides the same performance with the maximum
number of degrees of freedom for the residuals.
The third example was obtained from the calibration data of a high-performance liquid
chromatography (HPLC) method for quantifying caffeine in coffee samples. The calibration
data, extracted from plots reported by Sanchez [12], is presented in Figure 5. Clearly, a linear
calibration curve is adequate for describing the data. The model obtained by linear least-
squares regression is the following:
(5.6)
where is the concentration of caffeine in the sample in mg/L, is the peak area
obtained and is a standard normal random variable.
The goodness-of-fit coefficients for the different components of this randomistic model are:
‡‡
Since , ( ) , and √
(5.7)
(5.8)
(5.9)
Figure 5. Calibration data for the determination of caffeine concentration in coffee samples
using a chromatographic method. Data extracted from plots reported in [12].
Even though the performance of this model is very good, the residuals of the deterministic
model show heteroscedasticity (unequal variance) as can be seen in Figure 6:
Figure 6. Standardized residuals of the deterministic component of the randomistic model (5.6)
vs. caffeine concentration in the sample.
where and are constant parameters. Finding the optimum values of those parameters
by optimization, the following model is obtained:
( )
(5.11)
with goodness-of-fit coefficients:
(5.12)
(5.13)
(5.14)
Figure 7. Standardized residuals of the deterministic component of the randomistic model (5.11)
vs. caffeine concentration in the sample.
The standardized residuals for the model (5.11) are presented in Figure 7. Please notice that
standardization is performed dividing each residual by the value of the standard deviation
estimated at each caffeine concentration level. As it can be seen, heteroscedasticity of the
standardized residuals was reduced, and the performance of the model slightly improved.
Other non-linear models describing the change in standard deviation as a function of caffeine
concentration are also possible. Since models (5.6) and (5.11) present almost the same
performance,§§ the first model (with fewer parameters) would be preferred.
The corresponding calibration curves obtained from these models will then be:
(5.15)
( )
(5.16)
This example considers the decomposition kinetics of biomass by pyrolysis under non-
isothermal conditions. The data summarized in Figure 8 corresponds to a thermal gravimetric
analysis (TGA) of biomass decomposition in the temperature range from 235°C to 278°C using a
heating rate of 10°C/min, reported by Mallick et al. [13].
Figure 8. Solid mass loss as a function of temperature during TGA of biomass pyrolysis. Data
extracted from [13].
§§
A hypothesis test can demonstrate that the differences are not significant even for less conservative
significance levels.
( )
( ) ( )
(5.17)
where represents the fractional solid mass loss, is the absolute temperature in ,
is the ideal gas constant, is the heating rate, and and are the
pre-exponential factor and the activation energy of an Arrhenius-type kinetic expression. Eq.
( )
(5.17) represents a linear model between the variable transformations ( ) and . The
transformed data, as well as the corresponding least-squares linear regression, is presented in
Figure 9.
Figure 9. Coats-Redfern transformation of TGA data for biomass pyrolysis.[13]. Blue points:
Transformed experimental data. Red solid line: Least-squares linear regression.
( )
( )
(5.18)
( )
(5.19)
( ) ( )
(5.20)
Assuming a normal distribution of the residuals from model (5.20) comparing with the original
experimental data (before transformation), the performance of the corresponding randomistic
model is: , , and .
However, from a randomistic point of view the final kinetic model obtained is:
( )
(5.21)
Expanding the exponential of the standard random variable as a Taylor series, and truncating
to the first terms, the following approximation is obtained:
( )
(5.22)
Expanding and approximating the exponential with the random variable again, we obtain:
( ) ( )
(5.23)
which indicates that the standard deviation of the residuals depends on the temperature. For
the range of temperatures considered, such standard deviation increases almost linearly with
temperature, taking values between and . Model (5.23) presents the
following performance: , , and .
This model provides a significant improvement in the goodness-of-fit of the random model for
the residuals.
On the other hand, if the original data is used to fit the parameters directly in model (5.20),
without the Coats-Redfern transformation, by maximizing the objective function , and
considering a normal distribution of the residuals, the following randomistic model is
obtained***:
( ) ( )
(5.24)
Figure 10. Comparison of three different models for the kinetics of biomass pyrolysis. Black
dots: Experimental data extracted from [13]. Red solid line: Model (5.20) Arrhenius equation
using parameters obtained by the Coats-Redfern transformation. Green dotted line: Model
(5.23) obtained directly from the Coats-Redfern equation. Blue dashed line: Model (5.24)
Arrhenius equation using parameters obtained by maximizing .
***
Using the generalized reduced gradient (GRG) nonlinear optimization method, and using parameters
of Model (5.20) as starting point.
The next example considers experimental data for the solubility of different types of solutes on
two solvents: Tetrahydrofuran (THF) and 2-methyl tetrahydrofuran (2-MeTHF).[15] Two models
were previously obtained [16] considering a simple linear regression and a robust optimization
using the log-log transformation of the solubility data (see Figure 11). The corresponding
randomistic models are respectively:
(5.25)
(5.26)
Transforming back the models into the original variables results in:
(5.27)
(5.28)
Figure 11. Solubility (mg/ml) of different solutes at room temperature for THF and 2-MeTHF in
decimal logarithm scale. Blue dots: Data sample. Dashed purple line: Least-squares fit (Eq.
5.25). Solid green line: Robust fit (Eq. 5.27). Only the deterministic components of the models
are shown.
The performance of the models, considering both the original and transformed variables, is
summarized in Table 4. An additional model (optimal model) is included, which is obtained by
maximizing the function , calculated on the original dataset. This model considered the
same structure of the previous models, and only the values of the coefficients were used as
decision variables. The corresponding models obtained are the following:
(5.29)
(5.30)
Table 4. Comparative performance of different solubility models with respect to both original
and transformed data (decimal logarithm transformation).
Data Randomistic Model
Least-squares model 39.68% 73.05% 11.266 83.75% 6.60x10-3
Original Robust model 43.10% 80.49% 10.941 88.90% 7.43x10-3
Optimal model 49.03% 81.17% 10.356 90.40% 8.43x10-3
Least-squares model 72.78% 98.43% 0.3827 99.57% 6.800
Transformed Robust model 62.59% 96.52% 0.4486 98.70% 4.905
Optimal model 70.37% 95.91% 0.3992 98.79% 6.199
The deterministic predictions of the three models in original data values are presented in Figure
12.
Figure 12. Solubility (mg/ml) of different solutes at room temperature for THF and 2-MeTHF in
original scale. Blue dots: Data sample. Dashed purple line: Least-squares fit (Eq. 5.26). Solid
green line: Robust fit (Eq. 5.28). Dotted red line: Optimal fit (Eq. 5.30).
For this particular example, when the objective function of the optimization is defined in the
transformed domain, the optimal model corresponds to the least-squares model. Even though
in the domain of the logarithm transformations of the variables the best fit is obtained by the
least-squares model, its performance on the original scale of the variables is not optimal. Thus,
when fitting models to transformed data, it is very important to determine if the goal is
optimizing the performance of the model in the transformed domain or in the original domain
of the data, because the results may vary significantly. As it was previously mentioned,[16] it is
confirmed that the robust regression approach performs better in the original scale of the data
compared to the least-squares regression model; however, its performance is not optimal from
a randomistic point of view.
The last example consists on the comparison of three different dynamic models used to predict
the behavior of a particular fermentation process for ethanol production. The experimental
data and model results are reported by Ochoa et al.[17]. The key feature of this example is that
4 different response variables (cell, starch, glucose, and ethanol concentration) are considered
for assessing model performance. Figure 13 presents the experimental data and models
predictions at each data point, for all 4 response variables. Even though all the response
variables are expressed in the same units (g/L), the ranges of experimental values are different
and thus it is preferably using a dimensionless objective function for model comparison.
Furthermore, if all response variables are considered equally important, the following function
can be used for assessing the performance of each model reported:
∑ ( )
( ) ( )
(5.31)
It is also important noticing that the number of experimental observations is different for each
response variable, but this has no effect on the value of the performance function previously
defined.
Table 5 summarizes the performance comparison for the three models reported in [17],
assuming a zero-mean normal random model of the residuals. The relative performance
( ) of each individual model is presented for comparison.
The best overall performance was obtained with Model 3, whereas the worst overall
performance was obtained by Model 1. From a deterministic point of view, the best individual
model was glucose concentration for Model 3. It is also the model with the lowest standard
deviation of residuals relative to the range of the response variable, and highest relative
performance. On the other hand, the best individual random model was ethanol concentration
for Model 3, which is also the best model from a randomistic point of view. The random models
for starch and glucose concentration for Model 3 presented a poor fit. This is due to a drift of
the model with respect to experimental data which results in a significantly different from zero
average of the residuals. Eq. (5.31) can also be used as the objective function in an optimization
problem aiming to improving the estimation of the parameters of the models for ethanol
production from starch.
6. Conclusion
By analyzing the goodness of fit of a model, not only the performance of the deterministic
estimates should be considered. If the performance of the random model describing the
residuals of the deterministic model is also considered, it is possible to evaluate the goodness-
of-fit of the full randomistic model. Based on the definition of the determination coefficient
( ) for deterministic models, analogue coefficients for the random ( ) and randomistic
( ) models are presented. The best models will be those providing the largest randomistic
coefficient with the minimum estimated variance of the model residuals. Such variance
cannot be smaller than the natural variation of the system, caused by measurement
uncertainties and pure experimental errors. Different examples were presented in order to
illustrate the concepts introduced in this paper. They included topics such as model
comparison, optimal parameter identification, handling heteroscedasticity of residuals,
analyzing non-linear transformation of the variables, and evaluating the performance of the
model when multiple response variables are considered. Algorithms for calculating the
goodness-of-fit of randomistic models were implemented in R and MATLAB languages, and are
presented in the Appendix.
Acknowledgments
The author wishes to thank Prof. Dr. Silvia Ochoa (Universidad de Antioquia, Colombia), for
helpful discussions on this topic.
This research did not receive any specific grant from funding agencies in the public,
commercial, or not-for-profit sectors.
References
[1] Hernandez, H. (2018). The Realm of Randomistic Variables. ForsChem Research Reports
2018-10. doi: 10.13140/RG.2.2.29034.16326.
[2] Anderson-Sprecher, R. (1994). Model comparisons and R2. The American Statistician, 48(2),
113-117.
[3] Azzalini, A. (1996). Statistical Inference Based on the likelihood (Vol. 68). CRC Press.
[4] Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on
Automatic Control, 19(6), 716-723.
[5] Hernandez, H. (2018). Comparison of Methods for the Reconstruction of Probability Density
Functions from Data Samples. ForsChem Research Reports 2018-12. doi:
10.13140/RG.2.2.30177.35686.
[6] Hernandez, H. (2018). Parameter Identification using Standard Transformations: An
Alternative Hypothesis Testing Method. ForsChem Research Reports 2018-04. doi:
10.13140/RG.2.2.14895.02728.
[7] Hernandez, H. (2019). Modeling and Identification of Noisy Dynamic Systems. ForsChem
Research Reports 2019-04. doi: 10.13140/RG.2.2.12571.72489.
[8] Whitman, K., Starfield, A. M., Quadling, H. S., & Packer, C. (2004). Sustainable trophy
hunting of African lions. Nature, 428(6979), 175.
[9] Whitlock, M. C., & Schluter, D. (2015). The Analysis of Biological Data. 2nd Ed. Macmillan
Learning. https://whitlockschluter.zoology.ubc.ca/data/chapter17
[10] Hernandez, H. (2018). Multidimensional Randomness, Standard Random Variables and
Variance Algebra. ForsChem Research Reports 2018-02. doi: 10.13140/RG.2.2.11902.48966.
[11] Gentilucci, M., Bisci, C., Burt, P., Fazzini, M., & Vaccaro, C. (2018). Interpolation of Rainfall
Through Polynomial Regression in the Marche Region (Central Italy). In: Mansourian, A., Pilesjö,
P., Harrie, L., & van Lammeren, R. (Eds.). Geospatial Technologies for All: Selected Papers of the
21st AGILE Conference on Geographic Information Science. Springer. pp. 55-73.
[12] Sanchez, J. (2018). Estimating detection limits in chromatography from calibration data:
ordinary least squares regression vs. weighted least squares. Separations, 5(4), 49.
[13] Mallick, D., Bora, B. J., Barbhuiya, S. A., Banik, R., Garg, J., Sarma, R., & Gogoi, A. K. (2019).
Detailed study of pyrolysis kinetics of biomass using thermogravimetric analysis. In: AIP
Conference Proceedings (Vol. 2091, No. 1, p. 020014). AIP Publishing.
[14] Coats, A. W., & Redfern, J. P. (1964). Kinetic parameters from thermogravimetric data.
Nature, 201(4914), 68.
[15] Qiu, J., & Albrecht, J. (2018). Solubility Correlations of Common Organic Solvents. Organic
Process Research & Development, 22(7), 829-835.
[16] Hernandez, H. (2018). Introduction to Randomistic Optimization. ForsChem Research
Reports 2018-11. doi: 10.13140/RG.2.2.30110.18246.
[17] Ochoa, S., Yoo, A., Repke, J. U., Wozny, G., & Yang, D. R. (2007). Modeling and parameter
identification of the simultaneous saccharification‐fermentation process for ethanol
production. Biotechnology progress, 23(6), 1454-1462.
randgof<-function(datasample=NULL,estimates=NULL,nparam=1,rdist="Normal"){
#This function evaluates the goodness-of-fit and the performance of a randomistic model. The evaluation is
#performed considering both the original experimental data (datasample) and the deterministic predictions
#(estimates) of the model, under the exact same conditions of the original data. Both variables should be vectors
#with the same dimensions. The random model is evaluated assuming a certain distribution (rdist). Included in this
#code are the "Normal" and "Uniform" distributions. The standard deviation of the residuals is estimated using the
#number of parameters (nparam) considered in the deterministic model (including constant coefficients). The
#output includes: R2 (Randomistic goodness-of-fit), Perf (randomistic performance), detR2 (deterministic R2),
#randR2 (random fitness), sE (standard deviation of residuals), rPerf (relative performance), rsE (relative standard
#deviation of residuals)
#Calculation of residuals
res=datasample-estimates
#Calculation of deterministic R2
avg=mean(datasample)
SST=sum((datasample-avg)^2)
SSE=sum(res^2)
detR2=1-SSE/SST
#Calculation of random R2
res=sort(res)
n=length(res)
sE=sqrt(SSE/(n-nparam))
rsE=sE/(max(datasample)-min(datasample))
if (rdist=="Uniform") {
phi=punif(res,mean=0,sd=sE)
} else {
phi=pnorm(res,mean=0,sd=sE)
}
err=0*phi
for (i in 1:n){
err[i]=max(0,phi[i]-(i/n),(i-1)/n-phi[i])
}
randR2=1-(12*n/(n^2-1))*sum(err^2)
#Calculation of randomistic R2 and performance
R2=detR2+randR2-detR2*randR2
Perf=R2/(sE^2)
rPerf=R2/(rsE^2)
#Output
output=data.frame(R2,Perf,detR2,randR2,sE,rPerf,rsE)
return(output)
}
function
[R2,Perf,detR2,randR2,sE,rPerf,rsE]=randgof(datasample,estimates,nparam,rdist)
%This function evaluates the goodness-of-fit and the performance of a randomistic
%model. The evaluation is performed considering both the original experimental
%data (datasample) and the deterministic predictions (estimates) of the model,
%under the exact same conditions of the original data. Both variables should be
%vectors with the same dimensions. The random model is evaluated assuming a
%certain distribution (rdist). Included in this code are the 'norm' and
%'unif' distributions (among others). The standard deviation of the residuals is
%estimated using the number of parameters (nparam) considered in the
%deterministic model (including constant coefficients). The output includes: R2
%(Randomistic goodness-of-fit), Perf (randomistic performance), detR2
%(deterministic R2), randR2 (random fitness), sE (standard deviation of
%residuals), rPerf (relative performance), rsE (relative standard deviation of
%residuals)
%Calculation of residuals
res=datasample-estimates;
%Calculation of deterministic R2
avg=mean(datasample);
SST=sum((datasample-avg).^2);
SSE=sum(res.^2);
detR2=1-SSE/SST;
%Calculation of random R2
res=sort(res);
n=length(res);
sE=sqrt(SSE/(n-nparam));
rsE=sE/(max(datasample)-min(datasample));
phi=cdf(rdist,res/sE);
err=0*phi;
for i=1:n
err(i)=max([0,phi(i)-(i/n),(i-1)/n-phi(i)]);
end
randR2=1-(12*n/(n^2-1))*sum(err.^2);
%Calculation of randomistic R2 and performance
R2=detR2+randR2-detR2*randR2;
Perf=R2/(sE^2);
rPerf=R2/(rsE^2);
end