1 s2.0 S0169743921001404 Main

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Chemometrics and Intelligent Laboratory Systems 216 (2021) 104372

Contents lists available at ScienceDirect

Chemometrics and Intelligent Laboratory Systems


journal homepage: www.elsevier.com/locate/chemometrics

A new approach using the genetic algorithm for parameter estimation in


multiple linear regression with long-tailed symmetric distributed error
terms: An application to the Covid-19 data
_
Abdullah Yalçınkaya a, *, Iklim Gedik Balay b, Birdal Şenoǧlu a
a
Department of Statistics, Ankara University, 06100, Ankara, Turkey
b
Business School, Ankara Yıldırım Beyazıt University, 06760, Ankara, Turkey

A R T I C L E I N F O A B S T R A C T

Keywords: Maximum likelihood (ML) estimators of the model parameters in multiple linear regression are obtained using
Multiple linear regression genetic algorithm (GA) when the distribution of the error terms is long-tailed symmetric. We compare the effi-
Long-tailed symmetric distribution ciencies of the ML estimators obtained using GA with the corresponding ML estimators obtained using other
Maximum likelihood
iterative techniques via an extensive Monte Carlo simulation study. Robust confidence intervals based on
Modified maximum likelihood
Genetic algorithm
modified ML estimators are used as the search space in GA. Our simulation study shows that GA outperforms
traditional algorithms in most cases. Therefore, we suggest using GA to obtain the ML estimates of the multiple
linear regression model parameters when the distribution of the error terms is LTS. Finally, real data of the Covid-
19 pandemic, a global health crisis in early 2020, is presented for illustrative purposes.

1. Introduction utilized to obtain more robust solutions to the statistical inference


problems as an alternative to the normal distribution, see, for example,
Consider the following multiple linear regression model Tiku et al. [31] and Tiku and Suresh [33].
There exist plenty of studies about the regression models with non-
Y ¼ 1θ0 þ Xθ þ ϵ (1) normal error terms in the literature. Islam et al. [17] and Tiku et al.
[31] studied parameter estimation and hypothesis testing for the simple
where Y ¼ [y1, y2, …, yn]0 is a n  1 vector of responses, 1 is a n  1 vector
linear regression under non-normal error distributions. Then, Islam and
of ones, X ¼ ½x1 ; x2 ; …; xq  is a n  q non-stochastic design matrix, θ ¼
Tiku [16] discussed the multiple linear regression model with
[θ1, θ2, …, θq]0 is a q  1 vector of unknown coefficients and ϵ ¼ [ϵ1, ϵ2, non-normal error terms. Furthermore, multivariate linear regression
…, ϵn]0 is a vector of independent and identically distributed error terms. models under the assumption of non-normal error terms are considered
Linear regression models are very popular and therefore widely used by Islam [15] and Islam et al. [18].
in many different areas of science, see, for example, Ghosal et al. [11], It is very crucial to estimate the parameters in a multiple linear
Jomnonkwao et al. [19], Liao et al. [23], Mc Garry et al. [25]. regression model efficiently. Least squares (LS) method is convenient to
The distribution of the error terms is assumed to be normal in the estimate the parameters of a multiple linear regression model when the
context of the linear regression model in many cases encountered in real- error terms are normally distributed. However, LS estimators lose their
life problems. However, the distribution of the error terms can depart efficiencies when the normality assumption is not satisfied [35].
from normality in practice because of the presence of heavy tails, see, for To estimate multiple linear regression model parameters under a non-
examples related to non-normality, Geary [9], Huber [14], Pearson [27], normal error structure, ML estimation method is widely used because ML
Tiku et al. [34]. Therefore, in this paper, we strategically choose estimators are unbiased and fully efficient under certain regularity con-
long-tailed symmetric (LTS) distribution as the distribution of the error ditions. However, ML equations cannot be solved analytically when the
terms in the multiple linear regression model since it is more flexible than distribution of the error terms is non-normal. For this reason, numerical
normal distribution with its heavy tails. It is also used for modeling the methods are generally used to derive ML estimates of the unknown model
outliers existing in the direction of the tails. Thus, LTS distribution is parameters in literature. It should be noted that some problems such as

* Corresponding author.
E-mail address: ayalcinkaya@ankara.edu.tr (A. Yalçınkaya).

https://doi.org/10.1016/j.chemolab.2021.104372
Received 5 November 2020; Received in revised form 31 May 2021; Accepted 24 June 2021
Available online 29 June 2021
0169-7439/© 2021 Elsevier B.V. All rights reserved.
A. Yalçınkaya et al. Chemometrics and Intelligent Laboratory Systems 216 (2021) 104372

 p
non-convergence of iterations, convergence to the wrong root, slow 1 ϵ2
f ðϵÞ ¼ pffiffiffi   1 þ 2 ; ∞ < ϵ < ∞ (2)
convergence may arise in numerical methods, see Refs. [1,2,28,36,37]. kσ
Different than the earlier studies genetic algorithm (GA) which is a σ k B 12; p  12
population-based heuristic optimization method is utilized to obtain the
ML estimates of model parameters in multiple linear regression. It is a where k ¼ 2p  3, p  2, σ > 0, and B (⋅, ⋅) is the beta function. The
powerful method besides being easy for solving an optimization problem expected value and variance of the LTS distribution are given by E(ϵ) ¼
and can comfortably be used for large-scale and complex nonlinear 0 and Var(ϵ) ¼ σ 2, respectively. Also, the kurtosis value (γ) is evaluated
optimization problems, see Xia et al. [39]. Further, as Lu et al. [24] stated by γ ¼ 3 (p  3/2)/(p  5/2). Table 1 shows kurtosis values for some
that satisfactory solutions can easily be obtained according to the opti- representative values of shape parameter p.
mization objectives, and the shortcomings of numerical optimization LTS distribution has the following special cases:
methods are overcome by using GA due to its characteristics which make
pffiffiffiffiffiffiffi μÞ
it have outstanding advantages in iterative optimization. Garcia et al. [8] ● Assume X ~ LTS (μ, σ , p) then T ¼ v=k ðX σ reduces to the Student's
also reported that GA is one of the most robust heuristics automated t distribution with v ¼ 2p  1 degrees of freedom.
methods to solve optimization problems. See also Yalcinkaya et al. [40, ● When p → ∞, LTS distribution converges to the well-known Normal
41] for the details and advantages of GA. The main reason for choosing distribution.
GA in obtaining the estimates of the parameters is that the use of GA
warrants the convergence to the global optimal solution in optimization See Islam and Tiku [16] and Tiku and Kumra [32] for more details
problems that are extremely complex and suspected to be multi-modal, about LTS distribution.
see Goldberg [12]. We then compare the efficiencies of the ML estima-
tors using GA with corresponding estimators using traditional iterative
3. Maximum likelihood estimation
techniques, such as Newton-Raphson (NR), Nelder-Mead (NM), and
iteratively re-weighting algorithm (IRA).
In this section, we derive the maximum likelihood estimators of the
One of the main contributions of this paper is to propose confidence
parameters θ0, θ1, …, θq and σ in the multiple linear regression with LTS
intervals based on Tiku's [29,30] modified maximum likelihood (MML)
distributed error terms. The log-likelihood (lnL) function based on (2)
estimators for the corresponding multiple linear regression model pa- P
where ϵi ¼ yi  θ0  qj¼1 θj xij , i ¼ 1, …, n, is given by
rameters as the search space in GA. Yalcinkaya et al. [40,41] and Acitas
et al. [3] use GA and particle swarm optimization to obtain the ML es- 0 !2 1
timators of the parameters of skew normal distribution and Weibull P
q
   B yi  θ0  θj xij C
pffiffiffi 1 1 Xn B C
distribution, respectively. They both strategically use the confidence in- B j¼1 C
lnL ¼ nln σ k B ; p  p lnB1 þ C
tervals based on the MML estimators of parameters of interest as search 2 2 B kσ2 C
i¼1 @ A
space and show that using the proposed approach provides a narrower
search space improving the GA's performance for convergence.
The usage of heuristic methods is very limited in the context of (3)
regression. Therefore, we aim to implement these methods in solving In order to obtain the ML estimators of unknown parameters, partial
optimization problems in multiple linear regression with non-normal error derivatives of lnL function with respect to the parameters of interest are
terms instead of classical approaches. To the best of our knowledge, this is taken and we set them equal to 0 as follows:
the first study to obtain ML estimates of the parameters of a multiple linear  
Pq
regression model with LTS distributed error terms using GA. X yi  θ0  j¼1 θj xij
∂lnL n
It should be noted that there are many long-tailed symmetric distri- ¼ 2p  Pq 2 ¼ 0; (4)
∂θ0 i¼1 k σ 2 þ yi  θ0 
butions, see, for example, Lange et al. [21] and Lange and Sinsheimer j¼1 θj xij
[22]. Our results can easily be extended to these distributions so they
may be the topic of our future studies.  Pq 
∂lnL Xn xil yi  θ0  j¼1 θj xij
A virus called as Covid-19, which appeared in Wuhan province of ¼ 2p  Pq 2 ¼ 0; (5)
China in late 2019, spread rapidly worldwide in early 2020. On July 7, ∂θ l i¼1 k σ 2 þ y  θ  θx
i 0 j¼1 j ij
2020, World Health Organization (WHO) reported that Covid-19 infec-
ted over 11 million 500 thousand people worldwide and killed more than
(l ¼ 1, …, q), and
500 thousand of them. In the Covid-19 pandemic, which turned into a
global health crisis, the growth rate of cases and the number of deaths per !2
P
q
million varies from country to country because of the different charac- yi  θ0  θj xij
teristics of their governments and people. It is important to model the ∂lnL n 2p X n
j¼1
¼ þ !2 ¼ 0: (6)
mortality rate on fighting Covid-19. Therefore, in this study, we analyze ∂σ σ σ i¼1 P
q

the Covid-19 data which includes characteristics of governments and kσ2 þ yi  θ0  θj xij
j¼1
people, both to make a scientific contribution to the fight with Covid-19
and to demonstrate an implementation of our proposed methodology. The solutions of these likelihood equations are the ML estimators of
The rest of the paper is organized as follows. Section 2 presents the LTS parameters of interest. However, the equations have no explicit solutions
distribution. Section 3 includes the ML estimation for multiple linear [16] since the likelihood equations include intractable functions such as
regression model parameters when the distribution of error terms is LTS. gðzi Þ ¼ zi =½k þz2i  where zi ¼ ϵi/σ , i ¼ 1, …, n. Therefore, we resort to the
The details of GA, the procedure of identifying the search space in GA, and iterative techniques such as GA, IRA, NR, and NM algorithms. The pro-
the details of IRA are also given in Section 3. The simulation study and its cedures of GA and IRA used here are introduced in the following sub-
results are presented in Section 4. Real data of the Covid-19 pandemic is sections. See Ref. [40] for the details of NR and NM algorithms. Here, we
examined in Section 5 to show the implementation of the proposed
methodology. In the final section, the concluding remarks are given. Table 1
The kurtosis values for the LTS distribution.
2. Long-tailed symmetric distribution p 2.5 3.0 3.5 5.0 10 ∞

γ ∞ 9.0 6.0 4.2 3.4 3.0


Probability density function (pdf) of the LTS distribution is given by

2
A. Yalçınkaya et al. Chemometrics and Intelligent Laboratory Systems 216 (2021) 104372

don't give the NR and NM algorithms because they are very common CIðσ Þ ¼ σ^  zα=2 seð^
σ Þ; σ^ þ zα=2 seð^σ Þ (9)
methods to obtain the ML estimators in the literature.
Here, se(⋅) is the standard error of the estimator of interest. In this study,
3.1. Genetic algorithm we evaluate the standard errors by using the asymptotic variance -
covariance matrix of Tiku's MML estimators. The asymptotic variance -
Genetic Algorithm (GA), an iterative population-based search tech- covariance matrix of the MML estimators is obtained by using the inverse
nique proposed by Holland [13], is a very popular heuristic algorithm to of the expected Fisher information matrix given in Islam and Tiku [16].
find the optimum of the objective function. The idea underlying GA is to MML estimators are explicit estimators of the model parameters and
imitate the behavior of heredity characteristics of chromosomes on the derived by using non-iterative method, see Ref. [29]. To obtain the MML
passing from one generation to another based on the evolutionary estimators, firstly the likelihood equations given in (4)-(6) are expressed
mechanisms. Each solution and a set of solutions in every generation are in terms of the ordered statistics (for given θ0 and θj, 1  j  q)
called as chromosome and population, respectively. The following pro-
cedure given stage by stage is used during GA. X
q
ϵðiÞ ¼ y½i  θ0  θj x½ij ; 1in (10)
j¼1
Generating Initial Population: GA starts with an initial population of
N chromosomes generated from the search space via an initialization where (y[i], x[i]1, …, x[i]q) is called as the concomitant vector of obser-
ð0Þ ð0Þ vations corresponding to ith ordered ϵ(i). Then the MML estimators of
strategy. Assume that the initial population is denoted by w1 ; w2 ;
ð0Þ 0 multiple linear regression model parameters when the errors have LTS
…; wN where w ¼ θ0 ; θ1 ; …; θq ; σ is a vector of unknown param-
distribution are formulated as
eters. Here, q and N are the number of the independent variables in
the multiple linear regression model and the population size, X
q
ðmÞ ^θ ¼ y½  ^θ x½j ; (11)
respectively. Also, the vector of wr , r ¼ 1, 2, …, N, m ¼ 0, 1, 2, … 0
j¼1
j

represents the values of the rth chromosome in the population at mth


iteration.
^θ ¼ K þ D^
σ (12)
Fitness Evaluation: Each chromosome has a fitness value evaluated
according to the objective function. In this study, we use lnL as an
ðmÞ
and
objective function. lnLðwr Þ, m ¼ 0, 1, 2, … represents the fitness
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
value of the rth chromosome at iteration m. Firstly, the chromosomes B þ ðB2 þ 4nCÞ
in mth population are sorted and also classified by their fitness values σ^ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (13)
2 nðn  q  1Þ
from the best to the worst to obtain new (m þ 1)th population. It
should be noted that the classifying a chromosome as better or where
namely the having better fitness value demonstrates the having , ,
bigger/smaller fitness value for maximization/minimization X
n X
n X
n

problems. y½ ¼ βi y½i δ; x½j ¼ βi x½ij δ; δ¼ βi ; (14)


i¼1 i¼1 i¼1
Elitism: A predetermined elite number (E) of chromosomes having
the best fitness values are accepted as elite chromosomes and trans-
ferred directly to the new population without any modifications. K ¼ ðX ' βXÞ1 ðX ' βYÞ ¼ Kj ; β ¼ diagðβi Þ; (15)
Selection: At a predetermined selection rate (SR), the worst chro-
mosomes are replaced by new chromosomes generated randomly D ¼ ðX ' βXÞ1 ðX ' α1Þ ¼ Dj ; α ¼ diagðαi Þ; (16)
from the search space.
(  )
Mutation and Crossover: Finally, the mutation and crossover opera-
2p X X
n q

tors are applied according to mutation probability (MP) and crossover B¼ αi y½i  y½  Kj x½ij  x½j (17)
k i¼1 j¼1
probability (CP) to the candidate chromosomes selected from the
chromosomes except elites via a selection strategy. Here, we prefer
and
the roulette wheel selection strategy. The basic principle of this
strategy is that the chromosomes having the better fitness value have ( )2
2p X X
n q
a greater chance of being selected. C¼ β y½i  y½:  Kj ðx½ij  x½:j Þ : (18)
ðmþ1Þ ðmþ1Þ k i¼1 i
Convergence Check: So, the new population w1 ; w2 ; …; j¼1

ðmþ1Þ
wN is obtained. If the convergence criteria is not hold, this process Here,
is continued by setting m ¼ m þ 1. When the process stops, the values
3
of the best chromosome at the last population are called as the esti- ð2=kÞtðiÞ
mates of the parameters. αi ¼ 2 2; (19)
f1 þ ð1=kÞtðiÞ g
Identifying the search space of GA: We use the confidence intervals
based on Tiku's MML estimators of the parameters θ0, θ1, …, θq and σ 2
as the search space in GA. Then asymptotic 100 (1  α)% confidence 1  ð1=kÞtðiÞ
βi ¼ 2
(20)
2
intervals for the parameters θ0, θ1, …, θq and σ are given as follows: f1 þ ð1=kÞtðiÞ g
    
CIðθ0 Þ ¼ ^ θ0 ; ^θ0 þ zα=2 se ^θ0 ;
θ0  zα=2 se ^ (7) where t(i) is the expected value of the ordered statistics Z(i) ¼ ϵ(i)/σ , i. e.,
t(i) ¼ E (Z(i)). If C in (18) is a negative value, then we use the following α*i
     and β*i ,
CIðθl Þ ¼ ^
θl  zα=2 se ^
θl ; ^θl þ zα=2 se ^θl ; (8)
3
ð1=kÞtðiÞ
α*i ¼ 2 2; (21)
(l ¼ 1, …, q), and f1 þ ð1=kÞtðiÞ g

3
A. Yalçınkaya et al. Chemometrics and Intelligent Laboratory Systems 216 (2021) 104372

Table 2
Simulated Mean, MSE and Def values for the estimators of the parameters θ0, θ1, θ2, θ3 and σ
^
θ0 ^θ1 ^θ2 ^θ3 σ^

n Method Mean MSE Mean MSE Mean MSE Mean MSE Mean MSE Def

p¼3

20 GA 0.0143 0.2519 1.0041 0.2814 0.9888 0.3940 1.0260 0.4764 0.9123 0.0368 1.4404
IRA 0.0271 0.0088 0.9613 0.3480 0.9601 0.7383 1.1101 1.1324 0.9873 0.0442 2.2716
NR 326E2 679E10 356E2 957E10 441E2 148E11 246E2 550E10 166.3 228E5 367E11
NM 0.0134 0.5006 1.0084 0.4934 0.9985 0.6622 1.0102 0.7341 0.8802 0.0525 2.4427
50 GA 0.0101 0.0750 1.0151 0.1288 0.9945 0.1614 0.9977 0.1457 0.9737 0.0129 0.5238
IRA 0.0338 0.0047 1.0062 0.1751 1.0112 0.2299 0.9843 0.1971 1.0116 0.0162 0.6231
NR 0.0024 0.1228 1.0185 0.1872 0.9632 0.2176 1.0065 0.2240 0.9516 0.0135 0.7650
NM 0.0019 0.1237 1.0122 0.1836 0.9868 0.2196 0.9932 0.2089 0.9599 0.0175 0.7532
100 GA 0.0025 0.0397 1.0044 0.0634 0.9948 0.0704 0.9901 0.0703 0.9874 0.0063 0.2500
IRA 0.0201 0.0016 1.0134 0.0811 0.9949 0.0842 0.9945 0.0811 1.0098 0.0079 0.2559
NR 0.0046 0.0699 1.0023 0.0920 0.9889 0.1025 0.9840 0.0979 0.9775 0.0083 0.3706
NM 0.0046 0.0699 1.0023 0.0920 0.9889 0.1025 0.9840 0.0979 0.9775 0.0083 0.3705
200 GA 0.0039 0.0199 0.9992 0.0322 0.9955 0.0317 1.0099 0.0336 0.9968 0.0029 0.1203
IRA 0.0204 0.0011 1.0048 0.0391 0.9948 0.0383 1.0142 0.0381 1.0148 0.0041 0.1206
NR 0.0027 0.0351 0.9995 0.0464 0.9869 0.0459 1.0064 0.0475 0.9885 0.0041 0.1791
NM 0.0027 0.0351 0.9995 0.0464 0.9870 0.0459 1.0064 0.0475 0.9885 0.0041 0.1790
500 GA 0.0076 0.0084 1.0099 0.0102 0.9920 0.0128 1.0164 0.0144 1.0027 0.0009 0.0468
IRA 0.0138 0.0004 1.0156 0.0127 0.9952 0.0151 1.0205 0.0145 1.0150 0.0016 0.0444
NR 0.0063 0.0156 1.0069 0.0147 0.9896 0.0187 1.0192 0.0218 0.9950 0.0015 0.0722
NM 0.0063 0.0156 1.0069 0.0147 0.9896 0.0187 1.0191 0.0218 0.9950 0.0015 0.0722

p¼5

20 GA 0.0117 0.2931 0.9986 0.3237 1.0062 0.4645 1.0047 0.5582 0.9112 0.0322 1.6716
IRA 0.0174 0.0069 0.9717 0.3298 0.9858 0.7563 1.0595 1.1332 0.9518 0.0375 2.2637
NR 126E2 106E10 122E2 101E10 7462.9 951E9 17,779 107E10 81.019 174E5 410E10
NM 0.0114 0.5590 1.0010 0.5583 1.0171 0.7812 0.9898 0.8560 0.8854 0.0448 2.7993
50 GA 0.0010 0.0938 0.9940 0.1513 1.0158 0.1810 0.9964 0.1680 0.9647 0.0110 0.6051
IRA 0.0196 0.0037 0.9914 0.1853 1.0285 0.2459 0.9892 0.1992 0.9824 0.0135 0.6476
NR 0.0053 0.1548 0.9947 0.2147 1.0108 0.2519 0.9920 0.2412 0.9521 0.0152 0.8778
NM 0.0053 0.1547 0.9947 0.2147 1.0108 0.2519 0.9921 0.2412 0.9521 0.0152 0.8777
100 GA 0.0099 0.0485 1.0097 0.0762 1.0067 0.0795 1.0068 0.0821 0.9874 0.0053 0.2917
IRA 0.0142 0.0014 1.0120 0.0879 1.0045 0.0864 1.0037 0.0855 0.9979 0.0064 0.2676
NR 0.0102 0.0857 1.0108 0.1077 1.0074 0.1127 1.0053 0.1134 0.9803 0.0068 0.4263
NM 0.0102 0.0857 1.0107 0.1077 1.0074 0.1127 1.0054 0.1134 0.9803 0.0068 0.4262
200 GA 0.0107 0.0237 0.9941 0.0419 1.0246 0.0382 1.0031 0.0369 0.9956 0.0025 0.1432
IRA 0.0131 0.0008 0.9881 0.0456 1.0290 0.0409 0.9995 0.0407 1.0032 0.0033 0.1312
NR 0.0105 0.0420 0.9918 0.0588 1.0288 0.0538 1.0007 0.0513 0.9903 0.0034 0.2093
NM 0.0106 0.0420 0.9919 0.0588 1.0288 0.0538 1.0007 0.0513 0.9903 0.0034 0.2093
500 GA 0.0021 0.0112 0.9959 0.0169 0.9953 0.0172 1.0069 0.0189 0.9999 0.0009 0.0652
IRA 0.0063 0.0003 1.0024 0.0188 0.9968 0.0188 1.0145 0.0211 1.0048 0.0013 0.0602
NR 0.0081 0.0216 0.9925 0.0266 0.9890 0.0258 1.0060 0.0278 0.9960 0.0013 0.1032
NM 0.0082 0.0216 0.9925 0.0267 0.9890 0.0258 1.0060 0.0279 0.9960 0.0013 0.1032

E: This symbol means to raise the number that comes after it to a power of 10.

" #
1 P
n n X
X q X
n
β*i ¼ 2; ð1⩽i⩽nÞ (22) ðmþ1Þ
θ0 ¼
ðmÞ
yi γ i 
ðmÞ ðmÞ
θj xij γ i =
ðmÞ
γi ;
2
f1 þ ð1=kÞtðiÞ g i¼1 i¼1 j¼1 i¼1

instead of αi and βi, respectively for mathematical convenience; see 2 3


Ref. [16] for further details. MML estimators are asymptotically equiv- 6 7
alent to the ML estimators, therefore, under regularity conditions, they 6 n Xn X 7 X
6P q
7 n 2 ðmÞ
¼ 6 7=
ðmþ1Þ ðmÞ ðmÞ ðmÞ ðmÞ ðmÞ
are fully efficient. Also a remarkable property of MML estimators is that θl 6 i¼1 xil y i γ i  θ 0 xil γ i  θ j x il xij γ i 7 xil γ i ;
6 7 i¼1
they are as efficient as the ML estimators even for small sample sizes, see 4 i¼1 j¼1
5
Refs. [5,37,38]. The high performance of MML estimators is shown in j6¼l
plenty of studies in the literature; see, for example, [6,17,20].
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u  P 2
3.2. Iterative re-weighting algorithm u ðmÞ ðmÞ
u2p X n yi  θ0  qj¼1 θj xij
σ ðmþ1Þ ¼ t (23)
kn i¼1 ðmÞ
In this study, we provide IRA to compute the ML estimates of the wi
parameters of interest. It should be noted that the IRA is EM type algo-
rithm and its convergence is guaranteed, see Ref. [7]. The steps of IRA where
used in this study is given below. !2
ðmÞ 1 ðmÞ
X
q
ðmÞ
wi ¼1þ yi  θ 0  θj xij (24)
ð0Þ ð0Þ
Step 1: Identify the initial values θ0 , θ1 , …, θð0Þ kðσ ðmÞ Þ2
q and σ
(0)
for θ0, θ1, j¼1

…, θq and σ .
ðmÞ ðmÞ
Step 2: Compute the following equations and γ i ¼ 1=wi . Here, m ¼ 0, 1, 2, … is the number of iterations and l
¼ 1, …, q is the number of regression coefficients.

4
A. Yalçınkaya et al. Chemometrics and Intelligent Laboratory Systems 216 (2021) 104372

Fig. 1. Scatter plots and the correlation coefficients for the Covid-19 data.

Step 3: Stop the iterations when the conditions jθ0


ðmþ1Þ ðmÞ
 θ 0 j < ε, j where ⌊⋅⌋ denotes the greatest integer function.
ðmþ1Þ ðmÞ The simulated mean, mean square error (MSE), and deficiency (DEF)
θl  < ε (l ¼ 1, …, q) and |σ (mþ1)  σ (m)| < ε are hold. Here,
θl j
criteria are used in the comparisons. The mathematical expression of the
ε > 0 is a predetermined small constant.
DEF criterion which is used to compare joint efficiencies of the estimators
is given below
4. Simulation study
DEF ¼ MSEð^θ0 Þ þ MSEð^θ1 Þ þ MSEð^θ2 Þ þ MSEð^θ3 Þ þ MSEð^
σ Þ: (25)
In this section, we compare the efficiencies of the proposed parameter
estimators for the multiple linear regression model with LTS distributed We report the mean, MSE, and DEF values for the ML estimators of the
error terms via an extensive Monte Carlo simulation study. We simulate parameters in Table 2. We present the discussions about the simulation
for q ¼ 1, 2, 3, and 4 separately and see that similar results are obtained results below.
for all q values, therefore, we preferably give the results of q ¼ 3 for the It is seen from Table 2 that the NR algorithm gives non-meaningful
sake of brevity. The ML estimates of the parameters θ0, θ1, θ2, θ3, and σ values for the small sample sizes (i.e., n ¼ 20) for all shape parameter
are obtained by using the GA, IRA, NR, and NM algorithms. It should be values.
noted that the value of parameter p is assumed to be known throughout Considering the mean values in Table 2, it can be easily said that the
this section. All computations are conducted using the R statistical soft- biases for all estimators (except NR for n ¼ 20) are negligible.
ware environment. For θ0: GA has better performance than the NM and NR with the
For the computations of the GA algorithm, we use ‘ga’ function with smallest MSE values for all shape parameter values and all sample sizes.
the following settings; that are the population size N ¼ 500, the mutation On the other hand, IRA outperforms all methods in all of the cases ac-
probability MP ¼ 0.2, the crossover probability CP ¼ 0.8, and the elite cording to the MSE criterion. Therefore, GA has the second-best perfor-
number E ¼ 4. The maximum number of iterations is taken to be 1000. mance in estimating θ0 since it has quite small MSE values.
The other conditions are taken to be default in the ‘ga’ function. For θ1, θ2, θ3, and σ : GA has the best performance among NR, NM, and
Without loss of generality, we take θ0 ¼ 0, θ1 ¼ 1, θ2 ¼ 1, θ3 ¼ 1, and IRA according to MSE criterion for all cases.
σ ¼ 1. The values of xij (i ¼ 1, …, n; j ¼ 1, 2, 3) in the design matrix are The simulation results show that GA outperforms NM and NR algo-
randomly generated from Uniform (0, 1) distribution. The sample sizes rithms according to the DEF criterion because of providing the lowest
are taken to be n ¼ 20, 50, 100, 200, and 500 for small, moderate, large, deficiency in all of the cases. However, IRA has a little bit smaller defi-
very large, and extremely large sample size scenarios, respectively. We ciency than GA for a few cases, i.e., n ¼ 100, 200, 500 when p ¼ 5 and n
strategically choose p ¼ 3 and p ¼ 5 in LTS distribution to compare the ¼ 500 when p ¼ 3. On the other hand, GA is more efficient than IRA in
effects of kurtosis levels γ ¼ 9 and γ ¼ 4.2, respectively. It is known that the other cases (60% of cases). Therefore, GA outperforms IRA according
LTS distribution with p ¼ 3 and p ¼ 5 are equivalent to the Student's t to DEF criterion in most of the cases.
distribution with 5 and 9 degrees of freedom, respectively. Each Monte As a result, it can be said that GA is more preferable than the other
Carlo simulation run is independently replicated ⌊100,000/n⌋ times algorithms to obtain the ML estimators of the parameters of the multiple

5
A. Yalçınkaya et al. Chemometrics and Intelligent Laboratory Systems 216 (2021) 104372

Fig. 2. LTS Q-Q Plot for the residuals (n ¼ 52).

linear regression model when the error terms have LTS distribution
because of the superior performance of the GA.

5. Application

In this section, a real-life data set about the Covid-19 pandemic is


analyzed to show the implementation of the proposed methodology, see
Ref. [10] for the detailed backgrounds of the data. The data set consists of
measurements about some indicators and indices for 52 countries in the
context of the Covid-19 pandemic. Therefore, it is called as Covid-19 data
in the rest of this section.
Covid-19 data include n ¼ 52 observations about mortality rates
Fig. 3. Box-Plot for the residuals (n ¼ 52).
(deaths from Covid-19 per million people), cultural tightness, govern-
ment efficiency, and growth rates of the virus. We use the following
multiple linear regression model simultaneously.
To compare the fitting performances of the regression equations
yi ¼ θ0 þ θ1 xi1 þ θ2 xi2 þ θ3 xi3 þ ϵi ð1  i  nÞ (26) based on the ML estimates obtained by using GA and IRA algorithms, we
use AIC (Akaike information criterion), and RMSE (The root mean square
and obtain the estimated regression equation. Here, the dependent error) criteria.
(response) variable Y represents the mortality rate, and the independent AIC is a popular measure proposed by Akaike [4] to compare the
(explanatory) variables X1, X2, and X3 represent the cultural tightness, fitting performances of the possible models according to the
the government efficiency, and the growth rate, respectively. log-likelihood values and the number of estimated parameters. The
In regression analysis, it is an important concern to identify problems mathematical expression of AIC is
such as heteroscedasticity, multicollinearity and to take preventive ac-
tion. Since the mortality rates for some countries (e.g., Italy, Spain) are AIC ¼ 2p  2lnL (27)
far greater than those of the other countries, the mortality rates have a
high skewness and variance. Such response variables in a regression where p is the number of estimated parameters. AIC is a useful measure
model can lead to heteroscedasticity. To prevent violations of homo- when the number of estimated parameters is different for the compared
scedasticity, we apply log-transformation on the raw mortality rates models. It should be noted that the number of estimated parameters is
similar to Ref. [10]. In the regression analysis, the log-transformed different for Case 1 and Case 2 in our problem.
mortality rates are used as the response. It is known that the model having the lowest AIC value and the lowest
The multicollinearity has been checked by Ref. [10] via the variance RMSE value (the closest to zero) among the possible models provides the
inflation factors, and it is seen that there is no problematic multi- best fitting performance to the data.
collinearity among the explanatory variables. To see the numerical value To identify the distribution of the residuals, a Q-Q plot of the esti-
of the relationship between explanatory variables, we give the correla- mated residuals
tions between them in Fig. 1. The entries on the main diagonal show the
X
3
distribution of the variables, and the entries below and above the main ^ϵi ¼ yi  ^θ0  ^θj xij ð1  i  52Þ (28)
diagonal display the scatter plots of each pair of variables and the j¼1

Pearson correlation coefficients, respectively.


To find the appropriate regression equation which provides a better is given in Fig. 2. It is seen that there exist residuals which are grossly
fit to the Covid-19 data, we obtain the ML estimates of the model pa- anomalous in Covid-19 data. Therefore, we set aside observations that
rameters in multiple linear regression by using GA and IRA algorithms correspond to these outliers (i.e., 7.82, 5.77, and 5.30), see also the
because of their superior performances in the simulation study. box-plot in Fig. 3.
We consider two different cases in identifying the estimate of the After excluding the outliers from regression analysis, we estimate the
shape parameter (p) of LTS distribution. In Case 1, it is estimated by using residuals obtained from the regression model for the remaining n ¼ 49
profile likelihood (PL) method given in Ref. [40]. In Case 2, we obtain the observations. A Q-Q plot of these residuals
estimates of all parameters including the shape parameter p by using GA,

6
A. Yalçınkaya et al. Chemometrics and Intelligent Laboratory Systems 216 (2021) 104372

Fig. 4. LTS Q-Q Plot for the residuals after excluding the outliers (n ¼ 49).

Table 3
Estimates of the regression parameters, bootstrap standard errors, AIC, RMSE, and Dks values (n ¼ 49).
^
θ0 ^θ1 ^θ2 ^θ3 σ^ AIC RMSE Dks

Method Case 1: ^p ¼ 10:00


GA 1.7189 1.3878 0.1093 1.2453 1.5864 190.8778 1.5395 0.0620
(0.1088) (0.4931) (0.1048) (0.1701) (0.0021)
IRA 0.1620 1.2792 0.3478 1.0693 1.6967 199.2834 1.6552 0.0916
(0.1382) (0.7350) (0.4042) (0.6090) (0.6360)
Method Case 2: ^p ¼ 9:68
GA 3.1170 1.4264 0.0943 1.5188 1.4659 186.2225 1.4349 0.0567

X
3 The estimates of the regression parameters, the bootstrap standard
~ϵi ¼ yi  ~
θ0  ~
θj xij ð1  i  49Þ (29) errors of the corresponding estimates (given in the parenthesis), AIC,
j¼1
RMSE, and the K–S test statistics (i.e. Dks) values are given in Table 3.
given in Fig. 4 indicates that LTS distribution is appropriate. According to the K–S results based on GA and IRA estimates of the
To identify whether the LTS distribution is appropriate for the dis- model parameters, the null hypothesis is not rejected at the significance
tribution of the error terms or not, we also use Kolmogorov-Simirnov level α ¼ 0.05 for both tests since the Dks values are less than the cor-
(K–S) test which is a well-known and widely-used goodness of fit test. responding table value dn ¼ 49,α ¼ 0.05 ¼ 0.17128, see Table 3. This result
To test the following null hypothesis implies that the distribution of the error terms obtained from the
regression equation based on GA and IRA estimates of the model pa-
H0: The error terms are distributed as LTS rameters is LTS. The Q-Q plot given in Fig. 4 supports this result, visually.
It is clear from Table 3 that the proposed method using GA has lower
versus the alternative AIC and RMSE values than those of IRA in Case 1. Furthermore, GA has
much smaller bootstrap standard errors than IRA for all parameter esti-
H1: The error terms are not distributed as LTS, mates. Therefore, GA is more reliable and preferable than IRA. Addi-
tionally, it is obvious from Table 3 that using GA is superior than using PL
K–S test statistic Dks is obtained as follows: in model fitting according to the AIC and RMSE results given in Case 1
and Case 2. These conclusions show the superiority of the GA. As a result,
Dks ¼ sup ðjFn ðÞ  F0 ðÞjÞ (30) we advice to use GA to obtain the estimates of the parameters (including
ϵ
the shape parameter p) in multiple linear regression model with LTS
distributed error terms.
where F0 (⋅) is the cdf of a LTS distribution with a known shape parameter
Additionally, to investigate the aspects of the robustness of the ML
p and Fn(⋅) is the empirical cdf for the error terms. At significance level α,
estimates based on GA and IRA methods to outliers, we give the results of
if the calculated value Dks is greater than the tabulated value dα given in
the regression analysis for the data including outliers (i.e., n ¼ 52), see
Ref. [26] or equivalently if the corresponding p-value is less than α, then
Table 4. It is seen from Table 3 and Table 4 that the model fitting
the null hypothesis H0 is rejected.

Table 4
Estimates of the regression parameters, bootstrap standard errors, AIC, RMSE, and Dks values (n ¼ 52).
^
θ0 ^θ1 ^θ2 ^θ3 σ^ AIC RMSE Dks

Method Case 1: ^p ¼ 2:24


GA 3.3888 1.4082 0.0239 1.7134 2.1504 220.598 1.9515 0.0847
(0.1828) (0.3951) (0.1462) (0.2197) (0.0013)
IRA 0.2295 1.2858 0.4145 1.1633 2.5595 237.293 2.2591 0.0881
(0.1643) (0.7277) (0.3137) (0.5924) (0.4422)
Method Case 2: ^p ¼ 2:52
GA 3.6262 1.4299 0.0341 1.8993 2.0095 221.6684 1.9253 0.0997

7
A. Yalçınkaya et al. Chemometrics and Intelligent Laboratory Systems 216 (2021) 104372

performance of GA is less sensitive to the outliers than that of IRA. It is [11] S. Ghosal, S. Sengupta, M. Majumder, B. Sinha, Linear Regression Analysis to
predict the number of deaths in India due to SARS-CoV-2 at 6 weeks from day
clear that AIC and RMSE values decrease for the censored sample for both
0 (100 cases-March 14th 2020), Diabetes & Metabolic Syndrome: Clin. Res. Rev. 14
models based on GA and IRA, this is an indication of the negative effect of (4) (2020) 311–315.
the outliers on the estimated regression equation. However, the reduc- [12] D. Goldberg, Genetic Algorithms in Search, Optimization, and Machine Learning,
tion rate is much bigger for the model based on IRA. Addison-Wesley, MA, 1989.
[13] J. Holland, Adaptation in Natural and Artificial System: an Introduction with
Application to Biology, Control and Artificial Intelligence, University of Michigan
6. Conclusion Press, Ann Arbor, 1975.
[14] P.J. Huber, Robust Statistics, Springer Berlin Heidelberg, 2011, pp. 1248–1251.
[15] M.Q. Islam, Estimation and hypothesis testing in multivariate linear regression
In this study, we focus the ML estimation of the parameters of the models under non normality, Commun. Stat. Theor. Methods 46 (17) (2017)
multiple linear regression model when the underlying distribution of 8521–8543.
error terms is LTS. It should be noted that the ML estimators of the pa- [16] M.Q. Islam, M.L. Tiku, Multiple linear regression model under nonnormality,
Commun. Stat. Theor. Methods 33 (10) (2005) 2443–2467.
rameters cannot be obtained analytically. Therefore, we resort to GA and [17] M.Q. Islam, M.L. Tiku, F. Yildirim, Nonnormal regression. I. Skew distributions,
traditional NR, NM and IRA algorithms. To improve the performance of Commun. Stat. Theor. Methods 30 (6) (2001) 993–1020.
GA, we use the robust confidence intervals based on the MML estimators [18] M.Q. Islam, F. Yildirim, M. Yazici, Inference in multivariate linear regression
models with elliptically distributed errors, J. Appl. Stat. 41 (8) (2014) 1746–1766.
of the regression parameters as the search space. We compare the effi- [19] S. Jomnonkwao, S. Uttra, V. Ratanavaraha, Forecasting road traffic deaths in
ciencies of the ML estimators obtained by using mentioned algorithms in Thailand: applications of time-series, curve estimation, multiple linear regression,
terms of MSE and DEF criteria. Our simulation study shows that GA and path analysis models, Sustainability 12 (1) (2020) 395.
[20] Y.M. Kantar, B. Senoglu, A comparative study for the location and scale parameters
outperforms other traditional algorithms in most of the cases to obtain
of the Weibull distribution with given shape parameter, Comput. Geosci. 34 (12)
the ML estimates. Eventually, we strongly advise using GA to obtain the (2008) 1900–1909.
ML estimates of the parameters of the multiple linear regression model [21] K.L. Lange, R.J. Little, J. Taylor, Robust statistical modelling using the t-
when the error terms have LTS distribution because of the superior distribution, J. Am. Stat. Assoc. 84 (1989) 881–896.
[22] K.L. Lange, J.S. Sinsheimer, Normal/independent distributions and their
performance of the GA. applications in robust regression, J. Comput. Graph Stat. 2 (1993) 175–198.
[23] K. Liao, E.S. Park, J. Zhang, L. Cheng, D. Ji, Q. Ying, J.Z. Yu, A multiple linear
CRediT authorship contribution statement regression model with multiplicative log-normal error term for atmospheric
concentration data, Sci. Total Environ. 767 (2021) 144282.
[24] X. Lu, Y. Wu, J. Lian, Y. Zhang, C. Chen, P. Wang, L. Meng, Energy management of
Abdullah Yalçınkaya: Conceptualization, Methodology, Software, hybrid electric vehicles: a review of energy optimization of fuel cell hybrid power
Formal analysis, Validation, Data curation, Investigation, Writing – system based on genetic algorithm, Energy Convers. Manag. 205 (2020) 112474.
_
original draft, Review & editing. Iklim Gedik Balay: Conceptualization,
[25] K. McGarry, S.A. Siedlecki, J. Salisbury, S.R. Alin, Multiple linear regression models
for reconstructing and exploring processes controlling the carbonate system of the
Methodology, Software, Formal analysis, Resources, Validation, Writing northeast US from basic hydrographic data, J. Geophys. Res.: Oceans 126 (2)
– original draft, , Review & editing. Birdal Şenoǧ
ǧlu: Conceptualization, (2021), e2020JC016480.
[26] L.H. Miller, Table of percentage points of Kolmogorov statistics, J. Am. Stat. Assoc.
Methodology, Supervision, Formal analysis, Validation, Writing original 51 (1956) 111–121.
draft, Review & editing. [27] E.S. Pearson, The analysis of variance in cases of non-normal variation, Biometrika
23 (1/2) (1931) 114–133.
[28] S. Puthenpura, N.K. Sinha, Modified maximum likelihood method for the robust
Declaration of competing interest
estimation of system parameters from very noisy data, Automatica 22 (2) (1986)
231–235.
The authors declare that they have no known competing financial [29] M.L. Tiku, Estimating the mean and standard deviation from a censored normal
interests or personal relationships that could have appeared to influence sample, Biometrika 54 (1–2) (1967) 155–165.
[30] M.L. Tiku, Estimating the parameters of normal and logistic distribution from
the work reported in this paper. censored samples, Aust. J. Stat. 10 (2) (1968) 64–74.
[31] M.L. Tiku, M.Q. Islam, A.S. Selcuk, Nonnormal regression. II. Symmetric
References distributions, Commun. Stat. Theor. Methods 30 (6) (2001) 1021–1045.
[32] M.L. Tiku, S. Kumra, Expected values and variances and covariances of order
statistics for a family of symmetric distributions (Student's t), selected tables in
[1] S. Acitas, P. Kasap, B. Senoglu, O. Arslan, One-step M-estimators: jones and Faddy's mathematical statistics 8 (1985) 141–270.
skewed t-distribution, J. Appl. Stat. 40 (7) (2013) 1545–1560. [33] M.L. Tiku, R.P. Suresh, A new method of estimation for location and scale
[2] S. Acitas, P. Kasap, B. Senoglu, O. Arslan, Robust estimation with the skew t2 parameters, J. Stat. Plann. Inference 30 (2) (1992) 281–292.
distribution, Pakistan Journal of Statistics 29 (4) (2013) 409–430. [34] M.L. Tiku, W.K. Wong, G. Bian, Estimating parameters in autoregressive models in
[3] S. Acitas, C.H. Aladag, B. Senoglu, A new approach for estimating the parameters of non-normal situations: symmetric innovations, Commun. Stat. Theor. Methods 28
Weibull distribution via particle swarm optimization: an application to the (2) (1999) 315–341.
strengths of glass fibre data, Reliab. Eng. Syst. Saf. 183 (2019) 116–127. [35] J.W. Tukey, A survey of sampling from contaminated distributions, Contributions to
[4] H. Akaike, Maximum likelihood identification of Gaussian autoregressive moving probability and statistics (1960) 448–485.
average models, Biometrika 60 (2) (1973) 255–265. [36] D.C. Vaughan, On the Tiku-Suresh method of estimation, Commun. Stat. Theor.
[5] G.K. Bhattacharyya, The asymptotics of maximum likelihood and related estimators Methods 21 (2) (1992) 451–469.
based on type II censored data, J. Am. Stat. Assoc. 80 (390) (1985) 398–404. [37] D.C. Vaughan, The generalized secant hyperbolic distribution and its properties,
[6] N. Celik, B. Senoglu, Robust estimation and testing in one-way ANOVA for Type II Commun. Stat. Theor. Methods 31 (2) (2002) 219–238.
censored samples: skew normal error terms, J. Stat. Comput. Simulat. 88 (7) (2018) [38] D.C. Vaughan, M.L. Tiku, Estimation and hypothesis testing for a nonnormal
1382–1393. bivariate distribution with applications, Math. Comput. Model. 32 (1–2) (2000)
[7] N. Celik, B. Senoglu, O. Arslan, Estimation and testing in one-way ANOVA when the 53–67.
errors are skew-normal, Rev. Colomb. Estadística 38 (1) (2015) 75–91. [39] Z. Xia, K. Mao, S. Wei, X. Wang, Y. Fang, S. Yang, Application of genetic algorithm-
[8] A.M. Garcia, I. Sante, M. Boullon, R. Crecente, Calibration of an urban cellular support vector regression model to predict damping of cantilever beam with
automaton model by using statistical techniques and a genetic algorithm. particle damper, J. Low Freq. Noise Vib. Act. Contr. 36 (2) (2017) 138–147.
Application to a small urban settlement of NW Spain, Int. J. Geogr. Inf. Sci. 27 (8) [40] A. Yalcinkaya, B. Senoglu, U. Yolcu, Maximum likelihood estimation for the
(2013) 1593–1611. parameters of skew normal distribution using genetic algorithm, Swarm and
[9] R.C. Geary, Testing for normality, Biometrika 34 (3/4) (1947) 209–242. Evolutionary Computation 38 (2018) 127–138.
[10] M.J. Gelfand, J.C. Jackson, X. Pan, D. Nau, M.M. Dagher, P.V. Lange, C. Chiu, The [41] A. Yalcinkaya, U. Yolcu, B. Senoglu, Maximum likelihood and maximum product of
importance of cultural tightness and government efficiency for understanding spacings estimations for the parameters of skew-normal distribution under doubly
COVID-19 growth and death rates. Preprint on PsyArXiv. https://doi.org/10 type II censoring using genetic algorithm, Expert Syst. Appl. 168 (2021) 114407.
.31234/osf.io/m7f8a, 2020.

You might also like