1 s2.0 S1110016821003380 Main

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Alexandria Engineering Journal (2022) 61, 937–947

H O S T E D BY
Alexandria University

Alexandria Engineering Journal


www.elsevier.com/locate/aej
www.sciencedirect.com

Missing data imputation of MAGDAS-9’s ground


electromagnetism with supervised machine learning
and conventional statistical analysis models
Muhammad Asraf H. a, Nur Dalila K.A. a,*, Nooritawati Md Tahir b,
Zatul Iffah Abd Latiff a,b, Mohamad Huzaimy Jusoh b, Yoshikawa Akimasa c

a
College of Engineering, Universiti Teknologi MARA Cawangan Johor, Kampus Pasir Gudang, Malaysia
b
College of Engineering, Universiti Teknologi MARA, Shah Alam, Selangor, Malaysia
c
International Center for Space Weather Science and Education, Kyushu University, Fukuoka, Japan

Received 14 December 2020; revised 11 April 2021; accepted 27 April 2021


Available online 6 June 2021

KEYWORDS Abstract Data imputation studies include reconstruction or estimation of imperfect data gaps
Missing dataset; caused by system sensing failure, and non-responsive data transmission remains an open issue.
Imputation; In space weather applications, imputation of ground electromagnetism is significant in capturing
Geomagnetic storm; the complex interaction of sun–earth prior to the subsequent analysis of the space weather effects.
Space weather; Key contributions to the demonstration of supervised machine learning (ML) imputation approach
Supervised learning with artificial neural network, K-nearest neighbour, support vector regression (SVR), and General
Regression Neural Network (GRNN) for MAGDAS-9 ground electromagnetism dataset have not
yet been established. A total of 1,585,950 data points were analysed with supervised ML models
which included performance benchmark with statistical analysis namely zero value substitution, list-
wise deletion, mean substitution, and hot deck imputation. To achieve low reconstruction errors,
different imputation models with hyperparameter tuned settings are varied, and computational time
execution has been shown to contribute to imputation performance. Performance metrics measured
by mean square error (MSE), mean absolute error (MAE),mean absolute percentage error (MAPE),
and execution time respectively demonstrate the capability of SVR to perfectly impute missing data
for all ground electromagnetism components at an average of 0.314 MSE, 0.738 MAPE, closeness
to 0.510 MAE and 0.91-second at various percentage level of data missingness. A comparison with
traditional imputation shows that the supervised ML with SVR model has improved imputation
performance by up to 80% of data gap. The outcome of the proposed imputation will benefit space
weather applications for event characterisation, which will cover a large number of missing data in
the MAGDAS-9 dataset.
Ó 2021 THE AUTHORS. Published by Elsevier BV on behalf of Faculty of Engineering, Alexandria
University. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/
licenses/by-nc-nd/4.0/).

* Corresponding author.
E-mail address: nurdalila306@uitm.edu.my (N.D. K.A.).
Peer review under responsibility of Faculty of Engineering, Alexandria University.
https://doi.org/10.1016/j.aej.2021.04.096
1110-0168 Ó 2021 THE AUTHORS. Published by Elsevier BV on behalf of Faculty of Engineering, Alexandria University.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
938 M.A. H. et al.

1. Introduction implementation by Turrado et al. [14] and Guo et al. [15] sug-
gests that deletion or substitution with a zero mean value is not
Measurement of ground electromagnetism data of the earth in acceptable. Further, exclusion or deletion of data causes a
the specific application of space weather faces challenges such reduction in available data for further processing stages and
as missing data. Studies focused on space weather applications severely affects the bias interpretation and results in errors.
have a profound effect on Earth because studying the interac- This indicator of the reduced data demonstrates that the miss-
tion with the solar system opens up possibilities for a new data ingness mechanism contributes significantly toward the predic-
analytics field that can interpret the space effect [1] in terms of tion performance of data imputation, and therefore, a
identifying the root cause of the phenomenon and classifying simulation is necessary to prove the impact of missing data.
event occurrences in the space environment. To evaluate this For the traditional imputation approach, a high missingness
interaction, the system used for the observations of the electro- ratio of data imputation results in inaccurate prediction errors
magnetism of the earth from the space environment have been caused by limited data input [16]. To overcome the abovemen-
actively installed; they include ground-based magnetometers tioned disadvantage, imputation by ML has been suggested by
around the globe network called MAGnetic Data Acquisition Poulos [17] as an approach for improving prediction perfor-
System (MAGDAS), International Real-time Magnetic Obser- mance while compensating for the bias in missing data.
vatory Network (INTERMAGNET), and Indian Institute of Ramirez [18] highlighted that the imputation process requires
Geomagnetism (IIG) [2]. The magnetometer measures ground the fine-tuning of hyperparameters to produce the best model
electromagnetism data and performs data logging to record in ML during imputation, and most surprisingly, an artificial
the interaction activities which may be inconsistent due to lim- neural network (ANN) with a single hidden layer network
itations of data logging applications. Among the major issues [19] outperformed other proposed methods with a minimum
in measuring variables include the disconnection of network of five hidden neurons trained with the Levenberg–Marquardt
and hardware failure, system breakdown, sensing failure, learning rule. Gustavo et al. [20] and Jadhav [4] demonstrated
and/or connectivity issues when transferring the data to the an opposition to ANN, in that K-NN imputation was proven
web server [3] which disable the main function of recording to be significant among the ML methods available. This is
daily measurement. Such a system failure can lead to a missing because the K-NN has the capability to adjust the missing
gap in the data. Time-based variations on the order of seconds value in the distance metric, and it alleviates the creation of
cannot be logged and therefore, they produce gaps or data an explicit model. In another case study by Fusco [21], SVM
imperfections; this is very typical for real-time datasets. Values demonstrated excellent results in the reconstruction of missing
of recorded variables are often missing from complete datasets. data compared to conventional statistical algorithms in arbo-
As an alternative option for data reconstruction from the miss- virus infectious disease prediction modelling; it achieved an
ing gap, a data imputation technique can be selected as a viable accuracy of 90% in the classification. The advantage of
solution. Data imputation is necessary for ground electromag- SVM is that it is not influenced by the data distribution, and
netism data to retain exclusive information on data gathering it has strong output metrics for all distributions. Therefore,
that possesses continuous variation as time progresses. the variability of the ML network to generalise the abovemen-
These challenges need to be overcome because a complete tioned missing data can vary depending on the data itself based
dataset is important for data pre-processing and cleaning, on distribution size and data type. The large size of the dataset
and for the subsequent process of interpreting and analyzing can also accommodate the data gap better than the statistical
the event. Data imputation solves these issues by reconstruct- approach.
ing missing data and gap presence. Various techniques have Other points of view showed that the comparison of the
been developed to handle missing data; the most common verdict in the climate application that includes time series data-
approach is to disregard the missing data or impute the mean sets with fluctuating pattern characteristics necessitates a pre-
substitution [4] of overall data towards an empty gap. This liminary assessment of the missingness mechanism [12,22]
approach is called the standard imputation approach or con- rather than concentrating on the imputation model availabil-
ventional statistical analysis; however, the disadvantage of this ity. Imputation strategies depend on the nature and proportion
approach is that it includes bias, and this can subsequently of missing data. Preliminary consideration is necessary to
influence the prediction performance [5] in data analysis. Kline ensure that successful prediction does not overlook the missing
[6] emphasizes the high error exhibited in the estimated missing data mechanism. This consideration requires thorough
data caused by the case deletion effect with an incremental research such as the examination of the missing analysis of
error for the mean substitution case. Although no specific the data pattern to decide the correct imputation technique
algorithm is suitable for all applications, this imputation algo- [13] to improve accuracy. This study observes the nature of
rithm preserves the joint distribution of data [7]. missingness and provides the assessment of a suitable
Existing literature in the space weather field does not thor- approach for handling data imperfections. To understand the
oughly discuss this issue. A literature review indicated that sev- data missingness mechanism, the most commonly available
eral traditional imputation approaches have been used as a models [23,24] are outlined as follows:
solution for a variety of applications; literature on energy
healthcare [8], wind turbines [9], climate [10], and crowd-  Missing completely at random (MCAR): The assumption
sourcing [11] applications revealed that the machine learning considered in this case is that missingness is unrelated to
(ML) method and conventional statistical analysis are the its value or to the value of other variables.
two most popular techniques that are applied. The present  Missing at random (MAR): The MAR case involves the
technique involves the traditional method [12,13], which probability of missingness that depends on observed data.
includes listwise deletion, mean substitution, and hotdeck. This The missing value depends on the data point value of other
Missing data imputation of MAGDAS-9’s ground electromagnetism 939

variables to the extent that the missingness is correlated mance of the model and the discussions are presented in Sec-
with other variables included in the analysis (i.e., the cause tion 3, and finally, the conclusions with future works are
of missingness is considered) demonstrated in Section 4.
 Missing not at random (MNAR). The assumption consid-
ered in MNAR case is that the probability of missingness 2. Methodology
may depend on unobserved data. The missingness effect
depends on other variables of unobserved data, whereas The methodology of the data imputation described in Fig. 1
the data are not MCAR or MAR.
studies are performed with different models of supervised
ML, thereby including ANN, K-NN, SVR and GRNN against
Nevertheless, the missingness mechanism has been unclear standard imputation techniques; zero value substitution, list-
and difficult to identify [17] in the datasets, and its effect may
wise deletion, mean substitution, and hotdeck approach, which
be investigated to observe the performance of the data imputa- were developed using the MATLAB platform. Imputation
tion approach. Furthermore, the dataset size and perturbation using model-based MLs is developed as a prediction model
percentage of data missingness needs further attention when
for estimating the required missing values in gap interval mea-
imputing the missing data [25]. For the MAR case, the assump- surements. Using supervised ML, these datasets were retrieved
tion holds that MAR cannot be generalised for missingness from the cloud and divided into training, validation, and test-
caused by routine maintenance or calibration, and further
ing datasets at a ratio of 70:15:15. The imputed dataset was
inquiry can be carried out randomly [7]. The MNAR missing
benchmarked with the original dataset in which the data gap
pattern caused by sensor failure, which leads to a continuous was set as the output attribute. Experiments to simulate the
missing sequence [26] demonstrated an improved prediction per- percentage value of missing data were explored prior to evalu-
formance in the missing data case. Further, data missingness
ation with a supervised learning model. In addition, perfor-
caused by several factors of MCAR [27] can be caused by data mance measurements were developed on the basis of mean
management, human resources, and instrumentation error, square error (MSE), mean absolute error (MAE), mean abso-
which demonstrated the effect of the missing data gap. Contra-
lute percentage error (MAPE), and computational time.
dictory findings with different mechanism types, trigger the close
analysis to include the study of data missingness level for further
2.1. Access and availability of data
investigation. Differences between data distribution and the rela-
tionship between data missingness are recommended for further
study to examine whether there is a pattern [28]. The datasets used contain sampled data of ground electromag-
The wide range of work conducted earlier on of imputation netism located at the observatory station of the equatorial
models shall serve as a guideline to probe further for the case region in Johor, Malaysia. Measurements using a magnetome-
of the ground electromagnetism of solving missing data issues. ter device as the sensing elements acquire the daily variation of
Even though various strategies are available to overcome the the geomagnetic response, which represents the ground activi-
missing gap in the datasets, a systematic approach to modify ties because of the reflections of the space weather activities
the data should be devised considering data dimensionality and from the Sun–Earth interactions. Four datasets are available
incompleteness level. This includes preserving important data online at http://magdas2.serc.kyushu-u.ac.jp/realtime/index.
characteristics during the data imputation process. Despite the html, which comprise the horizontal (H) component, declina-
fact that ML techniques have been successfully applied for vari- tion (D) component, vertical downwards (Z) component, and
ous applications, data dimensionality and type of data may have total magnetic field intensity (F) is depicted as in Fig. 2 consid-
a significant influence on the developed model. Thus, there is no ered as important parameters worthy of investigation. Such
standardized technique for specific applications [29]. The solu- data are stored in the data logging card as primary storage,
tion remains an open problem because none of the best and concurrently display on the website to demonstrate the
approaches generalised well for all ranges of the problem state- current status of geomagnetic activities. Despite freely avail-
ment; hence, the selection of an appropriate model with good able data resources, data losses occurred because of the con-
computational cost is preferred. nectivity issues and inconsistency of data supply, which
This study aims to compare various imputation techniques further disturbed the full operation of presenting the current
between conventional and supervised ML approaches and eval- geomagnetic measurement. These datasets were collected
uate how the selection of different imputation methods affects within one (1) month, spanning from June 1, 2017 to June
reconstructed data. The proposed method, which estimates 30, 2017 at 1-s intervals daily. There are approximately
incomplete data through a supervised ML approach, can adapt 52,865 original data points per day with a total of 1,585,950
new data based on existing patterns. To the best of the author’s data points, which is captured from.8:11:54 am to
knowledge, this is the first study to investigate ground electro- 10:52:58 pm. The evaluation for validating the purpose with
magnetism data imputation of missing data for space weather new data sets of 52,895 retrieved on July 17, 2017 was con-
applications in Malaysia. The study of data missingness involves ducted to observe the model performance.
random data points missing in a given interval range. The out-
come of data imputation should provide reasonable data predic- 2.2. Brief overview of ML versus conventional imputation
tion and preserve the important association between ground approach
electromagnetism data in the datasets.
The remainder of this paper is organized as follows: The 2.2.1. Artificial neural network (ANN)
preliminary knowledge on theoretical methodology proposed In previous studies, ANNs demonstrated successful applica-
and the comparisons between methods of supervised ML are tions resembling human brain function to perform tasks such
presented in Section 2. The evaluation of the prediction perfor- as prediction, estimation, and classification. The network struc-
940 M.A. H. et al.

Fig. 1 Methodology framework for overall data imputation process.

Fig. 2 Sample datasets of ground electromagnetism separated into three sections of data are acquired from the website.

ture is composed of three layers: input, output, and hidden layer, missing data with the value of related cases (K-similarity of
each possessing nodes or neurons for each layer with intercon- features) from the entire dataset. Missing data is calculated
nected weights among the nodes. The input layer contains by defining a variable of K-nearest neighbor, and then, averag-
ground electromagnetism data that are transferred to the hidden ing the non-missing values to its neighbors. The K value is
layer and finally directed to the output layer. The interconnected selected heuristically based on the Euclidean distance,
nodes are randomly assigned with the weight and bias towards Dða; bÞ(referring to Eq. (1)) by calculating the square root of
each neuron, whereas these nodes are continuously updated by the sum of the difference between the estimated new value
_
the backpropagation step algorithm known as multilayer percep- (y ) and the original value (yt ). Depending on the dataset size
tron. Datasets were partitioned into a ratio of 70% training, t

15% testing, and 15% validation. Subsequent to the developed and percentage of missing values, the imputation process
model, the ground electromagnetism data were examined with requires good tuning to avoid susceptibility to overfitting
new testing data retrieved on July 17, 2017. Meanwhile, the num- and sensitive data points [20]
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ber of hidden neurons varied between and 3–12 neurons in the Xn  2
_
hidden layer. A linear transfer function was selected as the trans- Dða; bÞ ¼ y yt ð1Þ
t
fer function to obtain the output. t¼1

2.2.2. K-Nearest Neighbour (K-NN) 2.2.3. Support vector regression (SVR)


Among the mentioned supervised ML model, K-NN has Another supervised ML model that currently has received con-
widely selected and implemented in data imputation applica- siderable attention includes SVR (or known as support vector
tions because of its capability of preserving the value of the machine (SVM)) [30] that imputes the missing data by con-
Missing data imputation of MAGDAS-9’s ground electromagnetism 941

structing the nth hyperplane to separate the data. The objective [33]. Another technique of imputing the missing data with a
function is introduced in Eq. (2) to minimise the loss function mean substitution or zero value substitution approach toward
while maximizing the distance to the nearest data points. Thus, the missing instances. For the case of mean and zero value sub-
linear or nonlinear separable data were handled by applying stitution, the missing data in a variable are replaced with the
linear or SVR’s kernel tricks, which comprise polynomial mean and zero of its variable; however, the zero value is
and Gaussian types, as demonstrated in Eq. (3) where m is unsuitable for a large number of missing data points, which
the number of data points, x is the weight coefficient, C is leads to changes in data distribution [4]. For the case of hot
the tuning parameter constant, n is the slack variable, b is deck imputation, the substitution of the data gap is replaced
the constant coefficient, xi ¼ xj is the input data, K is the ker- by ‘‘similar” non-missing values of neighbors, which resembles
nel function, and U is the transformation function. The the K-NN approach, which is set as k = 1. This approach does
imputed missing value or predicted input values can be repre- not rely on model fitting, for the missing value that is to be
sented as the prediction fðxÞ function shown in Eq. (4). imputed and thus potentially less susceptible to missing data
Xm [34].
1
min k x k2 þ C ðni þ ni Þ ð2Þ
2 i¼1 2.3. Performance model evaluation

Kðxi ; xj Þ ¼ UT ðxi ÞUðxj Þ ð3Þ


Such parameters of MSE, MAE, and MAPE can be used to
investigate the estimation accuracy and how the missing rate
fðxÞ ¼ hx; UðxÞi þ b ð4Þ
affects the performance [35] of the estimated missing data,
thereby resulting in supervised ML generalisation capability.
2.2.4. General regression neural network (GRNN) The MSE at Eq. (8) demonstrates the error distribution mea-
The GRNN implements a non-linear regression with a one- sured in terms of the square deviation mean. The MAE evalu-
pass learning algorithm that can be used to impute missing ates the closeness of the absolute value between the imputed
data. Similar to ANN architecture, which consists of an input and actual data points, where the value approaching zero is
layer, a radial layer with a Gaussian feature, and an output the most ideal case. The lowest indicator between different
layer, but needs less training data than neural networks. One models is used to compare MAE with the lowest value, thereby
of the interesting features to investigate is GRNN’s ability to showing the best overall results. Eq. (9) is highly desirable to
generate consistent prediction results [31] and to achieve an achieve at low MAE because of the resemblance of imputed
accurate prediction result in a short period of time [32]. Based data as close as possible to the actual data point. Meanwhile
on a probability density function, this model employs a func- in Eq. (9), it is highly desirable. To obtain a low MAE because
tional relationship between dependent and target data. The of the consistency of the imputed data as close as possible to
efficiency of missing data imputation can thus be calculated the actual data point. MAPE in per Eq. (10) evaluates the mea-
by varying the smoothing factor, r. The GRNN model’s sure of prediction accuracy of the imputation method, and it
imputed missing data sequence follows Eqs. (5)–(7), which can be used to identify the loss function for the supervised
_
begins with the search for Euclidean distances, Di of the input ML model. The following notations of where are y the
t
vector, then moves to Gaussian distances, Gi and finally eval-
imputed value, yt : actual value, and n: number of missing value
uates the target data, Y.
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi data points are conducted throughout the analysis.
uX  2
u n  2 Pn
Di ¼ t yij  yj ð5Þ t¼1
_
y yt
t
j¼1
MSE ¼ ð8Þ
n
ðDi Þ2 _ 
Gi ¼ eð r2
Þ
ð6Þ 
n  y yt 

1 X t 
Pm MAPE ¼ :   ð9Þ
i¼1 ðDi xYi Þ
n t¼1  yt 
Y¼ Pm ð7Þ  
i¼1 Di
 
Pn  _

y
t¼1  t  y 
2.2.5. Conventional imputation model for predicting missing data t
gap MAE ¼ ð10Þ
n
Most executed approach to handle the missing data either by
data eliminating or substituting with the mean value of the 3. Results and discussion
overall data. Although these issues were prolonged, this tech-
nique has become preferred when dealing with missing values
The estimation technique to accommodate missing data
in the dataset considered for the analysis. Therefore, statistic
depends on the pattern and the extent of data severity missing
means was commonly adopted using list-wise deletion; it is
(from little to moderate to large amounts of data affected).
also known as complete case analysis that disregards an empty
Experiments on several benchmarks of ground electromag-
dataset. The advantages of this technique include comparabil-
netism datasets simulated at 30% of missing values artificially
ity across the analyses and the fact that it does not introduce
articulate the effect of missing patterns in the training and test-
bias to the estimated parameter estimates; however, substantial
ing datasets. The imputation was selected for the most occur-
loss of critical information when large samples are excluded
rence of missingness, which is commonly experienced during
942 M.A. H. et al.

used for the training function. The neural network has one hid-
Table 1 Information on hyperparameter tuning used in the
den node with a linear activation function for the output layer.
imputation prediction.
To avoid overfitting, the training stops if the number of itera-
ML-model Hyperparameter Tuning values tions reaches 1000 epochs or if the magnitude of the gradient is
algorithm less than 1  107. Meanwhile, for K-NN, tuning parameters
ANN Train function {trainlm, trainbr, trainscg} to impute the missing values vary from 5 to 500 nearest neigh-
Hidden neuron {3, 6, 9, 12} bours. As KNN evaluates missing values based on its neigh-
KNN K {5,10,50,100,150, 200,300,500} bour, it may be prone to over-fitting and noise-sensitivity
d {euclidean, when K is too small or covers a large value of data points away
minkowski,correlation} from the neighbour, and while the impute data may appear to
SVR e {0.01, 0.1, 1}
be bias-prone when it covers more instance space. Because this
C {0.01, 0.1, 1,10,100}
c {0.001, 0.01, 0.1, 1}
imputation technique involves scanning the entire dataset to
kernel {gaussian, polynomial, linear} find the K-NN, it can be costly and suffer from poor perfor-
GRNN r {0.5,0.7,1,1.5,2,2.5,3,3.5,6,7,10} mance, particularly for a large dataset. With SVR, the tuning
process involves e parameter, C parameter, gamma values c;
different kernel tricks were implemented to demonstrate the
disconnection data communication at night. The assessment of influence on performance when imputing the gap of missing
the best architecture to produce the most optimized imputa- data. Iterations between parameters were varied from 0.001
tion of data is conducted by trial-and-error by adjusting the to 1 to obtain the best performance with the lowest perfor-
parameters of the supervised ML model with its tuning values mance indicators. Among all model structures, GRNN is the
as summarized in Table 1. simplest, requiring tuning from a single parameter known as
The best imputation output can be achieved by manipulat- the smoothing factor,r. This parameter was varied from 0.5
ing the associated parameters related to the supervised ML or to 10 in order to achieve the best imputation technique. The
the feed-forward architecture in ANN. A simpler structure was simulated output was compared with that of actual datasets,
favored with a reduced number of hidden units, and a and the predicted missing data are illustrated in Fig. 3.
decreased computational processing time had the added The performance of the conventional and supervised ML
advantage of being selected as an appropriate imputation models was quantitatively assessed and benchmarked against
model. Simulations to train several feed-forward neural net- the complete datasets using MSE, MAE, and MAPE in Table 2
works using one hidden layer with different numbers of hidden and Fig. 4. Among conventional approaches such as zero value
neurons (from 3 to 12) and different activation functions for substitution, mean imputation, and hotdeck imputation, the
the hidden layers with the scaled conjugate gradient backprop- hotdeck represents the best predicted data imputation
agation network (trainscg), Lavernberg–Marquardt (trainlm), throughout all components as it follows pattern variation to
and Bayesian regularisation backpropagation (trainbr) were match the actual datasets. Although the output representation

Fig. 3 Comparison between predicted and actual missing data of (a) H components, (b) D components, (c) Z components and (d) F
components for ground electromagnetism data.
Missing data imputation of MAGDAS-9’s ground electromagnetism 943

Table 2 From top left; Performance of the statistical model for missing data imputation (a) Mean square error; (b) Mean absolute
error; and (c) Mean absolute percentage error for H,D,Z, and F components.
Conventional missing data imputation
Ground Percentage Zero value Listwise Mean Substitution Hotdeck
Electromagnetism of Data deletion
data Missing (%)
MSE MAE MAPE MSE/ MSE MAE MAPE MSE MAE MAPE
MAE/
MAPE
H 10 1.66E+09 4.08E+04 0.9998 – 18.3711 3.0295 0.0001 2.64 1.22 0.00
D 5.3695 2.0288 0.9998 – 62.0735 7.5411 0.9474 44.43 6.32 1.19
Z 9.51E+07 9.75E+03 0.9998 – 3.5762 1.8235 0.0002 367.66 19.11 0.00
F 1.76E+09 4.19E+04 0.9998 – 19.6161 3.347 0.0001 30.57 5.42 0.00
H 30 1.66E+09 4.08E+04 0.9999 – 83.5280 8.2768 0.0002 4.73 1.85 0.00
D 19.9695 3.8384 0.9999 – 40.1827 5.5776 0.6282 36.01 5.69 0.76
Z 9.51E+07 9.75E+03 0.9999 – 3.2679 1.7373 0.0002 332.92 18.21 0.00
F 1.76E+09 4.19E+04 0.9999 – 85.1576 8.4539 0.0002 36.80 5.97 0.00
H 50 1.66E+09 4.08E+04 1.0000 – 112.6472 10.0769 0.0002 5.09 1.98 0.00
D 49.6126 6.1177 1.0000 – 23.9827 3.8353 0.4451 23.31 4.12 0.52
Z 9.51E+07 9.75E+03 1.0000 – 3.0541 1.5953 0.0002 304.17 17.38 0.00
F 1.76E+09 4.19E+04 1.0000 – 113.6432 10.1571 0.0002 36.12 5.92 0.00
H 60 1.66E+09 4.08E+04 1.0000 – 138.6666 11.2841 0.0003 4.55 1.82 0.00
D 64.1393 7.0431 1.0000 – 19.4077 3.5461 0.4516 19.66 3.58 0.45
Z 9.51E+07 9.75E+03 1.0000 – 2.2585 1.3795 0.0001 285.89 16.80 0.00
F 1.76E+09 4.19E+04 1.0000 – 135.6664 11.1509 0.0003 33.09 5.62 0.00
H 70 1.66E+09 4.08E+04 1.0000 – 143.9195 11.2039 0.0003 6.44 2.08 0.00
D 73.0635 7.6413 1.0000 – 18.6062 3.8095 0.5663 16.92 3.15 0.39
Z 9.51E+07 9.75E+03 1.0000 – 2.9693 1.3604 0.0001 269.26 16.26 0.00
F 1.76E+09 4.19E+04 1.0000 – 135.6148 10.7864 0.0003 29.08 5.09 0.00
H 80 1.66E+09 4.08E+04 1.0000 – 112.4498 9.4319 0.0002 15.98 2.95 0.00
D 80.1732 8.1072 1.0000 – 29.9158 4.8573 1.1023 14.97 2.88 0.35
Z 9.51E+07 9.75E+03 1.0000 – 9.1377 2.4629 0.0003 257.33 15.88 0.00
F 1.76E+09 4.19E+04 1.0000 – 100.4164 8.9093 0.0002 29.66 5.16 0.00

Fig. 4 from top left; Performance of the supervised machine learning model (a) Mean square error; (b) Mean absolute error; and (c)
Mean absolute percentage error for H,D,Z, and F components.
944 M.A. H. et al.

resulted in gap differences of pattern on average 7.93 MAE, performance measures by simulating the missing data for the
102.62 MSE, and 0.19 MAPE, all components slightly fol- case of exceeding and lessening from the 30% rate are depicted
lowed the actual output with distorted component magnitudes; in Fig. 4.
however, the Z components were depicted to produce very The increase in missing data up to 80% affected the pre-
large discrepancies of 18.21 MAE and 332.92 MSE. The dicted imputation throughout the overall magnetic data com-
remaining two conventional approaches, zero value substitu- ponents, either increased or decreased in performance,
tion and mean imputation did not resemble identical predicted thereby depending on the imputation model. Mean substitu-
outputs. The predicted output of missing data imputation with tion demonstrated a decreased pattern for the MSE when the
ANN best tuned parameters for 30% missing data; with hid- missing rate was increased from 10% to 70%, which imposed
den neurons of 9 neurons at the Levenberg–Marquardt func- the D and Z components data. Subsequently, MSE increased
tion perfectly follows the actual H, D, and F components, when the missing data reached 80%. Overall, the hotdeck
except for Z components, thereby resulting in inaccurate pre- and supervised ML showed a decreased pattern in MSE per-
dicted output with 11.028 MSE; however, the lowest MAPE formance of D, Z, and F components from 30% to 80% rate
of 0.35 for the D component was observed compared to the of missing data. Except for SVR, the inconsistency of perfor-
other components. mance either increased or decreased when the missing rate con-
Nevertheless, the pattern showed that the capability of tinued to increase further. In terms of the MAE performance,
ANN to simulate the output of H, D, and F components at the H, D, Z, and F components for all imputation models fit
a promising error of less than 1; however, it failed to impute inconsistently when the rate of missing data increased, thereby
the gap for the Z component. The case is similar to GRNN causing the MAE performance to either decrease or increase.
that capable to follow the pattern of H, D, and F components, The MAPE performance for the overall imputation model
however produced the error of greater than 1. The SVR pro- reached approximately zero, except for H-components, which
duced the best simulated components for H, D, Z, and F com- slightly increased; however, the performance remained accept-
ponents and it was capable of accurately imputing the missing able. As a result, the contrast between the ML model and tra-
data perfectly at an average of 0.314 MSE, 0.738 MAPE, and ditional methods revealed the existence of missing values in a
closeness to 0.510 MAE, on average for all components. The dataset capable of being replicated, which eventually had an
K-NN approach produced the worst imputation outcome for effect on performance assessments. In our case, which include
all except for the H components that can track the actual data. a large number of data point sizes, the variables found are used
Regardless of the number of K, imperfect missing data impu- to affect this error evaluation, and they include parameter tun-
tation was presented for the entire distance used, namely, ing and percentages of missing values. Furthermore, quantifi-
Euclidean, Minkowski, and correlation. A greater number of cation of execution time demonstrated by testing data, as
neighbours significantly affected the error distribution between shown in Fig. 5, has revealed that linear-based SVR model
the predicted and actual imputation missing data for H- computes more time during training phase to fill the missing
components, which reach approximately 53.03 MSE at data, nevertheless it can execute in less than a second during
k = 500 neighbours when implementing the correlation type testing phase.
of distance. The remaining components did not experience Our findings show that there is a relationship between the
major variations of MSE and MAE when the number of k distribution of the data pattern and the output of the imputa-
was increased further; however, the computational time execu- tion model. Consequently, common dataset patterns benefit
tion increased further. In terms of computational time, Table 2 the most from different imputation techniques if different
indicates that the tuning of the e, C, and c parameters influ- missing rates and related parameters are considered. The
ences the execution of the predicted gap for missing data. Dif- results show that SVR is the optimal approach for any data
ferently tuned c parameters demonstrate that the lowest percentage rate that is lacking. In addition, SVR does not seem
c = 0.001 executed to the predicted output with the least time to be influenced by the distribution of data trends with good
required for the Gaussian kernel function, followed by C- performance measures for all distributions. However, further
parameter adjustment at the highest setting, C = 100 to result analysis is required to investigate and understand the general-
in the least computational testing time execution. Thus, the isation between missing data performance and its missingness
overall Gaussian function appeared to take 0.02 s; however, mechanism. Among the three schemes mentioned above,
a trade-off still exists so as to cause the highest MSE to be ground electromagnetism data of the H, D, and Z components
highest at 136.09 for the Z-component. For the linear kernel tends to encompass MCAR missingness because of the sensing
function, increments in both C and c parameters could reduce mechanism and system failure [36]. Meanwhile, the total mag-
the computational time execution with the addition of the low netic field intensity (F) includes the MAR type of missingness
MSE. Perfect tuning for e at the lowest while compromising to mechanism. The finding revealed that no pattern could gener-
the lowest error and predicted output for all components was alise the missingness correlation with either MCAR or
achieved with low e = 0.01. Therefore, the best tuned SVR MNAR. Visual assessment of all components of the datasets
kernel fits a linear function with e = 0.01, C = 0.1, and revealed that both H and D components that predicted well
c = 1. Nevertheless, the polynomial and Gaussian functions could lead to pattern matching when the F-component was
resulted in a similar execution of computational time; however, imputed. This occurred perfectly for the ANN and slightly
the quantitative assessment via MSE and MAE demonstrated accurate for the hotdeck and k-NN imputation models. Even
the deviated and fluctuating error of either one or all H,D,Z, though a certain model affected the F-component, the trends
and F components into undesirable high values. In view of to match the actual dataset revealed that both H and D com-
scrutinizing the missing data percentage to the quantitative ponents correlate with the F-component.
Missing data imputation of MAGDAS-9’s ground electromagnetism 945

Fig. 5 Performance of training time and testing time between models (a) Gaussian kernel; (b) polynomial kernel; and, (c) linear kernel.

4. Conclusion and future work reconstruct the presence of data gaps and determine the com-
plete data enables further space weather event processing and
In conclusion, we show the presentation of the supervised ML subsequent analysis. The dataset pattern was observed to vary
mechanism in data imputation for ground electromagnetism in in terms of predicted tracking capability and performance met-
space weather applications. Utilization of various strategies to rics evaluation via the techniques of MSE, MAE, MAPE, and
946 M.A. H. et al.

computational time when different imputation models and [6] D. Kline, R. Andridge, E. Kaizar, Comparing multiple
hyperparameters settings were used. Considering different pat- imputation methods for systematically missing subject-level
terns and a large dataset of data distribution requires a differ- data, Res. Synth. Methods. 8 (2) (2017) 136–148.
ent exploitation of the imputation model; overall, for [7] E. Afrifa-Yamoah, U.A. Mueller, S.M. Taylor, A.J. Fisher,
Missing data imputation of high-resolution temporal climate
MAGDAS-9 ground electromagnetism data, the data gap
time series data, Meteorol. Appl. 27 (1) (2020), https://doi.org/
could be imputed well with the linear-based SVR technique 10.1002/met.v27.110.1002/met.1873.
at the optimal hyperparameter setting of e = 0.01, C = 0.1, [8] R. Ramezani, M. Maadi, S.M. Khatami, A novel hybrid
and c = 1; meanwhile, low values of MSE, MAE, MAPE, intelligent system with missing value imputation for diabetes
and computational time at 0.00089, 0.01697, and 0.738, respec- diagnosis, Alexandria Eng. J. 57 (3) (2018) 1883–1891, https://
tively were considered on an average for all H, D, Z, and F doi.org/10.1016/j.aej.2017.03.043.
components of the magnetic data, and for up to 80% missing- [9] M. Martinez-Luengo, M. Shafiee, A. Kolios, Data management
ness percentage. The sutablility of the selected data imputation for structural integrity assessment of offshore wind turbine
can be concluded depends on data pattern, missignness mech- support structures: Data cleansing and missing data imputation,
anism, data type and the missingness percentage have influ- Ocean Eng. 173 (2019) 867–883, https://doi.org/10.1016/j.
oceaneng.2019.01.003.
enced to the performance evaluation. Further visual
[10] D.A. Williams, B. Nelsen, C. Berrett, G.P. Williams, T.K.
assessment confirmed the findings to justify the model ability Moon, A comparison of data imputation methods using
to follow the data pattern. Therefore, considering the success- Bayesian compressive sensing and Empirical Mode
ful approach and achievements, future work will focus on Decomposition for environmental temperature data, Environ.
developing an ensemble prediction method such as ensemble Model. Softw. 102 (2018) 172–184, https://doi.org/10.1016/j.
SVR [37], GRNN-SGTM [38] by optimising hyperparameters envsoft.2018.01.012.
to obtain high-performance results while reducing computa- [11] C. Ye, H. Wang, W. Lu, J. Li, Effective Bayesian-network-based
tional time. This allows for further optimised outcomes to missing value imputation enhanced by crowdsourcing,
improve imputation model performances. Knowledge-Based Syst. 190 (2020) 105199, https://doi.org/
10.1016/j.knosys.2019.105199.
Declaration of Competing Interest [12] M.S. Osman, A.M. Abu-Mahfouz, P.R. Page, A survey on data
imputation techniques: water distribution system as a use case,
IEEE Access 6 (2018) 63279–63291, https://doi.org/10.1109/
The authors declare that they have no known competing
Access.628763910.1109/ACCESS.2018.2877269.
financial interests or personal relationships that could have [13] R.J.A. Little, D.B. Rubin, Statistical Analysis with Missing
appeared to influence the work reported in this paper. Data, John Wiley & Sons, 2019.
[14] C. Crespo Turrado, F. Sánchez Lasheras, J.L. Calvo-Rollé, A.J.
Acknowledgments Piñón-Pazos, M.G. Melero, F.J. de Cos Juez, Hybrid algorithm
for missing data imputation and its application to electrical data
The acknowledgement is dedicated to support grants obtained loggers, Sensors (Switzerland) 16 (9) (2016) 1467, https://doi.
from the Ministry of Higher Education (MoHE), Malaysia, org/10.3390/s16091467.
[15] Z. Guo, Y. Wan, H. Ye, A data imputation method for
and the Faculty of Electrical Engineering, Universiti
multivariate time series based on generative adversarial network,
Teknologi MARA under the Fundamental Research Grant
Neurocomputing. 360 (2019) 185–197, https://doi.org/10.1016/j.
Scheme, FRGS (Grant No. 600-IRMI / FRGS 5/3 neucom.2019.06.007.
(091/2019)). The work was also supported by JSPS KAKENHI [16] Y. Zhuang, R. Ke, Y. Wang, Innovative method for traffic data
Grant Number JP20H01961. The authors also acknowledge the imputation based on convolutional neural network, IET Intell.
collaboration with the Malaysian Space Agency (MYSA) and Transp. Syst. 13 (4) (2019) 605–613, https://doi.org/10.1049/iet-
the International Center for Space Weather, Japan. its.2018.5114.
[17] J. Poulos, R. Valle, Missing data imputation for supervised
References learning, Appl. Artif. Intell. 32 (2) (2018) 186–196, https://doi.
org/10.1080/08839514.2018.1448143.
[1] E. Camporeale, The challenge of machine learning in space [18] E.L. Silva-Ramı́rez, R. Pino-Mejı́as, M. López-Coello, Single
weather: nowcasting and forecasting, Sp. Weather. 17 (8) (2019) imputation with multilayer perceptron and multiple imputation
1166–1207, https://doi.org/10.1029/2018SW002061. combining multilayer perceptron and k-nearest neighbours for
[2] W.N.I. Ismail, N.S.A. Hamid, M. Abdullah, N.H.M. Shukur, A. monotone patterns, Appl. Soft Comput. J. 29 (2015) 65–74,
Yoshikawa, Variation of equatorial electrojet current profiles https://doi.org/10.1016/j.asoc.2014.09.052.
over Solar Phases, ASM Sci. J. 12 (Special Issue 2) (2019) 125–133 [19] Y. Elfahham, Estimation and prediction of construction cost
https://kyushu-u.pure.elsevier.com/en/publications/variation- index using neural networks, time series, and regression,
of-equatorial-electrojet-current-profiles-over-solar-ph. Alexandria Eng. J. 58 (2) (2019) 499–506, https://doi.org/
[3] R. Tkachenko, O. Mishchuk, I. Izonin, N. Kryvinska, R. 10.1016/j.aej.2019.05.002.
Stoliarchuk, A non-iterative neural-like framework for missing [20] G.E.A.P.A. Batista, M.C. Monard, An analysis of four missing
data imputation, Procedia Comput. Sci. 155 (2019) 319–326, data treatment methods for supervised learning, Appl. Artif.
https://doi.org/10.1016/j.procs.2019.08.046. Intell. 17 (5-6) (2003) 519–533, https://doi.org/10.1080/
[4] A. Jadhav, D. Pramod, K. Ramanathan, Comparison of 713827181.
performance of data imputation methods for numeric dataset, [21] T. Fusco, Y. Bi, H. Wang, F. Browne, Data mining and machine
Appl. Artif. Intell. 33 (10) (2019) 913–933, https://doi.org/ learning approaches for prediction modelling of schistosomiasis
10.1080/08839514.2019.1637138. disease vectors, Int. J. Mach. Learn. Cybern. 11 (6) (2020) 1159–
[5] S. Kang, E. Kim, J. Shim, W. Chang, S. Cho, Product failure 1178, https://doi.org/10.1007/s13042-019-01029-x.
prediction with missing data, Int. J. Prod. Res. 56 (14) (2018) [22] M. Suresh, R. Taib, Y. Zhao, W. Jin, Sharpening the BLADE:
4849–4859, https://doi.org/10.1080/00207543.2017.1407883. Missing data imputation using supervised machine learning, in:
Missing data imputation of MAGDAS-9’s ground electromagnetism 947

Australas. Jt. Conf. Artif. Intell, 11919, Springer, pp. 215–227. Appl. 91 (2018) 63–77, https://doi.org/10.1016/j.
[23] E.L. Silva-Ramı́rez, M. López-Coello, R. Pino-Mejı́as, An eswa.2017.08.038.
application sample of machine learning tools, such as SVM [31] O.A. Alomair, A.A. Garrouch, A general regression neural
and ANN, for data editing and imputation, Stud. in Fuzziness network model offers reliable prediction of CO2 minimum
and Soft Comput. (2018) 259–298, https://doi.org/10.1007/978- miscibility pressure, J. Pet. Explor. Prod. Technol. 6 (3) (2016)
3-319-62359-7_13. 351–365, https://doi.org/10.1007/s13202-015-0196-4.
[24] M. Kokla, J. Virtanen, M. Kolehmainen, J. Paananen, K. [32] I. Izonin, N. Kryvinska, P. Vitynskyi, R. Tkachenko, and K.
Hanhineva, Random forest-based imputation outperforms other Zub, GRNN approach towards missing data recovery between
methods for imputing LC-MS metabolomics data: A IoT systems, in: International Conference on Intelligent
comparative study, BMC Bioinformatics 20 (1) (2019) 1–11, Networking and Collaborative Systems, 2019, p. 445–453.
https://doi.org/10.1186/s12859-019-3110-0. [33] R. Deb, A.W.C. Liew, Missing value imputation for the analysis
[25] S.J. Choudhury, N.R. Pal, Imputation of missing data with of incomplete traffic accident data, Inf. Sci. (Ny) 339 (2016) 274–
neural networks for classification, Knowledge-Based Syst. 182 289, https://doi.org/10.1016/j.ins.2016.01.018.
(2019) 104838, https://doi.org/10.1016/j.knosys.2019.07.009. [34] L.A. Hunt, Missing data imputation and its effect on the
[26] J. Ke, S. Zhang, H. Yang, X. Chen, PCA-based missing accuracy of classification, in: Data Sci., Springer, 2017, pp. 3–14.
information imputation for real-time crash likelihood [35] X. Yan, A. Mohammadian, Forecasting daily reference
prediction under imbalanced data, Transp. A Transp. Sci. 15 evapotranspiration for Canada using the Penman-Monteith
(2) (2019) 872–895, https://doi.org/10.1080/ model and statistically downscaled global climate model
23249935.2018.1542414. projections, Alexandria Eng. J. 59 (2) (2020) 883–891, https://
[27] W.Y. Lai, K.K. Kuok, A study on bayesian principal doi.org/10.1016/j.aej.2020.03.020.
component analysis for addressing missing rainfall data, Water [36] S. Sankaranarayanan, G. Swaminathan, T.K. Radhakrishnan,
Resour. Manag. 33 (8) (2019) 2615–2628, https://doi.org/ N. Sivakumaran, Missing data estimation and IoT-based flyby
10.1007/s11269-019-02209-8. monitoring of a water distribution system: Conceptual and
[28] R. Ratolojanahary, R. Houé Ngouna, K. Medjaher, J. Junca- experimental validation, Int. J. Commun. Syst. 22 (2019),
Bourié, F. Dauriac, M. Sebilo, Model selection to improve https://doi.org/10.1002/dac.4135 e4135.
multiple imputation for handling high rate missingness in a [37] Q. Shang, Z. Yang, S. Gao, D. Tan, An imputation method for
water quality dataset, Expert Syst. Appl. 131 (2019) 299–307, missing traffic data based on FCM optimized by PSO-SVR, J.
https://doi.org/10.1016/j.eswa.2019.04.049. Adv. Transp. 2018 (2018) 1–21, https://doi.org/10.1155/2018/
[29] W. Van Echelpoel, P.L.M. Goethals, Variable importance for 2935248.
sustaining macrophyte presence via random forests: data [38] I. Izonin, R. Tkachenko, V. Verhun, K. Zub, An approach
imputation and model settings, Sci. Rep. 8 (2018) 14557, towards missing data management using improved GRNN-
https://doi.org/10.1038/s41598-018-32966-2. SGTM ensemble method, Eng. Sci. Technol. an Int. J. 24 (3)
[30] S. Li, H. Fang, X. Liu, Parameter optimization of support (2021) 749–759, https://doi.org/10.1016/j.jestch.2020.10.005.
vector regression based on sine cosine algorithm, Expert Syst.

You might also like