Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

Exploring the effect of normalization on medical

data classification
Namrata Singh Pradeep Singh
Department of Computer Science and Engineering Department of Computer Science and Engineering
National Institute of Technology Raipur National Institute of Technology Raipur
Chhattisgarh, India Chhattisgarh, India
nsingh.phd2016.cs@nitrr.ac.in psingh.cs@nitrr.ac.in

Abstract—Data normalization as one of the pre-processing MMAD are utilized to examine their effect on classification
strategies is utilized either to transform or scale the data in performance of medical datasets. These approaches are
order to make an equal contribution of each attribute. For a classified depending upon how some statistical characteristics
given classification problem, the performance of any machine of raw data can be utilized to normalize it [10].
learning approach depends upon the quality of data in order to
produce a generalized classification approach. Various studies The significance of normalization in building accurate ML
have shown the significance of data normalization to enhance models has been observed in medical data classification for
the quality of data and finally the performance of machine various algorithms such as Naïve Bayes ( ), Support Vector
learning techniques. But there is dearth of investigations about Machines ( ), Random Forest ( ), k-Nearest Neighbours
the effect of data normalization methods in classifying the ( − ) [11]. For such techniques, the normalized input
medical datasets. Thus, this study intends to explore the effect variable space can affect the learning speed to obtain better
of three data normalization techniques namely min-max, z-score mining result. Several authors have confirmed the effect of
and Median and Median Absolute Deviation on the normalization to improve the outcome of medical data
performance of four classification algorithms namely Naïve classification [11].
Bayes, Support Vector Machine - Radial Basis Function,
Random Forest and k-Nearest Neighbour. The experiments This paper is structured as follows. The related work of
conducted on 20 publicly available medical datasets are based this study is covered in Section 2. Section 3 presents the details
on the classification accuracy as performance parameter. The of this methodology. Section 4 gives the experimental results
best performance results were obtained with z-score and discussion. Finally, this study is concluded in Section 5.
normalization method along with Random Forest classifier.
II. RELATED WORK
Keywords—Data normalization, Naïve Bayes, Support Vector
In medical data mining systems [12], data normalization
Machines, Random Forest, k-Nearest Neighbour, Classification
plays an important role in combining the various attributes on
I. INTRODUCTION a common scale. Different normalization approaches have
been utilized by researchers for improving the classification
Normalization is a part of data transformation where data performance of medical application domains. These methods
are normalized into appropriate form for mining [1]. It is a included min–max normalization, z-score normalization,
critical step in data preprocessing before modelling any decimal scaling, median & median absolute deviation
machine learning (ML) algorithm. Various types of (MMAD) etc. The selection of the best possible normalization
normalization procedures are used to improve the accuracy of technique to obtain an optimal classification performance on
medical data classification [2]. This procedure aims at a specific medical problem is still an uphill battle.
changing the numeric column values in the dataset by using a
common scale, without losing information and distorting Pires et al. [13] analyzed the difference between non-
differences in the ranges of values [3]. normalized and normalized data and concluded that best
results were obtained with the non-normalized data on
In medical data classification, the models generate poor “Heterogeneity Activity Recognition Dataset”. It was
performance due to the non-application of appropriate pre- concluded that the utilization of a particular data
processing strategies [4]. These strategies include feature normalization method depends on the dataset. Borkin et al.
selection [5], eliminating noise and outliers from the data [6], [14] presented the impact of data normalization on the
instance filtering [7] and transformation of data, i.e., performance of XGBoost (Extreme Gradient Boosting)
normalization. Among these, data normalization is an classification model for Parkinson’s disease dataset. The
operation on raw data that either rescales or transforms it such results demonstrated that XGBoost performed better with the
that each attribute has a uniform contribution. It handles two raw data as compared to the normalized one obtained using
basic problems of data mining that hampers the learning min-max procedure. Cao et al. [15] presented the Generalized
process of ML approaches, i.e., the outliers and presence of Logistic (GL) algorithm, an efficient data scaling method for
dominant attributes [8]. Several techniques have been classification modeling. Experimental results revealed that
developed for the normalization of raw data within a specific algorithms trained using the data scaled by the GL algorithm
range by using statistical procedures [9]. outperformed the ones using z-score and min-max algorithms.
Consider a dataset with attributes and samples In addition, the algorithms trained using the unscaled data
represented as = , , ∈ ∈ , where showed poor performances.
demonstrates the data to be modelled by the ML technique and Mihaela and Ruxandra [16] analyzed the effect of applying
denotes the respective class label. In this study, three various normalization approaches on k-NN algorithm for
normalization approaches namely min-max, z-score and Pima Indian Diabetes dataset. Both min-max and z-score

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE


To be published in Proceedings of IEEE conference,
Artificial Intelligence & Machine Vision (AIMV-2021)

You might also like