Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/323808183

Pre-Processing: A Data Preparation Step

Chapter · January 2018


DOI: 10.1016/B978-0-12-809633-8.20457-3

CITATIONS READS

16 4,266

5 authors, including:

Swarup Roy Pooja Sharma


Sikkim University Tezpur University
129 PUBLICATIONS 837 CITATIONS 14 PUBLICATIONS 83 CITATIONS

SEE PROFILE SEE PROFILE

Keshab Nath Dhruba K Bhattacharyya


Indian Institute of Information Technology Kottayam Tezpur University
31 PUBLICATIONS 243 CITATIONS 344 PUBLICATIONS 6,894 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Swarup Roy on 20 April 2020.

The user has requested enhancement of the downloaded file.


Provided for non-commercial research and educational use.
Not for reproduction, distribution or commercial use.

This article was originally published in Encyclopedia of Bioinformatics and Computational Biology, published by
Elsevier, and the attached copy is provided by Elsevier for the author’s benefit and for the benefit of the author’s
institution, for non-commercial research and educational use including without limitation use in instruction at your
institution, sending it to specific colleagues who you know, and providing a copy to your institution’s administrator.

All other uses, reproduction and distribution, including without limitation commercial reprints, selling or licensing
copies or access, or posting on open internet sites, your personal or institution’s website or repository, are prohibited.
For exceptions, permission may be sought for such use through Elsevier’s permissions site at:

http://www.elsevier.com/locate/permissionusematerial

Swarup Roy, Pooja Sharma, Keshab Nath, Dhruba K. Bhattacharyya and Jugal K. Kalita (2019) Pre-Processing: A
Data Preparation Step. In: Guenther, R. and Steel, D. (eds.), Encyclopedia of Bioinformatics and Computational
Biology, vol. 1, pp. 463–471. Oxford: Elsevier.
© 2019 Elsevier Inc. All rights reserved.
Author's personal copy

Pre-Processing: A Data Preparation Step


Swarup Roy, Sikkim University, Gangtok, India and North-Eastern Hill University, Shillong, India
Pooja Sharma, Tezpur University, Tezpur, India
Keshab Nath, North-Eastern Hill University, Shillong, India
Dhruba K Bhattacharyya, Tezpur University, Tezpur, India
Jugal K Kalita, University of Colorado, Boulder, CO, United States
r 2019 Elsevier Inc. All rights reserved.

Introduction

High-throughput experimental processes in various data driven computing domains have led to the availability or production of
massive amounts of data. Explosive data growth can definitely be witnessed in biological research, and is due to revolutionary
changes in how high throughput experiments are conducted in biomedical sciences and in biotechnology. A wide spectrum of
biomedical data is generated by experiments in clinical and therapeutic investigations. Omics research, involving high throughput
Next Generation Sequencing (NGS) and Microarray technologies, have been instrumental in generating massive amounts of
mRNA, miRNA and gene expression data, as well as Protein-Protein Interaction data. Rapid and massive data generation sources
lead to subsequent challenges in effective data storage and transmission. Efficient and scalable exploratory data mining techniques
provide an emerging and important set of tools for knowledge discovery from in silico data sources (Roy et al., 2013). A plethora of
data mining and machine learning methods have been proposed, over the last several decades, to explore and discover hidden and
unknown biological facts and relationships among biological entities.
Inference and analysis outcomes produced by knowledge discovery methods are highly dependent on the quality of the input
data. A highly effective method is generally incapable of producing reliable results in the absence of high quality data. Biological
data are generated and distributed across the globe. Heterogeneity in data due to different data acquisition techniques and
standards, with non-uniform devices in geographically distributed research laboratories, makes the task of producing high quality
data a near impossibility in many cases.
In general, real world data are of poor quality and can not be directly input to sophisticated data mining techniques. Such data
are also often incomplete. As a result, it may be difficult to discover hidden characteristics, which may be of interest to the domain
expert, or the data may contain errors, technically referred to as outliers. The interpretation of this type of data may also require
understanding of the background knowledge used when analyzing the data. The same set of data can be represented in multiple
formats and the values may have been normalized to different ranges, depending upon the statistical validity and significance. To
use data mining techniques to mine interesting patterns from such data, they need to be suitably prepared beforehand, using data
Pre-processing. Data pre-processing is a sequence of steps comprising Data Cleaning, Data Reduction, Data Transformation and
Data Integration. Each step is equally significant, independent and can be executed in isolation. It is the data and the tasks to be
performed which determine the step(s) to be executed and when. A domain expert's intervention may be necessary to determine
the appropriate steps for a particular dataset. It is not appropriate to use the steps simply as a black-box.
High throughput technologies such as Microarray and Next Generation Sequencing (NGS) typically assess relative expression levels of a
large number of cDNA sequences under various experimental conditions. Such data contain either time series measurements, collected
over a biological process, or comparative measurements of expression variation in target and control tissue samples (e.g., normal versus
cancerous tissues). The relative expression levels are represented as ratios. The original gene expression data derived after scanning the array,
i.e., the fluorescence intensities measured on a microarray, are not ready for data analysis, since they contain noise, missing values, and
variations arising from experimental procedures. Similar situations may arise even in RNASeq data derived from NGS. The in silico analysis
of large scale microarray or RNASeq expression data from such experiments commonly involves a series of preprocessing steps (Fig. 1).
These steps are indispensable, and must be completed before any gene expression analysis can be performed.
Data cleaning in terms of noise elimination, missing value estimation, and background correction, followed by data nor-
malization is important for high quality data processing. Gene selection filters out genes that do not change significantly in
comparison to untreated samples. In reality, not all genes actively take part in a biological activity, and hence are likely to be
irrelevant for effective analysis. The intention is to identify the smallest possible set of genes that can still achieve good perfor-
mance for the analysis under consideration (Daz-Uriarte and De Andres, 2006). Gene expression levels of thousands of genes are
stored in a matrix form, where rows represent the relative expression level of the gene w.r.t. to sample or time. Performing a
logarithmic transformation on the expression level, or standardizing each row of the gene expression matrix to have a mean of zero
and a variance of one, are common pre-processing steps.
We now discuss the major data pre-processing steps in detail.

Data Cleaning

Quality of data plays a significant role in determining the quality of the resulting output when using any algorithm. Data cleaning
is one such step in creating quality data (Herbert and Wang, 2007). It usually handles two situations, which give rise to

Encyclopedia of Bioinformatics and Computational Biology, Volume 1 doi:10.1016/B978-0-12-809633-8.20457-3 463


Author's personal copy
464 Pre-Processing: A Data Preparation Step

Fig. 1 Commonly used preprocessing steps for gene expression data analysis.

inconsistent input, missing values, misspellings, etc. The first deals with missing values, and the second is the data duplication
problem. The presence of duplicate data adds to the computation time without adding any extra benefit to the result.

Handling Missing Value


A major issue that arises during pre-processing of data is that of missing values. A missing value occurs when the value of data
example is not stored for a variable or feature of interest. Missing values can lead to serious issues, whether it is poor representation
of the overall dataset in terms of its distribution, or bias in the results obtained (Hyun, 2013). Therefore, they need to be handled
cautiously so as to enhance the performance of the methods. Missing data can occur because of a number of technical reasons.
These may be one of the following:

• Malfunctioning or physical limitations of the measuring equipment may lead to some values being not recorded. Even
interruptions in communication media during transportation of data may lead to missing values.
• A few missing values may not seem to be important during the data collection phase, and corrective measures may not have
been taken, at a later stage, to determine them.
• Some values may have been removed due to inconsistency with other recorded data.
There can be various types of missing data (Allison, 2012). These are broadly classified as follows.
○ Missing Completely at Random (MCAR): When the probability of a missing observation is not related to the estimated value.
○ Missing at Random (MAR): When the probability of a missing observation is unrelated to its value, but depends on other
aspects of the observed data.
○ Missing Not at Random (MNAR): If the missing data values do not belong to either of the above two categories, then it is
said to be in MNAR, i.e., the probability of a missing observation is related to its value.

We explain the above types of missing data values with the help of an example. Suppose that we have two variables a and b,
whose values are represented by vectors, A and B, respectively. Suppose that the individual values in vector B are always recorded,
however, vector A has missing values. A missing value in A is said to be MCAR if the chance of it being missing is independent of
values recorded in both A and B. On the other hand, if the probability of a missing value in A is likely to depend on the values in B,
but is independent of the values of A, it is termed as MAR.
There are various approaches to handling missing data. It is possible to ignore the missing data completely, but if the amount
of missing data is a large proportion of the dataset, this approach may severely affect the quality of the results. Also, if there are
cases of MNAR missing values, they require parameter estimation to model the missing data. In order to deal with missing data,
four broad classes of alternatives are available.

Ignoring and discarding data


Ignoring the missing values in the data is the simplest way to handle this issue. However, this is appropriate only for the MCAR
situation. If we apply this treatment to other situations, we are likely to introduce bias in the data, and ultimately in the end results.
Discarding missing data can be done in two ways: complete case analysis and attribute discarding. Complete case analysis is the de facto
method, and is available in many programs. It discards all instances with missing data, and further analysis is carried out only on
the remaining data.
The second way to discard attributes involves determining the extent to which the data is missing with respect to the number of
instances, and the relevance of the attributes in the whole dataset. Based on that, one can discard or remove instances that are not
widespread or whose relevance w.r.t. the whole dataset is not very significant. Note that if an attribute with missing values has very
high relevance, it has to be retained in the dataset, as in Batista and Monard (2002).

Parameter estimation
This approach is suitable for MNAR missing data. It requires estimation of parameters using suitable models such as probabilistic,
KNN, or SVD models. In order to estimate the parameters of a model, the maximum likelihood approach, using a variant of the
Author's personal copy
Pre-Processing: A Data Preparation Step 465

Expectation-Maximization algorithm can be used (Moon, 1996). The most common method using this approach is called BPCA
(Oba et al., 2003) which consists of three elementary steps – principal components regression, Bayesian estimation, and an
iterative Expectation-Maximization (EM) algorithm. This is most commonly used in biological data cleaning.

Imputation
In the context of missing data, imputation refers to the consequence of actions taken to address the case of missing data. The aim
of imputation is to replace the missing data with estimated values by using information in the measured values to infer population
parameters. It is unlikely to give the best prediction for the missing values, but can certainly add to the completeness of the dataset
(Little and Rubin, 2014). Imputation approaches can be broadly divided into Single value imputation, the Hot deck and cold deck
method, and Multiple imputation.

1. Single value imputation: Single value imputation replaces the missing value with a single estimated value, which is assumed to
be very close to the value that would have been observed if the dataset were complete. Various means for substituting a single
value for the missing data are used. Some of these are as follows:
(i) Mean imputation: The mean of the existing values of an attribute is used to replace missing attribute value. This approach,
reduces the diversity of data and therefore tends to produce a poorer estimation of the standard deviation of the dataset
(Baraldi and Enders, 2010).
(ii) Regression imputation: A regression model can be used to predict observed values of a variable based on other variables,
and that value is used to impute values in cases where that variable is missing. It preserves the variability in the dataset
without introducing bias in the results. It overestimates the correlation between the attributes as compared to mean
imputation, which underestimates this measure due to the loss of variability.
(iii) Least squares regression (LS impute): It is an extended form of the regression model. It is calculated by developing an
equation for a function that minimizes the sum of squared errors from the model (Bø et al., 2004).
(iv) Local least squares (LLS impute): It is similar to LS impute, except that it involves a priori step before applying the regression
and estimation. It first represent a target gene that has missing values using a linear combination of similar genes by
identifying the K nearest neighbors that have large absolute values of Pearson correlation coefficients (Kim et al., 2004).
(v) K-Nearest Neighbor imputation (KNN impute): Missing data are replaced with the help of known data by minimizing the
distance between the observed and estimated values of the missing data. The K-neighbors closest to the missing data are
found, and the final result is estimated and replaces the missing value. This substitution of value depends on the type of
missing data (Batista and Monard, 2002).

2. Hot deck and cold deck methods: Hot deck imputation involves filling in missing data on variables of interest from non-
respondents (or recipients) using observed values from respondents (i.e., donors) within the same survey data set. Depending
on the donor with which the similarity is computed, the hot deck approach, which is also known as Last Observation Carried
Forward (LOCF), can be subdivided into random hot deck prediction and deterministic hot deck prediction (Andridge and Little,
2010). The hot deck approach offers certain advantages because it is free from parameter estimation and ambiguities in
prediction of values. The most essential feature of this technique is that it uses only logical and credible values for estimating
missing data, as these are obtained from the donor pool set (Andridge and Little, 2010). Cold-deck imputation, on the other
hand, selects donors from another dataset to replace a missing value of a variable or data item with a constant value from an
external donor.
3. Multiple imputation: Multiple imputation involves a series of calculations to decide upon the most appropriate value for the
missing data. This technique replaces each missing value with a set of the most suitable values, which are then further analyzed
to determine the final predicted value. It follows a monotonic missing pattern which means, for a sequence of values for a
variable, if one value is missing then all subsequent values are also missing. Two common methods are used to find the set of
suitable values.

• Regression method: A regression model is fitted for attributes having missing values along with those from previous
attributes in the sequence of dataset as covariates, following the monotone property. Based on this model, a new model is
formed and this process is repeated sequentially for each attribute with missing values (Rubin, 2004).
• Propensity score method: The propensity score is the conditional probability of assignment to a particular treatment given a
set of observed covariates. In this method, a propensity score is calculated for each attribute with missing values to indicate
the probability of that observation being missing. A Bayesian bootstrap imputation is then used on the grouped data based
on the propensity score to get the set of values (Lavori et al., 1995).

Duplicate Data Detection


Data redundancy due to the occurrence of duplicate data values for data instances or attributes leads to the issue of shortage of
storage. The most common way of handling duplicacy is by finding chunks of similar values and removing the duplicates from the
chunks. However,this method is very time consuming and hence a few other techniques have been developed for handling
redundant data.
Author's personal copy
466 Pre-Processing: A Data Preparation Step

Knowledge-based methods
Incorporating domain dependent information from knowledge bases into the data cleaning task is one alternative for duplication
elimination. A working tool based on this technique is Intelliclean (Low et al., 2001). This tool standardizes abbreviations.
Therefore, if one record abbreviates the word street as St., and another abbreviates it as Str, while another record uses the full word,
all three records are standardized to the same abbreviation. Once the data has been standardized, it is cleaned using a set of
domain-specific rules that work with a knowledge base. These rules detect duplicates, merge appropriate records, and create
various alerts for any other anomalies.

ETL method
The most popular method for duplicate elimination, these days, is the ETL method (Rahm and Do, 2000). The ETL method
comprises three steps – extraction, translation, and loading. Two types of duplicate data elimination can be performed using this
method – one at the instance-level and the other at the schema-level. Instance-level processing cleans errors within the data itself,
such as misspellings. Schema-level cleaning usually works by transforming the database into a new schema or a data warehouse.

Data Reduction

Mining large scale data is time consuming and expensive in terms of memory. It may make the task of data mining impractical and
infeasible. Using the entire dataset is not always important, and in fact, may contribute little to the quality of the outcome,
compared to using a reduced version of it. Sometimes, the abundance of irrelevant data may lead to non-optimum results. Data
reduction reduces the data either in terms of volume or the number of attributes (also called dimensions) or both, without
compromising the integrity of the original data with regard to the results. A number of approaches are available for data reduction.
Ideally, any data reduction method should be efficient and yet produce nearly identical analytical results to those obtained with
the full data. Various approaches are briefly discussed below.

Parametric Data Reduction


In parametric data reduction, the volume of the original data is reduced by considering a relatively compact alternative way to
represent the data. It fits a parametric model based on the data distribution to the data, and estimates the optimal values of the
parameters required to represent that model. In this approach, only model parameters and the outliers are stored. Regression and
non-linear models are two well-known approaches for parametric data reduction.
Unlike parametric, non-parametric approaches do not use any models. They summarize the data with sample statistics, as
discussed below.

Sampling
Processing a large data set at one time is expensive, and time consuming. Instead of considering the whole data set, a small
representative sample of k instances taken from a data set with G instances, where |k|r|G|. This process is called sampling. We can
categorize sampling as follows.
• Simple random sampling without replacement: Here, a sample of size k is selected from a large data set of size |G| with
probability(p), and the samples, once chosen, are not placed back in the dataset. Sampling without replacement gives a non-
zero covariance between two chosen samples, which complicates the computations. If the number of data instances are very
large, the covariance is very close to zero.
• Simple random sampling with replacement: Here, the sample, once chosen, are placed back in the dataset and hence data
may be duplicated in the sample. It gives a zero covariance between two chosen samples. In case of a skewed data
distribution, simple random sampling with replacement usually produces poor results with any data analysis method. In
such a case, sampling with replacement isn't much different from sampling without replacement. However, the precision of
estimates is usually higher for sampling without replacement compared to sampling with replacement.
Adaptive sampling methods such as stratified sampling and cluster sampling are likely to improve performance in the case
of unbalanced datasets with relatively uneven sample distribution.
• Stratified sampling: In stratified sampling the original data is divided into strata (sub-groups), and from each stratum a
sample is generated by simple random sampling.
• Cluster sampling: In cluster sampling, the original data is divided into clusters (sub-groups), and out of a total of N clusters,
n are randomly selected, and from these clusters, elements are selected for the creation of the sample.

Dimension Reduction
Dimension or attribute reduction is simply the removal of unnecessary or insignificant attributes. The presence of irrelevant
attributes in the dataset may make the problem intractable and mislead the analysis as well, which is often termed the Curse of
Author's personal copy
Pre-Processing: A Data Preparation Step 467

Dimensionality. Data sets with high dimensionality are likely to be sparse as well. Machine learning techniques such as clustering,
which are based on data density and distance, may produce incorrect outcomes, in the case of sparse data. To overcome the low
performance of machine learning algorithms with high dimensional data sets, dimension reduction is vital. Dimension reduction
should be carried out in such a way that, as much as possible, the information content of the original data remains unchanged and
that the use of the reduced dataset does not change the final outcome. Dimension reduction can be obtained by selecting only the
relevant attributes and ignoring the rest. Principal Component Analysis (PCA) (Hotelling, 1933) is one of the most commonly
used technique for dimensionality reduction. PCA is a statistical procedure that uses an orthogonal transformation to map a set
of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
It finds low dimensional approximations of the original data by projecting the data onto linear subspaces. It searches for k
n-dimensional orthogonal vectors (kon) from the original n dimensional data which best represent the original data. The original
data are thus mapped onto a much smaller number of dimensions, resulting in dimensionality reduction.

Data Transformation and Normalization

Transformation is the process of converting data from one format or structure into another format or structure. It is applied so that
the data may more closely meet the assumptions of a statistical inference procedure, or to improve interpretability. It uses a
transformation function, y¼ f(x), to change data from one domain to another. Often the term normalization is interchangeably
used with transformation. Normalization means applying a transformation so that the transformed data are roughly normally
distributed. Some transformation or normalization techniques are discussed below.

Log2 Transformation
It is the most widely used transformation in the case of microarray (Quackenbush, 2002) or RNASeq data (Anders and Huber,
2010). It produces a continuous series of values for which it is easy to deal with extreme values, heteroskedasticity, and skewed
data distributions.

Min–Max Normalization
It maps an attribute value, x, in the original dataset to a new value, x0 , given by
x  min
x0 ¼
max  min

Z-Score Normalization
It transforms the values of variables based on their mean and standard deviation. For a variable X represented by a vector {x1, x2,…xn}
each attribute can be transformed using the formula
xi  X
x0i ¼
stddev ðXÞ
where x0i is the Z-score value of x, X is the row mean of X, stddev is the standard deviation given by
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
  1 X n
stddev X ¼ ðxi  XÞ2
n  1 i¼1

Decimal Scaling Normalization


Decimal scaling moves the decimal point of a attribute value. The number of decimal points moved depends on the maximum
absolute value of the attribute. A value v of an attribute is normalized to v0 by following way:
v
v0 ¼
10 j
where j is the smallest integer such that max(v0 )o1. Suppose that the values of an attribute ranges from 523 to 237. The maximum
absolute value is 523. To normalize by decimal scaling, it divides each value by 1000 (i.e., j¼ 3) so that 523 normalizes to 0.523
and 237 normalizes to 0.237.

Quantile Normalization
It is a technique for making two or more distributions identical in terms of statistical properties. It is frequently used in microarray
data analysis (Bolstad et al., 2003). To normalize gene expression data, it first assigns a rank to each column of values based on
Author's personal copy
468 Pre-Processing: A Data Preparation Step

lowest to highest values in that column. It then rearranges the column values based on their original values (before ranking), so
that each column is in order, going from the lowest to the highest value. The average for each row is then calculated using the
reordered values. The average value calculated in the first row will be lowest value (rank 1) from every column, the second average
value will be the second lowest (rank 2) and so on. Finally, it replaces the original values with the average values based on the
ranks assigned during first step. The new values in each column all have the same distribution.
The quantile normalization transforms the statistical distributions across samples to be the same and assumes there are no
global differences in the distributions of the data in each column. However, it is not clear how to proceed with normalization if
these assumptions are violated. Recently, a generalization of quantile normalization, referred to as smooth quantile normalization
(qsmooth) (Hicks et al., 2016), has been proposed that relaxes the assumptions by allowing differences in the distribution
between the groups.

Data Discretization
Data collected from different sources are found in different formats depending on the experimental set-up, prevailing conditions,
and variables of interest. To analyze such data, it is often convenient and effective to discretize the data (Dash et al., 2011).
Discretization transforms the domain of values for each quantitative attribute into discrete intervals. Using discrete values during
data processing offers a number of benefits such as the following:

• Discrete data consumes less memory for storage.


• Discrete data are likely to be simpler to visualize and understand.
• Performance, in some systems, often becomes more efficient and accurate using discrete data.

The two terms frequently used in the context of discretization or interval generation are discussed below.

• Cut-points: Cut-points are certain values within the range of each discretized quantitative attribute. Each cut-point divides a
range of continuous values into successive intervals (Liu et al., 2002). For example, a continuous range 〈a⋯b〉 may be
partitioned into intervals like [a, c] and [c, b], where the value c is a cut-point.
• Arity: Another term that is popular in discretization context is Arity. It is the total number of intervals generated for an attribute after
discretization. Arity for an attribute is one more than the number of cut-points used for discretization (Liu et al., 2002). An overly
high Arity can make the learning process longer while a very low Arity may negatively affect the predictive accuracy.

Any discretization method follows the enumerated steps below.

1. Sort the domain of continuous values for each attribute to be discretized.


2. Evaluate the cut-points for each of these attributes.
3. Create intervals using the deduced cut-points.
4. Assign each value with the alternative value according to the interval in which it falls.
In order to accomplish this set of steps, a discretizer has to use a certain measure to evaluate cut-points and group different
continuous values into separate discretized intervals. A few such measures are discussed below.

• Binning: The simplest method to discretize a continuous-valued attribute is by creating a pre-specified number of bins. These
cut-points of the bins are determined by user input.
• Entropy: It is one of the most commonly used discretization measures in the literature. It is the average amount of information
per event, where the information of an event is high for unlikely events and low otherwise. In every iteration, it calculates the
entropy of the bin after splitting and calculates the net entropy of the split and the information gain (Ross Quinlan, 1986). It
selects the split with the highest gain and iteratively partitions each split until the gain falls below a certain threshold.
• Dependency: It is a measure that finds the strength of association between a class and a attribute (depending upon certain
computations or statistics) and accordingly determines the cut-points. In this measure, the Arity of each attribute has to be
specified as a parameter.

Several discretization techniques have been proposed in the literature. They are broadly classified as Supervised and Unsu-
pervised discretization techniques, depending on whether they use any class levels (Mahanta et al., 2012). These are
discussed below.

Supervised data discretization


Such discretization uses class labels to convert continuous data to discrete ranges derived from the original dataset (Dougherty
et al., 1995). These are of two types.
(i) Entropy Based Discretization Method: This method of discretization uses the entropy measure to decide the boundaries.
Chiu et al. (1990) proposed a hierarchical method that maximizes the Shannon entropy over the discretized space. The
method starts with k-partitions and uses hill-climbing to optimize the partitions, using same entropy measure to obtain
finer intervals.
Author's personal copy
Pre-Processing: A Data Preparation Step 469

(ii) Chi-Square Based Discretization: It uses the statistical significance test (Chi-Square test) (Kerber, 1992) to determine the
probability of the similarity of data in two intervals. It starts with every unique value of the attribute in its own interval. It
computes χ2 for each initial interval. Next, it merges the intervals with the smallest χ2 values. It repeats the process of merging
until no more satisfactory merging is possible based on the χ2 values.

Unsupervised data discretization


The unsupervised data discretization process does not use any class information to decide upon the boundaries. A few dis-
cretization techniques are discussed below.

(i) Average and Mid-Ranged value discretization: A binary discretization technique that divides the class boundaries using the
average value of the data. Values less than the average correspond to one class, and values greater than the average correspond
to the other class (Dougherty et al., 1995).
For example, given A¼ {23.73,5.45,3.03,10.17,5.05} the average discretization method first calculates the average score,
P
A ¼ ni¼1 A. Based on this value, it discretizes the vector using following equation.
(
1; if AðiÞ4 ¼ A
Di ¼
0; otherwise

In our example, A ¼ 9:486, and therefore, the discretized vector using average-value discretization is D ¼ [1 0 0 1 0].
On the other hand, mid-range discretization uses the middle or mid-range value of an attribute in the whole dataset to decide
the class boundary. The mid-range of a vector of values is obtained using the equation
M ¼ ðH þ UÞ=2
where, H is the maximum value and U is the minimum value in the vector. Discretization of values occurs as follows.
(
1; if AðiÞ4 ¼ M
Di ¼
0; otherwise

For the vector A, we obtain M as 13.38. Therefore, the corresponding discretized vector of values are D ¼ [1 0 0 0 0]. It is not
of much use as it lacks efficiency and robustness (outliers change it significantly) (Dougherty et al., 1995).
(ii) Equal Width Discretization: Divides the data into k equal intervals. The upper and lower bounds of the intervals are decided
by the difference between the maximum and minimum values for the attribute of the dataset (Catlett, 1991).
The number of intervals, k, is specified by the user, and the lower and upper boundaries, pr and pr þ 1 of the class are obtained
from the data (H and U, respectively) in the vector according to the equation
prþ1 ¼ pr þ ðH  UÞ=k
For our example, if k¼ 3, the class division occurs as given below.
8
> 1; if pr0 o ¼ AðiÞopr1
<
Di ¼ 2; if pr1 o ¼ AðiÞopr2
>
: 3; if pr2 o ¼ AðiÞo ¼ pr3

Therefore, the corresponding discretized vector is D ¼ [3 1 1 2 1].


Some of the frequently used discretizers, out of the plethora of techniques available in literature, are reported in Table 1.

Table 1 Some discretizers and their properties

Discretizer Measure used Procedure adopted Learning model

Equal width Binning Splitting Un-supervised


Equal frequency Binning Splitting Un-supervised
ChiMerge (Kerber, 1992) Dependency Merging Supervised
Chi2 (Liu and Setiono, 1997) Dependency Merging Supervised
Ent-minimum description length principle (MDLP) (Fayyad and Irani, 1993) Entropy Splitting Supervised
Zeta (Ho and Scott, 1997) Dependency Splitting Supervised
Fixed frequency Binning Splitting Un-supervised
Discretization (FFD) (Yang and Webb, 2009)
Optimal flexible frequency discretization (OFFD) (Wang et al., 2009) Wrapping Hybrid Supervised
Class-attribute interdependence Dependency Splitting Supervised
Maximization (CAIM) (Kurgan and Cios, 2004)
Class attribute dependent discretization (CADD) (Ching et al., 1995) Dependency Hybrid Supervised
Ameva (Gonzalez-Abril et al., 2009) Dependency Splitting Supervised
Author's personal copy
470 Pre-Processing: A Data Preparation Step

Data Integration

Biological data deal with properties of living organisms, i.e., data are associated with metabolic mechanisms. Such data result from
the combined effect of complex mechanisms occurring in the cell, from DNA to RNA to Protein, and further to the metabolite
level. Understanding cellular mechanisms from all these perspectives is likely to produce more biologically relevant results
than simply analyzing them from mathematical or computational viewpoints. Such an analysis involves data integration. Data
integration is the process of analyzing data obtained from various sources, and obtaining a common model that explains the data
(Schneider and Jimenez, 2012). Such a model fits the prevailing conditions better and makes more accurate predictions. Literature
has shown that the use of data integration in the biological sciences has been effective in tasks such as identifying metabolic
and signaling changes in cancer by integrating gene expression and pathway data (Emmert-Streib and Glazko, 2011), thereby
identifying protein related disorders through combined analysis of genetic as well as genomic interactions (Bellay et al., 2011).
From the systems biological perspective, integration of data can be achieved at three levels, viz., at the data level, at the
processing level, or at the decision level (Milanesi et al., 2009). Integrating data at the first level requires combining data from
heterogeneous sources and implementing a universal query system for all types of data involved. This is the most time-consuming
step in data integration. One needs to consider various assumptions and experimental setup information before combining the
data. The second level of data integration involves an understanding and interpretation of the datasets. In this stage of integration,
one needs to identify associated correlations between the various datasets involved. The third stage of data integration is at the
decision level. In this stage, different participating datasets are first dealt with using specific procedures and individual results
are obtained. Finally, these individual results are mapped, according to some ontology or information known a priori to interpret
the results with more confidence.

Conclusions

Real-life data tend to be incomplete in nature. To handle incompleteness of different forms such as noise, missing values,
inconsistencies, and the curse of dimensionality, data preparation steps are essential. We presented a brief discussion of different
data preprocessing steps required to make data suitable for analysis. It is well-known that real data are not static in nature and data
distributions vary with time. State-of-the-art preprocessing steps are not appropriate to handle such changes of data distribution or
characteristics. Dealing with dynamic data is particularly challenging when data are produced rapidly. An empirical study on the
suitability and efficiency of pre-processing techniques applied to dynamic data is of great interest for the big data community.

See also: Data Mining in Bioinformatics. Knowledge Discovery in Databases. The Challenge of Privacy in the Cloud

References

Allison, P.D., 2012. Handling missing data by maximum likelihood. SAS Global Forum, Paper 312-2012. Available at: http://www.statisticalhorizons.com/wp-content/uploads/
MissingDataByML.pdf.
Anders, S., Huber, W., 2010. Differential expression analysis for sequence count data. Genome Biology 11 (10), R106.
Andridge, R.R, Little, RJ, 2010. A review of hot deck imputation for survey non-response. International Statistical Review 78 (1), 40–64.
Baraldi, A.N., Enders, C.K., 2010. An introduction to modern missing data analyses. Journal of School Psychology 48 (1), 5–37.
Batista, G.E.A.P.A., Monard, M.C., 2002. A study of K-nearest neighbour as an imputation method. HIS 87 (251-260), 48.
Bellay, J., Han, S., Michaut, M., et al., 2011. Bringing order to protein disorder through comparative genomics and genetic interactions. Genome Biology 12 (2), R14.
Bø, T.H., Dysvik, B., Jonassen, I., 2004. LSimpute: Accurate estimation of missing values in microarray data with least squares methods. Nucleic Acids Research 32 (3), e34.
Bolstad, B.M., Irizarry, R.A., Åstrand, M., Speed, T.P., 2003. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias.
Bioinformatics 19 (2), 185–193.
Catlett, J., 1991. On changing continuous attributes into ordered discrete attributes. In: Machine Learning, EWSL-91, LNAI, vol. 26, pp. 164–178.
Ching, J.Y., Wong, A.K.C., Chan, K.C.C., 1995. Class-dependent discretization for inductive learning from continuous and mixed-mode data. IEEE Transactions on Pattern
Analysis and Machine Intelligence 17 (7), 641–651.
Chiu, D.K.Y., Cheung, B., Wong, A.K.C., 1990. Information synthesis based on hierarchical maximum entropy discretization. Journal of Experimental & Theoretical Artificial
Intelligence 2 (2), 117–129.
Dash, R., Paramguru, R.L., Dash, R., 2011. Comparative analysis of supervised and unsupervised discretization techniques. International Journal of Advances in Science and
Technology 2 (3), 29–37.
Daz-Uriarte, R., De Andres, S.A., 2006. Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7 (1), 3.
Dougherty, J., Kohavi, R., Sahami, M., et al., 1995. Supervised and unsupervised discretization of continuous features. Machine Learning: Proceedings of the Twelfth
International Conference 12, 194–202.
Emmert-Streib, F., Glazko, G.V., 2011. Pathway analysis of expression data: Deciphering functional building blocks of complex diseases. PLOS Computational Biology 7 (5),
e1002053.
Fayyad, U., Irani, K., 1993. Multi-interval discretization of continuous-valued attributes for classification learning. Proceedings of the Thirteenth International Joint Conference on
Artificial Intelligence. 1022–1027.
Gonzalez-Abril, L., Cuberos, F.J., Velasco, F., Ortega, J.A., 2009. Ameva: An autonomous discretization algorithm. Expert Systems With Applications 36 (3), 5327–5332.
Herbert, K.G., Wang, J.T.L., 2007. Biological data cleaning: A case study. International Journal of Information Quality 1 (1), 60–82.
Hicks, S.C., Okrah, K., Paulson, J.N., et al., 2016. Smooth quantile normalization. BioRxiv. 085175.
Author's personal copy
Pre-Processing: A Data Preparation Step 471

Ho, K.M., Scott, P.D., 1997. Zeta: A global method for discretization of continuous variables. In: Proceedings of the 3rd International Conference on Knowledge Discovery and
Data Mining (KDD99), NewPort, USA, pp. 191–194, AAAI Press.
Hotelling, H., 1933. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology 24 (6), 417.
Hyun, K., 2013. The prevention and handling of the missing data. Korean Journal of Anesthesiology 64 (5), 402–406.
Kerber, R., 1992. Chimerge: Discretization of numeric attributes. In: Proceedings of the tenth national conference on Artificial intelligence, pp. 123–128, AAAI Press.
Kim, H., Golub, G.H., Park, H., 2004. Missing value estimation for DNA microarray gene expression data: Local least squares imputation. Bioinformatics 21 (2), 187–198.
Kurgan, L.A., Cios, K.J., 2004. CAIM discretization algorithm. IEEE Transactions on Knowledge and Data Engineering 16 (2), 145–153.
Lavori, P.W., Dawson, R., Shera, D., 1995. A multiple imputation strategy for clinical trials with truncation of patient data. Statistics in Medicine 14 (17), 1913–1925.
Little, R.J.A., Rubin, D.B., 2014. Statistical Analysis With Missing Data. John Wiley & Sons.
Liu, H., Setiono, R., 1997. Feature selection via discretization. IEEE Transactions on Knowledge and Data Engineering 9 (4), 642–645.
Liu, H., Hussain, F., Tan, C.L., Dash, M., 2002. Discretization: An enabling technique. Data Mining and Knowledge Discovery 6 (4), 393–423.
Low, W.L., Lee, M.L., Ling, T.W., 2001. A knowledge-based approach for duplicate elimination in data cleaning. Information Systems 26 (8), 585–606.
Mahanta, P., Ahmed, H.A., Kalita, J.K., Bhattacharyya, D.K., 2012. Discretization in gene expression data analysis: A selected survey. In: Proceedings of the Second
International Conference on Computational Science, Engineering and Information Technology, pp. 69–75, ACM.
Milanesi, L., Alfieri, R., Mosca, E., et al., 2009. Sys-Bio Gateway: A framework of bioinformatics database resources oriented to systems biology. In: Proceedings of
International Workshop on Portals for Life Sciences (IWPLS'09), Edinburgh, UK, CEUR.
Moon, T.K., 1996. The expectation-maximization algorithm. IEEE Signal Processing Magazine 13 (6), 47–60.
Oba, S., Sato, M.-A., Takemasa, I., et al., 2003. A Bayesian missing value estimation method for gene expression profile data. Bioinformatics 19 (16), 2088–2096.
Quackenbush, J., 2002. Microarray data normalization and transformation. Nature Genetics 32, 496–501.
Rahm, E., Do, H.H., 2000. Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin 23 (4), 3–13.
Ross Quinlan, J., 1986. Induction of decision trees. Machine Learning 1 (1), 81–106.
Roy, S., Bhattacharyya, D.K., Kalita, J.K., 2013. CoBi: Pattern based co-regulated biclustering of gene expression data. Pattern Recognition Letters 34 (14), 1669–1678.
Rubin, D.B., 2004. Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons.
Schneider, M.V., Jimenez, R.C., 2012. Teaching the fundamentals of biological data integration using classroom games. PLOS Computational Biology 8 (12), e1002789.
Wang, S., Min, F., Wang, Z., Cao, T., 2009. OFFD: Optimal flexible frequency discretization for naive Bayes classification. Advanced Data Mining and Applications. 704–712.
Yang, Y., Webb, G.I., 2009. Discretization for naive-Bayes learning: Managing discretization bias and variance. Machine Learning 74 (1), 39–74.

View publication stats

You might also like