Michael Melese (Ph.D.

¡ Data Processing
¡ Data quality problems
¡ Data preprocessing
¡ Data cleaning
¡ Data integration
¡ Data transformation
¡ Data reduction
¡ In STEM, the terms data processing are
considered too broad, and the term data
processing is typically used for the initial
stage followed by analysis to handle the
overall data.
¡ The collection and manipulation of items of
data to produce meaningful information.
§ The processing is in any manner detectable by an
¡ The conversion of data into
usable and desired form. The
conversion is carried out
using a predefined sequence
of operations either
§ Manual data processing or
§ Automatic (Electronic) data

¡ Data processing may involve various processes;
§ Validation: ensuring that supplied data is correct and relevant.
§ Sorting: is an arrangement of items in some sequence and/or in
different sets.
§ Summarization: reducing detailed data to its main points.
§ Aggregation: combining multiple pieces of data.
§ Analysis: collection, organization, analysis, interpretation and
presentation of data.
§ Reporting: list detail or summary of the data processed.
§ Classification: separation of data into various categories.

¡ Incomplete Data
¡ Data duplication
¡ Inconsistent Formats
¡ Accessibility
¡ System upgrades
¡ Data purging and storage
¡ Poor organization
¡ Why data incomplete,
§ Data has not been entered in the
system correctly,
§ Certain files may have been
§ Some data has several missing
¡ If an address does not include a
zip code at all, the remaining
information can be of little value,
since the geographical aspect of it
would be hard to determine.
¡ Multiple copies of the same
records take a toll on the
computation and storage.
Duplicated data
§ This produce skewed or incorrect from source
insights when they go undetected.
¡ One of the key problems could be
human error
§ Simply entering the data multiple
times by accident
§ Sometimes the problem might be data
the algorithm that has gone wrong.

¡ Different organization store
their data in different way.
These include;
§ Name (First name, Last name),
Storing basic
§ Date of birth (US/UK style), information should
be pre-determined.
§ Phone number with or without
country code.
¡ If the data is stored in inconsistently,
the systems used to analyze or store
the information may not interpret it
¡ Inconsistent data may take data
scientists a considerable amount of
time to simply unravel the many
versions of data saved.

¡ Most of the data and information scientists
use to create, evaluate, theorise and predict
the results or end products often gets lost.
¡ The way data trickles down to business
analysts in big organizations from
departments, sub-divisions, branches, and
finally the teams who are working on the
¡ This information may or may not have
complete access to the next user.
¡ Data sharing and making available
information in an efficient manner to all is
the cornerstone in sharing corporate data.
¡ Every time the data
management system gets an
upgrade or the hardware is
updated, there are chances of
information getting lost or
corrupt. In the absence of
¡ Making several back-ups of data backup
data and upgrading the
systems only through
authenticated sources is
always advisable.

¡ In organization, there
are chances that locally
saved information could
be deleted either by
mistake or deliberately.
§ Saving the data in a safe
manner, and sharing a
with the community is
¡ If we are not able to easily
search through the data, we
find that it becomes
significantly more difficult to
make use of.
§ Through different organizational
methods and procedures, there
are dozens of ways that data can
be represented.

¡ Noisy data due to
§ Faulty data collection instruments, entry errors,
transmission problems, technology limitation and
inconsistency in naming convention.
¡ Duplication: data set may include data objects that are
duplicates, or almost duplicates of one another
§ Major issue when merging data from heterogeneous data
sources. Such as, person with multiple email addresses.
¡ Impossible data combination (eg. Gender: Male, Pregnan:
¡ Data from multiple units and languages

¡ Data have quality if they satisfy the requirements of the
intended use and when it solves the data quality problems.
These includes;
§ Accuracy, Completeness, Consistency, Timeliness,
Believability and Interpretability

¡ Is a theory and practice of
manipulating/automating a electronic data in a
way that can be used for specific application.
§ preprocessing might have different scope based on the
application and domain.
¡ Trivial string manipulation programs is not
economical and performing these tasks requires
robust text processing.
§ Most widely used Approach: RegEx for NLP

¡ Preprocessing ML data involves both data
engineering and feature engineering.
§ Data engineering is the process of converting raw
data into prepared data.
§ Feature engineering then tunes the prepared data to
create the features expected by the ML model.

¡ Refers to the data in its source form, without
any prior preparation for ML.
§ The data might be in its raw form (flat file) or in a
transformed form (in a database).
¡ Transformed data might have been converted
from its original raw form to be used for
analytics, but not in the context our ML task.
§ In addition, data sent from other systems
that eventually call ML models for predictions is
considered to be data in its raw form.
¡ Refers to the dataset in the form ready for your
Machine learning task.
¡ Data sources have been parsed, joined, and put
into a tabular form after aggregating and
summarizing in the right granularity
§ Each row in the dataset represents a unique record, and
each column represents summary information for ML.
§ In the case of supervised learning tasks, the target
feature is present.
¡ Irrelevant columns have been dropped, and
invalid records have been filtered out.
¡ This refers to the dataset with the tuned features
expected by the model.
¡ Performing certain ML specific operations on the
columns in the prepared dataset, and creating new
features for your model during training and
prediction under Preprocessing operations.
§ Scaling numerical columns to a value between 0 and 1,
clipping values, and one-hot-encoding categorical

¡ Each operation aims to help ML build better predictive
models. Some of the operations for structured data:
¡ Data cleansing
§ Removing or correcting records with corrupted or invalid values from
raw data, as well as removing records that are missing a large number of
¡ Instances selection and partitioning
§ Selecting data points from the input dataset to create training,
evaluation (validation), and test sets using random sampling, minority
classes oversampling, and stratified partitioning.
¡ Feature tuning
§ Improving the quality of a feature for ML, which includes scaling and
normalizing numeric values, imputing missing values, clipping outliers,
and adjusting values with skewed distributions.
¡ Representation transformation
§ Converting a numeric feature to a categorical feature and vice verse.
¡ Feature extraction
§ Reducing the number of features by creating lower-dimension and more
powerful data representations using PCA, embedding extraction,
and hashing.
¡ Feature selection
§ Selecting a subset of the input features for training the model, and
ignoring the irrelevant or redundant ones, using filter or wrapper methods
which involve simply dropping features if the features are missing a large
number of values.
¡ Feature construction
§ Creating new features either by using typical techniques, such
as polynomial expansion or feature crossing.
¡ When working with Unstructured data such as images, audio, or
text documents, deep learning has gotten rid of the domain
knowledge-based feature engineering by folding it into the
model architecture.
¡ A convolutional layer is an automatic feature preprocessor
for constructing the right model architecture which requires
some empirical knowledge of the data. In addition, some
amount of preprocessing is needed, such as:
§ Text documents: stemming and lemmatization, TF-
IDF calculation, and n-gram extraction, embedding lookup.
§ Images: clipping, resizing, cropping, gaussian blur, and canary
§ Transfer learning, in which you are treating all-but-last layers of
the fully trained model as a feature engineering step. This applies
to all types of data, including text and images.
¡ Is the process of preparing data for analysis by removing
or modifying data that is incorrect, incomplete,
irrelevant, duplicated, or improperly formatted.
¡ Data cleaning clean the data by:
§ Filter unwanted outliers and smoothing noisy data
§ Remove duplicate and irrelevant observations
§ Fix structural errors such as typos or inconsistent
§ Filling in missing values

¡ Ignore the tuple whenever a class label is missed
¡ Estimate missing Values
§ Filling missing values manually is time consuming and
not feasible for a large data set
§ Use a global constant
§ Use the attribute or group mean
§ Use the most probable value (popular)
¡ Choosing the right technique is a choice that depends on
the problem domain.

¡ Blending data from multiple sources into a
coherent data store.
¡ Issues to be considered during integration:

§ Some redundancies can be detected by correlation

analysis if there is same entity from multiple source.

¡ Data consolidation
§ Brings data together from several separate systems to reduce the
number of data storage locations.
¡ Data propagation
§ Data propagation is the use of applications to copy data from one
location to another synchronously or asynchronously
¡ Data virtualization
§ Uses an interface to provide a near real-time, unified view of data from
disparate sources with different data models.
¡ Data federation
§ Uses a virtual database and creates a common data model for
heterogeneous data from different systems being data virtualization

¡ ML data may not be in the right format or may
require transformations to make it more useful.
Data Transformation activities and techniques
§ Categorical encoding
▪ Label encoding converts categorical variables to numerical
representation, something that is machine-readable.
§ Dealing with skewed data
▪ Regression algorithms with linear regression or ANN, a better
improvements registered with more symmetric distribution,
you can use roots (square-root, cube root), logarithms (base e,
or base 10), reciprocals (positive or negative), or Box-Cox

§ Bias mitigation
▪ If bias detected in data, mitigation with replacing the current
values, or labels, with those that will result in a fairer model.
Such as reweighing, optimized preprocessing, learning fair
representations and disparate impact remover.
§ Scaling
▪ Scaling is a method of transforming data into a particular range.
This is important when using regression algorithms and
algorithms using Euclidean distances (e.g. KNN, or K-Means) as
they are sensitive to the variation in magnitude and range across
▪ The goal of scaling is to change the values of each numerical
feature in the data set to a common scale. Such as min-max
scaling or z-score standardization.

¡ Most machine learning techniques may not be
effective for high-dimensional data. A database
or date warehouse may store TB of data.
§ This may take very long to perform data analysis on
such huge amounts of data.
¡ Data reduction techniques can be applied to
obtain a reduced representation of the actual
data in volume but still contain critical
¡ Data Cube Aggregation
§ Aggregation operations are applied to the data in the construction of
a data cube.
¡ Dimensionality Reduction
§ In dimensionality reduction redundant attributes are detected and
removed which reduce the data set size.
¡ Data Compression
§ Encoded in reduced or compressed representation of the original
¡ Numerosity Reduction
§ Where the data are replaced or estimated by alternative.
¡ Discretization and concept hierarchy generation
§ Data values are replaced by ranges or higher conceptual levels.


