Professional Documents
Culture Documents
Data Preparation in The Big Data Era
Data Preparation in The Big Data Era
Data Preparation in The Big Data Era
Federico Castanedo
Make Data Work
strataconf.com
Presented by O’Reilly and Cloudera,
Strata + Hadoop World is where
cutting-edge data science and new
business fundamentals intersect—
and merge.
Job # 15420
Data Preparation
in the Big Data Era
Best Practices for Data Integration
Federico Castanedo
Data Preparation in the Big Data Era
by Federico Castanedo
Copyright © 2015 O’Reilly Media, Inc. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA
95472.
O’Reilly books may be purchased for educational, business, or sales promotional use.
Online editions are also available for most titles (http://safaribooksonline.com). For
more information, contact our corporate/institutional sales department:
800-998-9938 or corporate@oreilly.com.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Data Preparation
in the Big Data Era, the cover image, and related trade dress are trademarks of
O’Reilly Media, Inc.
While the publisher and the authors have used good faith efforts to ensure that the
information and instructions contained in this work are accurate, the publisher and
the authors disclaim all responsibility for errors or omissions, including without
limitation responsibility for damages resulting from the use of or reliance on this
work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is sub‐
ject to open source licenses or the intellectual property rights of others, it is your
responsibility to ensure that your use thereof complies with such licenses and/or
rights.
978-1-491-93895-9
[LSI]
Table of Contents
v
CHAPTER 1
Data Preparation
in the Era of Big Data
Introduction
Preparing and cleaning data for any kind of analysis is notoriously
costly, time consuming, and prone to error, with conventional esti‐
mates holding that 80% of the total time spent on analysis is spent
on data preparation.1 Accelerating investment in big data analytics—
$44 billion total in 2014 alone, according to Gartner—elevates the
stakes for successfully preparing data from many sources for use in
data analysis.
Substantial ROI gains can be realized by modernizing the techni‐
ques and tools enterprises employ in cleaning, combining, and
transforming data. This report introduces the business benefits of
big data and analyzes the issues that organizations face today with
traditional data preparation and integration. It also introduces the
need for a new approach that scales, known as data curation, and
discusses how to deliver the results.
This report will cover the following key topics:
1 T. Dasu and T. Johnson, “Exploratory Data Mining and Cleaning,” Wiley-IEEE (2003).
1
• Delivering the results
Gain Insight
The data you analyze should support business operations and help
generate decisions in the company. Any results should be integrated
seamlessly with the existing business workflows, and will only be
valuable if managers and frontline employees understand and use
those results accordingly.
Here are four steps for any company to gain specific insights into
their business problems:
Data silos
One challenge that some companies may face in implementing a big
data strategy is the existence of data silos among different areas of
the company. When data silos are present, your business’s data is
distributed among the different silos, without communication and
interfaces between them.
As part of your big data strategy, you should plan to integrate data
projects into a coherent and unified view and, even more impor‐
tantly, avoid (as much as possible) moving data from one place to
another.
In a big data project, input data sources can come from different
domains, not only from traditional transactions and social network
data, and it is necessary to combine or fuse them. In order to suc‐
cessfully combine your data, it’s important to first understand it, and
your goals.
Data lakes
Until you have a solid grasp on the business purpose of your data,
you can store it in a data lake. A data lake is a storage repository that
holds raw input data, where it can be kept until your company’s
goals are clear. An important drawback of data lakes is the genera‐
tion of duplicate information, and the necessity of dealing with the
data variety problem in order to perform correct data integration.
Data variety, together with velocity and volume, is one of the “three
V’s” of big data characteristics. Data variety refers to the number of
distinct types of data sources. Since the same information can be
stored with different unique identifiers in each data source, it
becomes extremely difficult to identify similar data.
5 Hadley Wickham, “Tidy Data,” Journal of Statistical Software 59, issue 10 (September
2014).
1. In-record consistency
2. Cross-record consistency
Missing Values
Missing values (known as NA) are one of the most basic inconsis‐
tencies. Some data analysis methods can deal with NAs, while others
may fail when the data has missing input values, or may confuse a
missing value with a default category.8
NAs are commonly confused with an unknown category; however,
these are two different ideas. An NA value states that the informa‐
tion is not available in the dataset, whereas an unknown value indi‐
cates that the information is in the dataset but it is unknown. If the
records may have an unknown category, this should not be confused
with the NA values. A simple approach to deal with NAs is to ignore
the records that contain them. When the ratio of NAs versus all of
the data is high, it is better to use imputation techniques.
Imputation Techniques
Imputation techniques correct NAs by estimating values based on
other information. The most basic imputation method is to deter‐
mine the mean of the observed values, or any other measure of cen‐
trality. Another method is known as ratio imputation, where the esti‐
mate Xi is given by an average ratio between x and a covariate y:
Xi = Ryi. This is commonly computed as the sum of observed x val‐
ues, divided by the sum of corresponding y values. It has the prop‐
erty that x = 0 when y = 0, which is in general not guaranteed in lin‐
ear regression.
Generalized linear regression models can also be used as an imputa‐
tion method. In this case, missing values are estimated by using a
linear regression from known variables.
8 Maytal Saar-Tsechansky and Foster Provost, “Handling Missing Values when Applying
Classification Models,” Journal of Machine Learning Research 8, 1625–1657 (2007).
9 I.P. Fellegi and D. Holt, “A systematic approach to automatic edit and imputation,” Jour‐
nal of the American Statistical Association 71, 17–35 (1976).
10 V. Barnett and T. Lewis, Outliers in statistical data (New York: Wiley, 1994).
11 D.M. Hawkins. Identification of outliers. Monographs on applied probability and statistics
(Chapman and Hall, 1980).
12 M.A. Hiridoglou and J.M. Berthelot, “Statistical editing and imputation for periodic
business surveys,” Survey methodology 12(1), 73–83 (1986).
Entity matching
Entity matching, or deduplicating records, is a complex task. The
naive approach has a quadratic time complexity, because it needs to
check among all possible pairs, which does not scale to large data.
Tamr uses a proprietary blocking technique called “data binning” to
approach the problem. A blocking technique divides data into K par‐
13 Dona M. Wong, The Wall Street Journal Guide to Information Graphics: The Dos and
Dont’s of Presenting Data, Facts, and Figures (W.W. Norton & Company, 2013).
14 Nathan Yau, Visualize This: The FlowingData Guide to Design, Visualization, and Statis‐
tics (Wiley, 2011).