Professional Documents
Culture Documents
The Deal With Big Data
The Deal With Big Data
BIG DATA
Recognizing this
shift toward big
The era of big data is upon us. While providing a data practices,
formidable challenge to the classically trained quality quality profes-
sionals must step
practitioner, big data also offers substantial opportunities up their under-
for redirecting a career path into a computational and standing of big
data and how
data-intensive environment. organizations
The change to big data analytics from the status quo can use and take
advantage of
of applying quality principles to manufacturing and their transactional
service operations could be considered a paradigm data.
shift comparable to the changes quality professionals Standards groups
experienced when statistical computing packages realize big data is
here to stay and
became widely available, or when control charts were are beginning to
first introduced. develop founda-
tional standards
for big data and
big data analytics.
The challenge for quality practitioners is to recognize ++ Variability: Nonconstancy of volume, variety and
this shift and secure the training and understanding nec- velocity.
essary to take full advantage of the opportunities. This set of V’s is attributable originally to Gartner Inc.,
a research and advisory company,1 and documented
What’s the big deal? by the National Institute of Standards and Technology
What exactly is big data? You’ve probably noticed that (NIST) in the first volume of a set of seven documents. 2
big data often is associated with transactional data Big data clearly is the order of the day when the quality
sets (for example, American Express and Amazon), practitioner is confronted with a data set that exceeds
social media (for example, Facebook and Twitter) and, the laptop’s memory, which may be by orders of
of course, search engines (for example, Google). Most magnitude.
formal definitions of big data involve some variant of the In this article, we’ll reveal the big data era to the qual-
four V’s: ity practitioner and describe the strategy being taken by
++ Volume: Data set size. standardization bodies to streamline their entry into the
++ Variety: Diverse data types residing in multiple exciting and emerging field of big data analytics. This is
locations. all done with an eye on preserving the inherently useful
++ Velocity: Speed of generation and transmission of quality principles that underlie the core competencies of
data. these standardization bodies.
FIGURE 1
System orchestrator
Preparation/
Collection curation Analytics Visualization Access
IT value chain
LEGEND
NIST = National Institute of
Big data information flow Standards and Technology
Service use
Software tools and algorithms transfer
FIGURE 2
2. To exercise the reference architecture developed by as well as the updated and appended data sets.
NIST (Figure 1). ++ Ensure that the tools, techniques and algorithms
3. To use ISO/TC 69’s existing statistical standards apply to other large data sets and, in particular, are
where appropriate for the case study and to identify scalable (that is, capable of handling different data set
gaps in standards within TC 69 to conduct the case sizes from different locations) to data sets stored on
study. These gaps will be important elements to multiple nodes.
define a roadmap of future standards for TC 69 to The team identified three key learning opportunities:
develop. 1. Analysts working on the application provider layer
The team selected a case study to better understand usually do not have the capability to analyze the total-
the challenges, “get its hands dirty” and achieve these ity of the data. This is due, in part, to the large size of
objectives. The case study chosen was a fraud detection the data, which may be stored in multiple nodes and
problem pertaining to Medicare payments to providers. potentially owned by different parties that use differ-
The 2012 Medicare public data set was used for the ent security systems. This characteristic of big data
case study and consisted of more than 9 million records requires new methods that deal with the application
aggregated by unique provider codes (NPI) and proce- of learning from sample data to the original data set.
dure (HCCPS) codes. This, in turn, requires standardization so all parties
Further, the data set consisted of 29 columns, know how to interact.
including a unique identifier for the provider, address, 2. Programming code (that is, code used to prepare
credentials, services aggregated by procedure, unique and curate the full data set) must be presented to
beneficiaries of these services, average amount sub- the information systems practitioner in a language
mitted for repayment by Medicare, average allowed by (JavaScript, C, Python or some other programming
Medicare and the average paid by Medicare, along with language) that can be executed. This programming
standard deviations of costs for a given procedure code.9 code is passed between the application provider layer
A specific set of analytic objectives was developed and the data provider (see Figure 1).
for the case study, referring to the analytics goals 3. Because different experts have different back-
themselves: grounds, time and a willingness to understand each
++ Identify suspicious patterns of medical activities expert’s point of view is required. Consequently, the
reflected in the sampled data. dynamics of the team and the importance of the team
++ Document algorithms to be used on the full data set, leader cannot be minimized.
Teamwork facilitates
lem involving big data.
Teamwork—Due to the complexity of data mining
software, analytical software, statistical techniques,
hardware platforms, data structures and the
knowledge required to understand the underlying
successful big data
processes that generate data, it is imperative that orga-
nizations implement a team-based approach. applications
The Venn diagram shown in Figure 3 illustrates the Process SME
IT practitioner
need for interactions, communications and teamwork.
Of course, the degree of interactions and communica-
tions among the team members depends on the nature
of the problem and the level of skills each member
End
possesses. consumer
Note that Figure 3 has introduced a team member
called the end consumer. This team member is gener-
ally the individual who issued a request for data analysis
or posed a question or problem to be addressed. This Statistician
individual receives the output of the big data analysis. Quality practitioner
In some cases, the five roles introduced in Figure 3
SME = subject matter expert
may be reduced. For example, the quality practitioner
may be the end consumer and the process SME and the
statistician—all depending on the skill level achieved
in each discipline. Regardless of the number of roles global organization grew through mergers and acqui-
involved, continuous and effective communication is sitions, it would be reasonable to conclude that the
required, and teamwork is paramount. collective quality data across the organization would
Data integrity and quality—This issue is the scorn of constitute a problem of enormous inconsistency. Now
big data. Any meaningful analysis demands a high level expand the scenario to include multiple global organiza-
of the integrity and quality of the data. Often, this is not tions (think industry), and the analysis grinds to a halt.
the case in the big data era. Such was the situation with In such situations, successful results can only be
the Medicare fraud case study data. obtained when data are defined and interpreted in a
Due to multiple points of data entry (that is, medical uniform manner. For example, the United States has
providers across the country and beyond), the data for developed a unified set of state abbreviations (for exam-
a single data field was expressed in different ways. For ple, PA, FL, AZ and IL). This set of abbreviations is now
example, the “MD” suffix for a medical provider may universally accepted and embedded in all data collection
have been entered as “MD” or “M.D.” Generally, any systems—often through the use of drop-down menus.
analytical software recognizes these data as distinct Similarly, other examples can be readily found.
data elements. Such situations—and there are many of Incorrect data will never be eliminated, but it can
them—must be identified separately and corrected. This be minimized. This is a substantial challenge in the big
requires significant time and effort, as well as teamwork data era.
among the IT practitioner and other team members.
Given these challenges, consider how difficult it would More contributions necessary
be for a global organization to analyze quality data At its annual meeting in June in Cape Town, South
across the organization. Africa, TC 69 launched three new work items on big data
For example, suppose multiple product lines analytics on:
exist within a site, multiple sites comprise a divi- ++ Statistical terminology.
sion, multiple divisions comprise a business unit and ++ Methodology to flesh out the application provider in
multiple business units comprise the total global Figure 1 and enhance the development of CRISP-DM
organization. Such a structure is not uncommon. If the ++ Model validation in which a different approach to