Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

F E AT U R E

BIG DATA

26 QP September 2017 ❘ qualityprogress.com


The Deal With
BIG
DATA
Move over! Big data analytics
and standardization are the
next big thing in quality  Just the
Facts
by Michele Boulanger, More and more
organizations
Wo Chang, have realized
Mark Johnson the important
role big data
and T.M. Kubiak plays in today’s
marketplaces.

Recognizing this
shift toward big
The era of big data is upon us. While providing a data practices,
formidable challenge to the classically trained quality quality profes-
sionals must step
practitioner, big data also offers substantial opportunities up their under-
for redirecting a career path into a computational and standing of big
data and how
data-intensive environment. organizations
The change to big data analytics from the status quo can use and take
advantage of
of applying quality principles to manufacturing and their transactional
service operations could be considered a paradigm data.
shift comparable to the changes quality professionals Standards groups
experienced when statistical computing packages realize big data is
here to stay and
became widely available, or when control charts were are beginning to
first introduced. develop founda-
tional standards
for big data and
big data analytics.

qualityprogress.com ❘ September 2017 QP 27


F E AT U R E
BIG DATA

The challenge for quality practitioners is to recognize ++ Variability: Nonconstancy of volume, variety and
this shift and secure the training and understanding nec- velocity.
essary to take full advantage of the opportunities. This set of V’s is attributable originally to Gartner Inc.,
a research and advisory company,1 and documented
What’s the big deal? by the National Institute of Standards and Technology
What exactly is big data? You’ve probably noticed that (NIST) in the first volume of a set of seven documents. 2
big data often is associated with transactional data Big data clearly is the order of the day when the quality
sets (for example, American Express and Amazon), practitioner is confronted with a data set that exceeds
social media (for example, Facebook and Twitter) and, the laptop’s memory, which may be by orders of
of course, search engines (for example, Google). Most magnitude.
formal definitions of big data involve some variant of the In this article, we’ll reveal the big data era to the qual-
four V’s: ity practitioner and describe the strategy being taken by
++ Volume: Data set size. standardization bodies to streamline their entry into the
++ Variety: Diverse data types residing in multiple exciting and emerging field of big data analytics. This is
locations. all done with an eye on preserving the inherently useful
++ Velocity: Speed of generation and transmission of quality principles that underlie the core competencies of
data. these standardization bodies.

FIGURE 1

NIST big data reference architecture


Information value chain

System orchestrator

Big data application provider


Data consumer
Data provider

Preparation/
Collection curation Analytics Visualization Access
IT value chain

Big data framework provider


Processing frameworks: Computing and analytic frameworks
Security and privacy

Messaging/ Streaming Resource


communications Batch management
Management

Platforms: Data organization and distribution


Indexed storage
File systems

Infrastructures: Networking, computing and storage


Virtual resources
Physical resources

LEGEND
NIST = National Institute of
Big data information flow Standards and Technology
Service use
Software tools and algorithms transfer

28 QP September 2017 ❘ qualityprogress.com


Primary classes of big data problems standards for guiding big data efforts throughout JTC 1
The 2016 ASQ Global State of Quality reports included upon which other standards can be developed.
a spotlight report titled “A Trend? A Fad? Or Is Big Data NIST is the key contributor to JTC 1/WG 9. Its goal is
the Next Big Thing?”3-6 hinting that big data is here to to develop a standard interoperability framework that
stay. If the conversion from acceptance sampling, con- offers scientists and other experts an architecture to
trol charts or design of experiments seems a world away ease the process of big data analytics—whether during
from the tools associated with big data, rest assured the collection, storage, analysis, or deployment of mod-
that the statistical bases still apply. els or any other phase encountered in the big data arena.
Of course, the actual data, per the four V’s, are The NIST Reference Archi-
different. Relevant formulations of big data problems, tecture has been documented
however, enjoy solutions or approaches that are sta- in a seven-volume publication Many fields of expertise
tistical, though the focus is more on retrospective data that can be downloaded from share an interest in big
and causal models in traditional statistics, and more the NIST website and is ref- data analytics, includ-
forward-looking data and predictive analytics in big erenced as the NIST Big Data ing statistics, artificial
data analytics.7 Interoperability Framework. 8 intelligence, informa-
Two primary classes of problems occur in big data: Figure 1 shows a high-level tion systems, and other
++ Supervised problems occur when there is a depen- view of the architecture. fields in which the data
dent variable of interest that relates to a potentially The application provider are collected in great
large number of independent variables. For this, in Figure 1 is the area in which quantities.
regression analysis comes into play, for which the typ- ISO Technical Committee
ical quality practitioner likely has some background. (TC) 69 Applications of Statistical Methods, statisticians
++ Unsupervised problems occur when unstructured and quality practitioners, are positioned. There are three
data are the order of the day (for example, doctor’s opportunities to interface with other fields of expertise:
notes, medical diagnostics, police reports or internet 1. The data provider provides or feeds data or informa-
transactions). Unsupervised problems seek to find the tion into the big data system.
associations among the variables. In these instances, 2. The framework provider executes certain statistical
cluster and association analysis can be used. The qual- algorithms while protecting the privacy and integrity
ity practitioner can easily pick up such techniques. of the data.
3. The data (end) consumer in this framework consti-
The world of standardization tutes end users or other systems that use the results
and big data analytics of the big data application provider.
Many fields of expertise share an interest in big data It is interesting to see that visualization—usually a
analytics, including statistics, artificial intelligence, infor- step in the analysis that follows the data preparation—is
mation systems, and other fields in which the data are now a step following the actual analysis of the data. The
collected in great quantities (marketing, quality, finance, rationale is that visualization of the entire data set is
supply chain and engineering, for example). unwieldy, if not impossible, while visualization of results
The convergence of so many technical areas moti- at the post-analysis (that is, the presentation) stage
vates standardization activities in areas such as facilitates understanding.
terminology, algorithms and reporting. A search for
big data analytics on the International Organization Joint partnership with
for Standardization (ISO) website, in fact, identified 54 JTC 1/WG 9 and NIST
items that include standards, reports and guidelines. Given the multidisciplinary nature of big data analytics, it
Most of these come from Joint Technical Committee is a natural step for ISO/TC 69 to establish a liaison with
1 (JTC 1) Information Technology. Others come from JTC 1/WG 9. A small joint team of TC 69 experts and JTC
vertical committees, such as information security man- 1/WG 9 experts, along with NIST experts, was assembled
agement, energy, transportation and farming. with the following business and analytics objectives:
ISO/IEC JTC 1 has established Working Group 9 1. To mimic—as much as possible—the operation of a
(WG 9) to develop foundational standards for big data mining or business analytics team assembled to
data, including reference architecture and vocabulary address a case study selected by the team.

qualityprogress.com ❘ September 2017 QP 29


F E AT U R E
BIG DATA

FIGURE 2

Preliminary roadmap of potential standards for big data


Prepare/
Collect Analyze Visualize Access
curate

Scoring full data set and


Sampling guidance Assess data interface between application
++ Random quality provider and framework
++ Stratified provider
Outliers Choice of analytical techniques
identification based on storage platforms
Validation (TR on survey/
Ensure quality of Ensure quality of Ensure quality of
compilation of the existing
input processes input processes input processes
scientific literature and
for data quality for data quality for data quality
standards on guidance for use
New CRISP New CRISP New CRISP New CRISP New CRISP

CRISP = cross-industry standard process


TR = technical report

2. To exercise the reference architecture developed by as well as the updated and appended data sets.
NIST (Figure 1). ++ Ensure that the tools, techniques and algorithms
3. To use ISO/TC 69’s existing statistical standards apply to other large data sets and, in particular, are
where appropriate for the case study and to identify scalable (that is, capable of handling different data set
gaps in standards within TC 69 to conduct the case sizes from different locations) to data sets stored on
study. These gaps will be important elements to multiple nodes.
define a roadmap of future standards for TC 69 to The team identified three key learning opportunities:
develop. 1. Analysts working on the application provider layer
The team selected a case study to better understand usually do not have the capability to analyze the total-
the challenges, “get its hands dirty” and achieve these ity of the data. This is due, in part, to the large size of
objectives. The case study chosen was a fraud detection the data, which may be stored in multiple nodes and
problem pertaining to Medicare payments to providers. potentially owned by different parties that use differ-
The 2012 Medicare public data set was used for the ent security systems. This characteristic of big data
case study and consisted of more than 9 million records requires new methods that deal with the application
aggregated by unique provider codes (NPI) and proce- of learning from sample data to the original data set.
dure (HCCPS) codes. This, in turn, requires standardization so all parties
Further, the data set consisted of 29 columns, know how to interact.
including a unique identifier for the provider, address, 2. Programming code (that is, code used to prepare
credentials, services aggregated by procedure, unique and curate the full data set) must be presented to
beneficiaries of these services, average amount sub- the information systems practitioner in a language
mitted for repayment by Medicare, average allowed by (JavaScript, C, Python or some other programming
Medicare and the average paid by Medicare, along with language) that can be executed. This programming
standard deviations of costs for a given procedure code.9 code is passed between the application provider layer
A specific set of analytic objectives was developed and the data provider (see Figure 1).
for the case study, referring to the analytics goals 3. Because different experts have different back-
themselves: grounds, time and a willingness to understand each
++ Identify suspicious patterns of medical activities expert’s point of view is required. Consequently, the
reflected in the sampled data. dynamics of the team and the importance of the team
++ Document algorithms to be used on the full data set, leader cannot be minimized.

30 QP September 2017 ❘ qualityprogress.com


TC 69 and big data number of data fields can be enormous, the data fields
TC 69 decided to create an ad hoc group to explore the may not be fully understood, necessary data fields may
feasibility of developing standard methods in the area of be missing or a direct means of identifying the solution
big data analytics and in the context of the partnership is not available.
with ISO JTC 1 / WG 9 and NIST for the reference archi- The Medicare fraud case study mentioned earlier
tecture they are developing. illustrates the difficulty that can be incurred when
The first item on the charter of that ad hoc group was trying to formulate a meaningful
to conduct a gap analysis to assess the applicability problem statement—particularly
of 102 existing TC 69 standards and 34 documents in with regard to identifying the The challenges with
various stages of development.10 solution. Initially, the problem or regard to big data
Gap analysis: This task consisted of developing a set question to address was, “Where analytics are formi-
of requirements characterizing the treatment of big data does Medicare fraud exist?” After dable and, hopefully,
to evaluate the appropriateness of the TC 69 documents significant and time-consuming not insurmountable.
in the big data world. All 136 documents within TC 69 analyses, it became clear that a
were evaluated against these requirements. direct answer to this question did
It was recognized that all 59 control charts, process not exist because the analysis not only pointed toward
management and acceptance sampling documents can real outliers, but potential fraudulent activities.
be useful in big data analytics, especially in the prepa- The reason for this dilemma WAS the simple fact that
ration and curation phase, as illustrated in Figure 1. This no data exist that can readily confirm the presence of
phase relates to controlling the processes generating fraud. Statistical tools can only suggest the presence
the data or assessing the quality of the data. of fraud. Therefore, the problem statement must be
TC 69 has 63 of the 136 (46%) documents that can be reformulated as follows: “Where does the potential for
viewed as relevant. None of these documents, however, Medicare fraud exist?”
were initially developed specifically to address the big After such situations are identified statistically, further
data environment. Therefore, they are likely to require investigation is reduced to the hands-on approach. Such
some level of interpretation and modification to accom- an approach might include visits to medical provider
modate big data. offices, patient interviews and record reviews.
Preliminary roadmap development: The next item Skill sets—In the past, a quality practitioner would
for the ad hoc group consisted of creating a preliminary often analyze processes using a spreadsheet or special-
roadmap of documents that TC 69 should consider to ized software such as Minitab, JMP and SAS. Problems
standardize big data analytics. The roadmap shown in were usually limited to analyzing a single process using
Figure 2 was developed following a discussion on the small structured data sets located in a single database
impact of the potential changes to existing standards on a single desktop computer.
that will be needed to address big data or the develop- Frequently, the quality practitioner would produce
ment of brand new documents. a variety of charts (such as control charts and Pareto
This roadmap is shown to define a portfolio of doc- charts) and occasionally run a simple linear regression or
uments to be developed in such a way that they link two. Problems dealing with determining the underlying
to each other and create a consistent set of tools and relationships that might exist among multiple variables
procedures. The roadmap is expected to be revised as (that is, data fields) were often avoided. This occurred
standards are developed. because the quality practitioner did not possess the
appropriate statistical skill level, or the analytical tools
Challenges in big data analytics were not available or did not exist.
The challenges with regard to big data analytics are Enter big data sets, and the landscape changes
formidable and, hopefully, not insurmountable. A few significantly as described earlier. You can easily see from
key examples include: Figure 1 that such a framework necessitates the involve-
Problem formulation—Successful problems are ment of information technology specialists, process
solved, first and foremost, by defining the problem cor- subject matter experts (SME), statisticians, quality
rectly and succinctly. This is particularly true with regard practitioners and possibly many others. These skill sets
to big data analytical problems. With big data sets, the are not mutually exclusive; However, the individual

qualityprogress.com ❘ September 2017 QP 31


F E AT U R E
BIG DATA

disciplines are necessary to effectively analyze a prob- FIGURE 3

Teamwork facilitates
lem involving big data.
Teamwork—Due to the complexity of data mining
software, analytical software, statistical techniques,
hardware platforms, data structures and the
knowledge required to understand the underlying
successful big data
processes that generate data, it is imperative that orga-
nizations implement a team-based approach. applications
The Venn diagram shown in Figure 3 illustrates the Process SME
IT practitioner
need for interactions, communications and teamwork.
Of course, the degree of interactions and communica-
tions among the team members depends on the nature
of the problem and the level of skills each member
End
possesses. consumer
Note that Figure 3 has introduced a team member
called the end consumer. This team member is gener-
ally the individual who issued a request for data analysis
or posed a question or problem to be addressed. This Statistician
individual receives the output of the big data analysis. Quality practitioner
In some cases, the five roles introduced in Figure 3
SME = subject matter expert
may be reduced. For example, the quality practitioner
may be the end consumer and the process SME and the
statistician—all depending on the skill level achieved
in each discipline. Regardless of the number of roles global organization grew through mergers and acqui-
involved, continuous and effective communication is sitions, it would be reasonable to conclude that the
required, and teamwork is paramount. collective quality data across the organization would
Data integrity and quality—This issue is the scorn of constitute a problem of enormous inconsistency. Now
big data. Any meaningful analysis demands a high level expand the scenario to include multiple global organiza-
of the integrity and quality of the data. Often, this is not tions (think industry), and the analysis grinds to a halt.
the case in the big data era. Such was the situation with In such situations, successful results can only be
the Medicare fraud case study data. obtained when data are defined and interpreted in a
Due to multiple points of data entry (that is, medical uniform manner. For example, the United States has
providers across the country and beyond), the data for developed a unified set of state abbreviations (for exam-
a single data field was expressed in different ways. For ple, PA, FL, AZ and IL). This set of abbreviations is now
example, the “MD” suffix for a medical provider may universally accepted and embedded in all data collection
have been entered as “MD” or “M.D.” Generally, any systems—often through the use of drop-down menus.
analytical software recognizes these data as distinct Similarly, other examples can be readily found.
data elements. Such situations—and there are many of Incorrect data will never be eliminated, but it can
them—must be identified separately and corrected. This be minimized. This is a substantial challenge in the big
requires significant time and effort, as well as teamwork data era.
among the IT practitioner and other team members.
Given these challenges, consider how difficult it would More contributions necessary
be for a global organization to analyze quality data At its annual meeting in June in Cape Town, South
across the organization. Africa, TC 69 launched three new work items on big data
For example, suppose multiple product lines analytics on:
exist within a site, multiple sites comprise a divi- ++ Statistical terminology.
sion, multiple divisions comprise a business unit and ++ Methodology to flesh out the application provider in
multiple business units comprise the total global Figure 1 and enhance the development of CRISP-DM
organization. Such a structure is not uncommon. If the ++ Model validation in which a different approach to

32 QP September 2017 ❘ qualityprogress.com


traditional statistical techniques was determined to Michele Boulanger is an
be needed due to the large volume, variety, velocity associate professor of statistics
and variability of the data. and international business at
Rollins College in Winter Park,
Many challenges remain for TC 69. One of the most FL. She is chair of ISO/TC 69,
difficult ones is the recruitment of analytical experts, past chair and current member
information system experts and SMEs to participate in the of the U.S. Technical Advisory
Group (TAG) to ISO/TC 69, and
development phase of these documents. a member of the Z1S Standards
The partnership with ISO/JTC 1/WG 9 and NIST is crucial Committee. She is an academician in the International
to maintain connectivity to their expertise as needed by the Academy for Quality.
statistically inclined participants. With an exciting pro-
gram ahead, the challenge consists of recruiting additional
experts on a volunteer basis and selling the advantages Wo Chang is the digital
of standardization and the learning that comes with it to data advisor for the NIST IT
management. Laboratory. He chairs the ISO/
IEC Joint Technical Committee
Quality practitioners have an opportunity to join the (JTC) 1/ working group (WG)
group as SMEs initially while they assimilate the sophis- 9 on big data, the Institute
ticated techniques that make up the core of big data of Electrical and Electronics
Engineers Big Data Governance
analytics. Experts are needed in three areas targeted for and Metadata Management group, and the ISO/IEC
documentation including terminology, methodology and JTC 1 Standards Committee (SC) 29 WG 11 (MPEG)
model validation. Multimedia Preservation ad hoc group.

REFERENCES AND NOTES


1. Frank Buytendijk and Thomas W. Oestreich, “Organizing for Mark Johnson is professor
Big Data Through Better Process and Governance,” Gartner of statistics at the University
Research Report, March 10, 2015, www.gartner.com/doc/3002918/ of Central Florida in Orlando.
organizing-big-data-better-process. He is a fellow of the American
2. National Institute of Standards and Technology (NIST), “NIST Big Statistical Association, elected
Data Interoperability Framework Version 1.0,” report, September member of the Information
2015, https://bigdatawg.nist.gov/V1_output_docs.php. The Sciences Institute, a chartered
framework includes big data definitions, taxonomies, use cases statistician with the Royal
and requirements, security and privacy, architecture white paper Statistical Society, chair of ISO/
survey, reference architecture and a standards roadmap. TC 69/SC 1 on terminology, member of the U.S. TAG
3. ASQ and American Productivity and Quality Center (APQC), “The to ISO/TC 69 and statistics coordinator for the ASQ
Global State of Quality 2 Research: Discoveries 2016,” https://asq. Statistics Division. He has received the Brumbaugh
org/quality-resources/research/global-state-of-quality/reports. Award, two Shewell Awards and the Jack Youden Award
4. ASQ and APQC, “Spotlight Report: KPIs Key to Successful from ASQ. He has also received the Shin Excellence in
Supply Chain,” https://asq.org/quality-resources/research/ Research Award, sponsored by the Geneva Association
global-state-of-quality/reports. and the International Insurance Society.
5. ASQ and APQC, “Spotlight Report: A Trend? A Fad? Or Is Big Data
the Next Best Thing?” https://asq.org/quality-resources/research/
global-state-of-quality/reports.
6. ASQ and APQC, “Spotlight Report: Innovation and Quality Go
Hand in Hand,” https://asq.org/quality-resources/research/ T.M. Kubiak is a president and
global-state-of-quality/reports. founder of the consulting firm,
7. Galit Shmueli, “To Explain or to Predict,” Statistical Science, 2010, Performance Improvement
Vol. 25, No. 3, pp. 289-310. Solutions, located in Fort
8. NIST, “NIST Big Data Interoperability Framework,” see reference 2 Mohave, AZ. He is an ASQ
9. A description of the data is provided in the file Fellow, a member of the U.S.
PartBNationalSummaryReadmeFile2012.pdf available at www.cms. TAG to ISO/TC 69, a member of
com. the Z1S Standards Committee and the delegated expert
10. The program of work and existing standards are being handled by on Six Sigma for the U.S. TAG to ISO/TC 69.
ISO/TC 69. See www.iso.org/iso/home/standards_development/
list_of_iso_technical_committees/iso_technical_committee.
htm?commid=49742 for more information.

qualityprogress.com ❘ September 2017 QP 33

You might also like