Professional Documents
Culture Documents
PDF Business Analytics A Management Approach 2019 PDF - Compress
PDF Business Analytics A Management Approach 2019 PDF - Compress
PDF Business Analytics A Management Approach 2019 PDF - Compress
Business Analytics
A Management Approach
Approach
Richard Vidgen
Business School, University of New South Wales, Sydney, Australia
Samuel N. Kirshner
Business School, University of New South Wales, Sydney, Australia
Felix Tan
Business School, University of New South Wales, Sydney, Australia
ISBN
ISBN 97
978-
8-1-
1-35
352-
2-007
00725
25-1
-1 e-IS
e-ISBN
BN 97
978-
8-1-
1-35
352-
2-007
00726
26-8
-8
https://doi.org/10.26777/978-1-352-00726-8
Preface
The content of this book has been developed through teaching MBA,
undergraduate, and postgraduate courses on business analytics over
severall years. While the book is targeted at an MBA and business
severa
audience we go reasonably deeply into data collection and exploration,
predictivee modelling techniques, and data communication. This helps
predictiv
managers gain insight into what data scientists actually do, to
understand the impact on the organization of analytics, and to focus on
how value can be created. While we do not expect managers to become
data scientists (although some do) we aim to equip them with some
basic skills in predictiv
predictivee modelling. Indeed, tthe
he introduction of
automated machine learning (AML) with DataRobot takes this to a new
level since one beneit of AML is that advanced data science techniques
become accessible to citizens and managers.
A further
hence aim of
the choice is SAS
to have all the
Visual software
Analytics andavailable via aThis
DataRobot. webfacilitates
browser,
distance-taught courses and avoids the installation and hosting issues
associated with software in universities and organizations more
generally. We also cover the programming language R, which, while
being installed locally,
locally, is open source and free to use, for those with
some familiarity with programming (or a willingness to learn).
SAS Visual Analytics can be accessed free of charge via Teradata
University
Univ ersity Network (TUN) by students and is therefore freely
accessible for teaching. Students can gain access to DataRobot, subject
to their
Thereinstitution joiningwebsite
is a companion the DataRobot faculty
for the book programme.
( http://
http://macmillanihe.
macmillanihe.
com/vidgenbusiness-analytics
com/ vidgenbusiness-analytics ) ) that contains resources for students
and instructors. In particular,
particular, the site contains the datasets used in the
book and further resources, such as the accompanying R code. We
intend to grow the online resources for this book and welcome
feedback in the form of contributions, suggestions for improvimprovements,
ements,
and, of course, corrections.
We thank SAS for giving us permission to include screenshots
s creenshots of
their Visual Analytics product; IBM for permission to include
screenshotsscreenshots
reproduce of Watson Analytics; DataRobot
of their DataRobot for giving
software, andustopermission
include to
Table Of Contents
List of Boxes, Tables and Figures
Preface xiv
Part I Business Analytics in Context
1. Introduction
2. Business Analytics Development
Development
3. Data and Information
Part II Tools and Techniques
4. Data Exploration
5. Clustering and Segmentation
6. Predictive Modelling with Regression
7. Predictive Modelling with Logistic Regressi
Regression
on
8. Predictive Modelling with Classiication and Regression Trees
9. Visualization and Communication
10. Automated Machine Learning
11. R
12. Working
Working with Unstructured Data
Dat a
13. Social Networks
Net works
Part III: Organizational aspects
14. Business Analytics Development
Development Methodology
15. Design and Agile Thinking
16. Ethical Aspects of Business Analytics
Appendices:
Appendix A – Dataset Descriptions
Appendix B – GoGet Case Study
11.2.2 Open data available from the London Datastore (LDS) for ‘Crime
and Community Safety’
1.3 The Internet of Things
1.4 Google Glass ( https://
https://www.
www.varifocals.
varifocals.net/
net/google-glass/
google-glass/ )
1.5
1.5 A taxonomy of disciplines related to analytics (Mortenson et al.
2015)
1.6 Business analytics function
2.1 Core elements of a business analytics development function
2.2
2.2 Steps in the analytics process
2.3 Phases of the CRISP-DM ref
reference
erence model (Chapman et al. 2000,
p.13)
2.4 An A/B test
2.5 An A/B test in the UK courts service (Haynes et al. 2012, p. 10, ig.
5)
2.6 Artiicial intelligence (AI), machine learning, and deep learning
(reprinted fromPublications)
from Manning Chollet 2018, p.4, Copyright (2018) with permission
p ermission
2.7 Data scientist attributes
att ributes (Data Science Radar™, Reprinted with
permission from Mango Solutions 2019)
2.8 The DataRobot approach to automated machine learning ( https://
blog.datarobot.
blog.datarobot.com/
com/ai-simpliied-what-is-automated-machine-learning
ai-simpliied-what-is-automated-machine-learning
)
2.9 Aligning the analytics development function
3.1 From data to wisdom
3.2 Farr’s analysis of mortality data (Farr 1885)
3.8 Exponential distribution
4.1 Anscombe’s quartet
4.2 Scatter plot showing the relationship between television, earnings
and age for a
small sample of the dataset
small
4.3 Heat map showing the relationship between television, earnings,
and age for the entire dataset
4.4 The top of the SAS VA homepage window
4.5 Data Explorer window
4.5
4.6 Data options.
4.7 Automatic chart
4.8 Properties of the automatic chart
4.9 Role tab options
4.10 Bar chart aggregated by the sum of each employee’s age
4.11 Change the aggregation on a bar chart
4.12 Bar chart aggregated by the aver
average
age age of each employee
4.13 Bar chart of average age across job roles and gender
4.17 Creating a hierarchy for the dataset country
4.17
4.18 Creating a custom category for the dataset country
4.19 Creating a new variable for the dataset country
4.20 Viewing the properties of measure data
4.21 Bar chart in SAS VA
4.22 Bar chart with grouping in SAS VA
4.23 Histogram in SAS VA
4.24 Line chart in SAS VA
4.25 Scatter chart in SAS VA
4.34 Bar chart displaying the proportion of customers who are smokers
4.35 Histogram of the age variable
4.36 Setting a ilter
4.37 Creating a new variable, age 2
4.38 Histogram of BMI
4.39 Bar chart visualization showing charges by region and sex
4.40 Bar chart visualization showing average charges by region and
smoker
4.41 Bar chart visualization showing average charges by region,
whether the charge is from a smoker and whether BMI is over or under
30
4.42 Line chart visualization showing average charges by age, whether
the charge was made by a smoker
smoker,, and whether BMI is over or under 30
4.43 Nested if statements
4.44 BMI and smoker grouped by age
4.45 Bubble chart of BMI and smoker
4.46 Bubble chart grouped by male and female
5.1 Clustering Mario Kart characters
5.8 Geo map of cultural clusters (based on three cluster groups)
5.9 Geo map cultural clusters (based on ten cluster groups)
6.1 Graph of exam marks – actual versus predicted (mean)
6.2 Scatter plot of hours of revision against exam mark with a itted
regression line
6.3 Scatter plot
regression of hours
line and errorofterms
revision against exam mark with a itted
6.4 Creating a simple linear regression model in SAS VA
VA
6.5 Linear regression model results in SAS V
VA
A
6.6 Multiple regression visualization produced in SAS V
VA
A
6.7 Residuals (scatter plot)
6.8 Residuals (histogram)
6.9 Residual plot – identifying outliers
6.10 Inluence plot
6.11 Kitchen quality as a single, categorical predictor of sale price
6.12 Creating an interaction effect
6.13 Setting the variable selection parameter
6.14 House sale price model (variable selection = 0.01)
7.6 Setting the response event
7.7 Setting properties of the analysis
7.8 SAS VA logistic regression results
7.9 SAS VA
VA logistic regression it summary
7.10 SAS VA
VA logistic regression assessment – misclassiication
7.11 SAS VA
VA logistic regression assessment – lift
7.12 SAS VA
VA logistic regression assessment – ROC
7.13 SAS VA
VA logistic regression assessment – inspection of residuals
7.14 SAS VA
7.14 VA logistic regression assessment – residuals
7.15 SAS VA
VA generalized linear model (GLM) applied to logistic
regression
7.16 SAS VA GLM model results
8.3 Setting the event level to ‘Survived’
8.3 ‘Survived’
8.4 SAS VA
VA decision tree model with Sex as a single predictor
8.5 SAS VA
VA decision tree model with Sex and Age as predictors
8.6 Entropy graph
8.7 SAS VA decision tree variables and growth strategy
8.8 SAS VA decision tree
8.9 SAS VA decision tree model performance
8.10 SAS VA
VA decision tree model performance – misclassiication
8.11 SAS VA decision tree model advanced growth strategy
8.12 SAS VA decision tree model advanced growth strategy
8.13 SAS VA
VA decision tree model custom growth strategy
8.14 Model comparison – selecting the models to be compared
8.15 Model comparison – logistic regression vs. decision tree
8.15
8.16 Decision tree with a continuous target
8.17 Decision tree with a continuous target
8.18 Variables
Variables used to predict house pprice
rice (partial)
8.19 Model performance (ROC curve)
9.1 Example of a social network diagram
9.2 Unordered and ordered diver
divergent
gent colour sp
spectrums
ectrums
9.6 Sample visual discovery – exploring countries’ wine by price and
production quantity
9.7 Sample dashboard showing a report on sales execution
9.7
9.8 First bar chart in the sample report on sales execution
9.9 Two bar charts for the sample report on sales execution
9.10 First two bar charts with bullet gauges in the sample report on
sales execution
9.11 Formatted bullet gauges in the sample report on sales execution
9.12 Sample report on sales execution with controls to ilter data on
Performance and non-auto irms with 100K or less revenue
9.13 Interaction view for the sample report on sales execution
9.13
9.14 Using hierarchies in the sample report on sales execution
9.15 Dashboard on sales execution in the Report Viewer
10.1 The DataRobot
Dat aRobot predictiv
predictivee modelling process
10.2 Creating a new project in DataRobot
10.3 Uploading data
10.4 Exploring the dataset
10.5 Data exploration (Fare)
10.6 Creating a new feature (Child) using transform
10.10 DataRobot at work
10.10
10.11 Feature importance
10.12 Histogram post Autopilot
10.13 The DataRobot
D ataRobot leaderboard
10.14 Training, validation,
validation, and Holdout partitions
10.15 Data partitioning (source: DataRobot documentation)
10.16 Blueprint for the recommended model – eXtreme Gradient
Boosted Trees Classiier (M85)
10.17 Blueprint for the
t he most accurate model – Advanced A
AVG
VG Blender
model (M88)
10.18 Performance – lift chart
10.18
10.19 Performance – ROC (confusion matrix)
10.20 Performance – prediction distribution
10.21 Performance – ROC (KS and AUC)
10.22 Feature impact
10.23 Feature effects – categorical feature (Pclass)
10.24 Feature it – continuous feature (Age)
10.25 Feature effect – continuous feature (Age)
10.26 Prediction explanations
10.27 Insights menu
10.28 Insights from text analysis – text mining
10.29 Insights from text analysis – Word
Word Cloud
10.33 Leaderboard – sorted by Gamma deviance
10.33
10.34 Leaderboard – sorted by R-squared
10.35 Lift chart
10.36 Feature impact
10.37 Feature effects (OverallQual)
10.38 Partial dependence (OverallQual) rescaled
10.39 House price prediction explanations
10.40 Learning curve (houseprice)
10.41 Speed vs. accuracy (houseprice)
11.3 The RStudio interface (RStudio is a trademark of RStudio, Inc)
11.3
11.4 Getting help in R – help(getwd) (RStudio is a trademark of
RStudio, Inc)
11.5 Installing the package ‘psych’ in RStudio (RStudio is a trademark of
RStudio, Inc)
11.6 Histogram for sales variable
11.7 Box plot for Press and Sales variables
11.8 QQ plot for sales
s ales variable
11.9 Scatter plot of TV and Sales
11.10 Enhanced scatter plot of TV and Sales using ggplot()
11.14 Checking for inluential observations
11.14
11.15 R ROC curve with area under curve (AUC)
11.16 Decision tree
11.17 Variable importance (decision tree)
11.18 Variable importance (random forest)
12.1 Word cloud of Twitter data (produced in Voyant)
12.2 Histogram of sentiment scores for Twitter text
@RealDonaldTrump
@RealDonaldT rump (produced by the authors using R)
12.3 Testing for lycanthropy
12.4 Bayes’ theorem
12.5 Watson VR ( https://www.
https://www.ibm.
ibm.com/
com/cloud/
cloud/watson-visual-
watson-visual-
recognition )
recognition )
12.6 Watson VR
12.7 Text recognition
12.8 Zoomable map of actual food bank usage
12.9 London Ward Atlas (mortgage repossessions)
13.1 Network structures
13.2 Directed and undirected networks
13.3 Edge lists – directed and undirected networks
13.4 Matrix of a directed network
13.5 Matrix of an undirected network
13.6 Tie strength
13.10 Network reciprocity (directed networks only)
13.10
13.11 Degree centrality
13.12 Closeness centrality
13.13 Betweenness centrality
13.14 Eigenv
Eigenvector
ector centrality
13.15 Clustering – clique and k-core
13.16 Clustering – ‘group in a box’ (Rodrigues et al. 2011)
13.17 Network components
13.18 Social network analysis for fraud detection (CGI
( CGI Group 2011)
13.27 MBA network coloured by cohort, sized
13.27 s ized by in-degree
13.28 MBA network coloured by community cluster
cluster,, sized by in-degree
14.1 Business analytics as a co-evolvin
co-evolvingg ecosystem (reprinted from
Vidgen et al. 2017, p.635, Copyright (2017), with permission from
Elsevier)
14.8 The business model canvas with a generic analytics overlay
14.9 Analytics leverage matrix
15.1 Steps of design thinking (Doorley et al., 2018; reprinted with
permission from Stanford d.school)
15.2 Persona part A – out-of-town customer (template courtesy of Lucy
L ucy
Kimbell, Leeor Levy and University of the Arts London)
115.3
5.3 Persona part B – out-of-town customer (template courtesy of Lucy
Kimbell, Leeor Levy and University of the Arts London)
15.4 Persona part A – fraudulent customer (template courtesy of Lucy
Kimbell, Leeor Levy and University of the Arts London)
15.5 Persona part B – fraudulent customer (template courtesy of Lucy
Kimbell, Leeor Levy and University of the Arts London)
15.6 Storyboard – reducing vehicle collision damage (template courtesy
of Lucy Kimbell,
Leeor Levy and University of the Arts London)
B.5 GoGet organization chart
Tables
1.1 Delphi study rankings (reprinted from Vidgen et al. 2017, p.638,
Copyright (2017), with permission from Elsevier)
2.1 Some common data science techniques, with business applications
2.2 Data scientist tasks (adapted from Suda 2017, p.46 with permission
from O’Reilly Media)
2.3 Comparison of R, SAS, and DataRobot
6.1 Exam results (actual)
6.2 Predicted exam mark and error
6.3 Hours of revision (X) and exam mark (Y)
6.4 Predicted exam mark and error term
6.5 Variables
Variables in the advertising dataset (N = 250)
6.6 Overall ANOVA
6.7 Parameter estimates
6.8 Fit statistics
6.9 Calculation of R-square and F-v
F-value
alue
6.10 Overall ANOVA
6.14 Estimates of sale price based on kitchen quality
6.14
6.15 Parameter estimates for a model with an interaction effect
7.1 Probabilities and odds
7.2 Natural logarithms for odds
7.3 Confusion matrix and analysis
7.4 SAS VA
VA logistic regression details – parameter estimates
7.5 SAS VA
VA logistic regression details – it statistics
7.6 From logistic regression equation to case probabilities
7.7 Error distributions and link functions supported by SAS V
VA
A
8.3 Contingency table for Sex
8.3
8.4 Decision table rules
8.5 Confusion matrix for the advanced strategy decision tree
9.1 Encodings, order,
order, values, and types of data (Iliinsky 2013)
9.2 Typology for strategically designing visualization
10.1 Autopilot steps (for datasets where cross-v
cross-validation
alidation is
performed/allowed)
10.2 Pre-processing of features
10.3 Dummy variable coding versus One Hot encoding
10.4 Data for scoring
10.5 Scored data downloaded
d ownloaded from DataRobot
12.1 Top
Top 10 negative sentiment tweets (produced by the authors using
R with Twitter data from
f rom 25 May 2016),extreme language redacted.
12.2 Top
Top 10 ppositive
ositive sentiment tweets (produced by the authors
aut hors using
R with Twitter data from 25 May 2016)
12.3 Word
Word clouds and top words of selected topics produced using LDA
(produced by the author using R with Twitter data from 25 May
M ay 2016)
12.4 Confusion matrix for lycanthropy
lycanthropy
14.2 Front-ofice business analytics opportunities matrix for GoGet
14.2
14.3 Back-ofice business analytics opportunities matrix for GoGet
15.1 Paradigm shift from an analytics world to a business univer
universe
se
16.1 Framework
Framework for big-data ethics (Davis 2012, p.3. Adapted with
permission from O’Reilly Media)
16.2 Sample questions for inquiring into big-data values (Davis 2012,
pp.47–48. Adapted with permission from O’Reilly Media)
A.1 The Titanic dataset
A.4 The countries dataset
A.4
A.5 The insurance dataset
A.6 The Hofstede dataset
A.7 The NBA dataset
A.8 The sale–win–loss dataset
datas et (source IBM: Watson)
A.9 The advertising dataset
C.1 The Business Analytics Capability Assessment
Asses sment Survey
© Richard Vidgen, Sam Kirshner and Felix Tan, under exclusive licence to Springer Nature
Limited 2019
R. Vidgen et al., Business Analytics
https://doi.org/10.26777/978-1-352-00726-8_1
1. Introduction
Richard Vidgen1 , Samuel N. Kirshner2 and Felix Tan3
(1) Business School, University of New South Wales, Sydney, Australia
(2) Business School, University of New South Wales, Sydney, Australia
(3) Business School, University of New South Wales, Sydney, Australia
Richard Vidgen
r.vidg (Corresponding author)
Email: r.vidgen@unsw.edu.au
en@unsw.edu.au
Samuel N. Kirshner
Kirshner
Email: s.kirshner@unsw.edu.au
Felix Tan
Email: f.tan@unsw.edu.au
Learning Outcomes
After you have completed
completed this chapter you should be able to:
to:
Introduction
We are living in an age of the data
dat a deluge. Everywhere we go,
everything
recorded andwestored.
say, everything we buy,
Consequently
Consequently, leaves
, there a digital
is much trace that
excitement may be
– and
some trepidation – around big data and business analytics as
organizations of all types explore how they can use their data to create
(and protect) value. Data analytic methods are being used in many and
varied ways
ways – for example, to predict consumer choices, to estimate the
likelihood of a medical condition, to detect political extremism in social
networks and social media, and to better manage trafic networks.
The opportunities opened up by big data and business analytics are
leading academics and practitioners to explore ‘how ubiquitous data
can
suchgenerate new sources
value is manifest of value, asofwell
(mechanisms ascreation)
value the routesand
through which
how this
value is apportioned among the parties and data contributors’ (George
(George
et al. 2014,
2014, p.324).
(2012) ind
McAfee and Brynjolfsson (2012) ind that data-driven
d ata-driven companies
are, on average,
average, 5% more pproductiv
roductivee and 6% more proitable than their
competitors. However, becoming a data-driven organization is a
complex and signiicant challenge for managers: ‘Exploiting vast new
lows of information can radically improve your company’s
performance. But irst you’ll have to change your decision-making
culture’ (p.61).
A framework
framework for business anal
analytics
ytics
According to a succinct and widely adopted deinition provided by
Davenport and Harris (2007),
(2007), ‘business analytics’ is concerned with
‘the extensive use of data, statistical and quantitative analysis,
analysis,
explanatory and predictive models, and fact-based management to
drive decisions and actions’ (p.7). A key aspect of this deinition is that
analytics ultimately provides
provides insight that is actionable. Other terms,
such as data mining, knowledge discovery,
discovery, machine learning, artiicial
intelligence (AI), and deep learning are commonly used in association
with business analytics. These latter terms typically
typ ically describe
techniques deployed by analytics professionals (who may also be
referred to as data scientists), the people who build the explanatory
and predictive models that enable organizations to make better
decisions.
There is a distinct sense that this is all new and that ‘machine
learning algorithms were just invented last week and data was never
“big” until Google came along’ (O’Neil
(O’Neil & Schutt 2013
2013,, p.2).
p.2 ). Ho
However
wever,,
data sciencefor
techniques, hascenturies.
been around for decades
For example, theand, in the caseofofprobability
development statistical
Data sources
For an organization, data can be acquired from
f rom internal, external, and
open platforms. Internal data will typically be sourced from enterprise
systems and e-commerce applications. External data can be acquired
from third parties, for example credit scores, and from the Internet via
social media platforms.
Open data is data made ffreely
reely av
available
ailable by other organizations, such
as governments (e.g., Census data). More and more data is being made
available
available by central and local government agencies. For example, the
London Datastore (LDS) ‘has been created by the Greater London
Authority (GLA) as a irst step towar
towardsds freeing London’s data. We want
everyone
every one to be able [to] access the data that the GLA and other public
sector organizations hold, and to use that data
dat a however they see it –
for download
for free’ (London
(as Datastore, n.d.).covering
of June 2018), The LDSman
hasy723
many datasets
topics, available
including health,
employment, environment, and housing. For instance, there are
numerous datasets for ‘Crime and Community Safety’ (Figure (Figure 1.2).
1.2).
Combining open data with an organization’s own data can provide
much greater richness and depth of insight and open up new
commercial opportunities (e.g., via third-party app development).
Data generators
Data is being generated through developments including the Internet of
Things (IoT), ubiquitous computing, and social media.
Internet of Things (IoT)
Although the concept was not named until 1999, the IoT has been in
development for decades. The irst Internet appliance was a Coke
C oke
machine at Carnegie Melon University in the early 1980s. The
programmers could connect to the machine over the Internet, check the
status of the machine, and determine whether there would be a cold
drink awaiting them, should they decide to make the trip down to thet he
machine (Teicher
(Teicher 2018
2018).
).
The IoT is a scenario in which objects, animals, or people are
provided with unique identiiers that facilitate the automatic transfer of
data over a network without requiring human-to-human or human-to-
computer interaction (Figure
(Figure 1.3).
1.3). So far, the IoT has been most closely
associated withand
manufacturing machine-to-machine
power, oil, and gas(M2M)
power, communication
utilities. Products builtinwith
M2M communication capabilities are often referred to as being ‘smart’
(e.g., a smart utility meter).
IoT implementation?
3.
Thinking about the organization for which you currently work
work
(this can also be one that you have wor
worked
ked for previously or
your current educational institution):
(a)
Is your organization using the IoT? If so, how?
(b)
What potential use cases can you see for the IoT in your
organization? How would business value be created from these
use cases?
Ubiquitous computing
Closely allied to the IoT is ubiquitous computing. Ubiquitous means
‘existing everywhere’
everywhere’ – ubiquitous computing devices are completely
connected and constantly available. Ubiquitous computing relies on the
convergence
conv ergence of wireless technologies, advanced electronics, and the
Internet. The goal of researchers working in ubiquitous computing is to
create smart products that communicate unobtrusively
unobtrusively, particularly
wearable computers such as Google Glass (Figure
(Figure 1.4
1.4)) and the Fitbit.
https://www.
Figure 1.4 Google Glass (https://www.varifocals.
varifocals.net/
net/google-glass/
google-glass/))
Social media
The Internet analytics company Hootsuite (2015)
(2015) categorizes
categorizes social
media into eight archetypes (while recognizing that these boundaries
are luid and that a social media site may it under multiple headings):
Relationship networks, such as Facebook, Twitter, and LinkedIn
Media sharing networks, such as Flickr and Instagram
Online reviews, such as TripAd
TripAdvisor
visor
Discussion forums,
Social publishing such as Reddit
platforms, s uch asand
such Digg and WordPress
Tumblr
velocity –
– information generated at a rate that exceeds that of
traditional systems
variety –
– multiple emerging forms of data (structured and
unstructured), such as text, social media data, and video
veracity –
– the trustworthiness and quality of the data.
2016))
(IBM 2016
A ifth V is often added – value. Having large volumes of data is all
very well but can it be turned into value? In Figure 1.1 1.1 we
we show value
as an end product that is generated through actionable insight leading
to improved
improved decision-making, that is, value is not embedded in data per
se, but it can be extracted through the business analytics function.
(2017) argues
Thamm (2017) argues that the term ‘big data’ is now redundant,
having ‘emerged in a time when it was becoming more and more
dificult to process the exponentially growing volume of data with the
hardware
hardw are available at the time’
time’.. Successful business analytics projects
do not necessarily depend on access to big data; more important is
having relevant
relevant data for building and evaluating models. Collecting data
dat a
for its own sake can lead to expensive data warehouses and data lakes
(see Exhibit 1.1) that ultimately deliv
deliver
er little business value.
Exhibit 1.1: Data warehouses and data lakes Data lakes and
data warehouses
Amazon deines a data lake as ‘a centralized repository that
allows you to store all your structured and unstructured data at any
scale. You
You can store your data as-is, without having to irst structure
the data, and run different types of analytics – from dashboards and
visualizations to big data processing, real-time analytics, and
machine learning to guide better decisions.’
decisions.’https://
https://aws.
aws.amazon.
amazon.
com/big-data/
com/ big-data/datalakes-and-analytics/
datalakes-and-analytics/what-is-a-data-lake/
what-is-a-data-lake/
Wikipedia deines data warehouses as ‘central repositories of
integrated data from one or more disparate
d isparate sources. They store
current and historical data in one single place that
t hat are used for
creating analytical reports for workers throughout the enterprise.
enterprise.’’
https://en.
https://en.wikipedia.
wikipedia.org/
org/wiki/
wiki/Data_
Data_warehouse
warehouse
Campbell (2015) elaborates
(2015) elaborates on this deinition arguing that a data
warehouse represents an abstracted picture of the business
DataCampbell (2015)
(2015) distinguishes
lakes retain distinguishes
all data dataidentiied
(not just data lakes fromasdata
beingwarehouses.
useful for
a particular purpose), support all data types (e.g., text and video),
support all users (including operational users and data scientists),
adapt easily to changes, and, as a result, provide faster insights.
While data lakes sound like the best option, they come at a price
(e.g., software, servers, and data management needed to build and
maintain a data lake). And, operational users who simply want
reports and KPIs might not want to work with the unstructured raw
data in a data lake and so might be better served by a structured
data warehouse.
Data management
The cloud
As data becomes more varied and less structured,
s tructured, with greater volume
and velocity,
velocity, the challenge is to be able to capture and process it quickly
enough to meet business needs (which, in the case of credit card
transaction processing, may be measured in milliseconds). Traditio
Traditional
nal
databases and technologies have struggled to keep up with the
challenge of big data, and new architectures that can scale
s cale up to the
volume, velocity,
velocity, and variety of today’s data have emerged: ‘Big Data is
Data quality
Redman (2008) argues
(2008) argues that care of data and information boils down to
data quality. Organizations should ‘correctly create or otherwise obtain
the data and information they really need, correctly,
correctly, the irst time’ (p.3).
Data should be easy to ind, access, and use, such that people have the
conidence and trust to employ the data
d ata in powerful ways.
Organizations also have a responsibility to protect their data and to
preventt it being used in inapp
preven inappropriate
ropriate wa
ways.
ys. We will look at data
quality issues more closely in later chapters. Much of the data
scientist’s time is spent in ‘cleaning’ data to prepare it for use in
predictivee models. The better the quality of the source data then the
predictiv
less time will be needed for data cleaning (e.g., estimating missing
values).
Analytics
Models
The data science activity involves
involves the construction of models that are
used to describe, predict, and prescribe.
p rescribe.
Descriptive analytics: uses data visualizations and summaries to
make sense of data and to show what has already happened. For
example, we might produce a bar chart of sales by region or a report
of which customers have churned.
Predictive analytics: uses statistical models, forecasting methods,
and machine learning to show what could happen. For example,
example, we
might build a model that predicts sales over time by product and
region or a model that predicts which of our customers are likely to
churn.
Prescriptive analytics: uses models, such as optimizations and
simulations, that give advice on possible outcomes and propose what
we should do. For example, the model might indicate a marketing
campaign to increase sales or an enhanced service package
p ackage to
decrease the probability of customer churn.
The vast majority of reports in an organization are descriptive; they
tell us what has happened in the past (e.g.,
(e.g., inancial accounts,
management
reports, sales accounts
analyses).showing
Becausevariance
the toolsfrom budget, invent
inventory
are available and itory
is easy to
do, more and more descriptive analyses are being produced, with an
attendant risk of information over
overload.
load.
Predictive analytics give us insight into the future. However, no
model can predict the future with 100% certainty; the results of a
predictivee model are probabilistic (e.g., the probability that a customer
predictiv
may churn).
Prescriptivee analytics go further and attempt to recommend action
Prescriptiv
(or,, indeed, may actually instigate action). For example, a customer
(or
attrition model might advise as to which of several courses of action, or
combinations of action, should be taken with customers at risk of
churning.
Data scientists
According to Thomas Davenport,
Davenport, being a data scientist is the ‘‘Sexiest
Sexiest
job ofdata
that thescientists
21st century’ (Davenport
(Davenport
are the New Rock& Stars
Patil 2012
2012).
). The
and that
Economist says
they will continue
to be in short supply
supply.. McKinsey (2011) forecasted that the USA would
face a shortage of up to 190,000 data scientists by 2018. Glassdoor
Glassdoor,, an
online job site, produces an annual report of the 50 best jobs in
America. They calculate the ranking based on median annual base
salary,, job satisfaction rating, and number of job op
salary openings.
enings. Data
scientist is the top job in the USA in 2016, 2017, and 2018 (Glassdoor
(Glassdoor
2018)) scoring 4.2/5.0 for job satisfaction, a median
2018 med ian base salary of USD
110,000, and an overall score of 4.8/5.0. Glassdoor also provides data
for the UK, where
satisfaction the3.6/5.0,
score of data scientist
a medianranks
base17th in 2018
salary with
of GBP a job and
45,000,
an overall
overall score of 4.0/5.0. Although these are small datasets that may
not be truly representative
representative of the st state
ate of data scientist jobs in the USA
and the UK, it is tempting to speculate that the difference may be due to
the USA being more advanced in its use of analytics than the UK.
Data scientists need a range of skills and personal characteristics.
They need (1) computer science skills (e.g., programming, AI), (2)
quantitative skills (e.g., statistics), and (3) an appreciation of the
decisions that will be made in a particular domain (Figure(Figure 1.5
1.5).
). What
happens whenand
programming onestatistics
of the three
t hree
comedimensions
together is missing?
together,, the result isWhen
ttypically
ypically
2 U sing analdecision
improved ytics for linking
decisionthe analytics
making produced
in the businessfrom
fro m big data with key
making
3 Creating a big data and having a clear big data and analytic
analyticss strateg
strategyy that
t hat its w
with
ith
analytics strateg
strategyy the organisation’s business strategy
4 Avail
vailaabili
bility
ty of da
data
ta th
thee avail
vailab
abil
ilit
ityy of app
ppro
ropr
pria
iate
te da
data
ta to sup
uppo
port
rt an
anal
alyti
ytics
cs
(does the data exist?)
5 Building data skills in the training and education required to upskill employees in
the organisation general to utilise big data and analytic
general analyticss
6 Restrict
ctiions of existing existing IT platforms/architecture may make it dificult to
IT platforms migrate to and manage big data and analytics
7 M easuimpact
value ring customer can the real impact on the customer
measured? c ustomer of managing big data be
8 Analytics skills dificulty in acquiring the mathematical, statistical,
shortage visualisation skills for producing analytics
9 Es
Esttablishing a business can ‘tangible’ beneits of big data be demonstra
demonstrated
ted (e.g.,
case return on investment)?
10 Getti
tting acce
ccess to data accessing appropriate data sources to produce and manage
sources big data (can the data be accessed?)
11 Pro
Producin
cing cre
credible are the analytic
analyticss produced from big data likely to be credible
analytics and trusted by the
t he organisation?
12 Building a co
corrpora
orate e.g., are data and analytics taken seriously enough by the
data culture leaders at a strategic level in the business?
20 Ma
Mana
nagin
gingg da
data
ta volu
volume
me doe
doess theand
storing org
organ
anis
isati
ation
on have
managing ha ve effect
large effective
ive wa
volumes ways
ys (system
(systems)
of data s) for
21 Da
Data
ta own
owner
ersship
hip who
who own
ownss th
thee bi
bigg da
data
ta?? Ins
Insid
idee (e.g
(e.g.,., wh
which
ich de
depa
part
rtm
men
ent)
t) an
andd
outside of an organisa
organisation
tion ((e.g.,
e.g., Government, partners)
22 Managing
ing co
cossts abil
ilit
ityy to manag
nage the
the co
cossts assoc
ocia
iate
tedd wi
with
th big da
data
ta
23 De
Dein
inin
ingg th
thee scop
scopee di
dific
ficul
ulty
ty in de
dein
inin
ingg the
the scop
copee of bi
bigg da
data
ta proj
projec
ects
ts in the
the
organisation
organisa tion ((where
where does it start and stop?)
24 Dei
inning wh
whaat ‘big’ dificulty in deining what ‘big data’
data is actually is
25 Securi
Securing
ng inves
investme
tment
nt abili
ability
ty to secure
secure the inves
investme
tmentnt ne
need
eded
ed to build
build big da
data
ta an
andd
analytics (infrastructure, skills, training, etc.)
26 Ma
Mani
nipu
pula
latin
tingg data
data bein
beingg able
able to pr
proce
ocess
ss the
the da
data
ta to prod
produce
uce an
anal
alytic
ytic insi
insigh
ght
t
27 Legislative and compliance with laws such as the Data Protection Act
regulatory compliance 1998/2003
28 Usi
Using
ng the da
data
ta eth
ethical
ically
ly us
using
ing the data
data in an eth
ethical
ical way
way an
andd en
ensu
surin
ringg all ar
area
eass of the
organisation
organisation are using it in acceptable ways
29 Performance ability to develop key indicators for big data and analytics
management performance reporting
30 Safeguarding e.g., reputation and brand damage caused by inappropriate
reputation use of data, data leakage
leakage,, selling data
31 Working with can the organisation build relationships
relationships and work eff
effectively
ectively
academia with academia
ac ademia??
(Table 1.1
The Delphi study identiied 31 items (Table 1.1).
). The top ive issues
are (1) managing data quality, (2) using analytics for improved
decision-making, (3) creating a big data and analytics strategy
st rategy,, (4)
availability
availab ility of data, and (5) building data skills in the organization.
Summary
Business analytics is a complex organizational ield involving
involving
technology,, data science, management, and organizational change (to
technology
processes and culture and possibly to business strategy). While
managers might not need to know how big data technologies work and
how the complex predictive models built by data scientists operate,
they need to appreciate the management inputs required and the
interconnection of these elements. Value creation is not solely the
province of those organizations that have the ‘biggest’ data, the latest
technologies, and the smartest data scientists. Success can be created
from small data with technology that the t he organization is skilled at using
(this might even include Microsoft’s Excel) and a small project can
demonstrate the business value of analytics and pave the way for
further and more ambitious initiativ
initiatives.
es.
Whatever the circumstance, managers must be prepared to tackle
the following questions:
Where does our data come from?
Do we have the right data?
Is our data of suficient quality?
How well is our data managed?
What technologies are needed to collect, store, and make available
our data?
How can data science/analytics be used to build models that lead to
improved
improv ed decision-making?
How can social media data be utilized?
What external and open data should we acquire to enrich our
internal data?
What human resources do we need for business analytics?
Do we have an effectiv
effectivee business analytics strategy?
Is our business analytics strategy aligned with our business strategy?
References
Apprenda. (2016). IaaS, Paas, SaaS (explained and compared), Apprenda (website). https://
Apprenda.
apprenda.com/
apprenda.com/library/
library/paas/
paas/iaas-paas-saas-explained-compared/
iaas-paas-saas-explained-compared/
BBC News. (2014). ‘Internet of things’ to get £45m funding boost, BBC News, 9 March. http://
www.bbc.
www. bbc.com/
com/news/
news/business-26504696
business-26504696
Campbell, C. ((2015).
Campbell, 2015). Top ive
ive diff
differences
erences between data lakes and data warehouse
warehouses.s. Blue Granite,
26 January. https://
https://www.
www.blue-granite.
blue-granite.com/
com/blog/
blog/bid/
bid/402596/
402596/top-ive-differences-between-data-
top-ive-differences-between-data-
lakes-and-data-warehouses
Chrisos, M. (2018). 3 Beneits of analytics every HR manager should know.
know. TechFunnel , 21 March.
https://
https://www.
www.techfunnel.
techfunnel.com/
com/hr-tech/
hr-tech/types-of-hr-analytics-every-manager-should-know/
types-of-hr-analytics-every-manager-should-know/
Davenport, T. & Bean, R. (2017). Big Data Executive Survey 2017. NewVantage Partners.
Partners. http://
newvantage.com/
newvantage. com/wp-content/
wp-content/uploads/
uploads/2017/
2017/01/
01/Big-Data-Executive-Survey-2017-Executive-
Big-Data-Executive-Survey-2017-Executive-
Summary.pdf
Summary.pdf
Davenport, T. & Harris,
Harris, J. (2007). Competing on analytics: The new science of winning. Harvard
Business Press, Cambridge, MA.
Davenport, T. & Patil, D. (2012). Data scientist: The sexiest job of the 21st century, Harvard
Business Review , October:
O ctober: 70–76.
Dı́az,
az, A., Rowshankish
Rowshankish,, K., & Saleh, T
T.. ((2018).
2018). Why data culture matters. McKinsey Quarterly ,
September 2018.
Dun & Bradstreet. (2017). How Marketing, Procurement and Finance Departments Use Analytics.
Dun & Bradstreet (website). 17 July. https://www.
https://www.dnb.
dnb.co.
co.uk/
uk/perspectives/
perspectives/analytics/
analytics/integrating-
integrating-
analytics-into-business-decisions.html
html
EMC. (2012). Big Data-as-Service: A market and technology perspective , White Paper. http://
australia.emc.
australia. emc.com/
com/collateral/
collateral/software/
software/white-papers/
white-papers/h10839-big-data-as-a-service-perspt.
h10839-big-data-as-a-service-perspt.pdf
pdf
Fagella, D., (2018). Predictive Analytics for Marketing – What’s Possible and How it Works. Emerj .
29 November
Nov ember.. https://
https://emerj.
emerj.com/
com/ai-sector-overviews/
ai-sector-overviews/predictive-analytics-for-marketing-whats-
predictive-analytics-for-marketing-whats-
possible-and-how-it-works/
Fleming, O., Fountaine, T., Henke, N., & Saleh, T., (2018). Te
Tenn red lags
lags signaling
signaling your analytics
program will fail. McKinsey Quarterly , May 2018.
George, G., Haas, M., & Pentland, A. (2014). From the editors: Big data and management.
management. Academy
of Management Journal , 57 ( 2): 321–332.
[Crossref ]
Glassdoor.. (2018).
Glassdoor ( 2018). 50 Best Jo bs in America. https://www.
Jobs https://www.glassdoor.
glassdoor.com/
com/List/
List/Best-Jobs-in-
Best-Jobs-in-
America-LST_KQ0,20.
America-LST_ KQ0,20.htm
htm
Harvey, C., (2017). Big Data Challenges
Harvey, Challenges.. Datamation. 5 June. https://
https://www.
www.datamation.
datamation.com/
com/big-
big-
data/big-data-challenges.
data/big-data-challenges.html
html
Hootsuite. (2015). 8 types of social media and how each can beneit your business, Hootsuite
(blog), 12 March. https://blog.
https://blog.hootsuite.
hootsuite.com/
com/types-of-social-media/
types-of-social-media/
IBM. (2016). The four V’s of big data, Infographics & Animations, Big Data & Analytics Hub. IBM
(website). http://www.
http://www.ibmbigdatahub.
ibmbigdatahub.com/ com/infographic/
infographic/four-vs-big-data
four-vs-big-data
IT Knowledge Portal. (2016). Cloud computing. IT Knowledge Portal (website). http://
http://itinfo.
itinfo.am/
am/
eng/cloud-computing/
eng/cloud-computing/
LDS (London Datastore) (n.d.). About this website. LDS ( website). http://
http://data.
data.london.
london.gov.
gov.uk/
uk/
about/
McAfee, A. & Brynjolfsson, E. (2012). Big data: The management revolution, Harvard Business
Review , October: 61–68.
McKinsey Global
productivity, May,Institute.
McKinsey (2011).
GlobalBig data: The
Institute next frontier
( website forwww.mckinsey.
). https:// innovation,
https://www. competition,
mckinsey.com/ and
com/business-
business-
functions/digital-mckinsey/
functions/ digital-mckinsey/our-insights/
our-insights/big-data-the-next-frontier-for-innovation
big-data-the-next-frontier-for-innovation
MongoDB. (2016). Big data explained,
explained, MongoDB ( website). https://
https://www.
www.mongodb.
mongodb.com/
com/big-data-
big-data-
explained
Mortenson, M., Doherty,
Do herty, N. & Robinson, S. (2015). Operational resea
research
rch ffrom
rom Ta
Taylorism
ylorism to
terabytes:
terabytes: A rese
research
arch agenda for the analytics age. European Journal of Operational Research, 241:
585–595.
[Crossref ]
O’Neil, C. & Schutt, R. (2013). Doing data science: Straight talk from the frontline. O’Reilly Media,
Sebastopol, CA.
Redman, T. C. (2008). Data driven: Proiting from our most important business asset . Harvard
Business School, Cambridge, MA.
Teicher, J., (2018). The little-known story of the irst IoT device. IBM.https://
IBM. https://www.
www.ibm.
ibm.com/
com/
blogs/industries/
blogs/ industries/little-known-story-irst-iot-device/
little-known-story-irst-iot-device/
Thamm, A., (2017). Big Data is dead. Data is “Just Data,” regardless
regardless of quantity, structure, or spee
speed,
d,
LinkedIn. https://www.
https://www.linkedin.
linkedin.com/
com/pulse/
pulse/big-data-de
big-data-dead-just-reg
ad-just-regardles
ardless-quantity-structure-
s-quantity-structure-
speed-thamm/..
speed-thamm/
Turnbull, M. (2015). Internet of things sum
summit,
mit, Speech, Australian Governme
Government,
nt, Ministers for the
Department of Communications and the Arts, 15 March.
UK Government Chief Scientiic
Scienti ic Adviser
Adviser.. (2014). The Internet of Things: Making the most of the
second digital revolution. The Government Ofice for Science, United Kingdom.
© Richard Vidgen, Sam Kirshner and Felix Tan, under exclusive licence to Springer Nature
Limited 2019
R. Vidgen et al., Business Analytics
https://doi.org/10.26777/978-1-352-00726-8_2
Samuel N. Kirshner
Kirshner
Email: s.kirshner@unsw.edu.au
Felix Tan
Email: f.tan@unsw.edu.au
Learning Outcomes
Introduction
In establishing an effectiv
effectivee business analytics development function an
organization will need to consider the composition of its data scientist
team, the tools and techniques to be deployed, and the methodology
used to guide the analytics development process (Figure
(Figure 2.1
2.1).
).
1.
Deine the business objectives
objectives
An analytics project must address a business question. Therefore,
the project should start with a well-deined business objective. Clearly
stating that objective will allow the team to deine the scope of the
project and will provide them with a set of tests to measure the success
s uccess
of the project.
2. Collect data
The data is usually
us ually scattered across multiple sources: internal,
external, and open. Collecting the data may involv
involvee a range of methods,
such as SQL queries of corporate databases, searches of social media,
web-scraping, and the inclusion of open and publicly available data,
such as weather,
weather, crime, and ssocial
ocial deprivation statistics. Assembling
this data into a common and usable format constitutes a major part of
any analytics project.
Analytics methodologies
methodologies
While all analytics projects will need to address the six steps identiied
in Figure 2.2,
2.2, these steps provide little guidance concerning how the
steps will be accomplished. A methodology provides a framework that
is used tosolution.
analytics structure,Any
plan, and controlembarking
organization the process
onofanalytics
developing an
Evidence:
Evidence: A/B testing
Having deployed
deployed a model and imp
impacted
acted on the decision-making in a
business process, how do we know if the policies
p olicies and interventions
based on an analytics model actually work? Randomized controlled
trials (RCTs) are used extensively in medicine, economic development,
Modelling techniques
Supervised and unsupervised learning
Having the output variable – that is, the thing we wish to predict –
available
availab le is the hallmark of supervised learning. This is the most
frequent scenario in analytics. The strength of this approach
app roach is that the
training dataset contains the correct answers (e.g., which customers did
actually churn?). In unsupervised learning, the output variable is not
speciied. While unsupervised learning is not used as frequently as
supervised learning, it can be very useful for tasks
tas ks such as customer
segmentation, where we wish to establish segments based on how
similar customers are to each other using all the customer features we
have available
available (this process is called clustering and we will cover this in
5).
Chapter 5).
Regression
Regression and classiication models
Regression and classiication models are supervised techniques. They
differ in the type of output they produce. Regression models are used to
predict a numerical
givenn a number
give or quantitative
of inputs value, such
(e.g., advertising, as the level
promotion, and of sales
press
coverage). Classiication models predict the class that a case belongs to,
coverage).
such as a person’s gender
gender,, whether a customer will churn, or the
socioeconomic class of a customer.
customer.
Deep learning
Over the last few years the term ‘deep learning’ has become pop
popular
ular..
Deep learning can be thought of as a subset of machine learning in
which the ‘deep’ refers to an AI model with multiple layers of
representation. Machine learning is itself a subset of a broader class of
applications – AI. The relationship of the three can be visualized as in
2.6.
Figure 2.6.
Figure 2.6 Artiicial intelligence (AI), machine learning, and deep learning (reprinted from
fro m
Chollet 2018
2018,, p.4, Copyright (2018) with permission from Manning Publications)
Model-building techniques
There are many modelling techniques available
available to the data scientist,
scientist ,
and the list continues to grow.
grow. Table 2.1
2.1 lists
lists some modelling
techniques commonly used in business analytics applications. While
the detailed workings of each of these
t hese techniques is beyond the scope
of this book, it is important to be aware of the armoury of techniques
that data scientists might deploy in building a model. Indeed, an
individual data scientist is unlikely to be familiar with, and competent
in, all of the techniques in Table 2.1.
2.1. The data scientist is as much
engaged in bricolage (improvisation and tinkering) as they are in
engineering and will learn about and use techniques on a case-by-case
basis as the situation and the data take them.
Table 2.1 Some common ddata
ata science te
techniques,
chniques, with bus
business
iness applications
Tec
echni
hnique
que Deinit
Deinition
ion an
and
d usage
usage
Unsupervised
learning
Tec
echni
hnique
que Deinit
Deinition
ion an
and
d usage
usage
k-means k-means clustering aims to partition n observations into k clusters
k-means c lusters in which
clustering each observation belongs to the cluster with the
t he neare
nearest
st mean. It is a co
common
mmon
unsupervised
unsupe rvised learning approach to clustering data (e.g., customer
segmentation).
Principal
components PCA
to is avisualization
data vway o f reducing
of
isualization and the number when
exploration of dimensions
dimens ions
there is ainlarge
a dataset.
numbeItrisofavariables
number useful aid
analysis (PCA) to be analysed.
Supervised
learning
Linear The most common form
f orm of regre
regression
ssion model. One or more input variables
regression (continuous and categorical) are used to predict a continuous output; for
example,
exam ple, what amount of charges is likely to be incurred for an individual’s
health insurance
insurance policy
policy??
Logistic Logistic regression is used to make predictions
predictions in a dataset in which there are
regression one or more indepe
independent
ndent variables that determin
determinee a dichoto
dichotomous
mous outcome
(e.g., will this customer churn?). The binary model can be extende
extendedd to a
multinomial model to predict an output with more than two classes.
Artiicial Artiic ial neural networks (ANNs) are a family of models insp
Artiicial inspired
ired by biological
neural neurall networks ((the
neura the central nervous systems of animals,
animals, in partic
particular
ular the
networks brain) which are used to estimate or approxim
approximate
ate funct
functions
ions that can depend on
(ANNs) and a large number
number of inputs and are generally
generally unknown. Neural networks may be
deep learning more effective
effec tive than linear and logistic regre
regression
ssion when the feature space is
large. Training ANNs require substantial computing resources but as computing
has become cheaper and more available ANNs have become more popular.
ANNs are a core part of the ‘deep learning’.
Support vector A support vector machine (SVM)
( SVM) is a classiier formally
f ormally deined by a separating
machines hyperplane. Given labelled training data (supervised learning),
learning), the algorithm
(SVMs) outputs an optimal hyperplane which is used to categorize new examples.
examples. SVMs
are used
used in a wide range of prediction and classiication applications.
Classiication Classiication and regression trees (CARTs) are obtained by recursively
and regression partitioning the data space
space and itting
itting a simple pre
prediction
diction model w within
ithin each
trees (CART
( CARTs)
s) partition. As a result, the partitioning
partitio ning can be represen
represented
ted graphically as a
decision tree. CART models are used for regreregression
ssion and classiicatio
classiicationn problems
(e.g., to generate a decision tree for deciding whether to approve a loan to a
bank customer).
Gradient The next step on from regre
regression
ssion trees is gradien
gradientt boo
boosting,
sting, such as
boosting implemented
impleme nted in the XGBoost pac
package.
kage.
Naive Bayes The Naive Bayesia
Bayesiann classiier is based on Bayes’ theorem. The assu assumption
mption of
independence
indepen dence between pred
predictors
ictors means it is easy to build and eficient
ef icient to run
and particularly useful for very large
l arge data
datasets
sets (e.g., to identify whether an email
is spam).
Bayesian A Bayesian
Bayesian network is a probabilistic directed acy
acyclic
clic graphical
graphical model that
networks represents a set of random variables and their conditional depende
represents dependencies.
ncies.
Tec
echni
hnique
que Deinit
Deinition
ion an
and
d usage
usage
k-nearest k- nearest neighbours (kNN) is a simple algorithm that stores all available cases
neighbours and classiies new cases
c ases ba
based
sed on a similarity measure (e.g., distance functions).
f unctions).
(kNN) It is used to classify many types of data (e.g., to classify images, to diagnose
breastt cancer)
breas cancer)..
Association
rules Association
and using therules are created
criteria supportby
support analysing
and co datatofor
conidence
nidence f requent
freque
iidentify
dentify nt if/then
the ifmost
/thenimportant
patterns
relationships.
relationships. Association rules are useful for analysing and predpredicting
icting customer
c ustomer
behaviour
beha viour (e.g., shopping basket data analysis).
Genetic Genetic algorithms (GAs) are adaptive heuris
heuristic
tic search algorithms base
basedd on the
algorithms evolutionary ideas of natural selection and genetics. T
They
hey represent
represent an
intelligent exploitation of a random sear
search
ch used to solve optimization problems
(e.g., the travelling salesman problem).
Time-series Time-series analysis comprises methods for analysing time-series data in order
analysis to extract meaningful statistics and other characteristics of the data. Time-
series is often used to predict future values (e.g., sales)
sales) based on previously
observed values.
Ensemble
models Ensemble models combine
overall performance. c the
This can decisions
an be done, fofrom
for multiple
r example, by models
example, averag to the
averaging
ing improve theof
results
the different
diff erent models
models..
Text analysis
Natural Natural language
language process
processing
ing (NLP) is the ability of a computer program to make
language sensee of human sspeech
sens peech and text. The NLP family includes techniques such as
processing sentiment
sentime nt analysis and latent Dirichlet analysis
analysis..
Sentiment Sentiment
Sentime nt analysis is used to extract subjective information
inf ormation behind a series of
analysis words. It is used to gather an understan
understanding
ding of the attitudes, opinions and
emotions expressed
expressed within a text (partic
(particularly
ularly online and social media
mentions).
Topic A popular form of topic modelling uses Latent Dirichlet analysis (LDA), a
modelling gener
generative
ative
explained
explaine statistical
d by unobservedmodel that
(i.e., allows
latent) a corpus
topics that of text documents
explain to be of the
why some parts
data are similar
similar. E
Each
ach document
doc ument in a co corpus
rpus is modelled as a inite mixture over
an underlying
underlying set of topics. LDA ccan an be applied to social media data ssuch
uch as
tweets to identify the underlying topics driving the content of the tweets.
Other
Social network Social network analysis (SN
(SNA)A) is used to make visible hidden network
analysis (SNA) structures. Networks are modelled as nodes (individual actors, people, or things
within the network) and connecting ties (relationships or interactions). SNA
can be used to understand
understand how customers
c ustomers aare
re connected to each other and
which ones are inluential in forming
f orming opinion.
Simulat
Simulations
ions A compute
computerr simul
simulation
ation use
usess an abs
abstract
tract mode
modell of a system
system to rrepr
eproduce
oduce the
beha
behaviour
low viour of thatand
forecasting, system. Simulations
marketing are us
strategiesuseful
. eful in areas ssuch
strategies. uch as logistics, cash-
Tec
echni
hnique
que Deinit
Deinition
ion an
and
d usage
usage
Geospatial and Data that are tagged with geospatial coordinates
coo rdinates (e.g., latitude/longitud
latitude/longitude)
e) or
mapping with postal codes are vvisualized
isualized and analyse
analysed.
d. Geospatial mapping can be used
applications to plan the location
locatio n of new stores based
based on customer loc
location,
ation, to understand
customer demographics
demographics based on socioeconomic analysis of postal code.
Class DataScientist
Is skeptical, curious. Has inquisitive mind. Knows machine
learning, statistics, probability
probability.. Applies scientiic method. Runs
experiments. Is good at coding and hacking. Able to deal with IT
and data engineering. Knows how to build data products. Able to
ind answers to known unknowns. T Tells
ells relevant business
stories from data. Has domain knowledge.
These are rare people (often referred to as ‘unicorns’) and, even
when they can be found, it is unlikely that one person will have all the
capabilities needed in an analytics team. Therefore, a data science team
will typically comprise team members with complementary skills.
How might we proile data scientists? In collaboration with their
customers, Mango Solutions has deined six core attributes of the
contemporary data scientist. The irm’s survey questionnaire can be
used
and totohelp
gainensure
insightainto the capabilities
balance of organization’s
of skills in an individual dataanalytics
scientists
team. After completion of the online survey a Data Science RadarTM
chart is produced showing the proile of the data scientist – for
example, Figure 2.7 suggests
2.7 suggests a person with particular strength in data
visualization.
major involvement.
involvement. Thus, 67% of respondents reported basic
exploratory data analysis as a task
tas k in which they have major
involvement,
involvement, while at the other end of the scale only 4% reported the
development of hardware as a major involv
involvement
ement task.
Table 2.2 Data scientist tasks (adapted from
fromSuda
Suda 2017
2017,, p.46 with permission from O’Reill
O ’Reillyy
Media)
Task %
1 Basic exploratory data analysis 67
2 Conduct data analysis to answer research questions 61
3 Communicate indings to business decision-makers 58
4 Data cleaning 53
5 Develop prototy pe model 49
6 Create v isualizations 47
16 Develop dashboards 28
17 Set up/maintain data platforms 24
18 Develop data analytics software 21
19 Develop products that depend on real-time data analytics 18
20 Us
Usee das
dashboa
hboarrds and spreadshe
heeets (m
(maade by ot
othhers
rs)) to make deci
cissio
ionns 15
21 Develop hard
hardware
ware (or work on software projects that require
require expert knowledge of 4
hardware)
solutions.
solutions.com/
com/radar/
radar/)) and complete the survey questionnaire ‘What
kind of data scientist are you?’
It does not matter that
t hat you are not currently a data scientist – or
that your responses might relect your aspirations rather than your
current skills.
1.
If you were putting together a team of data scientists for your
organization, what skill proiles would you need? What size
might a data science team need to be to adequately cover the six
attributes?
2.
How might the data scientist recruitment process vary
depending on the core attribute proile being sought?
Tablethe
and 2.2 gives
2.2 gives
wide an in-depth
range tinsight
hey caninto
of tasks they whatved
be invol datain.scientists
involved actually
While basic data do
analysis and answering questions are key activities
activities (tasks 1, 2), the
majority of data scientists also report communicating with business as
a major activity (task 3). Data preparation activities form a core part of
the data scientist’s work, relected in tasks 4, 8, and 14 in particular
particular..
Analytics toolsets
toolsets
As well as building the human resources needed to execute analytics
projects, the organization
capture, store, visualize, andwillmodel
need data.
to select software
Adopting tools to help
an analytics toolit
involves
involves signiicant investment. Even if the software is free (e.g., open
source), the complementary investment in training, installation,
operation, support, and attracting analytics people with the
appropriate skills will still
st ill be expensive.
While it is tempting to jump on the latest bandwagon, it is not
necessary to have the latest tools to create business value from
analytics. It is more important to build a competency with a toolset
than to continually seek to change horses, always looking for a silver
bullet.
Automated
Automated machine
machine learning
Gartner (Sallam
(Sallam et al. 2017
2017)) argue that analytics is at a critical inlection
point; organizations have easier-to-use tools and self-service analytics,
but the processes
models, of preparing
and making andcommunicating
sense of and analysing data,the
building
resultspredictiv
predictive
are still e
As there is a standard interface the user does not need to know how
to invoke
invoke the different algorithms, which might be implemented, for
example, using Python or R. The output from DataRobot includes
predictions, insights, and model validation, all in a standardized form
regardless of the modelling technique used. The predictions can then be
deployed as part of operational business processes by embedding calls
to the DataRobot application program interface (API).
For business analysts with limited technical background the
platform allows them to build models without needing to know how the
different techniques, such as neural networks, logistic regression, and
support vector machines, work. DataRobot automatically divides the
dataset into training and holdout sets (80/20 as a default) and further
splits the training data into folds (ive
( ive as a default) to allow model
validation and cross-validation to be run. This approach ensures that
best practice in training, validating, and deploying models is conducted
by default and so brings a large element of safety to the model
development process when it is in the hands of end-user data scientists.
For expert data scientists much of the pain of data preparation and
remembering how to run models is remov removed.
ed. For all modellers, there is
access to techniques that they might have not considered or, or, indeed,
even heard of before. Indeed, one academic researcher found that
DataRobot allowed them to replicate in one hour a predictive model
that had taken them two to three months to develop. FurtherFurther,, the
DataRobot outperformed the academic’s original predictiv
predictivee model ‘by a
factor of two’ due to the model builder having ‘missed a class of
algorithms
www. that worked
www.datarobot.
datarobot.com/ really well
com/product/
product/).). for the data in question’ (https://
(https://
Analytics tool
tool comparison
The three tools – R, DataRobot, SAS VA VA – are compared and contrasted
in Table 2.3.
2.3. The three tools overlap in that they can all be used to
explore data, build predictive models, and explore model results.
However
Howev er,, this supericial similarity is quickly uncovered once we dig
deeper into their capabilities and costs of acquisition and operation.
Table 2.3 Comparis
Comparison
on of R, SAS,
SAS, and DataRobot
D ataRo bo t SAS Visual R
Analytics (SAS
VA)
Functionality
into its customers so that it can understand segments and needs and
provide a better service. SBS is also interested in the geo-location of
its stores: Are they in the right locations? Where should new stores
be opened?
The company is considering hiring a full-time data scientist to
carry out descriptive analytics and to build predictive models. SBS
currently has no analytics methodology and develops data analytics
solutions on an ad hoc basis using Excel. The inancial director wants
to use Excel as they are comfortable with spreadsheets. The IT
director used SAS in a previous company and prefers an enterprise
solution. The CEO went to a seminar
s eminar hosted by DataRobot and saw
automated machine learning in action – the CEO is a recent convert
to business analytics and thinks DataRobot could be the silver bullet
that SBS’s managers are looking for.
Summary
In deploying analytics, an organization has to consider the
methodology to use, the proiles of the
t he data scientists to employ
employ,, and
the toolset to use. The organization needs the three elements of the
analytics function to be in alignment (Figure
(Figure 2.9
2.9).
). The data scientists
will need skills in the tools and techniques adopted by the organization
and knowledge and acceptance of the way things are done
(methodology). For example, it might not work to hire data scientists
who subscribe
that they to open
only use toolsets such
a proprietary as Python
tool such as SASand R and
(and vicethen require
versa). While
some systematicity is need in the analytics development process, an
overlyy formal and bureaucratic methodology may impede the
overl
effectiveness
effectiv eness of the data sscientist
cientist team. Further
Further,, we should also plan to
see the data science
s cience role itself being automated in part through tools
such as DataRobot.
Dat aRobot. Lastly
Lastly,, while predictions are essential, the acid test
of a successful analytics intervention is the extent to which it informs
actions that create business value and, ideally,
ideally, that value is
demonstrable through A/B testing.
References
Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., & Wirth, R. (2000). CRISP-
DM: Step-by-step data mining guide, The CRISP-DM consortium, August 2000.
Data-Driven Science. (2018). Python vs R for Data Science: And the winner is. Medium. https://
medium.com/
medium. com/@data_
@data_driven/
driven/python-vs-r-for-data-science-and-the-winner-is-3ebb1a968197
python-vs-r-for-data-science-and-the-winner-is-3ebb1a968197
Davenport, T. & Ronanki, R. (2018). Artiicial intelligence for the real world. Harvard Business
Review , Janu
J anuary–
ary–Feb
February:
ruary: 109–116.
Fayyad,
Fayyad, U.
U.,, PPiatetsky-Sha
iatetsky-Shapiro,
piro, G., & Smyth, P. (1996). From data mining to knowledge
know ledge dis
discovery
covery in
databases. AI Magazine, 17 (3): 37–54.
Haynes, L., Service, O.,
O. , Goldacre, B., & Torge
Torgerson,
rson, T
T.. (2012). Test, learn, adapt: Developing public
policy with randomised controlled trials, UK Cabinet Ofic Of icee Behavioural Insigh
Insights Team. https://
ts Team.
www.gov.
www. gov.uk/
uk/government/
government/uploads/
uploads/system/
system/uploads/
uploads/attachment_
attachment_data/
data/ile/
ile/62529/
62529/TLA-1906126.
TLA-1906126.
pdf
Highsmith,
Highsm ith, J. & Cockburn, A. ((2001).
2001). Agile software development: The business
business of innovation.
Computer , 34(9): 120–127.
[Crossref ]
Khabaza. ((2010).
Khabaza. 2010). Nine laws of data mining, Khabaza
Khabaza ( website). http://
http://khabaza.
khabaza.codimension.
codimension.net/
net/
index_iles/
index_iles/9laws.
9laws.htm
htm
Mango Solutions. (2019). What kind of
o f data scientist are you? Mango Solutions ( website). https://
www.mango-solutions.
www.mango-solutions.com/
com/radar/
radar/
O’Neil, C. & Schutt, R. (2013). Doing data science. O
O’Reill
’Reillyy Media, Sebas
Sebastopol,
topol, CA.
Piatetsky, G. (2014). CRISP-DM, still the top methodology for analytics, data mining, or data
science projects, KDnugge
KDnuggets
ts (website). http://www.
http://www.kdnuggets.
kdnuggets.com/
com/2014/
2014/10/
10/crisp-dm-top-
crisp-dm-top-
methodology-analytics-data-mining-data-science-projects.html
methodology-analytics-data-mining-data-science-projects.html
Sallam,, R., How
Sallam Howson,
son, C. & Idoine, C. ((2017).
2017). Augme
Augmented
nted anal
analytic
yticss is the future of data and ana
analytics.
lytics.
Gartner , 27 July.
Somohano, C. (2013). Big data [sorry] and data science: What does a data scientist do? SlideShare
SlideShare
(video). http://www.
http://www.slideshare.
slideshare.net/
net/datasciencelondo
datasciencelondon/n/big-data-sorry-data-science-what-does-
big-data-sorry-data-science-what-does-
a-data-scientist-do
Suda, B. (2017). 2017 Data science salary survey , O’Reilly Media (website). http://
http://www.
www.oreilly.
oreilly.
com/data/
com/ data/free/
free/2017-data-science-salary-survey.
2017-data-science-salary-survey.csp
csp
Wills, J. (2012). Data scientist , Twitter. https://twitter.
https://twitter.com/
com/josh_
josh_wills?
wills?lang=
lang=en-gb
en-gb
© Richard Vidgen, Sam Kirshner and Felix Tan, under exclusive licence to Springer Nature
Limited 2019
R. Vidgen et al., Business Analytics
https://doi.org/10.26777/978-1-352-00726-8_3
Samuel N. Kirshner
Kirshner
Email: s.kirshner@unsw.edu.au
Felix Tan
Email: f.tan@unsw.edu.au
Learning Outcomes
After you have completed
completed this chapter you should be able to:
to:
Explain how data and its sources are an asset to organizations,
governments,
govern ments, and tthe
he lives of citizens
Explain the distinction between data, information, knowledge, and
wisdom
Explain why data quality is important
Deine and operationalize key data-quality attributes
Deine attributes of datasets, such as missing values, outliers, and
probability distributions.
Introduction
When someone thinks they t hey have lu they are likely to use a search
engine to ind symptoms, treatments, and other information. Google
decided to track online searches with the hope of being able to predict
lu outbreaks faster than traditional means – for example, possibly two
weeks earlier than health authorities such as the US Centers for Disease
Control and Prevention (CDC). The developers of Google Flu F lu Trends
(GFT) claimed in the journal Nature that ‘we can accurately estimate
the current level of weekly inluenza activity in each region of the
( Ginsberg et al.
United States, with a reporting lag of about one day’ (Ginsberg
2009,, p.1012). In 2013 GFT failed spectacularly,
2009 spectacularly, missing the peak of the
2013 lu season by 140% which led to the decommissioning of GFT
(Lazer & Kennedy 2015).
2015). While the failure of GFT does not mean that
big data does not have value, it does demonstrate the potential for ‘big
data hubris’.
In an article in Science, Lazer et al. (2014) explain
(2014) explain that ‘big data
hubris’ is the often implicit assumption that large volumes of data can
be a substitute for
for,, rather than a supplement to, traditional
t raditional data
collection and analysis. Smart data scientists with massive quantities of
data may think that they can outsmart anyone and anything. However,
GFT failed for a number of reasons. First,
First , GFT overitted the data, using
seasonal correlated,
strongly search
corr terms
elated, butsuch
onlyasby‘high school
chance. basketball’
Second, , which
GFT did wereinto
not take
Data growth
A dataof
result deluge is sweeping
the prevalenc
prevalencee of almost invisibly
automatic acr
across
oss theelectronic
data collection, planet. It is the
instrumentation, and online transaction processing (OLTP).
(OLTP). There is a
growing recognition
recognition of the untapped value in these databases, which is
in part driving the development of data science. This data comes in
many forms. Some of the data will be structured – that is, in tabular
form with regular columns and rows, as is typical of spreadsheets
s preadsheets and
relational databases. Other data will be unstructured, such as email,
text documents, audio recordings, video, and images.
Unstructured data is in the ascendancy
as cendancy and will pose data storage as
well as data
reports analysis
Gartner’s challenges
estimate for organizations.
that unstructured data Rizkullah
comprises(2017)
around
everycollection
data problem will generate
protocol data eventuall
will resulteventually
y – pproactiv
in more usefulroactively
ely deining a
information,
leading to more useful analytics
every company will
will need analytics eventually – proactively analytical
companies will compete more effectively
everyone will need analytics eventually – proactively analytical
people will be more marketable and more successful in their work.
As data becomes cheaper and more plentiful, companies have begun
to leverage information content that was previously impossible to
access or unfeasible. And those companies that implement competitive
analytics
their are likely to have
industry. have greater inluence on the shape and future
f uture of
2. How mig
might
ht this digital-tr
digital-trace
ace data pr
produced
oduced by the digital
exhaust be misused?
In 1866, 5,500 people died in one square mile of London’s East End.
Geo-mapping of cholera deaths by Dr John Snow (1813–58) showed
cholera deaths to be highly localized. This led Snow to speculate that
cholera was a water -borne
-borne disease. Despite compiling a substantial
body of evidence (including a map showing the location of 15 water
pumps and the locations of the cholera-related deaths) that seemed to
be irrefutable, such was the grip of the air-borne theory of cholera
(‘miasma theory’) that the Government refused to accept Snow’s
conclusions. Snow died in 1858 and didd id not live to see his ideas become
accepted. Eventually,
Eventually, politicians were forced to act and deal with -
London’s polluted water sources, leading to the building of a sewerage
s ewerage
system by Joseph Bazalgette and the eradication of cholera.
While Farr’s
contaminated later report
drinking waterthat cholera was
contradicts causedreport
his earlier by sewage--
sewage-
on -
elevation, this is a good example of data science in action – theories are
speculations that might be overturned by subsequent
subseq uent data. Howev
However
er,,
while there was compelling evidence that cholera was water-borne
rather than air-borne, the data alone was not suficient to break down
entrenched opinion. Eventually
Eventually, the weight of the data led to the
overturningg of the air-borne theory of cholera, but only through a social
overturnin
process where meaning is negotiated rather than absolute. For further
details about Farr and Snow
Snow,, see the Science Museum ((www.
www.
sciencemuseum.org.
sciencemuseum.org.uk/
uk/broughttolife/
broughttolife/people/
people/williamfarr
williamfarr)) and for the
Data summarization
Digesting data into a summary measure (e.g., creating a weighted
average)
average) is a one-way street. We lose information about the underlying
data and are left with just a inal single value. Summary data is easier to
work with, but is the trade-off worth it? Consider two movies:
1.
Eat Pray Love (2010) – IMDB movie rating 5.7/10
2.
Inception (2010) – IMDB movie rating 8.8/10
Which is the better movie? Now consider the additional information
– look
that at where the insight and surprise
surp rise are ((Figure 3.4). It is evident
Figure 3.4).
Eat Pray Love has fewer people voting thanInception – 64,565
versus 1,494,360. More strikingly,
strikingly, the dist
distribution
ribution of the votes is very
different. Eat Pray Love has 7.1% of voters rating the movie 10 and
5.2% rating it 1. Movie-goers are polarized between loving and hating
this movie. Inception has a distribution that
t hat looks more like a long tail –
36.8% rate the movie 10 with a fall-off thereafter (although some
people don’t like the movie as the proportion of movie-goers rating it 1
is 1.0% – more than the
t he number of movie-goers rating it 4). Using a
summary measure, such as the mean or the median, of necessity
involves
involves information loss.
Data quality
(2008) places data quality
Redman (2008) places q uality between IT infrastructure and
exploitation. Drawing on Redman’s data quality
q uality framework, we
recognize the pivotal role
role of data qualit
qualityy in business analytics linking
(Figure 3.5
data with decisions (Figure 3.5).
).
the other end of the spectrum, there is often a lack of data ownership;
the data is thrown into a data warehouse and it is assumed that the CIO
is ultimately responsible for the organization
organization’s
’s data. Data ownership
and role deinition is a fundamental part of data management: Which
business unit creates the data? Which business units can access the
data? Which can change the data?
Having data of an appropriate level of quality is a fundamental
requirement for business analytics. While quality has been deined in
different ways,
ways, two views dominate: the production view and the
consumption view.
Accuracy
Accuracy
Accuracy is the degree to which data
dat a correctly describes the ‘real-
world’ object or event being described. For example, a customer’s
family name may be incorrectly spelled as a result of a data entry error
error..
A relevant measure might be the percentage of data entries that pass
p ass
the data accuracy rules.
Completeness
Completeness is concerned with comprehensiveness. Data can be
complete even if optional data is missing. As long as the data meets
expectations then it is considered to be complete. For example,
example, a
customer’s irst name and last name are mandatory but middle name is
optional and so a record can be considered complete even if a middle
name is not available. A relev
relevant
ant measure for completeness might be
percentage of data ields complete.
Timeliness
Timeliness is the extent to which information is available when it is
expected and needed. Timeliness will vary depending on the context.
Real-time data, measured in sub-milliseconds, might be needed for
high-frequency trading while daily (every 24 hours) data might be
acceptable for a corporate billing system. A relev
relevant
ant measure for
timeliness is the time interval between the time period the data
represents (or when it was generated) and that data being available.
Validity
Validity is concerned with the ddegree
egree to which the data makes sense.
For example,
example, the age at entry to a UK primary & junior school
s chool is
captured on the form for school applications. This is entered into a
database and checked that it is between 4 and 11. If it were captured on
Data characteristics
characteristics
When we do business analytics we need to know what types of data we
are working with. For
For example, is it numeric? If it is numeric, does it
represent inancial data? If so, in which currency is it denominated? All
data management and data quality initiatives are built on the basic
building blocks of data types and all data instances should be consistent
with their data type and follow any rules that apply to that data type.
Data types
Data has an underlying type. All data has a type (or units) that helps us
map it and set constraints on the values the data might take. For
example, consider the following:
The type ‘month’ can be represented (or mapped) as either ‘January’
or 01.
Data can be of type ‘number’ (e.g., 2000) and the rules of a number
provide us with validity constraints (e.g., must be in the range 1–10).
Common base data types include number
number,, text, date, location, time,
currency, and time interval.
Variables
Variables used in models can be broadly distinguished as categorical or
continuous. With categorical data, entities are divided into distinct
categories:
Binary variable – there are only two categories (e.g., dead or alive).
Nominal variable – there are more than two categories (e.g., whether
someone is an omnivore, vegetarian
vegetarian,, vegan, or pes
pescatarian).
catarian).
Ordinal variable – this similar to a nominal variable, but the
categories have a logical order (e.g., whether a student earned a fail, a
Cardinality
Cardinality is the number of unique points within a column of data.
Higher cardinality of the data implies more unique values. Unique
identiier (ID) columns have full cardinality
cardinality,, since each value is, by
deinition, unique. The lowest cardinality is achieved when every row
has the same
s ame value for a given column; such a variable would have no
information content within the dataset, although it might be
meaningful in a wider context (e.g., when cross-referenced to another
dataset with different values for that variable).
Data distributions
When we observe real-world data we ind that many distribution
patterns keep reappearing. For example,
example, the height of humans is a
classic example of the bell-shaped normal distribution (also known as
the Gaussian distribution), shown in Figure 3.7
3.7 where
where the mean is zero
and the standard deviation one (this is also known as a z-distribution).
(Figure 3.8),
Another common distribution is the exponential (Figure 3.8), which
takes a rate parameter that alters the steepness of the curve. Other
common patterns have been found and named after their observers, for
example Poisson and Weibull.
Some analysis techniques require data (or at least, the error terms)
to be normally distributed. We can visually inspect the distribution
with a histogram and perform statistical tests to assess skew (do the
observations pile up on one side or the other?) and kurtosis (is the
distribution too peaky or too lat?). If we need normally distributed
data and this assumption is violated, then we might consider
performing a transformation on the data to make it more
approximatelyy normal (i.e., look more like Figure 3.7).
approximatel 3.7).
(standard
Matthews deviations awayGiven
2016,, p.203).
2016 from that
the mean several days in a row
roughly 95% of observations would
be expected to fall within two standard deviations in normally
distributed data (inspection of Figure 3.7
3.7 shows
shows that around 95% of the
area under the curve is accounted for between −2 and +2), this data is
remarkablyy unlikely
remarkabl unlikely.. With four standard deviations (a 4-sigma event)
the odds are around 16,000 to 1. A 25-sigma event should occur on
average every 10135 years – a igure that is astronomically unlikel
unlikelyy (i.e.,
inconceivably
inconceiv ably longer than the age of the universe). The pproblem
roblem is that
the inancial analysts were relying on the data being normally
distributed, and
collateralized ratings
debt agencies
obligations reliedbeing
(CDOs) on instruments such as and
normally distributed,
Outliers
An outlier is an observation that is distinctly
d istinctly different from the other
observations. It aisunique
on a variable or typically judged to beofan
combination unusually
variables: IT high
ST or low
STANDS
ANDS value
OUT
Missing data
Missing data arises when observations are missing for a column
(variable).
customers,For
butexample,
ind that we
notmight have recorded
all customers household
are willing income
to divulge this for
Summary
Data is a fundamental part of our lives; it is how we make sense of the
world
which in
wewhich wedata.
have no live. It
live. is almostdata
However, impossible to imagine
is only useful a world
if we are in
able to
extract information and knowledge from it and then have the wisdom
to make better decisions and take more effective action. If we are to rely
on, and extract value from, data then it must be it for purpose – that is,
of suficient
suf icient quality
quality..
big The
datavolume,
opens up velocity
velocity,
, and variety ofthere
new opportunities, dataisare allthat
risk growing, and
we will bewhile
drowned in the data deluge. There is a further risk in big data that we
rely on machines to build models based on correlation scavenging in N
= all datasets (the so-called ‘end of theory’). Such models can it the
data remarkably well, but can come up short when faced with unseen
data.
Making sense of all this data and using it wisely to make decisions is
a major challenge for the world today and one that affects and impacts
on all our lives.
References
Anderson, C. (2008). The end of theory: The data deluge m
Anderson, makes
akes the scientiic method obsolete,
Wired , 23 June.
Farr, W. (1852). Report on the mortality of cholera in England, 1848–49. W. Clowes, London.
https://archive.org/
https://archive. org/details/
details/b21516911/
b21516911/page/
page/n79
n79..
Farr, W. (1885). Vital statistics : A memorial volume of selections from the reports and writings of
William Farr . Edited for
fo r the Sanitary Institute of Great Britain by Noel A. Humphreys. Available
Available
https://babel.
from the Hathi Trust: https:// babel.hathitrust.
hathitrust.org/
org/cgi/
cgi/pt?
pt?id=
id=hvd.
hvd.li3s12
li3s12..
Ginsberg, J.J.,, Mohebbi, M., Patel, R., Bramm
Ginsberg, Brammer
er,, L., Smolinski, M., & Brilliant, L. (2009). Detecting
inluenza epidemics using search engine query data. Nature, 457 : 1012–1014.
[Crossref ]
Harford, T. (2014). Big data: Are we making a big mistake?,
mistake?, FT Magazine, 28 March.
Igneous (2018). 2018 State of Unstructured Data Manageme
Management. https://www.
nt. https://www.igneous.
igneous.io/
io/..
Lazer,, D. & Kennedy
Lazer Kennedy,, R. (2015). What we can learn ffrom
rom the epic failure of Google Flu Trends,
Wired , 10 January https://www.
J anuary.. https://www.wired.
wired.com/
com/2015/
2015/10/
10/can-learn-epic-failure-google-lu-trends/
can-learn-epic-failure-google-lu-trends/
Lazer, D., Kennedy, R., King, G., & Vespignani, A. (2014). The parable of Google Flu: Traps in big data
analysis. Science, 343(617 6):
6): 1203–1205.
[Crossref ]
Matthews, R. (2016). Chancing it: The laws of chance and how they can work for you . Proile
Books, London.
Redman, T. C. (2008). Data driven: Proiting from our most important business asset . Harvard
Business Press, Cambridge, MA.
Rizkullah, J. (2017). https://www.
( 2017). The big (unstructured) data problem, Forbes, 5 June. https://www.forbes.
forbes.
com/sites/
com/ sites/forbestechcounci
forbestechcouncil/ l/2017/
2017/06/
06/05/
05/the-big-unstructured-data-problem
the-big-unstructured-data-problem..
© Richard Vidgen, Sam Kirshner and Felix Tan, under exclusive licence to Springer Nature
Limited 2019
R. Vidgen et al., Business Analytics
https://doi.org/10.26777/978-1-352-00726-8_4
4. Data Exploration
Richard Vidgen1 , Samuel N. Kirshner2 and Felix Tan3
(1) Business School, University of New South Wales, Sydney, Australia
(2) Business School, University of New South Wales, Sydney, Australia
(3) Business School, University of New South Wales, Sydney, Australia
Richard Vidgen (Corresponding author)
Email: r.vidgen@unsw.edu.au
r.vidgen@unsw.edu.au
Samuel N. Kirshner
Kirshner
Email: s.kirshner@unsw.edu.au
Felix Tan
Email: f.tan@unsw.edu.au
Introduction
The increased volume of structured data implies that irms are not just
collecting information on more subjects (data-table rows) but
collecting more information (data-table columns, or variables) for each
subject. With billions of rows and anywhere from hundreds to
thousands of columns of data, gaining insight through human
inspection of tables is virtually impossible. Statis
Statistical
tical analysis is also
dificult given the sheer number of variables and dependencies across
datasets. Instead, irms are increasingly rel
relying
ying on visualization
software to explore and understand data. Visualizations allow analysts
to get to grips with and to comprehend massive amounts of data
quickly.. Unlike data in tabular form, visualizations make patterns,
quickly
trends, and outliers easier to recognize.
Fundamentals
Fundamentals of visualization and exploration
exploration
Data exploration is critical for understanding the underlying structure
of each data column. In addition, exploration of the dataset provides an
opportunity
analysis focusesto look
on afor patterns,
single trends,
variable, and relationships.
whereas Univ
Univariate
ariatee
exploratory multivariate
multivariat
analysis familiarizes the analyst with the dataset by producing
visualizations that provide different perspectives on the data.
Visualizations help users identify relev relevant
ant variables as well as
correlations. This allows analysts to quickly determine what factors and
measures are essential for further analysis and to build hypotheses
which can be validated through models and experiments. As a result,
visual exploration guides predictive model development and can
identify potential data that the organization should acquire or collect.
To demonstrate the importance of graphing data before analysing it
and the effect of outliers on statistical properties, the statistician
Francis Anscombe constructed ‘Anscombe’s
‘Anscombe’s quartet’ in 1973. The
quartet comprises four datasets that have nearly identical simple
statistical properties (mean, standard deviation, correlation), yet
appear very different when graphed. Each dataset consists of 11 (x, y)
points. If each of the four datasets is modelled with a line of best it, all
four models would be characterized by the same line: y = 3 + 0.5x.
Moreover
Moreov er,, the it of the line for all four models is the same (this is
measured by the R-squared value, which which is 63% for each model).
The irst scatter plot (Figure
(Figure 4.1
4.1 (X1,
(X1, Y1)) appears to be a simple
linear relationship, corresponding to two correlated variables. While
(Figure 4.1 (X2,
the second graph (Figure 4.1 (X2, Y2)) shows
s hows a clear relationship
Visualization softwar
soft ware
e
Although big data has provided new opportunities for utilizing
visualization software, it has created non-trivial challenges to the
display of visuals. In fact, standard
s tandard graphing tools are mostly incapable
of meaningfully displaying big data. Applications do not have the
capacity to plot a billion points in a tractable amount of time. Thus,
capacity to plot a billion points in a tractable amount of time. Thus,
the best visualization based on the input data and the user’s objective,
and (3) collapse results such that the graphs convey meaning without
losing valuable information.
We use SAS Visual Analytics (SAS VA) and SAS Visual Statistics (SAS
VS), which is an add-on
a browser-based to SAS
analytics VA, tothat
platform explore
usesand model data.
proprietary SAS VA isto
technology
analyse large datasets. SAS VA
VA enables users to prepare, explore, and
communicate data. The SAS VS component enables users to perform
data mining and build predictive analytic models while taking
advantage of the SAS’s powerful in-memory data capabilities.
Explorer application.
previously Once thetoExplorer
saved exploration
saved continueisworking
opened,on youthe
can select a or
exploration
you can choose to create a new exploration. T Too start a new exploration
click ‘Select a Data Source’
Source’.. If the dataset employee_attrition (note that
SAS VA
VA automatically converts a lower-case ile name into an upper-
case dataset name) has already been loaded into the server, server, select it
from the list of available datasets. If it is not available, then upload the
data using the Import Data functionality on the right-hand side of the
Open Data Source window. Once the data is loaded into SAS VA, it does
need not be loaded in again. In addition,
add ition, both the Data Explorer and the
Report Designer can access a dataset once it is uploaded to the in-
memory server.
The Data Explorer application is showns hown in Figure 4.5.
4.5. The Data
Explorer application has a double-layered menu bar and three column
panels: the left panel corresponds to the data, the middle panel is
where the visualizations appear,
appear, and the right panel allows the user to
edit the properties of their visualizations.
Data panel
In the data panel, the dataset is listed (in this case employee_attrition),
and there is a drop-down menu that allows the user to add additional
datasets to the exploration. Beside the drop-down menu is an options
button. The option button allows the user to change the data source,
create data hierarchies and new data variables (for example, based on
interaction effects or calculations using the existing variables), and
show/hide variables, as shown in Figure 4.6 4.6.. There is a search bar,
which allows the user to search for variables. For example, searching
for the word “job” provides the user with the variables JobRole,
JobInvolvemen
JobInv olvement,t, JobLevel, and JobSatisifcation. This is particularly
helpful for large datasets.
Creating visualizations
There are several andaediting
ways to create
ways its properties
propert
visualization. ies is to drag a
The easiest
variable of interest into the middle panel, where it helpfully says ‘Drop
a data item here’. For example, dragging the category variable JobRole
into the middle produces a histogram of JobRoles, which is shown in
Figure 4.7
4.7..
Looking at the right panel, that is, the visualizations property panel,
shows that the visualization created is an ‘Automatic Chart’. SAS uses its
best interpretation of the data to create (what is seemingly) the most
useful chart. Since it selected a bar chart, you hav
havee the option to click on
the button in the Roles tab of the property window,
window, ‘Use a Bar Chart’
(see Figure 4.8
4.8).
). If you intend to create a Bar Chart, then there will be a
greater selection of properties for the bar chart.
would remove
remove all data from the visualization). Similarly
Similarly,, the user can
add measures, which would then change the visualizations from f rom a
histogram, measuring the frequency of the number of employees in
each role, to a bar graph of that measurement.
Figure 4.15 Better bar chart of average age across job roles and ge
gender
nder
The interface for creating visualizations is intuitive and you should
try to create different visualizations either using the automatic chart
process or by using the icons in the second row of the menu bar.bar.
The data pane enables the user to create enriched subsets of
uploaded data iles in addition to presenting an overview of the data
variables. This includes creating new variables through groupings,
hierarchies, and deining new data variables using calculations.
There are important trade-offs between reinement prior to making
the data avai
available
lable for analysis and allowing the analysts to create data
subsets (from a singular uploaded data ile). If a software such as Excel
is utilized for data preparation or the data
dat a managers are not trained -
computer/data scientists, then reinement should be done after the
(cleaned) raw data ile is uploaded to SAS V
VA.
A. The Data Explorer
Calculated items enable the user to build a new variable using the
existing variables and logical functions (e.g., equals, greater or equal to,
if and else statements) and numeric functions (e.g., absolute values, log,
power,, root, round). The new variable will typically be either text or
power
numeric,
new whichitem
calculated can be set by (default
window changingisthe result type
numeric). Thenatusers
the
us erstop
canofbuild
the
new data variables by dragging the necessary logical and numeric
functions and variables into the middle area of tthehe window.
window. For
example, to make a custom variable which indicates whether a nation
has high GDP and education or not,
not , we use an If statement where if
Years of Education is higher than a threshold value AND ((which
which
requires using an AND statement) GDP per Capita
C apita is higher than a
threshold value, then the variable returns ‘High’ otherwise the variable
returns ‘Low’ (Figure
(Figure 4.19
4.19).).
Histograms
Histograms are a particular type of bar chart that focus on a single
variable, plotting the frequency of discrete intervals (i.e., bins) of a
measure variable. If the variable is numerical, then, to create a
histogram, numerical values must be binned together
together.. When numerical
data is binned, the data
dat a within the interval of an individual bin is
treated as having the same value, essentially creating a category for a
range of numerical values. Histograms provide valuable information on
the central tendency and distribution of values (Figure
(Figure 4.23
4.23).
).
In SAS VA
VA histograms are separated from bar charts because the
input to create the variable is a measure variable, whereas in bar charts
the required input is a category variable. Thus, for categorical variables,
histograms are created using a bar chart, as they can plot the frequency
of data observations for each category
category.. It is important to note that
that,,
Scatter plot
In a line chart, each value on the x-axis has at most one corresponding
point on the y-axis. In a scatter
s catter plot, values are plotted based on the
Cartesian coordinates of two variables. The x-axis in scatter plots can
also be time or an independent variable thought to cause a response in
the measure on the y-axis. The x-axis can also be a location, in which
case, if the y-axis is a location, the data can be scattered on a geo map.
Although scatter plots can have multiple y values
values for a single x value,
trends are still visible from scatter plots. In fact,
fact , line charts are often
used to summarize scatter plot data by displaying the best-it line from
scatter plot data. Like the line chart, adding colour to the plot can
(Figure 4.25).
increase the dimensionality to three factors (Figure 4.25).
Bubble chart
Bubble charts are an extension of scatter plots. In a standard scatter
plot, each data point, which consists of an x and and a y value,
value, has a uniform
size on the graph. In a bubble plot,
plot , a third factor is included, where the
magnitude of the quantity corresponds to the radius of the point.
Similar to how higher cardinality of the x-axis makes a line chart more
straightforward
straightforw ard to understand than a bar chartchart,, the higher cardinality
of a third category makes bubble charts easier to comprehend than
scatter plots representing different groups or categories with different
colours. The additional beneit of the bubble chart is that colour can be
utilized to incorporate a fourth factor with low cardinality
cardinality.. If colour is a
categorical variable, then differentiated
differentiated colours will be used for the
visualization. If colour is a measure variable, then it will use us e a colour
spectrum. SAS VAVA enables animation for bubble charts, showing how
the size and position of each bubble changes over time (Figure (Figure 4.26
4.26).
).
Pie chart
A pie chart is a circular graph wher
wheree the proportion of the pie’s slices
corresponds
slices to the
of the pie havequantity
similar value of an item
proportion, valueininterpretation
value a category
category.. When the
is dificult.
slices of the pie have similar proportion, value
value interpretation is dificult.
As a result, bar charts are often a better visualization tool than pie
charts for comparing quantities in a category
category..
Pie charts can be effective for illustrating a point when the number
of slices is low (2–6), and a few of the slices (1–3) are dominant. If there
are lots ofascategories
together an ‘other’with low .quantities,
category
category. t henalso
then
Pie charts are theyuseful
can be grouped
when the
comparisons are ratios since the representation of the data is built into
the chart.
Figure 4.27
4.27 shows
shows two different pie charts, with information on the
number of people in a dataset
d ataset who are single, married, or divorced. The
second igure has additional information about how the breakdown of
marital status varies across different US states. Although the igure has
additional information, it is dificult to ascertain even the most basic
information on whether married people outweigh those who are single
and divorced.with
divorced.
approached Because they are easy to abuse, pie charts should be
caution.
Box plot
Box plots group data by quartiles, where the top of the
t he box corresponds
to the 75th percentile (top quartile) and the bottom of the box
corresponds to the value of the 25th percentile (lowest quartile). A line
at the 50th percentile
p ercentile divides the box to separate the middle two
quartiles. Two vertical lines, known as whiskers, extend from the top
and bottom of the box to indicate the maximum and minimum expected
values.
Box plots are useful for examining outliers since they appear
outside of the whiskers. Comparing box plots across categories can
allow analysts to quickly determine when extreme values are
potentially meaningful and not just random noise. A irm can use the
information to explore potential factors that lead to these customers
having extreme lifetime values. Figure 4.29
4.29 shows
shows that there are several
outlier insurance charges for non-smokers, which are higher than the
75th percentile. The asymmetry of the outliers helps explain why the
mean (the diamond marker) is above the median. For the smokers, the
mean
range is
ofbelow
valuesthe median, since
in comparison to the 25th–50th
50th–75th quartile
q uartile.has
quartile. a greater
A comparison
of the box plots shows that smokers have a signiicantly wider rrange
ange of
insurance charges and that the charges are higher on aver
average.
age.
Heat map
Heat maps represent data in a matrix or on a geospatial map. Like tree
maps, heat maps use colour to express different values for the
combination of categories (or quantitative
quantit ative factors) given
given by the matrix
position. In addition, as with tree maps, heat maps are useful for
displaying large datasets and for identifying outliers. Figure 4.31
shows an example of the same datadat a as the previous tree map, that is,
customer lifetime value by education (x-axis) and marital status (y-
axis).
The size and position of each box in a tree map is determined by the
data, whereas the size of a box in a heat map is ixed by the matrix
coordinates (or spatial location). Thus, a tree map is more useful for
hierarchical data and showing part-to-whole relationships. A heat map
is more useful for displaying data across multiple (non-hierarc
(non-hierarchical)
hical)
categories.
Geo Vmap
SAS VA
A creates unique properties for geographical data so that it can be
plotted on a map with a variable of interest. A geo map uses the plot
data of interest on a physical map. The map can plot up to two different
additional measure variable dimensions, using bubble size and bubble
colour.. Geo maps use two dimensions to plot categories, longitude and
colour
latitude. This makes geo maps less effective when there are onlonlyy a
limited number of regions being plotted
p lotted (less tthan
han 5–6). However
However,, if
there are many regions
regions being plotted, then geo maps utilize location to
provide context to the categories.
size)Figure 4.32 uses
4.32 uses
and years the country
of education dataset
(colour). Thetogeo
plot GDPinper
maps SAScapita (bubble
VA are
interactive,
interacti ve, so hovering over a bubble pprovides
rovides the speciic values for
that country.
distributed,
regression. Ifsince this will
the data help meet
is skewed, thenthe
therequirements of linear
user may consider
transforming the data. For data that is skewed to the right, create a new
By using the measure details, we can see that t hat the average BMI is
30.66. Given that a healthy BMI is 25, and the average is approximatel
approximatelyy
30, we could be interested in how a BMI score of greater than 30 relates
to charges. In this case, we would want to create a binary value called
BMI30, based on whether the BMI is less than or equal to 30 or greater
than 30. We create a new custom category based on BMI, assigning
category 1 to be ‘below 30’ with the interval ran range
ge from 0 to 29.99 and
category 2 to be ‘above 30’ with the interval range from 30 to 55.
After performing a univariate analysis and creating two new
variables (and saving the initial exploration), we begin the next step by
looking at the relationship between charges and other variables (i.e.,
bivariate
biv ariate and trivar
trivariate
iate analysis). The multivariate
multivariate visualization can
reveal
rev eal outliers. When creating prediction models, knowledge of outliers
enable you to reine the dataset
dat aset to exclude these values or alternativel
alternativelyy
create prediction models to try and identify what drives the outliers.
Colours and size in bubble charts can help provide insight into factors
that may explain the drivers of outliers.
To carry out the multivariate analysis on the insurance dataset, we
start by looking at the values of charges across region and sex. We see
that in all four regions males have more charges than females and that
in the southern regions the difference in average charges is signiicantly
greater than in the northern regions (Figure
(Figure 4.39
4.39).). We then change the
variable sex to smokers and see that smoking has a strong relation to
(Figure 4.40
the amount of charges (Figure 4.40).). For non-smokers, the average
charges appear to be consistent across regions. However, the average
charges for smokers are higher in the south.
Figure 4.39 Bar chart visualization showing charges by region and sex
The bar charts show the relationship between the variables charges,
region, and smoker.
smoker. In the properties window the user can add more
category variables using lattices. Lattices create columns of the
visualization for each item in the category
category.. For example, by grouping the
variables by BMI30 we can use smokers (yes or no) as a lattice category
to clearly see how BMI30 impacts charges for each region for smokers
and non-smokers. This is shown in Figure 4.41.
4.41. From the igure, we can
quickly see the potential impact of smokers with a high BMI on charges.
The current visualization tells us that the charges for non-smokers are
not inluenced by a high BMI to the extent that charges for smokers are.
In addition, the region appears to have little inluence on our indings.
Figure 4.42 Line chart visualization showing average charges by age, whether
whether the charge was
made by a smoker, and whether BMI is over or under 30
The igure shows that charges increase with age, but that BMI does
not matter for non-smokers. On the other hand, smokers with low BMI
have greater
greater charges on average than older non-smokers, and this
impact is obviously exacerbated if the smokers
s mokers also have a BMI over 30.
If we wanted
single to create
graph, we thecreate
need to sameaigure, but withUsing
new variable. the four linesif on a
nested
statements (see Figure 4.43),
4.43), we can create a variable that has four
categories: non-smoker with BMI under 30, smoker with BMI under 30,
non-smoker with BMI over 30, and smoker
s moker with BMI over 30. We can
( Figure 4.44
then group the line graph with age and charges together (Figure 4.44).
).
2.
How do gender and marital status relate to employee attrition?
3.
How does attrition vary across job roles?
4.
How does attrition vary across business travel?
5.
Based on the categorical data, what hypotheses can you make
regardingg causes of attrition?
regardin
Summary
Visualizationsisenable
Visualization comprehension
utilized of large
to (1) assess data volumes
qquality
uality, of data.
, (2) address
descriptive questions to understand current and past performance, and
(3) look for trends to build hypotheses for further analysis and inform
future data collection. Visualizations play a crucial role in data
reinement and data exploration and are a crucial step in the predictiv
predictivee
modelling process.
Current visualization software can quickly generate igures from big
data and automatically convey meaning through the automatic process
of selecting the best visualizations and collapsing data. In SAS VA,VA,
analysis can be conducted by simply dragging variables on to the
canvas
canv as or by selecting a visualization and using the property window to
input the relevant variables, making data exploration accessible to a
wide range of employees, not just data scientists.
Further reading
Fayyad, U. M., Wierse, A. & Grinstein, G. G.(Eds.). (2002). Information visualization in data mining
and knowledge discovery . Morgan Kaufmann, San Francisco, CA.
Few, S. (2009). Now you see it: Simple visualization techniques for quantitative analysis. Analytics
Press.
Healy, K. (2018). Data visualization: A practical introduction. Princeton: Princeton University
Healy,
Press.
Keim,, D. A. ((2002).
Keim 2002). Infor
Information
mation visualization and visual data mining. IEEE Transactions on
Visualization & Computer Graphics, (1): 1–8.
[Crossref ]
Keller, P. R., Keller
Keller Keller,, M. M., Markel, S., Mallinck
Mallinckrodt,
rodt, A. J., & McKay, S. (1994). V
Visual
isual cues: Practical
Pract ical
data visualization. Computers in Physics, 8(3): 297–298.
[Crossref ]
Kirk, A. (2012). Data visualization: A successful design process. Packt
Pac kt PPublishing
ublishing,, Birmingham
Birmingham,,
United Kingdom.
Sahay, A. (2016). Data visualization, volume I: Recent trends and applications using conventional
and big data. Business Expert Press, New York.
Sahay, A. (2017). Data visualization, vol. II: Uncovering the hidden pattern in data using basic
b asic and
new quality tools. Business Expert Press, New York.
Samuels, M. & Samuels, N. (1975). Seeing with the mind’s eye: The history, techniques, and uses of
visualization. Random House, New York.
Tufte, E.
E . & Graves-Morris, PP.. (2014). The visual display of quantitative information.; 1983.
Yau,, N. ((2011).
Yau 2011). Visualize this: The lowing data guide to design, visualization, and statistics. John
Wiley & Sons, Indianap
I ndianapolis,
olis, IN.
© Richard Vidgen, Sam Kirshner and Felix Tan, under exclusive licence to Springer Nature
Limited 2019
R. Vidgen et al., Business Analytics
https://doi.org/10.26777/978-1-352-00726-8_5
Samuel N. Kirshner
Kirshner
Email: s.kirshner@unsw.edu.au
Felix Tan
Email: f.tan@unsw.edu.au
Chapter Overview
grouping imilar
Clustering
data wit h ssimilar iiss angmachine
underlyi
underlying lear
learning
characteri ning
characteristi
stics. approach for
cs. Clustering is
used to explore data without requiring a speciic outcome variable,
that is, it is unsupervised learning. Clustering has a wide range of
applications, both in business applications, such as consumer
marketing, and in information system-driven applications, such as
image recognition and recommendation systems. This chapter
presents an overview of clustering and segmentation. After
providing a high-level overview
overview of applications we detail the two
most common methods for clustering data: hierarchical clustering
and k-means clustering. We then cover cover how to implement clustering
in SAS Visual Analytics (SAS VA)
VA) and how to analyse the results using
cluster matrices and parallel coordination plots.
Learning Outcomes
After completing this chapter
chapter you should be able to:
Introduction
Machine learning differs from traditional programming because the
machines are programmed to learn from data, similar to people
learning from experience, to improve performance in accomplishing a
task. Although the roots of machine learning date back to the late
1950s, with the rise in big data, machine learning has become
widespread in industries, serving as the foundation of predictive
analytics. Machine learning enables websites like Amazon, Google, and
Netlix to make product and media recommendations. Other canonical
applications of machine learning include algorithmic trading, medical
diagnosis, and image, video, and natural language processing. Part of
the importance of machine learning stems from the fact that the data is
often both growing and changing. In many applications, such as credit
card fraud, subtle changes in the data can have drastic implications for
an organization. Machine learning algorithms (unlike humans) can
detect small structural changes within ever-expanding datasets, helping
irms build and maintain competitive advadvantages.
antages.
In general machine learning is applied
ap plied to the objectiv
objectivee of ddiscovery
iscovery
and prediction. These two
t wo objectiv
objectives
es require different types of
algorithms and correspond to two different types of learning:
supervised and unsupervised learning. The objective of supervised
Segmentation
The difference between segmentation and clustering is similar to the
difference between data mining and predictiv
p redictivee analytics: they are
mostly the same, but with a different emphasis. In business, particularly
in marketing applications, clusters are typically
t ypically referr
referred
ed to as
segments. Clustering is the technical process for unsupervised
Forming clusters can be quickly done for small dimensional spaces. For
example, consider the eight characters from Mario Kart 64, whose
attributes vary between their speed (y-axis) and their strength (x-axis).
It is easy to see in Figure 5.1 that
5.1 that these observations can be grouped
into
speedthree
and different
low in theclusters:
variableone group(fast
strength that scores
but weakhigh in the variable
characters); one
that scores high in the variable power and low in the variable speed
(strong but slow); and one that has averag
averagee variables of each attribute
(balanced). The natural eye can see similar groups when the data is
presented in two dimensions and can form clusters. Howev
However er,, if people
or objects have many different characteristics, it is challenging to form
interpretable groups that capture similarity of traits. The actual value of
clustering comes from data consisting of hundreds or even thousands
of dimensions as clustering can reveal the underlying structure of the
data that is unobservable to the human eye.
K-means
K-means clustering
cl ustering algorithm
k-means clustering is a centroid method and is the most common
method for determining clusters, due to its scalability to large datasets.
The irst step in running a k-means clustering algorithm is specifying a
value for k , that is, the number of clusters that the algorithm will
produce. The value of k is is often based on subject matter knowledge or
speciic
variantsrequirements. For example,
of a product, then if the
a marketer product
would likelyline hask four
select =4 todifferent
=4
determine consumer groups that are most likely to align with the
products. The number of clusters could also be based on a system of
trial and error,
error, where the algorithm is run for a variety of values of k so so
that the analyst can see the resulting clusters before inalizing the
choice of k . Alternatively
Al ternatively,, k can
can be selected via statistical techniques or
just by choosing some arbitrary value. If there is insuficient
information to select k , then hierarchical clustering can be used to
segment the data to provide
p rovide some insight into an appropriate number
of clusters (value
There are of k ). of ways to start the algorithm, once k has
a variety has been
determined. For example, each of the objects can be randomly allocated
to one of the k clusters,
clusters, or the centre of each cluster can be given
random values across the range of variables. To illustrate the algorithm,
assume that the k centres
centres are randomly distributed across the variable
space. After the centres are selected, each observation is allocated to
the cluster of the closest centre. Based on these new clusters, the
midpoint readjusts such that it is at thet he centre of its cluster
cluster.. Since the
centre of the cluster has changed, there may be observations in other
clusters that are now closer to a different centroid. Thus, the
observation’s clusters are reassigned after the centroids are updated.
observation’s up dated. If
there are new additions or subtractions from a cluster
cluster,, then the centre
of the cluster will again change. This process repeats itself until
changes in the centroid no longer result in changes to the clusters. At
this point, the algorithm terminates and the clusters are inalized.
To provide a concrete example of this method, consider the data
from thek is
points, hierarchal
is set to 2. clustering example.
the Since
InFigure 5.3(a) the
InFigure 5.3(a) centrethere are only abyfew
(represented datais
stars)
randomly allocated and based on a metric capturing distance; A, D, and
E are allocated to the dark centre; and B, C, FF,, and G are allocated to the
star. The centres of each cluster are updated in Figure 5.3(b),
light star. 5.3(b),
where the hollow stars are the previous positions and the solid stars
are the updated positions. Notice that the observations for D and E pull
the dark centre down, and the observations B and C pull the light centre
up. Because of the shifts in the centroids, Figure 5.3(c) 5.3(c) shows
shows that
observation A is reassigned to the t he light cluster
cluster.. Since A is no longer in
the dark
down clusterD, the
cluster,
towards andcentroid
E, while oft hethe
the dark
light clustermoves
centroid movesfurther
even further
up
towards
towar ds A, B, and C. This can be seen in Figure 5.3(d).
5.3(d). The change in the
centroids causes observation G to be allocated to the dark cluster in
Figure 5.3(e).
5.3(e). The switch of observation G causes the centroids to
update again in Figure
Figure 5.3(f)
5.3(f ). In Figure 5.3(g) the
5.3(g) the observation F
5.3(h),, the centroids
changes to the dark cluster. Finally, in Figure 5.3(h)
update, but no observations change clusters, which terminates the
algorithm.
Distance measures
There are several ways
ways to measure the closeness of observations. The
most straightforward is Euclidean distance, which is an ordinary
straight line between two points in a multi-dimensional Euclidean
space. In two dimensions, the Euclidean distance is calculated through
the Pythagorean Theorem. Other measures include Squared Euclidean
distance, which penalizes
which measures greater
d istance using
distance distances,
standard and Mahalanobis
deviations. Typically,,distance,
Typically before
measures of closeness are calculated, the variables are standardized
such that they all have the same magnitude regarding the range
range of
values. If customer data includes distances from the city centre and
income, the measures of closeness between the variables will be
skewed since distance and income are not on a similar scale.
s cale.
Clustering in SAS
To demonstrate clustering in SAS VVA,
A, we use a dataset called hofstede,
which provides values on Hofstede’s cultural dimensions for 70
countries. The dimensions are power distance, individualism–
collectivism, uncertainty avoidance
avoidance,, masculinity–femininity
masculinity–femininity,, short-term
orientation-long-term orientation, and indulgence-restraint. See Box
5.1 on Hofstede’s cultural dimensions for more information on each
construct. Analysing Hofstede’s dimensions is an ideal dataset for
demonstrating clustering since the variables are easily understandable,
each dimension is standardized (each dimension takes
t akes on a value
between 0 and 100), and each dimension is a measure variable. In SAS
VA, clustering only works for variables that are meas
measures.
ures. The closeness
between categorical variables cannot be measured, which means
centrality measures cannot be calculated. As a result, clustering
typically pertains to measured variables.
When performing cluster analysis, it is essential to be mindful of the
variables being selected for clustering. The variables should be limited
to the most critical measures for differentiating objects or people. It is
also important that clusters exhibit a range of variables to better
understand how groups are differentiated. Recall that it was easy to
categorize the clusters in the Mario Kart example
example since there were only
two dimensions. The more variables that are used to cluster the data,
d ata,
the less interpretable the groupings are. In the case of Hofstede’s
cultural dimensions, we will use each of the six dimensions.
cooperation (Wong
(Wong and Ahuvia, 1998
1998).
). Figure 5.4
5.4 shows
shows the
individuality of the countries in the dataset. Uncertainty avoidance
pertains to the degree that society
s ociety members accept and are
comfortable with ambiguity and uncertain situations. Masculinity is
is
the degree that
represented by society values
femininity. assertiveness
In cultures versus
displaying caring,
higher which
levels of is
masculinity,, there is greater admiration for the strong and men have
masculinity
more dominant roles in organizations and families. In 1991 Hofstede
revised his cultural dimensions to introduced short-term
orientation-long-term orientation, where low values represent
greater values of societal traditions and high values represent
countries that favour adaptation. The last dimension added in 2010
is indulgence-restraint . Indulgence corresponds to a societal norm
promoting the enjoyment of life and having fun. A restraint society
attempts to control and regulate gratiication through social norms.
The parallel coordinate plot bins each variable and draws a line for
each observation through the corresponding bin based on the
observation’s
observation ’s data for each variable. While the cluster matrix
mat rix helps the
analyst understand how groups differ across combinations of variables,
the parallel coordinate plot enables the analyst to focus on a speciic
group to understand the features deining the cluster
cluster.. For example, the
5.5 shows
parallel coordinate plot in Figure 5.5 shows that Cluster ID 0 represents
Figure 5.8 Geo map of cultural clusters (based on three cluster groups)
Figure 5.9 Geo map cultural clusters (based on ten cluster groups)
groups)
can understand which types of players they are missing if they are to
be more successful and the number of available play
players
ers of a certain
grouping (which can inluence contract offers and salary
negotiations) and provide a greater understanding of which
combinations
clustering is anofessential
player types leads to
analytical toolsuperior results.
for general Thus , in
Thus,
managers
basketball, as well as other sports.
1.
Use the NBA2018 dataset, which contains per game statistics for
259 players across the 20 major statistical categories, such as
minutes, points, assists, rebounds, and steals, to explore
potential clusters in basketball. T Too make appropriate clusters,
you will have to select the different measures that you think are
appropriate for creating groups. For example, you may just want
to consider
blocks, basic stats, like points, rebounds, assists,
and turnovers. assists , steals,
2.
Experiment with creating 5, 7, and 9 different cluster groups
based on the selected measure data. Also change the number of
bins from 16 to 10 and 4.
4 . After exploring these combinations,
decide which set of clusters may produce appropriate insights
on the data.
3.
Using your selected number of clusters and bins, use SAS VA’s
two cluster visualizations to create proiles for three different
groups of the data.
Summary
Clustering techniques enable a natural exploration of data by creating
groups of objects or segments of people to discover patterns and
similarities across clusters. These groupings, in turn, can be used by
irms to customize content, advertisements, services, products, and
other offerings, to create higher value for customers. With the level of
competition
irms need toenabled
providethrough
greater digital plat forms
platforms
customization to and digital innovation,
consumers. In many
applications, irms and analysts have a target outcome that they are
Further reading
Arora, P. & Varshney, S. (2016). Analysis of k-means and k-medoids algorithm for big data. Procedia
Computer Science, 78: 507–512.
[Crossref ]
Hofstede, G. (2011). Dimens
D imensionalizing
ionalizing cultures: The Hof
Hofstede
stede mode
modell in co
context.
ntext. Online readings in
psychology and culture, 2(1), 8.
[Crossref ]
Kogan, J. (2007). Introduction to clustering large and high-dimensional data. Cambridge
University Press, New York.
Otto, C., Wang, D., & Jain, A. K. (2018). Clustering millions of faces by identity. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 40( 2): 289–303.
[Crossref ]
Punj, G. & Stewart, D. W. (1983). Cluster analysis in marke
marketing
ting resea
research:
rch: Review and sugg
suggestions
estions
for application. Journal of Marketing Research, 134–148.
Sarstedt, M. & Mooi, E
Sarstedt, E.. (2014). Cluster analysis. In A concise guide to market research (pp. 273–
324). Springer, Berlin, Germany.
Wedel, M. & Kamakura, W. A. (2012). Market segmentation: Conceptual and methodological
foundations (Vol. 8).
8) . Springer Science & Business
Business Media, New York.
Wong, N. Y.,
Confucian Y.,and
& Ahuvia,
WesternA.societies.
Western C. ((1998).
1998).Psychology
Personal taste and family
& Marketing , 15 face: Luxury co
consump
(5), 423–441.
(5), nsumption
tion in
[Crossref ]
Footnotes
1 Another common technique is dimension reduction in which w hich variables (columns)
(c olumns) are grouped
to form
fo rm higher level constructs, as is the case with principal
princ ipal components analysis
analysis.. This is diff
different
erent
from clustering,
c lustering, in which rows (c
(cases
ases)) are grouped together to form groups of cases that share
share
some common traits.
Felix Tan
Email: f.tan@unsw.edu.au
Introduction
Analytics is, fundamentally,
fundamentally, concerned with making better decisions.
There has been a shift from using analytics to describe and understand
the business in its current form to using analytics for prediction and
even the reimagining of the business. The predictions
p redictions that can be made
using analytics are broad and various. For example, we might be
interested in predicting the outcome of an election, foreseeing criminal
activity, estimating the cost of a storm or other natural disaster,
predicting which customers will churn, identifying fraudulent
transactions, identifying individuals most likely to donate to charity
charity,, or -
targeting speciic types of customers with advertising and special offers.
The business
acquiring beneits of
new customers, analyticsorcan
upselling include identifying
enhancing and with
the relationship
existing customers, retaining proitable customers, gaining a competitive
advantage by identifying new market opportunities, and inding patterns
in data that alert you to potential dangers and opportunities in the
environment.
Regardlesss of the context and the technique deployed, analytics
Regardles
involves
involves building a model based on data to detect patterns in that data in
order to make predictions that serve as the basis for action. Thus, we go
from real-world messy data to an abstraction (i.e., a model) that
produces generalizations that are the basis for creating speciic
predictions.
Predictive
However
Howev models
er complex our models might be in terms of number of variables,
types of data, and
a nd statistical techniques, always remember that:
The outcome for observation i is given by the model plus some error
term for observation i. For example, assume that you recorded the marks
achieved by students in an exam and arrived at Table 6.1
6.1..
Table 6.1 Exam results (actual)
Case Exam mark
mark (Y)
1 58