PDF Business Analytics A Management Approach 2019 PDF - Compress

Richard Vidgen, Samuel N. Kirshner and Felix Tan
Business Analytics
A Management Approach
Approach

Richard Vidgen
Business School, University of New South Wales, Sydney, Australia
Samuel N. Kirshner
Felix Tan
ISBN
ISBN 97
978-
8-1-
1-35
352-
2-007
00725
25-1
-1 e-IS
e-ISBN
BN 97
978-
8-1-
1-35
352-
2-007
00726
26-8
-8
https://doi.org/10.26777/978-1-352-00726-8
The registered company address is: The Campus, 4 Crinan Street,

St reet,
London, N1 9XW, United Kingdom
A catalogue record for this book is available from the British Library.
Library.
Library of Congress Control Number: XXXXXXXX
Richard Vidgen, Sam Kirshner and Felix T Tan
an have asserted their rights
to be identiied as the authors of this work in accordance with the
Copyright, Designs and Patents Act 1988.
© Richard Vidgen, Sam Kirshner and Felix T
Tan,
an, under exclusi
exclusive
ve licence
to Springer Nature Limited 2019
All rights reserved. No reproduction, copy or transmission of this
publication may be made without written permission. No portion of
this publication may be reproduced, copied or transmitted save with
written permission or in accordance with the provisions of the
Copyright, Designs and Patents Act 1988, or under the
t he terms of any
licence permitting limited copying issued by the Copyright Licensing
Agency,, Saffron House, 6-10 Kirby Street, London EC1N 88TS.
Agency TS. Any

person who does any unauthorized act in relation to this publication

may be liable to criminal prosecution and civil claims for damages.
No portion of this publication
p ublication may be reproduced, copied or
transmitted save with written permission or in accordance with the
provisions of the Copyright, Designs and Patents Act 1988, or under the
terms of any licence permitting limited copying issued by the Copyright
Licensing Agency,
Agency, Saffron House, 6–
6–10
10 Kirby Street, London EC1N 8TS.
Any person who does any unauthorized act in relation to this
publication may be liable to criminal prosecution and civil claims for
damages.
Cover illustration: 9781352007268
First published 2019 by
RED GLOBEPress
Red Globe PRESS
in the UK is an imprint of Springer Nature Limited,
registered in England, company
company number 785998, of 4 Crinan Street,
London, N1 9XW.
Red Globe Press® is a registered trademark in the United States, the
United Kingdom, Europe and other countries.
ISBN 978-1-352-00725-1 hardback
ISBN 978-1-352-00726-8 ebook

Preface
The content of this book has been developed through teaching MBA,
undergraduate, and postgraduate courses on business analytics over
severall years. While the book is targeted at an MBA and business
severa
audience we go reasonably deeply into data collection and exploration,
predictivee modelling techniques, and data communication. This helps
predictiv
managers gain insight into what data scientists actually do, to
understand the impact on the organization of analytics, and to focus on
how value can be created. While we do not expect managers to become
data scientists (although some do) we aim to equip them with some
basic skills in predictiv
predictivee modelling. Indeed, tthe
he introduction of
automated machine learning (AML) with DataRobot takes this to a new
level since one beneit of AML is that advanced data science techniques
become accessible to citizens and managers.
A further
hence aim of
the choice is SAS
to have all the
Visual software
Analytics andavailable via aThis
DataRobot. webfacilitates
browser,
distance-taught courses and avoids the installation and hosting issues
associated with software in universities and organizations more
generally. We also cover the programming language R, which, while
being installed locally,
locally, is open source and free to use, for those with
some familiarity with programming (or a willingness to learn).
SAS Visual Analytics can be accessed free of charge via Teradata
University
Univ ersity Network (TUN) by students and is therefore freely
accessible for teaching. Students can gain access to DataRobot, subject
to their
Thereinstitution joiningwebsite
is a companion the DataRobot faculty
for the book programme.
( http://
http://macmillanihe.
macmillanihe.
com/vidgenbusiness-analytics
com/ vidgenbusiness-analytics ) ) that contains resources for students
and instructors. In particular,
particular, the site contains the datasets used in the
book and further resources, such as the accompanying R code. We
intend to grow the online resources for this book and welcome
feedback in the form of contributions, suggestions for improvimprovements,
ements,
and, of course, corrections.
We thank SAS for giving us permission to include screenshots
s creenshots of
their Visual Analytics product; IBM for permission to include
screenshotsscreenshots
reproduce of Watson Analytics; DataRobot
of their DataRobot for giving
software, andustopermission
include to

selected extracts from their documentation; and NodeXL and Polinode

for permission to include screenshots of their social network analy
analysis
sis
packages.

Table Of Contents
List of Boxes, Tables and Figures
Preface xiv
Part I Business Analytics in Context
1. Introduction
2. Business Analytics Development
Development
3. Data and Information
Part II Tools and Techniques
4. Data Exploration
5. Clustering and Segmentation
6. Predictive Modelling with Regression
7. Predictive Modelling with Logistic Regressi
Regression
on
8. Predictive Modelling with Classiication and Regression Trees
9. Visualization and Communication
10. Automated Machine Learning
11. R
12. Working
Working with Unstructured Data
Dat a
13. Social Networks
Net works
Part III: Organizational aspects
Development Methodology
15. Design and Agile Thinking
16. Ethical Aspects of Business Analytics
Appendices:
Appendix A – Dataset Descriptions
Appendix B – GoGet Case Study

Capability Assessment (BACA)

Appendix C – Business Analytics Capability
Survey
Index

List Of Figures And Tables

Figures
1.1 Business analytics in context (Vidgen 2014)
11.2.2 Open data available from the London Datastore (LDS) for ‘Crime
and Community Safety’

1.3 The Internet of Things

1.4 Google Glass ( https://
https://www.
www.varifocals.
varifocals.net/
net/google-glass/
google-glass/ )
1.5
1.5 A taxonomy of disciplines related to analytics (Mortenson et al.
2015)

1.6 Business analytics function

2.1 Core elements of a business analytics development function
2.2
2.2 Steps in the analytics process

2.3 Phases of the CRISP-DM ref
reference
erence model (Chapman et al. 2000,
p.13)

2.4 An A/B test

2.5 An A/B test in the UK courts service (Haynes et al. 2012, p. 10, ig.
5)

2.6 Artiicial intelligence (AI), machine learning, and deep learning
(reprinted fromPublications)
from Manning Chollet 2018, p.4, Copyright (2018) with permission
p ermission

2.7 Data scientist attributes
att ributes (Data Science Radar™, Reprinted with
permission from Mango Solutions 2019)

2.8 The DataRobot approach to automated machine learning ( https://
blog.datarobot.
blog.datarobot.com/
com/ai-simpliied-what-is-automated-machine-learning
ai-simpliied-what-is-automated-machine-learning
)

2.9 Aligning the analytics development function

3.1 From data to wisdom

3.2 Farr’s analysis of mortality data (Farr 1885)
3.3 Farr’s analysis of cholera mortality data (Farr 1852)

3.4 Two movies compared

3.5 Data quality in context

3.6 Data quality in six dimensions

3.7 Normal distribution (mean = 0, sd = 1)

3.8 Exponential distribution

4.1 Anscombe’s quartet

4.2 Scatter plot showing the relationship between television, earnings
and age for a

small sample of the dataset
small

4.3 Heat map showing the relationship between television, earnings,
and age for the entire dataset

4.4 The top of the SAS VA homepage window

4.5 Data Explorer window
4.5

4.6 Data options.

4.7 Automatic chart

4.8 Properties of the automatic chart

4.9 Role tab options

4.10 Bar chart aggregated by the sum of each employee’s age

4.11 Change the aggregation on a bar chart

4.12 Bar chart aggregated by the aver
average
age age of each employee

4.13 Bar chart of average age across job roles and gender
4.14 How to change properties of a graph so gender is grouped

4.15 Better bar chart of average age across job roles and gender

4.16 Data pane for the dataset country

4.17 Creating a hierarchy for the dataset country
4.17

4.18 Creating a custom category for the dataset country

4.19 Creating a new variable for the dataset country

4.20 Viewing the properties of measure data

4.21 Bar chart in SAS VA

4.22 Bar chart with grouping in SAS VA

4.23 Histogram in SAS VA

4.24 Line chart in SAS VA

4.25 Scatter chart in SAS VA
4.26 Bubble charts in SAS VA

4.27 Pie charts

4.28 Bar charts displaying the same information as the pie charts in
Figure 4.6
4.29 Box plot showing outliers

4.30 Tree map

4.31 Heatmap

4.32 Geo map

4.33 Correlation matrix

4.34 Bar chart displaying the proportion of customers who are smokers

4.35 Histogram of the age variable

4.36 Setting a ilter

4.37 Creating a new variable, age 2

4.38 Histogram of BMI

4.39 Bar chart visualization showing charges by region and sex

4.40 Bar chart visualization showing average charges by region and
smoker

4.41 Bar chart visualization showing average charges by region,
whether the charge is from a smoker and whether BMI is over or under
30

4.42 Line chart visualization showing average charges by age, whether
the charge was made by a smoker
smoker,, and whether BMI is over or under 30

4.43 Nested if statements

4.44 BMI and smoker grouped by age

4.45 Bubble chart of BMI and smoker

4.46 Bubble chart grouped by male and female

5.1 Clustering Mario Kart characters
5.2 Example of a dendrogram for hierarchical clustering

5 3 Example of k-means clustering

5.4 Individuality of countries in the dataset (higher scores represent
greater individualism and lower scores represent more collectivist
societies)

5.5 Default clustering of the Hofstede dataset

5.6 Cluster matrix for all six dimensions

5.7 Parallel coordinate plot for three clusters

5.8 Geo map of cultural clusters (based on three cluster groups)

5.9 Geo map cultural clusters (based on ten cluster groups)

6.1 Graph of exam marks – actual versus predicted (mean)

6.2 Scatter plot of hours of revision against exam mark with a itted
regression line

6.3 Scatter plot
regression of hours
line and errorofterms
revision against exam mark with a itted

6.4 Creating a simple linear regression model in SAS VA
VA

6.5 Linear regression model results in SAS V
VA
A

6.6 Multiple regression visualization produced in SAS V
VA
A

6.7 Residuals (scatter plot)

6.8 Residuals (histogram)

6.9 Residual plot – identifying outliers

6.10 Inluence plot

6.11 Kitchen quality as a single, categorical predictor of sale price

6.12 Creating an interaction effect

6.13 Setting the variable selection parameter

6.14 House sale price model (variable selection = 0.01)
6.15 House sale price model – variables included

http://www.
7.1 Online calculator of a natural logarithm for value 3 ( http:// www.
1728.org/
1728. org/logrithm.
logrithm.htm
htm ) )

7.2 Online calculator of a natural anti-logarithm http://
http://www.
www.1728.
1728.org/
org/
logrithm.htm
logrithm.htm ) )

7.3 The logistic function

7.4 Expressing logit as a probability
p robability

7.5 Setting the response variable for logistic regression

7.6 Setting the response event

7.7 Setting properties of the analysis

7.8 SAS VA logistic regression results

7.9 SAS VA
VA logistic regression it summary

7.10 SAS VA
VA logistic regression assessment – misclassiication
7.11 SAS VA
VA logistic regression assessment – lift

7.12 SAS VA
VA logistic regression assessment – ROC

7.13 SAS VA
VA logistic regression assessment – inspection of residuals

7.14 SAS VA
7.14 VA logistic regression assessment – residuals

7.15 SAS VA
VA generalized linear model (GLM) applied to logistic
regression

7.16 SAS VA GLM model results

8.1 An illustration of a decision tree

8.2 Creating a SAS VA
VA decision tree with Sex as predictor

8.3 Setting the event level to ‘Survived’
8.3 ‘Survived’

8.4 SAS VA
VA decision tree model with Sex as a single predictor

8.5 SAS VA
VA decision tree model with Sex and Age as predictors

8.6 Entropy graph

8.7 SAS VA decision tree variables and growth strategy

8.8 SAS VA decision tree

8.9 SAS VA decision tree model performance
8.10 SAS VA
VA decision tree model performance – misclassiication

8.11 SAS VA decision tree model advanced growth strategy

8.12 SAS VA decision tree model advanced growth strategy

8.13 SAS VA
VA decision tree model custom growth strategy

8.14 Model comparison – selecting the models to be compared

8.15 Model comparison – logistic regression vs. decision tree
8.15

8.16 Decision tree with a continuous target

8.17 Decision tree with a continuous target

8.18 Variables
Variables used to predict house pprice
rice (partial)

8.19 Model performance (ROC curve)

9.1 Example of a social network diagram

9.2 Unordered and ordered diver
divergent
gent colour sp
spectrums
ectrums
9.3 Sample idea illustration – publication

pu blication process

9.4 Sample idea generation – brainstorming for health analytics ( www.
lickr.com/
lickr. com/photos/
photos/juhansonin/
juhansonin/3093096757
3093096757 ).
).

9.5 Sample DataViz – a dashboard (commons.wikimedia.org/wiki/File:
Opsview_Monitor_6.0_Dashboard.jpg)

9.6 Sample visual discovery – exploring countries’ wine by price and
production quantity

9.7 Sample dashboard showing a report on sales execution
9.7

9.8 First bar chart in the sample report on sales execution

9.9 Two bar charts for the sample report on sales execution

9.10 First two bar charts with bullet gauges in the sample report on
sales execution

9.11 Formatted bullet gauges in the sample report on sales execution

9.12 Sample report on sales execution with controls to ilter data on
Performance and non-auto irms with 100K or less revenue

9.13 Interaction view for the sample report on sales execution
9.13

9.14 Using hierarchies in the sample report on sales execution

9.15 Dashboard on sales execution in the Report Viewer

10.1 The DataRobot
Dat aRobot predictiv
predictivee modelling process

10.2 Creating a new project in DataRobot

10.3 Uploading data

10.4 Exploring the dataset

10.5 Data exploration (Fare)

10.6 Creating a new feature (Child) using transform
10.7 The new feature, Child, is shown

s hown as derived from Age

10.8 The Start button

10.9 The model repository (partial)

10.10 DataRobot at work
10.10

10.11 Feature importance

10.12 Histogram post Autopilot

10.13 The DataRobot
D ataRobot leaderboard

10.14 Training, validation,
validation, and Holdout partitions

10.15 Data partitioning (source: DataRobot documentation)

10.16 Blueprint for the recommended model – eXtreme Gradient
Boosted Trees Classiier (M85)

10.17 Blueprint for the
t he most accurate model – Advanced A
AVG
VG Blender
model (M88)

10.18 Performance – lift chart
10.18

10.19 Performance – ROC (confusion matrix)

10.20 Performance – prediction distribution

10.21 Performance – ROC (KS and AUC)

10.22 Feature impact

10.23 Feature effects – categorical feature (Pclass)

10.24 Feature it – continuous feature (Age)

10.25 Feature effect – continuous feature (Age)

10.26 Prediction explanations

10.27 Insights menu

10.28 Insights from text analysis – text mining

10.29 Insights from text analysis – Word
Word Cloud
10.30 Leaderboard sorted by Holdout

10.31 Batch predictions

10.32 Houseprice data – continuous target

10.33 Leaderboard – sorted by Gamma deviance
10.33

10.34 Leaderboard – sorted by R-squared

10.35 Lift chart

10.36 Feature impact

10.37 Feature effects (OverallQual)

10.38 Partial dependence (OverallQual) rescaled

10.39 House price prediction explanations

10.40 Learning curve (houseprice)

10.41 Speed vs. accuracy (houseprice)
10.42 Model comparison (houseprice)

http://www.
11.1 Installation of R ( http:// www.r-project.
r-project.org
org )
)

https://www.
11.2 Installation of RStudio ( https:// www.rstudio.
rstudio.com/
com/ )

11.3 The RStudio interface (RStudio is a trademark of RStudio, Inc)
11.3

11.4 Getting help in R – help(getwd) (RStudio is a trademark of
RStudio, Inc)

11.5 Installing the package ‘psych’ in RStudio (RStudio is a trademark of
RStudio, Inc)

11.6 Histogram for sales variable

11.7 Box plot for Press and Sales variables

11.8 QQ plot for sales
s ales variable

11.9 Scatter plot of TV and Sales

11.10 Enhanced scatter plot of TV and Sales using ggplot()
11.11 Scatter plot matrix

11.12 Enhanced scatter plot matrix using psych package

11.13 Residuals and normal quantile-quantile (Q-Q) plots

11.14 Checking for inluential observations
11.14

11.15 R ROC curve with area under curve (AUC)

11.16 Decision tree

11.17 Variable importance (decision tree)

11.18 Variable importance (random forest)

12.1 Word cloud of Twitter data (produced in Voyant)

12.2 Histogram of sentiment scores for Twitter text
@RealDonaldTrump
@RealDonaldT rump (produced by the authors using R)

12.3 Testing for lycanthropy

12.4 Bayes’ theorem

12.5 Watson VR ( https://www.
https://www.ibm.
ibm.com/
com/cloud/
cloud/watson-visual-
watson-visual-
recognition )
recognition )

12.6 Watson VR

12.7 Text recognition

12.8 Zoomable map of actual food bank usage

12.9 London Ward Atlas (mortgage repossessions)

13.1 Network structures

13.2 Directed and undirected networks

13.3 Edge lists – directed and undirected networks

13.4 Matrix of a directed network

13.5 Matrix of an undirected network

13.6 Tie strength
13.7 Paths, distance, and geodesics

13.8 Network density

13.9 Network (diameter = 3, average geodesic = 1.81)

13.10 Network reciprocity (directed networks only)
13.10

13.11 Degree centrality

13.12 Closeness centrality

13.13 Betweenness centrality

13.14 Eigenv
Eigenvector
ector centrality

13.15 Clustering – clique and k-core

13.16 Clustering – ‘group in a box’ (Rodrigues et al. 2011)

13.17 Network components

13.18 Social network analysis for fraud detection (CGI
( CGI Group 2011)
13.19 Polinode network summary

13.20 Nodes coloured by cohort (X and Z)and
Z) and sized by betweenness
centrality

13.21 Nodes coloured by community and sized by in-degree
13.22 UNSW Business School Twitter account

13.23 @UNSWBusiness tweets retrieved and graphed using NodeXL

13.24 Graph metrics in NodeXL

13.25 Clustering the network in NodeXL

13.26 MBA network produced using R, nodes sized by in-degree

13.27 MBA network coloured by cohort, sized
13.27 s ized by in-degree

13.28 MBA network coloured by community cluster
cluster,, sized by in-degree

14.1 Business analytics as a co-evolvin
co-evolvingg ecosystem (reprinted from
Vidgen et al. 2017, p.635, Copyright (2017), with permission from
Elsevier)
14.2 Business analytics methodology (BAM) (reprinted from Hindle &

Vidgen 2018, p.839, Copyright (2018), with permission from Elsevier)

14.3 BACA radar chart (Vidgen 2017)

14.4 Rich picture for GoGet car-sharing
14.5 Business model canvas (Osterwalder & Pigneur 2010 and

Strategyzer.com)

14.6 Business model canvas for a budget airline (Strategyzer 2013;
reprinted with permission of Strategyzer
Strategyzer.com)
.com)

14.7 Business model canv
canvas
as for GoGet

14.8 The business model canvas with a generic analytics overlay

14.9 Analytics leverage matrix

15.1 Steps of design thinking (Doorley et al., 2018; reprinted with
permission from Stanford d.school)

15.2 Persona part A – out-of-town customer (template courtesy of Lucy
L ucy
Kimbell, Leeor Levy and University of the Arts London)

115.3
5.3 Persona part B – out-of-town customer (template courtesy of Lucy

15.4 Persona part A – fraudulent customer (template courtesy of Lucy

15.5 Persona part B – fraudulent customer (template courtesy of Lucy

15.6 Storyboard – reducing vehicle collision damage (template courtesy
of Lucy Kimbell,

Leeor Levy and University of the Arts London)

15.7 Example of analytics design thinking workshop outputs (Vidgen

2018)

15.8 Opportunity canvas
15.9 Opportunity Canv

C anvas
as GoGet

15.10 The Agile Scrum framework (reprinted with permission from
www.agileforall.com/
www.agileforall.com/resources/
resources/introduction-to-agile
introduction-to-agile )
)

16.1 Spectrum of data, increasing in organizational involv
involvement
ement
16.2 Cycle of ethical decision-point activities (Davis 2012, p.46;

reprinted with permission from O’Reilly Media)

16.3 Decision low chart for publication of Tweets (Williams et al. 2017,
p.1163)

B.1 Milestones in the
t he Evolution of GoGet CarShare

B.2 GoGet member subscriptions

B.3 GoGet vehicle search

B.4 GoGet vehicle booking

B.5 GoGet organization chart

Tables

1.1 Delphi study rankings (reprinted from Vidgen et al. 2017, p.638,
Copyright (2017), with permission from Elsevier)

2.1 Some common data science techniques, with business applications

2.2 Data scientist tasks (adapted from Suda 2017, p.46 with permission
from O’Reilly Media)

2.3 Comparison of R, SAS, and DataRobot

6.1 Exam results (actual)

6.2 Predicted exam mark and error

6.3 Hours of revision (X) and exam mark (Y)

6.4 Predicted exam mark and error term

6.5 Variables
Variables in the advertising dataset (N = 250)

6.6 Overall ANOVA

6.7 Parameter estimates

6.8 Fit statistics

6.9 Calculation of R-square and F-v
F-value
alue

6.10 Overall ANOVA
6.11 Parameter estimates

6.12 Fit statistics

6.13 Parameter estimates for kitchen quality

6.14 Estimates of sale price based on kitchen quality
6.14

6.15 Parameter estimates for a model with an interaction effect

7.1 Probabilities and odds

7.2 Natural logarithms for odds

7.3 Confusion matrix and analysis

7.4 SAS VA
VA logistic regression details – parameter estimates

7.5 SAS VA
VA logistic regression details – it statistics

7.6 From logistic regression equation to case probabilities

7.7 Error distributions and link functions supported by SAS V
VA
A
7.8 GLM parameter estimates

8.1 SAS VA node statistics

8.2 SAS VA node rules

8.3 Contingency table for Sex
8.3

8.4 Decision table rules

8.5 Confusion matrix for the advanced strategy decision tree

9.1 Encodings, order,
order, values, and types of data (Iliinsky 2013)

9.2 Typology for strategically designing visualization

10.1 Autopilot steps (for datasets where cross-v
cross-validation
alidation is
performed/allowed)

10.2 Pre-processing of features

10.3 Dummy variable coding versus One Hot encoding

10.4 Data for scoring

10.5 Scored data downloaded
d ownloaded from DataRobot

12.1 Top
Top 10 negative sentiment tweets (produced by the authors using
R with Twitter data from
f rom 25 May 2016),extreme language redacted.

12.2 Top
Top 10 ppositive
ositive sentiment tweets (produced by the authors
aut hors using
R with Twitter data from 25 May 2016)

12.3 Word
Word clouds and top words of selected topics produced using LDA
(produced by the author using R with Twitter data from 25 May
M ay 2016)

12.4 Confusion matrix for lycanthropy
lycanthropy

12.5 Likelihood table

13.1 Centrality measures (for further details see Cheliotis 2010; Disney
2014)
13.2 Network characteristics provided in NodeXL

13.3 Network and Twitter data provided in NodeXL (partial)

14.1 Root deinition components

14.2 Front-ofice business analytics opportunities matrix for GoGet
14.2

14.3 Back-ofice business analytics opportunities matrix for GoGet

15.1 Paradigm shift from an analytics world to a business univer
universe
se

16.1 Framework
Framework for big-data ethics (Davis 2012, p.3. Adapted with
permission from O’Reilly Media)

16.2 Sample questions for inquiring into big-data values (Davis 2012,
pp.47–48. Adapted with permission from O’Reilly Media)

A.1 The Titanic dataset

A.2 The house price dataset

A.3 The employee survey dataset

A.4 The countries dataset
A.4

A.5 The insurance dataset

A.6 The Hofstede dataset

A.7 The NBA dataset

A.8 The sale–win–loss dataset
datas et (source IBM: Watson)

A.9 The advertising dataset

C.1 The Business Analytics Capability Assessment
Asses sment Survey

Part I Business Analytics in Context

© Richard Vidgen, Sam Kirshner and Felix Tan, under exclusive licence to Springer Nature
Limited 2019
R. Vidgen et al., Business Analytics
https://doi.org/10.26777/978-1-352-00726-8_1
1. Introduction
Richard Vidgen1 , Samuel N. Kirshner2 and Felix Tan3
(1) Business School, University of New South Wales, Sydney, Australia

Richard Vidgen
r.vidg (Corresponding author)
Email: r.vidgen@unsw.edu.au
en@unsw.edu.au
Samuel N. Kirshner
Kirshner
Email: s.kirshner@unsw.edu.au
Felix Tan
Email: f.tan@unsw.edu.au
Chapter Overview In this chapter w

wee discuss the importance of
business
analytics analytics in
analytics in modern
value creation organi
framewor
framework zzations
ations
k that willand iint
nt rroduce
provide oduce anmap for
a route
the book as a whole. The major elements of the framework are then
introduced and discussed: (1) Data: where does it come from and
how is it managed? (2) Analytics: what types of models can we build
and who builds them? (3) Organizational context: how do we
strategize and organize the business analytics function?
Learning Outcomes
After you have completed
completed this chapter you should be able to:
to:

Explain how business analytics is changing organizations,

governments,
govern ments, and tthe
he lives of citizens
Describe the sources of big data and the types of data used in
business analytics
Explain the different types of analytics models that are produced
by data scientists
Describe and how they
the characteristics of acan
databescientist
deployed totheir
and create value
value
role in an
organization
Explain why business analytics is a strategic and cultural issue for
organizations.
Introduction
We are living in an age of the data
dat a deluge. Everywhere we go,
everything
recorded andwestored.
say, everything we buy,
Consequently
Consequently, leaves
, there a digital
is much trace that
excitement may be
– and
some trepidation – around big data and business analytics as
organizations of all types explore how they can use their data to create
(and protect) value. Data analytic methods are being used in many and
varied ways
ways – for example, to predict consumer choices, to estimate the
likelihood of a medical condition, to detect political extremism in social
networks and social media, and to better manage trafic networks.
The opportunities opened up by big data and business analytics are
leading academics and practitioners to explore ‘how ubiquitous data
can
suchgenerate new sources
value is manifest of value, asofwell
(mechanisms ascreation)
value the routesand
through which
how this
value is apportioned among the parties and data contributors’ (George
(George
et al. 2014,
2014, p.324).
(2012) ind
McAfee and Brynjolfsson (2012) ind that data-driven
d ata-driven companies
are, on average,
average, 5% more pproductiv
roductivee and 6% more proitable than their
competitors. However, becoming a data-driven organization is a
complex and signiicant challenge for managers: ‘Exploiting vast new
lows of information can radically improve your company’s
performance. But irst you’ll have to change your decision-making
culture’ (p.61).

Exercise 1.1: Big data analytics in the workplace

workplace Watch the
video ‘What is big data analytics?’ https://www.
https://www.youtube.
youtube.com/
com/
watch?v=
watch? v=aeHqYLgZP84
aeHqYLgZP84
Thinking about the organization for which you currently work
work
(this can also
als o be one that you have worked
worked for previously or your
current educational institution):
1.
What barriers does your organization face in creating value from
data?
2.
Who is championing business analytics in your organization?
Who should be?
3.
How might big data analytics disrupt and reshape thet he industry
of which your organization is part?
A framework
framework for business anal
analytics
ytics
According to a succinct and widely adopted deinition provided by
Davenport and Harris (2007),
(2007), ‘business analytics’ is concerned with
‘the extensive use of data, statistical and quantitative analysis,
analysis,
explanatory and predictive models, and fact-based management to
drive decisions and actions’ (p.7). A key aspect of this deinition is that
analytics ultimately provides
provides insight that is actionable. Other terms,
such as data mining, knowledge discovery,
discovery, machine learning, artiicial
intelligence (AI), and deep learning are commonly used in association
with business analytics. These latter terms typically
typ ically describe
techniques deployed by analytics professionals (who may also be
referred to as data scientists), the people who build the explanatory
and predictive models that enable organizations to make better
decisions.
There is a distinct sense that this is all new and that ‘machine
learning algorithms were just invented last week and data was never
“big” until Google came along’ (O’Neil
(O’Neil & Schutt 2013
2013,, p.2).
p.2 ). Ho
However
wever,,
data sciencefor
techniques, hascenturies.
been around for decades
For example, theand, in the caseofofprobability
development statistical

theory can be traced to Blaise Pascal (1623–1662) and Pierre de

Fermat (1601–1665), who laid out the fundamentals of probability
theory in response to gambling problems, such as calculating the
number of turns required to obtain a six in the roll of two dice. A
century later the Reverend Thomas Bayes (1701–1761) devised his
eponymous
probabilitiestheorem to include
– a method the updating
that is used in manyof beliefs using
machine learningp rior
prior
applications today to make predictions (e.g., the probability that this
email is spam or that this person has breast cancer). While the t he laws of
probability may be unchanged, at the same time, things really are
different – there is so much more data available and technology is so
much more powerful and accessible, and is getting cheaper every year year..
Business analytics provides a larger framework for ensuring that
value is created from the application of data analytics – using
us ing models
built by data scientists – within an organization. Our aim is to use
business
major analytics as an
organizational organizing
asset – that is,umbrella
data. for creating value from a
Figure 1.1
1.1 shows
shows that data can come from many sources and, while
this data may be classiied as ‘big’,
‘big’, it is not a requirement that it be so.
Indeed, many organizations can create signiicant business value from
relatively
relatively small volumes of ddata
ata (which may not have been exploited
previously)) to give insight into the organization
previously organization’s’s customers, processes,
and competitive environment.
environment. The data must be captured, stored, and
managed and its quality assured. Analytics methods can then be
applied to the data in order to support better decision-making,
ultimately
In someleading to the
cases the datacreation of business
an organization value.
holds can be exploited not
only with improved decision-making, but also through the creation of
data products. A data product is a bundling of data and algorithms that
creates yet more data through being used. It is an economic engine that
derives value from data, creates more data, and produces
p roduces more value as
a result. For example, the Fitbit technology collects data and creates yet
more data from that data to give insights that in turn inluence human
behaviour allowing questions such as ‘have I been suficiently active
today?’ and ‘am I sleeping well?’ to be addressed.
Lastly,, all of this analytics activity takes place in some organizational
Lastly
context that is typiied by cultural, social, ethical, political, and

economic dimensions. We We will now look at the different parts of Figure

1.1 to
1.1 to provide managers with an overview understanding of business
analytics and its context.
Figure 1.1 Busines

Businesss ana
analytics
lytics in context (Vidgen
Vidgen 2014)
2014)
Data sources
For an organization, data can be acquired from
f rom internal, external, and
open platforms. Internal data will typically be sourced from enterprise
systems and e-commerce applications. External data can be acquired
from third parties, for example credit scores, and from the Internet via
social media platforms.
Open data is data made ffreely
reely av
available
ailable by other organizations, such
as governments (e.g., Census data). More and more data is being made
available
available by central and local government agencies. For example, the
London Datastore (LDS) ‘has been created by the Greater London
Authority (GLA) as a irst step towar
towardsds freeing London’s data. We want
everyone
every one to be able [to] access the data that the GLA and other public
sector organizations hold, and to use that data
dat a however they see it –
for download
for free’ (London
(as Datastore, n.d.).covering
of June 2018), The LDSman
hasy723
many datasets
topics, available
including health,
employment, environment, and housing. For instance, there are
numerous datasets for ‘Crime and Community Safety’ (Figure (Figure 1.2).
1.2).
Combining open data with an organization’s own data can provide
much greater richness and depth of insight and open up new
commercial opportunities (e.g., via third-party app development).

Figure 1.2 Open data a

available
vailable from the London Datastore (LDS) for
fo r ‘Crime and CCommunity
ommunity
Safety’
Data generators
Data is being generated through developments including the Internet of
Things (IoT), ubiquitous computing, and social media.
Internet of Things (IoT)
Although the concept was not named until 1999, the IoT has been in
development for decades. The irst Internet appliance was a Coke
C oke
machine at Carnegie Melon University in the early 1980s. The
programmers could connect to the machine over the Internet, check the
status of the machine, and determine whether there would be a cold
drink awaiting them, should they decide to make the trip down to thet he
machine (Teicher
(Teicher 2018
2018).
).
The IoT is a scenario in which objects, animals, or people are
provided with unique identiiers that facilitate the automatic transfer of
data over a network without requiring human-to-human or human-to-
computer interaction (Figure
(Figure 1.3).
1.3). So far, the IoT has been most closely
associated withand
manufacturing machine-to-machine
power, oil, and gas(M2M)
power, communication
utilities. Products builtinwith
M2M communication capabilities are often referred to as being ‘smart’
(e.g., a smart utility meter).

Figure 1.3 The Inte

Internet
rnet of Thing
Thingss
A ‘thing’, in the IoT, can be:

A person with a heart monitor implant (physio sensing)
A person with a brain scanner (neuro sensing)
A farm animal with a biochip transponder
An automobile that has built-in sensors to alert the driver when tyre
pressure is low
Any other natural or man-made object that can be assignedas signed an IP
address (a unique identiier for a device connected to the Internet)
and have data transferred from it over a network.
The former UK prime minister David Cameron announced funding
for technology irms working on the IoT, saying it represents a new
‘industrial revolution’:
revolution’: ‘I see tthe
he internet of things as a huge
transformative development – a way of boosting productivity, of
keeping us healthier,
healthier, making transport more eficient, reducing energy
needs, tackling climate change’ (BBC (BBC News 2014
2014;; UK Government Chief
Scientiic Adviser 2014). At the Internet of Things Summit in 2015, the
Australian prime minister,
minister, Malcolm Turnbull, recognized the potential
of the IoT,
IoT, arguing that it ‘presents huge opportunities for city
management, trafic control, homes, health prevention and treatment,
agriculture, power and water eficiency, to name a few’. He goes on to
illustrate this: ‘Australian
‘Australian water utilities currently spend $1.4 billion per
p er
annum on reactive repairs and maintenance, including the consequence
cost of social and economic impact.
impact . Focusing the asset maintenance
efforts on preventative
preventative rather than reactive repairs has the potential to
save the water industry $355 million’ (Turnbull
(Turnbull 2015
2015).
).
Exercise 1.2: Internet of Things (IoT) implementation by the

Hamburg Port Authority (HPA)
Watch the video ‘The Internet of Things changes the game for
Watch
https://www.
Hamburg Port Authority’ https:// www.youtube.
youtube.com/
com/watch?
watch?v= v=
V7wzmjPbDik and
V7wzmjPbDik and consider the following questions:
1.
What business beneits has the HPA
HPA realized through the IoT?
2. How hav
havee other stakeholders in the HP
HPA
A been affected bbyy the

IoT implementation?

3.
work
(this can also be one that you have wor
worked
ked for previously or
your current educational institution):
(a)
Is your organization using the IoT? If so, how?
(b)
What potential use cases can you see for the IoT in your
organization? How would business value be created from these
use cases?
Ubiquitous computing
Closely allied to the IoT is ubiquitous computing. Ubiquitous means
‘existing everywhere’
everywhere’ – ubiquitous computing devices are completely
connected and constantly available. Ubiquitous computing relies on the
convergence
conv ergence of wireless technologies, advanced electronics, and the
Internet. The goal of researchers working in ubiquitous computing is to
create smart products that communicate unobtrusively
unobtrusively, particularly
wearable computers such as Google Glass (Figure
(Figure 1.4
1.4)) and the Fitbit.
https://www.
Figure 1.4 Google Glass (https://www.varifocals.
varifocals.net/
net/google-glass/
google-glass/))
Social media
The Internet analytics company Hootsuite (2015)
(2015) categorizes
categorizes social
media into eight archetypes (while recognizing that these boundaries
are luid and that a social media site may it under multiple headings):
Relationship networks, such as Facebook, Twitter, and LinkedIn
Media sharing networks, such as Flickr and Instagram
Online reviews, such as TripAd
TripAdvisor
visor
Discussion forums,
Social publishing such as Reddit
platforms, s uch asand
such Digg and WordPress
Tumblr

Bookmarking sites, such as StumbleUpon and Pinterest

Interest-based networks, such as Last.fm and Goodreads
e-commerce sites, such as Amazon and eBay
Social media sites create large volumes of data of which
organizations need to be aware. For example, what is being said about
your company and its products and services on Twitter? Is there a
Facebook user group for your products and services? What images,
blogs, or reviews are being posted? How are your products doing on e-
commerce sites and what prices do they achieve on the second-hand
market? Combining social media data with internal organizational data
can give a deeper and richer insight into customers, customer
behaviours, products, and competitors.
Social media is also an important source of network data as social
media users form connections – for example, by friending, following,
tagging photos, adding someone to an instant messenger (IM) list,
sending a message, favouriting, posting a comment, sending a poke,
being in the same group, or editing the same Wiki page. Social media
users leave behind traces that form an intricate web connecting them
with the people, locations, and digital objects around them. This data,
for example, allows organizations to identify customer groupings and
opinion formers. Analytics professionals can often access social media
data using application program interfaces (APIs) supplied by the
platform owner; where an API is not available then an organization
might engage in ‘screen-scraping’, that is, programmatically accessing
web pages and formatting the content into usable data. A word of
caution, though: screen-scraping might contravene
contravene the terms and
conditions of the platform owner
owner,, so check irst.
Big data
The emergence of new data generators such as the IoT has led to an
explosion in the volume of data being captured and stored by
organizations. While business analytics can be applied to any scale of
data to extract value, the term ‘big data’ has been coined to relect
changes in the characteristics of data (Zikopoulos
(Zikopoulos et al. 2012
2012).
). IBM
identiies ‘The four V’s of big data’:
volume – increasing amounts of data over traditiona
traditionall settings

velocity –
– information generated at a rate that exceeds that of
traditional systems
variety –
– multiple emerging forms of data (structured and
unstructured), such as text, social media data, and video
veracity –
– the trustworthiness and quality of the data.
2016))
(IBM 2016
A ifth V is often added – value. Having large volumes of data is all
very well but can it be turned into value? In Figure 1.1 1.1 we
we show value
as an end product that is generated through actionable insight leading
to improved
improved decision-making, that is, value is not embedded in data per
se, but it can be extracted through the business analytics function.
(2017) argues
Thamm (2017) argues that the term ‘big data’ is now redundant,
having ‘emerged in a time when it was becoming more and more
dificult to process the exponentially growing volume of data with the
hardware
hardw are available at the time’
time’.. Successful business analytics projects
do not necessarily depend on access to big data; more important is
having relevant
relevant data for building and evaluating models. Collecting data
dat a
for its own sake can lead to expensive data warehouses and data lakes
(see Exhibit 1.1) that ultimately deliv
deliver
er little business value.
Exhibit 1.1: Data warehouses and data lakes Data lakes and
data warehouses
Amazon deines a data lake as ‘a centralized repository that
allows you to store all your structured and unstructured data at any
scale. You
You can store your data as-is, without having to irst structure
the data, and run different types of analytics – from dashboards and
visualizations to big data processing, real-time analytics, and
machine learning to guide better decisions.’
decisions.’https://
https://aws.
aws.amazon.
amazon.
com/big-data/
com/ big-data/datalakes-and-analytics/
datalakes-and-analytics/what-is-a-data-lake/
what-is-a-data-lake/
Wikipedia deines data warehouses as ‘central repositories of
integrated data from one or more disparate
d isparate sources. They store
current and historical data in one single place that
t hat are used for
creating analytical reports for workers throughout the enterprise.
enterprise.’’
https://en.
https://en.wikipedia.
wikipedia.org/
org/wiki/
wiki/Data_
Data_warehouse
warehouse
Campbell (2015) elaborates
(2015) elaborates on this deinition arguing that a data
warehouse represents an abstracted picture of the business

organized by subject area, is highly transformed and structured.

s tructured. Data
is not loaded to the data warehouse until the use for it has been
deined, and in building a data warehouse we generally
generally follow a
methodology.
Data lakes versus data warehouses
DataCampbell (2015)
(2015) distinguishes
lakes retain distinguishes
all data dataidentiied
(not just data lakes fromasdata
beingwarehouses.
useful for
a particular purpose), support all data types (e.g., text and video),
support all users (including operational users and data scientists),
adapt easily to changes, and, as a result, provide faster insights.
While data lakes sound like the best option, they come at a price
(e.g., software, servers, and data management needed to build and
maintain a data lake). And, operational users who simply want
reports and KPIs might not want to work with the unstructured raw
data in a data lake and so might be better served by a structured
data warehouse.
Exercise 1.3: Data use in your organization Watch the video

‘Explaining big data’ https://youtu.
https://youtu.be/
be/7D1CQ_
7D1CQ_LOizA
LOizA and
and consider:
1. Why can’t traditional computing platforms cope with big data?
2. Thinking about the
t he organization for which you currently work
(this can also
current educational institution), how has your organization
organization’s
’s data
collection, storage, and use been affected by the four V’s of big data?
Data management
The cloud
As data becomes more varied and less structured,
s tructured, with greater volume
and velocity,
velocity, the challenge is to be able to capture and process it quickly
enough to meet business needs (which, in the case of credit card
transaction processing, may be measured in milliseconds). Traditio
Traditional
nal
databases and technologies have struggled to keep up with the
challenge of big data, and new architectures that can scale
s cale up to the
volume, velocity,
velocity, and variety of today’s data have emerged: ‘Big Data is

fundamentally about massively distributed architectures and massivel

massivelyy
parallel processing using commodity building blocks to manage and
analyze data’ (EMC
(EMC 2012,
2012, p.7). This is typically achieved thro
through
ugh cloud
technologies.
Cloud computing is a general term for anything that invol
involves
ves
deliveringg hosted services over the Internet. A cloud service
deliverin s ervice has three
distinct characteristics that differentiate it from traditional hosting:
It is sold on demand, typically by the minute or the hour (and
therefore may be cheaper compared with the full costs of running an
in-house IT service)
It is elastic – a user can have as much or as little of a service as they
want at any given time (and therefore more lexible)
The service is fully managed by the provider
p rovider – the consumer needs
nothing but a personal computer and Internet access (and therefore
more reliable).
Cloud services come in different lavours – see Exhibit 1.2.
Exhibit 1.2: Types of cloud services Cloud applications or

Software-as-a-Service
Software-as-a- Service (SaaS)
(Sa aS)
‘[In this model] the vendor supplies the hardware infr
infrastructure,
astructure,
the software product and interacts with the user through a front-end
portal. SaaS is a very broad market. Services can be anything from
Web-based email to invinventory
entory control and database processing.
Because the service provider hosts both the application and the data,
the end user
typicallyusused
er isfor
freeapplications
to use the service
such asfrom
emailanywhere. ’ SaaS
(e.g., Gmail), is
customer
relationship management (e.g., Salesforce), expenses management
(e.g., Concur), and collaboration (e.g., GoT
GoToMeeting).
oMeeting).
Cloud platforms or Platform-as-a-Service (PaaS)
‘[This is] deined as a set of software and product development
tools hosted on the provider’s infrastructure. Developers create
applications on the provider’s platform over the Internet. PaaS
providers may use APIs (application program interfaces), website
portals or gatewa
gatewayy software installed on the customer’s computer
computer..
For
Force.com
ce.comof(an
examples outgrowth
PaaS. of Salesforce.com)
Developers and currently
need to know that GoogleApps
currently, areare
, there

[no] standards for interoperability or data portability in the cloud.

Some providers will not allow software created by their customers
to be moved off the provider’s platform.’
platform.’ PaaS is typically used by an
organization’s
organization ’s soft
software
ware developers to create scalable app
applications
lications
quickly (e.g., using the Apprenda platform).
Cloud infrastructure or Infrastructure-as-a-Service
Infrastructure-as-a-Service (IaaS)
‘Infrastructure-as-a-Service like Amazon Web Services provides
‘Infrastructure-as-a-Service
virtual server instances with unique IP addresses and blocks of
storage on demand. Customers use the provider’s application
program interface (API) to start, stop, access and conigure their
virtual servers and storage. In the enterprise, cloud computing
allows a company to pay for only as much capacity as is needed, and
bring more online as soon as required. Because this pay-for-what-
pay-for-what-
you-use model resembles the way electricity
electricity,, fuel and water are
consumed; it’s sometimes referred to as utility computing.
computing.’’ With
IaaS, users are
middleware, responsible
and operating for managing
systems. theirwords,
In other applications,
they aredata,
st ill
still
running IT operations, but they no longer need to purchase
hardware
hardw are outright. Instead they pay for it on a consumption basis.
Source: IT Knowledge Portal 2016;
2016; Apprenda 2016.
2016.
A cloud can be private orpublic. A public cloud sells services to

anyone on the Internet. (Currently, Amazon Web Services is the largest
public cloud provider
provider.)
.) A private cloud is a proprietary network or a
data centre that supplies hosted services to a limited number of people.
When
privateeacloud,
privat servicethe
provider
result isuses public
called cloudprivate
a virtual resources to create
cloud. Privateitsor
public, the goal of cloud computing is to provide easy
easy,, scalable access to
computing resources and IT services.
Big data technologies
Big data (i.e., data
d ata with volume, velocity
velocity,, and variety) has created a
technical challenge for organizations in their operational and analytical
systems. Operational applications have to cope with real-time,
interactive
interactive workloads where primary data is captured and stored.
Analytical applications
retrospective,
retrospectiv have to provide
e, complex analysis that mayanalytical capabilities
need to access most for
or all of

the data. These classes of technology are complementary and

(MongoDB 2016
frequently deployed together (MongoDB 2016).
). See Exhibit 1.3 for
further details of big data technologies.
Exhibit 1.3: Big data technologies Operational Big data

‘NoSQL technologies, which were developed to address the
shortcomings of relational databases in the modern computing
comput ing
environment,
environm ent, are faster and scale much more quickly and
inexpensively
inexpensiv ely than relational ddatabases.
atabases.
‘Critically,, NoSQL Big Data systems are designed to take
‘Critically
advantage of new cloud computing architectures that have emerged
over the past decade to allow massive computations to be run
inexpensively
inexpensiv ely and eficiently
eficiently.. This makes operational Big Data
workloads much easier to manage, and cheaper and faster to
implement.’
Analytical big data
‘A
Analytical
nalytical Big Data workloads, on the other hand, tend to be
addressed by MPP (massively parallel processing) database systems
and MapReduce. These technologies are also a reaction to the
limitations of traditional relational databases and their lack of ability
to scale beyond the resources of a single server. Furthermore,
MapReduce provides a new method of analyzing data that is
complementary to the capabilities provided by SQL.
‘A
Ass applications gain traction and their users generate increasing
volumes of data, there
t here are a number of retrospective
retrospective analytical
workloads that lve
provide
workloads involve
invo real value
algorithms thatto
arethe business.
more Where these
sophisticated than
simple aggregation, MapReduce has emerged as the irst choice for
Big Data analytics. Some NoSQL systems provide native MapReduce
functionality that allows for analytics to be performed on
operational data in place. Alternately,
Alternately, data can be copied from
NoSQL systems into analytical systems such as Hadoop for
MapReduce.’
Source: MongoDB 2016.
2016.
Data quality

Redman (2008) argues
(2008) argues that care of data and information boils down to
data quality. Organizations should ‘correctly create or otherwise obtain
the data and information they really need, correctly,
correctly, the irst time’ (p.3).
Data should be easy to ind, access, and use, such that people have the
conidence and trust to employ the data
d ata in powerful ways.
Organizations also have a responsibility to protect their data and to
preventt it being used in inapp
preven inappropriate
ropriate wa
ways.
ys. We will look at data
quality issues more closely in later chapters. Much of the data
scientist’s time is spent in ‘cleaning’ data to prepare it for use in
predictivee models. The better the quality of the source data then the
predictiv
less time will be needed for data cleaning (e.g., estimating missing
values).
Analytics
Models
The data science activity involves
involves the construction of models that are
used to describe, predict, and prescribe.
p rescribe.
Descriptive analytics: uses data visualizations and summaries to
make sense of data and to show what has already happened. For
example, we might produce a bar chart of sales by region or a report
of which customers have churned.
Predictive analytics: uses statistical models, forecasting methods,
and machine learning to show what could happen. For example,
example, we
might build a model that predicts sales over time by product and
region or a model that predicts which of our customers are likely to
churn.
Prescriptive analytics: uses models, such as optimizations and
simulations, that give advice on possible outcomes and propose what
we should do. For example, the model might indicate a marketing
campaign to increase sales or an enhanced service package
p ackage to
decrease the probability of customer churn.
The vast majority of reports in an organization are descriptive; they
tell us what has happened in the past (e.g.,
(e.g., inancial accounts,
management
reports, sales accounts
analyses).showing
Becausevariance
the toolsfrom budget, invent
inventory
are available and itory
is easy to

do, more and more descriptive analyses are being produced, with an
attendant risk of information over
overload.
load.
Predictive analytics give us insight into the future. However, no
model can predict the future with 100% certainty; the results of a
predictivee model are probabilistic (e.g., the probability that a customer
predictiv
may churn).
Prescriptivee analytics go further and attempt to recommend action
Prescriptiv
(or,, indeed, may actually instigate action). For example, a customer
(or
attrition model might advise as to which of several courses of action, or
combinations of action, should be taken with customers at risk of
churning.
Data scientists
According to Thomas Davenport,
Davenport, being a data scientist is the ‘‘Sexiest
Sexiest
job ofdata
that thescientists
21st century’ (Davenport
(Davenport
are the New Rock& Stars
Patil 2012
2012).
). The
and that
Economist says
they will continue
to be in short supply
supply.. McKinsey (2011) forecasted that the USA would
face a shortage of up to 190,000 data scientists by 2018. Glassdoor
Glassdoor,, an
online job site, produces an annual report of the 50 best jobs in
America. They calculate the ranking based on median annual base
salary,, job satisfaction rating, and number of job op
salary openings.
enings. Data
scientist is the top job in the USA in 2016, 2017, and 2018 (Glassdoor
(Glassdoor
2018)) scoring 4.2/5.0 for job satisfaction, a median
2018 med ian base salary of USD
110,000, and an overall score of 4.8/5.0. Glassdoor also provides data
for the UK, where
satisfaction the3.6/5.0,
score of data scientist
a medianranks
base17th in 2018
salary with
of GBP a job and
45,000,
an overall
overall score of 4.0/5.0. Although these are small datasets that may
not be truly representative
representative of the st state
ate of data scientist jobs in the USA
and the UK, it is tempting to speculate that the difference may be due to
the USA being more advanced in its use of analytics than the UK.
Data scientists need a range of skills and personal characteristics.
They need (1) computer science skills (e.g., programming, AI), (2)
quantitative skills (e.g., statistics), and (3) an appreciation of the
decisions that will be made in a particular domain (Figure(Figure 1.5
1.5).
). What
happens whenand
programming onestatistics
of the three
t hree
comedimensions
together is missing?
together,, the result isWhen
ttypically
ypically

machine learning – ishing in large pools of data for patterns, building

models that may work (e.g., using neural networks) but lacking
suficient insight into why they might work. The combination of
programming and domain knowledge can be dangerous as the lack of
statistical and mathematical expertise means that the models that are
built may be wrong, unreliable, and lack validity – all contributing to
poor decision-making. When statistics and maths are paired with
domain knowledge we are in the realm of traditional research: many
large organizations have had an ‘operational research’ department that
applied mathematical and statistical
st atistical models (e.g., discrete event
simulation, optimization, linear programming, forecasting) to the
business.
Figure 1.5 A taxonom

taxonomyy of disciplines related
related to aanalytics
nalytics (Mortenson et al. 2015
2015))
While different data scientists will have varying degrees of strength

in each of the three areas, the team must be able to navigate all three,
including the various intersections.
The organizational context

context
Fleming et al. (2018) identify ten reasons why analytics progr
programs
ams fail.
Five of these relate clearly to the organizational context in which
analytics is conducted:
The executive team not having a clear vision for its analytics
program.
Not identifying the value to be delivered by the initial analytics use
cases.
Lack of an analytics strategy.
The analytics function being isolated from the business.

Lack of oversight of potential ethical, social, and regulatory

implications of analytics.
A large part of the organizational context is concerned with creating
a data culture:
When an organization’s data mission is detached from business

strategy and core operations, it should come as no surprise that
the results of analytics initiatives
initiatives may fail to meet expectations.
But when excitement about data analytics infuses the entire
organization, it becomes a source of energy and momentum. The
technology,, after all, is amazing. Imagine how far it can go with a
technology
culture to match.
(Dı́az
az et al. 2018)
Business analytics strategy
strategy
The business analytics function takes place in a context of a focal
organization, its network of partners, and broader socio-economic-
political circumstances. In developing a business analytics strategy
strategy,, we
must consider the overall business strategy
strategy,, business model, culture,
people, technology,
technology, processes, and etethics
hics of the organization. In the
broader context there will be legal and regulatory factors to consider as
well as a wide range of stakeholders, such as customers, suppliers,
government, and the public.
A key aspect of business analytics is that it is part of an
organizational change and transformation
t ransformation initiativ
initiativee to move the
enterprise towards being data-driven – a situation where decisions are
routinely taken on the basis of data and evidence rather than that of the
‘Hippos’ (highest paid person’s opinion). Change of this order does not
take place purely bottom-up and it certainly does not happen through
technology investments
investments alone – it needs senior management will and
support.
Business analytics applications
While business analytics can be applied throughout the organization,
crossing departmental and functional boundaries, there are some areas
of application that are common to most organizations: marketing,
human resources (HR), inance, and procurement:

Marketing analytics: can be used to: create predictive models of

customer behaviour,
behaviour, to qualify and prioritize sales leads, bringing the
right products/services to market, targeting the right customers at
the right time with the right content, and using predictive insights to
drive marketing strategies (Fagella
(Fagella 2018
2018););
HR analytics: key applications in HR analytics include modelling of
employee churn rate (why are people leaving?), absenteeism (e.g.,
sick days), training effectiveness, revenue per employee, and
(Chrisos, 2018
employee engagement (Chrisos, 2018);
);
Finance analytics: chief inancial oficers (CFOs) can use predictive
models to improve operational eficiencies, to optimize tax, and to
support long-term strategic planning (Dun
(Dun & Bradstreet 2017
2017);
);
Procurement analytics:
analytics: chief procurement oficers (CPOs) use
analytics to control cost and improv
improvee eficiency of their supply
chains, to manage inancial risk of vendors, and to model their
(Dun & Bradstreet 2017).
supplier spend (Dun 2017).
The business analytics function
Organizations typically structure their business analytics function into
data science, business analysis, IT operations, and data management
(Figure 1.6).
1.6). Data scientists
s cientists develop predictive and prescriptiv
prescriptivee models.
Business analysts work closely with the business to understand their
requirements and to communicate the results of the data scientists’
work. While data scientists may have well-dev
well-developed
eloped IT sskills,
kills, when it
comes to making changes to live operational systems IT professionals
are
Theyneeded
may alsoto ensure thatto
be needed theaccess
changes
dat are
data tested
a and and maintainable.
to create data warehouses.
The data management function is concerned with data deinitions, data
quality,, data ownership, data access, and data governance. Even though
quality
the data management function might not be located within the
analytics team, the analytics team will need to work closely with the
data management function.
Figure 1.6 Busines

Businesss analytics function

Business analytics challenges

Big data and analytics are presenting organizations with challenges as
well as opportunities. Datamation, a leading information management
(Harvey 2017
magazine, identiies seven challenges of big data (Harvey 2017):
):
1. Dealing with data growth. The amount of data to be stored is
doubling every two years
years and much of that data is unstructured.

2.
Generating insights in a timely manner . Data is only valuable if
insights can be extracted and acted on in a timely manner
manner..
3.
Recruiting and retaining big data talent . Data engineers and data
scientists are in short supply and salaries have risen accordingly
accordingly..
4.
Integrating disparate data sources. Data comes from many sources
– enterprise systems, social media, email, etc. – and needs to be
integrated for value to be extracted.
5.
Validating data. Data needs to be of suficient quality to be useful –
there should not be conlicting data in different systems (e.g.,
customer address).
6.
Securing data. Big data needs to be secured both in terms of
legitimate users (what can they access, what can they
t hey change) and
against malicious access by hackers.
7. Organizational resistance. According to a survey by NewVantage
Partners (Davenport
(Davenport & Bean 2017
2017)) the great majority of
organizations want to build a data-driven culture
culture (85.5%) but less
les s
are successful in doing so (37.1%). One part of the organizational
change process is to appoint a Chief Data Oficer
Oficer,, who should report
to the board.
(2017) used
Vidgen et al. (2017) used the Delphi
Delp hi technique to reach a consensus
about the relative importance of the key challenges facing organizations
in creating value from big data analytics. Delphi
Delp hi is an inductive and
data-driven
opinion fromprocess
a large and is aofvery
group eficient
experts on a and effective
speciic way to
problem. Worcanvas
canvas
Workshops
kshops

were conducted to surface the challenges from a group of experts

comprising practitioners, consultants, academics, and user
representatives
representativ es of organizations either considering the adoption of big
data and predictiv
predictivee analytics or already on the jjourney
ourney.. Thirty-one
challenges were identiied in the workshops (Table
(Table 1.1
1.1).). These
challenges were then subjected to a ranking exercise using an online
survey.. In the irst round there were 72 respondents: 36 practitioners,
survey
23 consultants, and 13 academics. The Delphi survey reached
convergence
conv ergence in the second round.
Table 1.1 Delphi study rankings (reprinted fromVidgen et al. 2017,
fromVidgen 2017, p.638, Copyright (2017),
with permission from Elsevier)
Ranking Item D escription
1 Ma
Mana
nagi
ging
ng data
data qu
qual
ality
ity assu
assuri
ring
ng da
data
ta qual
quality
ity as
aspe
pects
cts,, su
such
ch as accura
accuracy
cy,, da
data
ta
deinitions, consistency
c onsistency,, segmentation, timeliness, etc.
2 U sing analdecision
improved ytics for linking
decisionthe analytics
making produced
in the businessfrom
fro m big data with key
making
3 Creating a big data and having a clear big data and analytic
analyticss strateg
strategyy that
t hat its w
with
ith
analytics strateg
strategyy the organisation’s business strategy
4 Avail
vailaabili
bility
ty of da
data
ta th
thee avail
vailab
abil
ilit
ityy of app
ppro
ropr
pria
iate
te da
data
ta to sup
uppo
port
rt an
anal
alyti
ytics
cs
(does the data exist?)
5 Building data skills in the training and education required to upskill employees in
the organisation general to utilise big data and analytic
general analyticss
6 Restrict
ctiions of existing existing IT platforms/architecture may make it dificult to
IT platforms migrate to and manage big data and analytics
7 M easuimpact
value ring customer can the real impact on the customer
measured? c ustomer of managing big data be
8 Analytics skills dificulty in acquiring the mathematical, statistical,
shortage visualisation skills for producing analytics
9 Es
Esttablishing a business can ‘tangible’ beneits of big data be demonstra
demonstrated
ted (e.g.,
case return on investment)?
10 Getti
tting acce
ccess to data accessing appropriate data sources to produce and manage
sources big data (can the data be accessed?)
11 Pro
Producin
cing cre
credible are the analytic
analyticss produced from big data likely to be credible
analytics and trusted by the
t he organisation?
12 Building a co
corrpora
orate e.g., are data and analytics taken seriously enough by the
data culture leaders at a strategic level in the business?

Ranking Item D escription

13 Ma
Makin
kingg tim
timee avail
availab
able
le will pe
people
ople ha
have
ve enoug
enoughh tim
timee to work with big da
data
ta an
andd
analytics, over and above the ’day job’?
jo b’?
14 Managing data managing the complexity of big data processes (e.g.,
managing
processes generating,
generating, storing, ccleaning
leaning data and producing analytics)
15 Technical skills dificulty in acquiring technical/IT skills for managing big
shortage data and operationalising analytic
analyticss
16 Ov
Oveercom
coming resista
tannce is there buy–in and engagement
engagement around the beneits of big
to change data (the’ so what?’)?
w hat?’)? Can barriers to change be overcome?
17 Managing and data held in different business silos, systems, and segmented
integrating data in various ways is dificult to structure for analysis
structures
18 Managing
ing data secur
curity ensuring that data is stored securely, only available to
and privacy intended recipients, and anonymised as needed
19 Data
Data vis
visua
uali
lisa
satio
tionn abil
ability
ity to di
disp
spla
layy and
and visua
visuali
lise
se the
the da
data
ta to com
commu
muni
nicat
catee
insights clearly within the organisation
organisation
20 Ma
Mana
nagin
gingg da
data
ta volu
volume
me doe
doess theand
storing org
organ
anis
isati
ation
on have
managing ha ve effect
large effective
ive wa
volumes ways
ys (system
(systems)
of data s) for
21 Da
Data
ta own
owner
ersship
hip who
who own
ownss th
thee bi
bigg da
data
ta?? Ins
Insid
idee (e.g
(e.g.,., wh
which
ich de
depa
part
rtm
men
ent)
t) an
andd
outside of an organisa
organisation
tion ((e.g.,
e.g., Government, partners)
22 Managing
ing co
cossts abil
ilit
ityy to manag
nage the
the co
cossts assoc
ocia
iate
tedd wi
with
th big da
data
ta
23 De
Dein
inin
ingg th
thee scop
scopee di
dific
ficul
ulty
ty in de
dein
inin
ingg the
the scop
copee of bi
bigg da
data
ta proj
projec
ects
ts in the
the
organisation
organisa tion ((where
where does it start and stop?)
24 Dei
inning wh
whaat ‘big’ dificulty in deining what ‘big data’
data is actually is
25 Securi
Securing
ng inves
investme
tment
nt abili
ability
ty to secure
secure the inves
investme
tmentnt ne
need
eded
ed to build
build big da
data
ta an
andd
analytics (infrastructure, skills, training, etc.)
26 Ma
Mani
nipu
pula
latin
tingg data
data bein
beingg able
able to pr
proce
ocess
ss the
the da
data
ta to prod
produce
uce an
anal
alytic
ytic insi
insigh
ght
t
27 Legislative and compliance with laws such as the Data Protection Act
regulatory compliance 1998/2003
28 Usi
Using
ng the da
data
ta eth
ethical
ically
ly us
using
ing the data
data in an eth
ethical
ical way
way an
andd en
ensu
surin
ringg all ar
area
eass of the
organisation
organisation are using it in acceptable ways
29 Performance ability to develop key indicators for big data and analytics
management performance reporting
30 Safeguarding e.g., reputation and brand damage caused by inappropriate
reputation use of data, data leakage
leakage,, selling data
31 Working with can the organisation build relationships
relationships and work eff
effectively
ectively
academia with academia
ac ademia??

(Table 1.1
The Delphi study identiied 31 items (Table 1.1).
). The top ive issues
are (1) managing data quality, (2) using analytics for improved
decision-making, (3) creating a big data and analytics strategy
st rategy,, (4)
availability
availab ility of data, and (5) building data skills in the organization.
Summary
Business analytics is a complex organizational ield involving
involving
technology,, data science, management, and organizational change (to
technology
processes and culture and possibly to business strategy). While
managers might not need to know how big data technologies work and
how the complex predictive models built by data scientists operate,
they need to appreciate the management inputs required and the
interconnection of these elements. Value creation is not solely the
province of those organizations that have the ‘biggest’ data, the latest
technologies, and the smartest data scientists. Success can be created
from small data with technology that the t he organization is skilled at using
(this might even include Microsoft’s Excel) and a small project can
demonstrate the business value of analytics and pave the way for
further and more ambitious initiativ
initiatives.
es.
Whatever the circumstance, managers must be prepared to tackle
the following questions:
Where does our data come from?
Do we have the right data?
Is our data of suficient quality?
How well is our data managed?
What technologies are needed to collect, store, and make available
our data?
How can data science/analytics be used to build models that lead to
improved
improv ed decision-making?
How can social media data be utilized?
What external and open data should we acquire to enrich our
internal data?
What human resources do we need for business analytics?
Do we have an effectiv
effectivee business analytics strategy?
Is our business analytics strategy aligned with our business strategy?

What are the organizational change and transformation implications

of building a data-driven culture?
Ultimately, are we creating business value from our data?
References
Apprenda. (2016). IaaS, Paas, SaaS (explained and compared), Apprenda (website). https://
Apprenda.
apprenda.com/
apprenda.com/library/
library/paas/
paas/iaas-paas-saas-explained-compared/
iaas-paas-saas-explained-compared/
BBC News. (2014). ‘Internet of things’ to get £45m funding boost, BBC News, 9 March. http://
www.bbc.
www. bbc.com/
com/news/
news/business-26504696
business-26504696
Campbell, C. ((2015).
Campbell, 2015). Top ive
ive diff
differences
erences between data lakes and data warehouse
warehouses.s. Blue Granite,
26 January. https://
https://www.
www.blue-granite.
blue-granite.com/
com/blog/
blog/bid/
bid/402596/
402596/top-ive-differences-between-data-
top-ive-differences-between-data-
lakes-and-data-warehouses
Chrisos, M. (2018). 3 Beneits of analytics every HR manager should know.
know. TechFunnel , 21 March.
https://
https://www.
www.techfunnel.
techfunnel.com/
com/hr-tech/
hr-tech/types-of-hr-analytics-every-manager-should-know/
types-of-hr-analytics-every-manager-should-know/
Davenport, T. & Bean, R. (2017). Big Data Executive Survey 2017. NewVantage Partners.
Partners. http://
newvantage.com/
newvantage. com/wp-content/
wp-content/uploads/
uploads/2017/
2017/01/
01/Big-Data-Executive-Survey-2017-Executive-
Big-Data-Executive-Survey-2017-Executive-
Summary.pdf
Summary.pdf
Davenport, T. & Harris,
Harris, J. (2007). Competing on analytics: The new science of winning. Harvard
Business Press, Cambridge, MA.
Davenport, T. & Patil, D. (2012). Data scientist: The sexiest job of the 21st century, Harvard
Business Review , October:
O ctober: 70–76.
Dı́az,
az, A., Rowshankish
Rowshankish,, K., & Saleh, T
T.. ((2018).
2018). Why data culture matters. McKinsey Quarterly ,
September 2018.
Dun & Bradstreet. (2017). How Marketing, Procurement and Finance Departments Use Analytics.
Dun & Bradstreet (website). 17 July. https://www.
https://www.dnb.
dnb.co.
co.uk/
uk/perspectives/
perspectives/analytics/
analytics/integrating-
integrating-
analytics-into-business-decisions.html
html
EMC. (2012). Big Data-as-Service: A market and technology perspective , White Paper. http://
australia.emc.
australia. emc.com/
com/collateral/
collateral/software/
software/white-papers/
white-papers/h10839-big-data-as-a-service-perspt.
h10839-big-data-as-a-service-perspt.pdf
pdf
Fagella, D., (2018). Predictive Analytics for Marketing – What’s Possible and How it Works. Emerj .
29 November
Nov ember.. https://
https://emerj.
emerj.com/
com/ai-sector-overviews/
ai-sector-overviews/predictive-analytics-for-marketing-whats-
predictive-analytics-for-marketing-whats-
possible-and-how-it-works/
Fleming, O., Fountaine, T., Henke, N., & Saleh, T., (2018). Te
Tenn red lags
lags signaling
signaling your analytics
program will fail. McKinsey Quarterly , May 2018.
George, G., Haas, M., & Pentland, A. (2014). From the editors: Big data and management.
management. Academy
of Management Journal , 57 ( 2): 321–332.
[Crossref ]

Glassdoor.. (2018).
Glassdoor ( 2018). 50 Best Jo bs in America. https://www.
Jobs https://www.glassdoor.
glassdoor.com/
com/List/
List/Best-Jobs-in-
Best-Jobs-in-
America-LST_KQ0,20.
America-LST_ KQ0,20.htm
htm
Harvey, C., (2017). Big Data Challenges
Harvey, Challenges.. Datamation. 5 June. https://
https://www.
www.datamation.
datamation.com/
com/big-
big-
data/big-data-challenges.
data/big-data-challenges.html
html
Hootsuite. (2015). 8 types of social media and how each can beneit your business, Hootsuite
(blog), 12 March. https://blog.
https://blog.hootsuite.
hootsuite.com/
com/types-of-social-media/
types-of-social-media/
IBM. (2016). The four V’s of big data, Infographics & Animations, Big Data & Analytics Hub. IBM
(website). http://www.
http://www.ibmbigdatahub.
ibmbigdatahub.com/ com/infographic/
infographic/four-vs-big-data
four-vs-big-data
IT Knowledge Portal. (2016). Cloud computing. IT Knowledge Portal (website). http://
http://itinfo.
itinfo.am/
am/
eng/cloud-computing/
eng/cloud-computing/
LDS (London Datastore) (n.d.). About this website. LDS ( website). http://
http://data.
data.london.
london.gov.
gov.uk/
uk/
about/
McAfee, A. & Brynjolfsson, E. (2012). Big data: The management revolution, Harvard Business
Review , October: 61–68.
McKinsey Global
productivity, May,Institute.
McKinsey (2011).
GlobalBig data: The
Institute next frontier
( website forwww.mckinsey.
). https:// innovation,
https://www. competition,
mckinsey.com/ and
com/business-
business-
functions/digital-mckinsey/
functions/ digital-mckinsey/our-insights/
our-insights/big-data-the-next-frontier-for-innovation
big-data-the-next-frontier-for-innovation
MongoDB. (2016). Big data explained,
explained, MongoDB ( website). https://
https://www.
www.mongodb.
mongodb.com/
com/big-data-
big-data-
explained
Mortenson, M., Doherty,
Do herty, N. & Robinson, S. (2015). Operational resea
research
rch ffrom
rom Ta
Taylorism
ylorism to
terabytes:
terabytes: A rese
research
arch agenda for the analytics age. European Journal of Operational Research, 241:
585–595.
[Crossref ]
O’Neil, C. & Schutt, R. (2013). Doing data science: Straight talk from the frontline. O’Reilly Media,
Sebastopol, CA.
Redman, T. C. (2008). Data driven: Proiting from our most important business asset . Harvard
Business School, Cambridge, MA.
Teicher, J., (2018). The little-known story of the irst IoT device. IBM.https://
IBM. https://www.
www.ibm.
ibm.com/
com/
blogs/industries/
blogs/ industries/little-known-story-irst-iot-device/
little-known-story-irst-iot-device/
Thamm, A., (2017). Big Data is dead. Data is “Just Data,” regardless
regardless of quantity, structure, or spee
speed,
d,
LinkedIn. https://www.
https://www.linkedin.
linkedin.com/
com/pulse/
pulse/big-data-de
big-data-dead-just-reg
ad-just-regardles
ardless-quantity-structure-
s-quantity-structure-
speed-thamm/..
speed-thamm/
Turnbull, M. (2015). Internet of things sum
summit,
mit, Speech, Australian Governme
Government,
nt, Ministers for the
Department of Communications and the Arts, 15 March.
UK Government Chief Scientiic
Scienti ic Adviser
Adviser.. (2014). The Internet of Things: Making the most of the
second digital revolution. The Government Ofice for Science, United Kingdom.

Vidgen, R. (2014). Introduction

Introduc tion to big data and data science. BigDataScience ( website), 23 January.
https://datasciencebusin
https://datasciencebusiness.
ess.wordpress.
wordpress.com/
com/2014/
2014/01/
01/23/
23/introduction-to-big-data-and-data-
introduction-to-big-data-and-data-
science/..
science/
Vidgen, R., Shaw, S., & Grant, D. (2017). Manag
Management
ement challenges in creating value ffrom
rom busines
businesss
analytics. European Journal of Operational Research, 261( 2): 626–639.
[Crossref ]
Zikopoulos, P., Eaton, C., DeRoos, D., Deutsch,
D eutsch, T.,
T., & Lapis, G. ((2012).
2012). Understanding big data:
Analytics for enterprise class Hadoop andand streaming data. McGraw-Hill, New York.

Limited 2019
https://doi.org/10.26777/978-1-352-00726-8_2


Richard Vidgen (Corresponding author)
r.vidgen@unsw.edu.au
Samuel N. Kirshner
Kirshner
Felix Tan
Chapter Overview In this chapter w

wee consider what rresources
esources an
organization
function. needout
needs
To carry
To s t o business
build a busines
business
s analyticsde
analytics
analytics development
velopment
development we need to
consider three elements: (1) a methodology to guide the analytics
process, (2) the data scientists who build models, and (3) a set of
tools and techniques. Each of these aspects will be explored in
greater depth in the chapter.
chapter. As part of the discussion of toolsets, we
introduce three analytics environments: DataRobot (an automated
machine learning platform), the programming language R, and SAS
Visual Analytics (SAS VA).
Learning Outcomes


to:
Describe the steps in the process of analytics development
Describe the leading analytics methodologies
Identify common data science techniques used in business
Create ahow
Explain skillsA/B
proile for can
testing an analytics team
be used to validate actions taken on
the basis of analytics
Establish and implement a decision framework for analytics
toolset selection.
Introduction
In establishing an effectiv
effectivee business analytics development function an
organization will need to consider the composition of its data scientist
team, the tools and techniques to be deployed, and the methodology
used to guide the analytics development process (Figure
(Figure 2.1
2.1).
).
Figure 2.1 Core elem

elements
ents of a business analytics development
development function
func tion
Firstly,, the data scientist team will require a mix of skills if it is to

Firstly
deal with the wide range of activities entailed in building analytics
applications. For example, some data scientists might focus on building
complex statistical models, while others concentrate on understanding
and communicating the business requirements. Secondly,
Secondly, the data
science team will need to be provided with tools to collect, explore,
process, visualize, and model data. Thirdly
Thirdly,, a methodology is needed to
guide the data science team through the process of business analytics.
What constitutes an appropriate analytics methodology for one
organization will not necessarily be the same as that of another
organization. In all
should avoid the three
trap aspects (people,
of thinking that one tools, methodology)
size its all. Further we three
Further,, the

aspects need to be in alignment. There is no point in hiring a data

scientist versed in R and then asking
as king them to use Microsoft Excel.
Exercise 2.1: Building an analytics development function

Watch the video ‘How to create an effective data science department’
Watch
https://
https://youtu.
youtu.be/
be/9f-XXR9j6m8
9f-XXR9j6m8 and and then consider the following
questions:
1.
What is the deinition of ‘data science’ – as opposed to ‘data’ –
givenn in the video? Is it reasonable to apply the scientiic method
give
to business?
2.
Why are data visualizations and descriptive statistics (e.g., sales
fell by 7% when it snowed) not data science?
3.
Based on the video, what challenges do organizations face in
building an analytics development function?
The analytics process
When a data scientist embarks on an analytics project, at the most basic
level, they should follow the steps outlined in Figure 2.2.
2.2.

Figure 2.2 Steps in the a

analytics
nalytics process
1.
Deine the business objectives
objectives
An analytics project must address a business question. Therefore,
the project should start with a well-deined business objective. Clearly
stating that objective will allow the team to deine the scope of the
project and will provide them with a set of tests to measure the success
s uccess
of the project.
2. Collect data

The data is usually
us ually scattered across multiple sources: internal,
external, and open. Collecting the data may involv
involvee a range of methods,
such as SQL queries of corporate databases, searches of social media,
web-scraping, and the inclusion of open and publicly available data,
such as weather,
weather, crime, and ssocial
ocial deprivation statistics. Assembling
this data into a common and usable format constitutes a major part of
any analytics project.
3. Prepare and explore data

Data may contain duplicate records and outliers; depending on the

analysis and the business objective, the team has to decide whether to
keep or remove them. Also, the data
d ata could have missing values (values
for these may need to be imputed), may need to undergo some
transformation (e.g., to make the distribution more closely normal), and
may be used to generate deriv
derived
ed attributes that will increase the
predictivee power of the models created (feature engineering).
predictiv
Assessing the quality of the data is a vital step in an analytics project.
Indeed, collecting, cleaning, and assessing data can easily take up 80%
of an analytics project’s total time. Ultimately
Ultimately,, the qquality
uality of the input
data will impact on the quality of the model’s outputs.
4.
Create training and test datasets
The data needs to be divided into two sets: training and test
datasets.
is used toThe model
verify is built using
the accuracy themodel’s
of the training dataset.
output. The
This is test dataset
essential;
otherwise there is a risk of ‘overitting’ the model. Overitting occurs
when a model is trained with a limited dataset, to the extent that it
picks up on all the characteristics (both the signal and the noise) that
are only true for that particular dataset. A model that is overitted for a
speciic dataset will perform poorly when it is run on other
other,, previously
unseen, data. Running the model on the test dataset will give an
accurate assessment of the model’s performance on a dataset that it has
not seen previously.
previously. It is common to use 80% of the total dataset for
training, with 20% held
further partitioning backtofor
is done testing.
create Within
k-folds (e.g.,the 80folds
80%
ive % training
each data
containing 16% of the data) that can be used to cross-validate model
performance based on its performance on the k-folds rather than
simply assessing the model on its performance on the single training
set containing 80% of the data.
5.
Build and improve
improve the model
Sometimes the data or the
t he business objectives lend themselves to a
speciic
apparent.algorithm
As part oforthe
model.
dataAt other timesrelationships
exploration, the best approach is not so
between

variables should be explored and different algorithms run. The inal

model is then selected based on model performance. In some cases a
range of approaches might be used simultaneously and then the ‘best’
model selected by comparing model output and predictiv
predictivee
performance. For example, a model to predict customer churn might be
built using logistic regression, support vector machines, decision trees,
or neural networks. Which model is ‘best’ will depend on several
factors, such as predictive performance, the time it takes the model to
run (a model that takes three days to run to support a daily decision-
making process will not be much help), and auditability (in some
applications it is necessary to show how the model reached a decision).
6.
Deploy the model
Once the model has been built, it needs to be deployed if the
business is to achieve
work, it needs beneits.and
to be presented Regardless of how well
communicated the model might
to business
stakeholders in an understandable and convincing way if the business
is to adopt the model. Model deployment may require coordination
with other departments beyond the focal business departdepartment
ment (e.g., a
model built to support thet he sales department may have impact on order
processing and inventory functions). Regardless,
Regardless, the IT department will
likely need to be involved
involved to implement and maintain the model in a
production IT environment. Other areas of the business, such as
compliance and risk, may also have a stake in the model’s deployment.
d eployment.
Once the
edmodel
improved
improv is deployed,
d eployed,
as necessary
necessary. its performance
. The performance mustmodels
of most be monitored and
decays over
time and they will need to be retrained using new data and ultimately
may need to be replaced altogether.
Analytics methodologies
methodologies
While all analytics projects will need to address the six steps identiied
in Figure 2.2,
2.2, these steps provide little guidance concerning how the
steps will be accomplished. A methodology provides a framework that
is used tosolution.
analytics structure,Any
plan, and controlembarking
organization the process
onofanalytics
developing an

development will need a methodology as part of its governance regime.

regime.
This methodology might be largely implicit (‘that’s the
t he way we do it
here’) or it might be explicit but lightweight. Methodologies are
commonplace in information systems development, with a focus today
on lightweight and software-focused approaches typiied by lean and
agile software development (e.g., see Highsmith & Cockburn 2001 2001).
).
However, explicit methodologies are currently less prevalent in
business analytics development, relecting the culture
cult ure and background
of analytics and data science.
A review of the literature results in remarkabl
remarkablyy little concerning
business analytics methodologies or data science methodologies,
particularly those that address the organizational context and value
creation. However
However,, a notable exception is the area of dat
dataa mining. A poll
of 200 users of the KDnuggets website in 2014 (Piatetsky
(Piatetsky 2014)
2014) asked
‘What main methodology are you using for your analytics, data mining,
or data
are science
shown projects?’ and
in parentheses): reported
43% (42%)as follows
use (200727.5%
CRISP-DM, percentages
(19%)
use their own methodology,
methodology, 8.5% (13%) use the SAS Institute’s
SEMMA, and 7.5% (7.3%) use KDD. The remaining responses (covering
(covering
13.5% of respondents) include categories such as in-house
methodology,, non-domain speciic app
methodology approaches,
roaches, and no methodology.
methodology.
The CRISP-DM (Cross-Industry Standard Process for Data Mining)
(Chapman et al. 2000
2000)) reference model, shown in Figure 2.3,
2.3, consists of
six phases: business understanding, data understanding, data
preparation, modelling, evaluation,
evaluation, and deployment. As is evident,
thesearrows
The six phases
showmap
the closely to the sixdependencies
most important steps identiied in Figure
between 2.2.
2.2.
stages
(although this sequence is not ixed) and the outer cycle relects the
ongoing nature of data analytics work. The business understanding
phase is concerned with the project objectives and business
requirements, which are then conv
converted
erted into an analytics problem
deinition and project plan. The data understanding phase is concerned
with becoming familiar with the data, identifying data quality
problems, discovering initial insights and inding interesting areas for
making hypotheses. These two
t wo phases are reciprocally linked. Thus, the
CRISP-DM reference
reference model provides a more nuanced and real-world

view of how analytics unfolds in practice than does the linear

linear,, six-step
2.2..
process shown in Figure 2.2
Figure 2.3 Phases of the C

CRISP-DM
RISP-DM reference model (Chapman
model Chapman et al. 2000,
2000, p.13)
Further methodologies for analytics include SEMMA and KDD. The

SEMMA process was developed by the SAS Institute. The acronym
SEMMA stands for ‘Sample, Explore,
Expl ore, Modify
Modify,, Model and Assess
Assess’’ and
refers to the process of conducting a data-mining project or analytics
project.
presented TheinKDD
Fay (Knowledge Discovery in Databases) process, as
Fayyad
yad et al. (1996)
(199 6),, consists of ive stages: selection, pre-
processing, transformation, data mining and interpretation/evaluation.
The input to the KDD process is data and the output is knowledge.
Given
Given the discussion of data and knowledge in Chapter 22 we we might
argue that only humans can possess knowledge, although advances advances in
artiicial intelligence (AI) mean this might not necessarily hold in the
longer term.
The KDD and SEMMA approaches are primarily data-driven and
neither gives prominent attention to business context and business
objectives.
objective
context, s. The CRISP-
CRISP-DM
breaking DM model
the business ttakes
akes greaterphase
understanding account of four
into the business
tasks:
1.
determine business objectives
2.
assess situation
3.
determine data-mining (analytics) goals
4.
produce project plan.

The CRISP-DM model suggests that business objectives are couched

in terms of business goals (e.g., a goal might be to retain customers)
that can in turn be couched as business questions (e.g., will lower
transaction fees reduce the number of customers who leave?). The
outcomes from an analytics project should be assessed in business
terms, ranging from the relativ
relatively
ely objective (e.g., the reduction in
customer churn) to the more subjective (e.g., to give richer insight into
customer relationships).
It is clear from the CRISP-DM process that identifying business goals
is viewed as an essential aspect of an analytics project. This view is
(2010),, who proposes nine laws of data
further supported by Khabaza (2010)
mining. Rule 1 (‘Business Goals Law’) argues that:
data mining is concerned with solving business problems and
achieving business goals. Data mining is not primarily a
technology;
objectiv es atititsis heart.
objectives a process, which
Without has one or
a business more business
objective (whe
(whether
ther or
not this is articulated), there is no data mining.
(Khabaza 2010 2010))
Despite the high reported use of the CRISP
CRISP-DM
-DM methodology
methodology,, it
appears it is no longer supported or in active development and has
therefore not been developed to take account of more recent
developments in big data and data science. Similarly,
Similarly, neither the
SEMMA nor the KDD methodology appears to have been activel activelyy
supported
model or developed
provides in recent years.
a well-articulated Regar
Regardless,
approach todless, the CRISP-
analytics CRISP-DM
DM a
and offers
foundation and guide for an organization embarking on its analytics
journey.
Evidence:
Evidence: A/B testing
Having deployed
deployed a model and imp
impacted
acted on the decision-making in a
business process, how do we know if the policies
p olicies and interventions
based on an analytics model actually work? Randomized controlled
trials (RCTs) are used extensively in medicine, economic development,

and social policy,

policy, and are rapidly becoming the ‘gold standard’ in
evaluation.
NICE, the UK’s National Institute for Health and Care Excellence,
deines an RCT as:
“A study in which a number of similar people are randomly

assigned to 2 (or more) groups to test a speciic drug, treatment
or other intervention. One group (the experimental group) has
the intervention being tested, the other (the comparison or
control group) has an alternative intervention, a dummy
intervention (placebo) or no intervention at all. The groups are
followed up to see how effective the experimental intervention
was. Outcomes are measured at speciic times and any
difference in response between the groups is assessed
statistically.. This met
statistically method https://
hod is also used to reduce bias.” ((https://
www.nice.org.
www.nice. org.uk/
uk/glossary?
glossary?letter=
letter=rr)
For example,
example, let’s say that we have built a model that predicts which
customers are most likely to churn. The model is validated by the
auditors and we are satisied that it performs well. On this basis, we
might decide to allocate an account manager to those customers that
have the highest probability of churning (i.e., an intervention is made
on the basis of the
t he predictions made by our model). Does the account
manager intervention work?
work? An RCT in the form of an A/B test might
show that not only does the intervention fail to work, but it actually
makes thechurn
customer situation worsefor
is higher (e.g.,
theitgroup
is possible
p ossible that thewho
of customers incidence of an
are given
account manager than for those customers not giv given
en an account
manager). Thus, a model with high and proven predictipredictiveve ability is no
guarantee that effective interventions
interventions are made on the basis of that
model.
While models are typically used to establish correlation, the A/B
test helps us gain an insight into causation through the manipulation of
an experimental condition while holding everything else constant (as
best we can, at least). In the sciences this is relected in the common
mantra ‘no causation without manipulation’. So, to ind  ind out whether our

intervention to reduce customer churn is effective, we might run an

A/B test (Figure 2.4).1
(Figure 2.4).
Figure 2.4 An A/B test
We start with a population, such as all the customers our model

identiies as being at high risk of leaving (churning). The population of
high-risk churners is allocated randomly to two groups: an
experimental
receiv
receive group((intervention),
e a treatment and a control group.
intervention), The
such as experimental
being assigned an group will
account
manager.. Both groups are then assessed and compared and the
manager
question is asked: does the
t he intervention result in a signiicant reduction
in customer churn? If it does not, then, although we have identiied
high-risk customers, our intervention has not been effective.
An A/B test can be used to trial multiple treatments. In the UK
justice system the courts experimented with a range of treatments for
encouraging repayments
repayments of ines Haynes et al. 2012
 ines ((Haynes 2012).
). One group was
sent a standard text message, a second group a text message plus the
amount due, a third group the text plus their name, and a fourth group
the text plus their name and the amount due (Figure(Figure 2.5
2.5).
). A ifth group,
the control group, received no text message. Individuals were assigned
to each of the ive groups at random. The results showed large
improvements
improv ements over the control group, with the best uplift achieved by
sending a text message personalized with the individual’s name.
Haynes et al. 2012

Figure 2.5 An A/B test in the UK courts service ((Haynes 2012,, p.10, ig. 5)

A/B tests are being used to improve business performance

( Haynes et al. 2012
particularly in the online world (Haynes 2012):
):
Amazon and eBay test what works for driving purchases.
Wikipedia compared donation adverts with and without a picture of
founder Jimmy aWales.
Netlix trialled new service with four variants and four groups of
20,000 subscribers.
Delta Air Lines used an A/B test to improve website design in the
light booking process.
A/B testing is particularly powerful in online and e-commerce
settings where different treatments can be served up easily to
randomly selected groups to see what works best. It is not surprising
that Internet behemoths such as Google deploy A/B tests on a regular
basis and make experimental testing available to their customers via
the Google Analytics platform.
Modelling techniques
Supervised and unsupervised learning
Having the output variable – that is, the thing we wish to predict –
available
availab le is the hallmark of supervised learning. This is the most
frequent scenario in analytics. The strength of this approach
app roach is that the
training dataset contains the correct answers (e.g., which customers did
actually churn?). In unsupervised learning, the output variable is not
speciied. While unsupervised learning is not used as frequently as
supervised learning, it can be very useful for tasks
tas ks such as customer
segmentation, where we wish to establish segments based on how
similar customers are to each other using all the customer features we
have available
available (this process is called clustering and we will cover this in
5).
Chapter 5).
Regression
Regression and classiication models
Regression and classiication models are supervised techniques. They
differ in the type of output they produce. Regression models are used to
predict a numerical
givenn a number
give or quantitative
of inputs value, such
(e.g., advertising, as the level
promotion, and of sales
press

coverage). Classiication models predict the class that a case belongs to,
coverage).
such as a person’s gender
gender,, whether a customer will churn, or the
socioeconomic class of a customer.
customer.
Deep learning
Over the last few years the term ‘deep learning’ has become pop
popular
ular..
Deep learning can be thought of as a subset of machine learning in
which the ‘deep’ refers to an AI model with multiple layers of
representation. Machine learning is itself a subset of a broader class of
applications – AI. The relationship of the three can be visualized as in
2.6.
Figure 2.6.
Figure 2.6 Artiicial intelligence (AI), machine learning, and deep learning (reprinted from
fro m
Chollet 2018
2018,, p.4, Copyright (2018) with permission from Manning Publications)
We summarize from Chollet (2018) (2018) to

to describe the three terms. AI
has roots in the 1950s and tackled the challenge of building a machine
that could think.
t hink. Typically
Typically,, the approach was to hard-code lots of rules,
rules ,
which worked
worked well for well-deined problems, such as chess, but less
well with the sorts of problems that humans are good at – such as
recognizing images and speech and understanding and engaging in
argumentation. Traditional
Traditional approaches (also known as symbolic AI)
take rules plus data as input and provide answers as the output.
Machine learning works by taking data and answers as the inputs with
the output being rules. Thus,
Thus , a machine learning algorithm is trained
through seeing lots of examples in which it detects patterns.
p atterns. Machine
learning is empirical and is engineering-oriented rather than statistical.
Deep learning is an approach in which multiple layers of data
transformations are made to give increasing
increasingly
ly meaningful
representation of the data. In a layered and incremental approach
inputs are turned into a inal representation – for example, an image of
a squiggly,
squiggly, handwritten digit can be turned into an output with a value
in the range 0 through 9 using successive transformations of the image

that get closer to the answer

answer.. Deep in this case refers to the layers of
transformation of input to answer.
Deep learning has achieved much, despite only becoming a
prominent technique in the early 2010s, with success in notoriously
dificult application areas such as image and speech classiication,
digital assistants (e.g., Google Now and Alexa), autonomous driving, and
natural language processing (Chollet
(Chollet 2018
2018). ).
(Chui & McCarthy, n.d
McKinsey’s ‘An executive’s guide to AI’ (Chui n.d.)
.) gives
examples of business use cases for AI:
Diagnose health diseases from medical scans.
Understand customer brand perception and usage through images.
Detect a company logo in social media to better understand
underst and joint
marketing opportunities (e.g., pairing of brands in one product).
Detect defective products on a production line through image
analysis
Davenport and Ronanki (2018) propose
(2018) propose three types of AI for
organizations: process automation (e.g., automatic email routing,
reading contract documents to extract provisions), cognitive
cognitive insight
(e.g., to predict what a customer will buy
buy,, to detect insurance fraud),
and cognitive engagement (e.g., intelligent agents in customer service,
health treatment plans). For most organizations, AI and deep learning
is an aspiration rather than a reality today. Davenport and Ronanki
(2018) identify
(2018) identify some of the challenges facing AI adoption:
It is hard to integrate cognitiv
cognitivee projects with existing processes and
systems.
Technologies and expertise are too expensive.
Managers do not understand cognitive
cognitive technologies and how they
work.
There is a shortage of people with expertise in the technology.
technology.
The technologies are immature.
The technologies have been oversold.
The issues identiied by Davenport and Ronanki would apply to the
adoption of any breakthrough technology
technology.. The problems of being
overhyped
overh yped are a particular concern – while deep learning can achieve
near-human levels of performance in many areas (e.g., image

recognition) it is in danger of being a victim of the type of hype that

surrounded e-commerce in the 1990s and data science in the 2000s.
Organizations should consequently tread carefully when investing in AI
porjects.
Exercise 2.2: Neural Networks

Watch the video ‘Introduction to deep learning: What is deep
Watch
learning?’ https://www.
https://www.youtube.
youtube.com/
com/watch?
watch?v=
v=3cSjsTKtN9M&
3cSjsTKtN9M&vl=
vl=en
en
and then consider the following questions:
1.
To what extent does a neural network work like the human
brain?
2.
What might be the danger of thinking of neural networks as
working like a ‘brain’?
Model-building techniques
There are many modelling techniques available
available to the data scientist,
scientist ,
and the list continues to grow.
grow. Table 2.1
2.1 lists
lists some modelling
techniques commonly used in business analytics applications. While
the detailed workings of each of these
t hese techniques is beyond the scope
of this book, it is important to be aware of the armoury of techniques
that data scientists might deploy in building a model. Indeed, an
individual data scientist is unlikely to be familiar with, and competent
in, all of the techniques in Table 2.1.
2.1. The data scientist is as much
engaged in bricolage (improvisation and tinkering) as they are in
engineering and will learn about and use techniques on a case-by-case
basis as the situation and the data take them.
Table 2.1 Some common ddata
ata science te
techniques,
chniques, with bus
business
iness applications
Tec
echni
hnique
que Deinit
Deinition
ion an
and
d usage
usage
Unsupervised
learning

Tec
echni
hnique
que Deinit
Deinition
ion an
and
d usage
usage
k-means k-means clustering aims to partition n observations into k clusters
k-means c lusters in which
clustering each observation belongs to the cluster with the
t he neare
nearest
st mean. It is a co
common
mmon
unsupervised
unsupe rvised learning approach to clustering data (e.g., customer
segmentation).
Principal
components PCA
to is avisualization
data vway o f reducing
of
isualization and the number when
exploration of dimensions
dimens ions
there is ainlarge
a dataset.
numbeItrisofavariables
number useful aid
analysis (PCA) to be analysed.
Supervised
learning
Linear The most common form
f orm of regre
regression
ssion model. One or more input variables
regression (continuous and categorical) are used to predict a continuous output; for
example,
exam ple, what amount of charges is likely to be incurred for an individual’s
health insurance
insurance policy
policy??
Logistic Logistic regression is used to make predictions
predictions in a dataset in which there are
regression one or more indepe
independent
ndent variables that determin
determinee a dichoto
dichotomous
mous outcome
(e.g., will this customer churn?). The binary model can be extende
extendedd to a
multinomial model to predict an output with more than two classes.
Artiicial Artiic ial neural networks (ANNs) are a family of models insp
Artiicial inspired
ired by biological
neural neurall networks ((the
neura the central nervous systems of animals,
animals, in partic
particular
ular the
networks brain) which are used to estimate or approxim
approximate
ate funct
functions
ions that can depend on
(ANNs) and a large number
number of inputs and are generally
generally unknown. Neural networks may be
deep learning more effective
effec tive than linear and logistic regre
regression
ssion when the feature space is
large. Training ANNs require substantial computing resources but as computing
has become cheaper and more available ANNs have become more popular.
ANNs are a core part of the ‘deep learning’.
Support vector A support vector machine (SVM)
( SVM) is a classiier formally
f ormally deined by a separating
machines hyperplane. Given labelled training data (supervised learning),
learning), the algorithm
(SVMs) outputs an optimal hyperplane which is used to categorize new examples.
examples. SVMs
are used
used in a wide range of prediction and classiication applications.
Classiication Classiication and regression trees (CARTs) are obtained by recursively
and regression partitioning the data space
space and itting
itting a simple pre
prediction
diction model w within
ithin each
trees (CART
( CARTs)
s) partition. As a result, the partitioning
partitio ning can be represen
represented
ted graphically as a
decision tree. CART models are used for regreregression
ssion and classiicatio
classiicationn problems
(e.g., to generate a decision tree for deciding whether to approve a loan to a
bank customer).
Gradient The next step on from regre
regression
ssion trees is gradien
gradientt boo
boosting,
sting, such as
boosting implemented
impleme nted in the XGBoost pac
package.
kage.
Naive Bayes The Naive Bayesia
Bayesiann classiier is based on Bayes’ theorem. The assu assumption
mption of
independence
indepen dence between pred
predictors
ictors means it is easy to build and eficient
ef icient to run
and particularly useful for very large
l arge data
datasets
sets (e.g., to identify whether an email
is spam).
Bayesian A Bayesian
Bayesian network is a probabilistic directed acy
acyclic
clic graphical
graphical model that
networks represents a set of random variables and their conditional depende
represents dependencies.
ncies.

Tec
echni
hnique
que Deinit
Deinition
ion an
and
d usage
usage
k-nearest k- nearest neighbours (kNN) is a simple algorithm that stores all available cases
neighbours and classiies new cases
c ases ba
based
sed on a similarity measure (e.g., distance functions).
f unctions).
(kNN) It is used to classify many types of data (e.g., to classify images, to diagnose
breastt cancer)
breas cancer)..
Association
rules Association
and using therules are created
criteria supportby
support analysing
and co datatofor
conidence
nidence f requent
freque
iidentify
dentify nt if/then
the ifmost
/thenimportant
patterns
relationships.
relationships. Association rules are useful for analysing and predpredicting
icting customer
c ustomer
behaviour
beha viour (e.g., shopping basket data analysis).
Genetic Genetic algorithms (GAs) are adaptive heuris
heuristic
tic search algorithms base
basedd on the
algorithms evolutionary ideas of natural selection and genetics. T
They
hey represent
represent an
intelligent exploitation of a random sear
search
ch used to solve optimization problems
(e.g., the travelling salesman problem).
Time-series Time-series analysis comprises methods for analysing time-series data in order
analysis to extract meaningful statistics and other characteristics of the data. Time-
series is often used to predict future values (e.g., sales)
sales) based on previously
observed values.
Ensemble
models Ensemble models combine
overall performance. c the
This can decisions
an be done, fofrom
for multiple
r example, by models
example, averag to the
averaging
ing improve theof
results
the different
diff erent models
models..
Text analysis
Natural Natural language
language process
processing
ing (NLP) is the ability of a computer program to make
language sensee of human sspeech
sens peech and text. The NLP family includes techniques such as
processing sentiment
sentime nt analysis and latent Dirichlet analysis
analysis..
Sentiment Sentiment
Sentime nt analysis is used to extract subjective information
inf ormation behind a series of
analysis words. It is used to gather an understan
understanding
ding of the attitudes, opinions and
emotions expressed
expressed within a text (partic
(particularly
ularly online and social media
mentions).
Topic A popular form of topic modelling uses Latent Dirichlet analysis (LDA), a
modelling gener
generative
ative
explained
explaine statistical
d by unobservedmodel that
(i.e., allows
latent) a corpus
topics that of text documents
explain to be of the
why some parts
data are similar
similar. E
Each
ach document
doc ument in a co corpus
rpus is modelled as a inite mixture over
an underlying
underlying set of topics. LDA ccan an be applied to social media data ssuch
uch as
tweets to identify the underlying topics driving the content of the tweets.
Other
Social network Social network analysis (SN
(SNA)A) is used to make visible hidden network
analysis (SNA) structures. Networks are modelled as nodes (individual actors, people, or things
within the network) and connecting ties (relationships or interactions). SNA
can be used to understand
understand how customers
c ustomers aare
re connected to each other and
which ones are inluential in forming
f orming opinion.
Simulat
Simulations
ions A compute
computerr simul
simulation
ation use
usess an abs
abstract
tract mode
modell of a system
system to rrepr
eproduce
oduce the
beha
behaviour
low viour of thatand
forecasting, system. Simulations
marketing are us
strategiesuseful
. eful in areas ssuch
strategies. uch as logistics, cash-

Tec
echni
hnique
que Deinit
Deinition
ion an
and
d usage
usage
Geospatial and Data that are tagged with geospatial coordinates
coo rdinates (e.g., latitude/longitud
latitude/longitude)
e) or
mapping with postal codes are vvisualized
isualized and analyse
analysed.
d. Geospatial mapping can be used
applications to plan the location
locatio n of new stores based
based on customer loc
location,
ation, to understand
customer demographics
demographics based on socioeconomic analysis of postal code.
Many of these techniques can be applied to a problem

interchangeablyy – for example, a data scientist might build a Bayesian
interchangeabl
network, a linear regression model, and a neural network and then
assess the performance of each on the same dataset to see which best
predicts the target outcome. To complicate things further
further,, ensemble
models pool the results of multiple models using a form of ‘voting’ (e.g.,
by taking the average
average of the predictions returned by each model in tthe he
ensemble) to produce a better performing ‘super’ model. The data
scientist will also be trading off predictiv
predictivee performance with the
resources required
required to run the model, as some techniques may be too
computationally intensive
intensive to cope with large datasets in a timely
manner.
Exercise 2.3: Understanding data science techniques

Select two techniques from Table 2.1
2.1 with
with which you are unfamiliar.
For each technique:
1.
Describe how the technique has been applied previously in a
business setting.
2. Thinking about the organization for which you currently work
(this can also be one that you have wor
worked
work
ked for previously or
your current educational institution), give an example (use case)
cas e)
of how the technique could be used in your organization and the
potential business beneit that might be generated from its
application.
The data scientist

At a basic level, the data scientist needs programming skills,

mathematics and statistics expertise, and domain knowledge to be
effective.
effective. Josh Wills (2012)
(2012),, a data scientist and commentator
commentator,, deined a
data scientist as:
Data Scientist (n.): Person who is better at statistics than any

software engineer and better at software
s oftware engineering than any
statistician.
Cathy O’Neil and Rachel Schutt (2013) provide a nuanced review of
the role of data science and data scientists, from which we take three
lessons:
Data science is special and distinct from statistics because a data
product (e.g., a recommendation system) ‘gets incorporated back into
the real world, and users interact with that product, and that
generates more data, which creates a feedback loop’ (p.42). In other
words, data models can be strongl
s tronglyy ‘performative’, acting back on tthe
he
business.
Data science is not just about models and algorithms: ‘[Data
scientists] spend a lot more time trying to get data into shape than
anyone cares
cares to admit, maybe up to 90% of their time’ (p.351).
Getting data into shape involves data acquisition, data structuring,
data cleaning (outliers, missing values, etc.), and transformations.
Many statistics courses start with fairly clean and well-structured
data, spending 90% of the time on models and only 10% on herding
data.
Data scientists need to be able to ask questions, be able to say ‘I don’t
know’,, and not be blinded by money: ‘They seek out opp
know’ opportunities
ortunities to
solve problems of social value and they try to consider the
consequences of their models’ (p.355).
It seems that we are asking a lot of the data scientist role. They need
to be capable in statistics and programming, hav havee domain knowledge,
be able to engage stakeholder support, tell convincing stories through
data, understand how data can help the business create value, and have
an unquenchable thirst for knowledge. Somohano (2013 (2013,, slide 24)
deines the data scientist class as follows:

Class DataScientist
Is skeptical, curious. Has inquisitive mind. Knows machine
learning, statistics, probability
probability.. Applies scientiic method. Runs
experiments. Is good at coding and hacking. Able to deal with IT
and data engineering. Knows how to build data products. Able to
ind answers to known unknowns. T Tells
ells relevant business
stories from data. Has domain knowledge.
These are rare people (often referred to as ‘unicorns’) and, even
when they can be found, it is unlikely that one person will have all the
capabilities needed in an analytics team. Therefore, a data science team
will typically comprise team members with complementary skills.
How might we proile data scientists? In collaboration with their
customers, Mango Solutions has deined six core attributes of the
contemporary data scientist. The irm’s survey questionnaire can be
used
and totohelp
gainensure
insightainto the capabilities
balance of organization’s
of skills in an individual dataanalytics
scientists
team. After completion of the online survey a Data Science RadarTM
chart is produced showing the proile of the data scientist – for
example, Figure 2.7 suggests
2.7 suggests a person with particular strength in data
visualization.
Figure 2.7 Data scientist attributes (Data Science Radar™, Reprinte

Reprintedd with permission from
Mango Solutions 2019)
2019)
O’Reilly Media conducts an annual survey of data science salaries.

(Suda 2017
The ifth edition (Suda 2017),
), conducted in 2017,
2 017, surveyed ‘nearly 800
participants from 69 countries, 42 US states, and Washington, DC’ and
explores ‘everything from
from salaries and bonuses to tools, cloud
providers, and reporting’ (p.1). The survey asks respondents about a
range of tasks.
in those tasks and
Thewhether they have
tasks identiied aremajor,
shownminor, or no
in Table 2.2involvement
2.2 in
in order of

major involvement.
involvement. Thus, 67% of respondents reported basic
exploratory data analysis as a task
tas k in which they have major
involvement,
involvement, while at the other end of the scale only 4% reported the
development of hardware as a major involv
involvement
ement task.
Table 2.2 Data scientist tasks (adapted from
fromSuda
Suda 2017
2017,, p.46 with permission from O’Reill
O ’Reillyy
Media)
Task %
1 Basic exploratory data analysis 67
2 Conduct data analysis to answer research questions 61
3 Communicate indings to business decision-makers 58
4 Data cleaning 53
5 Develop prototy pe model 49
6 Create v isualizations 47
7 Identify business problems that can be solved with analytics 44

8 Feature extraction 42
9 Organize and guide team projects 40
10 Implement models/algorithms into production 38
11 Coll
Collab
abor
orat
atee on cod
codee pr
proje
ojects
cts (read
(read/e
/edi
ditt oth
other
ers’
s’ cod
code,
e, us
usin
ingg the
the git
git re
repos
posit
itory)
ory) 38
12 Teach/train others 31
13 Communicate with people outside your company 30
14 ETL (extract, transform, load) 29
15 Plan large software projects or data systems 28
16 Develop dashboards 28
17 Set up/maintain data platforms 24
18 Develop data analytics software 21
19 Develop products that depend on real-time data analytics 18
20 Us
Usee das
dashboa
hboarrds and spreadshe
heeets (m
(maade by ot
othhers
rs)) to make deci
cissio
ionns 15
21 Develop hard
hardware
ware (or work on software projects that require
require expert knowledge of 4
hardware)
Exercise 2.4: Proiling the data scientist Visit Mango Solutions’

(www.mango-
webpage on the irm’s ‘Data Science Radar Challenge’ (www.mango-

solutions.
solutions.com/
com/radar/
radar/)) and complete the survey questionnaire ‘What
kind of data scientist are you?’
It does not matter that
t hat you are not currently a data scientist – or
that your responses might relect your aspirations rather than your
current skills.
1.
If you were putting together a team of data scientists for your
organization, what skill proiles would you need? What size
might a data science team need to be to adequately cover the six
attributes?
2.
How might the data scientist recruitment process vary
depending on the core attribute proile being sought?
Tablethe
and 2.2 gives
2.2 gives
wide an in-depth
range tinsight
hey caninto
of tasks they whatved
be invol datain.scientists
involved actually
While basic data do
analysis and answering questions are key activities
activities (tasks 1, 2), the
majority of data scientists also report communicating with business as
a major activity (task 3). Data preparation activities form a core part of
the data scientist’s work, relected in tasks 4, 8, and 14 in particular
particular..
Analytics toolsets
toolsets
As well as building the human resources needed to execute analytics
projects, the organization
capture, store, visualize, andwillmodel
need data.
to select software
Adopting tools to help
an analytics toolit
involves
involves signiicant investment. Even if the software is free (e.g., open
source), the complementary investment in training, installation,
operation, support, and attracting analytics people with the
appropriate skills will still
st ill be expensive.
While it is tempting to jump on the latest bandwagon, it is not
necessary to have the latest tools to create business value from
analytics. It is more important to build a competency with a toolset
than to continually seek to change horses, always looking for a silver
bullet.

The O’Reilly 2017 Data Science Salary Survey (Suda(Suda 2017)

2017) asks
respondents which programming languages they use. The top-three
languages are SQL (structured query language) (64%), Python (63%),
and R (54%). These tools are not mutually exclusive;
exclusive; a data scientist
may well use all three languages. Indeed, given that SQL is primarily a
language for querying database contents, then a basic data scientist
toolset would be knowledge of SQL plus a statistical programming
language (either R or Python). Statistical analysis and programming can
also be conducted using proprietary packages such as SPSS (an IBM
product), SAS, or Stata. However,
However, the take-up of these tools by data
scientists is much lower than that of R and Python, both of which are
open source and free to use.
To provide broad cover
coverage
age of tools used by data scientists we will
illustrate three analytics development environments: R, SAS Visual
Analytics (SAS VA),
VA), and DataRobot. Each of these environments has
strengths and weaknesses and they might be used singly or in
combination. WeWe also acknowledge that Microsoft’s Excel will likely be
a part of the data scientist’s toolset, regardless of which data science
platforms are used.
R
R is an open source programming language and environment for
statistical computing and graphics. R provides a wide variety of
statistical techniques, such as linear and non-linear modelling, classical
statistical tests, time-series analysis, classiication, clustering, and
graphical techniques,
Software under and is
the terms of highly extensible.
the Free SoftwareRFoundation
is available’s as
Foundation’s GNUFree
General Public License in source code form. It compiles and runs on a
wide variety of Unix platforms and similar systems (including FreeBSD
(https://www.
and Linux), Windows, and MacOS (https:// www.r-project.
r-project.org/
org/about.
about.
html).
html ).
R developed from the S language, which was created in 1976 at Bell
Labs and released commercially as S-Plus. In the 1990s R was written
as an open source implementation of S by (R)oss Ihaka and (R)obert
Gentleman. In 1997 the Comprehensive R Archiv
Archivee Network (CRAN) was
launched. In
packages; as August 2014 the
of May 2019 the number
CRAN repository contained
of packages 5,789
had risen to 14,164.

An alternative to R is Python, which is also open source and has a

similar array of packages available. Python is a general-purpose
programming language that has been usedus ed for Web development in
particular while R was developed speciically as a statistical
programming language. Over time, R has become a more general-
purpose language while Python has added many packages for statistical
analysis. For
For further discussion of the relative
relative strengths and uses of R
and Python see Data-Driven Science (2018).
(2018).
SAS Visual Analytics (SAS VA)
SAS VA
VA is a web-based commercial analytics platform that balances
performance in terms of both computational processing and the range
of statistical methods with an intuitive user interface. As a result, SAS
VA is an analytics tool that is accessible for people with limited
programming and/or statistical backgrounds.
With the increasing prevalence of big data, a key component of
analytics technology is the computational performance, that is, how fast
can the application process and model big data sets? To handle
computational processing, SAS VA utilizes their patented in-memory
analytics technology.
technology. Their in-memory analytics stores data in either
local or web-based servers, which enables faster integration of the data
with software compared to traditional solutions accessing data from
external sources.
Using SAS VAVA a data scientist can pperform
erform cluster analysis, build
linear regression and generalized linear regression models, tackle
categorical targets with logistic regression and decision trees, and
perform model comparisons. Additional functionality is available for
data manipulation, visualization, and building dashboards. While not as
powerful or comprehensive as the SAS Enterprise version, SAS VA has
an easy-to-use interface and is quick
q uick and intuitive to learn.
Automated
Automated machine
machine learning
Gartner (Sallam
(Sallam et al. 2017
2017)) argue that analytics is at a critical inlection
point; organizations have easier-to-use tools and self-service analytics,
but the processes
models, of preparing
and making andcommunicating
sense of and analysing data,the
building
resultspredictiv
predictive
are still e

largely manual. Gartner identify ‘augmented analytics’ as the next wave,

in which machine learning is applied throughout the data and analytics
worklow (from insight to action). Augmented analytics, which we will
refer to as automated machine learning, puts analytics capabilities into
the hands of operational workers and citizen data scientists, thus
freeing up expert data scientists to work on specialized problems.
One such platform for automated machine learning is DataRobot,
who state on their website:
DataRobot captures the knowledge, experience, and best
practices of the world’s leading data scientists,
scientists , delivering
unmatched levels of automation and ease-of-use for machine
learning initiatives. DataRobot
DataRobot enables ususers
ers to build and deploy
highly accurate
accurate machine learning models in a fraction of thet he
time it takes using traditional data science methods. (https://
met hods. (https://
www.datarobot.com/
www.datarobot. com/product/
product/))
The DataRobot platform automates the data preprocessing and
modelling activities (Figure
(Figure 2.8).
2.8). Features are created and selected, and
models tuned automatically.
automatically. DataRobot applies many algorithms to the
same business problem and through a survival of the ittest contest
creates a leaderboard based on model performance. DataRobot then
goes on to create ensemble models, that is, models that combine the
power of multiple base models to create ‘supermodels’.
‘supermodels’. The results
from the base models are combined in the ensemble model by
averaging
aver
gets aging
to votethe
onindividual model
the outcome results;
to come up in a sense,
with each of the models
a ‘crowd-sourced’
prediction.
Figure 2.8 The DataRobot approach to automate

automatedd machine learn
learning
ing (hhttps://blog.
ttps://blog.datarobot.
datarobot.
com/ai-simpliied-what-is-automated-machine-learning
com/ ai-simpliied-what-is-automated-machine-learning))

As there is a standard interface the user does not need to know how
to invoke
invoke the different algorithms, which might be implemented, for
example, using Python or R. The output from DataRobot includes
predictions, insights, and model validation, all in a standardized form
regardless of the modelling technique used. The predictions can then be
deployed as part of operational business processes by embedding calls
to the DataRobot application program interface (API).
For business analysts with limited technical background the
platform allows them to build models without needing to know how the
different techniques, such as neural networks, logistic regression, and
support vector machines, work. DataRobot automatically divides the
dataset into training and holdout sets (80/20 as a default) and further
splits the training data into folds (ive
( ive as a default) to allow model
validation and cross-validation to be run. This approach ensures that
best practice in training, validating, and deploying models is conducted
by default and so brings a large element of safety to the model
development process when it is in the hands of end-user data scientists.
For expert data scientists much of the pain of data preparation and
remembering how to run models is remov removed.
ed. For all modellers, there is
access to techniques that they might have not considered or, or, indeed,
even heard of before. Indeed, one academic researcher found that
DataRobot allowed them to replicate in one hour a predictive model
that had taken them two to three months to develop. FurtherFurther,, the
DataRobot outperformed the academic’s original predictiv
predictivee model ‘by a
factor of two’ due to the model builder having ‘missed a class of
algorithms
www. that worked
www.datarobot.
datarobot.com/ really well
com/product/
product/).). for the data in question’ (https://
(https://
Exercise 2.5: Automated machine learning

Watchh the video ‘How DataRobot works’ https://
Watc https://www.
www.youtube.
youtube.com/
com/
watch?v=RrbJLm6atwc
watch?v= RrbJLm6atwc and
and then consider the following questions:
1.
What are the strengths of automated machine learning? What
are the potential weaknesses?
2.
Do you think that the data scientist role will be fully automated?
Which parts of the data scientist role might be left?

Analytics tool
tool comparison
The three tools – R, DataRobot, SAS VA VA – are compared and contrasted
in Table 2.3.
2.3. The three tools overlap in that they can all be used to
explore data, build predictive models, and explore model results.
However
Howev er,, this supericial similarity is quickly uncovered once we dig
deeper into their capabilities and costs of acquisition and operation.
Table 2.3 Comparis
Comparison
on of R, SAS,
SAS, and DataRobot
D ataRo bo t SAS Visual R
Analytics (SAS
VA)
Functionality
Data handling Limited data

manipulation Has visualfor data
interface Quick andtion
eficient
manipulation
manipula w ith data
with
facilities – it may be manipulation but scope for sophisticated
necessary to an experienced feature engineering.
supplement data programmer
program mer w will
ill
preprocessing
preprocess ing wit
withh do it more quickly
tools such as SQL, (and with more
Excel, R, and Python. functionality)
using R or Python.
Data visualization and Limited dashboard Intuitive interface Many visualization
dashboard development building capabilities for building possibilities (e.g.,
as the focus is visualizations and ggplot2 and igraph
squarely
square ly on dashboards – packages) and
automated machine suitable for data dashboards can be built
learning (predictive scientists. in RShiny
RShiny – but comes at
model building). the cost of effort and
DataRobot; would knowledge.
typically be used in
conjunction with a
visualization tool
such as Tableau.
Predictive modelling DataRobot draws on Limited to a subset Has packages for pretty
capability a repository of more of methods (e.g., much everything and is
than 80 model types multiple constantly expan
expanding
ding
which are further regression logistic (e.g., AI/deep learning
combined into regression, support).
ensembles. decision trees).
Acquisition
Acquisitio n and support


Analytics (SAS
VA)
Platform Runs in the cloud Typically runs on a Typically runs on a
and is accessed via a corporate server workstation. Supports
web browser. Can and is accessed via multiple OS (e.g.,
also be run on- a web browser. A Windows, Unix, MacOS).
premises. cloud-based Most often used with an
solution is integrated development
planned. environment (IDE),
such as RStudio.
Cost Corporat
Corporatee usage is Va
Various
rious corporate
c orporate No acquisition cost
charged on a per pricing schemes
schemes.. (open source).
seat basis.
basis. A ffaculty
aculty Educational
license is available institutions can
for teaching partner with SAS
purposes. to access the
software free of
charge for
teaching purposes.
Can also be
accessed free of
charge by students
via the Teradata
University
network using
preinstalled
datasets
datase ts (no data
upload facility).
Organizational Corporate support Corporate support Support is decentralized
support/longevity/stability from a large from a large and provided through
organization with organization and communities such as
tried and tested long-established stackoverlow and R-
software and organization with bloggers.
models. tried and tested Multiple versions of
Automated machine software and base R and packages
learning is a recent models. mean that things can
development and a An industry stop working when
fast-moving market standard
standa rd fo
forr updates are applied.
and the winners may corporate busine
business
ss Packages may cease to
be yet to emerge. analytics. SAS be developed or
accreditation supported.
schemes are Testing and quality
available for data assurance may be more
scientists in dificult.
conjunction
universities. with


Analytics (SAS
VA)
Usability and learning
Ease of use A single and uni ied Visual interface
uniied While many R
interface for
machine learning and menus
options makeforit commands arethe
and intuitive, simple
suitable for data easy to use and command line interface
scientists and end quick to run. and power of R make it
users. hard to use for the
occasional user. For
experienced data
scientists who are using
R daily, R makes for a
highly eficient and
productive
environment.
Skills required The standa
standardized
rdized SAS VA can be R users need basic
output, regardless
model type, makesof learnt quickly
managers by
(days programming
solid statisticalskills and
results interpretable rather than hours). knowledge as models
by managers and While model and diagnostics are
employees at all results and conigurable (the data
levels in the diagnostics are scientist needs to know
organization. IT and prepackaged, what they need to check
statistical skills are statistical and what the
not a prerequisite. knowledge is diagnostics mean).
desirable
desira ble if model
results are to be
interpreted safely.
Learning curve A single license
lic ense can SAS VA users R has a steep learning
be acquired and a should be able to curve. Even with good
DataRobot user will produce results in IT and statistical skills it
be able to produce a day or two, will take days to learn R
results within hours. subject to a basic and weeks and months
IT and statistical of regular usa
usage
ge to
competence. become a compete
co mpetent
nt
user.
DataRobot is a cloud-based offering; a corporate user can sign up

and be using the software to analyse their data immediately
immediately.. Because it
is cloud-based there are no installation requirements. The subscription
model and cloud-based implementation make it a cost-effective and
quick optionthat
does mean forthe
dataorganization
analytics for
organization’s ’s organizations of the
data is stored in anycloud
size. However
However,
by , it

DataRobot and this may not be acceptable to all organizations, for

example where regulatory requirements
requirements may mean that data cannot be
stored on servers outside of the organization’s country.
country. In such a case,
an on-premises version of DataRobot may be required. DataRobot is
designed for all types of organizational user (it does not require either
IT or statistical knowledge).
Both SAS and R are aimed at data scientists rather than the general
user. SAS VA is a corporate solution that needs to be implemented by IT
professionals with associated costs of installation and operation
(although SAS is developing a cloud platform for a software as a service
offering). As with DataRobot, it may be too expensive for a small
organization or an independent data scientist to acquire.
R is an open source offering that is free to use. Whereas SAS V VAA has -
predeined functionality (e.g., multiple and logistic regression), R has
almost unlimited functionality through
t hrough the extensive range of packages
available.
availab le. On the basis of cost and functionality then R would seem the
obvious choice of analytics tool over SAS. However, it is not so simple. R
has a steep learning curve and does not have the beneit of a large
corporate infrastructure to provide support (although the R community
is large and self-supporting through various sites such as R-bloggers).
While SAS has limited functionality
f unctionality,, the functionality it does have is
targeted squarely at corporate needs. For example, running a logistic
regression and viewing the relevant
relevant outputs and diagnostics is quick
and easy in SAS VA whereas in R it will take time to ind code examples
and stitch them together into a script
script.. However
However,, once the R script is
written,
term then it isfor
investment easily
datareusable, making R (or Python) a great long-
scientists.
Exercise 2.6: Choosing an analytics tool What factors should

Stanley Building Services (SBS) consider in deciding on its analytics
toolset? The aim is not to decide which analytics product to adopt
but to arrive at a set of decision criteria (around 8 to 12 items) that
should be considered and scored in making the decision.
Background
Stanley Building Services (SBS) is an SME that sells building
supplies to the building trade
t rade and to the public. The company has
150 outlets across Europe and is keen to use analytics to get insight

into its customers so that it can understand segments and needs and
provide a better service. SBS is also interested in the geo-location of
its stores: Are they in the right locations? Where should new stores
be opened?
The company is considering hiring a full-time data scientist to
carry out descriptive analytics and to build predictive models. SBS
currently has no analytics methodology and develops data analytics
solutions on an ad hoc basis using Excel. The inancial director wants
to use Excel as they are comfortable with spreadsheets. The IT
director used SAS in a previous company and prefers an enterprise
solution. The CEO went to a seminar
s eminar hosted by DataRobot and saw
automated machine learning in action – the CEO is a recent convert
to business analytics and thinks DataRobot could be the silver bullet
that SBS’s managers are looking for.
Summary
In deploying analytics, an organization has to consider the
methodology to use, the proiles of the
t he data scientists to employ
employ,, and
the toolset to use. The organization needs the three elements of the
analytics function to be in alignment (Figure
(Figure 2.9
2.9).
). The data scientists
will need skills in the tools and techniques adopted by the organization
and knowledge and acceptance of the way things are done
(methodology). For example, it might not work to hire data scientists
who subscribe
that they to open
only use toolsets such
a proprietary as Python
tool such as SASand R and
(and vicethen require
versa). While
some systematicity is need in the analytics development process, an
overlyy formal and bureaucratic methodology may impede the
overl
effectiveness
effectiv eness of the data sscientist
cientist team. Further
Further,, we should also plan to
see the data science
s cience role itself being automated in part through tools
such as DataRobot.
Dat aRobot. Lastly
Lastly,, while predictions are essential, the acid test
of a successful analytics intervention is the extent to which it informs
actions that create business value and, ideally,
ideally, that value is
demonstrable through A/B testing.

Figure 2.9 Aligning the an

analytics
alytics developmen
developmentt function
References
Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., & Wirth, R. (2000). CRISP-
DM: Step-by-step data mining guide, The CRISP-DM consortium, August 2000.
Chollet, F. (2018). Deep Learning with R. Manning Publications, New York.

https://www.
Chui, M. & McCarthy, B. (n.d.). An executive’s guide to AI. McKinsey ( website). https://www.
mckinsey.com/
mckinsey. com/business-functions/
business-functions/mckinsey-analytics/
mckinsey-analytics/our-insights/
our-insights/an-executives-guide-to-ai
an-executives-guide-to-ai
Data-Driven Science. (2018). Python vs R for Data Science: And the winner is. Medium. https://
medium.com/
medium. com/@data_
@data_driven/
driven/python-vs-r-for-data-science-and-the-winner-is-3ebb1a968197
python-vs-r-for-data-science-and-the-winner-is-3ebb1a968197
Davenport, T. & Ronanki, R. (2018). Artiicial intelligence for the real world. Harvard Business
Review , Janu
J anuary–
ary–Feb
February:
ruary: 109–116.
Fayyad,
Fayyad, U.
U.,, PPiatetsky-Sha
iatetsky-Shapiro,
piro, G., & Smyth, P. (1996). From data mining to knowledge
know ledge dis
discovery
covery in
databases. AI Magazine, 17 (3): 37–54.
Haynes, L., Service, O.,
O. , Goldacre, B., & Torge
Torgerson,
rson, T
T.. (2012). Test, learn, adapt: Developing public
policy with randomised controlled trials, UK Cabinet Ofic Of icee Behavioural Insigh
Insights Team. https://
ts Team.
www.gov.
www. gov.uk/
uk/government/
government/uploads/
uploads/system/
system/uploads/
uploads/attachment_
attachment_data/
data/ile/
ile/62529/
62529/TLA-1906126.
TLA-1906126.
pdf
Highsmith,
Highsm ith, J. & Cockburn, A. ((2001).
2001). Agile software development: The business
business of innovation.
Computer , 34(9): 120–127.
[Crossref ]
Khabaza. ((2010).
Khabaza. 2010). Nine laws of data mining, Khabaza
Khabaza ( website). http://
http://khabaza.
khabaza.codimension.
codimension.net/
net/
index_iles/
index_iles/9laws.
9laws.htm
htm
Mango Solutions. (2019). What kind of
o f data scientist are you? Mango Solutions ( website). https://
www.mango-solutions.
www.mango-solutions.com/
com/radar/
radar/
O’Neil, C. & Schutt, R. (2013). Doing data science. O
O’Reill
’Reillyy Media, Sebas
Sebastopol,
topol, CA.
Piatetsky, G. (2014). CRISP-DM, still the top methodology for analytics, data mining, or data
science projects, KDnugge
KDnuggets
ts (website). http://www.
http://www.kdnuggets.
kdnuggets.com/
com/2014/
2014/10/
10/crisp-dm-top-
crisp-dm-top-
methodology-analytics-data-mining-data-science-projects.html
methodology-analytics-data-mining-data-science-projects.html
Sallam,, R., How
Sallam Howson,
son, C. & Idoine, C. ((2017).
2017). Augme
Augmented
nted anal
analytic
yticss is the future of data and ana
analytics.
lytics.

Gartner , 27 July.
Somohano, C. (2013). Big data [sorry] and data science: What does a data scientist do? SlideShare
SlideShare
(video). http://www.
http://www.slideshare.
slideshare.net/
net/datasciencelondo
datasciencelondon/n/big-data-sorry-data-science-what-does-
big-data-sorry-data-science-what-does-
a-data-scientist-do
Suda, B. (2017). 2017 Data science salary survey , O’Reilly Media (website). http://
http://www.
www.oreilly.
oreilly.
com/data/
com/ data/free/
free/2017-data-science-salary-survey.
2017-data-science-salary-survey.csp
csp
Wills, J. (2012). Data scientist , Twitter. https://twitter.
https://twitter.com/
com/josh_
josh_wills?
wills?lang=
lang=en-gb
en-gb

Limited 2019
https://doi.org/10.26777/978-1-352-00726-8_3
3. Data and Information


Samuel N. Kirshner
Kirshner
Felix Tan
Chapter Overview In this chapter we look at all thing

thingss data. W
Wee
start by considering
considering the implication
implicationss of the dramati
dramaticc growth in data
volumes. In everyday language the terms ‘data’, ‘information’,
‘knowledge’, and ‘wisdom’ are often used loosely and sometimes
interchangeablyy. In the context of decision-making, it is useful to
interchangeabl
distinguish between these different concepts. Regardless, the ground
rock of decision-making is data and that data needs to be of
suficient quality – not necessarily perfect, but it must be it for use.
Having established the role of data we will then dig deeper into the
pragmatics of data, looking at some of the characteristics of data
(e.g., different data types and distributions).

Learning Outcomes
to:
Explain how data and its sources are an asset to organizations,
governments,
govern ments, and tthe
he lives of citizens
Explain the distinction between data, information, knowledge, and
wisdom
Explain why data quality is important
Deine and operationalize key data-quality attributes
Deine attributes of datasets, such as missing values, outliers, and
probability distributions.
Introduction
When someone thinks they t hey have lu they are likely to use a search
engine to ind symptoms, treatments, and other information. Google
decided to track online searches with the hope of being able to predict
lu outbreaks faster than traditional means – for example, possibly two
weeks earlier than health authorities such as the US Centers for Disease
Control and Prevention (CDC). The developers of Google Flu F lu Trends
(GFT) claimed in the journal Nature that ‘we can accurately estimate
the current level of weekly inluenza activity in each region of the
( Ginsberg et al.
United States, with a reporting lag of about one day’ (Ginsberg
2009,, p.1012). In 2013 GFT failed spectacularly,
2009 spectacularly, missing the peak of the
2013 lu season by 140% which led to the decommissioning of GFT
(Lazer & Kennedy 2015).
2015). While the failure of GFT does not mean that
big data does not have value, it does demonstrate the potential for ‘big
data hubris’.
In an article in Science, Lazer et al. (2014) explain
(2014) explain that ‘big data
hubris’ is the often implicit assumption that large volumes of data can
be a substitute for
for,, rather than a supplement to, traditional
t raditional data
collection and analysis. Smart data scientists with massive quantities of
data may think that they can outsmart anyone and anything. However,
GFT failed for a number of reasons. First,
First , GFT overitted the data, using
seasonal correlated,
strongly search
corr terms
elated, butsuch
onlyasby‘high school
chance. basketball’
Second, , which
GFT did wereinto
not take

account changes – in other words, the model itself

its elf (rather than simply
the data) needed updating.
up dating. Third, and most importantly
importantly,, GFT was using
intrinsically unreliable data. Writing in Forbes magazine, Steven
Salzberg (2014)
(2014) stated
stated the problem
p roblem as follows:
A bigger problem with Google Flu, though, is that most people

who think they have ‘the lu’ do not. The
T he vast majority of
doctors’ ofice visits for lu-like symptoms turn out to be other
viruses. CDC tracks these visits under ‘inluenza-like illness’
because so many turn out to be something else. To illustrate, the
CDC reports that in the most recent week for which data is
available, only 8.8% of specimens tested positive
p ositive for inluenza.
When 80–90% of people visiting the doctor for ‘lu’ don’t
really have
have it, you can hardly expect their internet searches to be
a reliable source of information.
While there is undoubtedly value to be extracted from big data,
managers must not fall into the trap of big data hubris. Whether the
data is big or small, as the GFT example shows, problems of data
quality,, model overitting (the model follows the data too closely and
quality
does not adequately distinguish signal from noise), and model decay
(relationships in the data change over time) can arise.
Data growth
A dataof
result deluge is sweeping
the prevalenc
prevalencee of almost invisibly
automatic acr
across
oss theelectronic
data collection, planet. It is the
instrumentation, and online transaction processing (OLTP).
(OLTP). There is a
growing recognition
recognition of the untapped value in these databases, which is
in part driving the development of data science. This data comes in
many forms. Some of the data will be structured – that is, in tabular
form with regular columns and rows, as is typical of spreadsheets
s preadsheets and
relational databases. Other data will be unstructured, such as email,
text documents, audio recordings, video, and images.
Unstructured data is in the ascendancy
as cendancy and will pose data storage as
well as data
reports analysis
Gartner’s challenges
estimate for organizations.
that unstructured data Rizkullah
comprises(2017)
around

80% of enterprise data and goes on to comment that organizations are

unprepared for unstructured data management – they don’t know what
they have and they don’t know how to protect it. Igneous (2018)
commissioned a poll of 200 organizations and ind that the typical
organization is experiencing 23% annual growth of its unstructured
data with around a quarter of organizations seeing growth rates of
more than 40%. They likened this growth of unstructured data to a
tsunami and ind that organizations are struggling to manage this data,
with particular issues around accessibility,
accessibility, governance, and insight.
The consequences of a data
d ata deluge for organizations means that
deining a data management protocol is essential if they are to
maximize the opportunity for obtaining data that can yield useful
information. Accordingly
Accordingly, we should bear in mind that ((Truxillo
Truxillo 2015
2015,,
pp.1–6):
everycollection
data problem will generate
protocol data eventuall
will resulteventually
y – pproactiv
in more usefulroactively
ely deining a
information,
leading to more useful analytics
every company will
will need analytics eventually – proactively analytical
companies will compete more effectively
everyone will need analytics eventually – proactively analytical
people will be more marketable and more successful in their work.
As data becomes cheaper and more plentiful, companies have begun
to leverage information content that was previously impossible to
access or unfeasible. And those companies that implement competitive
analytics
their are likely to have
industry. have greater inluence on the shape and future
f uture of
Exercise 3.1: Digital exhaust

Please watch the video ‘The human face of big data’ [4:27] https://
vimeo.com/
vimeo. com/103263590
103263590.. Thinking about the organization for which
you currently work (this can also be one that you have worked for
previously or your current educational institution), consider the
following questions:
1.
How could you use digital traces to better understand the
behaviours and characteristics of your customers/consumers?

2. How mig
might
ht this digital-tr
digital-trace
ace data pr
produced
oduced by the digital
exhaust be misused?

From data to wisdom

It is through data that we interact with, experience, and make sense of
the world. We can only think about our everyday understanding of the
world through data, whether it be our perceptions of cities, crime,
global trade, migration, or disease. Our lives are mediated almost
entirely through
through data – examples include checking the time, reading an
email or newspaper,
newspaper, monitoring our heart rate, or counting the number
of steps we have taken today.
The term ‘data’ derives from the Latin datum, meaning ‘that which is
given’
give n’. Data can be descriptions (e.g., I’m doing exer
exercise),
cise), counts (e.g.,
I’ve done 10,000 steps today) or measures (e.g., My weight is 75
kilograms). Data can be collected on anything – happiness, trade,
weather, transport, weight, height, activity, ideas, behaviours, the
economy,, and so on. Data can also be in many formats – for example,
economy
numbers, words, sounds, images, or video.
When data is analysed it becomes information, which in turn can
build to knowledge, and possibly to wisdom (Figure (Figure 3.1).
3.1). To take a
simple illustration, we might look in a data ile
 ile and see a code, ‘802981’
‘802981’..
Given
Giv en the context ((e.g.,
e.g., the organization that we work for) we can turn
this
this into information
information might– itbeisthat
onethis
of our customer
is not just anycodes. The meaning
customer of
but a valued
customer that we are at risk of losing. Our insight from our knowledge
of the customer is that we should be prepared to take action to retain
the customer.
customer. All organizations need to be able to give context to their
raw data, turn it into information, understand the meaning of their
information to create knowledge, and – given insight – they must be
able to turn knowledge into the wisdom to take effective action.

Figure 3.1 From data to wisdom
Data and data

dat a analysis have been around rather longer than digital
computers and the Internet. In Victorian Britain the industrial
revolution
revolution was creating booming cities and changing the landscape and
daily lives of citizens beyond all recognition. William Farr (1807–83)
pioneered the ield of medical statistics. In 1838 he was appointed to
the General Register Ofice (GRO), a government department
responsible for recording
recording births, deaths, and marriages. Farr set up a
system that routinely recorded
recorded the cause of death, thus providing the
raw data for detailed analysis of death within the general population
(Theerman 2014).
2014). This data allowed mortality rates by location and
profession to be compared (Figure
(Figure 3.2),
3.2), showing that life expectancy in
Liverpool was far lower than in other areas.
Figure 3.2 Farr’s analysis of mortality data (Farr 1885

1885))
Without data collection and subsequent analysis of the differences

in mortality rates between rural and metropolitan areas, the high
mortality rates in cities such as Liver
L iverpool
pool would not be visible. Using
data, Farracould
reported begin tobetween
relationship speculate about
death causes
f rom
from of death.
cholera In 1852 Farr
and elevation
(Figure 3.3
above the river Thames in London (Figure 3.3).
). Of course, this is an
association between variables (a correlation) and not necessarily a
causal link. However, the prevailing belief was that cholera was an air -
borne disease and the evidence presented in Figure 3.3 3.3 (itself
(itself an
interesting use of data visualization) helped shore up this belief. Those
who lived higher above
above the river Thames were indeed lessles s likely to die
from cholera.

Figure 3.3 Farr’s analysis of ccholera

holera mortality
mortality data (FFarr
arr 1852
1852))
The igures
 igures in the centre expres
expresss the number of deaths from cholera to 10,000 inhabitants
inhabitants
living at the elevations expressed
expressed in ffeet
eet on the sides of the diagram.
diagram. For example, in districts 90
feet above the Thames
T hames,, the average mortality ffrom
rom Cholera was 22 in 10,000 inhabitants.
In 1866, 5,500 people died in one square mile of London’s East End.
Geo-mapping of cholera deaths by Dr John Snow (1813–58) showed
cholera deaths to be highly localized. This led Snow to speculate that
cholera was a water -borne
-borne disease. Despite compiling a substantial
body of evidence (including a map showing the location of 15 water
pumps and the locations of the cholera-related deaths) that seemed to
be irrefutable, such was the grip of the air-borne theory of cholera
(‘miasma theory’) that the Government refused to accept Snow’s
conclusions. Snow died in 1858 and didd id not live to see his ideas become
accepted. Eventually,
Eventually, politicians were forced to act and deal with -
London’s polluted water sources, leading to the building of a sewerage
s ewerage
system by Joseph Bazalgette and the eradication of cholera.
While Farr’s
contaminated later report
drinking waterthat cholera was
contradicts causedreport
his earlier by sewage--
sewage-
on -
elevation, this is a good example of data science in action – theories are
speculations that might be overturned by subsequent
subseq uent data. Howev
However
er,,
while there was compelling evidence that cholera was water-borne
rather than air-borne, the data alone was not suficient to break down
entrenched opinion. Eventually
Eventually, the weight of the data led to the
overturningg of the air-borne theory of cholera, but only through a social
overturnin
process where meaning is negotiated rather than absolute. For further
details about Farr and Snow
Snow,, see the Science Museum ((www.
www.
sciencemuseum.org.
sciencemuseum.org.uk/
uk/broughttolife/
broughttolife/people/
people/williamfarr
williamfarr)) and for the

story of Cholera and the Thames see (www.

(www.choleraandthetha
choleraandthethames.
mes.co.
co.
uk/).
uk/ ).
Data summarization
Digesting data into a summary measure (e.g., creating a weighted
average)
average) is a one-way street. We lose information about the underlying
data and are left with just a inal single value. Summary data is easier to
work with, but is the trade-off worth it? Consider two movies:
1.
Eat Pray Love (2010) – IMDB movie rating 5.7/10
2.
Inception (2010) – IMDB movie rating 8.8/10
Which is the better movie? Now consider the additional information
– look
that at where the insight and surprise
surp rise are ((Figure 3.4). It is evident
Figure 3.4).
Eat Pray Love has fewer people voting thanInception – 64,565
versus 1,494,360. More strikingly,
strikingly, the dist
distribution
ribution of the votes is very
different. Eat Pray Love has 7.1% of voters rating the movie 10 and
5.2% rating it 1. Movie-goers are polarized between loving and hating
this movie. Inception has a distribution that
t hat looks more like a long tail –
36.8% rate the movie 10 with a fall-off thereafter (although some
people don’t like the movie as the proportion of movie-goers rating it 1
is 1.0% – more than the
t he number of movie-goers rating it 4). Using a
summary measure, such as the mean or the median, of necessity
involves
involves information loss.
Figure 3.4 Two movies compared
Exercise 3.2: The social nature of ‘facts’

work (this
can also be one that you have worked for previously or your current
educational institution) identify something that is believed as a given
fact in your organization that might be analogous to the air-borne

‘miasma theory’ of cholera. This belief might relate, for example, to

customer behaviours, the organization’s operations, its external
partners, or its revenue models.
1.
What evidence is there for this belief?
2.
How is the belief sustained and communicated through time?
3.
Who gains from this belief? Who might lose if the belief were
shown to be incorrect?
Data quality
(2008) places data quality
Redman (2008) places q uality between IT infrastructure and
exploitation. Drawing on Redman’s data quality
q uality framework, we
recognize the pivotal role
role of data qualit
qualityy in business analytics linking
(Figure 3.5
data with decisions (Figure 3.5).
).
Figure 3.5 Data quality in context
However,, ensuring data access and ddata

However ata quality is surrounded by
what Redman calls ‘surprisingly brutal politics’ (2008, p.35). In today’s
organizations ownership of data represents power and not everyone is
willing to share data – either within or outside of their organization. At

the other end of the spectrum, there is often a lack of data ownership;
the data is thrown into a data warehouse and it is assumed that the CIO
is ultimately responsible for the organization
organization’s
’s data. Data ownership
and role deinition is a fundamental part of data management: Which
business unit creates the data? Which business units can access the
data? Which can change the data?
Having data of an appropriate level of quality is a fundamental
requirement for business analytics. While quality has been deined in
different ways,
ways, two views dominate: the production view and the
consumption view.
Production view of data quality

The production view of quality is associated with the notion of
conformance to speciication and is often measured in terms of number
of defects. In production contexts, the goal of a quality effort is to
reduce the product’s variance as compared to a speciied ideal or
product template. For example,
example, a production view of data quality might
consist of a set of characteristics, such as those identiied by Strong et
al. (1997).
(1997). Here, data quality is viewed as a hierarch
hierarchyy of quality factors.
Data quality categories are deined through a number of data d ata quality
dimensions (Strong
(Strong et al. 1997
1997):
):
Intrinsic data quality: accuracy, objectivity, believability, reputation
Accessibility data
data quality: accessibility
accessibility,, access security
dataamount
Contextual
completeness, quality:ofrelevancy
relevancy,
data , value-added, timeliness,
Representational data
data quality:
qualit y: interpretability
interpretability,, ease of understanding,
concise representation, consistent representation
At the dimensional level, measures must be developed if quality is
to be managed and improved – the old adage that you can’t can’t manage
what you can’t measure applies equally well to data.
In practice, data quality is usually assessed
assess ed using between three and
(Figure 3.6
six dimensions (Figure 3.6).
). The number of dimensions can varyvary,, as do
the dimension labels. For each dimension KPIs should be deined
de ined and
measures established.

Figure 3.6 Data quality in ssix

ix dimen
dimensions
sions
Accuracy
Accuracy
Accuracy is the degree to which data
dat a correctly describes the ‘real-
world’ object or event being described. For example, a customer’s
family name may be incorrectly spelled as a result of a data entry error
error..
A relevant measure might be the percentage of data entries that pass
p ass
the data accuracy rules.
Completeness
Completeness is concerned with comprehensiveness. Data can be
complete even if optional data is missing. As long as the data meets
expectations then it is considered to be complete. For example,
example, a
customer’s irst name and last name are mandatory but middle name is
optional and so a record can be considered complete even if a middle
name is not available. A relev
relevant
ant measure for completeness might be
percentage of data ields complete.
Timeliness
Timeliness is the extent to which information is available when it is
expected and needed. Timeliness will vary depending on the context.
Real-time data, measured in sub-milliseconds, might be needed for
high-frequency trading while daily (every 24 hours) data might be
acceptable for a corporate billing system. A relev
relevant
ant measure for
timeliness is the time interval between the time period the data
represents (or when it was generated) and that data being available.
Validity
Validity is concerned with the ddegree
egree to which the data makes sense.
For example,
example, the age at entry to a UK primary & junior school
s chool is
captured on the form for school applications. This is entered into a
database and checked that it is between 4 and 11. If it were captured on

the form as 14 or N/A it would be rejected as invalid. Validity

Validity might be
measured as the percentage of data items deemed to be valid, or
invalid.
Integrity
Integrity refers to the validity of data across the relationships and
ensures that all data in a database can be traced and connected to other
data. For example, in a customer database, there should be a valid
customer,, addresses, and relationship between them. If there is address
customer
data without a related customer,
customer, then that data is not valid and is
considered an orphaned record. One measure of integrity would be the
number and percentage of orphaned records.
Consistency
Consistency requires the data across all systems to relect the same
information. For example,
example, the date of birth for an employee should be
the same in all databases
d atabases that record this datum. While, ideally,
ideally, each
datum should be recorded only once – the single source of truth –in
practice data is often duplicated due to IT
IT,, performance, operational,
and legacy system reasons. Consistency might be measured as the
percentage of data items deemed to be consistent.
Consumption view of data quality

The consumption view of quality is associated with the notion of
‘itness for use’.
use’. That is, does the data meet tthehe needs and expectations
of those who use it? In this viewview,, the product (data) ddoes
oes not have to be
used as designed; rather
rather,, it can be used in any way that the customer
desires. For example,
example, the quality of a fridge might be deined as how
well a customer perceives
perceives that the fridge stores food and keeps it fresh.
If the fridge
f ridge does the job to the level required by the customer
customer,, then it
can be said to be of high quality.
quality. If the food deteriorates because the
fridge is not suficiently powerful in a very hot climate, then it might
appear to be of poor quality
q uality to a customer.
customer. If the customer decides
instead to use the fridge as a chicken coop, then they might ind it
rather small and wonder why the door creates an airtight seal. Quality

is therefore about the customer,

customer, the context, and how the customer
chooses to use the product.
With data, a strict production view might result in the view that the
data is of poor quality – for example, in objective terms it might be
incomplete, inaccurate, and inconsistent. Howev
Howeverer,, the data might still
be useful for decision-making and thus be a potential source of
business value. The consumption view further suggests that data might
be used in way
wayss different from what was envisaged when tthe he data was
collected. (While this is often the case, it can give rise to ethical, legal,
and regulatory issues as data collected for one purpose cannot – or
should not – necessarily be used for other purposes.)
In practice, the organization has to negotiate between the
production and consumption views of data quality
quality.. Quality-
improvement
improv ement initiatives that seek to address data quality issues are an
essential aspect of getting value from business analytics. At the same
time, value can only be created from data once it is ‘consumed’ in some
way through
through the apapplication
plication of business analytics.
Exercise 3.3: Data quality

Imagine that the organization for which you currently work
work (this can
also be one that you have worked for previously or your current
educational institution) wants to check the quality of its data as part
of a change program concerned with creating a data-driven culture.
The data management group has identiied the following data
quality factors:
accuracy
completeness
timeliness
validity
integrity
consistency
1.
Using your chosen organization, provide examples of poor data
quality that relate to TWO of the above quality factors.
2. How would you measure your selected quality factors?

3. What might be an appropriate quality improv

improvement
ement plan ttoo
remedy the data quality issues?

Data characteristics
characteristics
When we do business analytics we need to know what types of data we
are working with. For
For example, is it numeric? If it is numeric, does it
represent inancial data? If so, in which currency is it denominated? All
data management and data quality initiatives are built on the basic
building blocks of data types and all data instances should be consistent
with their data type and follow any rules that apply to that data type.
Data types
Data has an underlying type. All data has a type (or units) that helps us
map it and set constraints on the values the data might take. For
example, consider the following:
The type ‘month’ can be represented (or mapped) as either ‘January’
or 01.
Data can be of type ‘number’ (e.g., 2000) and the rules of a number
provide us with validity constraints (e.g., must be in the range 1–10).
Common base data types include number
number,, text, date, location, time,
currency, and time interval.
Variables
Variables used in models can be broadly distinguished as categorical or
continuous. With categorical data, entities are divided into distinct
categories:
Binary variable – there are only two categories (e.g., dead or alive).
Nominal variable – there are more than two categories (e.g., whether
someone is an omnivore, vegetarian
vegetarian,, vegan, or pes
pescatarian).
catarian).
Ordinal variable – this similar to a nominal variable, but the
categories have a logical order (e.g., whether a student earned a fail, a

pass, a credit, or a distinction in their exam).

With continuous data, entities get a distinct score:
Interval variable – equal intervals on the variable represent equal
differences in the property being measured (e.g., the difference
between 6 and 8 is equivale
eq uivalent
nt to the difference between 13 and 15).
Ratio variable – this is the same
s ame as an interval variable, but the ratios
of scores on the scale must also make sense and have a meaningful
zero point (e.g., an income of 60,000 is twice as much as an income of
30,000 and an income of zero is meaningful).
Cardinality
Cardinality is the number of unique points within a column of data.
Higher cardinality of the data implies more unique values. Unique
identiier (ID) columns have full cardinality
cardinality,, since each value is, by
deinition, unique. The lowest cardinality is achieved when every row
has the same
s ame value for a given column; such a variable would have no
information content within the dataset, although it might be
meaningful in a wider context (e.g., when cross-referenced to another
dataset with different values for that variable).
Data distributions
When we observe real-world data we ind that many distribution
patterns keep reappearing. For example,
example, the height of humans is a
classic example of the bell-shaped normal distribution (also known as
the Gaussian distribution), shown in Figure 3.7
3.7 where
where the mean is zero
and the standard deviation one (this is also known as a z-distribution).
(Figure 3.8),
Another common distribution is the exponential (Figure 3.8), which
takes a rate parameter that alters the steepness of the curve. Other
common patterns have been found and named after their observers, for
example Poisson and Weibull.

Figure 3.7 Normal dis

distribution
tribution (mean = 0, ssdd = 1)
Figure 3.8 Exponential distribution
Some analysis techniques require data (or at least, the error terms)
to be normally distributed. We can visually inspect the distribution
with a histogram and perform statistical tests to assess skew (do the
observations pile up on one side or the other?) and kurtosis (is the
distribution too peaky or too lat?). If we need normally distributed
data and this assumption is violated, then we might consider
performing a transformation on the data to make it more
approximatelyy normal (i.e., look more like Figure 3.7).
approximatel 3.7).
The dangers of assuming normally distributed

data
Data distributions are something that we impose on the world as
observers and there is a risk that we want to see everything through
through
the lens of the normal distribution. In 2007 David Viniar
Viniar,, CFO at
Goldman Sachs at that time, reported seeing things that were 25
(standard
Matthews deviations awayGiven
2016,, p.203).
2016 from that
the mean several days in a row
roughly 95% of observations would
be expected to fall within two standard deviations in normally
distributed data (inspection of Figure 3.7
3.7 shows
shows that around 95% of the
area under the curve is accounted for between −2 and +2), this data is
remarkablyy unlikely
remarkabl unlikely.. With four standard deviations (a 4-sigma event)
the odds are around 16,000 to 1. A 25-sigma event should occur on
average every 10135 years – a igure that is astronomically unlikel
unlikelyy (i.e.,
inconceivably
inconceiv ably longer than the age of the universe). The pproblem
roblem is that
the inancial analysts were relying on the data being normally
distributed, and
collateralized ratings
debt agencies
obligations reliedbeing
(CDOs) on instruments such as and
normally distributed,

thus severely underestimated

underestimated the risk – a contributing factor to them
not seeing the looming global inancial crisis (GFC) of 2007–2008
(Matthews 2016).
2016).
The normal curve is appealing due to its elegance and simplicity and
its seeming ubiquity in the real world. While height is probably the
most commonly used example of a ‘naturally’ occurring normal
distribution, even this is slightly misleading (Matthews
(Matthews 2016,
2016, p.205).
The probability distribution of height might be bell-shaped , but it is not
a bell curve. The curve is not symmetrical, has a dented peak, and the
tails do not slope off to ininity at either side. By separating out males
and females, then better looking curves are produced, but they are still
not perfect. Subdividing by other factors, such as ethnicity
ethnicity,, background,
and nutritional status, can improve the curve further. However, to be
truly normal all these factors sshould
hould be working independently
independently,, which
is rather implausible. Sometimes it does not matter that our data is
bell-shaped rather than normal; at other timest imes it can make a big
difference, depending on the analysis method we are using.
Exercise 3.4: Data distributions

Consider the distribution of salaries
s alaries in a banking organization. If you
can get access to real data, then
t hen so much the better
better.. If not, make
rough calculations based on the number of employees and typical
salaries.
1.
How do you think the distribution will look? Is it normally
distributed? Are there a few very high salaries?
2.
What is the median salary and how does it compare with the
mean salary? Which is a better measure
meas ure of typical (average)
salary?
Outliers
An outlier is an observation that is distinctly
d istinctly different from the other
observations. It aisunique
on a variable or typically judged to beofan
combination unusually
variables: IT high
ST or low
STANDS
ANDS value
OUT

FROM THE OTHER OBSERVOBSERVA ATIONS. As the size of a dataset increases,

the chance of inding outliers increases.
Outliers have implications for data analysis. FoForr example, consider a
sample in which 20 individuals report an income in the range of
$20,000–$100,000, with an average of $45,000; the 21st has an income
of $1 million. Including this observation increases the average income
to more than $90,000. It’s a valid observation, but which mean is more
useful as an estimate – $45,000 or $90,000? Is the outlier
representativee of the populat
representativ population?
ion? If the $1 million income is the only
one in the entire population (an extreme value), then it may be
appropriate to delete it for analysis and model-building purposes.
Outliers can occur due to:
Procedural error: examples include data entry errors and mistakes
in coding. These should be corrected or remo
removed
ved in the data cleaning
stage. event: for example, we might be tracking daily
Extraordinary event:
average
aver age rainfall when a cyclone hits with rainfall levels not
comparable to anything recorded in ‘normal’
‘ normal’ weather patterns. We
would consider removing this extraordin
extraordinary
ary event.
Extraordinary observation: the data scientist has no explanation
for the observation and feels it is not representati
representative
ve of the population
and hence considers removing it.
Unique combination of variables: although falling in the ordinary
range on each variable in the dataset, the observation is unique in the
combination
should ofunless
be kept values speciic
sacross
peciicthe variables.
evidence These observations
is available that the
observation is not a valid member of the population.
Deleting outliers may well make the model it better to the training
data. Unfortunately,
Unfortunately, when the model comes up against unseen test data
it may perform poorly,
poorly, an indication tthat
hat it has been overitted.
Missing data
Missing data arises when observations are missing for a column
(variable).
customers,For
butexample,
ind that we
notmight have recorded
all customers household
are willing income
to divulge this for

information. Perhaps customers in higher income brackets are less

willing to disclose their income. As a result, missing values may not be
random and they may bias the results and insights from our models.
Missing data can be characterized as:
Missing completely at random (MCAR) cases are those that are
indistinguishable from cases with complete data and can be removed
without affecting the analysis (other than reducing the sample size).
Missing at random (MAR) can result when, for example, reporting
income is systematically different between males and females or by
age group. Not having this data may introduce bias and detract from
the model’s performance.
Missing not at random (MNAR) arises within a single variable, for
example when those with high incomes are more likely to not report
their income. Removing these cases may also lead to biases in models
built from the
usefulness dataset
of the and thus impact on the predictiv
model. predictivee ability and
One approach to dealing with problems of missingness is to simply
delete cases with missing values. Although this reduces the sample size
this may not be an issue for large datasets. However
However,, case deletion may
limit the applicability of the models we build on the reduced dataset
through the introduction of bias, depending on the cause of the
missingness (MCAR, MAR, MNAR).
Rather than delete cases with missing values, the data scientist will
often use direct imputation to create substitute values, for example
replacing the missing value with the mean or median value for that
variable. More advanced approaches include model imputation using
maximum likelihood and expectation maximization and the use us e of
machine learning to impute missing values. Given that much business
data is incomplete, missing value analysis and imputation is an
important part of the data scientist’s skill set.
Data does not speak for itself

Now that big data is becoming av
available
ailable with potentially billions of
rows (observ
(observations)
are seeing theations) and tensofofautomated
phenomenon thousandsdata
of columns (variables),
mining in which we

machines look for patterns in the data

dat a (e.g., searching for high
correlations in a dataset with thousands of variables), as we have seen
with GFT.
It is tempting to think that data can somehow speak for itself, that
weTheory’:
of can abandon theory because, as Anderson (2008) writes
(2008) writes in ‘The End
There is now a better way.
way. Petabytes allow us to say: ‘Correlation
is enough.’ We can stop looking for models. We can analyze the
data without hypotheses about what it might showshow.. We can
throw the numbers into the biggest computing clusters the
world has ever seen and let statistical algorithms ind patterns
where science cannot.
Anderson (2008)
(2008) argues
argues that numbers can speak
sp eak for themselves
and that we don’t need statistical models,
models , sampling, or scientiic models
of causal relationships, since the machine can discover the patterns
using a complete dataset (‘N = all’).
However, the renowned Cambridge statistician David
Spiegelhalter’s view of this claim, expressed in Harford (2014),
(2014), is that
this is ‘complete
‘ complete bollocks. Absolute nonsense.
nonsense.’’ Spiegelhalter continues:
‘There are a lot of small data
d ata problems that occur in big data. They don’t
(Harford
disappear because you’ve got lots of the stuff. They get worse’ (Harford
2014).
2014 ). Building a model that its the data very well indeed can be
automated. However, as we saw with GFT, the model can it the training
data veryrelationships
spurious well, but when onfaced
whichwith
thenew
model data the implausible
is built manifest inand
poor
performance. Having more variables can produce more accurate
models that it the training data well, but the models might struggle
with unseen data. If you are told that with enough observations and
variables the data can ‘speak for itself’ you should be actively sceptical.
Summary
Data is a fundamental part of our lives; it is how we make sense of the
world
which in
wewhich wedata.
have no live. It
live. is almostdata
However, impossible to imagine
is only useful a world
if we are in
able to

extract information and knowledge from it and then have the wisdom
to make better decisions and take more effective action. If we are to rely
on, and extract value from, data then it must be it for purpose – that is,
of suficient
suf icient quality
quality..
big The
datavolume,
opens up velocity
velocity,
, and variety ofthere
new opportunities, dataisare allthat
risk growing, and
we will bewhile
drowned in the data deluge. There is a further risk in big data that we
rely on machines to build models based on correlation scavenging in N
= all datasets (the so-called ‘end of theory’). Such models can it the
data remarkably well, but can come up short when faced with unseen
data.
Making sense of all this data and using it wisely to make decisions is
a major challenge for the world today and one that affects and impacts
on all our lives.
References
Anderson, C. (2008). The end of theory: The data deluge m
Anderson, makes
akes the scientiic method obsolete,
Wired , 23 June.
Farr, W. (1852). Report on the mortality of cholera in England, 1848–49. W. Clowes, London.
https://archive.org/
https://archive. org/details/
details/b21516911/
b21516911/page/
page/n79
n79..
Farr, W. (1885). Vital statistics : A memorial volume of selections from the reports and writings of
William Farr . Edited for
fo r the Sanitary Institute of Great Britain by Noel A. Humphreys. Available
Available
https://babel.
from the Hathi Trust: https:// babel.hathitrust.
hathitrust.org/
org/cgi/
cgi/pt?
pt?id=
id=hvd.
hvd.li3s12
li3s12..
Ginsberg, J.J.,, Mohebbi, M., Patel, R., Bramm
Ginsberg, Brammer
er,, L., Smolinski, M., & Brilliant, L. (2009). Detecting
inluenza epidemics using search engine query data. Nature, 457 : 1012–1014.
[Crossref ]
Harford, T. (2014). Big data: Are we making a big mistake?,
mistake?, FT Magazine, 28 March.
Igneous (2018). 2018 State of Unstructured Data Manageme
Management. https://www.
nt. https://www.igneous.
igneous.io/
io/..
Lazer,, D. & Kennedy
Lazer Kennedy,, R. (2015). What we can learn ffrom
rom the epic failure of Google Flu Trends,
Wired , 10 January https://www.
J anuary.. https://www.wired.
wired.com/
com/2015/
2015/10/
10/can-learn-epic-failure-google-lu-trends/
can-learn-epic-failure-google-lu-trends/
Lazer, D., Kennedy, R., King, G., & Vespignani, A. (2014). The parable of Google Flu: Traps in big data
analysis. Science, 343(617 6):
6): 1203–1205.
[Crossref ]
Matthews, R. (2016). Chancing it: The laws of chance and how they can work for you . Proile
Books, London.

Redman, T. C. (2008). Data driven: Proiting from our most important business asset . Harvard
Business Press, Cambridge, MA.
Rizkullah, J. (2017). https://www.
( 2017). The big (unstructured) data problem, Forbes, 5 June. https://www.forbes.
forbes.
com/sites/
com/ sites/forbestechcounci
forbestechcouncil/ l/2017/
2017/06/
06/05/
05/the-big-unstructured-data-problem
the-big-unstructured-data-problem..
Go ogle Flu is a failure, Forbes, 23 March.

Salzberg, S. (2014). Why Google
Strong, D. M., Lee, Y. W.,
W., & Wang, R. Y. (1997). Data quality in context. Communications of the ACM ,
40(5 ):
): 103–110.
[Crossref ]
Theerman, P.
P. (2014). Calculating lif
lifetimes:
etimes: Life expectancy and medical progress at the turn of the
century, Center for the History of Medicine and Public Health (blog post), 18 August https://
nyamcenterforhistory.
nyamcenterforhis tory.org/
org/tag/
tag/william-farr/
william-farr/..
Truxillo, C. (2015). Strategies and concepts for data scientists and business analysts. Course notes,
SAS Institute.

Part II Tools and Techniques

Limited 2019
https://doi.org/10.26777/978-1-352-00726-8_4
4. Data Exploration

Samuel N. Kirshner
Kirshner
Felix Tan
Chapter Overview This chapter covers the fundamentals of data

exploration and rein
reinement
ement using vi
visu
sualizations.
alizations. T
Thhe chapter starts
with a discussion on the importance of visualization and discusses
how big data has created a need for more advadvanced
anced data
visualization software. We
We then introduce the SAS Visual Analytics
Explorer as well as deinitions and concepts necessary for
understanding the practical use of visualization. An essential aspect
of effective visualizations is awareness
awareness of the beneits and
disadvantages of each type of technique. After covering the
fundamentals the chapter examines a methodology for using
visualization to guide data exploration, including how to spot trends,
discover relationships, and establish associations among variables.

Learning Outcomes After you have completed

completed this chapter you
should be able to:
• Explain why big data has led to a growth in the importance of
data visualization
• Identify software
visualization organizational beneits achieved from modern
• Select the best chart type to visually address descriptive
analytic questions
• Enrich datasets in SAS Visual Analytics (SAS VVA)
A) by creating
hierarchies, groups, and calculations
• Apply the fundamentals of ddata
ata exploration using SAS VVA
A to a
dataset from your organization.
Introduction
The increased volume of structured data implies that irms are not just
collecting information on more subjects (data-table rows) but
collecting more information (data-table columns, or variables) for each
subject. With billions of rows and anywhere from hundreds to
thousands of columns of data, gaining insight through human
inspection of tables is virtually impossible. Statis
Statistical
tical analysis is also
dificult given the sheer number of variables and dependencies across
datasets. Instead, irms are increasingly rel
relying
ying on visualization
software to explore and understand data. Visualizations allow analysts
to get to grips with and to comprehend massive amounts of data
quickly.. Unlike data in tabular form, visualizations make patterns,
quickly
trends, and outliers easier to recognize.
Exercise 4.1: Visual information According to David McCandless

in his TED Talk ‘The beauty of data visualization’ https://
https://youtu.
youtu.be/
be/
5Zg-C8AAIGg [18:17],
5Zg-C8AAIGg [18:17], information design is important because
‘we’re all visualisers now; we’re all demanding a visual aspect to our
information.’
work
(this can also

current educational institution), please watch the video and answer

the following questions:
1.
How has the demand for visual information impacted your
2. business?
What is an excellent example of how your business has
responded to this demand?
Fundamentals
Fundamentals of visualization and exploration
exploration
Data exploration is critical for understanding the underlying structure
of each data column. In addition, exploration of the dataset provides an
opportunity
analysis focusesto look
on afor patterns,
single trends,
variable, and relationships.
whereas Univ
Univariate
ariatee
exploratory multivariate
multivariat
analysis familiarizes the analyst with the dataset by producing
visualizations that provide different perspectives on the data.
Visualizations help users identify relev relevant
ant variables as well as
correlations. This allows analysts to quickly determine what factors and
measures are essential for further analysis and to build hypotheses
which can be validated through models and experiments. As a result,
visual exploration guides predictive model development and can
identify potential data that the organization should acquire or collect.
To demonstrate the importance of graphing data before analysing it
and the effect of outliers on statistical properties, the statistician
Francis Anscombe constructed ‘Anscombe’s
‘Anscombe’s quartet’ in 1973. The
quartet comprises four datasets that have nearly identical simple
statistical properties (mean, standard deviation, correlation), yet
appear very different when graphed. Each dataset consists of 11 (x, y)
points. If each of the four datasets is modelled with a line of best it, all
four models would be characterized by the same line: y = 3 + 0.5x.
Moreover
Moreov er,, the it of the line for all four models is the same (this is
measured by the R-squared value, which which is 63% for each model).
The irst scatter plot (Figure
(Figure 4.1
4.1 (X1,
(X1, Y1)) appears to be a simple
linear relationship, corresponding to two correlated variables. While
(Figure 4.1 (X2,
the second graph (Figure 4.1 (X2, Y2)) shows
s hows a clear relationship

between the two variables, the relationship is not linear.

linear. In the third
graph (Figure
(Figure 4.1
4.1 (X3,
(X3, Y3))
Y3) ) the dist
distribution
ribution is linear
linear,, but with a different
line, which is offset by an outlier that exerts enough inluence to alter
the best-it line. Finally,
Finally, the fourth graph ((Figure
Figure 4.1
4.1 (X4,
(X4, Y4)) shows an
example
even in which
though one outlier between
the relationship producesthe
a high
t wocorrelation
two coeficient,
variables may be non-
existent. If we do not visualize our data (and rely instead purely on the
model) we might ind that, although the it seems reasonable, the
predictions the model makes on new data will be inaccurate to the
point of being useless. In addition, visualizations can help identify
outliers, which in turn can help direct the exploration process. Outliers
can provide a source of insight since there could be hidden factors
causing the data points to be signiicantly different from most data
points.
Figure 4.1 Anscombe’s quartet

Source: ‘Introduction to correlation with R | Anscombe’s Quartet’, http://stats.
http://stats.seandolinar.
seandolinar.
com/introduction-to-correlation-with-r-anscombes-quartet/
com/ introduction-to-correlation-with-r-anscombes-quartet/
Visualization softwar
soft ware
e
Although big data has provided new opportunities for utilizing
visualization software, it has created non-trivial challenges to the
display of visuals. In fact, standard
s tandard graphing tools are mostly incapable
of meaningfully displaying big data. Applications do not have the
capacity to plot a billion points in a tractable amount of time. Thus,
capacity to plot a billion points in a tractable amount of time. Thus,
trying to explore the relationship between different

dif ferent measures to look
for insights was previously unfeasible. Furthermore, even if
applications can generate the plots in a timely manner
manner,, without
additional advanced visualization techniques, viewing plots of a billion
points
this is likelybytosampling
problem be incomprehensible. Typically
Typically,plots.
data when generating , organizations
However, handle
sampling data limits the potential value creation generated bbyy big data.
In addition, big data creates challenges in selecting appropriate
app ropriate visuals.
Without the proper representation of the data, analysts and data d ata
scientists may fail to convey their results to stakeholders clearly
clearly..
4.2 and4.3
Figures 4.2 and 4.3 both
both contain charts displaying age, hours of
television watched per week, and earnings per week taken from a time-
use survey with more than 100,000 respondents. Figure 4.2 4.2 uses
uses a
scatter plot with earnings on the x-axis and hour of television watched
on the y-axis, and colour to depict the age range. The igure plots less
than 1% of the available data and even then it is dificult to grasp the
relevant
relev ant information of how wages and age impact television. In Figure
4.3 the
4.3 the same information is displayed
disp layed using a heat map, where each
square corresponds to an earning level and age group, the size of each
box represents the number of data points (rows) from the survey survey,, and
the gradient of colour represents the hours of television watched.
Figure 4.2 Scatter plot showing the relationship between television,

television, earnings and age for a small
sample of the dataset
Figure 4.3 Heat map sshowing

howing the relationship between tele
television,
vision, earnings, and age for the entire
dataset
To address these challenges, modern visualizat

visualization
ion solutions, such as
SAS Visual
quickly Analytics
generate and from
igures Tableau
big use
dataproprietary
sources, (2)technology to (1)
automatically select

the best visualization based on the input data and the user’s objective,
and (3) collapse results such that the graphs convey meaning without
losing valuable information.
We use SAS Visual Analytics (SAS VA) and SAS Visual Statistics (SAS
VS), which is an add-on
a browser-based to SAS
analytics VA, tothat
platform explore
usesand model data.
proprietary SAS VA isto
technology
analyse large datasets. SAS VA
VA enables users to prepare, explore, and
communicate data. The SAS VS component enables users to perform
data mining and build predictive analytic models while taking
advantage of the SAS’s powerful in-memory data capabilities.
Introduction to the SAS Visual Analytics (SAS

VA) environment
SAS VA
tasksVAand
consists of several in
responsibilities applications
the data that are
dat a science separated
process. based
Access on
to the
interfaces is managed by tthehe administrator
administrator.. Figure 4.4
4.4 displays
displays a
screenshot of a sample
s ample homepage in SAS V VA.
A. The homepage displays
tiles of different SAS VA
VA applications – in this
t his case the Data Explorer
App, the Report Designer App, and the Report Viewer App. The Report
Designer provides users with the ability to design analytic reports and
the Report Viewer allows users to view (but not edit) the content in the
report. Chapters 44–8
–8 focus on the Data
Dat a Explorer application, which is
where users can access the add-on VS to develop and evaluate
predictiveeand
predictiv
Designer models, and
Report Chapter
Viewer 9 discusses
9 discusses
to create howdashboards.
and view to use the Report
Figure 4.4 The top of the SAS VA homepage window
On the homepage you can add or remov

removee application shortcuts,
customize the colours and names, and add content to the tab marked
favourites,
favourites, which sits under the app
application
lication tiles. Clicking on the
options allows
banner menu, you
the three horizontal
to access lines besides
your applications ‘SAS®
using Home’
a side in the
menu.

Browse opens a window that lets you access ppreviousl

reviouslyy saved
explorations, reports, and data iles. The shortcut button allows you to
add application shortcuts. Once a shortcut is added to the homepage,
you can edit the colour or name by clicking on the three
t hree vertical dots at
the toparight-hand
create collection, corner oflike
which is the atile. Theof
group collection button
bookmarks thatallows
might you to
include favourite reports, explorations, and folders. Typically,
collections are restricted to administrative and advanced users and will
not be covered further here.
Introduction to the Data Explorer

The dataset employee_attrition.csv will be used to introduce the Data
Explorer application. To
To start, click on tthe
he Data Explorer tile to open the
Explorer application.
previously Once thetoExplorer
saved exploration
saved continueisworking
opened,on youthe
can select a or
exploration
you can choose to create a new exploration. T Too start a new exploration
click ‘Select a Data Source’
Source’.. If the dataset employee_attrition (note that
SAS VA
VA automatically converts a lower-case ile name into an upper-
case dataset name) has already been loaded into the server, server, select it
from the list of available datasets. If it is not available, then upload the
data using the Import Data functionality on the right-hand side of the
Open Data Source window. Once the data is loaded into SAS VA, it does
need not be loaded in again. In addition,
add ition, both the Data Explorer and the
Report Designer can access a dataset once it is uploaded to the in-
memory server.
The Data Explorer application is showns hown in Figure 4.5.
4.5. The Data
Explorer application has a double-layered menu bar and three column
panels: the left panel corresponds to the data, the middle panel is
where the visualizations appear,
appear, and the right panel allows the user to
edit the properties of their visualizations.

Figure 4.5 Data Explorer window
Data panel
In the data panel, the dataset is listed (in this case employee_attrition),
and there is a drop-down menu that allows the user to add additional
datasets to the exploration. Beside the drop-down menu is an options
button. The option button allows the user to change the data source,
create data hierarchies and new data variables (for example, based on
interaction effects or calculations using the existing variables), and
show/hide variables, as shown in Figure 4.6 4.6.. There is a search bar,
which allows the user to search for variables. For example, searching
for the word “job” provides the user with the variables JobRole,
JobInvolvemen
JobInv olvement,t, JobLevel, and JobSatisifcation. This is particularly
helpful for large datasets.
Figure 4.6 Data options
The variables, which are listed below the search bar

bar,, are organized
based on the classiication of the variable. SAS classiies variables as
either a category variable (which can be discrete numeric data or
character data), a measure variable (which is either discrete or
continuous), or a geographical variable. Underneath the data categories
are the properties, which allow the user to edit a variable’s

classiication type, model type (e.g., continuous vs. discrete), and how
the data is aggregated.
Creating visualizations
There are several andaediting
ways to create
ways its properties
propert
visualization. ies is to drag a
The easiest
variable of interest into the middle panel, where it helpfully says ‘Drop
a data item here’. For example, dragging the category variable JobRole
into the middle produces a histogram of JobRoles, which is shown in
Figure 4.7
4.7..
Figure 4.7 Automatic chart
Looking at the right panel, that is, the visualizations property panel,
shows that the visualization created is an ‘Automatic Chart’. SAS uses its
best interpretation of the data to create (what is seemingly) the most
useful chart. Since it selected a bar chart, you hav
havee the option to click on
the button in the Roles tab of the property window,
window, ‘Use a Bar Chart’
(see Figure 4.8
4.8).
). If you intend to create a Bar Chart, then there will be a
greater selection of properties for the bar chart.
Figure 4.8 Properties of the automatic chart
The roles tab enables the user to add additional categories to a

visualization. By clicking on the arrow icon beside JobRole (see Figure
4.9),
4.9), there will be the option to add
ad d a new category variable, replace
JobRole with a different category, or remove JobRole entirely (which
would remove
remove all data from the visualization). Similarly
Similarly,, the user can
add measures, which would then change the visualizations from f rom a
histogram, measuring the frequency of the number of employees in
each role, to a bar graph of that measurement.
Figure 4.9 Role tab options
For example, let’s say we wanted to know the aver

average
age age for each
job role. By dragging the Age variable onto the visualization or into the
measure box, or by clicking on the arrow by the box underneath
measures and selecting the variable Age, we can create a bar chart
displaying the age for each role.
4.10 shows the resulting bar chart. Observe that the Age
Figure 4.10 shows
ranges from 0 to 12,500 and that the label on the y-axis says Age (Sum).
This implies that for each column – for example, Sales Executive – the
column is displaying the cumulative age of the sales executives.
executives. This
column is the highest because it is the role with the highest frequency
f requency
(refer to Figure 4.7).
4.7). To
To ind the aver
average
age age of each job role, click on
the ‘Age
‘Age (Sum)’ label, go to Aggregation, and change the aggregation to
Average (see Figure 4.11).
4.11).
Figure 4.10 Bar chart aggreg

aggregated
ated by the sum of each employee’
employee’ss age

Figure 4.11 Change the aggregation on a bar chart
Changing the aggregation to the average produces the bar chart in

4.12,, which shows that Managers have the highest average age.
Figure 4.12
Figure 4.12 Bar chart aggregated by the average age of eac

eachh employee
Next, add the variable gender to the visualizations. Figure 4.13

shows that the bar graph is divided into two sections, a bar graph with
the average
average ages for females across job roles on the left and a bar graph
with average male ages by job role on the right. By changing the igure
from Automatic to a Bar Chart on the roles tab, the user is provided
with more ways to organize the data on the chart. For example, by
dragging gender from Lattice columns to Group, as in Figure 4.14,
4.14, the
visualizations will change such that Males and Females have different
different
coloured bars and are side by side for each job role, as in Figure 4.15.
4.15.
Figure 4.13 Bar chart of avera

average
ge age aacross
cross job roles and ge
gender
nder

Figure 4.14 How to cchange

hange pr
properties
operties of a graph so ge
gender
nder is group
grouped
ed
Figure 4.15 Better bar chart of average age across job roles and ge
gender
nder
The interface for creating visualizations is intuitive and you should
try to create different visualizations either using the automatic chart
process or by using the icons in the second row of the menu bar.bar.
Data and data reinement in SAS Visual

Analytics (SAS
(SAS VA)
VA)
Data in SAS VA
VA is characterized as either category data or measure data.
Category
someone data
is an consists
omnivore,of binary or nominal
vegetarian, veg
vegan, variables
an, or (e.g., whether
ppescatarian),
escatarian), whereas
measure data consists of continuous data. Discrete numeric data, such
as interval data or ordinal variables (e.g., year), can be assigned as
either categorical or discrete measure data. Ordinal variables can be
made into categories, effectively making the variable nominal, since
categorical data does not have a natural order
order.. With that said, there are
two distinctive types of data categories that have ordering: geography
and time–date data. Additional properties are added to these types of
categories, for enhanced visualizations, which are described in further
detail in the ‘Geo map’ section later in the chapter
chapter..

To manage the properties of dat

data,
a, you can use the options button
beside the data source and select
s elect data properties or highlight an
individual data item and go to the property window below the
variables. Figure 4.16 shows
4.16 shows an example of the variables in a dataset
datas et
called countries.
and Year We see that
are categorical. Whenthethe
data variables
data Class,the
was loaded, Continent, Region,
variable Year
was interpreted as a date, which is evident by the icon (a calendar and
clock) beside the variable. There are three continuous variables, GDP
per capita, socioeconomic status, and years of education, which are
listed under measure. Finally
Finally,, there is also a geographical variable
called Country.
Country. When the dataset was initially loaded into VA, the
variable Country was a standard categorical variable. By right-clicking
on the variable, and selecting geography
geography,, you can change the properties
of the variable such that the data represents countries or subdivisions,
such as provinces, states, and zip codes. After doing so, the variable gets
listed as a geographic variable (notice the globe icon).
Figure 4.16 Data pane for the datase

datasett country
The data pane enables the user to create enriched subsets of
uploaded data iles in addition to presenting an overview of the data
variables. This includes creating new variables through groupings,
hierarchies, and deining new data variables using calculations.
There are important trade-offs between reinement prior to making
the data avai
available
lable for analysis and allowing the analysts to create data
subsets (from a singular uploaded data ile). If a software such as Excel
is utilized for data preparation or the data
dat a managers are not trained -
computer/data scientists, then reinement should be done after the
(cleaned) raw data ile is uploaded to SAS V
VA.
A. The Data Explorer
interface allows for more natural augmentation and transformation of

the data (compared to using Excel), which can simplify the data
manager’s and the analyst’s experiences. The options and capabilities
of the data pane include showing and hiding variables, looking at
hiddencalculations,
using data columns, iltering data
hierarchies columns, and
for categorical data,creating new columns
and groups to bin
numerical or categorical data. To create a hierarchy, click on the options
button and then select new Hierarchy
Hierarchy.. A new window will open
allowing you to name and select variables for the hierarch
hierarchyy. Observe
that the hierarchies only apply to categorical variables. With the
Country dataset we can create a geographical hierarchy
hierarchy, with
continents at the top of the hierarch
hierarchyy, followed by regions, and then
countries, as show in Figure 4.17
4.17..
Figure 4.17 Creating a hiera

hierarchy
rchy for tthe
he datas
dataset
et country
A custom category will create categories from

f rom categorical or
measure data. For category data, you can create new groups consisting
of the different
dif ferent observations. FFor
or example, based on tthe
he category
Countries, you could create a new variable called Language, and have
categories such as English, French, Spanish, and so on and add the
items from the variable Countries to the new category
category,, as shown in
Figure 4.18.
4.18. With measure data, grouping can be used to create bins for
the variable.
Figure 4.18 Creating a custom categ
category
ory ffor
or the data
dataset
set country
Calculated items enable the user to build a new variable using the
existing variables and logical functions (e.g., equals, greater or equal to,
if and else statements) and numeric functions (e.g., absolute values, log,
power,, root, round). The new variable will typically be either text or
power
numeric,
new whichitem
calculated can be set by (default
window changingisthe result type
numeric). Thenatusers
the
us erstop
canofbuild
the
new data variables by dragging the necessary logical and numeric
functions and variables into the middle area of tthehe window.
window. For
example, to make a custom variable which indicates whether a nation
has high GDP and education or not,
not , we use an If statement where if
Years of Education is higher than a threshold value AND ((which
which
requires using an AND statement) GDP per Capita
C apita is higher than a
threshold value, then the variable returns ‘High’ otherwise the variable
returns ‘Low’ (Figure
(Figure 4.19
4.19).).
Figure 4.19 Creating a new variable for the da

dataset
taset country
To see an overview of each measure variable, select measure details

from the options list. It will open a window that provides information
on the
data data distribution,
points, skewness,
and other aspects of thekurtosis, the number
(Figure
data (Figure 4.20). of missingof
4.20). Inspection
the measure details provides an idea of the quality of the data for each
variable. In addition, to the right of the information, there is a visual
representation of the distribution of the data.

Figure 4.20 Viewing the properties of measure data

data
Reinement should be used immediately after importing data for

modiications applicable to the entire data science
s cience team. Reinement
can also be used to streamline an individual analyst’s worklow
worklow.. For
example, when investigating correlations or creating predictions for a
speciied target variable, the user may want to remove irrelevan
irrelevant
t
columns. This prevents the potential for spurious results and can
reduce computation time for large datasets. In addition,
ad dition, if multiple
users are working on the same dataset,
dataset , reinement can help coordinate
and control the analysis. For example,
example, the dataset can be segmented
based on the role of each analyst. Data managers can also reine data to
pre-ilter and aggregate sensitive data before allowing analysts access,
thus promoting the responsible and ethical use of data. As you will see
later in the ‘Exploration in SAS V
VA:
A: An illustration’ section later in the
chapter,, data reinement is also a part of the exploration process.
chapter
Exploring the data may provide insight into the creation of new data
columns or the basis for the creation of groupings and hierarchies.
Exercise 4.2: Data reinement

Using the employee_attrition dataset, complete the following tasks:
1.
Upload the dataset (if you have not alread
alreadyy done so).
2.
Examine the measure variables (columns) and their data types.
Which variables have somewhat of a normal distribution?
3.
Use the data binning capability to create a logical grouping for
the ield years at the company.
Step 1. Create a new custom category for the variable
‘YearsAtCompany’
‘YearsAtCompany’ and enter as the name for tthe he variable
YearsAtCompanyBinned.
Step 2. For label1, enter the name ‘0–05’
‘0–05’,, then push the plus
button and make the range of value go from 0 to 5.
Step 3. Add a new label, change the name to ‘06–10’ and add the
values 6 to 10. By default, bar graphs are often ordered based on
frequency. However, bar graphs can be ordered
alphabetically/numerically. The label ‘06–10’ will ensure that the

alphabetically/numerically
data for the category occurs before ‘11–15’
‘11–15’,, when sorting the data
numerically.
Step 4. Add two more bins for ‘11–15’ and ‘16 to 20’
20’..
Step 5.
selected andMake sureother
rename the option ‘group
as ‘over 20’. the remaining values’ is
Step 6. Plot an automatic chart for the custom category
YearsAtCompanyBinned.
4.
Create a hierarchy to enable drilling down from Department to
the job role.
Step 1. Select create a new hierarchy.
Step 2. Make the irst
irst level ‘Department’
‘Department’..
Step 3. Make the second level ‘JobRole’.
Step 4. Name the hierarchy ‘Employment hierarchy’.
Data visualizations and exploration in SAS

Visual Analytics
Visualizations in SAS VA
Visualizations are representations of data using charts, plots,
plots , maps, and
tables. Traditionally,
Traditionally, visualizations were utilized to communicate
trends, indings, and information internally within an organization and
externally to stakeholders and customers. However,
However, the volume of data
collected by organizations today has created new opportunities to use
visualization to explore data to uncover nov
novel
el business insights.
Common visualization forms for exploring multivar
multivariate
iate relationships
include bar charts, line charts, scatter plots, bubble charts, pie charts,
tree maps, and heat maps. It is important to understand when and how
to use each of these forms.
Bar chart
Bar charts compare quantities across a range of categorical data or
continuous data that has been segmented into distinct groups. The
height of the bar corresponds to values for each category/group. Values
Values
are typically total values within the group, aver

averages
ages across the units
within each group, or minimum or maximum values of the group. Bar
charts are useful for categories or groups with low cardinality
cardinality.. The
higher the cardinality,
cardinality, the more bars there will be on the graph, making
it dificult
In SAS toVA,compare
VA, the values
to construct acrossyou
a bar chart, categories.
irst create a bar chart
visualization, which will create a blank visualization and the
corresponding property window.
window. The only mandatory input is the
category for the x-axis. Once the category is set, by either dragging a
variable from the data pane to the roles tab in the visualization
property window or selecting the arrow beside the input box under
category, SAS VA will create a visualization showing the frequency of
data observations for each input. For example, the insurance dataset
has three categories: region (northeast, southeast, northwest, and
southwest),
select regions sexto(female and male),
be the category forand
the smoker ( yes
(yes
bar chart,
chart anditno).
, then will Ifplot
wethe
frequency of data observations in each region, ordering the graph from
the highest frequency (southeast) to the t he lowest (northeast). If we
wanted to see the breakdown of charges by region, then we add the
measure variable charges to the appropriate role in the visualization
properties tab.
4.21 shows that the southeast region has the most charges. -
Figure 4.21 shows
However
Howev er,, this is possibly driven by the fact that this region had the
highest number of observations. The aggregation method of the
variable charge needs to be changed from sum to average to see how
the average charge compares by region. This can be done by either
right-clicking on the label charges on the y-axis and changing the
aggregation or going to the variable in the data pane and in the
property window changing the aggregation. The difference in these t hese two
methods is that the former will only change the aggregation for the
individual visualization, whereas the latter will change the default for
the aggregation for future visualizations.

Figure 4.21 Bar chart in SAS VA
Colour can be useful to add an additional dimension of categorical

data to the chart. Again, colour is only useful if the category has low
cardinality. For example, the average insurance charge could be
dependent on gender, which has low cardinality, since gender is
typically either female or male. To add colour based on gender,
gender, add the
( Figure 4.22
category variable Sex to the role group (Figure 4.22).
).
Figure 4.22 Bar chart wi

with
th grouping in SA
SASS V
VA
A
Histograms
Histograms are a particular type of bar chart that focus on a single
variable, plotting the frequency of discrete intervals (i.e., bins) of a
measure variable. If the variable is numerical, then, to create a
histogram, numerical values must be binned together
together.. When numerical
data is binned, the data
dat a within the interval of an individual bin is
treated as having the same value, essentially creating a category for a
range of numerical values. Histograms provide valuable information on
the central tendency and distribution of values (Figure
(Figure 4.23
4.23).
).
Figure 4.23 Histogram in SAS VA
In SAS VA
VA histograms are separated from bar charts because the
input to create the variable is a measure variable, whereas in bar charts
the required input is a category variable. Thus, for categorical variables,
histograms are created using a bar chart, as they can plot the frequency
of data observations for each category
category.. It is important to note that
that,,
while the histogram only works for measure variables, a histogram is a

bar chart and so the x-axis is technically a categorical variable
variable that is
ordinal. SAS VA automatically creates bins for measure variables to
create the histogram, but bins are not a continuous measure, despite
the fact that they have a natural order.
Line chart
Line charts are more appropriate than bar charts for comparing
quantities across ordinal categories with high cardinality or groups of
data that can be represented by continuous numbers. Line charts are
good at showing trends in data or describing how the quantities (on the
y-axis) change with increases in the value of the category (on the x-
axis). Often the category on the x-axis is time, and charts contain
multiple lines (differentiated
(dif ferentiated by colour
colour,, texture, line width, or line style)
to compare
Thus, how the
line graphs canvalues across
be used multiple
to show categories change
the relationship over
between time.
at least
three factors. Like the bar chart, the role group assigns a colour to add a
further data dimension (Figure
(Figure 4.24
4.24).).
Figure 4.24 Line chart in SAS VA
Scatter plot
In a line chart, each value on the x-axis has at most one corresponding
point on the y-axis. In a scatter
s catter plot, values are plotted based on the
Cartesian coordinates of two variables. The x-axis in scatter plots can
also be time or an independent variable thought to cause a response in
the measure on the y-axis. The x-axis can also be a location, in which
case, if the y-axis is a location, the data can be scattered on a geo map.
Although scatter plots can have multiple y values
values for a single x value,
trends are still visible from scatter plots. In fact,
fact , line charts are often
used to summarize scatter plot data by displaying the best-it line from

scatter plot data. Like the line chart, adding colour to the plot can
(Figure 4.25).
increase the dimensionality to three factors (Figure 4.25).
Figure 4.25 Scatt

Scatter
er chart in SSAS
AS VA
Bubble chart
Bubble charts are an extension of scatter plots. In a standard scatter
plot, each data point, which consists of an x and and a y value,
value, has a uniform
size on the graph. In a bubble plot,
plot , a third factor is included, where the
magnitude of the quantity corresponds to the radius of the point.
Similar to how higher cardinality of the x-axis makes a line chart more
straightforward
straightforw ard to understand than a bar chartchart,, the higher cardinality
of a third category makes bubble charts easier to comprehend than
scatter plots representing different groups or categories with different
colours. The additional beneit of the bubble chart is that colour can be
utilized to incorporate a fourth factor with low cardinality
cardinality.. If colour is a
categorical variable, then differentiated
differentiated colours will be used for the
visualization. If colour is a measure variable, then it will use us e a colour
spectrum. SAS VAVA enables animation for bubble charts, showing how
the size and position of each bubble changes over time (Figure (Figure 4.26
4.26).
).
Figure 4.26 Bubble charts in SAS VA
Pie chart
A pie chart is a circular graph wher
wheree the proportion of the pie’s slices
corresponds
slices to the
of the pie havequantity
similar value of an item
proportion, valueininterpretation
value a category
category.. When the
is dificult.
slices of the pie have similar proportion, value
value interpretation is dificult.
As a result, bar charts are often a better visualization tool than pie
charts for comparing quantities in a category
category..
Pie charts can be effective for illustrating a point when the number
of slices is low (2–6), and a few of the slices (1–3) are dominant. If there
are lots ofascategories
together an ‘other’with low .quantities,
category
category. t henalso
then
Pie charts are theyuseful
can be grouped
when the
comparisons are ratios since the representation of the data is built into
the chart.
Figure 4.27
4.27 shows
shows two different pie charts, with information on the
number of people in a dataset
d ataset who are single, married, or divorced. The
second igure has additional information about how the breakdown of
marital status varies across different US states. Although the igure has
additional information, it is dificult to ascertain even the most basic
information on whether married people outweigh those who are single
and divorced.with
divorced.
approached Because they are easy to abuse, pie charts should be
caution.
Figure 4.27 Pie charts

Figure 4.28
4.28 shows
shows the same data displayed in a bar chart. The bar
chart version of the pie chart without the breakdown by state is easy to
interpret, but it does not provide
p rovide the viewer with the natural
(Figure 4.27
interpretation of proportions (Figure 4.27).
). The bar chart with the
breakdown by by state is more effectiv
effectivee than the pie chart because the
viewer can quickly see that the light green bars (married) are higher
than the medium green bars (single), which are in turn higher than the t he
heights of the dark green bars (divorced) in each state. Although the
proportions are not clear from the bar graph, this information was also
unclear in the initial pie chart,
chart , making the bar chart more effective
when there is more information to present.
Figure 4.28 Bar charts displa

displaying
ying the same information as the pie charts inFigure
inFigure 4.6
Box plot
Box plots group data by quartiles, where the top of the
t he box corresponds
to the 75th percentile (top quartile) and the bottom of the box
corresponds to the value of the 25th percentile (lowest quartile). A line
at the 50th percentile
p ercentile divides the box to separate the middle two
quartiles. Two vertical lines, known as whiskers, extend from the top
and bottom of the box to indicate the maximum and minimum expected
values.
Box plots are useful for examining outliers since they appear
outside of the whiskers. Comparing box plots across categories can
allow analysts to quickly determine when extreme values are
potentially meaningful and not just random noise. A irm can use the
information to explore potential factors that lead to these customers
having extreme lifetime values. Figure 4.29
4.29 shows
shows that there are several
outlier insurance charges for non-smokers, which are higher than the
75th percentile. The asymmetry of the outliers helps explain why the
mean (the diamond marker) is above the median. For the smokers, the
mean
range is
ofbelow
valuesthe median, since
in comparison to the 25th–50th
50th–75th quartile
q uartile.has
quartile. a greater
A comparison
of the box plots shows that smokers have a signiicantly wider rrange
ange of
insurance charges and that the charges are higher on aver
average.
age.
Figure 4.29 Box plot showing outliers

Tree map
Hierarchies are often referred

referred to as trees. Tree maps display
hierarchical data through the collection of nested rectangles. Each
category is a branch and is shaped as a rectangle. The size of the
rectangle represents the sum of the entire category
category.. Each branch
consists
can haveofsub-branches,
rectangles, which
whichare individual
also leaves (although
contain leaves). branches
The leaf nodes also
represent categories and the size of each node corresponds to a
numeric quantity.
quantity. Leaves often have different colours to distinguish
their value from neighbouring leaves. The organized interaction
between size and colour in tree maps promotes the detection
d etection of
patterns. For example, it is easy to see if a branch has signiicantly
s igniicantly
different dominant categories, which can direct exploration and
analysis. Tree
Tree maps are popular in big data applications because they
use space effectiv
effectively
ely to display many leaves in a single visual. Figure
4.30
4.30 is
is an
lifetime example
value of a treelevel
by education mapand
showing thestatus.
marital breakdown of customer
The size of each
leaf represents the number of clients based on their education level and
marital status. Double-clicking on one of the
t he levels of education
branches provides a breakdown of that branch by the individual leaves
(marital status).
Figure 4.30 Tree map
Heat map
Heat maps represent data in a matrix or on a geospatial map. Like tree
maps, heat maps use colour to express different values for the
combination of categories (or quantitative
quantit ative factors) given
given by the matrix
position. In addition, as with tree maps, heat maps are useful for
displaying large datasets and for identifying outliers. Figure 4.31
shows an example of the same datadat a as the previous tree map, that is,
customer lifetime value by education (x-axis) and marital status (y-
axis).
Figure 4.31 Heatmap

GDP Per Capita, years of education by country
The size and position of each box in a tree map is determined by the
data, whereas the size of a box in a heat map is ixed by the matrix
coordinates (or spatial location). Thus, a tree map is more useful for
hierarchical data and showing part-to-whole relationships. A heat map
is more useful for displaying data across multiple (non-hierarc
(non-hierarchical)
hical)
categories.
Geo Vmap
SAS VA
A creates unique properties for geographical data so that it can be
plotted on a map with a variable of interest. A geo map uses the plot
data of interest on a physical map. The map can plot up to two different
additional measure variable dimensions, using bubble size and bubble
colour.. Geo maps use two dimensions to plot categories, longitude and
colour
latitude. This makes geo maps less effective when there are onlonlyy a
limited number of regions being plotted
p lotted (less tthan
han 5–6). However
However,, if
there are many regions
regions being plotted, then geo maps utilize location to
provide context to the categories.
size)Figure 4.32 uses
4.32 uses
and years the country
of education dataset
(colour). Thetogeo
plot GDPinper
maps SAScapita (bubble
VA are
interactive,
interacti ve, so hovering over a bubble pprovides
rovides the speciic values for
that country.
Figure 4.32 Geo map

Correlation
Correlation matrix
A correlation matrix uses colour to show how strong the linear

relationship is between different measure variables. Although
correlations may reveal predictive
predictive relationships, a strong relationship
does not mean that there is necessarily a causal relationship between
the two variables.
a general overviewCorrelation
of the manyplots are useful
different because they
data categories, can provide
which can
direct further exploration. For example, Figure 4.33,
4.33, which shows a
correlation matrix data from a ttime-use
ime-use survey
survey,, clearly shows that
running has a weak relationship with the other variables. A strong
relationship can be positive or negative. FFor
or example, the measure
Weekly Hours Worked is negatively correlated with socializing and
relaxing and positively correlated with weekly earnings.
Figure 4.33 Correlation matrix
Exercise 4.3: Data visualization

Using the Country dataset complete the following tasks:
1.
Create a bar chart for GDP by continent.
a.
What continent has the highest GDP?
b.
How does the graph change when the aggregation of GDP is
changed to average?
c.
The dataset covers the years 1900–2010. Filter the data
such that the visualization is only cover
covering
ing years 1970–
2010, using the Filters tab in the visualization properties
tab. How do the results change?
2. Create a bubble plot with years of education on the xx-axis,
-axis, GDP
on the y-axis, and socioeconomic status as the bubble size, with
each being aggregated as the average.
a. How does the visualization cchange

hange when the igure is
grouped by continent and coloured by continent?
3.
Using the bubble
by country,
country plotyear
, and add fromasthe previous question, group the data
an animation.
a.
What happens when you ‘play’ the igure?
 igure?
b.
How many data dimensions does the visualization capture?
capt ure?
c.
Reset the igure by dragging the year to 1990.
1 990. The three
countries with the highest educational average are
Switzerland, Canada, and the USA. Highlighting the three
countries on the visualization reveals their path over time.
Describe how the three countries differ over time in terms of
GDP per capita and years of education.
edu cation.
Exploration in SAS VA – An illustration

Having a good understanding of data types and categories is important
when exploring data. Viewing this information may giv givee insight into
potential data collection issues or into columns that are unlikely to
provide meaning if included in the analysis. Although it is tempting to
dive straight into predictive
predictive modelling, visually checking the data is
critical to ensuring that the produced insights are not based on
incomplete or inaccurate data (e.g., as with the data in Anscombe’s
quartet in Figure 4.1).
4.1). The exploration can also help identify
hypotheses that may require the collection of additional data. Going
straight to prediction without exploring the data means that the
explanatory variables are limited by the initial dataset.
To help illustrate how to explore data in SAS VA, we focus on the
insurance dataset with the objective of having a better understanding
of insurance charges. We irst look at the category data. For the
categorical data, it is also important to understand what proportion of
our populations are male or female, smokers or non-smokers. We are
interested in the percentage of charges that arise from smokers. After

creating a histogram of the frequency of smokers, we use the properties
tab to change the frequency from Count to Percent. We see that 80% of
our customers are non-smokers (Figure
(Figure 4.34
4.34).
). This is important since it
effects how we of
the proportion aggregate
females our
andmeasure data. Similar
males indicate bar is
that there charts showing
roughly an
equal number of observations for each gender
gender.. Bar charts of the regions
show that there are more charges in the southeast,
southeast , while the other
three regions have approximately
approximately the same number of charges.
Figure 4.34 Bar chart displa

displaying
ying the proportion of ccustomers
ustomers who are sm
smokers
okers
Checking the properties of your data is critical, particularly

p articularly to
determine if SAS VAVA is modelling the data as continuous or discrete. For
example, age and children are likely initialized as continuous; however,
the visualizations will be more useful if they are modelled as discrete. It
is also essential to set the aggregation correctly. Typically, SAS VA
assumes that data should be aggregated as a sum. This is useful for
looking at totals in terms of proits and revenues for different products,
customer types, or areas. However
However,, if you are looking for expected
impacts
to of different
average.
average. Recallingmeasures, then
that 80% of theitpopulation
is better toisset aggregationif
non-smokers levels
charges are aggregated as sums, then non-smokers are likely to have
higher charges overall (there are more of them), even though on
average smokers will have higher charges. Thus, the discrepancy
between the number of smokers and non-smokers implies that it is best
to use averages for each of the measure variables.
The most fundamental visualization for understanding the structure
of a variable is a histogram. Histograms should be used systematically
to individually look at each variable. Ideally
Ideally,, data is nearly normally
distributed,
regression. Ifsince this will
the data help meet
is skewed, thenthe
therequirements of linear
user may consider
transforming the data. For data that is skewed to the right, create a new
variable by taking the square of the original variable. If data

distribution is skewed to the left, then take the square root of the
variable to help normalize the variable. If the user wants to categorize
numerical values, then bin the data by creating groups.
We can see by
of observations the distribution
across ages, exceptthat
for there is almost
the ages 18 andan
19even number
(Figure 4.35
4.35),
),
which have more
more than double the observations of the other age groups.
Without having more information on the dataset, we do not know if the
company’s primary customers are young people or if there is an
alternativee explanation. For example, this dataset may have been
alternativ
collected over a two-year period and there was a lower bound on the
age variable captured, that is, 18 or younger
younger.. Thus, anyo
anyone
ne younger
than 18 who had their data collected will be classiied as 18. If these
people’s data were collected in year 1, then in year 2 all the customers
previously classiied
trying to predict as 18(the
charges willamount
be listed
theashealthcare
19. Regardless,
serviceif we are
charges
the customer), where age is likely to be an important factor,
factor, looking at
the histogram reveals that we may want to consider iltering
 iltering out
observations where
where age is less than 20.
Figure 4.35 Histogram of the age varia

variable
ble
Filtering data can be done either per individual visualization or
across all visualizations. To create a ilter
ilter,, use the visualization’s
property pane and click on the ilter
 ilter tab. There are two places to add
ilters. The top section will ilter data across all visualizations. The
bottom part of the ilter tab is speciically for the individual
visualization that the property pane is corresponding too. To To add a
variable as a ilter,
ilter, drag the variable from the data pane to either the top
or bottom section of the ilter tab,
t ab, and then unselect the observations
that you want to ilter out. In this case, we are iltering out the data of
that youunder
people want20
to ilter
acrossout.
all In this case, we(Figure
visualizations are iltering
(Figure ).out the data of
4.36).
4.36
Figure 4.36 Setting a ilter
Age is likely to have a signiicant impact on medical charges. To

account for the possible non-linear relationship between age and
charges (i.e., charges go up non-linearly as customers age), we can
introduce a new variable, age2, which is age squared. We creatcreatee the age
squared using create a new calculated item. Speciically
Speciically,, we use tthe
he
numeric function ‘Power’ and then drag the variable age into the left
box of the equation and input a 2 into the second box (Figure
( Figure 4.37
4.37).).
Figure 4.37 Creating a new variable, age 2
Next, we look at the distribution of BMI. Dragging the variable BMI

onto the canvas creates a histogram. To
To get better detail of the shape,
we change the number of bins to 15. The distribution illustrates that
the distribution of BMI is skewed ((Figure
Figure 4.38).
4.38).

Figure 4.38 Histogram of BMI
By using the measure details, we can see that t hat the average BMI is
30.66. Given that a healthy BMI is 25, and the average is approximatel
approximatelyy
30, we could be interested in how a BMI score of greater than 30 relates
to charges. In this case, we would want to create a binary value called
BMI30, based on whether the BMI is less than or equal to 30 or greater
than 30. We create a new custom category based on BMI, assigning
category 1 to be ‘below 30’ with the interval ran range
ge from 0 to 29.99 and
category 2 to be ‘above 30’ with the interval range from 30 to 55.
After performing a univariate analysis and creating two new
variables (and saving the initial exploration), we begin the next step by
looking at the relationship between charges and other variables (i.e.,
bivariate
biv ariate and trivar
trivariate
iate analysis). The multivariate
multivariate visualization can
reveal
rev eal outliers. When creating prediction models, knowledge of outliers
enable you to reine the dataset
dat aset to exclude these values or alternativel
alternativelyy
create prediction models to try and identify what drives the outliers.
Colours and size in bubble charts can help provide insight into factors
that may explain the drivers of outliers.
To carry out the multivariate analysis on the insurance dataset, we
start by looking at the values of charges across region and sex. We see
that in all four regions males have more charges than females and that
in the southern regions the difference in average charges is signiicantly
greater than in the northern regions (Figure
(Figure 4.39
4.39).). We then change the
variable sex to smokers and see that smoking has a strong relation to
(Figure 4.40
the amount of charges (Figure 4.40).). For non-smokers, the average
charges appear to be consistent across regions. However, the average
charges for smokers are higher in the south.
Figure 4.39 Bar chart visualization showing charges by region and sex

Figure 4.40 Bar chart visualization showing average charg

charges
es by region and
and smoker
The bar charts show the relationship between the variables charges,
region, and smoker.
smoker. In the properties window the user can add more
category variables using lattices. Lattices create columns of the
visualization for each item in the category
category.. For example, by grouping the
variables by BMI30 we can use smokers (yes or no) as a lattice category
to clearly see how BMI30 impacts charges for each region for smokers
and non-smokers. This is shown in Figure 4.41.
4.41. From the igure, we can
quickly see the potential impact of smokers with a high BMI on charges.
The current visualization tells us that the charges for non-smokers are
not inluenced by a high BMI to the extent that charges for smokers are.
In addition, the region appears to have little inluence on our indings.
Figure 4.41 Bar chart visualization showing average charg

charges
es by region, whether the charge
charge is
from a smoker and whether BMI is over or
o r under 30
In these visualizations, the x-axis is being usused

ed to show geography
geography..
However
Howev er,, since region appears to have little inluence on the pattern of
charges, we may want to use the t he x-axis to display more useful
information, such as age. Since age is a measure variable, we cannot use
a bar chart, and, instead, we need to make a new visualization. A line
chart could be useful for seeing
s eeing general trends, but, again, line charts
require categories for the x-axis. To To make a line chart with age, we right
click on the variable age in the data pane and select duplicate datadat a item.
Calling the new data point ‘age (discrete)’ and changing the data
classiication
pane, to a age
we can add category
to theinx-axis
the data
of a property
line chartwindow belowNext,
visualization. the data
we
add charges as the y-axis, BMI30 as groups, and smoker as a lattice

latt ice
4.42..
column, producing the visualization in Figure 4.42
Figure 4.42 Line chart visualization showing average charges by age, whether
whether the charge was
made by a smoker, and whether BMI is over or under 30
The igure shows that charges increase with age, but that BMI does
not matter for non-smokers. On the other hand, smokers with low BMI
have greater
greater charges on average than older non-smokers, and this
impact is obviously exacerbated if the smokers
s mokers also have a BMI over 30.
If we wanted
single to create
graph, we thecreate
need to sameaigure, but withUsing
new variable. the four linesif on a
nested
statements (see Figure 4.43),
4.43), we can create a variable that has four
categories: non-smoker with BMI under 30, smoker with BMI under 30,
non-smoker with BMI over 30, and smoker
s moker with BMI over 30. We can
( Figure 4.44
then group the line graph with age and charges together (Figure 4.44).
).
Figure 4.43 Nested if statements
Figure 4.44 BMI and ssmoker

moker grouped by age

Line graphs are useful for generating trends; however, average

values can mask heterogeneity in data. For example, the high charges
for smokers could be due to half the
t he smokers having very high charges
and half of the smokers having low charges. To clearly see if there is any
masked heterogeneity
heterogeneity,
age, we use , we construct
the age-squared variable,a age
bubble chart.
2. This Rather
variable hastthan
han use
the effect
of stretching out the data points for people of older age. For the bubble
chart, we make the bubble sizes BMI and then colour the data by BMI
(Figure 4.45).
and smokers (Figure 4.45).
Figure 4.45 Bubble chart of BMI and smoker
From the bubble chart, we see

s ee that ssmokers
mokers have consistentl
consistentlyy high
charges and that
t hat non-smokers can have greater variab
variability
ility,, seeing as
how there are two different increasing clusters for both non-smoking
groups. We
We complete our exploration by separating the bubble plot by
gender using the column lattice functionality and change the bubble
size to the number of children. The plot reveals that there is no
systematic difference on the previous observations, between gender or
among the number of children (Figure
(Figure 4.46).
4.46).
Figure 4.46 Bubble chart gr

grouped
ouped by male and female
Exercise 4.4: Data Exploration

Working with the dataset employee_attrition
emp loyee_attrition use the SAS V
VA
A Data
Explorer
1. What isinterface to in
the vvalue
alue address
know thewhich
knowing
ing following questions.
factors lead to employee
attrition?
2.
How do gender and marital status relate to employee attrition?
3.
How does attrition vary across job roles?
4.
How does attrition vary across business travel?
5.
Based on the categorical data, what hypotheses can you make
regardingg causes of attrition?
regardin
Summary
Visualizationsisenable
Visualization comprehension
utilized of large
to (1) assess data volumes
qquality
uality, of data.
, (2) address
descriptive questions to understand current and past performance, and
(3) look for trends to build hypotheses for further analysis and inform
future data collection. Visualizations play a crucial role in data
reinement and data exploration and are a crucial step in the predictiv
predictivee
modelling process.
Current visualization software can quickly generate igures from big
data and automatically convey meaning through the automatic process
of selecting the best visualizations and collapsing data. In SAS VA,VA,
analysis can be conducted by simply dragging variables on to the
canvas
canv as or by selecting a visualization and using the property window to
input the relevant variables, making data exploration accessible to a
wide range of employees, not just data scientists.
Further reading
Fayyad, U. M., Wierse, A. & Grinstein, G. G.(Eds.). (2002). Information visualization in data mining
and knowledge discovery . Morgan Kaufmann, San Francisco, CA.
Few, S. (2009). Now you see it: Simple visualization techniques for quantitative analysis. Analytics
Press.
Healy, K. (2018). Data visualization: A practical introduction. Princeton: Princeton University
Healy,
Press.
Keim,, D. A. ((2002).
Keim 2002). Infor
Information
mation visualization and visual data mining. IEEE Transactions on
Visualization & Computer Graphics, (1): 1–8.
[Crossref ]
Keller, P. R., Keller
Keller Keller,, M. M., Markel, S., Mallinck
Mallinckrodt,
rodt, A. J., & McKay, S. (1994). V
Visual
isual cues: Practical
Pract ical
data visualization. Computers in Physics, 8(3): 297–298.
[Crossref ]
Kirk, A. (2012). Data visualization: A successful design process. Packt
Pac kt PPublishing
ublishing,, Birmingham
Birmingham,,
United Kingdom.
Sahay, A. (2016). Data visualization, volume I: Recent trends and applications using conventional
and big data. Business Expert Press, New York.
Sahay, A. (2017). Data visualization, vol. II: Uncovering the hidden pattern in data using basic
b asic and
new quality tools. Business Expert Press, New York.
Samuels, M. & Samuels, N. (1975). Seeing with the mind’s eye: The history, techniques, and uses of
visualization. Random House, New York.
Tufte, E.
E . & Graves-Morris, PP.. (2014). The visual display of quantitative information.; 1983.
Yau,, N. ((2011).
Yau 2011). Visualize this: The lowing data guide to design, visualization, and statistics. John
Wiley & Sons, Indianap
I ndianapolis,
olis, IN.

Limited 2019
https://doi.org/10.26777/978-1-352-00726-8_5
5. Clustering and Segmentation


Samuel N. Kirshner
Kirshner
Felix Tan
Chapter Overview
grouping imilar
Clustering
data wit h ssimilar iiss angmachine
underlyi
underlying lear
learning
characteri ning
characteristi
stics. approach for
cs. Clustering is
used to explore data without requiring a speciic outcome variable,
that is, it is unsupervised learning. Clustering has a wide range of
applications, both in business applications, such as consumer
marketing, and in information system-driven applications, such as
image recognition and recommendation systems. This chapter
presents an overview of clustering and segmentation. After
providing a high-level overview
overview of applications we detail the two
most common methods for clustering data: hierarchical clustering
and k-means clustering. We then cover cover how to implement clustering
in SAS Visual Analytics (SAS VA)
VA) and how to analyse the results using
cluster matrices and parallel coordination plots.
Learning Outcomes
After completing this chapter
chapter you should be able to:
Identify differences between supervised and unsupervised

learning
Discuss applications of clustering
Describe different types of clustering algorithms
Run clustering in SAS Visual Analytics (SAS VA)
Use cluster matrices and parallel coordination plots to reine the
number of clusters and understand the characteristics of different
clusters
Describe the role of clustering in predictive analytic models.
Introduction
Machine learning differs from traditional programming because the
machines are programmed to learn from data, similar to people
learning from experience, to improve performance in accomplishing a
task. Although the roots of machine learning date back to the late
1950s, with the rise in big data, machine learning has become
widespread in industries, serving as the foundation of predictive
analytics. Machine learning enables websites like Amazon, Google, and
Netlix to make product and media recommendations. Other canonical
applications of machine learning include algorithmic trading, medical
diagnosis, and image, video, and natural language processing. Part of
the importance of machine learning stems from the fact that the data is
often both growing and changing. In many applications, such as credit
card fraud, subtle changes in the data can have drastic implications for
an organization. Machine learning algorithms (unlike humans) can
detect small structural changes within ever-expanding datasets, helping
irms build and maintain competitive advadvantages.
antages.
In general machine learning is applied
ap plied to the objectiv
objectivee of ddiscovery
iscovery
and prediction. These two
t wo objectiv
objectives
es require different types of
algorithms and correspond to two different types of learning:
supervised and unsupervised learning. The objective of supervised
learning algorithms is to produce a set of rules that can take

t ake new input
data and generate an accurate prediction of the target. The supervised
learning algorithm is trained on data featuring input characteristics and
the predetermined target output to create these predictions.
Classiication and regression
learning. Classiication are twolearning
is a machine canonical methods
process thatofmaps
supervised
the
inputs to a discrete category
category.. Often the discrete category is binary
binary,, such
as classifying if email is spam or whether a customer churned.
Regression can be used to mathematically map a set of input data to an
output that is continuous or to a discrete output. These are the focuses
of later chapters.
In this chapter we focus on unsupervised learning,
learning, which is
exploratory.. Data that is run through unsupervised learning algorithms
exploratory
does not have a speciied target output or labelled response based on
the
andinput data. in
groupings Unsupervised learning
data rather than is utilized afor
for predicting indingoutcome
speciic patterns
variable based on input data. As a result, unsupervised learning is
heavily used in marketing research and object recognition applications.
The most common unsupervised learning technique is clustering.1
Clustering is an unsupervised learning process that groups together
people or objects with similar characteristics. These groupings form
clusters, which represent groups that are relativ
relatively
ely homogeneous
regardingg the underlying data and differentiated from people or objects
regardin
in other clusters. Clustering is unsupervised learning since the
discover
discovered
target ored groups are
objective. based
Thus, on patterns
clustering that are unrelated
is exploratory and usefultowhen
a speciic
trying to understand a new dataset. Without a speciic outcome or goal,
clustering enables a data scientist to discover if there are any naturally
occurring patterns within the data.
Segmentation
The difference between segmentation and clustering is similar to the
difference between data mining and predictiv
p redictivee analytics: they are
mostly the same, but with a different emphasis. In business, particularly
in marketing applications, clusters are typically
t ypically referr
referred
ed to as
segments. Clustering is the technical process for unsupervised
grouping, while segmentation is the application of creating segments of

customers or markets. Thus, clustering can be used to segment
consumer groups.
Traditionally
Traditionally,, segmenting allows a irm to identify and target new
potential
needs andcustomers by using
characteristics. analytics
This to differentiate
is particularly useful forbetween
irms people’s
leveraging
leveraging customization, whether it is personalized employee training
programs or personalized email offers to consumers. For many
organizations, the size of the workforce or customer base means that
personal customization is not scalable, despite
desp ite its ability to improve
irm performance and proitability.
proitability. Clustering helps irms to identify
meaningful customer segments, allowing them to target deined groups
rather than having to customize for each individual customer
customer..
Although clustering is an unsupervised learning technique, irms
can utilizee models
predictiv
predictive segments
forineach
predictive modelling
segment. byofcreating
If the size separate
inluencing factors is
likely to differ between heterogeneous groups, then creating separate
models based on segments will better capture the underlying
behaviours driving consumer decision-making in each group. While the
roots of segmentation in business are in marketing, segmentation is
now being used for a variety of information system analytic tasks such
as social network analysis, recommendation engines, and image
recognition.
Companies creating recommendations, such as Netlix, beneit from
the granular
people preferdata provided
superhero by bigand
movies data. Directly determining
documentaries whether
to historical period
pieces, romantic comedies, and dramas is unlikely to produce valuable
recommendations when there are thousands of potential movies.
Instead, movies themselves can be broken down into components, from f rom
actors, plot elements, sub-genre characteristics, budget, release dates,
promotions, movie locations, and so on. In addition to a breakdown of
movie characteristics, Netlix tracks every aspect of a user’s behaviour
based on their history
history,, ratings, and searches and uses this to create
detailed proiles that group similar users based on their tastes and
preferences for the different aspects of movies. Netlix as created more
than a thousand different taste communities and clusters its almost
150 million users into these proiles. These clusters are a critical input
for creating new original content and its recommender algorithm to

provide users with suggestions of movies and series. Netlix’s success
shows that clustering can provide organizations with a better
understanding of their customers, as well as the various features of
products
improved
improv edand services,
services and which can Similar
products. lead to the design and delivery of
to recommendations,
clustering is essential for targeted advertisements as well as for
selecting content for articles and news on social media sites.
Exercise 5.1: The power of clustering

Clustering was also at the centre of the
t he techniques reportedly used
by Cambridge Analytica to sway voters through media campaigns
and digital advertisement in various stages of the US 2016
201 6
presidential election. Recognizing that using demographics to
promote messages
Analytica is supericial
created data and mostly
proiles based ineffective,
on people’s Cambridge
fundamental
values. Using these proiles, it developed messaging that was
personalized to a speciic cluster.
cluster. After watching the video ‘The
power of big data and psychographics’ (https://
(https://www.
www.youtube.
youtube.com/
com/
watch?v=
watch? v=n8Dd5aVXLCc
n8Dd5aVXLCc), ), answer the following questions:
1.
Explain why the strategy outlined by Cambridge Analytica is
useful in creating value from clusters.
2.
Thinking aboutbethe
oneorganization forwor
which
ked you currently work
wor
or k
(this can also that you have worked for previously
your current educational institution), how could your

organization create value by using big data and psychographics,
and what, if any
any,, are the et
ethical
hical implications?
3.
Given the potential power in exploiting people’s data, is it ethical
Given
to use clustering for targeted advertisements?
Clustering algorithms
Forming clusters can be quickly done for small dimensional spaces. For
example, consider the eight characters from Mario Kart 64, whose
attributes vary between their speed (y-axis) and their strength (x-axis).
It is easy to see in Figure 5.1 that
5.1 that these observations can be grouped
into
speedthree
and different
low in theclusters:
variableone group(fast
strength that scores
but weakhigh in the variable
characters); one
that scores high in the variable power and low in the variable speed
(strong but slow); and one that has averag
averagee variables of each attribute
(balanced). The natural eye can see similar groups when the data is
presented in two dimensions and can form clusters. Howev
However er,, if people
or objects have many different characteristics, it is challenging to form
interpretable groups that capture similarity of traits. The actual value of
clustering comes from data consisting of hundreds or even thousands
of dimensions as clustering can reveal the underlying structure of the
data that is unobservable to the human eye.
Figure 5.1 Cluste

Clustering
ring Mario K
Kart
art characters
Clustering algorithms take raw data and produce clusters using

measures of distance to quantify similarity in the data. The algorithms
create groupings so that each entity in the cluster is as close as possible,
while the actual clusters themselves are as far away as possible, based
on the distance measure. Although there are hundreds of different
algorithms for generating clusters, the two most common methods are
connectivity models and centroid models. In the same way that there
are many methods of creating clusters, there are also many ways of
calculating distance between objects. Two of the most common
connectivity and centroid models are hierarchical clustering and k-
means clustering.
Hierarchical
Hierarchical clustering
A common connectivity model is hierarchical clustering. In the most

common variant of hierarchical clustering, each data point is initially
considered a unique cluster.
cluster. If there are N data ppoints,
oints, there will
initially be N clusters. The algorithm then proceeds by joining objects
togetherof
number toobjects.
form fewer clusters,
At the buttwo
start , the
start, the items
clusters will
that contain
have a higher
the closest
distance to each other are then joined together creating N-1 clusters,
and then among the N-1 clusters, the next two closest items are joined
together.. This keeps occurring until there is only one cluster that
together
contains all N objects. This process, which can be illustrated using a
diagram known as a dendrogram, is a bottom-up process.
5.2(a)
Figure 5.2 (a) shows seven data objects labelled A-G in a two-
dimensional space. Figure 5.2 5.2(b)
(b) shows the associated dendrogram for
hierarchical clustering.
clustering. The two points that are closest together are B
and C, so
and E, they are
followed byconnected
F and G. Atirst.
this The next
point, closest
there pair clusters:
are four of objectsCluster
is D
A, Cluster B-C, Cluster D-E, and Cluster
Clus ter F-G. Of the four clusters, A and
B-C are closest and are joined. Once cluster A-B-C is made, the next
closest clusters to be joined are D-E and F-G. Finally
Finally,, the clusters A-B-C
and D-E-F-G are joined into a single cluster
cluster.. The y-axis on the
dendrogram shows the average
average distance of the objects within the
cluster. The more objects are clustered together, the higher the average
distance will be, due to the greedy nature of the algorithm. Observe that
the algorithm produces N different options for the number of clusters,
ranging at
looking from
theone single cluster
dendrogram, to all individual
the analyst unique clusters.
can use judgement By
to determine
the best possible number of clusters. One possible option would be to
go with three different clusters: A-B-C, D-E, and F-G. However, the best
heuristic for deciding the number of clusters is based on the largest
vertical line separating the various horizontal lines joining clusters.
From the diagram, the largest vertical distance is between the cluster
connecting D-E with F-G with the inal cluster connecting A with B-C,
suggesting that there should be two
t wo clusters, that is, A-B-C and D-E-F-G.
A similar top-down algorithm can be created by starting with a single
cluster containing each object and then proceeding to split the data
until each object is its own unique cluster
cluster..
Figure 5.2 Example of a dendr

dendrogram
ogram for hierarchical
hierarchical cl
clustering
ustering
K-means
K-means clustering
cl ustering algorithm
k-means clustering is a centroid method and is the most common
method for determining clusters, due to its scalability to large datasets.
The irst step in running a k-means clustering algorithm is specifying a
value for k , that is, the number of clusters that the algorithm will
produce. The value of k is is often based on subject matter knowledge or
speciic
variantsrequirements. For example,
of a product, then if the
a marketer product
would likelyline hask four
select =4 todifferent
=4
determine consumer groups that are most likely to align with the
products. The number of clusters could also be based on a system of
trial and error,
error, where the algorithm is run for a variety of values of k so so
that the analyst can see the resulting clusters before inalizing the
choice of k . Alternatively
Al ternatively,, k can
can be selected via statistical techniques or
just by choosing some arbitrary value. If there is insuficient
information to select k , then hierarchical clustering can be used to
segment the data to provide
p rovide some insight into an appropriate number
of clusters (value
There are of k ). of ways to start the algorithm, once k has
a variety has been
determined. For example, each of the objects can be randomly allocated
to one of the k clusters,
clusters, or the centre of each cluster can be given
random values across the range of variables. To illustrate the algorithm,
assume that the k centres
centres are randomly distributed across the variable
space. After the centres are selected, each observation is allocated to
the cluster of the closest centre. Based on these new clusters, the
midpoint readjusts such that it is at thet he centre of its cluster
cluster.. Since the
centre of the cluster has changed, there may be observations in other
clusters that are now closer to a different centroid. Thus, the
observation’s clusters are reassigned after the centroids are updated.
observation’s up dated. If
there are new additions or subtractions from a cluster
cluster,, then the centre
of the cluster will again change. This process repeats itself until
changes in the centroid no longer result in changes to the clusters. At
this point, the algorithm terminates and the clusters are inalized.
To provide a concrete example of this method, consider the data
from thek is
points, hierarchal
is set to 2. clustering example.
the Since
InFigure 5.3(a) the
InFigure 5.3(a) centrethere are only abyfew
(represented datais
stars)
randomly allocated and based on a metric capturing distance; A, D, and
E are allocated to the dark centre; and B, C, FF,, and G are allocated to the
star. The centres of each cluster are updated in Figure 5.3(b),
light star. 5.3(b),
where the hollow stars are the previous positions and the solid stars
are the updated positions. Notice that the observations for D and E pull
the dark centre down, and the observations B and C pull the light centre
up. Because of the shifts in the centroids, Figure 5.3(c) 5.3(c) shows
shows that
observation A is reassigned to the t he light cluster
cluster.. Since A is no longer in
the dark
down clusterD, the
cluster,
towards andcentroid
E, while oft hethe
the dark
light clustermoves
centroid movesfurther
even further
up
towards
towar ds A, B, and C. This can be seen in Figure 5.3(d).
5.3(d). The change in the
centroids causes observation G to be allocated to the dark cluster in
Figure 5.3(e).
5.3(e). The switch of observation G causes the centroids to
update again in Figure
Figure 5.3(f)
5.3(f ). In Figure 5.3(g) the
5.3(g) the observation F
5.3(h),, the centroids
changes to the dark cluster. Finally, in Figure 5.3(h)
update, but no observations change clusters, which terminates the
algorithm.

Figure 5.3 Example of k-means cluste

clustering
ring
Distance measures
There are several ways
ways to measure the closeness of observations. The
most straightforward is Euclidean distance, which is an ordinary
straight line between two points in a multi-dimensional Euclidean
space. In two dimensions, the Euclidean distance is calculated through
the Pythagorean Theorem. Other measures include Squared Euclidean
distance, which penalizes
which measures greater
d istance using
distance distances,
standard and Mahalanobis
deviations. Typically,,distance,
Typically before
measures of closeness are calculated, the variables are standardized
such that they all have the same magnitude regarding the range
range of
values. If customer data includes distances from the city centre and
income, the measures of closeness between the variables will be
skewed since distance and income are not on a similar scale.
s cale.
Exercise 5.2: k-means vs. hierarchical clustering

Hierarchical and k-means clustering are classical techniques for
unsupervised learning.
algorithms, such as the However,
PERCH (orthere areenhancing
‘purity many novel cutting-edge
rotations for
cluster hierarchies’) method described in the video ‘AAnn online
hierarchal algorithm for extreme clustering’ www.youtube.
www.youtube.com/com/
watch?v=
watch? v=t1XL1IptjAA
t1XL1IptjAA.. After watching the video, answer the
following questions:
1.
What are some shortcomings
s hortcomings of k-means and hierarchical
clustering?
2.
Briely explain the main ideas underlying how PERCH clusters
data.
Clustering in SAS
To demonstrate clustering in SAS VVA,
A, we use a dataset called hofstede,
which provides values on Hofstede’s cultural dimensions for 70
countries. The dimensions are power distance, individualism–
collectivism, uncertainty avoidance
avoidance,, masculinity–femininity
masculinity–femininity,, short-term
orientation-long-term orientation, and indulgence-restraint. See Box
5.1 on Hofstede’s cultural dimensions for more information on each
construct. Analysing Hofstede’s dimensions is an ideal dataset for
demonstrating clustering since the variables are easily understandable,
each dimension is standardized (each dimension takes
t akes on a value
between 0 and 100), and each dimension is a measure variable. In SAS
VA, clustering only works for variables that are meas
measures.
ures. The closeness
between categorical variables cannot be measured, which means
centrality measures cannot be calculated. As a result, clustering
typically pertains to measured variables.
When performing cluster analysis, it is essential to be mindful of the
variables being selected for clustering. The variables should be limited
to the most critical measures for differentiating objects or people. It is
also important that clusters exhibit a range of variables to better
understand how groups are differentiated. Recall that it was easy to
categorize the clusters in the Mario Kart example
example since there were only
two dimensions. The more variables that are used to cluster the data,
d ata,
the less interpretable the groupings are. In the case of Hofstede’s
cultural dimensions, we will use each of the six dimensions.
Box 5.1: Hofstede’s Cultural Dimensions

Power distance explains how people, notably less powerful members
of organizations and society, view inequality of power in society
(Hofstede, 2011
2011).
). Followers
Followers within a society that accepts the
unequal distribution of power help reinforce the acceptance of
customs and behaviours that underpin higher levels of power
distance. Such behaviours typically see an emphasis on hierarch
hierarchyy
through organizational and family power structures. The second
dimension is individualism. Individualistic cultures place greater
importance
and pleasure,onwhereas
independence, freedom,
collectivist highplace
cultures levels of competition
greater emphasis
on interdependence, family security
s ecurity,, social hierarchies, and
cooperation (Wong
(Wong and Ahuvia, 1998
1998).
). Figure 5.4
5.4 shows
shows the
individuality of the countries in the dataset. Uncertainty avoidance
pertains to the degree that society
s ociety members accept and are
comfortable with ambiguity and uncertain situations. Masculinity is
is
the degree that
represented by society values
femininity. assertiveness
In cultures versus
displaying caring,
higher which
levels of is
masculinity,, there is greater admiration for the strong and men have
masculinity
more dominant roles in organizations and families. In 1991 Hofstede
revised his cultural dimensions to introduced short-term
orientation-long-term orientation, where low values represent
greater values of societal traditions and high values represent
countries that favour adaptation. The last dimension added in 2010
is indulgence-restraint . Indulgence corresponds to a societal norm
promoting the enjoyment of life and having fun. A restraint society
attempts to control and regulate gratiication through social norms.
Figure 5.4 Individuality of

countries in the datase
datasett (higher scores reprepresen
resentt greater
individualism and lower scores represent
represent more collec
collectivist
tivist soc
societies)
ieties)
To create a cluster analysis in SAS VA, select new visualization cluster,

or select
drag thevariables
the six cluster icon.
ontoOnce the cluster
the canvas (notevisualization
that SAS VA is on the canv
requires canvas,
at as,
least two measure variables to perform clustering). SAS VA VA will run its
clustering algorithm, creating ive clusters by default. The high
dimensionality associated with clustering can make understanding the
natural clusters in the data challenging. TToo help illust
illustrate
rate the
relationships between the variables for each typet ype of cluster SAS V VAA
creates two igures: a cluster matrix visualization and a parallel
p arallel
coordinate visualization, which are shown in Figure 5.5 5.5.. Despite the
clusters being created by all six cultural dimensions, by default the
cluster
roles. matrix and parallel coordinates graphs only display ive visible
Figure 5.5 Default clustering of the Hofstede datase

dataset
t
The cluster matrix is a matrix of different variables, like a

correlation matrix, except, rather than show the association between
the two variables, the igure is a scatter plot coloured and grouped by
the appropriate cluster.
cluster. The igure enables analysts to ascertain
similarities and differences across the different cluster groups. By
default, the cluster matrix presents only six of the variables. T Too view all
six, go to the properties window and change the number of visible roles
from ive to six. In addition, maximizing the window helps facilitate
drawing observations from the visualization. Figure 5.6 5.6 shows
shows the
cluster matrix for all six variables. The cluster matrix
mat rix demonstrates
severall interesting insights into the grouping. First
severa Firstly
ly,, the red and blue
clusters are comparable in individuality but differ on indulgence and
orientation. Likewise, the yellow and teal clusters are similar in
individuality (opposing the blue and green clusters) and are opposite
on indulgence. Finally,
Finally, we notice that the teal and green clusters do not
cover a wide range of data points, implying either that they are distinct
clusters or that there are few countries in both clusters.
clus ters.
Figure 5.6 Cluste

Clusterr matrix for all six dim
dimensions
ensions
The parallel coordinate plot bins each variable and draws a line for
each observation through the corresponding bin based on the
observation’s
observation ’s data for each variable. While the cluster matrix
mat rix helps the
analyst understand how groups differ across combinations of variables,
the parallel coordinate plot enables the analyst to focus on a speciic
group to understand the features deining the cluster
cluster.. For example, the
5.5 shows
parallel coordinate plot in Figure 5.5 shows that Cluster ID 0 represents
countries that are low on individuality and long-term orientation, high

on indulgence, and mid-tier for masculinity and power distance. Thus,
the parallel plot is most useful
us eful for understanding an individual cluster
cluster..
The observations of the parallel coordinates plot are ordered based
on theiron
cluster cluster
cluster,
, which
the axis is given
indicates theonmagnitude
the y-axis.inThe bandwidth
terms for eachof
of the number
observations belonging to the cluster
cluster.. The parallel coordinate plot
conirms the observation that two of the clusters (cluster 1 and cluster
4) do not have a signiicant membership, suggesting that the analysis
could be re-run with only three clusters.
clus ters. In the properties window
window,,
under the cluster matrix section, the
t he number of clusters can be changed
from ive to three. The parallel coordinate
coordinate plot has a default setting of
16 bins. To create a more general picture of the data, change the
number to a lower number like four (this would correspond to having
bins
high representing low values,
values. In addition, low–mid
change values,
the number of high–mid values
visible roles to sixand
to
display all the dimensions.
5.7 shows the maximized parallel coordinates plot for three
Figure 5.7 shows
clusters and four bins. Cluster 0 characterizes countries with high
indulgence and high uncertainty avoidance, low individuality and low
long-term orientation, and medium levels of masculinity and power
distance. Cluster 1 embodies countries that are highly individual, hav havee
medium levels of indulgence, masculinity
masculinity,, power distance, and
uncertainty avoidance, and have a short-term orientation. Cluster 2
represents collectivist
medium indulgence andsocieties with low
medium-high uncertainty
long-term avoidance, low
orientation, low--
masculinity and power distance. Although reducing the number of
clusters creates greater generality,
generality, there is a trade-off. Small clusters
may indicate outliers, which can be remov
removeded for later modelling for
improved
improv ed predictions oror,, in a business setting, can be used to design
new products and services to target niche, but highly proitable
consumer segments.
Figure 5.7 Parallel coo
coordinate
rdinate plot for three clusters
In the cluster matrix, by right-clicking on the visualization, there is

an option to deriv
d erivee a Clus
Cluster
ter ID variable. This creates a variable, which
can be a category or a measure. If the groupings are interpretable, then
the create custom category functionality can be used to name the t he
different Cluster
in order to IDs.
plot the In this on
clusters case we change
a geo the Cluster
map. Figure 5.8 IDs tothe
5.8 shows
shows measures
countries grouped into three clusters (light green is cluster group 0,
medium green is cluster group 1, and dark green is cluster group 2).
Interesting observations are that several European countries, such as
France, have
have cultural dimensions that are closer to Asian countries,
such as China and India, rather than W Western
estern traits, and Thailand is the
only Southeast Asian country that is not in Cluster 2. Naturally
Naturally,, by
having more clusters, there will be greater differentiation of countries.
5.9 shows
Figure 5.9 shows a geo map with the algorithm re-run with ten cluster
groups. Having
countries a greater
in Europe number
and South of clusters
America highlights
according the diversity of
to Hofstede’s
dimensions of culture.
Figure 5.8 Geo map of cultural clusters (based on three cluster groups)
Figure 5.9 Geo map cultural clusters (based on ten cluster groups)
groups)
Exercise 5.3: Moneyball

The success of Moneyball in baseball created grow
growthth in analytics in
basketball in the late 2000s. At this point, analytics is widespread in
basketball, with many teams emphasizing analytics as a critical
component of their competitive advadvantage.
antage. One important area of
analytics iswhich
attributes, in team composition.
translates Different
to different players have
performance different
differe
levels tnt
in the
he
various aspects of the game. By clustering different players, teams
can understand which types of players they are missing if they are to
be more successful and the number of available play
players
ers of a certain
grouping (which can inluence contract offers and salary
negotiations) and provide a greater understanding of which
combinations
clustering is anofessential
player types leads to
analytical toolsuperior results.
for general Thus , in
Thus,
managers
basketball, as well as other sports.
1.
Use the NBA2018 dataset, which contains per game statistics for
259 players across the 20 major statistical categories, such as
minutes, points, assists, rebounds, and steals, to explore
potential clusters in basketball. T Too make appropriate clusters,
you will have to select the different measures that you think are
appropriate for creating groups. For example, you may just want
to consider
blocks, basic stats, like points, rebounds, assists,
and turnovers. assists , steals,
2.
Experiment with creating 5, 7, and 9 different cluster groups
based on the selected measure data. Also change the number of
bins from 16 to 10 and 4.
4 . After exploring these combinations,
decide which set of clusters may produce appropriate insights
on the data.
3.
Using your selected number of clusters and bins, use SAS VA’s
two cluster visualizations to create proiles for three different
groups of the data.
Summary
Clustering techniques enable a natural exploration of data by creating
groups of objects or segments of people to discover patterns and
similarities across clusters. These groupings, in turn, can be used by
irms to customize content, advertisements, services, products, and
other offerings, to create higher value for customers. With the level of
competition
irms need toenabled
providethrough
greater digital plat forms
platforms
customization to and digital innovation,
consumers. In many
applications, irms and analysts have a target outcome that they are
trying to predict, such as proit,

p roit, customer lifetime value, and churn.
Although clustering is unsupervised, by transforming a large
assortment of continuous measure variables into discrete groups,
clustering can be used to enhance predictive models through data
reduction and is
Clustering removing outliertechnique
an important data. because the algorithms and
outcomes are intuitive and do not require a deep understanding of
mathematics or statistics. In addition, the groups created often have a
natural interpretation based on the data, which can provide actionable
insights. While clustering assumes that people it into natural
groupings across various dimensions of data, in reality people’s
characteristics exist on a continuum. This makes selecting the number
of groups and interpreting the underlying meanings of each group in a
clustering an art as much as it is a science.
Further reading
Arora, P. & Varshney, S. (2016). Analysis of k-means and k-medoids algorithm for big data. Procedia
Computer Science, 78: 507–512.
[Crossref ]
Hofstede, G. (2011). Dimens
D imensionalizing
ionalizing cultures: The Hof
Hofstede
stede mode
modell in co
context.
ntext. Online readings in
psychology and culture, 2(1), 8.
[Crossref ]
Kogan, J. (2007). Introduction to clustering large and high-dimensional data. Cambridge
University Press, New York.
Otto, C., Wang, D., & Jain, A. K. (2018). Clustering millions of faces by identity. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 40( 2): 289–303.
[Crossref ]
Punj, G. & Stewart, D. W. (1983). Cluster analysis in marke
marketing
ting resea
research:
rch: Review and sugg
suggestions
estions
for application. Journal of Marketing Research, 134–148.
Sarstedt, M. & Mooi, E
Sarstedt, E.. (2014). Cluster analysis. In A concise guide to market research (pp. 273–
324). Springer, Berlin, Germany.
Wedel, M. & Kamakura, W. A. (2012). Market segmentation: Conceptual and methodological
foundations (Vol. 8).
8) . Springer Science & Business
Business Media, New York.
Wong, N. Y.,
Confucian Y.,and
& Ahuvia,
WesternA.societies.
Western C. ((1998).
1998).Psychology
Personal taste and family
& Marketing , 15 face: Luxury co
consump
(5), 423–441.
(5), nsumption
tion in
[Crossref ]
Wu, J. (2012). Advances in K-means

K-means clustering: A data mining thinking. Springer Science &
Business Media, Berlin, Germany.
[Crossref ]
Footnotes
1 Another common technique is dimension reduction in which w hich variables (columns)
(c olumns) are grouped
to form
fo rm higher level constructs, as is the case with principal
princ ipal components analysis
analysis.. This is diff
different
erent
from clustering,
c lustering, in which rows (c
(cases
ases)) are grouped together to form groups of cases that share
share
some common traits.

© Richard Vidgen, Sam Kirshner and FeFelix

lix Tan, under exclusive licence to Springer Nature
Limited 2019
https://doi.org/10.26777/978-1-352-00726-8_6
6. Predictive Modelling with Regression


Samuel N. Kirshner
Felix Tan
Chapter Overview In this chapter we look in detail at how to build

and interpret a pred
predictive
ictive model. A basic
basic model is a simple linear
regression with one input (independent variable) and one continuous
output (dependent variable). For example, we might want to predict
exam performance based on the number of lectures a student has
attended. We then move to multiple regression where there is still one
continuous dependent variable (e.g., exam performance) but multiple
predictors – for example, the number of lectures attended and the
number of books read. All models have assumptions and it is
important to check that these are met if the results of the model are to
be trusted when making predictions.
Learning Outcomes

completed this chapter you should
should be able to:
Specify a multiple linear regression model
Describe the output of a multiple linear regression model
Interpret the output of a multiple
m ultiple linear regression model in
business terms
Assess whether the assumptions of multiple linear regression are
met
Discuss the implications of ‘overitting’ a predictive model
Use SAS Visual Analytics (SAS VA) to build a predictive model with
multiple linear regression.
Introduction
Analytics is, fundamentally,
fundamentally, concerned with making better decisions.
There has been a shift from using analytics to describe and understand
the business in its current form to using analytics for prediction and
even the reimagining of the business. The predictions
p redictions that can be made
using analytics are broad and various. For example, we might be
interested in predicting the outcome of an election, foreseeing criminal
activity, estimating the cost of a storm or other natural disaster,
predicting which customers will churn, identifying fraudulent
transactions, identifying individuals most likely to donate to charity
charity,, or -
targeting speciic types of customers with advertising and special offers.
The business
acquiring beneits of
new customers, analyticsorcan
upselling include identifying
enhancing and with
the relationship
existing customers, retaining proitable customers, gaining a competitive
advantage by identifying new market opportunities, and inding patterns
in data that alert you to potential dangers and opportunities in the
environment.
Regardlesss of the context and the technique deployed, analytics
Regardles
involves
involves building a model based on data to detect patterns in that data in
order to make predictions that serve as the basis for action. Thus, we go
from real-world messy data to an abstraction (i.e., a model) that
produces generalizations that are the basis for creating speciic
predictions.
Although models are generalizations, their predictive abilities are

powerful, particularly when the volume of data is high. For ex
example,
ample,
sometimes, when we visit a website, the use of analytics can feel uncanny
and even unsettling; the site seems to know us better than we know
ourselves.
ourselv
must es. Not the
consider every
everyone
one will
ethical, be comfortable
legal, regulatory, with this.
regulatory, social, and Therefore,
business we
implications of how analytics is used. Using data inappropriately could
lead to customers leaving because they feel uncomfortable, even though
the usage may be perfectly legal.
In this chapter we will look at one of the most widely used and
fundamental techniques for building predictive models – multiple linear
regression. Multiple linear regression is used to estimate the
relationships between data variables. Assuming that the future is related
to the past, using multiple regression to estimate relationships can
enable businesses to make forecasts and predictions.
Exercise 6.1: All models are wrong George Box said, ‘A Allll models
are wrong but some are useful’.
1. What do you understand to be the meaning of this expression?
2. What might be some of the implications of the expression for an
organization that relies on predictive models as an essential part of its
business operations?
Predictive
However
Howev models
er complex our models might be in terms of number of variables,
types of data, and
a nd statistical techniques, always remember that:
The outcome for observation i is given by the model plus some error
term for observation i. For example, assume that you recorded the marks
achieved by students in an exam and arrived at Table 6.1
6.1..
Table 6.1 Exam results (actual)
Case Exam mark
mark (Y)
1 58
Case Exam mark

mark (Y)
2 67
3 60
4 60
5 62
6 59
7 52
8 76
9 60
10 66
To tal 620
The most basic prediction (outcome) of the mark achieved by a

student taking this
Subtracting the exam is thevalue
predicted mean:(the mean) from the actual value
(Table 6.2
gives the error term (Table 6.2).
). Note that the sum of the errors is equal
to zero. Squaring the error term removes the sign and when summed
(Table 6.2
gives a measure of the total error (Table 6.2).
).
Table 6.2 Predicted exam
exam mark and error
Case Exam mark
mark (Y) Exam mark
mark mean Mean error Mean error squar
squared
ed (SSt)
1 58 62 −4 16
2 67 62 5 25
3 60 62 −2 4
4 60 62 −2 4
5 62 62 0 0
6 59 62 −3 9
7 52 62 −10 100
8 76 62 14 196
9 60 62 −2 4
10 66 62 4 16
Total
otal 620 620 0 374
Assuming that we do not update it, our (very) simple model always
produces the same prediction; if a new student takes the exam then our
model predicts they will achieve 62%.

Using the mean does not provide a very good it of
of model to data. The
mean squared error calculation in Table 6.2
6.2 shows
shows that the model poorly
predicts the outcomes of case 7 and 8, despite being reasonably accurate
accurate
for students
the actual exam3 tomarks
6 and against
9. We can
thevisualize
predictedthis in Figure
outcome of 6.1
6.1,
, which
62%. plotsof
The size
the error term, which is shown by the vertical lines (and is the same as in
6.2),
Table 6.2 ), reveals that there is potentially
p otentially a better way to predict a
student’s mark.
Figure 6.1 Graph of exam m

marks
arks – actual vversus
ersus pr
predicted
edicted (mean)
Simple linear regression

regression
We will start with simple linear regression, where there is a single
predictor.. As an example of simple regression, let’s assume that we asked
predictor
(Table
each student how many hours of revision they did for the exam (Table
6.3).
6.3 ).
Table 6.3 Hours of revision
revision (X) and exam mark (Y)
Case Hours of revision
revision (X) Exam mark
mark (Y)
(Y)
1 9.80 58
2 9.40 67
3 6.50 60
4 11.80 60
5 8.90 62
6 5.40 59
7 6.40 52
8 15.00 76

PDF Business Analytics A Management Approach 2019 PDF - Compress

Uploaded by

Copyright:

Available Formats

You might also like

PDF Business Analytics A Management Approach 2019 PDF - Compress

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PDF Business Analytics A Management Approach 2019 PDF - Compress

Uploaded by

Copyright:

Available Formats

Richard Vidgen, Samuel N. Kirshner and Felix Tan

The registered company address is: The Campus, 4 Crinan Street,

person who does any unauthorized act in relation to this publication

selected extracts from their documentation; and NodeXL and Polinode

Capability Assessment (BACA)

List Of Figures And Tables

3.3 Farr’s analysis of cholera mortality data (Farr 1852)

3.6 Data quality in six dimensions

4.14 How to change properties of a graph so gender is grouped

4.26 Bubble charts in SAS VA

4.29 Box plot showing outliers

4.32 Geo map

5.2 Example of a dendrogram for hierarchical clustering

6.15 House sale price model – variables included

8.1 An illustration of a decision tree

9.3 Sample idea illustration – publication

10.7 The new feature, Child, is shown

10.30 Leaderboard sorted by Holdout

10.42 Model comparison (houseprice)

11.11 Scatter plot matrix

13.7 Paths, distance, and geodesics

13.19 Polinode network summary

13.22 UNSW Business School Twitter account

13.25 Clustering the network in NodeXL

14.2 Business analytics methodology (BAM) (reprinted from Hindle &

14.5 Business model canvas (Osterwalder & Pigneur 2010 and

15.7 Example of analytics design thinking workshop outputs (Vidgen

15.9 Opportunity Canv

16.2 Cycle of ethical decision-point activities (Davis 2012, p.46;

6.11 Parameter estimates

7.8 GLM parameter estimates

12.5 Likelihood table

13.2 Network characteristics provided in NodeXL

A.2 The house price dataset

Part I Business Analytics in Context

Chapter Overview In this chapter w

Explain how business analytics is changing organizations,

Exercise 1.1: Big data analytics in the workplace

theory can be traced to Blaise Pascal (1623–1662) and Pierre de

economic dimensions. We We will now look at the different parts of Figure

Figure 1.1 Busines

Figure 1.2 Open data a

Figure 1.3 The Inte

A ‘thing’, in the IoT, can be:

Exercise 1.2: Internet of Things (IoT) implementation by the

Bookmarking sites, such as StumbleUpon and Pinterest

organized by subject area, is highly transformed and structured.

Exercise 1.3: Data use in your organization Watch the video

fundamentally about massively distributed architectures and massivel

Exhibit 1.2: Types of cloud services Cloud applications or

[no] standards for interoperability or data portability in the cloud.

A cloud can be private orpublic. A public cloud sells services to

the data. These classes of technology are complementary and

Exhibit 1.3: Big data technologies Operational Big data

machine learning – ishing in large pools of data for patterns, building

Figure 1.5 A taxonom

While different data scientists will have varying degrees of strength

The organizational context