Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Table of Contents

Preface vii
Chapter 1: From Data to Decisions – Getting Started with
Analytic Applications 1
Designing an advanced analytic solution 4
Data layer: warehouses, lakes, and streams 6
Modeling layer 8
Deployment layer 14
Reporting layer 15
Case study: sentiment analysis of social media feeds 16
Data input and transformation 17
Sanity checking 18
Model development 18
Scoring 19
Visualization and reporting 19
Case study: targeted e-mail campaigns 19
Data input and transformation 20
Sanity checking 21
Model development 21
Scoring 21
Visualization and reporting 21
Summary 23
Chapter 2: Exploratory Data Analysis and Visualization in Python 25
Exploring categorical and numerical data in IPython 26
Installing IPython notebook 27
The notebook interface 27
Loading and inspecting data 30
Basic manipulations – grouping, filtering, mapping, and pivoting 33
Charting with Matplotlib 38
[i]
Table of Contents

Time series analysis 46


Cleaning and converting 46
Time series diagnostics 48
Joining signals and correlation 50
Working with geospatial data 53
Loading geospatial data 53
Working in the cloud 55
Introduction to PySpark 56
Creating the SparkContext 56
Creating an RDD 58
Creating a Spark DataFrame 59
Summary 61
Chapter 3: Finding Patterns in the Noise – Clustering and
Unsupervised Learning 63
Similarity and distance metrics 64
Numerical distance metrics 64
Correlation similarity metrics and time series 70
Similarity metrics for categorical data 78
K-means clustering 83
Affinity propagation – automatically choosing cluster numbers 89
k-medoids 93
Agglomerative clustering 94
Where agglomerative clustering fails 96
Streaming clustering in Spark 100
Summary 104
Chapter 4: Connecting the Dots with Models – Regression
Methods 105
Linear regression 106
Data preparation 109
Model fitting and evaluation 114
Statistical significance of regression outputs 119
Generalize estimating equations 124
Mixed effects models 126
Time series data 127
Generalized linear models 128
Applying regularization to linear models 129
Tree methods 132
Decision trees 132
Random forest 138

[ ii ]
Table of Contents

Scaling out with PySpark – predicting year of song release 141


Summary 143
Chapter 5: Putting Data in its Place – Classification Methods
and Analysis 145
Logistic regression 146
Multiclass logistic classifiers: multinomial regression 150
Formatting a dataset for classification problems 151
Learning pointwise updates with stochastic gradient descent 155
Jointly optimizing all parameters with second-order methods 158
Fitting the model 162
Evaluating classification models 165
Strategies for improving classification models 169
Separating Nonlinear boundaries with Support vector machines 172
Fitting and SVM to the census data 174
Boosting – combining small models to improve accuracy 177
Gradient boosted decision trees 177
Comparing classification methods 180
Case study: fitting classifier models in pyspark 182
Summary 184
Chapter 6: Words and Pixels – Working with Unstructured Data 185
Working with textual data 186
Cleaning textual data 186
Extracting features from textual data 189
Using dimensionality reduction to simplify datasets 192
Principal component analysis 193
Latent Dirichlet Allocation 205
Using dimensionality reduction in predictive modeling 209
Images 209
Cleaning image data 210
Thresholding images to highlight objects 213
Dimensionality reduction for image analysis 216
Case Study: Training a Recommender System in PySpark 220
Summary 222
Chapter 7: Learning from the Bottom Up – Deep Networks and
Unsupervised Features 223
Learning patterns with neural networks 224
A network of one – the perceptron 224
Combining perceptrons – a single-layer neural network 226
Parameter fitting with back-propagation 229

[ iii ]
Table of Contents

Discriminative versus generative models 234


Vanishing gradients and explaining away 235
Pretraining belief networks 238
Using dropout to regularize networks 241
Convolutional networks and rectified units 242
Compressing Data with autoencoder networks 246
Optimizing the learning rate 247
The TensorFlow library and digit recognition 249
The MNIST data 250
Constructing the network 252
Summary 256
Chapter 8: Sharing Models with Prediction Services 257
The architecture of a prediction service 258
Clients and making requests 260
The GET requests 260
The POST request 262
The HEAD request 262
The PUT request 262
The DELETE request 263
Server – the web traffic controller 263
Application – the engine of the predictive services 265
Persisting information with database systems 266
Case study – logistic regression service 267
Setting up the database 268
The web server 271
The web application 273
The flow of a prediction service – training a model 274
On-demand and bulk prediction 283
Summary 287
Chapter 9: Reporting and Testing – Iterating on
Analytic Systems 289
Checking the health of models with diagnostics 290
Evaluating changes in model performance 290
Changes in feature importance 294
Changes in unsupervised model performance 295
Iterating on models through A/B testing 297
Experimental allocation – assigning customers to experiments 298
Deciding a sample size 299
Multiple hypothesis testing 302

[ iv ]
Table of Contents

Guidelines for communication 302


Translate terms to business values 303
Visualizing results 303
Case Study: building a reporting service 304
The report server 304
The report application 305
The visualization layer 306
Summary 310
Index 311

[v]

You might also like