Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Contents

About the Authors..................................................................................................xvii


About the Technical Reviewer ................................................................................xix
Acknowledgments ..................................................................................................xxi
Foreword ..............................................................................................................xxiii
Introduction ...........................................................................................................xxv

■Part I: Understanding Machine Learning ............................................ 1


■Chapter 1: Machine Learning Basics ..................................................................... 3
The Need for Machine Learning ....................................................................................... 4
Making Data-Driven Decisions ............................................................................................................... 4
Efficiency and Scale ............................................................................................................................... 5
Traditional Programming Paradigm ........................................................................................................ 5
Why Machine Learning? ......................................................................................................................... 6

Understanding Machine Learning .................................................................................... 8


Why Make Machines Learn?................................................................................................................... 8
Formal Definition .................................................................................................................................... 9
A Multi-Disciplinary Field ..................................................................................................................... 13

Computer Science .......................................................................................................... 14


Theoretical Computer Science.............................................................................................................. 15
Practical Computer Science ................................................................................................................. 15
Important Concepts .............................................................................................................................. 15

Data Science .................................................................................................................. 16

v
■ CONTENTS

Mathematics .................................................................................................................. 18
Important Concepts .............................................................................................................................. 19

Statistics ........................................................................................................................ 24
Data Mining .................................................................................................................... 25
Artificial Intelligence ...................................................................................................... 25
Natural Language Processing ........................................................................................ 26
Deep Learning ................................................................................................................ 28
Important Concepts .............................................................................................................................. 31

Machine Learning Methods ............................................................................................ 34


Supervised Learning ...................................................................................................... 35
Classification ........................................................................................................................................ 36
Regression ............................................................................................................................................ 37
Unsupervised Learning .................................................................................................. 38
Clustering ............................................................................................................................................. 39
Dimensionality Reduction ..................................................................................................................... 40
Anomaly Detection................................................................................................................................ 41
Association Rule-Mining ....................................................................................................................... 41

Semi-Supervised Learning ............................................................................................. 42


Reinforcement Learning ................................................................................................. 42
Batch Learning ............................................................................................................... 43
Online Learning .............................................................................................................. 44
Instance Based Learning ................................................................................................ 44
Model Based Learning.................................................................................................... 45
The CRISP-DM Process Model........................................................................................ 45
Business Understanding ....................................................................................................................... 46
Data Understanding .............................................................................................................................. 48
Data Preparation ................................................................................................................................... 50
Modeling ............................................................................................................................................... 51
Evaluation ............................................................................................................................................. 52
Deployment........................................................................................................................................... 52

vi
■ CONTENTS

Building Machine Intelligence ........................................................................................ 52


Machine Learning Pipelines ................................................................................................................. 52
Supervised Machine Learning Pipeline ................................................................................................ 54
Unsupervised Machine Learning Pipeline ............................................................................................ 55

Real-World Case Study: Predicting Student Grant Recommendations ........................... 55


Objective ............................................................................................................................................... 56
Data Retrieval ....................................................................................................................................... 56
Data Preparation ................................................................................................................................... 57
Modeling ............................................................................................................................................... 60
Model Evaluation .................................................................................................................................. 61
Model Deployment ................................................................................................................................ 61
Prediction in Action............................................................................................................................... 62

Challenges in Machine Learning .................................................................................... 64


Real-World Applications of Machine Learning ............................................................... 64
Summary ........................................................................................................................ 65
■Chapter 2: The Python Machine Learning Ecosystem ......................................... 67
Python: An Introduction .................................................................................................. 67
Strengths .............................................................................................................................................. 68
Pitfalls................................................................................................................................................... 68
Setting Up a Python Environment ......................................................................................................... 69
Why Python for Data Science? ............................................................................................................. 71

Introducing the Python Machine Learning Ecosystem ................................................... 72


Jupyter Notebooks................................................................................................................................ 72
NumPy .................................................................................................................................................. 75
Pandas .................................................................................................................................................. 84
Scikit-learn ........................................................................................................................................... 96
Neural Networks and Deep Learning .................................................................................................. 102
Text Analytics and Natural Language Processing ............................................................................... 112
Statsmodels ........................................................................................................................................ 116

Summary ...................................................................................................................... 118

vii
■ CONTENTS

■Part II: The Machine Learning Pipeline ........................................... 119


■Chapter 3: Processing, Wrangling, and Visualizing Data................................... 121
Data Collection ............................................................................................................. 122
CSV ..................................................................................................................................................... 122
JSON ................................................................................................................................................... 124
XML..................................................................................................................................................... 128
HTML and Scraping ............................................................................................................................ 131
SQL ..................................................................................................................................................... 136

Data Description ........................................................................................................... 137


Numeric .............................................................................................................................................. 137
Text ..................................................................................................................................................... 137
Categorical ......................................................................................................................................... 137

Data Wrangling ............................................................................................................. 138


Understanding Data ............................................................................................................................ 138
Filtering Data ...................................................................................................................................... 141
Typecasting......................................................................................................................................... 144
Transformations .................................................................................................................................. 144
Imputing Missing Values ..................................................................................................................... 145
Handling Duplicates............................................................................................................................ 147
Handling Categorical Data .................................................................................................................. 147
Normalizing Values ............................................................................................................................. 148
String Manipulations .......................................................................................................................... 149

Data Summarization ..................................................................................................... 149


Data Visualization ......................................................................................................... 151
Visualizing with Pandas ...................................................................................................................... 152
Visualizing with Matplotlib.................................................................................................................. 161
Python Visualization Ecosystem ......................................................................................................... 176

Summary ...................................................................................................................... 176

viii
■ CONTENTS

■Chapter 4: Feature Engineering and Selection .................................................. 177


Features: Understand Your Data Better ........................................................................ 178
Data and Datasets .............................................................................................................................. 178
Features.............................................................................................................................................. 179
Models ................................................................................................................................................ 179

Revisiting the Machine Learning Pipeline .................................................................... 179


Feature Extraction and Engineering ............................................................................. 181
What Is Feature Engineering?............................................................................................................. 181
Why Feature Engineering?.................................................................................................................. 183
How Do You Engineer Features? ......................................................................................................... 184

Feature Engineering on Numeric Data ......................................................................... 185


Raw Measures .................................................................................................................................... 185
Binarization......................................................................................................................................... 187
Rounding ............................................................................................................................................ 188
Interactions......................................................................................................................................... 189
Binning ............................................................................................................................................... 191
Statistical Transformations ................................................................................................................. 197

Feature Engineering on Categorical Data ..................................................................... 200


Transforming Nominal Features ......................................................................................................... 201
Transforming Ordinal Features ........................................................................................................... 202
Encoding Categorical Features ........................................................................................................... 203

Feature Engineering on Text Data ................................................................................ 209


Text Pre-Processing ............................................................................................................................ 210
Bag of Words Model............................................................................................................................ 211
Bag of N-Grams Model ....................................................................................................................... 212
TF-IDF Model ...................................................................................................................................... 213
Document Similarity ........................................................................................................................... 214
Topic Models....................................................................................................................................... 216
Word Embeddings............................................................................................................................... 217

ix
■ CONTENTS

Feature Engineering on Temporal Data ........................................................................ 220


Date-Based Features .......................................................................................................................... 221
Time-Based Features ......................................................................................................................... 222

Feature Engineering on Image Data ............................................................................. 224


Image Metadata Features ................................................................................................................... 225
Raw Image and Channel Pixels .......................................................................................................... 225
Grayscale Image Pixels....................................................................................................................... 227
Binning Image Intensity Distribution .................................................................................................. 227
Image Aggregation Statistics.............................................................................................................. 228
Edge Detection ................................................................................................................................... 229
Object Detection ................................................................................................................................. 230
Localized Feature Extraction .............................................................................................................. 231
Visual Bag of Words Model ................................................................................................................. 233
Automated Feature Engineering with Deep Learning ......................................................................... 236

Feature Scaling ............................................................................................................ 239


Standardized Scaling .......................................................................................................................... 240
Min-Max Scaling................................................................................................................................. 240
Robust Scaling.................................................................................................................................... 241

Feature Selection ......................................................................................................... 242


Threshold-Based Methods.................................................................................................................. 243
Statistical Methods ............................................................................................................................. 244
Recursive Feature Elimination ............................................................................................................ 247
Model-Based Selection....................................................................................................................... 248

Dimensionality Reduction............................................................................................. 249


Feature Extraction with Principal Component Analysis ...................................................................... 250

Summary ...................................................................................................................... 252


■Chapter 5: Building, Tuning, and Deploying Models .......................................... 255
Building Models............................................................................................................ 256
Model Types ........................................................................................................................................ 257
Learning a Model ................................................................................................................................ 260
Model Building Examples ................................................................................................................... 263

x
■ CONTENTS

Model Evaluation .......................................................................................................... 271


Evaluating Classification Models ........................................................................................................ 271
Evaluating Clustering Models ............................................................................................................. 278
Evaluating Regression Models............................................................................................................ 281

Model Tuning ................................................................................................................ 282


Introduction to Hyperparameters........................................................................................................ 283
The Bias-Variance Tradeoff ................................................................................................................. 284
Cross Validation .................................................................................................................................. 288
Hyperparameter Tuning Strategies ..................................................................................................... 291

Model Interpretation ..................................................................................................... 295


Understanding Skater ......................................................................................................................... 297
Model Interpretation in Action ............................................................................................................ 298

Model Deployment ....................................................................................................... 302


Model Persistence .............................................................................................................................. 302
Custom Development ......................................................................................................................... 303
In-House Model Deployment .............................................................................................................. 303
Model Deployment as a Service ......................................................................................................... 304

Summary ...................................................................................................................... 304

■Part III: Real-World Case Studies ................................................... 305


■Chapter 6: Analyzing Bike Sharing Trends ........................................................ 307
The Bike Sharing Dataset ............................................................................................. 307
Problem Statement ...................................................................................................... 308
Exploratory Data Analysis ............................................................................................. 308
Preprocessing ..................................................................................................................................... 308
Distribution and Trends ....................................................................................................................... 310
Outliers ............................................................................................................................................... 312
Correlations ........................................................................................................................................ 314

xi
■ CONTENTS

Regression Analysis ..................................................................................................... 315


Types of Regression............................................................................................................................ 315
Assumptions ....................................................................................................................................... 316
Evaluation Criteria .............................................................................................................................. 316

Modeling....................................................................................................................... 317
Linear Regression ............................................................................................................................... 319
Decision Tree Based Regression......................................................................................................... 323

Next Steps .................................................................................................................... 330


Summary ...................................................................................................................... 330
■Chapter 7: Analyzing Movie Reviews Sentiment ............................................... 331
Problem Statement ...................................................................................................... 332
Setting Up Dependencies ............................................................................................. 332
Getting the Data ........................................................................................................... 333
Text Pre-Processing and Normalization ....................................................................... 333
Unsupervised Lexicon-Based Models .......................................................................... 336
Bing Liu’s Lexicon ............................................................................................................................... 337
MPQA Subjectivity Lexicon ................................................................................................................. 337
Pattern Lexicon ................................................................................................................................... 338
AFINN Lexicon..................................................................................................................................... 338
SentiWordNet Lexicon ........................................................................................................................ 340
VADER Lexicon .................................................................................................................................... 342

Classifying Sentiment with Supervised Learning ......................................................... 345


Traditional Supervised Machine Learning Models........................................................ 346
Newer Supervised Deep Learning Models ................................................................... 349
Advanced Supervised Deep Learning Models .............................................................. 355
Analyzing Sentiment Causation.................................................................................... 363
Interpreting Predictive Models ........................................................................................................... 363
Analyzing Topic Models ...................................................................................................................... 368

Summary ...................................................................................................................... 372

xii
■ CONTENTS

■Chapter 8: Customer Segmentation and Effective Cross Selling ....................... 373


Online Retail Transactions Dataset ............................................................................... 374
Exploratory Data Analysis ............................................................................................. 374
Customer Segmentation ............................................................................................... 378
Objectives ........................................................................................................................................... 378
Strategies ........................................................................................................................................... 379
Clustering Strategy ............................................................................................................................. 380

Cross Selling ................................................................................................................ 392


Market Basket Analysis with Association Rule-Mining....................................................................... 393
Association Rule-Mining Basics ......................................................................................................... 394
Association Rule-Mining in Action ...................................................................................................... 396

Summary ...................................................................................................................... 405


■Chapter 9: Analyzing Wine Types and Quality ................................................... 407
Problem Statement ...................................................................................................... 407
Setting Up Dependencies ............................................................................................. 408
Getting the Data ........................................................................................................... 408
Exploratory Data Analysis ............................................................................................. 409
Process and Merge Datasets .............................................................................................................. 409
Understanding Dataset Features ........................................................................................................ 410
Descriptive Statistics .......................................................................................................................... 413
Inferential Statistics............................................................................................................................ 414
Univariate Analysis ............................................................................................................................. 416
Multivariate Analysis .......................................................................................................................... 419

Predictive Modeling...................................................................................................... 426


Predicting Wine Types .................................................................................................. 427
Predicting Wine Quality ................................................................................................ 433
Summary ...................................................................................................................... 446

xiii
■ CONTENTS

■Chapter 10: Analyzing Music Trends and Recommendations............................ 447


The Million Song Dataset Taste Profile ......................................................................... 448
Exploratory Data Analysis ............................................................................................. 448
Loading and Trimming Data ................................................................................................................ 448
Enhancing the Data ............................................................................................................................ 451
Visual Analysis .................................................................................................................................... 452

Recommendation Engines............................................................................................ 456


Types of Recommendation Engines .................................................................................................... 457
Utility of Recommendation Engines .................................................................................................... 457
Popularity-Based Recommendation Engine ....................................................................................... 458
Item Similarity Based Recommendation Engine................................................................................. 459
Matrix Factorization Based Recommendation Engine ........................................................................ 461

A Note on Recommendation Engine Libraries .............................................................. 466


Summary ...................................................................................................................... 466
■Chapter 11: Forecasting Stock and Commodity Prices ..................................... 467
Time Series Data and Analysis ..................................................................................... 467
Time Series Components .................................................................................................................... 469
Smoothing Techniques ....................................................................................................................... 471

Forecasting Gold Price ................................................................................................. 474


Problem Statement ............................................................................................................................. 474
Dataset ............................................................................................................................................... 474
Traditional Approaches ....................................................................................................................... 474
Modeling ............................................................................................................................................. 476

Stock Price Prediction .................................................................................................. 483


Problem Statement ............................................................................................................................. 484
Dataset ............................................................................................................................................... 484
Recurrent Neural Networks: LSTM ..................................................................................................... 485
Upcoming Techniques: Prophet .......................................................................................................... 495

Summary ...................................................................................................................... 497

xiv
■ CONTENTS

■Chapter 12: Deep Learning for Computer Vision ............................................... 499


Convolutional Neural Networks .................................................................................... 499
Image Classification with CNNs ................................................................................... 501
Problem Statement ............................................................................................................................. 501
Dataset ............................................................................................................................................... 501
CNN Based Deep Learning Classifier from Scratch ............................................................................ 502
CNN Based Deep Learning Classifier with Pretrained Models ............................................................ 505

Artistic Style Transfer with CNNs ................................................................................. 509


Background ........................................................................................................................................ 510
Preprocessing ..................................................................................................................................... 511
Loss Functions.................................................................................................................................... 513
Custom Optimizer ............................................................................................................................... 515
Style Transfer in Action ....................................................................................................................... 516

Summary ...................................................................................................................... 520

Index ..................................................................................................................... 521

xv

You might also like