Ebook Data Science Fundamentals Pocket Primer 1St Edition Oswald Campesato Online PDF All Chapter

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 69

Data Science Fundamentals Pocket

Primer 1st Edition Oswald Campesato


Visit to download the full and correct content document:
https://ebookmeta.com/product/data-science-fundamentals-pocket-primer-1st-edition-
oswald-campesato/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

D3 Data Driven Documents Pocket Primer 1st Edition


Oswald Campesato

https://ebookmeta.com/product/d3-data-driven-documents-pocket-
primer-1st-edition-oswald-campesato/

Bash for Data Scientists 1st Edition Oswald Campesato

https://ebookmeta.com/product/bash-for-data-scientists-1st-
edition-oswald-campesato/

Data Structures in Java 1st Edition Oswald Campesato

https://ebookmeta.com/product/data-structures-in-java-1st-
edition-oswald-campesato/

Natural Language Processing Fundamentals for Developers


1st Edition Oswald Campesato

https://ebookmeta.com/product/natural-language-processing-
fundamentals-for-developers-1st-edition-oswald-campesato/
Python for Programmers 1st Edition Oswald Campesato

https://ebookmeta.com/product/python-for-programmers-1st-edition-
oswald-campesato/

Fundamentals of Data Science 1st Edition Sanjeev J.


Wagh

https://ebookmeta.com/product/fundamentals-of-data-science-1st-
edition-sanjeev-j-wagh/

Data Architecture A Primer for the Data Scientist A


Primer for the Data Scientist 2nd Edition W.H. Inmon

https://ebookmeta.com/product/data-architecture-a-primer-for-the-
data-scientist-a-primer-for-the-data-scientist-2nd-edition-w-h-
inmon/

Fundamentals of Data Science: Theory and Practice 1st


Edition Jugal K Kalita

https://ebookmeta.com/product/fundamentals-of-data-science-
theory-and-practice-1st-edition-jugal-k-kalita/

Machine Learning and Data Science: Fundamentals and


Applications 1st Edition Prateek Agrawal (Editor)

https://ebookmeta.com/product/machine-learning-and-data-science-
fundamentals-and-applications-1st-edition-prateek-agrawal-editor/
Data Science
Fundamentals
Pocket Primer
LICENSE, DISCLAIMER OF LIABILITY, AND LIMITED WARRANTY

By purchasing or using this book and companion files (the “Work”), you
agree that this license grants permission to use the contents contained herein,
including the disc, but does not give you the right of ownership to any of the
textual content in the book / disc or ownership to any of the information or
products contained in it. This license does not permit uploading of the Work
onto the Internet or on a network (of any kind) without the written consent
of the Publisher. Duplication or dissemination of any text, code, simulations,
images, etc. contained herein is limited to and subject to licensing terms
for the respective products, and permission must be obtained from the
Publisher or the owner of the content, etc., in order to reproduce or network
any portion of the textual material (in any media) that is contained in the
Work.
Mercury Learning and Information (“MLI” or “the Publisher”) and anyone
involved in the creation, writing, or production of the companion disc,
accompanying algorithms, code, or computer programs (“the software”),
and any accompanying Web site or software of the Work, cannot and do
not warrant the performance or results that might be obtained by using
the contents of the Work. The author, developers, and the Publisher have
used their best efforts to ensure the accuracy and functionality of the textual
material and/or programs contained in this package; we, however, make no
warranty of any kind, express or implied, regarding the performance of these
contents or programs. The Work is sold “as is” without warranty (except
for defective materials used in manufacturing the book or due to faulty
workmanship).
The author, developers, and the publisher of any accompanying content,
and anyone involved in the composition, production, and manufacturing of
this work will not be liable for damages of any kind arising out of the use of
(or the inability to use) the algorithms, source code, computer programs, or
textual material contained in this publication. This includes, but is not limited
to, loss of revenue or profit, or other incidental, physical, or consequential
damages arising out of the use of this Work.
The sole remedy in the event of a claim of any kind is expressly limited
to replacement of the book and/or disc, and only at the discretion of the
Publisher. The use of “implied warranty” and certain “exclusions” vary from
state to state, and might not apply to the purchaser of this product.
Companion files for this title are available by writing to the publisher at
info@merclearning.com.
Data Science
Fundamentals
Pocket Primer

Oswald Campesato

Mercury Learning and Information


Dulles, Virginia
Boston, Massachusetts
New Delhi
Copyright ©2021 by Mercury Learning and Information LLC. All rights reserved.

This publication, portions of it, or any accompanying software may not be reproduced in any way,
stored in a retrieval system of any type, or transmitted by any means, media, electronic display
or mechanical display, including, but not limited to, photocopy, recording, Internet postings, or
scanning, without prior permission in writing from the publisher.

Publisher: David Pallai


Mercury Learning and Information
22841 Quicksilver Drive
Dulles, VA 20166
info@merclearning.com
www.merclearning.com
800-232-0223

O. Campesato. Data Science Fundamentals Pocket Primer.


ISBN: 978-1-68392-733-4

The publisher recognizes and respects all marks used by companies, manufacturers, and developers as
a means to distinguish their products. All brand names and product names mentioned in this book are
trademarks or service marks of their respective companies. Any omission or misuse (of any kind) of
service marks or trademarks, etc. is not an attempt to infringe on the property of others.

Library of Congress Control Number: 2021937777

212223321 This book is printed on acid-free paper in the United States of America.

Our titles are available for adoption, license, or bulk purchase by institutions, corporations, etc.
For additional information, please contact the Customer Service Dept. at 800-232-0223(toll free).

All of our titles are available in digital format at academiccourseware.com and other digital vendors.
Companion files (figures and code listings) for this title are available by contacting info@merclearning.
com. The sole obligation of Mercury Learning and Information to the purchaser is to replace the disc,
based on defective materials or faulty workmanship, but not based on the operation or functionality of
the product.
I’d like to dedicate this book to my parents–
may this bring joy and happiness into their lives.
Contents

Preface........................................................................................................... xix

Chapter 1 Working With Data................................................1


What are Datasets?.............................................................................................. 1
Data Preprocessing........................................................................................ 2
Data Types........................................................................................................... 3
Preparing Datasets.............................................................................................. 4
Discrete Data Versus Continuous Data........................................................ 4
“Binning” Continuous Data........................................................................... 5
Scaling Numeric Data via Normalization..................................................... 5
Scaling Numeric Data via Standardization................................................... 6
What to Look for in Categorical Data........................................................... 7
Mapping Categorical Data to Numeric Values............................................. 8
Working with Dates....................................................................................... 9
Working with Currency............................................................................... 10
Missing Data, Anomalies, and Outliers............................................................ 10
Missing Data................................................................................................ 10
Anomalies and Outliers............................................................................... 11
Outlier Detection......................................................................................... 11
What is Data Drift?.....................................................................................12
What is Imbalanced Classification?.................................................................. 12
What is SMOTE?.............................................................................................. 14
SMOTE Extensions..................................................................................... 14
Analyzing Classifiers (Optional)........................................................................ 14
What is LIME?............................................................................................ 15
What is ANOVA?......................................................................................... 15
The Bias-Variance Trade-Off............................................................................ 16
Types of Bias in Data................................................................................... 17
Summary............................................................................................................ 18
viii •  Contents

Chapter 2 Intro to Probability and Statistics........................19


What is a Probability?........................................................................................ 20
Calculating the Expected Value.................................................................. 20
Random Variables.............................................................................................. 22
Discrete versus Continuous Random Variables......................................... 22
Well-Known Probability Distributions........................................................ 22
Fundamental Concepts in Statistics................................................................. 23
The Mean..................................................................................................... 23
The Median.................................................................................................. 23
The Mode..................................................................................................... 23
The Variance and Standard Deviation........................................................ 24
Population, Sample, and Population Variance............................................ 25
Chebyshev’s Inequality................................................................................ 25
What is a P-Value?....................................................................................... 25
The Moments of a Function (Optional)........................................................... 26
What is Skewness?....................................................................................... 26
What is Kurtosis?......................................................................................... 26
Data and Statistics............................................................................................. 27
The Central Limit Theorem........................................................................ 27
Correlation versus Causation...................................................................... 28
Statistical Inferences.................................................................................... 28
Statistical Terms – RSS, TSS, R^2, and F1 Score............................................ 28
What is an F1 Score?................................................................................... 29
Gini Impurity, Entropy, and Perplexity............................................................ 30
What is the Gini Impurity?.......................................................................... 30
What is Entropy?......................................................................................... 30
Calculating Gini Impurity and Entropy Values.......................................... 31
Multidimensional Gini Index...................................................................... 32
What is Perplexity?...................................................................................... 32
Cross-Entropy and KL Divergence.................................................................. 32
What is Cross-Entropy?............................................................................... 33
What is KL Divergence?............................................................................. 33
What’s their Purpose?.................................................................................. 34
Covariance and Correlation Matrices............................................................... 34
The Covariance Matrix................................................................................ 34
Covariance Matrix: An Example................................................................. 35
The Correlation Matrix................................................................................ 36
Eigenvalues and Eigenvectors....................................................................36
Calculating Eigenvectors: A Simple Example.................................................. 36
Gauss Jordan Elimination (Optional).........................................................37
PCA (Principal Component Analysis).............................................................. 38
The New Matrix of Eigenvectors................................................................ 40
Well-Known Distance Metrics.......................................................................... 41
Pearson Correlation Coefficient.................................................................. 41
Jaccard Index (or Similarity).......................................................................41
Local Sensitivity Hashing (Optional).......................................................... 42
Contents  • ix

Types of Distance Metrics................................................................................. 42


What is Bayesian Inference?............................................................................. 44
Bayes’ Theorem........................................................................................... 44
Some Bayesian Terminology....................................................................... 44
What is MAP?.............................................................................................. 45
Why Use Bayes’ Theorem?......................................................................... 45
Summary............................................................................................................ 45

Chapter 3 Linear Algebra Concepts.....................................47


What is Linear Algebra?.................................................................................... 48
What are Vectors?.............................................................................................. 48
The Norm of a Vector.................................................................................. 48
The Inner Product of Two Vectors.............................................................. 48
The Cosine Similarity of Two Vectors......................................................... 49
Bases and Spanning Sets............................................................................. 50
Three Dimensional Vectors and Beyond.................................................... 50
What are Matrices?........................................................................................... 51
Add and Multiply Matrices.......................................................................... 51
The Determinant of a Square Matrix.........................................................51
Well-Known Matrices........................................................................................ 52
Properties of Orthogonal Matrices.............................................................53
Operations Involving Vectors and Matrices................................................ 53
Gauss Jordan Elimination (Optional)............................................................... 53
Covariance and Correlation Matrices............................................................... 54
The Covariance Matrix................................................................................ 55
Covariance Matrix: An Example................................................................. 56
The Correlation Matrix................................................................................ 56
Eigenvalues and Eigenvectors.......................................................................... 57
Calculating Eigenvectors: A Simple Example............................................ 57
What is PCA (Principal Component Analysis)?............................................... 58
The Main Steps in PCA..................................................................................... 58
The New Matrix of Eigenvectors................................................................ 60
Dimensionality Reduction................................................................................ 60
Dimensionality Reduction Techniques............................................................ 61
The Curse of Dimensionality...................................................................... 62
SVD (Singular Value Decomposition)........................................................ 62
LLE (Locally Linear Embedding).............................................................. 63
UMAP.......................................................................................................... 63
t-SNE........................................................................................................... 64
PHATE......................................................................................................... 64
Linear Versus Non-Linear Reduction Techniques.......................................... 65
Complex Numbers (Optional).......................................................................... 66
Complex Numbers on the Unit Circle........................................................ 66
Complex Conjugate Root Theorem............................................................ 67
Hermitian Matrices.....................................................................................67
Summary............................................................................................................ 67
x •  Contents

Chapter 4 Introduction to Python........................................69


Tools for Python................................................................................................. 69
easy_install and pip...................................................................................... 70
virtualenv...................................................................................................... 70
Python Installation............................................................................................ 70
Setting the PATH Environment Variable (Windows Only)............................. 71
Launching Python on Your Machine................................................................ 71
The Python Interactive Interpreter............................................................ 71
Python Identifiers.............................................................................................. 72
Lines, Indentations, and Multi-Lines............................................................... 72
Quotation and Comments in Python................................................................ 73
Saving Your Code in a Module......................................................................... 74
Some Standard Modules in Python.................................................................. 75
The help() and dir() Functions................................................................... 76
Compile Time and Runtime Code Checking................................................... 77
Simple Data Types in Python............................................................................ 77
Working with Numbers..................................................................................... 77
Working with Other Bases........................................................................... 79
The chr() Function....................................................................................79
The round() Function in Python..............................................................80
Formatting Numbers in Python.................................................................. 80
Unicode and UTF-8.......................................................................................... 81
Working with Unicode...................................................................................... 81
Working with Strings......................................................................................... 82
Comparing Strings....................................................................................... 83
Formatting Strings in Python...................................................................... 84
Uninitialized Variables and the Value None in Python....................................84
Slicing and Splicing Strings............................................................................... 84
Testing for Digits and Alphabetic Characters............................................. 85
Search and Replace a String in Other Strings.................................................. 86
Remove Leading and Trailing Characters........................................................ 87
Printing Text without NewLine Characters..................................................... 88
Text Alignment.................................................................................................. 88
Working with Dates........................................................................................... 89
Converting Strings to Dates........................................................................ 90
Exception Handling in Python.......................................................................... 90
Handling User Input......................................................................................... 92
Command-Line Arguments.............................................................................. 94
Precedence of Operators in Python.................................................................. 95
Python Reserved Words.................................................................................... 95
Working with Loops in Python......................................................................... 96
Python For Loops........................................................................................ 96
A For Loop with try/except in Python................................................... 97
Numeric Exponents in Python.................................................................... 97
Nested Loops..................................................................................................... 98
The split() Function with For Loops..........................................................99
Contents  • xi

Using the split() Function to Compare Words...........................................99


Using the split() Function to Print Justified Text.....................................100
Using the split() Function to Print Fixed Width Text..............................101
Using the split() Function to Compare Text Strings................................102
Using the split() Function to Display
Characters in a String...................................................................................... 103
The join() Function.....................................................................................103
Python While Loops........................................................................................ 104
Conditional Logic in Python........................................................................... 105
The break/continue/pass Statements...................................................... 105
Comparison and Boolean Operators.............................................................. 106
The in/not in/is/is not Comparison Operators.............................106
The and, or, and not Boolean Operators................................................... 106
Local and Global Variables............................................................................. 107
Scope of Variables...........................................................................................107
Pass by Reference Versus Value...................................................................... 109
Arguments and Parameters............................................................................. 110
Using a While Loop to Find the Divisors of a Number................................110
Using a While Loop to Find Prime Numbers.......................................... 111
User-Defined Functions in Python................................................................ 112
Specifying Default Values in a Function........................................................ 113
Returning Multiple Values from a Function............................................ 113
Functions with a Variable Number of Arguments......................................... 114
Lambda Expressions....................................................................................... 114
Recursion......................................................................................................... 115
Calculating Factorial Values...................................................................... 115
Calculating Fibonacci Numbers................................................................ 116
Working with Lists........................................................................................... 117
Lists and Basic Operations........................................................................ 117
Reversing and Sorting a List..................................................................... 119
Lists and Arithmetic Operations............................................................... 120
Lists and Filter-related Operations........................................................... 120
Sorting Lists of Numbers and Strings............................................................121
Expressions in Lists......................................................................................... 122
Concatenating a List of Words........................................................................ 122
The Python range() Function....................................................................123
Counting Digits, Uppercase, and Lowercase Letters.............................. 123
Arrays and the append() Function...............................................................124
Working with Lists and the split() Function.............................................125
Counting Words in a List................................................................................ 125
Iterating Through Pairs of Lists...................................................................... 126
Other List-Related Functions......................................................................... 127
Working with Vectors...................................................................................... 128
Working with Matrices.................................................................................... 129
Queues............................................................................................................. 130
Tuples (Immutable Lists)................................................................................ 130
xii • Contents

Sets................................................................................................................... 131
Dictionaries..................................................................................................... 132
Creating a Dictionary................................................................................ 132
Displaying the Contents of a Dictionary................................................... 132
Checking for Keys in a Dictionary............................................................ 133
Deleting Keys from a Dictionary.............................................................. 133
Iterating Through a Dictionary................................................................. 134
Interpolating Data from a Dictionary....................................................... 134
Dictionary Functions and Methods................................................................ 134
Dictionary Formatting.................................................................................... 135
Ordered Dictionaries...................................................................................... 135
Sorting Dictionaries................................................................................... 135
Python Multi Dictionaries......................................................................... 136
Other Sequence Types in Python................................................................... 136
Mutable and Immutable Types in Python...................................................... 137
The type() Function.....................................................................................138
Summary.......................................................................................................... 138

Chapter 5 Introduction to NumPy......................................139


What is NumPy?.............................................................................................. 139
Useful NumPy Features............................................................................ 140
What are NumPy Arrays?................................................................................ 140
Working with Loops........................................................................................ 141
Appending Elements to Arrays (1)................................................................. 142
Appending Elements to Arrays (2)................................................................. 142
Multiplying Lists and Arrays........................................................................... 143
Doubling the Elements in a List.................................................................... 144
Lists and Exponents........................................................................................ 144
Arrays and Exponents..................................................................................... 145
Math Operations and Arrays........................................................................... 145
Working with “-1” Sub-ranges with Vectors................................................... 146
Working with “-1” Sub-ranges with Arrays..................................................... 146
Other Useful NumPy Methods....................................................................... 147
Arrays and Vector Operations......................................................................... 147
NumPy and Dot Products (1)......................................................................... 148
NumPy and Dot Products (2)......................................................................... 149
NumPy and the Length of Vectors................................................................. 149
NumPy and Other Operations........................................................................ 150
NumPy and the reshape() Method............................................................. 151
Calculating the Mean and Standard Deviation.............................................. 152
Code Sample with Mean and Standard Deviation......................................... 153
Trimmed Mean and Weighted Mean........................................................ 154
Working with Lines in the Plane (Optional).................................................. 155
Plotting Randomized Points with NumPy and Matplotlib............................ 157
Plotting a Quadratic with NumPy and Matplotlib......................................... 157
Contents  • xiii

What is Linear Regression?............................................................................ 158


What is Multivariate Analysis?.................................................................. 159
What about Non-Linear Datasets?........................................................... 159
The MSE (Mean Squared Error) Formula.................................................... 160
Other Error Types..................................................................................... 161
Non-Linear Least Squares........................................................................ 161
Calculating the MSE Manually....................................................................... 161
Find the Best-Fitting Line in NumPy............................................................ 163
Calculating MSE by Successive Approximation (1)....................................... 164
Calculating MSE by Successive Approximation (2)....................................... 166
Google Colaboratory....................................................................................... 168
Uploading CSV Files in Google Colaboratory.......................................... 169
Summary.......................................................................................................... 170

Chapter 6 Introduction to Pandas......................................171


What is Pandas?............................................................................................... 171
Pandas Options and Settings..................................................................... 172
Pandas Data Frames.................................................................................. 172
Data Frames and Data Cleaning Tasks..................................................... 173
Alternatives to Pandas............................................................................... 173
A Pandas Data Frame with a NumPy Example............................................. 174
Describing a Pandas Data Frame................................................................... 175
Pandas Boolean Data Frames......................................................................... 177
Transposing a Pandas Data Frame............................................................ 178
Pandas Data Frames and Random Numbers................................................. 179
Reading CSV Files in Pandas.......................................................................... 180
The loc() and iloc() Methods in Pandas.................................................. 181
Converting Categorical Data to Numeric Data............................................. 182
Matching and Splitting Strings in Pandas....................................................... 185
Converting Strings to Dates in Pandas........................................................... 187
Merging and Splitting Columns in Pandas..................................................... 189
Combining Pandas Data Frames.................................................................... 190
Data Manipulation with Pandas Data Frames (1)......................................... 191
Data Manipulation with Pandas Data Frames (2)......................................... 192
Data Manipulation with Pandas Data Frames (3)......................................... 193
Pandas Data Frames and CSV Files............................................................... 194
Managing Columns in Data Frames............................................................... 196
Switching Columns.................................................................................... 197
Appending Columns.................................................................................. 197
Deleting Columns...................................................................................... 198
Inserting Columns..................................................................................... 198
Scaling Numeric Columns......................................................................... 199
Managing Rows in Pandas.............................................................................. 201
Selecting a Range of Rows in Pandas........................................................ 201
Finding Duplicate Rows in Pandas........................................................... 202
Inserting New Rows in Pandas.................................................................. 205
xiv •  Contents

Handling Missing Data in Pandas..................................................................206


Multiple Types of Missing Values.............................................................. 208
Test for Numeric Values in a Column....................................................... 209
Replacing NaN Values in Pandas.............................................................. 209
Sorting Data Frames in Pandas...................................................................... 211
Working with groupby() in Pandas.............................................................. 212
Working with apply() and mapapply() in Pandas.....................................214
Handling Outliers in Pandas........................................................................... 217
Pandas Data Frames and Scatterplots............................................................ 219
Pandas Data Frames and Simple Statistics.................................................... 220
Aggregate Operations in Pandas Data Frames.............................................. 222
Aggregate Operations with the titanic.csv Dataset................................. 223
Save Data Frames as CSV Files and Zip Files............................................... 225
Pandas Data Frames and Excel Spreadsheets............................................... 225
Working with JSON-based Data..................................................................... 226
Python Dictionary and JSON.................................................................... 227
Python, Pandas, and JSON........................................................................ 228
Useful One-line Commands in Pandas..........................................................229
What is Method Chaining?............................................................................. 230
Pandas and Method Chaining................................................................... 230
Pandas Profiling............................................................................................... 230
Summary.......................................................................................................... 231

Chapter 7 Introduction to R...............................................233


What is R?........................................................................................................ 234
Features of R.............................................................................................. 234
Installing R and RStudio........................................................................... 234
Variable Names, Operators, and Data Types in R......................................... 234
Assigning Values to Variables in R............................................................. 235
Operators in R............................................................................................ 235
Data Types in R.......................................................................................... 235
Working with Strings in R............................................................................... 236
Uppercase and Lowercase Strings............................................................ 237
String-Related Tasks.................................................................................. 237
Working with Vectors in R.............................................................................. 238
Finding NULL Values in a Vector in R.................................................... 240
Updating NA Values in a Vector in R........................................................ 241
Sorting a Vector of Elements in R............................................................. 242
Working with the Alphabet Variable in R................................................. 242
Working with Lists in R................................................................................... 243
Working with Matrices in R (1)...................................................................... 247
Working with Matrices in R (2)...................................................................... 250
Working with Matrices in R (3)...................................................................... 251
Working with Matrices in R (4)...................................................................... 252
Working with Matrices in R (5)...................................................................... 253
Updating Matrix Elements............................................................................. 254
Contents  • xv

Logical Constraints and Matrices................................................................... 255


Working with Matrices in R (6)...................................................................... 256
Combining Vectors, Matrices, and Lists in R................................................. 257
Working with Dates in R................................................................................. 258
The seq Function in R....................................................................................259
Basic Conditional Logic.................................................................................. 261
Compound Conditional Logic........................................................................ 261
Working with User Input................................................................................ 263
A Try/Catch Block in R................................................................................... 263
Linear Regression in R.................................................................................... 264
Working with Simple Loops in R.................................................................... 264
Working with Nested Loops in R................................................................... 265
Working with While Loops in R..................................................................... 266
Working with Conditional Logic in R............................................................. 266
Add a Sequence of Numbers in R.................................................................. 267
Check if a Number is Prime in R................................................................... 267
Check if Numbers in an Array are Prime in R............................................... 268
Check for Leap Years in R.............................................................................. 269
Well-formed Triangle Values in R................................................................... 270
What are Factors in R?................................................................................... 271
What are Data Frames in R?.......................................................................... 271
Working with Data Frames in R (1)............................................................... 273
Working with Data Frames in R (2)............................................................... 274
Working with Data frames in R (3)................................................................. 275
Sort a Data Frame by Column........................................................................ 276
Reading Excel Files in R................................................................................. 277
Reading SQLITE Tables in R...........................................................................278
Reading Text Files in R................................................................................... 279
Saving and Restoring Objects in R................................................................. 281
Data Visualization in R.................................................................................... 282
Working with Bar Charts in R (1)................................................................... 282
Working with Bar Charts in R (2)................................................................... 283
Working with Line Graphs in R...................................................................... 284
Working with Functions in R.......................................................................... 286
Math-related Functions in R.......................................................................... 287
Some Operators and Set Functions in R........................................................ 288
The “Apply Family” of Built-in Functions..................................................... 289
The dplyr Package in R.................................................................................291
The Pipe Operator %>%................................................................................ 292
Working with CSV Files in R.......................................................................... 294
Working with XML in R.................................................................................. 295
Reading an XML Document into an R Data Frame..................................... 296
Working with JSON in R..................................................................................297
Reading a JSON File into an R Data Frame...................................................298
Statistical Functions in R................................................................................ 298
Summary Functions in R................................................................................ 299
xvi •  Contents

Defining a Custom Function in R.................................................................. 301


Recursion in R................................................................................................. 302
Calculating Factorial Values in R (Non-recursive)........................................ 302
Calculating Factorial Values in R (recursive)................................................. 303
Calculating Fibonacci Numbers in R (Non-recursive).................................. 303
Calculating Fibonacci Numbers in R (Recursive)......................................... 305
Convert a Decimal Integer to a Binary Integer in R..................................... 305
Calculating the GCD of Two Integers in R.................................................... 306
Calculating the LCM of Two Integers in R.................................................... 307
Summary.......................................................................................................... 308

Chapter 8 Regular Expressions...........................................309


What are Regular Expressions?...................................................................... 310
Metacharacters in Python............................................................................... 310
Character Sets in Python................................................................................. 312
Working with “^” and “\”........................................................................... 313
Character Classes in Python........................................................................... 313
Matching Character Classes with the re Module.......................................... 314
Using the re.match() Method..................................................................... 315
Options for the re.match() Method............................................................ 318
Matching Character Classes with the re.search() Method...................... 318
Matching Character Classes with the findAll() Method..........................319
Finding Capitalized Words in a String...................................................... 320
Additional Matching Function for Regular Expressions............................... 320
Grouping with Character Classes in Regular Expressions............................ 321
Using Character Classes in Regular Expressions........................................... 322
Matching Strings with Multiple Consecutive Digits................................ 323
Reversing Words in Strings....................................................................... 323
Modifying Text Strings with the re Module.................................................. 324
Splitting Text Strings with the re.split() Method.................................... 324
Splitting Text Strings Using Digits and Delimiters........................................ 325
Substituting Text Strings with the re.sub() Method..................................325
Matching the Beginning and the End of Text Strings................................... 326
Compilation Flags........................................................................................... 328
Compound Regular Expressions.................................................................... 328
Counting Character Types in a String............................................................ 329
Regular Expressions and Grouping................................................................ 330
Simple String Matches.................................................................................... 330
Pandas and Regular Expressions.................................................................... 331
Summary.......................................................................................................... 334
Exercises.......................................................................................................... 335

Chapter 9 SQL and NoSQL.................................................337


What is an RDBMS?...........................................................................................338
A Four-Table RDBMS........................................................................................339
The customers Table.................................................................................. 339
Contents  • xvii

The purchase_orders Table....................................................................... 340


The line_items Table................................................................................. 342
The item_desc Table.................................................................................. 343
What is SQL?....................................................................................................344
What is DCL?..............................................................................................344
What is DDL?..............................................................................................345
Delete Vs. Drop Vs. Truncate���������������������������������������������������������������� 345
What is DQL?..............................................................................................345
What is DML?..............................................................................................346
What is TCL?..............................................................................................347
Data Types in MySQL.................................................................................347
Working with MySQL........................................................................................348
Logging into MySQL...................................................................................348
Creating a MySQL Database.......................................................................348
Creating and Dropping Tables........................................................................ 348
Manually Creating Tables for mytools.com.............................................. 349
Creating Tables via a SQL Script for mytools.com (1).............................. 350
Creating Tables via a SQL Script for mytools.com (2).............................. 350
Creating Tables from the Command Line................................................ 351
Dropping Tables via a SQL Script for mytools.com.................................. 352
Populating Tables with Seed Data.................................................................. 352
Populating Tables from Text Files.................................................................. 354
Simple SELECT Statements.......................................................................... 356
Select Statements with a WHERE Clause....................................................356
Select Statements with GROUP BY Clause............................................. 356
Select Statements with a HAVING Clause............................................... 357
Working with Indexes in SQL..........................................................................357
What are Keys in an RDBMS?...........................................................................357
Aggregate and Boolean Operations in SQL.................................................... 358
Joining Tables in SQL.......................................................................................361
Defining Views in MySQL................................................................................361
Entity Relationships........................................................................................ 363
One-to-Many Entity Relationships................................................................. 364
Many-to-Many Entity Relationships..............................................................364
Self-Referential Entity Relationships............................................................. 366
Working with Subqueries in SQL....................................................................366
Other Tasks in SQL..........................................................................................367
Reading MySQL Data from Pandas.................................................................368
Export SQL Data to Excel...............................................................................370
What is Normalization?................................................................................... 371
What are Schemas?......................................................................................... 372
Other RDBMS Topics..................................................................................... 373
Working with NoSQL........................................................................................373
Create MongoDB Cellphones Collection................................................. 374
Sample Queries in MongoDB.....................................................................375
Summary.......................................................................................................... 377
xviii •  Contents

Chapter 10 Data Visualization............................................379


What is Data Visualization?............................................................................380
Types of Data Visualization....................................................................... 380
What is Matplotlib?......................................................................................... 381
Horizontal Lines in Matplotlib....................................................................... 381
Slanted Lines in Matplotlib. . .................................. 382
Parallel Slanted Lines in Matplotlib............................. 383
A Grid of Points in Matplotlib.................................. 384
A Dotted Grid in Matplotlib........................................................................... 385
Lines in a Grid in Matplotlib.......................................................................... 386
A Colored Grid in Matplotlib......................................................................... 386
A Colored Square in an Unlabeled Grid in Matplotlib................ 387
Randomized Data Points in Matplotlib.......................................................... 389
A Histogram in Matplotlib.............................................................................. 390
A Set of Line Segments in Matplotlib............................................................ 391
Plotting Multiple Lines in Matplotlib............................................................. 391
Trigonometric Functions in Matplotlib.......................................................... 392
Display IQ Scores in Matplotlib..................................................................... 393
Plot a Best-Fitting Line in Matplotlib............................................................ 393
Introduction to Sklearn (scikit-learn)............................................................. 395
The Digits Dataset in Sklearn......................................................................... 395
The Iris Dataset in Sklearn (1)........................................................................ 398
Sklearn, Pandas, and the Iris Dataset....................................................... 400
The Iris Dataset in Sklearn (2)........................................................................ 402
The Faces Dataset in Sklearn (Optional)....................................................... 404
Working with Seaborn..................................................................................... 405
Features of Seaborn................................................................................... 406
Seaborn Built-in Datasets............................................................................... 406
The Iris Dataset in Seaborn............................................................................ 407
The Titanic Dataset in Seaborn...................................................................... 407
Extracting Data from the Titanic Dataset in Seaborn (1)..............................408
Extracting Data from the Titanic Dataset in Seaborn (2)..............................410
Visualizing a Pandas Dataset in Seaborn........................................................ 412
Data Visualization in Pandas........................................................................... 414
What is Bokeh?................................................................................................ 415
Summary.......................................................................................................... 417
Index�������������������������������������������������������������������������������������� 419
Preface

WHAT IS THE PRIMARY VALUE PROPOSITION FOR THIS BOOK?

This book contains a fast-paced introduction to as much relevant informa-


tion about data analytics as possible that can be reasonably included in a book
of this size. Please keep in mind the following point: this book is intended to
provide you with a broad overview of many relevant technologies.
As such, you will be exposed to a variety of features of NumPy and Pandas,
how to write regular expressions (with an accompanying chapter), and how
to perform many data cleaning tasks. Keep in mind that some topics are pre-
sented in a cursory manner, which is for two main reasons. First, it’s important
that you be exposed to these concepts. In some cases, you will find topics that
might pique your interest, and hence motivate you to learn more about them
through self-study; in other cases, you will probably be satisfied with a brief
introduction. In other words, you can decide whether to delve into more detail
regarding the topics in this book.
Second, a full treatment of all the topics that are covered in this book would
significantly increase the size of this book.
However, it’s important for you to decide if this approach is suitable for
your needs and learning style. If not, you can select one or more of the plethora
of data analytics books that are available.

THE TARGET AUDIENCE

This book is intended primarily for people who have worked with Python
and are interested in learning about several important Python libraries, such
as NumPy and Pandas.
This book is also intended to reach an international audience of read-
ers with highly diverse backgrounds. While many readers know how to read
xx •  Preface

English, their native spoken language is not English. Consequently, this book
uses standard English rather than colloquial expressions that might be confus-
ing to those readers. As you know, many people learn by different types of
imitation, which includes reading, writing, or hearing new material. This book
takes these points into consideration to provide a comfortable and meaningful
learning experience for the intended readers.

WHAT WILL I LEARN FROM THIS BOOK?

The first chapter contains a quick tour of basic Python 3, followed by a


chapter that introduces you to data types and data cleaning tasks, such as work-
ing with datasets that contain different types of data and how to handle missing
data. The third and fourth chapters introduce you to NumPy and Pandas (and
many code samples).
The fifth chapter contains fundamental concepts in probability and sta-
tistics, such as mean, mode, and variance and correlation matrices. You will
also learn about Gini impurity, entropy, and KL-divergence. You will also learn
about eigenvalues, eigenvectors, and PCA (Principal Component Analysis).
The sixth chapter of this book delves into Pandas, followed by Chapter 7
about R programming. Chapter 8 covers regular expressions and provides
with plenty of examples. Chapter 9 discusses both SQL and NoSQL, and then
Chapter 10 discusses data visualization with numerous code samples for Mat-
plotlib, Seaborn, and Bokeh.

WHY ARE THE CODE SAMPLES PRIMARILY IN PYTHON?

Most of the code samples are short (usually less than one page and some-
times less than half a page), and if need be, you can easily and quickly copy/
paste the code into a new Jupyter notebook. For the Python code samples that
reference a CSV file, you do not need any additional code in the corresponding
Jupyter notebook to access the CSV file. Moreover, the code samples execute
quickly, so you won’t need to avail yourself of the free GPU that is provided in
Google Colaboratory.
If you do decide to use Google Colaboratory, you can avail yourself of many
useful features of Colaboratory (e.g., the upload feature to upload existing Ju-
pyter notebooks). If the Python code references a CSV file, make sure that you
include the appropriate code snippet (as explained in Chapter 1) to access the
CSV file in the corresponding Jupyter notebook in Google Colaboratory.

DO I NEED TO LEARN THE THEORY PORTIONS OF THIS BOOK?

Once again, the answer depends on the extent to which you plan to become
involved in data analytics. For example, if you plan to study machine learning,
then you will probably learn how to create and train a model, which is a task
that is performed after data cleaning tasks. In general, you will probably need
Preface  • xxi

to learn everything that you encounter in this book if you are planning to be-
come a machine learning engineer.

WHY DOES THIS BOOK INCLUDE SKLEARN MATERIAL?

The Sklearn material in this book is minimalistic because this book is not
about machine learning. The Sklearn material is located in Chapter 6, where
you will learn about some of the Sklearn built-in datasets. If you decide to
delve into machine learning, you will have already been introduced to some
aspects of Sklearn.

WHY IS A REGEX CHAPTER INCLUDED IN THIS BOOK?

Regular expressions are supported in multiple languages (including Java


and JavaScript) and they enable you to perform complex tasks with very com-
pact regular expressions. Regular expressions can seem arcane and too com-
plex to learn in a reasonable amount of time. Chapter 2 contains some Pandas-
based code samples that use regular expressions to perform tasks that might
otherwise be more complicated.
If you plan to use Pandas extensively or you plan to work on NLP-related
tasks, then the code samples in this chapter will be very useful for you because
they are more than adequate for solving certain types of tasks, such as remov-
ing HTML tags. Moreover, your knowledge of RegEx will transfer instantly to
other languages that support regular expressions.

GETTING THE MOST FROM THIS BOOK

Some programmers learn well from prose, others learn well from sample
code (and lots of it), which means that there’s no single style that can be used
for everyone.
Moreover, some programmers want to run the code first, see what it does,
and then return to the code to delve into the details (and others use the op-
posite approach).
Consequently, there are various types of code samples in this book: some
are short, some are long, and other code samples “build” from earlier code
samples.

WHAT DO I NEED TO KNOW FOR THIS BOOK?

Current knowledge of Python 3.x is the most helpful skill. Knowledge of


other programming languages (such as Java) can also be helpful because of the
exposure to programming concepts and constructs. The less technical knowl-
edge that you have, the more diligence will be required to understand the
various topics that are covered.
xxii •  Preface

If you want to be sure that you can grasp the material in this book, glance
through some of the code samples to get an idea of how much is familiar to you
and how much is new for you.

DOESN’T THE COMPANION FILES OBVIATE THE NEED


FOR THIS BOOK?

The companion files contains all the code samples to save you time and
effort from the error-prone process of manually typing code into a text file. In
addition, there are situations in which you might not have easy access to the
companion files. Furthermore, the code samples in the book provide explana-
tions that are not available on the companion files.

DOES THIS BOOK CONTAIN PRODUCTION-LEVEL CODE SAMPLES?

The primary purpose of the code samples in this book is to show you Py-
thon-based libraries for solving a variety of data-related tasks in conjunction
with acquiring a rudimentary understanding of statistical concepts. Clarity has
higher priority than writing more compact code that is more difficult to under-
stand (and possibly more prone to bugs). If you decide to use any of the code
in this book in a production website, you should subject that code to the same
rigorous analysis as the other parts of your code base.

WHAT ARE THE NON-TECHNICAL PREREQUISITES


FOR THIS BOOK?

Although the answer to this question is more difficult to quantify, it’s impor-
tant to have strong desire to learn about data science, along with the motivation
and discipline to read and understand the code samples.

HOW DO I SET UP A COMMAND SHELL?

If you are a Mac user, there are three ways to do so. The first method is to
use Finder to navigate to Applications > Utilities and then double
click on the Utilities application. Next, if you already have a command
shell available, you can launch a new command shell by typing the following
command:
open /Applications/Utilities/Terminal.app

A second method for Mac users is to open a new command shell on a Macbook
from a command shell that is already visible simply by clicking command+n in
that command shell, and your Mac will launch another command shell.
If you are a PC user, you can install Cygwin (which is open source: htt-
ps://cygwin.com/) that simulates bash commands, or use another toolkit such
as MKS (a commercial product). Please read the online documentation that
Preface  • xxiii

describes the download and installation process. Note that custom aliases are
not automatically set if they are defined in a file other than the main start-up
file (such as .bash_login).

COMPANION FILES

All the code samples and figures in this book may be obtained by writing to
the publisher at info@merclearning.com.

WHAT ARE THE NEXT STEPS AFTER FINISHING THIS BOOK?

The answer to this question varies widely, mainly because the answer de-
pends heavily on your objectives. If you are interested primarily in NLP, then
you can learn more advanced concepts, such as attention, transformers, and
the BERT-related models.
If you are primarily interested in machine learning, there are some sub-
fields of machine learning, such as deep learning and reinforcement learning
(and deep reinforcement learning) that might appeal to you. Fortunately, there
are many resources available, and you can perform an Internet search for those
resources. One other point: the aspects of machine learning for you to learn
depend on who you are. The needs of a machine learning engineer, data scien-
tist, manager, student, or software developer are all different.

Oswald Campesato
April 2021
CHAPTER
1
Working With Data

T
his chapter introduces you to data types, how to scale data values, and
various techniques for handling missing data values. If most of the ma-
terial in this chapter is new to you, be assured that it’s not necessary
to understand everything in this chapter. It’s still a good idea to read as much
material as you can absorb, and perhaps return to this chapter again after you
have completed some of the other chapters in this book.
The first part of this chapter contains an overview of different types of data
and an explanation of how to normalize and standardize a set of numeric values
by calculating the mean and standard deviation of a set of numbers. You will
see how to map categorical data to a set of integers and how to perform one-
hot encoding.
The second part of this chapter discusses missing data, outliers, and anoma-
lies, as well as some techniques for handling these scenarios. The third section
discusses imbalanced data and the use of SMOTE (Synthetic Minority Over-
sampling Technique) to deal with imbalanced classes in a dataset.
The fourth section discusses ways to evaluate classifiers such as LIME and
ANOVA. This section also contains details regarding the bias-variance trade-
off and various types of statistical bias.

WHAT ARE DATASETS?

In simple terms, a dataset is a source of data (such as a text file) that con-
tains rows and columns of data. Each row is typically called a data point, and
each column is called a feature. A dataset can be in a range of formats: CSV
(comma separated values), TSV (tab separated values), Excel spreadsheet, a
table in an RDMBS (Relational Database Management Systems), a document
in a NoSQL database, or the output from a Web service. Someone needs to
2 • Data Science Fundamentals Pocket Primer

analyze the dataset to determine which features are the most important and
which features can be safely ignored to train a model with the given dataset.
A dataset can vary from very small (a couple of features and 100 rows) to
very large (more than 1,000 features and more than one million rows). If you
are unfamiliar with the problem domain, then you might struggle to determine
the most important features in a large dataset. In this situation, you might need
a domain expert who understands the importance of the features, their inter-
dependencies (if any), and whether the data values for the features are valid.
In addition, there are algorithms (called dimensionality reduction algorithms)
that can help you determine the most important features. For example, PCA
(Principal Component Analysis) is one such algorithm, which is discussed in
more detail in Chapter 2.

Data Preprocessing
Data preprocessing is the initial step that involves validating the contents of
a dataset, which involves making decisions about missing data, duplicate data,
and incorrect data values:

• dealing with missing data values


• cleaning “noisy” text-based data
• removing HTML tags
• removing emoticons
• dealing with emojis/emoticons
• filtering data
• grouping data
• handling currency and date formats (i18n)

Cleaning data is done before data wrangling that involves removing unwanted
data as well as handling missing data. In the case of text-based data, you might
need to remove HTML tags and punctuation. In the case of numeric data, it’s
less likely (though still possible) that alphabetic characters are mixed together
with numeric data. However, a dataset with numeric features might have in-
correct values or missing values (discussed later). In addition, calculating the
minimum, maximum, mean, median, and standard deviation of the values of a
feature obviously pertain only to numeric values.
After the preprocessing step is completed, data wrangling is performed,
which refers to transforming data into a new format. You might have to com-
bine data from multiple sources into a single dataset. For example, you might
need to convert between different units of measurement (such as date formats
or currency values) so that the data values can be represented in a consistent
manner in a dataset.
Currency and date values are part of i18n (internationalization), whereas
l10n (localization) targets a specific nationality, language, or region. Hard-
coded values (such as text strings) can be stored as resource strings in a file
Working With Data  • 3

called a resource bundle, where each string is referenced via a code. Each
language has its own resource bundle.

DATA TYPES

Explicit data types exist in many programming languages, such as C, C++,


Java, and TypeScript. Some programming languages, such as JavaScript and
awk, do not require initializing variables with an explicit type: the type of a
variable is inferred dynamically via an implicit type system (i.e., one that is not
directly exposed to a developer).
In machine learning, datasets can contain features that have different types
of data, such as a combination of one or more of the following:

• numeric data (integer/floating point and discrete/continuous)


• character/categorical data (different languages)
• date-related data (different formats)
• currency data (different formats)
• binary data (yes/no, 0/1, and so forth)
• nominal data (multiple unrelated values)
• ordinal data (multiple and related values)

Consider a dataset that contains real estate data, which can have as many as
thirty columns (or even more), often with the following features:

• the number of bedrooms in a house: numeric value and a discrete value


• the number of square feet: a numeric value and (probably) a continuous
value
• the name of the city: character data
• the construction date: a date value
• the selling price: a currency value and probably a continuous value
• the “for sale” status: binary data (either “yes” or “no”)

An example of nominal data is the seasons in a year. Although many (most?)


countries have four distinct seasons, some countries have two distinct seasons.
However, keep in mind that seasons can be associated with different tempera-
ture ranges (summer versus winter). An example of ordinal data is an employ-
ee’s pay grade: 1=entry level, 2=one year of experience, and so forth. Another
example of nominal data is a set of colors, such as {Red, Green, Blue}.
An example of binary data is the pair {Male, Female}, and some datasets
contain a feature with these two values. If such a feature is required for train-
ing a model, first convert {Male, Female} to a numeric counterpart, such as
{0,1}. Similarly, if you need to include a feature whose values are the previous
set of colors, you can replace {Red, Green, Blue} with the values {0,1,2}. Cat-
egorical data is discussed in more detail later in this chapter.
4 • Data Science Fundamentals Pocket Primer

PREPARING DATASETS

If you have the good fortune to inherit a dataset that is in pristine condi-
tion, then data cleaning tasks (discussed later) are vastly simplified: in fact, it
might not be necessary to perform any data cleaning for the dataset. On the
other hand, if you need to create a dataset that combines data from multiple
datasets that contain different formats for dates and currency, then you need
to perform a conversion to a common format.
If you need to train a model that includes features that have categorical
data, then you need to convert that categorical data to numeric data. For in-
stance, the Titanic dataset contains a feature called “sex,” which is either male
or female. As you will see later in this chapter, Pandas makes it extremely sim-
ple to “map” male to 0 and female to 1.

Discrete Data Versus Continuous Data


Discrete data is a set of values that can be counted, whereas continuous
data must be measured. Discrete data can “reasonably” fit in a drop-down list
of values, but there is no exact value for making such a determination. One
person might think that a list of 500 values is discrete, whereas another person
might think it’s continuous.
For example, the list of provinces of Canada and the list of states of the
United States are discrete data values, but is the same true for the number
of countries in the world (roughly 200) or for the number of languages in the
world (more than 7,000)?
On the other hand, values for temperature, humidity, and barometric pres-
sure are considered continuous. Currency is also treated as continuous, even
though there is a measurable difference between two consecutive values. The
smallest unit of currency for U.S. currency is one penny, which is 1/100th of a
dollar (accounting-based measurements use the “mil,” which is 1/1,000th of a
dollar).
Continuous data types can have subtle differences. For example, someone
who is 200 centimeters tall is twice as tall as someone who is 100 centimeters
tall; the same is true for 100 kilograms versus 50 kilograms. However, tem-
perature is different: 80 degrees Fahrenheit is not twice as hot as 40 degrees
Fahrenheit.
Furthermore, keep in mind that the meaning of the word “continuous” in
mathematics is not necessarily the same as continuous in machine learning.
In the former, a continuous function (let’s say in the 2D Euclidean plane) can
have an uncountably infinite number of values. On the other hand, a feature
in a dataset that can have more values than can be “reasonably” displayed in a
drop-down list is treated as though it’s a continuous variable.
For instance, values for stock prices are discrete: they must differ by at least
a penny (or some other minimal unit of currency), which is to say, it’s mean-
ingless to say that the stock price changes by one-millionth of a penny. How-
ever, since there are so many possible stock values, it’s treated as a continuous
Working With Data  • 5

variable. The same comments apply to car mileage, ambient temperature, and
barometric pressure.

“Binning” Continuous Data


The concept of binning refers to subdividing a set of values into multiple
intervals, and then treating all the numbers in the same interval as though they
had the same value.
As a simple example, suppose that a feature in a dataset contains the age of
people in a dataset. The range of values is approximately between 0 and 120,
and we could bin them into 12 equal intervals, where each consists of 10 val-
ues: 0 through 9, 10 through 19, 20 through 29, and so forth.
However, partitioning the values of people’s ages as described in the pre-
ceding paragraph can be problematic. Suppose that person A, person B, and
person C are 29, 30, and 39, respectively. Then person A and person B are
probably more similar to each other than person B and person C, but because
of the way in which the ages are partitioned, B is classified as closer to C than
to A. In fact, binning can increase Type I errors (false positive) and Type II er-
rors (false negative), as discussed in the following blog post (along with some
alternatives to binning):
https://medium.com/@peterflom/why-binning-continuous-data-is-almost-al-
ways-a-mistake-ad0b3a1d141f.
As another example, using quartiles is even more coarse-grained than the ear-
lier age-related binning example. The issue with binning pertains to the con-
sequences of classifying people in different bins, even though they are in close
proximity to each other. For instance, some people struggle financially because
they earn a meager wage, and they are disqualified from financial assistance
because their salary is higher than the cutoff point for receiving any assistance.

Scaling Numeric Data via Normalization


A range of values can vary significantly, and it’s important to note that they
often need to be scaled to a smaller range, such as values in the range [-1,1]
or [0,1], which you can do via the tanh function or the sigmoid function,
respectively.
For example, measuring a person’s height in terms of meters involves a
range of values between 0.50 meters and 2.5 meters (in the vast majority of
cases), whereas measuring height in terms of centimeters ranges between 50
centimeters and 250 centimeters: these two units differ by a factor of 100.
A person’s weight in kilograms generally varies between 5 kilograms and 200
kilograms, whereas measuring weight in grams differs by a factor of 1,000.
Distances between objects can be measured in meters or in kilometers, which
also differ by a factor of 1,000.
In general, use units of measure so that the data values in multiple features
belong to a similar range of values. In fact, some machine learning algorithms
require scaled data, often in the range of [0,1] or [-1,1]. In addition to the
6 • Data Science Fundamentals Pocket Primer

tanh and sigmoid functions, there are other techniques for scaling data, such
as standardizing data (such as the Gaussian distribution) and normalizing data
(linearly scaled so that the new range of values is in [0,1]).
The following examples involve a floating point variable X with different
ranges of values that will be scaled so that the new values are in the interval
[0,1].

• Example 1: If the values of X are in the range [0,2], then X/2 is in the
range [0,1].
• Example 2: If the values of X are in the range [3,6], then X-3 is in the
range [0,3], and (X-3)/3 is in the range [0,1].
• Example 3: If the values of X are in the range [-10,20], then X +10 is in
the range [0,30], and (X +10)/30 is in the range of [0,1].

In general, suppose that X is a random variable whose values are in the range
[a,b], where a < b. You can scale the data values by performing two steps:

Step 1: X-a is in the range [0,b-a]


Step 2: (X-a)/(b-a) is in the range [0,1]

If X is a random variable that has the values {x1, x2, x3, . . ., xn},
then the formula for normalization involves mapping each xi value to (xi –
min)/(max – min), where min is the minimum value of X and max is the
maximum value of X.
As a simple example, suppose that the random variable X has the values
{-1, 0, 1}. Then min and max are 1 and -1, respectively, and the normaliza-
tion of {-1, 0, 1} is the set of values {(-1-(-1))/2, (0-(-1))/2, (1-
(-1))/2}, which equals {0, 1/2, 1}.

Scaling Numeric Data via Standardization


The standardization technique involves finding the mean mu and the
standard deviation sigma, and then mapping each xi value to (xi – mu)/sigma.
Recall the following formulas:

mu = [SUM (x) ]/n


variance(x) = [SUM (x – xbar)*(x - xbar)]/n
sigma = sqrt(variance)

As a simple illustration of standardization, suppose that the random variable X


has the values {-1, 0, 1}. Then mu and sigma are calculated as follows:

mu = (SUM xi)/n = (-1 + 0 + 1)/3 = 0


variance = [SUM (xi- mu)^2]/n
= [(-1-0)^2 + (0-0)^2 + (1-0)^2]/3
= 2/3
sigma = sqrt(2/3) = 0.816 (approximate value)

Hence, the standardization of {-1, 0, 1} is {-1/0.816, 0/0.816,


1/0.816}, which in turn equals the set of values {-1.2254, 0, 1.2254}.
Working With Data  • 7

As another example, suppose that the random variable X has the values
{-6, 0, 6}. Then mu and sigma are calculated as follows:

mu = (SUM xi)/n = (-6 + 0 + 6)/3 = 0

variance = [SUM (xi- mu)^2]/n


= [(-6-0)^2 + (0-0)^2 + (6-0)^2]/3
= 72/3
= 24

sigma = sqrt(24) = 4.899 (approximate value)

Hence, the standardization of {-6, 0, 6} is {-6/4.899, 0/4.899,


6/4.899}, which in turn equals the set of values {-1.2247, 0, 1.2247}.
In the preceding two examples, the mean equals 0 in both cases, but the
variance and standard deviation are significantly different. The normalization
of a set of values always produces a set of numbers between 0 and 1.
On the other hand, the standardization of a set of values can generate num-
bers that are less than -1 and greater than 1: this will occur when sigma is
less than the minimum value of every term |mu – xi|, where the latter is the
absolute value of the difference between mu and each xi value. In the preced-
ing example, the minimum difference equals 1, whereas sigma is 0.816, and
therefore the largest standardized value is greater than 1.

What to Look for in Categorical Data


This section contains suggestions for handling inconsistent data values, and
you can determine which ones to adopt based on any additional factors that are
relevant to your particular task. For example, consider dropping columns that
have very low cardinality (equal to or close to 1), as well as numeric columns
with zero or very low variance.
Next, check the contents of categorical columns for inconsistent spellings
or errors. A good example pertains to the gender category, which can consist of
a combination of the following values:
male
Male
female
Female
m
f
M
F

The preceding categorical values for gender can be replaced with two categori-
cal values (unless you have a valid reason to retain some of the other values).
Moreover, if you are training a model whose analysis involves a single gender,
then you need to determine which rows (if any) of a dataset must be excluded.
Also check categorical data columns for redundant or missing white spaces.
8 • Data Science Fundamentals Pocket Primer

Check for data values that have multiple data types, such as a numerical
column with numbers as numerals and some numbers as strings or objects.
Ensure there are consistent data formats: numbers as integers or floating num-
bers. Ensure that dates have the same format (for example, do not mix mm/dd/
yyyy date formats with another date format, such as dd/mm/yyyy).

Mapping Categorical Data to Numeric Values


Character data is often called categorical data, examples of which include
people’s names, home or work addresses, and email addresses. Many types of
categorical data involve short lists of values. For example, the days of the week
and the months in a year involve seven and twelve distinct values, respectively.
Notice that the days of the week have a relationship: each day has a previous
day and a next day, and similarly for the months of a year.
On the other hand, the colors of an automobile are independent of each
other. The color red is not “better” or “worse” than the color blue. However,
cars of a certain color can have a statistically higher number of accidents, but
we won’t address that issue here.
There are several well-known techniques for mapping categorical values
to a set of numeric values. A simple example where you need to perform this
conversion involves the gender feature in the Titanic dataset. This feature is
one of the relevant features for training a machine learning model. The gen-
der feature has {M, F} as its set of possible values. As you will see later in this
chapter, Pandas makes it very easy to convert the set of values {M,F} to the
set of values {0,1}.
Another mapping technique involves mapping a set of categorical values to
a set of consecutive integer values. For example, the set {Red, Green, Blue}
can be mapped to the set of integers {0,1,2}. The set {Male, Female} can be
mapped to the set of integers {0,1}. The days of the week can be mapped to
{0,1,2,3,4,5,6}. Note that the first day of the week depends on the country (in
some cases it’s Sunday, and in other cases it’s Monday).
Another technique is called one-hot encoding, which converts each value to
a vector. Thus, {Male, Female} can be represented by the vectors [1,0] and
[0,1], and the colors {Red, Green, Blue} can be represented by the vectors
[1,0,0], [0,1,0], and [0,0,1]. If you vertically “line up” the two vectors
for gender, they form a 2×2 identity matrix, and doing the same for the colors
will form a 3×3 identity matrix as shown here:
[1,0,0]
[0,1,0]
[0,0,1]

If you are familiar with matrices, you probably noticed that the preceding set
of vectors looks like the 3×3 identity matrix. In fact, this technique general-
izes in a straightforward manner. Specifically, if you have n distinct categorical
values, you can map each of those values to one of the vectors in an nxn identity
matrix.
Working With Data  • 9

As another example, the set of titles {"Intern", "Junior", "Mid-


Range", "Senior", "Project Leader", "Dev Manager"} has a hierarchi-
cal relationship in terms of their salaries (which can also overlap, but we won’t
address that now).
Another set of categorical data involves the season of the year: {"Spring",
"Summer", "Autumn", "Winter"}, and while these values are generally in-
dependent of each other, there are cases in which the season is significant.
For example, the values for the monthly rainfall, average temperature, crime
rate, or foreclosure rate can depend on the season, month, week, or day of
the year.
If a feature has a large number of categorical values, then one-hot encoding
will produce many additional columns for each data point. Since the majority
of the values in the new columns equal 0, this can increase the sparsity of the
dataset, which in turn can result in more overfitting and hence adversely affect
the accuracy of machine learning algorithms that you adopt during the training
process.
Another solution is to use a sequence-based solution in which N categories
are mapped to the integers 1, 2, . . . , N. Another solution involves examin-
ing the row frequency of each categorical value. For example, suppose that
N equals 20, and there are three categorical values that occur in 95% of the
values for a given feature. You can try the following:

1. Assign the values 1, 2, and 3 to those three categorical values.


2. Assign numeric values that reflect the relative frequency of those cat-
egorical values.
3. Assign the category “OTHER” to the remaining categorical values.
4. Delete the rows whose categorical values belong to the 5%.

Working with Dates


The format for a calendar date varies among different countries, and this
belongs to something called the localization of data (not to be confused with
i18n, which is data internationalization). Some examples of date formats are
shown as follows (and the first four are probably the most common):

MM/DD/YY
MM/DD/YYYY
DD/MM/YY
DD/MM/YYYY
YY/MM/DD
M/D/YY
D/M/YY
YY/M/D
MMDDYY
DDMMYY
YYMMDD
10 • Data Science Fundamentals Pocket Primer

If you need to combine data from datasets that contain different date formats,
then converting the disparate date formats to a single common date format will
ensure consistency.

Working with Currency


The format for currency depends on the country, which includes different
interpretations for a “,” and “.” in the currency (and decimal values in general).
For example, 1,124.78 equals “one thousand one hundred twenty-four point
seven eight” in the United States, whereas 1.124,78 has the same meaning in
Europe (i.e., the “.” symbol and the “,” symbol are interchanged).
If you need to combine data from datasets that contain different currency
formats, then you probably need to convert all the disparate currency formats
to a single common currency format. There is another detail to consider: cur-
rency exchange rates can fluctuate on a daily basis, which in turn can affect the
calculation of taxes, and late fees. Although you might be fortunate enough
where you won’t have to deal with these issues, it’s still worth being aware of
them.

MISSING DATA, ANOMALIES, AND OUTLIERS

Although missing data is not directly related to checking for anomalies and
outliers, in general you will perform all three of these tasks. Each task involves
a set of techniques to help you perform an analysis of the data in a dataset, and
the following subsections describe some of those techniques.

Missing Data
How you decide to handle missing data depends on the specific dataset.
Here are some ways to handle missing data (the first three techniques are
manual techniques, and the other techniques are algorithms):

1. Replace missing data with the mean/median/mode value.


2. Infer (“impute”) the value for missing data.
3. Delete rows with missing data.
4. Isolation forest (tree-based algorithm).
5. Use the minimum covariance determinant.
6. Use the local outlier factor.
7. Use the one-class SVM (Support Vector Machines).

In general, replacing a missing numeric value with zero is a risky choice:


this value is obviously incorrect if the values of a feature are between 1,000 and
5,000. For a feature that has numeric values, replacing a missing value with the
average value is better than the value zero (unless the average equals zero);
also consider using the median value. For categorical data, consider using the
mode to replace a missing value.
Working With Data  • 11

If you are not confident that you can impute a “reasonable” value, consider
excluding the row with a missing value, and then train a model with the im-
puted value and the deleted row.
One problem that can arise after removing rows with missing values is that
the resulting dataset is too small. In this case, consider using SMOTE, which is
discussed later in this chapter, to generate synthetic data.

Anomalies and Outliers


In simplified terms, an outlier is an abnormal data value that is outside the
range of “normal” values. For example, a person’s height in centimeters is typi-
cally between 30 centimeters and 250 centimeters. Hence, a data point (e.g., a
row of data in a spreadsheet) with a height of 5 centimeters or a height of 500
centimeters is an outlier. The consequences of these outlier values are unlikely
to involve a significant financial or physical loss (though they could adversely
affect the accuracy of a trained model).
Anomalies are also outside the “normal” range of values (just like outli-
ers), and they are typically more problematic than outliers: anomalies can have
more severe consequences than outliers. For example, consider the scenario in
which someone who lives in California suddenly makes a credit card purchase
in New York. If the person is on vacation (or a business trip), then the purchase
is an outlier (it’s outside the typical purchasing pattern), but it’s not an issue.
However, if that person was in California when the credit card purchase was
made, then it’s more likely to be credit card fraud, as well as an anomaly.
Unfortunately, there is no simple way to decide how to deal with anoma-
lies and outliers in a dataset. Although you can exclude rows that contain out-
liers, keep in mind that doing so might deprive the dataset—and therefore
the trained model—of valuable information. You can try modifying the data
values (described as follows), but again, this might lead to erroneous infer-
ences in the trained model. Another possibility is to train a model with the
dataset that contains anomalies and outliers, and then train a model with a
dataset from which the anomalies and outliers have been removed. Compare
the two results and see if you can infer anything meaningful regarding the
anomalies and outliers.

Outlier Detection
Although the decision to keep or drop outliers is your decision to make,
there are some techniques available that help you detect outliers in a dataset.
This section contains a short list of some techniques, along with a very brief
description and links for additional information.
Perhaps trimming is the simplest technique (apart from dropping outliers),
which involves removing rows whose feature value is in the upper 5% range or
the lower 5% range. Winsorizing the data is an improvement over trimming.
Set the values in the top 5% range equal to the maximum value in the 95th
percentile, and set the values in the bottom 5% range equal to the minimum
in the 5th percentile.
12 • Data Science Fundamentals Pocket Primer

The Minimum Covariance Determinant is a covariance-based technique,


and a Python-based code sample that uses this technique can be found online:
https://scikit-learn.org/stable/modules/outlier_detection.html.
The Local Outlier Factor (LOF) technique is an unsupervised technique that
calculates a local anomaly score via the kNN (k Nearest Neighbor) algorithm.
Documentation and short code samples that use LOF can be found online:
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOut-
lierFactor.html.
Two other techniques involve the Huber and the Ridge classes, both of which
are included as part of Sklearn. The Huber error is less sensitive to outliers
because it’s calculated via linear loss, similar to MAE (Mean Absolute Error).
A code sample that compares Huber and Ridge can be found online:
https://scikit-learn.org/stable/auto_examples/linear_model/plot_huber_vs_
ridge.html.
You can also explore the Theil-Sen estimator and RANSAC, which are “robust”
against outliers, and additional information can be found online:
https://scikit-learn.org/stable/auto_examples/linear_model/plot_theilsen.html
and https://en.wikipedia.org/wiki/Random_sample_consensus.
Four algorithms for outlier detection are discussed at the following site:
https://www.kdnuggets.com/2018/12/four-techniques-outlier-detection.html.
One other scenario involves “local” outliers. For example, suppose that you use
kMeans (or some other clustering algorithm) and determine that a value is an
outlier with respect to one of the clusters. While this value is not necessarily an
“absolute” outlier, detecting such a value might be important for your use case.

What Is Data Drift?


The value of data is based on its accuracy, its relevance, and its age. Data
drift refers to data that has become less relevant over time. For example, on-
line purchasing patterns in 2010 are probably not as relevant as data from 2020
because of various factors (such as the profile of different types of customers).
Keep in mind that there might be multiple factors that can influence data drift
in a specific dataset.
Two techniques are domain classifier and the black-box shift detector, both
of which can be found online:
https://blog.dataiku.com/towards-reliable-mlops-with-drift-detectors.

WHAT IS IMBALANCED CLASSIFICATION?

Imbalanced classification involves datasets with imbalanced classes. For


example, suppose that class A has 99% of the data and class B has 1%. Which
Working With Data  • 13

classification algorithm would you use? Unfortunately, classification algorithms


don’t work well with this type of imbalanced dataset. Here is a list of several
well-known techniques for handling imbalanced datasets:

• Random resampling rebalances the class distribution.


• Random oversampling duplicates data in the minority class.
• Random undersampling deletes examples from the majority class.
• SMOTE

Random resampling transforms the training dataset into a new dataset, which
is effective for imbalanced classification problems.
The random undersampling technique removes samples from the dataset,
and involves the following:

• randomly remove samples from the majority class


• can be performed with or without replacement
• alleviates imbalance in the dataset
• may increase the variance of the classifier
• may discard useful or important samples

However, random undersampling does not work so well with a dataset that has
a 99%/1% split into two classes. Moreover, undersampling can result in losing
information that is useful for a model.
Instead of random undersampling, another approach involves generating
new samples from a minority class. The first technique involves oversampling
examples in the minority class and duplicate examples from the minority class.
There is another technique that is better than the preceding technique, which
involves the following:

• synthesizing new examples from a minority class


• a type of data augmentation for tabular data
• generating new samples from a minority class
Another well-known technique is called SMOTE, which involves data aug-
mentation (i.e., synthesizing new data samples) well before you use a clas-
sification algorithm. SMOTE was initially developed by means of the kNN
algorithm (other options are available), and it can be an effective technique for
handling imbalanced classes.
Yet another option to consider is the Python package imbalanced-learn
in the scikit-learn-contrib project. This project provides various re-
sampling techniques for datasets that exhibit class imbalance. More details are
available online:
https://github.com/scikit-learn-contrib/imbalanced-learn.
14 • Data Science Fundamentals Pocket Primer

WHAT IS SMOTE?

SMOTE is a technique for synthesizing new samples for a dataset. This


technique is based on linear interpolation:

Step 1: Select samples that are close in the feature space.


Step 2: Draw a line between the samples in the feature space.
Step 3: Draw a new sample at a point along that line.

A more detailed explanation of the SMOTE algorithm is here:

• Select a random sample “a” from the minority class.


• Now find k nearest neighbors for that example.
• Select a random neighbor “b” from the nearest neighbors.
• Create a line L that connects “a” and “b.”
• Randomly select one or more points “c” on line L.

If need be, you can repeat this process for the other (k-1) nearest neighbors
to distribute the synthetic values more evenly among the nearest neighbors.

SMOTE Extensions
The initial SMOTE algorithm is based on the kNN classification algorithm,
which has been extended in various ways, such as replacing kNN with SVM. A
list of SMOTE extensions is shown as follows:

• selective synthetic sample generation


• Borderline-SMOTE (kNN)
• Borderline-SMOTE (SVM)
• Adaptive Synthetic Sampling (ADASYN)

More information can be found online:


https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_
analysis.

ANALYZING CLASSIFIERS (OPTIONAL)

This section is marked optional because its contents pertain to machine


learning classifiers, which are not the focus of this book. However, it’s still
worthwhile to glance through the material, or perhaps return to this section
after you have a basic understanding of machine learning classifiers.
Several well-known techniques are available for analyzing the quality of
machine learning classifiers. Two techniques are LIME and ANOVA, both of
which are discussed in the following subsections.
Working With Data  • 15

What is LIME?
LIME is an acronym for Local Interpretable Model-Agnostic Explanations.
LIME is a model-agnostic technique that can be used with machine learning
models. The methodology of this technique is straightforward: make small ran-
dom changes to data samples and then observe the manner in which predic-
tions change (or not). The approach involves changing the data just a little and
then observing what happens to the output.
By way of contrast, consider food inspectors who test for bacteria in truck-
loads of perishable food. Clearly, it’s infeasible to test every food item in a truck
(or a train car), so inspectors perform “spot checks” that involve testing ran-
domly selected items. Instead of sampling data, LIME makes small changes to
input data in random locations and then analyzes the changes in the associated
output values.
However, there are two caveats to keep in mind when you use LIME with
input data for a given model:

1. The actual changes to input values are model-specific.


2. This technique works on input that is interpretable.
Examples of interpretable input include machine learning classifiers (such as
trees and random forests) and NLP techniques such as BoW (Bag of Words).
Non-interpretable input involves “dense” data, such as a word embedding
(which is a vector of floating point numbers).
You could also substitute your model with another model that involves in-
terpretable data, but then you need to evaluate how accurate the approxima-
tion is to the original model.

What is ANOVA?
ANOVA is an acronym for analysis of variance, which attempts to analyze
the differences among the mean values of a sample that’s taken from a popu-
lation. ANOVA enables you to test if multiple mean values are equal. More
importantly, ANOVA can assist in reducing Type I (false positive) errors and
Type II errors (false negative) errors. For example, suppose that person A is
diagnosed with cancer and person B is diagnosed as healthy, and that both di-
agnoses are incorrect. Then the result for person A is a false positive whereas
the result for person B is a false negative. In general, a test result of false posi-
tive is preferable to a test result of false negative.
ANOVA pertains to the design of experiments and hypothesis testing,
which can produce meaningful results in various situations. For example, sup-
pose that a dataset contains a feature that can be partitioned into several “rea-
sonably” homogenous groups. Next, analyze the variance in each group and
perform comparisons with the goal of determining different sources of vari-
ance for the values of a given feature.
16 • Data Science Fundamentals Pocket Primer

More information about ANOVA is available online:


https://en.wikipedia.org/wiki/Analysis_of_variance.

THE BIAS-VARIANCE TRADE-OFF

This section is presented from the viewpoint of machine learning, but the
concepts of bias and variance are highly relevant outside of machine learning, so
it’s probably still worthwhile to read this section as well as the previous section.
Bias in machine learning can be due to an error from wrong assumptions
in a learning algorithm. High bias might cause an algorithm to miss relevant
relations between features and target outputs (underfitting). Prediction bias
can occur because of “noisy” data, an incomplete feature set, or a biased train-
ing sample.
Error due to bias is the difference between the expected (or average) pre-
diction of your model and the correct value that you want to predict. Repeat
the model building process multiple times, gather new data each time, and
perform an analysis to produce a new model. The resulting models have a
range of predictions because the underlying datasets have a degree of random-
ness. Bias measures the extent to which the predictions for these models devi-
ate from the correct value.
Variance in machine learning is the expected value of the squared devia-
tion from the mean. High variance can/might cause an algorithm to model the
random noise in the training data (aka overfitting), rather than the intended
outputs. Moreover, adding parameters to a model increases its complexity, in-
creases the variance, and decreases the bias.
The point to remember is that dealing with bias and variance involves dealing
with underfitting and overfitting.
Error due to variance is the variability of a model prediction for a given data
point. As before, repeat the entire model building process, and the variance
is the extent to which predictions for a given point vary among different “in-
stances” of the model.
If you have worked with datasets and performed data analysis, you already
know that finding well-balanced samples can be difficult or highly impractical.
Moreover, performing an analysis of the data in a dataset is vitally important,
yet there is no guarantee that you can produce a dataset that is 100% “clean.”
A biased statistic is a statistic that is systematically different from the entity
in the population that is being estimated. In more casual terminology, if a data
sample “favors” or “leans” toward one aspect of the population, then the sam-
ple has bias. For example, if you prefer movies that are comedies more so than
any other type of movie, then clearly you are more likely to select a comedy
instead of a dramatic movie or a science fiction movie. Thus, a frequency graph
of the movie types in a sample of your movie selections will be more closely
clustered around comedies.
Working With Data  • 17

On the other hand, if you have a wide-ranging set of preferences for mov-
ies, then the corresponding frequency graph will be more varied, and therefore
have a larger spread of values.
As a simple example, suppose that you are given an assignment that in-
volves writing a term paper on a controversial subject that has many opposing
viewpoints. Since you want a bibliography that supports your well-balanced
term paper that takes into account multiple viewpoints, your bibliography will
contain a wide variety of sources. In other words, your bibliography will have
a larger variance and a smaller bias. On the other hand, if most (or all) the
references in your bibliography espouse the same point of view, then you will
have a smaller variance and a larger bias (it’s just an analogy, so it’s not a perfect
counterpart to bias vs. variance).
The bias-variance trade-off can be stated in simple terms. In general, re-
ducing the bias in samples can increase the variance, whereas reducing the
variance tends to increase the bias.

Types of Bias in Data


In addition to the bias-variance trade-off that is discussed in the previous
section, there are several types of bias, some of which are listed as follows:

• Availability Bias
• Confirmation Bias
• False Causality
• Sunk Cost Fallacy
• Survivorship Bias

Availability bias is akin to making a “rule” based on an exception. For example,


there is a known link between smoking cigarettes and cancer, but there are
exceptions. If you find someone who has smoked three packs of cigarettes on
a daily basis for four decades and is still healthy, can you assert that smoking
does not lead to cancer?
Confirmation bias refers to the tendency to focus on data that confirms
one’s beliefs and simultaneously ignore data that contradicts a belief.
False causality occurs when you incorrectly assert that the occurrence of a
particular event causes another event to occur as well. One of the most well-
known examples involves ice cream consumption and violent crime in New
York during the summer. Since more people eat ice cream in the summer, that
“causes” more violent crime, which is a false causality. Other factors, such as
the increase in temperature, may be linked to the increase in crime. However,
it’s important to distinguish between correlation and causality. The latter is
a much stronger link than the former, and it’s also more difficult to establish
causality instead of correlation.
Sunk cost refers to something (often money) that has been spent or in-
curred that cannot be recouped. A common example pertains to gambling at a
casino. People fall into the pattern of spending more money in order to recoup
18 • Data Science Fundamentals Pocket Primer

a substantial amount of money that has already been lost. While there are cases
in which people do recover their money, in many (most?) cases people simply
incur an even greater loss because they continue to spend their money. Hence
the expression, “it’s time to cut your losses and walk away.”
Survivorship bias refers to analyzing a particular subset of “positive” data
while ignoring the “negative” data. This bias occurs in various situations, such
as being influenced by individuals who recount their rags-to-riches success
story (“positive” data) while ignoring the fate of the people (which is often
a very high percentage) who did not succeed (the “negative” data) in a simi-
lar quest. So, while it’s certainly possible for an individual to overcome many
difficult obstacles in order to succeed, is the success rate one in one thousand
(or even lower)?

SUMMARY

This chapter started with an explanation of datasets, a description of data


wrangling, and details regarding various types of data. Then you learned about
techniques for scaling numeric data, such as normalization and standardiza-
tion. You saw how to convert categorical data to numeric values, and how to
handle dates and currency.
Then you learned some of the nuances of missing data, anomalies, and
outliers, and techniques for handling these scenarios. You also learned about
imbalanced data and evaluating the use of SMOTE to deal with imbalanced
classes in a dataset. In addition, you learned about classifiers using two tech-
niques, LIME and ANOVA. Finally, you learned about the bias-variance trade-
off and various types of statistical bias.
CHAPTER
2
Intro to Probability and Statistics

T
his chapter introduces you to data types, how to scale data values, and
various techniques for handling missing data values. If most of the ma-
terial in this chapter is new to you, be assured that it’s not necessary
to understand everything in this chapter. It’s still a good idea to read as much
material as you can absorb, and perhaps return to this chapter again after you
have completed some of the other chapters in this book.
This chapter introduces you to concepts in probability as well as a wide as-
sortment of statistical terms and algorithms.
The first section of this chapter starts with a discussion of probability, how
to calculate the expected value of a set of numbers (with associated probabili-
ties), the concept of a random variable (discrete and continuous), and a short
list of some well-known probability distributions.
The second section of this chapter introduces basic statistical concepts,
such as mean, median, mode, variance, and standard deviation, along with
simple examples that illustrate how to calculate these terms. You will also learn
about the terms RSS, TSS, R^2, and F1 score.
The third section of this chapter introduces the Gini impurity, entropy, per-
plexity, cross-entropy, and KL divergence. You will also learn about skewness
and kurtosis.
The fourth section explains covariance and correlation matrices and how to
calculate eigenvalues and eigenvectors.
The fifth section explains PCA (Principal Component Analysis), which is a
well-known dimensionality reduction technique. The final section introduces
you to Bayes’ Theorem.
20 • Data Science Fundamentals Pocket Primer

WHAT IS A PROBABILITY?

If you have ever performed a science experiment in one of your classes, you
might remember that measurements have some uncertainty. In general, we
assume that there is a correct value, and we endeavor to find the best estimate
of that value.
When we work with an event that can have multiple outcomes, we try to
define the probability of an outcome as the chance that it will occur, which is
calculated as follows:
p(outcome) = # of times outcome occurs/(total number of outcomes)

For example, in the case of a single balanced coin, the probability of tossing a
head H equals the probability of tossing a tail T:
p(H) = 1/2 = p(T)

The set of probabilities associated with the outcomes {H, T} is shown in the
set P:
P = {1/2, 1/2}

Some experiments involve replacement while others involve non-replacement.


For example, suppose that an urn contains 10 red balls and 10 green balls.
What is the probability that a randomly selected ball is red? The answer is 10/
(10+10) = 1/2. What is the probability that the second ball is also red?
There are two scenarios with two different answers. If each ball is selected
with replacement, then each ball is returned to the urn after selection, which
means that the urn always contains 10 red balls and 10 green balls. In this case,
the answer is 1/2 * 1/2 = 1/4. In fact, the probability of any event is independ-
ent of all previous events.
On the other hand, if balls are selected without replacement, then the an-
swer is 10/20 * 9/19. As you undoubtedly know, card games are also examples
of selecting cards without replacement.
One other concept is called conditional probability, which refers to the
likelihood of the occurrence of event E1 given that event E2 has occurred. A
simple example is the following statement:
"If it rains (E2), then I will carry an umbrella (E1)."

Calculating the Expected Value


Consider the following scenario involving a well-balanced coin: Whenever
a head appears, you earn $1 and whenever a tail appears, you earn $1 dollar.
If you toss the coin 100 times, how much money do you expect to earn? Since
you will earn $1 regardless of the outcome, the expected value (in fact, the
guaranteed value) is $100.
Now consider this scenario: whenever a head appears, you earn $1 and
whenever a tail appears, you earn 0 dollars. If you toss the coin 100 times, how
Another random document with
no related content on Scribd:
irreconcilable. When Dora dies recommending Agnes we know that
everything is wrong, at least if hypocrisy and artificiality and moral
vulgarity are wrong. There, again, Dickens yields to a mere desire to
give comfort. He wishes to pile up pillows round Dora; and he
smothers her with them, like Othello.
This is the real vulgar optimism of Dickens; it does exist, and I
have deliberately put it first. Let us admit that Dickens’s mind was far
too much filled with pictures of satisfaction and cosiness and repose.
Let us admit that he thought principally of the pleasures of the
oppressed classes; let us admit that it hardly cost him any artistic
pang to make out human beings as much happier than they are. Let
us admit all this, and a curious fact remains.
For it was this too easily contented Dickens, this man with
cushions at his back and (it sometimes seems) cotton wool in his
ears, it was this happy dreamer, this vulgar optimist who alone of
modern writers did really destroy some of the wrongs he hated and
bring about some of the reforms he desired. Dickens did help to pull
down the debtors’ prisons; and if he was too much of an optimist he
was quite enough of a destroyer. Dickens did drive Squeers out of
his Yorkshire den; and if Dickens was too contented, it was more
than Squeers was. Dickens did leave his mark on parochialism, on
nursing, on funerals, on public executions, on workhouses, on the
Court of Chancery. These things were altered; they are different. It
may be that such reforms are not adequate remedies; that is another
question altogether. The next sociologists may think these old
Radical reforms quite narrow or accidental. But such as they were,
the old radicals got them done; and the new sociologists cannot get
anything done at all. And in the practical doing of them Dickens
played a solid and quite demonstrable part; that is the plain matter
that concerns us here. If Dickens was an optimist he was an
uncommonly active and useful kind of optimist. If Dickens was a
sentimentalist he was a very practical sentimentalist.
And the reason of this is one that goes deep into Dickens’s
social reform, and like every other real and desirable thing, involves
a kind of mystical contradiction. If we are to save the oppressed, we
must have two apparently antagonistic emotions in us at the same
time. We must think the oppressed man intensely miserable, and, at
the same time, intensely attractive and important. We must insist
with violence upon his degradation; we must insist with the same
violence upon his dignity. For if we relax by one inch the one
assertion, men will say he does not need saving. And if we relax by
one inch the other assertion, men will say he is not worth saving.
The optimists will say that reform is needless. The pessimists will
say that reform is hopeless. We must apply both simultaneously to
the same oppressed man; we must say that he is a worm and a god;
and we must thus lay ourselves open to the accusation (or the
compliment) of transcendentalism. This is, indeed, the strongest
argument for the religious conception of life. If the dignity of man is
an earthly dignity we shall be tempted to deny his earthly
degradation. If it is a heavenly dignity we can admit the earthly
degradation with all the candour of Zola. If we are idealists about the
other world we can be realists about this world. But that is not here
the point. What is quite evident is that if a logical praise of the poor
man is pushed too far, and if a logical distress about him is pushed
too far, either will involve wreckage to the central paradox of reform.
If the poor man is made too admirable he ceases to be pitiable; if the
poor man is made too pitiable he becomes merely contemptible.
There is a school of smug optimists who will deny that he is a poor
man. There is a school of scientific pessimists who will deny that he
is a man.
Out of this perennial contradiction arises the fact that there are
always two types of the reformer. The first we may call for
convenience the pessimistic, the second the optimistic reformer. One
dwells upon the fact that souls are being lost; the other dwells upon
the fact that they are worth saving. Both, of course, are (so far as
that is concerned) quite right, but they naturally tend to a difference
of method, and sometimes to a difference of perception. The
pessimistic reformer points out the good elements that oppression
has destroyed; the optimistic reformer, with an even fiercer joy,
points out the good elements that it has not destroyed. It is the case
for the first reformer that slavery has made men slavish. It is the
case for the second reformer that slavery has not made men slavish.
The first describes how bad men are under bad conditions. The
second describes how good men are under bad conditions. Of the
first class of writers, for instance, is Gorky. Of the second class of
writers is Dickens.
But here we must register a real and somewhat startling fact. In
the face of all apparent probability, it is certainly true that the
optimistic reformer reforms much more completely than the
pessimistic reformer. People produce violent changes by being
contented, by being far too contented. The man who said that
revolutions are not made with rose-water was obviously
inexperienced in practical human affairs. Men like Rousseau and
Shelley do make revolutions, and do make them with rose-water;
that is, with a too rosy and sentimental view of human goodness.
Figures that come before and create convulsion and change (for
instance, the central figure of the New Testament) always have the
air of walking in an unnatural sweetness and calm. They give us their
peace ultimately in blood and battle and division; not as the world
giveth give they unto us.
Nor is the real reason of the triumph of the too-contented
reformer particularly difficult to define. He triumphs because he
keeps alive in the human soul an invincible sense of the thing being
worth doing, of the war being worth winning, of the people being
worth their deliverance. I remember that Mr. William Archer, some
time ago, published in his interesting series of interviews, an
interview with Mr. Thomas Hardy. That powerful writer was
represented as saying, in the course of the conversation, that he did
not wish at the particular moment to define his position with regard to
the ultimate problem of whether life itself was worth living. There are,
he said, hundreds of remediable evils in this world. When we have
remedied all these (such was his argument), it will be time enough to
ask whether existence itself under its best possible conditions is
valuable or desirable. Here we have presented, with a considerable
element of what can only be called unconscious humour, the plain
reason of the failure of the pessimist as a reformer. Mr. Hardy is
asking us, I will not say to buy a pig in a poke; he is asking us to buy
a poke on the remote chance of there being a pig in it. When we
have for some few frantic centuries tortured ourselves to save
mankind, it will then be “time enough” to discuss whether they can
possibly be saved. When, in the case of infant mortality, for example,
we have exhausted ourselves with the earth-shaking efforts required
to save the life of every individual baby, it will then be time enough to
consider whether every individual baby would not have been happier
dead. We are to remove mountains and bring the millennium,
because then we can have a quiet moment to discuss whether the
millennium is at all desirable. Here we have the low-water mark of
the impotence of the sad reformer. And here we have the reason of
the paradoxical triumph of the happy one. His triumph is a religious
triumph; it rests upon his perpetual assertion of the value of the
human soul and of human daily life. It rests upon his assertion that
human life is enjoyable because it is human. And he will never admit,
like so many compassionate pessimists, that human life ever ceases
to be human. He does not merely pity the lowness of men; he feels
an insult to their elevation. Brute pity should be given only to the
brutes. Cruelty to animals is cruelty and a vile thing; but cruelty to a
man is not cruelty, it is treason. Tyranny over a man is not tyranny, it
is rebellion, for man is loyal. Now, the practical weakness of the vast
mass of modern pity for the poor and the oppressed is precisely that
it is merely pity; the pity is pitiful, but not respectful. Men feel that the
cruelty to the poor is a kind of cruelty to animals. They never feel that
it is injustice to equals; nay, it is treachery to comrades. This dark,
scientific pity, this brutal pity, has an elemental sincerity of its own;
but it is entirely useless for all ends of social reform. Democracy
swept Europe with the sabre when it was founded upon the Rights of
Man. It has done literally nothing at all since it has been founded
only upon the wrongs of man. Or, more strictly speaking, its recent
failures have been due to its not admitting the existence of any rights
or wrongs, or indeed of any humanity. Evolution (the sinister enemy
of revolution) does not especially deny the existence of God; what it
does deny is the existence of man. And all the despair about the
poor, and the cold and repugnant pity for them, has been largely due
to the vague sense that they have literally relapsed into the state of
the lower animals.
A writer sufficiently typical of recent revolutionism—Gorky—has
called one of his books by the eerie and effective title “Creatures that
Once were Men.” That title explains the whole failure of the Russian
revolution. And the reason why the English writers, such as Dickens,
did with all their limitations achieve so many of the actual things at
which they aimed, was that they could not possibly have put such a
title upon a human book. Dickens really helped the unfortunate in the
matters to which he set himself. And the reason is that across all his
books and sketches about the unfortunate might be written the
common title, “Creatures that Still are Men.”
There does exist, then, this strange optimistic reformer; the man
whose work begins with approval and yet ends with earthquake.
Jesus Christ was destined to found a faith which made the rich
poorer and the poor richer; but even when He was going to enrich
them, He began with the phrase, “Blessed are the poor.” The
Gissings and the Gorkys say, as an universal literary motto, “Cursed
are the poor.” Among a million who have faintly followed Christ in this
divine contradiction, Dickens stands out especially. He said, in all his
reforming utterances, “Cure poverty”; but he said in all his actual
descriptions, “Blessed are the poor.” He described their happiness,
and men rushed to remove their sorrow. He described them as
human, and men resented the insults to their humanity. It is not
difficult to see why, as I said at an earlier stage of this book,
Dickens’s denunciations have had so much more practical an effect
than the denunciations of such a man as Gissing. Both agreed that
the souls of the people were in a kind of prison. But Gissing said that
the prison was full of dead souls. Dickens said that the prison was
full of living souls. And the fiery cavalcade of rescuers felt that they
had not come too late.
Of this general fact about Dickens’s descriptions of poverty there
will not, I suppose, be any serious dispute. The dispute will only be
about the truth of those descriptions. It is clear that whereas Gissing
would say, “See how their poverty depresses the Smiths or the
Browns,” Dickens says, “See how little, after all, their poverty can
depress the Cratchits.” No one will deny that he made a special
feature a special study of the subject of the festivity of the poor. We
will come to the discussion of the veracity of these scenes in a
moment. It is here sufficient to register in conclusion of our
examination of the reforming optimist, that Dickens certainly was
such an optimist, and that he made it his business to insist upon
what happiness there is in the lives of the unhappy. His poor man is
always a Mark Tapley, a man the optimism of whose spirit increases
if anything with the pessimism of his experience. It can also be
registered as a fact equally solid and quite equally demonstrable that
this optimistic Dickens did effect great reforms.
The reforms in which Dickens was instrumental were, indeed,
from the point of view of our sweeping, social panaceas, special and
limited. But perhaps, for that reason especially, they afford a
compact and concrete instance of the psychological paradox of
which we speak. Dickens did definitely destroy—or at the very least
help to destroy—certain institutions; he destroyed those institutions
simply by describing them. But the crux and peculiarity of the whole
matter is this, that, in a sense, it can really be said that he described
these things too optimistically. In a real sense, he described
Dotheboys Hall as a better place than it is. In a real sense, he made
out the workhouse as a pleasanter place than it can ever be. For the
chief glory of Dickens is that he made these places interesting; and
the chief infamy of England is that it has made these places dull.
Dulness was the one thing that Dickens’s genius could never
succeed in describing; his vitality was so violent that he could not
introduce into his books the genuine impression even of a moment of
monotony. If there is anywhere in his novels an instant of silence, we
only hear more clearly the hero whispering with the heroine, the
villain sharpening his dagger, or the creaking of the machinery that is
to give out the god from the machine. He could splendidly describe
gloomy places, but he could not describe dreary places. He could
describe miserable marriages, but not monotonous marriages. It
must have been genuinely entertaining to be married to Mr. Quilp.
This sense of a still incessant excitement he spreads over every inch
of his story, and over every dark tract of his landscape. His idea of a
desolate place is a place where anything can happen; he has no
idea of that desolate place where nothing can happen. This is a good
thing for his soul, for the place where nothing can happen is hell. But
still, it might reasonably be maintained by the modern mind that he is
hampered in describing human evil and sorrow by this inability to
imagine tedium, this dulness in the matter of dulness. For, after all, it
is certainly true that the worst part of the lot of the unfortunate is the
fact that they have long spaces in which to review the irrevocability
of their doom. It is certainly true that the worst days of the oppressed
man are the nine days out of ten in which he is not oppressed. This
sense of sickness, and sameness Dickens did certainly fail or refuse
to give. When we read such a description as that excellent one—in
detail—of Dotheboys Hall, we feel that, while everything else is
accurate, the author does, in the words of the excellent Captain
Nares in Stevenson’s “Wrecker,” “draw the dreariness rather mild.”
The boys at Dotheboys were, perhaps, less bullied, but they were
certainly more bored. For, indeed, how could any one be bored with
the society of so sumptuous a creature as Mr. Squeers? Who would
not put up with a few illogical floggings in order to enjoy the
conversation of a man who could say, “She’s a rum ’un, is Natur’....
Natur’ is more easier conceived than described”? The same principle
applies to the workhouse in “Oliver Twist.” We feel vaguely that
neither Oliver nor any one else could be entirely unhappy in the
presence of the purple personality of Mr. Bumble. The one thing he
did not describe in any of the abuses he denounced was the soul-
destroying potency of routine. He made out the bad school, the bad
parochial system, the bad debtors’ prison as very much jollier and
more exciting than they may really have been. In a sense, then, he
flattered them; but he destroyed them with the flattery. By making
Mrs. Gamp delightful he made her impossible. He gave every one an
interest in Mr. Bumble’s existence; and by the same act gave every
one an interest in his destruction. It would be difficult to find a
stronger instance of the utility and energy of the method which we
have, for the sake of argument, called the method of the optimistic
reformer. As long as low Yorkshire schools were entirely colourless
and dreary, they continued quietly tolerated by the public, and quietly
intolerable to the victims. So long as Squeers was dull as well as
cruel he was permitted; the moment he became amusing as well as
cruel he was destroyed. As long as Bumble was merely inhuman he
was allowed. When he became human, humanity wiped him out. For
in order to do these great acts of justice we must always realize not
only the humanity of the oppressed, but even the humanity of the
oppressor. The satirist had, in a sense, to create the images in the
mind before, as an iconoclast, he could destroy them. Dickens had
to make Squeers live before he could make him die.
In connection with the accusation of vulgar optimism, which I
have taken as a text for this chapter, there is another somewhat odd
thing to notice. Nobody in the world was ever less optimistic than
Dickens in his treatment of evil or the evil man. When I say optimistic
in this matter I mean optimism, in the modern sense, of an attempt to
whitewash evil. Nobody ever made less attempt to whitewash evil
than Dickens. Nobody black was ever less white than Dickens’s
black. He painted his villains and lost characters more black than
they really are. He crowds his stories with a kind of villain rare in
modern fiction—the villain really without any “redeeming point.”
There is no redeeming point in Squeers, or in Monck, or in Ralph
Nickleby, or in Bill Sikes, or in Quilp, or in Brass, or in Mr. Chester, or
in Mr. Pecksniff, or in Jonas Chuzzlewit, or in Carker, or in Uriah
Heep, or in Blandois, or in a hundred more. So far as the balance of
good and evil in human characters is concerned, Dickens certainly
could not be called a vulgar optimist. His emphasis on evil was
melodramatic. He might be called a vulgar pessimist.
Some will dismiss this lurid villainy as a detail of his artificial
romance. I am not inclined to do so. He inherited, undoubtedly, this
unqualified villain as he inherited so many other things, from the
whole history of European literature. But he breathed into the
blackguard a peculiar and vigorous life of his own. He did not show
any tendency to modify his blackguardism in accordance with the
increasing considerateness of the age; he did not seem to wish to
make his villain less villainous; he did not wish to imitate the analysis
of George Eliot, or the reverent scepticism of Thackeray. And all this
works back, I think, to a real thing in him, that he wished to have an
obstreperous and incalculable enemy. He wished to keep alive the
idea of combat, which means, of necessity, a combat against
something individual and alive. I do not know whether, in the kindly
rationalism of his epoch, he kept any belief in a personal devil in his
theology, but he certainly created a personal devil in every one of his
books.
A good example of my meaning can be found, for instance, in
such a character as Quilp. Dickens may, for all I know, have had
originally some idea of describing Quilp as the bitter and unhappy
cripple, a deformity whose mind is stunted along with his body. But if
he had such an idea, he soon abandoned it. Quilp is not in the least
unhappy. His whole picturesqueness consists in the fact that he has
a kind of hellish happiness, an atrocious hilarity that makes him go
bounding about like an indiarubber ball. Quilp is not in the least
bitter; he has an unaffected gaiety, an expansiveness, an
universality. He desires to hurt people in the same hearty way that a
good-natured man desires to help them. He likes to poison people
with the same kind of clamorous camaraderie with which an honest
man likes to stand them drink. Quilp is not in the least stunted in
mind; he is not in reality even stunted in body—his body, that is,
does not in any way fall short of what he wants it to do. His
smallness gives him rather the promptitude of a bird or the
precipitance of a bullet. In a word, Quilp is precisely the devil of the
Middle Ages; he belongs to that amazingly healthy period when even
the lost spirits were hilarious.
This heartiness and vivacity in the villains of Dickens is worthy of
note because it is directly connected with his own cheerfulness. This
is a truth little understood in our time, but it is a very essential one. If
optimism means a general approval, it is certainly true that the more
a man becomes an optimist the more he becomes a melancholy
man. If he manages to praise everything, his praise will develop an
alarming resemblance to a polite boredom. He will say that the
marsh is as good as the garden; he will mean that the garden is as
dull as the marsh. He may force himself to say that emptiness is
good, but he will hardly prevent himself from asking what is the good
of such good. This optimism does exist—this optimism which is more
hopeless than pessimism—this optimism which is the very heart of
hell. Against such an aching vacuum of joyless approval there is only
one antidote—a sudden and pugnacious belief in positive evil. This
world can be made beautiful again by beholding it as a battlefield.
When we have defined and isolated the evil thing, the colours come
back into everything else. When evil things have become evil, good
things, in a blazing apocalypse, become good. There are some men
who are dreary because they do not believe in God; but there are
many others who are dreary because they do not believe in the devil.
The grass grows green again when we believe in the devil, the roses
grow red again when we believe in the devil.
No man was more filled with the sense of this bellicose basis of
all cheerfulness than Dickens. He knew very well the essential truth,
that the true optimist can only continue an optimist so long as he is
discontented. For the full value of this life can only be got by fighting;
the violent take it by storm. And if we have accepted everything, we
have missed something—war. This life of ours is a very enjoyable
fight, but a very miserable truce. And it appears strange to me that
so few critics of Dickens or of other romantic writers have noticed
this philosophical meaning in the undiluted villain. The villain is not in
the story to be a character; he is there to be a danger—a ceaseless,
ruthless, and uncompromising menace, like that of wild beasts or the
sea. For the full satisfaction of the sense of combat, which
everywhere and always involves a sense of equality, it is necessary
to make the evil thing a man; but it is not always necessary, it is not
even always artistic, to make him a mixed and probable man. In any
tale, the tone of which is at all symbolic, he may quite legitimately be
made an aboriginal and infernal energy. He must be a man only in
the sense that he must have a wit and will to be matched with the wit
and will of the man chiefly fighting. The evil may be inhuman, but it
must not be impersonal, which is almost exactly the position
occupied by Satan in the theological scheme.
But when all is said, as I have remarked before, the chief
fountain in Dickens of what I have called cheerfulness, and some
prefer to call optimism, is something deeper than a verbal
philosophy. It is, after all, an incomparable hunger and pleasure for
the vitality and the variety, for the infinite eccentricity of existence.
And this word “eccentricity” brings us, perhaps, nearer to the matter
than any other. It is, perhaps, the strongest mark of the divinity of
man that he talks of this world as “a strange world,” though he has
seen no other. We feel that all there is is eccentric, though we do not
know what is the centre. This sentiment of the grotesqueness of the
universe ran through Dickens’s brain and body like the mad blood of
the elves. He saw all his streets in fantastic perspectives, he saw all
his cockney villas as top heavy and wild, he saw every man’s nose
twice as big as it was, and every man’s eyes like saucers. And this
was the basis of his gaiety—the only real basis of any philosophical
gaiety. This world is not to be justified as it is justified by the
mechanical optimists; it is not to be justified as the best of all
possible worlds. Its merit is not that it is orderly and explicable; its
merit is that it is wild and utterly unexplained. Its merit is precisely
that none of us could have conceived such a thing, that we should
have rejected the bare idea of it as miracle and unreason. It is the
best of all impossible worlds.
CHAPTER XII
A NOTE ON THE FUTURE OF DICKENS

The hardest thing to remember about our own time, of course, is


simply that it is a time; we all instinctively think of it as the Day of
Judgment. But all the things in it which belong to it merely as this
time will probably be rapidly turned upside down; all the things that
can pass will pass. It is not merely true that all old things are already
dead; it is also true that all new things are already dead; for the only
undying things are the things that are neither new nor old. The more
you are up with this year’s fashion, the more (in a sense) you are
already behind next year’s. Consequently, in attempting to decide
whether an author will, as it is cantly expressed, live, it is necessary
to have very firm convictions about what part, if any part, of man is
unchangeable. And it is very hard to have this if you have not a
religion; or, at least, a dogmatic philosophy.
The equality of men needs preaching quite as much as regards
the ages as regards the classes of men. To feel infinitely superior to
a man in the twelfth century is just precisely as snobbish as to feel
infinitely superior to a man in the Old Kent Road. There are
differences between the man and us, there may be superiorities in
us over the man; but our sin in both cases consists in thinking of the
small things wherein we differ when we ought to be confounded and
intoxicated by the terrible and joyful matters in which we are at one.
But here again the difficulty always is that the things near us seem
larger than they are, and so seem to be a permanent part of
mankind, when they may really be only one of its parting modes of
expression. Few people, for instance, realize that a time may easily
come when we shall see the great outburst of Science in the
nineteenth century as something quite as splendid, brief, unique, and
ultimately abandoned, as the outburst of Art at the Renascence. Few
people realize that the general habit of fiction, of telling tales in
prose, may fade, like the general habit of the ballad, of telling tales in
verse, has for the time faded. Few people realize that reading and
writing are only arbitrary, and perhaps temporary sciences, like
heraldry.
The immortal mind will remain, and by that writers like Dickens
will be securely judged. That Dickens will have a high place in
permanent literature there is, I imagine, no prig surviving to deny. But
though all prediction is in the dark, I would devote this chapter to
suggesting that his place in nineteenth century England will not only
be high, but altogether the highest. At a certain period of his
contemporary fame, an average Englishman would have said that
there were at that moment in England about five or six able and
equal novelists. He could have made a list, Dickens, Bulwer-Lytton,
Thackeray, Charlotte Brontë, George Eliot, perhaps more. Forty
years or more have passed and some of them have slipped to a
lower place. Some would now say that the highest platform is left to
Thackeray and Dickens; some to Dickens, Thackeray, and George
Eliot; some to Dickens, Thackeray, and Charlotte Brontë. I venture to
offer the proposition that when more years have passed and more
weeding has been effected, Dickens will dominate the whole
England of the nineteenth century; he will be left on that platform
alone.
I know that this is an almost impertinent thing to assert, and that
its tendency is to bring in those disparaging discussions of other
writers in which Mr. Swinburne brilliantly embroiled himself in his
suggestive study of Dickens. But my disparagement of the other
English novelists is wholly relative and not in the least positive. It is
certain that men will always return to such a writer as Thackeray,
with his rich emotional autumn, his feeling that life is a sad but
sacred retrospect, in which at least we should forget nothing. It is not
likely that wise men will forget him. So, for instance, wise and
scholarly men do from time to time return to the lyrists of the French
Renascence, to the delicate poignancy of Du Bellay: so they will go
back to Thackeray. But I mean that Dickens will bestride and
dominate our time as the vast figure of Rabelais dominates Du
Bellay, dominates the Renascence and the world.
Yet we put a negative reason first. The particular things for which
Dickens is condemned (and justly condemned) by his critics, are
precisely those things which have never prevented a man from being
immortal. The chief of them is the unquestionable fact that he wrote
an enormous amount of bad work. This does lead to a man being put
below his place in his own time: it does not affect his permanent
place, to all appearance, at all. Shakespeare, for instance, and
Wordsworth wrote not only an enormous amount of bad work, but an
enormous amount of enormously bad work. Humanity edits such
writers’ works for them. Virgil was mistaken in cutting out his inferior
lines; we would have undertaken the job. Moreover in the particular
case of Dickens there are special reasons for regarding his bad work
as in some sense irrelevant. So much of it was written, as I have
previously suggested, under a kind of general ambition that had
nothing to do with his special genius; an ambition to be a public
provider of everything, a warehouse of all human emotions. He held
a kind of literary day of judgment. He distributed bad characters as
punishments and good characters as rewards. My meaning can be
best conveyed by one instance out of many. The character of the
kind old Jew in “Our Mutual Friend” (a needless and unconvincing
character) was actually introduced because some Jewish
correspondent complains that the bad old Jew in “Oliver Twist”
conveyed the suggestion that all Jews were bad. The principle is so
lightheadedly absurd that it is hard to imagine any literary man
submitting to it for an instant. If ever he invented a bad auctioneer he
must immediately balance him with a good auctioneer; if he should
have conceived an unkind philanthropist, he must on the spot, with
whatever natural agony and toil, imagine a kind philanthropist. The
complaint is frantic; yet Dickens, who tore people in pieces for much
fairer complaints, liked this complaint of his Jewish correspondent. It
pleased him to be mistaken for a public arbiter: it pleased him to be
asked (in a double sense) to judge Israel. All this is so much another
thing, a non-literary vanity, that there is much less difficulty than
usual in separating it from his serious genius: and by his serious
genius, I need hardly say, I mean his comic genius. Such irrelevant
ambitions as this are easily passed over, like the sonnets of great
statesmen. We feel that such things can be set aside, as the
ignorant experiments of men otherwise great, like the politics of
Professor Tyndall or the philosophy of Professor Haeckel. Hence, I
think, posterity will not care that Dickens has done bad work, but will
know that he has done good.
Again, the other chief accusation against Dickens was that his
characters and their actions were exaggerated and impossible. But
this only meant that they were exaggerated and impossible as
compared with the modern world and with certain writers (like
Thackeray or Trollope) who were making a very exact copy of the
manners of the modern world. Some people, oddly enough have
suggested that Dickens has suffered or will suffer from the change of
manners. Surely this is irrational. It is not the creators of the
impossible who will suffer from the process of time: Mr. Bunsby can
never be any more impossible than he was when Dickens made him.
The writers who will obviously suffer from time will be the careful and
realistic writers; the writers who have observed every detail of the
fashion of this world which passeth away. It is surely obvious that
there is nothing so fragile as a fact, that a fact flies away quicker
than a fancy. A fancy will endure for two thousand years. For
instance, we all have fancy for an entirely fearless man, a hero: and
the Achilles of Homer still remains. But exactly the thing we do not
know about Achilles is how far he was possible. The realistic
narrators of the time are all forgotten (thank God); so we cannot tell
whether Homer slightly exaggerated or wildly exaggerated or did not
exaggerate at all, the personal activity of a Mycenæan captain in
battle: for the fancy has survived the facts. So the fancy of Podsnap
may survive the facts of English commerce: and no one will know
whether Podsnap was possible, but only know that he is desirable,
like Achilles.
The positive argument for the permanence of Dickens comes
back to the thing that can only be stated and cannot be discussed:
creation. He made things which nobody else could possibly make.
He made Dick Swiveller in a very different sense to that in which
Thackeray made Colonel Newcome. Thackeray’s creation was
observation: Dickens’s was poetry, and is therefore permanent. But
there is one other test that can be added. The immortal writer, I
conceive, is commonly he who does something universal in a special
manner. I mean that he does something interesting to all men in a
way in which only one man or one land can do. Other men in that
land, who do only what other men in other lands are doing as well,
tend to have a great reputation in their day and to sink slowly into a
second or a third or a fourth place. A parallel from war will make the
point clearer. I cannot think that any one will doubt that, although
Wellington and Nelson were always bracketed, Nelson will steadily
become more important and Wellington less. For the fame of
Wellington rests upon the fact that he was a good soldier in the
service of England, exactly as twenty similar men were good soldiers
in the service of Austria or Prussia or France. But Nelson is the
symbol of a special mode of attack, which is at once universal and
yet specially English, the sea. Now Dickens is at once as universal
as the sea and as English as Nelson. Thackeray and George Eliot
and the other great figures of that great England, were comparable
to Wellington in this, that the kind of thing they were doing,—realism,
the acute study of intellectual things, numerous men in France,
Germany, and Italy were doing as well or better than they. But
Dickens was really doing something universal, yet something that no
one but an Englishman could do. This is attested by the fact that he
and Byron are the men who, like pinnacles, strike the eye of the
continent. The points would take long to study: yet they may take
only a moment to indicate. No one but an Englishman could have
filled his books at once with a furious caricature and with a positively
furious kindness. In more central countries, full of cruel memories of
political change, caricature is always inhumane. No one but an
Englishman could have described the democracy as consisting of
free men, but yet of funny men. In other countries where the
democratic issue has been more bitterly fought, it is felt that unless
you describe a man as dignified you are describing him as a slave.
This is the only final greatness of a man; that he does for all the
world what all the world cannot do for itself. Dickens, I believe, did it.
The hour of absinthe is over. We shall not be much further
troubled with the little artists who found Dickens too sane for their
sorrows and too clean for their delights. But we have a long way to
travel before we get back to what Dickens meant: and the passage is
along a rambling English road, a twisting road such as Mr. Pickwick
travelled. But this at least is part of what he meant; that comradeship
and serious joy are not interludes in our travel; but that rather our
travels are interludes in comradeship and joy, which through God
shall endure for ever. The inn does not point to the road; the road
points to the inn. And all roads point at last to an ultimate inn, where
we shall meet Dickens and all his characters: and when we drink
again it shall be from the great flagons in the tavern at the end of the
world.
THE END
Transcriber’s Notes
Simple typographical errors were corrected.
Punctuation, hyphenation, and spelling were made
consistent when a predominant preference was found in the
original book; otherwise they were not changed.
*** END OF THE PROJECT GUTENBERG EBOOK CHARLES
DICKENS: A CRITICAL STUDY ***

Updated editions will replace the previous one—the old editions will
be renamed.

Creating the works from print editions not protected by U.S.


copyright law means that no one owns a United States copyright in
these works, so the Foundation (and you!) can copy and distribute it
in the United States without permission and without paying copyright
royalties. Special rules, set forth in the General Terms of Use part of
this license, apply to copying and distributing Project Gutenberg™
electronic works to protect the PROJECT GUTENBERG™ concept
and trademark. Project Gutenberg is a registered trademark, and
may not be used if you charge for an eBook, except by following the
terms of the trademark license, including paying royalties for use of
the Project Gutenberg trademark. If you do not charge anything for
copies of this eBook, complying with the trademark license is very
easy. You may use this eBook for nearly any purpose such as
creation of derivative works, reports, performances and research.
Project Gutenberg eBooks may be modified and printed and given
away—you may do practically ANYTHING in the United States with
eBooks not protected by U.S. copyright law. Redistribution is subject
to the trademark license, especially commercial redistribution.

START: FULL LICENSE


THE FULL PROJECT GUTENBERG LICENSE

You might also like