Dwnload Full Data Science Concepts and Practice PDF

Data Science: Concepts and Practice -
eBook PDF
Visit to download the full and correct content document:
https://ebooksecure.com/download/data-science-concepts-and-practice-ebook-pdf/
Data Science: Concepts and Practice -
eBook PDF
Visit to download the full and correct content document:
https://ebooksecure.com/download/data-science-concepts-and-practice-ebook-pdf/
Data Science
This page intentionally left blank
Data Science
Concepts and Practice
Second Edition
Vijay Kotu
Bala Deshpande
Morgan Kaufmann is an imprint of Elsevier
50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States
Copyright r 2019 Elsevier Inc. All rights reserved.
No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical,
including photocopying, recording, or any information storage and retrieval system, without permission in writing
from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies
and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency,
can be found at our website: www.elsevier.com/permissions.
This book and the individual contributions contained in it are protected under copyright by the Publisher (other than
as may be noted herein).
Notices
Knowledge and best practice in this field are constantly changing. As new research and experience broaden our
understanding, changes in research methods, professional practices, or medical treatment may become necessary.
Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any
information, methods, compounds, or experiments described herein. In using such information or methods they
should be mindful of their own safety and the safety of others, including parties for whom they have a professional
responsibility.
To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for
any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any
use or operation of any methods, products, instructions, or ideas contained in the material herein.
British Library Cataloguing-in-Publication Data

A catalogue record for this book is available from the British Library
Library of Congress Cataloging-in-Publication Data
A catalog record for this book is available from the Library of Congress
ISBN: 978-0-12-814761-0
For Information on all Morgan Kaufmann publications

visit our website at https://www.elsevier.com/books-and-journals
Publisher: Jonathan Simpson

Acquisition Editor: Glyn Jones
Editorial Project Manager: Ana Claudia Abad Garcia
Production Project Manager: Sreejith Viswanathan
Cover Designer: Greg Harris
Typeset by MPS Limited, Chennai, India
Dedication
To all the mothers in our lives

Contents
FOREWORD................................................................................................. xi
PREFACE ................................................................................................... xv
ACKNOWLEDGMENTS............................................................................... xix
Chapter 1 Introduction ..................................................................................... 1

1.1 AI, Machine Learning, and Data Science..........................................2
1.2 What is Data Science? .......................................................................4
1.3 Case for Data Science........................................................................8
1.4 Data Science Classification .............................................................10
1.5 Data Science Algorithms .................................................................12
1.6 Roadmap for This Book ...................................................................12
References ................................................................................................18
Chapter 2 Data Science Process................................................................... 19

2.1 Prior Knowledge...............................................................................21
2.2 Data Preparation ..............................................................................25
2.3 Modeling............................................................................................29
2.4 Application ........................................................................................34
2.5 Knowledge ........................................................................................36
References ................................................................................................37
Chapter 3 Data Exploration ........................................................................... 39

3.1 Objectives of Data Exploration ........................................................40
3.2 Datasets ............................................................................................40
3.3 Descriptive Statistics........................................................................43
3.4 Data Visualization .............................................................................48
3.5 Roadmap for Data Exploration........................................................63
References ................................................................................................64
Chapter 4 Classification................................................................................. 65
4.1 Decision Trees ..................................................................................66
4.2 Rule Induction...................................................................................87 vii
viii Contents
4.3 k-Nearest Neighbors .......................................................................98

4.4 Naïve Bayesian ...............................................................................111
4.5 Artificial Neural Networks.............................................................124
4.6 Support Vector Machines ..............................................................135
4.7 Ensemble Learners........................................................................148
References ..............................................................................................161
Chapter 5 Regression Methods................................................................... 165

5.1 Linear Regression ..........................................................................166
5.2 Logistic Regression........................................................................185
5.3 Conclusion ......................................................................................196
References ..............................................................................................197
Chapter 6 Association Analysis ................................................................... 199

6.1 Mining Association Rules...............................................................201
6.2 Apriori Algorithm............................................................................206
6.3 Frequent Pattern-Growth Algorithm ............................................211
6.4 Conclusion ......................................................................................220
References ..............................................................................................220
Chapter 7 Clustering .................................................................................... 221

7.1 k-Means Clustering........................................................................226
7.2 DBSCAN Clustering .......................................................................238
7.3 Self-Organizing Maps ....................................................................247
References ..............................................................................................260
Chapter 8 Model Evaluation ........................................................................ 263

8.1 Confusion Matrix ............................................................................264
8.2 ROC and AUC..................................................................................266
8.3 Lift Curves.......................................................................................268
8.4 How to Implement..........................................................................271
8.5 Conclusion ......................................................................................276
References ..............................................................................................279
Chapter 9 Text Mining.................................................................................. 281

9.1 How It Works ..................................................................................283
9.2 How to Implement..........................................................................290
9.3 Conclusion ......................................................................................304
References ..............................................................................................305
Chapter 10 Deep Learning .......................................................................... 307

10.1 The AI Winter............................................................................. 309
10.2 How it Works ............................................................................. 315
10.3 How to Implement .................................................................... 335
Contents ix
10.4 Conclusion ................................................................................. 341

References ........................................................................................... 342
Chapter 11 Recommendation Engines ....................................................... 343

11.1 Recommendation Engine Concepts......................................... 346
11.2 Collaborative Filtering .............................................................. 353
11.3 Content-Based Filtering ........................................................... 373
11.4 Hybrid Recommenders............................................................. 389
11.5 Conclusion ................................................................................. 390
References ........................................................................................... 394
Chapter 12 Time Series Forecasting .......................................................... 395

12.1 Time Series Decomposition ..................................................... 400
12.2 Smoothing Based Methods ...................................................... 407
12.3 Regression Based Methods ..................................................... 413
12.4 Machine Learning Methods...................................................... 429
12.5 Performance Evaluation ........................................................... 439
12.6 Conclusion ................................................................................. 443
References ........................................................................................... 444
Chapter 13 Anomaly Detection.................................................................... 447

13.1 Concepts .................................................................................... 447
13.2 Distance-Based Outlier Detection ........................................... 453
13.3 Density-Based Outlier Detection ............................................. 457
13.4 Local Outlier Factor.................................................................. 460
13.5 Conclusion ................................................................................. 464
References ........................................................................................... 465
Chapter 14 Feature Selection ..................................................................... 467

14.1 Classifying Feature Selection Methods................................... 468
14.2 Principal Component Analysis ................................................. 470
14.3 Information Theory-Based Filtering........................................ 477
14.4 Chi-Square-Based Filtering ..................................................... 480
14.5 Wrapper-Type Feature Selection............................................. 483
14.6 Conclusion ................................................................................. 489
References ........................................................................................... 490
Chapter 15 Getting Started with RapidMiner ............................................. 491

15.1 User Interface and Terminology.............................................. 492
15.2 Data Importing and Exporting Tools ....................................... 497
15.3 Data Visualization Tools ........................................................... 501
15.4 Data Transformation Tools ...................................................... 504
15.5 Sampling and Missing Value Tools.......................................... 509
x Contents
15.6 Optimization Tools .................................................................... 512

15.7 Integration with R ..................................................................... 520
15.8 Conclusion ................................................................................. 521
References ........................................................................................... 521
COMPARISON OF DATA SCIENCE ALGORITHMS ...................................... 523

ABOUT THE AUTHORS ............................................................................. 531
INDEX ...................................................................................................... 533
PRAISE .................................................................................................... 545
Foreword
A lot has happened since the first edition of this book was published in
2014. There is hardly a day where there is no news on data science, machine
learning, or artificial intelligence in the media. It is interesting that many of
those news articles have a skeptical, if not an even negative tone. All this
underlines two things: data science and machine learning are finally becom-
ing mainstream. And people know shockingly little about it. Readers of this
book will certainly do better in this regard. It continues to be a valuable
resource to not only educate about how to use data science in practice, but
also how the fundamental concepts work.
Data science and machine learning are fast-moving fields which is why this
second edition reflects a lot of the changes in our field. While we used to
talk a lot about “data mining” and “predictive analytics” only a couple of
years ago, we have now settled on the term “data science” for the broader
field. And even more importantly: it is now commonly understood that
machine learning is at the core of many current technological breakthroughs.
These are truly exciting times for all the people working in our field then!
I have seen situations where data science and machine learning had an
incredible impact. But I have also seen situations where this was not the
case. What was the difference? In most cases where organizations fail with
data science and machine learning is, they had used those techniques in
the wrong context. Data science models are not very helpful if you only
have one big decision you need to make. Analytics can still help you in
such cases by giving you easier access to the data you need to make this
decision. Or by presenting the data in a consumable fashion. But at the
end of the day, those single big decisions are often strategic. Building a
machine learning model to help you make this decision is not worth
doing. And often they also do not yield better results than just making the
decision on your own.
xi
xii Foreword
Here is where data science and machine learning can truly help: these
advanced models deliver the most value whenever you need to make lots of
similar decisions quickly. Good examples for this are:
G Defining the price of a product in markets with rapidly changing
demands.
G Making offers for cross-selling in an E-Commerce platform.
G Approving credit or not.
G Detecting customers with a high risk for churn.
G Stopping fraudulent transactions.
G And many others.
You can see that a human being who would have access to all relevant data
could make those decisions in a matter of seconds or minutes. Only that
they can’t without data science, since they would need to make this type of
decision millions of times, every day. Consider sifting through your customer
base of 50 million clients every day to identify those with a high churn risk.
Impossible for any human being. But no problem at all for a machine learn-
ing model.
So, the biggest value of artificial intelligence and machine learning is not to
support us with those big strategic decisions. Machine learning delivers most
value when we operationalize models and automate millions of decisions.
One of the shortest descriptions of this phenomenon comes from Andrew
Ng, who is a well-known researcher in the field of AI. Andrew describes what
AI can do as follows: “If a typical person can do a mental task with less than
one second of thought, we can probably automate it using AI either now or
in the near future.”
I agree with him on this characterization. And I like that Andrew puts the
emphasis on automation and operationalization of those models—because
this is where the biggest value is. The only thing I disagree with is the time
unit he chose. It is safe to already go with a minute instead of a second.
However, the quick pace of changes as well as the ubiquity of data science
also underlines the importance of laying the right foundations. Keep in
mind that machine learning is not completely new. It has been an active
field of research since the 1950s. Some of the algorithms used today have
even been around for more than 200 years now. And the first deep learning
models were developed in the 1960s with the term “deep learning” being
coined in 1984. Those algorithms are well understood now. And under-
standing their basic concepts will help you to pick the right algorithm for
the right task.
To support you with this, some additional chapters on deep learning and rec-
ommendation systems have been added to the book. Another focus area is
Foreword xiii
using text analytics and natural language processing. It became clear in the
past years that the most successful predictive models have been using
unstructured input data in addition to the more traditional tabular formats.
Finally, expansion of Time Series Forecasting should get you started on one
of the most widely applied data science techniques in the business.
More algorithms could mean that there is a risk of increased complexity. But
thanks to the simplicity of the RapidMiner platform and the many practical
examples throughout the book this is not the case here. We continue our
journey towards the democratization of data science and machine learning.
This journey continues until data science and machine learning are as ubiqui-
tous as data visualization or Excel. Of course, we cannot magically transform
everybody into a data scientist overnight, but we can give people the tools to
help them on their personal path of development. This book is the only tour
guide you need on this journey.
Ingo Mierswa
Founder
RapidMiner Inc.
Massachusetts, USA
Preface
Our goal is to introduce you to Data Science.

We will provide you with a survey of the fundamental data science concepts
as well as step-by-step guidance on practical implementations—enough to
get you started on this exciting journey.
WHY DATA SCIENCE?

We have run out of adjectives and superlatives to describe the growth trends
of data. The technology revolution has brought about the need to process,
store, analyze, and comprehend large volumes of diverse data in meaningful
ways. However, the value of the stored data is zero unless it is acted upon. The scale
of data volume and variety places new demands on organizations to quickly
uncover hidden relationships and patterns. This is where data science techni-
ques have proven to be extremely useful. They are increasingly finding their
way into the everyday activities of many business and government functions,
whether in identifying which customers are likely to take their business else-
where, or mapping flu pandemic using social media signals.
Data science is a compilation of techniques that extract value from data.
Some of the techniques used in data science have a long history and trace
their roots to applied statistics, machine learning, visualization, logic, and
computer science. Some techniques have just reached the popularity it
deserves. Most emerging technologies go through what is termed the “hype
cycle.” This is a way of contrasting the amount of hyperbole or hype versus
the productivity that is engendered by the emerging technology. The hype
cycle has three main phases: peak of inflated expectation, trough of disillu-
sionment, and plateau of productivity. The third phase refers to the mature
and value-generating phase of any technology. The hype cycle for data sci-
ence indicates that it is in this mature phase. Does this imply that data sci-
ence has stopped growing or has reached a saturation point? Not at all. On
the contrary, this discipline has grown beyond the scope of its initial
xv
xvi Preface
applications in marketing and has advanced to applications in technology,

internet-based fields, health care, government, finance, and manufacturing.
WHY THIS BOOK?

The objective of this book is two-fold: to help clarify the basic concepts
behind many data science techniques in an easy-to-follow manner; and to
prepare anyone with a basic grasp of mathematics to implement these techni-
ques in their organizations without the need to write any lines of program-
ming code.
Beyond its practical value, we wanted to show you that the data science
learning algorithms are elegant, beautiful, and incredibly effective. You will
never look at data the same way once you learn the concepts of the learning
algorithms.
To make the concepts stick, you will have to build data science models.
While there are many data science tools available to execute algorithms and
develop applications, the approaches to solving a data science problem are
similar among these tools. We wanted to pick a fully functional, open source,
free to use, graphical user interface-based data science tool so readers can fol-
low the concepts and implement the data science algorithms. RapidMiner, a
leading data science platform, fit the bill and, thus, we used it as a compan-
ion tool to implement the data science algorithms introduced in every
chapter.
WHO CAN USE THIS BOOK?

The concepts and implementations described in this book are geared towards
business, analytics, and technical professionals who use data everyday. You,
the reader of the book will get a comprehensive understanding of the differ-
ent data science techniques that can be used for prediction and for discover-
ing patterns, be prepared to select the right technique for a given data
problem, and you will be able to create a general-purpose analytics process.
We have tried to follow a process to describe this body of knowledge. Our
focus has been on introducing about 30 key algorithms that are in wide-
spread use today. We present these algorithms in the framework of:
1. A high-level practical use case for each algorithm.
2. An explanation of how the algorithm works in plain language. Many
algorithms have a strong foundation in statistics and/or computer
science. In our descriptions, we have tried to strike a balance between
being accessible to a wider audience and being academically rigorous.
Preface xvii
3. A detailed review of implementation using RapidMiner, by describing

the commonly used setup and parameter options using a sample data
set. You can download the processes from the companion website
www.IntroDataScience.com and we recommend you follow-along by
building an actual data science process.
Analysts, finance, engineering, marketing, and business professionals, or any-
one who analyzes data, most likely will use data science techniques in their
job either now or in the near future. For business managers who are one step
removed from the actual data science process, it is important to know what
is possible and not possible with these techniques so they can ask the right
questions and set proper expectations. While basic spreadsheet analysis, slic-
ing, and dicing of data through standard business intelligence tools will con-
tinue to form the foundations of data exploration in business, data science
techniques are necessary to establish the full edifice of analytics in the
organizations.
Vijay Kotu
California, USA
Bala Deshpande, PhD
Michigan, USA
Acknowledgments
Writing a book is one of the most interesting and challenging endeavors one
can embark on. We grossly underestimated the effort it would take and the
fulfillment it brings. This book would not have been possible without the
support of our families, who granted us enough leeway in this time-
consuming activity. We would like to thank the team at RapidMiner, who
provided great help with everything, ranging from technical support to
reviewing the chapters to answering questions on the features of the product.
Our special thanks to Ingo Mierswa for setting the stage for the book through
the foreword. We greatly appreciate the thoughtful and insightful comments
from our technical reviewers: Doug Schrimager from Slalom Consulting,
Steven Reagan from L&L Products, and Philipp Schlunder, Tobias Malbrecht
and Ingo Mierswa from RapidMiner. Thanks to Mike Skinner of Intel and to
Dr. Jacob Cybulski of Deakin University in Australia. We had great support
and stewardship from the Elsevier and Morgan Kaufmann team: Glyn Jones,
Ana Claudia Abad Garcia, and Sreejith Viswanathan. Thanks to our collea-
gues and friends for all the productive discussions and suggestions regarding
this project.
xix
CHAPTER 1
Introduction
Data science is a collection of techniques used to extract value from data. It

has become an essential tool for any organization that collects, stores, and
processes data as part of its operations. Data science techniques rely on find-
ing useful patterns, connections, and relationships within data. Being a buzz-
word, there is a wide variety of definitions and criteria for what constitutes
data science. Data science is also commonly referred to as knowledge discov-
ery, machine learning, predictive analytics, and data mining. However, each
term has a slightly different connotation depending on the context. In this
chapter, we attempt to provide a general overview of data science and point
out its important features, purpose, taxonomy, and methods.
In spite of the present growth and popularity, the underlying methods of
data science are decades if not centuries old. Engineers and scientists have
been using predictive models since the beginning of nineteenth century.
Humans have always been forward-looking creatures and predictive sciences
are manifestations of this curiosity. So, who uses data science today? Almost
every organization and business. Sure, we didn’t call the methods that are
now under data science as “Data Science.” The use of the term science in data
science indicates that the methods are evidence based, and are built on
empirical knowledge, more specifically historical observations.
As the ability to collect, store, and process data has increased, in line with
Moore’s Law - which implies that computing hardware capabilities double
every two years, data science has found increasing applications in many
diverse fields. Just decades ago, building a production quality regression
model took about several dozen hours (Parr Rud, 2001). Technology has
come a long way. Today, sophisticated machine learning models can be run,
involving hundreds of predictors with millions of records in a matter of a
few seconds on a laptop computer.
The process involved in data science, however, has not changed since those
early days and is not likely to change much in the foreseeable future. To get
meaningful results from any data, a major effort preparing, cleaning, 1
Data Science. DOI: https://doi.org/10.1016/B978-0-12-814761-0.00001-0
© 2019 Elsevier Inc. All rights reserved.
2 CHAPTER 1: Introduction
scrubbing, or standardizing the data is still required, before the learning algo-
rithms can begin to crunch them. But what may change is the automation
available to do this. While today this process is iterative and requires ana-
lysts’ awareness of the best practices, soon smart automation may be
deployed. This will allow the focus to be put on the most important aspect
of data science: interpreting the results of the analysis in order to make deci-
sions. This will also increase the reach of data science to a wider audience.
When it comes to the data science techniques, are there a core set of procedures
and principles one must master? It turns out that a vast majority of data science
practitioners today use a handful of very powerful techniques to accomplish
their objectives: decision trees, regression models, deep learning, and clustering
(Rexer, 2013). A majority of the data science activity can be accomplished
using relatively few techniques. However, as with all 80/20 rules, the long tail,
which is made up of a large number of specialized techniques, is where the
value lies, and depending on what is needed, the best approach may be a rela-
tively obscure technique or a combination of several not so commonly used
procedures. Thus, it will pay off to learn data science and its methods in a sys-
tematic way, and that is what is covered in these chapters. But, first, how are
the often-used terms artificial intelligence (AI), machine learning, and data sci-
ence explained?
1.1 AI, MACHINE LEARNING, AND DATA SCIENCE

Artificial intelligence, Machine learning, and data science are all related to
each other. Unsurprisingly, they are often used interchangeably and conflated
with each other in popular media and business communication. However,
all of these three fields are distinct depending on the context. Fig. 1.1 shows
the relationship between artificial intelligence, machine learning, and
data science.
Artificial intelligence is about giving machines the capability of mimicking
human behavior, particularly cognitive functions. Examples would be: facial
recognition, automated driving, sorting mail based on postal code. In some
cases, machines have far exceeded human capabilities (sorting thousands of
postal mails in seconds) and in other cases we have barely scratched the
surface (search “artificial stupidity”). There are quite a range of techniques
that fall under artificial intelligence: linguistics, natural language processing,
decision science, bias, vision, robotics, planning, etc. Learning is an impor-
tant part of human capability. In fact, many other living organisms can
learn.
Machine learning can either be considered a sub-field or one of the tools of
artificial intelligence, is providing machines with the capability of learning
1.1 AI, Machine learning, and Data Science 3
Artificial intelligence
Linguistics Vision
Language Robotics
synthesis
Sensor Planning
Machine learning
Data science
Support
vector Data
machines kNN Text preparation
mining
Statistics
Decision Bayesian Process
trees learning Time series mining
Visualization
forecasting
Deep Processing
learning Recommendation paradigms
engines Experimentation
FIGURE 1.1
Artificial intelligence, machine learning, and data science.
FIGURE 1.2
Traditional program and machine learning.
from experience. Experience for machines comes in the form of data. Data
that is used to teach machines is called training data. Machine learning turns
the traditional programing model upside down (Fig. 1.2). A program, a set
of instructions to a computer, transforms input signals into output signals
using predetermined rules and relationships. Machine learning algorithms,
also called “learners”, take both the known input and output (training data)
to figure out a model for the program which converts input to output. For
example, many organizations like social media platforms, review sites, or for-
ums are required to moderate posts and remove abusive content. How can
machines be taught to automate the removal of abusive content? The
machines need to be shown examples of both abusive and non-abusive posts
with a clear indication of which one is abusive. The learners will generalize a
pattern based on certain words or sequences of words in order to conclude
whether the overall post is abusive or not. The model can take the form of a
set of “if-then” rules. Once the data science rules or model is developed,
machines can start categorizing the disposition of any new posts.
Data science is the business application of machine learning, artificial intelli-
gence, and other quantitative fields like statistics, visualization, and mathe-
matics. It is an interdisciplinary field that extracts value from data. In the
context of how data science is used today, it relies heavily on machine learn-
ing and is sometimes called data mining. Examples of data science user cases
are: recommendation engines that can recommend movies for a particular
user, a fraud alert model that detects fraudulent credit card transactions, find
customers who will most likely churn next month, or predict revenue for the
next quarter.
1.2 WHAT IS DATA SCIENCE?

Data science starts with data, which can range from a simple array of a few
numeric observations to a complex matrix of millions of observations with
thousands of variables. Data science utilizes certain specialized computational
methods in order to discover meaningful and useful structures within a dataset.
The discipline of data science coexists and is closely associated with a number
of related areas such as database systems, data engineering, visualization, data
analysis, experimentation, and business intelligence (BI). We can further define
data science by investigating some of its key features and motivations.
1.2.1 Extracting Meaningful Patterns

Knowledge discovery in databases is the nontrivial process of identifying
valid, novel, potentially useful, and ultimately understandable patterns or
relationships within a dataset in order to make important decisions (Fayyad,
Piatetsky-shapiro, & Smyth, 1996). Data science involves inference and itera-
tion of many different hypotheses. One of the key aspects of data science is
the process of generalization of patterns from a dataset. The generalization
should be valid, not just for the dataset used to observe the pattern, but also
for new unseen data. Data science is also a process with defined steps, each
1.2 What is Data Science? 5
FIGURE 1.3
Data science models.
with a set of tasks. The term novel indicates that data science is usually
involved in finding previously unknown patterns in data. The ultimate objec-
tive of data science is to find potentially useful conclusions that can be acted
upon by the users of the analysis.
1.2.2 Building Representative Models

In statistics, a model is the representation of a relationship between variables
in a dataset. It describes how one or more variables in the data are related to
other variables. Modeling is a process in which a representative abstraction is
built from the observed dataset. For example, based on credit score, income
level, and requested loan amount, a model can be developed to determine
the interest rate of a loan. For this task, previously known observational data
including credit score, income level, loan amount, and interest rate are
needed. Fig. 1.3 shows the process of generating a model. Once the represen-
tative model is created, it can be used to predict the value of the interest rate,
based on all the input variables.
Data science is the process of building a representative model that fits the
observational data. This model serves two purposes: on the one hand, it pre-
dicts the output (interest rate) based on the new and unseen set of input
variables (credit score, income level, and loan amount), and on the other
hand, the model can be used to understand the relationship between the
output variable and all the input variables. For example, does income level
really matter in determining the interest rate of a loan? Does income level
matter more than credit score? What happens when income levels double or
if credit score drops by 10 points? A Model can be used for both predictive
and explanatory applications.
1.2.3 Combination of Statistics, Machine Learning, and

Computing
In the pursuit of extracting useful and relevant information from large data-
sets, data science borrows computational techniques from the disciplines of
statistics, machine learning, experimentation, and database theories. The
algorithms used in data science originate from these disciplines but have
since evolved to adopt more diverse techniques such as parallel computing,
evolutionary computing, linguistics, and behavioral studies. One of the key
ingredients of successful data science is substantial prior knowledge about
the data and the business processes that generate the data, known as subject
matter expertise. Like many quantitative frameworks, data science is an itera-
tive process in which the practitioner gains more information about the pat-
terns and relationships from data in each cycle. Data science also typically
operates on large datasets that need to be stored, processed, and computed.
This is where database techniques along with parallel and distributed com-
puting techniques play an important role in data science.
1.2.4 Learning Algorithms

We can also define data science as a process of discovering previously
unknown patterns in data using automatic iterative methods. The application of
sophisticated learning algorithms for extracting useful patterns from data dif-
ferentiates data science from traditional data analysis techniques. Many of
these algorithms were developed in the past few decades and are a part of
machine learning and artificial intelligence. Some algorithms are based on
the foundations of Bayesian probabilistic theories and regression analysis,
originating from hundreds of years ago. These iterative algorithms automate
the process of searching for an optimal solution for a given data problem.
Based on the problem, data science is classified into tasks such as classifica-
tion, association analysis, clustering, and regression. Each data science task
uses specific learning algorithms like decision trees, neural networks, k-near-
est neighbors (k-NN), and k-means clustering, among others. With increased
research on data science, such algorithms are increasing, but a few classic
algorithms remain foundational to many data science applications.
1.2.5 Associated Fields

While data science covers a wide set of techniques, applications, and disci-
plines, there a few associated fields that data science heavily relies on. The
techniques used in the steps of a data science process and in conjunction
with the term “data science” are:
G Descriptive statistics: Computing mean, standard deviation, correlation,
and other descriptive statistics, quantify the aggregate structure of a
1.2 What is Data Science? 7
dataset. This is essential information for understanding any dataset in

order to understand the structure of the data and the relationships
within the dataset. They are used in the exploration stage of the data
science process.
G Exploratory visualization: The process of expressing data in visual
coordinates enables users to find patterns and relationships in the data
and to comprehend large datasets. Similar to descriptive statistics, they
are integral in the pre- and post-processing steps in data science.
G Dimensional slicing: Online analytical processing (OLAP) applications,
which are prevalent in organizations, mainly provide information on
the data through dimensional slicing, filtering, and pivoting. OLAP
analysis is enabled by a unique database schema design where the data
are organized as dimensions (e.g., products, regions, dates) and
quantitative facts or measures (e.g., revenue, quantity). With a well-
defined database structure, it is easy to slice the yearly revenue by
products or combination of region and products. These techniques are
extremely useful and may unveil patterns in data (e.g., candy sales
decline after Halloween in the United States).
G Hypothesis testing: In confirmatory data analysis, experimental data are
collected to evaluate whether a hypothesis has enough evidence to be
supported or not. There are many types of statistical testing and they
have a wide variety of business applications (e.g., A/B testing in
marketing). In general, data science is a process where many hypotheses
are generated and tested based on observational data. Since the data
science algorithms are iterative, solutions can be refined in each step.
G Data engineering: Data engineering is the process of sourcing,
organizing, assembling, storing, and distributing data for effective
analysis and usage. Database engineering, distributed storage, and
computing frameworks (e.g., Apache Hadoop, Spark, Kafka), parallel
computing, extraction transformation and loading processing, and data
warehousing constitute data engineering techniques. Data engineering
helps source and prepare for data science learning algorithms.
G Business intelligence: Business intelligence helps organizations consume
data effectively. It helps query the ad hoc data without the need to
write the technical query command or use dashboards or visualizations
to communicate the facts and trends. Business intelligence specializes
in the secure delivery of information to right roles and the distribution
of information at scale. Historical trends are usually reported, but in
combination with data science, both the past and the predicted future
data can be combined. BI can hold and distribute the results of data
science.
1.3 CASE FOR DATA SCIENCE

In the past few decades, a massive accumulation of data has been seen with
the advancement of information technology, connected networks, and the
businesses it enables. This trend is also coupled with a steep decline in data
storage and data processing costs. The applications built on these advance-
ments like online businesses, social networking, and mobile technologies
unleash a large amount of complex, heterogeneous data that are waiting to
be analyzed. Traditional analysis techniques like dimensional slicing, hypoth-
esis testing, and descriptive statistics can only go so far in information dis-
covery. A paradigm is needed to manage the massive volume of data, explore
the inter-relationships of thousands of variables, and deploy machine learn-
ing algorithms to deduce optimal insights from datasets. A set of frameworks,
tools, and techniques are needed to intelligently assist humans to process all
these data and extract valuable information (Piatetsky-Shapiro, Brachman,
Khabaza, Kloesgen, & Simoudis, 1996). Data science is one such paradigm
that can handle large volumes with multiple attributes and deploy complex
algorithms to search for patterns from data. Each key motivation for using
data science techniques are explored here.
1.3.1 Volume
The sheer volume of data captured by organizations is exponentially increas-
ing. The rapid decline in storage costs and advancements in capturing every
transaction and event, combined with the business need to extract as much
leverage as possible using data, creates a strong motivation to store more
data than ever. As data become more granular, the need to use large volume
data to extract information increases. A rapid increase in the volume of data
exposes the limitations of current analysis methodologies. In a few imple-
mentations, the time to create generalization models is critical and data vol-
ume plays a major part in determining the time frame of development and
deployment.
1.3.2 Dimensions
The three characteristics of the Big Data phenomenon are high volume, high
velocity, and high variety. The variety of data relates to the multiple types of
values (numerical, categorical), formats of data (audio files, video files), and
the application of the data (location coordinates, graph data). Every single
record or data point contains multiple attributes or variables to provide con-
text for the record. For example, every user record of an ecommerce site can
contain attributes such as products viewed, products purchased, user demo-
graphics, frequency of purchase, clickstream, etc. Determining the most effec-
tive offer for an ecommerce user can involve computing information across
1.3 Case for Data Science 9
these attributes. Each attribute can be thought of as a dimension in the data

space. The user record has multiple attributes and can be visualized in multi-
dimensional space. The addition of each dimension increases the complexity
of analysis techniques.
A simple linear regression model that has one input dimension is relatively
easy to build compared to multiple linear regression models with multiple
dimensions. As the dimensional space of data increase, a scalable framework
that can work well with multiple data types and multiple attributes is
needed. In the case of text mining, a document or article becomes a data
point with each unique word as a dimension. Text mining yields a dataset
where the number of attributes can range from a few hundred to hundreds
of thousands of attributes.
1.3.3 Complex Questions

As more complex data are available for analysis, the complexity of informa-
tion that needs to get extracted from data is increasing as well. If the natural
clusters in a dataset, with hundreds of dimensions, need to be found, then
traditional analysis like hypothesis testing techniques cannot be used in a
scalable fashion. The machine-learning algorithms need to be leveraged in
order to automate searching in the vast search space.
Traditional statistical analysis approaches the data analysis problem by
assuming a stochastic model, in order to predict a response variable based
on a set of input variables. A linear regression is a classic example of this
technique where the parameters of the model are estimated from the data.
These hypothesis-driven techniques were highly successful in modeling sim-
ple relationships between response and input variables. However, there is a
significant need to extract nuggets of information from large, complex data-
sets, where the use of traditional statistical data analysis techniques is limited
(Breiman, 2001)
Machine learning approaches the problem of modeling by trying to find an
algorithmic model that can better predict the output from input variables.
The algorithms are usually recursive and, in each cycle, estimate the output
and “learn” from the predictive errors of the previous steps. This route of
modeling greatly assists in exploratory analysis since the approach here is not
validating a hypothesis but generating a multitude of hypotheses for a given
problem. In the context of the data problems faced today, both techniques
need to be deployed. John Tuckey, in his article “We need both exploratory
and confirmatory,” stresses the importance of both exploratory and confir-
matory analysis techniques (Tuckey, 1980). In this book, a range of data sci-
ence techniques, from traditional statistical modeling techniques like
regressions to the modern machine learning algorithms are discussed.
1.4 DATA SCIENCE CLASSIFICATION

Data science problems can be broadly categorized into supervised or unsuper-
vised learning models. Supervised or directed data science tries to infer a func-
tion or relationship based on labeled training data and uses this function to
map new unlabeled data. Supervised techniques predict the value of the out-
put variables based on a set of input variables. To do this, a model is devel-
oped from a training dataset where the values of input and output are
previously known. The model generalizes the relationship between the input
and output variables and uses it to predict for a dataset where only input
variables are known. The output variable that is being predicted is also called
a class label or target variable. Supervised data science needs a sufficient
number of labeled records to learn the model from the data. Unsupervised
or undirected data science uncovers hidden patterns in unlabeled data. In
unsupervised data science, there are no output variables to predict. The
objective of this class of data science techniques, is to find patterns in data
based on the relationship between data points themselves. An application
can employ both supervised and unsupervised learners.
Data science problems can also be classified into tasks such as: classification,
regression, association analysis, clustering, anomaly detection, recommenda-
tion engines, feature selection, time series forecasting, deep learning, and text
mining (Fig. 1.4). This book is organized around these data science tasks. An
overview is presented in this chapter and an in-depth discussion of the
FIGURE 1.4
Data science tasks.
1.4 Data Science Classification 11
concepts and step-by-step implementations of many important techniques

will be provided in the upcoming chapters.
Classification and regression techniques predict a target variable based on input
variables. The prediction is based on a generalized model built from a previ-
ously known dataset. In regression tasks, the output variable is numeric (e.g.,
the mortgage interest rate on a loan). Classification tasks predict output vari-
ables, which are categorical or polynomial (e.g., the yes or no decision to
approve a loan). Deep learning is a more sophisticated artificial neural net-
work that is increasingly used for classification and regression problems.
Clustering is the process of identifying the natural groupings in a dataset. For
example, clustering is helpful in finding natural clusters in customer datasets,
which can be used for market segmentation. Since this is unsupervised data
science, it is up to the end user to investigate why these clusters are formed
in the data and generalize the uniqueness of each cluster. In retail analytics,
it is common to identify pairs of items that are purchased together, so that
specific items can be bundled or placed next to each other. This task is called
market basket analysis or association analysis, which is commonly used in
cross selling. Recommendation engines are the systems that recommend items
to the users based on individual user preference.
Anomaly or outlier detection identifies the data points that are significantly dif-
ferent from other data points in a dataset. Credit card transaction fraud detec-
tion is one of the most prolific applications of anomaly detection. Time series
forecasting is the process of predicting the future value of a variable (e.g., tem-
perature) based on past historical values that may exhibit a trend and season-
ality. Text mining is a data science application where the input data is text,
which can be in the form of documents, messages, emails, or web pages. To
aid the data science on text data, the text files are first converted into docu-
ment vectors where each unique word is an attribute. Once the text file is con-
verted to document vectors, standard data science tasks such as classification,
clustering, etc., can be applied. Feature selection is a process in which attributes
in a dataset are reduced to a few attributes that really matter.
A complete data science application can contain elements of both supervised
and unsupervised techniques (Tan et al., 2005). Unsupervised techniques
provide an increased understanding of the dataset and hence, are sometimes
called descriptive data science. As an example of how both unsupervised and
supervised data science can be combined in an application, consider the fol-
lowing scenario. In marketing analytics, clustering can be used to find the
natural clusters in customer records. Each customer is assigned a cluster label
at the end of the clustering process. A labeled customer dataset can now be
used to develop a model that assigns a cluster label for any new customer
record with a supervised classification technique.
1.5 DATA SCIENCE ALGORITHMS

An algorithm is a logical step-by-step procedure for solving a problem. In
data science, it is the blueprint for how a particular data problem is solved.
Many of the learning algorithms are recursive, where a set of steps are
repeated many times until a limiting condition is met. Some algorithms also
contain a random variable as an input and are aptly called randomized algo-
rithms. A classification task can be solved using many different learning algo-
rithms such as decision trees, artificial neural networks, k-NN, and even
some regression algorithms. The choice of which algorithm to use depends
on the type of dataset, objective, structure of the data, presence of outliers,
available computational power, number of records, number of attributes,
and so on. It is up to the data science practitioner to decide which algorithm
(s) to use by evaluating the performance of multiple algorithms. There have
been hundreds of algorithms developed in the last few decades to solve data
science problems.
Data science algorithms can be implemented by custom-developed computer
programs in almost any computer language. This obviously is a time-
consuming task. In order to focus the appropriate amount of time on data
and algorithms, data science tools or statistical programing tools, like R,
RapidMiner, Python, SAS Enterprise Miner, etc., which can implement these
algorithms with ease, can be leveraged. These data science tools offer a library
of algorithms as functions, which can be interfaced through programming
code or configurated through graphical user interfaces. Table 1.1 provides a
summary of data science tasks with commonly used algorithmic techniques
and example cases.
1.6 ROADMAP FOR THIS BOOK

It’s time to explore data science techniques in more detail. The main body of
this book presents: the concepts behind each data science algorithm and a
practical implementation (or two) for each. The chapters do not have to be
read in a sequence. For each algorithm, a general overview is first provided,
and then the concepts and logic of the learning algorithm and how it works
in plain language are presented. Later, how the algorithm can be implemen-
ted using RapidMiner is shown. RapidMiner is a widely known and used
software tool for data science (Piatetsky, 2018) and it has been chosen partic-
ularly for ease of implementation using GUI, and because it is available to
use free of charge, as an open source data science tool. Each chapter is
1.6 Roadmap for This Book 13
Table 1.1 Data Science Tasks and Examples

Tasks Description Algorithms Examples
Classification Predict if a data point belongs to Decision trees, neural Assigning voters into known
one of the predefined classes. networks, Bayesian buckets by political parties,
The prediction will be based on models, induction rules, e.g., soccer moms
learning from a known dataset k-nearest neighbors Bucketing new customers into
one of the known customer
groups
Regression Predict the numeric target label of Linear regression, logistic Predicting the unemployment
a data point. The prediction will be regression rate for the next year
based on learning from a known Estimating insurance premium
dataset
Anomaly detection Predict if a data point is an outlier Distance-based, Detecting fraudulent credit
compared to other data points in density-based, LOF card transactions and
the dataset network intrusion
Time series Predict the value of the target Exponential smoothing, Sales forecasting, production
forecasting variable for a future timeframe ARIMA, regression forecasting, virtually any
based on historical values growth phenomenon that
needs to be extrapolated
Clustering Identify natural clusters within the k-Means, density-based Finding customer segments in
dataset based on inherit clustering (e.g., a company based on
properties within the dataset DBSCAN) transaction, web, and
customer call data
Association Identify relationships within an FP-growth algorithm, a Finding cross-selling
analysis item set based on transaction priori algorithm opportunities for a retailer
data based on transaction
purchase history
Recommendation Predict the preference of an item Collaborative filtering, Finding the top recommended
engines for a user content-based filtering, movies for a user
hybrid recommenders
LOF, local outlier factor; ARIMA, autoregressive integrated moving average; DBSCAN, density-based spatial clustering of applications
with noise; FP, frequent pattern.
concluded with some closing thoughts and further reading materials and
references are listed. Here is a roadmap of the book.
1.6.1 Getting Started With Data Science

Successfully uncovering patterns in a dataset is an iterative process.
Chapter 2, Data Science Process, provides a framework to solve the data sci-
ence problems. A five-step process outlined in this chapter provides guide-
lines on gathering subject matter expertise; exploring the data with statistics
and visualization; building a model using data science algorithms; testing
and deploying the model in the production environment; and finally reflect-
ing on new knowledge gained in the cycle.
Simple data exploration either visually or with the help of basic statistical
analysis can sometimes answer seemingly tough questions meant for data
science. Chapter 3, Data Exploration, covers some of the basic tools used in
knowledge discovery before deploying data science techniques. These practi-
cal tools increase one’s understanding of data and are quite essential in
understanding the results of the data science process.
1.6.2 Practice using RapidMiner

Before delving into the key data science techniques and algorithms, two spe-
cific things should be noted regarding how data science algorithms can be
implemented while reading this book. It is believed that learning the con-
cepts and implementing them enhances the learning experience. First, it is
recommended that the free version of RapidMiner Studio software is down-
loaded from http://www.rapidminer.com and second, the first few sections
of Chapter 15: Getting started with RapidMiner, should be reviewed in order
to become familiar with the features of the tool, its basic operations, and the
user interface functionality. Acclimating with RapidMiner will be helpful
while using the algorithms that are discussed in this book. Chapter 15:
Getting started with RapidMiner, is set at the end of the book because some
of the later sections in the chapter build on the material presented in the
chapters on tasks; however, the first few sections are a good starting point to
become familiar with the tool.
Each chapter has a dataset used to describe the concept of the book at www.IntroDataScience.com. Though
of a particular data science task and in most cases the not required, it is advisable to access these files to as a
same dataset is used for the implementation. The step- learning aid. The dataset, complete RapidMiner processes
by-step instructions on practicing data science using the (T.rmp files), and many more relevant electronic files can
dataset are covered in every algorithm. All the implemen- be downloaded from this website.
tations discussed are available at the companion website
1.6.3 Core Algorithms

Classification is the most widely used data science task in business. The objec-
tive of a classification model is to predict a target variable that is binary (e.g.,
a loan decision) or categorical (e.g., a customer type) when a set of input
variables are given. The model does this by learning the generalized relation-
ship between the predicted target variable with all other input attributes
from a known dataset. There are several ways to skin this cat. Each algorithm
differs by how the relationship is extracted from the known training dataset.
Chapter 4, Classification, on classification addresses several of these methods.
G Decision trees approach the classification problem by partitioning the
data into purer subsets based on the values of the input attributes. The
attributes that help achieve the cleanest levels of such separation are
considered significant in their influence on the target variable and end
up at the root and closer-to-root levels of the tree. The output model is
a tree framework than can be used for the prediction of new unlabeled
data.
G Rule induction is a data science process of deducing “ifthen” rules from
a dataset or from the decision trees. These symbolic decision rules
explain an inherent relationship between the input attributes and the
target labels in the dataset that can be easily understood by anyone.
G Naïve Bayesian algorithms provide a probabilistic way of building a
model. This approach calculates the probability for each value of the
class variable for given values of input variables. With the help of
conditional probabilities, for a given unseen record, the model
calculates the outcome of all values of target classes and comes up with
a predicted winner.
G Why go through the trouble of extracting complex relationships from
the data when the entire training dataset can be memorized and the
relationship can appear to have been generalized? This is exactly what
the k-NN algorithm does, and it is, therefore, called a “lazy” learner
where the entire training dataset is memorized as the model.
G Neurons are the nerve cells that connect with each other to form a
biological neural network in our brain. The working of these
interconnected nerve cells inspired the approach of some complex data
problems by the creation of artificial neural networks. The neural
networks section provides a conceptual background of how a simple
neural network works and how to implement one for any general
prediction problem. Later on we extend this to deep neural networks
which have revolutionized the field of artificial intelligence.
G Support vector machines (SVMs) were developed to address optical
character recognition problems: how can an algorithm be trained to
detect boundaries between different patterns, and thus, identify
characters? SVMs can, therefore, identify if a given data sample belongs
within a boundary (in a particular class) or outside it (not in the class).
G Ensemble learners are “meta” models where the model is a combination
of several different individual models. If certain conditions are met,
ensemble learners can gain from the wisdom of crowds and greatly
reduce the generalization error in data science.
The simple mathematical equation y 5 ax 1 b is a linear regression model.

Chapter 5, Regression Methods, describes a class of data science techniques
in which the target variable (e.g., interest rate or a target class) is functionally
related to input variables.
G Linear regression: The simplest of all function fitting models is based on
a linear equation, as previously mentioned. Polynomial regression uses
higher-order equations. No matter what type of equation is used, the
goal is to represent the variable to be predicted in terms of other
variables or attributes. Further, the predicted variable and the
independent variables all have to be numeric for this to work. The
basics of building regression models will be explored and how
predictions can be made using such models will be shown.
G Logistic regression: Addresses the issue of predicting a target variable that
may be binary or binomial (such as 1 or 0, yes or no) using predictors
or attributes, which may be numeric.
Supervised data science or directed data science predict the value of the target
variables. Two important unsupervised data science tasks will be reviewed:
Association Analysis in Chapter 6 and Clustering in Chapter 7. Ever heard of
the beer and diaper association in supermarkets? Apparently, a supermarket
discovered that customers who buy diapers also tend to buy beer. While this
may have been an urban legend, the observation has become a poster child
for association analysis. Associating an item in a transaction with another
item in the transaction to determine the most frequently occurring patterns is
termed association analysis. This technique is about, for example, finding rela-
tionships between products in a supermarket based on purchase data, or
finding related web pages in a website based on clickstream data. It is widely
used in retail, ecommerce, and media to creatively bundle products.
Clustering is the data science task of identifying natural groups in the data. As
an unsupervised task, there is no target class variable to predict. After the
clustering is performed, each record in the dataset is associated with one or
more cluster. Widely used in marketing segmentations and text mining, clus-
tering can be performed by a range of algorithms. In Chapter 7, Clustering,
three common algorithms with diverse identification approaches will be dis-
cussed. The k-means clustering technique identifies a cluster based on a central
prototype record. DBSCAN clustering partitions the data based on variation
in the density of records in a dataset. Self-organizing maps create a two-
dimensional grid where all the records related with each other are placed
next to each other.
How to determine which algorithms work best for a given dataset? Or for
that matter how to objectively quantify the performance of any algorithm on
a dataset? These questions are addressed in Chapter 8, Model Evaluation,
which covers performance evaluation. The most commonly used tools for
evaluating classification models such as a confusion matrix, ROC curves, and
lift charts are described.
Chapter 9, Text Mining, provides a detailed look into the area of text mining
and text analytics. It starts with a background on the origins of text mining
and provides the motivation for this fascinating topic using the example of
IBM’s Watson, the Jeopardy—winning computer program that was built using
concepts from text and data mining. The chapter introduces some key concepts
important in the area of text analytics such as term frequencyinverse
document frequency scores. Finally, it describes two case studies in which it is
shown how to implement text mining for document clustering and automatic
classification based on text content.
Chapter 10, Deep Learning, describes a set of algorithms to model high level
abstractions in data. They are increasingly applied to image processing,
speech recognition, online advertisements, and bioinformatics. This chapter
covers the basic concepts of deep learning, popular use cases, and a sample
classification implementation.
The advent of digital economy exponentially increased the choices of avail-
able products to the customer which can be overwhelming. Personalized rec-
ommendation lists help by narrowing the choices to a few items relevant to
a particular user and aid users in making final consumption decisions.
Recommendation engines, covered in Chapter 11, are the most prolific utili-
ties of machine learning in everyday experience. Recommendation engines
are a class of machine learning techniques that predict a user preference for
an item. There are a wide range of techniques available to build a recommen-
dation engine. This chapter discusses the most common methods starting
with collaborative filtering and content-based filtering concepts and implementa-
tions using a practical dataset.
Forecasting is a common application of time series analysis. Companies use
sales forecasts, budget forecasts, or production forecasts in their planning
cycles. Chapter 12 on Time Series Forecasting starts by pointing out the dis-
tinction between standard supervised predictive models and time series fore-
casting models. The chapter covers a few time series forecasting methods,
starting with time series decomposition, moving averages, exponential
smoothing, regression, ARIMA methods, and machine learning based meth-
ods using windowing techniques.
Chapter 13 on Anomaly Detection describes how outliers in data can be
detected by combining multiple data science tasks like classification, regres-
sion, and clustering. The fraud alert received from credit card companies is
the result of an anomaly detection algorithm. The target variable to be
predicted is whether a transaction is an outlier or not. Since clustering tasks

identify outliers as a cluster, distance-based and density-based clustering tech-
niques can be used in anomaly detection tasks.
In data science, the objective is to develop a representative model to general-
ize the relationship between input attributes and target attributes, so that we
can predict the value or class of the target variables. Chapter 14, Feature
Selection, introduces a preprocessing step that is often critical for a successful
predictive modeling exercise: feature selection. Feature selection is known by
several alternative terms such as attribute weighting, dimension reduction,
and so on. There are two main styles of feature selection: filtering the key
attributes before modeling (filter style) or selecting the attributes during the
process of modeling (wrapper style). A few filter-based methods such as prin-
cipal component analysis (PCA), information gain, and chi-square, and a
couple of wrapper-type methods like forward selection and backward elimi-
nation will be discussed.
The first few sections of Chapter 15, Getting Started with RapidMiner, should
provide a good overview for getting familiar with RapidMiner, while the lat-
ter sections of this chapter discuss some of the commonly used productivity
tools and techniques such as data transformation, missing value handling,
and process optimizations using RapidMiner.
References
Breiman, L. (2001). Statistical modeling: Two cultures. Statistical Science, 6(3), 199231.
Fayyad, U., Piatetsky-shapiro, G., & Smyth, P. (1996). From data science to knowledge discovery
in databases. AI Magazine, 17(3), 3754.
Parr Rud, O. (2001). Data science Cookbook. New York: John Wiley and Sons.
Piatetsky, G. (2018). Top Software for Analytics, Data Science, Machine Learning in 2018: Trends
and Analysis. Retrieved July 7, 2018, from https://www.kdnuggets.com/2018/05/poll-tools-
analytics-data-science-machine-learning-results.html.
Piatetsky-Shapiro, G., Brachman, R., Khabaza, T., Kloesgen, W., & Simoudis, E. (1996). An over-
view of issues in developing industrial data science and knowledge discovery applications.
In: KDD-96 conference proceedings. KDD-96 conference proceedings.
Rexer, K. (2013). 2013 Data miner survey summary report. Winchester, MA: Rexer Analytics.
,www.rexeranalytics.com..
Tan, P.-N., Michael, S., & Kumar, V. (2005). Introduction to data science. Boston, MA:
Addison-Wesley.
Tuckey, J. (1980). We need exploratory and Confirmatory. The American Statistician, 34(1),
2325.
CHAPTER 2
Data Science Process
The methodical discovery of useful relationships and patterns in data is

enabled by a set of iterative activities collectively known as the data science
process. The standard data science process involves (1) understanding the
problem, (2) preparing the data samples, (3) developing the model, (4) apply-
ing the model on a dataset to see how the model may work in the real world,
and (5) deploying and maintaining the models. Over the years of evolution of
data science practices, different frameworks for the process have been put
forward by various academic and commercial bodies. The framework put for-
ward in this chapter is synthesized from a few data science frameworks and is
explained using a simple example dataset. This chapter serves as a high-level
roadmap to building deployable data science models, and discusses the chal-
lenges faced in each step and the pitfalls to avoid.
One of the most popular data science process frameworks is Cross Industry
Standard Process for Data Mining (CRISP-DM), which is an acronym for
Cross Industry Standard Process for Data Mining. This framework was devel-
oped by a consortium of companies involved in data mining (Chapman
et al., 2000). The CRISP-DM process is the most widely adopted framework
for developing data science solutions. Fig. 2.1 provides a visual overview of
the CRISP-DM framework. Other data science frameworks are SEMMA, an
acronym for Sample, Explore, Modify, Model, and Assess, developed by the
SAS Institute (SAS Institute, 2013); DMAIC, is an acronym for Define,
Measure, Analyze, Improve, and Control, used in Six Sigma practice (Kubiak
& Benbow, 2005); and the Selection, Preprocessing, Transformation, Data
Mining, Interpretation, and Evaluation framework used in the knowledge dis-
covery in databases process (Fayyad, Piatetsky-Shapiro, & Smyth, 1996). All
these frameworks exhibit common characteristics, and hence, a generic
framework closely resembling the CRISP process will be used. As with any
process framework, a data science process recommends the execution of a
certain set of tasks to achieve optimal output. However, the process of
extracting information and knowledge from the data is iterative. The steps
within the data science process are not linear and have to undergo many 19
Data Science. DOI: https://doi.org/10.1016/B978-0-12-814761-0.00002-2
© 2019 Elsevier Inc. All rights reserved.
20 CHAPTER 2: Data Science Process
Business Data
understanding understanding
Data
preparation
Deployment
Data
Modeling
Evaluation
FIGURE 2.1
CRISP data mining framework.
loops, go back and forth between steps, and at times go back to the first step
to redefine the data science problem statement.
The data science process presented in Fig. 2.2 is a generic set of steps that is
problem, algorithm, and, data science tool agnostic. The fundamental objec-
tive of any process that involves data science is to address the analysis ques-
tion. The problem at hand could be a segmentation of customers, a prediction
of climate patterns, or a simple data exploration. The learning algorithm used
to solve the business question could be a decision tree, an artificial neural net-
work, or a scatterplot. The software tool to develop and implement the data
science algorithm used could be custom coding, RapidMiner, R, Weka, SAS,
Oracle Data Miner, Python, etc., (Piatetsky, 2018) to mention a few.
Data science, specifically in the context of big data, has gained importance in
the last few years. Perhaps the most visible and discussed part of data science
is the third step: modeling. It is the process of building representative models
that can be inferred from the sample dataset which can be used for either
predicting (predictive modeling) or describing the underlying pattern in the
data (descriptive or explanatory modeling). Rightfully so, there is plenty of
2.1 Prior Knowledge 21
FIGURE 2.2
Data science process.
academic and business research in the modeling step. Most of this book has
been dedicated to discussing various algorithms and the quantitative founda-
tions that go with it. However, emphasis should be placed on considering
data science as an end-to-end, multi-step, iterative process instead of just a
model building step. Seasoned data science practitioners can attest to the fact
that the most time-consuming part of the overall data science process is not
the model building part, but the preparation of data, followed by data and
business understanding. There are many data science tools, both open source
and commercial, available on the market that can automate the model build-
ing. Asking the right business question, gaining in-depth business understand-
ing, sourcing and preparing the data for the data science task, mitigating
implementation considerations, integrating the model into the business pro-
cess, and, most useful of all, gaining knowledge from the dataset, remain cru-
cial to the success of the data science process. It’s time to get started with Step
1: Framing the data science question and understanding the context.
2.1 PRIOR KNOWLEDGE

Prior knowledge refers to information that is already known about a subject.
The data science problem doesn’t emerge in isolation; it always develops on
top of existing subject matter and contextual information that is already

known. The prior knowledge step in the data science process helps to define
what problem is being solved, how it fits in the business context, and what
data is needed in order to solve the problem.
2.1.1 Objective
The data science process starts with a need for analysis, a question, or a busi-
ness objective. This is possibly the most important step in the data science
process (Shearer, 2000). Without a well-defined statement of the problem, it
is impossible to come up with the right dataset and pick the right data science
algorithm. As an iterative process, it is common to go back to previous data
science process steps, revise the assumptions, approach, and tactics. However,
it is imperative to get the first step—the objective of the whole process—right.
The data science process is going to be explained using a hypothetical use
case. Take the consumer loan business for example, where a loan is provi-
sioned for individuals against the collateral of assets like a home or car, that
is, a mortgage or an auto loan. As many homeowners know, an important
component of the loan, for the borrower and the lender, is the interest rate
at which the borrower repays the loan on top of the principal. The interest
rate on a loan depends on a gamut of variables like the current federal funds
rate as determined by the central bank, borrower’s credit score, income level,
home value, initial down payment amount, current assets and liabilities of
the borrower, etc. The key factor here is whether the lender sees enough
reward (interest on the loan) against the risk of losing the principal (bor-
rower’s default on the loan). In an individual case, the status of default of a
loan is Boolean; either one defaults or not, during the period of the loan.
But, in a group of tens of thousands of borrowers, the default rate can be
found—a continuous numeric variable that indicates the percentage of bor-
rowers who default on their loans. All the variables related to the borrower
like credit score, income, current liabilities, etc., are used to assess the default
risk in a related group; based on this, the interest rate is determined for a
loan. The business objective of this hypothetical case is: If the interest rate of
past borrowers with a range of credit scores is known, can the interest rate for a new
borrower be predicted?
2.1.2 Subject Area

The process of data science uncovers hidden patterns in the dataset by expos-
ing relationships between attributes. But the problem is that it uncovers a lot
of patterns. The false or spurious signals are a major problem in the data sci-
ence process. It is up to the practitioner to sift through the exposed patterns
and accept the ones that are valid and relevant to the answer of the objective
2.1 Prior Knowledge 23
question. Hence, it is essential to know the subject matter, the context, and
the business process generating the data.
The lending business is one of the oldest, most prevalent, and complex of all
the businesses. If the objective is to predict the lending interest rate, then it is
important to know how the lending business works, why the prediction mat-
ters, what happens after the rate is predicted, what data points can be col-
lected from borrowers, what data points cannot be collected because of the
external regulations and the internal policies, what other external factors can
affect the interest rate, how to verify the validity of the outcome, and so
forth. Understanding current models and business practices lays the founda-
tion and establishes known knowledge. Analysis and mining the data pro-
vides the new knowledge that can be built on top of the existing knowledge
(Lidwell, Holden, & Butler, 2010).
2.1.3 Data
Similar to the prior knowledge in the subject area, prior knowledge in the
data can also be gathered. Understanding how the data is collected, stored,
transformed, reported, and used is essential to the data science process. This
part of the step surveys all the data available to answer the business question
and narrows down the new data that need to be sourced. There are quite a
range of factors to consider: quality of the data, quantity of data, availability of
data, gaps in data, does lack of data compel the practitioner to change the busi-
ness question, etc. The objective of this step is to come up with a dataset to
answer the business question through the data science process. It is critical to
recognize that an inferred model is only as good as the data used to create it.
For the lending example, a sample dataset of ten data points with three attri-
butes has been put together: identifier, credit score, and interest rate. First,
some of the terminology used in the data science process are discussed.
G A dataset (example set) is a collection of data with a defined structure.
Table 2.1 shows a dataset. It has a well-defined structure with 10 rows
and 3 columns along with the column headers. This structure is also
sometimes referred to as a “data frame”.
G A data point (record, object or example) is a single instance in the dataset.
Each row in Table 2.1 is a data point. Each instance contains the same
structure as the dataset.
G An attribute (feature, input, dimension, variable, or predictor) is a single
property of the dataset. Each column in Table 2.1 is an attribute.
Attributes can be numeric, categorical, date-time, text, or Boolean data
types. In this example, both the credit score and the interest rate are
numeric attributes.
Table 2.1 Dataset

Borrower ID Credit Score Interest Rate (%)
01 500 7.31
02 600 6.70
03 700 5.95
04 700 6.40
05 800 5.40
06 800 5.70
07 750 5.90
08 550 7.00
09 650 6.50
10 825 5.70
Table 2.2 New Data With Unknown Interest Rate

Borrower ID Credit Score Interest Rate
11 625 ?
G A label (class label, output, prediction, target, or response) is the special

attribute to be predicted based on all the input attributes. In Table 2.1,
the interest rate is the output variable.
G Identifiers are special attributes that are used for locating or providing
context to individual records. For example, common attributes like
names, account numbers, and employee ID numbers are identifier
attributes. Identifiers are often used as lookup keys to join multiple
datasets. They bear no information that is suitable for building data
science models and should, thus, be excluded for the actual modeling
step. In Table 2.1, the attribute ID is the identifier.
2.1.4 Causation Versus Correlation

Suppose the business question is inverted: Based on the data in Table 2.1, can the
credit score of the borrower be predicted based on interest rate? The answer is yes—
but it does not make business sense. From the existing domain expertise, it is
known that credit score influences the loan interest rate. Predicting credit score
based on interest rate inverses the direction of the causal relationship. This
question also exposes one of the key aspects of model building. The correlation
between the input and output attributes doesn’t guarantee causation. Hence, it
is important to frame the data science question correctly using the existing
domain and data knowledge. In this data science example, the interest rate of
the new borrower with an unknown interest rate will be predicted (Table 2.2)
based on the pattern learned from known data in Table 2.1.
Another random document with
no related content on Scribd:
dangerous from the point of view of party expediency are tolerated. If
they are tolerated greater expenditures of money and of other party
resources must be made when the final accounting with public
sentiment takes place. To put the matter in another way: the forms of
political corruption earlier described, i.e., corruption in connection
with the regulation of business and of vice and corruption in
connection with the buying and selling operations of the state, are for
the most part sources of income, whereas corruption in the form of
political control is mainly expenditure. Under George III., according
to Mr. Dorman B. Eaton, a “Patronage Secretary of the Treasury”
was appointed
“whose duty it was to stand between members and partisan managers
appealing for places for their favourites, on the one side, and the
heads of offices who needed to have these places filled with
competent persons, on the other side. This Secretary measured the
force of threats and took the weight of influence; he computed the
political value of a member’s support and deducted from it the official
appraisement of patronage before awarded to him. It is said that
actual accounts, Dr. and Cr., were kept with members by this
Patronage Secretary.”[66]
Whether or not “accounts Dr. and Cr.” are kept by our political
organisations, a calculus of essentially the same character must
underlie the determination of their policy.
On the spending, or political control, side of their ledgers the
various heads are comparatively simple. Office holders must be kept
in line, and to this end patronage, promotion, and the control of
primaries are important. The direct use of money for bribes may play
only a small part in this process; opportunities for auto-corruption
may be left open in special cases, but personal and party loyalty and
ambition can be relied on to a large extent. Back of the office holders
of the hour, however, there are the constantly recurring necessities
of election day. Party organisation must be kept up continuously,
involving the reward in some way of swarms of assistants and
hangers-on who cannot all be remunerated directly at public
expense. At times votes must be bought, and repeaters, thugs, and
ballot-box stuffers must be paid for their services. A heavy toll is apt
to be taken out of the funds used for such purposes by every hand
through which they pass on their way down. In addition to the
expenditures already noted there are many other occasions, some of
them quite legitimate in character, and others unobjectionable or
even laudable, for the lavish use of money to secure party success
and party control.
The situation which has just been described is so common that the
only justification for repeating its description here is the necessity of
completing an outline the other parts and interrelations of which are
somewhat more obscure. In the gradual awakening of the American
people to corrupt conditions existing in their government the first
evils clearly seen were the abuses of the patronage and the
defilement of the ballot-box. Civil service reform and corrupt
practices acts (the latter term seems lamentably narrow in its original
usage to the present somewhat more sophisticated generation) were
the result. Later the presence of purveyors of vice immediately
behind much of the prevailing electoral corruption was clearly
discerned, and the battle on that score is still being waged. It is
beyond question that our present local option movement is directed
against the saloon not so much because it is a place where
intoxicating liquor is sold, as because it is a political centre which did
not know how to be moderate in its exercise of power during the
days of its ascendancy. Still later the more secret relationship
between grafting business and political corruption was laid bare.
Renewed determination to impose the necessary measures of state
regulation and, more specifically, the campaign contribution issue
were the results. The problems presented by corrupt practices in
connection with political control are still far from adequate solution.
Reforms already achieved in the right direction, and still more the
determination to press for further reforms, are the most hopeful
features of the present situation. In our national government, for
example, the civil service movement has reached a gratifyingly high
development, but it still needs much extension and strengthening in
our states and cities. We have some stringent legislation against
ballot-box crimes, but, an election once settled, our tolerance on this
subject is amazing and deplorable. Every act which simplifies our
governmental machinery, which places responsibility squarely upon
a few shoulders and provides means for enforcing it, which shortens
our cumbersome ballots, which makes the primary accessible to
independent voters, will help in the solution of the problem of honest
party control.
Without undertaking a summary of the argument on the various
forms of business and political corruption the same point may be
made with regard to them that was made with reference to the
corruption in the professions, journalism, and the higher education,—
namely that the major forms of evil are recognised and savagely
criticised. To an even greater extent legislative action has been
secured against the primarily political forms of corruption. The fight
for the regulation of business is the great unsolved problem of our
time, but so far as it is successful we may expect not only more
honest business practices but also a favourable reaction upon
political life. A great many means may be brought to bear to secure
honesty in the buying and selling operations of the state and to
prevent the corrupt toleration of vice. Their success will mean that
the corrupt political manager will find himself deprived of some of his
most lucrative sources of income. A strong impression prevails at the
present time that corruption funds in general are much smaller in
amount than a few years ago. In part this is perhaps due to a change
of heart, in part to the fear, intensified by recent events, of exposure.
Perhaps, however, it is still more largely the result of a conviction
that the “goods” could not, or would not, in the present state of public
opinion, be delivered by the politicians. It is evident that the more
successful we are in thus drying up the income sources of venal
political organisations the smaller will be the resources available in
their hands for the extension and perpetuation of their power of
control.
FOOTNOTES:
[53] “Sin and Society,” p. 78.
[54] “Back to Beginnings,” Commencement Address, Oberlin
College, June 28, 1905.
[55] “The Nature of Political Corruption,” p. 46, supra.
[56] Limitation of the scope of this study to the internal forms of
corruption makes it impossible to discuss this very interesting
topic. It may be noted, however, that in international cases certain
peculiarities occur regarding the personal element of corruption.
When the military secrets of one government are purchased by
another, the faithless official of the former who makes the sale is,
of course, corrupt in the highest degree. What shall be said of the
nation making the purchase? Personal interest on its side is
merged in the collective interest of a commonwealth numbering
millions of inhabitants it may be. The case is not entirely unlike
those in which group interest rather than self interest impels to
corrupt action (see p. 65), except that in the latter the groups are
subordinate and not sovereign. If, however, the state which buys
the secrets of another government runs counter to international
law or morality in so doing, it may be held to be pursuing a
relatively narrow interest regardless of the broader interest of
humanity as a whole. From this point of view the state which uses
money for such ends is guilty of corruption although, of course, it
is a highly socialised form of corruption.
[57] “Law and Opinion in England,” p. 216.
[58] Cf. H. C. Adams, “Public Debts,” pt. iii, ch. iv, for a very
able discussion of the influence of the commercial spirit on public
officials.
[59] Cf. sec. vii, “Die Organisation als
Klassenerhöhungsmaschine,” in Robert Michel’s very thorough
and illuminating study of the organisation of the German social-
democracy. Archiv. f. Sozialwissenschaft und Sozialpolitik, Bd.
xxiii, Heft 2 (September, 1906).
[60] For an overwhelmingly convincing presentation of
materials on this point cf. the “Digest of Report by the Bureau of
Municipal Research on the Administration of the Water Revenues,
Manhattan,”—Efficient Citizenship Leaflet, no. 145. Corruption of
this sneaking sort resembles tax dodging in that it is so largely
indulged in by otherwise respectable people. Cf. p. 192.
[61] “Enough Money to Uplift the World,” p. 6, by William H.
Allen, Director, Bureau of Municipal Research, reprinted as a
pamphlet by the Bureau from the World’s Work of May, 1909.
[62] Cf. p. 5, supra, for a discussion of the consequences of the
corrupt protection of vice and crime.
[63] Cf p. 61, supra.
[64] Cf. especially No. 3 of the Taxation Series published by the
Board, entitled, “Cincinnati an Independent Assessment District,”
by Allen Ripley Foote.
[65] The assumption is not extreme. In the pamphlet referred to
it is held that by the various means proposed, Cincinnati’s (then)
rate of 2.96 per cent. might be reduced to 0.75 per cent. “When
the real estate of the state of Kansas was revalued by the Tax
Commission,” according to Mr. Foote, “the valuation was
increased 484 per cent.” Of course real increase of property
values through considerable periods of time accounts in part for
such totals whenever assessment periods are a number of years
apart.
[66] “Civil Service in Great Britain,” p. 154.
CAMPAIGN CONTRIBUTIONS AND THE
THEORY OF PARTY SUPPORT
VI
CAMPAIGN CONTRIBUTIONS AND THE THEORY OF PARTY
SUPPORT
A party, according to Burke, “is a body of men united, for promoting

by their joint endeavours the national interest, upon some particular
principle in which they are all agreed.”[67] One must admit that the
definition is admirable in that it lays emphasis upon the ideal end of
party action,—the promotion of the national interest. It is adroit in
that it evades the question so constantly thrust upon one in practical
politics as to how far the real motive powers of party are class
interest and personal greed and ambition. Applied to the simpler
conditions of England where the single great object of political strife
is the capture of a parliamentary majority, Burke’s definition may be
accepted as sufficient even to-day. But it would need considerable
amplification before it could be regarded as an adequate description
of the vital activities of an American political party.
While the threshing out of reforms proposed in the public interest
and their translation into law is with us, as in England, the most
important single function of party, still it is but one among a number
of functions actually performed. Our adherence to the “check and
balance” system involves the possibility of clashes between the
legislative, executive, and judicial powers, and these clashes would
certainly be both more frequent and more violent were it not for the
party control which seeks to maintain harmony among the three
great departments of government. The relation between our state
and city governments is also such that conflict is chronic except
where a party organisation secures concerted action. In the state
and city governments themselves administration is so poorly
organised that authorities would constantly be falling afoul of each
other were it not for the intervention of party managers who realise
the necessity of maintaining harmony. Our elections involve a
tremendous volume of labour most of which is performed by party
workers. Not only legislative, but also frequently executive and
judicial candidates must be voted for; national, state, and local
offices must be filled. Back of the elections is a complex convention
or primary system which must be kept in running order. Referendum
and latterly initiative and recall elections require servants and
machinery. The ordinary good citizen who experiences a deep
feeling of personal satisfaction if he casts his vote, and who until
recently considered himself little less than a civic hero if he also
attended his primary, seldom has an adequate conception of the
enormous volume of detailed work which a popular government such
as ours involves. By those persons who are not so fortunately
situated the political worker is called upon for all manner of services,
—for aid in securing naturalisation papers, for assistance in
obtaining employment, for advice in every emergency of life, for
charitable relief. No doubt a quid pro quo is exacted in all these
cases, but so long as philanthropy fails to provide other and better
agencies the social value of such work must be admitted.[68]
Considering these various and exacting party activities it is
altogether probable, as Professor Henry Jones Ford maintains, that
“the machinery of control in American government requires more
people to tend and work it than all other political machinery in the
rest of the civilised world.”[69]
Under our present system the performance of this tremendous
volume of work is essential. In connection with it many grave abuses
have developed, but in the final balance there must be some surplus
of good over evil. Moreover the division of labour which places the
major portion of our political work in the hands of the much maligned
politician is at bottom economic. By so doing we enable our “good”
citizen to devote a larger share of attention to his business, his
family, and the other more immediate affairs of life. No doubt he has
taken too great an advantage of this opportunity, and thereby
enabled the political class to run things with a high hand. The future
of democracy in America will depend largely upon the extent of the
activity and intelligence manifested by our citizens. But at the very
best the great mass can give only a limited portion of its time to
public affairs. Too much politics and too little business, as in South
America, is also bad.
So far as can be foreseen at present, therefore, the political
worker and party machinery bid fair to remain functional and efficient
in America for an indefinite period. Reforms harmonising and
simplifying the departments and spheres of government may reduce
to some extent the volume of our necessary political work. On the
other hand our growth in population and the increase of
governmental functions tend constantly to increase it. Whatever the
future may bring forth present conditions clearly require the co-
operation of strong parties with a complex governmental
organisation. “In America,” wrote Mr. Bryce, “the government goes
for less than in Europe, the parties count for more. The great moving
forces are the parties.” Students of political science generally have
recognised that parties constitute an integral and very vital part of
our political system. It would perhaps not be putting it too strongly to
maintain that our government is divided into what may be called an
“official” part, consisting of the legally constituted political structure
and actual office holders, and an “unofficial” part, consisting of the
party organisations and their workers. As things are now the co-
operation of the two is absolutely essential to efficiency. Nothing is
so helpless or so certain to disappear promptly from the political
arena as an “official” group which has lost the support of its
complementary “unofficial” organisation.
Now while the utility and necessity of co-operation between official
and unofficial political forces is generally recognised by careful
students we have, singularly enough, provided regular and legitimate
means of subsistence only for the former, leaving the latter to shift
for itself as best it may. Our unofficial political forces, i.e., the party
organisations and their workers, are, as we have seen, burdened
with tasks of enormous magnitude. Under simpler conditions Burke’s
“body of men joined together for the purpose of promoting the
national interest upon some particular principle” might indeed “by
their joint endeavour” alone succeed in performing this work in a
patriotic and disinterested spirit. With the growth of American
population and the development of our very complex government,
however, this became impossible. Steady professional work by a
large body of men is demanded under present conditions. Much of
this work the politician knows to be necessary and useful even if the
full measure of its social utility seldom dawns upon him. Naturally he
thinks the labourer worthy of his hire, or, at any rate, he is keenly
conscious of his own bread and butter necessities. No regular
income being provided for the politician as such, he proceeds to
collect it in various ways, some of them perfectly open and even
praiseworthy, as in the case of campaign contributions made by
disinterested persons, and others distinctly furtive or even corrupt
and criminal. Under the old régime if his party was successful at the
polls there was, of course, the possibility of a job,—that is of a
translation from the unofficial to the official governmental sphere.
Even in the hey-day of the spoils system, however, there were never
jobs enough to supply the faithful and those who received
appointments were consequently “assessed” large sums to pay for
the labours of their less fortunate companions in arms. And the
politicians of the beaten party went bare although their social service
in arousing the people on the issues of the campaign was probably
as valuable in proportion to their numbers as that rendered by the
workers of the victorious party. Under the circumstances it was
inevitable that the party worker in office would pay more attention to
the requirements of the machine than to his public duties, and the
evils thus occasioned naturally gave rise to civil service reform.
Wherever it has been applied the merit system has done much to
discourage the collection of party revenues from office holders. As
sources of income there remain, however, the manifold possibilities
of the sale of political influence ranging all the way from permission
to violate a municipal ordinance up to the sale of a franchise or the
grant of legislative favours to large private interests. Many of the
forms of corruption dealt with in the preceding studies are cases in
point.
Whatever means may be employed to collect funds the total cost
of party maintenance in the United States is extremely heavy.
Referring to this frequently unreckoned burden, Professor Ford
remarks: “It is a fond delusion of the people that our republican form
of government is less expensive than the monarchical forms which
obtain in Europe. The truth is that ours is the costliest government in
the world.”[70] Turn the matter about as one will it is inevitable that
these costs of parties must be paid. Our present method of paying
them is indirect, furtive, fraught with grave moral consequences, and
it is tremendously extravagant. We do not perceive the latter point
clearly because we seldom get an insight into the total amount
demanded or into the many and devious ways by which it is
collected. What is exacted of us in the final analysis is not to be
reckoned in money alone but also in bad and inefficient government
with all the harm that it entails upon business, health, security, and
morality. And we must continue to pay in our present wasteful and
foolish manner until we devise a better method or make some
arrangement to dispense largely with the services of party
organizations.
What the ultimate lines of the solution may be it is too early to
inquire. It is only very recently that we have become aware of the
existence and magnitude of the problem. Indeed in its present form
the problem is itself of recent origin. Not until the presidential
campaign in 1876 was money used on a scale which could be
described as lavish. The interest which has been shown recently in
campaign contributions is gratifying evidence that our former neglect
of the sources of party support is giving way to lively interest. Such
contributions, however, represent a part only of the total expenses of
political management. Party organisations must be kept up
permanently and politicians, in or out of office, have a large amount
of party work to perform between elections. As a matter of fact
campaign funds may be regarded as a form of provision for the
surplus demand occasioned by the election time necessity of running
the machine at full blast with a large number of supernumerary
workers under employment. The size of the total sums contributed at
such periods, the influences behind some of the contributions, and
the new interest of the public in these influences make it desirable,
however, to consider the matter as a single but very important
section of the broader subject of party support in general.
Admitting the necessity and utility under present conditions of
party organisation and party work it is certainly not unreasonable to
suggest that part of the burden of campaign management should be
borne by the state. In his message at the beginning of the first
session of the Sixtieth Congress, December, 1907, President
Roosevelt said on this subject:
“The need for collecting large campaign funds would vanish if
Congress provided an appropriation for the proper and legitimate
expenses of each of the great national parties, an appropriation ample
enough to meet the necessity for thorough organisation and
machinery, which requires a large expenditure of money. Then the
stipulation should be made that no party receiving campaign funds
from the Treasury should accept more than a fixed amount from any
individual subscriber or donor; and the necessary publicity for receipts
and expenditures could without difficulty be provided.”
It was frankly admitted that this proposal was “very radical” and
that until the people had time to familiarise themselves with it they
would not be willing to consider its adoption. Indeed popular feeling
nowadays, whether rightly or wrongly, is strongly averse to the
granting of aid to party organisations and is manifestly bent on
cutting off some of their sources of supply rather than on providing
others. Many objections may be made to President Roosevelt’s
proposal, some of them technical in character, others on the basis of
principle. “Legitimate expenses” might be hard to define, but the
attempt has been made already by several state legislatures.[71]
Congress would either have to vote the same sum to each of the two
principal parties, or else devise some scheme of pro rata distribution.
How minor political parties would fare under the former arrangement
is not discussed. Colorado met this question in 1909, by providing
that the state should pay twenty-five cents for each vote cast at the
preceding contest for governor. The money is distributed to the state
party chairmen in proportion to the votes cast by each party. One-
half of it must be handed over to the county chairmen in proportion to
the number of votes cast in each county. Other contributions to
campaign funds are prohibited, except from candidates, who,
however, may not give sums in excess of twenty-five per cent of their
first year’s salary. What the practical outcome of the plan may be it
is, of course, impossible to predict. Just how a new minor party is to
get itself started, apart from the limited contributions of its
candidates, does not appear. Objection might also be raised to this
pro rata arrangement on the ground that it bases the financial
support of parties almost entirely upon their showing at the
preceding election. So far as the strength of parties is determined by
their money income the effect of the law will manifestly be to
maintain the status quo ante. Theoretically party support ought to
depend on the present actual standing of a party, that is, the
comparative value to the state of its policies at the election for which
its expenses are to be paid. Of course no agreement is possible as
to just what this standing is in given cases. None the less it would
seem clear that there might be a wide divergence between the
relative showing made by a party at the polls two or more years ago
and its present deserts. Possibly also a system of voluntary giving
with restrictions of corporate contributions and other abuses might
more correctly measure the current merit of parties than the pro rata
state appropriation system.
The Colorado plan, with the exception of the limited contributions it
permits from candidates, places the burden of election expenses
entirely upon the state, and therefore prohibits contributions both
from corporations and individuals. President Roosevelt’s suggestion
is not so radical, involving as it clearly did the raising of funds by
contributions in addition to the proposed congressional
appropriations. If, however, the latter were made sufficient to provide
for the “proper and legitimate expenses of each of the great national
parties,” one might inquire for what other purposes the campaign
managers would need money. Waiving this question, a mixed system
of state subsidies and private contributions has certain distinct
advantages. There is considerable force in President Roosevelt’s
argument that publicity and the restriction of large contributions could
be more easily obtained under a plan combining the two kinds of
support. Public appropriations for campaign purposes would place
the state in a stronger position logically to exercise supervision over
the whole process of gathering and spending money for political
purposes. However, it remains to be demonstrated that publicity and
the restriction of objectionable contributions cannot be secured
without the payment of party subsidies. Evidently, also, there would
be difficulties in connection with the supervision of party activities
necessary to determine whether or not the proposed congressional
appropriations should be granted. Democratic campaign managers
would certainly feel that no Republican congress could deal fairly
with them in such matters, although a bi-partisan supervisory board
appointed by Congress might escape this suspicion.
Any appropriation of state funds for campaign purposes would
also be objected to on grounds of principle. It is not considered a
misfortune, for example, that a philanthropic, educational, or
religious association must appeal to the public for contributions. On
the contrary this very necessity forces the managers of such
organisations to keep the service of the public constantly in view.
Fully endowed charities, schools, and churches, on the other hand,
have a notorious tendency to develop the dry rot or to degenerate
into positive nuisances. It is possible that even if our two great
parties were guaranteed support from state funds the keen rivalry
between them might preserve them from deterioration. Still the logic
of events may at any time demand the disbandment of a given
political party. If at such a juncture it were assured a large subsidy,
equal to or approximating that of the majority party, it might outlive its
usefulness indefinitely, maintaining its organisation and a numerous
body of adherents simply in order to devour the congressional
appropriation provided for its useless campaign work.
There is one form of campaign expenditure, however, which the
state may well assume and seek to extend, namely that incurred for
performing any service offered equally to all parties. Already public
provision is made for the rent of polling places, the salaries of
election officials, the printing of ballots, and some other expenditures
of a similar character. Legal regulation of the primary and convention
system, such as has been undertaken on a large scale within the last
decade, offers opportunities for the payment of certain preliminary
expenses in the same way. In connection with its referendum
elections Oregon has begun to print and distribute at public expense
documents containing the substance of the laws to be voted on,
supplemented by brief arguments drawn up by adherents of both
sides.[72] So far as this principle can be extended the real need for
campaign contributions from private citizens or corporations will be
reduced. Thus without going the length of placing the whole burden
of campaign expenditures upon the state, experimentation may well
be undertaken with various combinations of the mixed system. If
found advisable the relative amount of the state’s contribution may
then be increased from time to time.
While the work of parties at the present time must be conceded to
be essential and on the whole useful, the argument for their entire
support by the state is still far from being made out. There is, as we
have seen, a certain virtue in the very necessity under which parties
labour of applying to the people for contributions. Normally it should
have the effect of keeping the parties closer in interest to the people.
It is highly improbable that the question of campaign funds would
ever have been raised in American politics if party contributions were
habitually made by a large number of persons each giving a
relatively small amount. If in addition the donors were inspired by
patriotic motives only and never sought to procure corrupt favours
through their contributions such a system would be well nigh ideal.
With all the abuses that have sprung up in this connection it is
probably true that by far the larger number of the contributors to our
campaign funds have been of the better type just described,
although, of course, the same judgment would hardly be expressed
with regard to the greater portion of the total amounts contributed.
Under a system of small contributions from a large number of people
it would matter little even if some of the contributors were not wholly
disinterested. The relatively small proportion of the total sum
represented by any individual subscription would make it absurd for
the donor to claim corrupt favours of importance. It is not so much
the campaign contribution itself that has fallen into disrepute among
us as the secrecy involving the whole subject and the belief that
large corporate contributions have been repaid by corrupt favours.
Short of public subsidies, therefore, most of the advocates of reform
in this field content themselves with the demand for publicity and
certain restrictions as to the collection and expenditure of campaign
funds.
The movement for publicity was preceded by much vigorous
legislation against the bribery of voters and other abuses at the
ballot-box, but as these subjects have been abundantly discussed
elsewhere they need only incidental mention here. New York led in
the movement for publicity proper with a law passed in 1890 (ch. 94),
requiring candidates to file statements of their expenditures. This act
was very ineffective, no publicity being required for the expenses of
election committees. Most of the laws subsequently passed have
brought campaign committees as well as candidates specifically
under regulation.[73] By the end of 1908, more than twenty states
altogether had taken some action looking toward the publicity of
expenditures. The earlier laws of this character were very loosely
drawn. In many cases they simply required “statements,” and the
results obtained were distinguished chiefly by gross inadequacy and
heterogeneity. Later statutes and amendments, however, have fixed
the form of reports precisely, itemising them in considerable detail.
Wisconsin, for example, furnishes blanks especially prepared for this
purpose. Vouchers for all sums exceeding five or ten dollars are
required in a number of states. Publicity of receipts is not so
commonly prescribed as publicity of expenditures. Reports of
contributions were first required by Colorado and Michigan in 1891,
followed by Massachusetts in 1892, California in 1894, Arizona in
1895, Ohio in 1896. Repeals of the laws first passed in Ohio and
Michigan indicate that they were somewhat ahead of public
sentiment at the time, although they would hardly be so regarded
now. In this connection the New York law of 1906 (ch. 502), was an
event of first class importance. It compels political committees to file
detailed statements of receipts as well as expenditures, and provides
for judicial investigation to enforce correct statements. The great
weight of the name of the Empire State is thus placed squarely
behind the demand for real publicity of receipts.[74] Under this act,
voluntarily accepted by the national chairmen in 1908, publicity was
given to the finances of a presidential campaign for the first time in
the history of the country.
In the national field the nearest approach to legislation prescribing
publicity for campaign contributions was made by a bill (H. R. 20112)
introduced into the House of Representatives in 1908. Briefly this bill
covered both expenditures and contributions of the national and the
congressional campaign committees of all parties, and of “all
committees, associations, or organisations which shall in two or
more states influence the result or attempt to influence the result of
an election at which Representatives in Congress are to be elected.”
Treasurers of such committees were required to file itemised detailed
statements with the Clerk of the House of Representatives “not more
than fifteen days and not less than ten days before an election,” and
also final reports within thirty days after such elections. These
statements were to include the names and addresses of contributors
of $100 or more, the total of contributions under $100,
disbursements exceeding $10 in detail, and the total of
disbursements of less amount. The bill also contained provisions,
which will be referred to later, designed to cover the use of money by
persons or associations other than those mentioned above.
Unfortunately a provision was tacked on to the foregoing raising the
question of the restriction of colored voting in the South and hinting
at a reapportionment of congressional representation under the
Fourteenth Amendment to the Constitution. As a consequence an
embittered opposition was made by the Democrats who charged that
the latter provision was deliberately introduced in bad faith with the
intention of making the passage of the bill impossible. In the House it
was carried by a solid Republican vote of 161 in its favour to 126
Democratic votes in opposition, but was allowed to expire in the
Senate Committee on Privileges and Elections for fear that it would
become the object of a Democratic filibuster.
Whatever may be the merits of the proposal to readjust
congressional representation it is clearly a question which is logically
separable from that of campaign contributions. If this separation is
effected there would seem to be reason to hope that a publicity bill
similar in its main outlines to that of 1908 can pass Congress. While
a platform plank of this sort was voted down in the Republican
National Convention of that year, Mr. Taft in his speech of
acceptance said:—
“If I am elected President I shall urge upon Congress, with every
hope of success, that a law be passed requiring a filing in a Federal
office of a statement of the contributions received by committees and
candidates in elections for members of Congress, and in such other
elections as are constitutionally within the control of Congress.”[75]
The manœuvring for position between the parties in 1908 which
resulted in the voluntary acceptance by each of high standards of
publicity is too fresh in the public mind to require rehearsal here. For
the first time in the history of presidential elections some definite
information was made available regarding campaign finances. The
Republican National Committee reported contributions of
$1,035,368.27. This sum, however, does not include $620,150
collected in the several states by the finance committees of the
Republican National Committee and turned over by them to their
respective state committees. The Democratic National Committee
reported contributions amounting to $620,644.77. The list of
contributors to the Republican National Fund contained 12,330
names.[76] The Democratic National Committee filed a “list of over
25,000 names representing over 100,000 contributors who
contributed through newspapers, clubs, solicitors, and other
organisations, whose names are on file in the office of the chairman
of the Democratic National Committee at Buffalo.”[77]
On many points, unfortunately, the two reports, while definite to a
degree hitherto unknown, are not strictly comparable. Some species
of “uniform accounting” applicable to this subject is manifestly
necessary before any detailed investigation can be undertaken. One
big fact stands out with sufficient clearness, however, namely that
the national campaign of 1908 was waged at a money cost far below
that of the three preceding campaigns.
Basing his estimate upon what is said to have been spent in 1896,
1900, and 1904, Mr. F. A. Ogg placed the total cost of a presidential
election to both parties, including the state and local contests
occurring at the same time, at $15,000,000.[78] One-third to one-half
of this enormous sum, in his opinion, must be attributed to the
presidential campaign proper. Compared with this estimate of from
five to seven and a half millions the relatively modest total of
something more than two and a quarter millions shown by the figures
of 1908 must be counted a strong argument in favour of publicity.
The most important single issue raised by the policies of the two
parties during the last presidential campaign was that of publicity
before or after election. Early in the campaign the Democratic
National Committee decided to publish on or before October 15th all
individual contributions in excess of $100; contributions received
subsequent to that date to be published on the day of their receipt.
Following the principle of the New York law both parties made post-
election statements. It is manifest that complete statements of
expenditures, or for that matter of contributions as well, can be made
only after election. Every thorough provision for publicity must,
therefore, require post-election reports. Shall preliminary statements
also be required? As against the latter it is urged that contributors
whose motives are of the highest character will be deterred by the
fear of savage partisan criticism. If publicity is delayed until after the
election campaign bitterness will have subsided and a juster view of
the whole situation will be possible. In favour of publicity before the
election it is said that two main ends are aimed at by all legislation of
this sort;—first to prevent the collection and expenditure of enormous
sums for the bribery of voters and other corrupt purposes, and,
second, by revealing the source of campaign funds to make it
difficult or impossible for the victorious party to carry out corrupt
bargains into which it may have entered in order to obtain large
contributions. Publicity after the election will, indeed, serve the
second of these ends, but publicity before would be much more
effective in preventing corrupt collection and expenditure of funds.
Moreover it might prevent the victory of the party pursuing such a
policy and thus, by keeping it out of power, render it incapable of
paying by governmental favour for its contributions.
In attempting to arrive at a conclusion on this issue it is difficult to
assign it such practical importance as it received during the
campaign of 1908. Publicity after election simply delays the time of
exposure. The knowledge that it is bound to come must exert a very
powerful influence over intending contributors. That this was the
case in 1908 is pretty convincingly demonstrated by a comparison of
the figures of that year with the figures for earlier presidential
campaigns. It is certain that publicity pure and simple, whether
before or after election, will seldom show on the face of the returns
any facts seriously reflecting upon party integrity. If there is to be
difficulty in administering laws of this character it will come in the way
of getting at real, complete statements, going back of the names and
figures on the return if necessary. On the other hand it is not
altogether to be deplored that before election publicity may result in
rather bitter criticism of some contributors. Gifts in general, as we
have already noted,[79] stand in especial need of criticism, and this
principle applies with maximum force to campaign gifts. Designed as
they are to affect public policy a plea for privacy cannot be set up on
their behalf. If the criticism of contributors should go to extremes it
will hurt the party making it more than the individuals assailed.
Contributors who know their own motives to be honourable ought not
to allow themselves to be deterred by baseless clamour. If, however,
such criticism is just, both the individual making, and the party
receiving the suspicious contribution deserve to suffer. By deterring
other contributions of a similar questionable character a distinct
public service will be rendered by such ante-election criticism.
Knowledge of the sources of the financial support of a party is
certainly not the only nor the best basis to be employed by an elector
in determining the way he shall cast his vote, but under present
conditions it is certainly a matter which he is entitled to take into
consideration. While admitting, therefore, that there is room for
honest difference of opinion on the question of publicity before or
after election, the weight of the argument would seem to fall distinctly
in favour of the former. It is sincerely to be regretted that the question
became in a sense a matter of party record in 1908. Going back to
the congressional bill of the same year, however, it is worth noting
that the Republican majority in the House once placed itself solidly
and squarely on record in favour of publicity before the election.
Looking at the matter solely from the lower standpoint of expediency
that party is now in a most enviable position to revert to its earlier
attitude and, by enacting the principle of ante-election publicity into
law, to secure for itself the credit of a popular reform. This would
place the two parties on a uniform legal basis for the future, and
make it impossible for the Democrats to assume voluntarily a higher
standard regarding publicity which they could then use as a
campaign argument against the Republicans.[80]
Another random document with
no related content on Scribd:
THE OLD STORY.
There is without doubt a large criminal and semi-criminal class
among colored people. This is but another way of saying that the
social uplift of a group of freedmen is a serious task. But it is also
true, and painfully true, that the crime imputed carelessly and
recklessly against colored people gives an impression of far greater
criminality than the facts warrant.
Take, for instance, a typical case: A little innocent schoolgirl is
brutally murdered in New Jersey. A Negro vagabond is arrested.
Immediately the news is heralded from East to West, from North to
South, in Europe and Asia, of the crime of this black murderer.
Immediately a frenzied, hysterical mob gathers and attempts to
lynch the poor wretch. He is spirited away and the public is almost
sorry that he has escaped summary justice. Without counsel or
friends, the man is shut up in prison and tortured to make him
confess. “They did pretty near everything to me except kill me,”
whispered the wretched man to the first friend he saw.
Finally, after the whole black race in America had suffered
aspersion for several weeks, sense begins to dawn in Jersey. After all,
what proof was there against this man? He was lazy, he had been in
jail for alleged theft from gypsies, he was good natured, and he drank
whiskey. That was all. Yet he stayed in jail under no charge and
under universal censure. The coroner’s jury found no evidence to
indict him. Still he lay in jail. Finally the National Association for the
Advancement of Colored People stepped in and said, “What are you
holding this man for?” The Public Prosecutor got red in the face and
vociferated. Then he went downtown, and when the habeas corpus
proceedings came and the judge asked again: “Why are you holding
this man?” the prosecutor said chirpily, “For violating election laws,”
and brought a mass of testimony. Then the judge discharged the
prisoner from the murder charge and congratulated the National
Association for the Advancement of Colored People—but the man is
still in jail.
Such justice is outrageous and such methods disgraceful. Black
folk are willing to shoulder their own sins, but the difference between
a vagabond and a murderer is too tremendous to be lightly ignored.
“SOCIAL EQUALITY.”
At last we have a definition of the very elusive phrase “Social
Equality” as applied to the Negro problem. In stating their grievances
colored people have recently specified these points:
1. Disfranchisement, even of educated Negroes.
2. Curtailment of common school training.
3. Confinement to “Ghettos.”
4. Discrimination in wages.
5. Confinement to menial employment.
6. Systematic insult of their women.
7. Lynching and miscarriage of justice.
8. Refusal to recognize fitness “in political or industrial life.”
9. Personal discourtesy.
Southern papers in Charlotte, Richmond, New Orleans and
Nashville have with singular unanimity hastened to call this
complaint an unequivocal demand for “social equality,” and as such
absolutely inadmissible. We are glad to have a frank definition,
because we have always suspected this smooth phrase. We
recommend on this showing that hereafter colored men who hasten
to disavow any desire for “social equality” should carefully read the
above list of disabilities which social inequality would seem to
prescribe.
ASHAMED.
Any colored man who complains of the treatment he receives in
America is apt to be faced sooner or later by the statement that he is
ashamed of his race.
The statement usually strikes him as a most astounding piece of
illogical reasoning, to which a hot reply is appropriate.
And yet notice the curious logic of the persons who say such
things. They argue:
White men alone are men. This Negro wants to be a man. Ergo he
wants to be a white man.
Their attention is drawn to the efforts of colored people to be
treated decently. This minor premise therefore attracts them. But the
major premise—the question as to treating black men like white men
—never enters their heads, nor can they conceive it entering the
black man’s head. If he wants to be a man he must want to be white,
and therefore it is with peculiar complacency that a Tennessee paper
says of a dark champion of Negro equality: “He bitterly resents his
Negro blood.”
Not so, O Blind Man. He bitterly resents your treatment of Negro
blood. The prouder he is, or has a right to be, of the blood of his black
fathers, the more doggedly he resists the attempt to load men of that
blood with ignominy and chains. It is race pride that fights for
freedom; it is the man ashamed of his blood who weakly submits and
smiles.
JESUS CHRIST IN BALTIMORE.
It seems that it is not only Property that is screaming with fright at
the Black Spectre in Baltimore, but Religion also. Two churches
founded in the name of Him who “put down the mighty from their
seats and exalted them of low degree” are compelled to move. Their
palatial edifices filled with marble memorials and Tiffany windows
are quite useless for the purposes of their religion since black folk
settled next door. Incontinently they have dropped their Bibles and
gathered up their priestly robes and fled, after selling their property
to colored people for $125,000 in good, cold cash.
Where are they going? Uptown. Up to the wealthy and exclusive
and socially select. There they will establish their little gods again,
and learned prelates with sonorous voices will ask the echoing pews:
“How can the Church reach the working man?”
Why not ask the working man? Why not ask black people, and
yellow people, and poor people, and all the people from whom such
congregations flee in holy terror? The church that does not run from
the lowly finds the lowly at its doors, and there are some such
churches in the land, but we fear that their number in Baltimore is
not as great as should be.
“EXCEPT SERVANTS.”
The noticeable reservation in all attempts, North and South, to
separate black folk and white is the saving phrase, “Except servants.”
Are not servants colored? Is the objection, then, to colored people
or to colored people who are not servants? In other words, is this
race prejudice inborn antipathy or a social and economic caste?
SOCIAL CONTROL
By JANE ADDAMS, of Hull House
I always find it very difficult to write upon the great race problem
which we have in America. I think, on the whole, that the most
satisfactory books on the subject are written by people outside of
America. This is certainly true of two books, “White Capital and
Black Labor,” by the Governor of Jamaica, and William Archer’s
book entitled “Afro-America.” Although the latter is inconclusive, it
at least gives one the impression that the man who has written it has
seen clearly into the situation, and that, I suppose, is what it is very
difficult for any American to do.
One thing, however, is clear to all of us, that not only in the South,
but everywhere in America, a strong race antagonism is asserting
itself, which has various modes of lawless and insolent expression.
The contemptuous attitude of the so-called superior race toward the
inferior results in a social segregation of each race, and puts the one
race group thus segregated quite outside the influences of social
control represented by the other. Those inherited resources of the
race embodied in custom and kindly intercourse which make much
more for social restraint than does legal enactment itself are thus
made operative only upon the group which has inherited them, and
the newer group which needs them most is practically left without.
Thus in every large city we have a colony of colored people who have
not been brought under social control, and a majority of the white
people in the same community are tacitly endeavoring to keep from
them those restraints which can be communicated only through
social intercourse. One could easily illustrate this lack of inherited
control by comparing the experiences of a group of colored girls with
those of a group representing the daughters of Italian immigrants, or
of any other South European peoples. The Italian girls very much
enjoy the novelty of factory work, the opportunity to earn money and
to dress as Americans do, but this new freedom of theirs is carefully
guarded. Their mothers seldom give them permission to go to a party
in the evening, and never without chaperonage. Their fathers
consider it a point of honor that their daughters shall not be alone on
the streets after dark. The daughter of the humblest Italian receives
this care because her parents are but carrying out social traditions. A
group of colored girls, on the other hand, are quite without this
protection. If they yield more easily to the temptations of a city than
any other girls, who shall say how far the lack of social restraint is
responsible for their downfall? The Italian parents represent the
social traditions which have been worked out during centuries of
civilization, and which often become a deterrent to progress through
the very bigotry with which they cling to them; nevertheless, it is
largely through these customs and manners that new groups are
assimilated into civilization.
Added to this is the fact that a decent colored family, if it is also
poor, often finds it difficult to rent a house save one that is
undesirable, because situated near a red-light district, and the family
in the community least equipped with social tradition is forced to
expose its daughters to the most flagrantly immoral conditions the
community permits. This is but one of the many examples of the
harmful effects of race segregation which might be instanced.
Another result of race antagonism is the readiness to irritation
which in time characterizes the intercourse of the two races. We
stupidly force one race to demand as a right from the other those
things which should be accorded as a courtesy, and every meeting
between representatives of the two races is easily characterized by
insolence and arrogance. To the friction of city life, and the
complications of modern intercourse, is added this primitive race
animosity which should long since have been outgrown. When the
white people in a city are tacitly leagued against the colored people
within its borders, the result is sure to be disastrous, but there are
still graver dangers in permitting the primitive instinct to survive
and to become self-assertive. When race antagonism manifests itself
through lynching, it defiantly insists that it is superior to all those
laws which have been gradually evolved during thousands of years,
and which form at once the record and the instrument of civilization.
The fact that this race antagonism enables the men acting under its
impulsion to justify themselves in their lawlessness constitutes the
great danger of the situation. The men claim that they are executing
a primitive retribution which precedes all law, and in this belief they
put themselves in a position where they cannot be reasoned with,
although this dangerous manifestation must constantly be reckoned
with as a deterrent to progress and a menace to orderly living.
Moreover, this race antagonism is very close to the one thing in
human relations which is uglier than itself, namely, sex antagonism,
and in every defense made in its behalf an appeal to the latter
antagonism is closely interwoven. Many men in every community
justify violence when it is committed under the impulsion of these
two antagonisms, and others carelessly assert that great laws of
human intercourse, first and foremost founded upon justice and
right relations between man and man, should thus be disregarded
and destroyed.
If the National Association for the Advancement of Colored People
will soberly take up every flagrant case of lawbreaking, and if it allow
no withdrawal of constitutional rights to pass unchallenged, it will
perform a most useful service to America and for the advancement of
all its citizens. Many other opportunities may be open in time to such
an association, but is not this its first and most obvious obligation?
THE TEACHER.
By Leslie Pinckney Hill.
Lord, who am I to teach the way

To little children day by day,
So prone myself to go astray?
I teach them Knowledge, but I know

How faint they flicker, and how low
The candles of my knowledge glow.
I teach them Power to will and do,

But only now to learn anew
My own great weakness through and through.
I teach them Love for all mankind

And all God’s creatures; but I find
My love comes lagging still behind.
Lord, if their guide I still must be,

O let the little children see
The teacher leaning hard on thee!
Employment of Colored Women in
Chicago
From a Study Made by the Chicago School of Civics and

Philanthropy
In considering the field of employment for colored women, the

professional women must be discussed separately. They admit fewer
difficulties and put a brave face on the matter, but in any case their
present position was gained only after a long struggle. The education
is the less difficult part. The great effort is to get the work after
having prepared themselves for it. The Negro woman, like her white
sister, is constantly forced to choose between a lower wage or no
work. The pity is that her own people do not know the colored girl
needs their help nor realize how much they could do for her.
Two of the musicians found the struggle too hard and were
compelled to leave Chicago. One girl of twenty-three, a graduate of
the Chicago Conservatory of Music, is playing in a low concert hall in
one of the worst sections of the city, from 8 in the evening till 4 in the
morning. Her wages are $18 a week, and with this she supports a
father and mother and younger sister.
There are from fifteen to twenty colored teachers in the public
schools of Chicago. This information was obtained from the office of
the superintendent, where it was said no record was kept of the
number of colored teachers. When they are given such a place they
are always warned that they are likely to have difficulty. As a rule
they are in schools where the majority of the children are Negroes.
They are all in the grades, but they say that their opportunities for
promotion are equal to those of white women. Indeed, they say that
there are two places where they are not discriminated against
because of their color. One is in the public schools, the other is under
the Civil Service Commission.
However, the professional women do not have the greatest
difficulty. The real barriers are met by the women who have had only
an average education—girls who have finished high school, or
perhaps only the eighth grade. These girls, if they were white, would
find employment at clerical and office work in Chicago’s department
stores, mail order houses and wholesale stores. But these positions
are absolutely closed to the Negro girl. She has no choice but
housework.
When the object of the inquiry was explained to one woman she
said: “Why, no one wants a Negro to work for him. I’ll show you—
look in the newspaper.” And she produced a paper with its columns
of advertisements for help wanted. “See, not one person in this whole
city has asked for a Negro to work for him.”
A great many of the colored women find what they call “day work”
most satisfactory. This means from eight to nine hours a day at some
kind of housework, cleaning, washing, ironing or dusting. This the
Negro women prefer to regular positions as maids, because it allows
them to go at night to their families. The majority of the women who
do this work receive $1.50, with 10 cents extra for carfare. There was
a higher grade of day work for which the pay was $2 a day besides
the carfare. This included the packing of trunks, washing of fine linen
and lace curtains, and even some mending.
The records of the South Side Free Employment Agency showed
that the wages of colored women were uniformly lower than those of
white women. Of course, there is no way of judging of ability by
records, but where the white cooks received $8 per week the Negro
cooks were paid $7, and where the white maids received $6,
sometimes, but not as frequently as in the case of the cooks, the
Negro maid received less. One dollar and a half was paid for “day
work.” At the colored employment agency which is run in connection
with the Frederic Douglas Centre they have many more requests for
maids than they have girls to fill the places. Good places with high
wages are sometimes offered, but the girls are more and more
demanding “day work” and refusing to work by the week. At the
South Side Free Employment Agency during the months of January,
February and March of this year forty-two positions for colored
women were found.[2] These forty-two positions were filled by thirty-
six women, some of them coming back to the office two or three
times during the three months. The superintendent said it was
difficult to find places for the colored women who applied, and they
probably succeeded in placing only about 25 per cent. of them. In the
opinion of those finding the work for the girls in this office, the
reason for the difficulties they encounter are the fact that they do not
remain long in one place and have a general reputation for
dishonesty. The fundamental cause of the discrimination by
employers against them is racial prejudice either in the employer
himself or in his customers.
2. 454 white women in the same time.
One girl who has only a trace of colored blood was able to secure a
position as salesgirl in a store. After she had been there a long time
she asked for an increase in wages, such as had been allowed the
white girls, but the request was refused and she was told that she
ought to be thankful that they kept her at all.
In many cases, especially when the women were living alone, the
earnings, plus the income from the lodgers, barely covered the rent.
When they work by the day they rarely work more than four days a
week. Sometimes the amount they gave as their weekly wage fell
short of even paying the rent, but more often the rent was covered
and a very small margin left to live on.
Such treatment has discouraged the Negro woman. She has
accepted the conditions and seldom makes any real effort to get into
other sorts of work. The twelfth question on the schedule, “What
attempts have you made to secure other kinds of work in Chicago or
elsewhere?” was usually answered by a question: “What’s the use of
trying to get work when you know you can’t get it?”
The colored women are like white women in the same grade of life.
They do not realize the need of careful training, and they do not
appreciate the advantages of specialization in their work. But the
Negro woman is especially handicapped, for she not only lacks
training but must overcome the prejudice against her color. Of the
270 women interviewed, 43 per cent. were doing some form of
housework for wages, yet all evidence of conscious training was
entirely lacking. This need must be brought home to them before
they can expect any real advancement.
A peculiar problem presents itself in connection with the
housework. Practically this is the only occupation open to Negro
women, and it is also the only occupation where one is not expected
to go home at night. This the Negroes insist on doing. They are
accused of having no family feeling, yet the fact remains that they
will accept a lower wage and live under far less advantageous
conditions for the sake of being free at night. That is why the “day
work” is so popular. Rather than live in some other person’s home
and get good wages for continued service, the colored woman prefers
to live in this way. She will have a tiny room, go out as many days a
week as she can get places, and pay for her room and part of her
board out of her earnings, which sometimes amount to only $3 or
$4.50 per week.
Occasionally laundry, sewing or hair work is done in their homes,
but the day work is almost universally preferred.
Many of the Negroes are so nearly white that they can be mistaken
for white girls, in which case they are able to secure very good
positions and keep them as long as their color is not known.
One girl worked for a fellowship at the Art Institute. Her work was
good and the place was promised her. In making out the papers she
said Negro, when asked her nationality, to the great astonishment of
the man in charge. He said he would have to look into the matter, but
the girl did not get the fellowship.
A young man, son of a colored minister in the city, had a position
in a business man’s office, kept the books, collected rents, etc. He
had a peculiar name, and one of the tenants remembered it in
connection with the boy’s father, who had all the physical
characteristics of the Negro. The tenant made inquiries and reported
the matter to the landlord, threatening to leave the building if he had
to pay rent to a Negro. The boy was discharged.
A colored girl, who was very light colored, said that more than
once she secured a place and the colored people themselves had told
the employer he had a “Negro” working for him. The woman with
whom she was living said: “It’s true every time. The Negroes are their
own worst enemies.”
To summarize, the isolation which is forced upon the Negro, both
in his social and his business life, constitutes one of the principal
difficulties which he encounters. As far as the colored woman is
concerned, as we have shown, the principal occupations which are
open to her are domestic service and school teaching. This leaves a
large number of women whose education has given them ambitions
beyond housework, who are not fitted to compete with northern
teachers and yet cannot obtain clerical work because they are
Negroes. Certain fields in which there is apparently an opportunity
for the colored women are little tried. For example, sewing is
profitable and there is little feeling against the employment of Negro
seamstresses, and yet few follow the dressmaking profession.
Without doubt one fundamental reason for the difficulties the
colored woman meets in seeking employment is her lack of industrial
training. The white woman suffers from this also, but the colored
woman doubly so. The most hopeful sign is the growing conviction
on the part of the leading Negro women of the city that there is need
of co-operation between them and the uneducated and unskilled,
and that they are trying to find some practical means to give to these
women the much-needed training for industrial life.
THE BURDEN
If blood be the price of liberty,

Lord God, we have paid in full.
COLORED MEN LYNCHED WITHOUT TRIAL.

1885 78
1886 71
1887 80
1888 95
1889 95
1890 90
1891 121
1892 155
1893 154
1894 134
1895 112
1896 80
1897 122
1898 102
1899 84
1900 107
1901 107
1902 86
1903 86
1904 83
1905[3] 60
1906[3] 60
1907[3] 60
1908[3] 80
1909 73
1910[4] 50
Total 2,425
3. Estimated.
4. Estimated to date.
The policy usually carried out consistently in the South of refusing

the colored woman the courtesies accorded the white woman leads to
some unfortunate results. One of the courtesies refused is the title of
“Mrs.” or “Miss.” Thus a Kentucky newspaper, in its recent
educational news, notes “the resignation of Mrs. Mattie Spring Barr
as a substitute teacher and the unanimous election of Miss Ella
Williams,” while a few paragraphs below it says that “Principal
Russell of the Russell Negro School asked that Lizzie Brooks be
promoted to the position of regular teacher and that Annie B. Jones
be made a substitute.” Here we have two groups of women doing
similar important work for the community, yet the group with Negro
blood is denied the formality of address that every other section of
the country gives to the teacher in a public school.
How much the South loses by this policy is shown by a

Northerner’s experience in the office of one of the philanthropic
societies. Two probation officers, volunteers, were going out to their
work. The secretary of the society addressed the first, a white
woman, as Mrs. Brown: the second, a little middle-aged black
woman, he called Mary. When they had left the Northerner
questioned the difference in address. “I couldn’t call a nigger Mrs. or
Miss.,” the secretary expostulated. “It would be impossible.” “But
here in your own city,” the Northerner answered, “I happen to know
a number of educated colored girls, some of them college graduates,
who have a desire for social service. Under you they might learn the
best methods of charitable work, but they would not care to be called
‘Annie’ or ‘Jane’ by a young white man. Can you afford to lose such
helpers as these?”
He had but one answer. “It would never do for me to say Mrs. or
Miss to a nigger. It would be impossible.”
Throughout the country there are a number of colored postal

clerks. These men, with others in the service, belong to the Mutual
Benefit Association. At a recent meeting in Chicago the delegates to
this Mutual Benefit Association voted in the future to admit only
clerks of the Caucasian race. According to one of the colored clerks,
the meeting that passed this vote was “packed,” colored members
receiving no notice to elect delegates to it.
The black servant is more acceptable to some people than the

educated colored man. As an instance of this is Mr. U. of
Washington, a highly cultivated Negro of some means. As owner of a
cottage and a few acres of land in a small Virginia town, he good-
naturedly allowed an old black woman to live rent-free upon his
place. On her death he went down to claim the property, but found
the woman’s daughter had put in a counter claim. The case went to
court, and in his plea before the jury the woman’s lawyer said: “Are
you going to take property away from this black mammy’s daughter
to give it to a smart nigger from Washington?” And despite Mr. U.’s
former yearly payment of taxes, despite the deed which he himself
held, he lost his suit.
In Wilcox County Ala., there are 10,758 Negro children and 2,000
white children of school age, making a total population of 12,758.
The per capita allowance for each child in the State this year is $2.56.
According to the recent apportionment of the school fund on this
basis, Wilcox County receives $32,660.48. Of this amount the 10,758
Negro children have been allowed one-fifth, or $6,532.09, about 60
cents each, while the 2,000 white children receive the remaining
four-fifths, or $26,128.39, about $13 each. Further investigation
shows that only 1,000 of the 10,758 Negro children are in school,
leaving 6,758 with absolutely no provision for obtaining even a
common school education.
TALKS ABOUT WOMEN
NUMBER TWO By Mrs. JOHN E. MILHOLLAND
A most interesting and instructive morning was given at the

Berkeley Lyceum on Wednesday, December 7, under the auspices of
the National Association for the Advancement of Colored People. The
weather was bitterly cold, and although the great fall of snow during
the night had made getting about a not only troublesome but
dangerous experience, the little theatre was well filled with many
well-known women in sympathy with the cause and interested in all
sorts of activities for progress. The chairman, Mr. John Haynes
Holmes, introduced Madame Hackley, who gave the musical part of
the program, and did it well with her customary finish and
appreciation.
Madam Hackley was trained in Paris, and gave several French
selections with great skill.
Mrs. Mary Church Terrell, the speaker, was most enthusiastically
received, and made, as usual, a most effective and touching speech,
to which her audience listened with not only interest but surprise, as
many present had never dreamed of the struggles of these women in
their efforts for educational advancement.
Without doubt Mrs. Terrell is one of the best orators we have to-
day. She has much dignity, with a very easy and fluent mode of
speaking. She is direct, and when, as on this occasion, she is talking
for the women of her race, her enthusiasm and sincerity carry
conviction to her listeners. When she had finished Mrs. Terrell was
warmly congratulated, and the enthusiastic daughter of the great
orator, Robert Ingersoll, Mrs. Walston R. Brown, declared that “Mrs.

Dwnload Full Data Science Concepts and Practice PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dwnload Full Data Science Concepts and Practice PDF

Uploaded by

Copyright:

Available Formats

Data Science: Concepts and Practice -

British Library Cataloguing-in-Publication Data

For Information on all Morgan Kaufmann publications

Publisher: Jonathan Simpson

To all the mothers in our lives

Chapter 1 Introduction ..................................................................................... 1

Chapter 2 Data Science Process................................................................... 19

Chapter 3 Data Exploration ........................................................................... 39

4.3 k-Nearest Neighbors .......................................................................98

Chapter 5 Regression Methods................................................................... 165

Chapter 6 Association Analysis ................................................................... 199

Chapter 7 Clustering .................................................................................... 221

Chapter 8 Model Evaluation ........................................................................ 263

Chapter 9 Text Mining.................................................................................. 281

Chapter 10 Deep Learning .......................................................................... 307

10.4 Conclusion ................................................................................. 341

Chapter 11 Recommendation Engines ....................................................... 343

Chapter 12 Time Series Forecasting .......................................................... 395

Chapter 13 Anomaly Detection.................................................................... 447

Chapter 14 Feature Selection ..................................................................... 467

Chapter 15 Getting Started with RapidMiner ............................................. 491

15.6 Optimization Tools .................................................................... 512

COMPARISON OF DATA SCIENCE ALGORITHMS ...................................... 523

Our goal is to introduce you to Data Science.

WHY DATA SCIENCE?

applications in marketing and has advanced to applications in technology,

WHY THIS BOOK?

WHO CAN USE THIS BOOK?

3. A detailed review of implementation using RapidMiner, by describing

Data science is a collection of techniques used to extract value from data. It

1.1 AI, MACHINE LEARNING, AND DATA SCIENCE

1.2 WHAT IS DATA SCIENCE?

1.2.1 Extracting Meaningful Patterns

1.2.2 Building Representative Models

1.2.3 Combination of Statistics, Machine Learning, and

1.2.4 Learning Algorithms

1.2.5 Associated Fields

dataset. This is essential information for understanding any dataset in

1.3 CASE FOR DATA SCIENCE

these attributes. Each attribute can be thought of as a dimension in the data

1.3.3 Complex Questions

1.4 DATA SCIENCE CLASSIFICATION

concepts and step-by-step implementations of many important techniques

1.5 DATA SCIENCE ALGORITHMS

1.6 ROADMAP FOR THIS BOOK

Table 1.1 Data Science Tasks and Examples

1.6.1 Getting Started With Data Science

1.6.2 Practice using RapidMiner

1.6.3 Core Algorithms

The simple mathematical equation y 5 ax 1 b is a linear regression model.

predicted is whether a transaction is an outlier or not. Since clustering tasks

Data Science Process

The methodical discovery of useful relationships and patterns in data is

2.1 PRIOR KNOWLEDGE

top of existing subject matter and contextual information that is already

2.1.2 Subject Area

Table 2.1 Dataset

Table 2.2 New Data With Unknown Interest Rate

G A label (class label, output, prediction, target, or response) is the special

2.1.4 Causation Versus Correlation

A party, according to Burke, “is a body of men united, for promoting

By JANE ADDAMS, of Hull House

By Leslie Pinckney Hill.