Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Preface

Today, we live in a world of connected things where tons of data is generated and it is humanly
impossible to analyze all the incoming data and make decisions. Human decisions are
increasingly replaced by decisions made by computers. Thanks to the field of data science.
Data science has penetrated deeply in our connected world and there is a growing demand
in the market for people who not only understand data science algorithms thoroughly, but
are also capable of programming these algorithms. Data science is a field that is at the
intersection of many fields, including data mining, machine learning, and statistics, to name
a few. This puts an immense burden on all levels of data scientists; from the one who is
aspiring to become a data scientist and those who are currently practitioners in this field.
Treating these algorithms as a black box and using them in decision-making systems will lead
to counterproductive results. With tons of algorithms and innumerable problems out there,
it requires a good grasp of the underlying algorithms in order to choose the best one for any
given problem.

Python as a programming language has evolved over the years and today, it is the number one
choice for a data scientist. Its ability to act as a scripting language for quick prototype building
and its sophisticated language constructs for full-fledged software development combined with
its fantastic library support for numeric computations has led to its current popularity among
data scientists and the general scientific programming community. Not just that, Python is also
popular among web developers; thanks to frameworks such as Django and Flask.

This book has been carefully written to cater to the needs of a diverse range of data
scientists—starting from novice data scientists to experienced ones—through carefully crafted
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

recipes, which touch upon the different aspects of data science, including data exploration,
data analysis and mining, machine learning, and large scale machine learning. Each chapter
has been carefully crafted with recipes exploring these aspects. Sufficient math has been
provided for the readers to understand the functioning of the algorithms in depth. Wherever
necessary, enough references are provided for the curious readers. The recipes are written in
such a way that they are easy to follow and understand.

Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:47:14.
Preface

This book brings the art of data science with power Python programming to the readers and
helps them master the concepts of data science. Knowledge of Python is not mandatory to
follow this book. Non-Python programmers can refer to the first chapter, which introduces the
Python data structures and function programming concepts.

The early chapters cover the basics of data science and the later chapters are dedicated
to advanced data science algorithms. State-of-the-art algorithms that are currently used in
practice by leading data scientists across industries including the ensemble methods, random
forest, regression with regularization, and others are covered in detail. Some of the algorithms
that are popular in academia and still not widely introduced to the mainstream such as
rotational forest are covered in detail.

With a lot of do-it-yourself books on data science today in the market, we feel that there is a
gap in terms of covering the right mix of math philosophy behind the data science algorithms
and implementation details. This book is an attempt to fill this gap. With each recipe, just
enough math introductions are provided to contemplate how the algorithm works; I believe
that the readers can take full benefits of these methods in their applications.

A word of caution though is that these recipes are written with the objective of explaining the
data science algorithms to the reader. They have not been hard-tested in extreme conditions
in order to be production ready. Production-ready data science code has to go through a
rigorous engineering pipeline.

This book can be used both as a guide to learn data science methods and quick references.
It is a self-contained book to introduce data science to a new reader with little programming
background and help them become experts in this trade.

What this book covers


Chapter 1, Python for Data Science, introduces Python's built-in data structures and functions,
which are very handy for data science programming.

Chapter 2, Python Environments, introduces Python's scientific programming and plotting


libraries, including NumPy, matplotlib, and scikit-learn.

Chapter 3, Data Analysis – Explore and wrangle, covers data preprocessing and
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

transformation routines to perform exploratory data analysis tasks in order to efficiently build
data science algorithms.

Chapter 4, Data Analysis – Deep Dive, introduces the concept of dimensionality reduction
in order to tackle the curse of dimensionality issues in data science. Starting with simple
methods and moving on to the advanced state-of-the-art dimensionality reduction techniques
are discussed in detail.

vi

Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:47:14.
Preface

Chapter 5, Data Mining – Needle in a haystack Name, discusses unsupervised data mining
techniques, starting with elaborate discussions on distance methods and kernel methods and
following it up with clustering and outlier detection techniques.

Chapter 6, Machine Learning 1, covers supervised data mining techniques, including


nearest neighbors, Naïve Bayes, and classification trees. In the beginning, we will lay a
heavy emphasis on data preparation for supervised learning.

Chapter 7, Machine Learning 2, introduces regression problems and follows it up with


topics on regularization including LASSO and ridge. Finally, we will discuss cross-validation
techniques as a way to choose hyperparameters for these methods.

Chapter 8, Ensemble Methods, introduces various ensemble techniques including bagging,


boosting, and gradient boosting This chapter shows you how to make a powerful state-of-the-
art method in data science where, instead of building a single model for a given problem, an
ensemble or a bag of models are built.

Chapter 9, Growing Trees, introduces some more bagging methods based on tree-based
algorithms. Due to their robustness to noise and universal applicability to a variety of
problems, they are very popular among the data science community.

Chapter 10, Large scale machine learning – Online Learning, covers large scale machine
learning and algorithms suited to tackle such large scale problems. This includes algorithms
that work with streaming data and data that cannot be fitted into memory completely.

What you need for this book


All the recipes in this book were developed and tested on an 8 GB machine with Intel i7 CPU
running Windows 7 64-bit software.

Python 2.7.5, NumPy 1.8.0, SciPy 0.13.2, Matplotlib 1.3.1, NLTK 3.0.2, and scikit-learn 0.15.2
versions were used for the developing methods.

The same code should work on Linux variants and Macs with the appropriate libraries
mentioned here. Alternatively, a Python virtual environment can be created with the version of
these libraries and you can run all the recipes.
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

vii

Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:47:14.
Preface

Who this book is for


This book is intended for all levels of data science professionals, both students and
practitioners from novice to experts. Different recipes in the chapters cater to the needs of
different audiences. Novice readers can spend some time in getting themselves acquainted
with data science in the first five chapters. Experts can refer to the later chapters to refer/
understand how advanced techniques are implemented using Python. The book covers just
enough mathematics and provides the necessary references for computer programmers who
wish to understand data science. People from a non-Python background can effectively use
this book. The first chapter of the book introduces Python as a programming language for data
science. It will be helpful if you have some prior basic programming experience. The book is
mostly self-contained and introduces data science to a new reader and can help him become
an expert in this trade.

Sections
In this book, you will find several headings that appear frequently (Getting ready, How to do it,
How it works, There's more, and See also).

To give clear instructions on how to complete a recipe, we use these sections as follows:

Getting ready
This section tells you what to expect in the recipe, and describes how to set up any software or
any preliminary settings required for the recipe.

How to do it…
This section contains the steps required to follow the recipe.

How it works…
This section usually consists of a detailed explanation of what happened in the previous
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

section.

There's more…
This section consists of additional information about the recipe in order to make the reader
more knowledgeable about the recipe.

viii

Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:47:14.
Preface

See also
This section provides helpful links to other useful information for the recipe.

Conventions
In this book, you will find a number of text styles that distinguish between different kinds of
information. Here are some examples of these styles and an explanation of their meaning.

Code words in text,like function names are shown as follows:

We call get_iris_data() function to get the input data. We leverage the function
train_test_split from Scikit learn's model cross_validation to split the input
datasets into two.

A block of code is set as follows:


# Shuffle the dataset
shuff_index = np.random.shuffle(range(len(y)))
x_train = x[shuff_index,:].reshape(x.shape)
y_train = np.ravel(y[shuff_index,:])

Formulas are typically provided as images as follows,

Typically the math section is introduced at the beginning of each recipe. In some chapters the
common math required for most of the recipes in that chapter are included in the introduction
section of the first recipe.

External url's are specified as follows:

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

loss.html

Specific call-outs in some algorithm implementation details in a third party library is provided
as follows.

'The predicted class of an input sample is computed as the class with the highest mean
predicted probability. If base estimators do not implement a  predict_proba  method, then it
resorts to voting.'

Where ever applicable references to scientific journals and papers are provided as follows.

ix

Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:47:14.
Preface

Please refer to the paper by Leo Breiman for more information about bagging.

Leo Breiman. 1996. Bagging predictors.Mach. Learn.24, 2 (August


1996), 123-140. DOI=10.1023/A:1018054314350 http://dx.doi.
org/10.1023/A:1018054314350
Program output and graphs are typically provided as images. For example:

Any command-line input or output is written as follows:


Counter({'Peter': 4, 'of': 4, 'Piper': 4, 'pickled': 4, 'picked': 4,
'peppers': 4, 'peck': 4, 'a': 2, 'A': 1, 'the': 1, 'Wheres': 1, 'If': 1})

In places where we would like the reader to inspect some of the variables in Python shell,
we specify it as follows:
>>> print b_tuple[0]
1
>>> print b_tuple[-1]
c
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

>>>

Warnings or important notes appear in a box like this.

Tips and tricks appear like this.

Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:47:14.
Preface

Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this book—
what you liked or disliked. Reader feedback is important for us as it helps us develop titles
that you will really get the most out of.

To send us general feedback, simply e-mail feedback@packtpub.com, and mention the


book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or
contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you to
get the most from your purchase.

Downloading the example code


You can download the example code files from your account at http://www.packtpub.com
for all the Packt Publishing books you have purchased. If you purchased this book elsewhere,
you can visit http://www.packtpub.com/support and register to have the files e-mailed
directly to you.

Downloading the color images of this book


We also provide you with a PDF file that has color images of the screenshots/diagrams used
in this book. The color images will help you better understand the changes in the output.
You can download this file from http://www.packtpub.com/sites/default/files/
downloads/1234OT_ColorImages.pdf.

Errata
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

Although we have taken every care to ensure the accuracy of our content, mistakes do happen.
If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be
grateful if you could report this to us. By doing so, you can save other readers from frustration
and help us improve subsequent versions of this book. If you find any errata, please report
them by visiting http://www.packtpub.com/submit-errata, selecting your book,
clicking on the Errata Submission Form link, and entering the details of your errata. Once your
errata are verified, your submission will be accepted and the errata will be uploaded to our
website or added to any list of existing errata under the Errata section of that title.

xi

Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:47:14.
Preface

To view the previously submitted errata, go to https://www.packtpub.com/books/


content/support and enter the name of the book in the search field. The required
information will appear under the Errata section.

Piracy
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At
Packt, we take the protection of our copyright and licenses very seriously. If you come across
any illegal copies of our works in any form on the Internet, please provide us with the location
address or website name immediately so that we can pursue a remedy.

Please contact us at copyright@packtpub.com with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions
If you have a problem with any aspect of this book, you can contact us at questions@
packtpub.com, and we will do our best to address the problem.
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

xii

Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:47:14.

You might also like