Professional Documents
Culture Documents
Full Chapter Applied Data Science Using Pyspark Learn The End To End Predictive Model Building Cycle 1St Edition Ramcharan Kakarla PDF
Full Chapter Applied Data Science Using Pyspark Learn The End To End Predictive Model Building Cycle 1St Edition Ramcharan Kakarla PDF
https://textbookfull.com/product/vue-on-rails-end-to-end-guide-
to-building-web-apps-using-vue-js-and-rails-bryan-lim/
https://textbookfull.com/product/end-to-end-ma-process-design-
resilient-business-model-innovation-thorsten-feix/
https://textbookfull.com/product/data-science-on-aws-
implementing-end-to-end-continuous-ai-and-machine-learning-
pipelines-early-edition-chris-fregly/
https://textbookfull.com/product/making-sense-of-sensors-end-to-
end-algorithms-and-infrastructure-design-from-wearable-devices-
to-data-center-1st-edition-omesh-tickoo/
The Essence of Network Security An End to End Panorama
Mohuya Chakraborty
https://textbookfull.com/product/the-essence-of-network-security-
an-end-to-end-panorama-mohuya-chakraborty/
https://textbookfull.com/product/complete-vue-js-2-web-
development-practical-guide-to-building-end-to-end-web-
development-solutions-with-vue-js-2-1st-edition-mike-street/
https://textbookfull.com/product/end-times-a-brief-guide-to-the-
end-of-the-world-bryan-walsh/
https://textbookfull.com/product/asme-b16-10-2022-face-to-face-
and-end-to-end-dimensions-of-valves-2023rd-edition-asme/
https://textbookfull.com/product/python-data-analysis-perform-
data-collection-data-processing-wrangling-visualization-and-
model-building-using-python-3rd-edition-avinash-navlani/
Ramcharan Kakarla, Sundar Krishnan and Sridhar Alla
Sundar Krishnan
Philadelphia, PA, USA
Sridhar Alla
New Jersey, NJ, USA
Ramcharan Kakarla
I wish to express thank you to my wife (Aishwarya) who has supported
me throughout the whole process. From reading the drafts of each
chapter and providing her views and spending all nighter with me
during the book deadlines, I could not have done this without you.
Thank you so much, my love.
To my family – Bagavathi Krishnan (Dad), Ganapathy (Mom),
Loganathan (Father in law), Manjula (Mother in law), Venkateshwaran
(Brother), Manikanda Prabhu (Brother) and Ajey Dhaya Sankar
(Brother in law) who are excited about this book as I am and for their
motivation and support all day. I would definitely miss the status
updates that I provide them on call about this book.
To my data science guru – Dr. Goutam Chakraborty (Dr. C). I still watch
his videos to refresh my basics and I built my experience from his
teachnigs. For his passion and dedication, I express my gratitude
forever.
To my co-author and colleague Ram. We both always had the same
passion and we bring the best out of each other. For initiating the idea
for the book and working alongside with me to complete the book, I
express my thank you.
To Sridhar Alla, for being a mentor in my professional life. For sharing
your ideas and supporting our thoughts, I say thank you.
To Alessio Tamburro, for his technical review and feedback on the book
contents.
To Dr. C, Marina Johnson and Futoshi Yumoto for their forewords for
this book.
To publication coordinators - Aditee Mirashi, Shrikanth Vishwakarma,
development editor – James Markham who have assisted us from the
start and made this task as easy as possible. To the entire team from the
publication who has worked on this book, I say thank you.
Finally, to all the readers who expressed interest to read this book. We
hope this book will inspire you to consider your data science passion
seriously and build awesome things.
Have fun!
With love,
Sundar Krishnan
I would like to thank my wonderful loving wife, Rosie Sarkaria and my
beautiful loving daughters Evelyn & Madelyn for all their love and
patience during the many months I spent writing this book. I would
also like to thank my parents Ravi and Lakshmi Alla for their blessings
and all the support and encouragement they continue to bestow upon
me.
Ram and Sundar are simply amazing data scientists who worked hard
learning and sharing their knowledge with others. I was lucky to have
worked with them and saw them in action. I wish both of them all the
very best in all their future endeavors and frankly all credit for this
book goes to both of them.
I hope all the readers gain some great insights reading this book and
find it one of the books in their personal favorite collection.
Sridhar Alla
Table of Contents
Chapter 1:Setting Up the PySpark Environment
Local Installation using Anaconda
Step 1:Install Anaconda
Step 2:Conda Environment Creation
Step 3:Download and Unpack Apache Spark
Step 4:Install Java 8 or Later
Step 5:Mac &Linux Users
Step 6:Windows Users
Step 7:Run PySpark
Step 8:Jupyter Notebook Extension
Docker-based Installation
Why Do We Need to Use Docker?
What Is Docker?
Create a Simple Docker Image
Download PySpark Docker
Step-by-Step Approach to Understanding the Docker
PySpark run Command
Databricks Community Edition
Create Databricks Account
Create a New Cluster
Create Notebooks
How Do You Import Data Files into the Databricks
Environment?
Basic Operations
Upload Data
Access Data
Calculate Pi
Summary
Chapter 2:PySpark Basics
PySpark Background
PySpark Resilient Distributed Datasets (RDDs) and
DataFrames
Data Manipulations
Reading Data from a File
Reading Data from Hive Table
Reading Metadata
Counting Records
Subset Columns and View a Glimpse of the Data
Missing Values
One-Way Frequencies
Sorting and Filtering One-Way Frequencies
Casting Variables
Descriptive Statistics
Unique/Distinct Values and Counts
Filtering
Creating New Columns
Deleting and Renaming Columns
Summary
Chapter 3:Utility Functions and Visualizations
Additional Data Manipulations
String Functions
Registering DataFrames
Window Functions
Other Useful Functions
Data Visualizations
Introduction to Machine Learning
Summary
Chapter 4:Variable Selection
Exploratory Data Analysis
Cardinality
Missing Values
Built-in Variable Selection Process:Without Target
Principal Component Analysis
Singular Value Decomposition
Built-in Variable Selection Process:With Target
ChiSq Selector
Model-based Feature Selection
Custom-built Variable Selection Process
Information Value Using Weight of Evidence
Custom Transformers
Voting-based Selection
Summary
Chapter 5:Supervised Learning Algorithms
Basics
Regression
Classification
Loss Functions
Optimizers
Gradient Descent
Stochastic/Mini-batch Gradient Descent
Momentum
AdaGrad (Adaptive Gradient) Optimizer
Root Mean Square Propagation (RMSprop) Optimizer
Adaptive Moment (Adam) Optimizer
Activation Functions
Linear Activation Function
Sigmoid Activation Function
Hyperbolic Tangent (TanH) Function
Rectified Linear Unit (ReLu) Function
Leaky ReLu or Parametric ReLu Function
Swish Activation Function
Softmax Function
Batch Normalization
Dropout
Supervised Machine Learning Algorithms
Linear Regression
Logistic Regression
Decision Trees
Random Forests
Gradient Boosting
Support Vector Machine (SVM)
Neural Networks
One-vs-Rest Classifier
Naïve Bayes Classifier
Regularization
Summary
Chapter 6:Model Evaluation
Model Complexity
Underfitting
Best Fitting
Overfitting
Bias and Variance
Model Validation
Train/Test Split
k-fold Cross-Validation
Leave-One-Out Cross-Validation
Leave-One-Group-Out Cross-Validation
Time-series Model Validation
Leakage
Target Leakage
Data Leakage
Issues with Leakage
Model Assessment
Continuous Target
Binary Target
Summary
Chapter 7:Unsupervised Learning and Recommendation
Algorithms
Segmentation
Distance Measures
Types of Clustering
Latent Dirichlet Allocation (LDA)
LDA Implementation
Collaborative Filtering
Matrix Factorization
Summary
Chapter 8:Machine Learning Flow and Automated Pipelines
MLflow
MLflow Code Setup and Installation
MLflow User Interface Demonstration
Automated Machine Learning Pipelines
Pipeline Requirements and Framework
Data Manipulations
Feature Selection
Model Building
Metrics Calculation
Validation and Plot Generation
Model Selection
Score Code Creation
Collating Results
Framework
Pipeline Outputs
Summary
Chapter 9:Deploying Machine Learning Models
Starter Code
Save Model Objects and Create Score Code
Model Objects
Score Code
Model Deployment Using HDFS Object and Pickle Files
Model Deployment Using Docker
requirements.txt file
Dockerfile
Changes Made in helper.py and run.py Files
Create Docker and Execute Score Code
Real-Time Scoring API
app.py File
Postman API
Test Real-Time Using Postman API
Build UI
The streamlitapi Directory
real_time_scoring Directory
Executing docker-compose.yml File
Real-time Scoring API
Summary
Appendix:Additional Resources
Hypothesis Testing
Chi-squared Test
Kolmogorov-Smirnov Test
Random Data Generation
Sampling
Simple Random Sampling (SRS)
Stratified Sampling
Difference Between Coalesce and Repartition
Switching Between Python and PySpark
Curious Character of Nulls
Common Function Conflicts
Join Conditions
User-defined Functions (UDFs)
Handle the Skewness
Using Cache
Persist/Unpersist
Shuffle Partitions
Use Optimal Formats
Data Serialization
Accomplishments
Index
About the Authors
Ramcharan Kakarla
is currently Lead Data Scientist at
Comcast residing in Philadelphia. He is a
passionate data science and artificial
intelligence advocate with 6+ years of
experience. He graduated as outstanding
student with a master’s degree from
Oklahoma State University with
specialization in data mining. Prior to
OSU, he received his bachelor’s in
electrical and Electronics Engineering
from Sastra University in India.
He was born and raised in coastal town
Kakinada, India. He started his career
working as Performance engineer with
several fortune 500 clients including
StateFarm and British Airways. In his current role he is focused on
building data science solutions and frameworks leveraging big data. He
has published several award-winning papers and posters in the field of
predictive analytics. He served as SAS Global Ambassador for the year
2015.
www.linkedin.com/in/ramcharankakarla .
Sundar Krishnan
is passionate about Artificial Intelligence and Data Science with more
than 5 years of industrial experience. He has tremendous experience in
building and deploying customer analytics models and designing
machine learning workflow automation. Currently, he is associated with
Comcast as a Lead Data Scientist.
Sundar was born and raised in Tamilnadu, India and has a bachelor’s
degree from Government College of Technology, Coimbatore. He
completed his master’s at Oklahoma State University, Stillwater. In his
spare time, he blogs about his Data science works at Medium website.
www.linkedin.com/in/sundarkri
shnan1 .
Sridhar Alla
is founder and CTO of Sas2Py (
www.sas2py.com ) which focuses on
automatic conversion of SAS code to
Python and on integration with cloud
platform services like AWS, Azure, and
Google Cloud. His company
Bluewhale.one also focuses on using AI
to solve key problems, ranging from
intelligent email conversation tracking,
to solving issues impacting the retail
industry, and more. He has deep
expertise in building AI-driven big data
analytical practices on both public cloud and in-house infrastructures.
He is a published author of books and an avid presenter at numerous
Strata, Hadoop World, Spark Summit, and other conferences. He also
has several patents filed with the US PTO on large-scale computing and
distributed systems. He has extensive hands-on experience in most of
the prevalent technologies, including Spark, Flink, Hadoop, AWS, Azure,
Tensorflow, and others. He lives with his wife Rosie and daughters
Evelyn and Madelyn in New Jersey, USA, and in his spare time loves to
spend time training, coaching, and attending meetups. He can be
reached at sid@bluewhale.one.
About the Technical Reviewer
Alessio Tamburro
works currently as Principal Data
Scientist in the Enterprise Business
Intelligence Innovation Team at
Comcast/NBC Universal. Throughout his
career in data science research focused
teams at different companies, Alessio has
gained expertise in identifying,
designing, communicating and delivering
innovative prototype solutions based on
different data sources and meeting the
needs of diverse end users spanning
from scientists to business stakeholders.
His approach is based on thoughtful
experimentation and has its roots in his
extensive academic research
background. Alessio holds a PhD in
particle astrophysics from the University of Karlsruhe in Germany and a
Master’s degree in Physics from the University of Bari in Italy.
© Ramcharan Kakarla, Sundar Krishnan and Sridhar Alla 2021
R. Kakarla et al., Applied Data Science Using PySpark
https://doi.org/10.1007/978-1-4842-6500-0_1
The goal of this chapter is to quickly get you set up with the PySpark environment. There are
multiple options discussed, so it is up to the reader to pick their favorite. Folks who already have the
environment ready can skip to the “Basic Operations” section later in this chapter.
In this chapter, we will cover the following topics:
Local installation using Anaconda
Docker-based installation
Databricks community edition
Note Windows users should use Anaconda Prompt not Command Prompt. This option will be
available after you install Anaconda.
conda info
For more information on conda operations , you can web search for “CONDA CHEAT SHEET” and
follow the documentation provided here: https://docs.conda.io/.
Before we create the environment, we are going to take a look at another conda command. We
promise it will make sense later why we need to look at this command now.
Once you run this command, you get the following output.
Environment list before creation
# conda environments:
#
base * /Applications/anaconda3
The base environment exists by default in Anaconda. Currently, the base is the active environment,
which is represented by the small asterisk (*) symbol to its right. What does this mean? It means that
whatever conda operations we perform now will be carried out in the default base environment.
Any package install, update, or removal will happen in base.
Now that we have a clear understanding about the base environment, let us go ahead and create our
own PySpark environment. Here, we will replace the ENVNAME with a name (pyspark_env) that
we would like to call the environment.
When it prompts for a user input, type y and press Enter. It is good practice to type these codes
rather than copying them from the text, since the hyphen would create an issue, saying either
“CondaValueError: The target prefix is the base prefix. Aborting” or “conda: error: unrecognized
arguments: -–name”. After the command is successfully executed, we will perform the conda
environment list again. This time, we have a slightly different output.
Conda environment right after creation
# conda environments:
#
base * /Applications/anaconda3
pyspark_env /Users/ramcharankakarla/.conda/envs/pyspark_env
We notice a new environment pyspark_env is created. Still, base is the active environment. Let us
activate our new environment pyspark_env using the following command.
Conda environment after change
Going forward, all the conda operations will be performed in the new PySpark environment. Next,
we will proceed with Spark installation. Observe that “*” indicates the current environment.
Windows Users
java -version
java version "14.0.2" 2020-07-14
Java(TM) SE Runtime Environment (build 14.0.2+12-46)
Java HotSpot(TM) 64-Bit Server VM (build 14.0.2+12-46, mixed mode,
sharing)
You need to make sure that the Java version is 8 or later. You can always download the latest version
at https://www.oracle.com/java/technologies/javase-downloads.html. It is
recommended to use Java 8 JDK. When you use a Java version later than 8 you might get the
following warnings when you launch PySpark. These warnings can be ignored.
Mac/Linux users can now proceed to Step 5, and Windows users can jump to Step 6 from here.
vi ~/.bash_profile
Once you are inside the file, type i to insert/configure the environment variables by adding the
following lines:
export SPARK_HOME=/opt/spark-3.0.1
export PATH=$SPARK_HOME/bin:$PATH
Another random document with
no related content on Scribd:
The following case is related by Mr. George Semple: Mrs. B——,
wife of John Breward, Simpson Green, near Idle, aged forty-nine, the
mother of nine children, the youngest of whom is twelve years old,
lost a daughter-in-law about a year ago, who died in about a fortnight
after giving birth to her first child. On her death, Mrs. B. took charge
of the infant, a little puny sickly baby. The child was so fretful and
uneasy, that Mrs. B. after many sleepless nights, was induced to
permit the child to take her nipple into its mouth. In the course of
from thirty to thirty-six hours she felt very unwell; her breasts
became extremely painful, considerably increased in size, and soon
after, to her utter astonishment, milk was secreted, and poured forth
in the same abundance as on former occasions, after the birth of her
own children. The child, now a year old, is a fine, thriving, healthy
girl, and only a few days ago I saw her eagerly engaged in obtaining
an apparently abundant supply of healthy nourishment, from the
same fountain which, nearly twenty years ago, poured forth its
resources for the support of her father.”[16]
Quickening.
There is only one other symptom which I think it useful to notice,
that is quickening; by which is meant, the first sensation experienced
by the mother of the life of the child within her womb.
The first time this motion of the child occurs, the sensation is like
that of the fluttering of a bird within her, and so sudden that she
frequently faints, or falls into an hysterical paroxysm. A day or two
passes by when it recurs. It afterwards increases both in frequency
and degree, until the movements of the child are fully recognised.
It is proper that a female should be informed that the period when
quickening takes place is very uncertain; for an impression is
popularly prevalent that it always occurs exactly at the end of four
calendar months and a half. This is not the case; it varies in different
women, and in the same women during different pregnancies, as the
following one or two instances will prove:—
Mrs. F——. Quickened with her first child at four months:
quickened with the second at fourteen weeks: and is now in her third
pregnancy, and reckons from the fourteenth week again.
Mrs. B——. Has had seven children, and with all felt the motion of
the child for the first time at the third month.
Mrs. Mc M——. Has been several times pregnant; seldom feels
movements of the child at all until the sixth month, and not strongly
till the eighth.
The annexed table of the periods of quickening of seventy cases
taken in the order in which they have been entered in the author’s
note-book, will forcibly stamp the truth of these opinions:
9 Quickened at the 3d month.
11 Quickened at 3½ months.
21 Quickened at the 4th month.
16 Quickened at 4½ months
8 Quickened at the 5th month.
1 Quickened at 5½ months.
4 Quickened at the 6th month.
70