Professional Documents
Culture Documents
Ebook Machine Learning For Computer Scientists and Data Analysts From An Applied Perspective Setareh Rafatirad Online PDF All Chapter
Ebook Machine Learning For Computer Scientists and Data Analysts From An Applied Perspective Setareh Rafatirad Online PDF All Chapter
https://ebookmeta.com/product/machine-learning-with-neural-
networks-an-introduction-for-scientists-and-engineers-bernhard-
mehlig/
https://ebookmeta.com/product/applied-machine-learning-and-ai-
for-engineers-jeff-prosise/
https://ebookmeta.com/product/machine-learning-toolbox-for-
social-scientists-1st-edition-yigit-aydede/
https://ebookmeta.com/product/gene-expression-data-analysis-a-
statistical-and-machine-learning-perspective-1st-edition-pankaj-
barah/
Training Data for Machine Learning: Human Supervision
from Annotation to Data Science 1st Edition Anthony
Sarkis
https://ebookmeta.com/product/training-data-for-machine-learning-
human-supervision-from-annotation-to-data-science-1st-edition-
anthony-sarkis-2/
https://ebookmeta.com/product/training-data-for-machine-learning-
human-supervision-from-annotation-to-data-science-1st-edition-
anthony-sarkis/
https://ebookmeta.com/product/quantum-machine-learning-an-
applied-approach-the-theory-and-application-of-quantum-machine-
learning-in-science-and-industry-1st-edition-santanu-ganguly/
https://ebookmeta.com/product/practical-machine-learning-for-
computer-vision-end-to-end-machine-learning-for-images-1st-
edition-valliappa-lakshmanan/
https://ebookmeta.com/product/data-analytics-in-bioinformatics-a-
machine-learning-perspective-1st-edition-rabinarayan-satpathy-
editor/
Setareh Rafatirad · Houman Homayoun
Zhiqian Chen · Sai Manoj Pudukotai Dinakarrao
Machine Learning
for Computer Scientists
and Data Analysts
From an Applied Perspective
Machine Learning for Computer Scientists
and Data Analysts
Setareh Rafatirad • Houman Homayoun •
Zhiqian Chen • Sai Manoj Pudukotai Dinakarrao
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2022
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
The recent popularity gained by the field of machine learning (ML) has led to its
adaptation into almost all the known applications. The applications range from smart
homes, smart grids, and forex markets to military applications and autonomous
drones. There exists a plethora of machine learning techniques that were introduced
in the past few years, and each of these techniques fits greatly for a specific set of
applications rather than a one-size-fits-all approach.
In order to better determine the application of ML for a given problem, it is non-
trivial to understand the current state of the art of the existing ML techniques, pros
and cons, their behavior, and existing applications that have already adopted them.
This book thus aims at researchers and practitioners who are familiar with their
application requirements, and are interested in the application of ML techniques
in their applications not only for better performance but also for ensuring that the
adopted ML technique is not an overkill to the considered application. We hope that
this book will provide a structured introduction and relevant background to aspiring
engineers who are new to the field, while also helping to revise the background
for the researchers familiar with this field. This introduction will be further used to
build and introduce current and emerging ML paradigms and their applications in
multiple case studies.
Organization This book is organized into three parts that consist of multiple
chapters. The first part introduces the relevant background information pertaining
to ML, traditional learning approaches that are widely used.
• Chapter 1 introduces the concept of applied machine learning. The metrics
used for evaluating the machine learning performance, data pre-processing, and
techniques to visualize and analyze the outputs (classification or regression or
other applications) are discussed.
• Chapter 2 presents a brief review of the probability theory and linear algebra that
are essential for a better understanding of the ML techniques discussed in the
later parts of the book.
v
vi Preface
• “Online Learning”: Shuo Lei (Virginia Tech), Yifeng Gao (University of Texas
Rio Grande Valley), Xuchao Zhang (NEC Labs America)
• “Recommender Learning”: Shanshan Feng (Harbin Institute of Technology,
Shenzhen), Kaiqi Zhao (University of Auckland)
• “Graph Learning”: Liang Zhao (Emory University)
• “SensorNet: An Educational Neural Network Framework for 167 Low-Power
Multimodal Data Classification”: Tinoosh Mohsenin (University of Maryland
Baltimore County), Arnab Mazumder (University of Maryland Baltimore
County), Hasib-Al- Rashid (University of Maryland Baltimore County)
• “Transfer Learning in Mobile Health”: Hassan Ghasemzadeh (Arizona State
University)
• “Applied Machine Learning for Computer Architecture Security”: Hossein
Sayadi (California State University, Long Beach)
• “Applied Machine Learning for Cloud Resource Management”: Hossein
Mohammadi Makrani (University of California Davis), Najme Nazari
(University of California Davis)
Kaiqi Zhao (University of Auckland), Shanshan Feng (Harbin Institute of Technol-
ogy, Shenzhen), Xuchao Zhang (NEC Labs America), Yifeng Gao (University of
Texas Rio Grande Valley), Shuo Lei (Virginia Tech), Zonghan Zhang (Mississippi
State University), and Qi Zhang (University of South Carolina).
ix
x Contents
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
Part I
Basics of Machine Learning
Chapter 1
What Is Applied Machine Learning?
1.1 Introduction
and durable models and performing hardware analysis on trained models (in terms
of hardware latency and area) are significant applied machine learning problems to
address in this sector [1, 7, 8].
The majority of this book discusses the difficulties and best practices associated
with constructing machine learning models, including understanding an applica-
tion’s properties and the underlying sample dataset.
What is Machine Learning Pipeline? How do you describe the goal of machine
learning? What are the main steps in the machine learning pipeline? We will answer
these questions both through formal definitions and practical examples. Machine
learning pipeline is meant to help with automating the machine learning workflow,
in order to obtain actionable insights from big datasets. The goal of machine
learning is to train an accurate model to solve an underlying problem. However, the
term pipeline is misleading as many of the steps involved in the machine learning
workflow may be repeated iteratively so to enhance and improve the accuracy of the
model. The cyclical architecture of machine learning pipelines is demonstrated in
Fig. 1.1.
Initially, the input (or collected) data is prepared before performing any analysis.
This step includes tasks such as data cleaning, data imputation, feature engineering,
data scaling/standardization, and data sampling for dealing with issues includ-
ing noise, outliers, transforming categorical variables, normalizing/standardizing
dataset features, and imbalanced (or biased) datasets.
In the Exploratory Data Analysis step (EDA), data is analyzed to understand
its characteristics such as having a normal or skewed distribution (see Fig. 1.2).
Skewedness in data affects a statistical model’s performance, especially in the case
of regression-based models. To prevent harming the results due to skewness, it is a
common practice to apply a transformation over the whole set of values (such as log
transformation) and use the transformed data for the statistical model.
Fig. 1.2 Comparison of different data distributions. In Right-Skewed or positive distribution, most
data falls to the positive, or the right side of the peak. In Left-Skewed or negative distribution, most
data falls to the negative, or the left side the peak. (a) Right-skewed. (b) Normal distribution. (c)
Left-skewed
We live in the age of big data where data lives in various sources and repositories
stored in different formats: structured and unstructured. Raw input data can contain
structured data (e.g., numeric information, date), unstructured data (e.g., image,
text), or a combination of both, which is called semi-structured data. Structured data
is quantitative data that can fit nicely into a relational database such as the dataset in
Table 1.2 where the information is stored in tabular form. Structured attributes can
be transformed into quantitative values that can be processed by a machine. Unlike
structured data, unstructured data needs to be further processed to extract structured
information from it; such information is referred to as data about data, or what we
call metadata in this book.
Table 1.1 demonstrates a dataset for a sample text corpus related to research
publications in the public domain. This is an example of a semi-structured dataset,
which includes both structured and unstructured attributes: the attributes of year and
8
Table 1.2 Structured malware dataset, obtained from virusshare and virustotal, covering 5 different classes of malware
Bus-cycles Branch-instructions Cache-references Node-loads Node-stores Cache-misses Branch-loads LLC-loads L1-dcache-stores Class
11463 37940 8057 1104 111 2419 37190 2360 38598 Backdoor
1551 5055 1096 165 17 333 4916 330 5003 Backdoor
29560 126030 20008 1769 146 4098 108108 5987 99237 Backdoor
26211 117761 14783 1666 48 4182 117250 4788 91070 Backdoor
30139 123550 20744 1800 158 4238 124724 6969 115862 Backdoor
12989 30012 9076 1252 136 5412 27909 2000 27170 Benign
6546 12767 4953 548 87 3683 13157 864 12361 Benign
8532 31803 7087 699 124 3240 34722 1970 34974 Benign
14350 27451 9157 1843 178 6611 28507 2411 24908 Benign
13837 25436 12235 1296 192 7148 24747 2533 23757 Benign
1068674 8211420 168839 42612 28574 73696 6298568 64166 6202146 Rootkit
1054761 8187337 162526 41245 28389 71576 6688738 67408 6655480 Rootkit
1046053 8196952 158955 40525 28113 70250 6981991 69597 6950106 Rootkit
1038524 8124926 157896 40207 28214 69910 7134795 71132 7148734 Rootkit
1030773 8069156 158085 39603 28265 69356 7230800 72226 7294250 Rootkit
999182 29000000 455 64 5 94 29000000 289 14000000 Trojan
999189 29000000 457 65 5 95 29000000 288 14000000 Trojan
999260 29000000 457 65 6 96 29000000 287 14000000 Trojan
999265 29000000 459 67 6 98 29000000 287 14000000 Trojan
999277 29000000 459 67 6 98 29000000 288 14000000 Trojan
1 What Is Applied Machine Learning?
989865 9128084 2549 169 37 268 9404871 923 9614242 Virus
989984 9130539 2529 168 37 266 9402351 920 9611680 Virus
990117 9132992 2510 167 36 264 9400377 914 9609689 Virus
990233 9135227 2491 165 36 262 9397484 909 9606756 Virus
990366 9137694 2473 164 36 260 9395002 903 9604237 Virus
760836 7851079 165236 8891 4047 13803 10530146 38930 4454651 Worm
765750 7957382 161998 8717 3967 13533 10573140 38205 4453953 Worm
770445 8059123 158884 8549 3891 13273 10606358 37508 4450452 Worm
774993 8157690 155888 8388 3818 13022 10660033 36824 4454598 Worm
779347 8251754 153008 8237 3747 12785 10693711 36171 4452344 Worm
1.3 Knowing the Application and Data
11
12 1 What Is Applied Machine Learning?
ables including SEAT CLASS, GUESTS, FARE, and customer TITLE. Histograms
display a general distribution of a set of numeric values corresponding to a dataset
variable over a range.
Plots are great means to help with understanding the data behind an application.
Some example application of such plots is described in Table 1.5. It is important to
1.4 Getting Started Using Python 13
note that every plot is deployed for a different purpose and applied to a particular
type of data. Therefore, it is crucial to understand the need for such techniques
used during the EDA step. Such graphical tools can help maximize insight, reveal
underlying structure, check for outliers, test assumptions, and discover optimal
factors.
As indicated in Table 1.5, several Python libraries offer very useful tools to plot
your data. Python is a real generic programming language with a very large user
community. It is purpose-built for large datasets and machine learning analysis. In
this book, we focus on using Python language for various machine learning tasks
and hands-on examples and exercises.
Before getting started using Python for applying machine learning techniques
on a problem, you may want to find out which IDEs (Integrated Development
Environment) and text editors are tailored for Python programming or looking
at code samples that you may find helpful. IDE is a program dedicated to
software development. A Python IDE usually includes an editor to write and handle
Python code, build, execution, debugging tools, and some form of source control.
Several Python programming environments exist depending on how advanced is
a Python programmer to perform a machine learning task. For example, Jupyter
Notebook is a very helpful environment for beginners who have just started with
traditional machine learning or deep learning. Jupyter Notebook can be installed in
a virtual environment using Anaconda-Navigator, which helps with creating virtual
environments and installing packages needed for data science and deep learning.
While Jupyter Notebook is more suitable for beginners, there are other machine
learning frameworks such as TensorFlow that are mostly used for deep learning
tasks. As such, depending on how advanced you are in Python programming, you
may end up using a particular Python programming environment. In this book, we
will begin with using Jupyter Notebook for programming examples and hands-on
exercises. As we move toward more advanced machine learning tasks, we switch to
TensorFlow. You can download and install Anaconda-Navigator on your machine
using the following link by selecting Python 3.7 version: https://www.anaconda.
com/distribution/.
Once it is installed, navigate to Jupyter Notebook and hit “Launch.” You will then
have to choose or create a workspace folder that you will use to store all your Python
programs. Navigate to your workspace directory and hit the “New” button to create a
new Python program and select Python 3. Use the following link to get familiar with
the environment: https://docs.anaconda.com/anaconda/user-guide/getting-started/.
In the remaining part of this chapter, you will learn how to conduct preliminary
machine learning tasks through multiple Python programming examples.
14 1 What Is Applied Machine Learning?
Table 1.5 Popular Python tools for understanding the data behind an application. https://
github.com/dgrtwo/gleam
Plot Python Usage
type library description Example
Line plot Plotly Trends in data
2.5
Scatter plot Gleam Multivariate data 2.0
petal_width
1.5
1.0
0.5
0.0
1 2 3 4 5 6 7
petal_length
6
petal_length
1
setosa versicolor virginica
species
3.0
2.0
species
1.5 setosa
versicolor
virginica
1.0
0.5
0.0
1 2 3 4 5 6 7 8
petal_length
1 import re
2 from nltk.util import ngrams
3
4 #input text
5 text = """tighter time analysis for real-time traffic in on-chip \
6 networks with shared priorities"""
7 print(’input text: ’ + text)
8
9 tokens = [item for item in text.split(" ") if item != ""]
10
11 output2 = list(ngrams(tokens, 2)) #2-grams
12 output4 = list(ngrams(tokens, 4)) #4-grams
13
14 allOutput=[]
15 for bigram in output2:
16 if bigram[0]!= "for" and bigram[1]!= "for" and bigram[0]!="in" and \
17 bigram[1]!="in" and bigram[0]!="with" and bigram[1]!="with":
18 allOutput.append(bigram)
19
20 print(’\nall extracted bigrams are:’)
21 print(allOutput)
22
23
24 allOutput=[]
25
26 for quadgram in output4:
27 if quadgram[0]!= "for" and quadgram[1]!= "for" and quadgram[2]!= "for" \
28 and quadgram[3]!= "for" and quadgram[0]!="in" and quadgram[1]!="in" \
29 and quadgram[0]!="with" and quadgram[1]!="with":
30 allOutput.append(quadgram)
31
32 print(’\nall extracted quadgrams are:’)
33 print(allOutput)
34
35 >input text: tighter time analysis for real-time traffic in on-chip \
36 >networks with shared priorities
37
38 >all extracted bigrams are:
39 >[(’tighter’, ’time’), (’time’, ’analysis’),\
40 >(’real-time’, ’traffic’), (’on-chip’, ’networks’), (’shared’, ’priorities’)]
41
1.6 Data Exploration 17
especially when we arrive at modeling the data in order to apply machine learning.
Plotting in EDA consists of histograms, box plot, scatter plot, and many more. It
often takes much time to explore the data. Through the process of EDA, we can ask
to define the problem statement or definition on our dataset, which is very important.
Some of the common questions one can ask during EDA are:
• What kind of variations exist in data?
• What type of knowledge is discovered from the covariance matrix of data in
terms of the correlations between the variables?
• How are the variables distributed?
• What kind of strategy to follow with regard to the outliers detected in a dataset?
Some typical graphical techniques widely used during EDA include histogram,
confusion matrix, box plot, scatter plot, principal component analysis (PCA), and so
forth. Some of the available popular Python libraries used for EDA include seaborn,
pandas, matplotlib, and NumPy. In this section, we will illustrate multiple examples
showing how EDA is conducted on a sample dataset.
Let us begin the EDA by importing some libraries required to perform EDA.
1 import pandas as pd
2 import seaborn as sns #visualization
3 import matplotlib.pyplot as plt #visualization
4 import numpy as np
The first step to performing EDA is to represent the data in a Dataframe form, which
provides one with extensive usage for data analysis and data manipulation. Loading
the data into the Pandas dataframe is certainly one of the most preliminary steps in
EDA, as we can see that the value from the dataset is comma separated. So all we
have to do is to just read the CSV file into a dataframe and pandas dataframe does
the job for us. First, download iris.csv from https://raw.githubusercontent.com/uiuc-
cse/data-fa14/gh-pages/data/iris.csv. Loading the data and determining its statistics
can be done using the following command:
1 import pandas as pd
2 import matplotlib.pyplot as plt
3
4 data = pd.read_csv(’iris.csv’)
5 print(’size of the dataset and the number of features are:’)
6 print(data.shape)
7 print(’\ncolumn names in the dataset:’)
8 print(data.columns)
9 print(’\nnumber of samples for each flower species:’)
10 print(data["species"].value_counts())
11
12 data.plot(kind=’scatter’, x=’petal_length’, y=’petal_width’)
13 plt.show()
14
15 > # size of the dataset and the number of features are:
16 >(150, 5)
17
18 ># column names in the dataset:
19 >Index([’sepal_length’, ’sepal_width’, ’petal_length’, ’petal_width’,’species
’], dtype=’object’)
20
21 ># number of samples for each flower species:
22 >virginica 50
23 >setosa 50
24 >versicolor 50
25 >Name: species, dtype: int64
Fig. 1.6 2D scatter plot for iris dataset, based on two attributes “petal-length” and “petal-width”
A scatter plot can display the distribution of data. Figure 1.6 shows a 2D scatter
plot for visualizing the iris data (the command is included in the previous code
snippet). The plot observed is a 2D scatter plot with petal_length on x-axis and
petal_width on y-axis. However, with this plot, it is difficult to understand per class
distribution of data. Using a color-coded plot can help plot the color coding for
each flower/species/type of class. This can be done using seaborn(sns) library by
executing the following commands:
1 import seaborn as sns
2 sns.set_style("whitegrid")
3 sns.FacetGrid(data, hue="species", height=4) \
4 .map(plt.scatter, "petal_length", "petal_width") \
5 .add_legend()
6 plt.show()
Looking at this scatter plot in Fig. 1.6, it is a bit difficult to make sense of the
data since all data points are displayed with the same color regardless of their label
(i.e., category). However, apply color coding to the plot and we can say a lot about
the data by using a different color for each label. Figure 1.7 shows the color-coded
scatter plot coloring setosa with blue, versicolor with orange, and virginica with
green. One can understand how data is distributed across the two axes of petal-width
and petal-length based on the flower species. The plot clearly shows the distribution
across three clusters (blue, orange, and green), two of which are non-overlapping
(blue and orange), and two overlapping ones (i.e., orange and green).
One important observation that can be realized from this plot is that petal-
width and petal-length attributes can distinguish between setosaa and versicolor
and between setosa and versicolor. However, the same attributes cannot distinguish
1.7 A Practice for Performing Exploratory Data Analysis 21
petal_width
1.5
species
setosa
versicolor
1.0
virginica
0.5
0.0
1 2 3 4 5 6 7
petal_length
versicolor from virginica due to their overlapping clusters. This implies that the
analyst should explore other attributes to train an accurate classifier and perform a
reliable classification. So here is the summary of our observations:
• Using petal-length and petal-width features, we can distinguish setosa flowers
from others. How about using all the attributes?
• Separating versicolor from viginica is much harder as they have considerable
overlap using petal-width and petal-length attributes. Would one obtain the same
observation if instead sepal-width and sepal-length attributes were used?
We have also included the 3D scatter plot in the Jupyter notebook for
this tutorial. A sample tutorial for 3D scatter plot with Plotly Express can
be found here, which needs a lot of mouse interaction to interpret data.
https://plot.ly/pandas/3d-scatter-plots/ (What about 4D, 5D, or n-D scatter plot?)
Pair-Plot
When the number of features in a dataset is high, pair-plot can be used to clearly
visualize the correlations between the dataset variables. The pair-plot visualization
helps to view 2D patterns (Fig. 1.8) but fails to visualize higher dimension patterns
in 3D and 4D. Datasets under real-time study contain many features. The relation
between all possible variables should be analyzed. The pair plot gives a scatter plot
between all combinations of variables that you want to analyze and explains the
relationship between the variables (Fig. 1.8).
To plot multiple pairwise bivariate distributions in a dataset, you can use the
pairplot() function in seaborn. This shows the relationship for (n, 2) combination of
variables in a Dataframe as a matrix of plots and the diagonal plots are the univariate
plots. Figure 1.8 illustrates the pair-plot for iris dataset, which lead to the following
observations:
22 1 What Is Applied Machine Learning?
• Petal-length and petal-width are the most useful features to identify various
flower types.
• While Setosa can be easily identified (linearly separable), Virnica and Versicolor
have some overlap (almost linearly separable).
With the help of pair-plot, we can find “lines” and “if-else” conditions to build a
simple model to classify the flower types.
1 plt.close();
2 sns.set_style("whitegrid");
3 sns.pairplot(iris, hue="species", height=3);
4 plt.show()
1.7 A Practice for Performing Exploratory Data Analysis 23
Fig. 1.9 Histogram plot showing frequency distribution for variable “petal_length”
Histogram Plot
Box Plot
A box and whisker plot also called a box plot displays the five-number summary
of a set of data. The five-number summary is the minimum, first quartile, median,
third quartile, and maximum. In a box plot, we draw a box from the first quartile to
the third quartile. A vertical line goes through the box at the median. The whiskers
go from each quartile to the minimum or maximum. A box and whisker plot is a
way of summarizing a set of data measured on an interval scale. It is often used
in explanatory data analysis. This type of graph is used to show the shape of the
distribution, its central value, and its variability. In a box and whisker plot:
• The ends of the box are the upper and lower quartiles, so the box spans the
interquartile range.
• The median is marked by a vertical line inside the box.
• The whiskers are the two lines outside the box that extend to the highest and
lowest observations.
The following code snippet shows how a box plot is used to visualize the
distribution of the iris dataset. Figure 1.10 shows the box plot visualization across
the iris dataset “species” output variable.
1 sns.boxplot(x=’species’,y=’petal_length’, data=data)
2 plt.show()
Violin Plots
Violin plots are a method of plotting numeric data and can be considered a
combination of the box plot with a kernel density plot. In the violin plot (Fig. 1.11),
we can find the same information as in box plots:
• Median.
• Interquartile range.
• The lower/upper adjacent values are defined as first quartile-1.5 IQR and third
quartile + 1.5 IQR, respectively. These values can be used in a simple outlier
detection (Turkey’s fence) techniques, where observations lying outside of these
“fences” can be considered outliers.
1.7 A Practice for Performing Exploratory Data Analysis 25
5
petal_length
1
setosa versicolor virginica
species
Fig. 1.10 Box plot for Iris dataset over “species” variable
Data in statistics are sometimes classified according to how many variables are
in a study. For example, “height” might be one variable and “weight” might be
another variable. Depending on the number of variables being looked at, the data is
univariate, or it is bivariate.
Multivariate data analysis is a set of statistical models that examine patterns in
multi-dimensional data by considering at once with several data variables. It is
an expansion of bivariate data analysis, which considers only two variables in its
models. As multivariate models consider more variables, they can examine more
complex analyses/phenomena and find the data patterns that can more accurately
represent the real world. These three analyses can be done by using seaborn library
in the following manner, depicted in Fig. 1.12, showing the bivariate distribution of
“petal-length” and “petal-width,” as well as the univariate profile of each attribute
in the margin.
1 sns.jointplot(x="petal_length", y="petal_width", data=data, kind="kde")
2 plt.show()
Visualization techniques are very effective, helping the analyst understand the
trends in data.
1.7 A Practice for Performing Exploratory Data Analysis 27
Standard Deviation
The standard deviation is a statistic that measures the dispersion of a dataset relative
to its mean and is calculated as the square root of the variance. The standard
deviation is calculated as the square root of variance by finding each data point’s
deviation in the dataset relative to the mean. If the data points are far from the
mean, there is a higher deviation within the dataset. The more dispersed the data,
the larger the standard deviation; conversely, the more dispersed the data, the smaller
the standard deviation.
1 print("\n Std-dev:");
2 print(np.std(iris_setosa["petal_length"]))
3 print(np.std(iris_virginica["petal_length"]))
4 print(np.std(iris_versicolor["petal_length"]))
5
6 >Std-dev:
7 >0.17191858538273286
8 >0.5463478745268441
9 >0.4651881339845204
Mean/Average
The mean/average is the most popular and well-known measure of central tendency.
It can be used with both discrete and continuous data, although its use is most often
with continuous data. The mean is the sum of all values in the dataset divided by all
the values in the dataset. So, if we have n data points in a dataset and they have values
x1 , x2 , · · · , xn , the sample mean, usually denoted by x, is x = (x1 +x2 +· · ·+xn )/n
1 print("Means:")
2 print(np.mean(iris_setosa["petal_length"]))
3 # Mean with an outlier.
4 print(np.mean(np.append(iris_setosa["petal_length"],50)))
5 print(np.mean(iris_versicolor["petal_length"]))
6
7 >Means:
8 >1.4620000000000002
9 >2.4137254901960787
10 >4.26
Variance
Median
Percentile
Percentiles are used to understand and interpret data. The nth percentile of a set of
data is the value at which n percent of the data is below it. They indicate the values
below which a certain percentage of the data in a dataset is found. Percentiles can be
calculated using the formula n = (P /100) × N, where P = percentile, N = number
1.7 A Practice for Performing Exploratory Data Analysis 29
Quantile
Interquartile Range
The IQR describes the middle 50% of values when ordered from lowest to highest.
To find the interquartile range (IQR), initially, find the median (middle value) of the
lower and upper half of the data. These values are quartile 1 (Q1) and quartile 3
(Q3). The IQR is the difference between Q3 and Q1.
30 1 What Is Applied Machine Learning?
The mean absolute deviation of a dataset is the average distance between each data
point and the mean. It gives us an idea about the variability in a dataset. The idea is to
calculate the mean, calculate how far away each data point is from the mean using
positive distances, which are also called absolute deviations, add those deviations
together, and divide the sum by the number of data points.
1 from statsmodels import robust
2
3 print ("\n Median Absolute Deviation")
4 print(robust.mad(iris_setosa["petal_length"]))
5 print(robust.mad(iris_virginica["petal_length"]))
6 print(robust.mad(iris_versicolor["petal_length"]))
7
8 >Median Absolute Deviation
9 >0.14826022185056031
10 >0.6671709983275211
11 >0.5189107764769602
Precision and recall are not adequate for showing the performance of detection
even contradictory to each other because they do not include all the results
and samples in their formula. F-score (i.e., F-measure) is then calculated based
on precision and recall to compensate for this disadvantage. Receiver Operating
Characteristic (ROC) is a statistical plot that depicts a binary detection performance
while its discrimination threshold setting is changeable. The ROC space is supposed
by FPR and TPR as x and y axes, respectively. It helps the detector to determine
trade-offs between TP and FP, in other words, the benefits and costs. Since TPR and
FPR are equivalent to sensitivity and (1-specificity), respectively, each prediction
result represents one point in the ROC space in which the point in the upper left
corner or coordinate (0, 1) of the ROC curve stands for the best detection result,
representing 100% sensitivity and 100% specificity (perfect detection point).
TP +TN 8+6
ACC = = = 0.82. (1.1)
T P + FP + T N + FN 8+1+6+2
(continued)
Another random document with
no related content on Scribd:
1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if you
provide access to or distribute copies of a Project Gutenberg™ work
in a format other than “Plain Vanilla ASCII” or other format used in
the official version posted on the official Project Gutenberg™ website
(www.gutenberg.org), you must, at no additional cost, fee or expense
to the user, provide a copy, a means of exporting a copy, or a means
of obtaining a copy upon request, of the work in its original “Plain
Vanilla ASCII” or other form. Any alternate format must include the
full Project Gutenberg™ License as specified in paragraph 1.E.1.
• You pay a royalty fee of 20% of the gross profits you derive from
the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”
• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.
1.F.
1.F.4. Except for the limited right of replacement or refund set forth in
paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO
OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.
Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.