Download as pdf or txt
Download as pdf or txt
You are on page 1of 982

2

3
4
Data Analytics
using Python

5
Data Analytics
using Python

Bharti Motwani
Professor
Area Coordinator- Business Analytics
CMS Business School
Jain (Deemed-to-be) University
Bengaluru

6
Data Analytics using Python
Copyright © 2020 by Wiley India Pvt. Ltd., 4436/7, Ansari Road, Daryaganj, New Delhi-110002.

Cover Image: © DuKai photographer/Getty Images

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any
means, electronic, mechanical, photocopying, recording or scanning without the written permission of the publisher.

Limits of Liability: While the publisher and the author have used their best efforts in preparing this book, Wiley and the author
make no representation or warranties with respect to the accuracy or completeness of the contents of this book, and specifically
disclaim any implied warranties of merchantability or fitness for any particular purpose. There are no warranties which extend
beyond the descriptions contained in this paragraph. No warranty may be created or extended by sales representatives or written
sales materials. The accuracy and completeness of the information provided herein and the opinions stated herein are not
guaranteed or warranted to produce any particular results, and the advice and strategies contained herein may not be suitable for
every individual. Neither Wiley India nor the author shall be liable for any loss of profit or any other commercial damages,
including but not limited to special, incidental, consequential, or other damages.

Disclaimer: The contents of this book have been checked for accuracy. Since deviations cannot be precluded entirely, Wiley or
its author cannot guarantee full agreement. As the book is intended for educational purpose, Wiley or its author shall not be
responsible for any errors, omissions or damages arising out of the use of the information contained in the book. This publication
is designed to provide accurate and authoritative information with regard to the subject matter covered. It is sold on the
understanding that the Publisher is not engaged in rendering professional services.

Trademarks: All brand names and product names used in this book are trademarks, registered trademarks, or trade names of
their respective holders. Wiley is not associated with any product or vendor mentioned in this book.

Other Wiley Editorial Offices:


John Wiley & Sons, Inc. 111 River Street, Hoboken, NJ 07030, USA
Wiley-VCH Verlag GmbH, Pappellaee 3, D-69469 Weinheim, Germany
John Wiley & Sons Australia Ltd, 42 McDougall Street, Milton, Queensland 4064, Australia
John Wiley & Sons (Asia) Pte Ltd, 1 Fusionopolis Walk #07-01 Solaris, South Tower, Singapore 138628
John Wiley & Sons Canada Ltd, 22 Worcester Road, Etobicoke, Ontario, Canada, M9W 1L1

First Edition: 2020


ISBN: 978-81-265-0295-0
ISBN: 978-81-265-8965-4 (ebk)
www.wileyindia.com
Printed at:

7
Preface

Today, no organization can make, deliver, or market its product or service efficiently without
technology and data. Professionals who can leverage data, control technology, use artificial
intelligence to shape the organization and use expertise to understand customer experience, and
bring innovation or increase efficiency in the process would be the leaders of tomorrow. There is
need to rise above the antiquated approaches of earlier days and infuse the spirit of participation,
the spirit of adaptation, and the spirit of adventure. Data analysis is the method of examining,
cleansing, and modeling with the objective of determining useful information for effective
decision-making and operations. The quality of information depends on its reputation for high-
quality tools used in data analysis. It is important that the generated results and its perceptions
must continue to thrive under the new world scenario. This means a renewed commitment to
high-quality learning that is more responsive to data and concepts will provide the best possible
experience. This will happen best in learning environments that are both serious and focused, on
the one hand, but which are also joyous and inspiring, operating on the cutting edge of pedagogy
and knowledge. Having spent more than 22 years in the field of information technology and
having published more than 80 research papers, I feel obliged to share my knowledge and
experience related to analysis of real-life situations. This book has evolved from my teaching
experience in several technical institutions, providing consultancies, conducting research
methodology workshops, and my experience of working in the IT industry.
In this competitive world, there is a need to continue supporting core strengths in higher
education: build on a reputation for excellence and diversity in learning and teaching, world-
leading research and an enviable record of knowledge exchange. The goal of this book is to open
the doors of opportunity related to different analytical techniques from a broader array of
datasets. It is an attempt to provide a reservoir of updated knowledge on varied tools for
academicians, consultants, research scholars, practitioner and students. Data analytics using
Python will enable readers gain sufficient knowledge and experience to perform analysis using
different tools and techniques available in numerous libraries according to different requirement
of the user for different types of data. In order to provide a more meaningful and easier learning
experience, this book has been written with more interesting and relevant real-life examples. The
examples taken from a variety of sectors are solved with proper explanation of code and
comments are used for better clarity. This easy-to-understand approach would enable readers to
develop the required skills and apply techniques to solve all types of problems in Python in an
effective manner. The reader is suggested to execute the programs and read the explanation given
after each code for understanding the concept and process of implementation. Readers’ views,
observations, constructive criticism, and suggestions are welcome at
bhartimotwani@hotmail.com.

About the Book


Data are the fuel of the 21st century. Considering the importance of Python as programming and

8
data processing tool, rich in libraries and advanced data analytical tool, this book is divided into
four sections named as:

1. Programming in Python
2. Core Libraries in Python
3. Machine Learning in Python
4. Deep Learning Applications in Python

Organization of the Book


This book contains 22 chapters divided into four sections:

1. Section 1 – Programming in Python: Topics related to basics of Python are covered in the
first section. The first section consisted of four chapters. The section begins with an
introduction to Python and includes chapters related to basic programming that includes
control flow statements and user-defined functions, data structures, and commonly used
modules in Python. This section forms a base for beginners for data analysis in Python.
• Chapter 1 discusses on building foundation for understanding Python environment. It
also provides understanding of variables, input, and output functions in Python. Readers
will get exposure to arithmetic, assignment, relational, logical, and Boolean operators
and will be familiar with Python modules and libraries.
• Chapter 2 throws light on the concepts of programming. Readers can develop the
analytical skill for decision-making structures and looping and can build expertise for
exception handling in programming. Additionally, this chapter fosters analytical and
critical thinking abilities for creating user-defined functions.
• Chapter 3 discusses on creating different data structures in Python. It provides access to
different elements of data structures in Python. By reading this chapter, readers can
apply the knowledge of different functions to data structures and can also evaluate the
utility of different data structures in different conditions.
• Chapter 4 educates on the importance of inbuilt and user-defined modules in Python. It
provides understanding of the functions available in existing Python modules. Readers
will get exposure to usage of functions and respective modules in different scenarios.
Additionally readers will be able to create user-defined modules and using them in
program.

2. Section 2 – Core Libraries in Python: This section comprises of seven core libraries in
Python. This section has chapters related to numpy Library for Arrays, pandas Library for
Data Processing, matplotlib and seaborn Library for Visualization, SciPy Library for
Statistics, SQLalchemy Library for SQL, and statsmodel Library for Time Series.
• Chapter 5 provides foundation for understanding arrays through numpy library. It also
provides familiarity for accessing elements through slicing and indexing. By reading
this chapter, readers will be able to understand the utility of functions available in
numpy library and get exposure to special functions for single- and multidimensional
array.

9
Chapter 6 discusses on the importance of dataframe. By reading this chapter, readers
• will be able to apply the knowledge of available functions in Pandas library to datasets.
They will also be able to develop the skills for importing data from different software to
Python. Additionally, they will be able to attain competence to handle missing data
through excluding and recoding.
• Chapter 7 provides awareness with different types of basic charts. It demonstrate the
knowledge of charts in solving real-world problems. It encourages effective decision-
making of selection of chart related to data type. By reading this chapter, readers will be
able to create charts and analyze their results.
• Chapter 8 provides exposure to visualization techniques from Seaborn library. This
chapter will help to create different charts for categorical and continuous variables.
Readers will also be able to analyze the different charts and foster analytical and critical
thinking abilities for decision-making.
• Chapter 9 discusses on different subpackages in the SciPy library. After reading this
chapter, readers will be able to get familiarized with linear algebra and statistical
techniques. They can assess the results of different statistical techniques in real-world
situations and can apply the knowledge of image processing using ndimage subpackage.
• Chapter 10 throws light on the basic structured query language (SQL) operations: query,
insert, update, and delete. By reading this chapter, readers can attain competence in
accessing the data. They can apply the knowledge of ORDER BY and GROUP BY for
data extraction. Also, they will be able to implement data extraction from multiple tables
through joining and subquery.
• Chapter 11 mentions the importance of time series analysis. It helps in determining
stationarity of the series. By reading this chapter, readers will be able to understand
smoothing, seasonal decomposition for making time series stationary. They can also
construct different models using autoregressive integrated moving average (ARIMA)
modeling techniques.

3. Section 3 – Machine Learning in Python: The third section comprises of six chapters
related to machine learning. This section starts from the basics of machine learning and
gradually increases the level to better machine learning algorithms, and finally discusses
machine learning for text and image data through example-driven approach.
• Chapter 12 educates on the concept of unsupervised machine learning algorithms. By
reading this chapter, readers will get exposure to different unsupervised algorithms:
dimension reduction algorithms and clustering techniques. They will also be able to
analyze the results of these unsupervised algorithms and implement unsupervised
machine learning algorithms in real-world situation.
• Chapter 13 differentiates between regression and classification. By reading this chapter,
readers will be able to design models based on regression and classification algorithms
in Python. They can also analyze models related to regression and classification
algorithm in Python. Additionally, they can also interpret results of different models.
• Chapter 14 discusses on different supervised machine learning algorithms. It provides
the knowledge of ML algorithms to solve real-world cases. By reading this chapter,
readers will be able to implement supervised ML algorithms using Python and develop
the analytical skill for interpreting supervised ML algorithms.

10
• Chapter 15 provides orientation of different supervised machine learning ensemble
techniques. It demonstrates the knowledge of ML ensemble techniques in solving real-
world problems. By reading this chapter, readers will be able to evaluate different
ensemble techniques for improving accuracy of the model and can develop the best
model for accurate prediction.
• Chapter 16 throws light on the real-time applications of text data analysis. Based on the
learning from this chapter, readers can apply different machine learning techniques for
text data and can foster analytical and critical thinking abilities for databased decision-
making. They will be also able to evaluate the result of text mining and sentiment
analysis.
• Chapter 17 provides understanding of image data representation. On the basis of this
chapter, readers will be able to determine similar images to a given image from existing
image dataset and can apply different machine learning techniques on existing image
dataset. They can also develop analytical thinking abilities for analyzing image-based
dataset.

4. Section 4 – Deep Learning Applications in Python: The last section is the heart of the
book and comprises of five chapters related to deep learning applications. This section starts
from the development of neural network model like Multi Layer Perceptron (MLP),
Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN). For
improving the accuracy of the deep learning models related to test and image data, the next
chapters discusses trained deep learning models. The best application of deep learning for
text data is development of chatbot which is discussed in the next chapter. The last chapter
discusses the latest technologies that can be used in machine learning.
• Chapter 18 provides detail on neural network model. It also throws light on different
deep learning algorithms based on nature of data. Readers can validate and test the
different types of neural network model after reading this chapter. They can also attain
competence using different arguments for increasing accuracy.
• Chapter 19 educates on understanding the available trained models for text data. After
reading this chapter, readers can attain competence in creating own user-defined trained
model. They can apply the expertise of trained models for machine learning algorithms.
The knowledge in the chapter would help to implement the use of trained models for
question answering.
• Chapter 20 mentions the importance of trained models for determining similar images
and image recognition. Readers can apply the knowledge of unsupervised machine
learning algorithms using trained algorithms to form clusters of similar images. They
can evaluate the supervised machine learning algorithms using trained algorithms and
can create user-defined trained model and access it for feature extraction.
• Chapter 21 discusses on Rasa environment for creating chatbots. It educates on
developing interactive chatbot with new entities, actions, and forms. Readers can apply
the usage of application programming interface (API) key in developing the location-
based chatbots. By reading this chapter, readers can make effective chatbots and
understand limitations of chatbots.
• Chapter 22 provides new applications of machine learning techniques. Readers will get
familiar with reinforcement and federated learning and can have acquaintance to graph

11
neural network (GNN). They can gain exposure to create synthetic images using
generative adversarial network (GAN) by reading this chapter.

List of Color Figures


Color figures in the following chapters are available at http://www.wileyindia.com: Chapter 7,
Chapter 8, Chapter 9, Chapter 12, Chapter 13, Chapter 14, Chapter 15, Chapter 16, Chapter 17,
Chapter 18, Chapter 19, Chapter 20, Chapter 21, Chapter 22.

Instructor Resources
The following resources are available for instructors on request. To register, please log onto
https://www.wileyindia.com/Instructor_Manuals/Register/login.php

1. Chapter-wise PowerPoint Presentations (PPTs)


2. Chapter-wise Solution Manuals

Bharti Motwani

12
About the Author

Dr. Bharti Motwani has over 22 years of teaching, corporate, research, and consultancy in a
variety of contexts. She is an author of many books related to data analytics including Data
analytics with R (Wiley). She has demonstrated proficiency in guiding Ph.D. candidates,
reviewing journals, editing books, and journals. She has written more than 80 research papers in
leading books, national and international indexed journals of high repute. She is the recipient of
Young Scientist of the Year award. She has proved dexterousness in research methodologies and
software development by conducting various seminars and workshops related to latest tools in
research and software and guiding various research and software projects. She has high technical
expertise in Data Analytics Software (R, Python, Tableau, Spark, Power BI, SAS, SPSS, AMOS,
Smart PLS); Front End Tools (Visual Basic, Java, JavaScript, C, C++, C##, HTML, PHP); Back
End Tools (Oracle, MS SQL Server, My SQL); and various IDE and Web designing Tools. She
is an analytical professional, IT and analytics consultant, result-driven, and articulate
academician who can think out of the box, and who loves to innovate and engage in new
challenges.

13
Acknowledgments

Expression of feelings by words loses its significance when it comes to say a few words of
gratitude, yet to express it in some form, however imperfect, is a duty toward those who helped. I
offer my special gratitude to almighty God for blessings that has made completion of this book
possible.
I find myself at a loss for words to express my deep sense of gratitude to my father, Mr.
Shrichand Jagwani, and mother, Mrs. Anita Jagwani, for their affection, continuous care,
constant encouragement, and understanding.
My real strength has been the selfless cooperation, solicitous concern, and emotional support
of my husband, Mr. Bharat Motwani. No words can convey my gratitude to my children, Pearl
and Jahan, who had to tolerate my preoccupation with this book. Their patience, forbearance, and
love through this whole process have made this mind-absorbing and time-consuming task
possible.
I am grateful to the President of Jain (Deemed-to-be University) and Chairman of The JGI
Group Dr. Chenraj Roychand for his support and all the faculty and staff members for providing
a conducive environment. I am also thankful to all those people whose constructive suggestions
and work have helped to enhance the standard of the work directly and/or indirectly and brought
the task to fruition. I am indebted to Wiley for their sincere efforts, unfailing courtesy, and
cooperation in bringing out the book in this elegant form. It has been a real pleasure working
with such professional staff.

Bharti Motwani

14
List of Videos

Chapter No. Section


1.1 Features of Python
1 1.9 Core Libraries in Python

2.1 Decision-Making Structures


2 2.2 Loops
2.6 User-Defined Functions
3.1 Lists
3 3.2 Tuples
3.3 Dictionary
4.1 In-Built Modules in Python
4 4.2 User-Defined Module

5.1 One-Dimensional Array


5 5.2 Multidimensional Arrays

6.1 Basics of Dataframe


6 6.7 Missing Values

7.1 Charts Using plot() Function


7 7.6 Bar Chart

8.1 Visualization for Categorical Variable


8 8.2 Visualization for Continuous Variable

9.1 The linalg Sub-Package


9 9.2 The stats Sub-Package
9.4 The ndimage Sub-Package
10.1 Basic SQL
10 10.2 Advanced SQL for Multiple Tables

11.1 Time Series Object


11 11.2 Determining Stationarity

12.1 Dimensionality Reduction


12 12.2 Clustering

13.1 Basic Steps of Machine Learning


13 13.2 Regression
13.3 Classification

15
14 14.2 k-Nearest Neighbor’s Algorithm
14.4 Decision Tree
15.2 Random Forest
15 15.5 Gradient Boosting

16.1 Text Mining


16 16.2 Sentiment Analysis Using Lexicon-Based Approach

17.1 Image Acquisition and Preprocessing


17 17.2 Image Similarity Techniques

18.1 Steps for Building a Neural Network Model


18 18.2 Multilayer Perceptrons Model (2-D Tensor)
18.4 CNN Model (4-D tensor)
19.1 Text Similarity Techniques
19 19.2 Unsupervised Machine Learning
19.3 Supervised Machine Learning
20.1 Image Similarity Techniques
20 20.2 Unsupervised Machine Learning
20.3 Supervised Machine Learning
21.1 Understanding Rasa Environment and Executing Default Chatbot
21 21.2 Creating Basic Chatbot
21.3 Creating Chatbot with Entities and Actions
22.1 Reinforcement Learning
22 22.2 Federated Learning

16
Contents

Preface
About the Author
Acknowledgments
List of Videos

SECTION 1 Programming in Python


Chapter 1 Introduction to Python
1.1 Features of Python
1.2 Installation of Python
1.3 Getting Started
1.4 Variables in Python
1.4.1 Naming Rules
1.4.2 Assigning Values to Variable
1.5 Output in Python
1.5.1 The print() Function
1.5.2 Print with "end" Argument
1.5.3 Print with "sep" Argument
1.6 Input in Python
1.6.1 Input() Function
1.6.2 Input with int() Function
1.6.3 Input with float() Function
1.6.4 Input with eval() Function
1.7 Operators
1.7.1 Arithmetic Operators
1.7.2 Assignment Operators
1.7.3 Relational Operators
1.7.4 Logical Operators

17
1.7.5 Operator Precedence
1.8 Core Modules in Python
1.9 Core Libraries in Python

Chapter 2 Control Flow Statements


2.1 Decision-Making Structures
2.1.1 “If” Statement
2.1.2 If…else Statement
2.1.3 Nested “if” Statement
2.1.4 If-elif-else Ladder
2.2 Loops
2.2.1 For Loop
2.2.2 Nesting of for Loops
2.2.3 While Loop
2.2.4 Nesting of “While” Loops
2.3 Nesting of Conditional Statements and Loops
2.3.1 The “for” Loop inside “if” Conditional Statement
2.3.2 The “if” Conditional Statement Inside “for” Loop
2.3.3 The “if” Conditional Statement Inside “while” Loop
2.3.4 Using “for”, “while”, and “if” Together
2.4 Abnormal Loop Termination
2.4.1 Break Statement
2.4.2 Continue Statement
2.4.3 Pass Statement
2.5 Errors and Exception Handling
2.5.1 Types of Error
2.5.1.1 Compile-Time Errors
2.5.1.2 Run-Time Errors
2.5.1.3 Logical Errors
2.5.2 Exception Handling
2.6 User-Defined Functions
2.6.1 Function without Arguments
2.6.2 Function with Arguments
2.6.2.1 Create a User-Defined Function with Single Argument
2.6.3 Nesting of Functions
2.6.4 Recursive Functions
2.6.5 Scope of Variables within Functions

Chapter 3 Data Structures

18
3.1 Lists
3.1.1 Creating a List
3.1.2 Accessing List Elements
3.1.3 Functions for List
3.1.4 Programming with List
3.2 Tuples
3.2.1 Creating a Tuple
3.2.2 Accessing Tuple Elements
3.2.3 Functions for Tuple
3.2.4 Programming with Tuple
3.3 Dictionary
3.3.1 Creating a Dictionary
3.3.2 Accessing Dictionary Elements
3.3.3 Functions for Dictionary
3.3.4 Programming with Dictionary

Chapter 4 Modules
4.1 In-Built Modules in Python
4.1.1 The Math Module
4.1.2 The Random Module
4.1.3 The Statistics Module
4.1.4 The Array Module
4.1.5 The String Module
4.1.5.1 Accessing String Elements
4.1.5.2 Case Conversion Functions
4.1.5.3 Alignment and Indentation Functions
4.1.5.4 Other Functions for String
4.1.6 The “re” Module
4.1.7 The Time Module
4.1.8 The Datetime Module
4.1.9 The “os” Module
4.2 User-Defined Module
4.2.1 Creating a Module
4.2.2 Importing the User-Defined Module

SECTION 2 Core Libraries in Python


Chapter 5 Numpy Library for Arrays

19
5.1 One-Dimensional Array
5.1.1 Creating a 1-D Array
5.1.2 Accessing Elements of 1-D Array
5.1.3 Functions for 1-D Array
5.1.4 Mathematical Operators for 1-D Array
5.1.5 Relational Operators for 1-D Array
5.2 Multidimensional Arrays
5.2.1 Creating a Multidimensional Array
5.2.2 Accessing Elements in Multidimensional Array
5.2.3 Functions on Multidimensional Array
5.2.4 Mathematical Operators for Multiple Multidimensional Arrays
5.2.5 Relational Operators for Multiple Multidimensional Arrays

Chapter 6 Pandas Library for Data Processing


6.1 Basics of Dataframe
6.1.1 Creating a Dataframe
6.1.2 Adding Rows and Columns to the Dataframe
6.1.3 Deleting Rows and Columns from the Dataframe
6.2 Import of Data
6.3 Functions of Dataframe
6.3.1 Basic Information Functions
6.3.2 Mathematical and Statistical Functions
6.3.3 Sort Functions
6.4 Data Extraction
6.4.1 Using Relational Operators
6.4.2 Using Logical Operators
6.4.3 Using iloc Indexers
6.4.4 Using loc Indexers
6.5 Group by Functionality
6.6 Creating Charts for Dataframe
6.7 Missing Values
6.7.1 Determining Missing Values
6.7.2 Deleting Observations Containing Missing Values
6.7.3 Missing Data Imputation

Chapter 7 Matplotlib Library for Visualization


7.1 Charts Using plot() Function
7.2 Pie Chart
7.3 Violin Plot

20
7.4 Scatter Plot
7.5 Histogram
7.6 Bar Chart
7.7 Area Plot
7.8 Quiver Plot
7.9 Mesh Grid
7.10 Contour Plot

Chapter 8 Seaborn Library for Visualization


8.1 Visualization for Categorical Variable
8.1.1 Box Plot
8.1.2 Violin Plot
8.1.3 Point Plot
8.1.4 Line Plot
8.1.5 Count Plot
8.1.6 Bar Plot
8.1.7 Strip Plot
8.1.8 Swarm Plot
8.1.9 Factor Plot
8.1.10 Facet Grid
8.2 Visualization for Continuous Variable
8.2.1 Scatter Plot
8.2.2 Regression Plot
8.2.3 Heat Map
8.2.4 Univariate Distribution Plot
8.2.5 Joint Plot
8.2.6 Joint Hexbin Plot
8.2.7 Joint Kernel Density Plot
8.2.8 Pair Plot
8.2.9 Pair Grid

Chapter 9 SciPy Library for Statistics


9.1 The linalg Sub-Package
9.2 The stats Sub-Package
9.2.1 Basic Statistics
9.2.1.1 Descriptive Statistics
9.2.1.2 Rank
9.2.1.3 Determining Normality
9.2.1.4 Homogeneity of Variances

21
9.2.1.5 Correlation
9.2.1.6 Chi-Square Test
9.2.2 Parametric Techniques for Comparing Means
9.2.2.1 One Sample t-Test
Use Case: Green Building Certification
9.2.2.2 Independent Sample t-Test
Use Case: Comparison of Personal Webstore and Marketplaces for
Online Selling
9.2.2.3 Dependent t-Test
Use Case: Effect of Training Program on Employee Performance
9.2.2.4 One-Way ANOVA
Use Case: Effect of Demographics on Online Mobile Shopping
Apps
9.2.3 Non-Parametric Techniques for Comparing Means
9.2.3.1 Kolmogorov–Smirnov Test for One Sample
9.2.3.2 Kolmogorov–Smirnov Test for Two Samples
9.2.3.3 Mann–Whitney Test for Independent Samples
9.2.3.4 Wilcoxon Test for Dependent Samples
9.2.3.5 Kruskal–Wallis Test
9.3 The special Sub-Package
9.4 The ndimage Sub-Package
9.4.1 Flip Effect
9.4.2 Rotate Image
9.4.3 Blur Image
9.4.4 Crop Image
9.4.5 Filters
9.4.6 Colours
9.4.7 Uniform Filters

Chapter 10 SQLAlchemy Library for SQL


10.1 Basic SQL
10.1.1 SELECT Clause
10.1.2 WHERE Clause
10.1.2.1 Relational Operators
10.1.2.2 Logical Operators
10.1.2.3 IN and NOT IN Clauses
10.1.2.4 The LIKE Operator
10.1.2.5 IS NULL
10.1.3 Insert Statement

22
10.1.4 Update Statement
10.1.5 Delete Statement
10.1.6 In-Built SQL Functions
10.1.7 ORDER BY Clause
10.1.8 GROUP BY Clause
10.1.9 Ranking Functions
10.2 Advanced SQL for Multiple Tables
10.2.1 Intersect and Union
10.2.2 SubQuery
10.2.3 Joining

Chapter 11 Statsmodels Library for Time Series Models


11.1 Time Series Object
11.1.1 Reading Data
11.1.2 Creating Subset
11.2 Determining Stationarity
11.3 Making Time Series Stationary
11.3.1 Adjusting Trend Using Smoothing
11.3.1.1 Simple Moving Average
11.3.1.2 Exponential Weighted Moving Average
11.3.2 Adjusting Seasonality and Trend
11.3.2.1 Differencing
11.3.2.2 Seasonal Decomposition
11.4 ARIMA Modeling
11.4.1 Creating ARIMA Model
11.4.2 Forecasting
Use Case: Foreign Trade

SECTION 3 Machine Learning in Python


Chapter 12 Unsupervised Machine Learning Algorithms
12.1 Dimensionality Reduction
12.1.1 Factor Analysis
Use Case: Balanced Score Card Model for Measuring Organizational
Performance
12.1.2 Principal Component Analysis
Use Case: Employee Attrition in an Organization
12.2 Clustering
12.2.1 K-Means Clustering

23
Use Case: Market Capitalization Categories
12.2.2 Agglomerative Hierarchical Clustering
Use Case: Performance Appraisal in Organizations

Chapter 13 Supervised Machine Learning Problems


13.1 Basic Steps of Machine Learning
13.1.1 Data Exploration and Preparation
13.1.1.1 Understanding Dataset
13.1.1.2 Handling Missing Values
13.1.1.3 Assumptions of Regression
13.1.1.4 Feature Engineering
13.1.2 Model Development
13.1.3 Predicting the Model
13.1.4 Determining the Accuracy of the Model
13.1.4.1 RMSE Value
13.1.4.2 Confusion Matrix
13.1.4.3 Accuracy Score
13.1.4.4 Classification Report
13.1.4.5 ROC Curve and AUC
13.1.5 Creating Better Model
13.1.5.1 Avoid Overfitting and Under fitting
13.1.5.2 Feature Extraction
13.1.5.3 Tuning of Hyper Parameters
13.2 Regression
13.2.1 Simple Linear Regression
Use Case: Relationship between Buying Intention and Awareness of
Electric Vehicles
13.2.2 Multiple Linear Regression
Use Case: Application of Technology Acceptance Model in Cloud
Computing
13.2.3 Nonlinear Least Square Regression
Use Case: Impact of Social Networking Websites on Quality of
Recruitment
13.3 Classification
Use Case: Prediction of Customer Buying Intention due to Digital
Marketing

Chapter 14 Supervised Machine Learning Algorithms


14.1 Naive Bayes Algorithm

24
14.1.1 Naive Bayes for Classification Problems
Use Case: Measuring Acceptability of a New Product
14.2 k-Nearest Neighbor’s Algorithm
14.2.1 k-NN for Classification Problems
Use Case: Predicting Phishing Websites
14.2.2 k-NN for Regression Problems
Use Case: Loan Categorization
14.3 Support Vector Machines
14.3.1 Support Vector Machines for Classification Problems
Use Case: Fraud Analysis for Credit Card and Mobile Payment
Transactions
14.3.2 Support Vector Machines for Regression Problems
Use Case: Diagnosis and Treatment of Diseases
14.4 Decision Tree
14.4.1 Decision Tree Algorithm for Classification Problems
Use Case: Occupancy Detection in Buildings
14.4.2 Decision Tree for Regression Problems
Use Case: Artificial Intelligence and Employment

Chapter 15 Supervised Machine Learning Ensemble Techniques


15.1 Bagging
15.1.1 Bagging Algorithm for Classification Problems
Use Case: Measuring Customer Satisfaction related to Online Food Portals
15.1.2 Bagging Algorithm for Regression Problems
Use Case: Predicting Income of a Person
15.2 Random Forest
15.2.1 Random Forest Algorithm for Classification Problems
Use Case: Writing Recommendation/Approval Reports
15.2.2 Random Forest Algorithm for Regression Problems
Use Case: Prediction of Sports Results
15.3 Extra Trees
15.3.1 Extra Tree Algorithm for Classification Problems
Use Case: Improving the e-Governance Services
15.3.2 Extra Tree Algorithm for Regression Problems
Use Case: Logistics Network Optimization
15.4 Ada Boosting
15.4.1 AdaBoost for Classification Problems
Use Case: Predicting Customer Churn
15.4.2 AdaBoost for Regression Problems

25
Use Case: Big Data Analysis in Politics
15.5 Gradient Boosting
15.5.1 Gradient Boosting for Classification Problems in Python
Use Case: Impact of Online Reviews on Buying Behavior
15.5.2 Gradient Boosting for Regression Problems in Python
Use Case: Effective Vacation Plan through Online Services

Chapter 16 Machine Learning for Text Data


16.1 Text Mining
16.1.1 Understanding Text Data
16.1.2 Text Preprocessing
16.1.3 Shallow Parsing
16.1.4 Stop Words
16.1.5 Stemming and Lemmatizing
16.1.6 Word Cloud
Use Case: Text Mining for Long Documents/Speech/Resume
16.2 Sentiment Analysis Using Lexicon-Based Approach
16.2.1 Understanding Data
16.2.2 Determining Polarity
16.2.3 Determining Sentiment from Polarity
16.2.4 Determining Accuracy of Sentiment Analysis
Use Case: Sentiment Analysis for Twitter Data
16.3 Text Similarity Techniques
16.3.1 Cosine Similarity
16.3.2 Euclidean Distance
16.3.3 Manhattan Distance
Use Case: Finding Partners on Matrimonial Websites
16.4 Unsupervised Machine Learning for Grouping Similar Text
Use Case: Organizing Tweets/Reviews of Product/Service
16.5 Supervised Machine Learning
16.5.1 Logistic Regression Model
16.5.2 Random Forest Model
16.5.3 Gradient Boosting Model
16.5.4 Bagging Model
Use Case: Determining Popularity of Social Media News

Chapter 17 Machine Learning for Image Data


17.1 Image Acquisition and Preprocessing
17.1.1 Image Acquisition and Representation

26
17.1.2 Image Resizing and Rescaling
17.1.3 Image Rotation and Flipping
17.1.4 Image Intensity
17.1.5 Image Cropping
17.1.6 Edge Extraction Using Sobel Filter
17.1.7 Edge Extraction Using Prewitt Filter
Use Case: Image Optimization for Websites
17.2 Image Similarity Techniques
17.2.1 Cosine Similarity
17.2.2 Euclidean Distances
17.2.3 Manhattan Distances
Use Case: Product-Based Recommendation System
17.3 Unsupervised Machine Learning for Grouping Similar Images
Use Case: Grouping Similar Products in e-Commerce
17.4 Supervised Machine Learning Algorithms for Image Classification
17.4.1 Naïve–Bayes Model
17.4.2 Decision Tree Model
17.4.3 Random Forest Model
17.4.4 Bagging Model
Use Case: Online Product Catalog Management

SECTION 4 Deep Learning Applications in Python


Chapter 18 Neural Network Models (Deep Learning)
18.1 Steps for Building a Neural Network Model
18.1.1 Data Preparation
18.1.2 Building the Basic Sequential Model and Adding Layers
18.1.3 Compiling the Model
18.1.4 Fitting the Model on Training Dataset
18.1.5 Evaluating the Model
18.1.6 Creating Better Model with Increased Accuracy
18.2 Multilayer Perceptrons Model (2-D Tensor)
18.2.1 Basic Model
18.2.2 Changing Units, Dropout, Epoch and Batch_size
18.2.3 Changing Activation, Loss, and Optimizer
18.2.4 Changing Optimizer and Activation
18.2.5 Grid Approach to Determine Best Value of Epoch and Batch_size
Use Case: Measuring Quality of Products for Acceptance or Rejection

27
18.3 Recurrent Neural Network Model (3-D Tensor)
18.3.1 Basic LSTM Model
18.3.2 Changing Activation Function
18.3.3 Adding Dropout
18.3.4 Adding Recurrent Dropout
18.3.5 Adding Conv1D Layer for Sequence Classification
Use Case: Financial Market Analysis
18.4 CNN Model (4-D tensor)
18.4.1 Basic Model for Image Data
18.4.2 Creating Model with ModelCheckpoint API
18.4.3 Creating Denser Model by Adding Hidden Layers
18.4.4 Making Model Deeper
18.4.5 Early Stopping API
18.4.6 Grid-Based Approach
18.4.7 Creating a CNN Model
18.4.8 Regularization
18.4.9 Autoencoder as Classifier
18.4.10Data Augmentation
Use Case: Facial Recognition in Security Systems

Chapter 19 Transfer Learning for Text Data


19.1 Text Similarity Techniques
19.1.1 Without Pretrained Model
19.1.1.1 Cosine Similarity
19.1.1.2 Euclidean Distance
19.1.1.3 Manhattan Distance
19.1.2 Bert Algorithm
19.1.3 GPT2 Algorithm
19.1.4 Roberta Algorithm
19.1.5 XLM Algorithm
19.1.6 DistilBert Algorithm
Use Case: Service-Based Recommendation System
19.2 Unsupervised Machine Learning
19.2.1 Without Pretrained Model
19.2.2 Bert Algorithm
19.2.3 GPT2 Algorithm
19.2.4 ROBERTA Algorithm
19.2.5 XLM Algorithm

28
19.2.6 DistilBert Algorithm
Use Case: Grouping Products in e-commerce
19.3 Supervised Machine Learning
19.3.1 Without Pretrained Algorithm
19.3.2 BERT Algorithm
19.3.3 GPT2 Algorithm
19.3.4 Roberta Algorithm
19.3.5 XLM Algorithm
19.3.6 DistilBert Algorithm
Use Case: Spam Protection and Filtering
19.4 User-Defined Trained Deep Learning Model
19.4.1 Bert Algorithm
19.4.1.1 Creating User-Defined Model Using Bert Algorithm
19.4.1.2 Using Pretrained Model on Test Dataset
19.4.2 GPT2 Algorithm
19.4.2.1 Creating User-Defined Model Using GPT2 Algorithm
19.4.2.2 Using Pretrained Model on Test Dataset
19.4.3 Roberta Algorithm
19.4.3.1 Creating User-Defined Model Using Roberta Algorithm
19.4.3.2 Using Pretrained Model on Test Dataset
19.4.4 XLM Algorithm
19.4.4.1 Creating User-Defined Model Using XLM Algorithm
19.4.4.2 Using Pretrained Model on Test Dataset
19.4.5 DistilBert Algorithm
19.4.5.1 Creating User-Defined Model Using DistilBert Algorithm
19.4.5.2 Using Pretrained Model on Test Dataset
Use Case: Image Captioning
19.5 Question Answers Model
19.5.1 Bert Algorithm from Transformers
19.5.2 DistilBert Algorithm from Transformers
19.5.3 Bert Algorithm from PyTorch
19.5.4 Bert Algorithm from Deeppvalov
19.5.5 Ru_bert Algorithm from Deeppvalov
19.5.6 Ru_rubert Algorithm from Deeppvalov
Use Case: Chatbots

Chapter 20 Transfer Learning for Image Data


20.1 Image Similarity Techniques

29
20.1.1 Without Pretrained Model
20.1.2 Using MobileNet Model
20.1.3 Using MobileNetV2 Model
20.1.4 Using ResNet50 Model
20.1.5 Using VGG16 Model
20.1.6 Using VGG19 Model
Use Case: Recommendation System for Videos
20.2 Unsupervised Machine Learning
20.2.1 Without Pretrained Model
20.2.2 Using MobileNet Model
20.2.3 Using MobileNetV2 Model
20.2.4 Using ResNet50 Model
20.2.5 Using VGG16 Model
20.2.6 Using VGG19 Model
Use Case: Video Summarization Using Clustering
20.3 Supervised Machine Learning
20.3.1 Without Pretrained Models
20.3.2 Using MobileNet Model
20.3.3 Using MobileNetV2 Model
20.3.4 ResNet50 Model
20.3.5 VGG16 model
20.3.6 VGG19 Model
Use Case: Medical Diagnosis Using Image Processing
20.4 Pretrained Models for Image Recognition
20.4.1 Face and Eye Determination
20.4.2 Gender and Age Determination
Use Case: Personalized Display for Customers in Shopping Mall/Offline
Stores/Restaurants
20.5 Creating, Saving, and Loading User-Defined Model for Feature Extraction
20.5.1 Creating and Saving the Model for Feature Extraction
20.5.2 Evaluating the Model on Existing Dataset
20.5.3 Loading the Model and Determining Emotions of Existing Image
20.5.4 Determining Emotions from Cropped Facial Image
20.5.5 Determining Emotions of Image from Webcam
Use Case: Measuring Customer Satisfaction through Emotion Detection
System

Chapter 21 Chatbots with Rasa


21.1 Understanding Rasa Environment and Executing Default Chatbot

30
21.1.1 Data Folder
21.1.1.1 nlu.md
21.1.1.2 stories.md
21.1.2 domain.yml
21.1.3 Models Folder
21.1.4 Actions.py
21.1.5 config.yml
21.1.6 credential.yml
21.1.7 endpoints.yml
21.2 Creating Basic Chatbot
21.2.1 nlu.md
21.2.2 stories.md
21.2.3 domain.yml
Use Case: Chatbot for e-Governance
21.3 Creating Chatbot with Entities and Actions
21.3.1 Single Entity
21.3.2 Synonyms for Entities
21.3.3 Multiple Entities
21.3.4 Multiple Values of Entity in Same Intent
21.3.5 Numerous Values of Entity
21.3.5.1 Lookup Tables
21.3.5.2 Regular Expression Features
21.3.6 nlu.md
21.3.7 stories.md
21.3.8 domain.yml
21.3.9 actions.py
Use Case: Chatbot for Alzheimer’s Patients
21.4 Creating Chatbot with Slots
21.4.1 nlu.md
21.4.2 stories.md
21.4.3 domain.yml
21.4.4 actions.py
Use Case: Chatbot for Marketing
21.5 Creating Chatbot with Database
21.5.1 nlu.md
21.5.2 stories.md
21.5.3 domain.yml
21.5.4 actions.py

31
Use Case: Chatbots for Service Industry
21.6 Creating Chatbot with Forms
21.6.1 nlu.md
21.6.2 stories.md
21.6.3 domain.yml
21.6.4 actions.py
Use Case: Chatbot for Consumer Goods
21.7 Creating Effective Chatbot
21.7.1 Providing Huge Training Data
21.7.2 Including Out-of-Vocabulary Words
21.7.3 Managing Similar Intents
21.7.4 Balanced and Secured Data

Chapter 22 The Road Ahead


22.1 Reinforcement Learning
Use Case: Reinforcement Learning for Solving Real-World Optimization
Problems
22.2 Federated Learning
Use Case: Federated Learning for Self-Driven Cars
22.3 Graph Neural Networks (GNNs)
Use Case: GNN for Sales and Marketing
22.4 Generative Adversarial Network (GAN)
Use Case: Generative Adversarial Network for Cyber Security

Answers to Multiple-Choice Questions


Interview Questions and Answers
Index

32
33
SECTION 1
Programming in Python

Chapter 1
Introduction to Python

Chapter 2
Control Flow Statements

Chapter 3
Data Structures

Chapter 4
Modules

34
CHAPTER
1

35
Introduction to Python

Learning Objectives
After reading this chapter, you will be able to

• Build foundation for understanding Python environment.


• Understand variables, input, and output functions in Python.
• Get exposure to arithmetic, assignment, relational, logical, and Boolean operators.
• Be familiar with Python modules and libraries.

Python is an interpreted, high-level, general-purpose programming language created by Guido


van Rossum; it was first released in 1991. Python is van Rossum’s vision of a small core
language with a large, standard library and easily extensible interpreter, which stemmed from his
frustrations with ABC (programming language). Python has shown better performance than other
programming languages in recent years. Python is free and open source and runs on all major
operating systems, such as Microsoft Windows, Linux, and Mac OS. Python is designed to be
highly extensible and does not have all of its functionalities built into its core. This modular
approach has made it popular because it acts as a means of adding programmable interfaces to
the applications according to the user’s requirement. There are more than 300 standard libraries,
which may also contain modules and classes for a wide variety of programming. Along with the
standard libraries, there are broad collections of freely available add-on modules, libraries,
frameworks, and tool kits. Python helps in increasing speed for most applications and enhances
productivity of applications with its strong process integration features, unit testing framework,
and enhanced control capabilities.

1.1 Features of Python


Python supports multiple programming paradigms, including object-oriented, imperative,
functional, and procedural features. Some of the important features of Python are as follows:

1. Python syntax is easy to learn and code. It is developer-friendly and is a high-level


programming language.
2. Like any other programming language, Python also allows branching and looping as well as
modular programming using functions.
3. Python has an effective data-handling mechanism and excellent storage facility for numeric
and textual data.
4. Python has a dynamic and automatic memory management.

36
Python has a wide collection of operators and functions for data structures, such as list,
5. tuple, and dictionary.
6. Python has large and integrated collection of tools for data manipulation, analysis, and
processing.
7. Python code is interpreted by interpreter. This means that line-by-line interpretation is done
for efficient programming. There is no need to compile it as in many other programming
languages.
8. Python is a platform-independent programming language and hence runs on different
platforms, such as Windows, Linux, Unix, Macintosh, etc.
9. Python is dynamically typed. This means that the type for a value is decided at runtime, not
in advance. Hence, unlike any other programming language, we do not need to specify the
type of variable while declaring it.
10. Python has an integrated suite of software facilities for graphical interpretation.
11. The rich library of Python is immensely helpful in advanced algorithms related to different
specializations.
12. The machine learning algorithms of Python have brought a revolution in the field of Data
Analytics.

1.2 Installation of Python


Python is available for both versions of Windows (32-bit/64-bit). After installation, one needs to
locate the icon to run the program in a directory structure under Windows program files (Fig.
1.1). Click this icon to display Python-GUI that depicts the start of Python programming (Fig.
1.2).

37
Figure 1.1 Python prompt.

Figure 1.2 Python interpreter.

38
Figure 1.3 Anaconda.

Figure 1.4 Spyder software.

39
Figure 1.5 Jupyter software.

Figure 1.6 PyCharm software.

Programing in Python can be done easily using Anaconda or PyCharm software. Anaconda is
a software that consists of Python and a couple of Python libraries. It also has its own virtual
environment and repository, which can be used along with the Python command. Anaconda can
be installed from https://www.anaconda.com/download/. Anaconda, when installed, shows five
different options: Anaconda navigator, Anaconda prompt, Jupyter notebook, Reset Spyder
settings, and Spyder (Fig. 1.3). Spyder is an Integrated Development Environment (IDE) meant
for Python. The opening screen of Spyder is displayed in Fig. 1.4. Jupyter notebook is a sheet
that enables the user to arrange Python code into cells and run it in a desired order. The first
screen of Jupyter notebook is displayed in Fig. 1.5. We can observe that the Jupyter screen does
the operations cell by cell and hence programming is done at a single place. We can observe that
programming is done in the left window and results are displayed in the right-bottom window.
The right-top window displays the variables and data. All programs in this book are made
using the Spyder/Jupyter available within the Anaconda platform. When all the lines of the
program are executed together, Spyder is used. However, a complicated machine learning
program is implemented on Jupyter software because it supports step by step execution of the

40
program for better understanding and efficient execution of the program. PyCharm is an IDE
used in computer programming, specifically for Python language. It is developed by the Czech
company, JetBrains, and the software can be downloaded from
https://www.jetbrains.com/pycharm/download/. It is easy to include a library or module in
PyCharm by changing the settings and environment. The first screen in PyCharm has many
windows as displayed in Fig. 1.6. The top-left window gives details about the project on which
we are working. The center-top window is the place where programming is done and the bottom
window displays the results of programming.

1.3 Getting Started


In Python environment setup, click on Python interpreter to get a prompt “>>>”. We will start
learning Python programming by writing a “Welcome to Python World” program. Depending on
the needs, one can program either at Python command prompt or at Jupyter (Command Line) or
can write a Python script file in Spyder or Jupyter or PyCharm. In all these options, Python
issues a prompt where the user is expected to write input commands. Type print ("Welcome to
Python World") in different software and observe the results after executing the command. In
Python interpreter, write the statement at the command prompt and press enter to view the result.
In Spyder and Jupyter software, click on “run” to view the results. In PyCharm, click on Run
menu and select the “run” option; select the appropriate file name for executing the program and
for viewing the results.

1.4 Variables in Python


A Python identifier is a name that is used to identify variables, functions, or any other user-
defined item. Python is a case-sensitive programming language. Thus, Sales and sales are two
different identifiers in Python. The name of an identifier must follow the naming rules.

1.4.1 Naming Rules


The name of an identifier can be composed of letters, digits, or the underscore character. An
identifier starts with a letter from A to Z, a to z, or an underscore followed by zero or more
letters, underscores, and digits (0 to 9). Python does not allow punctuation characters such as @,
$, and % within identifiers. Example of valid identifiers are: PROFIT, salary, emp12, Account,
etc. Python is a dynamically typed language unlike any other programming language; hence,
there is no need to declare variables in Python before their usage in the program.
Some words have pre-defined significance in Python programming language. These words
are called reserved words and may not be used as constants or any other identifier name.
Examples of reserved words include if, else, for, break, etc.

1.4.2 Assigning Values to Variable


An identifier is generally associated with an expression or a value. Like many other

41
programming languages, the = operator means assignment and does not represent mathematical
equality. Thus, the statement x = y copies the value stored in variable “y” into variable “x”. It
should be noted that the numeric value is written without double quotes and a string value is
enclosed in double quotes “”. A string contains characters that are similar to character literals:
plain characters, escape sequences, and universal characters. Example: salary = 7000; here salary
is a variable which has the value 7000. In Python, we say that salary is assigned value 7000
where assignment operator (=) is used. The statement Country="India" stores the string value
of India in “Country” variable.

An error will occur if you do not put string in quotes. Type country=India to
understand the concept of strings.

1.5 Output in Python


The most fundamental thing in a programming language is the input to the program and output
from the program. The output is displayed using print() function in Python.
Comments: Comments are helping text in a Python program and are ignored by the compiler. A
single line comment starts with a #. Multi-line comments can be created using “”” (three double
quotes at the start and end of comment block).

1.5.1 The print() Function


The string which needs to be printed is written in double quotes in print() function. The
command print("Welcome to Python World") hence prints Welcome to Python World on the
screen as shown in the Figs. 1.4, 1.5 and 1.6 using different software.

Explanation
An identifier inside double quotes will be printed by Python as the name of the identifier rather
than the value of the identifier. Thus, the statement print("Total salary is salary") prints
“Total salary is salary” on the screen. But if we need to print the value of the identifier, we need
to write identifier name without double quotes. Hence, the next statement print("Total
salary is : ",salary) prints "Total salary is: 7000". It should be noted that the string
and identifier are separated and concatenated with comma (,).

We know that each print statement corresponds to output in a new line as demonstrated in the
following example:

42
Explanation
We can observe that the use of each print statement displays the result in a new line. Hence,
"Welcome", "to", "Python World" are printed on three different lines.

1.5.2 Print with "end" Argument


Sometimes it is convenient to view the output of a single line of printed text over several Python
statements. As an example, we may compute part of a complicated calculation, print an
intermediate result, finish the calculation, and print the final answer with the output all appearing
in one line of text. The “end” argument in print statement allows us to do so.
In the following example, the user is able to print the output in a single line. The “end”
argument is used for not allowing to shift the control to the next line. This is illustrated in the
following examples:

Explanation
In the above program, we have used “end” argument in print () function. This will cause the
cursor to remain on the same line for the next text. Hence, all the words are printed in the same
line.

Explanation
The statement print('Hello Dear') is an abbreviated form of the statement print('Hello
Dear', end='\n'), that is, the default ending for a line of printed text is the string ‘\n’ (the
newline control code). The next statement print ('Enter your name ', end='') terminates
the line with the string ‘Enter your name’ rather than the normal \n newline code. The
difference in the behavior of the two statements is indistinguishable. Since end argument is

43
used, "Welcome" is written in the same line. However, the statement print(end='Nice to see
you ') moves the cursor down to the next line.

1.5.3 Print with "sep" Argument


The print() function uses a keyword argument named “sep” to specify the string to use insert
between items. The value of “sep” argument allows us to control how the print() function
visually separates the argument it displays. The name “sep” stands for separator. By default, the
print() function places a single space in between the items it prints. The default value of “sep”
is a blank string‘ ’, which is a string containing single space. The following example illustrates
the same:

Explanation
The first output shows print’s default method of using a single space between printed items.
The second output line uses no space as separators, since sep=’’. The third output line uses
comma as separator. The fourth output runs the items together with an empty string separator.
The fifth output line uses colon as separator. The sixth output shows that the separating string
may consist of multiple characters like dashes.

1.6 Input in Python


Input in a Python program is primarily taken using the input() function. The input() function
helps to take the string data from the user. However, int() and float() functions can be used
along with input() if the user wants to take only integer and float inputs, respectively. The
eval() is a commonly used function that defines the type of input according to the value entered
by the user.

44
1.6.1 Input() Function
This function is primarily used to take input from the user. It should be noted that the input()
function accepts only string data. The following program illustrates the same:

Explanation
In the above program, the user is asked to enter a number using the input () function. It
should be noted that the input () function takes input in the form of string only, hence the
input is converted into string. In the above program, the user enters 11245, which is a number;
but, the input () function converts it into string. Hence, when the data type of “num” using
type() function is printed, class “str” is printed.

It is advisable to use input () function when the user wants to enter a string data type only, as
shown in the following program.

Explanation
In the above program, the user is prompted to enter the name which is stored in variable
“username” and hence when the username is printed along with the string “Hello and
Welcome”; the name is printed along with the string.

1.6.2 Input with int() Function


The program in Section 1.7.1 illustrates the use of input () function to input a string. However,
if the user wants to enter numeric data, there is a need to use other functions along with the
input() function. Python accepts numeric input from the user in the form of an integer or a float
number using the int() and float() functions, respectively. The use of int() is demonstrated
in the following example:

45
Explanation
After displaying the string “Enter the salary:”, the program’s execution stops and waits for the
user to type-in some text and then press the enter key. The string produced by the input ()
function is passed to the int() function that produces a value to assign to the variable “salary”.
The command salary = int(input(“Enter the salary:”)) hence prompts the user to enter
an integer value and store it in the identifier salary. The print command prints the value of
salary. It should be noted that In [1]: displays the results when the program was executed for the
first time and In [2]: displays the results when the same program was executed for the second
time.
The above program accepts integer values for salary and sales. However, error would have
been generated if the salary was entered in a decimal form because float value is not
automatically converted to an integer form in Python.

1.6.3 Input with float() Function


The float() function along with input() helps to accept decimal values from the user.

Explanation
In the first line, the user is prompted to enter the side of the square which is stored in variable
“side”. The area is computed and stored in variable named “area”. The first print statement
prints the answer from the variable “area” on the screen. The second print statement does the
calculation inside the print statement and prints the result. Both the ways of print statement
thereby print the same result (17.64). The user can choose any of the two ways depending on
the requirement and utility.

1.6.4 Input with eval() Function


Python provides a distinct feature for taking input from the user using eval() function, which

46
considers the data type according to the nature of input provided. The eval() function can also
be used to convert a string representing a numeric expression into its evaluated numeric value.
This is demonstrated in the following program:

Explanation
The first statement asks the user to enter the first value. The value entered by the user is stored
in val1 (2678). The statement print('val1 =', val1, 'type:', type(val1)) displays the
output on the screen as: val1 = 2678 type: <class ‘int’>. It can be observed that val1 is written
as it is, since it is contained inside the single quote. After the comma, val1 is written without
quotes, hence the value of the identifier val1 will be displayed (2678). After the comma, type is
written in single quote, hence it is printed as it is and then type(val1) is printed. Since the user
entered an integer value, class is printed as int. Similarly, in the next example, the user enters a
float value (1398.26); hence the class is printed as float. In the last example, the user enters a
string in double quote; hence the class is printed as “str” that represents the string.

In Python, unlike other programming languages like C, C++, it is possible to give multiple inputs
by the user using one single input statement. This helps in assigning multiple values to multiple
variables in one statement. Example: var1, var2 = eval(input('Please enter number 1 and
number 2: ')) will prompt user to enter two numbers var1 and var2.

Explanation
This example shows that the user is able to enter three numbers at the same time using one
eval() function. The user has entered three numbers 10, 15, and 6 and hence the result is 31. It
should be noted that for giving multiple inputs through one statement, the numbers entered by
the user should also be separated by a comma as shown in the eval() function.

47
The statement a,b,c= int(input("Enter three numbers:")) will result in
run-time error because int() function cannot accept multiple inputs. This is
possible only through eval() function.

1.7 Operators
An operator is a symbol that tells the compiler to perform specific mathematical or logical
functions. Python language provides the following types of operators: arithmetic operators,
assignment operators, relational operators, logical operators, and Boolean operators. This section
discusses these operators in detail.

1.7.1 Arithmetic Operators


There are different arithmetic operators like Addition (+) for adding operands; Subtraction (−)
for subtracting second operand from first; Multiplication (*) for multiplying operands; Division
(/) for dividing numerator by denominator; Modulus operator (%) for determining the remainder
after an integer division; Exponential operator (**) for calculating exponential power value; and
Integer division (//) for performing integer division and displaying only integer quotient. It
should be noted that in an expression where multiple operations take place, the bracket
(parenthesis) will be given the priority, followed by exponential, then division, multiplication,
modulus, addition and subtraction in this order only. The assignment operator is given the last
priority and hence is at the end; the value of right-hand side of expression is stored in the left-
hand side of the expression.

48
Explanation
The first two statements prompt the user to enter two numbers. When the user types the number
101 and then presses the enter key, num1 is assigned to the integer 101. Similarly, the user
enters number 4, which is stored in num2. Later, the different arithmetic operators produce the
desired result. It can be observed that 101/5 gives a float result, while 101//5 performs integer
division and hence gives only the integer result and ignores the decimal part. The remainder 1 is
displayed using modulus operator (%). The ** is used for exponential computation.

1.7.2 Assignment Operators


The different types of operators include assignment operator (=) that assigns numbers from right-
side operands to left-side operands. For example, Profit = Amount − Expenses will assign the
value of Amount − Expenses to Profit. It should be noted that unlike C and Java, Python does not
have increment and decrement operators. The following example demonstrates the use of
different assignment operations in Python.

49
For the above input, print the result as The result of basic mathematical
operations on 2 and 51 is 53, 49, 102 and 25.5.

1.7.3 Relational Operators


The different relational operators supported in Python are Equals (==), not Equals (!=), Greater

50
than (>), Less than (<), Greater than or equal to (>=), Less than or equal to (<=). The use of these
operators is explained in the following program:

1.7.4 Logical Operators


These operators are used when we need to check multiple conditions. Each of the condition is
evaluated to True or False and then the combined decision related to all conditions is taken.
There are basically three logical operators in Python which include “and”, “or”, and “not”. The
“and” operator returns True if all the conditions are True. The “or” operator returns True if at
least one of the conditions is True. The “not” operator returns the opposite of the value. This
means that if value is “True”, it will return “False” and vice versa. These operators are explained
in the following program:

Explanation

51
In the first line with print command, num1>num3 does not hold good and num2>num3 holds
correct. Since all the conditions are not holding correct for the “and” operator, hence the result
is “False”. In the second line with print command, result is “True”, because one condition is
holding correct. In the last line, num1<num3 is “True”, but when “not” operator is used, we get
the result as “False”.

It is also possible to use logical operator on Boolean values. The Boolean input will always
return a Boolean output as shown in the following program:

Explanation
The use of “and” operation on val1 and val2 returns “False” since both are not True. However,
“val1 and val1” returns “True” since both the conditions are True. Similarly, “val2 and val2”
returns “False” since both the conditions are False. The use of “or” operator returns “True”
since at least one of the conditions is True. The “not val1” returns False (opposite of val1 that is
True) and “not val2” returns True (opposite of val2 that is False).

1.7.5 Operator Precedence


In an expression with multiple operators, operator precedence determines the grouping of terms
in an expression and decides how an expression will be evaluated. Certain operators have higher
precedence than others. For example, ans = 10 + 5 * 4; here, ans is assigned 30, not 60. Since the
multiplication operator has higher precedence than the addition operator, so 5 is first multiplied
with 4 and then added to 10. The following table shows operator precedence in descending order:

Operator Details
() Parenthesis
** Exponential
/, *, //, % Division, Multiplication, Integer Division, Modulus

52
+, − Addition and Subtraction
>, >=, <, <=, ==, != Relational Operators
=, +=, −=, **=, /=, //−, %= Assignment Operators
not, or, and Logical and Boolean operators

The operators that have a higher precedence appear before the operators that have a lower
precedence. This means that within an expression, higher precedence operators are evaluated
first. This implies that parenthesis has the highest precedence and hence will be evaluated first
before any other operator. The following program shows the impact of operator precedence in
Python:

Explanation
In the first example, “a+c” and “a+b” are evaluated first because they are inside the
parentheses, and are then multiplied to each other. Hence, the result is 250*300 = 7500. In the
second example, since multiplication has higher precedence than addition, hence multiplication
of b and c is done first and then added to a. Hence, the result is 200 + 5000 = 5200. Similarly, in
the last case, since relational operators have a higher precedence than logical operations, hence
“a>b” and “b<c” are evaluated first and then “and” operator is used. Since a>b is “True” and
b<c is “False”, hence the “and” operator used on “True” and “False” returns “False”.

Use 10 different operators on some numbers in a single statement and analyze


the results.

1.8 Core Modules in Python


In Python, a module is a file containing definitions, statements, functions, classes, and variables.
A module helps to logically organize the code; grouping of related code into a module helps to
use the code efficiently. Each module has a distinct name, which is used as the global symbol in
accessing all functions defined in the module. Python has a rich library of more than 100
standard modules. The list of modules can be accessed from https://docs.python.org/3/py-
modindex.html. Some of the commonly used modules and the functions which they include are
calendar (functions related to calendars); collections (container data types functions); datetime
(basic date and time type functions); os (operating system interfaces); random (generate pseudo-

53
random numbers with different common distributions); re (regular expression operations); math
(mathematical functions); statistics (statistics functions); string (common string functions);time
(time access and conversion functions); tkinter (interface for graphical user interface); pickle
(converts Python objects to streams of bytes and vice versa), etc.

1.9 Core Libraries in Python


Python has a rich collection of libraries contributed by many authors. Each library is related to
one domain and has a huge collection of functions and sub-packages. According to the
requirement, for accessing a particular function related to a domain, the user can access library or
specific functions using import statement in Python. This helps in efficient code management.
Libraries are related to machine learning, specialized statistical methods, accessing functions
to arrays, visualization effects, etc. Some of the common libraries include:

1. NumPy (Numerical Python): It is the fundamental library in Python around which the
scientific computation stack is built. It has many high-level mathematical functions for
large, multi-dimensional arrays and matrices.
2. pandas: This library helps to work with labeled and relational data. It is primarily used for
data cleaning, data extraction, data processing, data aggregation, and visualization. It is like
a spreadsheet in Python.
3. matplotlib: This library helps in creating simple and powerful visualizations like line plots,
scatter plots, bar charts and histograms, pie charts, stem plots, contour plots, quiver plots,
spectrograms, etc. It is easy to apply different formatting styles like title, labels, grids,
legends, etc. Creating an image from multiple images is a special feature of this library.
4. seaborn: This library is generally used for creating visualization of statistical models.
These include heat maps and charts that show better effects for summarized data.
5. scipy (Scientific Python): This library contains modules for linear algebra, optimization,
statistics, etc. The functions available in the respective module provide significant power to
the user with high-level commands and classes for statistical calculations related to data.
6. statsmodels: This library enables the users to conduct data exploration, estimate statistical
models, and perform statistical analysis on the data efficiently.
7. sklearn (Scikit-learn): This library is built upon the SciPy and hence SciPy must be
installed before this library. This library provides a range of supervised and unsupervised
machine learning algorithms like clustering, cross validation, datasets, dimensionality
reduction, feature extraction, feature selection, parameter tuning, manifold learning,
generalized linear models, discriminate analysis, Naïve–Bayes, lazy methods, neural
networks, support vector machines, decision trees, etc.
8. nltk (Natural Language Toolkit): The nltk library is used for common tasks associated
with natural language processing. The functionality of nltk allows a lot of operations such
as text tagging, classification, tokenizing, name entities identification, building corpus,
stemming, semantic reasoning, etc. All these building blocks allow for building research
systems for complex tasks like sentiment analytics, text analytics, summarization, etc.
9. keras: Keras is a library meant for deep learning with high-level neural networks. It runs on
top of TensorFlow, CNTK, or Theano. It allows for easy and fast prototyping with its

54
important features of modularity and extensibility. It supports both convolutional networks
and recurrent networks, and also runs on GPU for faster processing of large data.

Before using the library for analysis, it is important that the libraries need to be
in the Python environment. Some of the basic libraries are available in the
environment when installation is done. These libraries can be used by
executing any of the following commands. However, if the library is not
present in the environment, we may require to install them using “pip install
library name” command at anaconda prompt or by changing the settings in
Pycharm software. The advantage of Pycharm software is that it displays most
of the Python libraries which can be available for analysis after installing the
required library.

Basically, there are three different ways to import functions from different modules/libraries:

1. From module/library name import: This approach helps to import all the functions from
the specified module/library in the program. We need to directly write the name of the
function for execution without specifying the module or library name. However, this
approach may result to inefficient results if same name of the function from different
libraries exists. This may finally make the code less maintainable.
2. Import module/library name: This approach also helps to import all the functions from
the specified module/library. The difference between the two approaches is that through this
approach, we need to specify module/library name before calling a function and through the
previous approach, the function can be called directly by specifying the function name only.
Since the module name is also written along with the function name, this approach removes
the limitation of the first approach by removing chances of ambiguity related to function
name. Besides, this approach allows to rename the module/library for easy programming.
3. From module/library import function1, function2, …: This is considered to be an
efficient approach since it helps the user to import only the required functions from the
related module/library, hence making the code more efficient and manageable.

Summary
• Python is an interpreted programming language and software environment for statistical
analysis, data visualization, and reporting.
• Efficient programing can be done using Anaconda or PyCharm software. Anaconda software
includes Jupyter Notebook and Spyder IDE.
• A single line comment starts with a hash sign (#) at the start of a line and multi-line
comments can be created using triple double quotes (“””) before and at the end of block.
• Python is a case-sensitive programming language.
• Python does not allow punctuation characters such as @, $, and % within identifiers.
• The reserved words may not be used as constants or any other identifier names.
• Input in a Python program is taken using the input() function and output is displayed using

55
print() function.
• The input() function accepts only string data. Python accepts numeric input from the user in
the form of an integer or float number using the int() and float() functions, respectively
along with input().
• It is possible to give multiple inputs using one single input statement.
• The “sep” argument allows us to control the way in which print() function visually
separates the arguments.
• Python language has the following types of operators: arithmetic operators, relational
operators, assignment operators, logical operators, and Boolean operators.
• In Python, a module is a file containing definitions, statements, functions, classes, and
variables. Python modules include math, statistics, collections, string, array, time, date-time,
calendar, etc.
• Python library is a collection of functions and sub-packages. Some of the common libraries
include NumPy, SciPy, pandas, statsmodel, matplotlib, seaborn, sklearn, keras, NLTK, etc.

Multiple-Choice Questions

1. Python is not a case-sensitive language.


(a) True
(b) False
(c) May be
(d) None of these
2. This software does not help in Python programming.
(a) Anaconda
(b) Spyder
(c) Jupyter
(d) Turbo
3. The result of print(22%5) is
(a) 4.4
(b) 2
(c) 4
(d) None of these
4. ____________ is not a keyword.
(a) value
(b) int
(c) eval
(d) print
5. The result of print(20+30*3) is
(a) 110
(b) 150

56
(c) 90
(d) None of these
6. A single-line comment in Python is written using ____________ at the start and at the end
of a block.
(a) //
(b) /*
(c) “””
(d) #
7. The output of print("print("The", "country",sep='--') is
(a) The country
(b) The country --
(c) --The--country
(d) The--country
8. ____________ is not a library in Python.
(a) sklearn
(b) shapiro
(c) SciPy
(d) NumPy
9. If a = 10, b = 20, c = 20, then the result of print(c>b and b>a) is
(a) True
(b) False
(c) Undefined Error
(d) None of these
10. ____________ is not a module in Python.
(a) Array
(b) Collections
(c) Statistics
(d) None of these

Review Questions

1. Explain the use of different relational operators in Python with example.


2. How do we represent comments in Python?
3. What is the utility of eval() function?
4. What are the different naming rules for a variable?
5. What is the utility of “sep” argument in print() function?
6. It is possible to give multiple inputs using single statement? Justify.
7. Differentiate between the int() and float() functions for taking input from user.
8. Discuss the significance of operator precedence in Python.

57
9. Explain the basic functionality of some common libraries available in Python.
10. What is a module? Discuss the important modules in Python.

58
CHAPTER
2

59
Control Flow Statements

Learning Objectives
After reading this chapter, you will be able to

• Understand the concepts of programming.


• Develop the analytical skill for decision-making structures and looping.
• Build expertise for exception handling in programming.
• Foster analytical and critical thinking abilities for creating user-defined functions.

All the programs discussed in Chapter 1 followed a linear sequence: Statement 1, Statement 2,
etc., until the last statement is executed and the program terminates. But, these linear programs
can solve limited problems. This chapter introduces constructs that allow program statements to
be executed under certain conditions or optionally executed, depending on the context of the
program’s execution. This chapter also throws light on the loops that are used for executing
particular course of action/statements multiple times. Functions which is considered an important
concept in programming, is also explained in this chapter for doing efficient programming.

2.1 Decision-Making Structures


Decision-making structures require the programmer to specify one or more conditions to be
evaluated or tested by the program, along with a statement or statements to be executed if the
condition is determined to be true, and optionally, other statements to be executed if the
condition is determined to be false. The “if” structure is the general form of a typical decision-
making structure found in most of the programming languages. When we need to execute a set of
statements based on a condition, then we need to use the “if” structure. There are four types of
“if” statements that we can use in Python programming based on the requirement: “If”
Statement, If…else Statement, Nested “if” Statement, and If-elif-else Ladder.

2.1.1 “If” Statement


A simple “if” statement consists of a Boolean expression followed by one or more statements. If
the Boolean expression evaluates to be true, then the block of code inside the “if” statement will
be executed. If the condition is false, then the statements inside the “if” statement body are
completely ignored (Fig. 2.1). The “if” statement consists of a condition, followed by a statement
or a set of statements is written as follows:

60
Syntax
if (boolean_expression):
statement(s) will execute if the boolean expression is true

Figure 2.1 Flowchart of “if” statement.

Explanation
In the first case when p1>p2, the output “The first organization is performing better” is
printed. In the second case when p1<p2, no output is generated because no instruction has been
given regarding what should happen if the condition is not met. To handle this type of situation,
we use if…else statement as discussed in the following subsection.

2.1.2 If…else Statement


A “if” statement can be followed by an optional “else” statement which is executed when the
Boolean expression is false. If the Boolean expression evaluates to be false, then the first set of
code after the end of the “else” statement will be executed (Fig. 2.2). For example, if a number is
greater than 0, then we want to print “Positive Number”; but if it is less than zero, then we want
to print “Negative Number”. In this case, we have two print statements in the program, but only
one print statement executes at a time based on the input value. It is compulsory to write “else”
in the same line as closing curly brace of “if” statement.

Syntax
if (boolean_expression):

61
statement(s) will execute if the boolean expression is true.
else:
statement(s) will execute if the boolean expression is false.

Figure 2.2 Flow chart of if…else statement.

Explanation
Unlike the previous program, this program shows that output is generated in both the cases.
However, it varies depending on the values. If the condition p1>p2 is true, the output is “The
first organization is performing better” and if the condition is false, then the output is
“The second organization is performing better”.

It is important to use colon at the end of “if” and “else” statement, otherwise an
error will be generated.

It is also possible to execute multiple lines if the specified criterion is met. But unlike other
programming languages, where we use curly braces to create a block of statement, in Python we
define the block through proper indentation.

62
Explanation
Multiple statements are executed depending on whether the criterion (profit>9000) is met or
not. If profit is greater than 9000, then two statements are executed and hence strings “Good
Profit” and “Congratulations” are printed on the screen. This is because both the print
statements are inside the “if” statement. Similarly, both the strings “Poor Profit” and “Work
Hard” are printed if the condition is false, since both of them are written inside the else
statement.

The importance of indentation can be understood by the small variation in this example:

Explanation
In the previous example, when profit was greater than 9000, “Work Hard” is also printed along
with “Good Profit” and “Congratulations”, because it is not a part of the “else” statement,
but it is in the main program outside the indent of block of else. This is because of spaces which
are not left before the print statement for “Work Hard”, which makes Python understand that
this statement is not part of “else” block, but is a part of the main program. However, in the

63
next case, both the statements are executed because “Poor Profit” is printed since the
condition is not “True” and “Work Hard” is printed from the main program. Thus, it is
important to know that spaces play a major role in Python.

2.1.3 Nested “if” Statement


Nested “if” statements means “if” statement inside another “if” statement. Python allows us to
nest “if” statements within “if” statements. A nested “if” is “if” statement that is the target of
another “if” statement (Fig. 2.3). In real world, there are some situations in which multiple
conditions need to be evaluated before taking a decision. For example, the decision related to
whether a person can get a passport or not depends on multiple conditions like he/she should
have no criminal background and he/she should be “Indian”.

Syntax
if (condition1):
Executes when condition1 is true
if (condition2):
Executes when condition1 and condition 2 both are true
else:
Executes when condition 2 is not true but condition 1 is true
else:
Executes when condition 1 is not true

Figure 2.3 Flowchart of nested “If” statement.

64
Explanation
It is also possible to check multiple conditions using multiple “if” statements as explained in the
previous section. If p1>p2, then the program checks the condition p1>p3. If p1>p2, then the
output is printed as “First product is expensive”. This output is printed after two conditions are
checked and met (p1>p2 and p1>p3). But, if p1 is not greater than p3, then “Third product is
expensive” is printed because this condition was tested only when the first condition p1>p2 was
satisfied. This implies that p1 is greater than p2 but is less than p3, which finally results in p3
being the greatest number. The “else” clause of p1>p2 is executed when the condition is not met
(p1<p2). It then checks the condition p2>p3. The output “Second product is expensive” is
printed when both the conditions (p2>p1 and p2>p3) are met. Similarly, “First product is
expensive” is printed if p2>p1 but p2<p3.

2.1.4 If-elif-else Ladder


An “if” statement can be followed by an optional “else if…else” statement, which is very useful
to test various conditions using single “if…else if” statement. Here, a user can decide among
multiple options. The “if” statements are executed from the top-down. As soon as one of the
conditions controlling the “if” is true, the statement associated with that “if” is executed, and the
rest of the ladder is bypassed. If none of the conditions are true, then the final “else” statement
will be executed (Fig. 2.4). When using “if, else if, else” statements, there are few points to keep
in mind. An “if” can have zero or one “else” and it must come after any “else if’s”. An “if” can
have zero to many “else if’s” and they must come before the “else”. Once an “else if” succeeds,
none of the remaining “else if’s” or “else’s” will be tested.

Syntax
if (boolean_expression 1):
Executes when the boolean expression 1 is true.
elif ( boolean_expression 2):
Executes when the boolean expression 2 is true.
elif (boolean_expression 3):
Executes when the boolean expression 3 is true.

65
....................
else:
Executes when none of the above condition is true.

Figure 2.4 Flowchart of “if-elif-else” ladder.

Explanation
The first condition is checked and since value of “qty” is found to be greater than 900, it will
print “Excellent Performance” and will not execute anything in the “else” statement. In the
second case when the value of “qty” is less than 90, it will enter inside the “else” statement and
will check the second condition. If the second condition (qty>750) is not met, it will enter in the
“else” statement of the second “if” statement and will check the third “if” condition (qty>500).
If it finds that this condition is met, it will then not enter the “else” statement and will execute
the print statement written in “if” statement. Thus, it will print “Average Performance” and
complete the check.

This program considers a line with equation a * x + b * y − c = 0 and a point (x,y). The code

66
checks if the point lies either on the line or on the left side or on the right side of the line. A point
(x1,y1) is said to lie on the line if a * x1 + b * y1 − c = 0 and the point lies on left side if a * x1 +
b * y1− c < 0 and the point lies on the right side of the line if a * x1 + b * y1 − c > 0.

Explanation
The user is asked to enter the coefficients of line: a, b and c. The user is then prompted to enter
the coordinates of the point. If ((x*a)+(y*b)-c)<0, then it displays the message that the point
lies on the left side of the line. If this condition is not holding good, it checks for the next
condition whether ((x*a)+(y*b)-c)>0. The result is displayed if that condition is holding good
else the last condition is checked. The above program is an example of using multiple ladder of
if statement.

We can also check multiple conditions using logical operators. The condition specified in the
“if” statement may be a complex expression where the logical operators “and”(&) and “or”(|)
can be used. The “and” operator is used when we want both the conditions to be fulfilled
simultaneously, whereas the “or” operator is used if we want at least one condition to be
fulfilled.

67
Explanation
The first “if” statement uses the “or” operator, which implies that the output will be displayed if
only one of the two conditions is met. The other “if” statement uses the “and” operator, which
implies that the output will be displayed when both the conditions are true. The first execution
shows that salary=0, hence only one condition is met which implies that the first output is
displayed. In the second case, both the conditions of the “if” statement having and (&) are met.

The program cannot be considered as an efficient program, because of poor programming


involved. It is worth mentioning that programming should be done efficiently. The same
program can be made efficient in the following manner:

Explanation
If the condition salary<=0 holds true, then “Invalid salary” is printed. The second condition
salary<=50000 is checked only if the first condition does not hold true. This implies that this
condition is tested only when salary>0. Similarly, the condition salary<=100000 is checked

68
only when salary>50000. This implies that the output “Junior Manager” will be generated
only if the salary is between 50000 and 100000. Similarly, the condition (salary<=300000) is
checked only when the condition salary<100000 is not met. This further implies that “Middle-
Level Manager” will be printed only when the salary is greater than 100000 but less than
300000. When the salary is greater than 300000, it prints “Senior Manager”.
In the first example, when the salary was 60000, the first condition was checked. Since it
was not holding true, hence second “if” condition was checked. Since the second condition was
not holding true, hence third “if” condition was checked. Now the third condition was holding
true, hence “Middle-Level Manager” was printed. In the second example, the salary was
30000, the second condition was holding true and hence result is generated as “Junior
Manager”. In the third example, when the salary is 200000, the else statement is executed and
hence “Senior Manager” is printed.

This program will check the nature of string depending on the user choice. If choice is pangram,
the presence of all the alphabets will be checked within the string; if choice is consogram, the
presence of all the consonants will be checked within the string and if the choice is vowgram, the
presence of all the vowels in the string will be checked.

Explanation

69
In this program, the user is given three choices to determine whether the string is pangram,
consogram or vowgram. A string named letters is created corresponding to the choice of the
user. If the user enters pangram, letters contains all the English alphabets, if the user enters
consogram, letters contains all the consonants and if the user enters vowgram, letters contains
all the vowels. A Boolean variable named flag is created with default value as True. Each and
every letter from the letters is checked in user string. If any letter is not found in the string, the
Boolean variable will be assigned the value as false and the loop will exit with the break
statement (refer to Section 2.4.1 for detail). If flag is True, this means all the letters were found
in the string, hence it is a pangram/ consogram/vowgram.

2.2 Loops
A loop statement allows us to execute a statement or group of statements multiple times.
Repetitive commands are executed by loops: for loop, while loop, and repeat loop. Python loops
are particularly flexible and they are not limited to integers. We can also use loops for character
vectors, logical vectors, lists, or expressions.

2.2.1 For Loop


If the number of repetitions are known in advance, then a “for” loop can be used. For executing
for loop, we need to specify start, end, and step argument. The command inside the for loop is
executed from the counter having specified value as start; every time it is incremented by the
value which is specified in step; the new value of the counter is checked till it reaches the value
specified in the end argument (Fig. 2.5).

Syntax
for var in range(start, end, step):
commands to be executed
where
• var is the variable name
• start denotes the starting number
• end denotes the ending number
• step is the optional step for increment or decrement (default are +1 and −1)

70
Figure 2.5 Flowchart of “for” iterator.

Explanation
The user is prompted to give input of a number which is stored in “num”. The for loop uses a
counter named “k” and has two arguments for range: 1 and 11. This implies the starting value of
“k” will be 1 and the last value will be less than 11, which means that the last value will be 10.
The most important thing to observe is that the range function gives us values up to, but not
including, the upper limit. The absence of third-step argument implies that the default increment
of 1 will be applied. This means that the value of “k” will be incremented by 1 each time till it
reaches the value 11. This further implies that initially for “k” equal to 1, the print statement
will be executed; the value of “k” will then be incremented by 1 (k=2) and again the print
statement will be executed. This process will be continued till “k” is equal to 10. When the
value of “k” reaches to 11, the loop will stop getting executed and the control will pass to the
main program and hence “Finished Task” is printed.

However, the same program with different specified step argument will show different results as
discussed in the following section.

71
Explanation
The third argument in range is 2. This implies that the step argument is 2 which further means
that each time, the value of “k” will be incremented by 2, unlike the earlier case where value of
“k” was incremented by default value of 1. Hence the table of 4 is printed only for k = 1, 3, 5, 7,
and 9.

It is also possible to have negative value for decrementing the value of step.
For example, if we need to print the table in decreasing order, we should have
syntax as: for k in range(10, 0, −2). In this case, the loop will start from 10 and
continue till 1 decrementing the value by 2 each time.

It is also possible to accept the number from the user for defining the starting and ending
arguments. This is shown in the following example:

Explanation
For printing the sum of the series starting from 1 till a number, the user is prompted to enter the
number which is stored in “num”. The starting value of “sum” is 0. The “for” loop is executed
for “num” times, since the starting value of “k” is 1 and the ending value is “num+1”. Each time

72
the value of “k” is incremented by default value of 1.

The following example reads a number from the user. Find the difference between the sum of
squares up to that number and the number itself.

Explanation
The number entered by the user is stored in the variable named “number”. The range inside the
for loop is specified from 1 to number+1; this will execute for loop till the number. The variable
“sum” thus stores the sum of the squares from 1 till that number. The variable “ans” will deduct
number from sum. For example: if a number is 3, then the output will be 11 (i.e., 1 *1 + 2 *2 +
3 *3 − 3 = 11).

We can also have different starting value. This is shown in the following program where the
starting value of the counter is other than 1.

Explanation

73
The first two numbers of Fibonacci series are 0 and 1 and are printed before “for” loop. Hence,
the starting value of counter “k” is 3 and the loop will execute till y+1. The “for” loop has four
statements inside it, hence all the four statements are executed every time the value of “k” is
incremented. Since in the Fibonacci series all the numbers need to be printed, hence the print
statement for printing the number is kept inside “for” loop. However, since “Finished Task” is
outside for loop, hence “Finished Task” is printed when the control is transferred outside for
loop to the main program.

The utility of “for” loop can be also shown in another example of calculating factorial of a
number.

Explanation
The user is prompted to enter a number for calculating the factorial which is stored in num. The
value of “fact” is 1. The counter “k” has initial value as 1 and will execute till the value of “k”
reaches num+1. The control will be transferred back to the main program when the value of
counter “k” reaches to num+1. The program shows that for num=6, the initial value of “k” is 1
and it will execute the statements written inside “for” till “num” reaches the value 6. For the first
time when fact=1, k = 1, “fact” will be equal to 1(1*1). The second time when the loop got
executed, the initial value of “fact” is 1 and k = 2 and hence “fact” will be equal to 2 (2*1).
The third time the loop got executed, initial value of “fact” is 2 and k = 3 and hence “fact”
will be equal to 6 (2*3). Similarly, it will execute till “k” reaches the value 6 and hence final
value of “fact” will be 720. It can be observed that the value of “fact” is printed not inside the
“for” loop and hence the final value of “fact” (720) is printed. Using the same approach, the
factorial of 8 is calculated and the final “fact” is printed.

Display the values of each computation. Like in the program in the previous
page, it should display 6 * 1 = 6, 6 * 2 * 1 = 12, 6 * 3 * 2 * 1 = 36, 6 * 4 * 3 *
2 * 1 = 144, 6 * 5 * 4 * 3 * 2 * 1 = 720. It should also be printed in descending
order: 6 * 5 * 4 * 3 * 2 * 1 = 720, 6 * 4 * 3 * 2 * 1 = 144, 6 * 3 * 2 * 1 = 36, 6
* 2 * 1 = 12, 6 * 1 = 6.

74
Since space has a lot of importance in Python, hence proper care should be taken while writing a
statement. The following program shows the difference in output when the print(fact) is
considered inside the for loop.

Explanation
In this program, the value of fact is printed inside the “for” loop, hence before incrementing
the value of counter “k” each time, the print statement is executed and hence the “fact” is
printed multiple times. The user accordingly use spaces.

In real-world scenarios, with the increasing complexity of program, there is a need to include
loops and conditional program in the same program as shown in the following example:

75
Explanation
This example calculates the net payable amount in the shop depending on the bill amount. If the
bill amount is greater than 5000, 10% discount is provided; if it is between 3001 and 5000, 7%
discount is provided; and if it is between 1001 and 2999, 5% discount; else no discount is
provided. The number of products are entered by the user and stored in “num”. The “for” loop is
executed “num” times and each time the price and quantity are entered by the user. The amount
is calculated and discount is provided according to predefined conditions using conditional
statement.

2.2.2 Nesting of for Loops


A loop contained within another loop is called a nested loop. These are used when an iterative
process is required. These are generally used in multi-dimensional or hierarchical data
statements, including matrices, lists, etc. It is important to understand that we can nest “for”
loops to a great level, although nesting beyond 4 to 5 levels often makes it difficult to read and
understand the code.

In nested “for” loops, it is important to emphasize on the spaces left before


another nested loop. The reader should write the statements carefully with
proper space indentation.

76
Explanation
We wanted to print stars in number of lines, and the number of stars in a particular line depends
on the value of “num”. This implies that the first line will have one star, the second line will
have two stars, and so on. This means that we need one counter to represent line number and
one counter to represent the number of stars, which further means that we require two counters
and hence two “for loops”. The user is prompted to enter the number of lines which is stored in
“num”. The first for loop starts from 1 and will continue till “num”. Since we have “for” loop
inside “for” loop, hence, for every increment of “k”, the other “for” loop will start in which the
counter “j” will have values from 1 to “k”. This implies that when k = 1, “j” will have values
from 1 to 1; when k = 2, “j” will have values from 1 to 2; when k = 3, “j” will have values
from 1 to 3, … . Thus, when the user enters 6, 6 lines are printed with different number of stars.

Create the program as in the previous page using the nested “while” loops and
create appropriate conditions for correct results.

2.2.3 While Loop


The “while” loop executes the same code again and again until a stop condition is met. A
condition is tested in starting of “while” loop. If it is true, then loop body is executed. Once the
loop body is executed, the condition is tested again, and so forth, until the condition is false, after
which the loop exits. This means that when the number of loops is not known in advance, we use
a “while” loop. A situation that might occur in a “while” loop is that the loop might not run ever.
When the condition is tested and the result is false, the loop body will be skipped and the first
statement after the while loop will be executed (Fig. 2.6). Besides, the programmer needs to be
careful that the counting variable within the loop is incremented, otherwise an infinite loop
occurs.

Syntax
while (condition):
commands to be executed as long as condition is TRUE

77
Figure 2.6 Flowchart of “while” loop.

Explanation
In this example, value of counter is incremented by 1 inside the “while” loop, hence each time
the loop gets executed, the value increases by 1 which further means that at one stage when the
value of counter exceeds the value 105, the loop gets terminated and ends in successful end of
the program.

It is important to note that the value of the counter within the “while” loop should be
incremented, otherwise an infinite loop may occur. This is demonstrated in the following
example:

Explanation
In this program, the initial value of counter was equal to 101 and the loop was supposed to
execute till counter <= 105; we display the value of the counter. We can observe that since
the value of count is not changed and it remains the same (counter = 101), value of counter
(101) is displayed infinite times and the loop is termed as an infinite loop. In this scenario, last

78
statement Finished Task is never printed.
The loops used in this example are known as definite loops. In definite loops, the
programmer knows the number of times the loop is executed in advance. We know that the
definite loops are implemented using “for” loop also. But, in some real-world situations, a
programmer cannot always predict how many times a loop will execute. In these situations,
indefinite loop is required and the while iterator is ideal for indefinite loops.

We will consider the same program discussed in “for” loop for calculating the amount of the bill.
However, the difference is that the program will execute according to user requirement. In real
world, the user does not know the number of products in advance. Since the number of products
is not known in advance, we will consider a “while” loop which is ideal for indefinite loops.

Explanation
In this program, the user is not asked about the number of products in advance and the loop is
executed till the user enters “yes”. Once the user enters “no”, the control is transferred to the
main program.

79
Explanation
This program calculates the sum of digits of a number. For example: if number is 564, then sum
= 5 + 6 + 4 = 15. The while loop is used to execute a repetitive task, because the number of
times the loop will get executed is not known in advance and varies according to the number
entered by the user, which the user does not know. Each time the loop is executed, the mod is
determined by using the operator %. This will store the remainder in rem. Thus, when the
number 567 is entered, the first time the loop is executed, the remainder is 7. The “sum”
variable stores 7. The “num” is then divided by 10 and stored in “num”. The loop is executed
again and remainder is added to sum. Thus, the total sum is displayed of a number when the
while loop exits. Thus, the sum of 4281 is 4 + 2 + 8 + 1 = 15.

2.2.4 Nesting of “While” Loops


Like “for” loops, it is also possible to nest “while” loops as shown in the following example:

Explanation
This program uses a nested “while” loop for printing the sequence of numbers. The user is
prompted to enter the number of lines. If the user enters 5, five lines are printed and each line
will display numbers depending on the number of line. For example, the first line will display
only number 1; the second line will display numbers 1 and 2; the third line will display numbers
1, 2, and 3, and so on.

2.3 Nesting of Conditional Statements and Loops


It is also possible to nest conditional statement and different loops. For example, conditional
statement can be inside for/while loop or for/while loop can be inside conditional statement.

80
2.3.1 The “for” Loop inside “if” Conditional Statement
The following program shows the usage of for loop inside “if” conditional statement.

Explanation
The user is asked to enter the choice: 1 for odd and 2 for even. When the user enters choice as
1, the odd numbers using for loop are printed. If the user enters choice as 2, the first five even
numbers using “for” loop are printed. However, the system prints “Invalid choice” if the user
enters choice besides 1 or 2.

2.3.2 The “if” Conditional Statement Inside “for” Loop


The following program considers “if” statement inside “for” loop.

81
Explanation
It can be observed that whenever there is a space, the count variable is not incremented, hence
the result consists of only the number of letters in the string.

Explanation
This program uses a three-level nested “for” loop to print all the different arrangements of the
letters A, B, and C. Each string printed is a permutation of ABC. The “if” statement is used to
prevent duplicate letters within a given string. Since we need that on letter occurs one time,
hence the condition is made in such a way that the letter is printed if first is not equal to third
and third is not equal to second. This finally helps to print distinct letters.

Use a nested “for” loop to display the different combinations of letters that can
be formed from the input of the five-lettered string given by user.

The following example tries to determine whether one string is kangaroo word of second.
Kangaroo refers to a word carrying another word within it but without transposing any letters.
For example, encourage contains courage, cog, cur, urge, core, cure, nag, rag, age, nor, rage and

82
enrage but not run, gen, gone, etc. The occurrence of each letter of the second string is checked.
If the letter is found in the first string, then the occurrence of the letter is checked after the index
of that letter. If the letter is found to be missing at any stage, the flag counter is set to false. If the
length of first string is lower than the second string, there is a mismatch. Each letter of the
second string is checked in the first string. If the letter occurs in the first string, the next letter is
checked from the later index. If any letter is not found, the flag is set to false and result is
printed.

Explanation
The user is asked to input two strings which are stored in string and string2. An if statement is
executed to determine whether the first string is smaller than the other string. If it is smaller,
this means that the strings are not entered properly. This is obvious that since the second string
is contained in first, it should be at least equal to it. The else statement is executed if the strings
are properly entered. Two nested for loops are used, since we wanted to match each and every
letter of the string (first for loop) with each and every letter of another string (second for loop).
Since it is desired that the letter should not be transposed, hence if there is the occurrence of the
letter of second string in first string, the second for loop will start from the index of the

83
occurred letter in string1. Thus, a variable named “val ” is initialized to 0, and when the first
search is completed, it will take the value of the index of the first string (val=j); this means that
the search will continue from the index of last found letter. If any letter is not found in the
string, Boolean variable named flag is made False. The last section of the program checks the
value of flag. If flag is true, the result of “It is a kangaroo word” will be displayed. In the
example, when the strings entered are “welcome reader” and “lead”, it will display “It is a
kangaroo word”, because each and every letter of lead is occurring in welcome reader in the
same order. In the second example, though each and every letter of load is occurring in “hello
dear”, but the order is not the same, hence it displays that “It is not a kangaroo word”.

2.3.3 The “if” Conditional Statement Inside “while” Loop


The following program considers “if” statement inside “while” loop.

Explanation
Initially the choice is made “yes” and “Total” is initialized to 0. The “while” loop is executed
till the user enters “yes”. The user can enter the choice of either addition or subtraction. When
the user enters “A”, the number is added and when the user enters “S”, the number is
subtracted. In this program, after three times of execution, the user enters “no” and hence the
total is printed at last.

84
2.3.4 Using “for”, “while”, and “if” Together
The following program uses all “for”, “while”, and “if” in the same program.

Explanation
This program shows the utility of for, while, and if in the same program. The user is prompted
to enter the choice, which is stored in the variable “choice”. If the user enters choice as 1, then
“for” loop is executed for printing numbers between the specified range by the user. If the user
enters 2 for the value of “choice”, initial value of 111 is stored in the variable “value” and the
choice is initialized to “y”. The program prints the number till the user enters “ans” as “y” and
the value of “y” is incremented by 1 in each iteration. The “while” loop is used for this purpose,
because the program is not aware how many times the loop will get executed. It should be noted
that while is ideal for indefinite loops. When the user enters “3” for the value of choice,
“Invalid choice” is printed.

2.4 Abnormal Loop Termination


We have seen that the loops are executed until the condition is true. This condition is checked
only at the top of the loop, and hence the loop is not exited if the condition becomes false in the

85
middle of the execution. Generally, this situation does not happen because the loop is created to
execute all the statements within the body. But in rare scenarios, if it happens, we need to exit
the loop by changing execution from its normal sequence. The break, pass, and continue
statements in Python give programmers better flexibility for implementing the control logic of
loops in these situations. When abnormal loop termination occurs, all objects that were created in
that scope are destroyed. It should be mentioned here that ideally every loop should have a
normal flow and have a single entry and exit point. It is discouraged to use abnormal loop
termination because it introduces an exception into the normal control flow of the loop.
However, in some unavoidable situations, these statements can be the only alternative.

2.4.1 Break Statement


Break terminates the loop statement and transfers execution to the statement immediately
following the loop. When the “break” statement is encountered inside a loop, the loop is
immediately terminated regardless of what iteration the loop may be on and program control
resumes at the next statement following the loop. The “break” statement immediately exits a
loop, skipping the rest of the loop’s body, without checking to see if the condition is true or false.
Execution continues with the statement immediately following the body of the loop. It is
important to note that the “break” statement exits only the loop in which “break” statement was
found.

Explanation
Initially, the “for” loop was supposed to execute from num = 11 to num = 20. The intention of
the user was that if the value reaches 16, it should exit the “for” loop. The first value (11) was
printed and “if” condition was checked. Since the value 11 was not greater than 15, the next
value was printed, and so on. When the value 16 was printed and “if” condition was checked, it
was found that the value was greater than 15 and hence break statement was executed. This
terminates the loop and further values were not printed.

Proper positioning of break statement inside the nested loops is important for
the correct execution. Multiple break statements can also be used at different
places for effective control.

86
The break statement can also be used inside a “while” loop also, as demonstrated in the
following program:

Explanation
Our intention is to add unlimited positive integers corresponding to the amount of the product;
the condition of the “while” can never be false till the user enter positive integers (customers
can buy unlimited products). In this scenario, the “break” statement is the only way to exit out
of the loop. The “break” statement is executed only when the user enters a negative number.
Like “for” loop, the “while” loop is also terminated in between and thus any statements
following the “break” statement within the body are not executed. Thus, when the user enters a
negative value, the total is printed. The total is printed after the user enters a negative number,
since this statement is outside the while loop.

It should be noted that in a nested loop, the break statement exits only the loop in which the
break is found. This can be better understood by the following example of break statement inside
nested loops.

87
Explanation
The first “for” loop tries to determine the prime numbers from 2 (smallest prime number) to
limit. It initializes the value of is_prime = True. It then divides each number of the value from
2 till the number for determining whether it is a prime number or not. The value of is_prime is
made “False” if the number is divisible by any number (value%factor==0 implies remainder is
0). The statement “if is _prime:” means that if is_prime is True, then the value will print,
else the new number will be considered. It is important to focus on the spaces used in the
program. A change in settings of space will produce a different result.

This program can also be created by using a nested “while” loop with small modifications.

Explanation
This program uses a nested “while” loop for displaying the prime numbers according to the user
choice. The program determines the prime number within the range specified by the user. The
first “while” loop is used to perform a check related to number and has a counter for increasing
the value of number by 1. The condition (k<limit) is checked so that the “while” loop is
executed till the last number is reached. The nested “while” loop is used for determining
whether the number is a prime number. The number is printed only if it is a prime number, else
the nested loop is exited using “break” statement. The main “while” loop then considers the
next number and the whole process is repeated. Thus, it is clear that break exits only the loop
where it is found.
It should be noted that using multiple break statements within a single loop should be
avoided. The user should always try to rewrite the code so that the break statement is not used.
However, in exceptional cases, it can be used with care.

2.4.2 Continue Statement


During a program’s execution, when the break statement is found within the loop, the remaining

88
statements within the body of the loop are not executed and the loop is exited. But when a
continue statement is found within a loop, the remaining statements within the body of the loop
are skipped; however, the loop is executed again and the condition in loop is checked again to
see if the loop should continue or be exited. If the loop’s condition is still true, the loop is not
exited, but the loop’s execution continues at the top of the loop. It should be noted that similar to
“break” in a nested loop, the continue statement affects only the loop in which “continue” is
found.

Explanation
The variable “total” is assigned to 0 and “complete” is assigned to False. The “while” loop
executes till value of “complete” is “False”. In this program, the “while” loop is used to add
numbers till the user enters a number “0”. When the user enters negative number, the rest of the
loop is not executed and the control transfers to the starting of the loop.

2.4.3 Pass Statement


In programming, pass is a null statement. This statement is used when the user has created a
code block that it is not required. When pass is executed, nothing is executed. This is shown in
the following example:

89
Explanation
In this program, pass statement does not let anything happen and hence, only the last letter of
word “HELLO” is printed. It helps in just executing the loop again without doing anything.

2.5 Errors and Exception Handling


While writing programs, errors may occur unwillingly which may prevent the program to run
according to the requirement of the user. The process of finding and eliminating errors is called
debugging. Python provides an exception-handling mechanism that enables us to detect errors
and handle them efficiently. This enables us to execute program without terminating after an
exception is caught. Once the exception is resolved, program execution continues till completion.

2.5.1 Types of Error


The different types of error that can occur in a Python program are compile-time error, run-time
errors, and logical errors.

2.5.1.1 Compile-Time Errors


Compile-time errors are syntactical errors found in the code due to which program compilation
does not take place. For example, incorrect spelling, incorrect number of parenthesis, not using
colon (:) sign after “if ”, improper indentation, etc. The Python compiler displays error message
along with the line number in case of compile-time errors, which helps the user to rectify the
error.

Explanation
In this program, a parenthesis was missing in the print statement and hence, a compile-time
error displaying the Syntax Error was displayed when the program was executed.

2.5.1.2 Run-Time Errors


Run-time errors occur at the time of execution of program. This error occurs after the check has
been done for compile-time error, for example, data type mismatch. These errors are however
sometimes not understood by the user. Also, in some cases, these are not displayed which leads
to abnormal execution and may even lead to endless execution without termination.

90
Explanation
In this program, when the user enters 5, the number is displayed. But in the second case, when
the user enters “w”, it shows a NameError since int() function only accepts numbers and “w” is
a string.

2.5.1.3 Logical Errors


Logical errors occur due to the usage of incorrect logic in program by the user. For example,
using incorrect condition in “if” statement to check the data. These logical errors can be
corrected by the programmer by modifying the program.

Explanation
In this program, the user is prompted to enter a number between 1 and 7; the proper result is
generated if the user enters a number between 1 and 7. If the number entered was greater than 6
or less than 1, the result would have been Sunday. In this program, logical errors occur because
it displays “Sunday” as the weekday when the number is not falling in the range 1 to 7. In such
a situation, the user will not be able to understand the reason for error and nature of error.

2.5.2 Exception Handling

91
Exception handling is an effective tool to handle the errors in the program. The purpose of
exception handling is to terminate program properly and display proper message to the
programmer in case of an error. If the programmer can determine the type of errors that can
occur, then exception is used to do efficient programming and for solving the problem. Some of
the common exceptions that are defined in Python include:

1. ImportError: Raised when an import statement fails.


2. IndexError: Raised when a sequence index is out of range.
3. NameError: Raised when a local or global name is not found.
4. IOError: Raised when an input/output operation fails.
5. ArithmeticError: Raised for numerical calculations.
6. OverflowError: Raised when a calculation exceeds max limit.
7. SyntaxError: Raised when the compiler encounters a syntax error.
8. ZeroDivisonError: Raised when there is division by zero.
9. ValueError: Raised when there is a mismatch in arguments.

Syntax
try:
--------------------
except Exception I:
--------------------
except Exception 2:
--------------------
else:
--------------------
finally:
--------------------
This program can be created effectively using exception handling as:

92
Explanation
In this program, when the user enters 3, “Wednesday” is printed and when the user enters 8 or 0
as its weekday Error is printed. This is because of “try” statement which raises an exception
from its block if the user enters a value not falling in the range 1–7. The exception is caught in
the “while” block and the block of statements inside except clause are executed. Hence,
message of Wrong!!!!…. inside except statement is printed when the user enters 8 or 0.

It is important to note that a single try block can have many except blocks to handle multiple
exceptions as shown in the syntax. However, we can also write a try block without except block
but cannot write except block without try block. Besides except clause, exception handling has
two other important clauses, namely “finally” and “else”. The else block is executed when no
exception occurs, and finally block is always executed at the end of the try/except/else block.
This is shown in the following program:

93
Explanation
This program shows multiple except clauses in a single try block. When the try block returns
an Input–Output (IO) exception, it is caught by except block for IO exceptions and statements
inside this block are executed. When the try block returns a value exception, it is caught by
“except” block for ValueError exception and statements inside this block are executed.
However, if any other exception besides the listed two exceptions occurs in the program, it is
caught by “except” block without any exception and statements inside this block are executed.
The statements inside the else block are executed if no exception occurs. Finally, a statement is
always executed at last regardless of occurrence of the exception.
In the first case, when the file was not found, the IOError was raised and caught by the
IOError(). The statements inside it are printed and then finally the block is executed. Similarly,
in the second case, when the file was found, but there was an absence of an integer in the file,
the ValueError exception was raised and hence was caught in the ValueError block and then
finally the block was executed. In the third case, no exception was generated, hence else block
was executed and finally the block was executed.

Execute different programs considering different types of exception that can


occur and handle the situation to solve the problem.

2.6 User-Defined Functions


A function is a set of statements put together to perform a specific task. Functions are very much
useful when a block of statements has to be written once and executed multiple times with or

94
without different inputs. Thus, the function is a standard unit of reuse. Functions are useful when
the program size is too large or complex. Functions are called to perform each task sequentially
from the main program. It is like a top-down modular programming technique to solve a
problem. Functions are also used to reduce the difficulties during debugging a program by the
programmer because of its modularity feature. A function is an object and the interpreter passes
control to the function along with arguments for the function to accomplish the task. The
function in turn performs its task and returns control to the interpreter as well as any result which
is further stored in other objects.
Writing functions is a core activity of a programmer, and Python provides a great support to
create user-defined functions. They are created according to the requirement of the user, and
once created they are used like the built-in functions. They basically act as a turning point from
user to a developer, who creates new functionality for Python which can be shared with others.
The body of function contains a collection of statements that basically define the task of
function. When a function is called, we may/may not pass a value to the argument. The user-
defined function allows a developer to create an interface to the code, specified with a set of
arguments. This interface allows the developer to communicate to the user the aspects of the
code that are important or are most relevant. Typically, a function produces a result based on the
parameters passed to it. However, it is possible that a function may contain no arguments. Thus,
a function has a name, a list of parameters (which may be empty), and a result (which may be
None).

Syntax
def fname(arg1, arg2, …):
function body
where,

• fname is the actual name of the function stored in Python environment.


• arg1, arg2 are the optional arguments. An argument is like a placeholder. When a function is
invoked, we pass a value to the argument. This value is referred to as actual parameter or
argument. The argument list refers to the order and number of the arguments of a function. A
function can/cannot contain arguments.
• function body contains a collection of statements that define the task of function.

It should be noted that the function name and the arguments list together constitute the function
signature.
The task of the function is defined when a function is created. To perform the defined task,
we need to call the function. When a program calls a function, the program control is transferred
to the called function. A called function performs a defined task and when its return statement is
executed or last statement is reached, it returns the program control back to the main program.
To call a function, we simply need to pass the required parameters along with the function name,
and if the function returns a value, then we need to store the returned value from a function
within a variable.

2.6.1 Function without Arguments


A function without arguments can be created using an empty parentheses, because all arguments

95
inside a function are passed within parentheses. It should be noted that a function without
arguments can be created with or without return statement.

Explanation
The name of the function is hello which is succeeded with an empty parenthesis. This means
that a function is created without any arguments. The body of the function is written inside the
definition of the function. In this program, the function body has two print statements. When
the function is executed and called by its name, all the tasks of the function are executed and
hence both the print statements are executed and the result is printed accordingly.

The major advantage of function is that it can be created once and called many times. This is
illustrated in the following program:

Explanation
In this program, the function hello() is created once, but it is called three times. Hence both
the print statements inside the function are executed three times and hence the output prints six
statements.

A function body can also contain some expressions and hence function can also produce some
results as explained in the following example:

96
Explanation
No argument is passed in this function. The function assigns value of 100 to the variable
“num1” and 200 to the variable “num2”. It then prints addition of “num1” and “num2”. When
the function is called, it simply returns the result of addition of “num1” and “num2” (100 + 200
= 300).

A function becomes more meaningful and will yield better results if programming is also done
using loops and decision-making statements in the function body. This is helpful for doing
effective analysis and decision making in Python.

Explanation
The first statement creates a user-defined function by the name function2. There is no
argument passed inside the function. The function body starts after the colon and in next
indented line. It has for loop which has “num” as the counter variable and a sequence is
generated from 1 till 5 (6 − 1). The “for” loop body multiplies each number starting from 1 till 5
with 7 and prints the number. Thus, the table of 7 till 5 is printed.

It is also possible to create multiple functions in the single program as discussed in the following
example:

97
Explanation
This program creates four functions named “area_triangle”, “area_rectangle”,
“area_circle”, and “area_square” with no arguments. All the functions ask for the required
variables from the user inside the respective body of function and prints the area of its
respective shape. From the main program, the user is asked to give the choice, and depending
on the choice entered by the user the respective function is called. Example: area_square()
function will be called and executed when the user enters 1, area_circle() function will be
called when user enters the value 2 and so on.

It should be noted that the variable created inside the function is not known outside the function.
Hence, an error will be generated if the variable is called outside the function. This is

98
demonstrated in the following example:

Explanation
This example creates a function named “trial_func” which does not take any argument and
stores 4000 in a variable named “amount”. The variable amount cannot be fetched from outside
the function. It hence throws an error if the user tries to fetch the value of that variable.

A function can also return some variable which needs to be handled in the main program. We
can directly store the returned value from a function to a variable and then print the value of that
variable in the main program. The use of return statement will shift the control back to the place
from where the function was called. It is important to note that a function can have only one
return statement; however, the return statement can return multiple values.

Explanation
The function named “bill” is created without any argument. The bill() function prints
“Welcome” and returns the variable “amount”. It is important to understand that since the bill
() function returns a variable named “amount”, hence when this function is called, the output
returned by this function needs to be stored in a variable. Thus, when the function is called, the
value of the “amount” which is the returned output from the function is stored in the variable
named “value”. The answer when printed shows the value 4000, since the value 4000 was
returned as an output from the function when it was called.

In Python, the function can also return a Boolean value to the main program unlike most of the
programming languages. This is demonstrated in the following program:

99
Explanation
This program creates a function named multiple() which does not take any argument. The
user takes two variables “num1” and “num2”. If “num1” is a multiple of “num2”, “True” is
returned, else “False” is returned. When the function is called for the first time, 40 and 5 are
entered as numbers to the program. The expression 40%5 returns the remainder 0, since the
number 40 is a multiple of the other number 5; hence, “True” is returned. In other example,
when 64 is not a multiple of 7, which implies that 64%7 is not equal to 0 which further means
that “False” is returned.

The user can also create multiple functions in the same program and each function can return
output separately from its function. Intentionally, the mentioned example of computing area of
shape is taken for explaining the utility of return statement.

100
Explanation
The purpose of this code is same as the earlier example to print the area of the respective shape.
But, the methodology is different. In the previous program, the area of the shape was printed
from the function body itself. In this program, we use “return” statement to pass the control
from the function to the main program which returns the value of area of the respective shape. It
can be observed that all the functions calculate the area inside the function and return the area
of the respective shape to the main program which is further stored in the variable “area” and
finally the area is printed during the execution of main program.

Python also supports the return of multiple values as illustrated in the following program.
However, it is necessary that the number of values returned from the function should be equal to
the number of variables where those values are stored.

Explanation
This program creates a function named “amount”, which returns the sales figures of three states:
Maharashtra, Gujarat, and Delhi. Since the function returned three values, hence it is important
to store them in three variables, namely a1, a2, and a3. It is also important to note that a
difference in the number of variables will result in an error in the program.

101
2.6.2 Function with Arguments
It has been observed that when the user-defined function is created without any argument, the
user is not able to change the input given to the function since the variables are assigned
predefined value inside the function. But, in practical scenario, the user needs to pass the input as
an argument to the function, because the input will change with every execution and every
requirement. It is important to note that number of arguments while creating a function should be
equal to number of arguments passed while calling the function. However, we can pass any
number of arguments inside the function.

2.6.2.1 Create a User-Defined Function with Single Argument


The following example considers only one single argument while creating a function:

Explanation
The “def” statement shows that a user-defined function is created by the name “table”. There is
only one argument “num1” which is passed inside the function. The function body starts from
the next line. It has “for” loop which has “i” as the counter variable and numbers starting from
1 till 11. Hence the last number for the counter “i” will be 10. The “for” loop body multiplies
each number that is passed as an argument with “i” and prints the table. Thus, the table of 8 is
printed when we call the function and pass the argument as 8.

It is also possible to consider multiple arguments while creating a function as discussed in the
following example. This example uses three arguments for creating a table:

102
Explanation
The function named “newtable” is created using three arguments: num1, num2, and num3. The
num1 argument represents the number for which the table will be printed. The num2 and num3
arguments represent the starting and ending numbers for the table. When the function is called
for the first time, three arguments are given and since no naming is done, hence by default the
position of arguments plays its role and num1 is assigned 7, num2 is assigned 1, and num3 is
assigned 3. When the function is called for the second time, the names of the arguments are also
written along with the value, hence the values will be taken accordingly. It is important to
observe that the order of the variables is not mandatory. Hence, in the third case when the
variables are not written in predefined order, the answer is computed accordingly.

It is important that the number of arguments while creating and calling a function should be
same, else it generates an error as illustrated in the following example:

Explanation
Since the “newtable” function has three arguments, but in this example, when we call the
function with one argument, an error is generated. This means that passing unequal number of
arguments results in an error in this example.

It is possible to create multiple functions in the same program having different numbers of
arguments. For proper understanding, the same “area” program is considered intentionally with
different number of arguments in each function and without return statement.

103
Explanation
It can be observed that since all the functions for computing area are defined with arguments,
hence it is important to pass the arguments when these functions are called. Thus, in the main
program the user is prompted to enter the values of the respective shape and these values are
passed as an argument to the function when the function is called. The function itself does not
take input from the user and just does the computation and prints the result. It should be clear
that all the programs for computing the area do the same task and hence the result is absolutely
the same, but the programming becomes different because of different signature of functions.
The user is prompted to enter the choice from 1 to 4 for the shapes. When the user enters 3,
the user is asked to give input of length and breadth of rectangle; the area_rectangle()
function is called after passing the arguments length and breadth. The area_rectangle()
function does the computation inside it and prints the result from its function body itself. After
the last line of the area_rectangle() function is executed, the control is returned back to the
program.

We will consider the same “area” program for understanding the difference in programming
when the function is created with arguments and also returns the value to the main program.

104
Explanation
It can be observed that all the functions compute the result within the respective function body
and return the result to the main program rather than printing the result within itself. Hence, the
task of the function becomes very simplified. The returned output from each function is stored
in “area” variable in the main program and the result of area is also printed from the main
program.

The main utility of functions is to create it once and use it for many times. The following
example creates a function once and uses it two times to determine whether two numbers ae
amicable. Two numbers are said to be amicable if the sum of proper divisors of one number plus
1 is equal to the other number. All divisors of a number other than 1 and itself are called proper
divisors. For example, the numbers 220 and 284 are amicable as the sum of proper divisors of
220 (i.e.) 1, 2, 4, 5, 10, 11, 20, 22, 44, 55 and 110 is equal to 284 and sum of proper divisors of
284 (i.e.) 1, 2, 4, 71 and 142 is 220.

105
Explanation
The function named sumoffact() is created which calculates the sum of its divisors. In the
function, a “for” loop is executed for n/2 times. If the given number is exactly divisible, then it
is considered to be a factor and added to the “sum” variable. The function finally returns the
sum of all its factors. The user is prompted to enter the two numbers x and y. The function
returns the sum of factors x and y and stores in val1 and val2, respectively. For determining
whether the numbers are amicable, equality of “y” is checked with the sum of factors of x
(val1) and equality of x is checked with sum of factors of y (val2). If both the conditions are
met, then the message that the numbers are amicable is printed.

Use concept of functions to determine whether the string is pangram,


consogram or vowgram. The function will take string and return the output
which will be displayed through the main program. The concept of pangram,
vowgram or consogram is discussed in Section 2.1.

106
Explanation
This program converts the Roman Letters to the corresponding value. Numbers in this system
are represented by the following combinations of letters:
Roman Numeral I V X L C D M
Value 1 5 10 50 100 500 1000

A function named roman is created, which returns the corresponding value of the roman
numeral; thus, if V is passed to the function, it will return a value 5; D will return value 500 and
so on. The special thing about the conversion of roman numeral is related to the position of the
numeral in a group of numerals. Thus, IV denotes 4 and VI denotes 6. This means that if
smaller roman numeral is before, then the corresponding value is subtracted from the larger
roman numeral, else it is added. This logic is used in the program. The user is prompted to enter
the roman numeral, which is stored in roman _num. A for loop is executed till the length of the
roman numeral. When the for loop is executed for the first time, the command ans1=value(s1)
calls the value function for the first roman number and stores in ans1. The condition is checked

107
whether the numeral is having some other numerals. If some other numeral is present, then the
value function is executed for the second numeral also. If ans1 is found to be greater than ans2,
its value is added, else the smaller value is subtracted from the bigger value. This is done
because in roman number system, the position of numeral plays an important role. If the roman
numeral is at the last, then the number is simply added to the “ans” variable. The program
finally prints the equivalent value of roman numeral when the execution of for loop is
completed.

2.6.3 Nesting of Functions


The concept of nesting of function is to define one function within other function. Due to simple
recursive scope rules, inside function is itself invisible outside of its immediately enclosing
function, but can see (access) all local objects (data, functions, types, etc.) of its immediately
enclosing function as well as of any function(s) which, in turn, encloses that function.

Explanation
The greet() function is nested inside the welcome() function and hence when the welcome()
function is called, the first statement of welcome function() is executed; this means that the
greet() function will be executed. The greet() function prints “Hello Dear!!!” on the screen
and then the control is transferred back to the welcome() function which then executes the
remaining tasks and hence prints “Welcome to Python World” on the screen.

We can nest functions theoretically to unlimited depth. However, for better readability and
interpretation, only a few levels are normally used in practical programs. The following program
shows the nesting of four functions:

108
Explanation
The armstrong(num) function is nested inside the task() function; welcome() function is
nested inside the armstrong() function; and greet() function is nested inside the welcome()
function. Hence, when the task() function is called, the armstrong(num) is executed. The first
statement of armstrong() is executed and hence the welcome() function is executed. The
welcome() function in turn calls greet() function and hence “Hello Dear!!!” is printed and
then the control will be transferred back to the welcome() function. The welcome() function
then executes the second task and “Welcome to Python World!!!” is printed on the screen
after “Hello Dear!!!”. The control is transferred back to the armstrong() function which
prints the result that “It is not an armstrong number”. The task() function then executes
the remaining tasks and hence prints “Thanks” on the screen. The last statement is printed after
the task() function is executed and hence See You!!! is printed.

We can also use return statement inside the function along with the calling of any other function
as illustrated in the following example:

109
Explanation
In this program, welcome() function is called inside the other function reverse() and when the
reverse() function is executed, it calls welcome() function which prints the statement
“Welcome to Python World!!!” The next statement reverses the number with the help of a
“while” loop. The main program then stores the output of the function in variable “result” and
then prints the value of “result” which is 316.

After the execution of return statement, the control transfers to the main program. Hence, it is
necessary to check the order of statements inside a function else all the statements in the program
will not be executed. This is explained in the following program:

Explanation
In this program, the return statement is executed before the calling of first function. The
Return statement stops further execution of the function and is considered as the last statement
of the function. Since the welcome() function is called from the reverse() function after the
return statement, hence welcome() is not executed.

110
It is also possible to pass a string argument in the function and the nesting of functions is also
possible for string argument. This is illustrated in the following example:

Explanation
This program creates the following two main functions:

1. vowel() for determining vowels and consonants in a string.


2. length() for determining the length of a string.

111
The input () function is created in which the user is asked to enter the choice. If the user
enters 1, vowel function is executed; if the user enters 2, length function is executed; and if the
user enters any other number the message of “Invalid choice” is printed.
In the main program, valid is initialized to “yes” and the “while” loop is executed till the
choice is equal to “yes”. The first task of the “while” loop is to execute main() function. After
the execution of the main() function, the second task is to ask the user whether he wishes to
continue. If the user says “yes” again, the “while” condition becomes true and hence the main()
function is executed again. This process repeats till the user says “no”.
We can observe from the main program that initially the value of the valid is “yes”, hence
the main() function is executed at least once. Since the user enters the choice as 1, hence the
vowel() function is executed. The vowel () function checks whether each and every letter in
the word “welcome” is a vowel or not. The user is prompted to enter the choice that whether
he/she wishes to continue or not. After the user enters “yes”, the main() function is executed
again. The user is prompted again to enter the choice for the task to be executed. Since the user
enters the choice as 2, hence the length() function is executed and length of the string “Python
World” is computed and printed. If the user enters “no” when the choice of doing the process
again is asked, the program terminates and “Finished Task” is printed.

Create the program (as in the previous page) without using the functions. The
user is supposed to input the choice and numbers from the main program and
execute the results.

2.6.4 Recursive Functions


A recursive function calls itself from its own body of function. In specific situations, it helps in
reducing lines of code to a great extent and is highly useful. This is shown in the following
example for computing factorial of a number.

Explanation
In this program, the name of the function is “fact” which has a single argument “num”. The
fact(num) function in its own function body calls the fact () function until the number “num”

112
becomes equal to 0. The value of “num” decreases by 1; whenever the fact() function is called.
The user is prompted for the number whose factorial needs to be computed and accordingly the
result is calculated. Thus, the factorial of 6 is computed as 720. It can be observed that unlike
other programming languages where there is a limitation to the length of the integer output,
Python has an extensive support to big numbers, hence a large number is also printed easily.
For example, if the variable defined is an integer, result of factorial of 9 cannot be printed in
many programming languages. In this program, the factorial of 9 is calculated and the result is
printed as 362880.

2.6.5 Scope of Variables within Functions


It should be noted that the variables which are defined in the main program can be accessed by
all the functions defined in the program and hence computation can be done using those
variables inside the body of the function. This is explained with the following example:

Explanation
In this program, offer() can access “total” variable, since the variable was declared in the
main program. Hence, when the function executes the task of subtracting two numbers, the
computation takes place and the result (700) is printed.

It should be noted that the functions do not have an access to the variables which are defined in
some other functions. The variables are having limited scope to their functions only. This can be
explained by the following program:

113
Explanation
In this program, “discount” is a variable which is defined inside offer() function, hence its
scope is limited to that function only. An error is generated when the amount() function is
executed, since the task of amount() function is to subtract “discount” from “total” and
discount was not accessible in amount() function. However, since “total” was defined in main
program, amount() can access that variable.

However, this problem can be solved by changing the scope of the variable to “global” inside a
function. This is done by adding the keyword “global” before the name of the variable as shown
in the following example:

Explanation
In comparison to the program, the function amount() is executed properly and the result is
printed in this program. This is because the variable “discount” is defined global in offer()
which means that this variable can be accessed from anywhere. Hence, the amount() function
can access “discount” variable along with “total” variable defined in the main program and
accordingly compute the result.

The accessibility of variables differs changes when the functions are nested.

114
The user is suggested to create three or more nested functions and check the
accessibility of variable from the different functions and the main program.

The following example uses different functions and demonstrates nesting of functions along with
the usage of global variables:

115
Explanation
The functions created in this program include:

116
1. show() function to display the choices to the user.
2. choice() function to get the choice from the user.
3. get_var() function to accept three global numbers from the user.
4. output() function to display the result which is stored in a global variable.
5. add() function to add three numbers.
6. subtract() function to subtract three numbers.
7. multiply() function to multiply three numbers.

Finally, all the functions listed are included in the main function.

The first task of the main program is to execute the show() function which displays the choice to
the user. The second task is to accept the choice using choice() function. Depending on the
user’s choice, the mathematical functions are called. For example, if the user enters 1, add
function is called, if user enters 2, subtract function is called and so on. However, before calling
the mathematical function, get_var() function is called for taking input from the user for three
global variables val1, val2, and val3. Since these variables are defined as global, hence they can
be accessed from any of the mathematical functions. All the mathematical functions store the
result in the global variable named ans. Similarly, since “ans” is a global variable, hence it can
be accessed from any other function. Finally, the output() function is called which prints the
computed result.
Use of global variable helps in efficient programming and reducing lines of code in all real-
time projects. For example, a shopping mall project will consider amount as global variable,
bank project will consider balance as the global variable, retail shops project will consider total
as the global variable, etc.

Create the program as in the previous page without using the functions. The
user is supposed to input the choice and numbers from the main program and
execute the results. This will help the user to understand the utility of the
functions.

Summary
• Efficient programming involves the use of control statements which help in efficient coding
and reducing time for doing programming. There are some control flow statements which
lead to abnormal loop termination. These include break, next, and continue statements.
• A conditional statement is a mechanism that allows for conditional execution of instructions
based upon the outcome of a conditional statement, which can either be true or false.
• A simple “if” statement consists of a Boolean expression followed by one or more
statements. If the Boolean expression evaluates to be true, then the block of code inside “if”
statement will be executed, whereas if the condition is false, then the statements inside “if”
statement body are completely ignored.
• An “if” statement can be followed by an optional “else” statement which executes when the

117
Boolean expression is false. The concept of nested “if” statements is to use one “if” statement
inside another “if” statement.
• Iteration repeats the execution of a sequence of code. Iteration is useful for solving many
programming problems. An iterator (loop) allows us to execute a statement or group of
statements multiple times.
• The “for” iterator is ideal for definite loops and the “while” iterator is ideal for indefinite
loops.
• It is also possible to nest conditional statement and different iterators.
• Python provides the break, next, and continue statements to give programmers more
flexibility in designing the control logic of loops related to abnormal loop termination.
• Break terminates the loop statement and transfers execution to the statement immediately
following the loop. When the break statement is encountered inside a loop, the loop is
immediately terminated and program control resumes at the next statement following the
loop.
• When a continue statement is encountered within a loop, the remaining statements within
the body are skipped.
• Compile-time errors are syntactical errors found in the code due to which program
compilation cannot take place. Runtime errors occur at the time of execution of program.
Logical errors occur due to incorrect logic in program.
• Exception handling is an effective tool to handle the errors in the program. Some common
exceptions are: IOError, ArithmeticError, OverflowError, SyntaxError, etc.
• A single “try” block can be followed by several except blocks and hence multiple except
blocks can be created to handle multiple exceptions. The “else” block is executed when no
exception occurs and “finally” block is always executed at the end of the try/except/else
block.
• Functions are useful when the program size is too large or complex. They help in splitting a
complicated procedure into smaller blocks and each block is called a function.
• A function in Python is defined by using the def keyword, followed by the name of the
function and a pair of braces (which may or may not contain input parameters) and, finally, a
colon (:) that signals the end of the function definition line.
• A function without arguments can be created using an empty parentheses because all
arguments inside a function are passed within parentheses. A function without arguments can
be created with or without return statement.
• A function can be created with arguments also. However, it is important that the number of
arguments while creating and calling a function should be same.
• The concept of nesting of function is to define one function within other function. One
function inside other function is theoretically possible to unlimited depth.
• A recursive function calls itself from its own function body.

Multiple-Choice Questions

1. ____________ is not used for abnormal loop termination.

118
(a) switch
(b) pass
(c) break
(d) continue
2. The result of the following command is
x=4
while(x<7):
print(x**2, end=' ')
(a) 16 25 36 49
(b) 6 7 8 9
(c) endless loop
(d) 16 25 36
3. If amount= 500, then the result of the following command will be
if(amount>700 | amount <200):
print("Excellent")
else:
print("Good")
(a) Good
(b) Excellent
(c) Error
(d) Excellent Good
4. The following command shows the result as
for num in range(1,5):
print(num**3, end=' ')
(a) 1 8 27 64 125
(b) 1 8 27 64
(c) 3 6 9 12 15
(d) 3 6 9 12
5. ___________ terminates loop statement and transfers execution to statement immediately
following loop.
(a) break
(b) while
(c) for
(d) continue
6. It is possible to return multiple values from a function.
(a) False
(b) Error
(c) True
(d) Both (a) and (b)
7. The correct way of defining a function min() is
(a) def min()

119
(b) def min():
(c) def min();
(d) def function min();
8. The scope of variable can be changed using the ___________ keyword:
(a) var_scope
(b) scope
(c) global
(d) unlimited
9. It is possible to call a function with multiple arguments.
(a) True
(b) False
(c) Error
(d) Both (b) and (c)
10. A ____________ function calls itself from its own function body.
(a) recursive
(b) nested
(c) argumented
(d) non-argumented

Review Questions

1. Define a function to calculate factorial of a number.


2. Define a function to calculate compound interest.
3. Define a function to print minimum of three numbers.
4. Write a program to print Fibonacci series till the user-specified number.
5. Define a function to print odd numbers between 130 and 250.
6. Define a function to determine whether the number is an Armstrong number.
7. Write a program to show the utility of if-else-if ladder.
8. Is it possible to raise multiple exceptions in a program? Explain with the help of a program.
9. Explain the importance of “finally” in exception handling.
10. What are the different types of exceptions that are raised in the program?

120
CHAPTER
3

121
Data Structures

Learning Objectives
After reading this chapter, you will be able to

• Create different data structures in Python.


• Access different elements of data structures in Python.
• Apply the knowledge of different functions to data structures.
• Evaluate the utility of different data structures in different conditions.

Data structures play an important role in programming language. The different data structures
included in Python are array, list, tuple, and dictionary. The difference between an array and a
list is that an array is a collection of similar data types (homogeneous), whereas a list can be a
collection of different data types (heterogeneous). While array and list can be modified by
adding, deleting, and modifying the existing elements, it is not possible to modify the tuple. A
dictionary is a unique data structure generally not available in programming languages and is
represented as key–value pair. Creation of data structures, accessing data structures, different
functions related to data structures, and programming related to them is discussed in detail in this
chapter. For effective programming related to arrays, we need to create array through NumPy
library or array module, which is discussed in the succeeding Chapters 4 and 5. This chapter
discusses the other data structure lists, tuple, and dictionaries.
To understand the utility of data structures, we will consider an example of creating a
program to add five decimal numbers.

Explanation

122
This program adds five decimal numbers which are stored in val1, val2, val3, val4, and val5.
Since we need to add five numbers, hence five variables are required to store the numbers.
If the total number increases to 10000, the length of this program will increase
unnecessarily. In this scenario, 10000 variables will be used to store 10000 numbers and this
will reduce the overall efficiency of program.

The “for” loop can be used to solve this problem which is illustrated as follows:

Explanation
The variable named “total” is initialized to 0. With every execution of “for” loop, the user is
prompted to enter the value of the variable which is stored in “value”. The user of “for” loop
helps to reduce the lines of code. In this program, the number of lines of code will be the same
even if the user wanted to add 10000 numbers. However, it should be noted that unlike the
previous program, we are not able to print the original numbers in the final output. This is
because the new number is stored in “value” and with each execution of the “for” loop, the new
value of “value” replaces the old value of “value”.

In real-world problems, there is large data which needs to be stored properly and accessed
effectively for proper analysis. Thus, the above programs will not be able to solve real-world
problems. This is a strong limitation, which can be resolved using data structures in Python.

R has primarily six data structures: Vector, array, matrix, list, factor and
dataframe. All these have special utility for data analysis. Dataframe in R is the
important dataframe which is done using Pandas library in Python. Arrays are
primarily created via numpy library in Python. Matrices is a combination of
two-dimensional array.

123
3.1 Lists
A list is a collection of objects and basically represents an ordered sequence of data. A list is
similar to a string which is a collection of characters. But unlike string, a single list can hold
other data type also. This means a list may not be homogeneous; this further implies that the
elements of a list may not be of the same data type.

3.1.1 Creating a List


A list is created by using square brackets and like any other variable, a list variable must be
assigned before it is used. Example: Declaring a list named “firstlist” that holds five integer
values is done using the command firstlist = [11, 25, 10, -17, -6]. It can be observed
that the elements of the list appear within square brackets and are separated by comma. Hence
the statement print(firstlist)will print the list variable as [11, 25, 10, -17, -6].

Explanation
This program creates an empty list named “firstlist”. Using “for” loop, the user is prompted to
enter the number five times. The number is initially stored in variable “value” which is then
added to the list and also added to variable “total”. The results show that unlike this program,
the total is also generated and individual elements of the list are also accessed. Similarly, unlike
the results of “for” loop, where we were not able to fetch the original five numbers, the list data

124
structure helps to fetch the numbers. Also, unlike the program for adding five numbers without
using “for” loop, this program helps to reduce the lines of code drastically if the numbers are
large. This means that list data structure is effective for programming.

A list can also be created without explicitly listing every element in the list. A range function is
used to produce a regular sequence of integers. We cannot fetch the data which is produced from
range() function directly. For example, if a=range(0,5); then print(a) will display the output
as range(0, 5) rather than printing the elements from 0 to 4. The list() function helps us to
convert a sequence specified in range to convert into a list. Thus, the range object returned by
range is not a list. However, we can make a list from a range using the list function.

Explanation
The range(12,19) creates a range of numbers from 12 to 18 and the list() function creates a
list of these seven numbers. The next command creates a list of numbers from −3 to +4 with
default interval of 1. The command list(range(30, -10, -2)) creates a list from 30 to -8
with an interval of -2 between the numbers. Similarly, the last command creates a list of
numbers from 0 to 300 with an interval of 50.

It is also possible to concatenate two lists together in a single list. The plus (+) operator
concatenates lists in the same way as it concatenates strings. This is illustrated in the following
program:

125
Explanation
This example shows that two lists are created each having five and four items. The class of both
of them is “list” as shown in the result. The two lists are concatenated in one single list named
list7 and hence the new list comprises nine items and the class is also list.

3.1.2 Accessing List Elements


The list elements can be accessed by using square brackets with its proper index. The list items
can be accessed using the concept of Indexing and Slicing. Thus, in an expression such as
userlist[k], the number “k” within the square brackets is called an index or subscript. Unlike the
convention in mathematics and like other programming languages, the first element in a list is at
position 0 and not at 1. The index represents the distance from the beginning; the first element is
at a distance of 0 from the beginning of the list. Since the starting index of list is 0, hence the
first element of list named “userlist” is denoted by userlist[0]. If there are “n” elements in the list,
the last element in the list is accessed and represented by userlist[n − 1]. It should be noted that
the elements of a list extracted with [] can be treated as any other variable.
It should be noted that Python also allows negative indexing as demonstrated in the following
example:

126
Explanation
The list named “list8” comprises five items. A “for” loop is used to access and display the
numbers. It should be noted that the “for” loop starts from “0” and will continue till “4”. This is
because the first list item is at index 0 and the last item is at index 4. The “for” loop prints the
value of “num+1” also for denoting the position of list item, because the true position of
number is one greater than its index. The use of negative indexing is shown in the next
commands where the loop starts from −1, because the last number of the list is denoted by −1
and continues till −5. This is because the index of the first number in negative indexing is
represented by the negative sign along with the length of the list (here 5). It can be observed
that the last argument in “for” loop is −1, since each time the “for” loop is executed, the index
decreases by 1 to point to the next list item from end of the list.

We know that a list may not be necessarily homogenous. This means that it is possible to store
elements of different data types in a single list. This will hence allow the user to have a list item
inside other list. However, to access a list within a list, we need to use two square brackets.
Example: userlist[a][b] denotes the list item within a list item where “a” denotes the index of the
list item in the main list and “b” denotes the index of the nested list item. This is illustrated in the
following program:

127
Explanation
In this example, a list named “list9” holds integers, floating-point numbers, strings,
alphanumeric, and even other lists. The first list item (index = 0) is an alphanumeric number,
second list item (index = 1) is an integer, third list item (index = 2) is a list, and the fourth list
item (index = 3) is a string. The “for” loop is used for displaying list items. The third list item
represented by index 2 has four list items: “India” at index 0, “USA” at index 1, “UK” at index
2, and “China” at index 3. Since this list was located at index 2 in the main list, hence these
items are accessed using list9[2][0], list9[2][1], list9[2][2], and list9[2][3], respectively.

It is clear that range specified in “for” loop is 5. Hence the maximum value of “x” will be 4.
However, if the user tries to fetch the element in index 5, an error will be generated because it
does not represent a valid element in the list. This is demonstrated in the following example:

Explanation
We can observe that the above code results in a run-time exception, because the list has five
items only and the code is accessing the sixth item (index is 5). It should be also noted that the
programmer uses an integral value for an index; but in order to prevent a run-time exception,
the programmer must ensure that the index used is within the bounds of the list.
The concept of slicing is used if we want a list that is created from a portion of an existing
list. A list slice is an expression of the form list [start : end ], where start and end are integers
representing the starting and ending index of a subsequence of the list. It is important to
understand that the value of start should always be greater than that of end. An empty square
bracket denotes the empty list. Since the index is always positive in list, an empty list will be
returned if negative numbers are used as starting and ending arguments for accessing list
elements.

128
It should be noted that if the starting argument is absent, the index is considered to be 0 (start of
the list) for the starting argument. If end argument is absent, the ending argument is considered
as length of list −1 (end of list).

Explanation
A list named “list10” with 10 elements is created. The use of list slicing is depicted using
different starting and ending positions in this program. It is clear that if the value of start
argument is lower than the value of end argument or if the value of start/end represented a
negative index, an empty list was generated. Absence of the start argument access elements
from starting and an absence of the last argument returns elements till the end.

It is possible to change the existing elements in a list by specifying the index number along with
new value.

Explanation
In this example, the value of second item of the list (index=1) is modified to a new value 110,
and the value of the fifth item (index = 4) is modified to a new value 120. The list, when

129
displayed, shows the new modified list.

It is also possible to modify an existing list by removing elements or adding a sub-range of


elements in an existing list. We can modify the content of the list using slice assignment (sub-
range of elements) by writing the slicing expression on the left side of the assignment operator.

Explanation
This example demonstrates the replacing of multiple items at single and multiple locations. The
command list12[2:4]=['MP', 'UP', 'AP'] replaces list items at second and third index with
items ‘MP’, ‘UP’, ‘AP’. Thus, list items “14” and “35” are replaced with new values. The
command list12[3:3] = [12,13,14] replaces value at index 3 with three values 12, 13, and
14. Thus, the new list now has 10 list items. The command list12[3:5] = [] deletes the items
from index 3 till index 4 and the new list now has 8 list items.

3.1.3 Functions for List


Different in-built functions for performing operations efficiently on lists exist in Python. These
functions include insert(), copy(), len(), count(), index(), max(), min(), remove(),
append(), sort(), sum(), etc. The utility of these functions is explained in the following
program:

130
131
Explanation
The utility of different functions related to list is clearly explained by the respective comments
on the top of statement. The output produced after executing the program clearly demonstrates
the results after modifications to the list by using respective function.

132
Repeating elements using “*” operator: The list elements can be repeated using the “*”
operator.

Explanation
The repetition of list [10, 20, 30, 40] is done three times in the first example, hence the new list
had 12 items. The next example creates a list “mylist3” having 3 items and its repetition by 4
creates a list of 12 items. It is also possible to repeat a list, store in a new list, and then print the
new list as shown in the next example where the list is repeated 2 times and hence the list has 6
items.

Determining existence of list item using “in” operator: The “in” operator helps us to
determine whether a particular element is present in the list or not. This is demonstrated in the
following example:

Explanation
The “in” operator returns a Boolean result and helps to determine whether a particular element
exists in the list or not. Thus, (50 in mylist4) returns a Boolean result “False” since that item is
not contained in list. Similarly, (60 in mylist4) returns a Boolean result “True” since the item is
contained in list.

3.1.4 Programming with List


It is possible to do efficient programming by using the concept of list because it helps to reduce
lines of code and thus makes code more manageable. In this section we learn to create different
programs using list.

133
Explanation
In this program, an empty list named “numlist” is created. Since we want to print the elements
in the range 100 to 200, hence the “for” loop has a range from 100 to 201. If the number is
divisible by 6 and not divisible by 5, the number is added to the numlist using append()
function and is separated by comma. Thus, all the elements which satisfy the above criterion
form a list and all elements are printed using join() function for the list.

Programming in lists is generally done through looping because the list has
multiple elements. If we want to access all the elements of the list, we can
access them one by one using the concept of looping.

Explanation
An empty list named “datalist” is created. The user is first asked to enter the number of
observations. Accordingly, the user is prompted to enter the different values. In this program,
the user enters five values: 3 4 5 1 and 7, since the user had five observations. All these five
values are added to the list by using append function. A “for” loop is then created having the
range equal to the length the “datalist”. Thus, the “for” loop is executed for five times. The term
datalist[k] accesses the kth element from the list. Thus, all elements are accessed one by one
and each element is multiplied by “*”. When the first value of the list (3) is accessed, three

134
asterisks are printed. The second item of the list is 5, hence five asterisks are printed on the next
line and so on.

Create the program for the above example using “while” loop instead of “for”
loop.

Explanation
The user is prompted to enter the number of products; accordingly “for” loop is executed for
those number of times. During each execution of “for” loop, the user is asked to enter the value
and each value is added to the list named “amountlist” using append() function. A new “for”
loop is executed for number of times equal to the length of the list. Each item in the list is
compared with a value of 100. If the item is greater than 100, it is added to the “newlist”. The
last print statement finally prints the “newlist” and hence those values which are greater than
100 are printed.

135
Explanation
An empty list named “values” is created. The “for” loop starts from 151 and ends at 250. The
number is first converted to a string by using str() function. This is done because we want to
access each digit of the specified number. If we convert to string, we will be able to access each
digit using the concept of indexing. However, each digit when accessed is again converted to an
integer for doing calculation by using int() function. The number is added to the list named
“values” if each digit of the number returns a remainder 0 after dividing by 2. The last
statement finally prints all the elements of the list “values”.

Determine the minimum and maximum element from the list.

Explanation
An empty list named “namelist” is created. The user is prompted to enter the number of
countries; accordingly the “for” loop is executed those number of times. The name of the
country entered by the user is added to the list using append() function. The sort() function
for the list sorts all the list items; hence when the list is printed, the countries are displayed in
the sorted order.

136
Explanation
An empty list named “valueslist” is created and the user is prompted to enter the starting and
ending number; accordingly the “for” loop is executed. If the number is divisible by 2, then the
number is added to the list. The print command prints the entire list.

Create the program for the preceding example using “while” loop instead of
“for” loop.

137
Explanation
This program determines day and IST for given day and time in GMT. It should be noted that
IST (Indian Standard Time) is 5 hours 30 minutes ahead of GMT (Greenwich Mean Time). For
example, if a day is Sunday and the time in GMT is 23:05, then in IST it is Monday (next day)
and the time is 04:35. The user is prompted to either the day, hours, and minute. A list of all
days is prepared. The occurrence of given day is checked in the list and the index of the day is
stored in the variable named ”idx”. 60 minutes are added to the existing minutes and if it
exceeds 60 minutes, then the value of “extrahr”, which is 1 and 60 minutes, is subtracted from
the existing minutes. Similarly, 5 hours is added to the existing hours along with the value of
“extrahr” and if the number of hours are exceeding 24, then the next list element is fetched from
the list and 24 is subtracted from the new hours. Thus, in the first example, since minutes were
35, the total number of minutes was equal to (35 + 30) = 65, same as 1 hour and 5 minutes and
hours were 16, the total number of hours were 16 + 5 + 1 = 22 (IST is 5 hours ahead and 1 extra
hour from minutes). Since the total number of hours were less than 24, hence the day does not
change. In the second example since the total number of hours became 29, hence day changed

138
from Monday to Tuesday.

Explanation
This program does a cyclic shift of list elements. The user is prompted to enter the number of
list elements. Two empty lists are created and the user is asked to enter the list items. The list
item is added to the “userlist” by using the function append and a blank string is added to the
“newuserlist”. The user is then asked to enter the number of places to shift in the list which is
sorted in the variable named “shift”. Different formulas are used to calculate the index of list
item in new list if the index of item + places to shift in original list is greater or less than the
highest index in the list. Thus, in original list, India is in the first place while it has shifted to
fourth place because the shift is required by three places. Similarly, USA has shifted to last
index.

139
Explanation
This program determines whether the entered number is a perfect number or not. A number is
said to be a perfect number if it is equal to its sum of its good factors. All factors of ‘n’
excluding itself are good factors of a number. For example, 6 is a perfect number as 1 + 2 + 3 =
6. A list named “factor_list” is created with single element 1 which is the common factor of all
the numbers. A “for” loop is executed from 2 till the number to determine the factors of a
number. If the number is exactly divisible by another number, then it is considered to be a good
factor. In this program, if the number is a factor, it is added in the list using append(). All the
factors are added and stored in sum. The last section uses an if statement to determine whether
the number is a perfect number or not. If the total sum of its factors is equal to the number, it
displays the message accordingly. Thus, 6 is displayed as a perfect number and 12 is not
displayed as a perfect number.

140
Explanation
In this program, for a given number ‘n’, we have determined count of the number of two digit
prime numbers in it when adjacent digits are taken. For example, if the value of ‘n’ is 114, then
the two digit numbers that can be formed by taking adjacent digits are 11 and 14. 11 is prime
but 14 is not. Therefore, the result is 1. The user is prompted to enter the number. The
remainder obtained when the number is divided by 10 is stored in the list. Thus, for the number
114, the first remainder obtained is 4, which will be stored in the list as the first number. This
process is repeated using the “while” loop till the list has all the three numbers 1, 1, and 4.
However, this list has numbers in reverse order. Hence the first “for” loop which is created for
creating two digit number start from the end of the list and moves till the start of the list. It is
decremented by one in range option. The first time the “for” loop executes, a variable named
“newnum” is formed considering last two digits. Hence the last digit is multiplied by 10 and
second last digit is added to it. Thus, the first number formed is 11. A nested “for” loop is used
to determine whether the number is a prime number or not. If the number is divisible by the
other number, the value of flag becomes True. If flag is False, this means that the number was
not divisible by any number, and the result displayed will be “The number 11 is a prime
number”. The second time “for” loop executes for creating a two digit number, the number
formed is 14 and hence the result printed is “The number 14 is not a prime number”. Whenever,
a number is found to be prime, the variable “ans” is incremented by 1. The last statement prints

141
the total number of two digit prime numbers formed from the number. In the next example,
when the number entered is 3779, three two-digit numbers are formed, such as 37, 77, and 79.
Since 37 and 79 are prime numbers, hence total number of two digit prime numbers are printed
as 2.

Explanation
This program computes the difference between the sum of the digits occurring in the alternate
positions (starting from the first position) and the sum of the digits occurring in the alternate
positions, starting from second position of number. For example, consider the number 8975.
The sum of the digits that occur in the alternate positions from the first position is 8 + 7 = 15.
The sum of the digits that occur in the alternate positions, starting from the rightmost position is
5 + 9 = 14. Difference between the two sums is 1 (= 15 – 14). Similarly, for the number
934716, the difference between two sums, is −2. An empty list is created and the number is first
stored in the form of list by dividing the number by 10 and storing the remainder in the list.
Each time the loop is executed, the number is divided by 10. However, the list formed will store
the numbers in the reverse order. Two “for” loops are executed for calculating the sum of even
and odd digits, respectively. It should be noted that since the list created stored the digits in the

142
reverse order, hence the “for” loop starts from the end of the list and is decremented till it
reaches to the start of the list. The last section determines whether the sum of even digits is
more than the sum of odd digits and prints the result accordingly.

Create the program for the preceding example using “while” loop instead of
“for” loop.

143
Explanation
This program checks if the given number is reverse length divisible. A number is said to be
reverse length divisible if the first i digits of the number is divisible by (l-i-1) where l is the
number of digits in N and 0 < i <= l. This means that if the first digit of a number is divisible by
a “x”, then the number formed by considering two digits should be divisible by “x − 1”, the
number formed by considering three digits should be divisible by “x − 2”. For example, 52267
is reverse length divisible because 5 is divisible by 5; 52 is divisible by 4; 522 is divisible by 3;
5226 is divisible by 2; 52267 is divisible by 1. In the next example, we have used 5621, which
is not reverse length divisible. This is because the first digit is divisible by 5, the number
formed by considering two digits 56 is divisible by 4, but the number formed by considering
three digits 562 is not divisible by 3 and hence the number is not reverse length divisible. The
number accepted from the user is first converted to a list named “numlist” using a “while” loop.
Since the list contains the elements in the reverse order, it was important to again create a list in
the original order. A list named “newnumlist” is created which stores the numbers in the
original order using the “for” loop. The next section then determines the number by which the
first digit of the number is divisible. The divisor is stored in the variable name “ans”. Then, the
numbers are determined by considering the number from the adjacent digits and its divisibility
is checked by the number equal to “ans-1”. Every time a new number is formed, the value of
“ans” is decremented by 1. The “for” loop is executed till all the digits of the list are exhausted.
A Boolean variable named flag is created which is assigned the value True if the number is
exactly divisible. The divisibility is checked using the “%” operator which returns the
remainder. If the remainder is 0, it means that the number is exactly divisible else it is not
divisible. If any number is not divisible, the value of flag becomes False and the loop is exited
using break. The last section prints the result depending on the value of flag.

144
Explanation
This program checks whether the given matrix is sparse or not. A matrix is said to be sparse if
the number of zero entries in the matrix is greater than or equal to the number of non-zero
entries. Otherwise it is not sparse. The user is prompted to enter the number of rows and
columns in a matrix. An empty list named “matrixlist” is created and elements are added to the
list for m*n times using the two nested “for” loops for rows and columns. Two variables named
“zeroes” and “nonzeroes” are initialized with value 0. Each element in the list is checked for its
value and accordingly the count of zeroes or nonzeroes is incremented. The last condition
checks whether the number of zeroes are more than the nonzeroes. Since in the preceding
example the number of zeroes are less than the nonzeroes, hence the result “It is not a sparse
matrix” is printed.

145
Explanation
This program checks if the string is a Palindrome string or Symmetry string for a given set of
‘n’ strings. A string is said to be a palindrome string if one half of the string is the reverse of the
other half. If the length of string is odd, then ignore middle character. For example, strings liril,
abba are palindromes. A string is said to be a symmetry string if both the halves of the string are
same. If the length of string is odd, then ignore middle character. For example, strings abab are
symmetry.
In this program, if the string is palindrome, then ‘Palindrome’ is printed; if the string is a
symmetry string, then ‘Symmetry’ is printed; if the string has both properties (ex: fff), then
‘both properties’ is printed and ‘No property’ is printed if the string does not have any of the
properties. A “for” loop is executed for checking the property on number of strings defined by

146
the user. In the example, the user has entered three strings to be searched. The first string
“level” is a palindrome; the second string is a “Symmetry”, and the third string “aaa” has both
the properties.
Two functions are created named ‘Palindrome’ and ‘Symmetry’ for checking their property
of string, respectively, and if the property is matched, both of them returns True else they return
False. In the palindrome function, True is returned if the condition
user_string.lower()==user_string.lower()[::-1]) is satisfied. The use of [::-1] reverses
the original string. Thus, the condition checks that the lower case of the original string is same
as reversed string. In the symmetry function, the string is splitted in exactly two halves using
the index function. Both the sub-strings are then matched for equality. The function returns
True if both the sub-strings are the same.

Explanation
Three lists, namely, “person”, “choice”, “hobby” are created with two values in each list. The
len() function calculates the length of the list(Here, 2) in all the three lists. Three nested “for”
loops are created, one for each list. A sentence is formed inside the nested “for” loop using one
item from each list. Thus, each item of a list is joined with every other item of the other two
lists. Hence 8 statements are formed from 3 lists containing 2 items each.

3.2 Tuples
A tuple is a collection of objects separated by comma inside the parentheses. The main
difference between tuple and list is that tuples are immutable while lists are mutable. This
implies that we cannot modify elements of a tuple while we can modify elements of a list. In
real-world scenarios, tuples are generally used to store data that do not require any modification,
but we need to do only retrieval of data. Unlike lists, which are represented by square bracket,
tuples are represented by parentheses.

147
3.2.1 Creating a Tuple
A tuple in Python is generally created using the tuple() function or by writing elements
separated by comma inside parentheses. However, an empty tuple is created using empty
parentheses (parentheses without any element). It should be noted that if we need to create a
tuple with a single item, we need to put comma after the item. This is because without comma,
that item is just itself wrapped in parentheses and portrays a redundant mathematical expression.
It should also be noted that parentheses are optional while creating a tuple. This means that
mytuple = 10, 20, 30, 40, 50 is same as mytuple = (10, 20, 30, 40, 50). However, for better
understanding and application, it is recommended to always create a tuple using parentheses.
Like list elements, the tuple elements can be heterogeneous (containing different data types). The
following program creates tuple through different ways in Python:

Explanation
The command “mytuple1 = ()” creates an empty tuple by the name “mytuple1”. The
command mytuple2 = (542, ) creates a tuple with only one item (42). It should be noted that
there is a comma after number 542. Similarly, “mytuple3” is having five items: 10, 20, 30, 40,
and 50. Like list, a tuple can also be created using range () function by specifying the starting

148
and ending element along with the interval. The tuple named “mytuple4” is created using the
command range(5,60,10); the starting element of the tuple is 5 and ending element is 60 and
interval of 10 is considered between two tuple elements. Thus, it has six items. It is also
possible to create a tuple from the list as shown in the example of mytuple5 which has five
items. Similar to lists, a tuple can also be created using concatenation, tuple named “mytuple6”
is created by concatenating “mytuple4” and “mytuple5” and hence has 11 items.

3.2.2 Accessing Tuple Elements


Elements of tuple are also accessed using the concept of indexing and slicing. Indexing and
slicing are done in the similar way as lists as demonstrated in the following example:

Explanation
It is clear that indexing and slicing on tuple are done in a similar fashion like lists. An empty
tuple is generated if a negative number is used in slicing. If the starting argument is missing, the
tuple is considered from the start and if ending argument is missing, the tuple is considered till
the last element.

3.2.3 Functions for Tuple


149
The main difference between lists and tuple is that tuple cannot be modified unlike lists. We can
perform all functions on tuple which were discussed on list. Since we cannot modify the
elements in the tuple, hence functions related to modification are not applicable on tuple. This
implies that all the functions besides append(), clear(), extend(), insert(), remove(), and
pop() can be performed on tuple also. This is demonstrated in the following program:

Explanation
The functions for tuple are clearly explained through the comments written on top of each and
every command and through the output generated after execution of program.

Repetition in tuple and determining existence of an item is also done in a similar way to lists.
Repetition is done using “*” symbol, and existence of an element is determined using “in”
operator as demonstrated in the following example:

150
Explanation
The repetition of tuple (100, 200, 300, 400) is done three times in the first example, hence the
new tuple had 12 items. A tuple “mytuple9” is created having 3 items and its repetition by 4
creates a tuple having 12 items. It is also possible to repeat a tuple and store in a new tuple and
then print the new tuple as shown in the next example, where the tuple is repeated three times.
The “in” operator return a Boolean result and helps to determine whether a particular element
exists in the tuple or not. Thus, “12 in newtuple” returns a Boolean result “False” because that
item is not contained in the tuple. Similarly, “30 in newtuple” returns a result as “True” since
the item is contained in the tuple.

3.2.4 Programming with Tuple


Like lists, tuple also helps in efficient programming. Tuple has a great application in some
specific areas of programming where modification is not required. This section creates different
programs using the unique feature of tuple that it cannot be modified.

Explanation
This program determines the minimum element from the tuple. The first element of the tuple is
stored in min. If the next item in the tuple is less than the element stored in min, the new
element is stored in min. This process repeats till the last element of the tuple is compared. This

151
helps to finally determine minimum element from the tuple.

Determine the maximum element from the tuple.

Explanation
Two tuples are created, namely, “tuple1” and “tuple2” having 5 and 4 strings of names,
respectively. Two nested “for” loops are taken for comparing each string in one tuple with
every string of another tuple. If the strings are similar, the statement is printed. When the first
string “Suhaan” is compared with each and every string in the second tuple, the similarity exists
and hence it is printed. The second and third strings of first tuple do not occur in second tuple
and hence are not printed. The fourth string “Pearl” occurs in the second tuple and hence is
printed.

3.3 Dictionary
A unique data structure which is not available in other programming languages but available in
Python is called a Dictionary. It is considered as the king of data structures in Python, because it
is the only standard mapping type. Dictionaries are mutable objects (can be modified) like lists
and unlike tuples. Besides, a dictionary is not necessarily an ordered collection unlike tuples and
lists which are ordered collections.
A dictionary has a group of elements which are arranged in the form of key–value pairs.
Thus, there are three special objects for dictionary: keys, values, and items. These objects provide
a different view of the dictionary entries and they change when the dictionary changes. In the
dictionary, the first element is considered as “key” and the following element is taken as its
value. A dictionary basically maps keys to values. It is important to note here that keys should be
hashable objects, while values can be arbitrary. It is also important to note here that keys of a
dictionary should be unique.

3.3.1 Creating a Dictionary


We know that the lists and tuple are created and represented using square brackets [] and
parentheses (), respectively. Dictionary is created and represented using curly braces {}. The
dict() function is also used to create a dictionary. There are quite a few different ways to create

152
a dictionary, which are explained in the following example. In this example, a dictionary has two
keys, named, ‘country’ and ‘capital’ that have corresponding values as ‘India’ and ‘New Delhi’.

Explanation
We can observe that a dictionary is represented through the key and value pairs that are
separated with a colon (:). This can be also observed that each key–value pair is separated by a
comma and all the key–value pairs are written inside curly braces {}.

It is also possible to create a dictionary with zip feature using dict() function, zip() function
along with range() function as demonstrated in the following example:

Explanation
We have created a dictionary by iterating over the zipped version of the string ‘INDIA’ and the
list [0, 1, 2, 3, 4]. The string ‘INDIA’ has two ‘I’ characters inside, and they are paired up with
the values 0 and 3 by the zip function. It can be noted that in the dictionary, the second
occurrence of the ‘I’ key (with index 3) overwrites the first one (with index 0).

153
Dictionary is the exclusive data type available only in Python and has an
extensive use for data analytics when mapping is required.

3.3.2 Accessing Dictionary Elements


Unlike list and tuple, indexing and slicing cannot be used to access elements of a dictionary. The
items() function returns all the key–value pairs in the dictionary, the keys() function returns all
the keys in the dictionary, and the values() function returns all the values in the dictionary. It is
possible to access the value associated with a key by writing the name of the key inside square
brackets.

Explanation
The items() function returns all the key–value pairs specified in the dictionary. The keys()
function returns all the keys of the dictionary and the values() function returns all the values of
the dictionary. We can get the value of a particular key by specifying the name of the key in the
square brackets. The command book_dict['name']) gets the value of the key “name”,
book_dict[‘author’]) helps to get the value of the key “author”.

3.3.3 Functions for Dictionary


We can add new key–value pair using different ways in a dictionary in Python. It is also possible
to modify the existing values of keys. Different functions are also available for performing
different operations on dictionaries, such as len(), clear(), pop(), get(), sorted(), copy(),

154
etc. The usage of these functions is illustrated in the following example:

155
Explanation
The comments written on the top of each statement for using the respective function and the
output generated after execution of the program clearly explains the utility and usage of
functions existing for dictionary.

3.3.4 Programming with Dictionary


The unique feature of dictionary of creating key–value pairs helps a lot in doing efficient
programming in some specific requirements. This section discusses some programs where the
concept of dictionary is useful.

Explanation
An empty dictionary named “mydict” is created using the command dict(). The user is
promoted to enter the last number in the dictionary. The “for” loop starts from 1 and is executed
till the user-defined number stored in variable end. The above program adds the number and its
cube in the dictionary during each execution of “for” loop. The dictionary when printed
displays all the key–value pairs representing number and its cube.

156
Explanation
A dictionary named product is created consisting of four names of products as keys and their
prices as values. The items() function displays all the key–value pairs in the dictionary. The
use of sum() function on all the values displays the total amount of all the values in the
dictionary. The total amount of value of all the products is thus displayed as 51800 (800 + 1000
+ 15000 + 35000).

There are different functions related to check the data type of the characters. These include
isdigit(), isnumeric(), isalpha(), isupper(), islower(), etc., which are discussed in the
following programs:

Explanation
A dictionary named “mydict” is created having two keys, named upper and lower and having
values as 0. The user is prompted to enter the string which is stored in mystring. The case of
each and every letter in the string is checked using isupper() and islower(). If the letter is in
upper case, the key “upper” is incremented by 1 and if the letter is a lower case, the key “lower”
is incremented by 1. If the letter is neither in upper case nor lower case (like space) “pass”
command is executed for moving to another letter in the string. Finally, the number of upper
case letters and lower case are fetched from their respective keys and printed.

157
Create two variables for upper and lower. Create a program for the above
example without dictionary and execute the results.

Explanation
A dictionary named “mydict2” is created having two keys named “numbers” and “letters” and
having values as 0. The user is prompted to enter the alphanumeric string which is stored in
mystring. Each and every character in the string is checked using isdigit() and isalpha(). If
the letter is a digit, the key “numbers” is incremented by 1. If the character is an alphabet, the
key “letters” is incremented by 1. If the character is neither a digit nor an alphabet (like space),
“pass” is executed for moving to another character. Finally, the number of digits and alphabets
are fetched from their respective keys and printed. Thus, in the string “Hello123”, 5 are
alphabets and 3 are digits.

158
Explanation
This program creates an empty dictionary by the name of dict_name. It then asks the user to
enter the names of friends till ‘Stop’ is entered. Each time the name is entered, the value of
count gets incremented by 1. The command dict_name[count]=len(fname) adds a new key–
value pair to the dictionary containing the length of the word. Thus, it is able to count the
number of friends. The items() function prints the value of key–value pairs. Thus, Riya was
the first word and hence the length of the sequence is 4, Rukmini is the second word and the
length of the sequence is 7, and so on. The maximum value of the length of the string from
dictionary is determined by max(dict_name.values()). Thus, it identifies the number of letters
in the longest name and displays the length as 7 corresponding to the word Rukmini.

159
Explanation
A dictionary named “org” is created consisting of five names of organizations as keys and their
share prices as values. The sorted() function helps to sort the names of the keys
(organizations) and presence of an argument reverse=True inside the sorted() function helps
to sort in descending order.

Create two lists for organization and share price. Create a program for the
above example without dictionary and execute the results.

Summary
• Data structures play an important role in programming languages. The different data
structures included in Python are array, list, tuple, and dictionary.
• The difference between array and list is that an array is a collection of similar data types
(homogeneous), while a list can be a collection of different data types (heterogeneous).
While array and list can be modified by adding, deleting, and modifying the existing
elements, it is not possible to modify the tuple.
• A dictionary is a unique data structure in Python and is represented as key–value pair.
• A list is a collection of objects and basically represents an ordered sequence of data.
• The list items can be accessed using the concepts of indexing and slicing. We can access
individual elements of a list using square brackets. A new list from a portion of an existing
list can also be created using a technique known as slicing. Slicing of list can be done using

160
[start : end ].
• It is also possible to modify an existing list by removing elements or adding a sub-range of
elements. Different in-built functions for performing operations efficiently on lists exist in
Python. These functions include insert(), copy(), len(), count(), index(), max(), min(),
pop(), remove(), append(), sort(), sum(), etc.
• A tuple is a collection of objects separated by comma inside parentheses. Indexing and
slicing in tuple work similar to lists. All the functions that can be used for lists can be used
for tuple also, except the functions in which modification can take place inside the list.
• A unique data structure not available in programming language and available in Python is
called Dictionary. It is considered as the king of data structures in Python, since it is the only
standard mapping type.
• A dictionary has a group of elements which are arranged in the form of key–value pairs.
Thus, there are three special objects for dictionary: keys, values, and items.

Multiple-Choice Questions
If test=[80,50,40,20,30,40,30,60,30,60,90,40], then the result of following statement is

1. print(test[4:6])
(a) [30,40,30]
(b) [30,40]
(c) [20,30]
(d) [20,30,40]
2. print(test[9:])
(a) [60,90,40]
(b) Error
(c) [30,60,90,40]
(d) [90,40]
3. print(test[-3:])
(a) [60,90,40]
(b) Error
(c) [30,60,90,40]
(d) [90,40]
4. print(max(test))
(a) 40
(b) 70
(c) 90
(d) 80
5. print(sum(test))
(a) 510
(b) 430

161
(c) 490
(d) 570
6. print(test.count(50))
(a) 1
(b) 2
(c) 3
(d) 4
7. print(70 in test)
(a) True
(b) False
(c) error
(d) Not executed
8. print(test.pop())
(a) 80
(b) 50
(c) 90
(d) 40
9. print(len(test))
(a) 10
(b) 9
(c) 11
(d) 8
10. print(test.index(40))
(a) 1
(b) 2`
(c) 3
(d) 4

Review Questions

1. Differentiate between array and list.


2. Differentiate between append() and insert() functions used in the list.
3. Differentiate between list and tuple.
4. What is the utility of indexing and slicing in tuple?
5. What is the utility of pop() function in list?
6. Discuss the functions that cannot be used on tuple but can be used on list.
7. Discuss different ways to create a dictionary.
8. Discuss the utility of different functions on dictionary.
9. We can add a sub range of elements in an existing list. Justify with an example.

162
10. What is the utility of zip() function inside the dict() function?

163
CHAPTER
4

164
Modules

Learning Objectives
After reading this chapter, you will be able to

• Understand the importance of inbuilt and user-defined modules in Python.


• Understand the functions available in existing Python modules.
• Get exposure to usage of functions and respective modules in different scenarios.
• Create user-defined modules and using them in program.

A module is basically a Python file consisting of Python executable code, functions, statements,
definitions, global variables, etc. Modules are used to break down large programs into small,
manageable, and organized sections for providing reusability of code-like functions. A Python
file named “usermodule.py” is called a module and the name of the module is “usermodule”. The
functions and definitions inside a module can be imported in the program using the “import”
keyword. The following are three approaches to use the existing modules in the program:

1. from module import *: import all functions from module


2. import module: import everything and import the module itself as a unit
3. from module import function1, function2, …: import specified functions from module

There is a small difference between the first two approaches. The first approach imports
everything and the second approach imports the module as a unit. Through the first approach, the
user needs to write the name of the module and function, both of them are separated by dot. In
the second approach, the function name can be directly written without specifying the name of
the module. The first approach requires programmers to use the longer, qualified names for
modules. This limitation is removed by the second approach. Hence it is better to use the third
approach for importing the required functions only from the module. If the module is very big, it
is better that we should avoid importing everything, since it makes programs less maintainable.
However, in small modules, the whole module can be imported.
There are two types of modules, such as Python-inbuilt modules and user-defined modules.
The inbuilt modules available in Python are related to specific functions of the particular module.
The user-defined modules help the user to create own modules for effective programming
through manageable code and reusability of code.

4.1 In-Built Modules in Python

165
Python has a lot of predefined modules containing specific functions that can be directly used in
programming. Some of the commonly used modules include sys, math, collections, array, re,
statistics, datetime, random, clock, os, etc.

4.1.1 The Math Module


The Math module includes basic mathematical, logarithmic, and trigonometric functions. The
different mathematical functions include fabs(), power(), factorial(), floor(), ceiling(),
sqrt(), etc.; logarithmic functions include log(), log10(), etc., and trigonometric functions
include sin(), cos(), tan(), etc. It is important to note that the user needs to import math
module before using the functions available from this module.

166
Explanation

167
The math module is imported using the command “import math”. Since we have imported the
complete module, hence we can use all the functions available from this module. All the
functions of the module are called after writing the name of the module and using the dot
symbol after the module name. This program displays the result of different mathematical,
logarithmic, and trigonometric functions available in math module.

All the functions from module or library can be called in different ways in the program. The
following program demonstrates the different ways in which the fact() function can be used in
the program.

Explanation
This example shows the different ways in which a function can be accessed from a module and
used in the program according to the user requirement.

4.1.2 The Random Module


The random module in Python contains standard functions related to pseudorandom numbers.
The random(), randrange(), and seed() functions are important.

168
Explanation
The random() function generates a pseudorandom floating-point number and requires no
arguments. We can observe from the first two commands that each time the random() function
is called, different random numbers are generated. The randrange() function returns a random
number falling in the specified range. In this program, the “for” loop is executed 10 times and
the randrange() function had 600 and 700 as arguments, hence 10 random numbers are
generated between 600 and 700 and the numbers printed are on a random basis.
In order to generate same random numbers each time, we need to use seed() function and
set it to a seed value. The algorithm is given a seed value of 1000 to begin, and a formula is
used in Python to produce the random value. The seed value determines the sequence of
numbers generated; identical seed values generate identical sequences. The seed function
establishes the initial value from which the random number is generated. Hence when the
random number is generated again, the same random number is displayed, because the same
seed value 1000 is used. Thus, we can infer that same random number is displayed for a
particular value of seed. However, when the seed value was changed to 4000, another number
was produced which also remains the same when the random number was displayed again with
the same seed. Thus, we can say that if we do not use the seed, different random numbers will
be generated. Hence this is an important function used for generating same simple
pseudorandom number by specifying a seed value. It is generally used during development and
testing when we want to display similar results. This concept is used in this book for producing
the same result which will finally help in better understanding by readers.

169
Create 20 random numbers between 500 and 800.

Generating random numbers from a range can be used in different applications where we wanted
to generate a random number within a range. The following program shows the utility in rolling
of a dice:

Explanation
Since the randrange is specified as (1,6), hence when the program is executed, the result is
printed between 1 and 6 each time the dice is rolled. Since “for” loop is executed three times,
the dice is rolled three times and three results are shown when the program is executed. When
the program is executed for the second time, the output differs. This is because the numbers are
randomly generated.

4.1.3 The Statistics Module


In the statistics module, functions primarily exist for determining descriptive statistics.
Descriptive Statistics includes functions for measures of descriptive statistics, such as mean,
harmonic_mean, mode, median, etc., and measures of spread-like standard deviation and
variance. The arithmetic mean is the sum of the data divided by the number of data points. The

170
median is a robust measure of central tendency, and it is less affected by the presence of outliers
in data. When the number of data points is odd, the middle data point is returned. For example,
The result of function median([11,12,13,14,15]) is 13. When the number of data points is
even, the median is interpolated by taking the average of the two middle values. For example, the
result of median([12,14,16,18]) is (14+16)/2 = 15. The median_low() and median_high()
return the same value as median when the number of elements is odd. But, they are helpful when
the number of data points is even. The median_low() and median_high() functions return low
and high median of numeric data, respectively. For example, when there are even observations as
in earlier example, median_low([12, 14, 16, 18]) will return a value 14 and
median_high([2,4,6,8]) will return a value 16. The median_grouped() returns the median of
grouped continuous data, calculated as the 50th percentile, using interpolation. The mode returns
the most common data point from discrete or nominal data. For example: mode([11, 21, 21,
13, 13, 23, 13, 14]) is 13.

171
Explanation
The results of this program and comments on the top of each and every statement clearly
explain the utility of the functions in the statistics module.

4.1.4 The Array Module


An array is a collection of variables of the same type (homogeneous). They can store a fixed-

172
size, sequential collection of elements of the same type. If we have information of 100 amounts,
then instead of declaring individual variables, such as amount0, amount1, …., amount99, we
declare one array variable such as amount and use amount[0], amount[1], …., amount[99] to
represent individual amount. A specific element in an array is accessed by an index. Thus, the
first amount is represented as amount[0] and the 100th amount is represented as amount[99].

The array() function in an array module helps us to create array in Python. An array can be
created using type code and a list containing elements. The type code represents the nature of all
elements in the array, since array contains similar type of elements.

Syntax
array(typecode, [elements])
where,

1. typecoderepresents typecode. It can be ‘i’ for integer array, ‘f’ for float-point array, ‘u’ for
unicode array.
2. elements are entered in the form of list.
Example: array1=array('i',[1,2,3,4,5,6]) is an integer type array named array1,
containing six elements. array2 =array('u',("X","Y","Z" ]) is a unicode array containing
three elements.

Explanation

173
An array is created using the array() function of the array module. We know that array
contains homogeneous data. However, we can create two different arrays corresponding to
different data types. In this program, different arrays of integers and characters are created. The
“for” loop is used to display elements of both the arrays.

174
Explanation
From this example, it is clear that besides the concept of negative indexing, all other functions

175
related to indexing in array are same as in lists and tuple. Unlike the result of negative indexing
on list and tuple (empty list or tuple), negative indexing in arrays return proper result. Negative
indexing helps to fetch elements from the last index. Hence starting index as −3 denotes that the
elements will be fetched from the last third position and till last index. The next command
shows that the ending argument is −5; this means that elements will be fetched from start and
fetched till −5 (fifth position from last).
It is clear that we can modify existing elements of an array like lists and unlike tuple. We
can observe that element at index 1 is replaced with 22. Similarly, elements from index 2 (third
position) and before index 5 (sixth position) are replaced with an array of elements. Hence the
modified list shows four new modified elements.
The specified functions on array include append(), insert(), extend(), max(), min(),
count(), len(), sqrt(), tolist(), etc. The comments before using the respective function and
the output generated clearly explains the usage and utility of the respective functions.

Create an array of 20 numbers between 500 and 800 and access the 6th
element from the last.

4.1.5 The String Module


The string module provides information of different upper case and lower case letters along with
the digits; digits can be in decimal format, hexadecimal format, octal format, etc.

Explanation
The commands in the program clearly explain the output produced in the program. However,
Python has a great support for strings without the use of string module also. This section

176
discusses the different functions available for strings in Python. A string in Python is created
using different types of quotes: single quote, double quotes, or triple quotes. Triple quotes are
generally used for creating a paragraph (multi-line string).

Explanation
This program creates strings using single quote, double quotes, and triple quotes and stores in
string1, string2, and string3, respectively. Tripe quotes is used to create a multi-line string in
which \n (newline) character is automatically created.
A string in Python can also be created by concatenation (adding) of different strings (strings
can be created using any approach). The concatenation of strings can be done with or without
using “+” sign.

177
Explanation
The strings with either single quote or double quotes or both the quotes can be concatenated by
using “+” sign. We have shown the same in this program by creating string4, string5, and
string6, respectively. It is also possible to concatenate the strings without “+” sign by simply
separating the strings as shown for string7. Python also helps in concatenation of strings in
multiple lines as shown for the string8.

4.1.5.1 Accessing String Elements


The elements of a string are accessed through the concept of string indexing and slicing.
Indexing of a string starts from 0 and continues till one less than the length of the string. If the
length of the string is n, then the last index will be n – 1. For example, consider the string
“Python”; its length is 6 and hence the starting index is 0 and highest index is 5 (Length – 1).
The string indexing in Python can be shown as follows:

Explanation
The string named “PYTHON” is stored in string9. Hence the first letter has an index 0 and last
letter has index 5 as shown using enumerate function. The letters can also be displayed using
print function wherein the first index is 0. A unique feature of Python for string indexing is that
we can give a negative index to start from the last. This means that string9[-1] will represent
the last character of the string (“N”) and string9[-2] will represent last second character of the
string (“O”), and so on. It is worth mentioning that negative indexing starts from −1 and not
from 0; hence in case of negative indexing, the index of the first letter will be equal to the
length of the string.
Accessing of different letters of a string is also done through slicing in Python. The colon
operator (:) plays an important role in string slicing. The numbers written before and after colon
in the bracket represents the range of the index of the string. This means that if m:n is the
specified range, then elements starting from index m and before index n will be fetched.
However, absence of a number (m or n) in the range means the complete string.

178
Explanation
The first print statement has 1:3 in square bracket, which means that letters starting from index
1 (second letter) and before index 3 (till index 2) will be printed. The second print statement
does not have any number before and after the colon and hence all the letters are printed. The
starting index number in the third command is missing and has only the ending index number
which is 3. This means that all the letters from starting will be printed till the index is less than
3. Hence first three letters are printed. The next statement has 3 as the starting index number
and ending index number is missing which means that all the letters from index 3 (fourth letter)
till end of the string will be printed. It is also possible to print last letters using negative
indexing as shown in the next example. The last example concatenates two substrings extracted
from the main string and prints the complete string.

It is also possible to use offsets in string slicing using double colons in square brackets. This is
demonstrated in the following program:

Explanation
We can also offset the string using an additional colon operator (:) in the square brackets. The
offset helps to extract a particular letter. The command string9[::2] will print every second
character from the complete string since no numbers exist within colons (::). The command
string9[0:4:2] will print every second character from the substring extracted from index 0 to
index 3. Since the last digit in the last command is 1, hence offsets are not shown, which further
means that all the letters will be extracted.

4.1.5.2 Case Conversion Functions


Different functions are available for converting string to a title case, upper case, lower case, etc.,
as discussed in the following program:

179
Explanation
The capitalize() function capitalizes only the first letter of the string and the title()
functions convert the first letter of all the words in capital letter. The lower() and upper()
function converts the string into lower case and upper case respectively. The swapcase()
reverses the case of the original string and casefold() function is used for matching after
ignoring the case.

4.1.5.3 Alignment and Indentation Functions


For doing alignment, different functions are available, such as rjust(), ljust(), zfill(),
center(), etc. These are discussed in the following program:

180
Explanation
The function rjust() justifies the string from right while the function ljust() justifies the
string from left. The reaming characters are filled with character specified in the function. For
example, since “Python” is a 6-letter word, hence when it is right justified with a size of 14 and
special character “*”, the remaining 8 spaces are filled with “*”. Similarly, in the next case
when size is 13 and the special character is “#”, the remaining 7 spaces are filled with “#”.
However, in the absence of special character as in the next case, the spaces are used to make the
string right justified with the specified width.
The string is left justified with ljust() function as shown in the next example. The
center() function helps to centrally justify the string. The size specified in the center()
function is 18 and the size of the string is 6. Hence remaining 12 spaces are distributed equally
on left (6) and right (6) sides and filled with “@” sign. The zfill() fills the remaining spaces
with 0. Thus, the last function fills the remaining 6 spaces (size of string is 6 and specified size
is 12) with zeros “0”.

4.1.5.4 Other Functions for String


The split() and join() functions are available in Python for splitting and joining, respectively.
The len(), start-with(), endswith(), count(), and find() functions can also be used directly
on strings.

181
Explanation
This program first splits the string by a semicolon using split() and then the two strings that
are formed because of split are joined together using join() in the next print statement which
finally returns the same string as illustrated above. The len() function displays the number of
characters in the string. The use of “in” word determines whether or not the string is contained
in the main string. The function startswith() checks whether the main string is starting with
the specified string. Here, the program checks whether the string starts with “good” and hence
False is returned. The function endswith() checks whether the main string is ending with the
“change” and hence False is returned. The replace () function replaces the word “is in” in the
main string with the word “from” which hence prints the revised string. The count () function
determines the number of times “i” occurred in the string and the answer is 4, since “i” occurred
for 4 times in the string. The find() function returns the index of “xy” in the main string which

182
is −1 starting from the left. The answer −1 is generated because it is not able to find “xy” in the
main string. The rfind() function returns the index of “in” in the main string which is 34.
Similarly, index() function also returns the index from the left and rindex() function returns
the index from the right. However, the major difference between the index() and find()
functions is related to non-occurrence of a string. If the string does not exist, then index()
function returns an error, whereas the find() function returns a negative value of −1. The
function index() returns the index from the left side and the function rindex() returns the
index from the right side; hence the first time “in” occurred was at index 10 and the last time
“in” occurred was at index 34.

4.1.6 The “re” Module


A regular expression is mainly useful for text processing. It is a special sequence of characters
which helps to match or find other strings or sets of strings using a specialized syntax held in a
pattern. Regular expressions are found from “re” module in Python. The most common functions
of regular expression are findall(), match(), search(), and sub(). The sub() function is used
for replacing a string with another string. Both match() and search() functions are used for
searching a particular pattern in a string. The match() function will report a successful match
only if it starts at index 0. This means that if the match would not start at zero, match() will not
show a successful match. The main difference is that the match() function only matches the
pattern at the beginning of the string, whereas search() scans forward through the complete
string for a match.

Syntax

1. re.sub(old, new, str)


where,
• old signifies the old string that needs to be replaced.
• new signifies the new string which will be replaces.
• str specifies the original string in which replacing needs to be done.
2. re.match(p, str)
3. re.search(p,str)
4. re.findall(p,str)
where,
• p is a character/group of characters/pattern that needs to be matched.
• str is the original string in which the match/search needs to take place.

183
Explanation
The original string is displayed as Hi! Welcome to Python World and the command
re.sub('Hi!', 'Good Morning', mystring1) replaces “Hi!” with “Good Morning” in the
mystring1. The command re.match('Go', mystring2) searches for “Go” at the first position
of string. “Go” is located at the first place and hence it returns span=(0,2) which means that it
is located at 0 index and is spanned till 2. The command re.match('Mo', mystring2) searches
“Mo” at the first position and returns none since “Mo” is not located at index 0. On the other
hand, search () function searches the entire string. But, when the search is done using the
command re.search('Mo', mystring2), it searches and returns the result as span(5,7) which
further means that it is located at index 5 (sixth position since starting index is 0) and is spanned
before index 7.

184
185
Explanation
The details of comments used in the program and the result clearly explain the output. The
value of span contains the index of starting and ending letter which matches the search results.

186
Explanation
The first findall() function displays the first letter of the string in first line (‘W’), but, the use
of re.MULTILINE argument displays the first letter from all the lines (‘W’,‘P’,‘W’).

Explanation
The country named “France” is stored in variable “mystring”. The country “France” is searched
with every item in the list of countries. For each search, the result is printed accordingly.

Explanation
A string named “mystring1” is created, which stores different strings of email address. This

187
program filters the email address from the string and displays them. The findall() function
specifies that any number of alphanumeric characters, dash, underscore, and dot should be
succeeded with @ sign and again followed by any number of alphanumeric characters, dash,
underscore, and dot.

4.1.7 The Time Module


The time module contains functions related to time. The common functions of time module
include clock() and sleep(). The sleep() function suspends the program’s execution for a
specified number of seconds and hence is generally used for controlling the speed of graphical
animations. The clock() function displays the current time and returns a floating-point number,
representing the number of seconds after the first call to clock. It is generally used in
representing elapsed time in seconds to measure the time for execution of a particular section.

Explanation
The clock starts at the time when first call is made for clock() function. The current time is
stored in start_time. The user is prompted to enter the basic information which is stored in
information. The clock() function is called again and the time taken in writing the information
is thus stored in the “timetaken” variable. The result shows that the user took nearly 9 seconds
to write the information.

The sleep() function suspends the program’s execution for a specified number of seconds. The
utility of sleep can be better understood using “for” loop as illustrated in the following example:

Explanation
Since the range is from 20 to 0 and with a step decrement of 4, this program displays the

188
countdown from 20 to 1 with an interval of 4. The sleep(2) function suspends execution for 2
seconds. Hence the result shows that it has nearly taken 2 seconds to print another value in
“for” loop.

4.1.8 The Datetime Module


This module is related to date and time and has some functions, such as time() and date(),
which helps to store time and date in proper format. The functions and attribute available in this
module are illustrated as follows:

Explanation
The user is prompted to enter the name and age. It determines the current year from the
todaydate.year and hence determines the year when the person will be 100. The MAXYEAR
shows the maximum value of a year which is 9999, and MINYEAR shows minimum value as
1. The time() and date() functions convert the values into time and date format, respectively.

Calculate the age of the user based on the input given as an argument to the
function of the birthdate.

4.1.9 The “os” Module


The current working directory can be determined by using getcwd() function from “os” library.
It is also possible to change the path of working directory by using the function chdir() as
illustrated in the following example:

189
Explanation
The path of current working directory was C:\Users\bharti”, which was then changed to D:\.
Hence the user can now keep the file from other software at D:\and read the file directly.

4.2 User-Defined Module


Python supports the users by helping them to create their own module of frequently used
functions and code. The user can import the user-defined module in the program and use the
desired functions of module in the program, rather than copying the code of frequently used
functions into different programs. This will help the user to reduce lines of code, save time, and
do efficient programming by managing code effectively. This section deals with the example of
creating a user-defined module and using it in our program.

4.2.1 Creating a Module


In the following section, we have created a file containing Python code named mymodule.py and
hence the module name would be “mymodule”. A Python file has two functions named
“performance” and “interest” and a variable named “amount”. The value of variable “amount”
is 5000.

Explanation
This program creates two functions named performance and interest. The performance()
function has one input argument named score and the interest() function has three arguments
p, r, and t, representing principal, rate, and time. The performance () function returns the
grade and interest function returns simple interest.
Since this file is saved as mymodule.py, hence this module will be created by the name of
mymodule. The user can access all the functions and variables defined in the module when the
user imports the module in program.

190
The dir() is an in-built function, which displays the details of the module and returns a sorted
list of strings containing the names defined by a module. The list contains the names of all the
modules, variables, and functions that are defined in a module. This includes some auto-
generated terms along with user-defined functions as illustrated in the following section:

Explanation
This program uses the dir() function to display the names that are defined inside a module
along with the sorted list of names of two user-defined functions (performance and interest) and
variable (amount). All other names that begin with an underscore are default Python attributes
associated with the module which are not defined the user. For example, the __name__ attribute
contains the name of the module.

4.2.2 Importing the User-Defined Module


It is possible to import all the attributes from the module by using import keyword. For example,
the user-defined variable named “amount” in our module “mymodule” and functions named
“performance” and “interest” can be called in our program if we import the complete module.
This is illustrated in the following program:

It is not necessary that import statement should be in the first few lines of the
program. It can occur at any line in the program. However, the module should
be imported before it can be used in the program.

191
Explanation
The user-defined module named “mymodule” is imported in this program. We know that the
functions and the variable in the module are accessed using dot (.) operation. The functions
named “performance” and “interest” which are defined in mymodule. py are called in this
program by writing the name of the module preceding the name of the required function and
separated by dot (.). These functions are called in this program and the result is computed from
those functions in the module and returned back to the calling program. Thus, when an user
enters the choice as 1, performance() function calculates the grade and returns to the main
program and when the user enters the choice as 2, simple interest is calculated from the
interest() function in mymodule and the result is returned to the calling program.

Like function, the use of module helps in doing modular programming. The
advantage of module on functions is that it can have many variables and
functions which can be called and used as per the requirement of the user.

It is also possible to call functions of more than one module in a single program. The following
program creates a table in a new module named mymodule2:

Explanation
We can observe that the name of the function is “table” which takes two arguments, “num” and
“value”. The “for” loop is executed from 1 to value+1 (value times). The function is finally

192
able to print the table of the “num” till the specified value.

The following program imports multiple modules and uses another approach (from module
import *) to import the complete module:

Explanation
We can observe that the statements “from mymodule import *” and “from mymodule2 import
*” are able to import all the functions from “mymodule” and “mymodule2” module, respectively.
This will help us to call these functions directly in our program by their name without
specifying the name of the module. In the previous example where the complete module was
imported using import keyword for using the function, the name of the module was also written
along with the name of the function. When the user enters the choice as 3, the function in the
second module is called; and when the user enters the choice as 1 or 2, the functions in the first
module are called.

Create three modules containing different functions as discussed in Chapter 2.


Depending on the user choice, import those functions from the modules in the
program. Depending on the user choice, import function will be executed.

Similar to the 1 in-built modules in Python, it is also possible to rename the user-defined module
for ease of programming as shown in the following example:

193
Explanation
We have renamed the “mymodule” module as “m”, and hence in the program the amount
variable is accessed by preceding “m” name for the module before the amount variable.
It is also possible to import only a particular variable or a function from the module rather
than importing the whole module. We know that importing everything with the asterisk (*)
symbol is not a good programming practice. This can lead to duplicate definitions for an
identifier. It also hampers the readability of our code. It is suggested that we import only the
required functions in our program.

The following program imports only the amount variable from “mymodule”:

Explanation
Since we have imported only the attribute “amount” from the mymodule, there is no
requirement to use the dot operator. The result directly displays the value of total variable as
5000.

Summary
• A module is basically a Python file consisting of Python executable code, functions,
statements, definitions, global variables, etc. Modules are used to break down large programs
into small manageable and organized files for providing reusability of code-like functions.
• There are a lot of predefined modules in Python related to common functions which can be
directly used in the program. Some of the commonly used modules include sys, math,
collections, array, re, statistics, datetime, random, clock, etc.
• The math module includes basic mathematical, logarithmic, and trigonometric functions. The
different mathematical functions include fabs(), power(), factorial(), floor(),
ceiling(), sqrt(), etc.; logarithmic functions include log(), log10(), etc.; and
trigonometric functions include sin(), cos(), tan(), etc.
• The random module contains standard functions related to pseudorandom numbers. The
random(), randrange(), and seed() are the important functions of this module. The seed
value determines the sequence of numbers generated, and identical seed values generate
identical sequences.
• In the statistics module, functions primarily exist for determining descriptive statistics.
Descriptive statistics include functions for measures of descriptive statistics, such as mean,
harmonic_mean, mode, median, etc., and measures of spread, such as pstdev(), stdev(),

194
variance(), and pvariance().
• Besides negative indexing, all other functions related to indexing in array are done similar to
lists and tuple. Unlike, the result of negative indexing on list and tuple (empty list or tuple),
negative indexing in arrays returns proper result. Negative indexing helps to fetch elements
from the last index.
• Regular expressions are found from “re” module in Python. The most common functions of
regular expression are match(); search(), and sub().
• The time module contains functions that are related to time. The common functions of time
module include clock() and sleep().
• The datetime module is related to date and time and has some functions such as time() and
date(), which help to store time and date in proper format.
• The dir() is an in-built function which displays the details of the module and returns a
sorted list of strings containing the names defined by a module.
• A user-defined module is created in Python by creating a file containing Python code, and the
module name would be the name of the file.

Multiple-Choice Questions
If mydata = [1, 2, 3, 4, 3, 9, 5, 6, 7, 8]

1. The result of print(st.mean(mydata)) is


(a) 4.8
(b) 4.5
(c) 4
(d) 4.3
2. The result of print(st.median (mydata)) is
(a) 4.8
(b) 4.5
(c) 4
(d) 4.3
3. The result of print(st.mode(mydata)) is
(a) 3
(b) 4
(c) 5
(d) 6
4. The result of print(st.median_high(mydata)) is
(a) 3
(b) 4
(c) 5
(d) 6
5. The result of print(st.median_low(mydata)) is

195
(a) 3
(b) 4
(c) 5
(d) 6
6. The ___________ function returns a sorted list of strings containing the names defined by a
module.
(a) dir()
(b) details()
(c) struct()
(d) None of these
7. The result of math.factorial(6) is
(a) 510
(b) 720
(c) 120
(d) 240
8. Which function is not available in random module?
(a) random()
(b) words()
(c) seed()
(d) randrange()
9. If myarray1= array.array(‘i’,[40,50,60,70,80,90, 100]), result of print(myarray1[-2:])
is
(a) [40, 50, 60]
(b) [80, 90, 100]
(c) [40, 50]
(d) [90, 100]
10. The result of print(myarray1[:-5]) is
(a) [40, 50, 60]
(b) [80, 90, 100]
(c) [40, 50]
(d) [90, 100]

Review Questions

1. What is the importance of user-defined module? Discuss different in-built modules of


Python.
2. What is the utility of seed() function from random module?
3. Explain the utility of randrange() function from random module with proper example.
4. Explain different functions available in math module with example.
5. Discuss the utility of functions available in time module with proper example.
6. Differentiate between search() and match() found in “re” module.

196
7. Differentiate among append(), insert(), and extend() used for array.
8. Create a module named basicmaths having four functions named add, sub, mult, and div.
All these functions take two arguments as numbers. Import the module in the main program
and call the respective function depending on the user requirement and return the computed
result.
9. Explain the result of negative indexing on arrays with an example.
10. Create a module named area having four functions named triangle, square, rectangle,
and circle. All these functions take arguments according to their shape; compute and
return the area of the shape. Import the module in the main program and call the respective
function depending on the user requirement and compute the area of respective shape.

197
198
SECTION 2
Core Libraries in Python

Chapter 5
Numpy Library for Arrays

Chapter 6
Pandas Library for Data Processing

Chapter 7
Matplotlib Library for Visualization

Chapter 8
Seaborn Library for Visualization

Chapter 9
SciPy Library for Statistics

Chapter 10
SQLAlchemy Library for SQL

Chapter 11
Statsmodels Library for Time Series Models

199
200
CHAPTER
5

201
Numpy Library for Arrays

Learning Objectives
After reading this chapter, you will be able to

• Build foundation for understanding arrays through numpy library.


• Have familiarity for accessing elements through slicing and indexing.
• Understand the utility of functions available in numpy library.
• Get exposure to special functions for single- and multidimensional array.

An array is a data structure that contains a group of elements of the same data type, such as an
integer or string. Arrays are used to organize data for easy sorting or searching. Instead of
declaring 1000 individual variables, such as num0, num1, …, and num999, we can have one
array variable such as num and use num[0], num[1], …, numb[999] to represent individual
variables. A specific element in an array is accessed by an index. In Python, arrays can be
efficiently created using numpy library, which provides many functions for effective
programming related to arrays. Unlike array module which can create only one-dimensional (1-
D) array, numpy arrays can also be multidimensional.

Arrays can be created using numpy library or using array module.

5.1 One-Dimensional Array


A 1-D array is an array of elements of similar data type starting from index 0 and ending at n + 1.

5.1.1 Creating a 1-D


Array An array can be created using the array() function of numpy library. We need to pass an
argument related to the data type of elements of an array along with the elements in the form of
list. For example, an integer array is created using “int” argument; string array is created using
“str” argument; and array of float elements is created using “float” argument. An array can
also be created from an existing array using view() and copy() command.

202
Explanation
This program demonstrates different methods of creating arrays of different data types, such as
integer, decimal, character, string, etc. It is important that all the elements in a single array are
of same data type. An array of integer data type is created using “int” argument, float data type
is created using “float” argument, and string data type is using “str” argument. The functions
view() and copy() are used to create new arrays from the existing arrays. The new arrays are
similar to the existing arrays.

The numpy library provides many other different functions for creating an array efficiently.
These functions include using arange() function, zeros() function, ones() function, and
linspace() function. The arange(start,end) function helps to create arrays with regularly
incrementing values from start to end, similar to range() function. The linspace(start, end,
size) function will create array with a specified number of elements according to the specified
size. The elements will be spaced equally between the specified start and end values. The
zeros(shape) function will create an array of the specified shape filled with value “0”. The
default dtype is float64. The ones(shape) function will create an array of specified shape filled
with value “1”. It is identical to zeros() in all the other aspects. These functions are discussed in
the following example:

203
Explanation
This program creates arrays using different functions, such as arange(), linspace(), zeros(),
and ones(). The arange() function either accepts one or two arguments. If one argument is
written, it will represent only the ending number, and the default starting element will be
considered as 0. If two arguments are written, it will represent both starting and ending
numbers. The array named myarray7 was created using only one argument as 14. Hence
numbers from 0 to 13 are printed. It is important to understand that the default data type of
creating an array from the arrange() function is an integer. But the user can provide a data
type according to the requirement as shown in the command “numpy.arange(14,19, dtype =
np.float)”. Since the array was created using two arguments 14 and 19, hence the numbers are
printed from 14 to 18. From the results, we can observe that since float data type is specified in
the function, the numbers are represented in the output with a decimal point.
The linpsace () function has a third argument also for creating numbers at equal intervals.
The value of the argument shows the number of elements in the array. Thus, the command
“numpy.linspace(12, 24, 5)” creates an array of 5 elements and considers 12 as the starting
element. Since the ending number is 24, hence an interval of 3 is calculated for displaying 5
numbers. The command “numpy.linspace(11., 14., 5)” generates 5 numbers starting from
11 and before 14. Hence an interval of 0.75 is calculated for generating 5 numbers. Thus, the

204
generated 5 numbers are 11, 11.75, 12.5, 13.25, 14.
The zeros() and ones() functions help to produce an array having all elements as 0 and 1,
respectively. Thus, np.zeros((4)) creates an array of 4 elements having value as 0 and
np.ones((5)) creates an array of 5 elements having value as 1.

5.1.2 Accessing Elements of 1-D Array


Indexing and slicing in numpy array works similar to the array created from array module are
discussed in Section 4.2.4.

5.1.3 Functions for 1-D Array

Inbuilt functions on arrays makes the task easier in comparison to the inbuilt
functions available on list and tuple.

In comparison to the array created by array module, numpy array has access to many other
mathematical, trigonometrical, statistical, logarithmic, and exponential functions. It should be
noted that it was not possible to apply these special functions on the array created from the array
module discussed earlier in Chapter 4. It should be noted that if array1 is a numpy array, the
statement array1.min() is equivalent to numpy.min(array1).

205
206
Explanation
The comment before the usage of the function and the output generated by using the respective
function explains the utility and result of different functions.

5.1.4 Mathematical Operators for 1-D Array


It is possible to use basic mathematical operators like addition (+), subtraction (−), multiplication
(*), division (/), integer division (//), and modulus (%) on 1-D array as demonstrated in the
following example:

207
Explanation
This program shows the use of basic mathematical operators on 1-D array. It is clear that the
use of mathematical operators with single element is applicable to all the elements of the array.
Hence the number 45 is added to all the elements of the array. Similarly, 10 is subtracted from
all the elements of the array and so on.

We can also perform mathematical operations on multiple arrays. It should be noted that size of
all the arrays should be same as demonstrated in the following example:

208
Explanation
This program imports numpy library and renames it as “np”. It shows the use of basic
mathematical operators on multiple 1-D array having similar number of elements.

5.1.5 Relational Operators for 1-D Array


It is also possible to use different relational operators (<, >, <=, >=, ==, !=) on multiple arrays as
described in the following example:

Explanation
This example shows the use of relational operators on two arrays. We can observe that since
there are 8 elements in each array, the output for each execution also shows 8 Boolean values
(result of relational operator for each corresponding element of the two arrays). The comment
written at the top of the relational operator and the result generated shows the utility of these
operators.

5.2 Multidimensional Arrays


The simplest form of multidimensional array is the 2-D array. A 2-D array has rows and
columns. It can also be considered as a list of 1-D arrays. A 2-D array of size [m][n] can be
considered as a tabular form having “m” number of rows and “n” number of columns. Thus, a
matrix is considered to be a specialized 2-D array.

209
5.2.1 Creating a Multidimensional Array
A multidimensional array can be created using array(), reshape(), matrix(), zeros(), or
ones() function. A 2-D array is represented using two square brackets for rows and columns and
both the square brackets are inside the main square brackets. The array() function uses square
bracket outside all elements to denote one row of the array. A matrix() function uses semicolon
(;) to separate a row from the other. The reshape() function converts the 1-D array into
multidimensional array on the basis of specified argument for rows and columns. The zeros()
and ones() functions create multidimensional array of specified dimensions and fill them with
zeros and ones, respectively. These functions are discussed in the following program:

Explanation
The array() function in this program creates 2 rows and 4 columns since there are 2 square
brackets inside the main square bracket and each square bracket has 4 elements. Each row in the
array created by the matrix() function is separated by a semicolon (;) and all the elements of
the row are written together. Hence the array created using matrix() function had 3 rows and 2
columns. The command “np.reshape(newarray(3,2))” creates an array of 3 rows and 2
columns considering elements of the array named “newarray”. The command

210
“np.zeros((4,2))” creates an array of 4 rows and 2 columns and all the elements are 0. The
command “np.ones((2,3))” creates an array of 2 rows and 3 columns and all the elements are
1.

5.2.2 Accessing Elements in Multidimensional Array


Elements in a multidimensional array are accessed using the concept of indexing and slicing.
Indexing helps to specify the location of the element in an array. An element in a 2-D array is
accessed by using the subscripts, that is, row index and column index of the array. Every element
in the array is identified by an element name of the form myarray[x][y], where “myarray” is the
name of the array, and “x” and “y” are the subscripts corresponding to the index of row and
column that uniquely identify each element in “myarray”. We know that the index of first
element is 0; hence my myarray[3][1] denotes the second element from the fourth row of the
array.

Explanation
The number in the first square bracket denotes the index of the row, and number in the second
square bracket denotes the index of the column. The result of this program represents the same
concept.
A slice represents a part or piece of the array. Slicing in multidimensional array is done by
specifying the starting and ending index of both row and column. It should be noted that in a
multidimensional array, the starting and ending index of rows and columns are separated by a
colon and rows and columns are separated by comma. The slicing is done by specifying four
arguments [begrow:endrow, begcol:endcol]. Here, “begrow” specifies begin of row,
“endrow” specifies end of row, “begcol” specifies begin of column, and “endcol” specifies end
of column.

211
Explanation
The command mymultarray7[:,:] does not specify the row and column numbers; hence all the
rows and columns of the array are displayed. The command “mymultarray7[0:2,])” does not
specify number for second argument and hence displays all the columns of the first two rows.
Similarly, the command “mymultarray7[2,]” displays the third row. The command
“mymultarray7[0:2,0:2])” specifies both the row and column indexes, and hence all the rows
and columns starting from index 0 and before index 2 are displayed. This means that the first
and second row and column elements are displayed. Similarly, the command
“mymultarray7[1,1:4])” displays the second row and columns starting from index 1 and
before index 4. Hence second row for second to fourth columns are displayed.

Input the elements of the array from the user and count the number of even
digits in the array.

5.2.3 Functions on Multidimensional Array


Some functions which are applicable to multidimensional arrays only include transpose(),
diagonal(), flatten(), sort(), etc. The ndim and shape returns the dimensions and shape of
the multidimensional array, respectively. The transpose() function returns a new matrix whose
rows are the columns of the original and vice versa. The diagonal() function returns the
diagonal element of the matrix, whereas the flatten () function returns the 1-D array by
converting the matrix into 1-D array. The sort() function helps to sort the array and the axis
argument in this function determines whether the array will be sorted on row or column basis.
These functions are discussed in the following example:

212
Explanation
The comment displayed before each function describes the function and the output generated
clearly explains the usage and utility of all the functions.

5.2.4 Mathematical Operators for Multiple Multidimensional Arrays


We can use common mathematical operations on single multidimensional array using an element
or between multidimensional arrays as discussed in this section.

213
Explanation
The basic four mathematical operations when used on two matrices return the result of the new
matrix. It should be noted that the mathematical operation is carried on element-to-element
basis. It should be noted here that if the order of the matrix is same, then element-to-element
multiplication is possible in matrix. The original matrix multiplication as discussed in maths is
possible if number of columns of the first matrix = number of rows of the second matrix. The
resultant matrix has the order as: number of rows of the first matrix * number of columns of the
second matrix. The last command shows the matrix multiplication since the order of two
matrices is as per the requirement for matrix multiplication (number of columns of first matrix
= number of rows of second matrix) and the order of the resultant matrix is number of rows of
the first matrix * number of columns of the second matrix (Here 2 * 2). The comments used in
the program and results displayed clearly explain the output.

Input the elements of two multidimensional arrays from the user and perform
the multiplication of arrays.

214
5.2.5 Relational Operators for Multiple Multidimensional Arrays
It is also possible to use different relational operators (<, >, <=, >=, ==, !=) on arrays as
described in the following example:

Explanation
The relational operators for multidimensional array works in a similar manner as the relational
operators in 1-D array. An element-to-element comparison is done for the two matrices.
However, unlike the result of 1-D array, the result of 2-D array is shown in a matrix form
consisting of multiple rows and columns.

Summary
• In Python, arrays can be efficiently created using numpy library. Unlike array module which
can create only 1-D array, numpy arrays can also be multidimensional.
• An array can be created using the array() function of numpy library. We need to pass an
argument related to the data type of elements of an array along with the elements. Example:
An integer array is created using “int” argument; string array is created using “str” argument;
array of float elements is created using “float” argument. An array can also be created from
an existing array using view() and copy() command.
• Numpy library provides many other different functions for creating an array efficiently.
These functions include using arange() function, zeros() function, ones() function and
linspace() function.

215
• Indexing and slicing in numpy array works similar to the array created using array module.
• In comparison to the array created by array module, numpy array has access to many other
mathematical, trigonometrical, statistical, logarithmic, and exponential functions which were
not possible to use through the array module discussed earlier
• It is possible to use basic mathematical operators like addition (+), subtraction (−),
multiplication (*), division (/), integer division (//), and modulus (%).
• The simplest form of multidimensional array is the 2-D array. A 2-D array has rows and
columns. It is also considered as a list of 1-D arrays. A 2-D array of size [m][n] can be
considered as a tabular form having “m” number of rows and “n” number of columns.
• A multidimensional array can be created using array(), reshape(), matrix(), zeros(), or
ones() function.
• An element in a 2-D array is accessed by using the subscripts, that is, row index and column
index of the array. Every element in the array is identified by an element name of the form
myarray[x][y].
• Some functions which are applicable to multidimensional arrays only include transpose(),
diagonal(), flatten(), sort(), etc. The ndim and shape return the dimensions and shape of
the multidimensional array, respectively.
• It is possible to use basic mathematical operators and relational operators on arrays as in
mathematics.
• Python supports matrix multiplication if the order of two matrices according to mathematical
requirement (number of columns of first matrix = number of rows of second matrix). The
order of the resultant matrix is number of rows of first matrix*number of columns of second
matrix.

Multiple-Choice Questions
The result of following statement is: if import numpy mytest=np.array([[100,200,300],
[400,500,600],[700,800,900]]) then

1. print(mytest.ndim)
(a) 1
(b) 2
(c) 3
(d) 4
2. print(mytest.shape)
(a) (2, 3)
(b) (3, 2)
(c) error
(d) (3, 3)
3. print(mytest.flatten())
(a) [100 200 300]
(b) [400 500 600]
(c) error

216
(d) None of the above
4. print(numpy.diagonal(mytest))
(a) [100 500 900]
(b) [100 400 700]
(c) [200 500 800]
(d) [200 600 800]
5. print(numpy.max(mytest))
(a) error
(b) 900`
(c) 400
(d) 500
6. print(mytest[0:1])
(a) error
(b) [[100 200 300]]
(c) [[400 500 600]]
(d) [[400 500]]
7. print(mytest[0:1,0:1])
(a) error
(b) [[100]]
(c) [[400]]
(d) [[700]]
8. print(mytest[1][2])
(a) 600
(b) 300
(c) 400
(d) error
9. print(numpy.mean(mytest))
(a) 500
(b) 300
(c) 400
(d) None of the above
10. print(numpy.sum(mytest))
(a) 4300
(b) 3300
(c) 4500
(d) None of the above

Review Questions

1. Explain the usage of linspace() function in creating an array with example.

217
2. What is the difference between zeros() and ones() function used for creating an array?
3. What is the major advantage of creating an array through numpy library?
4. How does slicing differ between array and list?
5. Discuss the different functions that can be used to create an array using numpy library.
6. What is the utility of arange() function in creating an array?
7. Explain the utility of diagonal() function on a multidimensional array with an example.
8. How does output differ when relational operators are used for 1-D and multidimensional
arrays?
9. Differentiate between transpose() and flatten() functions for multidimensional array.
10. How can an array be sorted on row and column basis?

218
CHAPTER
6

219
Pandas Library for Data Processing

Learning Objectives
After reading this chapter, you will be able to

• Understand the importance of dataframe.


• Apply the knowledge of available functions in Pandas library to datasets.
• Develop the skills for importing data from different softwares to Python.
• Attain competence to handle missing data through excluding and recoding.

The data structures discussed in the Chapters 3 and 5 dealt with unlabelled one- and two-
dimensional data. But in real-world problems, we generally require labelled data for effective
data analysis and meaningful interpretation. However, this labelled data requires processing
before doing any analysis and interpretation. Python has an important library named “Pandas”
which is helpful for processing labelled data.

The most useful data structure for data analysis is dataframe. Dataframe in R
was inbuilt; in Python it can be created using Pandas library.

6.1 Basics of Dataframe


The Pandas library is specially used for handling data of different dimensions. Series is a one-
dimensional labelled data, and “dataframe” is a two-dimensional labelled data holding any data
type. A series can be created using the function Series() from Pandas library.

Syntax
Series(data, index=list of column names)

where,

• data represents any data type (integers, strings, floating point numbers, Python objects, etc.).
• index represents the axis labels.

220
Explanation
The library “pandas” is imported and renamed as “pd” for easy programming. A list named
“pricelist” is created using four prices. The Series() function helps to create a series through
pandas which has taken “pricelist” as the data and index is a list of names of four products
[‘Pen’, ‘Shirt’, ‘Book’, ‘Mouse’]. Hence, when the series is printed, the value of index
argument (four product names) is associated with four items of pricelist. But the data in real-
world scenario is generally two-dimensional labelled data which is known as dataframe.

6.1.1 Creating a Dataframe


Dataframe is the most commonly used pandas object and is represented as a two-dimensional
labelled data structure with columns of potentially different types. It can be thought as a table in
RDBMS or a spread sheet. A dataframe is created using DataFrame() function.

Syntax
DataFrame(data, columns=list of column names)

where,

• data represents a multi-dimensional data of any data type (integers, strings, floating point
numbers, etc.).
• columns has a list representing name of the columns.

The dimension of the dataframe can be determined using “shape” and the name of the columns
can be determined using “keys()” function as illustrated in the following example:

221
Explanation
The command pd.DataFrame([[100,200,300,400],[4,2,5,6]], columns=['Pen', 'Shirt',
'Book', 'Mouse']) creates a dataframe with four columns, namely, “Pen”, “Shirt”, “Book”,
and “Mouse” and two rows since there are two lists inside the main list. Each row has four
numbers corresponding to four columns. The first row has four numbers representing price of
the product and the second row has four numbers representing quantity of the product. The
dimension of the dataset is determined using shape which produces the result as (2, 4). This
means that the dataframe has two rows and four columns. The size of the dataset is 8 (2*4). The
result for determining names of the columns is produced using “keys()” function which shows
that there are four keys representing names of four products.

6.1.2 Adding Rows and Columns to the Dataframe


We can add rows to an existing dataframe from a new dataframe using the append() function. It
is also possible to add a new column to the dataframe by writing the name of the column in
square brackets along with the name of the dataframe and assigning a list of items to it. This is
demonstrated in the following section:

222
Explanation
This program creates a dataframe named productdf2 which has two rows and four columns
similar as the columns of the dataframe “productdf”. The command
“productdf3=productdf.append(productdf2)” adds dataframe “productdf2” to “productdf”
and stores the new dataframe as “productdf3”. We can observe from the results that the new
dataframe “productdf3” has four rows and four columns. The command
“productdf3["Mobile"]=[15000,2,30,40]” creates a new column named “Mobile” and there
are four values in the column: 15000, 2, 30, and 40. Similarly, columns “Laptop” is added to
the dataframe. After adding two new columns, we can observe that the new dataframe has four
rows and six columns and the size of the dataframe is 24.

6.1.3 Deleting Rows and Columns from the Dataframe


It is possible to delete rows and columns from the dataframe using the drop() function. Columns

223
which need to be deleted from the dataframe are specified by the names of the columns as value
of the “columns” argument in drop() function. Rows can be deleted by specifying the index of
the rows to be deleted as the value of the “index” argument in drop() function.

Explanation
Multiple columns are removed from the dataset by using drop() function on a list of column
names [“Pen”, “Book”] which reduces the dimension to (4, 4). Thus, the columns “Pen” and
“Book” are removed. The command “productdf3. drop(index=[0])” removes rows with
index 0. Since our dataset has two rows for index 0, hence two rows are deleted reducing the
dimension to (2, 4) and the size of the modified dataframe thus becomes 8.

Create a dataframe for employee records related to personal information and


salary. Insert 10 records to the dataframe and delete any 2 records. Also add
new field as Performance and remove field name as “DOB” from the existing
dataframe of employee.

6.2 Import of Data


Data in real-world scenario is generally not created in Python but is imported from other sources.
The Pandas library provides many functions to import data from files of different types of
software and stores in a dataframe in Python. The function read_csv() helps to read a “csv” file;
read_excel() helps to read an “excel” file; read_html() helps to read an “html” file;
read_json() helps to read a “json” file; and read_sql() helps to read a “sql” file. However, for
reading the file from other software, it is important to keep the file from any other software in the
current working directory.
We will consider an existing dataset for understanding data processing using Pandas in an
effective manner. In the following example, we will consider “LiverPatient.csv” file; dataset
related to Indian Liver Patient Dataset; downloaded from

224
https://archive.ics.uci.edu/ml/datasets/ILPD+%28Indian+Liver+Patient+Dataset%29.

Explanation
The command “liver=pd.read_csv("LiverPatient.csv")” reads the file “LiverPatient.csv”
from the current working directory and stores in a dataframe named “liver”. The shape shows
that there are 583 observations and 11 columns. The size shows that there are 583*11 = 6413
values in the dataset. Hence, none of the values is missing. The names of all the 11 columns in
the dataset are displayed using “columns” or keys() function.

6.3 Functions of Dataframe


The Pandas library provides many functions with respect to a dataframe. These functions are
related to basic functions related to information of the dataframe, such as describe(), info(),
displaying records using head(), tail(), etc.; statistical functions including mean(), median(),
etc.; mathematical functions such as min(), prod(), max(), sum(), etc.; sorting of the dataframe
on the basis of specified column using sort_values() function, etc. This section discusses the
different basic functions that can be used on a dataframe considering “liver” dataset.

6.3.1 Basic Information Functions


This section discusses basic functions such as describe(), info(), head(), tail(), etc., for
displaying the basic information of the dataset.

225
Explanation
The info() function displays complete information about the dataset. It is clear that the same
result is produced from describe and info(). But there is a difference between describe and
describe(). The describe prints the complete dataset while the describe() function shows the
information related to basic statistical values of all the columns of the dataset. The describe()
function displays the information related to count, mean, standard deviation, minimum,
quartiles, and maximum value for each column.

Head() and tail() functions display the first and last five records, respectively. However, the
user can fetch specific number of records by specifying the appropriate number as an argument
in head() and tail() functions. Thus, head(3) will represent the first three records and tail(4)
will represent the last four records. It is also possible to determine all the values of particular
column by specifying the column name in square bracket and using values.

226
Explanation
This example clearly shows that the head() function displays first five records. In the next
command when the head() function takes 2 as argument, only two records are displayed. It is
also possible to display values of selected columns by specifying the column name inside the
square brackets. Thus, the command “liver[['TB','DB']].head(3)” displays the first three
records of “TB” and “DB” columns only. The command “liver[['Age','TB']].tail(2)”
displays last two records of “Age” and “TB” columns only. The command
“liver['Alkphos'].values” returns all the values of all observations for alkphos column.

227
It is also possible to count the number of records and display the descriptive statistics of a
particular column as illustrated in the following example:

Explanation
The result shows that there are only two values of gender: Male and Female. Number of
observations of people who are “male” is 441 and number of observations of people who are
“female” is 142. The next output shows that 75.6% of observations are male and 24.3% are
female. The last command shows the mean, standard deviation, min, median, quartiles, and
maximum of the “TB” column only.

It is possible to convert dataframe to list using tolist() function as illustrated in following


program:

Explanation
The command “agelist = liver['Age'].tolist()” converts the column named “Age” to a
list named “agelist” and the list, when displayed, shows the values of “Age” in the form of a
list.

228
6.3.2 Mathematical and Statistical Functions
It is possible to use basic mathematical functions, such as sum(), max(), min(), prod(), and
statistical functions, such as mean() and median() on data as shown in the following program:

Explanation
The comments written at the top of each function and output produced from the functions
clearly explain the usage and utility of the functions.

229
6.3.3 Sort Functions
It is also possible to sort the records in ascending or descending order on the basis of a column of
a dataset using sort_values() function. The following program sorts the observations on the
basis of a specified column (“TP”) in dataset:

Explanation
The command liver.sort_values(by='TP', ascending=False).head(2)) sorts the dataset
on the basis of “TP” in descending order and prints the first two records. It is clear that when
the value of argument “ascending” is made “True”, the observations are sorted on the basis of
ascending order.

6.4 Data Extraction


Data extraction according to the user requirement is an important task and is done at a great level
for performing data analysis. Different relational operators such as <, >, ==, <=, >=, !=, etc. can
be used to create conditions. These conditions will help in filtering data from the dataset. The use
of logical operators such as and (&) and or (|) help to filter the data on the basis of multiple
conditions. The use of indexers such as loc and iloc also contribute a lot for extracting data
according to the user requirement.

230
6.4.1 Using Relational Operators
This section discusses the use of different relational operators on dataset.

Explanation
The first command filters the observations for which Gender is “Male” and stores in
“male_data”. The print() function then prints the first two records of this filtered dataset, since
2 is passed as an argument to head() function. The command liver['Age']>=50 filters the
observations whose age is greater than or equal to 50. The print statement then prints the first
three records. The command liver['ALB']<=1 filters the observations whose value of “ALB”
is less than or equal to 1. The print statement then prints the last two records.

6.4.2 Using Logical Operators


Filtering of data can also be done for multiple conditions using “and” (&) and “or” (|) logical
operators for the dataset as shown in the following program:

231
Explanation
The command liver[(liver['Age']>=35) & (liver['DB']<=6)] filters the data whose Age
>=35 and value of DB is <=6 and stores in filter1. The shape of filter1 is (390, 11) which means
that there were 390 observations which satisfied both the conditions. The next two commands
display the sum of “TB” and product of “DB” from the filtered dataset of 390 observations.
The command filter2=liver[(liver['Gender']=="Female") | (liver['Age']>=35) |
(liver['DB']<=6)] filters the data whose Gender is Female or Age >=35 or value of “DB” is
less than or equal to 6 and stores in filter2. The shape of filter2 is (569, 11) which means that
there were 569 observations which satisfied either of the three conditions. The next two

232
commands display the mean of “ALB” and median of “TP” from the filtered dataset of 569
observations, respectively.
The command filter3=liver[(liver['LiverPatient']==1) & (liver.Age>=50) |
(liver['TP'] >=2) | (liver.ALb>2)] filters the data on the basis of multiple conditions and
stores in filter3. The shape of filter3 is (583, 12) which means that there were 583 observations
which satisfied all the conditions. The next two commands display the maximum of “Alkphos”
and minimum of “AG” from the filtered dataset of 583 observations.

6.4.3 Using iloc Indexers


These indexers play a major role in the data extraction on the basis of specified row and column.
The iloc indexer helps to extract particular row(s) and column(s) at specified numbers in the
order that they appear in the dataframe.

Explanation
The command liver.iloc[5,2] displays the third column of the sixth row, since 5 is specified
for row and 2 is specified for column. This is because numbering starts from 0. Hence, the third
column “TB” for sixth row is displayed. The command liver.iloc[5] displays all the columns

233
of sixth record in absence of column number. The command liver. iloc[[5,9],[1,4]]
displays the two records for second and fifth columns. Thus, gender and alkphos are displayed.
The command liver.iloc[7:9,[5]] displays range of rows for sixth column because : sign is
used for declaring range of row numbers.

6.4.4 Using loc Indexers


The iloc indexer gives information of the row number specified within the bracket, while the loc
indexer gives information of the index value specified within the bracket. The index of the
dataframe can be either number and/or a string or multi-value. It should be noted that it is
possible to change the index. Unlike iloc, the loc indexer can be used for index and label of
columns. When using the .loc indexer, columns are referred to by names using lists of strings or
“:” for slicing.

234
Explanation
The loc indexer helps to fetch the record according to specified index number and not the row
number. It is not necessary that the index number is same as the row number. The command
liver.loc[3] displays the record whose index number is 3. The command liver.loc[1:5,]
displays the records whose index number is between 1 and 5. The command
liver.loc[[14,25,36]] displays the record whose index number is 14, 25, and 36. It can be
observed that loc indexer helps the user to write the name of the columns also for fetching
particular records. Thus, the command liver.loc[[5, 6], 'TB':'TP'] fetches the columns
starting from TB to TP for index numbers 5 and 6. liver.loc[[7:9],

235
['Age','Gender','TB']] fetches the rows for age, gender, and TB column only.

Relational operators with loc indexer: The loc indexer helps to apply different relational
operators like <, <=, >=, == for extracting data according to user requirement.

Explanation
The commands use different relational operators to filter the records according to the specified
condition(s). The output produced and comments written at the top clearly explain the use.

Functions with loc indexer: It is also possible to use special functions like startswith() and
isin() to select specific records according to the user’s requirement.

236
Explanation
The startswith(“Fe”) filters those records where gender is starting from “Fe”. Thus, the records
of only females are filtered. The result shows that there is only 1 record where ALb>=5 of
female. The command liver['ALB'].isin([4.4,4.2,4.3] filters only those records where
ALB is either 4.4, 4.2, or 4.3. The result shows that there are only two records where Age>=60
and ALB is either 4.4, 4.3, or 4.2.

6.5 Group by Functionality


An important feature of dataframe is the use of “groupby()” function which is used to group the
observations on the basis of a variable. It should be noted that grouping of observations can be
done only on the basis of categorical variable and aggregate functions, such as max(), mean(),
median(), min(), sum(), count() are used on any of the continuous/categorical variable in the
dataset. This is demonstrated in the following program:

Explanation
The first command does grouping on the basis of Gender and counts the number of
observations for male and female. The result shows that 142 records had value “Female” and

237
441 records had value “Male”. The last command adds the value of the column “TB” for the
grouped observations of different genders. Thus, the result shows that sum of “TB” for the
observations of female patients is 329.8 and sum of “TB” for the observations of male patients
is 1593.4.

Explanation
All the four commands discussed here have done grouping on the basis of LiverPatient. The
first command determines the minimum of “DB” for grouped observations. We can observe
that there are two distinct values of LiverPatient: 0 and 1. The result shows that for the
observations having 0 and 1 as the value of LiverPateint, minimum of DB is 0.1. Maximum of
“ALB” is 5.0 for value0 and 5.5 for value 1. Mean of “TP” is 6.54 for value 0 and 6.45 for
value 1. Median of “AG” is 1.0 for value 0 and 0.9 for value 1.

Determine minimum and maximum value of alkphos for male and female from
the dataset.

238
6.6 Creating Charts for Dataframe
The libraries named “matplotlib” and “seaborn” are used in Python for displaying excellent
graphics and will be discussed in Chapters 7 and 8, respectively. However, Pandas library also
supports to create basic charts for a dataframe like pie chart using pie() function; scatterplot
using scatter() function; histogram using hist() function; and boxplot using boxplot()
function.

239
240
Explanation
The syntax used here helps us to create basic charts for the dataset. The comments describe the
type of chart and corresponding functions. The command data1=liver.iloc[1:10] selects the
top 10 rows of liver dataset and stores in data1. The pie chart displays labels of TB, since the
value of label argument is [‘TBl’]. The scatterplot shown between TB and DB in figure which
clearly shows that there is a linear relationship between the two variables because the line of
best fit will be a straight line on the scatterplot. The histogram is created for different genders
on the basis of alkphos. We can observe that the highest value of alkphos is 100 for females and
300 for males. From the boxplot, we can observe that the median of liver patient (value = 1) is
nearly 0.8 which is lower than the median of people who are not liver patients (value = 0). The
range of “AG” for both the groups is nearly the same. However, there are some outliers having
value of “AG” greater than 2.5 in the group of liver patients.

It is also possible to get information related to number of observations for two or more than two
categorical variables using crosstab() function. We can also plot the bar graph for the result
generated from the crosstab() function.

241
Explanation
In this program, crosstab() function is executed for determining the number of females and
males who are liver patients and who are not liver patients. The output shows that 50 females
have the LiverPatient value equal to 0 and 92 females have LiverPatient value equal to 1.
Similarly, 117 males have the LiverPatient value equal to 0 and 324 males have liver patient
value equal to 1. The value of margins argument determines whether the total observations will
be also shown. For example, when tabular data is shown as margins=“FALSE”, it displays only
the number of records for a particular gender and category of liver patient. But in the second
command, when the value of margins argument is equal to “TRUE”, the bar for total number of
values is also shown along with bar for male and female.

6.7 Missing Values


A missing value is one whose value is unknown. Missing values are represented in Python by the
NA symbol. NA is a special value whose properties are different from other values. NA is one of
the very few reserved words in Python. Missing values are often legitimate; values really are
missing in real life. NAs can arise when there is an empty column in a record in a database or
when Excel spreadsheet exists with empty cells.
When an element or value is “not available” or a “missing value” arises in statistical terms,
the element is assigned the special value NA. There is also a second kind of “missing values”
which are produced by numerical computation; these are called NaN (Not a Number) values.
Impossible values (e.g., dividing by zero) are also represented by the symbol NaN (not a
number). For understanding management of missing data, download the file loan prediction from
the link: https://www.kaggle.com/altruistdelhite04/loan-prediction-problem-dataset.

242
6.7.1 Determining Missing Values
The function isnull().sum() gives the total number of missing values for each column in the
dataset as demonstrated in the following program:

Explanation
The dimension of the data is (614, 13). This means that there are 614 observations and 13
columns. The command loandata.isnull().sum() returns the number of missing values in
each column. We can observe that there is no missing value for loan id, 13 observations are
having missing values for Gender, 3 observations are having missing values, 15 observations
are having missing values for Dependents, and so on. Education, ApplicantIncome,
CoapplicantIncome, Property_Area and Loan_Status had no missing value in any of the
observations. This seems to be true because these are very important fields which cannot be left
blank for loan processing.

6.7.2 Deleting Observations Containing Missing Values


It is possible to delete the observations from the dataset containing missing values in any column
directly. The function dropna(inplace=True) deletes the observations that contain the missing
values from the dataset and hence reduces the number of observations. However, it is not
advisable to delete the observation completely from the dataset because the analysis may not
show an effective result.

243
Explanation
The copy() function creates a copy of the dataset and stores in the newloandata. The dropna()
function deletes all the observations containing missing values and hence the dimension of the
dataset reduces to 480. This means that the dataset had 134 (614 − 480) missing observations.

6.7.3 Missing Data Imputation


Imputation is a method to fill in the missing values with estimated ones. It is very important to
impute the missing data before analysing, because the data analysis functions does not work
effectively if missing values exist in the dataset. This section focuses on imputation of missing
data with different values. The function fillna(value, inplace=True) fills the missing values
(NA) with value written as an argument and thus helps in missing data imputation. The value is
generally considered as either mean(), median(), mode(), or any specified value. However, it
should be noted that missing numeric/continuous variables can be replaced with mean, median,
or predicted mean and the categorical variable can be replaced with mode or any predicted
categorical value.

Median is generally considered as a better measure for missing data


imputation.

244
Explanation
Different datasets duplicating loandata dataset are created using copy() function and stored in
loan1, loan2, and loan3. These are created for understanding the impact of the missing data
imputation using different values. The command
loan1['LoanAmount'].fillna(0,inplace=True) fills the observation that had missing values
in LoanAmount column with 0. Thus, when the command loan1.isnull().sum() is executed,
LoanAmount shows missing values in 0 observations. The argument in fillna() function as
['LoanAmount'].median() fills the observation that had missing values in LoanAmount column
with median. The argument ['LoanAmount'].mean() fills the observation that had missing
values in LoanAmount column with mean. The resultant of missing data imputation displays that
the number of observations having missing values for LoanAmount is 0. This means that all the
missing observations for LoanAmount column were replaced by some value. When the missing
values were replaced by 0, the total sum remains the same. The total sum increases to 89492
and 89497 after replacing missing values by median and mean, respectively. The last part
discusses the replacing missing values of a categorical variable. Thus, the missing values of
Gender are replaced with the highest number of occurrence of “Gender” in the dataset.
Similarly, the missing values of “Married” are replaced by predicted value “Yes”. Thus,
number of missing values in gender and married is shown as 0.

Create a dataframe of products’ information available on website. Enter some

245
missing values for the information and try to compute the mean. Provide a
solution.

Summary
• The Pandas library is specially used for handling data of different dimensions. Series is a
one-dimensional labelled data and “dataframe” is a two-dimensional labelled data holding
any data type. A series can be created using the function Series() from Pandas library.
• Dataframe is the most commonly used Pandas object and is represented as a two-dimensional
labelled data structure with columns of potentially different types. It can be thought as a table
in RDBMS or a spread sheet. A dataframe is created using DataFrame() function.
• We can add rows to an existing dataframe from a new dataframe using the append()
function. It is also possible to add a new column to the dataframe by writing the name of the
column in square brackets along with the name of the dataframe and assigning a list of items
to it.
• It is possible to delete rows and columns from the dataframe using the drop() function.
• The function read_csv() helps to read a “csv” file; read_excel() helps to read an “excel”
file; read_html() helps to read an “html” file; read_json() helps to read a “json” file; and
read_sql() helps to read an “sql” file.
• The basic functions related to a dataframe include describe(), info(); displaying records
using head(), tail(), etc.; basic statistical functions including mean(), median(), etc.; basic
mathematical functions like min(), max(), sum(), etc.; sorting of the dataframe on the basis
of column using sort_values().
• Data extraction can be done using different relational operators like logical operators [and
(&), or (|)], or using indexers (loc and iloc).
• An important feature of dataframe is the use of “groupby()” function for grouping the
observations on the basis of a variable. It should be noted that grouping of observations can
be done only on the basis of categorical variable and using aggregate functions.
• Three common functions are used for handling missing values. The function
isnull().sum() gives the total number of missing values for each column in the dataset;
dropna(inplace=True) deletes the observations that contain missing values; and
fillna(value,inplace=True) fills the missing values with value in argument.

Multiple-Choice Questions

1. The _____________ function is used to do sorting on the dataframe.


(a) sort.data()
(b) sort_df()
(c) sort()
(d) sort_values()
2. The _____________ function displays the statistics related to count, mean, minimum,
quartiles, etc., for each column.

246
(a) statistics()
(b) describe()
(c) info()
(d) stats()
3. The _____________ function is used to delete rows and columns from the dataset.
(a) drop()
(b) remove()
(c) del()
(d) delete()
4. The _____________ function helps to do missing data imputation.
(a) fillna()
(b) nona()
(c) dropna()
(d) replacena()
5. The _____________ function is used to add a row in the dataframe.
(a) join()
(b) new()`
(c) append()
(d) add()
6. The _____________ chart is not possible to draw using only Pandas library.
(a) heatmap()
(b) boxplot()`
(c) scatterplot()
(d) piechart()
7. The _____________ function is used to do grouping of data on the basis of a categorical
variable.
(a) grouping()
(b) groupby()`
(c) group_data()
(d) group()
8. The _____________ function is used to display the columns of the dataframe.
(a) col()
(b) cols()`
(c) keys()
(d) Both (a) and (b)
9. The _____________ function is used to display the last five records of the dataset.
(a) display()
(b) start()
(c) tail()
(d) head()
10. The _____________ function is used to display smallest value of a particular column.
(a) small()
(b) nsmallest()`
(c) smallest()
(d) None of the above

247
Review Questions

1. How do we determine the dimension and size of the dataset?


2. Differentiate between the describe and describe() function used for dataframe.
3. How can we delete rows and column from the dataframe?
4. How can we add rows and column to the dataframe?
5. How can we extract a specified row or column from a dataframe?
6. Explain the utility of groupby() function with an example of your choice.
7. Differentiate between the use of loc and iloc indexer for data extraction.
8. Discuss the functions used to handle missing values in Python.
9. How can we do sort on the dataframe using Pandas library?
10. Discuss the utility of relational and logical operators in filtering data from a dataframe.

248
CHAPTER
7

249
Matplotlib Library for Visualization

Learning Objectives
After reading this chapter, you will be able to

• Understand awareness with different types of basic charts.


• Demonstrate the knowledge of charts in solving real world problems.
• Encourage effective decision-making of selection of chart related to data type.
• Create charts and analyze their results.

Data visualization is an important and extremely versatile component of the Python environment.
Visualization in Python is a vehicle for newly developing methods of interactive data analysis.
Python programming language basically has two libraries to create charts and graphs, including
matplotlib and seaborn. It is possible to use the different functions available in these libraries to
display a wide variety of statistical graphs and also to build entirely new types of graph. The
matplotlib has many sub-packages which can be accessed by matplotlib.axes,
matplotlib.backends, matplotlib.compat, matplotlib.delaunay, matplotlib.projections, matplotlib.
pyplot, matplotlib.sphinxext, matplotlib.style, matplotlib.testing, matplotlib.tests,
matplotlib.widgets, etc. This chapter discusses the different charts that can be created using the
pyplot sub-package.
The matplotlib.pyplot is a collection of command style functions that make matplotlib work
like Matlab software. In matplotlib. pyplot, various states are preserved across function calls,
hence it is easy to keep track of intermediate factors, such as the nature of figure, plotting area;
the plotting functions are used for x- and y-axes. This chapter discusses different types of charts
that can be drawn using matplotlib library and additional features in each new chart. However, it
should be noted that these features are not limited to that particular chart and hence can be used
in other charts also.

In order to determine the version of matplotlib on which you are working,


import the library first by typing the command import matplotlib and then print
("Matplotlib version:", matplotlib.__version__) for determining the
version. For example, before executing all the commands used in this program,
the result obtained was Matplotlib version: 3.0.2. This approach can be
used to determine the version of any library. This is very important because the
version of one library sometimes shows incompatibility with version of other
libraries. Once determined, it can be changed accordingly.

250
7.1 Charts Using plot() Function
The plot() is a versatile command and can take an arbitrary number of arguments. This means
we can plot figures corresponding to one axis, for two axes, considering single and multiple data,
etc. It is also possible to plot figures of different shapes and colors. For every x and y pair of
arguments, there is an optional third argument, which is the format string that indicates the color
and line type of the plot. The letters and symbols of the format string are from MATLAB and are
used for concatenating a color string with a line style string. The default format string is “b-”,
which is a solid blue line.

Explanation
A solid line is drawn covering all the 10 points specified in the list in the default blue color. The
ylabel() function is used to add a label on the y-axis.

It is also possible to use “ggplot” from style to add special effects in a chart. The following
program uses the above figure for showing a remarkable difference of charts created with and
without “ggplot”:

251
Explanation
The “ggplot” style is imported from matplotlib.style. This style helps to change the background
with the default setting of gray background with axis as shown in the chart. However, it is
possible to change the settings based on user requirement.

In the above figure, the points are written for only one axis and the line is drawn accordingly.
But we generally draw the charts on the basis of two coordinates: x and y. For including the
coordinates of another axis, we simple need to add another list in the function. The title is added
in the chart using title() function. Similar to ylabel, label on x-axis is added to the chart using
xlabel() function. The grid can be added to the chart using grid() function. It is also possible
to add a text on the plot at a location by specifying the location and text using text() function. It
should be noted here that xlabel(), title(), grid(), and text() can be used in all the charts. It
is also possible to control properties of line drawn on the chart by changing its attributes like
linewidth, color, dash style, etc. The use of these functions is demonstrated in the following
program:

252
There is lot of importance of square bracket inside the plot function. Remove
one pair of square brackets to see the change in chart. The function will
consider 14 points from x-axis only. Both the axes will be considered if there
are two square brackets. It is important that the number of figures are same in
both the lists, else an error will occur.

Explanation
We can observe from the figure that the titles x-axis and y-axis are displayed on the chart. The
above chart has two sets of coordinates corresponding to x and y. Both of them represent
coordinates of 7 points. The line is drawn hence joining all the 7 points. The value of grid
function is “True” which helps us to add a grid to the chart. The text “Green colored line” is
added at location (5, 12) corresponding to (x, y).

253
It is also possible to draw a chart by setting the limit of axes using axis() command. The axis
() command takes a list of [xmin, xmax, ymin, ymax] and specifies the viewport of the axes.
The “xmin” and “xmax” specify the minimum and maximum coordinate of x-axis, respectively.
Similarly, “ymin” and “ymax” specify the minimum and maximum coordinate of y-axis,
respectively.

Explanation
The term “ro” in plt.plot() plots red-colored dots on the coordinates as specified in the
program. The term “m” plots magenta-colored line covering all the coordinates specified before
the term. Thus, both dots of red color and line in magenta color are drawn on the same chart.
The command plt.axis([5,15,5,17]) specifies that minimum coordinate and maximum
coordinate for x-axis is 5 and 15, respectively, and the minimum coordinate and maximum
coordinate for y-axis are 5 and 17, respectively. These settings can also be observed from the
figure.

We can also draw chart using multiple colors, shapes, and figures for depicting different datasets.

254
A dashed line is drawn using “—”, a square is drawn using “s”, a triangle is drawn using “^”
symbol, and a circle is drawn using “o”. Hence, blue dashes will be drawn using “b—”, red
squares will be drawn using “rs”, green circle will be drawn using “go” and magenta triangles
will be drawn using “m^”. Unlike the axis() command that was used to set the limits of both the
axis in one single command, the commands xlim() and ylim() help to set the limits of x- and y-
axes, respectively.

Explanation
In the command plt.plot(a,a,'g^',a,b,'bs',a,c,'r--',a,d,'mo'), we can observe that

255
for every x, y pair of arguments, there is an optional third argument which is the format string
that indicates the concatenation of a color string with a line style string. Thus, the first line is
drawn using green triangles (g^) considering list “a” for both x and y; the second line is drawn
using blue squares (bs) considering list “a” for x argument and list “b” for y argument; the third
line is drawn using red dashes (r--) considering list “a” for x argument and list “c” for y
argument; the fourth line is drawn using magenta dots (mo) considering list “a” for x argument
and list “d” for y argument. The minimum and maximum limits of “x” are from 0 to 11 and
minimum and maximum limits of “y” are from 0 to 220. Since “b” is used to represent blue
color, “k” is used to represent black color. Thus, the grid lines are displayed using black color
(k) on the chart.

The above example considers points corresponding to x-axis only. Create a


chart depicting five lines of different line color and styles. Each line should
consider 10 different points corresponding to both x-axis and y-axis.

7.2 Pie Chart


Pie charts visualize absolute and relative frequencies. A pie chart is a circle partitioned into
segments where each of the segments represents a category. The size of each segment depends
upon the relative frequency and is determined by the angle. A pie chart is a representation of
values as slices of a circle with different colors. A pie chart is drawn using pie() function
considering single list as an argument.

Explanation

256
A pie chart is drawn using pie() function considering list named “list1”. The list “list1” is
created having 10 elements. We can observe from the chart that there are having 10 sectors and
the size of each sector corresponds to the figure. The size of each sector is according to the
value shown in the list. The value of the largest sector is 100, because the highest number is
100.

7.3 Violin Plot


This plot is a combination of box and kernel density plot. It is drawn using violinplot() from
matplotlib.pyplot.

Explanation
The violin plot clearly shows that there are more number of observations at 12 and 26 since the
width of the chart is maximum at those points.

It is possible to display/not display mean and extreme points in the violin plot.
Use showmeans=True/False, showextrema=True/False in violinplot()
function accordingly to desired results.

Create many violin plots in one chart only considering different figures. Create
the violins in different color.

257
7.4 Scatter Plot
The scatter plot is created using the scatter() function and is helpful in displaying bivariate
data. Scatter plots show many points plotted in the Cartesian plane. Each point represents the
values of two variables. One variable is chosen in the horizontal axis and another in the vertical
axis. The “color” is the default argument in the function which represents the color of the dot.

Explanation
There are five organizations, namely “Myntra”, “Snapdeal”, “Alibaba”, “Amazon”, and
“Flipkart”. The list “Q1_Profit” has five numbers corresponding to profits for first quarter of
five organizations. Similarly, the other lists correspond to the profits for their respective
months. The scatter plot is drawn for profit of five organizations for all the four quarters. We
can observe from the chart that the least profit was for “Myntra”, whereas the highest profit was
for “Alibaba”. “Alibaba”, “Amazon”, and “Flipkart” have achieved highest profit in the third
quarter, whereas “Myntra”, “Snapdeal”, and “Amazon” got least profit in the fourth quarter.

258
7.5 Histogram
Histogram is based on the idea to categorize data into different groups and plot the bars for each
category with height. A histogram represents the frequencies of values of a variable gathered into
ranges. Each bar in a histogram represents the height of the number of values present in that
range.

Explanation
A list named “list1” is created having starting element as 1; the ending element is before 10 and
the interval is 2. Thus, the list “list1” has five elements 1, 3, 5, 7, and 9. The list named “list2”
has five elements corresponding to five elements of “list1”. The histogram plots a yellow-
colored box for a particular set of list1 and list2.

Creating Multiple Charts on One Image: It is possible to have multiple figures on one chart in
the form of m × n array of charts using the function subplot(). This function accepts three
arguments. The first argument represents the total number of rows, the second argument
represents number of columns, and the third argument represents the number of cells. It should
be noted that the cell number is determined from a row basis. For better understanding, let us
consider a figure of 4 rows and 3 columns to draw 12 charts in one single image. Then the
subplot function will be (4, 3, x), where x denotes the cell number. Thus, the figure has total 12
cells (4*3). The last argument shows the cell where the chart will be displayed. Thus,
subplot(4,3,1) represents that there are total 4 rows and 3 columns in the figure and the chart will
be drawn in the first cell. Hence, subplot(4,3,2) shows that this chart will be drawn in the second
cell, since the cell is determined on a row basis; second cell means second column of first row.
Similarly, subplot(4,3,5) means the fifth cell, which means the first column of the second row.
The subplot(4,3,10) represents the tenth cell, which will be the second column of the third row.

259
One of the important functions used in creating a chart is figure(). A figure() function
represents the number of figure starting from 1. It has figsize as the argument which depicts the
dimensions of the figure.
It is known that the title() function gives a name to the single chart. However, when we
draw multiple charts on one chart, we may want to give a super title to the main image which is
done using suptitle() function.

7.6 Bar Chart


A bar chart visualizes the relative or absolute frequencies of observed values of a variable. It
consists of one bar for each category. A bar chart represents data in rectangular bars with length
of the bar proportional to the value of the variable. Python uses the function bar() to create bar
charts. The height of each bar is determined by either the absolute frequency or the relative
frequency of the respective category and is shown on the y-axis. However, the horizontal bar
chart can be created by using the barh() function.

Explanation
The figure creates a horizontal bar chart for different figures. The first list gives the details of
the labels on y-axis and the width argument has details of the width of the bar on the chart with
respect to each point in the first list. The next section creates multiple bar charts on one image.

260
261
Explanation
Multiple bar charts are drawn in one image using different arguments in subplot(). We can
observe that since the first two arguments in the subplot() function are 2 and 2, hence we can
draw an image having 2 rows and 2 columns. Thus, the complete image has 4 images
corresponding to 4 quarters. The suptitle() adds a title to the complete image.

262
Create one single chart containing 12 different types of charts as discussed
above, which are divided into 3 rows and 4 columns using subplot function.

A stacked bar chart can also be drawn from multiple data. In this bar chart, each of the bars can
be given different colors using “color” argument in the function. This chart helps in providing
effective comparison of the figures since the bars are stacked in nature.

Explanation
The stacked bar chart shows the population of five countries in different years starting from

263
1930 to 2010. The size of the stack is dependent on the variation in the population of the
respective country. Stack of different colors represents different years.

It is also possible to create one chart spread in multiple columns. If the chart
has two rows and three columns and we want the first chart to be spread in all
the three columns of the first row, use the syntax plt.subplot(211) for the
first chart followed by plt.subplot(234) for the second chart to be displayed
in the first column of the second row. The dimensions (211) will create only
one column in the first row and place the chart.

Create a single image of four rows and three columns. The first row will have
stacked bar chart in all the three columns. The second-fourth row will contain
three columns. Each cell will display the pie chart of each year for the five
countries.

7.7 Area Plot


Area plots are pretty much similar to the line plot. They are also known as stack plots. These
plots can be used to track changes over time for two or more related groups that make up one
whole category. For example, determining the total sales and individual sale from all the
different categories like Electronics, Clothing, Food, and Books. The area chart for this example
is shown as follows:

264
Explanation
The chart shows the total sales for 12 months along with sales of each category for different
months. Thus, we can observe that sales of Electronics is the highest, since the area occupied is
more than the sales of any other category.

Create a chart having two rows and two columns. The first row will display the
violin chart for food and clothing in two columns and the second row will
display for books and electronics. Use title for each category in the respective
chart and add a main title to the chart as: Sales for different categories.

7.8 Quiver Plot


The quiver() function takes two important arguments that represents 1-D or 2-D arrays or

265
sequences. The quiver () function plots a 2-D field of arrows representing the two datasets.
Color is an optional argument which represents the color of the arrow.

266
Explanation
The figure displays six quiver charts drawn depending on the data. It should be noted that the
direction and size of the arrow depends on the two datasets which are passed as an argument.
The first quiver plot draws arrows in red color and second draws arrows in blue color.

7.9 Mesh Grid


The purpose of meshgrid is to create a rectangular grid out of an array of x values and an array of
y values. A meshgrid is created using meshgrid() function. For example, if we have a point at
each integer value between 0 and 100 in both the x and y directions, we need to create a
rectangular grid with every combination of x and y points. This chart is used for analysis of large
data.

267
Explanation
Two numpy arrays val1 and val2 are created having large data. The image displays the
meshgrid for three different combinations. We can observe that the color changes in the image
with different combinations.

7.10 Contour Plot


A contour plot is a graphical technique for representing a three-dimensional surface and is
created using contour(). It is drawn by plotting constant z slices, called contours, on a two-
dimensional format. That is, given a value “z”, lines are drawn for connecting the (x, y)
coordinates where that “z” value occurs. It basically tells us how “x” and “y” variables impact the
variable “z”.

268
269
Explanation
The figure displays 8 contour charts corresponding to different combinations of A and B. It is
clear from the image that the contour() function creates the boundary of the contour plot and
contourf() fills the respective color also inside the boundary.

270
Summary
• Python programming language has two libraries for data visualization: matplotlib and
seaborn. Seaborn is a data visualization library based on matplotlib.
• The plot() is a versatile command and can take an arbitrary number of arguments for
drawing figures corresponding to one axis, for two axes considering single and multiple data.
• We can draw a chart using multiple colors, shapes, and figures for depicting different
datasets. A dashed line is drawn using “--”, a square is drawn using “s”, a triangle is drawn
using “^” symbol, and a circle is drawn using “o”.
• The letters and symbols of the format string are from MATLAB, for concatenating a color
string with a line style string.
• Blue dashes are drawn using “b--”, red squares are drawn using “rs”, green circle are drawn
using “go”, and magenta triangles are drawn using “m^”.
• The axis() command is used to set the limits of both the axes in one single command; the
commands xlim() and ylim() help to set the limits of x- and y-axes, respectively.
• A pie chart is drawn using the function pie(). A pie chart is a circle partitioned into
segments where each of the segments represents a category.
• Violin plot is a combination of box and kernel density plots. It is drawn using violinplot()
from matplotlib. pyplot.
• Scatter plots show many points plotted in the Cartesian plane, a simple scatter plot is created
using the scatter() function.
• A histogram represents the frequencies of values of a variable gathered into ranges. Each bar
in the histogram represents the height of the number of values present in that range.
• A bar chart represents data in rectangular bars with length of the bar proportional to the value
of the variable. Python uses the function bar() to create bar charts. It is also possible to draw
a stacked bar chart for multiple data.
• It is possible to have more than one chart on one page in the form of n × m array of charts
using the function subplot().
• Area plots are pretty much similar to the line plots. They are also known as stack plots. These
plots can be used to track changes over time for two or more related groups that make up one
whole category.
• The quiver plot shows a 2-D field of arrows representing the two datasets. The quiver()
function takes two important arguments that represents 1-D or 2-D arrays or sequences.
• The purpose of a meshgrid is to create a rectangular grid out of an array of x values and an
array of y values. A meshgrid is created using meshgrid() function.
• A contour plot is a graphical technique for representing a three-dimensional surface and is
created using contour().

Multiple-Choice Questions

1. The _____________ plot is a combination of box and kernel density plots.


(a) Violin

271
(b) Meshgrid
(c) Contour
(d) Quiver
2. The _____________ value is used to draw red squares on the line plot.
(a) sr
(b) sqred
(c) rs
(d) redsq
3. The _____________ and _____________ functions are used to set limit of x- and y-axis,
respectively.
(a) xlim(), ylim()
(b) xaxes(), yaxes()
(c) xaxis(), yaxis()
(d) None of the above
4. The value of argument in subplot for creating multiple images in 2 rows and 4 columns
is_____________
(a) 42…
(b) 24….
(c) 81…
(d) 18….
5. The _____________ function is used to display a grid on the chart.
(a) grid()
(b) lines()
(c) grid_main()
(d) grid_chart()
6. The _____________ function is used to give title to main image when multiple images
are drawn on one chart.
(a) title()
(b) maintitle()
(c) suptitle()
(d) main_title()
7. The _____________ function is used to give title to each image when multiple images
are drawn on one chart.
(a) title()
(b) maintitle()
(c) suptitle()
(d) main_title()
8. The function _____________ is used to set the limits of axis.
(a) axis()
(b) lim_axis()
(c) limit_axis()
(d) axis_limit()
9. The function ____________is used to draw more than one chart in form of n × m array

272
of charts.
(a) more()
(b) image()
(c) multiimage()
(d) subplot()
10. The ______________ plot shows a 2-D field of arrows representing two datasets.
(a) quiver
(b) meshgrid
(c) contour
(d) heatmap

Review Questions

1. Create and explain the utility of a pie chart considering an example of your choice.
2. Create and explain the utility of a violin plot considering an example of your choice.
3. Create and explain the utility of a scatter plot considering an example of your choice.
4. Create and explain the utility of a histogram considering an example of your choice.
5. Create and explain the utility of a stacked bar chart considering an example of your choice.
6. Create and explain the utility of a bar chart considering an example of your choice.
7. Create and explain the utility of an area plot considering an example of your choice.
8. How can we create multiple images on one single page?
9. Discuss the different special effects that can be added to a chart using plot() function.
10. How do we add title and set limits for x- and y-axes?

273
CHAPTER
8

274
Seaborn Library for Visualization

Learning Objectives
After reading this chapter, you will be able to

• Get exposure to visualization techniques from Seaborn library.


• Create different charts for categorical and continuous variables.
• Analyze the different charts.
• Foster analytical and critical thinking abilities for decision making.

Seaborn is a data visualization library built on top of core visualization library Matplotlib. It is
considered as a complement rather than a replacement for Matplotlib library. It provides a high-
level interface for drawing attractive statistical graphics. Because Seaborn Python is built on top
of Matplotlib, the graphics can be further tweaked using Matplotlib tools and rendered with any
of the Matplotlib back end to generate publication-quality figures. Seaborn is mostly focused on
the visualization of statistical models; such visualizations include heat maps and those that
summarize data but still depict overall distributions. It has different built-in themes for styling
Matplotlib graphics. Different types of charts can be made using Seaborn library. The common
arguments that are generally considered for creating different types of charts include:

1. data represents the dataset under consideration.


2. x represents the variable on x-axis.
3. y represents the variable on y-axis.
4. hue represents another variable which will be represented on the chart in different colors
with respect to the variables on x- and y-axis.

8.1 Visualization for Categorical Variable


For creating visualization related to categorical variables, we will consider titanic dataset that can
be downloaded from Seaborn library.

275
Explanation
The dimension of the dataset is (891, 15) which shows that there are 891 rows and 15 columns.
The columns in the dataset are displayed using keys() function. The describe() function
displays the basis statistical values of the fields having numerical values.

8.1.1 Box Plot


A box plot describes the distribution of a continuous variable by plotting the summary of five
statistical terms: minimum, maximum, second quartile (50th percentile), first quartile (25th
percentile), and third quartile (75th percentile) in the dataset. Box plots are a measure of how
well distributed is the data in a dataset. It divides the dataset into three quartiles. It is possible to
draw multiple boxes considering a categorical variable with different values. It is also useful in
comparing the distribution of data across datasets by drawing box plots for each of them. Box
plots usually have vertical lines extending from the boxes which are termed as whiskers. These
whiskers indicate variability outside the upper and lower quartiles; hence box plots are also
termed as box-and-whisker plots/ diagrams. Any outliers in the data are plotted as individual
points.

276
Explanation
The figure shows the general structure of a box-and-whisker plot. The box plot is drawn for fare
with respect to class. In titanic dataset, we know that there are 3 classes: First, Second, and
Third. Hence, three boxes are show in the figure corresponding to these three classes. We can
clearly determine the minimum and maximum value through first, second, and third quartile. It
is clear from the chart that the high fare is for first class while it is low for third class. We can
observe from the chart that since there is an outlier at fare 500 approximately for first class, an
unclear box-and-whisker plot is shown. There are other outliers also in the first class where fare
is between 200 and 300. One outlier also exists in second class. It is important to understand
that it is always better to remove outliers for understanding data properly.

There are lots of other parameters which can be used in the boxplot()
function. An argument named “order” is used to set the order of the classes
with the help of list, and palette is used to do the settings of the color according
to the predefined color settings. For example, value of the order can be
[“Second”, “Third”, “First”] to show the box plot of Second class first and first
class at the last. Similarly, palette can have values, such as Set1, Set2, Set3,
which contains predefined settings of different colors.

8.1.2 Violin Plot


A violin plot is a combination of box plot and kernel density plot. Hence, violin plot makes it
easier to analyze and understand the distribution of the data, because it includes both the types of
plots. Similar to box plot, a violin plot can produce multiple violins corresponding to different
values of a categorical variable. The quartile and whisker values from the box plot are shown
inside the violin. As the violin plot uses KDE, the wider portion of violin indicates higher density
and narrow region represents relatively lower density. The inter-quartile range in box plot and
higher density portion in “kde” fall in the same region of each category of violin plot.

277
Explanation
The violin plot displays age of different models with respect to class. Since sex has two values
– male and female – hence, we can observe that violins in two different colors are drawn. Also,
since there are three classes: first, second and third. Hence, 6 colored violins are drawn on the
chart. It can be observed from the chart that highest age of male person is greater than 85 and
highest age of female person is greater than 70.

Create a violin plot for displaying different categories of embarked with age.
Use palette option in violin plot for drawing violins for different color settings.

8.1.3 Point Plot


The point plot is created using the pointplot() function. It shows the point with respect to each
value of the categorical variable and joins them with a line segment.

278
Explanation
The figure shows that since only three types of class exist (First, Second, and Third) hence, the
chart shows only three points and line is drawn joining those three points.

8.1.4 Line Plot


It is a plot that connects a series of points by drawing line segments between them. These points
are ordered in one of their coordinate (usually the x-coordinate) value. Line charts are usually
used in identifying the trends in data.

Explanation
A line plot does not show dots. A line is drawn joining those positions. In the figure, fare is
shown corresponding to embarked.

8.1.5 Count Plot


This plot is drawn using countplot() function and is used to show the number of observations

279
corresponding to different values of column specified in the function.

One of the important arguments used in count plot is orient, which has two
possible values “v” and “h” for vertical and horizontal orientations. In our
example, the horizontal orientation is shown. The order and palette options
discussed can also be used for count plot.

Explanation
We can observe from the figure that there are two values for alive: Nearly 330 observations
corresponding to “yes” and 520 observations corresponding to “no”. We can observe that the
higher value of categorical variables which corresponds to maximum frequency of observations
is displayed at top.

8.1.6 Bar Plot


The bar plot draws a bar to display the relationship between a continuous and categorical
variable.

280
Explanation
The figure shows the bar plot corresponding to fare for different values of embarked. Thus, we
can observe that maximum value of fare is for “S” and minimum value is for “Q”.

8.1.7 Strip Plot


The stripplot() function is used when one of the variable under study is categorical. It
represents the data in a sorted order along any one of the axis. It is similar to scatter plot which is
created with two continuous variables. The major problem with the plot is that the points are
overlapped. To handle this situation, the argument “Jitter” is included in stripplot() with
Boolean value as True to add some random noise to the data. This parameter will adjust the
positions along the categorical axis.

Explanation
The figure displays the strip plot for class with respect to fare. In the plot, we can clearly see the

281
difference of fare in each class.

8.1.8 Swarm Plot


The swarmplot() function can be used as an alternate to “Jitter” discussed in strip plot. This
function positions each point of scatter plot on the categorical axis and thereby avoids
overlapping points.

Explanation
The figure shows the value of class with respect to fare. We can observe that it creates a better
structure for displaying different observations. It is a better way to interpret things from all the
observations.

Create a single chart displaying the bar plot and swarm plot together but of
different colors to understand the importance of both the charts effectively.

8.1.9 Factor Plot


Factor plot is drawn using the function factorplot() and this function creates a point plot for
the given data; it is also possible to change the type of chart by defining the value of “kind”
argument. The different values of kind argument include box, violin, bar, count, and strip for
creating charts like box plot, violin plot, bar plot, count plot, strip plot, respectively. Another
important aspect of factor plot is that it draws a categorical plot on a facet grid. In factor plot, the
data is plotted on a facet grid using the argument “col”. It forms a matrix of panels according to
row and column by dividing the variables. Because of panels in the factor plot, a single plot
looks like multiple plots. This helps in analysing all the combinations of different discrete
variables. Example: If col = “gender”, two different plots for male and female will be plotted.
This helps to input another variable also into the plot. Thus, the main plot is divided into two

282
plots based on a third variable specified by the “col” argument. Another important argument is
“col_wrap” which contains the number corresponding to the number of figures in one column in
one row. This situation arises because col argument helps us to create multiple charts and by
using col_wrap argument, we will be able to divide the number of charts in a row.

283
Explanation
The first figure shows a default point plot between age and sex. The next figure shows a violin
plot for “pclass” and “fare”. Since, there are different values of alive, hence different shades of
colors are used for showing different values of alive. The next figure is a columnar chart for
different values of class. Since there are 3 classes, hence 3 different charts are created for
different classes. The next figure shows different charts for different decks. Since there are 7
decks, so we have used the value as 3 for col_wrap argument. This will enable us to display 3
columns in a row. Hence the figure shows three rows and divides the charts as 3, 3, 1.

284
Create a factor plot for class, fare, embarked considering different color
palettes and order of representing data.

8.1.10 Facet Grid


This chart helps in visualizing distribution of one variable as well as relationship between
multiple variables separately within subsets of dataset using multiple panels. A Facet Grid can be
drawn using three dimensions − row, col, and hue. The variables should be categorical; the data
at each level of the variable will be used for a facet along that axis.

285
Explanation

286
In the first figure, we have just initialized the FacetGrid object which does not draw anything
on the chart . But the number of plots is more than one because of the parameter col. The main
approach for visualizing data on this grid is with the FacetGrid.map() method which is shown
in the other three figures. The figure shows a histogram with respect to age because the map
function takes plt.hist as argument. The figure shows a histogram for different categories of
alive and classes. The last figure creates a scatter plot with respect to age and fare because the
map function takes plt.scatter as argument. Different colors correspond to different
categories of sex since it is specified in the hue argument.

8.2 Visualization for Continuous Variable


In Section 8.1, we were able to analyze the categorical variables in the dataset. These plots are
not suitable when the variable under study is continuous. In real scenario, we generally use
datasets that contain multiple quantitative variables and the goal of analysis is to relate those
variables to each other. All the charts discussed in following section will be considering
continuous variables from the “iris” dataset available in “seaborn” library.

For effective analysis, a lot of visualization techniques are adopted for


understanding data and their relationship with each other. The charts discussed
in this section prove very effective for understanding the relationship. Explore
these charts by changing data and try to understand the data from the charts
drawn. A proper analysis of the chart will be possible with a good practice by
considering different types of data.

Explanation
It can be observed that the dimension of the dataset is (150, 5) which means that there are 150
observations and 5 columns. The details of the dataset are displayed which shows the
descriptive statistics of all the columns having numeric values.

287
8.2.1 Scatter Plot
A scatter plot using “seaborn” is drawn through scatterplot() as discussed below:

Explanation
The figure shows a scatter plot for sepal length with respect to petal length.

8.2.2 Regression Plot


While building the regression models, we need to check for assumption of multicollinearity. We
need to take action if multicollinearity exists and for this we need to determine whether there is
any correlation between all the combinations of continuous variables. There are two main
functions in seaborn to visualize a linear relationship determined through regression: regplot()
and lmplot(). These regression plots help to draw a line of best fit on the scatterplot. This is
helpful for statistical analysis to a great extent. Unlike lmplot() which take variables in the form
of strings, regplot() accepts the variables for x- and y-axis in different formats like numpy
arrays, variables in Pandas dataframe, etc.

288
Explanation
In the first figure, a line of best fit is drawn covering maximum points on the chart
corresponding to points for petal_length and petal_width. The chart shows that maximum
points are covered by the line. However, in the second figure, most of the points are left
uncovered, hence the data is non-linear.

The simple linear regression model used above is very simple to fit, but in most of the cases, the
data is non-linear and the above methods cannot generalize the regression line. If the plot shows
high deviation of data points from the regression line, then we say the data is non-linear and
requires other forms of regression like polynomial. These higher order can be visualized using
the order argument in lmplot() and regplot(). This is illustrated in the figure.

289
Explanation
By changing the order also, most points are left uncovered. It will be better to explore medium-
dimensional data and draw multiple instances of the same plot on different subsets of dataset.

8.2.3 Heat Map


It is a two-dimensional graphical representation of data where the individual values that are
contained in a matrix are represented as colors. The seaborn Python package allows the creation
of annotated heatmap which can be tweaked using Matplotlib tools as per the creator’s
requirement. It is really useful to display a general view of numerical data, but not to extract
specific data point. It is important to normalize the matrix, choose a relevant color palette and
use cluster analysis and thus permute the rows and the columns of the matrix to place similar
values near each other according to the clustering. The best thing for a heat map is that it can
consider any number of variables as argument. The cmap is an important argument which shows
the mapping with respect to colors. Some of the important values of colormap that can be used
are: Accent, Blues, BrBG, Greens, Oranges, Paired, Pastel2, PuRd, Purples, RdGy, RdYlBu,
Reds, magma, plasma, rainbow, seismic, spring, terrain, twilight, winter, etc.

290
Explanation
The first figure creates a heat map for length and width of petal with default setting of color.
The information on the right side shows that smallest value is represented by dark color and
largest value is represented by light color. We can observe from the map that total 150 bars are
shown corresponding to 150 observations. The next figure shows a heat map with color
mapping as “RdYlGn” representing Red, Yellow, and Green. The reason to choose these
columns together is that values in these two columns lie in a particular range. The information
on the right side shows that smallest value is represented by red color and largest value is
represented by green color.

It is also possible to depict correlations between different variables of the dataset using the heat
map as shown in the following example.

291
Explanation
From the figure, we can observe that sepal_width has a negative correlation with the rest of
three variables. However, the extent of correlation between sepal_width and sepal_length is less
than the other two variables. Rest of all other variables have a positive correlation. The value of
“cmap” is “RdYlBu” which means that Dark Red will have a perfect negative correlation (–1)
while Dark Blue will have perfect positive correlation (+1). Yellow color will show no
correlation. The colors will be determined according to the level of correlation between the
variables.

Create heat maps considering different fields from the diabetes dataset
available in sklearn.datasets. The dataset can be fetched by using the command
from sklearn.datasets import load_diabetes and diabetes=load_diabetes().

8.2.4 Univariate Distribution Plot


Distribution of data is the foremost thing that we need to understand while analysing the data.
The distplot() function helps us to visualize the parametric distribution of a dataset. We can
understand the univariate distribution of the data by plotting a histogram that fits the kernel
density estimation of the data. The two most important arguments are kde (Kernel Density Plot)
and hist (Histogram) which has a Boolean value. If the value are true, the plot shows both the
charts. KDE is a way to estimate the probability density function of a continuous random
variable. The bins is an important argument which defines the number of bins in the argument.

292
Explanation
The first chart in the figure is drawn using default settings of the distribution plot, hence both
the histogram and kernel density plot are drawn. In the second chart, “kde” flag is set to False.
As a result, the representation of the kernel estimation plot will be removed and only histogram
is plotted. In the third chart, the “hist” flag is set to False and hence will display only the kernel
density estimation plot.

8.2.5 Joint Plot


Joint plot is used for bivariate distribution to determine the relation between two variables. This

293
mainly deals with relationship between two variables and how one variable is behaving with
respect to the other. The jointplot() function creates a multi-panel figure that projects the
bivariate relationship between two variables and also the univariate distribution of each variable
on separate axes. The joint plot also supports creation of regression plot by specifying the kind
argument to value as reg.

294
Explanation
The first figure shows the relationship between the sepal_length and petal_length in the iris
data. It can be observed from the trend in the plot that there is a positive correlation between the
variables under study. The line of regression is also able to cover all the points; hence a linear
regression model can be developed between these two variables.

8.2.6 Joint Hexbin Plot


It is possible to create a hexbin plot by considering hex as the value of “kind” argument in the
jointplot(). Hexagonal binning is used in bivariate data analysis when the data is sparse in
density, that is, when the data is very scattered and difficult to analyze through scatterplots.

Explanation
The figure shows the relationship between the sepal_length and petal_length in the iris dataset.
Blue color depicts a positive correlation. The intensity of the color shows the extent of
correlation between the two variables.

8.2.7 Joint Kernel Density Plot


For creating a kernel density estimation plot, we need to add kind argument to the jointplot()
with value=”kde”.

295
Explanation
The figure shows the relationship between the sepal_length and petal_length in the Iris data. A
trend in the plot says that positive correlation exists between the variables under study.

8.2.8 Pair Plot


Python provides an important function pairplot() for representing multivariate data. If x is a
numeric matrix or dataframe, the command pairplot(x) produces a pair wise scatter plot matrix
of the variables defined by columns of x (every column of x is plotted against every other column
of x and the resulting n(n − 1) plots are arranged in a matrix with plot scales constant over the
rows and columns of the matrix. The pairplot() function produces a scatterplot between all
possible pairs of variables in a dataset and for each variable, it uses the same scale. It is also
possible to create a regression line on all the plots using the “reg” value for kind argument. It is
also possible to draw a pair plot for selected columns. The important arguments in the pair plot
are as follows:

Syntax
seaborn.pairplot(data, vars=, hue=, palette=,diag_kind=)

where

• data has the name of the dataset of which variables are considered.
• vars represents the variables for which the pair plot needs to be created. An absence of this
argument will print for all the continuous variables from the dataset.
• hue: hue variable helps to plot different levels with different colors.

296
palette: Set of colors for mapping the hue variable. The color palette include Deep, Muted,
• Bright, Pastel, Dark, Colorblind, husl, etc.
• kind: Kind of plot for the non-identity relationships. Options are: scatter and reg. Default is
scatter.
• diag_kind: Kind of plot for the diagonal subplots. Options are: hist and kde. Default is hist.
Except data, all other parameters are optional.

297
298
299
300
Explanation
The pair plot produces a pair plot considering only three variables: width and length of sepal
and length of petal. We can observe that since there are three variables, hence a table of 3*3
images is created. Each variable is written on both the axes. Hence, the scatter plot is drawn
with respect to every combination of two variables. Since the value of kind is regression, the
line of best fit is drawn on all of the scatter plots. The default of value of histogram is taken for
the diagonal. The next plot creates a chart with different colors for different species. Since kind
argument is not present, default value of scatter is considered for it. The next figure creates a
pairplot considering all the variables from the dataset and the color palette is changed to deep.
Since there are four numeric variables, hence a table of 4*4 images is created and each image
produces a scatter plot considering two variables. The last figure creates “kde” plot on diagonal
and scatter plot on all the other columns. Different color for different species is taken and color
is considered as deep.

8.2.9 Pair Grid

301
Unlike the pair plot where we can draw only the histogram or kde on diagonal plot, scatter or
regression this plot helps us define the type of chart on the diagonal plot along with
specifications for the type of chart on upper, lower and off diagonal separately. However, it
allows us to draw a grid of subplots using the same plot type to visualize data. Unlike Facet Grid,
it uses different pair of variable for each subplot. It forms a matrix of sub-plots. The usage of
PairGrid is similar to FacetGrid. We need to first initialize the grid and then pass the type of
chart inside the appropriate function. We can plot a different chart on the diagonal to show the
univariate distribution of the variable in each column using map_diag() function; on elements
other than diagonal using map_offdiag() function; map_upper() and map_lower() for different
charts in the upper and lower triangles, respectively, to see different aspects of the relationship.

302
303
Explanation
The first figure creates scatter chart between all the variables, since the map function has the
argument as plt.scatter. This means that every cell has scatter chart. However, the other chart
specifies different type of charts in different grid elements. The next figure has scatter chart
only in the upper triangle, since map_upper() has scatter argument. Similarly, lower triangle
has KDE plot in each cell and the diagonal has histogram.

Create a pair plot and pair grid considering different fields from the breast
cancer dataset available in sklearn.datasets. The dataset can be fetched by
using the command from sklearn.datasets import load_breast_cancer and
cancer=load_breast_cancer().

304
Seaborn provides a function called set_palette(), which can be used to give colors to plots and
adding more aesthetic value to it. We can basically create a colored sequential plot or diverging
plot. Sequential plots are suitable to express the distribution of data ranging from relative lower
values to higher values within a range. We need to add characters to the color passed to plot the
Sequential plot. Example Greens for green color, Reds for red color, etc. However, diverging
palettes use two different colors and each color represents variation in the value ranging from a
common point in either direction. If we assume data from −10 to 10, the values from −10 to 0
will be shown in one color and from 0 to 10 will take another color. By default, the values are
centred from zero.

Explanation
The above program creates a chart having two separate charts of different color settings. The
first chart is created with default red color settings and the second chart is created from
diverging brown to blue green color.

305
Summary
• Python programming language has basically two libraries for data visualization: matplotlib
and seaborn. Seaborn is a data visualization library based on matplotlib.
• Box plot helps us to draw multiple boxes considering a categorical variable with different
values.
• A violin plot is a combination of box plot and kernel density plot and is drawn using the
violinplot() function.
• A point plot is created using the pointplot() function and shows the point with respect to
each value of the categorical variable and joins them with a line segment.
• Line plot connects a series of points by drawing line segments between them.
• The count plot is drawn using countplot() and is used to show the number of observations
corresponding to the different values of column specified in the function.
• The bar plot draws a bar to display the relationship between a continuous and categorical
variable.
• The strip plot represents the data in sorted order along any one of the axis.
• The swam plot positions each point of scatter plot on the categorical axis and thereby avoids
overlapping points.
• In factor plot, the data is plotted on a facet grid using the argument “col”. It forms a matrix of
panels according to row and column by dividing the variables.
• A Facet Grid can be drawn using three dimensions: row, col, and hue. The variables should
be categorical and the data at each level of the variable will be used for a facet along that
axis.
• Scatter plots show many points plotted in the Cartesian plane and the simple scatter plot is
created using the scatterplot() function.
• There are two main functions in Seaborn to visualize a linear relationship determined through
regression: regplot() and lmplot().
• Heatmap is a two-dimensional graphical representation of data where the individual values
that are contained in a matrix are represented as colors.
• Distribution of data is the foremost thing that we need to understand while analysing the data.
The function distplot() helps us to visualize the parametric distribution of a dataset.
• Joint plot is used for bivariate distribution to determine the relation between two variables.
This mainly deals with relationship between two variables and how one variable is behaving
with respect to the other.
• It is possible to create a hexbin plot by considering hex as the value of kind in the
jointplot(). Hexagonal binning is used in bivariate data analysis when the data is sparse in
density, that is, when the data is very scattered and difficult to analyze through scatterplots.
• For creating a kernel density estimation plot, we need to add kind argument to the
jointplot() with value=”kde”.
• Seaborn provides an important function pairplot() for representing multivariate data. If x is
a numeric matrix or dataframe, the command pairplot(x) produces a pair wise scatter plot

306
matrix of the variables defined by columns of x.
• The pair grid() function helps to plot a different chart on the diagonal to show the
univariate distribution of the variable in each column using map_diag() function; on
elements other than diagonal using map_offdiag() function; map_upper() and map_lower()
for different charts in the upper and lower triangles, respectively, to see different aspects of
the relationship.
• Seaborn provides a function called set_palette(), which can be used to give colors to plots
and adding more aesthetic value to it. We can basically create a colored sequential plot or
diverging plot.

Multiple-Choice Questions

1. This function is not used to visualize a linear relationship determined through regression.
(a) regplot()
(b) lmplot()`
(c) regression()
(d) None of these
2. The ___________ draws a bar corresponding to the frequency of the variable.
(a) countplot()
(b) boxplot()`
(c) scatterplot()
(d) swarmplot()
3. A Facet Grid can be drawn using the dimension:
(a) row
(b) col
(c) hue
(d) All of these
4. It is possible to create a hexbin plot by considering _____________ as the value of kind in
the jointplot().
(a) hex
(b) hexbin
(c) bin
(d) None of these
5. Different functions that can be used to plot a pair grid include:
(a) map_diag()
(b) map_upper()`
(c) map_lower()
(d) All of these
6. The ______________ function show only the number of observations corresponding to the
different values of column specified in the function.
(a) pointplot()
(b) boxplot()`
(c) scatterplot()

307
(d) countplot()
7. ______________ is a plot that connects a series of points by drawing line segments
between them.
(a) lineplot()
(b) pointplot()`
(c) both (a) and (b)
(d) None of these
8. A kernel density estimation plot can be created by assigning value _____________ to kind
argument in jointplot().
(a) kernel
(b) kde
(c) kernel density
(d) estimation
9. A box plot describes the distribution of a continuous variable by plotting the summary of:
(a) minimum
(b) maximum
(c) Quartiles
(d) All of these
10. Violin plot includes the following chart:
(a) Box Plot
(b) Kernel Density Plot
(c) Both (a) and (b)
(d) None of these

Review Questions

1. Considering datasets of your choice, create and explain the utility of following charts:
1. Swarn plot
2. Pair plot
3. Pair grid
4. Facet grid
5. Scatter plot
6. Regression plot
7. Count plot
8. Bar plot
9. Violin plot
10. Heat map

308
309
CHAPTER
9

310
SciPy Library for Statistics

Learning Objectives
After reading this chapter, you will be able to

• Understand different sub-packages in the SciPy library.


• Get familiarized with linear algebra and statistical techniques.
• Assess the results of different statistical techniques in real-world situations.
• Apply the knowledge of image processing using ndimage sub-package.

SciPy (pronounced as “Sigh Pi”) is used for mathematical, scientific, and technical computing.
SciPy is built on top of the NumPy and hence it can operate on an array of NumPy library.
NumPy is used to perform the most basic operations such as sorting, shaping, indexing, etc.,
while new data science features are available in SciPy. Besides, SciPy has a fully-featured
version of Linear Algebra while NumPy contains only a few features. SciPy has a list of different
sub-packages like scipy.linalg (Linear Algebra Operation), scipy.stats (Statistics and Random
Numbers), scipy.special (Special Function), and scipy.ndimage (Image Manipulation). Other
packages included in SciPy library which are beyond the scope of the book include
scipy.integrate (Numerical Integration), scipy.io (File input/output), scipy.interpolate
(Interpolation), scipy.optimize (Optimization and Fit), scipy.fftpack (Fast Fourier Transforms),
and scipy.signal (Signal Processing).

Sometimes it becomes important to install the particular version of the library.


This is because version of one library sometimes shows incompatibility with
version of other libraries. For example, the environment might have SciPy
library loaded, but it might be showing some errors when it is executed with
other libraries. It then becomes important to install the library with particular
version. This can be achieved by typing the name of the library followed by ~=
sign and the desired version of the library. Thus, the command pip install
scipy~=1. 3.2. will install SciPy version 1.3.2 in the environment. This can
be used for any library with any version specification.

9.1 The linalg Sub-Package


The linalg sub-package is specifically created for applications related to linear algebra (linalg).
The sub-package has functions for solving linear equations, calculating determinant, inverse,

311
eigenvalues, eigenvectors, etc.

Explanation
The linalg sub-package is imported from the SciPy library in the program. The det() and inv()
functions give the determinant and inverse of the matrix, respectively. The eigen_val() and
eigen_vect() functions produce the eigenvalues and eigenvectors of the matrix, respectively.
The scipy.linalg.solve() function helps to solve the simultaneous equations. The
solve() function takes two inputs “a” and “b” where “a” represents the coefficients and “b”
represents the respective right-hand side value. The function returns the solution in the form of
an array.

Example
The following equations are solved in the program:
4x + y + 5z = 27
2x + 3y + z =19
6x + 3y + 3z = 33

312
Explanation
The solve() function takes two arguments in the form of list: a and b. The value “a” represents
the coefficients of x, y, z of three equations and “b” stores the values of RHS. The nested list
“a” has basically three lists; each list having coefficients of the respective equation. The result
shows that the value of x is 2, y is 4, and z is 3. When we solve these equations manually, the
same result is produced.

9.2 The stats Sub-Package


This section focuses on functions related to descriptive statistics and inferential statistics.
Inferential statistical analysis infers properties about a population and it includes testing
hypotheses and deriving estimates.

9.2.1 Basic Statistics


Different functions are available in scipy.stats package which are used for computing different
statistical values. These functions relate to descriptive statistics such as mean, median, etc.;
functions for determining normality of data, variance in data, correlation, etc. and different
functions for performing parametric and non parametric tests.

9.2.1.1 Descriptive Statistics


The descriptive statistical functions that are available in SciPy library include describe(),
cumfreq(), iqr(), gmean(), hmean(), etc. The describe() function computes several
descriptive statistics of the passed array. The gmean() computes geometric mean along the
specified axis. The hmean() calculates the harmonic mean along the specified axis. The iqr()
function computes the interquartile range of the data along the specified axis. The zscore()
function calculates the z-score of each value in the sample, relative to the sample mean and
standard deviation. The sem() function calculates the standard error of the mean (or standard
error of measurement) of the values in the input array. These functions are used in the following
example:

313
Explanation
The most important function is describe() which returns the value of many statistical
functions. The result shows that the nobs (number of observations) are 13, minmax(11, 24)
means that the minimum number in data is 11 while maximum number is 24. The mean is
17.07, variance is 16.74, skewness is 0.04, and kurtosis is –1.12. The next functions display the
cumulative frequencies, geometric and harmonic mean. The inter-quartile range is displayed
using iqr(). The z-score and standard error are displayed using zscore() and sem(),
respectively.

9.2.1.2 Rank
In statistics, “ranking” refers to the data transformation in which numerical values are replaced
by their rank. For example, the numerical data 4, 5, 2, 3 are observed, the ranks of these data
items would be 3, 4, 1, and 2, respectively.

Explanation

314
We can observe that there are 12 observations. When the data is arranged in ascending order,
the third observation is having lowest value (11); hence it is assigned a rank 1. The eighth and
tenth observations have a value 12, so they are assigned a rank which is average of their ranks;
hence a rank of 2.5 is assigned. Similarly, the seventh and ninth positions have a value 13,
hence they are assigned a rank 4.5 and so on. The highest number 20 is at sixth position; hence,
it is assigned rank 12.

9.2.1.3 Determining Normality


The normality of data can be determined using three approaches: (a) measuring skewness and
kurtosis, (b) applying normal test function, and (c) Shapiro test.

Explanation
If the coefficient of skewness and kurtosis is between –1 and +1, the data is considered to be
normal. Since the value of skewness is 0.12 and that of kurtosis is –0.66, hence we can say that
the data is nearly normal. Normality is also measured using normaltest() function and
shapiro() function. The result of normaltest shows that p-value (0.96) is insignificant.
Similarly, the result of Shapiro test shows that p-value (0.51) is insignificant. Hence, we can
consider that the data is normal.

9.2.1.4 Homogeneity of Variances


To determine the homogeneity of variance, we use Bartlett’s test using stats.bartlett() or
Levene’s test using stats.levene(). However, Levene’s test is considered to be a better
alternative than Bartlett’s test because it is less sensitive.

315
Explanation
If the p-value of test is greater than 0.05, the assumption of homogeneity of variance is met.
From the output, we can observe that p-value using Bartlett test is 0.236 and using Levene test
is 0.431. This means that the variance across groups is statistically insignificant. This further
means that there is a homogeneity of variances in the two groups.

9.2.1.5 Correlation
We can determine correlation using the following measures: Pearson correlation and Spearman
rank correlation. Both these functions accept input either in the form of one or two arrays. If both
the arrays are of the same length, these functions return the correlation between x and y. If the
input is one array, then both of these functions return value 1.0 because correlation of one array
with itself will be 1.

Explanation
The correlation between two groups lies between –1 and +1. The value –1 denotes a perfect
negative correlation and +1 denotes a perfect positive correlation. The value 0 denotes no
correlation. The value between 0 and 1 denotes the extent of correlation. A lesser value denotes
a less correlation and a higher value denotes a high correlation. From the output, we can see
that Spearman correlation is 0.36 and Pearson correlation is 0.50; this means that there is
average correlation between the two groups. We know that null hypothesis is rejected if the p-
value is <0.05. We can observe from our data that the p-value for both the Spearman and
Pearson is greater than 0.05, which means that we failed to reject the null hypothesis. This
means that there exists no significant correlation between the two groups.

316
9.2.1.6 Chi-Square Test
Chi-square test for independence is a statistical method to determine if two variables in a table
have a significant correlation between them. The chi-square test can be used to test, whether for a
finite number of bins, the observed frequencies differ significantly from the probabilities of the
hypothesized distribution.

Explanation
Taking account of the estimated parameters, we failed to reject the null hypothesis because the
p-value is greater than 0.05. This means that the two groups are independent, that is they are not
dependent on each other.

Download a dataset of your choice from online sources and perform all the
statistical functions discussed in this section on different variables.

9.2.2 Parametric Techniques for Comparing Means


A common problem that arises in research is the comparison of the central tendency of one group
to a value or to another group or groups. A test is a procedure for comparing sample means to see
if there is sufficient proof to predict that the means of the corresponding population distributions
also differ. Common statistical tools for assessing these comparisons are t-tests, analysis-of-
variance, and general linear models. Parametric techniques are used when some assumptions are
met; for cases where some assumptions are not met, a non-parametric alternative may be
considered. The general goal for most of these tools is to use the estimate of the mean (or other
central measure), assess the variation based on sample estimates, and use this information to
provide the amount of evidence of a difference in means or central tendency.
The t-test and analysis-of-variance abbreviated as ANOVA are two parametric statistical
techniques used to test the hypothesis when the dependent variable is continuous and
independent variable is categorical in nature. The sample is taken from different populations. The
different samples are measured on some variable of interest. For example, a t-test will determine
if the means of the two sample distributions differ significantly from each other, while ANOVA
will determine if the means of the more than two sample distributions differ significantly from
each other. Both of these tests are based on three common assumptions: the sample drawn from
the population should be normally distributed, there should be homogeneity of variances, and
independence of observations should exist. There is a thin line of demarcation between t-test and
ANOVA. When the categorical variable has two groups, t-test is used, while if there are more
than two groups, ANOVA is preferred, that is, when the population means of only two groups is
to be compared, the t-test is used, but when means of more than two groups are to be compared,
ANOVA is preferred.

317
There are three common assumptions that need to be fulfilled before applying t-test or
ANOVA. However, it should be noted that one sample t-test should fulfil only the normality
assumption since only sample is involved.

1. Normally Distributed: Non-normally distributed variables (highly skewed or kurtotic


variables, or variables with substantial outliers) can distort relationships and significance
tests. Before applying t-test or ANOVA, it is assumed that variables are normally
distributed (symmetric bell-shaped distribution).
2. Independent Samples: If we randomly sample each set of items separately, under different
conditions, the samples are independent. The measurements in one sample have no bearing
on the measurements in the other sample. Example, randomly sampling of two different
groups of people on the basis of gender for testing their online shopping behavior. If we
take one random sample from males and record their perception, while another sample from
another group of females and record their perception, we know that the measurement in one
sample does not have an effect on another sample.
3. Homogeneity of Variances: The assumption of homogeneity of variance is an assumption
of the independent samples t-test and ANOVA stating that the variance within each of the
populations is equal. However, the independent samples t-test and ANOVA are generally
strong to violations of this assumption also if the group sizes are equal. Equal group sizes
may be defined by the ratio of the largest to smallest group being less than 1.5. If group
sizes are vastly unequal and homogeneity of variance is violated, then the result will be
biased when large sample variances are associated with small group sizes. When this
occurs, the significance level will be underestimated, which can cause the null hypothesis to
be falsely rejected. The result will also be biased in the opposite direction if large variances
are associated with large group sizes. This would mean that the significance level will be
overestimated. This does not cause the same problems as falsely rejecting the null
hypothesis; however, it can cause a decrease in the power of the test.
In Python, the functions for comparing means are available in the stats sub-package from SciPy
library. Hence, this library needs to be imported before using the functions. A test is a process for
comparing sample means of different groups. While comparing means of different groups we
frame hypothesis in which dependent variable is continuous and independent variable is
categorical in nature. Example, in determining effect of gender on job satisfaction, gender
(categorical variable) will be considered as independent variable and job satisfaction will be
considered as dependent variable.

Before applying test, you can use different visualization techniques as


discussed in Chapters 7 and 8 for understanding the data and the relationship
between the different variables in the data. This will help you to do effective
analysis.

For understanding the utility of all the tests, we will apply t-test in two situations: considering
our own dataset and considering existing dataset. Different datasets are created for proper
understanding in each section. However, for existing dataset, we will consider existing dataset
named “mtcars” which can be downloaded from either

318
https://www.kaggle.com/ruiromanini/mtcars/version/1 or
https://gist.github.com/seankross/a412dfbd88b3db70b74b#file-mtcars-csv

Explanation
The dimension of the dataset is 32, 12, which means that there are 32 observations for 12
columns. The columns are ‘model’, ‘mpg’, ‘cyl’, ‘disp’, ‘hp’, ‘drat’, ‘wt’, ‘qsec’, ‘vs’, ‘am’,
‘gear’, ‘carb’.

9.2.2.1 One Sample t-Test


The t-test is described as the statistical test that compares the sample mean of one group with a
standard value. Since only one sample is involved, hence for this test there is a need to fulfil only
the normality assumption.

Syntax
stats.ttest_1samp(sample,val)
where

• sample is the continuous variable displaying the set of observations in the form of numeric
vector or the field from a dataframe.
• val represents the value with which we want to compare the sample mean.

For better clarity, we will apply one sample t-test in two situations: considering own dataset and
considering existing dataset.
Applying t-test on user’s data: To compare a sample mean with a constant value, one-sample t-
test is applied using the ttest_1samp() function in Python. Example, the pharmaceutical
company desired that to check the whether the report of the chemical deviate from the standard
value of 3. Based on 30 different observations generated, we will check whether or not the
chemical deviates significantly from the expected value of 3. But before applying one-sample t-
test, it is necessary to fulfil the assumption of normality. There are many different ways for
checking the assumption of normality for one-sample including drawing a probability plot,
skewness and kurtosis, and using normality function.

319
Explanation
Skewness and kurtosis functions are considered for measuring normality of the data. If the
coefficient of skewness and kurtosis is between –1 and +1, the data is considered to be normal.
Since the value of skewness is 0.61 and that of kurtosis is –0.64, hence we can say that the data
is normal. Normality is also measured using mstats.normaltest() function. The result shows
that the p-value (0.254) is insignificant. Hence, we can consider that the data is normal.
The probability plot created using function probplot() dots on the chart with respect to the
observations in blue color. The argument fix has a value “True”, which helps us to plot a

320
straight line in red color trying to cover all the dots. The chart displays that since nearly all the
dots lie on the line, hence the data can be assumed to be normal.
Since the normality assumption is met, we can now apply test to the data. To determine
whether the two distributions differ significantly from each other, the test that measures the
probability associated with the difference between the groups may be either a one-tailed or two-
tailed test of significance. The test examines whether the mean of one distribution differs
significantly from the standard value. Since the p-value is equal to 0.001, which is less than
0.05, hence we reject the null hypothesis. This means that the value does significantly vary
from the value 3. Based on the results, the company can now design strategies and take action
accordingly.

Applying t-test on existing dataset: For applying t-test on one sample, we will consider “disp”
variable which corresponds to displacement in mtcars dataset.

Explanation
The result displays that the coefficient of skewness is 0.4 and that of kurtosis is –1.0, which
shows that the data is nearly normal. Insignificant value of “p” (0.082) for the normality test
function shows that the data is normal. The probability plot displays that since all the dots

321
nearly lie on the line, hence the data can be assumed to be normal. Since the normality
assumption is fulfilled, we can now apply one-sample t-test to the data. After applying the one-
sample t-test, the significant p-value (0.002) shows that we reject the null hypothesis. This
means that the values vary significantly from the value 160.

USE CASE
GREEN BUILDING CERTIFICATION

Green building means a resource-efficient structure and an environmentally responsible process


that provides an integrated approach, which involves planning, designing, construction,
maintenance, renovation, and demolition of buildings. The main concerns of green buildings are
economy, utility, durability, and comfort and reducing the effect on natural environment by
effective utilization of resources through minimal generation of non-degradable waste,
protecting occupant health through reducing pollution, and improving employee productivity.
Different factors contribute in making a green building; some of them are use of energy-efficient
LEDs and CFLs, air-based flushing system in toilets that minimizes water use, sensors in
efficient cooling systems that automatically adjust the room temperature based on heat
generated from human body, latest technological lighting system that automatically switches off
in absence of anyone inside the building.
An important development in the growth of green building movement is brought by tools such
as Green Star in Australia and the Green Building Index (GBI) predominantly used in Malaysia.
However, in India, the launch of the three primary rating systems – GRIHA, IGBC, and BEE –
has changed the scenario. These predefined rating systems decide whether green buildings are
really green, which in turn brings together a host of sustainable practices and solutions to
reduce the environmental impacts. Certification is applicable to new and existing buildings,
homes, schools, factory, townships, SEZ, etc.
To achieve the rating, all supporting documents at preliminary and final stages of
submission, related to the compulsory requirements and the credits attempted should be
submitted. This also includes review done by third-party assessors and submission of
clarifications to preliminary review queries. Credits are awarded after the submission of final
documents showing implementation of design features. However, modifications in any expected
credits aspect after preliminary review can be documented again and resubmitted for the final
review.
One sample t-test is used when we want to compare a set of observations with a predefined
standard. The sensor system in green buildings will compare the observation with a predefined
standard value and will do the necessary amendments if deviations are found. Besides, credits
earned are also dependent on the level and nature of deviations from the standard value of
mandatory requirements.

9.2.2.2 Independent Sample t-Test


The t-test helps in comparing the population means of only two groups, and thus helps in
examining whether the population means of two samples greatly differ from one another. This
test is applied using stats.ttest_ind().

322
Syntax
stats.ttest_ind (sample_a, sample_b, equal_var=)
where

• sample_a and sample_b are two independent samples.


• equal_var has a Boolean value representing the variance for the two groups. Default value is
True.
Applying t-test on user’s data: We will create our own dataset for perception related to online
shopping for young and old groups and apply independent sample t-test on both the groups after
checking the assumptions.

323
Explanation
The skewness and kurtosis of the senior and junior dataset are between –1 and +1. Two graphs
are drawn using the subplot() function in the same image. The function plt.subplot(121)
draws the graph in the first quadrant (corresponding to senior) and plt.subplot(122) draws
the graph in the second quadrant (corresponding to junior). The graph shows that all the points
lie nearly on the line. Thus, the assumption of normality is fulfilled. We assume the data which
we have taken in the this example are independent of each other. Thus, the second assumption
of independent samples is fulfilled. We use Bartlett’s or Levene’s test to check the homogeneity
of variances. If the p-value of test is greater than 0.05, the assumption of homogeneity of
variance is met. From the output, we can see that the p-value is more than the significance level

324
of 0.05. This means that the variance across groups is statistically insignificant. Therefore, we
can assume the homogeneity of variances in the different groups. However, it should be noted
that if the homogeneity assumption is not met, we would have considered equal variances as
False (equal_var=False) in the syntax of t-test.
A null hypothesis is rejected if the p-value is less than 0.05 at 5% level of significance.
Since, in our example, the p-value (0.000) is less than 0.05, hence the null hypothesis is
rejected. The alternate hypothesis that the true difference in means is not equal to 0 is not
rejected. This means that the true difference in means of senior and junior groups is not equal to
0 which further means that there is a significant difference in the perception of senior and junior
groups.

Applying independent sample t-test using existing dataset: In the example, we had directly
used two samples created by the user. This section will consider example from an existing
dataset named mtcars considered for one sample t-test.

325
Explanation
The “vs” variable which denotes the engine shape has only two values 0 and 1 corresponding to
V-shape and straight. Since this variable has only two groups, hence we can use t-test
considering “vs” as independent variable. Miles per gallon denoted by “mpg” is a continuous
variable and is taken as dependent variable in this example. The skewness and kurtosis of the
mpg are 0.640 and –0.20, respectively, which is within the range –1 and +1. Hence the data can
be considered as normal. The normality function and the probability plot drawn further confirm
normality of the data. The two datasets are created based on the type of automatic and manual
transmission. The homogeneity of variance assumption is checked using Levene and Bartlett
tests. Since the p-value of Levene test and Bartlett test is greater than 0.05, the assumption of
homogeneity of variance is met. The independent sample t-test had the argument var.equal
which will have value equal to True (Default value). The results show that the p-value is 0.000,

326
which is less than the significant value of 0.05. Hence, we reject the null hypothesis and
conclude that there is a significant difference between V-shape and straight shape with respect
to miles per gallon.

In some cases homogeneity of variance assumption is not fulfilled. In these cases, the value of
argument equal_var is considered to be False in the function ttest_ind(). This is explained in
the following program where we need to determine whether there is a significant difference
between automatic and manual transmission with respect to mpg (miles per gallon). Here, mpg is
a continuous variable and hence considered as dependent variable.

Explanation
The result shows that the mean for first group is 17.15 and that of manual transmission is 24.39.
The Bartlett test shows that the homogeneity of variances assumption is partially met because
the p-value is slightly greater than 0.05. However, we recommend Levene’s test which is less
sensitive to departures from normal distribution. We can find that the p-value using Levene test
is 0.04, hence this assumption can be considered to be partially met.
For a better approach, we will consider equal_var=False in the t-test since the assumption
of homogeneity is not fulfilled. The result shows that p-value is 0.0013, which is less than 0.05;
hence the null hypothesis is rejected. This means that there is a significant difference between
automatic and manual transmission with respect to miles per gallon (mpg).

USE CASE
COMPARISON OF PERSONAL WEBSTORE AND MARKETPLACES FOR ONLINE SELLING

E-commerce has become an important part of our society, and a large number of firms have

327
started businesses over the Internet as it is the only medium that is able to cross geographic
boundaries and connect with customers on a real-time basis. It also allows the consumers to stay
with the updated company information. In an e-commerce platform, customer transactions are
processed in two fundamentally different ways – either through the personal webstore or through
marketplace – and then delivered and fulfilled by the participating retailers or wholesalers.
Personal webstores are websites that the sellers build according to their specifications,
layout, color schemes, and organization, including e-commerce (shopping cart) functionality. An
online marketplace is basically an e-commerce site where marketplace owner is responsible for
attracting customer and keeping track of money transactions, while information related to
products or services is given by participating retailers and wholesalers who add products which
are included into site inventory, along with the manufacturing and shipping. Online marketplace
offers good opportunities by creating online shops easily and quickly and the vendors create a
listing by writing description, adding photos, shipping options, etc. These sites allow sellers to
register and sell items for a post-selling fee and generally a nominal or no product listing fee.
Once a consumer decides to make a purchase, a vendor is liable for filling the order.
Both the ways of selling products online differ a lot in their approach. Online marketplaces
want high-quality vendors to draw in more customers and they offer a whole new world of
opportunities and standards to wholesalers, retailers, and innovative enthusiasts. But they
charge a commission fee and other fees on the products sold by the seller which actually
decrease a lot of profit from the sellers’ perspective. Besides, the inventory cost is also high in
case of selling through marketplace. Thus, vendors try to save the commission and fees, which
they are pay to the marketplaces. They can start their own webstore because the process of
website design and development has undergone a drastic change. On the other hand,
professionally designed and developed software is expensive and time-consuming. Hence, it may
not be the first option to sell through one’s own webstore. However, some websites provide seller
support by divisibility of risk. For instance, crowd funding platforms allow businesses to
aggregate small investments over a large market in order to generate the capital needed,
expanding the funding options for small, start-up businesses.
There are many advantages which might be common to both forms of online selling such as
speed of access, overcoming geographic constraints, greater range of available potential
products/services, provision of infrastructure and international reach, fast delivery, better
quality and security, ease of shopping, accurate product information, easy payment process,
feedback about the products, easy product comparison, time saving, 24/7 shopping, sharing
contents of the products, transparent and competitive pricing, avoiding salesman pressure,
instant purchase, etc.
For some characteristics, there might be a difference in the perception of sellers like service
quality, information systems quality, product portfolio, website quality including organization
layout, aesthetically appealing, signage, download time, provision of adequate information,
support online payment, usability, convenience, content and layout, interactivity, user-friendly
design, extensive range of payment options, convenient filters, various shipping options
(free/flat-fee/calculated), comprehensive and easy-to-use navigation system, customization,
quality of information availability, product selection, and appropriate personalization.
Marketplaces involve multiple products, brands, and vendors. Hence, a marketplace is
expected to distribute information successfully about who should be matched and at what prices
with efficient filters and speedy process. On the other hand, these things might be easy to handle
in case of a webstore since there is less range of products.

328
Though all the websites share common characteristics, some e-commerce websites have
special search ranking algorithms, reputation systems, new matching mechanisms, and tools for
launching customer marketing campaigns, comprehensive review system with star rating,
offering recommendations to consumers based on their shopping and browsing history, various
customer loyalty programs, personalized offers and email notifications based on buyers’
browsing/shopping history, Google Analytics and Facebook integration, highly-rated customer
service, customizable integrated chat functions, comprehensive tutorials and tips, other
promotional and marketing tools which act as the source of digital marketing.
Businesses may want to facilitate sales transactions by allowing to target discounts at
specific customers and may want an absolute control over pricing and listing features,
evaluation of customer feedback, online reviews, customer interaction, bidding, auction, etc.,
thereby effectively handling all functions of management – planning, organizing, staffing,
coordinating, and controlling. A marketplace offers more items for the consumers, hence sellers
enjoy more web traffic; yet they are disadvantaged in terms of ease of search, conversion rate,
ineffective control, and struggle to convert shoppers into buyers. In addition, webstore sellers
enjoy less web traffic but have a better control on logistics, sales, marketing, customer reviews,
and customer support, thereby proving effective in conversion rates. Besides, people recognize
the marketplaces easily whereas it is usually a long, hard road to build up brand recognition for
a new webstore.
Sellers are really facing a dilemma to sell via a personal webstore or a marketplace. A study
can be undertaken to investigate key differences between webstore and marketplace in terms of
the above discussed factors using independent sample t-test. This study will be beneficial to
sellers who want to sell online and want a comparison of different online selling alternatives.

9.2.2.3 Dependent t-Test


Paired t-test is used for dependent samples. If we collect two measurements on each item,
person, or experimental unit, then each pair of observations is closely related or matched. In this
type of scenario, we apply a paired t-test. The paired t-test is applied using ttest_rel() from
stats sub-package.

Syntax
stats.ttest_rel(sample_a, sample_b)
where sample_a and sample_b are two dependent samples.
Applying dependent t-test on user’s data: In the following example, we need to determine the
effect of a treatment on patients. This leads to a set of paired observations (scores before and
after treatment) for each patient.

329
Explanation
This example tries to understand whether any significant difference occurred after the treatment
was given. The variables “pretreat” and “posttreat” represent readings before and after the
treatment. Since the p-value is 0.5535 which is greater than 0.05, we failed to reject the null
hypothesis, which means that there is no significant difference in the values after the treatment
was given. This implies that treatment really did not an impact the patients and hence cannot be
considered as useful for the patients.

USE CASE
EFFECT OF TRAINING PROGRAM ON EMPLOYEE PERFORMANCE

Training is a program that helps employees gather specific knowledge or skill to improve their
performance for the betterment of the organization. The competitive battle for top talent has
outlaid the importance of training for an organization to support the strategic path. Training is
important in organizations because they perform better by retaining the right people and
increasing profits if they adapt well to changing environments.
Training should be provided through open-minded approach and the organization should
invest in developing employees’ skills to help them reach their potential, creating awareness for
development opportunities, emphasizing on employee motivation, and encouraging critical
thinking. Before starting the training program, needs assessment should be conducted through
research, interviews, and internal surveys to identify who needs to be trained and on what skills
or topics. In a training program, the work experiences and knowledge of an employee is used as
a resource. New information is given as an input to the employee’s past learning and work
experience and they are given an opportunity to implement their learning by practice.
The relationship between learning and an organization’s success is convincing. Training is
required when there is the gap between current performance and required performance. But
high-impact training programs are the result of a careful planning and alignment process.
Hence, it is important to design and develop training to meet the company’s overall goals and
measure its impact, since it involves more time and money. Hence, measurable learning
objectives are the basis to evaluate the impact of training. Score of employees related to these
measurable objectives can be determined before and after the training program. Dependent t-
test can be used for determining the effect of training on the employees.

9.2.2.4 One-Way ANOVA

330
ANOVA (Analysis of Variance) is a statistical method commonly used in all those situations
where a comparison is to be made between more than two population means, for example,
perception of online shopping from people belonging to different age groups. In ANOVA, the
total amount of variation in a dataset is split into two types, that is, the amount allocated to
chance and the amount assigned to particular causes. Its basic principle is to test the variances
among population means by assessing the amount of variation within group items, proportionate
to the amount of variation between groups. With the use of this technique, we test null
hypothesis (H0) wherein all population means are the same, or alternative hypothesis (H1)
wherein at least one population mean is different.

Syntax
stats.f_oneway(data_group1, data_group2, data_group3…., data_groupN)
where data_group1….. data_groupN are the different groups formed on basis of different
categorical values in independent variable.
For understanding the utility of ANOVA, we will apply ANOVA in two situations: considering
our own dataset and considering existing dataset.
Applying ANOVA test on user’s data: In the following example, the user creates own data for
applying ANOVA. The perception of people belonging to different income groups is recorded.
Since there are more than 3 groups, hence ANOVA test is used.

Explanation

331
Three different groups are created for showing the perception of people belonging to different
states. The sample size of Low, Middle, and High is 16, 15, and 16, respectively. The skewness
and kurtosis of three income groups are found to be between –1 and +1, hence the data is
assumed to be normal. The result of ANOVA test shows that the p-value is 0.28, hence we
failed to reject the null hypothesis. This means that there is no significant difference between
the perceptions of people belonging to different income groups.

Applying ANOVA test on existing dataset: In the following example, we will consider number
of cylinders as the independent variable for horse power (hp).
We can observe from the previous results of ANOVA test that a significant p-value indicates
that some of the group means are different, but we do not know which pairs of groups are
different. It is possible to perform multiple pairwise comparisons also to determine if the mean
difference between specific pairs of group is statistically significant. We can apply Tukey HSD
(Tukey Honest Significant Differences) for performing multiple pairwise comparisons between
the means of groups from the statsmodel. stats package.

Syntax
pairwise_tukeyhsd(endog=, groups=, alpha=)
where

• endog has the continuous dependent variable.


• groups has the categorical independent variable.
• alpha represent desired level of significance.

332
Explanation
It can be observed that for ANOVA test, the data is filtered according to the categorical variable
from the main dataset and stored in different groups. Example: According to the number of
cylinders, different groups, namely, Four_hp, Six_hp, and Eight_hp are formed for horse
power. The one-way ANOVA result shows that the p-value is 0.0000, hence we reject the null
hypothesis. This means that there is a significant difference between the three groups.
Tukey test is used to determine individual differences between the different groups formed.
It can be observed that for the Tukey test, the different groups are formed automatically after
the information of categorical and continuous variables. The function itself forms groups
depending on the values of the categorical variable. For example, the dataset has only 4, 6, and
8 number of cylinders. The Tukey test shows that since there are 3 groups for representing
number of cylinders, hence three pairs are formed: (4 and 6), (4 and 8), and (6 and 8). The mean
difference for the three groups is 39.64, 126.57, and 86.92, respectively. The result shows that
null hypothesis is rejected for the last two groups, but is not rejected for the first group. This
means that there is no significant difference for horsepower between the cars having 4 and 6
cylinders.

USE CASE
EFFECT OF DEMOGRAPHICS ON ONLINE MOBILE SHOPPING APPS

The global nature of Internet is increasing usage of online buying of products and services. It is
the process whereby consumers directly buy goods or services from a seller anytime and from
anywhere in the world. Online buying has helped a lot in the globalization of businesses
throughout the world starting from buying clothes and commodities, ticket booking, food
ordering, online grocery stores, vehicle booking, etc. Internet has provided a unique opportunity
for companies to more efficiently reach existing and potential customers and help them to
convey, communicate, and disseminate information to sell the product, to take feedback, and to
conduct satisfaction surveys. Many companies have started using the Internet with the aim of
cutting marketing costs, thereby reducing the price of their products and services in order to stay
ahead in highly competitive markets. The dimensions of Internet usage have changed with people
using mobile devices to access the network and the usage of smartphones, tablets, and other
mobile devices has increased the potential of mobile market drastically. Thus, people all over the
world have started purchasing products/services with more convenience. There is an increase in
online shopping because of this communication medium and smartphones have specifically
brought digital convergence to a great height. The characteristics of mobile applications include
viscous and loyal media, efficient and convenient interactivity, smaller restrictions to the
production scale, and it is the sole medium whose effectiveness could be inspected.
Although there are many advantages of buying online through mobile apps, there are also
many challenges that must be overcome if online purchase is to reach its full potential. Many
consumers are still reluctant to buy products and services online because of the insecurity
related to credit or debit cards, password, information hacking, less time to devote, lack of
awareness, quality of services, lack of physical product, privacy invasion, lack of knowledge of
shopping channels, unwillingness to pay and wait for delivery, website reliability, lack of
satisfaction with products, lack of ability to use online shopping, desire for recreational
shopping experiences, absence of physical store exposure, and Internet fraud.

333
There might be a difference in perception of consumers belonging to different demographics
for specific Internet application. It has also been found that younger consumers searched for
more products and services online and they were more likely to agree that online shopping was
more convenient. Also, it has been observed that women tend to be more sensitive to online
information related to purchase attitudes and intentions than men when judgments are made for
subsequent purchases. Frequency and quality of services used by the people in big cities will be
more than those used by people staying in small cities.
Gender, marital status, residential location, age, education, and income group can be
considered as important predictors of online purchase. A study can be undertaken to compare
the perception of consumers belonging to different demographics to increase our ability to
provide more targeted, relevant, and desirable user experience. ANOVA test can be used for
determining significant difference between the samples.

Considering the titanic dataset discussed in chapter 8, perform all the


parametric functions discussed in this section on different variables.

9.2.3 Non-Parametric Techniques for Comparing Means


If the data is not normal, then non-parametric techniques are generally used for comparing the
means. Like parametric techniques, the choice of non-parametric techniques is also dependent on
nature of independent variable. However, non-parametric techniques are considered to be less
effective in comparison to parametric techniques. The following section discusses different non-
parametric tests in different situations.

9.2.3.1 Kolmogorov–Smirnov Test for One Sample


In statistics, the Kolmogorov–Smirnov test (K–S test or KS test) is a non-parametric test of the
equality of continuous, one-dimensional probability distributions that can be used to compare a
sample with a reference probability distribution (one-sample K–S test).

Syntax
ks. scipy.stats.kstest(a, dist)
where

• a is the distribution.
• dist indicates the name of the distribution.

For understanding the utility of K–S test on one sample, we will apply test in two situations:
considering our own dataset and considering existing dataset.
Applying K–S test on user’s data: We consider our own data to compare a sample mean with a
constant value. The ks.test() function is used to apply one-sample t-test for non-parametric
data.

334
Explanation
This result shows that p-value is 0.0, which means that we reject the null hypothesis which
interprets that the distributions are not identical. This means that the dataset is not normally
distributed.

Applying K-S test on existing dataset: We consider miles per gallon (mpg) from mtcars to
demonstrate the utility of K–S test.

Explanation
This result shows that p-value is 0.0, which means that we reject the null hypothesis, which
interprets that the distributions are not identical. This means that the horse power is not
normally distributed.

9.2.3.2 Kolmogorov–Smirnov Test for Two Samples


In statistics, the Kolmogorov–Smirnov test (K–S test or KS test) is a non-parametric test of the
equality of continuous, one-dimensional probability distributions that can be used to compare
two samples (two-sample K–S test).

Syntax
ks.scipy.stats.kstest(a, b)
where a and b are two distributions.
For understanding the utility of K–S test for two samples, we will apply test in two situations:
considering our own dataset and considering existing dataset.
Applying K–S test for two samples on user’s data: In the following example, we consider our
own data to compare two samples. Suppose we need to examine whether there is a significant
difference between the perception of young and old people.

335
Explanation
The command shows that p-value is less than 0.05. This means that the null hypothesis is
rejected, which means that there is a significant difference between the two groups of different
age groups.

Applying K–S test for two samples on existing dataset: We create two samples on the basis of
two values for am belonging to mtcars dataset for displacement (disp) variable. We need to
determine whether the value of am (automatic, am = 0), manual (am = 1) affects the disp variable
or not.

Explanation
The result shows that p-value is less than 0.05. This means that the null hypothesis is rejected,
which means that there is significant difference between the two samples created for am = 0
(automatic transmission) and am = 1 (manual transmission).

9.2.3.3 Mann–Whitney Test for Independent Samples


Two data samples are independent if they come from distinct populations and the samples do not
affect each other. Using the Mann–Whitney test, we can decide whether the population
distributions are identical without assuming them to follow normal distribution.

Syntax
mannwhitneyu(a,b)
where a and b are two independent data samples.
For understanding the utility of Mann–Whitney test for independent samples, we will apply test
in two situations: considering our own dataset and considering existing dataset.

336
Applying test on user’s data: In the following example, we apply Mann–Whitney test on the
dataset for two samples corresponding to perception of two groups – senior and junior.

Explanation
To test the hypothesis, we apply the mannwhitneyu() function to compare the independent
samples. As the p-value turns out to be 0.000 and is less than the 0.05, we reject the null
hypothesis. Thus, there is a significant difference between the two samples and we conclude
that male and female had different perceptions.

Applying test on existing dataset: Here we will consider “drat” indicating rear axle ratio of
mtcars along with another data column named “am”, indicating the transmission type of the
automobile model (0 = automatic, 1 = manual). Without assuming the data to have normal
distribution, we want to determine whether the “drat” of manual and automatic transmissions in
mtcars have identical data distribution or not.

Explanation
Two datasets, namely, auto_drat and manual_drat, are created on the basis of transmission type
(automatic/manual), respectively, considering drat as dependent variable. As the p-value turns
out to be 0.00 and is less than the 0.05 significance level, we reject the null hypothesis. Thus,
we conclude that “drat” of manual and automatic transmissions in mtcars are non-identical
populations.

9.2.3.4 Wilcoxon Test for Dependent Samples


The paired samples Wilcoxon test (also known as Wilcoxon signed-rank test) is a non-parametric
alternative to paired sample t-test (dependent t-test) which is used to compare dependent
samples. It is important to understand that the number of observations remain same in both the

337
dependent samples, since this test is applicable for only dependent observations.

Syntax
wilcoxon(a,b)
where a and b are two dependent data samples.
Applying test on user’s data: We want to check whether there was any difference after the
training was conducted for employees in the organization. The data is considered to be not
normal.

Explanation
The p-value of the test is 0.526, which is more than the significance level alpha = 0.05. Hence,
we failed to reject the null hypothesis, which means that the training imparted did not have a
significant impact on the scores of employees.

9.2.3.5 Kruskal–Wallis Test


It is a non-parametric alternative to one-way ANOVA test, which is used when assumptions are
not met. The krusakalwallis() function is present in mstats sub-package of stats.

Syntax
kruskalwallis(GroupA,GroupB,….GroupN)
where GroupA, GroupB, … are different groups.
For understanding the utility of Kruskal–Wallis test, we will apply test in two situations:
considering our own dataset and considering existing dataset.
Applying test on user’s data: In the following example, we have considered data created by the
user. Suppose we want to determine whether there is any significant difference in perception of
people belonging to three different types of nations: developed, developing and under developed.

Explanation

338
The result shows that the p-value is 0.000 which is less than 0.05. Thus the null hypothesis is
rejected, which means that the three groups have a significant difference. This means that the
perception of graduates, post graduates, and double post graduates is significantly different.

Applying test on existing dataset: In the following example, we have considered data created
by the user. Suppose we want to determine whether there is any significant difference in
perception of people belonging to three different types of nations: developed, developing, and
under developed.

Explanation
It can be observed that for Kruskal–Wallis test, the data is filtered according to the categorical
variable from the main dataset and stored in different groups. Example: According to the
number of gears, different groups, namely, Three_mpg, Four_mpg, and Five_mpg are formed
for miles per gallon (mpg). The data type is determined for Three_mpg using the command
type(Three_mpg). The result is series datatype. Since we cannot apply Kruskal–Wallis test on
series data type, hence we need to convert the series into a list which is done using list()
function. The Kruskal–Wallis result shows that the p-value is 0.0000, hence we reject the null
hypothesis. This means that there is a significant difference between the three groups, which
means that miles per gallon of cars having different gears is different.

Download a dataset of your choice from online sources and use all the non-
parametric techniques as discussed in this section on different variables.

339
9.3 The special Sub-Package
The special sub-package in SciPy library gives special functionality to calculate cubic root,
exponential, permutations and combinations, etc. as shown in the following example:

Explanation
The cbrt() function determines the cubic root of three numbers 125, 27 and 64. The exp10()
function determines the exponential of 3 and 20 with base 10. The comb() and perm() functions
display the combinations and permutation of 6 with 4. The logsumexp() function calculates the
lofsumexp value of 10.

9.4 The ndimage Sub-Package


The ndimage (n-dimensional image) is a sub-package of SciPy which is mostly used for
performing image processing. The most commonly used features of image processing include
flipping, rotation, cropping, filtering, blurring. Flipping means to change the direction of the
image. Rotation helps to rotate the image while cropping means to crop the image according to
user requirements. Filtering is a technique which is used to emphasize on certain features
including Smoothing, Sharpening, and Edge Enhancement or removing unwanted features.
Blurring is widely used to reduce the noise in the image. In this section, for image manipulation,
we have used the image of Panda which is available in misc package in SciPy.

Most of the functions discussed in this section for images are available in
skimage library also. In fact, skimage library provides more functionality for
image data and is highly used in real-time applications.

340
Explanation
The misc sub-package is imported from the stats package for importing the image of Pandas.
The function misc.face() returns the image of Panda which is stored in img_panda. The next
two functions from matplotlib display the image of Panda.

9.4.1 Flip Effect


It is possible to flip the image by specifying the direction in which flipping of image will take
place. It is also possible to flip the image in different ways as shown in the following program:

341
Explanation
All the functions related to the flipping of image are stored in NumPy library, hence the library
is imported in the program. The flip() function in the NumPy library flips the image
completely. The flipud() function flips the image and make in upside down direction. The
fliplr() filps the image on side which just changes the direction of the image. Thus, the
direction of the original image of panda is reversed.

9.4.2 Rotate Image


The rotate() function inside the ndimage sub-package helps to rotate the image by specifying
the degrees in which rotation of image is required.

342
Explanation
The rotate() function is inside the ndimage sub-package of SciPy library which basically has
two arguments: the image and the required degree of rotation. This program rotates the image
by 50, 110, and 210 degrees and hence produces the result as shown in the output.

9.4.3 Blur Image


It is possible to add a blur effect to the image by using gaussian_filter() function from the
ndimage sub-package.

343
Explanation
The gaussian_filter() function in ndimage sub-package helps to add a blur effect on the
image. It basically accepts two arguments image and sigma with value. The value of sigma
determines the intensity of blur effect. This program adds blur effect to the image using
different values of sigma as 2, 7, and 12. We can say that the highest blur effect is seen with the
highest value of sigma.

9.4.4 Crop Image


The cropping of image is possible by specifying the dimensions of x- and y-axis of two diagonal
points of rectangle.

344
Explanation
The image of the panda is loaded from the misc.face and stored in variable named img_panda.
The shape of the original image shows that the coordinates are 768, 1024. This means that max
coordinate is 768 and y-coordinate is 1024 A figure is created having multiple images in the
image. The dimension of the first cropped image is: [240:540, 500:900] which means that x-
coordinates are from 240 to 540, while y-coordinates are from 500 to 900. The second image is
having dimensions as 140:540, 100:600. Thus, the image displayed is majorly from the upper
left corner. The third image is having dimensions from 340:640, 600:1000, hence the bottom-
right side of the image is mainly displayed.

9.4.5 Filters
The maximum_filter() function in ndimage sub-package helps to apply filters to the image as
shown in the following section.

345
Explanation
The maximum_filter() function in ndimage accepts two input image and value of the filter.
The images clearly show the effect in the image when the value of the filter is increased.

9.4.6 Colours
The shift() function is used to change the color of the image. This function accepts two
arguments image and the value of shift. The value 0 will show the original image.

346
Explanation
We can observe that the intensity of the color increases if the value is increased in positive or
negative direction.

9.4.7 Uniform Filters


The uniform_filter() function in ndimage sub-package helps to apply the uniform filter to the
image. This function takes two arguments the image and value of the filter. The value of uniform
filter increases the level of filtering as shown in the following program.

347
Explanation
We can observe that the level of filtering increases if the value of the argument in
uniform_filter() function is increased.

Perform all the discussed functions on an image of your choice and analyze the
results.

Summary
• SciPy has a list of different sub-packages like scipy.linalg (Linear Algebra Operation),
scipy.stats (Statistics and Random Numbers), scipy.special (Special Function), and
scipy.ndimage (Image Manipulation).
• Most of the statistical functions related to descriptive statistics compare means using
parametric and non-parametric techniques are found in stats sub-package of SciPy library.
• Descriptive statistics identify patterns in the data, but they do not allow for making
hypotheses about data.
• The Chi-Square test can be used to test, whether for a finite number of bins, the observed
frequencies differ significantly from the probabilities of the hypothesized distribution.
• The t-test and Analysis of Variance (abbreviated as ANOVA) are two parametric statistical
techniques used to test the hypothesis when the dependent variable is continuous and
independent variable is categorical in nature.
• There are three common assumptions that need to be fulfilled before applying t-test or
ANOVA. The data should be normally distributed, samples should be independent, and
homogeneity of variances should exist.
• The SciPy library provides different functions for determining normality including
determining value of skewness, normality, normaltest(), Shapiro test, etc. The kurtosis()
and skew() functions test the kurtosis and skewness of the data, respectively. These two tests
are combined in the normality test using functions normaltest() and shapiro().
• To determine the homogeneity of variance, we use Bartlett’s test using
scipy.stats.bartlett() or Levene’s test using scipy.stats.levene().
• Independent sample t-test is described as the statistical test that compares the population
means of only two groups. It examines whether the population means of two samples greatly
differ from one another. The ANOVA test on the other hand compares the means of more
than two samples.
• Paired t-test is used for dependent samples. If we collect two measurements on each item,
person, or experimental unit, then each pair of observations is closely related or matched.
• Non-parametric tests are used for comparing the means of samples when the data is not
normal. These include Kolmogorov–Smirnov Test for One Sample, Kolmogorov– Smirnov
Test for Two Samples, Mann–Whitney Test for Independent Samples, Wilcoxon Test for
Dependent Samples, and Kruskal–Wallis Test for more than two samples.
• The special sub-package in SciPy library gives special functionality to calculate cubic root,
exponential, permutations and combinations, etc.

348
• The ndimage (n-dimensional image) is a sub-package of SciPy which is mostly used for
performing image processing. The most commonly used features of image processing include
flipping, rotation, cropping, filtering, blurring.

Multiple-Choice Questions

1. ____________ utility is not available in ndimage sub-package for images.


(a) Cropping
(b) Rotation
(c) Blurred effect
(d) Classification
2. If matrix1 = np.array([[9,6], [6,8]]), then the result of linalg.det(matrix1) is
(a) 24
(b) 36
(c) 48
(d) 34
3. ____________ test is not used for determining normality of data.
(a) Kruskal
(b) Shapiro
(c) Normaltest()
(d) Skewness and kurtosis
4. The non-parametric alternative for one-sample test is
(a) Kolmogorov–Smirnov
(b) Mann–Whitney
(c) Shapiro–Wilk
(d) Kruskal–Wallis
5. The non-parametric alternative for ANOVA is
(a) Kolmogorov–Smirnov
(b) Mann–Whitney
(c) Shapiro–Wilk
(d) Kruskal–Wallis
6. The describe() function in stats sub-package of SciPy library does not give information
related to
(a) Skewness
(b) Median
(c) Mean
(d) Mode
7. Compare means are found in ___________ sub-package.
(a) special
(b) stats

349
(c) linalg
(d) ndimage
8. The assumption which is not required to be fulfilled before applying independent sample t-
test
(a) Independent samples
(b) Normality
(c) Homogeneity of variance
(d) Fisher test
9. If testdata=[1,2,3,4,3,2], then the result of print(stats. rankdata(testdata)) will be
(a) [1. 2.5 4.5 6. 4.5 2.5]
(b) [1. 2. 3. 4. 5. 6.]
(c) [1. 2.5 5.5 6. 5.5 2.5]
(d) [1. 3.5 5.5 6. 5.5 3.5]
10. Multiple pair-wise comparison between the means of groups can be done using the
_________ test.
(a) Tukey
(b) Multi
(c) Pair
(d) Multiple

Review Questions

1. Discuss the different assumptions that need to be fulfilled before applying independent
sample t-test.
2. Discuss the different ways to determine normality of a sample.
3. Differentiate between parametric and non-parametric tests.
4. How is it possible to use a t-test when the homogeneity of variances assumption is not
fulfilled?
5. Explain the different non-parametric tests.
6. What is the utility of Tukey test?
7. How do we determine rank in an observation which has observations of similar value?
8. Explain the utility of any three functions from linalg package.
9. Discuss the important functions of special sub-package.
10. What are the different utilities provided by the ndimage sub-package for image processing?

350
CHAPTER
10

351
SQLAlchemy Library for SQL

Learning Objectives
After reading this chapter, you will be able to

• Understand the basic SQL (Structured Query Language) operations: query, insert, update
and delete.
• Attain competence in accessing the data.
• Apply the knowledge of ORDER BY and GROUP BY for data extraction.
• Implement data extraction from multiple tables through joining and sub-query.

Data are organized into rows, columns and tables. A database is an organized collection of data
and is managed properly to find relevant information from large amount of information. A
database management system is a software for interaction between users and databases. SQL is a
query language for interaction with database. There are various Python SQL libraries that help to
perform programming for SQL such as SQLite, pymssql, and SQLAlchemy. Each of these
libraries has advantages and disadvantages and contains functions for Python SQL query
generator. This chapter discusses SQLAlchemy, which is the most popular Python SQL library;
it basically allows a Python program to interact with a database for performing query operations.

If it is not installed in Python, then write “pip install SQLAlchemy” at


anaconda prompt and enter or click on the SQLAlchemy library in pycharm
and click on install for installing this library in Python environment.

10.1 Basic SQL


Every table is divided into smaller entities called fields. A field is a column in a table that is
designed to maintain specific information about every record in the table. A record is a
horizontal entity in a table. A column is a vertical entity in a table that contains all information
associated with a specific field in a table. This section discusses the SQL operations on the data
file that can be downloaded from dataset named state.x77, which is an inbuilt dataset in R
software. The dataset has the following information for 50 states of the United States:
Population: population estimate as of July 1, 1975,
Income: per capita income (1974),
Illiteracy: illiteracy (1970, percent of population),

352
Life Exp: life expectancy in years (1969–1971),
Murder: murder and non-negligent manslaughter rate per 100,000 population (1976),
HS Grad: percent high school graduates (1970),
Frost: mean number of days with minimum temperature below freezing (1931–1960) in
capital or large city,
Area: land area in square miles.

Explanation
The state.csv file is read in the program by using the read.csv() function. The dimension of
the file is shown by shape and the name of the columns is shown by columns. The output shows
that the data file has 50 rows and nine columns. The create_engine() function helps to
establish a connection with the SQL and the last command state.to_sql converts the data file
named “state” in a table format required by SQL commands; the table is named as statesql.

10.1.1 SELECT Clause


The SELECT statement is used to select data from a database. It is possible to select all the fields
using * and select specific fields by writing their names after the clause. The LIMIT clause is used
to display limited number of records on the screen. The fetchone() and fetchall() function is
used to display one and all the records, respectively.

Since select statement can return multiple records, hence we require a “for”
loop to fetch all the records one by one. It is known that “for” loop is required
for executing same type of function multiple times.

353
Syntax
SELECT * FROM table_name
Where, * symbolizes all the fields from the table.

SELECT column1, column2, …


FROM table_name
Where, column1, column2, … are the specific field names of the table that needs to be displayed.

SELECT column_name(s)
FROM table_name
LIMIT number
Where, LIMIT helps to view limited number of records on the screen.

Explanation
The first command had * after the SELECT clause, hence all the fields are displayed. As the
function used is fetchone(), hence only the first records is displayed. However, in the next
command, the same function returns all the records, since the function used is fetchall(). The
third command used LIMIT clause with value 2, hence only two records are displayed. The
fourth command specifies the fields, namely, State, Area, and Population after the SELECT
clause with LIMIT 3. Hence, only these three fields for first three records are displayed.

354
It is also possible to use case while displaying other specific information according to the value
in the dataset.

Explanation
In this example, State and Area are displayed along with some results in the form of statement.
The value of the statement is dependent on the value of area. The case in which if area >70,000,
then it should display “Big State,” and if it is <70,000 but >30,000, it should display “Normal
State,” otherwise it should display “Small State.” Since LIMIT is 8, hence only first eight
records are displayed. We can observe that when “Alabama” has an area of 50,708, it is
displaying “Normal State”, as the value of area is between 30,000 and 70,000. Similarly, the
value is printed for the next seven records.

10.1.2 WHERE Clause


The WHERE clause is used to filter the records according to specified condition(s). The conditions
can be created using relational operators, logical operators, IN and NOT IN, LIKE etc. The
DISTINCT keyword helps to view all the fields, specific fields and records with unique value.

Syntax
SELECT column1, column2, …
FROM table_name
WHERE condition

It should be noted that the WHERE clause is not only used in SELECT statement, it is also used in
UPDATE, DELETE statement, etc.

10.1.2.1 Relational Operators


These operators are used for making conditions for filtering records according to user
requirement. These include <, >, ≥, ==, and <> operators.

355
Explanation
The relational operators are used for data extraction in this program. The less than condition
displays three records where illiteracy is <0.6. But, a lot of records are displayed when the
condition becomes less than or equal to 0.6 as shown in the output of second command. Only

356
one record is shown where income >6,000, population ≥20000, or illiteracy is equal to 1.0 as
shown in the next three outputs. However, many records are fulfilling the condition where
illiteracy is not equal to 0.5 as shown in the last command.

Between Keyword: The between keyword is also used to create a condition on the basis of
range for filtering necessary records.

Explanation
The between keyword is shown to define a range for checking the condition. Thus, the above
command filters those records where Area is between 30,000 and 40,000, hence five records are
shown in the output.

Distinct Keyword: This clause is used after SELECT clause to display records with distinct value
of the specified field.

Syntax
SELECT DISTINCT column1, column2, …
FROM table_name
WHERE condition

357
Explanation
We can observe from the example that three records had illiteracy equal to 0.5, but when
distinct clause was used after SELECT clause, only one record was shown because only distinct
values were displayed.

10.1.2.2 Logical Operators


There are basically three logical operators included in SQL: OR, AND and NOT. The AND and OR
operators are used to filter records based on more than one condition. The AND operator displays a
record if all the conditions separated by AND are TRUE. The OR operator displays a record if any of
the conditions separated by OR is TRUE. The NOT operator displays a record if the condition(s) is
NOT TRUE.

Syntax
SELECT column1, column2, …
FROM table_name
WHERE condition1 AND condition2 AND condition3 …

SELECT column1, column2, …


FROM table_name
WHERE condition1 OR condition2 OR condition3 …

SELECT column1, column2, …


FROM table_name
WHERE NOT condition

358
Explanation
The first command used the condition as Frost > 160 and Income < 4,000, which means that the
records will be displayed only if both the conditions are satisfied. Thus, from the dataset two
records satisfied both the conditions. The second command used the conditions as Population >
20,000 OR Income < 2,000, which means that either of the condition is satisfied. Only record
fulfilled either of the condition. The third command used a combination of AND and OR Frost >
160 and (Income < 4,000 OR Population < 700). This means that records will be displayed only
if Frost > 160 and Income is 4,000 or if Frost > 160 and Population is <700. The last command
displays the record where illiteracy is >2.6, because NOT operator is used for illiteracy < 2.6.

10.1.2.3 IN and NOT IN Clauses


The IN clause is used when we want to check multiple values for a single field. For example, if
we want to check whether the customer belongs to Mumbai, Delhi, or Bangalore, we will check
with IN easily as city IN (“Mumbai,” “Bangalore,” “Delhi”) rather than writing city=“Bangalore”
OR city=“Mumbai” OR city=“Delhi.” On the other hand, the NOT IN clause will display the

359
records where city is other than the mentioned cities.

Syntax
SELECT column1, column2, …
FROM table_name
WHERE column_name (NOT) IN (values)

Explanation
The IN word helps to display those records where illiteracy is either 0.5, 0.8, or 0.9. The NOT IN
displays records where illiteracy is neither 0.5, 0.8, nor 0.9. We can also observe from the data
that the first record that used IN clause has the index number of 7 and the last record had the
index number of 40. Both these records are not shown when the same values are used with NOT
IN clause.

10.1.2.4 The LIKE Operator


The LIKE operator is used in a WHERE clause to search for a specified pattern in a column.
Wildcard characters are used with the SQL LIKE operator. There are two wildcards often used in
conjunction with the LIKE operator: % and -. The percent sign represents zero, one, or multiple
characters and underscore represents a single character. Thus, “ba*” represents ball, batsman,
bat, back, etc. “?at” represents bat, hat, sat, etc. However, both these characters can also be used
in combinations.

360
Syntax
SELECT column1, column2, …
FROM table_name
WHERE column_name LIKE pattern

Explanation
The condition “M%” helps to fetch the records of states whose name is starting with letter “M”
and is followed by any number of characters. Thus, all the states starting with “m” are
displayed. The condition “_aine” will display all the states where the first letter can be anything
while the other four letters will be “aine.” Thus, only the Maine state is displayed. The third
condition is “_a%” that means that the first letter can be anything, second letter is “a” and it can
have any number of characters after “a.” Thus, all those states where second letter is “a” are
displayed.

10.1.2.5 IS NULL
It is not possible to test for NULL values with comparison operators, such as =, <, and <>.
We will have to use the IS NULL and IS NOT NULL operators for displaying records, which

361
(do not) have null values that are specified.

Syntax
SELECT column_names
FROM table_name
WHERE column_name IS (NOT) NULL

Explanation
The IS NULL condition checks for the records where state is a blank column. Since in our data
there are no missing values, hence no record is displayed. The IS NOT NULL condition checks
for the records where state is not a blank column. Since in our data there are no missing values,
hence all records will be displayed. However, since the limit is set to 3, only three records are
displayed.

Considering titanic dataset discussed in Chapter 8, use select statement with


different combinations of conditions and fields as discussed in this section to
get results.

10.1.3 Insert Statement


The INSERT INTO statement is used to insert new records in a table. It is possible to insert the
record by two approaches. The first approach specifies column names along with values to be
inserted. The second approach allows us to insert all the columns of the table, hence it is not
required to specify the column names in the SQL query. However, it is important that the number
and order of the values is in the same order as the columns in the table.

The order and representation of values is very important in insert statement.


A value corresponding to field of numeric data type is written without quotes,
while a value represented as character is written inside quotes.

362
Syntax
INSERT INTO table_name (column1, column2, column3, …)
VALUES (value1, value2, value3, …)

INSERT INTO table_name


VALUES (value1, value2, value3, …)

Explanation
The first command is executed and does not return any records. It should be noted that the
fetchall() function used in select statement helps to fetch/return the records. Since the insert
statement simply make changes to the table and does not returns anything, hence there is no
need of using fetchall() function for inserting records. The second command inserts the
values in all the fields and hence there is no need to specify the name of columns. However, the
order of the values inserted should be corresponding to the fields in the table.

10.1.4 Update Statement


This statement is used to modify the existing records in a table based on the condition. The new
value is specified through the set clause, whereas the condition that needs to be fulfilled is
specified through the WHERE clause. Thus, all the records that are satisfying the condition will be
set to new value.

Syntax
UPDATE table_name
SET column1 = value1, column2 = value2, …
WHERE condition

Explanation
Similar to INSERT command, UPDATE command also make changes to the table and does not

363
return anything. The command sets the population to 5,500 where name of the state is “ABC.”

10.1.5 Delete Statement


The DELETE statement is used to delete existing records in a table. Only those records are deleted
that are satisfying the condition specified in the WHERE clause.

Absence of WHERE clause in Delete statement will delete all the records. Thus,
WHERE clause in DELETE statement is important to delete selected records. The
DELETE statement can also use different relational operators, logical operators
and other conditions as discussed in the WHERE clause with select
statement.

Syntax
DELETE FROM table_name WHERE condition

Explanation
Similar to INSERT and UPDATE command, DELETE command also makes only changes to the
table. This command deletes the record from the table where state is either “ABC” or “XYZ.”

Considering titanic dataset, insert, update and delete one record with entries of
your choice.

10.1.6 In-Built SQL Functions


Python library SQLALchemy supports all the inbuilt functions of SQL including mathematical
functions, such as sum, count, avg, max, min etc., and character functions such as lower, upper,
substr, length, etc. This is demonstrated in the following section:

364
Explanation
All these commands except the last command return only one value, hence we use the function
fetchone() for retrieving the result. The comments written at top of the command, and the
output produced clearly explain the utility of the functions. The lower() function converts the
state in lower case, the upper() function converts the state in upper case, length() function
determines the length of string of state, and substr() function returns the substring from the
name of the state from position2 and four characters are extracted.

10.1.7 ORDER BY Clause


The ORDER BY keyword is used to sort the result-set in ascending or descending order. By
default, the ORDER BY keyword sorts the records in the ascending order. However, to sort the
records in the descending order, the DESC keyword is used after the name of the column(s).

The ORDER BY Clause can be used on variable having any data type. This is

365
because we can apply sorting on numeric variable also along with non-numeric
variable.

Syntax
SELECT column1, column2, …
FROM table_name
ORDER BY column1, column2, … ASC|DESC

Explanation
This commands displays the State, Population, Frost and Area for the records where Area >
95,000. The ORDER BY keyword is used to arrange the states in the ascending order of the name
of the state. Hence, the first record has the name of the state starting from “A.” The records are
displayed in the descending order when the word desc is followed after the name of the field on
which ordering is done. Thus, the first record displays the name of the state starting from “W.”

10.1.8 GROUP BY Clause


The GROUP BY statement group rows have the same values as summary rows, like “find the
number of customers in each country.” This statement is often used with aggregate functions
(COUNT, MAX, MIN, SUM, and AVG) to group the result-set by one or more columns. The
HAVING clause was added to SQL because the WHERE keyword could not be used with aggregate
functions.

366
GROUP BY should be done on the categorical variable. If the variable is
continuous in nature, it will result in no grouping of records. For example,
grouping on the basis of gender and city can show meaningful results, but
grouping on the basis of salary – which is a continuous variable – will show no
meaningful results. This is because a change in Rs. 1 of salary of two people
will also make it a different group. However, grouping can be effective if the
salary is represented in the form of some fixed categories.

Syntax
SELECT column_name(s)
FROM table_name
WHERE condition
GROUP BY column_name(s)
HAVING condition
ORDER BY column_name(s)

367
368
Explanation
All these commands had done grouping on the basis of illiteracy. This is because there are some
records where illiteracy is same. The GROUP BY clause groups the states that have same
illiteracy. The first command displays the count of state for similar value of illiteracy. Thus, it
can be observed that there are three states having illiteracy as 0.5, 10 states having illiteracy as
0.6, five states having illiteracy as 0.7, three states having illiteracy as 0.8, two states having
illiteracy as 2.2, etc. The second command calculates the sum of murder for grouped states.
Thus, the states have total number of murder as 15.5 for illiteracy as 0.5, 45.4 for illiteracy as
0.6, etc. The third command calculates the average of population for the grouped states. Thus,
the states have average population as 1,377.334 for illiteracy as 0.5 and 1,719.8 for illiteracy as
0.6. The fourth command calculates the maximum income from the grouped states. Thus, the
states have maximum income as 5,149 for illiteracy as 0.5, 4,864 for illiteracy as 0.6, etc. The
last command calculates the minimum area from the grouped states. Thus, the states have
minimum area as 55,941 for illiteracy as 0.5, 9,267 for illiteracy as 0.6, etc.

It is also possible to use GROUP BY and ORDER BY together as shown in the following example:

369
Explanation
The command does the grouping on the basis of illiteracy and displays the number of states for
particular value of illiteracy. It further arranges the data in the order of number of states. Thus,
the result produced shows that there is one state having illiteracy 1.0, 1.5, 1.6, etc., and
maximum number of states are 10, which has illiteracy as 0.6, followed by five states having
illiteracy as 1.1.

Considering titanic dataset, use ORDER BY according to Age and GROUP BY Sex
and Class to show the desired results.

10.1.9 Ranking Functions


Two basic functions are used for ranking in SQL that include RANK() and DENSE_RANK(). These
functions helps to calculate the rank of the record on the basis of the column specified by ORDER
BY clause. It is also possible to use PARTITION BY for partitioning the data on the basis of the
specified column. The difference between RANK() and DENSE RANK() is in case of a tie. In
RANK(), ties are assigned the same rank, with the next ranking(s) skipped. This means that if we
have three items at rank 4, the next item would be ranked 7. But in case of DENSE_RANK(), the
next item will have rank as 5; thus the ranks are consecutive and ranks are not skipped if there
are ranks with multiple items.

Syntax
SELECT column_name(s), RANK() OVER (ORDER BY column_ name, PARITION BY column_name)
FROM table_name

370
371
Explanation
We can observe that RANK() and DENSE RANK() show the different results when only the ORDER
BY clause is used to order the records on the basis of Illiteracy. When the RANK() is used, the
state having least illiteracy is assigned rank 1 and the state having highest illiteracy is assigned
rank 50 because the total number of records is 50. This is because when the same value of
illiteracy occurred for more than one record, each record will be assigned the same rank. In our
example, 0.5 value of illiteracy occurred for three records, so all of them were assigned rank 1
and the next record was assigned rank 4 after skipping two values. Similarly, since 0.6 value of
illiteracy occurred for 10 records, hence after rank 4 the next rank assigned was 14. But, when
DENSE_RANK() is used, the state having highest illiteracy was assigned rank 20. This is because
there are many records that are having same illiteracy and hence all those records are assigned
the same rank like RANK(). But, unlike RANK(), skip of ranks does not happen in DENSE_RANK().
Hence, the fourth record is assigned rank 2 unlike 4 in case of RANK(). Similarly, 14th record is

372
assigned rank 3 unlike 14 in case of RANK(). However, when the PARITION BY clause is used for
illiteracy, the records are first partitioned on the basis of same value of illiteracy. Thus, for each
value of illiteracy, the ranking is done on the basis of population. Thus, we can observe that
Nevada has rank 1, since its population is the least for illiteracy = 0.5 followed by South
Dakota. The ranking of states will again start from 1 when value of illiteracy = 0.6. Thus, the
state with least population for illiteracy 0.6 is Wyoming and hence assigned rank 1. Since there
are 10 states having illiteracy as 0.6, hence the state having maximum population (Minnesota)
is assigned rank 10. It can be observed from the records that illiteracy is also shown in
ascending order. Thus, the records having less illiteracy is shown before than the records having
more illiteracy.

Considering titanic dataset, use RANK function and analyze the results.

10.2 Advanced SQL for Multiple Tables


This section discusses advanced queries like joining and subquery for performing operations on
multiple tables at one single point of time. The section first creates four tables named customer,
product, sales_order and sales_order_details. The fields in the customer table consist of custid,
custname, and custcity. The custid is the primary key and hence stores all the distinct records of
the customers. The fields in the product table include prodid, prodname, and price. The table has
prodid as the primary key. The sales_order table has two fields named orderid and custid. The
orderid is the primary key and custid is the foreign key with reference to the custid of the
customer table. The sales_order_details have the information of orderid, prodid, and quantity.
The orderid and prodid act as a foreign key with reference to the sales_order and product table,
respectively.

373
374
375
Explanation
The pd.DataFrame() function is used to create a dataframe. The columns in a table are
displayed by the keys() function. Thus, customer.keys() displays the columns of the
customer dataframe: “custid,” “custname,” “custcity.” Dimension of customer dataframe is
represented by shape that is (4, 3); this means that there are four rows and three columns. The
command customer.to_sql("custsql," con=engine) converts customer dataframe to SQL
format and stores it by the name custsql. Similarly, details are displayed for all the other three
tables and are converted to required SQL format. They are termed as prodsql for product table;
ordersql for sales_order and detailssql for sales-order_details table.

10.2.1 Intersect and Union


The INTERSECT clause is used to display the records that exist in both the tables. It displays the
rows produced by the intersection of both the queries. The UNION clause is used to display all the
records from both the tables. It merges the output of two or more queries into a single set of rows
and columns.

Syntax
SELECT column_name(s)
FROM table1 INTERSECT/UNION
SELECT column_name(s) FROM table2

376
Explanation
The INTERSECT clause is used to display only those customers that exist in both tables: custsql
and ordersql. The UNION clause is used to display all the products from both the tables: proddql
and detailssql.

10.2.2 SubQuery
A subquery is a form of an SQL statement that appears inside SQL statement. It is also
considered as nested query. The statement containing a subquery is called a parent statement.
The rows returned by subquery are used by parent statement. This approach helps to filter the
records on the basis of conditions from multiple tables at one single point of time.

377
Explanation
The first query filters the records from two tables, namely ordersql and detailssql. The customer
id is displayed from the ordersql table based on the billid from detailssql table for products “p1”
and “p2.” The second query filters the records from three tables, namely prodsql, ordersql and
detailssql, and displays the product name from prodsql on the basis of prodid from the detailssql
table and further on the basis of billid from ordersql table for customers “c1” and “c2.” The
third query filters the records from four tables, namely custsql, prodsql, ordersql, and detailssql.
The custname is displayed from custsql table on the basis of custid, which is further extracted
from the ordersql table, which is further extracted on the basis of billid, which is further
extracted from detailssql on the basis of prodid from prodsql table for the product named
“mouse.” It can be observed that the result produced is always from one single table.

10.2.3 Joining
The biggest disadvantage of sub-query is that value can be displayed from one single table only.
It is not possible to display columns from multiple tables at one single point of time. This
drawback is removed using the concept of joining. In SQL, there are basically four types of
joins: (INNER) JOIN for displaying records that have matching values in both tables; LEFT
(OUTER) JOIN for displaying records from the left table, and the matched records from the right

378
table; RIGHT (OUTER) JOIN for displaying records from the right table, and the matched records
from the left table and FULL (OUTER) JOIN for displaying records when there is a match in either
left or right table.

However, the Python library supports only the LEFT JOIN and INNER JOIN.

Syntax
SELECT column_name(s)
FROM table1
INNER JOIN table2
ON table1.column_name = table2.column_name

SELECT column_name(s)
FROM table1
LEFT JOIN table2
ON table1.column_name = table2.column_name

Explanation
The LEFT JOIN displays the records from first table basically, while INNER JOIN displays all the
records that are common in both the tables.

The INNER JOIN can also be written in another form by writing all the names of the tables

379
together. This is demonstrated in the following example:

Syntax
SELECT column_name(s)
FROM table1, table2
WHERE table1.column_name = table2.column_name and……..

Explanation
Unlike subquery used in the last example, it can be observed that the result produced is from
multiple tables in joining. The joining is specified in the WHERE clause by specifying name of
both tables and column name separated by dot. The first query filters the records from two
tables namely ordersql and detailssql. The quantity, customerid, and productid are displayed for
products “p3” and “p4.” The second query filters the records from three tables, namely prodsql,
ordersql, and detailssql, and displays the product name and customer id for products “p1” and
“p2.” The third query filters the records from four tables, namely custsql, prodsql, ordersql, and
detailssql. The customer name and product name are displayed table for the customer “C3.”

380
Using GROUP BY during joining: We know that GROUP BY can be used on aggregate functions
such as sum, avg, count, etc. It is also possible to use GROUP BY for joining the tables as shown in
the following program:

Explanation
In this section, sum() function is used for grouping the data. In the first example, grouping is
done on the basis customers and total quantity purchased by each product is computed. In the
second example, grouping is done on the basis of product name and total revenue generated
from each product is computed by adding the total amount of each product sold. However, as
HAVING clause is specified for book and laptop, hence the records of only these two products are
displayed. In the third example, when joining is performed on four tables, the total revenue
generated from each customer is calculated, and records of only Sonakshi and Jitesh are
displayed because of HAVING clause.

Create three tables named room_master which has fields named


room_no(Primary Key), type and charges; doctor_ master which has fields
doctor_id (Primary Key), name and expertise; patient_master which has fields
regno(Primary Key), room_no. (Foreign Key from room_master), doctor_id
(Foreign Key from doctor_master) name and city. Execute sub-queries and
joining function as discussed in this section considering the three tables
together.

381
Summary
• A database management system is a software for interaction between users and databases.
SQL is a query language for interaction with database. There are various Python SQL
libraries that help to perform programming for SQL such as SQLite, pymssql, and
SQLAlchemy.
• The SELECT statement is used to select data from a database. It is possible to select all the
fields using * and select specific fields by writing their names after the clause.
• The LIMIT clause is used to display limited number of records on the screen.
• The fetchone() and fetchall() functions are used to display one and all the records,
respectively.
• The WHERE clause is used to filter the records according to specified condition(s). The
conditions can be created using relational operators, logical operators, IN and NOT IN, like,
etc.
• The DISTINCT keyword helps to view all the fields, specific fields, and records with unique
value.
• All the relational operators such as <, ≤, >, ≥ ==, and <> are used to filter the records on the
basis of conditions.
• There are basically three logical operators included in SQL: OR, AND, and NOT. The AND and OR
operators are used to filter records based on more than one condition.
• The IN is used when we want to check multiple values for a single field.
• The LIKE operator is used in a WHERE clause to search for a specified pattern in a column.
There are two wildcards often used in conjunction with the LIKE operator: % and -.
• We will have to use the IS NULL and IS NOT NULL operators for displaying records, which are
having null values in specified column.
• The INSERT INTO statement is used to insert new records in a table. It is possible to insert the
record by two approaches: by inserting all the fields and specified fields.
• Update statement is used to modify the existing records in a table based on the condition.
• The delete statement is used to delete existing records in a table based on condition(s).
• Python library SQLAlchemy supports all the inbuilt functions of SQL including
mathematical functions, such as sum, count, avg, max, and min, and character functions such
as lower, upper, substr, and length.
• The ORDER BY keyword is used to sort the result-set in ascending or descending order.
• The GROUP BY statement group rows that have the same values into summary rows and is
often used with aggregate functions to group the result-set by one or more columns.
• Two basic functions are used for ranking in SQL, which include RANK() and DENSE_RANK().
These functions helps to calculate the rank of the record on the basis of the column specified
by ORDER BY.
• The INTERSECT clause is used to display the records that exist in both the tables. The UNION
clause is used to display all the records from both the tables.
• A subquery is a form of an SQL statement that appears inside SQL statement. It is also
considered as nested query. This approach helps to filter the records on the basis of
conditions from multiple tables at one single point of time.

382
The biggest disadvantage of sub-query is that value can be displayed from one single table
• only. It is not possible to display columns from multiple tables at one single point of time.
This drawback is removed using the concept of joining.

Multiple-Choice Questions

1. The ______________clause is used to display limited number of records on the screen.


(a) SPECIFIC
(b) LIMIT
(c) NUMBER
(d) ONLY
2. The _____________clause is used to display the records that exist in both the tables.
(a) UNION
(b) INTERSECT
(c) MINUS
(d) QUERY
3. The _____________ is used to sort the result-set in ascending or descending order.
(a) ORDER BY
(b) GROUP BY
(c) SORT
(d) RESULT-SET
4. This wildcard is not used in conjunction with the LIKE operator:
(a) * (asterisk)
(b) % (percentage)
(c) _ (underscore)
(d) None of these
5. The _______________ helps to view fields with unique value.
(a) ONLY
(b) SPECIAL
(c) UNIQUE
(d) DISTINCT
6. The _________________ is used to display records that are having null values in
specified column.
(a) IS ZERO
(b) IS NULL
(c) IS EMPTY
(d) IS MISSING
7. The ______________ operator displays a record if the condition(s) is NOT TRUE.
(a) OR
(b) AND
(c) NOT
(d) None of these
8. The _________________ operator displays a record if all the conditions are TRUE.
(a) OR

383
(b) AND
(c) NOT
(d) None of these
9. The ___________ function is used to extract some letters from the string.
(a) Extract
(b) Sub
(c) String
(d) Substr
10. The ______________ function returns the length of the string.
(a) Length
(b) Len
(c) String
(d) Lenstr

Review Questions

1. Differentiate between fetchone() and fetchall() function with example.


2. Differentiate between RANK() and DENSE_RANK() function with example.
3. Differentiate between joining and subquery with example.
4. Differentiate between OR and IN operators with example.
5. Differentiate between LEFT JOIN and INNER JOIN with example.
6. How can we insert the record in the table?
7. Discuss the syntax of update and delete command.
8. Explain the utility of GROUP BY clause with an example.
9. Explain the utility of ORDER BY clause with an example.
10. Discuss the utility of various in-built functions in an SQL.

384
CHAPTER
11

385
Statsmodels Library for Time Series
Models

Learning Objectives
After reading this chapter, you will be able to

• Realize the importance of time series analysis.


• Determine stationarity of the series.
• Understand smoothing, seasonal decomposition for making time series stationary.
• Construct different models using autoregressive integrated moving average (ARIMA)
modeling techniques.

Forecasting helps to formulate strategies in various businesses, hence it is a basic need to help in
managerial decision making. Managers have to take decision in the face of uncertainty without
knowing what would happen. Forecasting can be obtained by different methods including
qualitative models and quantitative models. Quantitative models include time series models and
causal models. Causal models are used when one variable is dependent on the values of other
variables (regression analysis is used for these models, which will be discussed in next chapter).
Time series data provide important information related to time, and time series models attempt to
predict the forecast demand (future values) using the past demand values (historical data), which
is discussed in this chapter. We will consider air passengers dataset available on kaggle.com.

11.1 Time Series Object


Time series is a series of data points in which each data point is associated with a timestamp. It
plays a major role in understanding a lot of details on specific factors with respect to time, For
example, stock price at different points of time on a given day and amount of sales in a region at
different months of the year.

11.1.1 Reading Data


The data for the time series are stored in an object called time series object and can be read using
series from pandas library.

386
Explanation
We can observe that the series start from January 1949 and ends till December 1960. The data
type of number of passengers is int64.

A lot of data related to time series is available on online sources. Most of the
time series models are generally made on stock prices for financial market. A
proper modeling can help the user in doing effective analysis.

11.1.2 Creating Subset


Python language uses many functions to create, manipulate, and plot the time series data. It is
also possible to create a subset of the given time series object by specifying the value of a
particular year or by varying the value of the start and end argument. The chart can also be
plotted using the selected values.

387
388
Explanation
The first function displays all the records of year 1952. We can observe that when the year is
specified, all the records of that particular year are displayed. But, if we also want to include
month for creating a subset, we will have to specify month along with the year as shown in the
second example. The second function displays the records between June 1953 and April 1954.
A plot of the time series is displayed by using plot() function. The next function shows the
entire time series. The last function displays the time series chart for selected values from 1951
till 1954.

11.2 Determining Stationarity


A time series is considered to be stationary if it does not have trend or seasonal effects and thus
has consistent statistical properties over time. These properties include mean, variance, and auto
covariance. Before applying statistical modeling methods, the time series is required to be
stationary. This is required for easy modeling and doing proper forecast of the series; because if

389
we take a certain behavior over time, it should be same in the future.
It is also important to check the stationarity of the series before doing forecasting.
Observations from a nonstationary time series show seasonal effects, trends, and other structures
that depend on the time index. We can determine whether the time series is stationary or
nonstationary by either looking at plots and check trends or seasonality or by using statistical
tests like Dickey–Fuller test. We can observe from the figures in the previous page that there is
an overall increase in the trend, with some seasonality in it. Thus, we can say that the series is
not stationary.
The Dickey–Fuller test can be applied using adfuller() function from the statsmodel
package. The null hypothesis of test is that the time series can be represented by a unit root. This
means that it is nonstationary and has some time-dependent structure. The alternate hypothesis
(rejecting the null hypothesis) is that the time series is stationary. It suggests the time series does
not have a unit root and a time-dependent structure. We interpret this result using the p-value
from the test. If the p-value is ≤0.05, it suggests that we can reject the null hypothesis. This
means that the data do not have a unit root and is stationary. If the p-value is >0.05, we fail to
reject the null hypothesis, and the data have a unit root and are nonstationary.
For effective analysis, it is also recommended to plot rolling mean and standard deviation of
the series. The rolling mean and standard deviation is determined using mean() and std()
function from rolling. We will determine rolling mean and standard deviation for the other series
that are discussed later, hence an effective approach will be to create a function and call the
function number of times with different series.

390
Explanation

391
The function named time_func() is created that checks for stationarity of the series using
Dickey–Fuller test, determines rolling mean and standard deviation for the series, and plots the
figure representing the rolling mean and standard deviation. The result from the test will contain
the test statistic and critical value for different confidence levels. If the p-value is <0.05, the
null hypothesis that the series is nonstationary is rejected. In our data, the p-value is 0.9, which
means that the null hypothesis is not rejected, and thus, the series is nonstationary. This is also
reflected from the figure. This can be further interpreted by looking at the figure, which shows
that the mean is increasing even though the standard deviation is small. However, our objective
is to make the series stationary.

Download 4 years’ data related to share price of any organization and create a
chart. Determine whether the time series is stationary or not.

11.3 Making Time Series Stationary


There are two major factors that make a time series nonstationary: trend (nonconstant mean) and
seasonality (variation at specific time frames). Hence, before doing forecasting, we need to make
the series stationary and this can be done by adjusting the trend and seasonality. We can then
convert the forecasted values into real values by applying the trend and seasonality constrains
back again.

11.3.1 Adjusting Trend Using Smoothing


The first step is to reduce the trend using transformation such as log, square root, and cube root.
Smoothing is the most common method to model the trend. When the time series data have
significant irregular component for determining a pattern in the data, we want a smoothed curve
that will reduce these fluctuations. In smoothing, we usually take the past few instances (rolling
estimates). Smoothing in curve is generally achieved using either moving average (MA) method
or exponential smoothing.

11.3.1.1 Simple Moving Average


It is the best method that is achieved using rolling() function. For rolling mean, we first take
some (k) consecutive values and hence will define a value of parameter k. Generally, as the value
of “k” increases, the plot will become increasingly smoothed. The major challenge is to find the
appropriate value of k that highlight the major patterns in the data, without under- or
oversmoothing. In our data, the value of k depends on the frequency. Since it is 1 year, hence, we
have considered the value of k as 12 corresponding to 12 months. For our data, we will use
logarithmic transformation because there is a strong positive trend.

392
Explanation
The function air_logseries=np.log(airseries) applies log transformation on the series and
the transformed series is named as air_logseries. The function
air_logdata=air_logseries.to_frame() coverts the series to a dataframe. Since we need to
take average of passengers, so we select only the passengers column from air_logdata by the
command air_logdata['Passengers']. The MA of first 12 values is computed by using the
function air_logdata.rolling(12).mean(). The next section plots the chart of the logarithmic
series; the black line in the chart shows the rolling mean. The next section subtracts from the
original series. When the first five records are printed, we can observe that the Nan is printed.

393
This is because we are taking average of last 12 values and rolling mean is not defined for first
11 values. Since we cannot perform analysis for null values, hence the command
log_moving_avg_diff.dropna(inplace=True) handles null values.

We will now determine the stationarity of logarithmic series by calling the above-defined
function.

Explanation
The p-value shows that it is less than 0.05. This means that at 95% confidence interval, the
series is stationary. We can observe from the visual representation also that the rolling values
are varying slightly but there is no specific trend. Hence, we can assume stationarity of the
series.

11.3.1.2 Exponential Weighted Moving Average


In some cases, where the time period range is complicated such as stock price, we use the
exponentially weighted MA. In the above example, we can consider a specific time period (12
months for a year). This means that it is used when we want a weighted average of existing time
series values to make a short-term prediction of future values. This can be computed using
ewma() function from pandas library. These models with seasonality or nondamped trend or both
have two unit roots (i.e., they need two levels of differencing to make them stationary). All other
models have one unit root (they need one level of differencing to make them stationary).

394
Explanation
The function ewma(air_logdata,halflife=12).mean() calculates the weighted MA. The
black line in the chart shows the expected weighted average and it seems to be stationary. We
can observe that for weighted MA, the values are not null. The parameter (half-life) is assumed
12; but in real scenarios, it depends on the domain.

We can now check for stationarity of the series by calling the user-defined function developed
previously.

395
Explanation
We can observe from the figure that the series is stationary because rolling values have less
variations in mean and standard deviation in magnitude. The p-value is less than 0.01, hence we
are almost 99% confident that the series is stationary.

11.3.2 Adjusting Seasonality and Trend


Most time series have trends along with seasonality. There are two common methods to remove
trend and seasonality: differencing and seasonal decomposition.

11.3.2.1 Differencing
This is done by taking difference using time lag of the series.

396
Explanation
The above chart shows the impact of differencing on time series.

We can now check on stationarity of the difference series by calling the defined function:

397
Explanation
As the p-value is 0.07, we can infer that the time series can be considered stationary at 10%
level of significance. We can also observe from the figure that the mean and standard deviation
have small variations with time. However, we can also consider second-or third-order
differences for producing better results in specific applications.

11.3.2.2 Seasonal Decomposition


Seasonal decomposition helps in modeling both trend and seasonality. Time series data have a
seasonal component in data such as monthly, quarterly, or yearly, which can be divided into
trend, seasonal, and irregular components. The trend component displays changes that happen
over time. The seasonal component displays the cyclical effects due to the time of year. The
irregular component displays the other effects, which are not considered by trend and seasonal
component.

398
Explanation
The function seasonal_decompose() performs seasonal decomposition on the series and the

399
values of trend, decomposition and residual is determined and stored in air_trend, air_seasonal,
and air_residual, respectively. The combined chart produces the result.

We can now check stationarity of the data after handling missing values.

Explanation
The Dickey–Fuller test statistic is significantly lower than the 1% critical value. Thus, we can
say that the series is stationary, because the p-value is 0.00. This is also shown from the figure.

We can also use Prophet library for time series modeling from fbprophet. Type
the command from fbprophet import Prophet to import the library in
environment.

400
11.4 ARIMA Modeling
Time series analysis can be used in a multitude of business applications for forecasting a quantity
into the future and explaining its historical patterns. Exponential smoothing and ARIMA models
are commonly used for time series forecasting. Exponential smoothing models were based on a
description of trend and seasonality in the data while ARIMA models are used to describe the
autocorrelations in the data. In particular, every model is non-stationary, while ARIMA models
can be stationary. ARIMA models use historical information to make predictions, which is
considered as a basis for complex models. ARIMA models support either no exponential
smoothing or linear exponential smoothing but not nonlinear exponential smoothing. ARIMA is
like a linear regression equation where the predictors depend on some parameters of the ARIMA
model. Since we have made the time series stationary, we will make models on the time series
using differencing because it is easy to add the error, trend and seasonality back into predicted
values using differencing.

11.4.1 Creating ARIMA Model


The ARIMA model can be created by using ARIMA() function from statsmodel package.

Syntax
ARIMA(series,order=(a, d, m))
where

• series is the series on which the ARIMA model will be used.


• order has arguments that define the nature of the model:
• a denotes the number of AR (autoregressive) terms. For example, if a is 4, the predictor
for y(t) will be y(t-1),y(t-2),y(t-3) and y(t-4),
• d denotes the number of differences or the number of non-seasonal differences,
• m denotes the number of MA (moving-average) terms.
It should be noted that for AR model, value of “m” will be 0. Similarly, value of “a” will be 0 for
MA model and for ARIMA (combined model), the values of “a” and “m” will be non-zero.
In the following section, we will make three different ARIMA models considering individual
as well as combined effects. The RSS (values of residuals) will also be printed for better
analysis.

401
402
Explanation
The first model is AR model since the value of the argument order is (2, 1, 0), which means that
MA component is missing. The second model is MA model since the value of the argument
order is (0, 1, 2), which means that AR component is missing. We can observe from the charts
that RSS values for different models are:
AR model = 1.5023

403
MA model = 1.472
ARIMA model = 1.0292.
From the above values, it is clear that the ARIMA model has the best RSS values and hence
can be considered for further analysis.

11.4.2 Forecasting
Before forecasting, we need to bring the model back to original scale.

Explanation
The predicted values are stored in the series named predictions. To convert the differencing to
log scale, we add these differences consecutively to base number. A better approach to do is
through using the cumsum() function to first determine the cumulative sum at index and then
add it to the base number. The cumulative predicted value are stored in predictions_cumsum.
We can observe that the first month is missing because we took a lag of 1(shift).

For forecasting the model, we should take the exponent of the series from above (anti-log)
predicted value.

404
Explanation
We first created a series with all values as base number and add the differences to it. The next
section considers the exponential form of the series. The chart shows predicted values along
with the original series. However, it cannot be considered as the best model because the RMSE
value (90.10) is very high.

Download 5 years’ share price data from 2014 to 2019 of any organization.
Consider the data from January 2014 till December 2018 for ARIMA
Modeling. Forecast the share price for 2019. Calculate the RMSE value and try
to make RMSE value as low as possible by considering suitable factors.

USE CASE
FOREIGN TRADE

Foreign trade in India includes all imports and exports to and from India. India exports

405
approximately 7500 commodities to about 190 countries, and imports around 6000 commodities
from 140 countries. India exported US$318.2 billion and imported $462.9 billion worth of
commodities in 2014 (Wikipedia). Total merchandise exports from India grew by 4.48% year-on-
year to US$ 25.83 billion in February 2018, whereas merchandise trade deficit increased
25.81% year-on-year from US$ 11.979 billion during April–February 2017 to 2018 to US$
9.521 billion during April to February 2017 to 2018, according to data from the Ministry of
Commerce & Industry. India’s external sector has a bright future as global trade is expected to
grow at 4% in 2018 from 2.4% in 2016 (www.ibef.org). The Government of India’s Economic
Survey 2017 to 2018 noted that five states—Maharashtra, Gujarat, Karnataka, Tamil Nadu and
Telangana—accounted for 70% of India’s total exports.
The exported items include natural or cultured pearls, precious or semiprecious stones,
mineral fuels, mineral oils, organic chemicals, pharmaceutical products, iron and steel, articles
of apparel and clothing accessories, cereals, fish and meat, cotton, aluminum, and articles. The
imported items include crude oil, gold, diamond, palm oil, coal, computers, solar powers, and
petroleum gases.
Trade Map is trade statistics for international business development for monthly, quarterly
and yearly trade data, import and export values, volumes, growth rates, market shares, etc. Time
series analysis using trend analysis, ARIMA modeling, etc., can be used for forecasting the
export and import of any product.

Summary
• Forecasting can be done by different methods including qualitative models and quantitative
models. Quantitative models include time series models and causal models. Time series
models attempt to predict the forecast demand (future values) using the past demand values
(historical data).
• Time series is a series of data points in which each data point is associated with a timestamp.
The data for the time series is stored in a time-series object.
• A time series is considered to be stationary if they do not have trend or seasonal effects and
thus has consistent statistical properties over time.
• The Dickey–Fuller test can be applied using adfuller() function from the statsmodel
package.
• Smoothing-adjusting trend: The first step is to reduce the trend using transformations such as
log, square root, and cube root. Smoothing is the most common method to model the trend.
• Smoothing in curve is generally achieved using either MA method or exponential smoothing.
• MA is computed using the rolling() function. As k increases, the plot will become
increasingly smoothed. Exponential smoothing is computed using ewma() function from
pandas library.
• Most time series have trends along with seasonality. There are two common methods to
remove trend and seasonality: differencing and seasonal decomposition.
• Seasonal decomposition is used to determine the seasonal impact, which can be categorized
into trend, seasonal, monthly, yearly, and other factors.
• ARIMA models are commonly used for time series forecasting. The ARIMA model can be
created by using ARIMA() function from statsmodel package.

406
• ARIMA is like a liner regression equation where the predictors depend on some parameters
of the ARIMA model.

Multiple-Choice Questions

1. The ARIMA() is available in _________package.


(a) statsmodel
(b) numpy
(c) pandas
(d) matplotlib
2. Which one of the following options is false:
(i) Smoothing is achieved by exponential smoothing
(ii) Smoothing is achieved by moving average method
(a) Only (i)
(b) Only (ii)
(c) Both (i) and (ii)
(d) Neither (i) nor (ii)
3. If the value of k decreases, the smoothness of the curve increases.
(a) True
(b) False
(c) Has no effect
(d) Both (a) and (b)
4. The stationarity can be checked using the _________ method.
(a) rolling()
(b) ewma()
(c) adfuller()
(d) None of these
5. Moving average is computed using the _________ method.
(a) rolling()
(b) ewma()
(c) adfuller()
(d) None of these
6. Exponential smoothing is computed using _________ method.
(a) rolling()
(b) ewma()
(c) adfuller()
(d) None of these
7. We can determine by the seasonal decomposition.
(a) trend
(b) seasonal
(c) resid

407
(d) All of these
8. If order = (3, 1, 0) in ARIMA() function, this means _________model is created.
(a) Autoregressive
(b) ARIMA
(c) Moving average
(d) None of these
9. If order = (0, 1, 3) in ARIMA() function, this means _________model is created.
(a) Autoregressive
(b) ARIMA
(c) Moving average
(d) None of these
10. If order = (3, 1, 3) in ARIMA() function, this means _________model is created.
(a) Autoregressive
(b) ARIMA
(c) Moving average
(d) None of these

Review Questions

1. How can we create a moving average model in ARIMA?


2. How can we create an autoregressive model in ARIMA?
3. What is the utility of parameter “k” in smoothing?
4. Differentiate between simple moving average and exponential moving average methods.
5. What is the utility of ARIMA models?
6. Is it possible to view the time series for selected time period? If yes. How?
7. What is the purpose of using seasonal decomposition on time series data?
8. Explain the syntax of ARIMA() function.
9. Explain the results of the adfuller() test.
10. Discuss the null hypothesis for determining stationarity of time series.

408
409
SECTION 3
Machine Learning in Python

Chapter 12
Unsupervised Machine Learning Algorithms

Chapter 13
Supervised Machine Learning Problems

Chapter 14
Supervised Machine Learning Algorithms

Chapter 15
Supervised Machine Learning Ensemble Techniques

Chapter 16
Machine Learning for Text Data

Chapter 17
Machine Learning for Image Data

410
CHAPTER
12

411
Unsupervised Machine Learning
Algorithms

Learning Objectives
After reading this chapter, you will be able to

• Build concept of unsupervised machine learning algorithms.


• Get exposure to different unsupervised algorithms: dimension reduction algorithms and
clustering techniques.
• Analyze the results of these unsupervised algorithms.
• Implement unsupervised machine learning algorithms in real-world situation.

Unsupervised machine learning algorithms are used, when the output is not known and no
predefined instructions are available to the learning algorithms. In unsupervised learning, the
learning algorithm has only the input data and knowledge is extracted from these data. These
algorithms create a new representation of the data, which is easy to comprehend than the original
data and help to improve the accuracy of supervised algorithms by consuming less time and
reducing memory. The common unsupervised machine learning algorithms include
dimensionality reduction algorithms and clustering.
The two common dimensionality reduction algorithms include principle component analysis
and factor analysis (FA). These algorithms takes input of a high dimensional representation of
the data, which consists of many features and produces a output that summarizes the data by
grouping essential characteristics and results into fewer factors. Principal component analysis
(PCA) replaces a large number of correlated variables with the smaller number of correlated
variables and is used to understand structure in the data, shape of the data, and covariance of the
data, which is not possible with simple scatter plots and is a method that rotates the dataset in a
way such that the rotated features are statistically uncorrelated. This method is used to
summarize the data and reduce its dimensionality, while exploratory FA can be used as a
hypothesis generating tool useful to understand the relationship between the large numbers of
variables.
Clustering is the task of partitioning the data into groups called clusters whose members are
similar in some way. A cluster is a collection of observations, which are similar between them
and are different to the observations belonging to other clusters. K-means clustering and
hierarchical clustering are commonly used clustering algorithms. It tries to find cluster centers
that are representative of certain regions of the data.

412
Dimensionality reduction reduces the number of variables by grouping similar
variables, while clustering does the grouping of similar observations.

12.1 Dimensionality Reduction


When we are dealing with huge data, we are not sure about the usefulness of the information
collected and deriving useful information becomes very tedious. But, in order to derive some
useful information, we cannot delete some variables assuming that those are not really useful.
For example, if the number of variables is more, it is not easy to apply some test or create scatter
plots and find out correlations between variables and interpret the data. There would be too many
pairwise correlations between the variables to consider. It will also be difficult to comprehend
the data through the graphical display. Hence, it is necessary to club these variables together and
reduce the number of variables to a few interpretable linear combinations of the data for
interpreting the data in a more meaningful form. Each linear combination represents a principal
component or a factor. Thus, dimensionality reduction is helpful when we have a large number
of variables in our dataset and we need to reduce this number or where responses to many
questions tend to be highly correlated. This technique is generally used before performing t-test,
analysis of variance (ANOVA), and regression or cluster analysis on a dataset with correlated
variables.

12.1.1 Factor Analysis


FA is an exploratory data analysis method used to search important underlying factors or latent
variables from a set of observed variables. It helps in data interpretations by reducing the number
of variables. It is widely utilized in nearly all the specialization where we need to reduce the
number of existing features such as market research, advertising, and finance. FA is a linear
statistical model. It is used to explain the variance among the observed variable and reduce a set
of the observed variable into the unobserved variable called factors.
Factors are also known as latent variables or hidden variables or unobserved variables or
hypothetical variables. It describes the association among the number of observed variables. The
maximum number of factors is equal to a number of observed variables. Every factor explains a
certain variance in observed variables that have common patterns of responses. Each factor
explains a particular amount of variance in the observed variables. It is a method for
investigating whether the variables N1, N2,…, Nn are linearly related to a smaller number of
unobservable factors F1, F2,… F<n.
The common assumptions that need to be fulfilled before applying FA include the following:
there are no outliers in data; sample size should be greater than the factor; and there should not
be homoscedasticity between the variables.
There are two types of FA: (i) exploratory factor analysis (EFA): It is commonly used
approach for reducing number of features. The basic assumption is that any observed variable is
directly associated with any factor. EFA is a statistical technique used to identify latent
relationships among sets of observed variables in a dataset. In particular, EFA seeks to model a
large set of observed variables as linear combinations of some smaller set of unobserved latent
factors. (ii) Confirmatory factor analysis (CFA): The basic assumption is that each factor is

413
associated with a particular set of observed variables. In this section, we have applied EFA. A
library named “factor_analyzer” contains all the functions required for performing FA.
Before applying FA, it is important to evaluate whether there is a possibility to determine the
factors in the dataset. There are two methods to check the factorability or sampling adequacy:
Bartlett’s test and Kaiser–Meyer–Olkin (KMO) test. Bartlett’s test of sphericity checks whether
or not the observed variables intercorrelate at all using the observed correlation matrix against
the identity matrix. If the test found is statistically insignificant, we should not employ a FA. If
the p-value is <0.05, the test will be considered statistically significant, indicating that the
observed correlation matrix is not an identity matrix. KMO test measures the suitability of data
for FA. It determines the adequacy for each observed variable and for the complete model. KMO
estimates the proportion of variance among all the observed variables. Lower proportion is more
suitable for FA. The range of KMO values is between 0 and 1. A KMO value of less than 0.5 is
considered inadequate for performing FA.

Explanation

414
This example considers the diabetes dataset from the sklearn.datasets library. The details of the
dataset are displayed using the DESCR attribute. The dimension of dataset of independent
variables is known by “shape” attribute. The dimension is found to be (442, 10), which means
that there are 442 observations for 10 variables. A most important library named
factor_analyzer is first imported for carrying all the function related to FA. The p value of
Bartlett’s test is 0, which means that it is statistically significant. The overall KMO for our data
is 0.534, which is good but not excellent. The value of both the tests indicates that FA can be
executed since the condition of adequacy is met.
The primary objective of FA is to reduce the number of observed variables and find
unobservable variables. It should be noted that the factors with the lowest amount of variance
should be dropped. Rotation is a tool for better interpretation of FA. Rotation can be orthogonal
or oblique. It redistribute the commonalities with a clear pattern of loadings. This conversion of
the observed variables to unobserved variables also requires us to specify the rotation such as
Varimax rotation method and Promax rotation method. The Varimax rotation rotates the factor
loading matrix so as to maximize the sum of the variance of squared loadings, while preserving
the orthogonality of the loading matrix. The Promax rotation is a method for oblique rotation
and build upon the Varimax rotation, and allows factors to become correlated.

Kaiser criterion is an analytical approach, which is based on selecting the factor which has more
significant proportion of explained variance. The eigenvalue is a good criterion for determining
the optimum number of factors. Eigenvalues represent variance explained by each factor from
the total variance. Generally, an eigenvalue greater than 1 will be considered as selection criteria
for the feature. The eigenvalues are determined by using get_eigenvalues()function. The
graphical approach based on the visual representation of factors and eigenvalues is called scree
plot. It helps us to determine the number of factors where the curve makes an elbow.

415
Explanation
The model named “fa” is created using the FactorAnalyzer() function and it considers the “x”
as the data. The eigenvalues are determined using the get_eigenvalues()function, which
returns the eigenvalues for each variable. This section basically tries to determine the optimum
number of factors. From the results, we can observe that eigenvalues are greater than 1 for three
variables. It means we need to perform FA considering only three factors (unobserved
variables). This is also determined using the scree plot method. A chart depicting the number of
factors and eigenvalues is drawn ranging from 1 to the maximum number of variables. Hence,
the range is from 1 till x.shape[1] + 1. The curve shows a straight line for each factor and its
eigenvalues. A number of variables having eigenvalues greater than 1 are considered as the
number of factors; hence from the scree plot also, it can be interpreted that for this dataset, the
optimum number of factors is 3.

The matrix of weights, or factor loadings, generated from an EFA model describes the
underlying relationships between each variable and the latent factors. Factor loadings are similar
to standardized regression coefficients, and variables with higher loadings on a particular factor
can be interpreted as explaining a larger proportion of the variation in that factor. In many cases,
factor loading matrices are rotated after the FA model is estimated in order to produce a simpler,
more interpretable structure to identify which variables are loading on a particular factor. The

416
factor loading is a matrix, which shows the relationship of each variable to the underlying factor.
It shows the correlation coefficient for observed variable and factor. It is important to determine
the variance explained by each factor. The factor loadings are determined by loadings_ and the
variance explained by the factors is determined by get_factor_variances() function.

Explanation
The result in this program displays the factor loadings and variances of each factor. From the
factor loadings matrix, we can observe the loading of each variable on each factor. We need to
determine the highest loading of the variable a particular factor. Thus, we can observe that the
first factor is composed of two variables: fifth and sixth variables have the highest loading on
the first factor. The second factor comprises three variables: second, seventh, and eighth
variables having the highest loading on the second factor. The third factor comprises five
variables: first, third, fourth, ninth, and 10th variables having the highest loading on the third
factor. Thus, we are able to reduce the 10 variables in three factors. The analyst can name the
factors accordingly. However, it requires a domain knowledge of the concerned dataset. The
variance of each factor is displayed through the last section.

Perform factor analysis on breast cancer dataset available in sklearn.datasets


and try to reduce the number of variables by grouping the variables.

However, there are some limitations of Factor Analysis: Results of FA are controversial; Its
interpretations can be debatable because more than one interpretation can be made of the same
data factors; Factor identification and naming of factors require domain knowledge.

417
It is not advisable to use a large number of variables in the study. All those
studies which involve many variables should be reduced to small number of
variables for effective interpretation. Factor analysis is used in many real-time
business problems. For example, there are number of variables that can
contribute to sales in market places; hence, they can be grouped under some
categories for effective analysis.

USE CASE
BALANCED SCORE CARD MODEL FOR MEASURING ORGANIZATIONAL PERFORMANCE

Organizations have come to realize the importance of a strategic feedback and performance
measurement/management application that enables them to more effectively drive and manage
their business operations. Balanced score card (BSC) is a framework that enables translation of
a company’s vision and strategy into a coherent set of performance measures that can be
automated and linked at all levels of the organization. Harvard business review listed it as one of
the 75 most influential business ideas of the 20th Century and Kaplan and Norton’s first BSC
monograph was chosen as one of the 100 best books of all time by business columnists.
The BSC is a performance management and measurement tool; it is a concept for measuring
whether the microoperational activities of a company are aligned with its macroobjectives in
terms of vision and strategy. Its underlying rationale is that measuring an organization’s
performance mainly based on the financial perspective is not sufficient as this effort cannot
directly influence financial outcomes. The model is considered more effective than traditional
financial-based models and proposes that managers can select measures from three additional
categories or perspectives: customer, internal business processes and learning and growth.
Companies can use the BSC to track financial results, while simultaneously monitoring progress
in building the capabilities and acquiring the intangible assets they will need for future growth.
The financial facet includes some index used to indicate whether an organization’s business
operations are resulting in improvement of the bottom line. Customer facet consists of index that
can be used to measure an organization’s performance from the customer perspective. Internal
process facet focuses on the core competencies. Learning and growth facet contains index for
evaluating an organization continuous business improvement. Thus, the model has both financial
and non-financial (less tangible) aspects such as employee perception, customer satisfaction,
product development, time involvement, quality of information and support in organizational
strategic decision making.
BSC model helps in effectively manage overall performance evaluations and combine the
vision and strategies of the enterprise. It allows organizations to increase the completeness and
the quality of the reports and easily evaluates the positive and negative effects of the
organizational performance and enhances the ability to manage the ERP system implementation
since it focus on the intrinsically multidimensional character of organizational performance. It
links long-term strategy with short-term targets, thereby facilitating the best utilization of
resources.
There are many variables which measure the organizational performance. It becomes
difficult for senior managers to formulate their strategies for all the variables and they desire for
a subgrouping. Hence, number of different variables related to organizations performance can
be listed and principle component FA can be used to group together the different variables under

418
these four factors of BSC model.

12.1.2 Principal Component Analysis


Most problems of interest to organizations are multivariate. They contain multiple dimensions
that must be looked at simultaneously. Many statistical analysis techniques, such as machine
learning algorithms, are sensitive to the number of dimensions in a problem. In the big data era,
high dimensionality can render a problem computationally intractable. Hence, the goal of
dimensionality reduction is to replace a larger set of correlated variables with a smaller set of
derived variables and lose the minimum amount of information. The best way of minimizing the
loss of information is preservation of variance. PCA is a data reduction technique that transforms
a larger number of correlated variables into a much smaller set of correlated variables called as
principal components. In simple words, PCA is a method of extracting important variables (in
form of components) from a large set of variables available in a dataset. It extracts low-
dimensional set of features from a high-dimensional dataset with a motive to capture as much
information as possible. It is always performed on a symmetric correlation or covariance matrix.
This means the matrix should be numeric and have standardized data. The main idea of PCA is
to reduce the dimensionality of dataset consisting of many variables correlated with each other,
while retaining maximum information (variance) in the dataset. We transform original variables
to a new set of variables called PCA.
PCA is used to overcome features redundancy in a dataset. These features are low
dimensional in nature. These components aim to capture as much information as possible with
high explained variance. The first component has the highest variance followed by second, third,
and so on. The components must be uncorrelated. PCA is applied on a dataset with numeric
variables. PCA is a tool that helps to produce better visualizations of high-dimensional data.
First principal component is a linear combination of original predictor variables that captures
the maximum variance in the dataset. It determines the direction of highest variability in the data.
Larger the variability captured in first component, larger the information captured by component.
No other component can have variability higher than first principal component. The first
principal component results in a line that is closest to the data, that is, it minimizes the sum of
squared distance between a data point and the line.
Second principal component is also a linear combination of original predictors that captures
the remaining variance in the dataset and is uncorrelated with first principal component. In other
words, the correlation between first and second components should be zero. All succeeding
principal component follows a similar concept, that is, they capture the remaining variation
without being correlated with the previous component. In general, for n × p dimensional data,
min (n − 1, p) principal component can be constructed. The directions of these components are
identified in an unsupervised way because the response variable is not used to determine the
component direction. Therefore, it is an unsupervised approach.
It is a linear orthogonal transformation that transforms the data to a new coordinate system
such that the greatest variance by any projection of the data comes to lie on the first coordinate,
the second greatest variance on next coordinate, and so on. The analysis uses an orthogonal
projection of highly correlated variables to a set of values of linearly uncorrelated variables
called principal components. The number of principal components is less than or equal to the
number of original variables and our objective is to maximize all the variance on the first

419
principal component, then second, and so on.
PCA components explain the maximum amount of variance, whereas FA explains the
covariance in data. PCA components are fully orthogonal to each other, whereas FA does not
require factors to be orthogonal. PCA component is a linear combination of the observed
variable, whereas in FA, the observed variables are linear combinations of the unobserved
variable or factor. PCA components are uninterpretable. In FA, underlying factors are labelable
and interpretable.
PCA is a statistical procedure that transforms and converts a dataset into a new dataset
containing linearly uncorrelated variables, known as principal components. The basic idea is that
the dataset is transformed into a set of components where each one attempts to capture as much
of the variance (information) in data as possible. The function named PCA() available in
sklearn.decomposition is used to perform PCA by specifying the number of components.

420
Explanation
This example considers Friedman dataset available in sklearn.datasets library. The
make_friedman1() function returns two datasets: for independent variables and dependent
variables. The dataset corresponding to independent variables is stored in "x," whereas the
dataset corresponding to dependent variables is stored in “y.” Since we do not consider the
response variable in unsupervised learning, we will consider only the “x” dataset. The dataset
corresponding to independent variables has 100 observations and 10 variables. By applying
PCA, 10 variables are reduced to four principal components using the command
PCA(n_components=4) and pca.fit(x) functions. The command pca.transform(x)
transforms the dataset. Thus, the reduced shape of Friedman dataset is (100, 4) and shape of
principal components is (4, 10). The details of all components can be found using
pca.components_. Since the shape of principal components is (4,10), the matrix of same order
is displayed for showing the relationship of variables with the factors. The variance explained
by each component is determined using explained_variance_ratio_. Thus, we can observe
that the variance explained by the first component is 15.5%, second principal component will
explain a less variance than the first 13.2%, third component will explain 12.8% of variance,
and fourth component will explain 12% of variance. Thus, the total variance explained by all
the four components is 53.5%.
It should be noted that PCA and EFA derive their solutions from the correlation among the
observed variables, hence we need to decide the factor model, which is the best fit for our

421
research goals. We need to decide how many component factors are needed to extract, then we
actually extract the components or the factors; we may rotate the components or the factors and
then we interpret the results. At the last, we compute the component or the factor scores.

USE CASE
EMPLOYEE ATTRITION IN AN ORGANIZATION

High employee attrition is unfortunately part of almost every industry. Employee attrition is
expensive and incurs a great cost to an organization. Although some attrition is normal, but in
some situations, poor management can cause the normal turnover to climb to an extreme level.
Along with the monetary impact on the organization, it also affects employee morale in a
negative manner that further affects productivity and further results in less efficient and effective
results. Besides, organizations incur substantial expense in bringing new employees and hence
try to attract employee talent. These investments make senior professionals upset and hence they
try to understand the causes of attrition to retain valuable employees.
A manufacturing organization is facing the problem of employee attrition and they wanted to
retain its expertise and knowledge base. They tried to determine the reasons for employees
leaving their organization. They identified two types of reasons for employee attrition,
controllable and non-controllable reasons: Reasons outside the control of the company include
employee retirement, advancement to other parts of the organizations, promotions within the
same group, and illness and changes in the employee’s personal circumstances. Reasons within
the control of the company are feeling mistreated and undervalued, which includes improper
behavior of senior professionals, partial and rude attitude, unnecessary blame and back-biting;
work-life imbalance due to increasing economic pressures within an organization; job
characteristics; organization culture; organization instability and restructuring; job
dissatisfaction; employee misalignment with job; work stress; insufficient pay; improper
communication; lack of reward and appreciation; inability of decision making ability;
insufficient training; organization reputation and performance; no increment and promotions;
less confidence on the policies; unavailability of growth opportunities; etc.
The organization wanted to address these issues else they will have to pay a high price for
neglecting steps to correct the situation. It was not possible to control all the reasons why
employees may leave the organization, but they tried to understand the incidence of various
issues. In order to plan a sustainable and cost-effective workforce, they thought of an analytical
approach that can help in strategy formulation of necessary remedies.
The organization was not able to develop strategies because there were many reasons for
employee attrition. Hence, they wanted to reduce them into less number of dimensions. EFA will
be used to reduce number of dimensions. Besides, the organizations wanted to determine
whether the factors resulted by reducing dimensions are efficient (explain at least 85% of the
variance). The results obtained from FA can help the organization to determine the suitable
number of factors and perform analysis again with changed values of argument and evaluate the
results accordingly.

12.2 Clustering

422
Clustering is the process of organizing objects into groups whose members are similar in some
manner; it deals with finding a structure in a collection of unlabeled data. A cluster is a collection
of objects that have similar characteristics between them and are dissimilar to the objects
belonging to other clusters. Thus, data points inside a cluster are homogeneous and
heterogeneous to other groups. The right number of clusters is an important issue because
beyond which it becomes noise, below which you are not capturing any observation. Generally,
two forms of clustering, k-means and hierarchical clustering, are used for grouping for any noise
and observations as discussed in Section 12.2.1, for example, an online food channel that wants
to promote their website. They want to customize advertisement models for people living in
different cities. Creating individual message for every person is not feasible because of cost and
other practical problems of implementation. On the other hand, we cannot send the same
message to all because it is too coarse. Clustering helps in this situation by creating groups based
on similar characteristics like grouping on different cities (tier-I, tier-II, and tier-III).

12.2.1 K-Means Clustering


It is a type of unsupervised algorithm that solves the clustering problem by following a simple
and easy way to classify a given dataset through a certain number of clusters. For effective
analysis, we need to determine the number of clusters before applying k-means algorithm.
Depending on the initialization and on the distance function, we might need a different cluster. A
better machine learning algorithms can be created if effective initialization and better distance
functions are defined, for example the distance function between two resumes is the level of
experience and educational qualification. The distance between two books is the story, characters
in the story, etc. Since the data analysts have a better idea of the domain, data, and the
distribution, they need to define distance function effectively.
K-means picks k number of points for each cluster known as centroids. Each data point forms
a cluster with the closest centroids, i.e. k clusters. Finds the centroid of each cluster based on
existing cluster members. Here, we have new centroids. As we have new centroids, repeat steps
2 and 3. Find the closest distance for each data point from new centroids and get associated with
new k-clusters. Repeat this process until convergence occurs, i.e. centroids does not change. In
the graph, each node is shown a data point and we can determine the distances between data
points without having any information of the features of these points. The goal of clustering is to
minimize the distance between the points and its representatives. However, even if K-Means
clustering is done for many times, it is not certain that we get a global optimal solution. Hence,
we need to try with different initialization points and find the optima and consider the best
among those local optima.
In K-means, we have clusters and each cluster has its own centroid. Sum of square of
difference between centroid and the data points within a cluster constitutes within sum of square
value for that cluster. Also, when the sum of square values for all the clusters is done, it becomes
total within sum of square value for the cluster solution. We know that as the number of cluster
increases, this value keeps on decreasing but if we plot the result you may see that the sum of
squared distance decreases sharply up to some value of k, and then much more slowly after that.
This determines the optimum number of cluster.

423
For effective results, it is advisable to do data processing before doing factor
analysis and cluster analysis of the data. The different visualization techniques
can also be used for understanding the relationship between the variables in the
data.

Suppose there are nine elements (2, 3, 4, 10, 11, 12, 20, 25, 30) and if we need to form two
clusters, for the given dataset, there can be many different ways in which cluster can be formed.
Here, we will discuss three different conditions:

Case 1: Considering two clusters of five elements (2, 3, 4, 10, 11) and four elements (12, 20, 25
30)
The average of the two clusters is 6 and 21.75
(21.75 – 12)2 + (21.75 – 20)2 + (21.75 – 25)2 + (21.75 – 30)2 = 176.75
(6 – 2)2 + (6 – 3)2 + (6 – 4)2 + (6 – 10)2 + (6 – 11)2 = 70
Difference in variance = 106.75

Case 2: Considering two clusters of four elements (2, 3, 4, 10) and five elements (11, 12, 20, 25
30)
The average of the two clusters is 4.75 and 19.6
(19.6 – 11)2 + (19.6 – 12)2 + (19.6 – 20)2 + (19.6 – 25)2 +(19.6 – 30)2 = 269.2
(4.75 – 2)2 + (4.75 – 4)2 + (4.75 – 10)2 + (4.75 – 3)2 = 38.75
Difference in variance = 230.45

Case 3: Considering two clusters of six elements (2, 3, 4, 10, 11, and 12) and three elements (20,
25, and 30)
The average of the two clusters is 7 and 25
(25 – 20)2 + (25 – 25)2 + (25 – 30)2 = 50
(7 – 2)2 + (7 – 4)2 + (7 – 10)2 + (7 – 12)2 + (7 – 3) 2 + (7 – 11)2 = 25 + 9 + 9 + 25 + 16 + 16 =
100
Difference in variance = 50

Since the difference in the variance is least for the third combination (50), we can say that the
best possible combination is third case in which two clusters are formed of sizes 6 and 3
elements, respectively.
For k-means clustering, we need to import KMeans from sklearn.cluster. The optimum
number of clusters is determined using the elbow method in scree plot. The scree plot is created
considering the different values of k and inertia returned from the k-means algorithm. It should
be noted here that the inertia_ is same as WCSS (within cluster sum of squares).

424
425
426
Explanation
This example uses iris dataset for doing k-means clustering. The dimension of the dataset is:
150, 4. This means that there are 150 observations corresponding to four independent variables.
We know the independent variables are as follows: sepal.length, sepal. width, petal.length, and
petal.width. It is always suggested to determine the optimum number of clusters before actually
applying algorithm for k-means. The next section uses a “for” loop to determine the optimum
number of clusters. An empty list is created named “list.” A “for” loop is created from 1 to 11
and k-means algorithm is executed considering the values of k from 1 to 10. The command
KMeans(n_clusters=i,random_state=42) applies the k-means algorithm and the value of
inertia is stored in the list corresponding to each value of k. A scree plot is created for different
values of k and corresponding value of inertia generated by applying k-means algorithm. From
the scree plot, we can observe that the optimum number of clusters is 3.
The next section then applies k-means algorithm for three clusters and the command

427
“kmeans.fit_predict(X)” then predicts the cluster for each of the 150 observations and stores
in the variable named y_kmeans. The centers are also determined for each of the cluster
corresponding to each of the four variables of the iris dataset and displayed using
cluster_centers. The next section then creates a dataframe with two columns named original
and predicted containing the values of original values of dependent variable and predicted value
of cluster, respectively, using the command data = {'Original': y, 'Predicted': ypred}
and kmeansdf = pandas.DataFrame (data,columns=['Original','Predicted']).
Since the dependent variable contains the values 0, 1, and 2, it is needed to give a name to
the clusters. The next section gives the name to the clusters of the original dependent variables
and predicted values. This is required since the values in the dataframe are numerical and we
should really classify them according to the specie they belong to. We know that the dependent
variable has three species of the flower: setosa, versicolor, and virginaca.
The next section determines the accuracy of the predicted cluster through the confusion
matrix and accuracy score. These terms are discussed in detail in Chapter 13. The details of the
clusters show that the number of observations belonging to each cluster is as follows: 62, 52,
and 38. The result of the confusion matrix shows that all 50 observations that belonged to first
cluster are grouped similarly in the predicted cluster also. Another cluster contained 62
observations, of which 48 were correctly classified and 14 were wrongly classified. Similarly,
the third cluster had 38 observations, of which 36 were correctly classified and two were
incorrectly classified. The accuracy score is 89.4% because 16 observations from 150 were
incorrectly classified.
The next section presents a figure where three different clusters are shown in different
colors. At one time we can consider two features in the chart. Hence, four different charts
considering two different independent variables at one time are formed. Thus, the first chart
shows the clusters corresponding to sepal.length and sepal.width, second chart displays clusters
corresponding to petal.length and petal.width, third chart displays clusters corresponding to
length of sepal and petal and the fourth chart displays clusters corresponding to sepal and petal.
We can observe from the chart that the observations are actually forming good clusters. This
further means that the accuracy is rightly predicted. Thus, considering this dataset, the analyst
can rightly predict the cluster, which the new observation will belong to.

USE CASE
MARKET CAPITALIZATION CATEGORIES

Market capitalization is the aggregate valuation of the company based on its current share price
and the total number of outstanding stocks. It is calculated by multiplying the current market
price of the company’s share with the total outstanding shares of the company. It helps the
investor to determine the returns and the risk in the share. It also helps the investors choose the
stock that can meet their risk and diversification criterion. For example, a company has 1 lakh
outstanding shares and the current market price of each share is Rs. 150. Market capitalization
of this company will be 1,00,00,000 × 150 = Rs. 150 Lakh. A company’s stock price cannot
determine the total value or size of a company. For example, a company whose stock price is Rs.
200 is not necessarily worth more than a company whose stock price is Rs. 125. A company with
a stock price of Rs. 200 and 1 lakh shares outstanding (a market cap of Rs. 200 lakh) is actually
smaller in size than a company with a stock price of Rs. 125 and 10 lakh shares outstanding (a

428
market cap of Rs. 1250 lakh). Different clusters are formed of the companies on the basis of
market capitalization. These clusters include:
Large-cap stocks: These stocks are the first class in market capitalization and these are stocks of
well-established companies that have been around for years. The market capitalization of these
companies is above Rs. 20,000 crore. The stocks of large-cap companies are generally
considered to be very safe (low risk) and information on large-cap companies is very readily
available.
Mid-cap stocks: Mid-cap companies are considerably smaller than large-cap companies in all
fields of comparison–revenue, profitability, employees, client base, etc., and their market
capitalization lies between Rs. 5000 and 20,000 crore. Mid-cap companies have a marvelous
scope for growth and can potentially give higher returns in future years. Unlike large-cap
companies, a lot of information on mid-cap companies is not publicly available, which makes it
difficult for an investor to invest.
Small-cap stocks: Small-cap companies generally are either start-up enterprises or companies
in the development stage. They have low revenues and a small number of employees and clients
and information on these companies is not easily available to all.
Large-cap stocks tend to hold up better in recessions, but they also tend to underperform
small-cap stocks when the economy emerges from a recession. Large-cap stocks tend to be less
volatile than mid-cap and small-cap stocks and are therefore considered less risky. The risk of
failure is greater with small-cap stocks than with large-cap and mid-cap stocks. It is difficult to
know exactly when the market will favor large cap, mid-cap, or small-cap stocks. Hence, before
investing, the investor must conduct thorough research of company related to short-term and
long-term plans of the company, the revenue model, profitability, outside investments, goodwill
of the promoters, and financial strength to withstand difficult times. The investor may want to
group companies of similar characteristics together for better decision making. The k-means
clustering algorithm can be used to form clusters of the similar type of companies.

Perform cluster analysis considering different financial data of different


organizations. Name the groups having different organizations.

12.2.2 Agglomerative Hierarchical Clustering


There has been an exponential increase in data capturing and number of variables at every
possible stages. It becomes difficult for a data scientist to determine more than 1000 significant
variables. It is used for clustering population in different groups, which is widely used for
segmenting customers in different groups for specific intervention. Hierarchical clustering uses
two approaches: top-down and bottom-up. In top-down approach, the entire dataset is considered
as a single cluster and is divided into two clusters and then further dividing both the clusters into
two each and we keep going down and so on. Both K-means approach and top-down approach
may not return the same number of clusters with same entities. In case of small observations,
bottom-up hierarchical clustering is used for better results since it helps us to build the clusters
from n to 1 by merging the clusters bottom-up and we get all possible clusters between 1 and n.
For explaining the working and utility of hierarchical clustering, we will consider make_blobs
dataset available in sklearn.datasets.

429
The agglomerative clustering algorithm is applied using the function stored in sklearn.cluster
library. A dendrogram is created by creating clusters from the observations. The dendrogram and
ward function are stored in sklearn.cluster.hierarchy library. In agglomerative clustering, the
cluster for each observation is determined using AgglomerativeClustering() in sklearn.cluster
library.

430
Explanation
The example uses the dataset named make_blobs from sklearn.datasets. The dataset has 100
observations and two independent variables, which are stored in “x.” The variable y contains
the dependent variable with three categories named class 0, class 1, and class 2. The
agglomerative cluster algorithm is performed using the command
agg=AgglomerativeClustering(n_clusters=3), which means that number of clusters
required is 3. This is chosen because we know that there are three categories of dependent
variable. The command ypred=agg.fit_predict(x) predicts the cluster of each observation
contained in “x” and stores in “ypred.” A dendrogram (chart) is created corresponding to the
number of observations and the clusters using the commands result=ward(x) and
dendrogram(result). It can be observed from the figure that colors are used for depicting three
clusters and the figure considers all the 100 observations on the x-axis. The next section then
creates a dataframe with two columns named original and predicted containing the values of
original values of dependent variable and predicted value of cluster, respectively, using the
command mydata = {'Original': y, 'Predicted': ypred} and clusterdf =

431
pandas.DataFrame (mydata,columns= ['Original','Predicted']). Since the dependent
variable contains the values 0, 1, and 2, it is needed to give a name to the clusters. The next
section gives the name to the clusters of the original dependent variables and predicted values.
The next section determines the accuracy of the predicted cluster through the confusion
matrix and accuracy score. These terms are discussed in detail in Chapter 13. The result of the
confusion matrix shows that all 34 observations, which belonged to first cluster, are grouped
similarly in the predicted cluster also. Similarly, all 33 observations belonging to the second
cluster are grouped together in the predicted values also. This is applicable for third cluster also.
The accuracy score is hence 100%. Thus, for this dataset, all the clusters predicted from the
observations are grouped in similar fashion like the original dataset.

Perform cluster analysis considering image dataset named


load_sample_images that can be downloaded from sklearn. datasets.

USE CASE
PERFORMANCE APPRAISAL IN ORGANIZATIONS

The performance appraisal is an essential part of the human resources department’s


contribution to an organization. An effective appraisal may not only eliminate behavior and
work-quality problems, it can also motivate an employee to contribute more. The performance
appraisal is the perfect opportunity to address long-term goals that may not be on the everyday
to-do list. Lighting the way toward a successful career path inspires loyalty and stability and can
improve the bottom line. On the other hand, business owners want to gauge whether or not an
employee is meeting performance standards. Developing a process that enables managers to
appraise performance through objective metrics is imperative, so that a manager can define any
underlying human resource issues versus operational issues.
The performance appraisal process starts with establishing performance standards, which
finally lead to fulfill the mission and vision of the company. These standards are established
through job descriptions, employee handbooks, and operational manuals. After establishing
clear standards that can be easily measured, they are communicated clearly to employees. The
management then compares one employee against all others who perform the same tasks. This
gives employer an idea about the performance of each employee. After the review of
performance appraisals, employees are provided feedback about what has been done well and
what areas need improvement. Finally, an action plan is developed.
An employee performance appraisal is a process combining both written and oral elements,
whereby management evaluates and provides feedback on employee job performance, including
steps to improve or redirect activities as needed. The different performance appraisal methods
include assessment center method, critical incident technique, essay evaluation, human asset
accounting method, management by objective, paired comparison method, rating scale, ranking,
forced distribution, confidential report, checklists, field review technique, and performance test.
By monitoring performance and progress against business objectives, employers can assess
which employees to reward with salary increases, promotions, or bonuses or whom to retain and
whom to discontinue. Hierarchical clustering analysis can be used in the performance appraisal
report for determining employees belonging to the three clusters as discussed.

432
Summary
• Unsupervised machine learning algorithms are used, when the output is not known and no
predefined instructions are available to the learning algorithms. In unsupervised learning, the
learning algorithm has only the input data and knowledge is extracted from these data. The
common unsupervised machine learning algorithms include dimensionality reduction
algorithms and clustering.
• The two common dimensionality reduction algorithms include principle component analysis
and FA. These algorithms replace a large number of correlated variables with the smaller
number of correlated variables. This is helpful when we have a large number of variables in
our dataset and we need to reduce this number or where responses to many questions tend to
be highly correlated.
• FA explains the correlations between variables by uncovering a smaller set of more
fundamental and observed variables underlying the data. These hypothetical and observed
variables are called as factors, where each factor explains the variance shared among two or
more variables.
• Factors are also known as latent variables or hidden variables or unobserved variables or
hypothetical variables. It describes the association among the number of observed variables.
The maximum number of factors is equal to a number of observed variables.
• There are two types of FA: EFA and CFA.
• A library named “factor_analyzer” contains all the functions required for performing FA.
• Before applying FA, it is important to evaluate whether there is a possibility to determine the
factors in the dataset. There are two methods to check the factorability or sampling adequacy:
Bartlett’s test and KMO test.
• Bartlett’s test of sphericity checks whether or not the observed variables intercorrelate at all
using the observed correlation matrix against the identity matrix.
• KMO test measures the suitability of data for FA. It determines the adequacy for each
observed variable and for the complete model.
• The eigenvalue is a good criterion for determining the optimum number of factors.
Eigenvalues represent variance explained by each factor from the total variance. Generally,
an eigenvalue greater than 1 will be considered as selection criteria for the feature. The eigen
values are determined by using get_eigenvalues() function.
• The graphical approach based on the visual representation of factors’ and eigenvalues is
called scree plot. It helps us to determine the number of factors where the curve makes an
elbow.
• The matrix of weights, or factor loadings, generated from an EFA model describes the
underlying relationships between each variable and the latent factors. The factor loadings are
determined by loadings_ and the variance explained by the factors is determined by
get_factor_variances() function.
• PCA is a data reduction technique that transforms a larger number of correlated variables into
a much smaller set of correlated variables called as principal components.
• The principal components aim to capture as much information as possible with high

433
explained variance. The first component has the highest variance followed by second, third,
and so on. The components must be uncorrelated. PCA is applied on a dataset with numeric
variables.
• The function named PCA() available in sklearn. decomposition is used to perform PCA by
specifying the number of components.
• Clustering is the task of partitioning the data into groups called clusters whose members are
similar in some way. A cluster is a collection of observations that are similar between them
and are different to the observations belonging to other clusters.
• K-means clustering and hierarchical agglomerative clustering are commonly used clustering
algorithms. For effective analysis, we need to determine the number of clusters before
applying clustering algorithm.
• For kmeans clustering, we need to import KMeans from sklearn.cluster. The optimum
number of clusters is determined using the elbow method in scree plot. The scree plot is
created considering the different values of k and inertia returned from the kmeans algorithm.
It should be noted here that the inertia_ is same as WCSS.
• The agglomerative clustering algorithm is applied using AgglomerativeClustering() in
sklearn.cluster library. A dendrogram is created by creating clusters from the observations.
The dendrogram and ward function are stored in sklearn.cluster.hierarchy library.

Multiple-Choice Questions

1. The number of principal components is ____________ the number of original variables.


(a) <=
(b) >=
(c) ==
(d) None of the above
2. Factor analysis can be applied using ____________ rotation.
(a) Varimax
(b) Single
(c) Dual
(d) Both single and dual
3. We need to determine the ____________ before applying K-Means algorithm.
(a) Dependent variable
(b) Observations
(c) Accuracy
(d) Number of clusters
4. The library_______ is used to do perform exploratory factor analysis for dimension
reduction.
(a) factor_analyzer
(b) factor
(c) factor_analysis

434
(d) dimension_reduction
5. The AgglomerativeClustering() function is available in __________ library.
(a) sklearn.agglocluster
(b) sklearn.cluster
(c) sklearn.agglomerative
(d) None of the above
6. _________________ generated from EFA model describes relationships between each
variable and the latent factors.
(a) Mean value
(b) Predicted value
(c) Accuracy
(d) Factor loadings
7. The variance of first principal component is ______than second principal component.
(a) Less
(b) Equal
(c) Higher
(d) None of the above
8. The function __________is used to determine principal components for dimension
reduction.
(a) PCA()
(b) Pcomp()
(c) Pcomponent()
(d) principal()
9. The k-means algorithm is applied using ___________ function.
(a) kmeans()
(b) k-means()
(c) KMeans()
(d) K-Means()
10. Important test that needs to be checked before applying factor analysis is _________.
(a) KMO
(b) Bartlett
(c) Both (a) and (b)
(d) Neither (a) nor (b)

Review Questions

1. Discuss the adequacy tests that need to be checked before applying factor analysis.
2. Differentiate between supervised and unsupervised learning algorithms.
3. What is the utility of a dendrogram?
4. How can we determine the optimum number of factors before performing factor analysis?
5. Explain the process of forming clusters considering an example of your choice.

435
6. What is clustering? Which functions are used for two important methods used for
clustering?
7. Discuss principal component analysis in detail.
8. Perform agglomerative clustering considering make_ moons dataset from sklearn.datasets.
9. Perform principal component analysis on diabetes dataset used in EFA considering three
factors.
10. How can we determine the optimum number of clusters in k-means algorithm?

436
CHAPTER
13

437
Supervised Machine Learning
Problems

Learning Objectives
After reading this chapter, you will be able to

• Differentiate between regression and classification.


• Design models based on regression and classification algorithms in Python.
• Analyze models related to regression and classification algorithm in Python.
• Interpret results of different models.

Regression is a form of predictive modeling technique that estimates the relationship between a
dependent (target) and independent variable(s) (predictor). It is a very widely used statistical tool
for data modeling and is used primarily for predicting and finding the causal effect relationship
between the dependent and independent variables. Regression analysis helps to indicate the
significant relationships between dependent variable and independent variable and the strength
of impact of multiple independent variables on a dependent variable. Thus, it helps market
researchers and data analysts to eliminate the worst and evaluate the best set of variables for
building effective predictive models. There are two major types of supervised machine learning
problems called regression and classification. In classification problems, the goal is to predict a
categorical variable. It should be specified that in classification problems, the dependent variable
can have any number of categories; unlike, logistic regression which was applicable to two
categories only (logistic regression is a binary classification). Example of classification is to
predict whether the customer is going to buy or not. In regression problems, the goal is to predict
a continuous number depending upon independent variables of any type, for example, predicting
the salary of a person.

13.1 Basic Steps of Machine Learning


There is an important difference between Python and the other main statistical systems including
SAS and SPSS. In Python, a statistical analysis is normally done as a series of steps, with
intermediate results being stored in objects. Thus, although SAS and SPSS will give copious
output from a regression analysis or dimension reduction techniques, Python will give minimal
output and store the results in a fit object for subsequent interrogation by further Python
functions. There are five basic steps of the whole process.

438
13.1.1 Data Exploration and Preparation
This is the most important step and requires more than 60% of the project time and efforts. Once
the data is understood clearly and prepared, other steps do not take more time. The different
stages involved in major step have been discussed in the following subsections.

13.1.1.1 Understanding Dataset


We need to identify the independent and dependent variables. It is also important to identify the
data type such as character, numeric, int, float, and category like categorical and continuous
variables. Univariate and bivariate analyses can also be done to understand the data.
Univariate analysis: The tool used for univariate analysis is dependent on the type of variable.
If the variable is continuous in nature, we need to determine the measures of central tendency –
mean, median, mode, etc., and measure of spread – variance, skewness, kurtosis, standard
deviation, etc. The visualization techniques such as histogram and box plot can also be used to
do univariate analysis. However, for categorical variable, frequency table can be used to
understand distribution of each category. Bar plot can be used as a visualization technique.
Bi-variate analysis: The bivariate analysis finds out the relationship between two variables.
These two variables can be any combination of categorical and continuous variables. If we need
to understand the relationship between two categorical variables, we need to use stacked column
chart. Scatter plot can be used to understand relationship between two continuous variables. It
should be noted that scatter plot shows the relationship between two variables and does not
indicates the strength of relationship. We can use Spearman and Pearson correlation coefficients
to determine the strength of the relationship. If we want to understand the relationship between
categorical and continuous variables, we can use test for comparing means or draw box plots for
each level of categorical variable.

13.1.1.2 Handling Missing Values


Missing values can be found if data are not extracted or collected properly. The accuracy of the
developed model can be misleading if missing data exit and are not handled properly. This is
because the user has not analyzed the behavior and relationship between the variables in a correct
manner.
Example: A feedback of five managers is recorded related to two employees: A and B.
Performance = [“good”,“bad”,“bad”,“good”,“bad”,“bad”,“good”,“bad”,“good”,“good”]
Employee = [“A”,“B”,“A”,“B”,“A”, “B” ,“A”, “B”, “A”,“B”]
We can observe that employee A has got 3/5 good reviews (60%), whereas employee B has
got 2/5 good reviews (40%). This means that employee A is better.
But if the data were not extracted properly related to employee name and B’s data were
missing at three places.
Employee = [“A”, “NA”, “A”,“B”, “A”, “NA”,“A”, “NA”, “A”,“B”]
If missing values are not considered, B will have 2/2 good reviews (100%), whereas
employee A will have (60%) good reviews. This means that the employee B is better. Hence, the
result generated is incorrect.
The preceding example clearly shows the importance of missing data imputation. From

439
Chapter 6, we know that the observations having missing values are either deleted or handled by
doing imputation with mean, median or mode.

13.1.1.3 Assumptions of Regression


It is important to check the five important assumptions of regression before doing regression
analysis.

Normality of variables: Regression assumes that variables should have normal distributions. If
the data are not normally distributed, a nonlinear transformation (e.g., log-transformation) can be
used. There are many different ways for checking the assumption of normality including Shapiro
test, skewness and kurtosis and normality test. Non-normally distributed variables (highly
skewed or kurtotic variables, or variables with substantial outliers) can distort relationships and
reduce accuracy. It is important to check for outliers, since linear regression is sensitive to outlier
effects. The effect of outliers on the data is explained in the following example:

Explanation
The original data were data1, but due to some negligence, the last observation was recorded as
66 instead of 6. The presence of an outlier in the second dataset changed the mean of the data
from 5.5 to 11. We can observe that dataset with outlier resulted in significantly different mean.
Thus, it would have a high impact on the analysis of data. So, it is important to detect outliers in
the data.

Outlier detection is to determine the influential data points that are providing a change in the
data. An observation whose z score is more than 3 is considered outlier. However, visualization
techniques such as box-plot, histogram, and scatter plot can also be used for determining outliers.
Box plot uses the interquartile range (IQR) method to display outliers, but we will have to use
mathematical formula to retrieve the list of outlier data. IQR is basically the difference between
75th and 25th percentiles, or between upper and lower quartiles, Q3 – Q1. If the value of
observation is <(Q1 – 1.5 × IQR) or >(Q3 + 1.5 × IQR), it is considered as an outlier. After
determining outliers, we generally try to delete outliers, but it should be noted that only those
outliers should be removed which have occurred due to data entry error, data processing error or
if they are very small in numbers. It is important to understand the nature of the outlier before
removing it.
Linearity: Regression can only accurately estimate the relationship between dependent and
independent variables if the relationships are linear in nature. A significant correlation between
each independent variable(s) and dependent variable confirms the linearity. Scatter plots can be
used to determine a linear or curvilinear relationship.

440
Multi collinearity: Multi collinearity occurs when we have two or more independent variables
that are highly correlated with each other. This leads to problems with understanding which
independent variable contributes to the variance explained in the dependent variable, as well as
technical issues in calculating a multiple regression model. We can remove collinear variables by
looking at correlation tables and eliminating variables that are above a certain threshold. But, that
is possible for two variables only. We can detect multi collinearity for several variables together
systematically by determining variance inflation factor (VIF) for each variable. VIF calculations
are easy to understand: the higher the value, the higher the collinearity. A VIF is calculated for
each explanatory variable and those with high values are removed. If the value of VIF is greater
than 4, it indicates a multi collinearity problem. This further means that for fulfilling this
assumption, the value of VIF for each independent variable in the model should be less than 4.
Independence of errors: Linear regression analysis requires that there is little or no
autocorrelation in the data. Autocorrelation occurs when the residuals are not independent from
each other. In other words, when the value of y(x+1) is not independent from the value of y(x).
The Durbin–Watson test is used to check this assumption. Durbin–Watson value informs about
whether the assumption independence of errors is defensible (no autocorrelation of error terms).
Durbin–Watson’s tests the null hypothesis that the residuals are not linearly auto correlated.
While we can assume values between 0 and 4, values around 2 indicate no autocorrelation. As a
rule of thumb values between 1.5 and 2.5 show that there is no autocorrelation in the data.
Homoscedasticity: Homoscedasticity means that the variance of errors is the same across all
levels of the independent variables. When the variance of errors differs at different values of the
independent variables, heteroscedasticity is indicated. According to Berry and Feldman (1985)
and Tabachnick and Fidell (1996), slight heteroscedasticity has little effect on significance tests;
however, when heteroscedasticity is marked it can lead to serious distortion of findings and
seriously weaken the analysis. This assumption can be checked by visual examination of a plot
of the standardized residuals (the errors) by the regression standardized predicted value. The
Breusch–Pagan test fits a linear regression model to the residuals of a linear regression model
(by default the same explanatory variables are taken as in the main regression model) and rejects
if too much of the variance is explained by the additional explanatory variables. This assumption
is checked using Breusch–Pagan test. The insignificant value of p means that constant variance,
that is, homoscedasticity assumption is met. It should be noted that Breusch–Pagan test is not
applicable on linear models in Python. Hence, for linear models, we can check this assumption
by creating a scatter chart between the predicted values and residuals. This assumption will be
considered to be fulfilled, if the variance was observed to be same.
It is important that all the assumptions are satisfied before applying the regression analysis.
Also, it is necessary that all the variables under the study should be significant to make a
contribution to the dependent variable.

13.1.1.4 Feature Engineering


Feature engineering is the science (and art) of extracting more information from existing data.
Feature engineering is generally done by variable transformation and creation. A variable is
transformed generally by doing scaling, label encoding and one hot encoding. In data modeling,
transformation refers to the replacement of a variable by a function. For instance, replacing a
variable x by the square/cube root or logarithm x is a transformation. In other words,

441
transformation is a process that changes the distribution; when we want to change the scale of a
variable or standardize the values of a variable for better understanding. While this
transformation is a must if data are in different scales; this transformation does not change the
shape of the variable distribution. In scaling techniques, standardization is adopted to replace the
values by their Z scores. In label encoding, we convert each value in a column into a number.
One hot encoding (dummy variables) helps to create one Boolean column for each category,
where only one column can have the value 1 for each sample.
Feature creation is a method to create new variables/features based on existing variable(s).
For example, we are determining relationship between sales in a shopping mall and date. For
these data, the sales variable will be considered as dependent variable and date variable is
considered as an independent variable in the format dd-mm-yy. We know that date in this format
will not be able to do effective results of sales. But, if we split the date in three different
columns: day, month and year, we will be able to get better and hidden relationship of day with
sales variable. For example, Sunday will have more sales than Monday. Similarly, it can be
observed that more sales will be observed in the festival months.

13.1.2 Model Development


Predictive analytics is basically extracting information from existing datasets for determining
patterns and predict future outcomes and trends. For determining the patterns and predicting
outcomes, we divide the dataset is divided in training and test dataset. It is better to split for a
particular seed value to have always the same split of data. This will help the system to generate
same random numbers as specified in the function else we will not be able to have a uniformity
across the different executions of the same program. This will help us to fetch the same result
always since the same observations will be there in training and test dataset. This book has used
always this approach so that the results generated are same as the results of the reader. Using
train_test_split() function from sklearn library, we can spilt the dataset in the ratio of 70:30,
80:20, 75:25, etc. The model is developed according to the required independent variables on the
training dataset. The user can determine the value of the coefficients and the intercept to
understand the model. The model is developed using different algorithms like regression, k-NN,
decision tree, random forest, support vector machines, bagging, boosting, etc. These algorithms
are discussed in this chapter along with the Chapters 14 and 15.

13.1.3 Predicting the Model


The developed model is used for predicting the values of the user defined input or the test dataset
using predict function. This step stores or displays the predicted value of dependent variable with
respect to given values of input independent variables.

13.1.4 Determining the Accuracy of the Model


This is the most important step because it finally reports about the accuracy of the created model.
This step shows the final result of the algorithm used on the data. The role of analyst is to
primarily focus on increasing the accuracy of the model with meaningful interpretations. The
accuracy of the model is determined by comparing the predicted values and original values of the

442
test dataset. The difference in the above values shows the inaccuracy of the model. Generally, for
regression problems, we can determine the accuracy using root mean squared error (RMSE)
value and for classification problems, different techniques like accuracy score, classification
report, confusion matrix, receiver operating characteristic (ROC) and area under curve (AUC)
are used.

13.1.4.1 RMSE Value


For linear and multiple regression, we use concept of RMSE for determining the RMSE, while
for logistic regression, we use confusion matrix to determine the accuracy of the model. RMSE
is basically mean squared difference between the observed and predicted values, which give us a
measure for the prediction accuracy. The lower the value of RMSE, the better is the model.

13.1.4.2 Confusion Matrix


This is used to determine to what extent whether prediction has been done appropriately. A
confusion matrix is a table that is often used to describe the performance of a classification
model on a set of test data for which the true values are known. For understanding the confusion
matrix, let us assume the results of a binary classifier. There are two possible predicted classes:
“yes” and “no”. If we were predicting the loan payment, “yes” would mean they have done the
payment, and “no” would mean they have not done the payment. Suppose, that the model did a
total of 100 predictions (100 people were being tested for doing payment). Out of those 100
cases, the model predicted “yes” 60 times, and “no” 40 times. But, in original dataset 55 people
had done the payment and 45 had not done the payment. The confusion matrix will be created as
follows:

Predicted positive Predicted negative


Actual positive 55 (TP) (FN)
Actual negative 5 (FP) 40 (TN)

True positives (TP): These are cases in which we predicted yes, and they have done the
payment.
True negatives (TN): These are cases in which we predicted no, and they have not done the
payment.
False positives (FP): We predicted yes, but they have not done the payment. (Also known as
a “Type I error.”)
False negatives (FN): We predicted no, but they have done the payment. (Also known as a
“Type II error.”)
Accuracy: Accuracy is determined by (TP + TN)/total = (55 + 40)/100 = 0.95.

13.1.4.3 Accuracy Score


The accuracy is determined using the function accuracy_score() from sklearn.metrics library.
This function primarily takes two input: predicted values of dependent variable and original
values of dependent variable from the test dataset. The score of accuracy is calculated between 0

443
and 1 corresponding to 0 to 100%. Thus, a score of 1 shows that there is 100% accuracy. This
further means that all the values are rightly predicted. A score of 0 shows that none of the
observations is predicted correctly. A score of 0.5 shows that 50% of the observations are rightly
predicted.

13.1.4.4 Classification Report


The classification report displays four different values namely precision, recall, F1-score and
support for all the categories of the dependent variable, accuracy of the model, macro average
and weighted average. Precision is the ratio of true positives to the sum of true and false
positives. This further means that the percentage of correct for all the positives. Recall is the ratio
of true positives to the sum of true positives and false negatives. This displays the percentage of
records that were classified correctly. The F1 score is a weighted harmonic mean of precision
and recall; the highest score is 1.0 and the lowest is 0.0. It is suggested to consider F1 score for
comparing models and classifier because they consider precision and recall scores into their
computation. It should be noted that F1 score is always lower than accuracy measures. Support is
the number of actual occurrences of the class in the specified dataset. Thus, value of support will
not change between the classifiers, but instead will diagnose the evaluation process.
Classification Report is discussed in chapter 15.

13.1.4.5 ROC Curve and AUC


ROC curve is a commonly used graph that summarizes the performance of a classifier over all
possible thresholds. It is generated by plotting the True Positive Rate (y-axis) against the False
Positive Rate (x-axis). The ROC curve is usually a good graph to summarize the quality of our
classifier. The higher the curve is above the diagonal baseline, the better the predictions. The
performance of the model is determined by looking at the area under the ROC curve and the
value is stored in AUC. The highest value of AUC is 1 while the least is 0.5, which depicts the
45° random line. It should be noted that if there is any value less than 0.5, this means that we
should do the exact opposite of recommendation of the model to get the value more than 0.5. It
should be noted that ROC curve is a graphical plot that illustrates the diagnostic ability of a
binary classifier system as its discrimination threshold is varied. However, ROC for multiclass
classifier can also be created using OneVsRestClassifier from sklearn.multiclass. This is
discussed in Chapter 15 to an extensive level.

13.1.5 Creating Better Model


The main objective of any analyst is to make a better model. This steps requires a lot of
consideration and requires the execution of previous steps also. However, we will be discussing
this step in detail in Chapters 14 and 15. Important considerations that can help us to create a
better model include the following:

13.1.5.1 Avoid Overfitting and Under fitting


It is important to check whether the model is overfitting or under fitting. Over fitting exists in a

444
model, if the model is too close to the training set. This is done because if we build a model,
which is very complex in nature because of the number of variables and amount of information,
the model ends up in the situation of over fitting. Although the model works well on the training
set, we are not able to generalize the model to the new data. On the other hand, if we build a
model that is easy and simple, the model ends up in the situation of under fitting. Under fitting
exists, when we are not able to consider all the variables and information and the model gives a
poor result on the training dataset. This means that by increasing complexity, we will be able to
predict better on the training data. But, if our model becomes very complex, we start giving
attention on each individual data point in our training set and the model will not be able to
generalize better on the new data. Hence, there is a need for analyst to have a tradeoff between
under fitting and overfitting of the model.

13.1.5.2 Feature Extraction


If the model is not displaying the required accuracy, number of independent variables are
changed in the model and steps are repeated till the desired level of accuracy is obtained and
further improvement in accuracy is not required. The independent variables can be either added
or deleted from the model to get a final set of predictor variables using stepwise regression. In
stepwise method, variables are added or deleted from the model one by one until a satisfactory
model is created. In backward stepwise, all the independent variables are included and they are
removed one by one to determine the best model. In forward stepwise, we add the independent
variables one by one to the model to determine, which are the significant independent variables
in the model. It is worth mentioning here that a combination of both the approaches can also be
used for developing a better model. However, there is no single criterion to choose the best
model. The analyst can decide according to his requirements of independent variables and
accuracy of prediction.
Feature extraction plays an important role in the machine learning for problems with
thousands of features for each training instance. Hence, it is important to determine the optimal
subset of features for reducing model’s complexity, easy approach to find the best solution, and
for decreasing the time it takes to train the model. We can filter features based on the correlation
of the features. If two features or more are highly correlated, we can randomly select one of them
and discard the rest without losing any information. If two features have correlation value of 1, it
means that they are perfectly correlated, 0 not correlated, and –1 highly correlated but in the
opposite direction (one feature increases while the other decreases). We then determine a group
of features that have a correlation coefficient greater than 0.95. From each group of correlated
features, one is selected out of them and others are discarded. After removing highly correlated
features, the number of features is reduced by determining feature_importances_or
coef_attributes of all the features and deleting the features having least/no importance. A visual
representation also helps in determining the importance of each feature. Feature extraction to an
extensive level is however beyond the scope of this book, but it is highly recommended to the
user for increasing accuracy of the model and deriving meaningful interpretation. Feature
engineering is discussed in Chapter 14.

13.1.5.3 Tuning of Hyper Parameters


After the feature extraction, it is important to do tuning of hyper parameters. This is important to

445
reduce the risk of overfitting and to maximize the estimator’s performance. In Python, this is
primarily done by GridSearchCV object, which will perform an exhaustive search over the hyper
parameter grid and will report the hyper parameters that will maximize the cross-validated
classifier performance. A dictionary is created of key-value pairs consisting of key, which has
string denoting the classifier, and value, which has corresponding different values. There is no
standard and optimal hyper parameter grid for any classifier. The user can change the hyper
parameter grid according to the data. However, we have focused on tuning of hyper parameters
for each algorithm through grid-based approach in Chapter 15 to an extensive level.

A particular version of library can be installed in anaconda by specifying the


version along with library. For example: conda – version conda 4.7.10 will
install 4.7.10 version of conda. It is also possible to update a particular version
by using the update command. For example: conda update conda will update
the conda to the latest version in the environment.

13.2 Regression
There are different types of regression including simple linear regression, multiple linear
regression, and nonlinear regression. The difference between linear and nonlinear regression is
that the plot of the model gives a curve in a nonlinear regression and in linear regression, it gives
a line. In true scenario, when we are modeling data for regression analysis, it is found that in rare
cases, the equation of the model is a linear equation giving a linear graph. Generally, the
equation of the model involves mathematical functions of higher degree. However, the objective
of both linear and nonlinear regression is to determine the values of the parameters in the model
to find the line or curve that comes closest to your data.
It should be clear that there is only one dependent variable in regression, while there can be
one or more than one independent variables. Linear regression is an analysis that assesses
whether one or more predictor variables explain the dependent (criterion) variable. The
difference between simple linear regression and multiple linear regression is that multiple linear
regression has more than one independent variables, whereas simple linear regression has only
one independent variable.

13.2.1 Simple Linear Regression


In simple linear regression, there is only one independent and one dependent variable. It is worth
mentioning here that since there is only one independent variable, hence only two assumption of
regression namely normality and linearity should be met before applying regression analysis.
These two variables are related through an equation and straight line is plotted if there is a linear
relation between the variables. A nonlinear relationship is shown by a curve. The general
mathematical equation for a linear regression is y = ax + b, where

• y is the response variable;


• x is the predictor variable;

446
• a and b are constants that are called the coefficients.

447
Explanation
Two lists named weight and height are created having details of 30 people taken from a sample.
Before applying regression analysis, it is necessary that minimum sample size should be 20.
The preceding program shows the value of skewness and kurtosis of the height and weight.
Since the value of skewness and kurtosis is nearly between –1 and +1, hence we can assume
that the data are normal. The correlation coefficient using Spearman’s method is 0.880 and
using Pearson’s method is 0.884.The correlation coefficient shows that there is a significant
correlation between height and weight. Also, since the p-value is less than 0.05, we reject the
null hypothesis, which further means that there is a significant correlation between weight and
height. The graph also shows that a straight line can be drawn covering all the points
corresponding to weight and height. Thus, the assumption of linearity is also fulfilled. Since
both the assumptions are fulfilled, we can apply regression analysis.
Since linear regression in Python can be applied in datasets only, hence both the lists
corresponding to weight and height are converted to dataframe using pandas library. The
command pd.DataFrame() converts list corresponding to height and weight to “heightdf” and
“weightdf” dataframe, respectively. The function LinearRegression() establishes a regression
equation between the two variables. The first variable is dependent variable (weight), while the
second variable is independent variable (height).
The value of adjusted R squared is found to be 0.78, which is good and hence the model can
be used for prediction. The next commands gives details about the value of intercept (–22.88)
and the coefficient of height is 0.519. Thus, the equation formed is Weight = Height × (0.519) –
22.88.

448
The next section predicts the value of weight depending on the value of height given by the
user. The function creates a dataframe named newheight, which is basically a list of three
heights [172,180,176] and weight is to be predicted for these three heights. The
model.predict() function predicts the value of weight depending upon the value of height
given. We can see that the predicted value of weight for 172 cm is 66.48, which is nearly equal
to the weight, if calculated by the equation (0.519 × 172 – 22.88 = 66.39). Similarly, the weight
for 180 cm height is 70.64 kg and the weight for 176 cm height is 68.56 kg.

Change the dependent variable and independent variable for the preceding
example. Consider height as a dependent variable and weight as an
independent variable to determine the height of the person depending on the
weight of a person. Determine the regression equation.

USE CASE
RELATIONSHIP BETWEEN BUYING INTENTION AND AWARENESS OF ELECTRIC VEHICLES

In 1970, Gasoline vehicles became pervasive in developed and developing countries, when a
need was determined for alternative fuel vehicles to tackle the problems of harmful emissions
from I.C. Engines and to reduce the dependency on crude oil but increasing pollution, global
warming and health issues are few of the major problems due to gasoline vehicle. In this
scenario, electric mobility can have profound impact globally and seems to be the strong and
sustainable transportation medium. Unlike conventional vehicles that use a gasoline or diesel-
powered engine, electric vehicle use an electric motor and engine powered by electricity from
batteries. There are two basic types of EVs: all-electric vehicles (AEVs) and plug-in hybrid
electric vehicles (PHEVs). However, in future they will most likely carry lithium-ion phosphate
(LiFePO4) batteries, which are rechargeable and powerful.
Electric vehicles can reduce pollution by improve air quality and are considered to be 95%
cleaner. Unlike gas-powered vehicles, they do not produce tailpipe emission, which is harmful
for health. By utilizing renewable energies, they show a secure and balanced energy option that
is efficient and environmentally friendly. Besides, electric vehicle has a potential to show a
massive economic development by creating new industry and better job opportunities.
Various experts estimate that demand for electric vehicles will accelerate and they may
contribute largely to new-vehicle sales by the end of the next decade. China has set aggressive
EV targets and has become the largest market for electric vehicles with nearly 650,000 EV on
road (1/3 of world’s total). India also has planned for a mass scale shift from gasoline vehicle to
EV by 2030 so that many vehicles on Indian roads will be powered by electricity.
In a world where environmental and energy are growing concerns, there should be a wide
acceptance of electric vehicle technology. Although electric vehicle industry seems to be
effective, but a small percentage of the overall vehicle market is captured. A research can be
carried out to find out people buying intention of electric vehicle in India and their awareness
toward it. Simple linear regression analysis can be used for determining the association between
the two variables. This study will help the government and automobile industry experts to design
their strategies accordingly for better acceptability of electric vehicles in India.

449
13.2.2 Multiple Linear Regression
Multiple regression is an extension of linear regression into relationship between more than two
variables. In simple linear relation, we have one predictor and one response variable, but in
multiple regression, we have more than one predictor variable and one response variable. The
general mathematical equation for multiple regression is y = a + blxl + b2x2 + …bnxn, where

1. y is the response variable;


2. a, b1, b2, … bn are the coefficients;
3. x1, x2, … xn are the predictor variables.
For better clarity related to regression, we have considered example of Boston dataset, which is
available in the library of dataset of Python. The sklearn.datasets library has many datasets
including Boston. However, the description shows that this dataset was taken from
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/. This dataset was taken from
the StatLib library, which is maintained at Carnegie Mellon University.

450
451
Explanation
The Boston dataset is loaded from the sklearn.datasets library, which has basically two sets:
boston.data for independent variables and boston.target for dependent variable. The dimension
of boston.data is (506, 13), which means that there are 506 rows and 13 columns; boston.target
also has 506 rows. The data of independent variables is converted to a dataframe named
bostondf by the command pandas.dataframe[boston.data]. A new column named MEDV is
added to the dataframe, which has the data of the boston.target. Thus, the Boston dataframe has
506 observations for 14 variables. Note that 13 variables are considered to be independent
variables and MEDV(median value) is considered to be a dependent variable. The dataset is
then partitioned into training and test dataset using train_test_split() function available in
subpackage named sklearn.model_selection. The utility of seed() function available in random
module is same as in numpy library; it helps to divide the dataset always in the same manner.
Hence, to have similarity in results between our results and your results, we use the seed()
function. The value of argument "test_size" is 0.3, which means that the test dataset is 30%
of original dataset and hence the training dataset is 70% of the original dataset.

The command "x_trg = training.drop('MEDV', axis=1)" drops the medv variable and stores
all the independent variable in x_trg. The command "y_trg = training['MEDV']" considers
only the MEDV variable as the y_trg. Similar execution is done for the test datasets also. The
linear model is then developed using the x_trg and y_trg dataset. The accuracy of the training
dataset is calculated using the score() function and is found to be 0.749. The developed model
is then predicted using the test dataset for independent variables (x_test) using the function
model1.predict(). The predicted values are stored in “pred.” RMSE is used to determine the
difference in the predicted values and the original values. RMSE is calculated using
sqrt(mean_squared_error()), which means that square root of mean_squared_error() will be
calculated. The function mean_squared_error()) is available in sklearn.metrics, which is
imported in the start of the program. The lower the value of RMSE, the better is the model. The
RMSE value is 5.24, which shows that it is not a good prediction and hence steps should be
taken for improving the accuracy of the model. However, steps should be taken to lower the
RMSE by decreasing number of variables and checking whether the RMSE value after each
change of the model and then finally determining the best model. These steps include increasing
the sample size, assumptions fulfillment for reducing the number of independent variables,
changing the model, data exploration and processing, etc. These steps are discussed in the
following example.

452
453
454
455
456
457
458
459
Explanation
This program follows step by step method as discussed earlier in this chapter to determine an
efficient model. The dimension of the dataset is found to be (506, 13). Both boston.data and
boston.target are converted to dataframes using the command pandas.DataFrame(). The details
of the dataframe are printed, which shows that all the variables are continuous in nature besides
“Chas.” The description of “CHAS” shows that it should be considered as a categorical variable
because the variable has maximum value as 1 and minimum value as 0. The value_counts()
function confirms the count of categorical variable. The countplot displays the details of
“CHAS” variable.

460
Explanation
The output of missing values data shows that there is no missing observation in the dataset.

461
Explanation
The normality of data is determined by three different ways: skewness and kurtosis,
normaltest(0) and Shapiro test(). The skewness and kurtosis of many variables is found to
be not falling in the range of –1 and +1; the normal test and Shapiro test shows that the p-value
is significant since it is less than 0.5; hence data cannot be considered as normal data.

We will try to remove outliers because outliers affect the normality to a great extent. The next
section determines outliers. If the z-score of the observation is found to be less than 3, then it can
be considered as an outlier. All the observations whose z-score is less than 3 are stored in the list
named “outlier.” Thus, all the indexes are printed using the command outlierlist[0]. The
length of the list shows that there are 100 outliers in the data. The next section removes the
observations with indexes listed in the outlier list. Thus, the dimension of the dataset after
removing outliers is (415, 13). It is important to remove the same observations from the target
dataset also for making the number of observations similar in both the dataset.

462
Explanation
This section checks for multi collinearity using the VIF. The function is created named
vifresult(), which sends the important predictor variables that are fulfilling this assumption.
Thus, a new dataframe is created, which has only these predictor variables. The dimension of
this dataset is found to be (415, 5), which means that there are five independent variables only
and eight variables are deleted that does not fulfill this assumption.

463
Explanation
This section determines the linearity between dependent and independent variables. The
Spearman correlation coefficient is used for determining linearity. It is observed that there is
correlation between every independent variable and dependent variable since the correlation
coefficient is found to be significant for all these variables. It can be observed that Spearman
correlation coefficient for “chas” and target variable is determined as NaN because “chas” is
not a continuous variable and is categorical in nature.

Explanation
This section does the feature extraction by doing scaling of the data using the function
StandardScaler() from subpackage sklearn.preprocessing.

One more important scaler used for data processing is the minmaxscaler. The
reader can import from sklearn.preprocessing library to perform preprocessing
on the data.

464
Explanation
The dataset is then divided into training and test dataset after applying seed() function. The
developed model is then predicted using the test dataset for independent variables (x_test) using
the function model2.predict(). The predicted values are stored in “pred.”

465
In the next step, we checked whether the model is fulfilling the homoscedasticity
assumption. For checking the independence of errors assumption, we have applied Durbin–
Watson test on the residual. Residual is basically the difference between predicted values and
original values. Since the result shows that the value is 1.93, which means that this assumption
is also fulfilled. In the next step, we have checked the homoscedasticity assumption by plotting
the chart between the residuals and the predicted values. We can observe from the chart that the
variance seems to be nearly same; hence this assumption is also fulfilled. If there would have
been another pattern there was a need to do some transformation of the model for improving the
accuracy.
RMSE is used to determine the difference in the predicted values and the original values.
RMSE is calculated using sqrt(mean_squared_error()), which means that square root of
mean_squared_error() will be calculated. The function mean_squared_error()) is available
in sklearn.metrics, which is imported in the start of the program. The lower the value of RMSE,
the better is the model. The RMSE value is 6.06 by considering only the five variables. This
shows that the model can be considered better than before since we have tried to keep the
accuracy same as before considering very less number of independent variables.

Perform regression on iris dataset available in sklearn.datasets. This dataset


can be downloaded by using the command sklearn.datasets import
load_iris.

USE CASE
APPLICATION OF TECHNOLOGY ACCEPTANCE MODEL IN CLOUD COMPUTING

Cloud computing technology has caused a significant paradigm shift in accessing hardware and
software applications. It is gaining momentum at a high speed for hosting and delivering
services over the Internet since it facilitates resource pooling among many users. It is basically a
collection of servers, databases, technologies, business models and applications, which are
available on a demand and scalable basis and are provided by a service company through the
Internet. Cloud computing is attractive to business owners as it eliminates the requirement for
users to plan ahead for provisioning, and allows enterprises to start and pay for the resources
only when there is a rise in service demand. Developing nations and small enterprises have the
same access to the benefits of cloud computing as large enterprises or developed nations.
Cloud computing is classified into three different layers based on the type of resources
provided by the cloud. The lowest level is Infrastructure-as-a-service (IaaS), which provides
basic hardware components. The middle level is Platform-as-a-Service (PaaS), which provides
developers a platform for developing, testing, deploying and hosting of web applications. The top
level is Software-as-a-Service (SaaS) that provides ready to use applications for the users. Thus,
SaaS providers have complete control, including control over the infrastructure, operating
system, storage and its physical location. On the positive side, access to the SaaS is often very
easy and can be accomplished via a Web browser and from multiple devices including mobile
devices. There are numerous services that can be delivered through cloud computing including
Dynamic Servers, Hosted Desktops, Hosted Email, Hosted Telephony (VOIP) and Cloud
Storage.

466
The main purpose of cloud computing is to reduce initial investments and potential cost
savings, provide flexibility and scalability, anytime, anywhere on all devices. Cloud computing
provides significantly lower costs along with scalable computing efficiency because
organizations need to spend money only on services that they actually receive along with the
flexibility of adjusting the amount of resources they require. Organizations receiving such
services do not also have to take possession of hardware and all costs associated with it. They
also help in delivering increased efficiencies with reduced complexities, since the users can focus
on their core competency rather than spending time, effort and energy on IT and computational
needs. In addition, by outsourcing the service infrastructure to the clouds, organization can cut
down the hardware maintenance and the staff training costs by shifting business risks to
infrastructure providers, who have adequate disaster recovery services and business continuity
plans from cloud backup.
However, despite the fact that cloud computing offers huge opportunities to the IT industry,
the development of cloud computing technology has many issues which needs to be addressed.
Security, process failures and dependability on a third party has been considered as the risk
factors, which are faced by the organizations that implement cloud. Security is considered as an
important threat because with cloud computing, data are spread across wide geographical area
and it is critical because if it is located in another country, the laws of the host country may
affect the security of the data.
Although it is predicted that cloud computing in Indian market would increase many folds,
but we still need to understand the real effect on business. A technology acceptance model
comprising of four independent factors: perceived usefulness, perceived ease of use, perceived
security and behavioral intention can be considered as dependent variables and business process
outcomes can be considered for multiple regression analysis. This will help the organization to
determine the contribution of each independent variable for achieving high business process
outcomes.

13.2.3 Nonlinear Least Square Regression


In least square regression, we establish a regression model in which the sum of the squares of the
vertical distances of different points from the regression curve is minimized. We generally start
with a defined model and assume some values for the coefficients. We then apply the nls()
function of Python to get the more accurate values along with the confidence intervals. On
finding these values, we will be able to estimate the response variable with good accuracy.
Syntax
nls(formula, data, start)
where

• formula is a nonlinear model formula including variables and parameters;


• data are a dataframe used to evaluate the variables in the formula;
• start is a named list or named numeric vector of starting estimate.

Ordinary least squares is one of the methods to find the best fit line for a dataset using linear
regression. The most common application is to create a straight line that minimizes the sum of
squares of the errors generated from the differences in the observed value and the value

467
anticipated from the model. However, least-squares problems fall into two categories: linear and
nonlinear squares, depending on whether or not the residuals are linear in all unknowns. We will
consider the prestige dataset that can be downloaded from
https://www.kaggle.com/tmcketterick/job-prestige

Explanation
The regression model considering OLS is developed using OLS() function available in
statsmodels.api. The model is then fit () and further displayed using summary() function. The
summary of the model displays the coefficients of the independent variables and value of
variance (R-squared) along with the necessary other parameters of the regression model. The
model shows that the coefficient of independent variables: coefficient of education is 3.7176;
coefficient of income is 0.0014; coefficient of women is 0.0139 and census is 0.0003. The value
of R-squared is 97.6, which suggest that the total variance explained from the independent
variables is 97.6%, which further means that the developed model is excellent. We can observe
that for this dataset and model, the value of R-squared using OLS regression is much higher
than using linear regression (80%).

468
This model displays the value of Durbin–Watson test and this assumption is considered to
be met if the value in nearly equal to 2. The value of Durbin–Watson test is 1.657, which shows
that this assumption is fulfilled.

USE CASE
IMPACT OF SOCIAL NETWORKING WEBSITES ON QUALITY OF RECRUITMENT

Social networking sites are freely accessible and people are spending more time on them through
latest electronic gadgets such as cell phones, laptops, and palmtops. These sites allow individual
to construct a public or semi-public profile and provides them the freedom to express themselves,
which is not possible through other outlets. It also helps in creating a virtual community for
people interested in a particular subject to share a common platform and to increase their circle
of acquaintances. Users of these sites can view complete details of other persons, photos, videos,
comments, their groups, etc. However, the nature and nomenclature of these connections may
vary from site to site.
For organizations and HR professionals in particular, social networking sites help in social
recruitment by using social media platforms to advertise jobs, find talent, communicate with
potential recruits and is generally effective for finding passive candidates. Due to the increased
competition for searching top talent, recruiting in present scenario has become more
challenging and organizations need to computerize their talent acquisition strategies. These sites
have revolutionized the exchange of information between recruiters and job seekers. They can
start building connections to people and relevant groups for effective recruitment and for job
opportunities, respectively. Job seekers can use professional social media to promote their
personal achievements to prospective employers and peers. With an effective social recruiting
strategy, organizations can build relationships to convince ideal candidates to leave their
current role and join them. In order to maximize the benefits and get the best ROI, HR
professional needs to update themselves with the social media world and need to think outside
the box stand to attract the best possible talent.
The recruiter always wanted the largest pool of most qualified and talented applicants.
Earlier, a lot of resistance was raised from HR professionals, when social media was used for
recruitment, but later it became a common practice in the industry. After the Millennial, Gen Z
has also posed a challenge to HR professionals to attract and convert job seekers. Gen Z is very
versatile, easily adapt to new technologies and ideas and can work at any time of the day. This
generation encompasses a larger share of the job market has grown in a digital world and is
active on social media through latest networks.
The advantages of including social media in your recruitment strategy include setting up
social media pages very fast at low prices and easy in live recruitment of new employees.
Posting a simple message and connecting job seekers on camera will help in instant verbal
communication irrespective of geographic area and complicated appointment times. Recruiters
can evaluate their profile, contacts, recommendations from peers, managers and colleagues,
membership in a groups relevant to their field.
LinkedIn has now become the platform for recruiters and candidates and is considered more
than just a professional social network. In fact, this is the way it differentiated itself from
Facebook and Twitter. LinkedIn is a directory of professionals organized according to different
categories such as industry, company, and job title. It helps organizations to pay for posting jobs

469
and search for candidates or can also buy job credits at a lesser fee. In case of no budget,
recruiters without posting a job can broadcast that they are hiring people of the specified
criteria and people can contact him accordingly. Although Facebook marketplace allows us to
post a job for free by providing basic information such as location, job category, and
designation, but this does not help to target it to a specific group of people unlike Facebook
Advertisement. Facebook Pages are another free resource that enables to share business and
products with Facebook users. People start following pages quickly afterwards, if the company
is well known and jobs are promising. Twitter can help to tweet about the company and required
jobs and can be effective for small companies to get an edge over the competition. Hash tags are
used as a way to filter and find information on Twitter and help in making extraordinary job
posting tweets. Instagram has also changed the game of social recruitment by releasing stories.
Although there are numerous benefits of social recruiting, but there is a room for mistakes
also. According to some employers, Facebook and Twitter are related to an individual’s
personal life, and therefore not a helpful tool in the recruitment process. Some professionals
have the opinion that it would be inappropriate for a potential employer to use personal social
media profiles in the recruitment process. However, drawing clear distinction between
professional capabilities and personal lives can resolve the problem. It is also important for
organizations to consider the purpose of reviewing personal social media profiles and manage
the risk associated with violating data protection laws, when personal social media in used for
the recruitment process.
The question that arises is that can employers make best use of social media as part of the
recruitment process and can social recruiting bring in the right candidates. They need to
examine which platforms are best for posting job search information. In order to effectively
reach the right audience, it is important to have adequate content management through
organizations page and participate in the right conversations with the right people by using
filters. However, an experienced recruitment consultant can do the effective use of professional
social media.
Some recruiters consider that competition for candidates will increase and some companies
have lost money due to inefficient recruiting. With more people engaging with social media, it is
worth investigating its relevance to the recruitment process. Social media recruiting requires
time and effort and hence it is important to determine whether this investment return long-term
benefits to the organization.
A study can be undertaken to understand the impact of social networking websites on quality
of recruitment by considering HR professionals as respondents. Factors of social networking
sites can be considered as independent variables while the quality can be considered as
dependent variable. A regression analysis can help us to determine the contribution of each
factor for having desired quality. This study will prove beneficial to organization for framing
strategies of recruitment.

13.3 Classification
Classification problem occurs in situations in which the dependent variable is a categorical
variable, having binary values such as Pass/Fail, Yes/No, or 0/1, and is predicted given a set of
independent variables. It actually measures the probability of event=Success and event=Failure.
The characteristics of logistic regression help to solve many real world problems. In fact, most of

470
the data occurring in real world problems requires the concept of classification. It is important to
mention here that most of the assumptions needs not be fulfilled before applying the logistic
regression. First, logistic regression does not require a linear relationship between the dependent
and independent variables. Second, the error terms (residuals) do not need to be normally
distributed (Durbin–Watson test not required). Third, homoscedasticity is not required. But,
logistic regression requires the data to be normal.

The reader is suggested to use label encoding and one hot encoding for data
preprocessing of categorical dependent variable. These functions are available
in sklearn.preprocessing package and can be downloaded using the command
from sklearn.preprocessing import labelencoder and
sklearn.preprocessing import onehotencoder, respectively.

For implementation of logistic regression, we will consider the wine dataset from
sklearn.datasets subpackage. In the following example, processing of data is not done and
accuracy is determined after predicting the values.

471
472
Explanation
The wine dataset is loaded from the sklearn.datasets subpackage. The description of the dataset
shows that there are 178 observations and 13 variables. Both the wine.data and wine.target are
converted to the dataframe. The data have no missing values. The dataset is then partitioned
into training and test dataset. The dimensions of training dataset is (153, 13), while the
dimensions of test dataset is (45, 13). The training and test set score is found to be 0.9845 and
0.9334, respectively. It is possible to determine the accuracy of the model through confusion
matrix also because the dependent variable is categorical in nature. The accuracy score displays
the accuracy of the model and confusion matrix displays the number of correct and incorrect
classifications. The accuracy of the model is 93.33%, which is considered to be excellent. The
confusion matrix shows that 14 + 20 + 8 observations are correctly classified. This means that
42 observations are predicting the same class as that of the test dataset. Thus, out of 45
observations of test dataset, only three observations were wrongly predicted. Hence, an
accuracy of 42/45 = 0.933 was achieved.

The model depicts overfitting, since there is a big difference between training and test dataset

473
scores. Hence, there is a need to do data processing for avoiding the situation of overfitting of the
model and increasing the accuracy of the model. The following section tries to increase the
accuracy of the model by effective data processing.

474
475
476
477
478
479
480
Explanation
The normality of data is determined through skewness and kurtosis. We can observe from the
values that the values are nearly between –1.0 and +1.0. Hence, the data can be considered as
normal. But, for increasing the normality, we identified the outliers in the data by determining
the z-score. We observe that there are 11 outliers whose z score is >3. When we remove these
outliers from the data, the dimension of the dataset reduces to (168, 13). Feature engineering is
done to the data by applying scaling to the data. The dimension of training and test dataset also
reduces to (126, 13) and (42, 13), respectively. The model is then developed and prediction of
test dataset is done. We can observe that the accuracy of training and test dataset increases to
100%, which is excellent. Thus, we have avoided the situation of overfitting also and increased
the accuracy to 100% by doing effective data processing. Thus, this model can be considered
for future use.

Perform logistic regression on breast cancer dataset available in


sklearn.datasets. This dataset can be downloaded by using the command
sklearn.datasets import load_breast_cancer.

USE CASE
PREDICTION OF CUSTOMER BUYING INTENTION DUE TO DIGITAL MARKETING

Digital marketing, electronic marketing, e-marketing and Internet marketing are same terms that
simply mean encouraging customer communications through company’s own website, online
advertisements, emails, mobile phones (both SMS and MMS), social media marketing, display
advertising, search engine marketing and many other forms of digital media. It mainly uses
Internet as a core promotional medium in addition to mobile and traditional TV and radio. It
refers to various promotional techniques deployed for products or services through digital
technologies. It is basically a strategy that may help to effectively reach clients by establishing
innovative practices, combining technology with traditional marketing strategies and using
digital instruments to improve customer knowledge by matching their needs.
The various forms of digital marketing include content marketing, viral marketing, online
advertising, email marketing, social media, text messaging, affiliate marketing, search engine
optimization, search engine marketing, audio marketing, website contents, you tube, webinars,

481
pay-per-click, Google analytics, e-Newsletters, display advertising, web banner advertising, pop-
ups, side-panel ads, coupons, etc.
The dimensions of Internet usage have changed with people using mobile devices to access
the network and the usage of smart phones, tablets and other mobile devices has increased the
potential of mobile market drastically and people all over the world started connecting with each
other more conveniently through social media. Marketing has risen because of these
communication medium and smart phone has specifically brought digital convergence to a great
height. The heightened attention paid to digital marketing is a sign that there can be significant
benefits to be gained from digital marketing. The benefits of digital marketing include more
influential and interesting due to multimedia compatibility, exponential speed to reach messages,
accuracy and usefulness, timeliness, 24 hours availability, globally available, interactive,
micromarketing compatible, integration ready better accessibility, navigation, high efficiency,
penetrating power, interactivity, content sharing using multiple applications, low technical
requirements, personalized cross-platform interaction, real-time publication and sharing
multiple applications, availability of online reviews and recommendations, etc.
Message content is an important factor in digital marketing because inaccurate, improper, or
useless messages could reduce the efficiency and trust of media. The other factors related to
message content include interesting and customized advertisements information, variety of
message, appropriate message delivery timings, right message frequency and less manipulative.
The other problems in digital marketing include customers hesitation in sharing their personal
information, less source credibility, low blogger’s trustworthiness, technological unfriendly, and
lack of unawareness.
As consumers spend large number of hours online, online digital marketing is rapidly
becoming an important and popular communication tool for consumer engagement and brand
building. But, the effectiveness of digital marketing arises will be more if we can understand how
consumers reacted or responded to tools of digital marketing. It is extremely important to tap the
right kind of consumer behavior and attitude to leverage the opportunities available with the
marketers. Hence, the professionals might need to observe the factors that affect the customer
buying behavior due to digital marketing techniques. Different factors discussed above related to
digital marketing can be considered as independent variables. Classification can be used as a
tool for predicting the customer buying intention; considering these four factors as independent
variables and dependent variable will be the buying intention, which is a nominal variable with
two levels of categories: 1= buy the product and 2= not buy the product. This will help the
professionals to predict the buying intention of customers depending on the value of factors
corresponding to digital marketing.

Summary
• There are different types of regression including simple linear regression, multiple linear
regression ordinary least square regression, and logistic regression. In logistic regression, the
goal is to predict a categorical while in linear regression; the goal is to predict a continuous
number depending upon independent variables of any type.
• There are some important assumptions that need to be fulfilled for doing regression analysis,
which includes normality of variables, linearity, independence of errors, homoscedasticity,
and multicollinearity.

482
• Regression assumes that variables should have normal distributions. There are many different
ways for checking the assumption of normality including Shapiro test, outlier test, skewness
and kurtosis, and normality test.
• A significant correlation between each independent variable(s) and dependent variable
confirms the linearity.
• Independence of errors assumption is met, if there is little or no autocorrelation in the data.
Autocorrelation occurs when the residuals are not independent from each other. The Durbin–
Watson test is used to check this assumption.
• Homoscedasticity means that the variance of errors is the same across all levels of the
independent variables. The Breusch–Pagan test is used to check this assumption.
• Multicollinearity occurs when we have two or more independent variables that are highly
correlated with each other. Multicollinearity is detected by VIF: If the square root of VIF is
greater than 2, it indicates a multicollinearity problem.
• There is always one dependent variable in regression and classification. In simple linear
regression, there is only one independent variable, while in multiple and logistic regression,
there are many independent variables.
• In nonlinear least square regression, we establish a regression model in which the sum of the
squares of the vertical distances of different points from the regression curve is minimized.
• Logistic regression occurs in situations in which the dependent variable is a categorical
variable, having binary values such as Pass/Fail, Yes/No, or 0/1, and is predicted given a set
of independent variables.
• Simple linear regression analysis in Python is done using LinearRegression() function.
• RMSE is calculated using sqrt(mean_squared_error()), which means that square root of
mean_squared_error() will be calculated. This mean_squared_error() function is available
in sklearn. metrics. The lower the value of RMSE, the better is the model.
• Logistic regression model is developed using LogisticRegression() function, which is
available in sklearn.linear_model library.
• It is possible to determine the accuracy score and confusion matrix in logistic regression
model because the dependent variable is categorical in nature. The accuracy score displays
the accuracy of the model and confusion matrix displays the number of right and wrong
classifications.

Multiple-Choice Questions

1. The process of logistic regression does not have the utility of the following function:
(a) train_test_split()
(b) seed()
(c) confusion_matrix()
(d) cov()
2. Functions for determining the accuracy of regression problems are found in
___________ package.
(a) linear_model
(b) metrics

483
(c) model_selection
(d) accuracy
3. Prediction of dependent variable is done using the _____________ function.
(a) predict()
(b) seed()
(c) result()
(d) ans()
4. In _____________ regression problems, the dependent variable is categorical in nature.
(a) Simple linear
(b) Multiple linear
(c) Logistic
(d) Nonlinear
5. Ordinary least square regression model is created using _______________ function.
(a) OLS()
(b) OLSR()
(c) LSR()
(d) model()
6. The function used for partitioning the dataset is _______________.
(a) split()
(b) train_test_split()
(c) traintest_split()
(d) train_test()
7. Logistic regression model is created using the _______________function.
(a) logreg()
(b) LogisticRegression()
(c) logmodel()
(d) logregmodel()
8. An observation is termed as an outlier, if value of z-score is:
(a) <3
(b) >3
(c) <10
(d) >10
9. The _________ function is used to determine difference between actual and predicted
values in linear regression.
(a) mse()
(b) rmse()
(c) difference()
(d) result()
10. Functions for developing regression models are found in ____________ package.
(a) linear_model
(b) metrics
(c) model_selection
(d) accuracy

484
Review Questions

1. Differentiate between the process of determining accuracy of linear and logistic regression
model.
2. How and why do we create training and test datasets.
3. Discuss the situations where simple linear, multiple linear, and logistic regression analysis
are applied.
4. What are the assumptions that need to be fulfilled before applying multiple regression
analysis?
5. Discuss all the steps of executing the complete process of regression analysis.
6. Discuss the result produced by the confusion matrix.
7. Discuss the steps to do data exploration and preparation.
8. What is the need of identifying outliers? Explain with an example.
9. Explain the importance of handling missing values with an example.
10. Explain the importance of creating feature with an example.

485
CHAPTER
14

486
Supervised Machine Learning
Algorithms

Learning Objectives
After reading this chapter, you will be able to
• Get familiarity with different supervised machine learning algorithms.
• Apply the knowledge of ML algorithms to solve real-world cases.
• Implement supervised ML algorithms using Python.
• Develop the analytical skill for interpreting supervised ML algorithms.

Supervised machine learning algorithms are used when we have a labeled data and we are trying
to find a relationship model from the user’s data. The machines helps the algorithms learn to
predict the output from the input data. These algorithms generate functions that map inputs to
desired outputs and consist of a dependent variable, which is predicted from a given set of
independent variables. The training process continues until the model achieves a desired level of
accuracy on the training data. Supervised learning is used whenever we want to predict a certain
outcome from a given input and we have data of both input and output. In supervised learning
algorithms, we first partition data in two sets: training and test. Training dataset contains major
proportion and test contains minor proportion of the available dataset. The ratio can be either
60/40, 70/30, 75/25, and 80/20 depending upon the users choice. A model is developed on the
basis of training dataset. Our main objective is to make accurate predictions for new test data.
The model is then implemented on test dataset to determine the predictions. The predicted values
are compared with the original value in the test dataset to determine the accuracy of the model.
Algorithms of supervised learning include Naive Bayes, k-NN, Decision Tree, and Support
Vector Machines. Naive Bayes algorithm is applicable only for classification problems. Rest of
all the algorithms can be used for both classification and regression problems as discussed in this
chapter.

14.1 Naive Bayes Algorithm


This algorithm is quite similar to the logistic regression algorithm. However, this algorithm is
faster in training because they learn parameters by looking at each feature individually and
collect simple per-class statistics from each feature. In other words, it is a classification
technique based on Bayes theorem with an assumption of independence among predictors
(assumes that the presence of a particular feature in a class is unrelated to the presence of any
other feature). For example, a vegetable may be considered to be a brinjal if it the color is violet

487
and oval in shape. Even if these features depend on each other or upon the existence of the other
features, all of these properties independently contribute to the probability that this vegetable is a
brinjal and that is why it is known as “Naive.” Naive Bayes model is easy to build and
particularly useful for very large datasets. Along with simplicity, Naive Bayes is known to
outperform even highly sophisticated classification methods. But, it is important to consider that
continuous features follow normal distribution.
However, it is easy to improve the power of this basic model by tuning parameters and
handling assumptions. If continuous features do not follow normal distribution, we should use
transformation or different methods to convert it in normal distribution. As the model assumes
that there is an independence among predictors, it is suggested to remove correlated features,
because the two highly correlated features will be voted twice in the model and it can lead to
over inflating importance.
This model is used generally when the dimensionality of the input is very high. This
classifier assumes that the presence of a particular feature in a class is unrelated to the presence
of any other feature. The Bayes theorem is represented as P(Y/X) = P(X/Y) P(X).
Thus, the model basically calculates the probability of Y for given X, where X is the prior
event and Y is the dependence event.

14.1.1 Naive Bayes for Classification Problems


Naive Bayes algorithm can be applied using GaussianNB() function from the subpackage
namely sklearn.naive_bayes. To understand the utility of naive Bayes algorithm for classification
problems, we will consider the occupancy detection, which is downloaded from:
https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection. Three files for the dataset are
downloaded namely datatest.txt, datatest2.txt, and datatraining.txt. The training dataset is named
as “occu_trg.csv”, and test dataset is named as “occu_ test.csv.” These files are then converted to
csv files and date column is deleted from the csv files because date variable is not important in
this dataset and hence cannot be considered as an independent variable. The details of the dataset
are as follows:

date time year-month-day hour:minute:second


Temperature, in Celsius
Relative Humidity, %
Light, in Lux
CO2, in ppm
Humidity ratio, derived quantity from temperature and relative humidity, in kgwater-
vapor/kg-air Occupancy, 0 or 1, 0 for not occupied, 1 for occupied status

488
489
Explanation
The dimension of the training and test dataset is found to be (8143, 6) and (2665, 6). The
training set score and test set is found to be 0.9789 and 0.9775, respectively. Since there is no
big difference between the both scores, hence the model is a case of neither underfitting nor
overfitting of the model. The accuracy of the naive Bayes model is found to be 0.9775, which is
slightly more than the accuracy of the logistic regression model (0.9771) created for this
dataset.

USE CASE
MEASURING ACCEPTABILITY OF A NEW PRODUCT

Business owners hoping to launch a new product face many risks. Launching a new product
involves investment of enormous amount of resources including time, energy, and money. Hence,
evaluating the product’s acceptability is an important step to minimize the risk of the new
product.
Before launching the new product, a company should evaluate the acceptability of the
product in the market. It should then improve the acceptability by finding out the key parameters
and improving them. The company can integrate acceptability evaluation model during the
product development process along with expert knowledge.
The different types of risks that exist in launching a new product include demand risk (failure
to generate demand), operational risk (delayed launch due to production issues), quality risk
(brand, features, design), and price risks (price war with a competitor). Different models can be
developed to minimize each type of risk by identifying different parameters contributing to
particular risk and measuring the effect of those parameters.
To understand the key features and their importance, let us consider the example of a car.
For evaluating the key features of the car, the independent variables (source:
https://archive.ics.uci.edu/ml/datasets/automobile) will include both categorical and continuous
variables.

490
The different categorical variables of car include make (Audi, BMW, Chevrolet, Dodge,
Honda, Isuzu, Jaguar, Mazda, Mercedes Benz, Mercury, Mitsubishi, Nissan, Peugeot, Plymouth,
Porsche, Renault, Saab, Subaru, Toyota, Volkswagen, Volvo), fuel type (diesel, gas), aspiration
(standard, turbo), number of doors (four, two), body style (hardtop, wagon, sedan, hatchback,
convertible), drive wheels (4wd, fwd, rwd), engine location (front, rear), engine type (dohc,
dohcv, l, ohc, ohcf, ohcv, rotor), number of cylinders (eight, five, four, six, three, twelve, two),
buying (high, low, medium), maintenance (high, low, medium), lug_boot (“big”, “med”,
“small”), and persons (two, four, more) and safety (high, low, medium).
Some other continuous variables include wheel base, length, width, height, curb weight,
engine size, fuel system, bore, stroke, compression ratio, horsepower, peak rpm, city mpg,
highway mpg, and price.
The organization can develop a Naive Bayes classification model based on their past project
experience to make useful estimations of improvement scenarios related to car. For evaluating
quality and price risk, the dependent variable will be based on a user’s perception related to
acceptability, which will be a categorical variable. Important factors that lead to “acceptability”
and “unacceptability” can be identified and can be considered before forming a new strategy.

14.2 k-Nearest Neighbor’s Algorithm


k-Nearest neighbor’s (k-NN) algorithm is a nonparametric method used for both classification
and regression problems. Since it is a supervised learning method, all the data are labeled and the
algorithm learns to predict the output from the input data. It performs well even if the training
data are large. When k-NN is used for regression problems, the prediction is based on the mean
or the median of the K-most similar instances. When k-NN is used for classification, the output
is based on the mode. If we are using k and we have an even number of classes, k should be
assigned an odd number to avoid a tie and vice versa. The k-NN gives better results if same scale
is used for all the data. The k-NN works well with a small number of input variables, but
struggles when the number of inputs is large.
The most important parameter of the k-NN algorithm is k, which specifies the number of
neighbor observations that contribute to the output predictions. Optimal values for k can be
obtained mainly through cross-validation. Cross-validation is a smart way to find out the optimal
K value. It estimates the validation error rate by holding out a subset of the training set from the
model building process. Cross-validation involves randomly dividing the training set into groups,
or folds, of approximately equal size. However, before selecting k-NN, it is important to
understand that k-NN is computationally expensive and the variables should be normalized,
otherwise higher range variables can bias it and we should work on outliers and noise removal
more before going for k-NN.
k-NN makes predictions using the training dataset directly. The value for k can be found by
algorithm tuning. It is a good idea to try many different values for “k” and determine the value of
“k”, which is best for problem. The computational complexity of k-NN increases with the size of
the training dataset. Hence, for very large training sets, k-NN can be made stochastic by taking a
sample from the training dataset from which to calculate the K-most similar instances.
There are a number of different distance measures, namely Euclidean, Manhattan, and
Minkowski. Most commonly used distance measure is Euclidean. It is calculated as
k = number of clusters

491
d = √∑ (xi - yi)2
i =1, 2, 3, 4….n

14.2.1 k-NN for Classification Problems


kNN algorithm for classification can be applied using KNeighborsClassifier() function from
the subpackage named sklearn.neighbors. An important parameter to implement this function is
the value of k. For understanding the utility of k-NN algorithm for classification in Python, we
will consider the dataset breast_cancer from sklearn.datasets.

492
493
Explanation
The breast cancer dataset is loaded and stored in cancer. The cancer.data shows that there are
569 rows (number of observations) and 30 columns (independent variables). Since the number
of neighbor plays an important role in the kNN model, we have created 20 different kNN
models using for loop from 1 to 20 for different values of k. The knn_accuracylist is an empty
list created for storing accuracy of each and every model. Thus, every time “for” loop is
executed, a kNN model will be created and accuracy will be computed and stored in the list.
The list will finally store the accuracy of the 20 different kNN models. The
KNeighborsClassifier() function creates a kNN model for classification problem. The
command knn_accuracylist.append(knn_acc_score) helps to add the new element
corresponding to the accuracy of the new model in the list. The max(knn_accuracylist)
determines the maximum accuracy from the list.
We can observe that the maximum accuracy of the kNN model is 0.9650. The command
curve.plot() plots the figure showing the accuracy of the model for different values of k.
From the plot also, it is observed that the maximum accuracy is slightly less than 0.97.

494
The Naive Bayes model shows the accuracy of 0.9161, which is lower than kNN model.
Similarly, the accuracy using logistic regression model is 0.9580, which is also lower than kNN
model. Thus, for this dataset, kNN can be considered as the best model.

USE CASE
PREDICTING PHISHING WEBSITES

Phishing is an online deception technique in which a hacker uses an e-mail or a website that
looks reputable and honest to obtain confidential information such as user names and
passwords. Phishing attacks cost banks and credit card issuers heavily. It is a new Internet crime
and semantic attack, which targets the user rather than the computer. To start, the hacker sends
a message that appears to be from an authentic source and records the information entered by
victims into webpages. The hackers use this information to make illegal purchases or for
committing other frauds. They then evaluate the successes and failures of the attack and repeat
again.
A phishing website can be easily identified from the list of blacklisted websites. However, this
list cannot cover all phishing websites as a new website can be created within a short span of
time. Hence, another approach for recognizing newly created phishing websites needs to be
designed. The accuracy of this approach will depend on the discriminative features. For
accurate and effective classification of websites, it is important to determine these factors
properly.
Several researchers have identified important factors for effectively predicting phishing
websites, which include abnormal behaviors [abnormal URL, abnormal DNS record, abnormal
anchors, server-form-handler, abnormal cookie, certificate authority, distinguished names
certificate and abnormal secure sockets layer (SSL) certificate], page style and contents
(spelling errors, copy website, using forms with submit button, using popups windows, disabling
right click), web address bar (long URL address, replacing similar char for URL, adding a
prefix or suffix, using the @ symbol to confuse, using hexadecimal char code), address bar
(using IP address, long URL to hide the suspicious part, subdomain and multi-subdomains),
HTML and Java Script based features (website forwarding, status bar customization, disabling
right click, using pop-up window, iframe redirection); domainbased features (age of domain,
DNS record, website traffic, page rank, Google index, number of links pointing to page,
statistical reports based feature), submitting information to email, etc.
Predicting and stopping phishing attacks is a critical step toward protecting online
transactions. The different factors discussed previously can be considered as independent
variables and the nature of website (phishing or not phishing) can be considered as a dependent
variable. A training dataset can be obtained after an extensive survey. The k-means
classification algorithm can be applied for the prediction of phishing website based on the model
developed by the training data. This will help differentiate between honest and phishing websites
based on the features extracted from the visited website.

14.2.2 k-NN for Regression Problems


kNN algorithm for regression problems can be applied using the KNeighborsRegressor available

495
in sklearn.neighbors library. For understanding the utility of regression in Python, we will
consider the dataset from https://archive.ics.uci.edu/ml/datasets/Computer+Hardware.
Dataset Information:

1. Vendor name: 30
2. Model name: many unique symbols
3. MYCT: machine cycle time in nanoseconds (integer)
4. MMIN: minimum main memory in kilobytes (integer)
5. MMAX: maximum main memory in kilobytes (integer)
6. CACH: cache memory in kilobytes (integer)
7. CHMIN: minimum channels in units (integer)
8. CHMAX: maximum channels in units (integer)
9. PRP: published relative performance (integer)
10. ERP: estimated relative performance from the original article (integer)
For the analysis, we are considering ERP as a dependent variable and rest all seven as
independent variables. We have deleted vendor name and model from the dataset, since they
cannot be considered as independent variables. Thus, our dataset has eight columns.

496
497
Explanation
The dimension of the dataset is (209, 8) since we have removed some variables from the
dataset, which cannot be considered as independent variables. From the result, it is clear that
there are no missing values in the dataset. The statement x_trg = training.drop('ERP',
axis=1) drops the element “ERP” from the dataset and stores the rest of the seven variables in
the training dataset. The statement y_trg = training['ERP'] considers only the “ERP”
element as a dependent variable. Similar to the earlier example, we have created “for” loop for
creating different kNN models for different number of neighbors (k). The knn_rmselist is an

498
empty list created for storing RMSE of each and every model. Thus, every time “for” loop is
executed, a kNN model will be created and RMSE will be computed and stored in the list. The
list will finally store the RMSE value of the 20 different kNN models. The
KNeighborsRegressor() function creates a kNN model for regression problem. The command
knn_rmselist.append(knn_rmse) is used to add the new element corresponding to the RMSE
value of the new model in the list. The min(knn_rmselist) determines the least RMSE value
from the list. The value of RMSE for different values of the k is also shown in the chart, which
shows that the least RMSE value is nearly 10. It is clear that RMSE value decreases by
increasing the value of k. The least RMSE value using kNN model is found to be 10.34, which
is lower than the RMSE value of linear regression model (15.58). This means that for this
dataset, it is better to use a KNN model.

An optional parameter named algorithm can also be used in the


KNeighborsClassifier() function to specify the algorithm which can be used.
The different algorithms are ‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’ and ‘auto’.
The ‘ball_ tree’ will use BallTree, ‘kd_tree’ will use KDTree, ‘brute’ will use a
brute-force search and ‘auto’ will attempt to decide the most appropriate
algorithm based on the values passed to fit method. Another optional
parameter is leaf_size which contains an integer value. The default value is 30.
Leaf size passed to BallTree or KDTree. This can affect the speed of the
construction and query, as well as the memory required to store the tree.

Use different combinations of optional arguments on the data used in


regression and classification problems for kNN algorithms and try to improve
the accuracy of the model.

USE CASE
LOAN CATEGORIZATION

There are two types of loans: secured and unsecured. Secured loans rely on an asset and the
asset needs to be evaluated before sanctioning the loan. In case of nonpayment of the loan, the
lender can possess the asset to cover up the loan. Unsecured loans have higher concern rates;
credit history and revenue are evaluated to meet the criteria for the loan. An example of
unsecured loans is loan through credit card. In this case, the customer is given a credit limit to
buy and when purchasing is done; this available credit decreases and it permits to use more. But
there are many risks related to unsecured loans for the lender and for those who get the loans.
These risks include credit risk (the loan will not be returned on time or at all) and interest rate
risk (interest rates priced on loans will be too low to earn).
Loan categorization refers to the process of evaluation of loan collections and assigning
loans to groups or grade based on the perceived danger and other related loan parameters. The
number of transactions related to loan in banking sector is rapidly growing and huge data
volumes show that the risks around loans are increasing day by day. It becomes important for

499
banks to determine whether the loan should be awarded to a person/business. These decisions
can have an important impact on the real economy. The process of continual review and
classification of loans enables monitoring the quality of the loan portfolios and to take action to
counter any fall in the credit quality of the portfolios. Hence, it is important for regulators and
policy-makers of banks or other lending institutions to renew their business models and make
quick decisions by using advanced technology rather than the more standardized schemes,
enabling them to report reasons for easy monitoring and evaluation.
Credit risk predictions, monitoring, model reliability, and effective loan processing are key
to decision making and transparency in a system. Machine learning models play a significant
role in credit risk modeling. It is important to determine the probability of certain outcomes,
particularly an existing negative threat for trying to achieve a current monetary operation.
Adequate money prediction involves using independent variables in a dataset to predict
dependent variables and hence helps in developing model, which helps in easy interpretation of
the data that cannot be interpreted by human.
Machine learning algorithms help in developing models through extracting information from
tremendous amount of accumulated datasets. For example, in a loan risk prediction situation,
the company would be interested in knowing how long it takes customers with certain attributes
to pay back their loans and also, what is the possible risk of a default. Past data observations
gathered by the company will help determine whether the customer will be a defaulter or
nondefaulter. The category may also be from less chance to high chance to give consideration to
those who are not lying between the extreme situations of being a defaulter or nondefaulter. For
these two scenarios, KNN classification algorithm would be used to determine whether the
customer will be a defaulter or nondefaulter (categorical dependent variable) and KNN
regression algorithm would be used to predict the amount (continuous dependent variable).

14.3 Support Vector Machines


Support vector machines are powerful models and perform well on a variety of datasets. SVM
require careful preprocessing of data and requires tuning of the data and are hard to inspect. But
this model is used, if all the features represent in similar units. SVM technique is easy to
understand and is generally useful for unknown distribution data and nonregular data. In this
algorithm, each data item is plotted as a point in n-dimensional space, where n is number of
attributes in the dataset. For example, if we only had two features of some dataset, a graph is
drawn showing these two variables in two-dimensional space is where each point has two co-
ordinates (these co-ordinates are known as support vectors). The dimension of the graph will
increase on increase in the number of attributes.
These data are simple to classify and one can see that the data are clearly separated into two
segments. Now, we will find some line that splits the data between the two differently classified
groups of data. If our line is too close to any of the data points, noisy test data are more likely to
get classified in a wrong segment. We have to choose the line, which lies between these groups
and is at the farthest distance from each of the segments. This will be the line, which lies
between these groups and its decided in such a way that the distances from the closest point in
each of the two groups will be farthest away.
In the example, three lines are shown to splits the data into two differently classified groups
and the center line is farthest from two closest points. This line is our classifier. Then, depending

500
on where the testing data lands on either side of the line, we can classify the new data. However,
in case of more than two attributes, plane is drawn for depicting a classifier, which classifies the
data and the data space is divided into segments and each segment contains only one kind of
data. Thus, the hyperplane classifies the points into two classes (categorical variables) and works
by identifying the hyperplane, which maximizes the margin between two classes. SVM
algorithms use a set of mathematical functions called kernel. We generally use the linear kernel,
which is expressed as K(x, x’) = exp((–||x–x’||2)/2σ2).
In business situations when one needs to train the model and continually predict over test
data, SVM may fall into the trap of overfitting. Hence, SVM needs to be carefully modeled—
otherwise the model accuracy may not be satisfactory. For linear data, we can compare SVM
with linear regression while for nonlinear data, SVM is comparable to logistic regression. As the
data becomes more and more linear in nature, linear regression becomes more and more
accurate. However, SVM is used when noisiness and bias severely impact the ability of
regression.

Figure 14.1 Plotting two variables in a two-dimensional space.

Figure 14.2 Plotting features with pane for support vector machines.

14.3.1 Support Vector Machines for Classification Problems


Support vector machines classification algorithm can be applied through LinearSVC() available

501
in sklearn.svm library. For understanding the implementation of support vector machines, we
can download the dataset of bank note authentication from
https://archive.ics.uci.edu/ml/datasets/banknote+authentication.
Dataset Information: Data were extracted from images that were taken from genuine and
forged banknote-like specimens. For digitization, an industrial camera usually used for print
inspection was used. The final images have 400 × 400 pixels. Due to the object lens and distance
to the investigated object, gray-scale pictures with a resolution of about 660 dpi were gained.
Wavelet transform tool was used to extract features from images.

1. Variance of wavelet transformed image (continuous).


2. Skewness of wavelet transformed image (continuous).
3. Curtosis of wavelet transformed image (continuous).
4. Entropy of image (continuous).
5. Class (integer).
For analysis, class was considered as a dependent variable and all the other variables were
considered as independent variables.

502
503
504
Explanation
The bank note authentication dataset is stored in banknote. The dimension of the dataset shows
that there are 1372 rows (number of observations) and five columns (variables). The different
variables in the dataset are “variance”, “skewness”, “curtosis”, “entropy”, and “class”. Class is
considered as a dependent variable and others are considered as independent variables. We have
not done feature scaling and encoding because SVM is very sensitive model. The accuracy of
the training dataset using SVM model is 0.99, while the accuracy of the test dataset is 0.98.
Thus, SVM model for this dataset shows a case of overfitting. The confusion matrix shows that
223+184 are the correctly classified observations, while 1+4 are incorrectly classified
observations. We can observe that the maximum accuracy of the kNN model is 1, while the
accuracy from Naive Bayes algorithm is 0.9854 and accuracy from logistic regression model is
0.983. Thus, for this dataset, kNN is showing 100% accuracy.

USE CASE
FRAUD ANALYSIS FOR CREDIT CARD AND MOBILE PAYMENT TRANSACTIONS

With the growing popularity of online payment system, fraud detection has become the need of
the hour. Machine learning algorithms are much better at dealing with and processing large
datasets than humans. Rules-based programming by human and machine learning approaches
have an inverse relationship with the size of datasets. Humans become less effective while
machine learning approaches get better with larger datasets. So, in fraud detection cases,
machine learning is the logical choice. Machine learning algorithms are able to detect and
recognize thousands of features on a user’s purchasing journey instead of the few that can be
captured by creating rules. They have the ability to see deep into the data and make concrete
predictions for large volumes of transactions. Hence, fraud investigators, credit card companies,
banking systems and electronic payment systems must use machine learning algorithms to build
an efficient and complex fraud detection system for preventing fraud activities that change
rapidly. The ability to stop fraud before it happens is not only a cost saver for them, but it also
helps in maintaining a high brand value.
The goal of an efficient fraud detection system is only to escalate decisions to people in case
of any fraud. A machine learning algorithm enables us to achieve results within a short time and
with the required confidence level to approve or decline a transaction. Every time a card is
swiped or inserted, or a phone is tapped or scanned, there is either an authorization or a decline.
For predicting decision, it is necessary to determine anomalies across patterns of fraud behavior
that have undergone change relative to the past. A good fraud detection system should be able to
identify the fraud transaction accurately and should make the detection possible in real-time
transactions. To detect and deal with fraud in real time, there is need to monitor each click,

505
detect anomalies, and respond appropriately. The anomalies can be determined from the data
related to credit history, purpose of using the credit card, credit amount, job, time, amount,
class, etc. The solution must be fast, accurate, and flexible enough to keep up with modern fraud
attacks.
The decision is driven by a machine learning model that can identify fraudulent behavior
based on information from historical fraud data. This training occurs in a big data system that
receives exported information from an in-memory database. The model then gets loaded as
stored procedures or user-defined functions into the database multiple times a day. As the ability
to predict and prevent becomes more widely adopted, consumer’s tolerance for fraud will reach
zero and will ultimately be the differentiator between success and failure. However, fraudsters
change their methodology all the time, so it is important to constantly update the machine
learning fraud model to keep the quality of decisions high and the false positive rate low.
Machine learning algorithms help to quickly detect data anomalies and make decisions
based on information as it happens, even anticipating results. SVMs algorithm can be used for
prediction of fraud in online payment and credit card transactions.

14.3.2 Support Vector Machines for Regression Problems


Support vector machines regression algorithm can be applied through LinearSVR() available in
sklearn.svm library. For understanding the utility of support vector machines in regression
problems, we consider the dataset of protein structure that can be downloaded from
https://archive.ics.uci.edu/ml/datasets/Physicochemical+Properties+of+Protein+Tertiary+Structure
Dataset Information: The dataset has the following features:

RMSD-Size of the residue.


F1 - Total surface area.
F2 - Nonpolar exposed area.
F3 - Fractional area of exposed nonpolar residue.
F4 - Fractional area of exposed nonpolar part of residue.
F5 - Molecular mass weighted exposed area.
F6 - Average deviation from standard exposed area of residue.
F7 - Euclidian distance.
F8 - Secondary structure penalty.
F9 - Spacial distribution constraints (N, K value).
We will consider RMSD as dependent and others as independent variables.

506
507
508
Explanation
The protein dataset is stored in protein. The dimension of the dataset shows that there are
45,730 (number of observations) and 10 columns (variables). The different variables are as
follows: “RMSD”, “F1”, “F2”, “F3”, “F4”, “F5”, “F6”, “F7”, “F8”, and “F9”. RMSD is
considered as a dependent variable and others are considered as independent variables. It is
observed from the result that there are no missing values in the dataset. The RMSE value of
SVM model for this dataset is 6.62, which is higher than least RMSE value of kNN (5.77) and
RMSE value of linear regression model (5.155). Thus, for this dataset, SVM is less effective
than other algorithms.

All the parameters used in the support vector machines algorithm are optional.
These parameters are epsilonfloat, which depends on the scale of the target
variable y; tolfloat which has tolerance for stopping criteria; Cfloat is
regularization parameter and must be strictly positive; lossstring specifies the
loss function. The epsilon-insensitive loss (standard SVR) is the L1 loss, while
the squared epsilon-insensitive loss (‘squared_epsilon_insensitive’) is the L2
loss.; random_state which has an integer value.

Use different combinations of optional arguments on the data used in


regression and classification problems for support vector machines algorithms
and try to improve the accuracy of the model.

USE CASE
DIAGNOSIS AND TREATMENT OF DISEASES

As the healthcare industry is becoming more and more reliant on computer technology, methods
are required to assist the physicians in identifying and curing abnormalities at early stages.
Machine learning algorithms can significantly help in solving healthcare problems by
developing classifier systems that can assist physicians in diagnosing and predicting diseases in
early stages. For training of high-dimensional and multimodal biomedical data, machine
learning offers a worthy approach for making classy and automatic algorithms. However,
extracting knowledge from medical data is challenging as these data may be heterogeneous,
unorganized, high dimensional, and may contain noise and outliers.
Medical diagnosis is one of the important activities of medicine. The accuracy of the
diagnosis allows for deciding the right treatment and subsequently curing the diseases.
Researchers have dedicated themselves in designing and understanding technologies to support

509
individuals with chronic illnesses in managing their health. These technologies will help people
to change their behavior, learn about their disease, get support from similar others or track
information about them.
Many artificially intelligent diagnosis algorithms have been developed for detecting various
diseases like rheumatoid arthritis, cancer, lung diseases, heart diseases, diabetic retinopathy,
hepatitis disease, Alzheimer’s disease, liver disease, dengue, Parkinson disease, etc. Dinu et al.
(2017) have compiled contributions of many scientists and researchers in the medical field.
Georg Langs developed a system for automatic quantification of joint space narrowing and
erosions in rheumatoid arthritis. Smita Jhajharia predicted cancer with an accuracy of 96%.
Juan Wang proposed a deep learning algorithm for detecting cardiovascular diseases.
Shubhangi Khobragade proposed an algorithm for automatic detection of major lung diseases
with an accuracy of 86%. Zheng L. proposed an algorithm that combines several artificial
intelligent techniques with the discrete wavelet transform for detection of masses in
mammograms. Yinghe Huo presented a system for automatic quantification of radiographic
finger joint space width of patients with early rheumatoid arthritis with an impressive accuracy.
A. B. Suma proposed a cost-effective and safer technique for the diagnosis of rheumatoid
arthritis.
Otoom and others have proposed an algorithm for detection and training of coronary artery
disease with highest accuracy of 85.5%. Vembandasamy et al. put forward an algorithm to
diagnose heart disease by using Naive Bayes algorithm, which offers 86.42% accuracy.
Chaurasia and Pal proposed an algorithm for heart disease detection. Here, Naive Bayes
provides 85.31% accuracy. Parthiban and Srivatsa developed a machine learning algorithm for
diagnosis of heart disease using SVM to provide the highest accuracy of 94.60%. Tan et al.
proposed hybrid technique in which two machine learning algorithms, genetic algorithm and
SVM, are joined effectively for attaining an accuracy of 84.07%.
Iyer has performed a work to predict diabetes disease by using decision tree and Naive
Bayes. Sen and Dash developed meta-learning algorithms for diabetes disease diagnosis. Ba-
Alwi and Hintaya put forward a comparative training of various data mining algorithms that are
used for hepatitis disease diagnosis, which gives the accuracy of 96.52%. Sathyadevi employed
algorithm that has offered great performance of 83.2%. Ruben Armananzas proposed a voxel-
based diagnosis of Alzheimer’s disease using classifier ensembles; classification accuracy of the
proposed method is 97.14%. Baiying Lei proposed a novel discriminative sparse learning
method with relational regularization to jointly predict the clinical score and classification
accuracy of the proposed method is 94.68%.
Tong Tong proposed algorithm for the prediction of conversion from mild cognitive
impairment to Alzheimer’s disease in the range of 79–81% for the prediction of MCI-to-AD
conversion within 3 years in tenfold cross-validations. Priyanka Thakare developed Alzheimer
Disease Detection AI system. In this work, using wavelet transform four features are extracted
and classification is done by SVM with an accuracy of 94%. Jun Zhang proposed a landmark-
based feature extraction method based on a shape-constrained regression forest algorithm with
a classification accuracy of 83.7%. Vijayarani and Dhayanand predict liver disease by using
SVM and Naive Bayes classification algorithms showing accuracy of 79.66% and 61.28%,
respectively. P. Rajeswari put forward a training of liver disorder by using data mining
algorithm with a high accuracy of 97.10%. Fathima and Manimeglai used SVM data mining
algorithm for prediction of dengue disease with an accuracy of 90.42%. Ibrahim proposed an
algorithm of multilayer feed-forward neural network for prediction of dengue disease with an

510
accuracy of 90%. Reddy Challa and others have developed automated diagnostic models using
different models and found best accuracy of 97.159%. Sachin Shetty and Y. S. Rao proposed
SVM-based machine learning approach to identify Parkinson’s disease with an overall accuracy
of 83.33%. Indrajit Mandal and N. Sairam proposed robust methods of treating Parkinson’s
disease with the highest accuracy obtained by multinomial logistic regression of 100%.
Medicine plays a great role in human life, and so automated knowledge extraction from
medical datasets has become extremely important. All activities in medicine can be divided into
six tasks: screening, diagnosis, treatment, prognosis, monitoring, and management. Different
supervised machine learning algorithms discussed in this chapter including SVMs can be used
for doing the above tasks, including detection and diagnosis of different diseases, with enhanced
accuracy. The training related to the relevant medical imagery and associated point data will
make an inference that will increase the speed of decision making and can lower false positive
rates.

14.4 Decision Tree


Decision tree is mostly used in machine learning and data mining applications using Python. It is
basically a graph that represents choices and their outcomes in form of a tree structure. The
nodes in the graph represent an event or choice and the edges of the graph represent the decision
rules or conditions. The decision tree classifier is a supervised learning algorithm, which can be
used for both the classification (categorical dependent variable) and regression (continuous
dependent variable). However, the factors are discussed as follows which would help us to
decide which algorithm to use.

Situations to Use Decision Tree Model

• Decision tree is preferred in cases when there is a high nonlinearity and complex relationship
between dependent and independent variables.
• A decision tree model is a graphical representation, which is simple and easy to understand
even for people from nonanalytical background. No statistical knowledge is required to
understand and interpret the results.
• Decision tree is one of the fastest way to identify most significant variables and relation
between two or more variables. With the help of decision trees, we can create new
variables/features that have better power to predict target variable.
• Decision tree model is a nonparametric method and not affected by outliers and missing
values. Hence, we do not require to do assumptions check, less data cleaning is required and
no imputation is required.

Situations when Decision Tree Model Should Not Be Used

• If the relationship between dependent and independent variable is well approximated by a


linear model, linear regression algorithm should be adopted.
• Overfitting is one of the most real challenges in decision tree models. This problem gets
solved by setting constraints on model parameters and pruning (discussed in detail in
succeeding section).

511
• Decision tree should not be adopted while doing prediction for dependent continuous
variable because while working with continuous numerical variables, decision tree generally
loose information when it categorizes numerical variables in different categories.

Decision trees are typically drawn upside down such that terminal node (leaves) is at the bottom
and root node is at the tops. Root node represents entire population or sample and this further
gets divided into two or more homogeneous sets. Splitting is a process of dividing a node into
two or more subnodes. When a subnode splits into further subnodes, then it is called decision
node. A node, which is divided into subnodes is called parent node of subnodes; subnodes are the
child of parent node. The decision of making strategic splits heavily affects a tree’s accuracy.
The creation of subnodes increases the homogeneity of resultant subnodes. It then selects the
split that results in most homogeneous subnodes.
Decision tree is a type of supervised learning algorithm that is used in classification and
regression problems. It works for both categorical and continuous input and output variables.

14.4.1 Decision Tree Algorithm for Classification Problems


Decision Tree algorithm for classification problems can be applied using
DecisionTreeClassifier() function from the sklearn.tree library. For understanding the utility
of decision tree in classification problems, we consider fertility dataset from
https://archive.ics.uci.edu/ml/datasets/Fertility.

Dataset Information:

1. Season in which the analysis was performed: (1) winter, (2) spring, (3) summer, (4) fall (–1,
–0.33, 0.33, 1)
2. Age at the time of analysis: 18–36 (0, 1).
3. Childish diseases (i.e., chicken pox, measles, mumps, polio): (1) yes, (2) no (0, 1).
4. Accident or serious trauma: (1) yes, (2) no (0, 1).
5. Surgical intervention: (1) yes, (2) no (0, 1).
6. High fevers in the last year: (1) less than 3 months ago, (2) more than 3 months ago, (3) no
(–1, 0, 1).
7. Frequency of alcohol consumption: (1) several times a day, (2) every day, (3) several times
a week, (4) once a week, (5) hardly ever or never (0, 1).
8. Smoking habit: (1) never, (2) occasional, (3) daily (–1, 0, 1).
9. Number of hours spent sitting per day ene-16 (0, 1).
10. Diagnosis normal (N) and altered (O).
The dataset has diagnosis as dependent variable and rest nine variables are considered as
independent variables.

512
513
514
515
516
Explanation
Initially, the decision tree model with random_state = 0 and default value of max_depth is
created. The accuracy of the training dataset is found to be 0.98 and test dataset is found to be
0.73. Thus, this model depicts underfitting. We tried to improve the accuracy of the model by
changing the parameter max_depth to 3. We found that the accuracy of the model has increased
from 0.734 to 0.867. The importance of the predictor variables is shown in a numerical format
and is also displayed through visual form. The importance of each variable in the result and the
chart shows that all the variables besides, freq_alcohol, hours and season are not important
since there value is 0. Hence, a new model was developed considering these three as
independent variables. The drop() function with a list of unwanted predictor variables is used
to remove the unwanted independent variables from the training and test dataset. The accuracy
of the new model remains the same as 0.8367. The accuracy of the model with three

517
independent variables is same when more number of independent variables were considered.
Hence, we will consider these variables only for the model. Thus, for other models, we have
considered only hours, freq_alcohol, and season as independent variables. The maximum
accuracy of the KNN model and logistic regression model is found to be 0.90. The accuracy of
the Naive Bayes is found to be similar to decision tree model (0.867).

USE CASE
OCCUPANCY DETECTION IN BUILDINGS

The accurate determination of occupancy detection in buildings has been recently estimated to
save energy, provide effective security measures, and also determine the behavior of building
occupants. Accurate determination is necessary for automation of buildings, which starts from
determining the presence of people in the controlled areas. The need for automation arises due
to human nature. As an example, consider the situation of forgetting to switch off lights. The
solution to this problem is to use some form of occupancy sensor to detect the presence of people
in that area.
Occupancy detection is cost effective; indoor motion detecting devices are used to detect the
presence of a person to automatically control lights, temperature, or ventilation system. Today,
with sensors becoming affordable and ubiquitous, together with affordable computing power for
automation systems, determining occupancy can help to greatly reduce energy consumption.
Earlier, a vision-based system for occupancy detection was used which had a camera and
automatic image training to detect humans in the field of vision. The system was able to count
the number of occupants in the images. But it is objectionable due to privacy concerns.
The occupancy sensor type most widely used in the building automation industry is the
passive infrared (PIR) sensor. PIR sensing technology simply means that the sensors
detect/sense heat. When the sensors detect heat, they send an electrical signal to a circuit to turn
a light ON. But these sensors wait for a change to be made. But, if someone is sitting behind an
office partition, the sensor may not observe because it does not have a direct line of sight. In
such cases, we use ultrasonic occupancy sensors that can “sense” motion through and around
obstacles.
A model can be developed by considering independent variables such as date, time, year,
temperature, humidity, digital video cameras, passive infrared detection, CO2 sensors, light,
sound, and motion. The dependent variable can be a categorical variable occupancy with two
values: occupancy or no occupancy. The accuracy of the prediction of occupancy in a room
using data from the above mentioned factors can be evaluated using a decision tree algorithm.
The dataset can be partitioned into two datasets: one for training and other for testing the
model. A proper selection of factors can have an important impact on the accuracy of detection.

14.4.2 Decision Tree for Regression Problems


Decision Tree algorithm for regression problems can be applied using
DecisionTreeRegressor() function from the sklearn. tree library. For understanding the utility
of decision tree, we will consider Longley dataset that can be downloaded from
https://www.itl.nist.gov/div898/strd/lls/data/LINKS/DATA/Longley.dat.

518
Dataset Information: The Longley dataset contains various U.S. macroeconomic variables that
are known to be highly collinear. Variable name definitions are as follows:

Employed—Total Employment
GNP.deflator—GNP deflator
GNP—GNP
Unemployed—Number of unemployed
Armed.Forces—Size of armed forces
Population—Population
Year—Year (1947–1962)

519
520
521
522
Explanation
Looking at the figure, we can observe that the GNP is the most important predictor variable and
thus has a strong influence on the dependent variable “employed.” We can observe that all the
six independent variables have an effect on the dependent variable and hence the new decision
tree model is developed considering all the independent variables. The RMSE value is very
small for all the models. We can also observe that the least RMSE value is for linear regression
model (0.80), followed by KNN (1.21) and maximum RMSE is of decision tree model (1.48).

The different optional arguments used in the function include: criterion– This
can have value as “gini” for the Gini impurity and “entropy” for the
information gain; splitter – supported strategies are “best” to choose the best
split and “random” to choose the best random split; max_depth informs about
maximum depth of the tree; min_samples_split informs about the minimum
number of samples required to split an internal node; max_features specifies
number of features to consider when looking for the best split, can have integer
or float or “auto”, “sqrt”, “log2”, None.

Use different combinations of optional arguments on the data used in


regression and classification problems for decision tree algorithms and check if
further improvement in accuracy of the model is possible.

CatBoostRegressor algorithm can also be used for creating regression models.


This can be imported from the catboost library and produces effective results
for some datasets.

USE CASE
ARTIFICIAL INTELLIGENCE AND EMPLOYMENT

Artificial intelligence developed due to the thought of scientists and inventors for doing work and
taking decision without the participation of people. Artificial intelligence is an integrated system

523
of devices that performs tasks normally requiring human intelligence, visual explanation, and
decision making. The core problems of artificial intelligence include programing for certain
characteristics such as knowledge, reasoning, problem solving, perception, learning, planning,
and ability to manipulate and move objects. After the technological revolution, there was an
increase in the automation and computerization of work activities in all organizations. The
organizations’ focus shifted toward profit generation by increasing efficiency through the use of
artificial intelligence. Such impacts of technological change dispersed greatly across various
geographical regions in manufacturing and service industries.
Although artificial intelligence system has led to the increase in work efficiency and
development of easier application systems, it has also increased the concerns of employees due
to decrease in opportunities and wages. When a job is transferred to automatic machines or
computers, opportunities reduce as lesser people are required in final creation/decision making
of products/services. Due to the change in the procedure to do organizational activities, the
complete task structure changed. Apart from change in structure of tasks, the nature of skills
required to perform the job are also changed. This may have a huge impact on the jobs of
employees in an organization and the wages and incentives paid to the employees. Artificial
intelligence can increase the efficiency of work in business activities, but the cost is expected to
be high.
Besides, machines using artificial intelligence must have access to all processes, products,
categories, and relationship between all of them to implement knowledge engineering. Hence,
organizations may require more people to handle these computers and machines. Artificial
intelligence reduces costs which lead to increased savings and profits, decreases opportunities
for corruption, which in turn will lead to improved ease of business environment. This will
further lead to development of new technologies, more investments and hence more job
opportunities. It has been observed that all jobs for society are not being eradicated. In fact,
many are developing as a result, like what happened in case of the revolution in software
engineering, which actually created more jobs.
Considering the scenario depicted. It is important for us to determine effectiveness of
artificial intelligence on the quality of employment. Data can be collected from various
organizations and decision tree algorithm can be applied to develop and test a model.

Summary
• Supervised machine learning algorithms consist of a dependent variable that is predicted
from a given set of independent variables.
• Algorithms of supervised learning include Naive Bayes, k-NN, Decision Tree, and Support
Vector Machines. Except Naive Bayes, these algorithms can be used for both classification
and regression problems as discussed in this chapter.
• For determining accuracy of the model, we first divide the whole dataset into two datasets:
training dataset which uses the dataset for model development and test dataset for validating
the values with the predicted values of training dataset.
• Naive Bayes is a classification technique based on Bayes theorem with an assumption of
independence among predictors (assumes that the presence of a particular feature in a class is
unrelated to the presence of any other feature). Naive Bayes model is easy to build and
particularly useful for very large datasets.

524
• When k-NN is used for regression problems, the prediction is based on the mean or the
median of the K-most similar instances. When k-NN is used for classification, the output is
based on the mode.
• The most important parameter of the k-NN algorithm is k, which specifies the number of
neighbor observations that contribute to the output predictions. A low value of k shows a
very complex model/very wiggly/specifically jagged model, while a high k shows a very
inflexible model/very smooth. We really need to determine value of k, which is something in
the middle that predicts well on unseen data.
• SVM technique is easy to understand and is generally useful for unknown distribution data
and nonregular data. In this algorithm, each data item is plotted as a point in n-dimensional
space, where n is number of attributes in the dataset.
• A decision tree model is a graphical representation, which is simple and easy to understand
even for people from nonanalytical background. It shows a graph that represents choices and
their outcomes in form of a tree structure. The nodes in the graph represent an event or
choice and the edges of the graph represent the decision rules or conditions.

Multiple-Choice Questions

1. This algorithm is not used for both classification and regression problems.
(a) Naive Bayes
(b) Decision Tree
(c) Random Forest
(d) Support Vector Machines
2. The function used for creating a Naive Bayes model is
(a) GaussianNB()
(b) naive()
(c) naivebayes()
(d) NB()
3. Which algorithm shows a graph that represents choices and their outcomes in the form of
a tree structure?
(a) kNN
(b) Decision Tree
(c) Naive Bayes
(d) Support Vector Machines
4. The value of the argument _______________ has a great impact on the decision tree model.
(a) tree
(b) max_depth
(c) length
(d) dtree
5. __________________ is the most important parameter of the KNN algorithm.
(a) acc
(b) rmse

525
(c) knn
(d) k
6. The model easy to build and highly useful for large data is
(a) Naive Bayes
(b) Decision Tree
(c) Random Forest
(d) Support Vector Machines
7. The library from where functions of decision tree model are imported is
(a) sklearn.ensemble
(b) sklearn.neighbors
(c) sklearn.svm
(d) sklearn.tree
8. The library from where functions of k-NN model are imported is
(a) sklearn.ensemble
(b) sklearn.neighbors
(c) sklearn.svm
(d) sklearn.tree
9. The library from where functions of support vector machines model are imported is
(a) sklearn.ensemble
(b) sklearn.neighbors
(c) sklearn.svm
(d) sklearn.tree
10. Naive Bayes model is based on _______________ theorem.
(a) Bayes
(b) Naive
(c) Ensemble
(d) Predict

Review Questions

1. Why does SVM model generally shows overfitting?


2. Discuss the process of model development by k-NN algorithm.
3. How can we determine the best value of k in K-NN algorithm?
4. What is the significance of support vector machines algorithm?
5. How can we view the importance of the predictor variables in a decision tree model?
6. Discuss the process of model development by support vector machines algorithm.
7. Are different functions used for regression and classification problems? List functions used
for different algorithms discussed in the chapter.
8. Discuss the situations where decision tree algorithm is appropriate and not appropriate.

526
9. How do we decide number of attributes in support vector machines model?
10. How can we improve the accuracy of the decision tree model?

527
CHAPTER
15

528
Supervised Machine Learning
Ensemble Techniques

Learning Objectives
After reading this chapter, you will be able to

• Understand orientation of different supervised machine learning ensemble techniques.


• Demonstrate the knowledge of ML ensemble techniques in solving real-world problems.
• Evaluate different ensemble techniques for improving accuracy of the model.
• Develop the best model for accurate prediction.

Supervised machine learning algorithms like decision trees are used to make better decisions and
make more profit, but they suffer from bias and variance that grows as the complexity increases.
The ensemble method combines different decision trees to generate better predictive
performance than utilizing a single decision tree. In ensemble techniques, a group of weak
learners come together to form a strong learner. The main advantage of ensemble techniques is
the power of handling large dataset with more dimensions. It can handle thousands of
independent variables and using dimensionality reduction methods identifies the significant
variable. The different ensemble techniques include Bagging, Random Forest (Extension of
Bagging), Extra tree, AdaBoost, and Gradient Boosting. The following table shows a comparison
between various machine learning techniques based on different factors.

Logistic Regression Decision Tree k-NN Ensemble Techniques


Explanation of output Normal Easy Easy Difficult
Prediction power Normal Normal Normal High
Time required in calculation Less Normal Less More

It is important to do hyperparameters tuning in order to create a better model. This helps in


reducing the risk of overfitting and to maximize the estimator’s performance. In python, this is
primarily done by GridSearchCV object, which will perform an exhaustive search over the
hyperparameter grid and will report the hyperparameters that will maximize the cross-validated
classifier performance. A dictionary is created of key value pairs consisting of key, which a
string denoting the classifier and value which has corresponding different values. There is no
standard and optimal hyperparameter grid for any classifier. The user can change the
hyperparameter grid according to the wish. Feature extraction is however beyond the scope of
this book, but it is highly recommended to the user for increasing accuracy of the model and

529
deriving meaningful interpretation. However, we have focused on tuning of hyperparameters for
each algorithm through grid-based approach. The commonly used hyperparameters used in
ensemble techniques include the following:

1. max_depth: This argument considers the maximum depth of a tree. It is used to control
overfitting as higher depth will allow model to learn relations very specific to a particular
sample.
2. max_features: These denote the number of features to consider that will be randomly
selected while searching for a best split. It has been observed that the square root of the total
number of features works great but we can also consider 30 to 40% of the total number of
features. However, higher values can lead to overfitting but depends on case to case.
3. learning_rate: This determines the impact of each tree on the final outcome. The model
works by starting with an initial estimate, which is updated using the output of each tree.
The learning parameter controls the magnitude of this change in the estimates. Lower values
are generally preferred as they make the model robust to the specific characteristics of tree
and thus allowing it to generalize well. However, lower values would require higher number
of trees to model all the relations and will be computationally expensive.
4. n_estimators: This denotes the number of sequential trees to be modeled. Higher number
of trees can overfit, hence, this should be tuned using CV for a particular learning rate.
Since learning rate shrinks the contribution of each tree by learning_rate, there is a trade-
off between learning_rate and n_estimators.
5. min_samples_split: This parameter defines the minimum number of samples (or
observations), which are required in a node to be considered for splitting. Generally, it is
used to control overfitting. Higher values prevent a model from learning relations, which
might be highly specific to the particular sample selected for a tree. Also too high values
can lead to underfitting; hence, it is important to determine it using grid approach.
6. min_samples_leaf: This parameter defines the minimum samples (or observations)
required in a terminal node or leaf. It is also used to control overfitting. Generally, lower
values should be chosen for imbalanced class problems because the regions in which the
minority class will be in majority will be very small.
7. min_weight_fraction_leaf: It is similar to min_samples_leaf but defined as a fraction of
the total number of observations instead of an integer.
8. max_leaf_nodes: It represents the maximum number of terminal nodes or leaves in a tree. It
can be used in place of max_depth. Since binary trees are created, a depth of “n” would
produce a maximum of 2^n leaves. It should be noted that if this is defined, max_depth will
not be considered.
9. random_state: This parameter will have different outcomes for subsequent runs on the
same parameters and it becomes difficult to compare models. However, it can result in
overfitting to a particular random sample selected.
This chapter basically discusses both classification and regression problems for different
algorithms used in ensemble techniques including Bagging, Random Forest, Extra Tree,
AdaBoost, and Gradient Boosting. In Chapters 13 and 14, we had discussed the different metrics
such as accuracy score and confusion metrics for evaluating classification algorithms. In this
chapter, we will discuss other different metrics like classification report and receiver operating
characteristic (ROC) curve for evaluating classification problems. The classification report

530
displays four different values namely precision, recall, F1-score, and support for all the
categories of the dependent variable, accuracy of the model, macro average, and weighted
average.
ROC curve is a commonly used graph that summarizes the performance of a classifier over
all possible thresholds. ROC curves typically feature true positive rate on the Y axis, and false
positive rate on the X axis. The highest value of area under the curve (AUC) is 1 while the least
is 0.5, which depicts the 45° random line. The higher the curve is above the diagonal baseline,
the better the predictions. The performance of the model is determined by looking at the area
under the ROC curve and the value is stored in AUC. This means that the top left corner of the
plot is the best point—a false positive rate of zero, and a true positive rate of one because it will
have the largest AUC. It should be noted that the steepness of ROC curves is also important,
since it is ideal to maximize the true positive rate while minimizing the false positive rate. It
should be noted that if there is any value less than 0.5, this means that we should do the exact
opposite of recommendation of the model to get the value more than 0.5.
For understanding ROC metric to evaluate output quality of binary data, classification model
is developed for cancer dataset considering using naïve Bayes algorithm. The roc_curve() and
roc_auc_score() functions are imported from sklearn. metrics library for creating a ROC curve
and determining AUC score. We have used naive Bayes algorithm discussed in Chapter 14 on
breast_cancer data available in sklearn.datasets library.

531
Explanation
A naïve model is first created on the training dataset of cancer data. The command
naivecancer.predict_proba(x_test) predicts the probabilities of test data for independent
variables (x_test) and stores in cancer_probs. The next command cancer_probs[:, 1]
considers only the positive outcome. The command roc_auc_score(y_test, cancer_probs)
calculates the roc_auc_score by comparing original data of test dataset and predicted
probabilities. The command roc_curve(y_test, log_probs) returns the values of false
positive rate, true positive rate, and stores in cancer_fpr and cancer_tpr, respectively. The
next section plots the line corresponding to positive rates. The cancer dataset had only two
categories for dependent variable. Hence, it was possible for us to create a ROC curve for the
data. We can observe from the chart that the area between the red line and blue line is very
high, which further means that the model shows excellent accuracy. This is also consistent with

532
the aoc_score of 0.99, which shows an excellent accuracy.

It is important to notes that ROC curves are primarily used in binary classification to study the
output of a classifier. In order to extend ROC curve and ROC area to multiclass or multilabel
classification, we need to consider one-vs.-the-rest (OvR) multiclass/ multilabel strategy for any
classifier. In the following multiclass example of iris dataset, we had considered SVM classifier
and the dependent variable named specie has three categories – Virginica, Setosa, and
Versicolor. It should be noted that it is important to binarize the dependent variable before
drawing ROC curve; one ROC curve can be drawn per label.

533
534
Explanation
The command label_binarize(y, classes = [0, 1, 2]) will binarize the dependent
variable (specie). Since we have three categories of the dependent variable: 0, 1, 2; the value of
argument classes will have all the three categories. The command num_class = y.shape[1]
will store value 3 in the num_class since it has three categories. The command
OneVsRestClassifier(svm.SVC(kernel='linear', probability=True, random_state=0))
crates a multiclass classifier for svm algorithm. The command classifier.fit(x_trg,
y_trg).decision_function(x_test) fits the training data and stores the predicted values in
y_score. As we had three classes for three categories of specie, we created dictionaries for
storing values of fpr, tpr, and roc_auc. A “for” loop is then executed for storing different values
returned by roc_curve. The next section creates a plot displaying all the three categories.

15.1 Bagging
Bagging (Bootstrap) was proposed by Leo Breiman in 1994 for improving classification
accuracy. Bootstrapping is a process of creating random samples with replacement for estimating
model accuracy. This algorithm is used when our goal is to reduce the variance of a decision
tree. It is a relatively simple way to increase the power of a predictive statistical model by taking
multiple random samples (with replacement) from training dataset, and each collection of subset
data is used to train their decision trees, which finally results into ensemble of different models.
At the end, we get separate predictions for test set and the mean of all the predictions from
different trees is calculated, which is more accurate than the result of a single decision tree.

535
When the samples are extremely similar, all of the predictions derived from the samples will
be nearly same; hence, use of bagging is not very advantageous. This algorithm is more useful
when the predictors are more unstable. This means that if the random samples drawn from the
training set are very different, we will have different sets of predictions and this greater
variability will lead to a better final result.
Bagging also called as bootstrap aggregating is a technique used to reduce variance because
it combines and averages multiple models, which thereby helps to reduce the variability of any
one tree and improves predictive performance. The average of output is considered for regression
and categories are used for classification. Bagging is a relatively simple way to increase the
power of a predictive statistical model by taking multiple random samples (with replacement)
from your training dataset, and using each of these samples to construct a separate model and
separate predictions for your test set. These predictions are then averaged to create a, hopefully
more accurate, final prediction value.
Bagging is a powerful method to improve the performance of simple models and reduce
overfitting of more complex models. The principle is very easy to understand, and instead of
fitting the model on one sample of the population, several models are fitted on different samples
(with replacement) of the population. Then, these models are aggregated by using their average,
weighted average or a voting system for classification. Although bagging reduces the
explanatory ability of model, it makes it much more robust.
Let us understand the process of bagging through 100 models that we need to average. For all
the 100 iterations, we will take a sample with replacement of original dataset, train a
regression/classification tree on this sample, and save the model. After all the models are trained,
the prediction is done by calculating average of all the estimates from each tree. Thus, the
bagging model handles the bias efficiently from each tree and the variance from the trees on a
bootstrapped sample.
It should be noted that this algorithm is more useful when the predictors are more unstable.
In other words, if the random samples from training set are very different, they will generally
lead to very different sets of predictions. This greater variability will lead to a stronger final
result. When the samples are extremely similar, all of the predictions derived from the samples
will be similar, thus bagging will not be very effective. It should be noted that taking smaller
samples from training set will induce greater instability, but taking samples that are too small
will result in useless models. The smaller the bagging samples, the more samples will be needed
and the more models will be generated to create more stability in the final predictors.
The different arguments that can be used in bagging algorithm include:
base_estimator describes the base estimator to fit on random subsets of the dataset, values
can be an object or None (base estimator is a decision tree); max_samples describes number of
samples to draw from X to train each base estimator and can have integer or float values;
bootstrap which has a Boolean value; True denotes that samples are drawn with replacement and
False denotes that sampling without replacement is performed; bootstrap_features also has a
Boolean values to denote whether the features are drawn with replacement or not; verbose is an
integer value which helps to controls the verbosity when fitting and predicting; n_jobs which
can have integer value.

The reader can also try other arguments like oob_score and warm_start which
has a Boolean value. The oob_score denote whether to use out-of-bag samples

536
to estimate the generalization error and the warm_start specifies whether to
reuse the solution of the previous call to fit and add more estimators to the
ensemble, otherwise, just fit a whole new ensemble.

15.1.1 Bagging Algorithm for Classification Problems


Bagging Algorithm for classification problems can be applied using BaggingClassifier()
which is available in sklearn. ensemble library. To understand the usage of bagging algorithm on
classification problems, the data of clients having the credit card are downloaded from
https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients.
Dataset Information: These data employed a binary variable Y, default payment (yes = 1, no =
0), as the response variable for credible and not credible clients. This study reviewed the
literature and used the following 23 variables as explanatory variables:
X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit
and his/her family (supplementary) credit.
X2: Gender (1 = male; 2 = female).
X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
X4: Marital status (1 = married; 2 = single; 3 = others).
X5: Age (year).
X6–X11: History of past payment. We tracked the past monthly payment records (from April
to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the
repayment status in August, 2005; …; X11 = the repayment status in April, 2005. The
measurement scale for the repayment status is: –1 = pay duly; 1 = payment delay for 1 month; 2
= payment delay for 2 months; …; 8 = payment delay for 8 months; 9 = payment delay for 9
months and above.
X12–X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in
September, 2005; X13 = amount of bill statement in August, 2005; …; X17 = amount of bill
statement in April, 2005.
X18–X23: amount of previous payment (NT dollar); X18 = amount paid in September, 2005;
X19 = amount paid in August, 2005; …; X23 = amount paid in April, 2005.

537
538
539
Explanation
The dataset had 30,000 observations for 24 variables: 23 independent variable and Y
corresponding to default payment was considered to be the dependent variable. The test dataset
had 30% of the observations. A seed value of 3000 was considered for generating same training
and test dataset. A bagging model was developed by considering the following values of
hyperparameters: base_estimator=None, n_estimators=10, max_samples=1.0,
max_features=1.0, bootstrap=True. The result shows that the model is overfitting, since the
training and test accuracy were found to be 0.98 and 0.799, respectively. For determining the
best hyperparameters, a grid-based search was adopted by creating a grid-based object of three
keys namely n_estimators, max_samlpes, and max_features. The n_estimators had three
values [10, 20, 30]; max_samples had three values [0.5, 0.8, 1.0], and max_features also had
three values [0.5, 0.7, 1.0]. Thus, different models for a particular combination of distinct three
hyperparameter values were created by using GridSearchCV function. The best values of
hyperparameters were determined using best_params_ variable. It was observed that the best
value of max_features is 1.0, max_samples is 0.5, and n_estimators is 20. This combination
of hyperparameters was able to increase the accuracy to 0.809 for the bagging model.
The results of the classification report can be explained as follows:
Precision is the ratio of true positives to the sum of true and false positives. This further
means that the percentage of correct for all the positives. Our results show that precision for
value 0 is 0.83 and for 1 is 0.65. This means that the records having 0 were correctly predicted
as 83%. Recall is the ratio of true positives to the sum of true positives and false negatives. This
displays the percentage of records that were classified correctly. Our result shows recall is 0.94

540
and 0.32 for 0 and 1 values, respectively. This further suggests that the accuracy is lower as
0.32 for categorical value as “1.” The F1 score is a weighted harmonic mean of precision and
recall; the highest score is 1.0 and the lowest is 0.0. For our data, the F1 score is 0.88 and 0.42
for 0 and 1, respectively. Support is the number of actual occurrences of the class in the
specified dataset. For our test data of 10,000 observations, 6967 corresponds to value 0 and
2033 corresponds to value 1 of support.
From the ROC curve, we can observe that the AUC is good but not excellent. This means
the accuracy is not excellent. It would have better if the red line would have been on the top
side. However, it is definitely more than 50%; between 50% and 100%, since it is higher than
the random line of 45° (showing an accuracy of 50%) and less than the imaginary horizontal
line drawn on the top of the chart (showing an accuracy of 100%). This is consistent in data
with aoc_score of 74% and macro average score from the classification report.

USE CASE
MEASURING CUSTOMER SATISFACTION RELATED TO ONLINE FOOD PORTALS

Online food ordering is the process of food delivery from a local restaurant or an independent
food delivery company at door step. A customer searches for a favorite restaurant, usually
filtered via type of cuisine and choose from available items, and choose delivery or pick-up.
Payment is done either by credit card, Paypal or cash, with the restaurant returning a
percentage to the online food company. An order is typically made either through a restaurant or
grocer’s website or mobile app and the customers can keep track of the services. The delivered
items can include starters, dishes of main course, drinks, desserts, etc., and are typically
delivered in boxes or bags. The delivery person use bikes or motorized scooters for delivery.
Important online food portals in India include Swiggy, Zomato, Uber Eats, and Foodpanda.
For determining customer satisfaction related to online food portals, the dependent variable
will be satisfaction (categorical variable). Satisfaction can basically include factors related to an
experience or feature of an experience for both purchase and eating. There might be many
independent variables related to purchase, which include ease and convenience, cost
effectiveness, 24*7 availability, easy mode of payment, customer services, better discounts,
doorstep delivery, choice of restaurant, ease of use of app, location, rewards, cashbacks, etc.
However, customer satisfaction with respect to a specific restaurant can also be measured
considering different independent variables. The sensory attributes of foods are widely
considered to be an important determinant, perhaps the most important determinant, of
satisfaction. Other independent variables related to food can be quantity of food, quality of
ingredients, packaging, etc.
For a detailed analysis, different demographic variables such as gender, age, occupation,
marital status, and income can be considered for understanding the difference in satisfaction
related to these demographic variables. T-test and ANOVA discussed earlier can be used to
determine the differences so that strategies can be framed accordingly. For increasing customer
base, survey can also include questions related to the factors that hinder people from using the
services. These factors may include unaffordability, influence from friends/family, bad past
experience, and reviews.
For measuring satisfaction, bagging algorithm for classification problems can be used. The
management of online portals can design effective strategies after determining the important

541
predictors contributing to customer satisfaction.

15.1.2 Bagging Algorithm for Regression Problems


Bagging Algorithm for regression problems can be applied using BaggingRegressor() function
which is available in sklearn. ensemble library. To understand the usage of bagging algorithm in
regression problems, we have downloaded the data from
https://archive.ics.uci.edu/ml/datasets/Auto+MPG.

Dataset Information: This dataset was taken from the StatLib library, which is maintained at
Carnegie Mellon University. The dataset was used in the 1983 American Statistical Association
Exposition. This dataset is a slightly modified version of the dataset provided in the StatLib
library. In line with the use by Ross Quinlan (1993) in predicting the attribute “mpg”, eight of
the original instances were removed because they had unknown values for the “mpg” attribute.
The original dataset is available in the file “auto-mpg.data-original.” The data concerns city-
cycle fuel consumption in miles per gallon as dependent variable. The dataset had 398
observations for nine variables and has five continuous attributes. The different attributes are as
follows: mpg: continuous; cylinders: multi-valued discrete; displacement: continuous;
horsepower: continuous; weight: continuous; acceleration: continuous; model year: multi-valued
discrete; origin: multi-valued discrete and car name: string (unique for each instance). In the
dataset, it was found that six records had “?” in some column. Hence those six records were not
considered for the study. Hence, the final dataset had 392 observations. Besides, columns named
car_name, origin and year were deleted since they cannot be considered as an independent
variable.

542
Explanation
The dataset had 392 observations with six variables. The variable “mpg” was considered a
dependent variable and five others were considered to be independent variable. Similar to the
bagging algorithm for classification problem, the variables such as max_features,

543
max_samples, and n_estimators formed a grid. The grid-based approach shows that the best
parameters were found to be max_features:1.0; max_samples:0.5 and n_estimators: 20.
Since the RMSE value decreases from 4.73 to 4.47, hence it can be inferred that the efficiency
of the new model with the grid approach increased.

Check if further improvement in the accuracy of the model is possible


considering different data preprocessing techniques and tuning of
hyperparameters for the data discussed in classification and regression
problems in the bagging algorithm.

USE CASE
PREDICTING INCOME OF A PERSON

Income is considered as one of the factors for different organizations to design strategies. Many
businesses need to personalize their offer based on customer’s income. For example, Low-
income customers will be targeted for normal products and high-income customers will be
targeted for premium products. For providing transparency, job providers can also make huge
improvements in the experience of users searching for jobs, and help jobseekers to understand
the market worth of different positions. Government and other organizations may need to
compute income of a person for tax and other reasons. Besides, banks and other financial
services have millions of customers using their services such as ATM, debit cards, and Internet
banking and wanted to distinguish a long-term customer. A bank/credit company considers
credit history and income as important parameters and they want models for setting credit card
limits, which will help them to predict credit risk in approving credit limit of a customer. As a
customer’s income is not always explicitly known, predictive model could estimate income of a
person based on many other parameters.
It is generally considered that few determinants would influence the income of a person.
These determinants are related to demographic and psychographic needs, among others.
Factors that are generally considered includes categorical variables such as Place/City (Tier-I,
Tier-II, and Tier-III), intelligence level (ordinary, extra-ordinary), gender (male/female), caste
(Marathi, Marwari, Punjabi, Sindhi, etc.), income group (low level, middle level, upper class),
educational levels (XII, graduate, postgraduate), marital status (married, unmarried, divorced),
number of dependents (0, 1, 2, 3), type of organization (private, self-employed, state government,
central government), current designation (junior level, middle level, senior level), previous
designation (junior level, middle level, senior level), job type (engineer, doctor, manager,
professor, etc.), and number of organization changed (1, 2, 3, etc.). Different continuous
variables can also be considered as important factors such as age of the person, age of the
dependents, experience, performance report of job, and performance report during education.
Our objective is to create a predictive model that will be able to generate income of a person
(continuous variable) based on different independent variables. A huge amount of data can be
collected and prediction of the earning of a person can be done based on different categorical
and continuous variables discussed above and considering them as independent variables. A
bagging algorithm for regression problems can be applied for determining the continuous
dependent variable.

544
15.2 Random Forest
Random Forest is an extension over bagging. This algorithm also takes the random selection of
features rather than using all features to grow trees along with taking the random subset of data,
which is done in bagging algorithm. Decision trees, being prone to overfit, have been
transformed to random forests by training many trees over various subsamples of the data (in
terms of both observations and predictors used to train them). Hence, a lot of random trees are
generated. Since there are many random trees, hence it is called a random forest.
We know that error occurs due to two main reasons: from bias and variance. A too complex
model has low bias but large variance, while a too simple model has low variance but large bias.
Hence, we need different ways to solve the problem. We need variance reduction algorithm for a
complex model and bias reduction for a simple model. Random forest and boosting algorithm
help in reducing bias to a great extent of simple models. Random forest also reduces variance of
a large number of complex models with low bias. Since the trees are large in number and they
are selected randomly and additional random variable selection make them even more
independent, which makes it perform better than bagging algorithm.
The parameters that control model complexity in decision trees are the prepruning parameters
that stop the building of the tree before it is fully developed. The main advantage of decision
trees is that they can be easily visualized and understood by nonexperts, since the algorithm is
completely invariant to scaling of the data. But the main drawback of decision trees is that even
with the use of prepruning they tend to overfit and provide more generalization performance.
Hence, random forest is used in most applications in place of a single decision tree. Random
forest is very powerful and often works well without heavy tuning of the parameters and do not
require scaling of the data. Random forest shares all the benefits of all the benefits of decision
tree. But decision trees are used only if we need a compact representation of the decision-making
process. Random forest helps in interpreting hundreds of trees in detail and the trees in random
forest tend to be deeper than decision trees. Thus, if we want to do a prediction in a visual way
by nonexpert, a single decision tree might be a better choice. Random forest works well on large
datasets and training, but might be time consuming and can be parallelized across multicore
processor. Random forest does not perform very well on high dimensional sparse data such as
text data.
Random forest is considered to be a solution of all data science problems. Like a decision
tree, random forest is also capable of performing both regression and classification tasks. In this
method, a group of weak models combine to form a powerful model. In random forest, we grow
multiple trees as opposed to a single tree in decision tree model. In classification problem, each
tree gives a classification and it is interpreted that the tree “votes” for that class. The forest
chooses the classification having the most votes (trees in the forest). In case of regression
problems, it takes the average of outputs by different trees. In the random forest approach, a
large number of decision trees are created. Every observation is fed into every decision tree. The
most common outcome for each observation is used as the final output. A new observation is fed
into all the trees and taking a majority vote for each classification model.
The major drawback of random forest is that like decision tree, it is more appropriate for
classification problems, but not as good as for regression problem as it does not give precise
continuous nature predictions. In case of regression, it does not predict beyond the range in the

545
training data, and that they sometimes overfit the datasets. However, because of its high
prediction power, data analyst generally uses random forest for prediction.
Besides, the common parameters, the user can use optional parameters which has a Boolean
value: bootstrap to determine whether bootstrap samples are used when building trees. If False,
the whole dataset is used to build each tree; oob_score to determine whether to use out-of-bag
samples to estimate the generalization accuracy; warm_start to specify whether to reuse the
solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a
whole new forest. Other parameters that contain integer values include n_jobs to denote the
number of jobs to run in parallel. fit, predict, decision_path and apply are all parallelized over
the trees; Verbose to control the verbosity when fitting and predicting.

The reader can also use class_weight argument which has a dictionary data
type and can contain values in the form of a list of dictionary, values such as
“balanced”, “balanced_subsample” or None, optional. The “balanced”
mode and “balanced_subsample” mode is basically used for computation of
weights according to the requirement. For multi-output, the weights of each
column will be multiplied. It should be noted that the weights associated with
classes are in the form {class_label: weight}. If weights are not specified,
all classes are supposed to have weight one. For multi-output problems, a list
of dicts can be provided in the same order as the columns.

15.2.1 Random Forest Algorithm for Classification Problems


Random Forest algorithm for classification problems can be applied using
RandomForestClassifier() which is available in sklearn.ensemble library. For understanding
the use of random forest boost algorithm for classification problems, we will consider the
download of the letter recognition dataset from
https://archive.ics.uci.edu/ml/datasets/Letter+Recognition.

Dataset Information: The dataset had 20,000 observations with 17 variables. The variable letter
is a categorical variable having 26 factors from A to Z. In this classification problem, we will
consider letter as dependent variable and others as independent variables.

546
547
548
Explanation
The dataset had 20,000 observations for 17 variables. The dataset had one independent variable
named letter, which had 26 categories of English alphabets A-Z. A random forest model was
developed considering the default settings of the classifier. The result shows that the model is
overfitting, since the training and test accuracy were found to be 0.99 and 0.933, respectively.
For determining the best hyperparameters, a grid-based search was adopted by creating a grid-
based object of two different keys: criterion and max_depth. Thus, different models for a
particular combination of distinct two hyperparameter values were created by using
GridSearchCV function. The best values of hyperparameters were determined using
best_params_variable. It was observed that the best value of criterion is gini and max_features
is sqrt. However, this combination of hyperparameters was not able to increase the accuracy.
When we do a comparison with previously discussed ensemble techniques and decision tree
model, it was found to be higher than the accuracy of the bagging model (0.926) and decision
tree model (0.871). The decision tree model shows a high level of overfitting because the
accuracy of the training set is found to be 100%, while the accuracy of test dataset drops to
87.1%. Thus, for this dataset random forest model was better than the decision tree and bagging
model. The classification report shows out of 6000 cases for test set, nearly all the categories A-
Z shows a higher value of precision, recall, and F1-score. It has been generally observed that if
the number of observation is more in the dataset, the probability of increasing the accuracy is
increased. For improving accuracy, the user is hence suggested to have more number of
observations as reflected in this example. We have not considered ROC curve for these data
because there are 26 categories in the dependent variable, which will not show a meaningful
interpretation from the data.

USE CASE
WRITING RECOMMENDATION/APPROVAL REPORTS

Machine learning is dire need of today’s scenario to eliminate human effort as well as come up
with higher automation with less errors. Machine learning utilizes algorithms that can learn
from and perform predictive data training. A recommendation report compares two or more
products, services or solutions and makes a recommendation about the best option. Because the
purpose of the report is to recommend a course of action, it is called a recommendation report.
They are used in different type of organizations such as government and private organization
and different sectors such as construction, medical sciences, and manufacturing. These reports
act as a decision-making tool and helps in deciding between two or more products or solutions,
which will not be an arduous process in the concerned area.
These reports in business may be related to buy any product/service, which is required for
business processes. For example, in construction industry, decision related to purchase of all the
products can be based on different product constituents such as rate, life of the product and past
experience and company information such as background data, policy documents, and financial

549
figures. The decision related to hiring services is based on the skills of service provider, timing
of service, location of service provider, etc. In government organization, recommendation for
providing contract to a particular organization or tenders approval is based on different
parameters such as price, quality, time of delivery, past work, organization culture,
management, and location.
In medical field, computer-aided diagnosis (CAD) is a rapidly growing dynamic area of
research. In recent years, significant attempts are made for the enhancement of CAD
applications because errors in medical diagnostic systems can result in seriously misleading
medical treatments. For example, in pathology, the blood testing is done on the basis of many
parameters in blood and accordingly report is prepared showing whether the person is suffering
from malaria/dengue or any other disease. However, extracting knowledge from medical data is
challenging as these data may be heterogeneous, unorganized, and high dimensional and may
contain noise and outliers since the recommendation report is written based on the different
compositions of body.
Random Forest classification algorithm can be used for preparing recommendation report
related to any organization, person, product/service, etc., which will help us to decide the later
course of action.

15.2.2 Random Forest Algorithm for Regression Problems


Random Forest algorithm for regression problems can be applied using
RandomForestRegressor() function which is available in sklearn.ensemble library. We consider
energy efficiency dataset, which can be downloaded from:
https://archive.ics.uci.edu/ml/datasets/Energy+efficiency.
Dataset Information: This dataset has 12 different building shapes simulated in Ecotect. The
buildings differ with respect to the glazing area, the glazing area distribution, and the orientation,
amongst other parameters. We simulate various settings as functions of the aforementioned
characteristics to obtain 768 building shapes. The dataset comprises 768 samples and eight
features, aiming to predict two real valued responses. The dataset contains eight attributes (or
features, denoted by X1…X8) and two responses (or outcomes, denoted by y1 and y2). The aim
is to use the eight features to predict each of the two responses.
The attribute information is as follows:
X1 = Relative compactness,
X2 = Surface area,
X3 = Wall area,
X4 = Roof area,
X5 = Overall height,
X6 = Orientation,
X7 = Glazing area,
X8 = Glazing area distribution,
y1 = Heating load,
y2 = Cooling load.
For creating a random forest model, we want to measure the impact of all independent
variables on cooling load. Hence, we will consider cooling load (y2) as dependent variable and

550
delete y1 variable from the downloaded file.

551
552
Explanation
The dataset had 768 observations with nine variables. The variable “Y2” corresponding to
cooling load was considered to be dependent variable and eight others were considered to be
independent variable. A basic random forest model was created with default settings. A
dictionary was created of the other parameters and a grid-based approach was executed for
identifying the best parameters. The best value of parameters was found to be eight for
max_depth; “auto” for max_features; 0.05 for min_samples_split and 30 for n_estimators.
Since the RMSE value increased from 1.89 to 1.95, hence it can be inferred that the efficiency
of the new model decreased when the best parameters according to the user were considered for
the random forest model. Thus, for this dataset, the default settings show a better result. For this
dataset, when the comparison was done with previously discussed ensemble techniques and
decision tree model, the RMSE value was found to be least for bagging model (1.885) and
highest for the decision tree model (2.294). Thus, bagging algorithm should be considered for
this dataset.

Check if further improvement in the accuracy of the model is possible


considering different data preprocessing techniques and tuning of
hyperparameters for the data discussed in classification and regression
problems in the random forest algorithm.

USE CASE
PREDICTION OF SPORTS RESULTS

The challenge of predicting sport results has long been of interest to sport managers, media, and
different stakeholder. The decision of winner team is important because of the interest of people
and financial assets involved in the betting process. The bookmakers, fans, and potential bidders
are all interested in approximating the odds of a game in advance. In addition, sport managers
are also looking for appropriate strategies that can work well for assessing the potential
opponent in a match. Due to the advent of new technologies, the large amount of electronic data
related to sports is now available that may further help in developing prediction models to
forecast the results of matches.
Besides, video game/e-sports streaming is also a huge market. In the world championship of

553
League of Legends (LoL) last year, one semifinal attracted 106 million viewers, even more than
the 2018 Super Bowl. Companies wants a model to estimate the winning rate of a team in real
time for providing personalized game analytics to players.
In sports prediction, data related to many variables can be collected including the historical
performance of the teams, results of matches, characteristics of the players (winning percentage,
player fatigue and injury, serve, score, etc.), and characteristics of opposition team and the
match (match surface, venue, weather conditions). Player selection is also one of the most
important tasks for a team sport. The team management, the coach, and the captain select
players for each match from a squad of players. They analyze different characteristics and the
statistics of the players to select the best players for each match. In case of e-sports and video
games also, there are many factors that are informative to the prediction of the game. For
example, pattern in team composition and the choices of heroes can serve as a strong indicator
of the outcomes.
The quality of players cannot be determined using only a single value and will not correctly
contribute to the outcome of a match. Therefore, the challenge really lies in how to handle these
categorical features. Considering the availability of an immense amount of diverse historical
sports data, supervised machine learning algorithm can be considered as the best approach for
sports prediction. The features of players and the features of the match, paired with the match
result, could form a set of independent variables and dependent variables, respectively. Random
Forest algorithm can be used to generate the prediction models for all the situations and
problems related to prediction of sports results.

15.3 Extra Trees


Extra trees algorithm is similar to random forest classifier, with the only difference of extra trees.
The features and splits are selected at random; hence it is named as “extremely randomized tree.”
Since splits are chosen at random for each feature in the extra trees classifier, it is less
computationally expensive than a Random Forest. This is consistent because of the theoretical
construction of the two learners. This is the reason for knowing it as extremely randomized trees.
The difference between random forests and extra trees lies in the fact that, instead of computing
the locally optimal feature/split combination (for the random forest), for each feature under
consideration, a random value is selected for the split (for the extra trees). To introduce more
variation into the ensemble, the trees are built in a different manner, which leads to more
diversified trees and less splitters to evaluate when training an extremely random forest. Each
decision tree will be built with the following criteria: All the data available in the training set are
used to build each tree and to form the root node or any node, the best split is determined by
searching in a subset of randomly selected features of size sqrt(number of features). The split of
each selected feature is chosen at random. However, when all the variables are relevant, both
methods seem to achieve the same performance. But, in presence of noisy features, extra trees
seem to keep a higher performance. However, in some situations, random forest can generalize
better than extra tree but this can work out better by developing different models with both the
algorithms and tuning different parameters such as n_estimators, max_features, and
min_samples_split using grid search methodology. It should be noted that when variables are
chosen for split, samples are drawn from the entire training set instead of a bootstrap sample of
the training set. Splits are chosen completely at random from the range of values in the sample at

554
each split.

The criterion argument in extra tree algorithm is an important argument to


measure the quality of a split. Supported criteria are “gini” for the Gini
impurity and “entropy” for the information gain. The default value for this
argument is “gini”.

15.3.1 Extra Tree Algorithm for Classification Problems


We have taken shuttle dataset from R environment, which is present in the mlbench library for
explaining the utility of extra tree algorithm for classification problems. The dataset was read in
R environment and written to a csv file by using the command write.csv().
Dataset Description: The data have 58,000 rows of 10 variables. Class was considered to be
dependent variable and had seven categories. Only 22 rows were present corresponding to class
of Bpv.Open and Bpv.Close and hence were changed to Fpv.Open and Fpv.close. The change
finally resulted in five categories of class variable. Other nine variables “V1,” “V2,” “V3,” “V4,”
“V5,”“V6,” “V7,” “V8,” and “V9” were considered to be independent variables.

555
556
557
Explanation
The dataset had 58,000 observations for 10 variables. An extra tree model was developed
considering the default settings of the classifier. The result shows that all the different models
related to ensemble techniques show an excellent accuracy of nearly 100%. The classification
report shows that out of 17,400 records corresponding to five categories, Bypass has 994,
Fpv.Close has 16, Fpv.Open has 59, High has 2669, and Rad.Flow has 13,662 observations.
The precision was nearly accurate for all the categories. However, the recall value was slightly
lower for Fpv.close and Fpv.Open. The ROC curve for the extra tree algorithm was not created
because the decision function used in multiclass classifier is not applicable for extra tree
classifier.

USE CASE
IMPROVING THE E-GOVERNANCE SERVICES

Governance is a challenge in huge, diverse, and rapidly developing country like India. The rapid
rise of the Internet and digitization enable large-scale transformation and help in the
implementation of ambitious government plans by initiating steps to involve IT in all
governmental processes. Thus, e-Governance involves carrying the functions and achieving the
results of governance through the utilization of electronic services. An efficient e-governance
system basically ensures government to be transparent in process, accountable for its activities,
facilitate efficient storing and retrieval of data, rapid processing and transmission of information
and data, taking decisions expeditiously and judiciously. This process helps in increasing the
reach of government both geographically and demographically and includes delivering the
services online to citizens by connecting them to the respective government departments.
Different initiatives include efficient network from a single window facility for the
disbursement of services like providing citizens the means to pay taxes and other financial dues
to the state government, computerization of land records to ensure that landowners get
computerized copies of ownership, crop and tenancy and updated copies, handling of
grievances, admission to professional colleges, etc. The different e-Governance projects include

558
e-Mitra project in Rajasthan, e-Seva project in Andhra Pradesh, CET (Common Entrance Test),
Bhoomi Project for online delivery of land records in Karnataka, and Gyandoot in Madhya
Pradesh, which was used as an interface between the district administration and the people.
After taking many initiatives for development, Government would always like to know the
feedback and response of citizens of country. Besides, every citizen wants a safe community with
reduced crime and corruption. The analyst can develop model after identifying the different
independent variables leading to the satisfaction of e-governance services wherein satisfaction
can be considered as a dependent variable. The compile report based on complaint statistics
using different classification algorithms can be used for reference by the government
departments. It will helps to understand the social issues so that government departments can
discover them before they become serious and thus seize the opportunities for service
improvements. Hence, by decoding the messages through the statistical analysis of complaints
data, the government can better understand the voice of the people and help government
departments to improve service delivery and develop smart strategies. This will help to boost
public satisfaction with the government.

15.3.2 Extra Tree Algorithm for Regression Problems


For understanding the utility of extra tree algorithms in regression problems, we will consider the
dataset that can be downloaded from https://archive.ics.uci.edu/ml/machine-learning-
databases/concrete/compressive/.
Dataset Information: Concrete is the most important material in civil engineering. The concrete
compressive strength is a highly nonlinear function of age and ingredients. These ingredients
include cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate, and fine
aggregate. The information of dataset is as follows:
Name – Data Type – Measurement – Description
Cement (X1) – quantitative – kg in a m3 mixture – Input Variable
Blast Furnace Slag (X2) – quantitative – kg in a m3 mixture – Input Variable
Fly Ash (X3) – quantitative – kg in a m3 mixture – Input Variable
Water (X4) – quantitative – kg in a m3 mixture – Input Variable
Superplasticizer (X5) – quantitative – kg in a m3 mixture – Input Variable
Coarse Aggregate (X6) – quantitative – kg in a m3 mixture – Input Variable
Fine Aggregate (X7) – quantitative – kg in a m3 mixture – Input Variable
Age (X8)– quantitative – Day (1~365) – Input Variable
Concrete compressive strength – quantitative – (Y1) – Output Variable
Thus, the dataset had strength as the output variable and eight independent variables
corresponding to input.

559
560
561
Explanation
The dataset had 1030 observations with nine variables. The variable “Y1” corresponding to
cooling load was considered be dependent variable and eight others were considered to be
independent variable. The best value of parameters was found to be: “criterion”: “mse,”
“max_features”: “auto,” “min_samples_leaf”: 1. The RMSE value for extra tree model
decreased from 5.408 to 5.326, hence it can be inferred that the efficiency of the new model
increased with the grid approach. For this dataset, when the comparison was done with
previously discussed ensemble techniques and decision tree model, the RMSE value was found
to be least for random forest model (5.259), succeeded by bagging model (5.45) and highest for
the decision tree model (7.378). Thus, random forest algorithm should be considered for this
dataset.

Check if further improvement in the accuracy of the model is possible


considering different data preprocessing techniques and tuning of
hyperparameters for the data discussed in classification and regression
problems in the extra tree algorithm.

USE CASE
LOGISTICS NETWORK OPTIMIZATION

As supply chains have become more global, the logistics network includes new routes and
locations and thus huge load of data needs to be processed every day. Each new link in the
supply chain involves complexities regarding the availability of logistics assets, infrastructure,
laws and regulations, etc. Logistics network optimization is helpful in streamlining the global
supply chain process through data digitization process and data standardization. It helps to
streamline complicated logistics processes and keep track and trace of data related to products,
services, and information effectively. In manufacturing industry, logistics network optimization is

562
used for supply and sales of finished goods and procurement of raw materials. It aims to
determine the number, location, and size of warehouses that are optimal for each business by
considering large range of constraints in one’s supply chain. It also determines the best
combination of warehouses necessary to cover the entire supply chain from raw material
suppliers to end-users.
Effective and efficient logistics management is done by addressing all modes of
transportation with an intention to keep costs low and customer satisfaction high. The analyst
can evaluate multiple transportation modes, carriers, routes, shipping strategies, and support to
find the lowest cost and time-efficient combination for transportation optimization needs, so that
client can run business smarter and more effectively. The model can be developed using
regression algorithm for showing the relationship of logistics operation, with vendors,
warehouses, distribution centers, service operations, transportation routes and hubs, can reveal
a better picture for cost reduction and service improvement. However, the weightage for these
two advantages may differ depending on the purpose of the optimization. The decision can be a
trade-offs between cost and service and hence determining the best combination of warehouses
to offer a desirable service with the lowest cost possible under given constraints. The best
regression model can be determined by analyzing and evaluating many different models
corresponding to different scenarios.

15.4 Ada Boosting


Boosting is a general ensemble method that keeps adding weak learners to correct classification
errors and creates a strong classifier from a number of weak classifiers. This is done by building
a model from the training data, then creating a second model that attempts to correct the errors
from the first model. Models are added until the training set is predicted perfectly or a maximum
number of models are added. AdaBoost was originally called AdaBoost.M1 by its authors:
Freund and Schapire. AdaBoost is generally used to boost the performance of decision trees on
binary classification problems rather than regression problems, hence it may be referred as
discrete AdaBoost. It is best used with weak learners. The most suited and therefore most
common algorithm used with AdaBoost are decision trees with one level. Weak models are
added sequentially and trained using the weighted training data. The process continues until a
pre-set number of weak learners have been created (a user parameter) or no further improvement
can be made on the training dataset. Predictions are made by calculating the weighted average of
the weak classifiers.
Initially, the base learner takes all the distributions and assign equal weight or attention to
each observation. If there is any prediction error caused by first base learning algorithm, then we
pay higher attention to observations having prediction error. Then, we apply the next base
learning algorithm. This process is repeated until the limit of base learning algorithm is reached
or higher accuracy is achieved. Finally, it combines the outputs from weak learner and creates a
strong learner, which eventually improves the prediction power of the model. Boosting pays
higher focus on examples, which are wrongly classified or have higher errors by preceding weak
rules. It basically fits a sequence of weak learners on different weighted training data. It starts by
predicting original dataset and gives equal weight to each observation. If prediction is incorrect
using the first learner, then it gives higher weight to observation, which has been predicted
incorrectly. Being an iterative process, it continues to add learner(s) until a limit is reached in the

563
number of models or accuracy.
For a new input instance, each weak learner calculates a predicted value as either +1.0 or –
1.0. The predicted values are weighted by each weak learner stage value. The prediction for the
ensemble model is taken as the sum of the weighted predictions. If the sum is positive, then the
first class is predicted, if negative the second class is predicted. For example, five weak
classifiers may predict the values: 1.0, 1.0, –1.0, 1.0, –1.0. From a majority vote, it looks like the
model will predict a value of 1.0 or the first class. These same five weak classifiers may have the
stage values 0.2, 0.5, 0.8, 0.2, and 0.9, respectively. Calculating the weighted sum of these
predictions results in an output of –0.8, which would be an ensemble prediction of –1.0 or the
second class.

One optional argument that is used in AdaBoost is “algorithm” which can have
two values: ‘SAMME’ or ‘SAMME.R’. The default value is ‘SAMME.R’. If
‘SAMME.R’ then use the SAMME.R real boosting algorithm. It is important
that the base_estimator must support calculation of class probabilities. If
‘SAMME’, then use the SAMME discrete boosting algorithm. The SAMME.R
algorithm typically converges faster than SAMME, achieving a lower test error
with fewer boosting iterations.

15.4.1 AdaBoost for Classification Problems


For understanding the implementation of AdaBoost for classification problems, we will consider
balance dataset from: http://archive.ics.uci.edu/ml/datasets/balance+scale.
Dataset Information: This dataset was generated to model psychological experimental results.
Each example is classified as having the balance scale tip to the right, tip to the left, or be
balanced. The attributes are the left weight, the left distance, the right weight, and the right
distance. The correct way to find the class is the greater of (left-distance × left-weight) and
(right-distance × right-weight). If they are equal, it is balanced. The attributes are as follows:
Class having three values: (L, B, R); Left-Weight having five values: (1, 2, 3, 4, 5); Left-
Distance having five values: (1, 2, 3, 4, 5); Right-Weight having five values: (1, 2, 3, 4, 5) and
Right-Distance and having five values: (1, 2, 3, 4, 5). Class in considered as dependent variable
and three others are considered as independent variable.

564
565
566
567
Explanation
The dataset had 625 observations for five variables. The dataset had one independent variable,
which has three categories: L (left), R (right), and B (between). An AdaBoost model was
developed considering the default settings of the classifier. The result shows that the model is
overfitting, since the training and test accuracy were found to be 0.95 and 0.899, respectively.
For determining the best hyperparameters, a grid-based search was adopted by creating a grid-
based object of two different keys: learning_rate and n_estimators. Thus, different models
for a particular combination of distinct two hyperparameter values were created by using
GridSearchCV function. The best values of hyperparameters were determined using
best_params_ variable. It was observed that the best value of learning_rate is 0.1 and
n_estimators is 200. This combination of hyperparameters was able to increase the accuracy
from 0.899 to 0.91 for the AdaBoost forest model. When we do a comparison with previously
discussed ensemble techniques and decision tree model, it was found to be higher than the
accuracy of the random forest model (0.814), extra tree model (0.819), bagging model (0.835),
and decision tree model (0.814). Thus, for this dataset AdaBoost model was better than all the
other models. The classification report shows out of 188 cases for test set, 15 were
corresponding to category “B,” 84 were corresponding to category “L,” and 89 were
corresponding to category “R.” For category “L,” the precision was 0.89, recall was 1, and F1
score was 0.94; for category “R,” the precision was 0.94, recall was 0.98, and F1 score was
0.96. However, the precision for category “B” was nil. Thus, none of the observations were
rightly predicted for category “B.” The user can accordingly decide on the feature extraction for
improving accuracy.

568
The ROC cure shows that the AOC is 0.99, 1, and 0.25 for different categories. It is
important to binarize the ytrg and y-test values before using it in the multiclass classifier,
otherwise it will generate an error of tuple, but the binarized data cannot be considered while
creating the model without multiclass classifier. It should be noted that the chart is hence
created at the last stage, otherwise we would have not able to execute the other algorithms on
the binarized data.

USE CASE
PREDICTING CUSTOMER CHURN

Customer churn (customer attrition) is the percentage of customers that stopped using
company’s product/service during a certain time frame. It is calculated by dividing the number
of customers lost during a particular time period by the total number of customers in the
beginning of that time period. For example, if in the start of the year, there were 1000 customers
and if from those 1000 customers, there were 900 customers remaining at the end of the year,
churn rate is 10% because there is a loss of 10% of your customers. The organization would
always aim for a churn rate close to 0% as possible and will consider churn rate as a top
priority. Hence, it is one of the most important measures for evaluating a growing business.
It is clear that the organization will face the problem, if the customer churn rate is higher
than new customer acquisition rate. The full cost of customer churn includes both lost revenue
and the marketing costs involved with replacing those customers with new ones. Hence, churn is
expensive because organizations spend heavily to acquire new customers through sales and
marketing efforts. Besides, it is always better to keep an existing customer than to acquire a new
customer as it is much easier to save a customer before they leave than to convince the customer
to come back. Hence, reducing customer churn is the major goal of every business and it directly
impacts company’s bottom line, with increased revenue and reduced acquisition costs.
The ability to predict that a particular customer is at a high risk of churning will give the
organization a time to do some remedial measures and may be an additional potential revenue
source for business. Thus, by focusing on churn, organizations can multiply the ROI many times
over on their sales or marketing efforts but, with each customer who churns, there are usually
some indicators that could have been determined with proper churn analysis. Hence, the most
important concern for organization is to determine how to measure churn. They should also be
able to predict in advance which customers are going to churn through churn analysis and know
which marketing actions will have the greatest retention impact on each particular customer.
This will help them to reduce customer churn to a great level.
Different organizations can determine these indicators of churn, and then use the data to
predict the likelihood of customer churn. In product-based industry, declining repeat purchases,
reduced purchase amounts, and customer experiences from the relational feedback can be
considered as important measures to predict churn. For example, a customer who has declined
in recent visits and gives a less rating after its latest shopping experience has definitely an
increased probability of churning. In telecom industry, it will be effective if we analyze the data
related to time on network, days since last top-up, activation channel, whether the customer
ported the number or not, customer plan, outbound calling behavior over the preceding 90 days.
It has been a general observation that time on network has a strong correlation with churn. After
understanding the drivers of customer churn, the organization can then identify at-risk

569
customers by using classification algorithm and considering the outcome variable as churn and
all the measures as independent variable. It will build a model that will predict the probability of
churn for each individual customer. The organization can accordingly take steps to prevent at-
risk customers from churning by defining thresholds for taking action based on the likelihood of
churn and take immediate actions.
Thus, churn prediction modeling techniques enable the organizations to understand the
customer behaviors properly, will predict the risk of customer churning and identify the
customers who can discontinue, etc. The accuracy of these techniques will be beneficial to the
organization for implementing efforts related to retain the customers because the organization
will not take any action if it is unaware about churning of customer. These special retention
efforts will result in increasing revenues to organization.

15.4.2 AdaBoost for Regression Problems


For showing the utility of support vector machines algorithm in regression problems, we use
Abalone dataset, which can be downloaded from https://archive.ics.uci.edu/ml/datasets/Abalone.
Dataset Information: We develop the model to predict the age of abalone, which is basically
determined by cutting the shell through the cone, staining it, and counting the number of rings
through a microscope. The information of attributes in the dataset is as follows:
Name/Data Type/Measurement Unit/Description
Sex/nominal/--/M, F, and I (infant)
Length/continuous/mm/Longest shell measurement
Diameter/continuous/mm/perpendicular to length
Height/continuous/mm/with meat in shell
Whole weight/continuous/grams/whole abalone
Shucked weight/continuous/grams/weight of meat
Viscera weight/continuous/grams/gut weight (after bleeding)
Shell weight/continuous/grams/after being dried
Rings/integer/--/+1.5 gives the age in years
All the fields had numerical data except sex. The sex was nominal data having three
categories; hence for doing analysis, they were coded to numeric categories.

570
571
572
Explanation
The abalone dataset had nine variables: “Sex,” “Length,” “Diameter,” “Height,”
“Wholeweight,” “Shuckedweight,” “Visceraweight,” “Shellweight,” “Rings.” Rings
represented by a continuous variable were considered to be a dependent variable. An AdaBoost
model created with the default arguments shows an RMSE value of 2.737. A dictionary was
created for different parameters to apply a grid-based approach for effective hyperparameter
tuning. The best parameters were identified as follows: learning_rate: 0.5, n_estimators:
50. A new model was created considering the best parameters identified by grid-based approach
and were able to reduce the RMSE value from 2.737 to 2.594. The RMSE value using extra tree
model is 1.145, random forest model is 1.403, bagging model is 1.387, and decision tree model
is 1.511. Thus, for this dataset, we can say that extra tree model was the best since it was able to
reduce the accuracy drastically.

Perform pre-processing of data using different techniques. Also perform tuning


of hyperparameters and check if further improvement in accuracy is possible in
datasets discussed in regression and classification problems of ada boost
algorithm.

USE CASE
BIG DATA ANALYSIS IN POLITICS

One of the applications where data analysis is making a big difference is in the field of politics.
It has become a critical part of political campaigns. In addition to the structured data of detailed
records of previous campaigns, market research and poll data, a huge amount of varied data is
available on Internet through social media (Twitter, Facebook wall post, blog posts) and web
data (web pages, news articles, news groups). These data can be searched and analyzed through
political scientist for better insight and behavioral targeting. Different political groups in major
elections can use data-driven technologies for an effective and efficient campaign. Shen (2013)
in his article on big data analytics and elections has said that the real winner of the 2012
elections in the United States was analytics.
Text data have always been an important data source in political science. Political analysis
involves researching news articles, magazines, advertisements, speeches, press releases, social

573
media, and much more. Due to the Internet and electronic text databases, the volume of text data
has increased drastically. Besides, it is also possible to use scanner or optical character
recognition software to convert content into computer-readable texts. Also, due to the text
analysis software provided by contributors along with training and support, there has been a
rapid increase in investigating large amounts of text, which has become a boon for the political
scientists. These new tools can systematically import and analyze very large volumes of text
documents by identifying keywords, key phrases, themes, topics, images, speakers, and
sentiment. Thus, the process of extremely time consuming, expensive, and in many cases
impossible to read each and every document has been simplified. The analyst needs to tap new
data sources and employ more diverse methods for getting a better picture.
The analyst can determine the factors (in the form of numeric and text data) contributing to
the voter’s decision and employ regression algorithm to determine the satisfaction level of the
voters. The group can then decide the strategies accordingly for political campaign and election
results. Application of advanced analytics on very large and diverse data sources can be thus
used to determine who is the target for which and what type of message.

15.5 Gradient Boosting


Boosting is another ensemble technique to create a collection of predictors and Gradient
Boosting is an extension over boosting method. An ensemble of trees is built one by one and
individual trees are summed sequentially.

In this algorithm, new tree learns sequentially from the previous tree that fits relatively simple
models to the data. We fit consecutive trees at every step and the goal is to reduce error from the
previous tree. If an input is wrongly interpreted, its weight is increased so that next step classifies
it correctly. By combining the whole set at the end converts weak learners into better performing
model. The model works in a chain or nested iterative model and hence they are not independent
parallel models but each model is built based on all the previous small models by weighting.
Hence, the new model gets boosting from the earlier model. This helps in reducing bias of a
large number of small models with low variance. This algorithm supports different loss function
and works well with interactions but they are prone to overfitting and require careful tuning of
different hyperparameters. Let us understand it graphically: In this diagram, 100% accuracy
would have been achieved if stars are in gray zone and diamonds are in white zone. The weights
are representation of size of the shape. Initially, the output of first weak leaner shows that
decision boundary predicts two stars and two diamonds correctly. The outcomes predicted
correctly are given a lower weight and the incorrect predictions are weighted higher. The model
focuses on high weight points now and next tree tries to recover the loss (difference between
actual and predicted values) and hence try to classify them correctly. The weight of the rightly
predicted stars and diamond reduces in the second learner model (size of three stars and two

574
diamonds reduces). This continues for many times and we can see that accuracy has again
improved in third situation. The output of third leaner shows that four stars and three diamonds
are correctly classified (size is small), while the big diamond is wrongly predicted. In the end, all
models are given a weight depending on their accuracy and a consolidated result is generated.
Thus, this algorithm combines a set of weak learners and delivers improved prediction accuracy
and at any situation, the model outcomes are weighed based on the outcomes of previous model.
Unlike bagging, boosting does not focus on reducing the variance of learners, but focus is to
reduce the high variance of learners by averaging lots of models fitted on bootstrapped data
samples generated with replacement from training data, so as to avoid overfitting. Boosting is a
sequential technique that works on the principle of ensemble. It combines a set of weak learners
and delivers improved prediction accuracy. At any instant t, the model outcomes are weighed
based on the outcomes of previous instant t-1. The outcomes predicted correctly are given a
lower weight and the ones misclassified are weighted higher. This technique is followed for a
classification problem while a similar technique is used for regression.
Another major difference between both the techniques is that in bagging, the various models
that are generated are independent of each other and have equal weightage, whereas boosting is a
sequential process in which each next model that is generated is added so as to improve a bit
from the previous model. This further means that some things are added to improve on the
performance of the previous collection of models.

The reader can also use other arguments in gradient boosting algorithm: loss
which denotes the loss function to be optimized. It can have the values, such as
‘ls’ for least squares regression, ‘lad' for least absolute deviation, ‘huber’ is a
combination of the two or ‘quantile’ which allows quantile regression.
Another argument is subsample which denotes the fraction of samples to be
used for fitting the individual base learners. If smaller than 1.0, this results in
Stochastic Gradient Boosting. An argument named criterion is also generally
used by analyst to measure the quality of a split. Supported criteria are
“friedman_mse” for the mean squared error with improvement score by
Friedman, “mse” for mean squared error, and “mae” for the mean absolute
error. The default value of “friedman_mse” is generally the best as it can
provide a better approximation in some cases.

15.5.1 Gradient Boosting for Classification Problems in Python


Gradient boosting algorithm for classification problems can be used using the function
GradientBoostingClassifier() available in sklearn.ensemble. For explaining the utility of
Gradient Boosting, we will consider the dataset from
https://archive.ics.uci.edu/ml/datasets/Wine+Quality.
Dataset Information: The dataset had 12 variables. The inputs include objective tests and have
11 independent variables. The dependent variable (quality) is based on sensory data (median of
at least three evaluations made by wine experts). Each expert graded the wine quality between 0
(very bad) and 10 (very excellent). The different input variables include fixed acidity, volatile
acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH,

575
sulfates, and alcohol.

576
577
578
579
Explanation
The dataset had 1599 observations for 12 variables: 11 independent variable and quality was
considered to be the dependent variable. The test dataset had 30% of the observations. A seed
value of 3000 was considered for generating same training and test dataset. A Gradient
Boosting model was developed for default settings and random state = 0. The result shows that
the model is overfitting, since the training and test accuracy were found to be 0.91 and 0.69,
respectively. For determining the best hyperparameters, a grid-based search was adopted by
creating a grid-based object of four keys namely n_estimators, max_depth, learning_rate,
and max_features. Thus, different models for a particular combination of distinct four
hyperparameter values were created by using GridSearchCV function. The best values of
hyperparameters were determined using best_params_ variable. The best value of parameters
was found to be: “learning_rate”: 0.1, “max_depth”: 4, “max_features”: “sqrt,”
“n_estimators”: 200. This combination of hyperparameters was however not able to increase

580
the accuracy of the Gradient Boosting algorithm. All the other models including bagging
model, decision tree model, AdaBoost model, and extra tree model had a less accuracy than the
Gradient Boosting model. Thus, for this dataset, Gradient Boosting shows a better accuracy.
The results of the classification report show that precision for values 3, 4, 5, 6, 7, and 8 is 0.0,
0.0, 0.7, 0.69, 0.66, and 0.33, respectively. The recall and f1-score also shows similar results.
This means that the records having quality as 5 were predicted the best, while all the records
having quality as 3 and 4 were wrongly predicted. For our test data of 480 observations, two
corresponds to value 3 and 13 corresponds to value 1 and so on. As the accuracy is very low,
the analyst can follow a good feature extraction process for increasing the accuracy.
The data show that there are six different categories in the dataset for the dependent
variable. Thus, the command label_binarize(y_trg, classes=[3, 4, 5, 6, 7, 8])
binarized the data for six different categories: 3, 4, 5, 6, 7 and 8. The plot created six different
ROC curve for six different categories. All the curves and their corresponding AUC is shown in
the plot. It can be inferred that the model shows an average accuracy, recommending steps to be
taken for increasing the accuracy.

USE CASE
IMPACT OF ONLINE REVIEWS ON BUYING BEHAVIOR

Customers are replacing Internet-based search for traditional ways of information search
related to products and services. Personal opinions and experiences for products and services in
the form of online reviews have become one of the most important sources of information for
buying behavior of customers. In comparison to expert-written product review and abstract
product review, the customer-written product review and actual product review has a better
impact. These personal experiences are shared through different forms including verbal reviews
having detailed information related to features and benefits, product rating in the form of a
numerical value, picture and video for better visualization and impact, volume representing the
number of reviews, helpfulness that allow visitors to rate the helpfulness of the reviews, and
cumulative reviews for a better understanding of the product or service. But impact of online
reviews on purchasing decisions of customer is greatly dependent on online platform where the
review is posted. These platforms present product review in different formats and can range from
business retail websites to online communities, independent review sites, personal blogs, and
video sharing platforms. Hence, these platforms should focus to build customer trust by
providing effective website through a user-friendly design and true and sufficient information to
customers through an excellent quality of service.
Primarily, the focus of online retail websites is on sales of goods and services, but, in order
to facilitate customers to decide about the product, they provide previous customers to write
their opinion related to product reviews. The content of reviews can range from numerical star
ratings to open-ended text comments. Independent consumer review platforms help in product
comparison and display customer reviews without having a direct or indirect interest in
businesses or products. Bloggers recommendation posts also influence customer purchase
decision making and can also be considered as a marketing communication tool. They share
purchasing experiences about products and services and give feedback to other customers.
Video-sharing platforms like YouTube help in posting product reviews in the form of videos.
These videos are uploaded by users and the customer can find a video review and also see the

581
popularity of the review through number of downloads and also read comments of others about
the review.
Gradient Boosting classification algorithm can be applied on the data obtained from
different platforms (where reviews are posted) to understand the effectiveness of reviews on the
buyers purchase intention. These data will include all the independent variables, while the
dependent variable (buyers purchase intention) will have only two logical values (Yes and No).
This study can be used for the firm to determine the potential of customer reviews as a source of
messages that could be used in firm’s strategic activities to predict customer buying behavior.

15.5.2 Gradient Boosting for Regression Problems in Python


Gradient boosting algorithm for classification problems can be used using the function
GradientBoostingRegressor() available in sklearn.ensemble. For understanding the utility of
Gradient Boosting algorithm for regression problems, we have downloaded the data from
https://archive.ics.uci.edu/ml/datasets/Airfoil+Self-Noise.
Dataset Information: The NASA dataset comprises different size NACA 0012 airfoils at
various wind tunnel speeds and angles of attack. The span of the airfoil and the observer position
were the same in all of the experiments. The dataset had following five independent variables:
frequency, in hertzs; angle of attack, in degrees; chord length, in meters; free-stream velocity, in
meters per second; and suction side displacement thickness, in meters. It had only one
independent variable namely scaled sound pressure level, in decibels.

582
583
584
Explanation
The dataset had 1503 observations with six variables. The variable “pressure” was considered
to be dependent variable and five others were considered to be independent variable. A
Gradient Boosting model was created using the default settings. A grid-based approach was
implemented by created a dictionary of learning rate, maximum depth, and number of
estimators. The best value of parameters was found to be: “learning_rate”: 0.5, “max_depth”:
5, “n_estimators”: 200. A new Gradient Boosting model was created considering the best
argument and the RMSE value decreased from 5.156 to 1.817, hence it can be inferred that the
efficiency of the new model increased with the grid approach. For this dataset, when the
comparison was done with previously discussed ensemble techniques and decision tree model,
the RMSE value for AdaBoost was found to be 3.844, 1.825 for extra tree model, 2.052 for
random forest model, 2.247 for bagging model, and 3.3009 for decision tree model. Thus,
Gradient Boosting model should be considered for this dataset, since it had the least RMSE
value.

Perform pre-processing of data using different techniques. Also perform tuning


of hyperparameters and check if further improvement in accuracy is possible in
datasets discussed in regression and classification problems of gradient

585
boosting algorithm.

USE CASE
EFFECTIVE VACATION PLAN THROUGH ONLINE SERVICES

Online services are prevalent in different sectors related to banking, financial, travel, insurance,
recruitment, payment, food, maintenance, etc. Unlike brick-and-mortar business, these services
can be accessed from anywhere and at any time. In today’s economy and work culture either you
have got enough time to travel, but no money to do so, or you have got enough money to travel,
but no time to do so. In order to enjoy a vacation, we need to plan and predict the vacation
effectively.
Different type of online services related to vacation are provided to customers that include
hotel booking, flight booking, commuting services, food services, home delivery services, and
maintenance services. The cost of a vacation includes cost incurred due to individual ticket
booking, food booking, hotel booking, commuting cost, cruise, excursions, theme-based park
booking, automobile rental booking, and other travel-related expenses. However, the cost of
these services and complex offerings is dependent on certain time period, service class, quality,
price range, distance and geographic location. For example, flight and hotel booking at last
moment will increase the cost, staying in a downtown hotel is generally going to be more
expensive than staying in the outskirts of a city, traveling on weekends or holidays will be
expensive affair for airline ticket and lodging, and use of credit cards or special bank card
reduces the cost.
Our objective is to create a predictive model that will be able to generate cost of a vacation
trip (continuous variable) based on different independent variables (continuous and categorical
variable). This will finally lead to select a better option since it will help in conserving and
cutting the costs associated with vacation. A large amount of data can be collected related to
different independent variables that include categorical variables (e.g., location, category of
hotel, no of persons, and quality of services) and continuous variable (e.g., cost of all types of
services for all different time frames and location service). A Gradient Boosting algorithm for
regression problems can be applied for determining the cost (continuous dependent variable).
Accordingly, this algorithm will help to choose better flights, hotels, car rentals, etc., to plan a
stress-free, comfortable, affordable, and a safe vacation on the basis of cost prediction.

Summary
• Supervised machine learning algorithms like decision trees are used to make better decisions
and make more profit, but they suffer from bias and variance that grows as the complexity
increases.
• The ensemble method combines different decision trees to generate better predictive
performance than utilizing a single decision tree. In ensemble techniques, a group of weak
learners come together to form a strong learner.
• Ensemble techniques is the power of handling large dataset with thousands of independent
variables.

586
Ensemble techniques is the power of handling large dataset with thousands of independent

variables.
• The different ensemble techniques include Bagging, Random Forest (Extension of Bagging),
Extra tree, AdaBoost, and Gradient Boosting.
• Feature extraction plays an important role in the machine learning for problems with
thousands of features for each training instance. It is important to determine the optimal
subset of features for reducing model’s complexity, easy approach to find the best solution,
and for decreasing the time it takes to train the model.
• Hyperparameter tuning is primarily done by GridSearchCV object that performs search over
the hyperparameter grid and report the hyperparameters that will maximize the cross-
validated classifier performance.
• The commonly used hyperparameters used in ensemble techniques are max_depth,
max_features, n_estimators, learning_rate, random_state, min_samples_split,
min_samples_leaf, etc.
• The classification report displays four different values namely precision, recall, F1-score and
support for all the categories of the dependent variable; accuracy of the model; macro
average; and weighted average.
• Precision is the ratio of true positives to the sum of true and false positives. Recall is the ratio
of true positives to the sum of true positives and false negatives. The F1 score is a weighted
harmonic mean of precision and recall; the highest score is 1.0 and the lowest is 0.0. Support
is the number of actual occurrences of the class in the specified dataset.
• ROC curve is a commonly used graph that summarizes the performance of a classifier over
all possible thresholds. It is generated by plotting the true positive rate (y-axis) against the
false positive rate (x-axis).
• Bagging increases the power of a predictive statistical model by taking multiple random
samples (with replacement) from training dataset, and each collection of subset data is used
to train their decision trees, which finally results into ensemble of different models.
• Random Forest is an extension over bagging. This algorithm also takes the random selection
of features rather than using all features to grow trees along with taking the random subset of
data, which is done in bagging algorithm.
• Decision trees, being prone to overfit, have been transformed to random forests by training
many trees over various subsamples of the data (in terms of both observations and predictors
used to train them).
• Extra trees algorithm is similar to random forest classifier, with the only difference of extra
trees. The features and splits are selected at random; hence it is named as “extremely
randomized tree.” Since splits are chosen at random for each feature in the extra trees
classifier, it is less computationally expensive than a Random Forest.
• Boosting is a general ensemble method that keeps adding weak learners to correct
classification errors and creates a strong classifier from a number of weak classifiers. The
process continues until a pre-set number of weak learners have been created or no further
improvement can be made on the training dataset.
• The major difference between bagging and boosting is that in bagging, the generated models
are independent of each other and have equal weightage, whereas in boosting, each next
model that is generated is added so as to improve a bit from the previous model.

587
Objective Type Questions

1. This parameter is not used in ensemble techniques


(a) Random_state
(b) max_depth
(c) n_estimators
(d) All of the above
2. The value of learning rate ranges from _______ to _______ .
(a) 0, 1
(b) 1, 10
(c) 0, 10
(d) 0, 100
3. The value of precision, recall, and F1-score ranges from _______ to _______.
(a) 0, 1
(b) 1, 10
(c) 0, 10
(d) 0, 100
4. The classification report displays the following value:
(a) Recall
(b) Precision
(c) F1-score
(d) All of the above
5. Hyperparameter tuning is done primarily using the function:
(a) hypergrid
(b) GridCV
(c) Grid
(d) GridSearchCV
6. Random Forest is an extension of _______
(a) Bagging
(b) Boosting
(c) Both a and b
(d) Neither a nor b
7. Extra tree is _______ computationally expensive than random forest
(a) More
(b) Equal
(c) Less
(d) Cannot say
8. The parameters that contribute to highest accuracy through grid search are identified using:
(a) best_params_
(b) best
(c) grid_best

588
(d) best_grid_
9. _________ determines the impact of each tree on the final outcome
(a) impact_rate
(b) learning_rate
(c) max-impact
(d) n_estimators
10. The value of AUC ranges from _______ to _______.
(a) 0, 1
(b) 1, 10
(c) 0.5, 1
(d) 0, 0.5

Review Questions

1. Compare the ensemble techniques with the k-NN, logistic regression, and decision tree
algorithm with respect to explanation of output, prediction power, and time required in
calculation.
2. Discuss the process of Gradient Boosting algorithm in detail.
3. What is ROC curve and discuss the results produced by ROC curve considering an example
of your choice.
4. How is a ROC curve drawn if the classifier is not a binary classifier?
5. Discuss the importance of the attributes displayed in the result of the classification report.
6. Why and how do we do hyperparameter tuning in python?
7. Differentiate between bagging and boosting.
8. Differentiate between random forest and extra tree.
9. Create Gradient Boosting and Random Forest model for the dataset discussed in bagging
algorithm for regression problem.
10. Create a Gradient Boosting and Random Forest model for the dataset discussed in bagging
algorithm for classification problem.

589
CHAPTER
16

590
Machine Learning for Text Data

Learning Objectives
After reading this chapter, you will be able to

• Understand the real-time applications of text data analysis.


• Apply different machine learning techniques for text data.
• Foster analytical and critical thinking abilities for data-based decision making.
• Evaluate the result of text mining and sentiment analysis.

Data is the fuel of the 21st century and there has been a rapid increase in the volume of data with
availability of Internet and use of e-commerce in different sectors. The data available on the
Internet generally include text, image, and video data. The text data are generally digital data that
might relate to customer opinion, feedback and reviews, and description for a product or a
service. Text databases consist of a huge collection of documents, and information is collected
from several sources such as news articles, e-books, reviews, e-mail messages, and information
in forums. Due to the increase in digital information, there is a rapid increase in the text data. For
example, a structured document may have fields such as name, title, and date but, in the real
world, the document also contains unstructured text components, such as review and abstract.
Hence, for effective handling of the text data, we require an intelligent algorithm to retrieve
relevant information from the data repositories. The retrieval of information is called text mining.
The difference between data mining and text mining is that data mining is applied to structured
data and relational data, whereas text mining deals with all unstructured and semi-structured
data. It produces insightful summary of the document/s and helps in converting large
unstructured data into a summarized format by extracting useful and relevant words. This chapter
discusses the analysis of the text data using functions from different libraries and modules related
to the text data along with supervised and unsupervised machine learning for the existing text
data.

16.1 Text Mining


Text mining is the process of evaluating large amount of textual data to produce meaningful
information, and to convert the unstructured text data into structured text data for further analysis
and visualization. Text mining helps to identify unnoticed facts, relationships, and assertions of
textual big data. The process of text mining includes basic libraries such as nltk, re, and
wordcloud. Text mining involves execution of the following steps:

1. Understanding text data

591
2. Text preprocessing
3. Shallow parsing
4. Stop words
5. Stemming and lemmatizing
6. Word cloud

16.1.1 Understanding Text Data


Before doing text mining, we need to understand the text data, for example, determining the
number of words in the document. We need to first load data from different sources including
text files (.txt), pdfs (.pdf), and csv files (.csv).

592
593
Explanation
In this example, text related to text mining from Wikipedia is copied in a text file named
“textminingwiki.” The different punctuation marks in the string module are displayed. The
command textdoc.translate(punctuation).lower().split() translates the document
considering the punctuation marks, then converts the document into lowercase, and finally
splits the document based on space between words. This finally results in different words,
which are then stored in the variable word_doc. The counter() function counts the number of
words and “for” loop is executed to print the most common words. It should be noted that the
function most_common() arranges the words in descending order of their occurrence. Hence,
when the 11 most common words are displayed, the words with maximum occurrence are
displayed first. Thus, “and” has occurred 11 times, “the” and “of” has occurred 10 times, “text”
has occurred eight times, “to” has occurred six times, and so on. It should be observed that Text
and text both are considered as different words because Python is case-sensitive.

16.1.2 Text Preprocessing


Text Preprocessing is an important phase before applying any algorithm on the text data. Data
cleaning implies cleaning of noise such as punctuation and spaces. The objective of text mining
is to clean the data for creating independent terms from the data file for further analysis. After
the textual data has been loaded in the environment, it needs to be cleaned by adopting different
measures such as transforming the text to lowercase (since Python is case-sensitive, it considers
same words such as “software” and “Software” differently); removing specific characters such as
removing URLs, non-English words, punctuations, whitespace, etc. The cleaning of data is
explained in the following code.

594
Explanation
The “nltk” and “re” libraries are downloaded using the import statement. For cleaning the
data, we are considering the same file discussed in the above code. The number of letters in the
original document is 1627. The subcommand from “re” library removes unwanted characters
from the original document and we can see that the number of letters now in the document is
reduced to 1579 (48 were unwanted characters). All the letters are converted to lowercase for
better understanding; leading and trailing spaces are removed from the document. We can
observe from the new document that all the punctuation marks, spaces, and other unnecessary
letters are removed from the document and all the letters are in lowercase.

16.1.3 Shallow Parsing


There are three types of parsing involved in natural language processing (NLP):

1. Shallow parsing (or chunking): It adds a bit more structure to a part of speech (POS)-
tagged sentence. The most common operation is grouping words into noun phrases (NP),
verb phrases (VP), and prepositional phrases (PP).
2. Constituency parsing: It adds even more structure to the POS-tagged sentence.
3. Dependency parsing: It implies finding the dependencies between the words and also their
type.
In this section, we will discuss shallow parsing. Shallow parsing is a popular NLP technique for
analyzing the structure of a sentence to break it down into its smallest constituents called tokens
(such as words) and grouping them into higher level phrases. This includes POS tags as well as
phrases in a sentence.

595
Tokenization is the process of breaking down a text paragraph into smaller chunks such as
words or sentence. Token is a single entity that is the building block for a sentence or a
paragraph. Sentence tokenizer breaks text paragraph into sentences, while word tokenizer breaks
text paragraph into words. The process of classifying words into their parts of speech (POS) and
labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging.
POS are also known as word classes or lexical categories. The collection of tags used for a
particular task is known as a tagset. The emphasis in this section is on exploiting tags and
tagging text automatically. A part-of-speech tagger (or POS-tagger) processes a sequence of
words and attaches a POS tag to each word. The different types of phrases include noun phrase
(NP, noun acts as a subject or object to a verb), VP (it has verb components), adjective phrase
(ADJP, it has an adjective and is placed before or after the noun/pronoun), adverb phrase
(ADVP, it acts like adverbs), PP (it contains a preposition and other components), coordinating
conjunction (CC, e.g., and), RB-adverbs (e.g., completely), IN-preposition (e.g., for), JJ-
adjective (e.g., slow), VBP-present tense verb (e.g., move), and a NN-noun (e.g., cloth).

Explanation
The document is tokenized into tokens/words using tokenize command. We can observe from
the result that the sentence is broken into different tokens/words and each token/word is shown
separately in a single quote. The pos_tag(doc_tokens) function determines the POS of every
token/word in doc_tokens and stores in nltk_pos_tagged. The output shows the POS along
with every corresponding word. It should be noted that some additional information for “punkt”
and “averaged_ perceptron_tagger” also needs to be downloaded along before using the above
functions.

Term frequency-inverse document frequency (TF-IDF) is a statistical measure


used to evaluate how important a word is to a document in a collection of
documents or corpus. This importance is directly proportional to the number of
times a word appears in the document but is offset by the number of
documents in the corpus that contain that word.

596
16.1.4 Stop Words
Text may contain stop words such as is, am, are, this, a, an, and the. These stop words are
considered as noise in the text and hence should be removed. Before analyzing the text data, we
should filter out the list of tokens from these stop words. This is demonstrated in the following
section.

Explanation
The command nltk.download('stopwords') downloads stop words from nltk library and
stores in eng_stopwords. The result shows that the corpus has 179 English stop words. The first
30 stop words are displayed from the corpus using the command eng_stop_words[0:30]).
Each token/word of the document is checked with the set of stop words. All the words that do
not exist in stop words are extracted out and joined together. We can observe that the number of
words remaining in the document is 1329, which means that 250 words were stop words and
they were filtered out. This helped in creating a summary of important words for understanding
the article effectively. The last command displayed the first 300 letters of the new document
using the command newdoc[0:300].

597
In One Hot Encoding approach, each element in the vector corresponds to a
unique word or n-gram (token) in the corpus vocabulary. Then if the token at a
particular index exists in the document, that element is marked as 1, else it is
marked as 0. In bag of words (BoW) representation, each element of the vector
corresponds to the number of times that specific word occurs in the document.
However, it does not encode any idea of meaning or word similarity into the
vectors.

16.1.5 Stemming and Lemmatizing


Stemming and lemmatization consider another type of noise in the text, which reduces
derivationally related forms of a word to common root word. Stemming is the process of
gathering words of similar origin into one word. Stemming helps us to increase accuracy in our
mined text by removing suffixes and reducing words to their basic forms. For example, words
such as detection, detected, and detecting are reduced to a common word “detect.” However,
lemmatization is usually more sophisticated than stemming and also reduces words to their base
word. But lemmar, unlike stemmer, works on an individual word with knowledge of the context.
For example, the word “better” has “good” as its lemma, but this is not included by stemming
because it requires a dictionary search. Stemming is done through PorterStemmer() function,
and lemmatizing is done using WordNetLemmatizer() function available in the nltk.stem
package. For understanding the utility of stemming, we will consider a new example in a new
file and then we will apply the process of stemming and lemmatization on the above data.

Explanation
The above command combines similar types of words under one basic word and stores the

598
result in the corpus. We can observe that stemming reduces all the different words to a base
word “program,” while lemmatizer considers program and programmer as different words
because it also has knowledge of words. Thus, lemmatizer reduces programmer and
programmers to the basic word programmers.

The following program shows the use of stemming and lemmatization on our data.

Explanation
We can observe that there is no noticeable change because the data did not contain words
corresponding to different forms of the basic word.

Word embedding is a learned representation for text where words that have the
same meaning have a similar representation. The two most popular word
embeddings are Word2Vec and GloVe.

16.1.6 Word Cloud


For creating a visual impact, a word cloud is created from different words using WordCloud()
function from wordcloud library. In the word cloud, the size of the words is dependent on their
frequencies.

599
Explanation
The word cloud in the figure consists of those words that have high frequency. The size of the
word in the word cloud determines its frequency and hence its importance. The white-colored
word cloud clearly shows that “text” is the most important word, followed by “mining.” The

600
black-colored word cloud displays words after performing lemmatization and stemming. Thus,
mining is reduced to the common word “mine,” which has occurred more number of times in
the document.

USE CASE
TEXT MINING FOR LONG DOCUMENTS/SPEECH/RESUME

Due to the advent of Internet and social media, in particular, a lot of information and
communication is available in the form of text and audio data. Since it is now possible to convert
text to speech easily and effectively, a lot of textual data are now available. It becomes very
difficult for a person to understand the long document effectively and quickly. It has hence
become a skill that is more important than ever in today’s information overflow and people want
to develop the skill to pick out the main ideas of reading and to improve critical thinking skills
related to the document. Text mining helps to filter the long text to its essentials by specifying the
important key words. It is also effective because when people are told to summarize the long
document, they often either copy verbatim; write long, detailed “summaries;” or write
excessively short ones, missing key information. Besides, creating a summary manually is time
consuming and may sometimes become a boring and tedious task if the data are huge and the
task is repetitive.
Because of social media, a lot of text data is available in the form of speeches, documents,
information, etc. Gone are the days when it required re-reading the long document, annotating
the speech and underlining any portions, marking any words or phrases that are considered
important in a long document. For example, an HR professional has access to many more
potential candidates now than before the existence of social media. Ten years ago, if HR
department had around 50 printed CVs (many of them were send by post) to flip through for a
particular position, today, they have to deal with tens of thousands online CVs after a single
search. It becomes difficult because there just never seems to be enough time to read every single
line on a CV or online profile. They are interested to learn (quite thoroughly) the key points
about the accomplishment of the person in their career. A word cloud can help clarify the
essential elements of CV in the quickest way possible and also help extrapolate its main points
and essential arguments. Similarly, word clouds created based on speeches can act as a study
guide for students studying history, literature, or rhetoric. A well-crafted word cloud can be used
in an analytical report for better understanding of any document.

16.2 Sentiment Analysis Using Lexicon-Based Approach


Sentiment analysis is also popularly known as opinion analysis or opinion mining. The key idea
is to use techniques from text analytics, NLP, machine learning, and linguistics to extract
important information or data points from unstructured text. Sentiment analysis is a branch of
machine learning that deals with the interaction between computers and humans using the natural
language. Sentiment analysis provides a way to understand the attitudes and opinions expressed
in texts. We can use sentiment analysis to understand how a narrative changes throughout its
course or what words with emotional and opinion content are important for a particular text. It
deals with reading and interpreting text, preprocessing, extracting, and predicting the solution

601
along with measuring sentiment.
Sentiment polarity is typically a numeric score that is assigned to both the positive and
negative aspects of a text document based on subjective parameters such as specific words and
phrases expressing feelings and emotions. Neutral sentiment typically has 0 polarity since it does
not express any specific sentiment, positive sentiment will have polarity > 0, and negative
sentiment will have polarity < 0. However, the thresholds can always be changed based on the
type of text; there are no strict rules. The results help us to derive qualitative outputs like the
overall sentiment being on a positive, neutral, or negative scale and quantitative outputs like the
sentiment polarity, subjectivity, and objectivity proportions.
In this section, we will perform sentiment analysis using both unsupervised and supervised
machine learning approaches. We have taken the data of clothing reviews related to women’s
clothing from e-commerce stores. The data can be downloaded from
https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews.
Unsupervised sentiment analysis models make use of well-created knowledge bases,
lexicons, and databases, which have detailed information pertaining to subjective words and
phrases including sentiment, mood, polarity, objectivity, subjectivity, and so on. However, this
section uses the lexicon model for sentiment analysis. A lexicon model, also known as a
dictionary or vocabulary of words, is specifically aligned toward sentiment analysis. These
lexicons usually contain a list of words associated with positive and negative sentiments, polarity
(magnitude of negative or positive score), POS tags, subjectivity classifiers (strong, weak,
neutral), mood, modality, etc.

16.2.1 Understanding Data


Before doing sentiment analysis, we need to first understand the dataset—shape of the dataset,
missing values, and do feature extraction. This is done in the following section.

602
603
Explanation
The above output shows that there are 23,486 rows and 11 columns in the dataset. Each record
in the dataset is a customer review that consists of the review title, text description, and a rating
(ranging from 1 to 5) for a product among other features. Recommend IND is based on the
rating. The customers recommend a product (label 1) whose rating is >3, otherwise they do not
recommend the product (label 0). We can observe from the first five rows that there are many
missing values in the data. Thus, when the complete information is displayed, we can find that
Review Text has 845 missing observations.
For the analysis, we require only two columns, namely, Recommend IND and Review Text.
Hence, we created a new dataset containing only these two columns. Before performing
analysis, it was important to delete those observations, which do not have review text because
they will not yield meaningful interpretation. We find that after removing missing values form
the data, we find that the dimension of the new dataset is (22641, 2). This means that the dataset
now has 22,641 rows. Since sentiment analysis can be done on the text data only, it is important
to convert the data type of Review Text to string before doing analysis. We further have
converted the data in lowercase.

16.2.2 Determining Polarity


The most important step of sentiment analysis is to create or find a word list of sentiments

604
(lexicon). Unsupervised machine learning uses a lexicon-based approach. We can use a lexicon
that already exists and we may also add to or modify it. These lexicons contain many words for
showing sentiment and the words are assigned scores for each sentiment. The textblob library
in Python helps to calculate the polarity of the text data. Polarity is a float value within the range
–1 to 1, where 0 indicates neutral, +1 indicates a very positive sentiment, and –1 represents a
very negative sentiment. Negative polarity generally shows a negative sentiment, and a positive
polarity shows a positive sentiment. Higher the value, more effective is the sentiment. This
means that the polarity of 0.8 shows better and effective positive sentiment than a polarity of 0.3.
After handling the missing observations and doing the basic processing, we need to calculate the
polarity of the review text. For analysis, we will consider the first 18,000 rows from the dataset.

Explanation
The command sentiments = newwomendata['Recommended IND'].values[:18000] stores the
value Recommended IND of first 18,000 reviews in the sentiments. The command
textblob.TextBlob(review).sentiment.polarity calculates the sentiment polarity of
review. The “for” loop calculates and displays the review, original recommended value, and
calculated polarity of each review. We can observe from the above result that the first review
had a polarity of 0.63 and recommended a value of 1 (positive rating). The second review had a
polarity of 0.33 and recommended a value of 1. The third review had a polarity of 0.7 and
recommended a value of 0 (negative rating). The polarity of all three reviews is in accordance
with the original recommended value.

605
16.2.3 Determining Sentiment from Polarity
We know that score less than 0 shows a negative sentiment, and score higher than 0 shows a
positive sentiment. This section converts the polarity of all the records to corresponding
sentiments.

Explanation
The first division stores the polarities corresponding to each review in the polarity_sentiment.
The next division converts all the polarities to the binary form (0 and 1) for making two
categories: positive and negative. Thus, score ≥ 0 is stored as 1 in predicted_sentiments and
score < 0 will be stored as 0 in predicted_sentiments. Thus, we will have only two categories in
predicted sentiments: 1 corresponding to positive sentiment and 0 corresponding to negative
sentiment.

16.2.4 Determining Accuracy of Sentiment Analysis


Since we have now two categorical variables in predicted_sentiments, we can now compare
these values corresponding to sentiments with the original values stored in the sentiments. This
will help us to calculate the accuracy of the predicted sentiments.

Explanation

606
The predicted_sentiments stores the score corresponding to the polarity of each sentiment. It
stores 1 if polarity has a positive value and 0 if it has a negative value. These predicted
sentiments are compared with the original sentiments and the accuracy of the model is 0.7831.
The confusion matrix shows that there are 192 + 13903 rightly predicted values. Also, we have
generated a report for positive and negative sentiments. The precision for positive sentiments is
the ratio TP/(TP + FP), where TP is the number of true positives and FP is the number of false
positives. The precision is intuitively the ability of the classifier not to label as positive a
sample that is negative. The precision for positive is 0.82 and for negative is 0.18. Therefore,
this model has 82% precisely labeled positive as true positive and 18% precisely labeled
negative as true negative.
The recall is the ratio TP/(TP + FN), where TP is the number of true positives and FN is the
number of false negatives. The recall is intuitively the ability of the classifier to find all the
positive samples. This classifier has found about 94% of the positive samples. The F-beta score
can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta
score reaches its best value at 1 and the worst score at 0. The support is the number of
occurrences of each class in y_true.

Create a dataset containing employee reviews and sentiments corresponding to


the review in the form of a categorical variable (positive or negative). Predict
the sentiments based on the review using textblob library and create a
confusion matrix of the original and predicted sentiments.

USE CASE
SENTIMENT ANALYSIS FOR TWITTER DATA

Twitter is a social networking site where people communicate in short messages called tweets.
Tweeting means posting short messages to people who follow you on Twitter, with an intention
that the messages might be helpful for taking a decision.
Data were collected from Twitter with the help of Twitter API. The Twitter API is a set of
URLs that take parameters. These URLs help access many features of Twitter, such as posting a
tweet or finding tweets that contain a word, etc. The Twitter account was made using the
following steps:

1. Visit the Twitter Developers’ Site: dev.twitter.com.


2. Sign in with Twitter account associated with the app.
3. Visit apps.twitter.com. Once we were logged in, we visited Twitter’s app website. This can
be found at apps.twitter.com (it can also be found by clicking “manage your apps” in the
footer of the Twitter Developers site).
4. Create a New Application for access to the Twitter comments so that we could load these
comments for our analysis.
5. Fill the application form to get the consumer key and the consumer secret key to get
authentication to the Twitter comments.
6. Create Access Token.

607
7. Once the application got approved, the app was created on Twitter. By creating this app,
we got special access to the developer’s account of Twitter.
8. Make a note of OAuth Settings.
9. Once we have got access to the Twitter’s Developer account, we made a note of OAuth
settings that can be used for logging in with the help of Python to the Twitter account. These
settings can be kept as secret and must not be shared with anyone. The settings which are
needed for having access through Python are as follows: Consumer Key, Consumer Secret,
OAuth Access Token, and OAuth Access Token Secret.
The Search API allows developers to look up tweets containing a specific word or a phrase. One
of the constraints imposed by Twitter is that the Search API produces only 1500 tweets at a time.
Hence, to gather more tweets, Streaming API was used that captures tweets in real time. Twitter
Developer Account was created that provided credentials and authorization for extracting the
tweets. This information was used in the Python code for collecting the tweets related to a
particular product/service/event. After importing the Twitter comments into the csv file, the file
was loaded into the Python using the pandas library.
Before performing the analysis, data cleaning was required. It involves recognition, removal
of errors, and inconsistency to improve the quality of the dataset prior to the process of analysis.
Irrelevant data were cleaned from tweets to improve their quality. Removing punctuations and
other miscellaneous data is performed as follows: Punctuations marks such as quotes (“”),
commas (,), and semicolons (:) do not have any significant role in the analysis, and hence were
removed from all the tweets present in the dataset. People generally have a tendency to attach
documents (images, blogs, videos, web direction, etc.) along with their tweets. These links or
URLs had to be eliminated since they were of no use to the analysis. There was also a need to
delete mentions, hashtags, and retweets. Mentions are used in Twitter to reply, acknowledge, or
start a conversation. Mentions are always written using “@” sign followed by the username.
These mentions do not contain any relevant information, thus they are removed. A Hashtag
(“#”) is used to mark keywords or topics in a tweet. Using hashtags, people can search or start a
new trend on Twitter. The “#” has been removed from all the tweets of the dataset. A retweet
involves the reposting of another user’s tweet. It leads to redundancy in data and to avoid this,
we eliminated all the retweets from the dataset. Besides a lot of tweets are from customer care.
Hence, these tweets should also be removed from the data. In the following example, we have
collected tweets related to Amazon from Twitter.

608
The comments available on Twitter consist of many retweets, mentions, hashtags, URLs, etc. The
noise in the comments of Twitter was removed using inbuilt libraries in Python such as Textblob,
which is specifically used for extracting and cleaning the data. The data still contained minor
noise such as website address and some kind of hexadecimal coded expressions. For removing
these expressions from the tweets, regular expression library available in Python was used,
which matched these expressions present in the tweets and then removed them. While working on
the data for preprocessing, it was found that the tweets also contained promotional tweets. A
“Care” is used to identify the tweets originated from customer care. These tweets are removed
from all the tweets of the dataset.

609
Sentiment analysis is basically the process of determining the attitude or the emotion of the
writer, that is, whether it is positive or negative or neutral. Polarity is a float that lies in the
range of [–1,1], where 1 means positive statement and–1 means a negative statement. The
positive, negative, and objective score for each word in a tweet is calculated. Words with an
objective score lesser than a predefined threshold value (between 0.0 and 1.0) are discarded,
while the ones above the threshold value are added to get an aggregated positive and negative
score of a tweet. Finally, a tweet is classified as negative, positive, or neutral based on the
dominating value.
We then plotted the figure corresponding to the live sentiments of the tweeter users. It was
found that the overall sentiments of Amazon are positive as there are very few negative
sentiments for analysis, both in terms of quantity (number of words) as well as quality (the type
of words).

16.3 Text Similarity Techniques


Text similarity techniques are used to recommend products/services, videos, movies, etc. The
different examples of document similarity include e-commerce websites recommending products
on their websites, Amazon Prime and Netflix recommending movies/shows, YouTube
recommending videos, etc. Recommendation for a product/service can be done according to
predefined criteria such as number of buyers, budget, rating, popularity, manufacturer,
description, etc. The similarity of the text data can be determined using the concept of cosine
similarity, Euclidean distance, and Manhattan distance. These functions are available in
sklearn.metrics.pairwise package. For explaining the utility of the recommendation system
on the text data, we will consider imdb dataset available in keras package.

Explanation
We can observe that when the imdb dataset is loaded, it is split into four sets: training data and
test datasets of independent and dependent variables. By default, the training and test sets have
25,000 records. It is important to have the same sequences of the same length, so the next
section converts the training and test datasets of independent variables into equal sizes of 200.
Hence, the dimension is (25000, 200) for both the datasets.

16.3.1 Cosine Similarity

610
Cosine similarity is a numeric score to denote the similarity between two text documents. Cosine
similarity calculates similarity by measuring the cosine of angle between two vectors. This is
calculated as

The cosine.similarity() function is available in sklearn.metrics.pairwise library. The


result lies between –1 and 1. If both tests are exactly similar, then the angle between two texts is
0. Since the value of cos(0) is 1, it can be said that better similarity exists between texts if the
value of cosine similarity is higher. The values are far apart if the value is lower. Hence, in order
to recommend the movies that are similar to the given movie, we need to sort the result in
descending order.

Explanation
It can be observed from the results that cosine similarity ranges from 0 to 1. It is the highest (1)
with similar movies. Thus, the cosine similarity of movie at a particular index with the movie at
the same index will be 1. Thus, the result of movie similarities for the movie at index 1 shows
that the similarity with the movie at index 0 is 0.07 (very less), index 2 is 0.02 (very less), and
with movie at index 24999 is 0.1854 (less). The cosine similarities are highest for similar
movies; if we need to determine similar movies, we need to sort the results in descending order
to fetch the higher values of cosine similarities. This is done by using a negative sign inside the
argsort() function of the numpy library. We can observe from the results that the indexes of
similar movies with the second movie are as follows: 1, 22259, 19517, 7491, 1817, etc.

16.3.2 Euclidean Distance


611
In mathematics, the Euclidean distance or Euclidean metric is the “ordinary” straight-line
distance between two points in Euclidean space. With this distance, Euclidean space becomes a
metric space. The Euclidean distance between points p and q is the length of the line segment
connecting them. In Cartesian coordinates, if p = (p1, p2 ,…, pn) and q = (q1, q2, …, qn) are two
points in Euclidean n-space, then the distance (d) from p to q, or from q to p, is given by
Pythagorean formula. If two texts are exactly similar, the distance between them will be 0. Thus,
lower values result in better similarity between the texts. Hence, there is no need to sort the
results in descending order unlike the cosine similarity function. The text with better similarity
can be known by sorting them in ascending order.

Explanation
We can observe from the results that the index of similar movies to movie at index 10 are as
follows: 10, 24142, 19348, etc. Since the Euclidean distance between the movie at index 10
with the movie at index 10 will be 0, hence, the first index of a similar movie is depicted as the
same index number of the movie.

16.3.3 Manhattan Distance


The Manhattan distance function computes the distance that would be traveled to get from one
data point to the other if a grid-like path is followed. The Manhattan distance between two items
is the sum of the differences of their corresponding components. The formula for this distance
between a point X = (X1, X2, …) and a point Y = (Y1, Y2, ….) is given as

where n is the number of variables, and Xi and Yi are the values of the ith variable at points X and
Y, respectively.
Like Euclidean distance, lower value of Manhattan distance shows better similarity; hence
for viewing the movies that are similar to the given movie, the results will not be sorted in
descending order.

612
Explanation
We can observe from the results that the index of similar movies to movie at index 305 are as
follows: 11156, 13051, 430 8934, 1701, etc. As discussed earlier, since the distance between
the movie at index 305 will be 0 with the movie at index 305, hence the first index of a similar
movie is depicted as the same index number.

USE CASE
FINDING PARTNERS ON MATRIMONIAL WEBSITES

Marriage is a big dream in anyone’s life and is one of the most important of all established
bonds. Thus, it needs the utmost care when searching for the right person to spend the rest of
your life. Matrimonial sites help us in finding life partners online and are popular in India and
among Indians settled overseas. They have redefined the way Indian brides and grooms meet for
marriage and have acted as an alternative to the traditional marriage broker. The online
matrimony business has grown tremendously and there are over 1500 matrimony websites in
India. The most popular websites include Shaadi.com, communitymatrimony.com,
bharatmatrimony.com, kalyanmatrimony.com, jeevanssathi.com, lifepartner.in, lovevivah.com,
etc. These sites assist in meeting potential life partners as per one’s preferences and help to
know them better, offer a marriage proposal, and build lifetime relationships.
The users first register on the matrimonial website and upload their profile, which is stored
in a database. Once the profile is created, the verification of phone numbers is done to offer
matrimony updates and matches. However, there are special privacy options for all premium
users. People who are interested to find a proper match filter the database with customized
searches on factors such as physical features, nationality, education, occupation, age, lifestyle,
gender, religion, geographic location, and caste.
It is easier to just go through the list of the best match found by just clicking a few buttons
and answering a few questions. It is also possible to save searches for multiple permutations and
combinations and to waste less time searching for different kinds of people. In order to expand
growth, these sites should bring innovative technologies, improved user-friendly navigation, and
cooperative approach to facilitate efficient as well as agile search in optimum manner. Different
measures such as cosine similarity, dot product matrix, Euclidean distance, and Manhattan
distance are considered for determining the best match according to the requirement. The
records of all the people will be displayed who have a better similarity with the search criterion.

613
Create a dataset of 100 cricket players. The first column should contain the
name of the player and the second column should contain the description of the
game. Use text similarity techniques to identify the players who play similar to
the given player.

16.4 Unsupervised Machine Learning for Grouping Similar Text


We know that cluster analysis requires to run k-means clustering on a given dataset for a range of
values of k. Then for each value of k, we calculate the sum of squared errors (SSE). A line graph
of the SSE against each value of k is plotted. The line graph looks like an arm, and the elbow on
the arm is the value of optimal k (number of clusters). The goal is to choose a small value of k
that still has a low SSE, and the elbow usually represents where we start to have diminishing
returns by increasing k. The following snippet loops through different values of “k” and
generates the elbow plot to help us narrow down to the optimal value of “k”. We will consider
the imdb dataset available in keras library for performing unsupervised machine learning.

614
Explanation
We can observe from the scree plot that optimum number of clusters is 2. Hence, the next
section performs cluster analysis considering two clusters on our dataset.

Explanation
We can observe from the results that out of 25,000 movies, 17,839 belong to the first cluster
and 7161 belong to the second cluster. The number of correct values is 8746 + 3407, and
number of incorrect values is 9093 + 3754. However, the accuracy is found to be very low as
0.486.

Trained models for text data such as BERT, DISTILBERT, ROBERTA, and
GPT2 are used to determine better accuracy of the machine learning algorithms
and similarity techniques used for the text data.
USE CASE
615
USE CASE
ORGANIZING TWEETS/REVIEWS OF PRODUCT/SERVICE

The advent of e-commerce has brought buying of all products and services s at our fingertips,
from buying less costly products such as groceries and stationery items to expensive products
such as diamond, gold, and luxurious cars; from short-duration services such as transferring the
amount to a bank to long-duration services such as finalizing a travel plan. People like to post
reviews related to any purchased product/service either on the e-commerce sites or on general
social networking sites such as Facebook and Twitter. Impressions and feedback posted in the
form of reviews and tweets are making this place a forum for consumers to evaluate products
and services. Positive online reviews are worth more than the benefits provided by a small
marketing campaign. These online customer reviews that remain there for a longer time period
appear to be a great avenue for grabbing consumer’s attention and increasing sales in both the
short and long term. They reach a majority of consumers and are also responsible for securing
online visibility in search rankings. Hence, analyzing reviews can provide true feedback
regarding customer’s opinion to any organization. By resolving the issues of the consumers
written in the reviews, a positive experience can be created to improve product/service
efficiently.
The number of reviews has drastically increased with the passage of time and availability of
online methodologies to share the reviews. Hence, it becomes important to organize the reviews
in different groups for effective analysis. Organizations can easily group the reviews/tweets
collected from customers for different products/services using the concept of unsupervised
machine learning (cluster analysis). This will help the organization to filter similar types of
reviews together for proper understanding and analysis. Cluster analysis will help form clusters
of similar reviews. Manual process would have incurred a lot of time because of the availability
of huge data. Hence, this process will help them to save a lot of time with good accuracy. The
number of clusters can also be chosen depending on the requirement. For example, an
organization can make two clusters if it is interested in grouping all positive and negative
reviews separately. However, interest of grouping into strong positive, strong negative, and
neutral reviews will need to perform cluster analysis using three clusters.

16.5 Supervised Machine Learning


It is also possible to carry out supervised machine learning for performing sentiment analysis.
We can perform sentiment analysis by building a model based on the available text data and
predict the sentiment of the text-based reviews. We initially divide the Reuters dataset into
training and test datasets. For building a model, we need to first train the data based on different
classification models (Logistic Regression, Decision Tree, Random Forest, etc.) and predict them
the test dataset using that model. We then try to determine the accuracy of the developed model.
Reuters is a benchmark dataset for document classification. To be more precise, it is a multiclass
(e.g., there are multiple classes) and multilabel (e.g., each document can belong to many classes)
dataset. The mean number of words per document, grouped by class, is between 93 and 1263 on
the training set. The training set has a vocabulary size of 35,247. Even if you restrict it to words
that appear at least five times and at most 12,672 times in the training set, there are still 12,017
words.

616
Explanation
The Reuters dataset is loaded from keras.datasets and stored in the training and test datasets for
dependent and independent variables. Since all the records have different lengths, it is always
suggested to define a maximum length for the data. The function sequence.pad_sequences()
with the value of argument maxlen as 200 helps to reduce the data to a maximum length of 200.
We can observe from the result that the dimension of the training dataset is hence (8982, 200),
while the dimension of the test dataset is (2246, 200). This further means that there are 8982
records in training and 2246 records in the test dataset.

16.5.1 Logistic Regression Model


This section uses logistic regression algorithm to create the model on training dataset and
determine the accuracy of the model by using test dataset.

Explanation
We can observe from the results that the logistic regression model produced an accuracy of
0.3687, which is low.

617
16.5.2 Random Forest Model
RandomForestClassifier available in sklearn.ensemble is used to create the model on training
dataset and the accuracy of the model is determined by using the test dataset.

Explanation
We can observe from the results that the random forest model produced an accuracy of 0.5178,
which can be considered normal accuracy for the text data.

16.5.3 Gradient Boosting Model


This section uses Gradient Boosting algorithm to create the model on training dataset and the
accuracy of the model is determined by using the test dataset.

618
Explanation
We can observe from the results that the gradient boosting model produced an accuracy of
0.4675, which can be considered average accuracy for the text data.

16.5.4 Bagging Model


BaggingClassifier available in sklearn.ensemble is used to create the model on training
dataset and accuracy of the model is determined by using the test dataset.

619
Explanation
We can observe from the results that the bagging algorithm produced an accuracy of 0.4635,
which can be considered as average accuracy for the text data.
The above section shows that the supervised machine learning algorithms show very less
accuracy. Hence, other steps can be considered to increase the accuracy of the models such as
hyperparameter tuning or changing the maximum length of the sequences, applying advanced
text processing techniques, or using trained models such as BERT, DISTILBERT, and
ROBERTA. These models are discussed in Chapter 19 in detail.

Create a dataset of 200 products belonging to any of the following three


categories: clothing, food, and electronics. The first column should contain the
name of the product, the second column should contain the description of the
product, and the third column should contain the category to which the product
belongs. Divide the dataset into training and test datasets. Apply the supervised
machine learning algorithm to predict the category and use metrics to
determine whether the categories of items are correctly predicted.

USE CASE
DETERMINING POPULARITY OF SOCIAL MEDIA NEWS

Social media is a platform that enables people to interact and socially adhere to other people in
the society and around the world. People who communicate more on social media are less
stressed, happier, and psychologically healthier than people who have less social resources;
They feel more lonely, more depressed, and psychologically less healthy as well. The social
media channels have positive links to the development of social capital, are associated with
increased community engagement, and are of greater importance for both individuals and
society. The social media channel has a majority of conversations about news items related to
sports, politics, brands, products, and services, and their respective social feedback on multiple
platforms including social networks such as Facebook, Google+, and LinkedIn; forums; and
blogs. Social networks allow members to connect by a variety of common interests and share
information. Forums are like social mixers, where all people are at an equal level, content is
usually segmented by topic, and anyone can start a topic and anyone can respond to it. In blogs,
the blogger is in control of the discussion but allows questions and comments from the audience.
These social media channels have an incredible social outlet that has had a tremendous

620
effect on online society. These channels also engage communities for discussing specific topics
related to the news. The news can include everything from light hearted to very intense, from
politics and religion, to sports. It is not necessary that anything published on an online global
platform is correct and useful, but sometimes, the participants check and evaluate about the
outcomes and also post a negative review. In social media channels, the discussion reaches a
massive range of participants and individuals post their comments and views to the thread. The
collection of comments and reviews can thus help to determine the popularity of the news item on
these social media channels. However, increased traffic of visitors to the news also helps in
determining its success.
With social media, many sectors have achieved a success by determining the true feedback of
their product and services. Sentiment analysis can be used for determining the popularity of
news related to each and every dimension of life. The dataset can be created by including fields
such as title (title of the news item), source (original news outlet that published the news item),
date–time (date and time of the news publication), sentiment (sentiment score of the news),
network (score of the news popularity according to the social network), blogs (score of the news
popularity according to blogs), and forums (score of the news popularity according to forums).

Summary
• Text mining helps to identify unnoticed facts, relationships, and assertions of textual big data.
The process of text mining includes two basic libraries: textblob and wordcloud.
• The objective of text mining is to clean the data for creating independent terms from the data
file for further analysis.
• The data can be cleaned by adopting different measures such as transforming the text to
lowercase and removing specific characters such as removing URLs, non-English words,
punctuations, numbers, whitespace, and stop words (commonly used English words).
• Tokenization is the process of breaking down a text paragraph into smaller chunks such as
words or sentences. Token is a single entity that is the building block for sentences or
paragraphs. Sentence tokenizer breaks text paragraph into sentences, whereas word tokenizer
breaks text paragraph into words.
• Text may contain stop words such as is, am, are, this, a, an, and the. These stop words are
considered as noise in the text and hence should be removed.
• Stemming is the process of gathering words of similar origin into one word. It helps us to
increase accuracy in the mined text by removing suffixes and reducing words to their basic
forms.
• For creating a visual impact, a word cloud is created from different words. The word cloud is
created from wordcloud library. In the word cloud, the size of the words is dependent on their
frequencies.
• Sentiment analysis is a branch of machine learning that deals with the interaction between
computers and humans using the natural language. We can use sentiment analysis to
understand how a narrative changes throughout its course or what words with emotional and
opinion content are important for a particular text.
• It deals with reading and interpreting text, preprocessing, extracting, and predicting the
solution along with measuring sentiment.

621
• Sentiment polarity is typically a numeric score, which is assigned to both the positive and
negative aspects of a text document based on subjective parameters such as specific words
and phrases expressing feelings and emotion.
• The textblob library in Python helps to calculate the polarity of the text data. Polarity is a
float value within the range –1 to 1, where 0 indicates neutral, +1 indicates a very positive
sentiment, and –1 represents a very negative sentiment. Negative polarity generally shows a
negative sentiment and a positive polarity shows a positive sentiment.
• Prominent applications of NLP include analyzing Twitter’s data, document similarity, and
cluster analysis.
• Document similarity is done using cosine similarity, which is a numeric score to denote the
similarity between two text documents. The other approaches are Euclidean distance,
Manhattan distance, and Dot Product matrix.

Multiple-Choice Questions

1. ________ helps in increasing accuracy of mined text by removing suffixes and reducing
words to basic forms.
(a) Word cloud
(b) Suffixing
(c) Reducing
(d) Stemming
2. The visual display of the words based on their frequency is created using ______ function.
(a) wordcloud()
(b) visualword()
(c) visualcloud()
(d) cloudword()
3. _____ has many words for showing sentiment and the words are assigned scores for each
sentiment.
(a) Cloud
(b) Sentiment
(c) Lexicon
(d) Feeling
4. In NLP, words such as “as,” “to,” “it,” “in,” “for,” “is,” “a,” and “the” are called as
__________
(a) Common words
(b) English words
(c) Stop words
(d) Unwanted words
5. ____________ is the most important function used for performing document similarity.
(a) similarity()
(b) document_similarity()
(c) doc_similarity()

622
(d) cosine_similarity()
6. Similarity techniques used for the text data includes
(a) Cosine similarity
(b) Euclidean distance
(c) Manhattan distance
(d) All of these
7. Unwanted letters are removed from the text using ____________ library.
(a) re
(b) numpy
(c) pandas
(d) text
8. A word cloud is created using ____________ function.
(a) WordCloud()
(b) wordcloud()
(c) CloudWord()
(d) cloudword()
9. Polarity of text is determined using the ____________ library.
(a) re
(b) numpy
(c) pandas
(d) textblob
10. Score of polarity of text ranges from ____________ to ____________.
(a) 0, 1
(b) −1, +1
(c) 1, 10
(d) −10, +10

Review Questions

1. Why and how do we tokenize a text?


2. Differentiate between stemming and lemmatization.
3. What are the steps for text preprocessing?
4. Explain the importance and process of feature extraction.
5. What is sentiment analysis?
6. Discuss the utility of two basic packages required in text mining.
7. Explain the significance of shallow parsing.
8. How do we collect the text data from the tweets of Twitter?
9. How do we determine the sentiment from the score of polarity?
10. Discuss the different techniques available for performing text similarity.

623
624
CHAPTER
17

625
Machine Learning for Image Data

Learning Objectives
After reading this chapter, you will be able to

• Understand image data representation.


• Determine similar images to a given image from existing image dataset.
• Apply different machine learning techniques on existing image dataset.
• Develop analytical thinking abilities for analyzing image-based dataset.

Today, the diffusion of smartphones, tablet, and computers with high-speed Internet access
makes new data types available for data analysis. Images can tell a completely different story
than text mentions. Image analysis is the ability of computers to recognize attributes within an
image. Unlike text data, images do not require translation, which makes it extremely useful in a
global strategy. With social media becoming more image-focused, image analysis is becoming
increasingly important and can be considered as an extension of text analysis features applied to
visual content. However, the complete picture of consumer perception can be determined by
looking at text and images together. For example, consumer perception was earlier collected in
the form of reviews (text data) only, but now it is possible to collect images/snaps, music, or
videos instead of ratings. Hence, it now becomes important to use appropriate algorithms for
analysis of uploaded images/snaps, music, or videos.
Computer vision basically provides visual understanding to the computer for making them
use the same power and help to take decisions that are generally taken by the human being. This
involves the following three main processes:

1. Image acquisition: It is the process of translating analog data into binary data for digital
images. Webcams and embedded cameras, digital compact cameras, and DSLR are used for
image acquisition.
2. Image preprocessing: This step requires low-level processing of images. An image
processing technique is the usage of computer to manipulate the digital image. This
technique has many advantages such as elasticity, adaptability, data storing, and
communication. With the growth of different image resizing techniques, the images can be
kept efficiently. This technique has many sets of rules to perform into the images
synchronously. The 2D and 3D images can be processed in multiple dimensions. This
involves techniques such as image resizing, rescaling, cropping, rotation, intensity, edge
detection, and feature detection.
3. Image analysis: This step basically helps in decision making using advanced algorithms for
machine learning. The different techniques include identifying similar images, creating
cluster of images, using supervised machine learning, and object recognition.

626
In this chapter, the first two steps are discussed in Section 17.1, while the remaining chapter is
focused on the different algorithms available for performing image analysis.

17.1 Image Acquisition and Preprocessing


An image is represented in the form of pixels. A black and white image is represented by pixels
arranged in two dimensions. The pixel having dark black color is represented with intensity 0
and pure white is represented with intensity 1 or 256 (Fig. 17.1). All the pixels in the image are
represented by the value depending on the intensities of black and white. However, colored
image is represented generally in the form of RGB format. Hence, a colored image has three
matrices, one matrix representing one color. For most images, pixel values are integers that range
from 0 to 255. The 256 possible values are the intensities values for the respective color. For
example, a pixel of pure red color in a colored image will have the dimension as (255, 0, 0)
because for red color, the values of green and blue will be 0. Similarly, a violet image will have
higher value for red and blue but less value for green color. Thus, a dataset consisting 500
colored images of size (200, 100) will have dimension as (500, 200, 100, 3), where 3 denotes the
three RGB colors. A dataset containing gray-scale images will have the dimension as (500, 200,
100) because it will have only one value between 0 (black) and 1 (white) according to intensities
of black and white. This means that the last dimension does not exist in noncolored images.

Figure 17.1 An image in the form of pixels.

17.1.1 Image Acquisition and Representation

627
628
Explanation
The function imread('taj.jpg', as_gray=False) reads the image in colored format because
the as_gray argument has the value as false. We can observe that the dimension of the colored
image is (122, 183, 3), here 3 denotes the three colors corresponding to RGB. The details when
printed show the nested list of three digits. It is known that the value of RGB is between 0 and

629
255 for each color – red, green, and blue. For example, when the image was read in gray mode
using the value of the argument as_gray asTrue, the dimension got reduced to (122, 183)
because the component of RGB was removed. It should be noted that the value of gray image is
between 0 and 1. Thus, the value is denoted by corresponding pixel in the image.
The next command converts the image in different color schemes. The colored image can
be converted to gray mode directly by using the function rgb2gray and can be converted to hsv
mode by using the function rgb2hsv. To display all the three images together in a single image,
a subplot function is used.

17.1.2 Image Resizing and Rescaling


Resizing and rescaling can be done on an image using respective functions from the
skimage.transform library. These functions help to change the size and scale of the image.

630
Explanation
The original dimension of the image is found to be (122, 183, 3). This means that the width of
the image is 122, height of the image is 183 pixels, and 3 denotes that it is a colored image. The
image is resized to a dimension of (120, 120) by using the function resize. Hence, the resized
dimension is displayed as (120, 120, 3). It is also possible to rescale the image by specifying the
percentage to which it should be rescaled. For example, the function rescale(taj, scale =
(0.5, 0.5)) will reduce the width and height to exactly 50%. Hence the new dimension
becomes (122/2, 183/2, 3) = (61, 92, 3). The next image reduces the width of the image to 75%
and height of the image to 50%. Hence, the dimension of the new image becomes (122/2,
183*3/4, 3) = (61, 137, 3). The last section displays the single image consisting of four images
using the subplot function. We can observe from the image that when the original image was
changed to dimension (120, 120), the image is shown in the form of a square shape. However,
the basic look of the image did not change when it was rescaled to 50% in both height and
width. But, when the width was reduced to 75% and height was reduced to 50%, the image
becomes wider than the original image.

17.1.3 Image Rotation and Flipping


The rotation to the image can be applied using rotate function from transform library. The
flipping of the image is possible using the function fliplr (for flipping from left to right) and
flipud (for flipping from up to down).

631
Explanation
The function rotate(taj, angle=40) rotates the image to 40° and imshow function displays
the rotated image. There is no change observed in the image when it is flipped from left to right
because image of Taj is exactly similar on both the sides. However, the results are very clear
when it is flipped from up to down.

632
17.1.4 Image Intensity
The intensity of the image can be changed using exposure function available in skimage library.
The value of gamma basically determines the intensity of the image. The higher the value of
gamma, the darker will be the image.

633
Explanation
A brighter image is created using the value of gamma as 0.25. It should be noted that the
original image has value of gamma equal to 1. A darker image can be created by increasing the
value of gamma. Thus, a dark image was created using the value as 2.5. A very dark image is
created using the value as 4 for gamma.

634
17.1.5 Image Cropping
This tool helps us to crop the image according to the requirement of the user. Since we may want
more focused images, hence image cropping serves as an important image processing tool before
applying advanced machine learning algorithms. It is important to specify the range of
coordinates on both x-axis and y-axis for cropping an image according to the requirement.

Explanation
The command taj[50:(taj.shape[0]-10),50:(taj.shape[1]-10)] crops the image for both
axes by taking a range of pixels from 50 till the last dimension – 10. The command taj[60:
(taj.shape[0]-20),70:(taj.shape[1]-30)] starts from pixel 60 and till last dimension 20 on
x-axis and starts from 70 and till last dimension 30 on y-axis. We can observe the results
according to the dimensions specified during cropping of the image.

17.1.6 Edge Extraction Using Sobel Filter


For big data analysis of image type, where the processing of large and multiple images is
involved, it is important to consider the time factor involved in processing. The analyst may
really want to determine the edges from the image for faster processing. We can extract edges
using sobel filter or prewitt filter existing in skimage library. The sobel filter for horizontal
processing is applied using sobel_h() and sobel_v() is used for vertical processing.

635
Explanation
The results explain clearly the appearance of the image when the edge extraction is done for a
particular image. We can observe that the color inside the image is not considered and only
edges are shown. This drastically reduces the processing time for the images without affecting
the results to a great extent.

Due to version incompatibility, it is not possible to access some libraries. It is


always suggested to update anaconda using the command “conda update
conda” for updating conda version directly.

17.1.7 Edge Extraction Using Prewitt Filter


The prewitt filter for horizontal processing is applied using prewitt_h() and vertical processing
is done using prewitt_v() available in skimage.filters. These tools are important for edge
extraction.

636
Explanation
The results clearly explain the appearance of the image when edge extraction is done for a
particular image. This tool is particularly used for processing of large image datasets where
processing time is an important consideration.

USE CASE
IMAGE OPTIMIZATION FOR WEBSITES

It is a common saying that “a picture is worth a thousand words.” Each person likes to interact
with images rather than the text xt. Due to this reason, the number of images on e-commerce,
travel and tourism, and entertainment websites has been increasing over the time. According to
HTTP Archive, on an average, around 64% of a website’s space is acquired by images. Images
occupy a lot of memory space and hence when downloaded from a webpage, they require a lot of
bytes. Hence, it becomes important for websites to reduce the size of these images without
compromising with the image quality. They need to optimize the images so that the user can
download the image easily with fewer bytes according to available bandwidth. This will further
help in performance improvement for website by displaying the useful content on the screen and
will finally lead to more customer satisfaction.
According to a report, nearly half of the visitors prefer websites that load in less than 2
seconds. If the webpage requires more than 3 seconds to load, almost 40% of visitors tend to
leave that site, thus increasing the bounce rate. So, if optimization for images is done, which is
nearly 64% of website’s weight, website speed can be drastically improved. This will give
website visitors a faster experience and hence more users will be able to view product and
services, thereby improving user experience and SEO ranking.
The importance of images in connecting users to products is excellent; hence image
optimization is also required for these images available on websites related to products
belonging to different categories like food, clothing, electronics, etc. However, image
optimization is both an art and science: an art because there is no exact answer for how to
compress an individual image, and science because there are many well-developed techniques
and algorithms that can significantly reduce the size of an image. Image optimization can be
done in different ways, be it by resizing the images, caching, or by compressing the size. Before

637
doing image optimization, it is important to understand the format capabilities, content of
encoded data, quality, and pixel dimensions. CSS (cascading style sheets) effects (gradients,
shadows, etc.) and CSS animations can also be used to produce resolution-independent assets
that always look sharp at every resolution and zoom level, often at a fraction of the bytes
required by an image file. It should also be noted that text-in-images generally delivers a poor
user experience because the text is not selectable, not searchable, not zoomable, not accessible,
and not friendly for high-DPI (dots per inch) devices.
Thus, image optimization for websites is a methodology to provide the high-quality images in
the right format, dimension, size, and resolution for effective, direct, and positive impact on page
load speeds for better user experience. It helps in improving page load speed, boosts websites’
SEO ranking, and improves user experience.

17.2 Image Similarity Techniques


Image similarity techniques are adopted to determine similar images to a given image. These
techniques are employed to create recommendation system for image data. Different algorithms
exist in the sklearn library for determining similar images from the dataset. The different
techniques include cosine similarity, Euclidean distances, Manhattan distances, etc. We will
consider the fashion_mnist dataset available in keras.datasets for determining similar images
to a given image.

Explanation
The fashion_mnist dataset was loaded and stored in x_trg, y_trg, x_tes_org (original test
dataset of independent variables) and y_test_org (original test dataset of dependent variables).
We can observe that the dimension of training dataset is (60,000, 28, 28). This means that there
are 60,000 images, each of dimension 28 × 28 and these are noncolored images. Similarly, the
dimension of test dataset is (10,000, 28, 28), which means that there are 10,000 noncolored
images. For easy and effective analysis by the reader, we want less number of images. Hence,
we will do the analysis on 10,000 images only. It is important to have the data in 2D; hence, we
need to reshape the data. Since the size is 28 × 28, we will reshape in a size of 784 (28 × 28 =
784) images.

638
17.2.1 Cosine Similarity
In this section, we will use cosine_similarity technique to determine the similar images to the
given image.

639
Explanation
Cosine similarity technique determines the similarity of each image with all the other images.
Since there are 10,000 images, hence we will get matrix of order (10,000 × 10,000) displaying
the similarity between all the images. From the details we can observe that the similarity of first
image with itself is 1. Thus, all the diagonal elements of the matrix will have the value as 1. The
cosine similarity information of the product at index 245 is displayed and the product is
displayed as purse. Since the similar products will have higher value, hence the array is sorted
in descending order by considering negative sign for argument sorting; the index of similar
images are displayed. We can observe that the images at determined indexes display purses
only. Thus, the technique rightly displays the similar images.

17.2.2 Euclidean Distances


This section uses euclidean_distances technique to determine the similar images for the given
image.

640
Explanation
Euclidean distance unlike Cosine similarity is least for the similar images. Hence, unlike cosine
distance where it was required to sort in descending order, we need not do sorting in descending
order in Euclidean distances. This is because the value will be least (0) between exactly similar
images. We can observe that the Euclidean distance between the first image and the given
image is 1896.174042, while with the second product it is 4018.045295. This means that second
image is more dissimilar than the first image with the given image. Higher the value of
Euclidean distance, higher is the dissimilarity between the two images. Euclidean distance
shows an effective result. We can observe from the result that for the given image of shoe, all
the shoes are displayed from the dataset.

641
17.2.3 Manhattan Distances
This section shows the usage of concept of Manhattan distances for determining images similar
to the given image.

Explanation
Manhattan distance works in a similar fashion as Euclidean distance; the value will be least (0)
between exactly similar images. Higher the value of Manhattan distance, more is the
dissimilarity between the two images. Hence, there is no need to sort the values produced by
this algorithm in descending order. The result shows that Manhattan algorithm produced
efficient results. Thus, all the shirts are displayed, which are similar to the given shirt.

Consider the cifar100 dataset available in keras. The dataset contains images
belonging to 100 categories. Apply the three different image similarity
techniques for determining images similar to the given image.

USE CASE
642
USE CASE
PRODUCT-BASED RECOMMENDATION SYSTEM

Recommendation systems are one of the popular and most adopted applications of machine
learning. In online medium, people really wanted to search the product similar to their choice.
The product ranges from small-sized items like pen, to middle-sized items like food, to large-
sized items like cars. Recommendations system basically tries to understand the features that
govern the customer choice. The organizations, on the other hand, try to determine the similarity
between the customer-required product and the available products in their catalog. On the basis
of scores corresponding to similarities, they recommend products related to food, fashion,
movies, shows, retail, etc., to the customer within a short span of time. Recommendation system
can be used by market places (place with hundreds of millions of products and thousands of
sellers) such as Amazon, Flipkart, Snapdeal, and Myntra to suggest products to the customers
depending on the similarities; online platforms such as Netflix, Hotstar, and YouTube to suggest
movies, shows, videos, etc., for watching; online commuting platforms such as Uber and OLA to
recommend the choice of vehicle and desired location; online food delivery partners such as
Zomato, Uber Eats, and Swiggy to understand the choices of consumer related to food. In short,
recommendation system will be truly an asset to nearly all the industries such as hospitality,
food, tourism, and entertainment for understanding the liking of the customer and frame
strategies accordingly for higher growth and meeting company’s objective.

17.3 Unsupervised Machine Learning for Grouping Similar Images


K-means cluster analysis, an unsupervised machine learning technique, is an effective tool for
grouping similar images/text/other data. In this section, cluster analysis is done considering
image dataset and cifar10 dataset from keras.datasets library. The CIFAR-10 dataset
(Canadian Institute for Advanced Research) is a collection of 60,000 color images, each of
dimension 32 × 32 belonging to 10 different classes. The 10 different classes represent airplanes,
cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. There are 6,000 images of each
class. Since the images in CIFAR-10 are of low resolution (32 × 32), this dataset is used to cluster
analysis easily. It should be noted that since cifar10 dataset had images belonging to 10
categories, hence cluster analysis was done considering 10 clusters.

643
Explanation
We can observe that the dimension of training and test datasets for X are (50000, 32, 32, 3) and
(10000, 32, 32, 3), respectively. This means that these are colored images. Since we need to
apply cluster analysis that is possible only on 2D data, we will convert these to 2D data. Hence,
the dataset is converted to a new dimension as (50000, 3072). It is taken as 3072 because 32 ×
32 × 3 = 3072. Since there are 10 categories in cifar10 dataset, hence we performed cluster
analysis considering 10 clusters. It is clear from the results that there are 7095 images in the
fourth cluster, 6742 images in the tenth cluster, 5837 images in the third cluster, 5382 images in

644
the first cluster, 5070 images in the sixth cluster, 4925 images in the second cluster, 4454
images in the eighth cluster, 4152 images in the ninth cluster, 3614 images in the seventh
cluster, and 2729 images in the fifth cluster.
Since there are a lot of images in each cluster, it is not possible to display the images of all
digits in each cluster in this book. However, in Chapter 20, when the user-defined data are
considered, we have displayed the images also belonging to each cluster for better
understanding.

USE CASE
GROUPING SIMILAR PRODUCTS IN E-COMMERCE

In online shopping, customers rarely directly pick-up the one product they want to buy. They try
to search through the webstore’s inventory, navigating from one to the next until a perfect
requirement is met. The webstore can consider this move as a strategic action and display the
images of the products that belong to the same cluster which is selected by the customer. By
grouping related products into collections, customers may be encouraged to buy more or other
products they did not even know they wanted. For example, when a customer searches for men
premium branded shirt, shirts belonging to the cluster of branded shirts can be displayed;
however some images related to cluster of men clothing like trousers, jacket, etc., can also be
displayed, which will prompt the user to search for men trousers or jackets also. A display of
images of other clusters such as handbag and electronic devices will definitely not yield good
results. This will make it easier for customers to find exactly what they are looking for by
grouping items together in obvious categories. Besides, many customers visit an online store
with a specific interest in mind to check out the new arrivals or to shop a timely sale. By creating
clusters of these popular types, the consumer can easily find what they need, making the
experience more friendly and enjoyable.
Grouping similar images also helps in managing products more efficiently since it helps in
categorizing product into collections and helps to keep webstore more organized, under effective
control. For example, grouping will help to offer special discount or coupon for a specific
cluster, help in promotional activities such as sending e-mails/SMS related to a particular
cluster, and help customers find what they are looking for by following a quick and easy path
through your website. It also gives customers an enjoyable shopping experience and makes them
feel like they are shopping from a reliable webstore; hence, they will more likely feel comfortable
to do online shopping.

17.4 Supervised Machine Learning Algorithms for Image


Classification
Images are represented in the form of numeric dimensions and they can be classified into
different categories. This section used different supervised machine learning classification
algorithms for classification of images. It should be noted that these algorithms can be applied to
a 2D dataset only. Hence, all the gray-scale images and colored images need to be reshaped
before using these algorithms.

645
In this section, we will use machine learning algorithms for classification problems for the
image data considering mnist dataset from keras. This dataset contains 70,000 images of
handwritten digits and has 10 classes representing different digits from 0 to 9.

Explanation
We know that gray-scale image is represented in the form of three dimensions. Hence, there is
need to convert them to 2D before applying classification problems. The reshape() function
helps to convert the shape of the representation of the image. Thus, a 3D image dataset is
converted to 2D image dataset. The new dimension should be relevant to the existing
dimension. Since 28 × 28 = 784, each image is converted to new dimension of 784. Thus, the
new dimension of training and test datasets are (60,000, 784) and (10,000, 784), respectively.

17.4.1 Naïve–Bayes Model


This section considers Naïve–Bayes algorithm for training and determining the accuracy of the
model. The GaussianNB function() is imported from the sklearn.naive_bayes library.

646
Explanation
A Naïve–Bayes model was created on the training dataset. We can observe that when the
Naïve–Bayes model was evaluated on test dataset, it showed an accuracy of 0.5558 for the
given data.

Trained models for images such as Mobilenet, MobilenetV2, Resnet, VGG16,


and VGG19 are used to determine better accuracy of the machine learning
algorithms and similarity techniques used for the images data.

17.4.2 Decision Tree Model


This section considers decision tree algorithm for training and determining the accuracy of the
model. The DecisionTreeClassifier() function is imported from the sklearn.tree library.

Explanation

647
A decision tree model was created considering the training dataset. The accuracy of the model
has been observed to be 0.8781 when it was evaluated on the test dataset. However, the use of
hyperparameter tuning would have resulted in higher accuracy.

Create the support vector machine model and KNN model for the above data
and compare the results.

17.4.3 Random Forest Model


This section considers random forest ensemble technique for training and determining the
accuracy of the model. The RandomForestClassifier() function is imported from the
sklearn.ensemble library.

Explanation
A random forest model was created considering the training dataset. The accuracy of the model
has been observed to be 0.9468 when it was evaluated on the test dataset, which is excellent.

17.4.4 Bagging Model


This section considers bagging algorithm for training and determining the accuracy of the model.
The BaggingClassifier() function is imported from the sklearn.ensemble library.

648
Explanation
The accuracy of the bagging model when evaluated on the test dataset was observed to be
0.9468, which is excellent for image dataset. However, steps like hyperparameter tuning could
be taken for improving the accuracy of the model.

Try to increase the accuracy of the model for the given dataset by doing
hyperparameter tuning of the different supervised machine learning models.

Use of conv1D and conv2D layers in deep learning models helps in increasing
the accuracy of image dataset to a great level.

USE CASE
ONLINE PRODUCT CATALOG MANAGEMENT

All e-commerce portals are regularly adding new categories and deepening existing product
lines by tapping latest opportunities, adding vendors, expanding reach, and enhancing
applicability. Customers today demand rich and consistent information from all the products. In
order to face the competition, these portals need to meet the customer expectations. A picture is
worth a thousand clicks. Hence, an e-commerce category manager is always struggling in
putting correct images of products online. One prime responsibility is to effectively manage e-
Commerce product catalog by ensuring the appropriate quality of image of the product,
especially in case of fashion and apparel. This is because the product pages for these categories
require multiple, consistent, high-quality images and the online buyers rely heavily on product
images to make their decisions. These images fill in a significant gap in the online consumer
journey.

649
For giving a uniform look to the webstore, product images need to be sometimes altered to
make sure they adhere to the necessary parameters and specifications for your portal. Some of
the features addressed include removing/adding backgrounds and resizing for ensuring that all
images look coherent when displayed on a category page, removing shadows for giving a neat
appearance, removing/adding objects to remove ambiguity about the inclusions and exclusions,
giving a clearer and brighter look by correcting the color composition, noise removal especially
on zoomed-in images for clearer picture quality, removal of watermarks, enhancing the
brightness and vibrancy, cropping away unnecessary portions, erasing physical flaws like
blemishes, scratches, dust spots and glare, ensure standard resolution and quality, etc.
A model can be created considering different images for training and taking dependent
variable as whether the image is appropriate to be considered in the catalog management. The
model can then be used to evaluate whether the given image can be uploaded on the website or
not.

Summary
• Computer vision provides visual understanding to the computer for making them use the
same power and help to take decisions that are generally taken by the human being. This
involves three main processes: image acquisition, image preprocessing, and image analysis.
• An image is represented in the form of pixels. A black and white image is represented by
pixels arranged in two dimensions. The pixel having dark black color is represented with
intensity 0 and that in pure white is represented with intensity 1.
• A colored image is represented generally in the form of RGB format. Hence, a colored image
has three matrices; one matrix representing one color. For most images, pixel values are
integers that range from 0 to 255. The 256 possible values represent the intensities values for
all the three respective color.
• Resizing and rescaling can be done on an image using respective functions from the
skimage.transform library. These functions help to change the size and scale of the image.
• The rotation to the image can be applied using rotate function from transform library. The
flipping of the image is possible using the function fliplr (for flipping from left to right)
and flipud (for flipping from up to down).
• The intensity of the image can be changed using exposure function available in skimage
library. The value of gamma basically determines the intensity of the image. The higher the
value of gamma, the darker will be the image.
• For big data analysis of image type, where the processing of large and multiple images is
involved, it is important to consider the time factor involved in processing. The analyst may
really want to determine the edges from the image for faster processing. Edge extraction is
done by using filters, namely, sobel_h, sobel_v, prewitt_h, and prewitt_v.
• Image similarity techniques are adopted to determine images similar to a given image. These
techniques are employed to create recommendation system for image data. Different
algorithms exist in the sklearn library for determining similar images from the dataset. The
different techniques include cosine similarity, Euclidean distances, Manhattan distances, etc.
• Unsupervised machine learning algorithms like k-means clustering can be used for grouping
similar images.

650
• Supervised machine learning algorithms such as Naïve–Bayes, decision tree, random forest,
and bagging can be used for creating models for image classification.

Multiple-Choice Questions

1. The higher the value of gamma in exposure() function, the ____________ is the image
(a) Darker
(b) Lighter
(c) No change
(d) None of the above
2. Similar images will have value close to 1 for…
(a) Cosine similarity
(b) Euclidean distances
(c) Manhattan distances
(d) All of the above
3. The value of Euclidean distance for exactly similar images will be
(a) –1
(b) 1
(c) 0
(d) None of the above
4. Similar images can be determined using following technique:
(a) Cosine similarity
(b) Euclidean distances
(c) Manhattan distances
(d) All of the above
5. The functions primarily used for edge extraction include
(a) sobel_h
(b) prewitt_h
(c) Both (a) and (b)
(d) Neither (a) nor (b)
6. The important argument in reading colored or gray scale image in imread() function is
(a) color
(b) as_color
(c) V
(d) as_gray
7. The image dataset consisting of 500 colored images of resolution 40 × 40 is represented as
(a) (500, 1600)
(b) (500, 1600, 3)
(c) (500, 40, 40, 3)
(d) (500, 40, 40)

651
8. The perfect black and white colored pixel are represented, respectively, as
(a) 1, 0
(b) 0, 1
(c) 1, –1
(d) −1, 1
9. The colored image can be converted to gray mode using the _____________ function.
(a) color2gray
(b) rgb2gray
(c) colortogray
(d) rgbtogray
10. The pixel value in RGB format ranges from
(a) 0 to 1
(b) 0 to 100
(c) 0 to 255
(d) 1 to 256

Review Questions

1. What is the basic difference between representation of colored and gray-scale images?
2. Explain briefly the different techniques used for image data processing.
3. Explain the importance of rescale function on an image considering different values.
4. Differentiate between rotating and flipping the image with an example.
5. What is the need of performing edge extraction on the image?
6. Explain the utility and implementation of resize() function in images.
7. Discuss the different image similarity techniques.
8. Considering cifar10 dataset, perform supervised machine learning algorithms and evaluate
the results.
9. Considering fashion mnist dataset, perform cluster analysis and evaluate the results.
10. Considering mnist dataset, use image similarity techniques to display similar results to digit
9 and evaluate the results.

652
653
SECTION 4
Deep Learning Applications in Python

Chapter 18
Neural Network Models (Deep Learning)

Chapter 19
Transfer Learning for Text Data

Chapter 20
Transfer Learning for Image Data

Chapter 21
Chatbots with Rasa

Chapter 22
The Road Ahead

654
CHAPTER
18

655
Neural Network Models (Deep
Learning)

Learning Objectives
After reading this chapter, you will be able to

• Understand with neural network model.


• Implement different deep learning algorithms based on nature of data.
• Validate and test the different types of neural network model.
• Attain competence using different arguments for increasing accuracy.

Neural networks have become increasingly popular in recent years. The concept of neural
networks was originated from our own biological neural networks. Biological neural networks
have interconnected long cells neurons with dendrites that receive inputs and based on these
inputs they propagate electrochemical signals from and to surrounding neurons, respectively.
This finally results in an effective communication network, which helps to do complex geometric
transformation in a high-dimensional space through a long series of simple steps. The whole
process is replicated using artificial neural network (ANN) in python. The fundamental unit in
neural network model is a neuron. The neuron receives inputs, multiplies them by some weight
(bias), and then passes them into an activation function to produce an output. The most important
thing is the adjustment of weights for the next process. This is done by comparing the outputs
with the original labels. This process is repeated until we have reached a maximum number of
allowed iterations, or an acceptable error rate.
To create a neural network, we basically add layers of neurons together and finally create a
multilayer model of a neural network. It should be noted that a model should primarily have two
layers input and output: The input layer directly take input of features and output layer create the
resulting outputs. The effectiveness of neural network model enhances when we use more layers
between input and output layer. These layers are called as hidden layers because they do not
directly observe the feature inputs or outputs.
Each arrow displayed in Fig. 18.1 passes an input that is associated with a weight. Each
weight is essentially one of many coefficient estimates that contribute to the regression model.
Basically, these are unknown parameters that must be tuned by the model to minimize the loss
function and uses an optimization procedure. Each neuron is mathematically represented as z = b
+ ∑W i X i, where “i ” ranges from 1 to n, b denotes the intercept (bias) and W and X are vectors
carrying the weights and values from all “n” inputs, respectively. Before training, all weights are
initialized with random values. These weights contain the information learned by the network
from exposure to training data. Weight is automatically calculated based on the input and the

656
output shapes. The weights represent a matrix capable of transforming the input shape into the
output shape by some mathematical operation. Deep learning is basically about improving the
accuracy by gradually adjusting these weights depending on feedback signal. Many deep
learning frameworks have emerged over the time frame. Among the available frameworks, Keras
and TensorFlow are the most popular ones. In fact, it is not necessary to choose one out of both;
we can work on both of them together since the backend for Keras is TensorFlow and Keras can
be integrated seamlessly with TensorFlow. Keras works as a wrapper to Tensorflow. Keras is
popular for its easy usage; effective flexibility by supporting arbitrary network architectures and
running the same code to run on CPU and GPU seamlessly; and user-friendly API for quick
prototype of deep learning models and providing high support for convolutional networks
(computer vision), recurrent networks (sequence processing), and combination of both.
A neural network model in Python is developed using keras package. Different types of
neural network models can be built in Python using keras, which include multi-layer perceptrons
(MLP), convoluted neural networks (CNN), recurrent neural network (RNN), skip-gram models,
pre-trained models, etc. This chapter discusses MLP, RNN, and CNN models in detail. However,
skip-gram models and pre-trained models are beyond the scope of this book.

Figure 18.1 Neuron.

Figure 18.2 Neural network model.

18.1 Steps for Building a Neural Network Model

18.1.1 Data Preparation

657
First, even if the input data are already quite clean as mentioned before, it still needs some
preparation and pre-processing in order to be in an appropriate format to then later be fed to the
neural network. This includes data separation, reshaping, and visualization, which might give
insight to the data scientist as to the nature of the images.
Step 1A: Data Exploration: The first step in neural network model is to determine the basic
data structure of the dataset to be used. Deep learning use tensors as their basic data structure.
Tensors are a generalization of vectors and matrices to an arbitrary number of dimensions. In
Python, a tensor that contains only one number is called a scalar tensor or 0-D tensor, a tensor
that contains one-dimensional array of numbers is called a vector tensor or 1-D tensors, and a
tensor that contains two-dimensional array of numbers is called a matrix tensor or 2-D tensors. If
we combine matrices in a new array, we will obtain a 3-D tensor. Similarly, by combining 3-D
tensors in an array, we can create a 4-D tensor, and so on. For higher level dimensions, array
objects (which support any number of dimensions) are used.
A vector data are 2-D tensor, having two dimensions as sample and features. For example,
information of 500 organizations having details of four features related to name, place, number
of employees, and stock price can be stored in a 2-D tensor of shape (500, 4).
A time series data or sequence data are 3-D tensor having three dimensions representing
sample, time, and features. For example, information with respect to time series data is stored in
a 3-D tensor with time axis. Each sample is considered as a sequence of vectors (2-D tensor), and
a batch of data is considered as a 3-D tensor. For example, in a dataset of stock prices, for every
minute the current price of the stock is stored along with the highest and lowest price in the past
minute. Hence, for every minute, a 3-D vector is created. The share market starts from 9:00 am
to 3:30 pm, hence there are 390 minutes of trading in an entire day. Hence, 1 day data of trading
is encoded as a 2-D tensor of shape (390, 3). Generally, there are 250 working days in a year for
stock market, hence the data can be stored in a 3-D tensor of shape (250, 390, 3).
An image is 4-D tensor representing sample, height, width, and channels. By convention
image tensors are always 3D since images have three dimensions: height, width, and color depth.
Colored images are generally represented as RGB, hence information related to three colors (red,
green, and blue) is required. However, grayscale images have only a single color channel and
hence could be stored in 2-D tensors. A batch of 100 grayscale images of size 512 × 512 could
thus be stored in a tensor of shape (100, 512, 512, 1), and a batch of 100 color images (RGB)
could be stored in a tensor of shape (100, 512, 512, 3). Note that in Theano convention, the
information related to color is defined before the dimensions of the image. Hence, with the
Theano convention, the previous examples would become (100, 1, 512, 512) and (100, 3, 512,
512), respectively.
A video is 5-D tensor representing samples, frames, height, width, and channels. Video data
need 5-D tensors since a video can be understood as a sequence of frames, each frame being a
color image. Each frame is stored in a 3-D tensor (height, width, color_depth), an image which is
a sequence of frames is stored in a 4-D tensor (frames, height, width, color_depth), and a video is
stored in a 5-D tensor of shape (samples, frames, height, width, color_depth).
Step 1B: Data normalization: This step is basically needed for scaling the data. As neural
network converges faster with smaller values of data in the range [0–1], so the data are
normalized to this range. The data are transformed by either inbuilt functions like
MinMaxScaler() available in sklearn library or doing it manually. For example, the image data
involve rescaling the image pixel values within the range 0–1. Pixels are represented in the range

658
[0–255], hence we will first determine the range of the pixel values in the images and find out
the min and max. The dataset is then divided by the maximum value of the range of pixel.
However, in real world problems, the dimensions of images could be different because
images are usually much bigger and do not usually have the same dimensions. If the images are
very big, effort must be laid to resize the image to much smaller dimensions. However, if there is
different dimension of images, it becomes difficult since dense layers at the end of the neural
network have a fixed number of neurons, which cannot be dynamically changed. This means that
the layer expects fixed image dimensions, which means all images must be resized to the same
dimensions before training. However, the user can use the concept of fully convoluted network
model (consists solely of convolutional layers), resize images to a fixed dimension or add
padding to some images and resize.
Step 1C: Split the dataset: We need to split the dataset into training and test datasets in order
for the model to generalize well and for knowing the accuracy of model on unseen data. Further,
the split of training data is also done in two parts: training and validation set. In order to avoid
submitting the predictions and risking a bad performance, and to determine whether the neural
network model overfits, a small percentage of the train data is separated and termed validation
data. The ratio of the split can vary from 10% in small datasets to 1% in large datasets. For
example, if the data have 1M images and the neural network model is trained with these 1M
images, it might overfit and respond poorly to new data. Overfitting will occur because the
model is trained and has learned the differences in those 1M images only. The accuracy
decreases, when it is tested on new images. The objective of neural network model is to learn
from the training set and implement it also nicely on the test dataset. Hence, the splitting of data
takes place.
Thus, we can say that the training dataset is used for training the model; validation dataset is
used for tuning hyperparameters and evaluating the models, and test dataset is used to test the
model after the model has gone through changes by the validation set. The validation data help in
reducing the chances of overfitting, as we will be validating model on data, which was not
present in the training phase. The train_test_split() is used from sklearn.model_selection to
split the data.
Step 1D: Data pre-processing: It is important to reshape the tensor before developing a model.
Reshaping a tensor means rearranging its rows and columns to match a target shape. The total
number of coefficients in the reshaped tensor should be same as the initial tensor. The training
and test data for independent variables is reshaped first in the form of an array using reshape()
function and by specifying the dimensions of the new array to be formed. The image datasets
generally consist of RGB images. Keras also expect each image to have three dimensions:
[x_pixels, y_pixels, color_channels]. The images of the dataset are indeed grayscale images with
pixel values ranging from 0 to 255 with a dimension of 28 × 28, so before we feed the data into
the model, it is very important to pre-process them. It should be noted that if images in dataset
are grayscale, the color dimension is equal to one and thus we will reshape the image data to
contain an explicit color channel of dimension 1. For example, reshape(x, 28, 28, 1) will reshape
the image data in the form of 28×28×1 and we will reshape input data from (28, 28) to (28, 28,
1).
It is also advisable to convert dependent variables of training and test data in a categorical
form, since we generally do classification tasks in deep learning. To train our model, we have to
format our image labels as one-hot-vectors. This is called as one-hot encoding and is done using

659
to_categorical() function.

18.1.2 Building the Basic Sequential Model and Adding Layers


Once the data processing is done, the second step is to build a model in Keras. After installation
and loading the package “keras,” an empty sequential model is constructed using the function
sequential() from models in keras.
The core building block of neural networks is the layer, which processes the data. The input
data are processed and output data are generated in a more useful form. Specifically, layers
extract representations out of the data fed into them, hopefully, representations that are more
meaningful for the problem at hand. The deep learning model is made of successive layers and
finally implements a form of progressive data refinement. Each layer applies a transformation
that disentangles the data and all the successive layers make an extremely complicated
disentanglement process. The different types of layers that can be added to the model include
dense layer, dropout layer, flatten layer, 1-D and 2-D convolutional layer, batch normalization,
MaxPooling2D, and Upsampling 2D. A dense layer is created using dense() function; a 1-D and
2-D convolutional layers are created using con1D() and con2D(), respectively; batch
normalization layer is added using BatchNormalization(), and MaxPooling2D layer is added
using MAxPooling2D().

Syntax
For adding a convolutional 2D layer:
Conv2D(filters=, kernel_size=, padding=, activation=, input_shape=)

For adding a dense layer:


Dense(units,input_dim=,kernel_initializer=, bias_initializer=,activation=)

For adding a MaxPooling 2D layer:


MaxPooling2D(pool_size=)

For adding a Flatten() layer:


Flatten()

For adding dropout layer:


Dropout(num)
where num generally can have values from [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9].
Adding a layer to the model is the most important step in neural network, hence it is important to
understand the concept of these arguments in detail. Different types of layers have different
arguments according to the kind of data and methodology. The detail of these arguments is as
follows:
input_shape: Tensors flows between layers and they are represented as matrices with shapes.
This represents how many elements an array or tensor has in each dimension and is the most
important thing to be defined, which is based on input data. In Keras, the input layer itself is not
a layer but a tensor. This argument needs to be defined in the first layer. The starting tensor is
sent to the first hidden layer. It is important that the tensor must have the same shape as the input
data. All the other shapes are calculated automatically based on the units and particularities of
each layer. The output shape is calculated based on the relation between shapes and units. There

660
is no overall rule for how to set the network architecture (depth and width of layers). In general,
the optimization gets harder with the depth of the network. It is possible to do tuning of network
parameters, but implement an outer cross-validation for avoiding overfitting.
Example 1: A shape (20, 6, 8) means an array or tensor with three dimensions, containing 20
elements in the first dimension, six in the second and eight in the third. Hence, total elements are
20×6×8 = 960 elements. All other shapes are results of layers calculations based on input shape.
Example 2: If we have 200 images of 30 × 30 pixels in RGB (three channels), the shape of the
input data will be (200, 30, 30, 3). Hence, the input tensor must have this shape, but Keras
ignores the first dimension, which is the batch size. The model should be able to deal with any
batch size, so we define only the other dimensions; hence the input shape is defined as
input_shape = (30, 30, 3). When the model is printed, it will show (None, 30, 30, 3). The first
dimension representing batch size is “None” because it can vary depending on user choice. Thus,
the absence of first dimension related to batch_size will expand our limit for training possibilities
to a particular batch size.
The model needs to know the input shape of the data; hence only the first layer in a
sequential model needs to receive information about its input shape. There is no need to provide
input shape to the following layers, since they can do automatic shape inference. We pass an
input_shape argument to the first layer, which is in the form of tuple (a tuple of integers or none
entries, where none indicates that any positive integer may be expected). It should be noted that
the batch dimension is not included in the input_shape(). Some 2D layers, such as Dense,
support the specification of their input shape via the argument input_dim, and some 3D temporal
layers support the arguments input_dim and input_length for specifying input shape.
Units/filters: It is a property of each layer and it is related to the output shape. It basically
describes the number of “neurons” the layer has inside it. In the following figure, input layer has
2 units (neurons), hidden layer 1 and hidden layer 2 has 5 units (neurons) and the output layer
has 1 unit (neuron).
The units of each layer will define the output shape (the shape of the tensor that is produced
by the layer and that will be the input of the next layer). Dense layers have output shape based on
“units” while convolutional layers have output shape based on “filters”. A dense layer has an
output shape of (batch_size, units). Hence for the first hidden layer, the output shape will be
(batch_ size, 5) and the output shape of the last layer will be (batch_size, 1). For example, if the
input shape has only one dimension, we give input_dim as a scalar number and not as a tuple. If
the input layer has three elements, we can write either input_shape=(3) or input_dim = 3. When
we are dealing directly with the tensors, dim represents number of dimensions a tensor has. For
instance, a tensor with shape (20, 1000) has two dimensions.
Kernel/bias initializer: Initializations define the way to set the initial random weights of Keras
layers. The initializer can assume any value from [“uniform,” “lecun_uniform,” “normal,”
“zeroes,” “ones,” “glorot_normal,” “glorot_uniform,” “randomnormal,” “randomuniform,” etc.],
for example zeroes, ones and constant initializer generate tensors initialized to 0, 1 and constant,
respectively. The randomnormal and randomuniform initializer generates tensors with a normal
distribution and uniform distribution, respectively. However, the details of the initializer can be
accessed from https://keras.io/initializers/.
Activation: The value of this argument represents the operation with the elements. The different
types of activation include “relu,” “tanh,” “sigmoid,” “hard_sigmoid,” “linear,” “softmax,”

661
“softplus,” and “softsign. Every hidden unit is having a sort of toggle switch that is used as a
filter on the regression output. It should be noted that each unit hence may not necessary have a
linear regression model. Different types of toggle switches are designed for different activation
functions. On/off and high-pass toggle switches are encoded by sigmoid and rectified linear unit
(ReLU) activation functions, respectively. In relu activation, operation and addition are done
element-wise and are applied independently to each entry in the tensors. In this activation, a dot
product (*) between the input tensor and another tensor (X) is done and the resultant is added to
a vector (v) and finally the maximum value of the resultant is considered. X (kernel) and v (bias)
are called the weights or trainable parameters of the layer.

18.1.3 Compiling the Model


The third step is to compile the developed model and for doing the actual training and weight
fine-tuning, we need to determine (i) a measure of goodness-of-fit that compares predictions and
known labels over all training observations and (ii) an optimization method that computes the
gradient descent, essentially tweaking all weight estimates simultaneously, in the directions that
improve the goodness-of-fit. For each of these, we have loss functions and optimizers,
respectively. There are many types of loss functions, all aimed at quantifying prediction error.
The model is compiled using the function:

Syntax
compile(loss =, metrics =, optimizer = )
where
• optimizer specifies the optimizer and can have value from [‘adam’, ‘SGD’, ‘RMSprop’,
‘Adagrad’, ‘Adadelta’, ‘ Adaptive Moment Estimation (adam)’, ‘Adamax’, ‘Nadam’].
• loss specifies the loss and can have value from [‘mean square error(mse)’, ‘Gradient
Decent’, ‘mean absolute error(mae)’, ‘categorical_crossentropy’, ‘binary_crossentropy’,
‘Stochastic Gradient Decent (SGD)’].
• metrics specifies the accuracy of the model.
The compile () function has some important arguments that should be carefully implemented.
The metrics argument is used to monitor the training and testing steps and helps in determining
the accuracy of the model. The loss is the quantity we need to minimize during training; this
helps to measure its performance on the training data, which enable us to go in the right direction
and hence represents a measure of success. These loss functions used a feedback signal for
learning the weight tensors, and which the training phase will attempt to minimize.
An optimizer is the mechanism through which the model will update itself based on loss
function. The optimizer specifies the exact way in which the gradient of the loss will be used to
update parameters. Gradient Descent calculates gradient for the whole dataset and updates values
in direction opposite to the gradients until we find a local minima. Unlike normal gradient
descent, stochastic gradient descent is much faster because it performs a parameter update for
each training example. The SGD algorithm draws a complete batch of data. However, mini-batch
SGD algorithm draws a single sample and target in each iteration. Each update in SGD is more
accurate, but far more expensive. Hence, it is better to work between these two extremes and use
mini-batches of reasonable size. If the parameter were optimized using “SGD” with a small
learning rate, then the optimization process would not be at a wide level. In gradient-based

662
optimization, a gradient function is computed that maps model parameter values to gradient
values. Gradient Decent algorithms can further be improved by tuning important parameters such
as momentum and learning rate. These parameters need to be defined in advance and they
depend heavily on the type of model. A small learning rate result in small steps toward finding
optimal parameter values, which minimize loss and a large learning rate causes the loss function
to fluctuate around the minimum. Hence, it becomes difficult in tuning these hyperparameters.
Besides, the same learning rate is applied to all parameter updates and this is not useful for
sparse data, where we want to update the parameters differently.
SGD was much faster but the results produced were far from optimum. Adaptive gradient
descent algorithms such as Adagrad, RMSprop, and Adam provide an alternative to classical
SGD. They have learning rate methods for each parameter and hence do not require expensive
work in tuning hyperparameters for the learning rate. Both, Adagrad and Adam produced better
results than SGD, but they were computationally extensive. Adagrad is more preferable for a
sparse dataset as it makes big updates for infrequent parameters and small updates for frequent
parameters. Adagrad and Adam both uses a different learning rate for every parameter at a time
step based on the past gradients, which were computed for that parameter. Thus, we do not need
to manually tune the learning rate. Adam is slightly faster than Adagrad. Thus, while using a
particular optimization function, one has to make a trade-off between more computation power
and more optimum results.
This step basically helps in effective learning by model. It helps to find a combination of
model parameters that helps to reduce loss function for a given set of training data samples and
their corresponding target. Depending on the learning rate, the model parameters are moved in
the opposite direction and the magnitude of the move is defined. The entire learning process is
made possible by the fact that neural networks are chains of differentiable tensor operations, and
thus it is possible to apply the chain rule of derivation to find the gradient function mapping the
current parameters and current batch of data to a gradient value.

18.1.4 Fitting the Model on Training Dataset


This step will iterate on the training data in batches. The number of iterations (epoch) are
specified in the epochs argument and number of samples are specified in batch_size. After each
iteration, the model will compute the gradients of the weights with regard to the loss on the batch
and update the weights accordingly. After all the epochs, the model will have performed many
updates and the loss of the network will be very low and the model will be capable of classifying
the data with high accuracy.

Syntax
fit(x, y, epochs =, batch_size =, verbose=)
where

• x is the training data for independent variables.


• y is the training data for dependent variable.
• epoch specifies the number of epoch.
• batch_size specifies number of batches.
• verbose defines the verbose (If 0, it will print…; if 1, it will print…).

663
Thus, initially a batch of training samples x and corresponding targets y is selected. Then the
model is executed on x to obtain predictions. The loss of the model is then computed on the
batch and the weights of the model are adjusted in a way that slightly reduces the loss on this
batch. Batch size basically determines how many observations will be used in each training step.
It should be noted that deep-learning models do not process an entire dataset at once; rather, they
break the data into small batches. If the batch size is of 2000, then from the training dataset, first
batch is formed with images from 1 to 100 and the second batch will be formed considering
images from 101 to 200. Mainly, there are three main options for batch size: stochastic batch,
mini-batch and full-batch feed in single observations, samples or the entire training set in each
training step, respectively. During runtime, the stochastic batch is faster compared to the full-
batch while in terms of accuracy in the optimization, the full-batch is more accurate than the
stochastic batch. In both cases, mini-batch is between both the batch sizes. It should be noted that
there is no best method and we should consider the number of epochs that can help in extending
iterations beyond the size of the training set.
The model is tested against the validation data, and in each step/epoch, we can determine the
performance. This will help us to observe the way loss and accuracy metrics vary during
training, and will help us to determine where there is overfitting and accordingly take action. For
example, if the result after 10th epoch is:
loss: 0.005 - acc: 0.9360 - val_loss: 0.13 - val_acc: 0.9320. This means that the training loss
is 0.005, which is very low, while the val_loss is four times higher (0.13). But the training
accuracy is nearly same as the val_acc. Since there is a difference in loss, this means that it is an
overfitting problem. Thus, we can say that val_loss and val_acc are important measures. The
model might do better on the trained data but our objective is that the model learns to generalize
also. If the model shows effective results with validation data, it may generalize well on the test
data also.

18.1.5 Evaluating the Model


The defined model is then evaluated on the test dataset using the function evaluate() and the
difference between predicted and original values of y is determined. The accuracy of the model
on test data is compared with the accuracy on training dataset. If the accuracy of training dataset
is higher than test dataset, then it is an overfitting model, and if the accuracy of training dataset is
less than test dataset, then it is an underfitting model.

18.1.6 Creating Better Model with Increased Accuracy


This is the most important step, where we change the different argument in the steps from 18.1.2
to 18.1.4 for increasing the accuracy of the model. We start again from step 18.1.2 for creating
an effective and efficient model by trying different combinations according to the result
generated by 18.1.4. It should be noted that there is no fixed formula for using a particular value
in any argument for improving the accuracy. It differs according to the nature of data, so the user
is suggested to observe output carefully and try different combinations according to generated
output.

664
In case of incompatibility with different libraries, it is better to check the
different versions of libraries. For example, version of tensorflow can be
determined using the command tensorflow.__version__

18.2 Multilayer Perceptrons Model (2-D Tensor)


Multilayer perceptron model is generally created for the two-dimensional data. The vector data
are generally used with dense layer in neural network model. A dense layer is added to the empty
sequential model using the dense() function.

18.2.1 Basic Model


The utility of deep learning for classification problems (when dependent variable is categorical in
nature) can be better understood by considering the credit card dataset that can be downloaded
from https://www.kaggle.com/mlg-ulb/creditcardfraud.

665
666
Explanation
The shape of the dataset is (284807, 30), which means that there are 284,807 rows and 30
columns. The command creditcard.iloc[:,0:29] stores all the independent 29 variables in X
and creditcard.iloc[:,29] stores the last dependent variable in Y. The command Sequential()
creates a sequential empty model. The command model1.add(Dense(10,
input_dim=input_dim, kernel_initializer='uniform', activation='relu')) adds a
Dense layer to the model with 10 units, activation as relu, and uniform kernel_initializer and
input_dim as 30 (number of columns are 30). The command model1.add(Dropout(0)) adds a
dropout layer but without any dropout, since the value of dropout is 0. This layer acts as a
hidden layer. The next command model1.add(Dense(1,kernel_initializer='uniform',
activation='softmax')) adds a dense layer with 1 unit and softmax activation. This layer
basically represents the output layer of the neural network model since the number of units
produced from this layer will be 1. The next command
model1.compile(loss='binary_crossentropy',optimizer='SGD',metrics=['accuracy'])
uses a SGD optimizer, binary_crossentropy as loss, and accuracy as metrics. The model is then
fitted on the training data. Two quantities are displayed during training: the loss of the network
over the training data, and the accuracy of the network over the training data. When the
command model1.fit(x_trg,y_trg, epochs=5, batch_size=1000) is executed, we find
from the result that 5 epcohs are created and batch_size is 1000. The accuracy of the model is
0% for the first epoch and first batch size. However, it increased to 0.20% and 0.40% when the
third epoch is created and again reduces to 0.17%. The last section basically evaluates the
model and shows that the accuracy of training and test is 0.17%, which is very low. As this
model is showing less accuracy, hence we need to do tuning of hyperparameters for better
results. In the following section, we will improve the accuracy by building different models.
Note: When we are using tensor flow, it may show warnings at the run time. It is suggested to

667
ignore those warnings at this stage.

18.2.2 Changing Units, Dropout, Epoch and Batch_size


For improving the accuracy, we created a new model by changing the number of units to 1000
and increase the percentage of dropout layer. The dropout value can be from 0 to 1 representing
value from 0 to 100%, respectively. A dropout of 0.1 means 10% will be dropped out. Since we
have already worked with step 1, the new models will always start from step 2.

Explanation
In this model, the number of units in the input layer was considered as 1000 and we have
increased the number of epoch and batch_size when we fit the model. It was observed that the
accuracy is very low (0.17%) and has not changed from the previous model though we
increased the number of epoch and batch size.

668
Change the values of different arguments to check whether any improvement
in the accuracy is possible.

18.2.3 Changing Activation, Loss, and Optimizer


Since there was no improvement in accuracy by increasing the number of epoch and batch size,
we will reduce the number of epoch and batch_size in the next model, and change the activation
function and optimizers in the next model for increasing the accuracy of the model.

Explanation
From this model, we find that when the optimizer is changed from SGD to RMSProp and
sigmoid activation was introduced in the output layer, the accuracy has increased to a great
level 99.83%, which is excellent. The accuracy of training and test dataset is same, hence the
model is neither underfitting nor overfitting. Thus, we can say that the optimizers played a

669
major role in this dataset.

18.2.4 Changing Optimizer and Activation


In the next model, we will change the optimizer from RMSprop to adagrad and determine
whether there can be an increase in the accuracy. We will also change activation in input layer
from “softmax” to “relu.”

Explanation
We found that the use of adagrad optimizer increased the accuracy to 99.96%, which is
excellent.

Change the values of different arguments to check whether any improvement


in the accuracy is possible.

670
18.2.5 Grid Approach to Determine Best Value of Epoch and Batch_size
To improve the accuracy, we need to determine the best value of epoch and batch size. We will
create a grid of different values of epoch and batch size and determine their best value for
maximum accuracy. This is done by creating a dictionary of these two variables.

671
672
Explanation
We can use a gird-based approach by using the Kerasclassifier() function. This function
takes value of function for build_fn argument. Hence, we create a function in the start by the
name of create_model. This user defined function creates sequential model, which has three
layers as the earlier model, compile, and return the model. The command
model5=KerasClassifier(build_fn=create_model) finally stores a model named model5.
For doing a grid search, a list of two values of epochs is created using epochs = [50,100] and a
list of three values of batch size is created using batch_ size = [1500, 2500, 3000]. A dictionary
named param_grid is created of epochs and batch_size. The command grid =
GridSearchCV(estimator=model5, param_grid=param_grid) creates a grid for the different
values of the dictionary items and the command grid_result = grid.fit(x_trg, y_trg)
produces the result for training dataset for different combinations of epoch and batch_size. This
further means that six models will be fitted considering: 50 epoch and 1500 batch size, 100
epoch and 1500 batch size, 50 epoch and 2500 batch_size, 100 epoch and 2500 batch_size, 50
epoch and 3000 batch_size, and 100 epoch and 3000 batch size. The results are displayed using
the command grid_result.cv_results_. Using the grid-based approach, we can determine the
best results using best_score_ and best_params_. Thus, the next command displays the result as
Best: 0.998269 using {‘batch_size’: 3000, ‘epochs’: 50}. This means that out of six models, the
best model can be considered for 50 epochs and 3000 batch size. The accuracy of the model for
training and test dataset is evaluated considering the best parameters (50 epochs and 3000 batch
size). It has been found that this model showed an accuracy of 99.93%, which is good, so we
will use this model for predicting the test dataset, since we need to check whether the model
performs well on the test set also. The result shows that the accuracy of both training and test
dataset is 99.83%, which is good and the model is neither under fitting and nor overfitting.
It is important to understand that deep learning is more an art than a science. We can
provide guidelines that suggest what can work or not work on a given problem, but there is no
theory that can predict in advance steps to optimally solve a problem. Since every problem is
different, we need to evaluate different strategies by iterating. We should always create a basic
model, which will help in determining whether we are making real progress. We should always
create simple models before expensive ones to justify the additional expense. In fact, in some
cases, a simple model turns out to be better option.
We can also improve the accuracy of the model by considering other optimizer and
adjusting the learning rate or “mae” loss. It should be noted that in situation when we are not

673
overfitting the model, but have a performance bottleneck, we should increase the capacity of the
network until over fitting becomes the primary obstacle. As long as we are not overfitting too
badly, we are likely under capacity and network capacity can be increased by either increasing
the number of units in the layers or adding more layers.

Change the values of different arguments to check whether any improvement


in the accuracy of the model is possible.

USE CASE
MEASURING QUALITY OF PRODUCTS FOR ACCEPTANCE OR REJECTION

Quality is a benchmark of perfection for the end-user. It covers all aspects that collectively or
individually impact the quality of the product. Quality control is a process intended to ensure
that product quality or performed service adheres to a defined set of criteria or meets the
requirements of the client. Quality assurance is a good practice in the manufacture of products,
as it is the process of vouching for integrity of products to meet the standard for the proposed
use. Through the quality control process, the product quality will be maintained, and the
manufacturing defects will be examined and refined. It is an obligation that ensures
manufacturers meet the needs of end-user needs in terms of safety, quality, efficacy, strength,
reliability and durability.
For effective quality control, we need to examine physical product at all stages. Product
quality is sum of organized arrangements that are made with the aim of ensuring products are of
the required quality as per the intended use. To maintain quality, provision of suitable systems,
higher level instructions and appropriate information should be done to product inspectors and
other employees enabling them with the right decision making variables for short decision
routes. In short, these people should be provided with lists and descriptions of unacceptable
variables related to product defects such as cracks, quantity, blemishes, and color. However, this
varies according to the type of industry. Example in pharmaceutical industries, content along
with the length and volume of the bottle, etc., are important predictors. In textile industry, wrap,
weft, of the fabric, content labels, color/print/design of fabric, etc., are important variables for
measuring fabric quality. In food industry, external factors as appearance (size, shape, color,
gloss, and consistency), texture, and flavor; factors such as federal grade standards (e.g., of
eggs) and internal (chemical, physical, and microbial) seem to be important.
In quality control, a process parameter whose variability has an impact on a critical quality
attribute and, therefore, should be monitored or controlled to ensure the process produces the
desired quality is called critical process parameter. A physical, chemical, biological, or
microbiological property or characteristic that should be within an appropriate limit, range, or
distribution to ensure the desired product quality is called critical quality attribute. Neural
networks can be used effectively for determine the critical parameters, which will lead to
effective quality control.

674
18.3 Recurrent Neural Network Model (3-D Tensor)
RNNs are able to hold their state in between inputs, and therefore are useful for modeling a
sequence of data such as with a time series or with a collection words in a text.
RNN is a generalization of feedforward neural network that has an internal memory. RNN is
recurrent in nature as it performs the same function for every input of data, while the output of
the current input depends on the past one computation. After producing the output, it is copied
and sent back into the recurrent network. For making a decision, it considers the current input
and the output that it has learned from the previous input.
Unlike feedforward neural networks, RNNs can use their internal state (memory) to process
sequences of inputs. This makes them applicable to tasks such as unsegmented, connected
handwriting recognition or speech recognition. In other neural networks, all the inputs are
independent of each other. But in RNN, all the inputs are related to each other.

18.3.1 Basic LSTM Model


Long–short-term memory (LSTM) networks are a modified version of recurrent neural networks,
which makes it easier to remember past data in memory. The vanishing gradient problem of
RNN is resolved here. LSTM is well suited to classify, process, and predict time series given
time lags of unknown duration. It trains the model by using backpropagation. For understanding
the RNN with LSTM Layer, we will use IMDB dataset to predict whether a movie review from
IMDB is generally positive (1) or negative (0). We will start by building a very basic MLP
model on this dataset.

675
Explanation
It should be noted that there is a change in the new version of numpy related to allow_pickle.
So the user is suggested to make sure that old version of numpy 1.16.1 version is installed for
loading the dataset, otherwise error will be generated. We need to load the IMDB dataset. We
are constraining the dataset to the top 50,000 words. By default, the dataset is splitted in 50-50

676
ratio. Thus, training will have 25,000 (50%) and test will have 25,000 (50%) sets. Next, we
need to truncate and pad the input sequences so that they are all of the same length for modeling
and computation in Keras. The model will learn the zero values carry no information, so indeed
the sequences are not of the same length in terms of content. The first layer is the embedded
layer that uses 32 length vectors to represent each word. The second layer is the LSTM layer
with 100 memory units. In the last layer, we have used relu activation and since it is a
classification problem, hence we use a Dense output layer with a single neuron to make binary
(0 or 1) predictions for the two classes (good and bad) in the problem. After the model was
developed, the model is then complied using loss as binary_crossentropy and Adam optimizer.
The model is then fit for only three epochs because it quickly overfits the problem. After fitting,
we estimate the performance of the model on unseen reviews and we found the accuracy is
55%.

18.3.2 Changing Activation Function


In the next model, we will try to change the activation function to sigmoid to determine whether
there is any increase/change in accuracy.

677
Explanation
We can observe that the change of the activation function to sigmoid increases the accuracy and
new accuracy of the model is 83.91%.

18.3.3 Adding Dropout


In the next model, we have added dropout because recurrent neural networks like LSTM
generally have the problem of overfitting. In the following example, we have used dropout
between the embedding and LSTM layers and the LSTM and output layers.

678
Explanation
We found that the addition of dropout layer has increased the training accuracy to 84.51%. We
have observed that the test accuracy is less than the train accuracy. This further means that the
model is depicting the situation of underfitting. However, the model could use a few more
epochs of training and may achieve a higher skill.

18.3.4 Adding Recurrent Dropout


We can also apply dropout to the input and recurrent connections of the memory units with the
LSTM precisely and separately. Recurrent dropout is a specific, built-in way to use dropout to
fight overfitting in recurrent layers. Keras provides this capability with parameters on the LSTM
layer, the dropout for configuring the input dropout and recurrent_dropout for configuring the
recurrent dropout. We have made the new model depper by adding more layers to it.

679
Explanation
We can observe that the accuracy of the model has decreased after adding recurrent dropout and
making the model deeper. Thus, increasing the number of dense layers did not show a positive
impact and the LSTM-specific dropout has a more pronounced effect on the convergence of the
network than the layer-wise dropout. However, we could increase the number of epochs to see
if the skill of the model can be further lifted.

18.3.5 Adding Conv1D Layer for Sequence Classification


The IMDB review data does have a one-dimensional spatial structure in the sequence of words in
reviews and the CNN may be able to pick out invariant features for good and bad sentiments.
This learned spatial features may then be learned as sequences by an LSTM layer. Since making
the model deeper did not show an increase in the accuracy, we will remove the dense layers and
add conv1D layer in the next model. Besides, since dropout is a powerful technique for handling

680
overfitting in LSTM models, we can try both methods for getting better results.

Explanation
We have added a one-dimensional CNN and max pooling layers after the embedding layer,
which then feed the consolidated features to the LSTM. From our experience, we know that
“relu” works better with conv1D hence, activation was considered as “relu.” We found that the
accuracy has increased to 87.32%.
It is suggested that the user can use stacking recurrent layers or bidirectional recurrent

681
layers for increasing the representational power of the network (at the cost of higher
computational loads). RNN can also be created using GRU layer for time series data
specifically, which is beyond the scope of this book. It is important to understand that for 3-D
tensor, we can add 1-D layer for convolutional like CONV1D and/maxpooling 1D, and for 4-D
tensor, we can add 2-D layer for convolutional like CONV2D and/maxpooling 2D.

Change the values of different arguments to check whether any improvement


in the accuracy of the model is possible.

USE CASE
FINANCIAL MARKET ANALYSIS

Financial markets are marketplaces where entities buy or sell financial securities such as stocks,
bonds, currencies, and derivatives. Some of the famous securities market indexes of the world
are Footsie (London financial market), Dow Jones (New York financial market), Hang Seng
(Hong Kong financial market), Nikkei (Tokyo financial market), BSE Sensex (Mumbai financial
market), and Nifty (Indian national financial market). The financial market index is integrating
very fast on a global scale and traders are investing in a large number of markets across the
globe. As a result, analysis of the financial markets (based on a large number of factors both
within the market and outside it) becomes very important. Financial market analysts cannot
anticipate extraneous factors and measure only factors within the market.
Financial market analysis is concerned with trying to understand what has happened, what is
currently happening, and what will happen in that marketplace to better understand what
positions that entity should take. Financial market analysis deals with the performance of a
particular financial market(s). The performance of a financial market depends upon the
performance of the total number of securities that are traded in that market. On a given day
when the market closes with the prices of most of its securities on the higher side, then it could
be said to have performed well. The different types of financial analysis include fundamental
analysis, securities market analysis, securities market technical analysis, index momentum
analysis, securities momentum analysis, securities chart analysis, market analysis, and market
trend indicators. Since the integration of information technology in market analysis is
increasing, the number of factors that directly or indirectly impact the financial markets is
increasing rapidly.
A time series is a series of data points listed in time order. It is generally a sequence, which
is considered at successive equally spaced points in time. The most important thing in time series
data is the ordering of the time points unlike other regression algorithms. Time series analysis
comprises methods for analyzing time series data in order to extract meaningful statistics and
other characteristics of the data. Time series analysis can be applied to real-valued, continuous
data and discrete numeric data. Time series forecasting is the use of a model to predict future
values based on previously observed values.
Recurrent neural networks can be used for effective financial market analysis. We can
implement them on the problem of forecasting the future price of securities on the stock market,
currency exchange rates, etc. But it should be noted that the markets have very different
statistical characteristics than other phenomena, hence other parameters should also be

682
considered.

18.4 CNN Model (4-D tensor)


CNNs are a special type of neural network, which is used for image processing. It separates
square patches of pixels in an image being processed through filters. As a result, the model can
mathematically capture key visual cues such as textures and edges that help discerning classes.

In situations, when multiple libraries need to be installed, the names of all


libraries are written in a text file. For example requirements.txt. We can then
install all the libraries which are existing in the txt file using one single
command pip install –r requirements.txt.

For explaining the use of keras for convolutional neural network, we have used Fashion-MINST
dataset available in keras. This MNIST dataset has images with few dimensions so it is easy to
process. Similar to MNIST dataset, the Fashion-MNIST dataset also consists of 10 classes, but
has fashion accessories instead of handwritten digits. Fashion-MNIST dataset is 28 × 28
grayscale images of 70,000 fashion products from 10 categories of fashion accessories such as
sandals, shirt, and trousers with 7000 images per category. The training set has 60,000 images,
and the test set has 10,000 images. The training and test splits are similar to the original MNIST
dataset. Each image is 28 × 28 arrays, with pixel values ranging between 0 and 255. The labels
are arrays of integers, ranging from 0 to 9 and represents the class of clothing of the image. We
will start from a basic deep learning model and build an efficiently regularized model by adding
more and more complexity to neural network model.

18.4.1 Basic Model for Image Data


For understanding the use of convolutional neural network, we have taken the fashion _mnist
data available in keras. We first start by creating a basic model for our dataset.

683
684
685
Explanation
We can observe that the minimum value is 0 and maximum value is 255. Hence, we have
normalize the data by dividing the dataset with 255. We can observe from the dataset that after
normalizing the data by doing rescaling, we have reduced the maximum value to 1.0. We have
then splitted the training dataset into training and validate dataset using the value 0.2 for
test_size argument in the train_test_split() function. This hence splits the training and
validate datasets into 80% and 20%, respectively. Thus, the training dataset now has 48,000
observations and test dataset has 12,000 observations. The reshape() function is used to
reshape the input data, while to_categorical() function is used to do one hot encoding. In our
example, the input data have been reshaped to (28, 28, 1) and one hot encoding is done on the
dependent dataset. The function tf.keras.Sequential() is used to create a blank keras model.
There are basically two layers in the model: input layer and output layer. There is no hidden
layer in the model. The first layer is the input layer of flatten nature and which takes the input
shape as (28, 28, 1) and transforms the format of the images from a 2d-array (of 28 by 28
pixels) to a 1d-array of 28 × 28 = 784 pixels. The second layer is the dense layer and represents
the output layer. After the pixels are flattened, the network consists of a 10-node softmax layer
and returns an array of 10 probability scores that sum to 1. Each node contains a score that
indicates the probability that the current image belongs to one of the 10 digit classes. It should
be important to note the number of units in the output layer. For our mode, the output layer has
10 units because there are 10 categories in the dataset. Before the model is ready for training,
we have compiled the model. The test accuracy is found to be 83.43%.

The most popular and de facto standard library in Python for loading and
working with image data is Pillow. Pillow is an updated version of the Python
Image Library, or PIL, and supports a range of simple and sophisticated image
manipulation functionality.

686
18.4.2 Creating Model with ModelCheckpoint API
There is an API named ModelCheckpoint that trains the model with fit() API
[ModelCheckpoint] and saves the model after every epoch. The important feature of this API is
that if the value of the argument save_best_only is made “True,” it will save the model only
when the validation accuracy improves or val_loss decrease else it ignores it. This API is
considered as a value to the argument named callbacks in the fit() function.

Explanation
We can observe that whenever the value of val-loss decreased, it saves the model with the best
value, else it ignores it. Thus, the statement Epoch 00001: val_loss improved from inf to
0.42877 shows that val_loss decreases from infinity to 0.428 in the first epoch and it is saved.
We can observe from the final result that in comparison to the earlier model, we have increased
the accuracy from 83.43 to 84.21 using model Checkpoint API.

687
18.4.3 Creating Denser Model by Adding Hidden Layers
We will make the model deeper by adding one more hidden layer to the model with “relu”
activation function. The number of hidden units in this layer will be considered something
between the input dimension (28 × 28 × 1) and the output dimension (10). For this hidden layer,
we will consider the number of units as 64. However, the user is suggested to change the
activation function and the number of hidden units for improving the models performance and
training speed.

688
689
Explanation
We can observe that the accuracy improved from 84.21 to 87.06% when hidden layer was also
used with ModelCheckpoint API. The last section creating a new model named bestmodel 3 to
generate and we can observe that same results are generated when we just load the model 3 by
its name and evaluate it directly. In the next models, we will use this concept since it works
better.

18.4.4 Making Model Deeper


From the previous model, we know that the accuracy improved after adding a hidden layer to the
model. So, in the next model, we will make our model deeper by adding three extra hidden
layers. The user is suggested to explore the effects of adding more layers to the network model
and accordingly determine training speed and test accuracy. Since the training time increases at a
great level by adding more hidden layers, it is important to note that it always requires to have a
clear tradeoff between increased training time by adding more hidden layers and model
performance.

690
Explanation

691
This model just made the model deeper by adding more hidden layers to the first model and did
not used the concept of model-CheckpointAPI. In comparison to the first model, we can
observe that by adding hidden layer, the accuracy of the test dataset has increased considerably
from 83.43% to 86.66%. The increase in accuracy of training dataset is because of the presence
of hidden layers, which were able to train the model effectively. The accuracy of the model for
test dataset is less than the training dataset and this gap between training accuracy and test
accuracy is an example of overfitting. Overfitting means when a machine learning model
performs worse on new data than on their training data.

18.4.5 Early Stopping API


From results of model 3, we know that higher accuracy is generated, if we use check point, so in
this model we will use early stopping API along with the Checkpoint API and store it in
callbacks. Hence, in the new model, we will use early stopping along with ModelCheckpoint as
an argument to the call_back list to the above dense model. In the new model, callbacks are
added because model 4 was made much deeper and the training will take a lot of time, but the
training should not be made too long; otherwise the model will depict overfitting.

692
Explanation
We are able to increase the accuracy of the model from 86.66% to 87.65% when callback and
checkpoint API were used.

18.4.6 Grid-Based Approach

693
In the model, we have used a grid approach to do hyperparameter tuning. For determining the
best parameters, we have created a function so that it can be called again and again for different
combinations of parameters.

Explanation
A dictionary was created using two parameters namely epochs and batch_size. The epoch had
three values: 5, 10 and 20, whereas the batch_size had only one value 256. The grid-based

694
approach was adopted for determining the best parameters corresponding to the highest
accuracy. It was observed that the best value is determined for epochs equal to 20 and the test
accuracy of the new model increases to 89.79%.

Change the values of different arguments to check whether any improvement


in the accuracy is possible.

18.4.7 Creating a CNN Model


AS discussed earlier, we can add a conv2D layer to the 4-d tensor. Hence, this section creates a
CNN model by adding conv2D layer to the above model. The syntax of add a conv2D layer is as
follows:

Syntax
Conv2D(filters=, kernel_size=, padding=, activation=))
where

• filters is an integer specifying dimensionality of the output (number of output of filters in


convolution).
• kernel_size is a pair of integers specifying the dimensions of the 2D convolution window.
• padding is either “valid,” “causal,” or “same”.
• activation is the name of the activation function to use.

In this model, we have used MaxPool2D for reducing variance/overfitting and reduce
computational complexity since it makes the image smaller and works better for binarized
images, with noticeable edge differences. It should be noted that the optimal dropout value in
Conv layers is 0.2 and 0.5 in the dense layers and the number of filters, layers, epochs for
training the model are all hyperparameters and should be chosen depending on our experience.
The user is expected to try new experiments by changing these hyperparameters and measuring
the performance of model, which will finally result in improving the accuracy of the model.

695
696
Explanation
The accuracy of the test dataset has increased to 90.60%. The accuracy of training dataset is
93.16%, which is more than the accuracy of the test dataset. Thus, our model is depicting
overfitting.

18.4.8 Regularization
Our network model is overfitting; this further means that it has many trainable parameters and
can therefore fit almost any function, if we keep training for long period. To avoid overfitting,
we will use the concept of regularizing, which means that we need to add dropout layers in the
network model. It should be noted that there is no best dropout rate. A good dropout can differ

697
according to different datasets.

698
Explanation
The verbose argument is 1 for train model, hence we can see that the observations are displayed
there also (which was not the case in earlier example).

18.4.9 Autoencoder as Classifier


This model uses autoencoder as a classifier in Python with Keras considering our own dataset.
We will first convert labels into one-hot encoding vectors and split up training and validation
images along with their respective labels. Then we will use the same encoder function that has
been used in autoencoder architecture followed by fully connected layers. This will help us to
load the weights of a trained model into few layers of the new model, verify the weight matrices
of trained and new model, make few layers of the new model false, and finally we will compile
the new classification model, train the model, and save the weights. We will re-train the model
with all layers trainable, evaluate the model, visualizing the accuracy and loss plots, make
predictions on the test data, convert the probabilities into class labels and visualize the
classification report. The task at hand is to train a convolutional autoencoder and use the encoder
part of the autoencoder combined with fully connected layers to recognize a new sample from
the test set correctly.

699
700
701
Explanation
We can observe that both the validation loss and the training loss are not in synchronization.
There is a gap between training and validation loss throughout the training phase. It shows that
model is overfitting. However, the validation loss is decreasing and not increasing, and
therefore, we can say that model’s generalization capability is not excellent.

18.4.10 Data Augmentation


For creating this model, we have used the technique of data augmentation, which is used to
artificially make the training set bigger. This is achieved by rotating images, zooming in a small
range and shifting images horizontally and vertically, etc. This requires a lot of caution because
if we are taking one image and flipping it, it may be same as image2 (having different label). For
example, with the digits 6 and 9, if we flip any one image vertically and horizontally, it becomes
the other. This will result in dropping the performance at a large level; hence care should be
taken related to features that may affect the labeling.

It is also possible to do deep augmentation from deepaugment.deepaugment by


importing DeepAugment.

The main function that is used for data augmentation is ImageDataGenerator() and the different
parameters included in the function are as follows: featurewise_center=False/True (False will
set input mean to 0 over the dataset), samplewise_center=False/True (False will set each
sample mean to 0), featurewise_std_normalization=False/True (False will divide inputs by
std of the dataset), samplewise_std_normalization=False/True (False will divide each input
by its std), zca_whitening=False/True, (apply ZCA whitening), rotation_range=0 to 180
(Randomly rotate images), zoom_range = 0 to 1 (Randomly zoom image),
width_shift_range=0 to 1 (Randomly shift images horizontally fraction of total width),
height_shift_range=0 to 1 (Randomly shift images vertically fraction of total height),
horizontal_flip=False/True (Randomly flip images), and vertical_flip=False/True
(Randomly flip images).

702
For better augmentation of images, different functions like HorizontalFlip,
IAAPerspective, ShiftScaleRotate, CLAHE, RandomRotate90, Transpose,
ShiftScaleRotate, Blur, OpticalDistortion, GridDistortion,
HueSaturationValue, IAAAdditiveGaussianNoise, GaussNoise, MotionBlur,
MedianBlur, RandomBrightnessContrast, IAAPiecewiseAffine, IAASharpen,
IAAEmboss, Flip, OneOf, Compose, etc., can be imported from albumentations
library.

It should be noted that the fitting function changes from fit to fit_generator when there is data
augmentation. The first input argument is slightly different.

703
704
Explanation
We can observe that the data augmentation on our dataset is not effective since the accuracy has
reduced to great extent. However, it can be effective in other deep learning problems.

Change the values of different arguments to check whether any improvement


in the accuracy of the model is possible.

Different trained models for text and image data use deep learning models for
the data. These trained models can be directly used on the data by the user.
Rasa library for development of chatbots uses deep learning models for text
data.

USE CASE
705
USE CASE
FACIAL RECOGNITION IN SECURITY SYSTEMS

Security cameras are a key component in security systems and they are one of the best ways to
identify the right person at home or in an organization. Face is considered as the most important
part of human body and it depicts people’s identity, so it can be used as a key for security
solutions at many places including home and organizations. Face recognition system is getting
popularity across the world for providing extremely safe and reliable security technology. It is
gaining significant importance and attention by thousands of corporate and government
organizations only because of its high level of security and reliability. It is typically used as
access control in security and compared to other biometric devices such as finger print scanner
or retina scanner systems. Moreover, this system will provide vast benefits when compared to
other biometric security solutions such as palm print and finger print. Hence, for providing
security at home or organization, facial recognition can be considered as a viable option for
authentication as well as identification. This technology is adopted in high-quality cameras in
mobile devices. They are also described as a biometric artificial intelligence based application
that can uniquely identify a person by analyzing patterns based on the person’s facial textures
and shape.
Image classification refers to the task of extracting information from an image. It helps to
map an individual’s facial features and stores the data as images. Images with respect to each
individual of an organization or at home can be gathered in training sample. The system
captures biometric measurements of a person from a specific distance without interacting with
the person. To classify an image, we can use deep learning algorithms to compare a live image
to the stored images in order to verify an individual’s identity.
Thus, a facial recognition system can be considered as a technology capable of identifying or
verifying a person from a digital image and it works by comparing selected facial features from
given set of images stored in a dataset However, the difficulty may arise due to the great
transformations in head rotation and tilt, angle, facial expression, aging, etc. In particular, the
correlation is very low between two pictures of the same person with two different head
rotations. To solve these problems, a large dataset can be created consisting of many images of
same person that can be considered for training.

Summary
• Different types of neural network models can be built in Python using keras, which include
MLP, CNN, and RNN.
• A vector data is 2-D tensor, having two dimensions as sample and features. A time series
data or sequence data is 3-D tensor having three dimensions representing sample, time, and
features. An image is 4-D tensor representing sample, height, width, and channels, and a
video is 5-D tensor representing samples, frames, height, width, and channels.
• Steps for building an NN model include manipulating data (tensors) in Python, building the
basic sequential model and adding layers, compiling the model, predicting the model, and
saving models for future use.
• To create a neural network, we basically add layers of neurons together and finally create a
multilayer model of a neural network. It should be noted that a model should primarily have

706
two layers input and output.
• Before training, all weights are initialized with random values. These weights contain the
information learned by the network from exposure to training data. Weight is automatically
calculated based on the input and the output shapes.
• Many deep learning frameworks have emerged over the time frame. Among the available
frameworks, Keras and TensorFlow are the most popular ones.
• Steps for building a neural network model incode data preparation (data exploration,
normalization, splitting the data, and data pre-processing); building the basic sequential
model and adding layers; compiling the model; fitting the model on training dataset; and
predicting the model and creating better model by improving accuracy.
• An empty sequential model is constructed using the function sequential() from models in
keras.
• The core building block of neural networks is the layer, which processes the data.
• A dense neural network layer is created using Dense() function; a recurrent neural network
layer is created using a GRU() function or singleRNN(); a 1-D and 2-D convolutional neural
network layer are created using con_1d() and con_2d(), respectively.
• The important arguments in creating a layer include input_shape, units, and activation. The
argument input_shape represents how many elements an array or tensor has in each
dimension and is the most important thing to be defined, which is based on input data.
Units/filters is a property of each layer and it is related to the output shape. The value of
activation argument represents the operation with the elements. The different types of
activation include “relu,” “sigmoid,” “linear,” and “hard-sigmoid.
• Compiling a model is done using the compile method, and the most important arguments are
the loss function, optimizer, and metrics. The loss is the quantity we need to minimize during
training. The different types of loss include “mean square error (mse),” “mean absolute error
(mae),” “categorical_crossentropy,” and “binary_crossentropy”. An optimizer is the
mechanism through which the model will update itself based on loss function.
• A developed model is used for predicting the test dataset using the function fit(), which has
two important arguments namely epoch and batch_size. The number of iterations (epoch)
are specified in the epochs argument and number of samples are specified in batch_size.
• The developed model is evaluated on the test dataset using the function evaluate() and the
difference between predicted and original values of y is calculated.
• The accuracy of the model on test data is compared with the accuracy on training dataset. If
the accuracy of training dataset is higher than test dataset, then it is an overfitting model and
if the accuracy of training dataset is less than test dataset, then it is an underfitting model.

Multiple-Choice Questions

1. This is not a valid activation in keras.


(a) relu
(b) sigmoid
(c) linear
(d) adam

707
2. This is not an important argument in creating a layer in keras.
(a) input_shape
(b) Units
(c) Length
(d) Activation
3. This is not a valid optimizer in keras.
(a) SGD
(b) adagrad
(c) adam
(d) prop
4. The core building block of neural networks is the ______________ that processes the data.
(a) rmsprop
(b) adam
(c) optimizer
(d) layer
5. Which one of the following function is not used for creating a layer.
(a) layer_dense
(b) layer_rnn
(c) layer_conv_2d
(d) layer_gru
6. An empty sequential keras model is constructed using the function ______________.
(a) sequential()
(b) empty()
(c) new()
(d) new_seq()
7. ______________ is a property of each layer and it is related to the output shape.
(a) Value
(b) Units
(c) Shape
(d) Number
8. The model is then predicted on test dataset using the ______________ function.
(a) fit ()
(b) empty()
(c) evaluate ()
(d) new_seq()
9. The popular deep learning frameworks are
(a) Keras
(b) Tensorflow
(c) Both (a) and (b)
(d) Neither (a) nor (b)
10. Deep learning is improving ______________ by gradually adjusting weights depending on
feedback signal.

708
(a) Dimension
(b) Accuracy
(c) Shape
(d) All of the above

Review Questions

1. What are the steps of basic neural network model?


2. Explain the importance of regularization in neural network model.
3. Explain the utility of grid based approach in the model development.
4. Discuss the importance of an optimizer.
5. When do we use conv2D layer?
6. Discuss the utility of activation argument.
7. What are the different types of loss included in keras?
8. How does input_data argument varies with the type of data?
9. Compare the different types of optimizer used in compile function.
10. Draw and explain the diagram to understand the functioning of different layer and neurons.

709
CHAPTER
19

710
Transfer Learning for Text Data

Learning Objectives
After reading this chapter, you will be able to

• Understand the available trained models for text data.


• Attain competence in creating own user-defined trained model.
• Apply the expertise of trained models for machine learning algorithms.
• Implement the use of trained models for question answering.

Transfer learning means the application of skills, knowledge, and/or attitudes that were learned
in one situation to another learning situation. A trained model works on the concept of transfer
learning. In other words, a pretrained model is a model created by other person to solve a similar
problem. Instead of building a model from scratch to solve a similar problem, the model already
trained on other problem by other person is considered as a starting point for new model. This
could be better understood by the following analogy: A teacher has a huge expertise in the
subject he/she teaches. It is beneficial for the student to get all the information through a lecture
from the teacher. In this scenario, “transfer” of information from the learned to novice is being
done. Keeping in mind this analogy, let us compare this to a neural network. A neural network is
trained on a particular data. This network gains knowledge from this data, which is compiled as
“weights” of the network. These weights can be extracted and stored in the form of a trained
model (teacher). Depending on the user requirement (novice), these weights are then transferred
to any other neural network. Hence, instead of training the other neural network from scratch
(novice learning), the learned features are transferred from the trained model (teacher). However,
a pretrained model may not be 100% accurate in all the applications, but it saves huge efforts
required to reinvent the wheel.
There are many other algorithms available in the transformers library related to text data.
This chapter focuses on use of Bert, GPT2, Roberta, XLM, and DistilBert trained algorithms for
text data. However, other models, such as FlauBERT, XLM-RoBERTa, Bart, CamemBert,
ALBERT, T5, CTRL, XLNET, and GPT, are also available and can be used according to the
requirement.

19.1 Text Similarity Techniques


We know that similarity techniques are available in sklearn.metrics.pairwise library and
different techniques include cosine_similarity, euclidean_distances, and
manhattan_distances. All these techniques can also be used along with the different pretrained
models for displaying text similar to the given text. For understanding document similarity,

711
tmdb_5000_movies.csv.gz dataset from Github was downloaded. Before applying text
similarity techniques, we will read and understand the data for performing effective analysis.

712
713
Explanation
The details of the dataset show that there are 3959 rows and 6 columns, namely, title, tagline,
overview, budget, popularity, and information. The data show that top five high-budget movies
include Pirates of the Caribbean: On Stranger Tides at index 17, Pirates of the Caribbean: At
World’s End at 1, Avengers: Age of Ultron at 7, John Carter at 4, and Tangled at 6. The top five
popular movies include Minions at index 546, Interstellar at 95, Deadpool at 788, Guardians of
the Galaxy at 94, and Mad Max: Fury Road at 127.

For doing effective modeling of text data, it is important to do feature extraction along with the
text processing on the data. This is primarily done using the concept of n-grams. Before applying
feature extraction on the data, it is important to understand the concept of n-gram. The
TextBlob() function from textblob library helps to execute the function of n-grams.

Explanation

714
The given text is “Display your text or numeric variable here.” We can observe from the output
that bi-gram makes a pair of two words occurring together in a sentence, tri-gram makes a
combination of three words together, and four-gram makes a combination of four words
together. Hence, the first wordlist using bi-gram is “Display your”, using tri-gram is “Display
your text” and four-gram is “Display your text or.” This helps a lot in doing effective
feature extraction.

19.1.1 Without Pretrained Model


After understanding the concept of n-grams, we will now perform feature extraction on the data.
In Python, feature extraction can also be done primarily using the concept of Count Vectorizer
and TF-IDF Vectorizer. Count Vectorizer converts a collection of text documents to a matrix of
the counts of occurrences of each word in the document. TF-IDF means term frequency--inverse
document frequency; this means that the weight assigned to each token not only depends on its
frequency in a document but also on how recurrent that term is in the entire corpora. TF-ID
assigns lower weights to these common words and gives importance to rare words in a particular
document.

715
Explanation
This helps to do text processing on 3959 rows of the data by converting text to lower case,
removing stop words and special characters, etc. The countvectorizer and Tfidfvectorizer
transform the data using n-gram range specified in the function inside the argument
ngram_range. The value of ngram_range is (1, 2). The dimension after doing feature extraction
by using Count Vectorizer and TDIDF Vectorizer is found to be (3959, 17,820).

19.1.1.1 Cosine Similarity


This section determines similarity of the documents on the basis of cosine similarity. Cosine
similarity is basically a numeric score to denote the similarity between two text documents. The
cosine_similarity() function is available in sklearn.metrics.pairwise library.

716
Explanation
We have determined the similar document/content from the required data by matching cosine
similarity with all the contents. The cosine similarity is computed for the data obtained after
using tfidf vectorizer and stored in cosine_sim (). The first five records of cosine_sim show
that value of cosine similarity ranges from 0 to 1. This example displays those movies that are
similar to movie “Deadpool.” The command movies = movie_data['title'].values stores
the name of all the titles in movies. The next command stores the index of the “Deadpool”
movie in movie_index. It shows that the index of the “Deadpool” movie is 765. The next
command similarities = cosine_df.iloc[movie_index].values stores the cosine
similarities with respect to the index of the movie Deadpool. The command np.argsort(-
similarities)[0:8] then sorts the similarities and stores the top eight indexes of movies in
similar_movie_index. It should be noted that the value of cosine similarity is between 0 and 1.
Higher the value of cosine similarity, higher is the similarity between the documents. Thus, two
movies having a cosine similarity of 1 are exactly the same movies. A cosine similarity of 0

717
means that there is no similarity between the movies. Hence, the results are sorted in
descending order for determining similar movies using negative sign (- with argsort()
function. Thus, movies having indexes 765, 2429, 1066, 462, 3879, 235, 3117, and 1931 have
better similarity to Deadpool movie. The last command then prints the names of the movies
corresponding to the indexes stored in similar_movie_index. Thus, the movies similar to
Deadpool are as follows: Silent Trigger, Underworld: Evolution, Mars Attacks, etc.

19.1.1.2 Euclidean Distance


This section determines movies similar to “Newlyweds” on the basis of the concept of Euclidean
distance.

Explanation
It should be noted that lower the value of Euclidean distance, higher is the similarity between
the documents. Thus, two movies having a Euclidean distance of 0 are exactly similar movies.
Here, unlike cosine similarity where the results are sorted in descending order for determining
similar movies, the results are displayed in ascending order (without a negative sign). We can
observe from the results that the index of the movie “Newlyweds” was 3957. The index of the
first similar movie is 3957 (the movie itself) because this is obvious that the highest similarity
of the movie will be with the movie itself. The names of the other similar movies include “Our
Family Wedding” and “Just Married”.

19.1.1.3 Manhattan Distance

718
In this section similar movies are determined on the basis of the Manhattan distance. The
function manhattan_distances is available in the sklearn.metrics.pairwise library.

Explanation
It should be noted that lower the value of Manhattan distance, higher is the similarity between
the documents. Thus, two movies having a Manhattan distance of 0 are exactly similar movies.
In this case, unlike cosine similarity where the results are sorted in descending order for
determining similar movies and like Euclidean distance, the results are displayed in ascending
order (without a negative sign). We can observe from the results that the index of the movie
“The Matrix Revolutions” is 119. The indexes of the similar movies are at 3867, 3957, etc. The
movie at 3867 index is “Amidst the Devil’s Wings” and so on.

19.1.2 Bert Algorithm


We know that the use of the trained models helps to do a better analysis on text data. It should be
noted that the pretrained algorithms related to text data are available in transformers library.
Bert algorithm has defined different architectures for different models such as bert-base-uncased
(12-layer, 768-hidden, 12-heads, 110M parameters), bert-large-uncased (24-layer, 1024-hidden,
16-heads, 340M parameters), bert-base-cased (12-layer, 768-hidden, 12-heads, 110M
parameters), bert-large-cased (24-layer, 1024-hidden, 16-heads, 340M parameters), and bert-
base-multilingual-cased (12-layer, 768-hidden, 12-heads, 110M parameters). In this section, we
have used “bert-base-uncased” architecture with the Bert Tokenizer for performing text
similarity. We have created a general function for using all the tokenizers defined in the trained
models.

719
Explanation
Here, we have created a function named func_tokenizer() which will return the features
based on the tokenizer name and the document which is given as an input to the function. A
“for” loop is used because there can be multiple documents. Each and every document is
converted in tokens using tokenize() function from the respective tokenizer entered as an
input to the function. The tokens are then converted to ids using the convert_tokens_to_ids()
function from the respective tokenizer. All the ids generated from each document are then
appended to the features list and the function then returns all the features from all the
documents.

720
Explanation
Bert Tokenizer is initialized by using the command
transformers.BertTokenizer.from_pretrained('bert-base-uncased'). The function
func_tokenizer created in the previous section takes two inputs: name of the tokenizer and the
document that needs to be considered for determining similarity. The command
bert_features= func_tokenizer(bert_tokenizer,movie_data['information']) stores all
the features generated using the bert_tokenizer in the bert_features. Since the features
might be of different lengths, hence the value of the argument maxlen is considered as 500
inside the function sequence.pad_sequences() and stored in bert_trg. The functions
cosine_similarity(bert_trg), euclidean_distances(bert_trg), and
manhattan_distances(bert_trg) determine the similar movies on the basis of cosine
similarity, Euclidean distances, and Manhattan distances, respectively.
Using the cosine similarity, similar movies for Money Train movie that was at index 599
were found to be “Escape from Planet Earth”, “Mr. Nice Guy”, “The Love Letter”, “The
Jacket” “Harry Brown”, “Rain Man”, and “Gosford Park”. Similarly, using Euclidean distance
technique for finding similar movies of the Love Letter movie, they were found to be “P.S. I
Love You”, “The Jacket”, “Harry Brown”, “Drive”, “Blackhat”, “50/50”, and “Money Train”.
Manhattan distance technique for determining similar movies for Magic Mike resulted in the

721
movies like “I Love You, Don’t Touch Me!”, “My Big Fat Greek Wedding”, “Diary of a
Wimpy Kid: Dog Days”, “Glitter”, “Journey to the Center of the Earth”, and “Room”
“Tammy”. Thus, we can observe that the results are highly satisfactory.

19.1.3 GPT2 Algorithm


GPT2 algorithm has defined different architectures for different models such as gpt2 (12-layer,
768-hidden, 12-heads, 117M parameters), gpt2-medium (24-layer, 1024-hidden, 16-heads, 345M
parameters), gpt2-large (36-layer, 1280-hidden, 20-heads, 774M parameters), and gpt2-xl (48-
layer, 1600-hidden, 25-heads, 1558M parameters). In this section, we will use “gpt2” model.
This section uses GPT2 tokenizer considering gpt2 architecture to recommend similar movies to
the given movie.

Explanation
It can be observed from the results that GPT2 algorithm like the other pretrained algorithms
also shows satisfactory results for displaying similar movies.

722
19.1.4 Roberta Algorithm
Roberta algorithm has defined different architectures for different models such as roberta-base
(12-layer, 768-hidden, 12-heads, 125M parameters), roberta-large (24-layer, 1024-hidden, 16-
heads, 355M parameters), and roberta-large-mnli (24-layer, 1024-hidden, 16-heads, 355M
parameters). In this section, we have used “roberta-base” model. This section uses Roberta
Tokenizer considering “roberta-base” architecture to determine similar movies.

Explanation
It can be observed from the results that like the BERT and XLM tokenizer used in previous
section for determining similar movies, Roberta algorithm also shows satisfactory results for
displaying similar movies.

19.1.5 XLM Algorithm


XLM algorithm has defined different architectures for different models such as xlm-mlm-en-
2048 (12-layer, 2048-hidden, 16-heads), xlm-mlm-ende-1024 (6-layer, 1024-hidden, 8-heads),
xlm-mlm-enfr-1024 (6-layer, 1024-hidden, 8-heads), xlm-mlm-enro-1024 (6-layer, 1024-hidden,
8-heads), and xlm-mlm-xnli15-1024 (12-layer, 1024-hidden, 8-heads). In this section, we have
used xlmmlm-en-2048 model. This section uses XLM tokenizer considering “xlm-mlm-en-2048”
architecture to tokenize the data.

723
Explanation
Similar to the implementation of Bert tokenizer to determine the features, XLM tokenizer was
also used to determine the features from the documents. Cosine similarity of The Prince movie
at index 2219 displayed the result of similar movies at indexes 507, 590, 947, 2928, 1788, 3172,
and 3129. The names of the movies at these indexes is displayed in the next section. Thus, the
movie Grown Ups is at index 507, “Blackhat” is at 590, “The Life of David Gale” is at 947,
“When Did You Last See Your Father?” is at 2928, “The Ruins” is at 1788, “Spring Breakers”
is at 3172, and “The Ten” is at 3129. Similarly, the indexes of movies similar to “Blackhat” are
2928, 894, 742, 947, 2219, 3288, and 3829. Indexes of movies similar to Glitter are 2728, 3027,
3309, 2792, 3446, 2083, and 2744. Thus, we can infer that XLM tokenizer was able to
determine the similar movies to a good extent.

19.1.6 DistilBert Algorithm


DistilBert algorithm has defined different architectures for different models such as distilbert-
base-uncased (6-layer, 768-hidden, 12-heads, 66M parameters), distilbert-base-uncased-distilled-
squad (6-layer, 768-hidden, 12-heads, 66M parameters), distilbert-base-cased (6-layer, 768-
hidden, 12-heads, 65M parameters), and distilbert-base-cased-distilled-squad (6-layer, 768-
hidden, 12-heads, 65M parameters). This section uses “distilbert-base-uncased” architecture for

724
DistilBertTokenzier to recommend similar movies to the given movie.

Explanation
It can be observed from the results that DistilBert algorithm like the other pretrained algorithms
also shows good results for displaying similar movies.

Use FlauBERT, Bart, CamemBert, and ALBERT algorithms to determine


similar movies from the dataset discussed in this section.

USE CASE
SERVICE-BASED RECOMMENDATION SYSTEM

Recommendation systems are one of the popular and most adopted applications of machine
learning in service industry. In online medium, people post their tweets or send their reviews
related to different services such as bus travel, ticket booking, hospitality services, food services,
tourism services, personalized home services, and salon services. When the user types a
particular text to reach his/her desired service, recommendations system basically tries to

725
understand the features that govern the customer’s choice and try to determine the similarity
between two services. On the basis of scores corresponding to similarities, services related to
destination, food delivery, commuting, etc., are recommended. A recommendation system can be
implemented by either measuring popularity, ratings, recommendations, etc., or on the basis of
information such as name of service provider, review of the service, and quality of the service.
Recommendation system can be used by online commuting platforms such as Uber and OLA
to recommend the choice of vehicle and desired location; online food delivery partners such as
Zomato, Uber Eats, and Swiggy to understand the choices of consumer related to types of food,
delivery time, and location; travel partners such as Red Bus to understand the choices of
customer related to destination name and mode of travel such as bus, train, or flight. The
tourism industry can understand the choice of the customer based on the text typed by the user
for their preferences and recommend destinations according to the budget, leisure quality,
reviews, etc. The telecom and insurance industries can suggest plans based on the customer
budget and requirements. The entertainment industry can recommend movies, shows depending
on the reviews of the movies, cast of the movie, place of the show, etc. The personalized home
services and salon services can be suggested on the basis of the budget and type of the service
required. In short, recommended system will be an asset to nearly all the service-based
industries to understand the behavior of the customer and frame strategies accordingly for
higher growth and meeting company’s objective.

Google Text-to-Speech (gTTS) is a Python library and CLI tool to interface


with Google Translate’s text-to-speech API. The function gTTS() is required to
convert text to speech and can be accessed using the command “from gtts
import gTTS”. Execute the command “file = gTTS(text = user_text,
lang = name of language)” to convert the given text in desired language;
save the file using the command “file.save(filename in mp3 format)”.

19.2 Unsupervised Machine Learning


Cluster analysis can be done on text data after performing the feature extraction on data using
vectorizer. This section performs cluster analysis on the data discussed in Section 19.1 and forms
clusters of different movies together on the basis of their information stored in tfidf matrix
while performing feature extraction. For understanding the utility of supervised machine learning
algorithms, we will consider the dataset of women clothing available at:
https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews.

726
Explanation
In this scenario, documents indicate the reviews and classes indicate the review sentiments that
can either be positive or negative making it a binary classification problem. The reviews and
sentiments datasets were both divided into training and test datasets. The training dataset has
first 14,000 records and test dataset has 4000 records starting from 14,001 to 18,000. Similarly,
the sentiments dataset was also divided into training and test datasets.

727
19.2.1 Without Pretrained Model
If pretrained models are not used, it is important to do feature engineering. and text
preprocessing. Text preprocessing involves the different operations like stemming, removal of
stop words, converting into lower case, removing unwanted characters, etc.

728
Explanation
Here we have created the three basic functions related to stemming, stop words removal, and
for removing special characters. The main function is named func_text_prcoess, which calls
all the above functions along with other tasks. This function is then implemented for training
and test datasets of reviews. Feature extraction was done using Count Vectorizer and TF-IDF
Vectorizer and hence CountVectorizer and TfidfVectorizer were imported to the program.
We can observe from the output that the usage of vectorizer increased the dimension of train
features and test features to (14,000, 17,044) and (4000, 17,044), respectively. This is because
we have used range of n-gram from 1 to 2, hence all the different combinations of two words
from review data are formed.

729
Explanation
It is observed that the optimum number of clusters could not be determined from the elbow
method because of the straight line. This further means that we will not be able to perform
cluster analysis effectively.
Since the optimum number of clusters could not be determined and the desired results could
not be produced, hence in the subsequent section, we used pretrained model to perform cluster
analysis.

19.2.2 Bert Algorithm


In this section, Bert tokenizer is used for processing of text data and then cluster analysis is done
on the new processed data.

730
Explanation
We can observe from the scree plot that the optimum number of clusters are 2. Hence,
clustering should be done considering two clusters.

731
Explanation
We can observe from the results that 7758 reviews from the training dataset belong to the first
cluster and 6242 reviews belong to the second cluster.

19.2.3 GPT2 Algorithm


In this section, we will use GPT2 tokenizer for processing of text data and then perform cluster
analysis on the new data.

Explanation
We can observe that the optimum number of clusters is 2, which is equal to the types of
recommendation. This further means that GPT2 algorithm gave good results. From the results,
it is clear that 8048 reviews from the training dataset belong to the first cluster and 5952
reviews belong to the second cluster.

732
19.2.4 ROBERTA Algorithm
In this section, we use Roberta tokenizer for processing of text data and then perform cluster
analysis on the new processed data.

Explanation
We can observe that the optimum number of clusters is 2, which is equal to the types of
recommendation. This further means that Roberta algorithm gave good results. From the
results, it is clear that 9130 reviews from the training dataset belong to the first cluster and 4870
reviews belong to the second cluster.

19.2.5 XLM Algorithm


In this section, we will use XLM tokenizer for processing of text data and then perform cluster
analysis on the new processed data. The result generated when cluster analysis is done on the
data is as follows:

733
Explanation
We can observe that the optimum number of clusters is 2, which is equal to the types of
recommendation. This further means that XLM algorithm gave good results. From the results, it
is clear that 7936 reviews from the training dataset belong to the first cluster and 6064 reviews
belong to the second cluster.

19.2.6 DistilBert Algorithm


In this section, we will use DistilBert tokenizer considering distilbert-base-uncased architecture
for processing of text data and then performe cluster analysis on the new processed data.

734
Explanation
We can observe that the optimum number of clusters is 2, which is equal to the types of
recommendation. This further means that DistilBert algorithm gave good results. From the
results, it is clear that 7758 reviews from the training dataset belong to the first cluster and 6242
reviews belong to the second cluster.
Thus, we can observe that with the use of trained models, the optimum number of clusters
could be identified clearly and they are according to the original dataset. Hence, the result of
cluster analysis was more efficient using pretrained models for text data.

Use T5, CTRL, XLNET, and GPT model to perform cluster analysis for the
dataset discussed in this section.

Speech-to-text conversion is possible. Import the library named


speech_recognition. Create an object by executing the Recognizer()
function from this library and recognize_google() function for the created
object to read the audio file. Use Microphone() function from the library to
read the source.

USE CASE
GROUPING PRODUCTS IN E-COMMERCE

Identifying identical products is also important to construct the final item page. Products can be
described in terms of their features such as brand, color, and size. In order to make it easier for
sellers to onboard their items, most product features are not mandatory for sellers to provide. As
a result, we find that different sellers may provide different features in their product feed. By
utilizing different sources of information for the same product, we can increase the coverage of

735
product specifications on the item page. On the other hand, we need to be precise in the creation
of groups and ensure they contain identical products sold by different sellers. Once a group is
formed, it is represented by exactly one item page with a single title, single description, and a
single price on the item page. Although customers can select other sellers and see their prices on
the seller choice page as shown above, the individual seller’s product content is not shown on
this page. If the groups are not homogeneous, an individual seller’s product might be different
from what is portrayed on the item page, which is unacceptable.
The preprocessing of text is done using character filtering, tokenization, stemming and
lemmatization, spelling correction, word segmentation, and query understanding using Named–
Entity Recognition (NER) (reductionist understanding). It is important to mention here that the
word representation can be done using Glove (contextual word representation) and Convolution
Random Field (CRF). For making search more customer friendly and interactive, the given
search query can be tagged with entities such as brand, price, size, and quantity. It is also
possible to use contextualized embedding using ELMo followed by Bi-LSTM and sigmoid layer
at the end.
There are many factors to group online stores’ products: product titles; identifying key
attributes such as brand, condition, color, and model number from available data for each
product and measuring the discrepancies in the attribute values; price outlier identification by
identifying whether an incoming offer price is an outlier in this price distribution; trending in top
category/activity/brand, popular with customer segments, bestsellers/top sellers, top reviewed,
top rated, 5-star products, new releases/just added, top picks, on sale, price drop, new lower
price, only X left, back in stock, exclusive top wished, most popular, customer favorites,
seasonal/holiday, new [color/ size/fit], daily deals, recommended by [x], most searched for, top
searches, etc.

19.3 Supervised Machine Learning


Different classification algorithms such as Logistic Regression, Random Forest, Gradient
Boosting, and Bagging are discussed in this section on the same dataset of clothing reviews as
considered for cluster analysis (Section 19.2). The classification models for the training dataset
are created and accuracy is determined on the test dataset for all the models.

19.3.1 Without Pretrained Algorithm


In this section, different supervised machine learning models are developed on the dataset
without using pretrained model. Different feature extraction techniques are used such as removal
of stop words and case conversion along with doing vectorization using count vectorizer.

736
737
Explanation
The accuracy of the logistic regression model is 0.7710 and precision of positive and negative
sentiments is 0.822 and 0.18, respectively. The accuracy of the random forest model is 0.8213,
and precision of positive and negative sentiments is 0.822 and 0.0, respectively. The accuracy
of the gradient boosting model is 0.8220, and precision of positive and negative sentiments is

738
0.822 and 0.33, respectively. The accuracy of the bagging model is 0.7710, and precision of
positive and negative sentiments is 0.822 and 0.18, respectively.

19.3.2 BERT Algorithm


In this section, we have used the processed data created using Bert algorithm in the unsupervised
machine learning algorithm Section 19.2.2. Different classification models are created using the
training data and accuracy is determined on the test data.

Explanation
We can observe that using the Bert algorithm, the accuracy of the logistic regression has
increased from 77.10 to 82.13 from the untrained models.

19.3.3 GPT2 Algorithm


In this section, we have used the processed data for clothing reviews created using GPT2
algorithm in the unsupervised machine learning algorithm Section 19.2.3. Different classification

739
models are created using the training data and accuracy is determined on the test data.

Explanation
It can be observed that the accuracy of the model is best for random forest model and least for
the bagging model, when GPT2 algorithm was considered for developing the model.

19.3.4 Roberta Algorithm


In this section, we have used the processed data for clothing reviews created using Roberta
algorithm in the unsupervised machine learning algorithm Section 19.2.4. Different classification
models are created using the training data and accuracy is determined on the test data.

740
Explanation
When Roberta algorithm is considered, it can be observed that the accuracy of the model is best
for random forest model and least for the bagging model.

19.3.5 XLM Algorithm


In this section, we have used the processed data for clothing reviews created using XLM
algorithm in the unsupervised machine learning algorithm Section 19.2.5. Different classification
models are created using the training data and accuracy is determined on the test data.

741
Explanation
It can be observed that the accuracy of the model is best for random forest model and gradient
boosting model, whereas it is the least for the bagging model.

19.3.6 DistilBert Algorithm


In this section, we have used the processed data for clothing reviews created using DistilBert
algorithm in the unsupervised machine learning algorithm Section 19.2.6. Different classification
models are created using the training data and accuracy is determined on the test data.

742
Explanation
Like the results of other pretrained models, the accuracy of the model determined by DistilBert
algorithms also shows the best for random forest model and least for the bagging model.

Try to improve the accuracy of the random forest model by tuning the
hyperparameters and considering Bert algorithm.

USE CASE
SPAM PROTECTION AND FILTERING

E-mail is a powerful tool for exchange of idea and information in personal and professional
lives. The delivery process of e-mails is very simple and straightforward. The originating mail
server delivers e-mail to the destination mail server via SMTP, with both servers having an IP
address. However, with the benefits of e-mail, there has been a dramatic growth in problems
arising from e-mail, which is seen in the form of spam. The increasing volume of spam has
become a costly and serious risk for the business and educational environment and hence cannot
be avoided.
Spam causes loss of employee productivity, draining morale, decreases bandwidth, clogs
mailboxes, and costs companies a lot of money. It also increases the frequency, severity, and cost
of virus attacks and related threats, such as the damage to an employer’s reputation from
inadvertently sending spam or viruses. Hence, companies must take measures in order to block
spam from entering their e-mail systems. Although it might not be possible to block out all spam,
just blocking a large proportion of it will greatly reduce its harmful effects.
Spam characteristics appear in two parts of a message: e-mail headers and message content.
In order to effectively filter out spam and junk mails, text mining algorithms should be used to
differentiate spam from genuine messages. Using these algorithms will help to analyze a huge
volume of messages daily. Text mining can be done using words like free, amazing, great news,
limited offer, click here, act now, risk free, earn money, get rich, and exclamation marks and
capitals in the text. However, other different attributes in e-mail can also be observed to
accurately detect text, image, and attachment-based spam or phishing e-mails. Message delivery
will be based on the results of a score that is associated with each message after doing text
mining. These methods attempt to identify spam through suspicious word patterns or word
frequency.
Spam filter will automatically assign new, previously unseen e-mails to class spam or
nonspam. Hence, it can be used to detect unsolicited and unwanted e-mails and prevent those
messages from getting to a user’s inbox. Spam can later be blocked by checking for these words
in the e-mail body and subject. Thus, text mining in e-mail spam helps businesses by providing
them the security needed to ensure that unwanted e-mails do not reach user inboxes. The use of
text mining algorithms will hence protect companies and their employees by scanning e-mail and
eliminating threats and other junk mail before reaching the end user.

19.4 User-Defined Trained Deep Learning Model

743
It is also possible for the user to create and save his/her own model for text data. In such a
scenario, the user is creating his/her own pretrained model that can be used later for executing on
any data. The model is developed on the training dataset and is saved using the command
model.save_weights('name of the model') and can be loaded later for further usage by
importing model.load_weights('name of the model') from the keras library. In this section
also, the clothing reviews dataset considered in unsupervised and supervised machine learning
techniques (Section 19.2 and 19.3) is used for building the trained model by the user. However,
for easy training of deep learning model, only 1500 records of the dataset are considered.
However, the validation dataset contains 500 and test dataset contains 1000 records.

If configuration of CPU is not sufficient to train the huge data, the training of
the model can be done on colab and weights can be saved. Downloade the
trained model (weights) from colab and test it in your own machine by
considering test data which is relatively smaller.

744
Explanation
A new dataset named newwomendata is created, which consists of only two columns, “Review
Text” and “Recommended IND.” After removing all the missing values, the review text is
converted to string data type and then all the reviews are converted to lower case. The column
named “Reviews Text” is an independent variable and stored in reviews and “Recommended
IND” is the dependent variable and depicts the sentiment of the user. For creating the model,
1500 records are taken for training dataset, validation dataset contains 500 and test dataset
contains 1000 records.

It is important to have 2.0 version of tensorflow installed before applying the


models discussed in this section. This can be installed using !pip install
tensorflow==2.0.

745
19.4.1 Bert Algorithm
In this section, a user-defined deep learning model considering Bert algorithm is created
considering the training dataset and the model is saved for future purpose. The saved model is
then loaded and tested on the test dataset for determining the accuracy of the model.

19.4.1.1 Creating User-Defined Model Using Bert Algorithm


The model is created using a neural network architecture on the data created according to the
Bert algorithm. Before creating the model, it is important to convert the training dataset into the
structure required by the Bert model. This is basically done using the Bert tokenizer. It is better
to create a function for extracting the features from the dataset according to the Bert algorithm
using the Bert tokenizer.

Explanation
It is known that the Bert model requires three features related to id, mask, and segment. This
program develops the function named bert_features that helps us in converting the document
to ids, masks, and segment as per the requirement of the Bert model. The function takes input of

746
the tokenizer to be considered for creating tokens, the document for which tokens need to be
created, and length of the maximum sequence. An empty nested list structure named features
containing three lists for ids, masks, and segment is created. Since there are many documents,
hence a “for” loop is taken for getting the values of ids, masks, and segments for each
document. Using the specified tokenizer, the document is converted to tokens and ids are
generated from the tokens. The masks and segment are generated after determining the ids. For
each document, ids, masks, and segments are added to the features list. It is important to
convert the list in array form before training the deep learning model. Hence, the features list is
then converted to array form using the np.array() and the array of features is returned from the
function.

Explanation
This program calls the function bert_features created in the previous program and converts
the training, test, and validation datasets according to the requirements of Bert algorithm. It can
be observed from the output that since we define the MAX_SEQ_ LENGTH as constant having
the value as 100, the training, test, and validation datasets will have the dimension as (1500,
100), (500, 100), and (1000, 100), respectively.

747
Explanation
Since our training, validation, and test datasets had a dimension of maximum length as 100,
therefore the input given to the model should also have the same length. Hence, an input named
“bert_input” is created, which has bert_id, bert_mask, and bert_segment, all these have the
MAX_LENGTH as 100. The command

748
transformers.TFBertModel.from_pretrained('bert-base-uncased')(bert_input) builds
the basic layer of the model using BertModel and taking bert-base-uncased architecture and
input that comprises three features: id, masks, and segment (since Bert algorithm requires input
consisting of three features). Alternate dense and dropout layers with different activation
functions are added to the model. The architecture of the model is then displayed using the
summary command.

We will fit the model on the training dataset and determine the accuracy of the model.

Explanation
In code, we have fit the model on the training dataset and observed that the accuracy of the
model is 80.47%. Since we have already trained the model, it is better to save the model by a
particular name and then load the model later when it is required on the other datasets.

Explanation
The model is saved by the name bertnewmodel.h5 using the function save_weights(). This
function will save weights on the basis of the training data provided to it. This model can then
be used on other datasets directly like the existing other trained models discussed previously in
the chapter. The next code then predicts the model on test dataset and determines the accuracy.

749
Explanation
It can be observed from the results that the accuracy of the dataset is 81.3%. From the
confusion matrix, we can observe that all the reviews belong to only one sentiment. The recall
value of 0 for a sentiment shows that none of the review belongs to this sentiment. Similarly,
the recall value of 1 shows that all the reviews belong to this sentiment. Since the results are not
good, steps should be taken to improve the accuracy of the mode. As discussed in Chapter 18
for deep learning models, it is always suggested to use a validation dataset for effective
modeling and keeping a check on the callback value. Hence, in this code, we will use callback
feature considering the validation dataset.

750
Explanation
The concept of early stopping was used considering the function callbacks.EarlyStopping()
from keras library and stored in check_pointer. The model was then fitted considering the
training and validation data along with the callback value and found that the accuracy of the
validation dataset was very low (15.8%). When the model was used to predict the test dataset, it
was found that the accuracy decreased to a great level (18.7%). The results of the classification
matrix and confusion matrix clearly show that all the reviews predicted one sentiment only like
the results when the callback feature was not used. This means that for this dataset, predicted
values are showing a particular sentiment only. However, it was important to discuss this
concept, because it may give rise to better results in some other datasets.
The next section predicts the model on new test data by considering the saved trained model
by the user.

19.4.1.2 Using Pretrained Model on Test Dataset


It should be noted that we have saved weights and not the complete model, hence we need to
create and compile the model. It should be noted that because the weights are available, hence

751
there is no need to fit the model on training dataset. The weights can be directly loaded and can
be used directly to evaluate on the test dataset. This will help to save a lot of time in doing
training. Since we have already tested the model on test dataset from 2000 to 3000, hence we
will consider a new test dataset for predicting the values. These 1000 records are taken from
records number 3001 to 4000.

Explanation
It was found that the accuracy of the model remains the same considering another dataset also.
It is important to note here that a model is created and compiled but it is not required to train the
model on training dataset by using fit() function, because we have already saved the weights.
The new model can hence be used directly to evaluate on the new test dataset by loading
weights.

752
19.4.2 GPT2 Algorithm
This section uses GPT2 algorithm for creating, saving a user-defined deep learning model, and
using the saved model for validating on the test dataset.

19.4.2.1 Creating User-Defined Model Using GPT2 Algorithm


The GPT2 algorithm requires only one feature for processing of input data. Hence, the tokens are
directly converted to ids and so there is no need to create masks and segments for the input data.
The features list is also a simple list and not a nested list because only one feature is required.

Explanation
The summary of the model clearly shows that the model is essentially a combination of dense

753
and dropout layers considering different configurations. The model is saved by the name
“gpt2newmodel.h5”, which can be used later for evaluating the test dataset.

19.4.2.2 Using Pretrained Model on Test Dataset


In this section, a new model considering GPT2 algorithm is created. There is no need to train the
model because the weights of the model are directly saved. Hence, we need to evaluate the
model directly on the test dataset.

Explanation
From the results we can observe that the accuracy of the model is found to be 81.3% and all the
reviews show only one sentiment.

19.4.3 Roberta Algorithm


This section uses Roberta algorithm for creating, saving a user-defined deep learning model, and
using the saved model for validating on the test dataset.

19.4.3.1 Creating User-Defined Model Using Roberta Algorithm


This section uses Roberta algorithm for building a deep learning model. It should be noted that
this algorithm requires only two features, hence ids and masks are considered for the different
dataset.

754
Explanation
It can be observed that the Roberta model requires two features as input, hence only ids and
masks are considered. The summary of the model clearly explains the architecture of the model,
which is primarily a combination of dense and dropout layers with different inputs and outputs.
The model is saved by the name 'robertanewmodel.h5.

19.4.3.2 Using Pretrained Model on Test Dataset


In this section, the weights in the saved model are used to evaluate the test dataset. The model is

755
first loaded by using the function load_weights('robertanewmodel.h5').

Explanation
It can be observed that the accuracy of the model using Roberta algorithm also shows the same
accuracy as that by the Bert algorithm.

19.4.4 XLM Algorithm


This section uses XLM algorithm for creating, saving a user-defined deep learning model, and
using the saved model for validating on the test dataset.

19.4.4.1 Creating User-Defined Model Using XLM Algorithm


Like Bert algorithm, this algorithm requires three features as input and hence the function for
extracting features is created with three lists; the results in the form of array are returned
accordingly.

756
Explanation
It can be observed that similar to Bert algorithm, the XLM algorithm also requires three
features as input, hence ids, masks, and segments are considered for providing input to the Bert
model. The summary of the model clearly explains the architecture of the model. The model is
saved by the name “xlmnewmodel.h5”.

757
19.4.4.2 Using Pretrained Model on Test Dataset
In this section, the weights that are saved in the file named “xlmnewmodel.h5” are loaded and the
model is evaluated on the test dataset.

Explanation
It can be observed that the accuracy of the model using XLM algorithm also shows the same
accuracy as the other models.

19.4.5 DistilBert Algorithm


This section uses DistilBert algorithm for creating, saving a user-defined deep learning model,
and using the saved model for validating on the test dataset.

19.4.5.1 Creating User-Defined Model Using DistilBert Algorithm


This algorithm requires only two features as input and hence the function for extracting features
is created with two lists; the result in the form of array is returned accordingly.

758
Explanation
The summary of the model clearly explains the architecture of the model. The weights are
saved in the file named “d_bertnewmodel.h5”. Hence, when a new model is created and
compiled, the weights can be loaded and the model can be directly used to predict the test
dataset without fitting on the training data.

19.4.5.2 Using Pretrained Model on Test Dataset


In this section, the weights that are saved in the file named “d_bertnewmodel.h5” are loaded and

759
the model is evaluated on the test dataset.

Explanation
It can be observed that the accuracy of the new DistilBert model is same as other models. All
these models have shown an accuracy of 81.3%.
Although the number of layers and the nature of layers were changed in each model, there
was no change in accuracy. The highest accuracy achieved was 81.3%. For this dataset, all the
algorithms show the same accuracy, but the results could be different for other datasets for
different algorithms. Also, different results can be obtained with the change in the dataset.

Try to improve the accuracy of the model by creating different architecture


considering Bert and DistilBert algorithms.

USE CASE
IMAGE CAPTIONING

Image captioning is the task of generating a descriptive and appropriate sentence of a given
image. Hence, it requires semantic understanding of images and the ability of generating
description sentences with proper and correct structure. It is basically the process of generating
textual description of an image. It uses both natural language processing and computer vision to
generate the captions and hence the model relies on two main components, a convolutional
neural network (CNN) and an recurrent neural network (RNN). Two tasks are involved in this
process: understanding the content of the image and converting the understanding of the image
into words in proper order. It is related to merging of the two components to combine their most
powerful attributes to find patterns and images, and then use that information to generate a
description of those images. CNNs should produce best results at preserving spatial information
and images, and RNNs work well with any kind of sequential data, such as generating a
sequence of words.

760
Image captioning has various applications such as getting live captions from
CCTV/surveillance cameras, describing videos, recommendations in editing applications, usage
in virtual assistants, image indexing, aid to visually impaired persons, social media, and several
other natural language processing applications. It can also be used in web development by
providing a description for any image that appears on the page so that an image can be read or
heard as opposed to just seen. This makes web content accessible.
The entities in the image can be predicted by using different algorithms but a proper
sequence of words is a major concern. However, the encoder–decoder architecture has received
a lot of popularity in solving sequence-to-sequence problems in NLP domain and they are used
in fields of language translation, chatbots, text summarization, etc. The encoder processes each
word in the input sequence and the compiled information is put into context vector; context
vector is passed to the decoder. The decoder also maintains a hidden state that it passes from
one timestep to the next. CNNs can very efficiently encode the abstraction of images and
generate a robust representation. Different things can be inferred from the image using CNN
such as location (road, mountain, sea, etc.), color (red, blue, green, etc.), activity (playing,
dancing, etc.), and the object (man, woman, animal, etc.). The decoder has standard LSTM unit
same as that in sequence-to-sequence architectures. The CNN can be thought of as an encoder.
The input image is given to CNN to extract the features. The CNN compares the target image to
a large dataset of training images, and then generates an accurate description using the trained
captions. The last hidden state of the CNN is connected to the decoder. The decoder is an RNN
that does language modeling up to the word level. The captioned associated with the image is
converted into a list of tokenize words. This tokenization turns any string into a list of integers.
First, all of the training captions are iterated and then a dictionary is created that maps all
unique words to a numerical index. So, every word will have a corresponding integer value that
can be found in this dictionary. The words in this dictionary are referred to as our vocabulary.
This list of tokens is then converted into a list of integers that come from our dictionary that
maps each distinct word in the vocabulary to an integer value. RNN then is used to predict the
most likely next word in a sentence by recollecting spatial information from the input feature
vector.

19.5 Question Answers Model


For explanation of question answers model, we have downloaded bryant-storeis.txt from
github.com. This section is executed considering tensorflow==1.15. It should be noted that
Roberta and GPT2 are not included for question answering in transformers, hence we are using
only the Bert and DistilBert algorithms.

761
Explanation
Gutenberg library has some predefined functions such as para(), sents(), and words(), which
are able to determine paragraphs, sentences, and words from the given document. When these
functions were used on the document of bryant-stories.txt, we can observe that the number
of paragraphs, sentences, and words were found to be 1194, 2863, and 55,563, respectively.

Explanation
It is important to vectorize the documents before doing text analysis. The Count Vectorizer or
TF-IDF vectorizer can be used on the list effectively, hence it is important to convert the
document into list format. It is known that sents() function helps us to determine the sentences
in the document and paras() helps to determine the paragraphs. In the first division, a list of
paragraphs is created using these functions; in the second division the lists and the question to
be asked are vectorized; and the last division displays the shape of the document and the
question.

762
In the following code, cosine similarity technique is used to get five best possible answers; the
following code will try to determine the best answer out of these five possible answers. This is
done because the entire document contains many paragraphs and it becomes difficult to use the
algorithm for predicting the best answer from all the paragraphs.

Explanation
We can observe from the results that the paragraphs having the best cosine similarity exist with
index 4, 5, 14, 12, and 20. This means that all these paragraphs may contain the answer for our
question. In the next code, we will use different trained algorithms for displaying the best
answer to the question by evaluating these best possible answers.

19.5.1 Bert Algorithm from Transformers


In this section, we have used BertForQuestionAnswering and BertTokenizer from
transformers library to determine the best answer from the five possible answers stored in
top_paragraphs. It should be specified that these possible answers were determined using
cosine similarity techniques because it is difficult to do processing for each and every document.

763
764
Explanation
In the previous code, five best paragraphs were determined using the cosine similarity and this
code tries to determine the best answer from the available five options. For each question and
paragraph, a score is determined to understand the association of both of them. Two empty lists
named start_list and end_list are created to store the starting and ending scores of each
and every paragraph. A “for” loop is executed since we have five best possible answers stored
in top_paragraphs. The command text="[CLS] "+ question + " [SEP] "+ paragraph + "
[SEP]" joins the paragraph and question; both are separated using a separator and stored in the
variable named “text”. All the tokens are determined from the text string and converted to ids.
The id of separator is 102, hence a structure is used to separate the tokens. The maximum
starting and ending scores of each and every paragraph are stored in the start_list and
end_list, respectively. It should be noted that higher the score, more is the probability of the
correct answer. All the ids are again converted back to the tokens which are joined again to
form the answer. The next division displays the score of both the lists for each and every
paragraph. The highest score of both the lists is also generated and displayed. The last division
executes the command end_list.index(max(end_list)) for determining the index of the
highest element in the end_list. This index is then printed and the paragraph corresponding to
this index is displayed since it is considered to be the best answer.

Use “bert-large-cased-whole-word-masking-finetuned-squad” model for


creating a Bert tokenizer. Determine the answer and interpret the results.

765
19.5.2 DistilBert Algorithm from Transformers
This section uses DistilBertForQuestionAnswering and DistilBertTokenizerfor from
transformers library for determining the best answer from the possible answers.

766
Explanation
We can observe from the results of starting and ending score that the maximum starting and
ending score is found for the second paragraph, hence the best answer is the statement in the

767
second paragraph. The last division determines the index of the element whose score is the
highest and then finally prints the answer to the question.

19.5.3 Bert Algorithm from PyTorch


This section uses BertForQuestionAnswering and BertTokenizer from PyTorch library to
determine the best answer for the given question.

768
Explanation
We can observe from the results of starting and ending score that the maximum starting and
ending score is found for the second paragraph, hence the best answer is the statement in the
second paragraph. The last division determines the index of the element whose score is the
highest and then finally prints the answer to the question.

Use bert-base-uncased using the command


BertForQuestionAnswering.from_pretrained('bert-base-uncased') and
interpret the answers from the results.

19.5.4 Bert Algorithm from Deeppvalov


Deeppavlov library also provides algorithms for determining the answers for the given question.
This section builds a model considering configs.squad.squad_bert algorithm.

Explanation

769
We can observe that the score (181.8485) is highest for second item in the list. Hence, the best
answer is at index 1.
In the next code, we will display the details of the list at index 1.

Genism is a good library for text analysis. An existing genism library can be
upgraded using the command pip install--upgrade genism.

Explanation
It can be observed from the results that the best answer is “a little dark house under the
ground.” This answer is at position 60 with a highest score.

19.5.5 Ru_bert Algorithm from Deeppvalov


This section builds a model considering configs.squad.squad_ru_bert algorithm from
deeppavlov library.

Explanation
We can observe that the score (181.8485) is highest for second item in the list. Hence, we will
display the details of the list at index 1 in the following code

770
Explanation
It can be observed from the results that the best answer is “in a little dark house under the
ground.” This answer is at position 57 with a highest score of 39,649.566.

19.5.6 Ru_rubert Algorithm from Deeppvalov


In this section, ru_rubert algorithm available in deeppavlov library is used for determining
answers for the given question.

Explanation
We can observe that the score (132.937) is highest for second item in the list. Hence, the next
code displays the details of the list at index 1 for fetching correct answer for the given question.

Explanation
It can be observed from the results that the best answer is “down in a little dark house under the
ground.” This answer is at position 52 with a highest score of 132.93.
Although the number of words in the correct answer differ from the model to model, all the
pretrained models help in correct determination of the answer. Thus, it can be inferred that the
use of trained models helps us to determine the correct answer from the complete document.

USE CASE
771
USE CASE
CHATBOTS

A question-answer model is primarily meant for fulfilling two tasks: understanding the question
of the user and giving the correct response to the user. Using different pretrained algorithms for
question answer models, the answers can be provided to the user depending on the highest score
for each document. Hence, this question answer model is appropriate to build chatbots. It is
known that chatbots are a form of human–computer dialog system that operates through natural
language via text or speech; they are autonomous and can operate anytime (day or night) and
can handle repetitive, boring tasks. The information will be provided to the user depending on
the concept of NER.
However, there are different platforms existing for building chatbot: Rasa is an open-source
and production-ready software for chatbot development and used in large companies
everywhere; Google DialogFlow (API.ai) is a completely closed-source product with APIs and
web interface, has easy to understand voice and text-based conversational interface; Facebook
Wit.ai; IBM Watson Assistant has support for searching for an answer from the knowledge base;
Microsoft LUIS provides easy and understandable web interface to create and publish bots;
Amazon Lex has voice and text-based conversational interface and provides a web interface to
create and launch bots.

Summary
• A pretrained model is a model created by other person to solve a similar problem. Instead of
building a model from scratch to solve a similar problem, the model already trained on other
problem is considered as a starting point for new model.
• There are many other algorithms available in the transformers library related to text data.
Some of the algorithms include Bert, GPT2, Roberta, XLM, and DistilBert.
• Feature extraction is primarily done using the concept of n-grams. The TextBlob() function
from textblob library helps to execute the function of n-grams.
• In Python, feature extraction can also be done primarily using the concept of Count
Vectorizer and TF-IDF Vectorizer.
• Count Vectorizer converts a collection of text documents to a matrix of the counts of
occurrences of each word in the document.
• TF-IDF means term frequency-inverse document frequency; this means that the weight
assigned to each token not only depends on its frequency in a document but also on how
recurrent that term is in the entire corpora. TF-IDF assigns lower weights to these common
words and gives importance to rare words in a particular document.
• Bert algorithm has defined different architectures for different models such as bert-base-
uncased, bert-large-uncased, bert-base-cased, bert-large-cased, and bert-base-multilingual-
cased.
• GPT2 algorithm has defined different architectures for different models such as gpt2, gpt2-
medium, gpt2-large, and gpt2-xl.
• Roberta algorithm has defined different architectures for different models such as roberta-
base, roberta-large, and roberta-large-mnli.

772
• XLM algorithm has defined different architectures for different models such as xlm-mlm-en-
2048, xlm-mlmende-1024, xlm-mlm-enfr-1024, xlm-mlm-enro-1024, and xlm-mlm-xnli15-
1024.
• DistilBert algorithm has defined different architectures for different models such as distilbert-
base-uncased, distilbert-base-uncased-distilled-squad, distilbert-base-cased, and distilbert-
base-cased-distilled-squad.
• For effective results, it is possible to apply unsupervised and supervised machine learning
techniques on the text dataset using pretrained models.
• It is possible to create and save the model created by the user. The model is developed on the
training dataset and is saved using the command model.save_weights('name of the
model') and can be loaded later for further usage by importing model.load_weights('name
of the model') from the keras.models library.
• It is important to note here that a model is created and compiled but it is not required to train
the model on training dataset by using fit() function if we have already saved the weights;
hence, we can directly evaluate on the new test dataset.

Multiple-Choice Questions

1. Feature extraction is generally done using


(a) TFIDF Vectorizer
(b) Count Vectorizer
(c) Both (a) and (b)
(d) Neither (a) nor (b)
2. The algorithm generally used for determining answers from the question is
(a) TR5
(b) CRT
(c) Bert
(d) GPT
3. This is not a valid trained text algorithm:
(a) XML
(b) DistilBert
(c) Bert
(d) GPT
4. This pretrained algorithm for question answers is not available in transformer library:
(a) Roberta
(b) Bert
(c) DistilBert
(d) None of these
5. The ____________ function is used to create tokens from the document.
(a) create_tokens()
(b) create_token()

773
(c) createtokens()
(d) None of these
6. The tokens are converted to ids using the function
(a) convert_tokens_to_ids()
(b) tokens_to_ids()
(c) converttokenstoids()
(d) tokenstoids()
7. Libraries which include Bert algorithm for providing answers to the question include
(a) transformers
(b) PyTorch
(c) deeppavlov
(d) All of these
8. In deeppavlov library, the algorithms available in config.squad for question answers include
(a) Bert
(b) ru_rubert
(c) ru_ruberta
(d) All of these
9. The function ngrams() for displaying bi-gram, tri-gram, etc., is available in ___________
library.
(a) textblob
(b) n-Grams
(c) NGrams
(d) None of these
10. A trained text model can be used to directly evaluate the test dataset by ________ the saved
weights.
(a) Deleting
(b) Loading
(c) Modifying
(d) All of these

Review Questions

1. Perform cluster analysis for creating clusters of movies considering the movie dataset
discussed in Section 19.1.
2. How does a pretrained tokenizer help in reducing the steps for preparing data before
training?
3. Explain the architecture of different models available for Roberta algorithm.
4. Is there any difference between the requirement of the input features between Bert and
DistilBert algorithms used in developing deep learning models?
5. Are the trained models effective for the question answer? Explain the process.
6. How do we determine the best answer for the question from the results generated using

774
pretrained algorithms for question answer model?
7. Considering any sentence, differentiate between the result of using bi-gram, tri-gram, and
four-gram.
8. Why do we identify some best paragraphs before using the trained model for question
answering?
9. Explain the architecture of different models available for Bert algorithm.
10. Discuss some libraries and the name of the functions that provide facility for providing
question answer model.

775
CHAPTER
20

776
Transfer Learning for Image Data

Learning Objectives
After reading this chapter, you will be able to

• Understand the importance of trained models for determining similar images and image
recognition.
• Apply the knowledge of unsupervised machine learning algorithms using trained
algorithms to form clusters of similar images.
• Evaluate the supervised machine learning algorithms using trained algorithms.
• Create user-defined trained model and access it for feature extraction.

Images analysis has now become the heart of powerful machine learning algorithms. The
existing pretrained models that can be used for transfer learning has simplified the task of image
analysis and has helped to increase the accuracy of the analysis to a large extent. Image
recognition helps to identify the image in the digital documents. For example, if someone wants
to view the products that contain the word “mobile”, image recognition will display all the
products that have “mobile” word into its description. An advanced application, however, will be
to find those reviews/posts/tweets that contain the photos of a mobile. Image recognition also
helps to recognize multiple elements within single image at the same time, including logos,
faces, activities, objects, and scenes. It can also identify faces within image to determine
sentiment according to facial expressions; identify gender and age depending on features,
determining body parts such as upper body, hand such as face, eyes, and so on.
This chapter discusses popular pretrained models such as MobileNet, MobileNetV2,
ResNet50, VGG16, and VGG19 for performing unsupervised and supervised machine learning
on image data and image similarity techniques that can be used for recommendation system. It
should be noted that the use of transfer learning can be known best for unprocessed images
because these trained models provide functions for processing the images. This chapter also
focuses on predefined algorithms from “haarcascade” group to determine the face and eye and
predefined algorithms from cv2 library to determine gender and age. The chapter then shows
how to save and load a user-defined deep training model for determining facial expression;
creating a pretrained model by the user; saving the model; and loading the model later for
evaluating test dataset, an image, and image from webcam.

20.1 Image Similarity Techniques


We know that image similarity techniques are available in sklearn.metrics.pairwise library
and different techniques include cosine_similarity, euclidean_distances and

777
manhattan_distances. All these techniques are used in the different pretrained models for
displaying the similar images to the given image. For understanding the importance of transfer
learning for recommendation system, we will consider DivTrain dataset that contains 800
images of different categories. This dataset can be downloaded from
https://data.vision.ee.ethz.ch/cvl/DIV2K/.

778
Explanation
The glob() function helps to read all the images one by one from the folder. The name of the
folder is DIVtrain; all the files are in the format of *.png. Hence, the command “for mix_image
in glob('DIVtrain/*.png')” will read each and every image and store in mix_image. The
preprocessing library available in keras helps to load the image and converts the image to array.
The generally accepted image size for trained models is 224 × 224 pixels. Hence, the function
kimage.load_img(mix_image, target_size=(224, 224)) loads the image from mix_image
with a dimension of 224 × 224 pixels. Before processing, an image needs to be converted to an
array. Hence, the command kimage.img_to_array(image) converts an image to an array form.
We need to create a dictionary that maps the image num and image. The function
mix_image.split('\\')[-1].split('.')[0] is used to extract image name from the path and
stores in num. We mapped the image and num using the command images_dict[num] =

779
image. We can observe that the dataset contains 800 images and are stored in the dictionary
with values like 001, 002, …; which basically denotes the number of images. The next division
then plots all the images. An image list named main_image_list is created. The command for
image in images_dict.keys() reads the values of each image and adds the item to the list.
The last division then plots all the images using the enumerate() function. There were 800
images that were displayed in 100 rows and 8 columns. We can observe from the display that
the dataset contains different types of images related to nature, animals, birds, people,
monuments, etc.

Image file generally is huge and may exist on tar.gz format. For unzipping the
file existing in tar.gz format, type the command !tar -xf filename.tar.gz.

20.1.1 Without Pretrained Model


Before doing image analysis, we need to convert each image to one-dimensional (1-D). This can
be done using the flatten() function, which will flatten each image to 1-D. Hence all the 800
images will be converted to 1-D, thus making a matrix format containing 800 rows.

780
781
782
783
Explanation
As all the images are with dimension 224 × 224, it is important to have images in a single
dimension. Hence, a matrix named images_matrix1 of order 800, 150528 (224 × 224 × 3) was
created and all the elements were initialized to 0 by using the command
numpy.zeros([800,150528]). The “for” loop was used to flatten each and every image and the
result was stored in the matrix. Thus, the shape of the matrix becomes (800, 150528). The
command Image('DIVtrain/0007.png') displays the image for which we wanted to determine
similar images from the dataset. The image is basically a nature beauty. Three different
similarity techniques, namely, cosine similarity, Euclidean distance, and Manhattan distance
were imported from the library. The command
cos_similarity=cosine_similarity(images_matrix1) calculates the cosine similarity of
each image with other images and stores it in cos_similarity. The shape of the cosine matrix
is (800, 800), which contains the similarity value of each image with all the other images. We
can observe that the values range from 0 to 1. The higher the value of cosine similarity, the
higher is the similarity between the images; thus, two exactly similar images will have cosine
value as 1. As we wanted to determine images that are close to the given image, we convert the
matrix containing cosine values to a dataframe using pandas library. The command
product_info = cosine_dataframe.iloc[6].values stores the values of cosine similarity of
image at index 6 and stores in product_info. For determining the index of top similar images,
we need to sort these values in descending order because the image that will have higher value
will resemble the most with the given image. Hence, the command
similar_images_index=numpy.argsort(-product_info)[0:4] sorts the values in descending
order and stores the index of top four values in similar_images_index. The similar indexes
when displayed show the result as [6 39 483 31], which means that images at indexes 6, 39,

784
483, and 31 show a better similarity with the given image. This is obvious because the image
will have highest cosine similarity (1) with the image itself. The next division then plots the
four similar images with respect to the given image. We can observe that the first similar image
is the image itself. Similar process is then adopted for determining similar images using
Euclidean distance and Manhattan distances. It can be observed that all the three techniques
adopted for determining similar image do not display very satisfying results. Hence, in
subsequent sections we use different trained models for displaying similar images.

20.1.2 Using MobileNet Model


This section uses MobileNet pretrained model for determining similar images. It should be noted
here that this model processes the image to 50176 dimensions. Hence, it is important to create
the matrix in the same order.

785
786
Explanation
We can observe that the MobileNet model suggested similar images with respect to the given

787
image. Thus, it can be inferred that the similarity techniques shows better and effective results
with respect to pretrained model.

20.1.3 Using MobileNetV2 Model


This section uses MobileNetV2 pretrained model for determining similar images. It should be
noted here that this model processes the image to 62720 dimensions. Hence, it is important to
create the matrix in the same order.

788
789
Explanation
The result shows that similar images are filtered from the huge dataset according to all the three

790
techniques. As the given image belongs to the nature, all the images are displayed according to
the nature.

20.1.4 Using ResNet50 Model


This section uses ResNet50 pretrained model for determining similar images. It should be noted
here that this model processes the image to 100352 dimensions. Hence, it is important to create
the matrix in the same order.

791
Explanation
The result shows that similar images are filtered from the huge dataset according to all the three
techniques. As the given image belongs to the nature, all the images are displayed according to

792
the nature. The accuracy of the pretrained model is very high.

20.1.5 Using VGG16 Model


This section uses VGG16 pretrained model for determining similar images. It should be noted
here that this model processes the image to 25088 dimensions. Hence, it is important to create
the matrix in the same order.

793
Explanation
The VGG16 model shows an excellent accuracy. The result also shows that similar images are
filtered from the huge dataset according to all the three techniques. As the given image belongs

794
to the nature, all the images are displayed according to the nature.

20.1.6 Using VGG19 Model


This section uses VGG19 pretrained model for determining similar images. It should be noted
here that this model processes the image to 25088 dimensions. Hence, it is important to create
the matrix in the same order.

795
Explanation
The result shows that similar images are filtered from the huge dataset according to all the three

796
techniques. The VGG19 model shows an excellent accuracy. As the given image belongs to the
nature, all the images are displayed according to the nature.

Consider any other image from the dataset and determine similar images using
all the three different techniques and pretrained models. Evaluate the results.

USE CASE
RECOMMENDATION SYSTEM FOR VIDEOS

The future is visual. Today, images and image sequences (videos) account for around 80% of all
corporate and public unstructured big data. Watching online videos has become a popular trend
and a daily habit of our new generation. Videos are a reliable source for gaining knowledge and
it is easier to grasp information through videos than reading. The Internet is flooded with
billions of videos and as growth of unstructured data increases, finding relevant video is
becoming a time-consuming task; so to save time as well as efforts from browsing lots of videos
to choose the appropriate ones, there is a necessity to build a strong, efficient, and accurate
recommendation system, which will display appropriate videos for the users. The analytical
systems must assimilate and interpret images and videos as well as they interpret structured data
such as text and numbers. Video recommendation system provides users with suitable video for
users to choose and thus an effective way to get a higher user satisfaction. It is important for a
recommendation system to show the relevant results, for example, showing adventurous movies
to person who loves to watch adventurous movies and showing romantic love stories to
teenagers. It is suggested that we should not recommend videos that a user already knows or can
find any way; better to expand user’s taste without offending or annoying him/her. This will also
bring more network traffic on the video websites and hence video websites are paying more
attention to it.
Video recordings are complex media types particularly including audio and visual
modalities. These systems can use collaborative filtering (CF) models or content-based filtering
(CBF). The information obtained from audio and visual modal sources provides the possibility of
understanding relationships between modalities and thus understand the basic content of video.
Video recommendation system can also be created on finding the most relevant videos according
to current viewings rather than collection of user profiles as required in traditional
recommenders. These recommendation systems are used by online platforms such as Netflix,
hotstar, and YouTube to suggest movies, shows, and videos for viewing according to the user’s
choice.

20.2 Unsupervised Machine Learning


It is important to group similar images together for meeting different requirements; this task is
performed using cluster analysis. Different techniques for performing cluster analysis are
available including hierarchical clustering and k-means clustering. In this section, however, we
will consider only the k-means clustering for grouping similar type of images. For understanding

797
the importance of transfer learning for cluster analysis, we will consider images that are
downloaded directly from Internet by the user. This is because those images are not converted to
a matrix form unlike the images in the existing datasets available in keras. For example, the
cifar10 dataset that was discussed in Section 17.3 contains images already stored in the form of
numbers and all the images had the same dimensions. However, this is not the case with the
images that can be downloaded from Internet since all the images have different sizes. We hence
need to do basic image processing before applying k-means clustering algorithm. For effective
understanding, we will consider dataset containing 64 images belonging to four different
categories: car, cakes, shirt, and purses. These images are downloaded from the Internet.

798
Explanation
We can observe that the dataset contains 64 images belonging to four different categories: shirt,

799
cakes, car, and bags. As the images are resized to a dimension of (500, 500) and since these are
colored images, we create a matrix considering dimension as 750000.

20.2.1 Without Pretrained Model


In this section, we will apply k-means clustering technique after flattening the image and without
using any pretrained model for preprocessing of image.

Explanation
We can observe from the chart that the optimum number of clusters is 4. Hence, in the next
code, we will now perform cluster analysis considering four clusters.

800
Explanation
We can observe that we had 64 images and when cluster analysis considering four clusters was
applied to these images, the images are labeled as 0, 1, 2, and 3 according to the cluster which
they belong to. Thus, we can observe that 25 images belonged to the first cluster, 23 to the
second cluster, 12 to the third cluster, and 4 to the fourth cluster. The confusion matrix shows
that the accuracy is also not good.

801
Explanation
Since we know that the first cluster had 25 images, so an image with five rows and five
columns was created to display all the images of the first cluster. However, we can observe that
it contains mixed type of images and has all the categories of images. But, it can be observed
that the colors of all the images are dark. This means that clustering has been done on the basis
of color tones. However, it cannot be considered as the effective clustering.

802
Explanation
Since we know that the second cluster had 23 images, so an image with 6 rows and 4 columns
was created to display all the images of the second cluster. However, we can observe all the

803
images belonging to this cluster are light colored images.

Explanation
Since we know that the third cluster had 12 images, so an image with 3 rows and 4 columns
was created to display all the images of the third cluster. It can be observed that this cluster
mainly contains the images of cars, but the accuracy seems to be very low since it contains

804
images of cakes also.

Explanation
Since we know that the fourth cluster had four images, so an image with 1 row and 4 columns
was created to display all the images of the cluster. We can observe that the clusters formed are
really not good because the model for processing of images is not trained. In the next code, we
will consider existing pretrained models for performing cluster analysis.

20.2.2 Using MobileNet Model


We know that this model processes the image to 50176 dimensions. Hence, we will create the
matrix accordingly for saving the processed image.

805
Explanation
A matrix named mobilenet_matrix was created initially with a dimension of 64, 50176 using
the command numpy.zeros([64, 50176]). It is important to consider the correct dimension for
proper interpretation of image data. This dimension was considered because the data had 64
images and MobileNet algorithm can be used to produce an output of 50176 columns. The
preprocess_input() function from the MobileNet algorithm was used for converting the
image to an array form using the command

806
preprocess_input(numpy.expand_dims(kimage.img_to_array(image),axis=0)). The
command mobilenet_matrix[i, :] = mobile_net_model.predict(image).ravel() was
used to predict all the images on the MobileNet trained model and the result was flattened using
the ravel() function; the results were then stored in the matrix form.
Since there are four categories in the dataset, cluster analysis was done considering four
clusters using the command KMeans(n_clusters=4,random_state=10). It is observed from the
output that the first cluster had 10, second cluster had 14, third cluster had 21, and fourth cluster
had 19 images. The confusion matrix clearly shows that there is only one nonzero element in
each row and column. This means that the accuracy is 100% for the data.

Explanation
We know that the first cluster had 10 images, so an image was made up of 2 rows and 5
columns for displaying all the images of the first cluster. We can observe from the results that

807
the first cluster contains the images of only shirts.

Explanation
We know that the second cluster had 14 images, so an image was made up of 2 rows and 7
columns for displaying all the images of the second cluster. We can observe from the results
that the second cluster contains the images of only bags.

808
Explanation
From the output, it can be seen that the third cluster had 21 images, so an image was made up of
3 rows and 7 columns for displaying all the images of the third cluster. We can observe from
the results that the third cluster contains the images of only cakes.

809
Explanation
We know that the fourth cluster has 19 images, so an image was made up of 4 rows and 5
columns for displaying all the images of the fourth cluster. We can observe from the results that
the fourth cluster contains the images of only cars. Hence, it can be inferred that the use of

810
pretrained models on unsupervised machine learning algorithms showed a very high accuracy.

20.2.3 Using MobileNetV2 Model


In this section we will use MobileNetV2 model for performing cluster analysis considering four
clusters. We know that this model processes the image to 62720 dimensions. Hence, we will
create the matrix accordingly for saving the processed image.

Explanation
A matrix named mobilenetv2_matrix was created initially with a dimension of (64, 62720)
using the command numpy.zeros([64, 62720]). This dimension was considered because the
data had 64 images and MobileNetv2 algorithm can be used to produce an output of 62720
columns. The preprocess_input() function from the MobileNetv2 algorithm was used for
converting the image to an array form using the command
preprocess_input(numpy.expand_dims(kimage.img_to_array(image), axis=0)). It is
observed from the output that the first cluster had 14, the second cluster had 10, the third cluster
had 19, and the fourth cluster had 21 images. The confusion matrix clearly shows that there is
only one nonzero element in each row and column. This means that the accuracy is 100% for
the data. The images of the different clusters obtained were similar to the Mobilenet model,
hence this algorithm also confirms the 100% accuracy in predicting the clusters.

20.2.4 Using ResNet50 Model


In this section we will use ResNet50 algorithm for performing cluster analysis on the dataset of
64 images. We know that this model processes the image to 100352 dimensions. Hence, we will

811
create the matrix accordingly for saving the processed image.

Explanation
A matrix named resnet_matrix was created initially with a dimension of (64, 100352) using
the command numpy.zeros([64, 100352]). It should be noted that the dimension 100352 was
considered because ResNet50 algorithm is used to produce an output of 100352 columns. The
preprocess_input() function from the ResNet50 algorithm was used for converting the image
to an array form using the command
preprocess_input(numpy.expand_dims(kimage.img_to_array(image), axis=0)). It can be
observed that the result of ResNet model was found to be exactly same as the MobileNet
model. The images of all the clusters were exactly the same as identified by the MobileNet
model. This further suggests that the accuracy of all the clusters was obtained at a level of
100%.

20.2.5 Using VGG16 Model


In this section we use VGG16 algorithm for performing cluster analysis. We know that this
model processes the image to 25088 dimensions. Hence, we will create the matrix accordingly
for saving the processed images.

812
Explanation
A matrix named vgg16_matrix was created initially with a dimension of (64, 25089) using the
command numpy.zeros([64,25088]). This dimension was considered because VGG16
algorithm is used to produce an output of 25088 columns. The preprocess_input() function
from the VGG16 algorithm was used for converting the image to an array form using the
command preprocess_input(numpy.expand_dims(kimage.img_to_array(image), axis=0)).
It can be observed that the results of VGG16 model were found to be exactly same as the
MobileNet model. The images of all the clusters were exactly the same as identified by the
MobileNet model. This further suggests that the accuracy of all the clusters was obtained at a
level of 100%.

20.2.6 Using VGG19 Model


In this section we will use VGG19 algorithm for performing cluster analysis. We know that this
model processes the image to 25088 dimensions. Hence, we will create the matrix accordingly
for saving the processed images.

813
814
815
816
Explanation
A matrix named vgg19_matrix was created initially with a dimension of (64, 25089) using the
command numpy.zeros([64,25088]). It can be observed that the accuracy of VGG19 model
was not 100% accurate as there are multiple nonzero elements in the third column and row of
confusion matrix. This means that the third cluster had some incorrect images, which will affect
the images of fourth cluster also. As there is only one nonzero element in the first and second
column, this suggests that there is a possibility of these clusters to have correct images. Thus,
the images of first and second clusters are having similar types of images and the third cluster
has some combination of images. This is also shown with the images of third cluster, which
shows images of cars and shirts. The fourth cluster had only one image of cake.

Download 81 images belonging to different images related to furniture: bed,


table, and almirah. Perform cluster analysis considering three clusters and
compare the results of predicted cluster and original cluster using different
pretrained models.

USE CASE
VIDEO SUMMARIZATION USING CLUSTERING

Due to recent advances in technology, tremendous amount of multimedia information is


available. According to the report, more than 500,000 hours of new videos are generated in a
day. Because of availability of such a huge amount of videos, a good video summarization
technique will give choice to users to quickly browse videos, comprehend large numbers of
videos, and select the videos of their choice for complete viewing later. Video summarization can
be hence defined as a simplification of video content for compressing the video information.
There are basically two approaches for video summarization: static and dynamic video
summarization. Static video summarization considers the visual information only and ignores
audio message. Dynamic video summarization, however, combines image, audio, and text
information together. Hence, in comparison with dynamic video summarization, static video
summarization is done easily and reduces computational complexity for video analysis.
A video basically consists of video shots. A video shot is the smallest physical segment of
video, which is basically an unbroken sequence of frames. The technique of video summarization

817
basically gathers the representative frames (key frames) from a video. An important requisite is
that the key frames should represent the whole video content without missing important
information and, second, these key frames should not be similar, in terms of video content
information. One of the simplest methods is to choose the first, last, and median frames of a shot
or a combination of the previous ones to describe each and every shot.
However, a better and effective approach for video summarization is to use clustering
algorithm. This is implemented by integrating important properties of video to gather similar
frames into clusters using any distance measure. When clusters are formed, a fraction of the
frames that has given a larger distance metric is retrieved from each group to form a sequence
making up the desired output. It is suggested to collect all clusters’ centers in case of static video
summarization. This will hence represent the most unique frames of the input video and create a
useful summarized video. It is important to understand that special steps need to be taken for
preserving continuity of the summarized video in dynamic video summarization because of the
involvement of text and audio data.

20.3 Supervised Machine Learning


In this section like the previous section we consider the images that can be downloaded directly
from the Internet by the user. The mnist dataset discussed in Section 17.4 contains images that
store numbers and all the images have the same dimensions. But downloaded images may have
different sizes. Hence, we need to do basic image processing before applying classification
algorithms. For understanding the utility of supervised machine learning algorithms, we have
downloaded images from Internet and created a dataset containing 81 images belonging to six
different categories: bike, flower, dog, chair, bottle, and shirt. It should be noted here that for
better understanding, a small set of images was taken. The reader is suggested to take bigger
dataset of images for better results and improving accuracy at a greater level.
In this section, we will use different supervised machine learning techniques such as Naïve–
Bayes, random forest, decision tree, and bagging algorithm for creating the model on training
dataset using trained models. All the created models are then evaluated on the test dataset and on
the sample image for predicting the category of the image.

818
819
Explanation
The images were read using the glob() function from the glob library. The function
imread(image, as_gray=False) reads each image available in colored format and stores in
image_form. All the images were resized to (500,500) dimension using the command
resize(image, (500, 500)). A matrix was created of a dimension (81, 750000) because there
are 81 images and the dimension of the colored image was 250000. We can observe that there
are 81 images, which are shown in 8 × 8 matrix using plt.show() function. Based on the
images shown in the matrix, we have created a dataset of dependent variables corresponding to
the category. The dataset has 81 items corresponding to six categories. A dependent variable
named “y” is created for naming all the categories. Flower was given category 1, bike belonged
to category 2, dog belonged to category 3, chair belonged to category 4, bottle belonged to
category 5, and shirt belonged to category 6. Thus, all the images were categorized by creating
a dependent variable. A list named category is also created in the same order to predict the
category depending on the predicted value.

20.3.1 Without Pretrained Models


In this section, for converting the image to 1-D, we did not use any pretrained model. The

820
flatten() function was used for flattening the image. The dataset is split into training and test
datasets in the ratio of 80–20% and different models using supervised machine learning
algorithms are created and evaluated on the test dataset for determining the accuracy of the
model.

821
822
823
Explanation
We have created training and test datasets by splitting in the ratio of 80:20. Thus, from the
dataset of 81 images, the training dataset has 64 images and the test dataset has 17 images. A
test image was read and stored in variable named image using the function
imread("test1.jpg", as_gray=False). The imshow() function displayed the image as flower.
The image was then resized and flattened to the same dimension as the training dataset. It was
observed that the accuracy of all the models is not very high and also that when the image of
flower was given for predicting the category, all the algorithms did not correctly predict the
category of the given image. Hence, there was a need to use the pretrained model.

20.3.2 Using MobileNet Model


In this section we use MobileNet algorithm for processing the images to a dimension of 50176.
As there are 81 images, a matrix of order (81, 50176) is created.

824
825
Explanation
The MobileNet model is applied on the dataset of images and it was observed that the
dimension of training and test datasets was (64, 1076) and (17, 1076). For all the supervised
machine learning algorithms, the accuracy of the trained models has significantly become
higher. It was found that the Naïve–Bayes model showed a highest accuracy when the
MobileNet model was adopted. Thus, when the image of chair was given for testing, all the
models of supervised machine learning rightly predicted the output as “Chair.”
It should be noted here that for better understanding, a small set of images was taken. The
reader is suggested to take bigger dataset of images for better results and improving accuracy at
a higher level.

826
20.3.3 Using MobileNetV2 Model
In this section we use MobileNetV2 model for supervised machine learning algorithm. As
discussed earlier, this model processes the image in a dimension of 62720.

827
Explanation
The accuracy of the MobileNetV2 model is observed to be less than the MobileNet model. Like
the MobileNet model, it was found that the Naïve–Bayes model showed a highest accuracy
when the MobileNetV2 model was adopted to develop the model. Thus, when the image of bike
was given for testing, all the models of supervised machine learning rightly predicted the output
as “Bike.”

20.3.4 ResNet50 Model


In this section we discuss the use of ResNet50 model for supervised machine learning
algorithms. It is known that the ResNet50 model is trained considering the images of dimension
100352. Hence, we will process all our images to the same dimension before training the models
of supervised machine learning algorithms.

828
829
Explanation
The accuracy of the ResNet50 model has significantly become higher. Bagging and Naïve–
Bayes had shown a highest accuracy of 0.94. Thus, when the image of Dog was given for
testing, all the models of supervised machine learning rightly predicted the output as “Dog.”

830
20.3.5 VGG16 model
In this section we will discuss the use of VGG16 model for supervised machine learning
algorithms. It is known that the VGG16 model is trained considering the images of dimension
25088. Hence, the preprocessing of all images to the same dimension before training the models
of supervised machine learning algorithms was done accordingly.

831
Explanation
The accuracy of the VGG16 model is high. Naïve–Bayes, random forest, and bagging models
showed an accuracy of 94.12%. Thus, when the image of shirt was given for testing, all the
models of supervised machine learning rightly predicted the output as “Shirt.”

20.3.6 VGG19 Model


In this section we discuss the use of VGG19 model for supervised machine learning algorithms
considering the images of dimension 25088. The preprocessing of all images to the same
dimension before training the models of supervised machine learning algorithms was done
accordingly.

832
833
Explanation
The accuracy of the VGG19 model is good but less than VGG16 model. Thus, when the image
of bottle was given for testing, all the models of supervised machine learning rightly predicted

834
the output as “Bottle.”

Download 200 images belonging to one of the four categories: animals, people,
electronic devices, and shoes. Split the dataset in training and test datasets in
the ratio of 80:20. Train the model considering the training dataset and
evaluate the model on test dataset considering different supervised machine
learning algorithms and trained models.

USE CASE
MEDICAL DIAGNOSIS USING IMAGE PROCESSING

Medical diagnosis is one of the most important area in which image processing procedures are
usefully applied for diagnosis. Due to the tremendous advancement in image acquisition devices,
the data are quite large; this makes image analysis challenging and interesting. Medical imaging
is the process of producing visible images of inner structures of the body for scientific and
medicinal study and treatment as well as visible view of the function of interior tissues. Machine
learning using trained models will help to diagnose the disease, predict the risk of diseases
accurately and faster, and provide actionable prediction models effectively and efficiently.
Previously, the natural images were processed in their raw form which was a time-consuming
process. New algorithms will learn multiple levels of abstraction, representation, and
information automatically from large set of images, help in tuning of features, and will show
significant accuracies.
In ancient times, the health of human body was identified through eyes, tongue, skin, nails,
palm, etc. There are symbols like island, star, square, spot, grille, and circle, and if one or more
of them is/are found on specific region of palm, it indicates probability of disease of respective
organ of body. The color of nails is also observed by many doctors to get assistance in disease
identification. However, these results are less accurate due to limitation of human eye. By using
the knowledge base of medical palmistry, algorithms can be designed to consider input as digital
hand image and result can be generated on the basis of extracting colors; textures; shape from
segmented nail image; and symbols, color, shape, and texture from segmented palm image.
Skin disease recognition can be done on the basis of image color and texture features. Yu et
al. made a diagnosis on herpes simplex, varicella, and herpes zoster through reflectance
confocal microscopy (RCM). The final empirical results demonstrated that specificity could be
extracted from all the three different types of herpes. Zhong et al. (2011) diagnosed psoriasis
vulgaris through three-dimensional computed tomography (CT) imageological technique of skin.
Arivazhagan et al. (2012) proposed an automated system based on texture analysis for
recognizing human skin diseases by independent component analysis of skin color image. Salimi
et al. (2015). presented the pattern recognition method to classify the skin disease. Kotian and
Deepa (2017) studied the problem of skin disease automated diagnosis system based on the
techniques of image border identification.
An image grading automation process is available to provide automated real-time evaluation
system to expedite diagnosis of retinal image by maintaining accuracy and cost effectiveness.
Substantial result has also been shown for CT, digital subtraction angiography, and magnetic
resonance imaging. Tumor/cancer detection can also be done using microarray analysis and

835
identification of inherited mutations predisposing family members to malignant melanoma,
prostate, and breast cancer. Kumar and Singh (2016) established the relationship of skin cancer
images across different types of neural network. Also the pathological tests involving blood
samples, urine samples, etc., take a lot of time to determine the name of the disease and require
patient’s presence. To overcome these problems, automated disease prediction system can be
designed to give more accurate results in less time. For all the diseases, the training dataset
consisting of images of respective body parts of different people having disease and not having
the disease can be created. This process also includes creation of a data bank of regular
structure and function of the organs to make it easy to recognize the anomalies. However, these
images should be preprocessed to remove noise and irrelevant background by filtering and
transformation. The texture and color features of different images could be obtained accurately.
Supervised machine learning algorithm will identify the disorder identification and determine
whether the patient has the disease or not based on image of his/her respective body part.

References
Arivazhagan S., Shebiah R. N., Divya K., Subadevi M. P. (2012). Skin disease classification by
extracting independent components. Journal of Emerging Trends in Computing and
Information Sciences, 3(10):1379–1382.
Kotian A. L., Deepa K. (2017). Detection and classification of skin diseases by image analysis
using MATLAB. International Journal of Emerging Research in Management and
Technology, 6(5):779–784.
Kumar S., Singh A. (2016). Image processing for recognition of skin diseases. International
Journal of Computer Applications, 149(3):37–40.
Salimi S., Nobarian M. S., Rajebi S. (2015). Skin disease images recognition based on
classification methods. International Journal on Technical and Physical Problems of
Engineering, 22(7):78–85.
Zhong L. S., Jin X., Quan C. (2011). Diagnostic applicability of confocal laser scanning
microscopy in psoriasis vulgaris. Chinese Journal of Dermatovenereology, 25(8):607–608.

20.4 Pretrained Models for Image Recognition


A lot of pretrained algorithms are available for doing image recognition from haarcascade group
related to body parts. The algorithm related to detection of front face is available in
haarcascade_frontalface_alt.xml and eye detection algorithm is available in
haarcascade_eye.xml. Different pretrained algorithms are also available in cv2 library for
determining demographic variables like gender and age. This section focuses on using pretrained
algorithms for determining body parts (such as face and eye) and demographic variables (such as
gender and eye).

20.4.1 Face and Eye Determination


In this section we determine the face and eye from the given image using pretrained algorithms
available from haarcascade group.

836
Explanation

837
The algorithm “haarcascade_frontalface_alt.xml” related to face detection is stored in facedata
and the command cv2.CascadeClassifier(facedata) creates a face cascade. The image
named “a1.jpg” is read and the function facecascade.detectMultiScale(miniframe) is able
to detect the face from “a1.jpg.” For marking the boundary, the function named
cv2.rectangle() is used. The image where the face is marked with rectangle is saved as the
file name face_a1.jpg using the command cv2.imwrite('face_a1.jpg', image). Similarly,
the algorithm “haarcascade_eye. xml” related to eye detection is stored in eyedata and the
command cv2.CascadeClassifier(eyedata) creates an eye cascade. The command
eyecascade.detectMultiScale(miniframe) is able to detect the eyes from “a1.jpg.” For
marking the boundary, the function named cv2.rectangle() is used. The image where the eyes
is marked with rectangle is saved as the file name eye_a1.jpg using the command
cv2.imwrite('eye_a1.jpg', image). Thus, from the image, we can observe that a rectangle is
shown on the face when it is detected by the face cascade. Similarly, eye detection is also
marked with proper rectangles considering the algorithm. It should be noted here that the
command cv2.destroyAllWindows() should be used to end the process; otherwise, two
important process will execute together resulting in error.

It is also possible to detect left eye and right eye also separately using the
algorithm haarcascade_lefteye_2splits and
haarcascade_righteye_2splits, respectively.

It is also possible to crop the image by specifying the appropriate dimensions determined by the
face detection algorithm as explained in the following section.

838
Explanation
The above section accesses each image using the glob() function. The images are read using
the imread() function, resized to dimension of (300, 300) using the function resize(), and
displayed using the imshow() function.

839
Explanation
The algorithm for front face “haarcascade_frontalface_alt.xml” is used with cascade
classifier and saved in cascade. The face is detected using the function
cascade.detectMultiScale(miniframe) and the command sub_face = image[y:y+h,
x:x+w] crops the image with necessary dimensions and saves it in sub_face. The command
imwrite(img, sub_face) then creates a new image and saves it with the name of the original
file. The next code displays all the images from the folder. We can observe from the results that
when all the images inside the folder are displayed, only the cropped face is displayed from all
the images.

The haarcascade group contains the algorithms for determination of full body,
smile upper body, lower body, etc. The smile algorithm is available in
haarcascade_smile_alt.xml, full-body algorithm is available in
haarcascade_fullbody.xml, upper body algorithm is available in
haarcascade_upperbody.xml, and lower body detection algorithm is available
in haarcascade_lowerbody.xml.

20.4.2 Gender and Age Determination


A powerful feature related to image analysis is determination of gender and age. In this section,
gender and age are determined for the given image using the algorithm available in cv2 library.

840
841
842
The same code when executed on the new image produced the result as:

Explanation
The command cv2.dnn.readNetFromCaffe('deploy_age.prototxt',
'age_net.caffemodel') is used to develop age model considering the pretrained models for
age “deploy_age.prototxt”, “age_net.caffemodel” and using the function

843
readNetFromCaffe(). The models, when trained initially, considered the range of age as [‘(0,
2)’, ‘(4, 6)’, ‘(8, 12)’, ‘(15, 20)’, ‘(25, 32)’, ‘(38, 43)’, ‘(48, 53)’, ‘(60, 100)’]. Similarly, gender
model is also created and considered two values for gender “Male” and “Female.” The model
also had some mean values that are stored in mean_values. The face detection algorithm was
first used on the image and the face was then cropped from the original image and stored in
new_img. The command img_blob=cv2.dnn.blobFromImage(new_img, 1, (227, 227),
mean_values, swapRB=False) helps to create a blob from the image and stores in img_blob.
The gender model and age model take inputs as the blob from image and use the forward()
function to predict the values of gender and age. These predicted values of gender and age are
stored in gender_predicted and age_predicted, respectively. We can observe from the first
image that the predicted values of gender are 0.99814117 and 0.00185879. This means that the
image has 99.8% probability of belonging to male and 0.18% probability of belonging to
female. The command gender = gender_values[gender_predicted[0].argmax()]
determines the category that has the maximum argument from the predicted values and stores in
variable named “gender.” Since the probability of gender was highest for “male”, the value
male is stored in gender. Similarly, the age had eight categories; hence probabilities of each
category are predicted resulting in eight predicted values and the category having the highest
predicted value of age is stored in age. These values are then written on the cropped image at
specified location with font size as 34. Thus, the first image was rightly predicted as male
belonging to 25–32 age group and the second image was rightly predicted as female belonging
to 25–32 age group. Thus, it is observed from the cropped face that the algorithm for gender
and age detection was used rightly in predicting the gender and age of the person.

Create an application for captioning the image like “female carrying shopping
bag” or “male riding a bike at the road” on the basis of object determination
and gender determination.

USE CASE
PERSONALIZED DISPLAY FOR CUSTOMERS IN SHOPPING MALL/OFFLINE STORES/RESTAURANTS

An image-processing technique is defined as the usage of computer to understand the digital


image. This technique has many benefits such as elasticity, adaptability, data storing, and
communication. As deep learning technology continues to evolve and improve, businesses can
understand consumer behavior from a wide array of devices. Combining data with deep learning
models will help businesses to create personalized marketing approaches that will appeal to
anyone who might buy their product. With the growth of different trained models, the visual
analysis of gender and age groups of a person’s image can be done easily. This will prove to be
new inroad of marketing and increasing personalization through image analysis.
A restaurant can use computer vision to identify demographics of customers and then
recommend menu items based on these visual data. An offline store can be equipped with
cameras and image analysis software, identify customers by their gender and approximate age
and then recommend products based on these visual data. In stores, image of the person can be
captured when the person enters the store, gender and age of the person can be identified from
the image using the pretrained algorithms, and when the person reaches the particular area of

844
the store where things are kept matching to his/her age group, display screen can show the
images according to his/her liking. Even in shopping malls, these camera-equipped screens can
capture and analyze data about all passers-by, specifically window-shoppers. When a window-
shopper stops to check out screen displays, the content will change to show the items or products
that the software perceives as relevant to that individual based on the analyzed visual data. For
example, a lady window-shopper may see images of ladies’ handbags, cosmetics, fashion
accessories, and garments that complement her wear. This will really serve to be the best way of
grabbing window-shoppers’ attention and turning them into impulse buyers.

20.5 Creating, Saving, and Loading User-Defined Model for Feature


Extraction
It is also possible for the user to create and save his/her own model. In such a scenario, the user
creates his/her own pretrained model that can be used later for executing on any data. The model
is developed on the training dataset and is saved using the command model.save('name of the
model'); it can be loaded later for further usage by importing load_model from the
keras.models library.
For creating a model in training dataset, we have considered the fer2013.csv available from
https://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-recognition-
challenge/data consisting of images related to features for independent variables and categories
for facial expressions. The dataset had seven classes, namely, angry, disgust, fear, happy, sad,
surprise, and neutral. The data consist of 48 × 48 pixels grayscale images of faces. The faces
have been automatically registered so that the face is more or less centered and occupies about
the same amount of space in each image. The task is to categorize each face based on the
emotion shown in the facial expression in one of the seven categories (0 = Angry, 1 = Disgust, 2
= Fear, 3 = Happy, 4 = Sad, 5 = Surprise, 6 = Neutral). The train.csv contains two columns,
“emotion” and “pixels.” The “emotion” column contains a numeric code ranging from 0 to 6,
inclusive, for the emotion that is present in the image. The “pixels” column contains a string
surrounded in quotes for each image. The contents of this string a space-separated pixel values in
row major order. The training set consists of 28,709 examples. The public test set used for the
leaderboard consists of 3589 examples. The final test set consists of another 3589 examples. This
dataset was prepared by Pierre-Luc Carrier and Aaron Courville.

It is important to check the compatibility of the versions of the two libraries,


else errors will be generated. It should be noted that this section considered the
tensor flow 1.15 version for implementation.

20.5.1 Creating and Saving the Model for Feature Extraction


In this section, we have created our own model for feature extraction from the images.

845
846
Explanation
It can be observed from the results that the number of records is 35,888 and columns are 2304.
The next division creates three different datasets depending on the value of usage. Columns in a
comma-separated file are split using comma; hence, when the split is done on the csv file, the
file returns three columns, namely, emotion, img, and usage. If the value of usage is “Training”,
a new record is added to training (x_trg and y_trg) dataset. Similarly, the usage value as
“PrivateTest” adds new record to validate dataset and else (usage="PublicTest") adds new
record to the test dataset. The next section then transforms the dataset to float values. The
images are colored and are in RGB format; hence all the values are represented from 0 to 256.
For normalization of images, we will divide all the values by 255; this means that the new
values will have values ranging from 0 to 1. It is known that the dimension of the original
image was 48 × 48 and also that for using conv2D layer, the image should be 3-D and the
dataset should be 4-D. Hence, the next division converts all the datasets to 4-D. We can observe
that finally the training dataset had 28,709 images, and both validation and test datasets had
3589 images.

847
google.colab is also used by analysts on unavailability of sufficient resources
to do analysis. For loading the data on the colab drive, we can use the
commands from google.colab import drive and
drive.mount('/content/drive').

Explanation
The above code creates a model consisting of conv2D layer, average pooling layer, max
pooling layer, dense layers, and dropout. It should be noted that conv2D layer is used on the
image data of 3-D and dense layer is added to 1-D data. Hence, before adding dense layer, it is
important to convert the image data in 1-D format. The model was compiled using
“categorical_crossentropy” and “Adam” optimizer. The model was trained on the dataset
considering 20 epochs and batch_size as 192. The accuracy of the model was found to be 73.54.

848
The model is saved by the user using the command model.save() and named
“trained_img_model_1.h5.” It is important to mention here that we generally save the model
with extension h5. It should be noted that the model will be saved in the working directory. The
next division validates the model on validation and test dataset.

20.5.2 Evaluating the Model on Existing Dataset


In this section, we will evaluate the model on existing validation dataset and test dataset.

Explanation
The accuracy of the validation and test datasets is found to be 57.67, which is very low in
comparison to the training dataset. Thus, this model is overfitting since the accuracy of the

849
training dataset is more than the test dataset. In the following code, for improving accuracy, we
tried to fit the model with different epochs and batch size.

Explanation
When the model was fit using 25 epochs and batch_size was made 256, the accuracy increased
to 91.84%. The model is saved by the user by the name of “trained_img_model_2.h5” using
save() function. It should be noted that the model will be saved in the working directory.

850
Explanation
It is clear from the results that the model is depicting overfit since a lot of training was provided
to it. We found that the accuracy of the training dataset increased to 91.84% but there is lot of
difference in the test and validation datasets. This means that we were not able to increase the
accuracy of training and validation datasets to a large level. It should be noted that though this
model is not very good for this dataset, we will use it for understanding and detection of the
emotions. We will consider the pretrained model named “trained_img_model_1.h5” created
earlier by the user for evaluating the facial expression of the given image by the user.

20.5.3 Loading the Model and Determining Emotions of Existing Image


In this section, we will try to determine the emotions from the new image by loading the already
saved trained model. The model can be loaded by using the function load_model() available in
keras.models library.

851
852
853
Explanation
The model is loaded using the load_model() function and stored in user_model. The file named
user_image.jpg is read from the expression folder. Since we have trained the model considering
an image of dimension (48, 48), it is important to convert the image to the same dimension
before using the model. Hence, a new variable user_img is created that reads the image to a
dimension of 48 × 48. The image is then converted to an array format using the function
img_to_array(). The model trained by the user then predicts the value of image and stores the
predicted value in variable named “predictedvalue.” The first item is stored in pred_emotions
using the command pred_emotions=predictedvalue[0].
A function func_expression() is created to display the image consisting of original image,
chart, and the emoji. The function takes input as the original image and predicted emotions. The
predicted emotions are converted to a list by using the command emotions.tolist(). This is
important because we need to determine the index of the maximum value in the emotions. This
is done by using the command bestemotion=emotionlist.index(max(emotions)). Hence, the
best emotion stores the index of the maximum value in the predicted emotions.
An object named emo_val is created that contains seven different values of emotions:
“Angry”, “Disgust”, “Fear”, “Happy”, “Sad”, “Surprise”, “Neutral”. A nested “if” structure is
used to set the color of the chart and create an emoji variable according to the value of best
emotion. This means that if the predicted value of image is 0, then the person is angry (first
item in the object), a predicted value of 1 means that the person is disgusted (second item in the
object), and so on. Different colors are set for the bar chart depending on the predicted emotion.
An image is created considering three columns using the subplot() function. The first column
displays the original image that was given as an input to the function. The second column
displays the bar chart corresponding to the emotion and the third column displays the
corresponding emoji. These emojis are taken from Google images.

The above code when executed on different images produces the following results:

854
855
856
Explanation
These images show the corresponding emotions in the form of bar chart and emoji along with
the user image. All the emotions are displayed depending on the facial recognition of the given
image.
However, there were some images that displayed incorrect emotion. For example, when the
following image was considered, a sad emotion was displayed, although it was a happy
emotion.

Explanation
The above image shows a sad expression, although it was a happy face. Hence, for increasing
the accuracy, we used the face detection algorithm discussed earlier in our model and then
cropped the image. Model for expression detection was then applied on the cropped image.

20.5.4 Determining Emotions from Cropped Facial Image


In this section, for determining emotions efficiently, the face was cropped from the image. For
cropping the face, the frontalface algorithm discussed in Section 20.4.1 is used.

857
Explanation
This function will crop the face from the user image according to the frontal face algorithm
discussed earlier. The imwrite() function will create and save the image named new_image.jpg
from the sub_face. It should be mentioned here that sub_face contains the cropped face from
the main image.

858
Explanation
The above code loads the new cropped image and converts to an array format. It reads the
image in the size 48 × 48 because our trained model accepts the input image in this dimension.
The values are predicted and the function func_face_crop() created earlier is called for
displaying the result of image along with emotions in the form of bar chart and emoji.

The same code when executed on other images before and after cropping of image produced the
following results:

859
Explanation
We can observe that the accuracy of the emotion detection increased to a great extent when the
face was cropped from the whole image.

20.5.5 Determining Emotions of Image from Webcam


In the real-world scenario, there is a need to capture real-time images. This is possible using a
webcam or video devices. In this section, we determined the emotion of the image directly
captured by the webcam.

860
861
862
Explanation
The videocapture(0) command from cv2 library is used to capture the image from webcam.
The function cap.read() reads the image and stores in the frame. The haarcascade algorithm is
used to crop the face from the image. The function func_face_crop() is used to crop the face
from the image. Then the function func_expression() is called for displaying the expression
of the cropped face. We can observe that the emotions are correctly determined when the image
was taken from the webcam and using face detection algorithm, the image was cropped before
doing emotion analysis.

USE CASE
MEASURING CUSTOMER SATISFACTION THROUGH EMOTION DETECTION SYSTEM

Customer satisfaction is one of the most important concerns in hospitality industry. In order to
meet customer needs efficiently, we make customers loyal and retain them for better profit; it is
important to evaluate the true feedback of the customer. For measuring customer satisfaction, a
survey method is generally adopted. But businesses handle many issues in analyzing the
feedback from survey-based approach. This is because of bugs, erroneous reports, and general
misunderstanding as submitted by the customer through survey forms. It is also known that
customer satisfaction cannot be measured by the monetary terms; hence the profit earned cannot
directly reflect the customer satisfaction.
Organizations need to understand true feedback of customer for quality improvement process
and increasing profits. A nonvocal communication approach needs to be adopted, which forms
the major ideology in understanding customers because only few people today give genuine
feedbacks through feedback forms. Creating an efficient system for measuring customer
satisfaction is the need of the hour when technological advances and computational capabilities
of resources have increased.
An automated emotion detection system can be created from the image of the consumer after
the consumer has used services at the place. This will tell us the level of satisfaction of the
customer and help in better evaluation of the feedback of the customer. Thus, a real-time system
that can detect customers’ emotion which will further help to understand what customers are
thinking about the services offered to them in a better manner.

863
Summary
• Images analysis has now become the heart of powerful machine learning algorithms.
• Image recognition also helps to recognize multiple elements within single image at the same
time, including logos, faces, activities, objects, and scenes.
• Because of existing pretrained models that can be used for transfer learning, the task of
image analysis has simplified the process and has helped to increase the accuracy of the
analysis to a large extent.
• The use of transfer learning can be determined best for unprocessed images because these
trained models provide functions for processing the images. Popular pretrained models such
as Mobilenet, Resnet50, Mobilenetv2, VGG16, and VGG19 are used for performing machine
learning analysis on image data.
• Images that are downloaded from Internet require basic image processing since all the images
have different sizes. Hence, before using machine learning algorithms, we should process the
images.
• Different supervised machine learning techniques such as Naïve–Bayes, random forest,
decision tree, and bagging algorithm are used for creating the model on training dataset using
trained models.
• Trained algorithms can also be used to identify faces within image to determine sentiment
according to facial expressions, gender, age, upper body, hand body, and body parts such as
face, eyes, hand, and so on. Predefined algorithms from “haarcascade” group are used to
determine the face and eye and predefined algorithms from cv2 library are used to determine
gender and age.
• It is also possible for the user to create and save its own model. In such a scenario, the user is
creating its own pretrained model that can be used later for executing on any data.
• The model is developed on the training dataset and is saved using the save() function and
can be loaded later for further usage by importing load_model from the from keras.models.

Multiple-Choice Questions

1. MobileNet model processes the image to ___________ dimension.


(a) 50,176
(b) 62,720
(c) 100,352
(d) 25,088
2. MobileNetV2 model processes the image to _________ dimension.
(a) 50,176
(b) 62,720
(c) 100,352
(d) 25,088
3. ResNet model processes the image to _________ dimension.
(a) 50,176

864
(b) 62,720
(c) 100,352
(d) 25,088
4. VGG16/VGG19 model processes the image to ________ dimension.
(a) 50,176
(b) 62,720
(c) 100,352
(d) 25,088
5. Multiple images from the folder can be read by using the _______ function.
(a) glob()
(b) multiple()
(c) mulimage()
(d) All of the above
6. The algorithm used for determining face and eye from the image is also available in
__________ group.
(a) human
(b) face_body
(c) haarcascade
(d) All of the above
7. Gender and age detection can be done using the _______ function.
(a) readNetFromCaffe()
(b) GenderAge()
(c) Demo()
(d) All of the above
8. The readNetFromCaffe() is available in _________ library.
(a) cv2.dnn
(b) vision
(c) Demo
(d) All of the above
9. The image can be saved in the folder using the ___________ function available in cv2
library.
(a) saveimage()
(b) imwrite()
(c) imgwrite()
(d) imgsave()
10. An image is converted to 1-D using the ___________ function.
(a) convert1D()
(b) conv1D()
(c) flatten()
(d) All of the above

865
Review Questions

1. Explain the importance of creating the numpy array with zeros before using any pretrained
algorithm.
2. Explain the process of performing cluster analysis for images.
3. How and why do we create a dependent variable for the user-defined image dataset in
supervised machine learning algorithms?
4. Is it possible to crop all the images from the folder at one time? If yes, how?
5. Explain the process of determining the eye and face from the image.
6. Explain the process of determining the gender and age from the image.
7. Is it possible to determine the facial expression of the live image? If yes, how?
8. Is it possible to save a user-defined trained model and use it later? If yes, how?
9. Why it is important to preprocess the images that are downloaded from the Internet.
10. Does the input dimension of data differ with CONV2D and dense layer? If yes, how do we
solve the problem when both the layers are taken together in the same model?

866
CHAPTER
21

867
Chatbots with Rasa

Learning Objectives
After reading this chapter, you will be able to

• Understand Rasa environment for creating chatbots.


• Develop interactive chatbot with new entities, actions, and forms.
• Apply the usage of API (Application Programming Interface) key in developing the
location-based chatbots.
• Making effective chatbots and understanding limitations of chatbots.

Chatbots are a form of human–computer dialog system that operates through natural language
using text or speech. These chatbots are autonomous and can operate anytime, day or night, and
can handle repetitive, boring tasks. They help to drive conversation and are also scalable because
they can handle millions of requests. Satya Nadella had forecasted that chatbots will
fundamentally revolutionalize how computing is experienced by everybody. Chatbots are very
effective for customer interactions and are expected to cut business costs by $8 billion by 2022.
It is estimated that, 85% of customer interactions will be managed without a human. According
to Business Insider, the global chatbot market is expected to reach $1.23 billion by 2025.
According to a report, “The global market chatbot was valued at USD 88.5 million in 2015 and
is anticipated to witness a substantial compound annual growth rate of 35.08% over the period
2016–2023.”
It is always possible to add more information to the chatbot for improving the quality of
chatbots. Chatbots are consistent because they will provide the same information when the query
is asked multiple times. However, since understanding natural language is difficult, hence
chatbots cannot be used for designing any interaction. Chatbots can be risky sometimes, hence
they should not be used for sensitive data. For effective results, steps should be taken to ensure
that only relevant data are being asked and captured as an input and they are securely transmitted
over the Internet. It should be noted that chatbot development is not just one exercise; we need to
do continuous development, improvement, and testing cycle, and deployment for refinement in a
chatbot for effective results.
The different platforms for building chatbot include the following:
Rasa, an open-source product, production ready, and used in large companies everywhere)
Google DialogFlow (API.ai), a completely closed-source product with APIs and web interface,
has easy to understand voice and text-based conversational interface Facebook Wit.ai; IBM
Watson Assistant, which has support for searching for an answer from the knowledge base,
and it is important to first create a skill and then go to assistant to integrate it with other
channels;

868
Microsoft LUIS, which provides an easy and understandable web interface to create and publish
bots;
Amazon Lex, which has a voice and text-based conversational interface and provides a web
interface to create and launch bots.
The Rasa Stack is a pair of open-source libraries [Rasa natural language understanding
(NLU) and Rasa Core] that allow developers to expand chatbots and voice assistants beyond
answering simple questions. Rasa has become very popular because it can be deployed anywhere
without any network problems, data are completely controlled by organizations (there is no need
to share the data with big companies), and the models can be modified according to the
requirement. It uses state-of-the-art machine learning and hold contextual conversations with
users. The most important feature of Rasa is that it is sustainable unlike init.ai that was acquired
and closed by Apple. It is also compatible with wit.ai, LUIS, or api.ai; hence, it is possible to
migrate chat application data into the Rasa-NLU model. It has average NLU capabilities but
success of chatbots depends mainly on the training data.

The new updated version of Rasa is Rasa X for which enterprise edition is paid
and the community edition is free but not open source. In this chapter, Rasa
has been used because it is free and open source. Rasa X provides tools for
developers for viewing and filtering conversations between humans and their
Rasa assistant, for converting those conversations into training data, for
managing and creating new versions of models, and for easily giving test users
access to their assistants.

21.1 Understanding Rasa Environment and Executing Default


Chatbot
Chatbots are primarily meant for fulfilling two tasks: understanding the user message and giving
the correct responses. The Rasa Stack tackles these tasks with Rasa NLU, the NLU component,
and Rasa Core, the dialogue management component. Rasa NLU builds a local NLU model for
extracting intent and entities from a conversation. It provides customized processing of user
messages through a pipeline. A pipeline defines different components that process a user
message sequentially and ultimately lead to the classification of user messages into intents and
the extraction of entities. Rasa core is sued for prediction based on machine learning models. It
uses current context, intents, current state in the conversation, etc., to decide the next steps. Some
stories in the form of sample interactions are created between the user and bot and are defined in
terms of intents captured and actions performed.
The whole process of Rasa chatbot includes the following steps (Fig. 21.1):

1. The message is received and passed to an interpreter, and NLU converts it into a dictionary
including the original text, the intent, and entities if any.
2. The tracker is the object that keeps track of conversation state. It receives the information
that a new message has been received.
3. The policy receives the current state of the tracker.

869
4. The policy chooses which action to take next.
5. The chosen action is logged by the tracker.
6. Finally, an appropriate response is sent to the user.

Figure 21.1 Rasa Chatbot Architecture.


Source: rasa.com.

Rasa 1.4.5 has requirement of TensorFlow~=1.15.0 and error will be displayed


if tensorflow 1.14.0 or 2.0.0 is installed. Update version of tensorflow for
generating results according to this chapter.

Starting with Rasa: Create a directory at the command prompt using the command mkdir
firstbot. Change the directory using the command cd firstbot. Activate the rasa environment
using the command activate rasa_env and then initialize the environment using the command
rasa init for initializing the environment or rasa init --no-prompt for initializing the
environment but avoiding prompts that are usually shown. If rasa init is chosen without
prompt, it will ask questions like: “Where will the project be created [default: current
directory]?” and “Do you want to speak to the trained assistant on the command line?” The user
can press enter to create the default directory and y/n for speaking to the trained assistant. This
command will automatically create a Rasa pack containing the folders and files discussed in the
following subsections.

21.1.1 Data Folder


The data folder contains primarily two main files nlu.md md (training data to make the bot
understand the human language) and stories.md (flow of data to help the bot understand how to
reply and what actions to take in a conversation).

21.1.1.1 nlu.md
The chatbot should primarily need to understand the user requirement. For example, we need to
teach the chatbot that if the user is saying “Bye” and “Good Bye,” it basically represents a
common intent of saying “Bye.” This file describes each intent with different
statements/expressions that correspond to that intent. These intents are given to Rasa NLU for
training. However, the expressions can also contain entities that will be discussed in Section
21.3.

870
An intent is a collection of expressions that mean the same thing but are constructed
differently. Each and every intent finally corresponds to one action that needs to be taken. For
example, if the user wants to wish someone a birthday, multiple statements/ expressions can be
used such as many happy returns of the day, happy birthday, etc. All these expressions mean the
same thing – to wish birthday – and hence the intent will be “wish_birthday”. We can observe
that the basic file nlu.md automatically gets created when the rasa environment is initialized and
it contains different intents, namely, greet, goodbye, affirm, deny, mood_great, mood_unhappy,
and bot_challenge.

Explanation
Each and every intent starts with ## sign and all the expressions start with a hyphen (-) sign.
For example, the greet intent is specified as ## intent:greet in the file and all the expressions
that are contained in the “greet” intent include “hey, “hello”, “hi”, “good morning”, “good
evening”, and “ hey there”. These expressions starts with hyphen (-) sign. Similarly, “goodbye”
intent contains different expressions such as “bye”, “goodbye”, “see you around”, and “see you
later”. We can observe that all the expressions belonging to a particular intent convey the same
message.

21.1.1.2 stories.md
The stories.md file contains a bunch of stories created from where the learning takes place. This
provides information related to sample conversations between users and the assistant. It basically
creates a probability model of interactions from each story. Stories help in teaching chatbot the
manner in which to respond to these intents in different sequences. All the sequences are stored
in data/ stories.md. It is important to understand that the stories also help to train the dialog
management model of the chatbot. We can observe from the file that when the rasa environment
is initialized, five stories are created by default, namely, happy path, sad path 1, sad path 2, say
good bye, and bot challenge. Each and every story has some intents and each intent in turn has
responses. In the file, stars precede the intents; it is important that these intents are first defined
in nlu.md file. The dashes represent responses (called utterances) that will be provided by
chatbot when the user passes any statement/expression belonging to the corresponding intent.
The stories file is represented as follows:

871
Explanation
The story named happy path has two intents, namely, greet and mood_great. It is important that
these two intents should exist in the nlu.md file, otherwise an error will be generated. The
“greet” intent will execute “utter_greet” response and the “mood_great” intent will execute
“utter_happy” response. It can be observed that all the responses are preceded with the word
utter_. It should be noted here that simplest actions (printed statements) that need to be
executed are called as utterances, hence are generally preceded with “utter_.” Though it is not
compulsory, but it may lead to misleading results if utter word is not used.
It is important to mention that the complicated and more flexible actions are generally
preceded by the word “action_” and are defined as “actions” for API calls. Thus, it can be
interpreted that action is a task that is expected from the chatbot. Generally, an external API
performs this action; since the bot platforms do not support external API calls, an external
program is used to drive that functionality. These actions are defined in action.py file and will
be discussed in Section 21.4 in detail when a chatbot is created with some actions.
It is important that all responses described in stories are defined as templates in domain.yml.

21.1.2 domain.yml
This file is called the universe of the chatbot and it has everything that is needed for a basic
chatbot. It contains details of everything – intents, actions, templates, slots, entities, forms, and
response templates that the assistant understands and operates with. The content of domain.yml
after initializing the rasa environment is as follows:

872
The validity of the domain file can be checked from http://www.yamllint.com/.

Explanation
The default basic file generated after initializing the rasa environment is named domain.yml.
The file contains information related to three things: intents, actions, and templates. All the
intents defined in the nlu.md file and actions listed in the stories.md file are listed in this file.
The most important content is the templates section that basically contains the response to be
produced for the user. The templates section contains information of action followed by the
response in the form of text or image. Thus, the action utter_greet will give text response “Hey!
How are you?”.

Thus, the whole process of chatbot can be summarized in three basic steps:

1. Intent identification: An intent is identified from the nlu.md file based on the message
given by the user. The message is compared to the expressions written in all the intents. For
example, a greet intent is identified from the “good morning” message that the user enters.
2. Action identification: Based on the path defined in the story, the action is identified for the
intent that is determined at the first level.
3. Response identification: Depending on the action identified from the story file, the
response is generated from the templates specified in the “domain.yml” file. For example,
suppose the user input follows happy path story. As specified in the story, for the first input

873
of “greet” intent, the user will execute action named “utter_greet”. Since the template
defines that the action “utter_greet” will show response as “Hey! How are you?”. Hence,
the chatbot engine will show the output as “Hey! How are you?”.
In other words, the job of Rasa Core is to essentially generate the reply message for the chatbot.
The input message is interpreted by an interpreter to extract intent and entity. It is then passed to
the tracker that keeps track of the current state of the conversation. The machine learning
algorithm is applied to determine what the reply should be, and action is chosen accordingly.
Story defines the interaction between the user and chatbot in terms of intent and action taken by
the bot. It takes the output of Rasa NLU (intent and entities) and applies machine learning
models to generate a reply. Actions are basically the operations performed by the bot either
asking for some more details to get all the entities or integrating with some APIs or querying the
database to get/save some information. It could be replying something in return, querying a
database, or any other thing possible by code.
It is important to understand that a model can be trained according to the user requirement.
An incorrect response can be corrected and chatbot can be made to understand the correct
behavior so that next time it does not make any mistake. This can be done by using the command
“rasa interactive--endpoints endpoints.yml”. All the responses are asked by the user one
by one and a check is performed on each and every step related to intents and entities. This will
help rasa to create an effective trained model that can be saved for future references. It is
important to note that every time the model is trained, a story named interactive_story_num is
created in the stories.md file. Thus, the first story will be automatically named as
interactive_story_1.

21.1.3 Models Folder


The training part generates a machine learning model based on the training data and is saved in
this folder. It is possible for doing training for both “nlu” and “core” together using a single
command “rasa train”. This command will train both the models and save them in models
folder with the name of the model represented as “yearmonthdate-hrmmss”.
When we want to determine the intent and the entity to which the message belongs to, it is
suggested to do training for “nlu” model only. The model can be trained by writing “rasa train
nlu” in the command line. Once the model is trained, it will display the message as
NLU model training completed.
Your Rasa model is trained and saved at 'C:\Users………………..tar.gz'.

For result generation, we need to type “rasa shell” at the command prompt. For generating output
from “nlu”-trained model, type “rasa shell” at the command prompt. It will prompt the user to
enter the message. When the user enters the message as good morning, the following result is
displayed:

874
Explanation
The result broadly displays three results: intent, entities, and intent_ranking. The intent displays
the name of the intent that has the highest confidence level for this message. Thus, it can be
interpreted that the message “good morning” belongs to intent greet with the maximum
confidence level. The entities are displayed by [], which means that the message does not
contain any entity. Since the nlu.md file does not contain any information of entities, hence
entities are not displayed. However, we will be discussing entities in Section 21.3 in detail. The
ranking of all intents is shown in the next section starting with the intent displaying the highest
confidence level. The next intent having the higher confidence level is goodbye with confidence
of 0.014. Since there is a huge difference in the confidence level of two corresponding intents,
we can interpret that the message given by the user belongs to greet intent.

The following shows the result generated when the user enters the message as extremely sad:

Explanation
The message typed by the user is “extremely sad” and the result shows that the message
belongs to “mood_unhappy” intent and the confidence level is 0.99. The next intent having the
highest confidence level is “mood_great” with a value of 0.0020. This means that message
definitely belongs to “mood_unhappy” intent.

875
For generating a response from the chatbot, it is important to train the core model. The training
of “core” model is done by writing the command “rasa train core” at the command prompt. It
can be observed that when the core model is trained and “rasa shell” command is executed, it
will prompt the user to enter an input after displaying the message: Bot loaded. Type a message
and press enter (use “/stop” to exit).

Explanation
From the result we can observe that when the user types the message “good morning”, it
displays the result as: “Hey! How are You”. This is because from the stories.md, we can
observe that if intent is “greet”, the action that should be reflected is “utter_greet” and from the
domain.yml file, we can see that the text response that should be generated from the
“utter_greet” action is “Hey! How are you?”.
Similarly, when the user types the message “not very good” from the nlu.md file, it can find
that this message belongs to the “mood_unhappy” intent and from the stories, it can be
observed that this message belongs to sad path and two actions need to be executed:
“utter_cheer_up” and “utter_did_that_help”. Further, from the domain file, we can observe that
“utter_cheer_up” contains one text message (Here is something….) and one image message.
The action “utter_did_that_help” contains one text message (Did that…) and hence all the three
messages are displayed.
It is important to retrain the nlu model every time when some changes are made to nlu.md
file. Similarly, when changes are made to the core files like stories.md and domain.yml file, we
need to retrain the core model. It should be noted that whenever retraining takes place, the new
models are created and stored in models folder. The name of the core model starts with “core”
and name of the nlu model starts with “nlu.” The name is succeeded with date and time at
which the model was created.

The true potential of the rasa nlu and core shines when the training data are
sufficient. This means that numerous stories should be created related to
different flow of expressions entered by the user.

21.1.4 Actions.py
This file defines all the custom functions, form actions, and can be either simple utterances or
complicated actions based on some logic or some entries. This file generally contains
complicated actions and not simple actions because simple actions are displayed in the templates
section of the domain.yml file. In other words, the simplest form of action is utterances in the
form of simple text messages. The process of Rasa chatbot is displayed in Fig. 21.2. Rasa has a
very cool interface; it provides the facility to create your own class in this file and add user-

876
defined logic in the derived class. This is discussed in detail in Section 21.3 when a chatbot is
created with some user-defined actions that are required to be executed. It should be noted that a
command “rasa run actions” needs to execute the actions.py file on the command prompt.
This file should be executed on a different window and messages from the user should be taken
in another window by the user.
The most important feature of Rasa is that when the bot chooses the wrong action, it is
possible to tell the right action that should be taken depending on the expression and flow of
expression given by the user. The model will update itself immediately and there are less chances
to do same mistake by the bot in next interactions. The conversation is basically logged to file
and added to training data.

Figure 21.2 Rasa Chatbot.


Source: Conversational AI: Building Clever Chatbots---Tom Bocklisch.

It is also possible to create stories by saving online interactions. Thus, all the
online platforms that prompt the user to enter the text, create and save the story
depend on the flow of messages entered by the user.

21.1.5 config.yml
Rasa provides a lot of flexibility in terms of configuring the NLU and core components. Earlier,
“config.yml” was meant for NLU and “policies.yml” was created for the core model. But, now
Rasa has merged in the single file named config.yml and it has configuration parameters for both
pipeline and policies.
Rasa Core generates the training data for the conversational part using the stories we provide.
It also lets you define a set of policies to use when deciding the next action of the chatbot. These
policies are either defined in the policies section of config.yml file, else these parameters can be
passed during training to respective policy constructors. It should be noted here that the training

877
policy can be changed by defining own LSTM or RNN for dialog training. The term “max
history” is used to define how one action is dependent on previous questions. If max history is 1,
then it just memorizes individual intent and its related actions.

At the initial level, due to lack of training data (lot of stories), the reader is
suggested to memorize the rules and create some sample stories and see the
true potential of rasa dialog management.

It is not possible to map every combination of intent-entity to its action in stories. For example,
mapping each flight details to its answer is practically not possible. This problem is solved by
using custom actions, where the action calls a Python function with the information about the
intent and entity. The “train()” function iterates through the pipeline and performs the
following natural language processing (NLP) tasks:

1. Preprocessing: Here the data are transformed to extract the required information. We
generally use SpacyTokenizer and SpacyFeaturizer for doing preprocessing.
2. Entity Extractor and Intent Classifier: The preprocessed data are used to create the
machine learning models that perform intent classification and entity extraction. NER CRF
EntityExactor and SklearnIntentClassifier are used commonly.
3. Persistence: Storing the result.
However, the commonly used two components include Pretrained Embeddings and Intent
Classifier Sklearn. pretrained embeddings spacy pipeline is used, if there are less than 1000 total
training examples, and spaCy model is used for language; otherwise supervised embeddings
pipeline is used. The pretrained embeddings spacy pipeline uses pretrained word vectors from
either GloVe or fastText and supervised embeddings pipeline uses pre-trained word vectors
fromr GloVe. The classifier uses the spaCy library to load pretrained language models, which are
then used to represent each word in the user message as word embedding. Word embeddings are
vector representations of words, meaning each word is converted to a dense numeric vector.
Word embeddings capture semantic and syntactic aspects of words. This means that similar
words should be represented by similar vectors. When pretrained word embeddings are used,
there is a benefit in training more powerful and meaningful word embeddings. Since the
embeddings are already trained, it requires only little training to make confident intent
predictions. This helps to increase the accuracy of the results and reduce training time because
the training does not start from scratch.

It is also possible to use different word embeddings, for example, Facebook’s


fastText embeddings. Refer to the spaCy guide to convert the embeddings to a
compatible spaCy model. Link the converted model to the desired language
with Python -m spacy link <converted model> <language code>.

Word embeddings are generally trained on datasets that are mostly in English language; they do
not cover domain-specific words such as product names or acronyms. In this case, it would be

878
better to train our own word embeddings with the supervised embeddings classifier with the
intent Classifier TensorFlow Embedding. Instead of using pretrained embeddings and training a
classifier on top of that, it trains word embeddings from scratch. It is typically used with
intent_featurizer_count_vectors component that counts how often distinct words of training data
appear in a message and provides that as an input for the intent classifier. Since this classifier
trains word embeddings from scratch, it needs more training data than the classifier that uses
pretrained embeddings. The advantages include that it adapts as per domain-specific messages, it
is language-independent and hence does not dependent on word embeddings for a certain
language. Also, it supports messages with multiple intents and help to create a very flexible
classifier for advanced use cases. Thus, we can say that the advantage of the pretrained
embeddings spaCy pipeline is that word similarity exits, and the advantage of the supervised
embeddings pipeline is that word vectors will be customized for domain and do not have generic
similarity.

In some languages (e.g., Chinese), it is not possible to use the default approach
of Rasa NLU to split sentences into words by using whitespace (spaces,
blanks) as separator. Rasa provides the Jieba tokenizer for Chinese in such
cases.

21.1.6 credential.yml
The file credentials.yml basically accesses tokens and keys for external calls.

21.1.7 endpoints.yml
This file basically specifies the URL to the place where the actions file is running (generally on
the local host). If custom actions are used, it is important to pass in the endpoint configuration
for action server also.

21.2 Creating Basic Chatbot


In this section, a basic chatbot with user-defined intents and stories for user-defined responses
are created. It should be noted that before creating chatbot, it is important to determine the target
customer, problem that needs to be solved, solution/answers to the problem, benefits of solution,
etc. The main objective of this chatbot is to provide basic information of national symbols to the
user. All the intents and responses will be defined in the domains.yml file.

21.2.1 nlu.md
The nlu.md file should contain the information of all intents. Since our chatbot is related to
national symbols, we can have intents such as animal, bird, flag, and flower representing national
symbols. Besides, we can have intents for greeting and saying bye. There are two more intents

879
that we have included related to working and not_working; these are created because our
response to the greet intent is “Hey! What are you doing”. This file is created as follows:

Explanation
Here, eight intents are created, namely, greet, working, not_working, bye, flag, animal, bird,
and flower. All the intents are preceded with ## sign. The different expressions that are
contained in each intent are preceded with hyphen sign. The flower, bird, animal, and flag
intents have three expressions. Thus, if the user types the message as “our flower” or “which is
our flower” or “flower”, the chatbot considers that it belongs to the intent flower. Similarly, the
working intent has six expressions; this means that when the user types the message as “doing a
job,” “I am working”, “studying”, “practicing”, “I am doing a job”, or “I am playing”, the
chatbot identifies the intent as “working”.

880
21.2.2 stories.md
This file contains different stories and each story has a different flow of intents. It is always
suggested to have more number of stories in the chatbot for effective results. This chatbot
contain the following stories:

Explanation

881
In the stories.md file, we have created five stories, namely, happy path, happy path2, sad path,
sad path2, and happy path3. The name of the story is preceded with two hash signs. The story
has intents that are represented by * and corresponding templates/ actions that need to be taken
when an intent is recognized. Thus, story happy path has three intents in the order “greet”,
“working”, and “bye”. When the chatbot assistant recognizes the first intent as greet, the second
intent as working, and the third intent as bye, it is obvious that the happy path story is executed.
Hence, when the user enters the message that belongs to the intent greet, it will automatically
display the response that is specified in “utter_greet” template. Similarly, when the next
message entered by the user belongs to working intent, the response specified in “utter-
positive” template will be executed. When the third message entered by the user belongs to
“bye” intent, the response specified in “utter_goodbye” template will be executed. These
templates are described in the domains.yml file, which is discussed in the following section.

21.2.3 domain.yml
The domain file contains the information of intents (specified in nlu.md), actions (specified in
stories.md with respect to intents), and templates. The templates define the utterances that are
executed on the basis of the actions specified. The details of the domain file for chatbot related to
national symbols are as follows:

882
Explanation
It is known that all the utterance used in the stories.md file needs to be specified in the
domains.yml also. The stories.md file used eight utterances: utter_greet, utter_positive,
utter_cheer, utter_animal, utter_flag, utter_bird, utter_flower, and utter_goodbye. Hence, the
domain file had listed those in the actions block; the utterances and the corresponding text that
needs to be displayed are written in the templates section. From the stories.md file, it is known
that when the user types message belonging to flower intent, action name “utter_flower” will be
executed. Since the template describes that the “utter_flower” action will display the text as
“Lotus is our national flower”, hence, when the intent named “flower” is identified, the message
will be displayed as “Lotus is our national flower”. Similarly, when the intent specified is
“bird”, the message will be displayed as “Peacock is our national bird”. An identification of
“flower” intent will display the message as “Lotus is our national flower”. An identification of
“animal” intent will display the message as “Tiger is our national animal” and so on.

The nlu model is trained using the command “rasa train nlu” and executed using the

883
command “rasa shell”. The following results are produced when the user enters the message as
“nothing”:

Explanation
We can observe that when the user enters the message as “nothing”, it searches the file named
nlu.md and identifies the intent corresponding to the statement “nothing”. In this case, the
intent identified is “not_working” with maximum confidence level of 0.954. The second intent
is greet having confidence level of 0.012 and so on.

The core model can be trained using the command “rasa train core”. It is also possible to
train both the nlu and core models together by using the command rasa train. The following
results are produced when the following messages are given as input after executing the
command “rasa shell”.

Explanation
The first command hi belongs to the intent greet; hence, according to the instruction specified
in the stories.md file, “utter_greet” template is executed and from the domain.yml file, the text
message to be written was “Hey! What are you doing”. The user was prompted for the message
again and when the user entered the message as nothing, it identified from the nlu.md file that
the message belongs to “not_working” intent and the response message hence identified from
domain.yml file is written as: “Let’s do something…”. The next message given by user is bye
that belonged to “bye” intent and hence the message written is “Bye! Nice…”. We can observe
from the above commands that all the output was in accordance with the desired results. This
was because the rasa shell identified from the defined story that the commands given are in
accordance with the order of sad path story.

884
The result is obtained when once again the execution was carried out by using the command
“rasa shell”.

Explanation
We can observe that all the messages given by the user are in the same order as specified in the
happy path2 in the stories.md file. Thus, all the actions and templates specified in the
domains.yml file are displayed accordingly.

USE CASE
CHATBOT FOR E-GOVERNANCE

Electronic governance or e-governance is the application of information and communication


technology (ICT) for delivering government services, exchange of information, communication
transactions, integration of various stand-alone systems such as government-to-citizen (G2C),
government-to-business (G2B), government-to-government (G2G), government-to-employees
(G2E) as well as back-office processes and interactions within the entire government framework
(wikepedia.com). Through e-governance, government services are made available to citizens
digitally in a convenient, efficient, and transparent manner. In other words, it is the use of
technology to perform government activities and achieve the objectives of governance.
The rapid growth of digitalization has led to many governments across the globe to introduce
and incorporate technology into governmental processes. E-governance is adopted by countries
across the world. In a fast-growing and demanding economy like India, e-governance has
become essential. E-governance is adopted by countries across the world. Examples of e-
governance include Digital India initiative, National Portal of India, Prime Minister of India
portal, Aadhar, filing and payment of taxes online, digital land management systems, and
Common Entrance Test, to name a few.
In the above section, we have created a chatbot for displaying information related to
national symbols. Similarly, chatbots related to different government services and reforms
provided by government can be created to provide information to the citizens at any time.

21.3 Creating Chatbot with Entities and Actions

885
A good chatbot would always like to greet the user by the name. This means that the message
given by the user will have information about his/her name and this name will vary in all the
expressions given by the user, but the basic format of the message remains the same. For
example, “My name is Hazel” or “My name is Kasheen”. In both the sentences, “My name is”
remains common but the name differs in both the expressions. In this case, the name is called as
an entity. In other words, an entity is a piece of information inside the message given by the user,
which is defined in the intent. Practically, the intent generally does have some entities such as the
username or any related field depending on function of the particular chatbot. For example, the
library chatbot will have entities such as book, price, and author; the movie chatbot will have
entities such as movie name, actor, and rating. Depending on the information given by the
entities in the expression, the chatbot extracts and displays the information to the user by
executing the action. Thus, we can say that an entity is information extracted from the text, the
intents are associated with entities and they provide the information to the chatbot for performing
the action. It should be noted that the entities are described both in nlu.md file and in
domains.yml file.

Figure 21.3 Intent and Entity Extraction Using NLU.

Figure 21.3 explains the process of intent classification and entity extraction. When the user
types the message as “What is the capital of India”, the NLU first does the vectorization and then
classifies the intent. In this example, it identifies that the message belongs to capital_query
intent. NLU also extracts the entity. As discussed earlier in Chapter 16, the job of tokenizer is to
break the sentence into different words/token. The sentence is first broken into different tokens
after POS tagging and named entity recognition. The entity is extracted. In the above sentence,
entity is identified as country.

21.3.1 Single Entity


An entity is defined in parentheses and the possible values of entity are written in square bracket.
In the following example, we have created intent named as username in the nlu.md file:

886
Explanation
This means that the intent is “username” and the entity is “name”. The possible values of name
entity can be Jitesh, Pearl, or Jahan. It can be observed that the entity is written in parentheses
and possible values are written in square brackets. It is important to include the name of entity
in the stories.md file. The stories.md file is defined as follows:

When the “rasa shell” command is executed after training the model by the command “rasa
train” and the message is entered as “My name is Jitesh”; the result obtained will be:

Explanation
We can observe from the result that the message “My name is Jitesh” belongs to intent
“username” with confidence of 0.9536. The starting position of Jitesh in the sentence “My
name is Jitesh” is 11 and since Jitesh is a six-letter word, it ends at the 17th position. It shows
that the value “Jitesh” belongs to entity “name” with a confidence of 0.814. It is important to
understand that the entities are case-sensitive. If the message is “My name is jitesh”, it will now
show it as entity because the value of entity is Jitesh (uppercase) and not jitesh (lowercase).

When we execute the shell with new message like “Call me Pearl”, the result is

887
Explanation
It should be noted that in the intent, it was specified as Call me Jahan and Jahan was considered
as value for entity “name” in the expression “Call me…” but Pearl was also defined in the
earlier statement as name entity. Hence, the result shows the confidence level of 0.9929,
although the sentence was Call me Pearl. Pearl is starting at the position 8 and ending at 13 in
the message with confidence level of 0.843.

The confidence level will change depending on the training that will be given
to the chatbot. A highly trained model will result in higher confidence level
than a poorly trained model.

21.3.2 Synonyms for Entities


The user may want that an entity value can have synonyms and the chatbot interprets both the
values belonging to the same entity. For example, synonym of Mumbai can be Mum and it is
needed that when the user types the message as Mumbai or Mum; in both the scenarios, city
entity should be recognized. Rasa provides a feature for creating synonyms for the entity by
defining the synonym after the entity name and colon sign. In the following section, intent is
defined as location, entity as city, two values of cities, namely Pune and Mumbai, and the Mum
is synonym for Mumbai. The nlu.md file is defined as

It should be noted that the synonym of the entity is written in the entity separated by a colon.
Thus, the entity value as Mumbai has the synonym as “Mum”. It should be noted that in the
story.md file, new intent named location and entity named city should be added. The shell when
executed with the message “I am from Mumbai” produces the following result:

888
Explanation
It can be observed that the entity city has the value Mumbai. Whenever the same text is found
in the value of entity, the value will use the synonym instead of the actual text in the message.
Hence, when the value “Mumbai” was written in the expression, the result generated shows that
it starts from location 10 and ends at location 16, value is “Mum,” and entity is city with
confidence of 0.959.

It should be noted that the synonyms defined in training data refer to the pipeline that contains
the EntitySynonymMapper component. We can also add an “entity_synonyms” array to define
several synonyms to one entity value in the nlu.md file as

It is important to note that the accuracy of the model does not improve by adding synonyms in
the above format. The most important task is that entities must be properly classified before they
can be replaced with the synonym value.

21.3.3 Multiple Entities


It is also possible to have multiple entities in a single expression defined in the intent. It is
possible to create multiple entities for efficiently managing intents. There are two formats to
specify entities (type and value) in sentences/expressions. The entities can be specified in either
JSON or Markdown format. For example, we have intent named birth that has two entities
named city and month. The expressions will be written inside nlu.md file under the heading
intent birth as

889
When it is executed in rasa shell with the input message given by the user as “I am born in June
at Pune city”, the result is

Explanation
We can observe from the result that two entities are shown named month and city having the
values as “June” and “Pune”, respectively. The starting position of June in the message is at
location 13 and the ending position is at location 17. Similarly, Pune starts at location 21 and
ends at location 25.

21.3.4 Multiple Values of Entity in Same Intent


In some situations, we sometimes need to define two values of same entity in a single expression.
For example, when we are creating a chatbot for travel assistance, we want the name of city from
where the user starts his/her travel and also name of the destination. We will create intent named
“travel” to determine the name of the cities the individual is traveling. We will create entity as
city. We can write the expressions in both the intents considering the names of the cities as
values for the entity city. The expression and intent can be defined in nlu.md file as

890
The result produced when executed in the rasa shell after doing training of only nlu model is

Explanation
We can observe that for a single expression, the intent is shown as travel and both the values of
cities Chennai and Pune are shown for the entity named “city” with good confidence level.

21.3.5 Numerous Values of Entity


A chatbot generally has a lot of entity values for the system. For example, a flight chatbot will
require all the names of cities from where flight is departing, chatbot for restaurant will require
all the name of dishes, chatbot for clothing store will require the names of all the clothes, chatbot
for daily needs will require the names of all the fruits and vegetables, etc. However, because of
numerous values, it becomes difficult to enter all the different combinations of values in the
specified intent discussed in above sections. In such a scenario, it is better to do training of the
data in some other manner. Rasa provides us the facility to provide the huge data as Markdown
or as JSON, as a single file, or as a directory containing multiple files.
JSON Format: This format consists of a top-level object called rasa_nlu_data, with the keys

891
common_examples, entity_ synonyms, and regex_features. The most important one is
common_examples.
{
“rasa_nlu_data”: {
“common_examples”: [],
“regex_features” : [],
“lookup_tables” : [],
“entity_synonyms”: []
}
}
Markdown Format: Markdown is the easiest Rasa NLU format for humans to read and write.
Examples are listed using the unordered list syntax, for example, minus –, asterisk *, or plus +.
Examples are grouped by intent, and entities are annotated as Markdown links, for example,
[entity](entity name).
The lookup_tables or regex features are generally used for creating numerous values of
entity.

21.3.5.1 Lookup Tables


Rasa provides different facilities to enter all the different values of the entity in a separate section
of lookup in the nlu.md file or in an external file. Both lookup tables and external files contain
lists of elements for which the user may require information; some elements should be specified
in the training data. It is important that the lookup tables must have entries in a newline-separated
format. When lookup tables are supplied in training data, the contents are combined into a large,
case-insensitive regex pattern that looks for exact matches in the training examples.

Chatito is a tool for generating training datasets, but it is suggested to use


lookup tables instead for a large number of entity values because Chatito might
lead to overfitting.

Example: We can add an intent named “order” having dish as an entity in the nlu.md file.

Explanation
The intent order has one entity named dish, which in turn has two values, namely, Boiled Egg
and Palak Paneer. It should be noted that the changes should be made in the stories.md file
accordingly.

For using the facility of lookup table, all the names of numerous dishes can be stored in lookup
table named dishes.txt with a new line separator inside the data folder as

892
Explanation
It is necessary to be careful when data are added to the lookup table. An unclean data may
affect the performance badly. Also, for lookup tables to be effective, there must be a few
examples of matches in training data. Otherwise, the model will not learn to use the lookup
table match features.

It should be noted that the lookup table is referred in nlu.md file along with the entity name as

Explanation
The txt file needs to be specified in the nlu.md file under the heading: two hash signs followed
by lookup word followed by colon with the name of the entity. In the above example, dishes.txt
inside the data folder is written inside the ## lookup:dish. This means that for determining the
dish entity, the possible values can be searched in dishes.txt, which exists in data folder.

Another approach for adding multiple values to a particular entity is to enter the elements
directly in nlu.md file as a list:

Explanation
It is important to note that hyphen (-) is to be added before the name of the dish if it is specified
in the nlu.md file, else an error will be generated. Also, for lookup tables to be effective, there
must be a few examples of matches in training data. Otherwise the model will not learn to use
the lookup table match features.

The result generated when the user types the message “Like to eat Grilled Fish” is

893
Explanation
It can be observed that the message belongs to the intent named order with a highest confidence
of 0.80 and has the value Grilled Fish from the entity named dish.

If the files are showing some error in indentation that is not identified easily,
this means that code is indented with spaces in some places and with tabs in
others. To fix this, in Notepad++, go to Edit -> Blank Operations -> TAB to
Space (PEP 8 recommends using spaces versus tabs).

21.3.5.2 Regular Expression Features


Regular expressions can be used to support the intent classification and entity extraction. The
name does not define the entity or the intent; it is just a human readable description for you to
remember what this regex is used for and what is the title of the corresponding pattern feature.
For example, an entity may have a defined structure (e.g., zipcode and email address), and
regular expression can be used for the detection of that entity. The zipcode example can be
represented as
## regex:zipcode
- [0-9]{5}
In this example, the zipcode will be identified if there are any five digits from 0 to 9. If there is a
need to identify all those messages containing “hey” word as greet, regex expression can be used
as

Here, the greet entity will be identified if the sentence starts with hey word. Thus, it could be

894
concluded that regex features help to improve the intent classification performance to a great
level.

Regex features for entity extraction are currently only supported by the
CRFEntityExtractor component! Hence, other entity extractors such as
SpacyEntityExtractor will not use the generated features and their presence
will not improve entity recognition for these extractors. Currently, all intent
classifiers make use of available regex features.

In nlu.md file, regex was created for phone as

Explanation
This means that the entity “phone” will be extracted if a combination of 10 digits from 0 to 9 is
found in the message entered by the user.

When the nlu model was trained, the following output will be generated when the user types the
message as “plz contact me at 4123556889”.

Explanation
We can observe that the when the user entered any phone number as a combination of any 10
digits, it extracted that as “phone” entity.
While creating a chatbot, the importance of action is equally important like entities. In this
section, a chatbot is created related to user detail, which has some entities and action. For the
chatbot discussed in Section 21.2 related to national symbols, we considered some utterances in
the form of text which will happen when the user enters the message. These utterances (simple

895
text response) were defined in the templates section of the domain.yml file. In an efficient and
interactive chatbot, the response generated is not always simple. There is a need to have some
complicated responses based on some processing; these are called as actions. Here, we will
create a chatbot related to user details, which will have some actions also along with the
utterances.

21.3.6 nlu.md
Since the chatbot is related to user details, some common intents such as greet, username, birth,
and goodbye are created. The nlu.md file for a chatbot with some entities such as month and
name is created as

Explanation
The “greet” intent has some common expression for greetings such as hello and good morning.
The username intent has three expressions, namely, I am…, Call me…, and My name is … .
This intent considers only one entity “name”. This entity is defined in the parentheses and all
the values such as Anita, Pearl, and Jahan are written in square brackets. We can observe that
two lookup information are created in the nlu.md file: city and month. Since there are many
cities, a txt file named “cities.txt” is created having the names of all the cities. Another lookup

896
information named month had only 12 months, so they are included in nlu.md file only.

21.3.7 stories.md
For better understanding of the concept, only one story related to user details is created in this
file:

Explanation
We can observe that the stories.md file has one story named “trial path”, which requires a
sequence of four messages by the user. These messages are in the following order of intents:
greet, username, birth, and goodbye. This further means that the user will enter the message
belonging to the “greet” intent, and then the user will enter the message belonging to
“username” intent, “birth” intent, and finally “goodbye” intent.

21.3.8 domain.yml
The domain file for this chatbot will have some actions, which will be defined in the actions.py
file. It is important to specify the name of the action (defined in actions.py) under the heading
“actions” in domain.yml. This file is created as

897
Explanation
We can observe that the actions heading had specified four actions: action_first, utter_goodbye,
utter_greet, and utter-forecast. The task of three utterances is defined in the templates with
corresponding text and the task of the action (action_first) is defined in the actions.py file.

21.3.9 actions.py
We know that the command “rasa train” will consider all the nlu.md, stories.md, and
domain.yml files for training. In order to execute actions.py file and determine the errors in the
actions.py file, the command “rasa run actions” needs to be executed.

Explanation
A class named ActionFirst is created; the name() function returns the “action_first.” It is

898
important to mention that the commands “rasa run actions” and “rasa shell” will be
executed on the different windows. The basic concept is that the actions file is executed on the
server and is referred by the rasa shell. It should be noted that the result of print() can be seen
on the server, while the result of dispatcher.utter_message() is seen on the same window
where there is interaction with the user. This action when executed prints three statements. The
statement “This will be printed on Rasa server” will be printed on the server and “Hello
and Welcome to Rasa Actions” and “When and where were you born” will be printed on the
windows where the “rasa shell” command is executed.

Explanation
It can be observed that since the first message belongs to the “greet” intent, the text written in
“utter_greet” template is displayed that asks the user his/her name. When the user entered the
message as “Call me Jahan”, as per the instruction from the story file, the action named
“action_first” from the actions.py file is executed. It should be noted that the name() function in
the class “ActionFirst” defines the name of the action, which should be same as specified in the
actions block of the domain.yml file. Also, the run() function specifies the tasks that need to be
executed when this action is called upon. Thus, as specified in the “action_ first,” the action
related to print two statements and one statement to print on the rasa server is executed. Hence,
two statements, namely, “Hello and Welcome to Rasa Actions” and “When and where were
you born?” are displayed. Similarly, when the intent birth is met, the “utter_forecast” is
executed and accordingly displays the message as “Wow!! You have the best future.” Similarly,
when the user inputs the message as bye, the text defined in the template utter_goodbye is
displayed.

It is possible that incorrect results are produced. This is because the model is
having less training. The user is suggested to provide proper training by using
the command rasa interactive--endpoints endpoints.yml.

USE CASE
CHATBOT FOR ALZHEIMER’S PATIENTS

Chatbots are software applications that use both written or spoken form for simulating an
interaction with human. Chatbots have become extraordinarily popular in recent years largely
due to tremendous technological advancements in the field of machine learning for NLP.
Chatbots process the text after identifying and interpreting the meaning of the text written by

899
user and determine a series of appropriate responses according to some predefined algorithms.
They have become better, smarter, more sophisticated, responsive, useful, and more natural.
However, the utility is going on increasing at a great rate.
Here, a chatbot is created for entering user details. A chatbot can be created that can store
all the information about the user and provide the same information to the user when he/she
needs it. It is important to mention that this can be particularly useful in medical world,
particularly for Alzheimer’s disease where the patients struggle with the short-term memory loss
and face problem in basic routine conversational interactions. Alzheimer’s disease is a
progressive disorder that causes brain cells to waste away (degenerate) and die. Alzheimer’s
disease is the most common cause of dementia – a continuous decline in thinking, behavioral,
and social skills that disrupts a person’s ability to function independently. The early signs of the
disease may be forgetting recent events or conversations. As the disease progresses, a person
with Alzheimer’s disease will develop severe memory impairment and lose the ability to carry
out everyday tasks. Alzheimer’ disease patients forget the things and hence feel very hesitant to
ask many questions to their caretakers. Also, the caretakers of these patients might lose patience
in answering the same questions again and again. Chatbots can answer any number of questions
even if repetitive without any issue. Hence, these patients will not face any problem in
interacting with these chatbots. However, it is important to train the chatbots with a lot of stories
containing all the questions that can be asked for performing their day-to-day activities.
Human beings by nature feel the need to belong and stay connected to each other. Research
has identified that loneliness and rejection both activate the same parts of the brain as physical
pain. Chatbots can be efficient in removing loneliness of mental patients by lot of interaction,
thereby helping in removing depression.

21.4 Creating Chatbot with Slots


Slots are memory of the chatbot. They act as a key–value store that can be used to store
information from the user and the outside world. There are different slot types for different
behaviors such as text slot, categorical slot, Boolean slot, float, list, and unfeaturized. They help
the bot to categorize and interpret user input. For example, if your user has provided city, there is
a text slot named city. It should be noted that the text slot only tells Rasa Core whether the slot
has a value or not; the value of a text slot (Mumbai or Pune) is irrelevant to the chatbot. The
policy does not have access to the value of slots and it receives a featurized representation. The
policy just sees a 1 or 0 depending on whether it is set. It is suggested to use categorical or
Boolean slot if the value of slot is considered to be important by the chatbot. A float slot should
be used to store a float value and an unfeaturized slot to store some data without affecting the
flow of the conversation.
It is important to understand that there is a need to define and fill all the chatbot slots after
chatbot’s intents have been identified. For example, in chatbot of restaurant search, the assistant
will ask the user city if city is not known from the intent. Thus, it is possible for chatbot to have
slots for effectiveness to improve user experience; hence, it is important to define, structure, and
fill chatbot slots. In this section, we have created chatbot for providing information related to the
restaurants based on the location and choice of cuisine by the user. However, we have taken
assistance of Zomato API Key for displaying the information of restaurants.

900
21.4.1 nlu.md
Since the chatbot is related to provide the information of restaurant for food services, an
important intent related to restaurant search should be created. This file hence along with the
other common intents has one important intent named restaurant_search in the nlu.md file.

Explanation
Here, common intents, namely, greet, goodbye, affirm, deny, and thanks, were created with the
common messages. An important intent, namely, restaurant_search, was created considering
two main entities: city and cuisine. This intent has 12 different expressions that can be given by

901
the user and these expressions will correspond to the same intent named restaurant_search.
Since there will be many cities and many cuisines, the chatbot uses the concept of lookup for
city and cuisine entity. The cities.txt file contains the list of all the cities and cuisines.txt
contains the list of all the cuisines.

21.4.2 stories.md
For effective understanding of the chatbot with slots, we have created only two stories, namely,
story1 and story2. In story1, the user is satisfied with the search of restaurants. Both these stories
have considered the value of slots for their action during the indent identification process.

Explanation
The story.md files two user-defined stories: story1 and story2. The story1 had four messages in
the order: greet, restaurant_search, affirm or thanks, and goodbye. The story2 had four
messages in the order: greet, restaurant_search, deny, restaurant_search, and affirm or thanks. It
should be noted that story1 had defined the value of two slots city and cuisine as delhi and
biryani respectively. It should be noted that when the restaurant_search intent will be identified,
one action named “action_zomato_search” and one utterance named “utter_did_that_help” will
be executed. If the user then prompts the message that belongs to either of the two intents,
affirm or thanks, the chatbot will execute “utter_gratitude”. When the last message belongs to
goodbye intent, “utter_ goodbye” template will be executed.
The second story considered slots two times because the user may not be satisfied with the
results produced by the first restaurant search. Initially, the values considered were lucknow (do
not change, let it be small case) for city and burger for cuisine. When the user denies that the

902
results were not helpful, the user enters the message again. The second search considered
Chennai as the value for city slot and Mughlai for cuisine slot. The restaurant_search indent
needs to execute the action named “action_ zomato_search,” which will be defined in the
actions.py file.

If a story is added with the training of the user data by using the command
rasa interactive--, a story named interactive_story1 will be automatically
added to the stories.md file.

21.4.3 domain.yml
This file will list all the six intents, two entities specified in the nlu.md file, two slots specified in
the stories.md file, four utterances defined in the templates section, and six actions/utterances
specified in the action section.

Explanation

903
It can be observed that the action section in the domain.yml file listed one action and four
utterances. The task of these utterances is described in the templates section. The utter_goodbye
template displays two text messages. There are two slots named cuisine and city.

21.4.4 actions.py
The file will contain the tasks that need to be executed when the complicated and important
action described in the domain.yml and stories.md file is executed. It is known that the name of
the action is “action_zomato_search”, hence a class is created by the name ActionZomatoSearch
and the function name() of this class returns the same action (“action_zomato_search”). It should
be noted that it is important to get the API key from Zomato by requesting the developers team
of Zomato to provide a free API key. This key is used in our actions.py file to fetch the details of
restaurants on the basis of the location entered by the user.

904
905
906
Explanation
In the above actions.py file, four functions are created – get_lat_lon(), run(), name(), and
rest_search(). The name() function is basically used to name the action, which is listed in the
stories.md and domain.yml files. The run() function is the entry point of the start of execution
of the action. It should be noted that all the other functions that are defined in the actions.py file
will be called inside the run() function using self.name of the function. This further means that
all the functions should be called inside the run() function; get_lat_lon() function will be
called using self.get_lat_lon() and rest_search() will also be called using the
self.rest_search() function.
The rest_search() function takes the parameter as restaurants and creates a structure
named rest_data containing five lists named “name”, “cuisines”, “address”, “rating”, and
“cost”. In the run() function later, we have executed query for five restaurants. Since the
rest_search() function takes input of five restaurants, hence a “for” loop is executed to fetch
the results of all five restaurants and store in rest_data. It should be noted that the Zomato file
has many parameters such as “id”, “phone_number”, and “votes” but we have considered five
important parameters to be displayed. In Zomato file, these parameters are stored as name,
cuisines, address, user_rating, and average_cost_for_two, while the user-defined list names are
name, cuisines, address, rating, and cost, respectively. Thus, the statement
rest_data['cost'].append(str(rest['average_cost_for_two'])) converts the value stored
in “average_cost_for_two” in string format and stores in “cost”. The run function() then
returns the complete data structure named “rest_data” to the calling function which will be
discussed later.
The get_lat_lon() function returns the latitude and longitude of the location, which is
given as an input argument to the function. The zomato_api_key is used to get latitude and
longitude of the location. However, we have set default values of latitude and longitude of
Mumbai in the variables named latitude and longitude, in case it is difficult to trace the location.
It should be noted that if the value of req_data is 200, the request is considered to be successful;
this means that the latitude and longitude of the location are identified. The function returns the
value of latitude and longitude in string format to the calling() function.
The run() function is the main function of the class, which is executed when the action
named in the name() function is called from the stories.md and domain.yml files. This function
tracks the value of two slots named cuisine and city (defined in nul.md and domain.yml file).
The print statement prints the output on the server, while the dispatcher.utter_message displays
the output on the chatbot interaction screen. Thus, when the two slots named cuisines and city
are identified, the result is displayed as “Wanted to search…”. Once the city slot is identified,
the function get_lat_lon() is called to fetch the values of latitude and longitude from the
function. If the value of API key is “”, the user gets the message for getting the API key from
the Zomato developer team.

907
A string named zomato_url is created as api_url + ‘search?q=’ + cuisine + ‘&lat=’
+latitude+ ‘&lon=’ + longitude + ‘&sort =cost’+‘&radius=2000’+‘&count=5’+‘&order=asc’.
This URL is given as an input to the requests.get() function to get the desired results. Thus,
the string on the basis of cuisine value entered by the user and latitude and longitude of the city
and radius of 2000 is considered for search. The sorting is done on the basis of cost in
ascending order and five restaurants are searched.
The command req_rest_data.status_code == 200 means that the request is successful
and the information of restaurants is fetched. The next step is to then determine whether
restaurants are found or not. If no restaurants are found (len(restaurants)=0), then the message is
displayed to the user to change the cuisine and city for producing desired results. If the number
of restaurants found are >0, then the rest_search() function is called to return the information
of restaurants. The information of each and every restaurant is printed using “for” loop. A try
block is used to catch the exception in case of an error. However, in case of network problems
or other problems, the value of status_code will not be equal to 200. When the search could not
be executed, the message “Request is unsuccessful” is displayed on both the server and the
screen where chatbot interaction is taking place.

When the chatbot is executed using the command “rasa shell”, the following results are
obtained:

908
Explanation
We can observe that the first intent was greet, hence “utter_greet” template was executed.
When the system found the message belonging to the restaurant_search intent, an action named
“action_zomato_search” was executed. Thus, the information of five restaurants was displayed.
When “utter_did_that_help” was executed and the user promoted the message belonging to
thanks intent, utter_goodbye was executed. The last message entered by the bot was Bye bot,
which belonged to the goodbye intent, and hence two messages belonging to the
“utter_goodbye” template are displayed. The interaction with the user stops thereafter.

The results produced in the actions server are as follows:

Explanation
It can be observed that the statements that are written inside the print statement are printed on
the server.

When the chatbot was executed again using the command “rasa shell”, the following results
were obtained:

909
Explanation
We can observe that the above execution of the chatbot followed the path of story2; hence,
when the user denies that the search did not help, it responds the user with the message “Plz
enter another choice…”. When the user enters the choice of Mughlai food in Chennai, it
displays the five restaurants in Chennai and when the user says the query helped him with the
restaurant search, the chatbot responded the message specified in “utter_gratitude” template and
finally says bye to the user.

When the chatbot was executed again using the command “rasa shell”, the following results
were obtained:

910
Explanation
It can be observed that the code in the program sets the default value of cuisine slot to north
Indian and city slot to Mumbai. Hence, an absence of cuisine slot from the message given by
the user searches restaurants according to the value of city slot (Here “Mumbai”) for north
Indian cuisine and displays the result accordingly.

Create a chatbot using the slots for displaying the geographic information
about the states of India.

USE CASE
CHATBOT FOR MARKETING

Chatbots can create an important impact on marketing activities because users have opted to
interact with chatbot themselves. According to a research, 57% customers preferred live chat to
get their queries answered. The complete information about the organization and brand can be
provided and can hence help in promoting the brand by interacting with people regarding the
brand benefits, discounts, and other promotional offers. Chatbots can help in improving
customer service experience by providing quick response, accurate results, and an answer to all
their queries. Also, chatbots can help in announcing a new offer or a new product launch.
Chatbots are a much more personalized way to gather meaningful information from valuable
sources through a survey. The best thing is that users never feel like they are going through a
lengthy survey. Using the data, businesses can get an idea of the consumer mindset and the
buying pattern of the consumers, which can help them to understand the market value of the
product and so they can plan their digital marketing strategies accordingly. This will help to
save money and time by not hiring additional resources.
According to a research, 71% of customers prefer personalized ads and customers love
personalization across the board. Many chatbots can be programmed to take data from your
users and turn that into a personalized experience. Chatbots can handle proactive customer
engagement and send follow-up messages. Like e-mail campaigns, they can also help in
automating and scheduling marketing messages to build brand awareness in a better and
effective manner. Chatbots can also be added as an extension on various social media platforms
for basic level enquiries.
Chatbots can also be extremely useful in registration of events organized for promotion of

911
products/services. Since it is versatile conversion tool, it is possible to customize them for all
phases including sign-ups for the event. It can prove to be an asset in getting better open and
response rates because it is difficult to make people sign up via e-mail or click-throughs.
Besides, after the users have signed, reminders for attending the event can be set that will help to
increase the number of attendees in the event.
In other words, chatbots can be considered as an effective and efficient digital marketing
tool.

21.5 Creating Chatbot with Database


A table in a database has many records and different fields. It becomes easy for a chatbot to
perform a query on the fields that are listed in the table of the database depending on the
information entered by the user. A useful and an efficient chatbot would have connectivity to the
database that is created in any software (Figure 21.4).

Figure 21.4 Chatbot with database.

In this section, a chatbot is created for online bus services website that will provide
information related to the availability of buses from different cities and through different travels.
A dataset providing travel information was created with the following details. It should be noted
here that only few fields and records were considered for better understanding to the reader. The
number of fields and records can be increased for increasing the effectiveness of the chatbot. The
efficiency of the chatbot can be made better by increasing the fields related to quality of travel
desired, type of bus, number of vacant seats, etc.

Source Destination Agency Fare


Pune Mumbai Raj Travels 1000
Mumbai Bengaluru Ashoka Travels 2000
Delhi Mumbai Vrl Travels 3000

912
Mumbai Pune Srk Travels 4000
Pune Mumbai Raj Travels 800
Delhi Pune Raj Travels 4000
Mumbai Pune Raj Travels 900
Bengaluru Mumbai Ashoka Travels 1200
Mumbai Delhi Vrl Travels 3000
Pune Mumbai Srk Travels 800
Mumbai Pune Raj Travels 700
Pune Delhi Raj Travels 3400

21.5.1 nlu.md
An important intent for bus services chatbot is related to the search of services from specified
cities or travels according to the requirement by the user. The nlu.md file for online bus services
will have primarily the following intents:

913
Explanation
We can observe that two entities named “name” and “city” are created. This is obvious because
the bus service will primarily need to have two basic information related to city of origin and
destination to which travel is required and the name of the travels by which the person wants to
travel. However, some more information related to date and timing when the service is required
can be used in the chatbot. But, for understanding the utility of slots properly, we have included
only two fields, namely, name and city. Since there are many cities, the lookup values for city
are included in the file named cities.txt stored in the data folder. There are six travels associated
with this website; hence, we can include all these six names in the lookup section of nlu.md file.

21.5.2 stories.md
For easy understanding, two basic stories were created by the name path1 and path2. This first
story identified three intents in the order: greet, search, and thanks; the second story identified

914
three intents in the order: greet, deny, and thanks.

Explanation
We can observe that the path of the first story had only three messages by the user: greet, search
and thanks. When the chatbot identified the intent as “greet” from the message, utter_greet
template is executed, identification of search intent will execute the action_travel_search which
is described in the actions.py file and identification of “thanks” indent will execute the
utter_thanks template. It should be noted that the tasks of these utterances are described in the
domain.yml file.
The path of the second story also identified three intents in the order: greet, deny, and
thanks. When the chatbot identified the intent as “greet” from the message, utter_greet template
is executed, identification of “deny” intent will execute the utter_deny template and
identification of “thanks” indent will execute the utter_thanks template. It should be noted that
the tasks of these utterances are described in the domain.yml file.

21.5.3 domain.yml
This file basically specifies two entities and slots: city and name. Since the chatbot is related to
the bus services information, hence one important action named action_travel_search is included
in the actions section of the domain.yml file. The task of this complicated and important action
will be defined later in the actions.py file.

915
Explanation
The actions section of the domain file has basically one action and three utterances. All these
utterances are described in the templates section of the file. The text corresponding to each
utterance is described in templates. There are two entities named city and name and four intents
named search, thanks, greet, and deny. We can observe that there are two slots named city and
name, which are text slots. Hence, it is important for the chatbot to fetch the information of
these slots. However, if the information is not available, a null value will be stored in the slot.
As discussed earlier, the value of the slots is irrelevant. Three utterances defined in the actions
section are described in the template section along with the text message that needs to be
displayed.

21.5.4 actions.py
Since the chatbot is related to providing the details of the bus services, hence this file will
basically have a class with some important actions meant for providing information of the bus
services. It is known that the information of the buses is stored in the csv file, and SQL queries
are needed to extract the information from the dataset. The functions listed in the action file are
related to fetching the information from the dataset.

916
Explanation
It can be observed that the data from travelinfo.csv is read and stored in the variable named
travel. The command travel.to_sql('travelsql', con=engine) converts the travel variable
in SQL format and stores the new format to travelsql using SQL engine. We know that we have
created two slots named city and name in the domain.yml file, which are been tracked by the
tracker. Hence, when this action will be executed, from the user’s message, both these slots will
be tracked. It should be clear that if the chatbot is able to track the slot, it will save the value to
the variable, else the tracker will return the value “None” to the variable.
The next section executes the SQL query on the basis of the value of the slots. Thus, if the
name of agency is None, the query will be executed on the basis of city only; if city is None, the
query will be executed on the basis of agency only; and if none of the slots is left blank, the
query will be operated on the basis of both the slots. Since the query can return multiple
records, hence a “for” loop is executed in fetching all the records. We know from the table that
the first column is source, hence the command source=row[1] will store the information of city
in the source from where the bus starts. The command destination=row[2] will store the

917
information of the destination city. The information of the third column is stored in the
agency=row[3] and the information of the fourth column is stored in the fare. The details
variable stores the information of all the columns and prints on the server in the form of
statement.

The following results were obtained when the chatbot was executed after using the command
rasa shell:

Explanation
When the user type the message as need your support for traveling from Delhi, all the buses
available from Delhi are listed. From the results, we can observe that since the user has not
specified the name of travels, the query was executed on the basis of city only and the results
were produced accordingly.
It should be noted that when incorrect results were produced by the chatbot assistant,
training of the model was done using the command “rasa interactive--endpoints
endpoints.yml” and the model will be executed again to produce satisfactory results.

The following results were generated when the chatbot was executed after training the model and
using the command “rasa shell” for executing the chatbot.

Explanation
From the results, we can observe that when the user asked services from Mumbai by Raj
Travels, it lists down all the details of the buses from Mumbai by Raj Travels. It can be
observed that there are only two buses with fare as 900 and 700 operating from Mumbai by Raj
Travels.

It should be noted that the following stories were automatically added to the stories.md file when
the interactive training was provided based on the interaction with the user.

918
Explanation
It can be observed that when the story was created through the interaction, the slots entered by
the user are automatically written in the intent. Hence, every time the story is trained
considering different values of slots, a new story is created. Although there are three same
intents in the two stories and they follow the same order, two separate stories are created for
different values of slots. The first story is created considering only the city slot and the second
story is created considering both the city and name slot. It is important to note that the user
should always create more interactive stories for an efficient chatbot.
In the above section, a chatbot for online bus services was created where the information
related to the bus from source to destination is displayed. The chatbot also displays the results if
someone wants to know the bus services provided by a particular travel. However, for effective
understanding of the chatbot with respect to the database, simple chatbot with less data of travel
agencies and lesser number of features related to bus services was created. The efficiency of the
chatbot can be made better by adding information about more travel agencies and increasing the
fields related to quality of travel desired, type of bus, seats vacant, etc., in the database.

Create a database for storing information related to organizations. Create a


chatbot for providing the details of the organization, vacancy, job profile,
region, salary, etc.

USE CASE
CHATBOTS FOR SERVICE INDUSTRY

Economists divide all economic activities into two broad categories, goods and services. Goods-
producing industries are agriculture, mining, manufacturing, and construction. Service industry
creates services rather than tangible objects. The service industry is very wide in its nature. It
covers a large range of activities that add value to businesses and individuals, but the output is
not a physical product, instead this industry enhances, maintains, repairs, shapes, and performs
different alterations to physical items. It comprises banking and financial services; warehousing
and transportation services; tourism and hospitality services; information services; securities
and other investment services; professional services; waste management; healthcare and social

919
assistance; arts, amusement, and recreation; legal services; private education; social services,
membership organizations; and engineering and management services.
Different chatbots with respect to different services can be created after creating a database
related to that particular service. Similar to the chatbot created for the bus services, other
chatbots can be created for other services. For example, the tourism dataset can have
information related to the main attractions of each and every place and the cost to visit these
attractions. A chatbot can be created to display the attractions of all these places along with the
cost involved depending on the user choice. Similarly, amusement and recreation database might
store the information related to the movie name and theatre name along with the cost involved. A
chatbot can be created for displaying the theatre where a particular movie is shown or on the
basis of which all shows are possible in that theatre. Similarly, information related to hospitals
and doctors, their expertise can be stored in the dataset for the hospital. Information can be
fetched using the chatbot for the name of doctors specializing in a particular area, the name of
the hospitals that provide medical services related to a particular field, the hospitals where
particular doctor visits, etc. Chatbot with respect to the bank can be created for providing the
information related to the branches of the bank for the desired location. This chatbot can also
provide assistance to all the banking services.

21.6 Creating Chatbot with Forms


Forms collect valuable data from the conversation with the chatbot. It is known that slot helps to
collect a few pieces of information from a user in order to do something (book a flight, call an
API, search a database, etc.). But, if we need to collect multiple pieces of information in a row,
FormAction is created. This is a single action that contains the logic to loop over the required
slots and ask the user for this information. A form in a chatbot is hence able to collect
information from multiple fields such as name, email, location, and phone with appropriate
custom validation using proper API.
It is important to change the FormPolicy in config.yml file if the chatbot has a form. It is
required to add “FormPolicy” for activation forms using the command name: “FormPolicy” in
policies section. An error will be generated if the policy is not included. The form is specified in
the story section under the proper intent. For example, a form named “library_form” will be
executed when the intent named “request_library is found.

Here, library_form is the name of our form action. The form gets activated with the command
form{"name": "library_form"} and gets deactivated again with the command form{"name":
null}. It is important to note that when a form is defined, it should be added to the domain file
also and slots should be unfeaturized for this story. If any of these slots are featurized, story
needs to include slot{} events to show that these slots are set. In this scenario, the easiest way to
create valid stories is using the concept of interactive learning.
It is necessary to define three important methods for using the forms:

920
1. name():This function will return the name of action named “library_form”.
2. required_slots(): This function is related to a list of slots that need to be filled for the
submit method to work.
3. submit(): This function describes what has to be done at the end of the form, when all the
slots have been filled.
In this section, a chatbot is created for a website that provides membership in either a library,
gym, or hotel.

21.6.1 nlu.md
An important intent related to membership request by the user needs to be created in the nlu.md
file for providing membership. The nlu.md file for this chatbot will be given as

921
Explanation
This chatbot provides membership for three different places: hotel, library, and gym. Hence, in
the file, along with the common intents related to thanks, deny, greet, etc., an important intent
named request_membership is created. This intent is basically meant for understanding the
requirement of the user related to the membership of library, hotel, or gym in a particular city.
Thus, the intent considers five entities, namely, city, phone, myname, mem_type, and category.
The chatbot provides assistance related to three categories, hence these are added in the intent
section only and a separate lookup is not created. Since there can be multiple values of city,
hence an external file named cities.txt is created containing all the names of the cities. Since
there can be infinite names (values) for entity named “myname”, we have just considered five
names in myname lookup. Similarly, since there can be numerous values of phone that can be
created from 10 digits, hence a regex feature is considered to define different values of the
entity named “phone”. Since the entity named “mem_type” has only two values, general and
special, there is no need to create lookup or use regex features and hence they are described in
the expressions.

21.6.2 stories.md
Initially, only one story named “path1” was created and gradually stories were added through
rasa process of creating interactive stories. These stories are created during the training provided
after using the command “rasa interactive--endpoints endpoints.yml”.

Explanation
A story named “path1” is created with three intents in the order as “greet”,
“request_membership,” and “thankyou.” An interactive story was created by training the model.
It should be noted here that when we execute the command for interaction, it asks one by one
and then finally creates the story by the name interactive_story_1. The first story named
interactive_story_1 is created in the stories.md file with only two intents “greet” and
“request_membership”. It is important to understand here that when another new story will be
trained, it will automatically store the name as interactive_story_2 in the stories.md file.

922
21.6.3 domain.yml
The chatbot contains additional section named forms along with the other sections that are
discussed in the above section. The name of the form is “membership_form” and has an
important intent named “request_membership” inside the intent section.

923
Explanation
Entities, intents, and slots are already discussed previously. This section has a new tab named
forms that includes the name of the form, which is included in the actions.py file. Another
important thing that can be observed in this section is that unlike the above sections, the number
of utterances listed in the actions section is not the same as utterances in the templates section.
The specified number of actions in the action section is 3, while there are 13 utterances defined
in the templates section. This is because some utterances are defined with respect to the slots. It
is important to note that the utterances for respective slots are carried out automatically by the
rasa form if the slots are not filled. The three utterances defined in actions section include
utter_greet, which will be executed when the greet intent is identified; utter_thanks, which will
be executed when the thanks intent is identified; and utter_values, which will be executed when
the intent named request_membership is identified. It is important to understand that the
utterances defined in the templates section can also be called from the actions.py file using
dispatcher.utter_template(utterance name) or dispatcher.utter_message(utterance name).
The name of the utterance created for slots should start with utter_ask_(slot name). This
further means that since we have six slots, namely, myname, city, category, mem_type, phone,
and comments, the name of the utterances will be utter_ask_myname, utter_ask_city, etc. These
templates will be automatically called if any slot is not filled by the input given by the user. For
example, if the user gives input message where the information related to two slots namely

924
myname and city is given, it will execute the utterances corresponding to other slots such as
category, mem_type, phone, and comments. Since two slots namely “myname” and “city” were
identified, the utterances for these respective slots will not be executed. Hence, the chatbot will
display the statement as “please provide the category: library, hotel, gym…” for determining
the name of category and similarly it will display “plz share your contact number?” for phone,
“Do you want a special membership” for mem_type and so on.
There are four more utterances defined in the templates section that are neither related to
action and nor to the slots. These include utter_wrong_mem_type, utter_wrong_category,
utter_submit, and utter_default. These will be called from the actions. py file using
utter.message() or utter.template() discussed in the next section. The
utter_wrong_mem_type and utter_wrong_category will be called when the user has entered
choice of wrong membership and category, respectively. The utter_ submit is executed when
the form has been filled automatically.

21.6.4 actions.py
Since Section 21.6 has created the chatbot with respect to a database, the actions.py file for this
chatbot has three main functions, namely, name(), required_slots(), slot_mapping(), and
submit(). It should be clear that any function starting with “validate” word is automatically
executed and validated for the slot name written after the validate keyword. We have defined
other functions for validating membership type and category. Thus, the two functions defined are
validate_mem_type() and validate_category(). One additional function named
category_data() is created that returns the names of all the categories defined in this chatbot.
The function validate_category() refers to this function for determining whether the category
entered by the user belongs to the list of categories specified in this function. However, since
there are only two types of membership, “general” and “special”, hence we have not created a
separate list of membership types. The function validate_mem_type() itself validates the
membership choice entered by the user.

925
926
Explanation
The most important default functions of form are required_slots() and slot_mappings().
The function required_slots() returns a list of all the slots that we need to fill. Thus,
whenever the form is executed via the story, the required_slots() function is executed. If a
particular slot is not found, the template utter_ask_slotname is automatically called and the text

927
is displayed accordingly. This is done for all the slots specified in the return list of this function.
The slot_mapping() defines the criterion for mapping each slot. The command "myname":
[self.from_entity(entity="myname",intent="request_membership"),self.from_text()]
specifies that myname slot can be searched either in entity named myname or
request_memership intent or it can be a text entered by the user. The command "category":
self.from_entity(entity="category", intent="request_membership") specifies that
category slot can be searched either in entity named category or request_memership intent. The
command "city": [self.from_entity(entity="city", intent="request_membership"),
self.from_text()] specifies that city slot can be searched either in entity named city or
request_memership intent or it can be a text entered by the user. The command "phone":
[self.from_entity(entity="phone"),self.from_text()] says that it can be either an entity
named phone or can be a text. The command "mem_type":
self.from_entity(entity="mem_type", intent="request_membership"),
self.from_intent(intent="affirm", value="special"),
self.from_intent(intent="deny", value="general")] specifies that it can be either an
enity mem_type or request_membership intent or if it from affirm intent (when the user enters
yes, it can have value as special and if it from the affirm intent, it can have value as general.
Similarly, the command "comments": [self.from_intent(intent="deny", value="No
comments"), self.from_text(not_intent="affirm") says that if comments are with deny
intent, it should have value as “No comments” and text will be considered if it is not having the
intent as affirm.
It should be noted that validate_slotname() is predefined function that are automatically
validated for the respective specified slot. Thus, in our actions.py file, we have defined two
functions named validate_category() for validating categories and validate_mem_type() for
validating two slots category and mem_type. Since the functions are not defined for other slots,
validation will be done for only those two slots. The validate _catgeory() function converts
the value to lower case and checks whether the value exist in the list of all the categories listed
in the function named category_data. If it is successful, it assigns the value to the slot category.
An unsuccessful validation will assign the value “None” to the category.
The validate_mem_type() function does checks whether the “special” or “general” word is
contained in the message. If the word “special” exists, then the value “special” is allotted to the
mem_type . If it contains “general” word, then the value “general” is allotted to the mem_type,
else the message contained in the template utter_wrong_mem_type is displayed.
When all the slots are filled and validated, the submit() function is called by default. Since
our submit function ask the user to call utter_submit by using the command utter_template(),
the message written as Glad!!Task is successfully done is executed. It should be noted here that
the utter_template() should be used for the utter_submit function, because it is a default
function used for forms. The error will be generated if utter_message() is used.

When the rasa shell was executed in one window with rasa run actions in other window, the
following results are obtained:

928
Explanation
In the above execution, the first message after greet was “can i know the process to take
membership”. This did not contain any intent or entity, hence it starts with the task of filling the
first slot “myname”. Since the text associated with utter_ask_myname is “what is your good
name?”, it prompts the user to enter the name. When the user enters Pearl, it maps the slot
“myname” to Pearl, although Pearl entity is not derived from any intent. But it considers it
because the slot_mapping() function defined in actions.py file says that the text can be
considered directly mapped. The next slot that is searched is the category because that occurs
second in the list of required_slots() function. It prompts the message that is written in the
utter_ask_category template. When the user enters hotel, it displays an error because
slot_mapping() function specifies that it should not be general text entered by the user. It
should be noted that the error message that is written in the utter_default template is displayed
(default message for any process). It should be only an entity named category from the intent
“request_membership”. Thus, when the statement wanted membership of hotel that is specified
in nlu.md file, it accepts the input and proceeds to check whether the slot for city and phone is
filled. Similarly, the corresponding messages are prompted, but we have specified in the file
that it can be text, hence inputs are accepted.
The mapping for mem_type specifies that it can be either from the intent
“request_membership” or can be from affirm or deny intent. If it is from the affirm intent, it
should be considered as “special” and if it is from deny statement, then it should be considered
as “general”. The validate_mem_type() function is automatically executed that shows that if
the string contains the word special, then the value should be considered as special, else it
should be considered as general. Finally, the slot named comments is checked and the text
entered by the user is stored is mapped to comments. But, if the user enters the message that
belongs to deny content, this means that the user never wanted to enter the comment. In such
situations, value of “No comments” will be mapped to the comments key in the dictionary. The
story says that after the form is executed, it should execute utter_values(), while this function
says that all the values should be displayed. Hence, all the values are displayed.

929
The result produced when the rasa shell command was executed again.

Explanation
In the above program, the first message after greet contains the entity named “myname” and
“city” with value as “Pearl” and “Mumbai”. Hence, message for those slots are skipped, and it
prompts the user to enter other slots. The user enters the category as “hotel” which is specified
in the intent request_membership and hence the next slot (phone) needs to be filled up. For
filling the next slot as phone, the user is prompted the message as “Plz share your contact
number”. Since the phone can be a text hence it is accepted rightly and new slot is now required
to be filled It then asks that whether the user wanted the special membership and when the user
enters only “special” word, it displays the error message because no message named “special”
is defined in the intent: request_membership, affirm, or deny. It displays the message again and
when the user enters no, it found the statement in the deny intent and according to the
validate_mem_type, it assigns the value of “general” in the mem_type key. After the feedback
is generated, all the values are printed. It can be observed that it displays “general” membership
as per the value returned from the validate_mem_type() function.

The following results are obtained when the rasa shell command was again executed:

930
Explanation
In the above interaction, the starting statement defines only the slot “category” because only
library is specified. Hence, the assistant asks for each and every slot one by one. First the name
slot is asked followed by city, phone, mem_type, and comments. When all the slots are
asked,slot mapping is done successfully and values for all the slots are validated, the submit()
function specified in the actions.py file is executed. It then executes the utter_submit template
that displays the message specified as “Great!!…” and then executes the action utter_values,
which is specified in the storied.md file. Thus, it prints the complete message to the user related
to all the slots, which is specified.

The following results are obtained when the “rasa shell” command was again executed:

Explanation
Here, since the starting statement defined the slots named “myname” and “category”, hence it
asks for all the other slots in the order: city, phone, membership type, and comments. A
successful determination of all the slots displays the message as Great!!… and finally prints the
message written in utter_values.

Create a chatbot using the forms for displaying the information about the user
requirement related to the grocery products available in the web store.

USE CASE
CHATBOT FOR CONSUMER GOODS

The consumer goods sector is a category of stocks and companies that relate to items purchased
by individuals and households rather than by manufacturers and industries. These companies
make and sell products that are intended for direct use by the buyers for their own use and
enjoyment. Consumer goods can be broadly categorized as durable or nondurable. The durable
goods include items with long life such as electronics and home furnishings. Nondurable goods
include goods with a life expectancy of fewer than 3 years, such as food and personal care items.
These goods can also be categorized as necessary items such as food and clothing and luxury
items such as automobiles and electronics.

931
Many companies in the consumer goods sector rely heavily on advertising and brand
differentiation. These companies are looking to maximize profits and market share in an
interconnected, competitive environment. Performance in the consumer goods sector depends
heavily on consumer behavior and these companies are facing challenges such as meeting the
changing demands of customers and executing innovative strategies to grow profitably. This
hence requires balancing investment strategies with technology initiatives that can address the
needs of increasingly empowered consumers. Modern Internet technology has had an enormous
and on-going impact on the consumer goods sector and the ways products are sold have all
evolved dramatically over the past few decades.
Chatbots can be used for selling products by providing the product information as required
by the user. Adding a personal touch to the query by the user helps to sell goods effectively. This
information may relate to the availability of the product, cost of the product, delivery time,
quantity, available brands, etc. A proper form generation with all the slots will enhance the
quality of the interaction with the user by providing better response.

21.7 Creating Effective Chatbot


NLP is moving incredibly fast and trained models such as BERT and GPT-2 pretraining have
good representations. Chatbots are very useful and effective for conversation related to different
user queries because of the availability of good algorithms. But chatbot sometimes produces
incorrect results. Some powerful tools for collecting and annotating training data (most valuable
source of data), checking the security of data, adding new words, etc., will definitely lead to
increase the accuracy of chatbot.

21.7.1 Providing Huge Training Data


When a new chatbot is created, we have little or no training data and the accuracies of intent
classifications are low. But when the bot is actually used by users, plenty of conversational data
from training examples can be obtained. There are many tools for data generation like chatito
developed by Rodrigo Pimentel and Amazon Mechanical Turk (mturk). The best data can be
obtained from real users by using interactive learning feature of Rasa Core to get new Core and
NLU training data. This is because the messages are automatically framed differently when the
user is actually speaking to the bot in comparison to the messages that are thought while creating
a chatbot.

21.7.2 Including Out-of-Vocabulary Words


It might be possible that the users will use words for which the trained model does not have word
embeddings. These include words that were not thought while creating and designing a chatbot.
If pretrained word embeddings are used, there is not much that can be done. However, if
embeddings are trained from scratch, then more training data or examples that include the
OOV_token (out-of-vocabulary token) can be included. This will help the classifier to deal with
messages that include unseen words.

932
21.7.3 Managing Similar Intents
It is difficult to distinguish similar intents, and specifically, the obvious things are generally
forgotten when creating intents. For example, intuitively an intent was created related to
flight_info. The user wanted to ask about the seat availability in a particular flight from Mumbai
to Bengaluru. Another user wanted to have flights available from Mumbai to Bengaluru. From an
NLU perspective, these messages are very similar. For this reason, it would be better to create
separate intents for seat_info and flight_info and an intent that unifies seat-info and flight_info.
In Rasa Core stories, different story paths can be selected depending on which entity Rasa NLU
is extracted.

21.7.4 Balanced and Secured Data


The user may sometimes give less training examples/expressions for some intent than other
intents. It is known that more training examples helps in achieving better accuracies of the
chatbot, hence there is a need to have more examples for each and every intent. A highly strong
imbalance can result in partial classifier that may lead to low accuracy. It is hence suggested to
maintain a balance with the number of training examples per intent.
The user sometimes for fun enters messages that are not meant for that particular chatbot. If
wrong data are captured, it will provide wrong training to the chatbot and hence will display
incorrect results in the future. It becomes important to ensure that the user enters only
appropriate data and only the correct data are stored as training data for future purpose.

Summary
• Chatbots are a form of human–computer dialog system that operates through natural
language via text or speech.
• Chatbots are autonomous and can operate anytime, day or night, and can handle repetitive,
boring tasks. They help in drive conversation and are also scalable because they can handle
millions of requests. Chatbots are consistent because they will provide same information
when the query is asked multiple times.
• Chatbots are primarily meant for fulfilling two tasks: understanding the user message and
giving the correct responses. The Rasa Stack tackles these tasks with Rasa NLU, the NLU
component and Rasa Core, the dialogue management component.
• Rasa NLU builds a local NLU model for extracting intent and entities from a conversation. It
provides customized processing of user messages through pipeline.
• Rasa environment is initialized using the command “rasa init”. The command “rasa
init--no-prompt” can be used for initializing the environment but avoiding prompts that are
shown.
• The data folder contains primarily two main files nlu.md md (training data to make the bot
understand the human language) and stories.md (flow of data to help the bot understand how
to reply and what actions to take in a conversation).
• The stories.md file contains a bunch of stories created from where the learning takes place.
Stories help in teaching chatbot the manner in which to respond to these intents in different

933
sequences.
• The domain.yml file is called the universe of the chatbot and it contains details of everything:
intents, actions, templates, slots, entities, forms, and response templates that the assistant
understands and operates with.
• Chatbot process involves three basic steps: intent identification, action identification, and
response identification.
• The training part generates a machine learning model based on the training data and is saved
in the model folder. It is possible for doing training for both “nlu” and “core” together by
single command “rasa train”.
• Actions.py file generally contains complicated actions and not simple actions because the
simple actions are displayed in the templates section of the domain.yml file.
• Rasa Core defines a set of policies to use when deciding the next action of the chatbot. These
policies are either defined in the policies section of config.yml file, else these parameters can
be passed during training to respective policy constructors.
• An entity is a piece of information inside the message given by the user, which is defined in
the intent. An entity is defined in parenthesis and the possible values of entity are written in
square bracket.
• Rasa provides special features for creating synonyms for the entity, which have multiple
entities in single expression and define multiple values of same entity in single expression.
• Slots are memory of the chatbot. They act as a key-value store that can be used to store
information from the user and the outside world. They help the bot to categorize and interpret
user input. There are different slot types such as text slot, categorical slot, and Boolean slot
for different behaviors.
• Forms collect valuable data from the conversation with the chatbot. A form in a chatbot is
hence able to collect information from multiple fields such as name, email, location, and
phone with appropriate custom validation using proper API.
• A useful and efficient chatbot would have connectivity to the database, which is created in
any software. It becomes easy for a chatbot to perform a query on the fields that are listed in
the table of the database depending on the information entered by the user.
• Chatbot sometimes produces incorrect results. Some powerful tools for collecting and
annotating training data (most valuable source of data), checking the security of data, adding
new words, etc., will definitely lead to increase the accuracy of chatbot.

Multiple-Choice Questions

1. The ______ is called the universe of the chatbot.


(a) domain.yml
(b) nlu.md
(c) config.yml
(d) action.py
2. The policies are defined in the ___________.
(a) domain.yml

934
(b) nlu.md
(c) config.yml
(d) action.py
3. The nlu model can be trained using the command
(a) Rasa train
(b) rasa train nlu
(c) Both (a) and (b)
(d) Neither (a) nor (b)
4. The __________ contains information of intents and entities.
(a) domain.yml
(b) nlu.md
(c) Stories.md
(d) action.py
5. Entities are described in
(a) domain.yml
(b) nlu.md
(c) Both (a) and (b)
(d) Neither (a) nor (b)
6. It is always better to train the model with numerous stories. The statement is
(a) True
(b) False
(c) Not necessary
(d) Can’t say
7. We can determine that the expression is belonging to a particular indent by considering its
(a) Name
(b) Confidence
(c) Start
(d) End
8. An important step for including forms in chatbot is to define __________ in policies section
of config.yml file.
(a) FormPolicy
(b) ImportantPolicy
(c) SlotPolicy
(d) MappingPolicy
9. Rasa environment can be initialized using the command
(a) rasa env
(b) init rasa env
(c) env rasa
(d) rasa init
10. The command “rasa interactive--endpoints endpoints.yml” helps to
(a) Create an interactive story based on training

935
(b) Create interactive slots (c ) Check the correctness of the domain.yml file
(d) Check the structure of nlu.md file

Review Questions

1. What is a chatbot? Explain the three basic steps of chatbot process.


2. Explain the importance of models folder.
3. What is the role of actions.py file?
4. What is the importance of data folder?
5. Discuss the process of creating slots for making an efficient chatbot.
6. Is it possible to link the chatbot to an existing dataset? If yes, how?
7. Differentiate between entity and intent.
8. What strategies should be adopted to create an efficient chatbot?
9. Differentiate between the use of pretrained embeddings and supervised embeddings in
creating a chatbot.
10. How does rasa chatbot solve the problem of storing the information of numerous entities?

936
CHAPTER
22

937
The Road Ahead

Learning Objectives
After reading this chapter, you will be able to

• Understand new applications of machine learning techniques.


• Get familiar with reinforcement and federated learning.
• Have acquaintance to graph neural network (GNN).
• Gain exposure to create synthetic images using generative adversarial network (GAN).

22.1 Reinforcement Learning


When the thinking process required for solving a problem is easy and slow, different techniques
like LPP can be used for solving the problem. Supervised learning is used when the thinking
process required for solving the problem is fast but easy. Reinforcement learning is applied in
situations when there is a need to take fast decision for a problem with high complexity.
Reinforcement learning is used for tasks that humans find hard to do; when there is no ideal
reference; when time is not enough and it is not possible to solve real-time problems; and when
the system is hard to define, or complex and no analytical relationships can be found. In short,
reinforcement learning can be used to solve very complex problems that cannot be solved by
conventional techniques.
Reinforcement learning is the problem faced by an agent that learns behavior through trial-
and-error interactions with a dynamic environment. This learning model is very similar to the
learning of human beings. Hence, it is close to achieving perfection. It works on the principle of
learning to maximize long-term reward through interaction with the environment (Fig. 22.1). It
uses neural networks with gradients driving the training to compute the parameters that
maximize reward. The model can correct the errors occurred during the training process. Once an
error is corrected by the model, the chances of the same error occurring are very less. It can
create the perfect model to solve a particular problem. In the absence of training dataset, it is
bound to learn from its experience.
The problem is initially divided into sequence of tasks using domain knowledge. These tasks
can be repeatedly performed to achieve goals and the performance system is created, thereby
reducing the problem size. All the states and actions are clearly defined. The agent observes an
input state. An action is performed according to the decision-making function and based on it,
the algorithm learns a policy of how to act in a given environment. Every action has some impact
on the environment (Fig. 22.2). The environment then provides reward that guides the learning
algorithm. The agent is considered to be intelligent programs, environment is an external
condition and reward is associated with the agent’s behavior at a given time and is basically a

938
mapping from states to actions. The agent receives a scalar reward or reinforcement from the
environment and information about the reward given for that state/action pair is recorded (Fig.
22.3). The model describes a sequence of possible events in which the probability of each event
depends only on the state attained in the previous event (Fig. 22.4).

Figure 22.1 Concept of reinforcement learning.

Figure 22.2 Action at time 0.

Figure 22.3 Reward at time 0.

Figure 22.4 Model of reinforcement learning.

It is not advisable to use reinforcement learning for solving simple problems; it should be used in
cases when the correct decision are not obvious. It is suggested to speed up training by seeding
with existing heuristics. Reinforcement learning needs a lot of data and a lot of computation.

939
This is the reason why it is used in video games, because one can play the game again and again
to fetch lots of data. It should be noted that too much reinforcement learning can lead to an
overload of states which can diminish the results.

Reinforcement learning can be used effectively in financial portfolio


management because it involves training by learning.

USE CASE
REINFORCEMENT LEARNING FOR SOLVING REAL-WORLD OPTIMIZATION PROBLEMS

Linear programming is most suitable for solving complex problems and helps in simplicity and
productive management of an organization, which leads to better outcomes. It is used to analyze
numerous economic, social, military, and industrial problems, and helps in improving the quality
of decision by optimization of resources. Linear programming is an optimization technique for a
system of linear constraints and a linear objective function. An objective function defines the
core factor that needs to be optimized. The goal of linear programming is to find values of the
variables that maximize or minimize the objective function.
An optimization problem involves the selection of a best solution (min/max) from some set of
available alternatives. It is basically a combinatorial structure to the problem where the
constraints may have to be satisfied, and a cost function is involved which we want to optimize.
The major objective is to find the best solution or at least an acceptable solution because
searching all possible solutions is infeasible. For example, it can be used to optimize delivery
routes (least driving distance) or assigning tasks to delivery person based on geolocations and
time window (least time taken) or for determining the lowest cost involved in the whole process.
All the different constraints (non-assignments) can be applied and a feasible solution can be
determined by applying all constraints and assigning all tasks near resource. The sequence of
tasks can be changed and new assignments of tasks is done on the basis of different sequence of
tasks. All the tasks (e.g., assigning all deliveries of the day) are then repeated and one task (e.g.,
delivery) is selected at a time and the best resource (e.g., delivery person) is determined that can
do that task while selecting best resource that takes care of all constraints (e.g., delivery time
requested by consumer).

22.2 Federated Learning


It is known that the true performance of any machine learning model depends on the relevance of
the data used to train it. Conventional machine learning models depend on the mass transfer of
data from the devices or deployment sites to a central server to create a large, centralized dataset.
In cases of decentralized datasets, the training of conventional machine learning models still
requires large, centralized datasets and is not able to consider computational resources that are
available closer to the place where data are generated. However, though conventional machine
learning models deliver a high level of accuracy, they are not appropriate due to data security and
legal restrictions. Other important considerations for effective functioning include the following:

940
1. Data quality: The generated data have to be correct within an accepted error range.
2. Security: With a number of Internet-connected devices, securing the network from
cyberthreats is very important.
3. Privacy: Data collected are very sensitive to business operations and hence the solution has
to be privacy-preserving.
4. Uploading: Since the amount of data generated is huge, hence it is not feasible to upload all
of them to the cloud.
5. Resources: Transfer of massive data requires more network resources, which may result in
lack of bandwidth.
6. Cost: Dependency on a centralized dataset can be costly because of high data transfer cost.
7. Time consuming: Dependency on a centralized dataset for maintenance and retraining
purposes can be time consuming.
Federated learning is a solution to all these problems. Federated learning enables edge devices to
collaboratively learn a machine learning model and keep all of the data on the device itself.
Instead of moving data to a centralized place, the models are trained on the device and only the
revisions of the model are shared across the network for providing effective training to all.
In the early days of information technology, we had large mainframes doing the heavy lifting
of most of the computing. Eventually, we moved to a client–server framework where the
computes were distributed between central server(s) and multiple client computers. The
federated learning architecture deploys a similar model. Federated learning is a machine learning
technique that trains an algorithm across multiple decentralized edge devices or servers holding
local data samples, without exchanging their data samples. This approach is different from
traditional centralized machine learning techniques where all data samples are uploaded to one
server, as well as to more classical decentralized approaches, assuming that local data samples
are identically distributed. It helps in promoting continuous learning networks where self-
adapting, scalable, and intelligent agents can work independently to continuously improve
quality and performance by using machine learning models. Federated learning enables to build a
collective, robust machine learning model without sharing data and addressing critical issues
such as privacy and security of data, access to rights of data, and access to heterogeneous data.
Its applications are spread over a number of industries including defense, telecommunications,
Internet of Things (IoT), or pharmaceutics.
For understanding the concept of federated learning, consider a scenario when the training
needs to be done at multiple nodes of the dataset. In a centralized machine learning model, the
datasets from all the nodes are transferred to one master node and model training is performed.
The trained model is then transferred and deployed back to the other sub-nodes for inference. In
this scenario, all sub-nodes use exactly the same pretrained machine learning model.

Figure 22.5 Basic design of federated learning.

941
In the isolated machine learning scenario, no data are transferred from the sub-nodes to a master
node. Instead, each sub-node trains on its own dataset and operates independently from the
others. Based on the concept of federated learning, the sub-nodes train on their individual
datasets and share the updated weights from the neural network model via the message bus. The
final model accuracy is determined when weight-sharing and weight-averaging process is
executed a lot of times till the same accuracy is achieved for numerous times. In this way, the
sub-nodes can learn from each other also without transferring their datasets (Fig. 22.5). The
message exchange between the master node and the sub-nodes is implemented in the form of
assigned tasks such as new weights and average weights. Each task is pushed into the message
bus and has a direct recipient. The recipient must acknowledge that it has received the task. If the
acknowledgment is not made or in the case of a failure, the task remains in the bus and the failed
process restarts. Once the process reaches a running state again, the message bus retransmits all
the tasks that were not transferred or acknowledged.
Thus, a federation is treated as a task run-to-completion, enabling a single resource definition
of all parameters of the federation that are later deployed to different centralized environments.
The resource definition for the task deals both with main and alternate components of the
federation. The alternative components handle the characteristics of the federated learning model
and its hyperparameters. The main component includes the specifics of common components
that can be reused by different federated learning tasks. Alternative components comprise a
message bus, the master node of the federated learning task, and the sub-nodes to be deployed
and federated in different data centers. Nodes are tightly attached with the underlying machine
learning platform, which is primarily used to train the model and is irreversible during the
federation. Main components can be reused by different federated learning tasks. These tasks can
run sequentially or in parallel depending on the availability of resources.
Message bus is considered to be the single point of failure in the design of federated learning
framework to reduce the complexity of the code base for both the master and sub-node of the
federated learning task. This finally leads to a better and more robust framework, because it
allows for a federated learning task to fail without affecting the other federated learning tasks
executed at the same time. Thus, the benefits of federated learning can be summarized as
follows:

1. High accuracy: Federated learning can help to improve the accuracy of the models. It
provides a framework to port models across organizations for the same domain of the
device, something not possible in traditional cloud-based anomaly detection models, which
makes it easy to deploy with very limited data. Federated learning, with its ability to do
machine learning in a decentralized manner, is a promising approach for increasing the
accuracy of the model.
2. Low latency: Since the data are not moved and only the revisions in models are shared, low
latency is observed in the models.
3. Better utilization of resources: Federated learning makes it possible to achieve better
utilization of resources with minimize data transfer and can be used immediately. Networks
of the future must be able to learn without needing to transfer voluminous amounts of data
and perform centralized computation.
4. Less power consumption: The core concept behind federated learning is to train a
centralized model on decentralized data that never leaves the local data center that generated
it. Rather than transferring the data to the computation, federated learning transfers the

942
computation to the data. Hence, it consumes less power and contributes to longer device
life.
5. Less network load: Since the data remain on the device, it becomes energy efficient
because the workload on the device is drastically reduced due to sharing. Federated learning
can be used across organizations with less load of network.
6. Privacy: It helps in preserving the privacy of those whose information is being exchanged
and/or keeps a check on the risk of exposing sensitive information.
To be specific, in a neural network model, the training in a federated learning framework is done
locally near to the place where the data are generated or collected. Such initial models are
distributed to several data sources and trained in parallel. Figure 22.6 illustrates the basic
architecture of federated learning lifecycle. In neural network model, federated learning aims at
training a machine learning algorithm and multiple local datasets contained in local nodes
without exchanging data samples. The general principle consists of training local models on local
data samples and exchanging parameters (weights of a deep neural network) between these local
models at some frequency to generate a global model. The training of the heterogeneous dataset
is done at individual data center. After the training is done, the weights of all neurons of the
neural network are shared with a centralized data center.

Figure 22.6 Architecture of federated learning lifecycle.

The dashed lines indicate the aggregated weights that are sent/received to/from the centralized
data center and not from the individual data center (as in conventional machine learning models).
At the data center, averaging of weights is executed and a new model is developed. This model is
then communicated back to all the remote neural networks, which has shared the individual
weights and has hence contributed to the new model development. The main drawback with this
approach is that the transition from training a conventional machine learning model using a
centralized dataset to several smaller federated ones may introduce a bias that may impact the
accuracy, which was initially achieved by using a centralized dataset. However, the risk for this
is greatest in less reliable federations that span over to mobile devices. It is, therefore, suggested
to have significantly reliable data centers than devices in terms of data storage, computational
resources, and general availability. Since the corresponding processes may fail due to lack of
resources, it is important to ensure that high fault tolerance exists for better results.
Federated learning algorithms may use a central server that coordinates the different steps of

943
the algorithm and acts as a reference clock. It can also be a peer-to-peer coordination where no
such central server exists. However, in the non-peer-to-peer case, a federated learning process
can be broken down in multiple rounds, each consisting of four general steps (Fig. 22.7).
It should be clearly understood that federated learning is not same as distributed learning.
The main difference between federated learning and distributed learning is based on the
properties of the local datasets. Distributed learning aims at parallelizing computing power,
whereas federated learning aims at training on heterogeneous datasets. Distributed learning aims
at training a single model on multiple servers; local datasets are identically distributed and are
nearly of the same size. On the other hand, datasets used in federated learning are typically
heterogeneous and their sizes may be different to a great extent.
The true application of federated learning can be a modern smart building, which has a
number of Internet-enabled devices: IoT sensors to measure temperature, Internet-enabled
lighting, IP camera, IP phone, etc. The data are generated at large scale across all the devices.

Figure 22.7 Neural network model of federated learning.


Source: proandroiddev.com

By linking all the mobile devices, federated learning can prove to be an asset in
telecommunication sector. This concept can also prove to be effective for
understanding customer behavior in e-commerce industries.

USE CASE
FEDERATED LEARNING FOR SELF-DRIVEN CARS

The idea of self-driven car is to fit a car with cameras that can track all the objects around it and
have the car react if it is about to steer into one. Teach in-car computers the rules of the road
and allow them to navigate to their own destination. Thus, developing a model for self-driven
cars requires high-definition video to gain insight of deep learning based internal and external
perception systems and gain a complete understanding of how human beings interact with
vehicle automation technology. This is primarily done by integrating video data with vehicle
state data, driver characteristics, mental models, and self-reported experiences with technology;
and identifying how technology and other factors related to automation adoption. Consider a
situation in which a driverless car was not trained to differentiate between large white cars and
ambulances on the basis of sound and red flashing lights on the road. If the user car is moving

944
down the highway and an ambulance comes on the road, the self-driven car may not slow down
because it does not perceive the ambulance as different from a big white car.
Thus, driving is a complex activity that can be explicitly solved through model-based and
learning-based approaches in order to achieve full unconstrained vehicle autonomy.
Localization, mapping, scene perception, vehicle control, trajectory optimization and higher
level planning decisions associated with autonomous vehicle development remain full of open
challenges. This is especially true for unconstrained, real-world operations where the margin of
allowable error is extremely small and the number of edge-cases is extremely large. Self-driven
cars hence need to be trained extensively in virtual simulations to prepare the vehicle for nearly
every event on the road. However, there are many rules that a human follows and cannot be
taught to the cars like making eye contact with others to confirm who has the right of way, react
to weather conditions, etc.
Federated learning is a machine learning setting where the goal is to train a high-quality
centralized model with training data distributed over a large number of cars, each with
unreliable and relatively slow network connections. Machine learning models are being
computed on large, centralized machines and are distributed over cars for computation.
Federated learning can be used to test and train all cars and thus makes it possible for self-
driving cars to train on aggregated real-world driver behavior. Federated learning enables to
collaboratively learn a shared prediction model while keeping all the training data on car,
decoupling the ability to do machine learning from the need to store the data in the cloud. The
car downloads the current model, improves it by learning from car on device, and then
summarizes the changes as a small focused update. Only this update to the model is sent to the
cloud, using encrypted communication, where it is immediately averaged with other cars’
updates to improve the shared model. It should be noted that all the training data will remain in
the car, and individual updates are stored in the cloud. For every drive, each car independently
computes an update to the current model based on its local data, and communicates this update
to a central server, where the client-side updates are aggregated to compute a new global model.
Techniques such as secure aggregation and differential privacy can be applied to further ensure
the privacy and anonymity of the data origin. It is important that the system that needs to
communicate and aggregate the model updates in a secure, efficient, scalable, and fault-tolerant
way.

22.3 Graph Neural Networks (GNNs)


Graphs are a kind of data structure that models a set of objects (nodes) and their relationships
(edges). GNN is a type of neural network that directly operates on the graph structure. GNN has
gained increasing popularity in various domains, including social network, knowledge graph,
recommender system, and even life science because of the great expressive power of graphs
(used as denotation of a large number of systems), convincing performance, and high
interpretability. As a unique non-Euclidean data structure for machine learning, graph analysis
focuses on node classification, link prediction, and clustering. The power of GNN in modeling
the dependencies between nodes in a graph enables the breakthrough in the research area related
to graph analysis. In computer science, a graph is a data structure consisting of two components,
vertices and edges. A graph can be well described by the set of vertices (nodes) and edges it
contains. Edges can be either undirected (Fig. 22.8a) or directed (Fig. 22.8b), depending on

945
whether there exist directional dependencies between vertices (nodes). Nodes and edges typically
come from some expert knowledge or intuition about the problem. So, it can be atoms in
molecules, users in a social network, cities in a transportation system, players in team sport,
neurons in the brain, interacting objects in a dynamic physical system, pixels, bounding boxes, or
segmentation masks in images. In other words, it differs according to the perception of the user.
A graph can be considered as a very flexible data structure that generalizes many other data
structures. For example, a set is created if there are no edges and a tree is created if there are only
“vertical” edges and any two nodes are connected by exactly one path.

Figure 22.8 Undirected and directed graphs.

A typical application of GNN is node classification. Essentially, every node in the graph is
associated with a label, and the label of the nodes without ground-truth is predicted. In the node
classification problem setup, each node is characterized by its feature and associated with a
ground-truth label. The goal is to leverage these labeled nodes to predict the labels of the
unlabeled. It learns to represent each node with a dimensional vector that contains the
information of its neighborhood.
When we train our neural networks (ConvNets) on images or videos, we basically define
images on a graph as a regular two-dimensional grid (Fig. 22.9a) or videos as a regular three-
dimensional grid (Fig. 22.9b). Since this grid is the same for all training and test images and is
regular, that is, all pixels of the grid are connected to each other in exactly the same way across
all images (i.e., have the same number of neighbors and length of edges), this regular grid graph
has no information that will help us to tell about the images. An example of irregular
representation can be visualized in Fig. 22.10.

NetworkX library in Python is used to create a grid for graphs.

It should be noted that nodes and edges are created from the user’s own understanding about the
problem and his/her own way of representing the problem. This can be entirely different between
two users for the same data but with different solutions. Let us assume that a graph G contains N
number of nodes and E number of edges. Edge represents the undirected connections between
nodes. For example, all the images in MNIST dataset are represented as a 28 × 28 dimensional
matrix. We can assume that in this image nodes are pixels and edges are spatial distances

946
between them. So, our graph G is going to have N = 784 nodes (28 × 28 = 784 pixels) and edges
will have large values (thicker edges in Fig. 22.11) for closely located pixels and small values
(thinner edges) for remote pixels. An image from the MNIST dataset (left) and an example of its
graph representation (right) is shown in Fig. 22.11. Darker and larger nodes on the right
correspond to higher pixel intensities. The figure on right was, however, inspired by Fey et al.
(2018).

Figure 22.9 Regular representation of 2D and 3D grids.

Figure 22.10 Irregular representation of 2D grid.

947
Figure 22.11 Graph neural network for image.
Source: https://medium.com/@BorisAKnyazev/

In the context of computer vision, graphs and the models help to view neural network as a
graph, where nodes are neurons and edges are weights, or where nodes are layers and edges
denote flow of forward/backward pass. Convolutional neural networks are less prone to
overfitting (high accuracy on the training set and low accuracy on the validation/test set), are
more accurate in different visual tasks and easily scalable to large images and datasets. Hence, in
the situations where input data are graph-structured, it is better to transfer all these properties to
GNNs to regularize their flexibility and make them scalable. This will help to develop a model
that is as flexible as GNNs and can digest and learn from any data, and at the same time control
(regularize) factors of this flexibility by turning on/off certain priors.
Based on convolutional neural networks (CNNs) and graph embedding, GNNs are proposed
to collectively aggregate information from graph structure. Thus, they can model input and/or
output consisting of elements and their dependency. Also, GNN can simultaneously model the
diffusion process on the graph with the Recurrent Neural Network (RNN) kernel. Another
important reason of preferring GNN is that the standard neural networks such as CNNs and
RNNs cannot manage the graph input properly and do not traverse all the possible orders as the
input of the model. This problem is solved by GNNs because they propagate on each node,
respectively, and ignore the input order of nodes (output of GNNs is invariant for the input order
of nodes). Besides, an edge in a graph represents the information of dependency between two
nodes, while in CNN and RNN, the dependency information is just regarded as the feature of
nodes. Since, GNNs can do propagation guided by the graph structure instead of using it as part
of features, it is preferred more. It should be noted that GNNs update the hidden state of nodes
by a weighted sum of the states of their neighborhood. Another important reason is that the
reasoning process in human brain is similar to the graph. Although the standard neural networks
can generate synthetic images and documents by learning the data distribution, but they cannot
learn the reasoning graph from large experimental data. GNNs, on the other hand, can generate
the graph from nonstructural data such as pictures and stories.
The popularity of GNN is primarily due to its application for CNN and graph embedding.
CNNs have the ability to extract multiscale localized spatial features and compose them to
construct highly expressive representations. A deeper understanding of CNNs and graphs shows
that local connection, shared weights, and the use of multilayer are important factors, which are
similar to solving problems of graph domain. This is because graphs are the most typical locally
connected structures; shared weights reduce the computational cost compared with traditional
spectral graph theory and multilayer structure is the key to deal with hierarchical patterns, which
captures the features of various sizes. However, CNNs can only operate on regular Euclidean
data like images (2D grid) and text (1D sequence), while these data structures can also be
regarded as instances of graphs. Hence, the generalization of CNNs can be done to graphs. The
other important application is graph embedding, which learns to represent graph nodes, edges, or
sub-graphs in low-dimensional vectors. In the field of graph analysis, traditional machine
learning approaches usually rely on hand-engineered features and are limited by their
inflexibility and high cost. Based on the success of word embedding, DeepWalk, which is
regarded as the first graph embedding method based on representation learning, applies
SkipGram model on the generated random walks.
DeepWalk is the first algorithm proposing node embedding learned in an unsupervised

948
manner. It highly resembles word embedding in terms of the training process. It is also shown
that the distribution of both nodes in a graph and words in a corpus follow a power law as shown
in Fig. 22.12. The algorithm contains two steps: perform random walks on nodes in a graph to
generate node sequences and run skip-gram to learn the embedding of each node based on the
node sequences generated in step 1. At each time step of the random walk, the next node is
sampled uniformly from the neighbor of the previous node. Each sequence is then truncated into
subsequences of length 2|w| + 1, where w denotes the window size in skip-gram. After a
DeepWalk GNN is trained, the model has learned a good representation of each node as shown
in Fig. 22.13. Different colors indicate different labels in the input graph. We can see that in the
output graph (embedding with two dimensions), nodes having the same labels are clustered
together, while most nodes with different labels are separated properly. However, the main issue
with DeepWalk is that it lacks the ability of generalization. Whenever a new node comes in, it
has to re-train the model in order to represent this node. Thus, such GNN is not suitable for
dynamic graphs where the nodes in the graphs are ever-changing.

Figure 22.12 DeepWalk algorithm.


Source: snap.stanford.edu

USE CASE
GNN FOR SALES AND MARKETING

It is crucial for businesses based on social networks to have an effective link prediction strategy
to make sure their network continues to grow and the user finds the platform more relevant and
engaging. In social networking, different people can be considered as nodes and edges will be
the connection between them. Social networks may contain two important features: influencers
and closely weaved community. Influencers tend to have many one-sided edges and these nodes
(Person) tend to influence how information spreads in the network and therefore are of interest
for increasing customer base. The organization hence likes to find the influencer in the network
because the influencer help in distributing the information (marketing campaigns) faster though
the networks. Influencers are typically also prolific content generators, thereby increasing the
overall engagement on the platform. Closely weaved community follows the concept of being
similar to the friends and they tend to resist flow of information from outside. This will help in
identifying the similar nodes (people) by looking at their clustering coefficient and betweenness
centrality. This will help them to identify how likely a person is to adopt to buying a new product
or refer the product in the network.
Market basket analysis and recommender systems can also be made by creating a graph

949
from user-item transactional data. The links can be inferred through shared attributes such as
shared location, IP, and cookies.
City routes can be represented as graphs. Algorithms such as simple paths and shortest path
can be used to create features to predict the expected time of a trip. Also, algorithms such as
finding cycles and shortest path along with nodes and edges can be used to provide the timely
delivery of the product.
Images and video are one of the most interesting data types on social networking sites. The
different image classification methods include classifying one frame at a time with a ConvNet,
using a time-distributed ConvNet and passing the features to an RNN, in one network, using a
3D convolutional network, extracting features from each frame with a ConvNet and passing the
sequence to a separate RNN and extracting features from each frame with a ConvNet and
passing the sequence to a separate Multi-Layer Perceptron (MLP). These techniques can help
the marketing person to classify the images and thus help in improving the e-commerce business.
Also, video uploading platforms such as YouTube are collecting enormous datasets. A video
is really just a stack of images and can be understood as a series of individual images. Hence,
deep learning practitioners basically consider video classification as performing image
classification a total of N times, where N is the total number of frames in a video. It should be
noted that video classification is more than just simple image classification because there are
subsequent frames in a video that are correlated with respect to their semantic contents. Hence,
classifying video presents unique challenges for machine learning models.

22.4 Generative Adversarial Network (GAN)


GAN is one of the most versatile neural network architectures in use today. The idea of pitting
two algorithms against each other originated with Arthur Samuel, a prominent researcher in the
field of computer science who is credited for popularizing the term “machine learning”. But if
Samuel is the grandfather of GANs, Ian Goodfellow, former Google Brain research scientist and
director of machine learning at Apple’s Special Projects Group, might be their father. A GAN is
a class of machine learning systems invented by Ian Goodfellow and his colleagues in 2014 and
was published in the form of a research paper titled Generative Adversarial Nets. Goodfellow has
expressed that he was inspired by noise-contrastive estimation, a way of learning a data
distribution by comparing it against a defined noise distribution (i.e., a mathematical function
representing corrupted or distorted data). Noise-contrastive estimation uses the same loss
functions as GANs. In other words, the same measure of performance with respect to a model’s
ability to anticipate expected outcomes.
GANs belong to the set of generative models and are able to produce/generate new content of
images/text. For a training set, this technique learns to generate new data with the same
information as given to the training set. Example, a GAN trained on images of ship can generate
new synthetic images of ship with some features that look similar. It should be noted that GAN
was originally designed for unsupervised learning, but has proved to be effective for semi-
supervised learning, fully supervised learning, and reinforcement learning also. Figure 22.13 by
Ian Goodfellow and co-authors shows the results generated by GANs after training on two
datasets: MNIST and TFD. It is important to mention here that in both images, the rightmost
column contains true data that are the nearest from the direct neighboring generated samples. The
result in the other columns (second to sixth column) shows that the produced data are really

950
generated and not memorized by the network.

Figure 22.13 Illustration of GANs abilities.


Source: https://arxiv.org/abs/1406.2661

Figure 22.14 Concept of generative adversarial network.

The basic concept of GAN is to have two neural networks (generator and discriminator) that
contest with each other in a game (in the sense of game theory, often but not always in the form
of a zero-sum game) (Fig. 22.14). Two models are trained simultaneously by an adversarial
process. A generator learns to create images that look real, while the discriminator learns to tell
real images apart from unreal images. The generator model produces synthetic examples from
random noise sampled using a distribution. The training set includes both synthetic and real
examples. This is then fed to the discriminator, which attempts to distinguish between the
synthetic and real examples by displaying the results as true or false for the generated image
(Fig. 22.15). The figure clearly shows that the generator generates the image of a ship. Then the
discriminator on the basis of real images compares this generated image with the real images of
the ship and the discriminator then finally decides whether the generated synthetic image is
correct or not.
It is important to understand that both the generator and discriminator try to improve their
respective abilities. During training, the generator progressively becomes better at creating
images that look real, while the discriminator becomes better at telling them apart. The process

951
reaches equilibrium when the discriminator can no longer differentiate between real images and
fake images and a better accuracy is achieved. GANs train in an unsupervised fashion, meaning
that they infer the patterns within datasets without reference to labeled outcomes. The role of the
discriminator is to inform about the generators work; it tells the generator about the
modifications that can be done to produce the output, which is more realistic in the future. It can
be observed from Fig. 22.16 that the discriminator suggested the generator to make
modifications to the current generated image of ship till it is able to correctly draw the image of
the ship.

Figure 22.15 Architecture of GAN.

Figure 22.16 Continuous interaction between generator and discriminator.

In true scenario, GAN has a number of limitations owing to their architecture. It should be noted
that the generator and discriminator also run the risk of overpowering each other. If the generator

952
becomes too accurate, it will exploit weaknesses in the discriminator, which will produce wrong
outcomes. On the other hand, the discriminator will hinder the generator’s progress toward
convergence if it becomes too accurate. A lack of training data also threatens to hinder GANs’
progress in the semantic realm, which in this context refers to the relationships among objects.
Besides, the simultaneous training of generator and discriminator models is fundamentally
unstable. The parameters are sometimes not stable, because after every parameter update, the
nature of the optimization problem being solved changes. This will lead to the collapse of the
generator and it will produce wrong content.
GANs are basically related to the field of artificial intelligence, which is used for producing
human-like speech or generating images of people that are difficult to distinguish from real-life
photographs, help in media synthesis, composing melodies, swapping of images, etc. However,
GANs have been used to produce problematic content like deepfakes, which is media that takes a
person in existing media and replaces them with likeness of someone else. The different
applications of GANs include the following:

1. Image and video synthesis: GANs are perhaps best known for their contributions to image
synthesis. Data scientists and deep learning researchers use this technique to generate
photorealistic images.
2. Predicting future events: Predicting future events from only a few video frames is possible
because of the state-of-the-art approaches involving GANs and novel datasets.
3. Image editing: Most image editing software these days do not give us much flexibility to
make creative changes in pictures. GAN can help to change the appearance drastically by
changing facial expressions, change hairstyle of a person, changing dress of a person, etc.,
which cannot be done by the current image editing tools.
4. Generating artwork: Deep learning researchers can use GAN techniques to produce
magnificent artwork. When trained on the right datasets, they are able to produce de novo
works of art.
5. Healthcare: In healthcare sector, GAN can be used to generate synthetic data for
identifying the deviation and thus help in effective supervision.
6. Business: When we see an image, we tend to focus on a particular part (rather than the
entire image as a whole). This is called attention and is an important human trait. Knowing
where a person would look beforehand would certainly be a useful feature for businesses, as
they can optimize and position their products better.
7. Gaming industry: GAN can be used for 3D object generation. Game designers work
countless hours recreating 3D avatars and backgrounds to give them a realistic feel. They
help in automating the entire process. Game designers can focus on a particular portion of
the game to enhance the features and make it more engrossing by visualizing designs.
8. Training: GAN can be used for providing training related to the images. For example, in an
app providing training of exercises, synthetic images for showing the feature of doing
exercises can be generated and deviation can be informed. It is also possible to generate
synthetic images from numeric data.
A lot of applications have been created by different people and organizations. GANs have
been primarily applied to the problems of super-resolution (image upsampling) and pose
estimation (object transformation). StyleGAN, a model Nvidia developed, has generated high-
resolution head shots of fictional people by learning attributes such as facial pose, freckles, and

953
hair and makes improvements with respect to both architecture and training methods, redefining
the state of the art in terms of perceived quality. In June 2019, Microsoft researchers detailed
ObjGAN, a novel GAN that could understand captions, sketch layouts, and refine the details
based on the wording. The co-authors of a related study proposed that a system StoryGAN
synthesizes storyboards from paragraphs. Researchers at the Indian Institute of Technology,
Hyderabad and the Sri Sathya Sai Institute of Higher Learning devised a GAN and created
SkeGAN that can generate stroke-based vector sketches of cats, firetrucks, mosquitoes, and yoga
poses. Scientists at the Maastricht University in the Netherlands created a GAN that produces
logos from one of 12 different colors. Vue.ai’s GAN used clothing characteristics and learns to
produce realistic poses, skin colors, and other features. From snapshots of apparel, it can
generate model images in every size up to five times faster than a traditional photo shoot. Tang
says one of his teams used GANs to train a model to upscale 200-by-200-pixel satellite imagery
to 1000 × 1000 pixels, and to produce images that appear as though they were captured from
alternate angles. Victor Dibia, a human—computer interaction researcher and Carnegie Mellon
graduate, trained a GAN to synthesize African tribal masks. A team at the University of
Edinburgh’s Institute for Perception and Institute for Astronomy designed a model that generates
images of fictional galaxies that closely follow the distributions of real galaxies. A publicly
available tool named GAN Paint Studio help the users to upload any photograph and edit the
appearance of depicted buildings, flora, and fixtures to their heart’s content. Impressively, it is
generalizable enough that inserting a new object with one of the built-in tools realistically affects
nearby objects (for instance, trees in the foreground occlude structures behind them) (Fig. 22.17).

Figure 22.17 Edits performed by GAN Paint Studio.


Source: venturebeat.com

USE CASE
GENERATIVE ADVERSARIAL NETWORK FOR CYBER SECURITY

The rise of artificial intelligence has been wonderful for most industries. But there is a real
concern that has shadowed the entire AI revolution – cyber threats. A cyber or cyber security
threat is a malicious act that seeks to damage data, steal data, or disrupt digital life in general.
Cyber threats include computer viruses, data breaches, denial-of-service attacks, and other
attack vectors. Cyber threats also refer to the possibility of a successful cyber attack that aims to
gain unauthorized access, damage, disrupt or steal an information technology asset, computer
network, intellectual property or any other form of sensitive data. Cyber threats can come from
within an organization by trusted users or from remote locations by unknown parties. The

954
availability of data in certain domains is a necessity, especially in domains where training data
are needed to model machine learning algorithms.
Even deep neural networks are susceptible to being hacked. A constant concern of industrial
applications is that they should be robust to cyber attack because of availability of confidential
information. GANs are proving to be of immense help here because of the feature of directly
addressing the concern of “adversarial attacks”. These adversarial attacks use a variety of
techniques to fool deep learning architectures. GANs can be used to make existing deep learning
models more robust to these techniques by creating more such fake examples and training the
model to identify them.

Summary
• Reinforcement learning is applied in the situations when there is a need to take fast decision
for a problem with high complexity.
• Reinforcement learning can be used to solve very complex problems that cannot be solved by
conventional techniques.
• Reinforcement learning is the problem faced by an agent that learns behavior through trial-
and-error interactions with a dynamic environment. This learning model is very similar to the
learning of human beings.
• Federated learning is a machine learning technique that trains an algorithm across multiple
decentralized edge devices or servers holding local data samples, without exchanging their
data samples.
• Federated learning enables to build a collective, robust machine learning model without
sharing data, and addressing critical issues such as privacy and security of data, access rights
of data and access to heterogeneous data.
• The benefits of federated learning are as follows: high accuracy, low latency, better
utilization of resources, less power consumption, less network load, and privacy.
• The main difference between federated learning and distributed learning is based on the
properties of the local datasets. Distributed learning aims at parallelizing computing power
while federated learning aims at training on heterogeneous datasets.
• Graphs are a kind of data structure that models a set of objects (nodes) and their relationships
(edges). GNNs are deep learning-based methods that operate on graph domain.
• Generative adversarial networks belong to the set of generative models and are able to
produce/generate new content of images/text.
• The basic concept of GAN is to have two neural networks (generator and discriminator) that
contest with each other in a game (in the sense of game theory, often but not always in the
form of a zero-sum game).
• Two models are trained simultaneously by an adversarial process. A generator learns to
create images that look real, while the discriminator learns to tell real images apart from
unreal images.
• During training, the generator progressively becomes better at creating images that look real,
while the discriminator becomes better at telling them apart. The process reaches equilibrium
when the discriminator can no longer differentiate between real images from fake images and

955
a better accuracy is achieved.

Multiple-Choice Questions

1. _______________ concept is primarily based on weights.


(a) GAN
(b) Federated learning
(c) Reinforcement learning
(d) Graph neural network
2. Generating synthetic images is done primarily using _________________.
(a) GAN
(b) Federated learning
(c) Reinforcement learning
(d) Graph neural network
3. _________________ concept is primarily master node and sub-nodes.
(a) GAN
(b) Federated learning
(c) Reinforcement learning
(d) Graph neural network
4. In _________________ agent learns behavior through trial-and-error interactions with
dynamic environment.
(a) GAN
(b) Federated learning
(c) Reinforcement learning
(d) Graph neural network
5. The concept of nodes and edges is used in _________________.
(a) GAN
(b) Federated learning
(c) Reinforcement learning
(d) Graph neural network
6. This learning model is similar to learning of human beings.
(a) GAN
(b) Federated learning
(c) Reinforcement learning
(d) Graph neural network
7. This is a benefit of federated learning _________________.
(a) Privacy
(b) Accuracy of model
(c) Resource utilization

956
(d) All of the above
8. _________________ which is regarded as the first graph embedding method based on
representation learning.
(a) DeepWalk
(b) Spacewalk
(c) DeepSpace
(d) SpaceDeep
9. Nodes and edges are considered same for all types of problems
(a) True
(b) False
(c) Can’t say
(d) Cannot be determined
10. The generator and discriminator both try to improve in their respective abilities during
training.
(a) True
(b) False
(c) Can’t say
(d) Cannot be determined

Review Questions

1. What is the utility of reinforcement learning?


2. How is the data shared in a federated learning model?
3. What are the different benefits of federated learning?
4. Differentiate between the role of generator and discriminator in a GAN.
5. Draw and explain the basic architecture of federated learning.
6. What are the limitations of GNA model?
7. Discuss some of the applications of GAN.
8. How is federated learning different than distributed learning?
9. Explain the functioning of GAN with suitable example.
10. Discuss the importance of graph neural network.

957
Answers to Multiple-Choice
Questions

Chapter 1
Introduction to Python

Answers
1. b
2. d
3. b
4. a
5. a
6. a
7. d
8. b
9. a
10. d

Chapter 2
Control Flow Statements

Answers
1. a
2. c
3. a
4. b
5. a
6. c
7. b
8. c
9. a
10. a

958
Chapter 3
Data Structures

Answers
1. b
2. a
3. a
4. c
5. d
6. a
7. b
8. d
9. c
10. b

Chapter 4
Modules

Answers
1. a
2. b
3. a
4. c
5. b
6. a
7. b
8. b
9. d
10. c

Chapter 5
Numpy Library for Arrays

Answers
1. b
2. d
3. d
4. a

959
5. b
6. a
7. b
8. a
9. a
10. c

Chapter 6
Pandas Library for Data Processing

Answers
1. d
2. b
3. a
4. a
5. c
6. a
7. b
8. c
9. c
10. b

Chapter 7
Matplotlib Library for Visualization

Answers
1. a
2. c
3. a
4. b
5. a
6. c
7. a
8. a
9. d
10. a

Chapter 8

960
Seaborn Library for Visualization

Answers
1. c
2. a
3. d
4. a
5. d
6. d
7. c
8. b
9. d
10. c

Chapter 9
SciPy Library for Statistics

Answers
1. d
2. b
3. a
4. a
5. d
6. d
7. b
8. d
9. a
10. a

Chapter 10
SQLAlchemy Library for SQL

Answers
1. b
2. b
3. a
4. a
5. d

961
6. b
7. c
8. b
9. d
10. a

Chapter 11
Statsmodels Library for Time Series Models

Answers
1. a
2. c
3. b
4. c
5. a
6. b
7. d
8. a
9. c
10. b

Chapter 12
Unsupervised Machine Learning Algorithms

Answers
1. b
2. a
3. d
4. a
5. b
6. d
7. c
8. a
9. c
10. c

Chapter 13

962
Supervised Machine Learning Problems

Answers
1. d
2. b
3. a
4. c
5. a
6. b
7. b
8. b
9. a
10. a

Chapter 14
Supervised Machine Learning Algorithms

Answers
1. a
2. a
3. b
4. b
5. d
6. a
7. d
8. b
9. c
10. a

Chapter 15
Supervised Machine Learning Ensemble Techniques

Answers
1. d
2. a
3. a
4. d
5. d
6. a

963
7. c
8. a
9. b
10. c

Chapter 16
Machine Learning for Text Data

Answers
1. d
2. a
3. c
4. c
5. d
6. d
7. a
8. a
9. d
10. b

Chapter 17
Machine Learning for Image Data

Answers
1. a
2. a
3. c
4. d
5. c
6. d
7. c
8. b
9. b
10. c

Chapter 18
Neural Network Models (Deep Learning)

964
Answers
1. d
2. c
3. d
4. d
5. b
6. a
7. c
8. c
9. c
10. b

Chapter 19
Transfer Learning for Text Data

Answers
1. c
2. c
3. a
4. a
5. d
6. a
7. d
8. d
9. a
10. b

Chapter 20
Transfer Learning for Image Data

Answers
1. a
2. b
3. c
4. d
5. a
6. c
7. a
8. a
9. b
10. c

965
Chapter 21
Chatbots with Rasa

Answers
1. a
2. c
3. c
4. b
5. c
6. a
7. b
8. a
9. d
10. a

Chapter 22
The Road Ahead

Answers
1. b
2. a
3. b
4. c
5. d
6. c
7. d
8. a
9. b
10. a

966
Interview Questions and
Answers

1. Which libraries are used for text mining in Python?


Answer: The Natural Language Toolkit (NLTK) is used for common tasks associated with natural language processing. The
functionality of NLTK allows a lot of operations such as text tagging, classification, and tokenizing, name entities
identification, building corpus, stemming, semantic reasoning, etc. The regular expressions library “re” is also used in the
process of text mining.

2. Discuss the utility of numpy library.


Answer: It is the fundamental library in Python around which the scientific computation stack is built. It has many high-
level mathematical functions for large, multidimensional arrays and matrices.

3. Is dictionary data type available in other programming languages such as C and C++?
Answer: A unique data structure that is not available in other programming language and available in python is dictionary.
It is considered as the king of data structures in Python because it is the only standard mapping type. A dictionary has a
group of elements that are arranged in the form of key-value pairs.

4. Explain the importance of exception handling?


Answer: Exception handling is an effective tool to handle the errors in the program. The purpose of exception handling is to
terminate program properly and display proper message to the programmer in case of an error. If the programmer can
determine the type of errors that can occur, then exception is used to do efficient programming and for solving the problem.

5. What is meant by unsupervised machine learning?


Answer: Unsupervised machine learning algorithms are used when the output is not known and no predefined instructions
are available to the learning algorithms. In unsupervised learning, the learning algorithm has only the input data and
knowledge is extracted from these data. These algorithms create a new representation of the data that are easy to
comprehend than the original data and help to improve the accuracy of supervised algorithms by consuming less time and
reducing memory.

6. Is it possible for the users to create own module. If yes how?


Answer: Python supports the users by helping them to create their own module of frequently used functions and code. The
users can import the user-defined module in the program and use the desired functions of module in program, rather than
copying the code of frequently used functions into different programs. This will help the users to reduce lines of code, save
time and do efficient programming by managing code effectively. The users can create a file with desired functions and save
the file with py extension. The functions of this module can be accessed using import function in another program.

7. Differentiate between a list and array?


Answer: A list is a collection of objects and basically represents an ordered sequence of data. A list is similar to an array
that is a collection of same data type. But unlike array, a list can hold other data type also. This means a list may not be
homogeneous; this further implies that the elements of a list may not be of the same data type.

8. What is federated learning?


Answer: Federated learning enables edge devices to collaboratively learn a machine learning model and keep all of the data
on the device itself. Instead of moving data to a centralized place, the models are trained on the device and only the revisions
of the model are shared across the network for providing effective training to all.

9. Is it possible to connect to a database in Python?


Answer: Yes. SQLAlchemy library provides a lot of functions that can be primarily used for connecting to a database and
helps the user to perform SQL operations on the data.

10. How is the function defined in python?


Answer: A function is defined by the following syntax:

967
def fname(arg1, arg2, …):
function body
where:
• fname is the actual name of the function stored in Python environment,
• arg1 and arg2 are the optional arguments. An argument is like a placeholder. When a function is invoked, we pass a
value to the argument. This value is referred to as actual parameter or argument. The argument list refers to the order and
number of the arguments of a function. A function can/cannot contain arguments,
• function body contains a collection of statements that define the task of function.

11. Which library is used for machine learning?


Answer: sklearn provides a range of supervised and unsupervised machine learning algorithms such as clustering, cross
validation, datasets, dimensionality reduction, feature extraction, feature selection, parameter tuning, manifold learning,
generalized linear models, discriminate analysis, naive bayes, lazy methods, neural networks, support vector machines, and
decision trees.

12. What is reinforcement learning?


Answer: Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions
with a dynamic environment. This learning model is very similar to the learning of human beings. Hence, it is close to
achieving perfection. It works on the principle of learning to maximize long-term reward through interaction with the
environment.

13. Which library is used for deep learning algorithms?


Answer: Keras is a library meant for deep learning with high-level neural networks. It runs on top of TensorFlow, CNTK,
or Theano. It allows for easy and fast prototyping with its important feature of modularity and extensibility. It supports both
convolutional networks and recurrent networks and runs on GPU also for faster processing of large data.

14. Name the libraries that help in performing data visualization in Python.
Answer: Matplotlib and Seaborn.

15. Differentiate between the different ways to import function from a library.
Answer: There are three different ways to import functions and module/library:
(a) from module/library name import *: This approach helps to import all the functions from the specified module/library in
the program.
(b) import module/library name: This approach also helps to import all the functions from the specified module/library. The
difference between the two approaches is that through this approach, we need to specify module/library name before
calling a function and through the previous approach, the function can be called directly by specifying the function name
only. Since, the module name is also written long with the function name, hence this approach removes the limitation of
first approach by removing chances of ambiguity relation to function name.
(c) from module/library import function1, function2, …: This is considered to be the efficient approach, since it helps the
user to import only the required functions from module/library and hence making the code more efficient and
manageable.

16. Differentiate between parametric techniques used for comparing means.


Answer: T-test and analysis of variance (ANOVA) are two parametric statistical techniques used to test the hypothesis when
the dependent variable is continuous and independent variable is categorical in nature. When the population means of only
two groups is to be compared, the t-test is used, but when means of more than two groups are to be compared, ANOVA is
preferred.

17. Differentiate between list and tuple.


Answer: The main difference between tuple and list is that tuple are immutable while lists are mutable. This implies that we
cannot modify elements of a tuple while we can modify elements of a list. In real-world scenarios, tuples are generally used
to store data that do not require any modification but we need to do only retrieval of data. Unlike lists, which are represented
by square bracket, tuple are represented by parenthesis.

18. Explain the importance of supervised machine learning?


Answer: Supervised machine learning algorithms are used when we have a labeled data and we are trying to find a
relationship model from the user’s data. The machine helps the algorithms learn to predict the output from the input data.
These algorithms generate functions that map inputs to desired outputs and consist of a dependent variable, which is
predicted from a given set of independent variables. The training process continues until the model achieves a desired level
of accuracy on the training data.

19. What is the utility of pandas library?

968
Answer: This library that helps to work with labeled and relational data. It is primarily used for data cleaning, data
extraction, data processing, data aggregation, and visualization. It is like a spreadsheet in Python.

20. Differentiate between the representation of a black, white, and colored image?
Answer: A black and white image is represented by pixels arranged in two dimensions. The pixel having dark black color is
represented with intensity 0 and pure white is represented with intensity 1 or 256. All the pixels in the image are represented
by the value depending on the intensities of black and white. However, colored image is represented generally in the form of
RGB format. Hence, a colored image has three matrices, one matrix representing one color. For most images, pixel values
are integers that range from 0 to 255. The 256 possible values are the intensities values for the respective color.

21. Differentiate between unsupervised and supervised machine learning?


Answer: Supervised learning is used whenever we want to predict a certain outcome from a given input and we have data of
both input and output, whereas unsupervised machine learning does not have output data to train the machine.

22. What are the different techniques used for unsupervised machine learning?
Answer: The common unsupervised machine learning algorithms include dimensionality reduction algorithms and
clustering.

23. What is cluster analysis? What are the different forms of clustering?
Answer: Cluster analysis is the process of organizing objects into groups whose members are similar in some manner; it
deals with finding a structure in a collection of unlabeled data. A cluster is a collection of objects that have similar
characteristics between them and are dissimilar to the objects belonging to other clusters. Thus, data points inside a cluster
are homogeneous and heterogeneous to other groups. The right number of clusters is an important issue because beyond
which it becomes noise, and below which we are not able to capture any observation. There are generally two forms of
clustering: k-means and hierarchical clustering.

24. Can Naïve Bayes algorithm be used for regression problems?


Answer: No. Naïve Bayes algorithm is used only for classification problems.

25. What is meant by text mining?


Answer: Text mining is the process of evaluating large amount of textual data to produce meaningful information and to
convert the unstructured text data into structured text data for further analysis and visualization. Text mining helps to
identify unnoticed facts, relationships, and assertions of textual big data.

26. What are the different assumptions used for performing regression analysis?
Answer: The different assumptions include normality of variables, linearity, multicollinearity, independence of errors, and
homoscedasticity.

27. What is a word cloud?


Answer: For creating a visual impact, a word cloud is created from different words in the document. In the word cloud, the
size of the words is dependent on their frequencies.

28. Differentiate between iloc and loc indexers used for data extraction?
Answer: The iloc indexer helps to extract particular row(s) and column(s) at specified numbers in the order that they appear
in the dataframe. The iloc indexer gives information of the row number specified within the bracket, whereas the loc indexer
gives information of the index value specified within the bracket.

29. What are the different ensemble techniques that can be used in supervised machine learning?
Answer: The different ensemble techniques include Bagging, Random Forest (Extension of Bagging), Extra tree, AdaBoost,
and Gradient Boosting.

30. What are stop words?


Answer: Text may contain stop words such as is, am, are, this, a, an, and the. These stop words are considered as noise in
the text and hence should be removed. Before analyzing the text data, we should filter out the list of tokens from these stop
words.

31. What is the role of k in k-NN algorithm?


Answer: In k-NN algorithm, k specifies the number of neighbor observations that contribute to the output predictions. It is a
good idea to try many different values for “k” and determine the value of “k”, which is best for problem.

32. Differentiate between regression and classification.


Answer: Regression is a form of predictive modeling technique that estimates the relationship between a dependent (target)

969
and independent variable(s) (predictor). In classification, the goal is to predict a categorical variable, whereas the goal of
regression is to predict any continuous variable.

33. Explain the importance of polarity in sentiment analysis?


Answer: Sentiment analysis provides a way to understand the attitudes and opinions expressed in texts. Sentiment polarity
is typically a numeric score that is assigned to both the positive and negative aspects of a text document based on subjective
parameters such as specific words and phrases expressing feelings and emotions. Neutral sentiment typically has 0 polarity
since it does not express any specific sentiment, positive sentiment will have polarity > 0, and negative sentiment will have
polarity < 0.

34. What is tokenization in text mining?


Answer: Tokenization is the process of breaking down a text paragraph into smaller chunks such as words or sentence.
Token is a single entity that is the building block for a sentence or a paragraph. Sentence tokenizer breaks text paragraph
into sentences, whereas word tokenizer breaks text paragraph into words.

35. What is the importance of domain.yml in chatbot development?


Answer: The domain.yml file is called the universe of the chatbot and it contains details of everything: intents, actions,
templates, slots, entities, forms, and response templates that the assistant understands and operates with.

36. Differentiate between entity and intent in chatbot?


Answer: An entity is a piece of information inside the message given by the user, which is defined in the intent. An entity is
defined in parenthesis and the possible values of entity are written in square bracket.

37. Discuss the role of neuron in neural network model.


Answer: The neuron receives inputs, multiplies them by some weight (bias), and then passes them into an activation
function to produce an output. The most important thing is the adjustment of weights for the next process. This is done by
comparing the outputs with the original labels. This process is repeated until we have reached a maximum number of
allowed iterations or an acceptable error rate.

38. How do we create a neural network model?


Answer: To create a neural network, we basically add layers of neurons together and finally create a multilayer model of a
neural network. It should be noted that a model should primarily have two layers: input and output. The input layer directly
takes input of features and output layer create the resulting outputs. The effectiveness of neural network model enhances
when we use more layers between input and output layer. These layers are called as hidden layers because they do not
directly observe the feature inputs or outputs.

39. What is the role of loss argument in the compiling the neural network model?
Answer: The loss is the quantity we need to minimize during training; this helps to measure its performance on the training
data, which enable us to go in the right direction and hence represents a measure of success. These loss functions used a
feedback signal for learning the weight tensors, and which the training phase will attempt to minimize. The different type of
loss that can be specified include mean square error(mse), gradient decent, mean absolute error(mae),
categorical_crossentropy, binary_crossentropy, and stochastic gradient decent (SGD).

40. What is an optimizer? What are the different optimizers that can be used in a neural network model?
Answer: An optimizer is the mechanism through which the model will update itself based on loss function. The optimizer
specifies the exact way in which the gradient of the loss will be used to update parameters. The different optimizers include
“adam”, “SGD”, “RMSprop”, “Adagrad”, “Adadelta”, “Adaptive Moment Estimation (adam)”, “Adamax”, and “Nadam”.

41. Explain the importance of transfer learning?


Answer: Transfer learning means the application of skills, knowledge, and/or attitudes that were learned in one situation to
another learning situation. A trained model works on the concept of transfer learning. In other words, a pretrained model is a
model created by other person to solve a similar problem. Instead of building a model from scratch to solve a similar
problem, the model already trained on other problem by other person is considered as a starting point for new model.

42. What are the different trained algorithms available for text data?
Answer: The different trained algorithms available for text data include Bert, GPT2, Roberta, XLM, and DistilBert.

43. How do recommendation systems contribute in service industry?


Answer: Recommendation systems are one of the popular and most adopted applications of machine learning in service
industry. In online medium, people post their tweets or send their reviews related to different services such as bus travel,
ticket booking, hospitality services, food services, tourism services, personalized home services, and salon services. When
the user types a particular text to reach his/her desired service, recommendations system basically tries to understand the

970
features that govern the customer’s choice and try to determine the similarity between two services. On the basis of scores
corresponding to similarities, services related to destination, food delivery, commuting, etc., are recommended. A
recommendation system can be implemented by either measuring popularity, ratings, recommendations, etc., or on the basis
of information such as name of service provider, review of the service, and quality of the service.

44. Which are the most popular trained models available for image data analysis?
Answer: The popular pretrained models such as MobileNet, MobileNetV2, ResNet50, VGG16, and VGG19 are used for
performing unsupervised and supervised machine learning on image data.

45. What are chatbots?


Answer: Chatbots are a form of human–computer dialog system that operates through natural language using text or speech.
These chatbots are autonomous and can operate anytime, day or night, and can handle repetitive, boring tasks. They help to
drive conversation and are also scalable because they can handle millions of requests.

46. How can an effective chatbot be created?


Answer: An effective chatbot can be created by providing huge training data, including out-of-vocabulary words, managing
similar intents, and having balanced and secured data.

47. Which techniques are used to determine similar text and images to a given text or image?
Answer: The different techniques include cosine_similarity, euclidean_distances, and manhattan_distances, which are
available in sklearn.metrics.pairwise library.

48. Is it possible to use classification algorithms on text data?


Answer: Yes. All the classification algorithms used on numeric data can be sued on text data also.

49. How can clusters analysis be helpful on image data?


Answer: This technique can be used to group similar images that can be an advantage in the e-commerce industry.

50. Explain the process of generative adversarial network?


Answer: The basic concept of generative adversarial network is to have two neural networks (generator and discriminator)
that contest with each other in a game. Two models are trained simultaneously by an adversarial process. A generator learns
to create images that look real, whereas the discriminator learns to differentiate real images from unreal images. The
generator model produces synthetic examples from random noise sampled using a distribution, which along with real
examples form a training dataset. This is then fed to the discriminator, which attempts to distinguish between the two by
displaying the results as true or false for the generated image.

971
Index

A
abnormal loop termination, 43–47
AdaBoost, 411–419
add() function, 69
Anaconda, 5–6
“and” operator, 15–16
area plots, 183–184
area_rectangle() function, 59
ARIMA models, 289–292
arithmetic operators, 12–13
armstrong (num) statement, 63
array() function, 133, 139
array module, 111–113
assigning values to variable, 7
assignment operator, 13–14

B
bagging algorithm, 477–478
bagging (Bootstrap), 389–396
balanced score card model, 302
bar() function, 180–183
Bert algorithm, 531–533, 541–542, 548–549, 554–559, 569–570, 572–573
Boolean values, 16
break statement, 44–46

C
capitalize() function, 117
casefold() function, 117
centre() function, 117
chatbots, 651
basic, 659–664
with database, 685–691
effective, 701–702
with entities and actions, 664–676
with forms, 692–700
Rasa, 652–659
with slots, 676–685
choice() function, 69
classification, 342–351
clock() function, 123
clustering, 305–313
CNN model, 504–521
compile-time errors, 47
continue statement, 46
contour() function, 186–187
copy() command, 133–134
core libraries in Python, 17–18
core modules in Python, 17
cosine similarity, 448–449, 469–470, 529–530
count() function, 117–118

972
D
Dataframe, 147–150
adding rows and columns from, 149–150
charts for, 163–166
creating, 148–149
data extraction, 156–162
deleting rows and columns from, 150
functions of, 151–156
group by functionality, 162–163
head and tail() function, 152–154
importing of data, 150–151
mathematical and statistical functions of, 155
missing value, handling of, 166–168
Series() function, 147–148
sort functions of, 156
data structures in Python, 71–72
date() function, 124
datetime module, 124
debugging, 47
decision-making structures, 21–30
decision tree, 372–382, 476–477
def statement, 57
dictionary
accessing dictionary elements, 97–98
creating, 96–97
functions for, 98–99
programming with, 99–102
dimensionality reduction, 297
factor analysis method, 298–301
principal component analysis, 302–304
dir() function, 125
DistilBert algorithm, 536–538, 544–545, 551–552, 564–566, 570–571

E
end argument, 8–9
endswith() function, 117–118
errors
compile-time, 47
logical, 48
run-time, 47–48
Euclidean distance, 449–450, 470–471, 530–531
eval() function, 11–12
exception-handling mechanism, 47–50
extra trees algorithm, 404–410

F
fact() function, 65
federated learning, 707–709
findall() function, 119, 122–123
find() function, 117–119
float() function, 11
“for” loops, 30–35
nested, 35–36

G
generative adversarial network (GAN), 714–717
get_var() function, 69
GPT2 algorithm, 534, 542–543, 549, 560–561
Gradient Boosting, 420–429

973
graph neural networks (GNNs), 710–713
green building, 228
greet() function, 62

H
histogram, 179–180
hyperparameters, 385–386

I
identifier, 7–8
“if…else if” statement, 26–30
If…else Statement, 22–24
“if” statement, 21–22
iloc indexer, 158–162
image recognition, pretrained models for, 627–633
image similarity techniques, 579–595
import functions, 18
in-built modules
array module, 111–113
datetime module, 124
Math module, 105–107
“os” module, 124
random module, 107–109
“re” module, 119–123
statistics module, 109–110
string module, 113–119
time module, 123
index, 75
index() function, 119
input() function, 10, 65
with eval() function, 11–12
with float() function, 11
with int() function, 10–11
Integrated Development Environment (IDE), 6
interest() function, 125
int() function, 10–11
items() function, 97–98

J
join() function, 117–118
Jupyter notebook, 6
Jupyter software, 6

K
Keras, 18
k-Nearest neighbor’s (k-NN) algorithm, 358–364

L
len() function, 117–118
length() function, 65
linalg sub-package, 219–220
linspace() function, 135
lists, 72–91
accessing list elements, 75–78
creating, 73–75
functions for, 78–80
programming with, 80–91

974
ljust() function, 117
logical errors, 48
logical operators, 15–16, 157–158, 256–257
loop statement, 30–39
“while” loops, 36–39

M
machine learning
accuracy score, 321
classification report, 321
confusion matrix, 320
data exploration and preparation, 317–319
feature extraction, 321–322
image acquisition and preprocessing, 459–468
image similarity techniques, 469–472
model development, 320
overfitting or under fitting, 321
RMSE value, 320
ROC curve and AUC value, 321
tuning of hyper parameters, 322
Manhattan distance function, 450, 472, 531
match() function, 119
Math module, 105–107
matplotlib, 18
matplotlib library, 171
meshgrid() function, 185–186
multidimensional array
accessing elements in, 140–142
creating, 139–140
functions on, 142–143
mathematical operations, 143–144
relational operators, 144
multilayer perceptron model, 488–496
multiply() function, 69

N
naive Bayes algorithm, 355–357
Naïve–Bayes algorithm, 475–476
naming rules, 7
ndimage Sub-Package, 241–248
blur effect to image, 244–245
colours, 246–247
cropping of image, 245
filters, 245–246
flip effect, 242–243
rotate image, 243–244
uniform filters, 247–248
nested “if” statements, 25–26
nesting of conditional statements and loops, 39–43
“for” loop inside “if” conditional statement, 39–40
“if” statement inside “for” loop, 40–42
“if” statement inside “while” loop, 42
using “for,” “while,” and “if” together, 43
neural network model, 484–488
recurrent, 497–504
newtable() function, 58
nltk library, 18
“not” operator, 15–16
NumPy (Numerical Python), 17

975
O
one-dimensional array, 133–139
accessing elements, 135
functions of, 135–137
mathematical operators for, 137–139
relational operators for, 139
operator precedence, 16–17
“or” operator, 15
“os” module, 124
output() function, 69

P
pandas, 17
pass statement, 46–47
performance() function, 125
pie() function, 176–177
plot() function, 171–176
print() function, 7–8
end argument in, 8
sep argument in, 9
PyCharm software, 6
Python
core libraries in, 17–18
core modules in, 17
features of, 3
getting started, 7
input in, 10–12
installation of, 4–6
interpreter, 4
loops, 30–39
naming rules, 7
operators, 12–17
output in, 7–9
prompt, 4
variables in, 7
under Windows program files, 4

Q
question answers model, 567–575
quiver() function, 184–185

R
Random Forest algorithm, 396–403, 477
random() function, 107–108
randrange() function, 107–108
recursive function, 65
regression, 322–341
reinforcement learning, 705–706
relational operators, 15, 156–157, 254–256
“re” module, 119–123
reserved words, 7
return statement, 63–64
reverse() function, 64
rfind() function, 119
rjust() function, 117
Roberta algorithm, 534–535, 543–544, 549–550, 561–562
run-time errors, 47–48
Ru_rubert Algorithm, 575

976
S
scatter() function, 178–179
scipy (Scientific Python), 18
seaborn, 18
search() function, 119
seed() function, 107–108
sentiment analysis, 441–448
sep argument, 9
show() function, 69
sklearn (Scikit-learn) library, 18
sleep() function, 123
special sub-package, 241
split() function, 117–118
Spyder software, 5
SQL operations, 251–252
advanced, 268–274
DELETE statement, 261
GROUP BY clause, 263–266
inbuilt functions, 262
Insert statement, 260–261
Intersect and Union clauses, 270–271
IS NULL condition, 259–260
joining, 272–274
LIKE operator, 258–259
IN and NOT IN clauses, 257–258
ORDER BY clause, 263
ranking functions, 266–268
SELECT clause, 252–254
subquery, 271–272
Update statement, 261
WHERE clause, 254–260
start-with() function, 117–118
stationarity of series, 279–281
statistics module, 109–110
statsmodels, 18
stats Sub-Package
ANOVA, 233–235
chi-square test, 223–224
correlation, 223
descriptive statistics, 221–222
homogeneity of variance, 222–223
Kolmogorov–Smirnov test, 236–238
Kruskal–Wallis test, 239–241
Mann–Whitney test, 238–239
normality of data, 222
parametric techniques, 224–225
rank, 222
t-test, 225–233
Wilcoxon test, 239
string, 7
string module, 113–119
accessing string elements, 115–116
alignment and indentation functions, 117
case conversion functions, 116–117
sub() function, 119
subscript, 75
subtract() function, 69
sumof fact() function, 60
supervised machine learning, 453–456, 475–478, 546–552
image similarity technique, 614–626
support vector machines, 365–371
swapcase() function, 117

977
T
task of function, 51
text mining, 433–440
lemmatization, 438–439
shallow parsing, 436–437
stemming, 438–439
stop words, 437–438
word cloud, 439–440
text similarity techniques, 448–450, 525–537
time() function, 124
time module, 123
time series
creating subset, 278–279
reading data, 277–278
stationary, 281–288
title() function, 117
tuples, 91–95
accessing tuple elements, 93
creating, 92–93
functions for, 93–95
programming with, 95

U
unsupervised machine learning, 451–452, 473–474, 538–545
image similarity technique, 595–613
user-defined functions, 50–69
with arguments, 56–61
nesting of function, 62–65
recursive function, 65
scope of variables within functions, 66–69
with single argument, 57–61
without arguments, 51–56
user-defined model for feature extraction, 633–648
user-defined module
creating a module, 125
importing, 125–128
user-defined trained deep learning model, 552–566

V
values() function, 97–98
view() command, 133–134
violinplot() function, 177–178
visualization for categorical variables, 191–192
bar plot, 195
box plot, 192
count plot, 194–195
Facet Grid, 199–200
factor plot, 197–198
line plot, 194
point plot, 193–194
strip plot, 195–196
swarm plot, 196
violin plot, 193
visualization for continuous variables, 200
heat map, 202–204
joint hexbin plot, 207
joint kernel density plot, 207–208
joint plot, 205–207
pair grid, 212–215
pair plot, 208–212

978
regression plot, 201–202
scatter plot, 201
univariate distribution plot, 204–205
vowel() function, 65

W
welcome() function, 63–64
“while” loops, 36–39
nested, 38–39

X
XLM algorithm, 535–536, 544, 550–551, 563–564

Z
zeros() function, 134–135
zfill() function, 117

979
980
981
982

You might also like