Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

DATA ANALYTICS WITH R

About the Book


Data analysis is the method of examining, cleansing, and modeling with the objec ve of
determining useful informa on for effec ve decision-making and opera ons. It includes diverse
techniques and tools and plays a major role in different business, science, and social science
areas. R so ware provides numerous func ons and packages for using different techniques for
producing desired outcome.

DATA
Data Analy cs with R will enable readers gain sufficient knowledge and experience to perform
analysis using different analy cal tools available in R. Each chapter begins with a number of
important and interes ng examples taken from a variety of sectors. The objec ve is to explain
the concepts and to simultaneously develop in readers an understanding of its applica on with
real-life examples. This easy-to-understand approach would enable readers to develop the
required skills and apply techniques to solve all types of problems related to R.

Salient Features
l

l
500+ real-life examples.
30+ Case Studies related to different sectors.
ANALYTICS
l

l
200+ Objec ve Type Ques ons with answers.
40+ Prac cal Exercises with solu ons. WITH

R
l 50+ datasets for different problems.
l Thorough refresher on the Basics of R.
l Examina on of Basic and Advanced Visualiza on Techniques.
l Descrip on of Sta s cal Techniques in R.
l Detailed explana on and coverage of Machine Learning.

Dr. Bharti Motwani

Wiley India Pvt. Ltd.


Customer Care +91 120 6291100
Dr. Bharti Motwani
ISBN: 978-81-265-7646-3
csupport@wiley.com
www.wileyindia.com
www.wiley.com

9 788126 576463
DATA
ANALYTICS
WITH

R
DATA
ANALYTICS
WITH

R
Dr. Bharti Motwani
Associate Professor
Balaji Institute of Modern Management
Pune
Data Analytics with R
Copyright © 2019 by Wiley India Pvt. Ltd., 4436/7, Ansari Road, Daryaganj, New Delhi-110002.
Cover Image: © Toria/Shutterstock
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by
any means, electronic, mechanical, photocopying, recording or scanning without the written permission of the publisher.

Limits of Liability: While the publisher and the author have used their best efforts in preparing this book, Wiley and the
author make no representation or warranties with respect to the accuracy or completeness of the contents of this book,
and specifically disclaim any implied warranties of merchantability or fitness for any particular purpose. There are no
warranties which extend beyond the descriptions contained in this paragraph. No warranty may be created or extended by
sales representatives or written sales materials. The accuracy and completeness of the information provided herein and
the opinions stated herein are not guaranteed or warranted to produce any particular results, and the advice and strategies
contained herein may not be suitable for every individual. Neither Wiley India nor the author shall be liable for any loss of
profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

Disclaimer: The contents of this book have been checked for accuracy. Since deviations cannot be precluded entirely,
Wiley or its author cannot guarantee full agreement. As the book is intended for educational purpose, Wiley or its author
shall not be responsible for any errors, omissions or damages arising out of the use of the information contained in the
book. This publication is designed to provide accurate and authoritative information with regard to the subject matter
covered. It is sold on the understanding that the Publisher is not engaged in rendering professional services.

Trademarks: All brand names and product names used in this book are trademarks, registered trademarks, or trade
names of their respective holders. Wiley is not associated with any product or vendor mentioned in this book.

Other Wiley Editorial Offices:


John Wiley & Sons, Inc. 111 River Street, Hoboken, NJ 07030, USA
Wiley-VCH Verlag GmbH, Pappellaee 3, D-69469 Weinheim, Germany
John Wiley & Sons Australia Ltd, 42 McDougall Street, Milton, Queensland 4064, Australia
John Wiley & Sons (Asia) Pte Ltd, 1 Fusionopolis Walk #07-01 Solaris, South Tower, Singapore 138628
John Wiley & Sons Canada Ltd, 22 Worcester Road, Etobicoke, Ontario, Canada, M9W 1L1

First Edition: 2019


ISBN: 978-81-265-7646-3
ISBN: 978-81-265-8835-0 (ebk)
www.wileyindia.com
Printed at:
Preface

Educating effective future leaders is a great responsibility. There is need to rise above the antiquated ap-
proaches of earlier days and infuse the spirit of participation, the spirit of adaptation and the spirit of
adventure. This will happen best in learning environments which are both serious and focused on the one
hand, but which are also joyous and inspiring, operating on the cutting edge of pedagogy and knowledge.
Having spent more than 21 years in the field of information technology and having published more than
75 research papers, I feel obliged to share my knowledge and experience related to analysis of real-life
situations. This book has evolved from my teaching experience in several technical institutions, provid-
ing consultancies, conducting research methodology workshops and my experience of working in the IT
industry. In order to provide a more meaningful and easier learning experience, this book has been written
with more interesting and relevant examples. Each chapter contains numerous problems of different types
to help readers evaluate themselves.
The international success of research depends on its reputation for high-quality tools used in analysis.
This quality and its international perceptions must continue to thrive under the new arrangements. This
means a renewed commitment to high-quality higher education that is more responsive to choice and which
provides the best possible experience. In this competitive world, there is a need to continue supporting core
strengths in higher education: build on a reputation for excellence and diversity in learning and teaching,
world-leading research and an enviable record of knowledge exchange. The goal of this book is to open
the doors of opportunity related to different analytical techniques from a broader array of datasets. It is an
attempt to provide a reservoir of updated knowledge on varied tools for academicians, consultants, research
scholars, practitioner and students. The reader is suggested to execute the programs for understanding util-
ity and effectiveness of the concept in a better manner.
Readers’ views, observations, constructive criticism and suggestions are welcome at bhartimotwani@
hotmail.com.

About the Book


Data analysis is the method of examining, cleansing, and modeling with the objective of determining useful
information for effective decision-making and operations. It includes diverse techniques and tools and plays
a major role in different business, science and social science areas. R software provides numerous functions
and packages for using different techniques for producing desired outcome.
Data Analytics with R will enable readers gain sufficient knowledge and experience to perform analysis
using different analytical tools available in R. Each chapter begins with a number of important and interest-
ing examples taken from a variety of sectors. The objective is to explain the concepts and to simultaneously
develop in readers an understanding of its application with real-life examples. This easy-to-understand
approach would enable readers to develop the required skills and apply techniques to solve all types of
problems related to R.

Data Analytics Using R_FM.indd 5 3/8/2019 4:34:21 PM


vi Preface

List of Color Figures


The following color figures are available at http://www.wileyindia.com
Chapter 4: Fig. 4.5
Chapter 6: Figs. 6.40, 6.42 to 6.64
Chapter 7: Figs. 7.1 to 7.9, 7.11 to 7.22, 7.24 to 7.38
Chapter 8: Figs. 8.4 to 8.11
Chapter 9: Figs. 9.1 to 9.9, 9.11 to 9.13, 9.15, 9.17, 9.18, 9.20
Chapter 10: Figs. 10.1, 10.2, 10.4, 10.6, 10.7, 10.20, 10.21
Chapter 11: Figs. 11.4, 11.5, 11.8, 11.9, 11.11, 11.13, 11.16, 11.17, 11.18
Chapter 12: Fig. 12.6
Chapter 13: Figs. 13.14, 13.15
Chapter 14: Figs. 14.17, 14.12, 14.13, 14.14
Chapter 15: Figs. 15.4, 15.6

Organization of the Book


This book contains 16 chapters divided into 4 parts:
1. Part 1 – Basics of R: This deals with the basic operations available in R. There are five chapters, which
form a base for beginners in R programming.
• Chapter 1 discusses the variables, input and output in R along with in-built functions in R.
• Chapter 2 expounds on six basic data structures available in R – Vectors, Matrices, Array, List,
Factor and Data Frame. This chapter discusses different mechanism to create these data structures,
access elements of these structures and functions that are applicable to them.
• Chapter 3 is related to programming in R. This chapter details decision-making structures, loops
and user-defined functions which are essential for effective programming in any language.
• Chapter 4 is related to data exploration and manipulation. This chapter throws light on missing
data imputation and effective data management using apply() function with Dplyr package.
• Chapter 5 provides information related to importing data from and exporting data to different
softwares. These softwares are related to text file, Excel file, XML file, JSON file, MySQL file, SPSS
and SAS file.
2. Part 2 – Visualization Techniques: This section is related to visualization techniques that are available
in R. There are two chapters in this section and they include step-by-step procedure to create different
type of charts in R for better interpretation.
• Chapter 6 is on basic visualization techniques. This chapter deals with the different basic charts
like Bar Chart, Pie Chart, Histogram, Box Plot, Image Plot, Q-Q Plot, etc.
• Chapter 7 on advance visualization techniques discusses special charts in R which include Scatter
Plot Matrices, 3D Pie Charts, Corrgrams, Tree Map, and Heat Map. The chapter also presents the
special techniques available in the GGplot2 package for visualization.
3. Part 3 – Statistical Analysis: Topics related to different types of statistical techniques available in R are
covered in this section. There are three chapters related to basic statistical techniques, comparing means
through parametric and non-parametric methods and time series models.

Data Analytics Using R_FM.indd 6 3/8/2019 4:34:21 PM


Preface vii

• Chapter 8 on basic statistics primarily discusses different functions to compute different terms of
descriptive statistics, correlation and covariance, simulation and distributions in R.
• Chapter 9 of this section deals with both parametric and non-parametric techniques for compar-
ing means. All the different tests in both the techniques are applied on two types of data: user’s data
along with the existing dataset available in R environment. This will help the user to have a better
understanding of the concept and familiarity with the available datasets in R environment.
• Chapter 10 on time series models primarily discusses smoothing and seasonal decomposition for
time series data.
4. Part 4 – Machine Learning: This section depicts the real strength of R in a true manner. It includes
six chapters and starts from basic machine learning algorithms to deep learning algorithms for different
types of data. All the algorithms covered in this section are discussed and analysis is done on existing
dataset available in R environment or other reputed places. The source of the dataset is specified at all
places with complete information. Machine learning comprises of both unsupervised and supervised
machine learning algorithms.
• Chapter 11 discusses unsupervised machine learning algorithms: factor analysis and cluster analysis.
• Chapter 12 throws light on the basic supervised machine learning problems: regression and classi-
fication.
• Chapter 13 discusses different machine learning algorithms used for regression and classification
problems like Naïve Bayes, KNN, Support Vector Machines and Decision Tree.
• Chapter 14 discusses different ensemble techniques of machine learning algorithms like Bagging,
Random Forest and Gradient Boosting. These techniques give better results since these techniques
use effective way of analysis by grouping.
• Chapter 15 focuses only on text data and discusses text mining and sentiment analysis. With the
advent of e-Commerce, a lot of available data are in the form of text and analysis is required for this
new type of data.
• Chapter 16 is related to neural networks (advance machine learning technique – deep learning).
This chapter discusses development of deep learning model for different structures of Tensor. A
Multilayer Perceptron model for a 2-D tensor (normal data), Recurrent Neural Network model
for a 3-D tensor (time series data) and Concurrent Neural Network model for 4-D tensor (image
data) is developed and results are analyzed in this chapter.

Instructor Resources
The following resources are available for instructors on request. To register, please log onto https://www.
wileyindia.com/Instructor_Manuals/Register/login.php
1. Chapter-wise PowerPoint Presentations (PPTs)
2. Chapter-wise Solution Manuals

Acknowledgements
Expression of feelings by words loses its significance when it comes to say a few words of gratitude, yet to
express it in some form, however imperfect, is a duty towards those who helped. I offer my special gratitude
to almighty God for His blessings that has made completion of this book possible.

Data Analytics Using R_FM.indd 7 3/8/2019 4:34:21 PM


viii Preface

I find myself at a loss for words to express my deep sense of gratitude to my father, Mr. Shrichand
Jagwani, and mother, Mrs. Anita Jagwani, for their affection, continuous support, constant encouragement
and understanding.
My real strength has been the selfless cooperation, solicitous concern and emotional support of my
husband, Mr. Bharat Motwani. No words can convey my gratitude to my children, Pearl and Jahan, who
had to tolerate my preoccupation with this book. Their patience, forbearance, love and support through this
whole process has made this mind-absorbing and time-consuming task possible.
I am grateful to the President of Sri Balaji Society Dr. (Col.) A. Balasubramanian for his guidance and
all the faculty and staff members for providing a conducive environment.
I am also thankful to all those people whose constructive suggestions and work have helped to enhance
the standard of the work directly and/or indirectly and brought the task to fruition.
I am indebted to Wiley Publishers for their sincere efforts, unfailing courtesy and cooperation in
bringing out the book in this elegant form. It has been a real pleasure working with such professional staff.

Dr. Bharti Motwani

Data Analytics Using R_FM.indd 8 3/8/2019 4:34:21 PM


About the Author

Dr. Bharti Motwani is an analytical professional, IT and analytics con-


sultant, result-driven and articulate academician who can think out of
the box, with more than 21 years of experience in corporate world and
academics.
She has demonstrated proficiency in guiding Ph.D. candidates, writ-
ing books, reviewing journals, editing books and journals. She has written
more than 75 research papers in leading national and international journals
and books, including journals indexed in Scopus, with publishers includ-
ing Elsevier. She is the recipient of Young Scientist of the Year award. She
has proved dexterousness in research methodologies and software devel-
opment by conducting various seminars and workshops related to latest
tools in research and software and guiding various research and software
projects. She has high technical expertise in Data Analytics Software
(R, Python, Tableau, Spark, Power BI, SAS, SPSS, AMOS, Smart PLS);
Front End Tools (Visual Basic, Java, JavaScript, C, C++, C##, HTML, PHP); Back End Tools (Oracle,
MS SQL Server, My SQL, MS Access, FoxPro); and various IDE and Web designing Tools.

Data Analytics Using R_FM.indd 9 3/8/2019 4:34:21 PM


Data Analytics Using R_FM.indd 10 3/8/2019 4:34:21 PM
Preface xi

Contents

Prefacev
About the Author ix

PART 1 Basics of R 1
Chapter 1 Introduction to R 3
1.1 Features of R 3
1.2 Installation of R 4
1.3 Getting Started 5
1.3.1 Window Sections of RStudio 5
1.3.2 First Interaction 5
1.3.3 Command Line versus Scripts 6
1.3.4 Comments6
1.3.5 Help in R 6
1.3.6 Directory7
1.4 Variables in R 7
1.4.1 Naming Variables 8
1.4.2 Assigning Values to Variables 8
1.4.3 Finding Variables 9
1.4.4 Removing Variables 9
1.5 Input of Data 10
1.5.1 Input of Data from Terminal 10
1.5.2 Input of Data through R-Objects 11
1.6 Output in R 11
1.6.1 print() Function 11
1.6.2 cat() Function 12
1.7 In-Built Functions in R 13
1.7.1 Mathematical Functions 13
1.7.2 Trigonometric Functions 15
1.7.3 Logarithmic Functions 15
1.7.4 Date and Time Functions 16
1.7.5 Sequence Function 16

Data Analytics Using R_FM.indd 11 3/8/2019 4:34:21 PM


xii Contents

1.7.6 Repeat Function 18


1.7.7 String Functions 20
1.8 Packages in R 30
1.8.1 Standard Packages 30
1.8.2 Contributed Packages 30
Chapter 2 Data Types of R 37
2.1 Vectors37
2.1.1 Class of a Vector 37
2.1.2 Elements of a Vector 39
2.1.3 Accessing Vector Elements 44
2.1.4 Functions for Vectors 45
2.2 Matrices55
2.2.1 Creating a Matrix 56
2.2.2 Accessing Matrix Elements 60
2.2.3 Functions for Matrices 61
2.3 Arrays65
2.3.1 Creating an Array 65
2.3.2 Accessing Elements of an Array 66
2.3.3 Functions for Array 68
2.4 Lists69
2.4.1 Creating a List 69
2.4.2 Accessing List Elements 73
2.4.3 Functions for List 76
2.5 Factors  79
2.5.1 Creating a Factor 79
2.5.2 Accessing Factors 83
2.5.3 Functions for Factors 84
2.6 Data Frame  86
2.6.1 Creating a Data Frame 87
2.6.2 Accessing Data from Data Frame 90
2.6.3 Functions for Data Frame 91
Chapter 3 Programming in R 101
3.1 Decision-Making Structures 101
3.1.1 If Structure 101
3.1.2 Switch Statement 108
3.2 Loops110
3.2.1 For Loop 110
3.2.2 While Loop 113
3.2.3 Repeat Loop 115
3.3 User-Defined Functions 116
3.3.1 Function without Arguments 117

Data Analytics Using R_FM.indd 12 3/8/2019 4:34:21 PM


Contents xiii

3.3.2 Function with Arguments 119


3.3.3 Nesting of Functions 124
3.4 User-Defined Package  127
3.4.1 Creating Data and Function File 127
3.4.2 Uploading the Package 128
3.5 Reports using Rmarkdown  129
3.5.1 Direct Rendering 129
3.5.2 Indirect Rendering 129
Chapter 4 Data Exploration and Manipulation 135
4.1 Missing Data Management  135
4.1.1 Determining Missing Data 135
4.1.2 Excluding Missing Values 141
4.1.3 Missing Data Imputation 143
4.2 Data Reshaping through Melting and Casting 145
4.2.1 Data Melting 145
4.2.2 Casting the Molten Data 146
4.3 Special Functions across Data Elements 146
4.3.1 Data Management using Apply Functions 146
4.3.2 Data Management with dplyr Package 152
Chapter 5 Import and Export of Data 167
5.1 Import and Export of Data in Text File 167
5.1.1 Reading from Text Data Format 167
5.1.2 Writing to Text Data Format 169
5.2 Import and Export of Data in Excel 171
5.2.1 Reading from Excel Format 171
5.2.2 Writing to Excel Format 173
5.3 Import and Export of Data in XML 174
5.3.1 Reading from XML Format 175
5.3.2 Writing to XML Format 177
5.4 Import and Export of Data in JSON 178
5.4.1 Reading from JSON Format 179
5.4.2 Writing to JSON Format 180
5.5 Import and Export of Data in MySQL  180
5.5.1 Reading from MySQL 180
5.5.2 Writing to MySQL 182
5.6 Import and Export of Data in SPSS 183
5.6.1 Reading from SPSS Format 184
5.6.2 Writing to SPSS Format 184
5.7 Import and Export of Data in SAS 185
5.7.1 Reading from SAS Format 185
5.7.2 Writing to SAS Format 186

Data Analytics Using R_FM.indd 13 3/8/2019 4:34:21 PM


xiv Contents

PART 2 Visualization Techniques 189


Chapter 6 Basic Visualization 191
6.1 Pie Chart 191
6.2 Bar Chart 195
6.3 Histograms200
6.4 Line Chart 201
6.5 Kernel Density Plots 204
6.6 Quantile-Quantile (Q-Q) Plot  207
6.7 Box-and-Whisker Plot 208
6.8 Violin Plot 210
6.9 Dot Chart 211
6.10Bubble Plot 213
6.11Image Plot 214
6.12Mosaic Plot 216
Summary of Additional Plotting Commands 218
Chapter 7 Advanced Visualization 231
7.1 Scatter Plot 231
7.2 Corrgrams236
7.3 Star and Segment Plots 237
7.4 Tree Maps  238
7.5 Heat Map  243
7.6 Perspective and Contour Plot  244
7.7 Using ggplot2 for Advanced Graphics  246

PART 3 Statistical Analysis 263


Chapter 8 Basic Statistics 265
8.1 Descriptive Statistics 265
8.1.1 Measures of Central Tendency 265
8.1.2 Measures of Variability 269
8.1.3 Quantile272
8.1.4 Rank273
8.1.5 Skewness and Kurtosis 274
8.2 Table in R  275
8.2.1 Creating a Table 276
8.2.2 Marginal Distributions 276
8.2.3 Calculation of Proportions 277
8.3 Correlation and Covariance 278
8.3.1 Coefficient of Correlation 278
8.3.2 Coefficient of Covariance 280
8.3.3 Correlation and Covariance at Successive Lags 280
8.3.4 Chi-Square Test for Correlation 282

Data Analytics Using R_FM.indd 14 3/8/2019 4:34:21 PM


Contents xv

8.4 Simulation and Distributions  286


8.4.1 Normal Distribution 286
8.4.2 Binomial Distribution 291
8.4.3 Monte Carlo Simulation 294
8.5 Reproducing Same Data 297
8.5.1 Random Selection from Sample 297
8.5.2 Random Selection from Distribution 298
8.5.3 Random Selection from Dataset 299
Case Study: Web Analytics using Goal Funnels: Asset for
e-Commerce Business 301
Chapter 9 Compare Means 307
9.1 Parametric Techniques 307
9.1.1 One Sample t-Test 308
Case Study: Green Building Certification 315
9.1.2 Independent Sample t-Test  316
Case Study: Comparison of Personal Web Store and Marketplaces
for Online Selling 329
9.1.3 Dependent t-Test 331
Case Study: Effect of Training Program on Employee Performance
333
9.1.4 One-Way Analysis of Variance (ANOVA)
334
Case Study: Effect of Demographics on Online Mobile Shopping Apps
347
9.2 Non-Parametric Tests 348
9.2.1 Kolmogorov–Smirnov Test for One Sample
348
9.2.2 Kolmogorov–Smirnov Test for Two Samples
350
9.2.3 Mann–Whitney–Wilcoxon Test for Independent Samples
351
9.2.4 Mann–Whitney–Wilcoxon Test for Dependent Samples
352
9.2.5 Kruskal–Wallis Rank Sum Test 354
Chapter 10 Time-Series Models 363
10.1 Time-Series Object in R 363
10.1.1 Creating a Time-Series Object 363
10.1.2 Creating a Subset 365
10.1.3 Multiple Time-Series Chart 366
10.2 Smoothing367
10.2.1 Simple Moving Average 367
10.2.2 Exponential Smoothing 369
10.3 Seasonal Decomposition 372
10.3.1 Using st1() Function 372
10.3.2 Using monthplot() and seasonplot() Functions 373
10.4 ARIMA Modeling 374
10.4.1 Creating an ARIMA Model 375
10.4.2 Creating a Subset in ARIMA 376
10.5 Survival Analysis 377
Case Study: Foreign Trade in India 380

Data Analytics Using R_FM.indd 15 3/8/2019 4:34:21 PM


xvi Contents

PART 4 Machine Learning 385


Chapter 11 Unsupervised Machine Learning Algorithms 387
11.1 Dimensionality Reduction 387
11.1.1 Factor Analysis 388
Case Study: Balanced Scorecard Model for Measuring Organizational
Performance393
11.1.2 Principal Component Analysis (PCA) 393
Case Study: Employee Attrition in an Organization 397
11.2 Clustering397
11.2.1 k-Means Clustering  398
Case Study: Market Capitalization Categories 405
11.2.2 Hierarchical Clustering 406
Case Study: Performance Appraisal in Organizations 412
Chapter 12 Supervised Machine Learning Problems 425
12.1 Regression  425
12.1.1 Simple Linear Regression 427
Case Study: Relationship between Buying Intention and Awareness
of Electric Vehicles 439
12.1.2 Multiple Linear Regression 440
Case Study: Application of Technology Acceptance Model in
Cloud Computing 462
12.1.3 Non-Linear Least Square Regression 463
Case Study: Impact of Social Networking Websites on Quality
of Recruitment 465
12.2 Classification467
Case Study: Prediction of Customer Buying Intention due to
Digital Marketing 473
Chapter 13 Supervised Machine Learning Algorithms 485
13.1 Naïve Bayes Algorithm 486
13.1.1 Naïve Bayes Algorithm for Classification Problems 486
Case Study: Measuring Acceptability of a New Product 489
13.2 k-Nearest Neighbor’s (KNN) Algorithm 490
13.2.1 KNN for Classification Problems 490
Case Study: Predicting Phishing Websites 493
13.2.2 KNN for Regression Problems 494
Case Study: Loan Categorization 498
13.3 Support Vector Machines (SVMs)  498
13.3.1 Support Vector Machines for Classification Problems 499
Case Study: Fraud Analysis for Credit Card and Mobile Payment
Transactions501
13.3.2 Support Vector Machines for Regression Problems 502
Case Study: Diagnosis and Treatment of Diseases 504

Data Analytics Using R_FM.indd 16 3/8/2019 4:34:21 PM


Contents xvii

13.4 Decision Trees 505


13.4.1 Decision Tree for Classification Problems 507
Case Study: Occupancy Detection in Buildings 515
13.4.2 Decision Tree for Regression Problems 515
Case Study: Artificial Intelligence and Employment 516
Chapter 14 Supervised Machine Learning Ensemble Techniques 529
14.1 Bagging529
14.1.1 Bagging for Classification Problems 530
Case Study: Measuring Customer Satisfaction related to Online
Food Portals 531
14.1.2 Bagging for Regression Problems 532
Case Study: Predicting Income of a Person 533
14.2 Random Forest 533
14.2.1 Random Forest for Classification Problems 535
Case Study: Writing Recommendation/Approval Reports 537
14.2.2 Random Forest for Regression Problems 537
Case Study: Prediction of Sports Results 539
14.3 Gradient Boosting 540
14.3.1 Gradient Boosting for Classification Problems 541
Case Study: Impact of Online Reviews on Buying Behavior 544
14.3.2 Gradient Boosting for Regression Problems 544
Case Study: Effective Vacation Plan through Online Services 549
Chapter 15 Machine Learning for Text Data 565
15.1 Text Mining 565
Case Study: Spam Protection and Filtering 573
15.2 Sentiment Analysis 574
Case Study: Determining Online News Popularity 591
Chapter 16 Neural Network Models (Deep Learning) 601
16.1 Steps for Building a Neural Network Model 601
16.2 Multilayer Perceptrons Model (2D Tensor) 606
Case Study: Measuring Quality of Products for Acceptance or Rejection 610
16.3 Recurrent Neural Network Model (3D Tensor) 610
Case Study: Financial Market Analysis 616
16.4 Convolutional Neural Network Model (4D Tensor)  616
Case Study: Facial Recognition in Security Systems 623

Answers to Objective Type Questions


635
Index643

Data Analytics Using R_FM.indd 17 3/8/2019 4:34:22 PM


Data Analytics Using R_FM.indd 18 3/8/2019 4:34:22 PM
PART 1
Basics of R

Chapter 1
Introduction to R

Chapter 2
Data Types of R

Chapter 3
Programming in R

Chapter 4
Data Exploration and Manipulation

Chapter 5
Import and Export of Data

Chapter 01_4th.indd 1 2/23/2019 4:42:41 PM


Chapter 01_4th.indd 2 2/23/2019 4:42:41 PM
CHAPTER
Introduction to R
1
Learning Objectives
• Build foundation for understanding R environment.
• Familiarity with R installation.
• Exposure to variables, input and output functions in R.
• Implementing different mathematical, trigonometrical, string and other basic functions in R.

R is a programming language primarily used for basic and advanced statistical analysis, excellent visualiza-
tion of graphics, machine learning and deep learning related to numbers, text etc. It was initially written
by Ross Ihaka and Robert Gentleman at the Department of Statistics of the University of Auckland in
Auckland, New Zealand. R can be regarded as an implementation of the S language which was developed at
Bell Laboratories by Rick Becker, John Chambers and Allan Wilks in 1993. R is freely available under the
GNU General Public License, and pre-compiled binary versions are provided for various operating systems
like Linux, Windows and Mac.

1.1 Features of R
R is an interpreted programming language and software environment for statistical analysis and data visual-
ization and reporting. The following are important features of R:
1. It allows branching and looping as well as modular programming using functions.
2. It allows integration with different programming languages such as C, C++, .Net and Python.
3. It has an extensive community of contributors; hence it has rich library of functions and datasets.
4. It has an effective data handling and storage facility for numeric and textual data.
5. It provides a collection of operators for calculations on arrays, lists, factor, vectors, data frame and
matrices.
6. It provides large and integrated collection of tools for data analysis and statistical functions.
7. It provides graphical facilities for data analysis and can show results both in soft and hard copies.
8. It is an integrated suite of software facilities for data manipulation, calculation and graphical facilities for
data analysis and display.
9. It has neither graphical user interfaces nor a spreadsheet view of data, nor is it a database; but it connects
to DBMS and spreadsheets.

Chapter 01_4th.indd 3 2/23/2019 4:42:41 PM


4 Data Analytics with R

1.2 Installation of R
R can be installed from R-3.2.2 for Windows (32-bit/64-bit) and then saved in a local directory. In win-
dows, installer (.exe) with the name “R-version-win.exe” can be downloaded. We need to double click
to run the installer and accept the default settings. After installation, we have to locate the icon to run the
program in a directory structure “R\R- 3.2.2\bin\i386\Rgui.exe” under the directory name Program Files.
Clicking this icon opens R-GUI which is the start for R programming (Fig. 1.1). This software gives facility
to the user to enter single line commands.
Installation of RStudio: RStudio has a better graphical user interface; hence most users prefer RStudio for
programming in R. Execution of commands in R is not menu-driven. (Not like clicking over buttons to
get outcome.) Besides, sometimes the user needs to type multiline commands also. When writing multiline
programs, it is useful to use a text editor rather than execute everything directly at the command line as it is
not possible at the R command line. RStudio is R’s own built-in editor, which is accessible from the R GUI

Figure 1.1 Screen of R console.

Chapter 01_4th.indd 4 2/23/2019 4:42:41 PM


Chapter 1 Introduction to R 5

Figure 1.2 Screen of RStudio.

menu bar. In other words, it is an interface between R and the user. It is more useful for beginners and makes
coding easier. RStudio is also available for 32- and 64-bit versions. The user can download according to the
requirements of the Windows operating system. The user needs to click on RStudio from “All Programs” to
see the screen (Fig. 1.2).

1.3 Getting Started


This section focuses on understanding the basics of RStudio and some starting operations in R.

1.3.1 Window Sections of RStudio


RStudio has four main window sections:
1. Top-left section (script section): To write and save R code.
2. Bottom-left section (console section): To execute R code and perform calculations. The nature and val-
ues of all variables and objects appear here.
3. Top-right section (data section): To manage datasets and variables.
4. Bottom-right section (plot, packages and help section): To display plots, install packages and seek help
on R functions.

1.3.2 First Interaction


In the R environment setup, the user needs to launch the R interpreter to get a prompt “>”. Let us start
learning R programming by writing a “Hello” program. Depending on the needs, you can program either
at R command prompt (command line) or we can use an R script file to write the program in RStudio.
In both the cases, R issues a prompt where it expects input commands. Type print(“Hello”) in the top left

Chapter 01_4th.indd 5 2/23/2019 4:42:41 PM


6 Data Analytics with R

Figure 1.3 Use of function and output in RStudio.

window and click on Run, else write print(“Hello”) in top left window at the command prompt and press
enter to view the result.

1.3.3 Command Line versus Scripts


Command line is generally used for single line and script is used for multiple commands. If we want to
execute only one function, highlight the function and click on RUN. But, if we want many functions to
execute together, we need to write a script for the same. Click file and then click on new script for writing
multiple line commands. At this point R will open a window entitled “untitled-R-Editor”. We may type and
edit in this. However, if we want to execute a line or a group of lines from the whole script, just highlight
them and click on RUN (Fig. 1.3).

1.3.4 Comments
Comments are helping text in the R program and they are ignored by the interpreter while executing
the actual program. They are generally used for user reference. A comment is denoted by ># followed by
statement. For example, >#My first program in R Programming. However, R does not support multiline
comments unlike other programming languages.

1.3.5 Help in R
R has an extensive user-friendly facility to provide help with regard to different commands. Some examples
are presented below.

Chapter 01_4th.indd 6 2/23/2019 4:42:41 PM


Chapter 1 Introduction to R 7

> #For online help


> help()
> #HTML browser interface to help
> help.start()
> #To get help related to use of function print
> help("print")
> ?print
> #For demonstration
> demo()
> #To quit R
> q()

Explanation
The above commands show the use of basic commands of R for help and exiting from R.

1.3.6 Directory
It is important to determine the directory where a user is creating R program. The getwd() and setwd()
functions are used for getting and setting the working directory, respectively. The user can determine the
existing working directory where he/she is currently working by using the getwd() function and he/she
can change the settings to a new working directory by using the setwd() function.

> #Print current working directory


> getwd()
[1] "C:/Users/dell/Documents"
> #Set current working directory
> setwd("D:/R prog")
> #Displaying the current working directory after new settings
> getwd()
[1] "D:/R prog"

Explanation
The getwd() displays the working directory. In this example, it shows that Documents folder is the cur-
rent working directory. The next command sets the current working directory to a new working directory
“R prog” in D drive. The modifications are done and if we display the current working directory, then
D:\R prog is displayed. However, this result depends on the operating system and current directory where
we are working.

1.4 Variables in R
In programming, a variable is a named piece of computer memory, containing some information inside.
We can think of a variable as a box with a name, where we can store something. Variables can be static
and dynamic. A variable is a value that can change, depending on conditions or on information passed. A
variable provides us with named storage that our programs can manipulate.

Chapter 01_4th.indd 7 2/23/2019 4:42:41 PM


8 Data Analytics with R

Table 1.1 Valid and invalid names of the variables


Variable Name Valid Reason
stu_name2. Yes Has letters, numbers, dot and underscore.
stu_name% No Has the character “%”.
5stu_name No Starts with a number.
.5stu_name No Starting dot followed by a number is not allowed.
_stu_name No Starts with underscore “_”, which is not valid.

1.4.1 Naming Variables


A variable in R can store any object including atomic vector, list, matrix, array, factor and data frame. A
valid variable name consists of letters, numbers and the dot or underline characters. The variable name
starts with a letter or the dot not followed by a number. Table 1.1 discusses some valid and invalid vari-
able names.
Unlike other programming languages, a variable is not declared of any data type in R; rather it gets the
data type of the R-object assigned to it. Thus, R is a dynamically typed language, which means that we can
change a variable’s data type again and again when using it in a program.

1.4.2 Assigning Values to Variables


In R, an assignment to a variable can be done in three ways: =, <– and –> signs. The assignment operator
(<–) consists of the two characters, “<” (less than) and “–” (minus) occurring strictly side-by-side and
pointing to the object receiving the value of the expression. In some contexts, the “=” operator can be used
as an alternative. It is important to remember that R is case sensitive (i.e., “X” is not the same as “x”). The
following are examples of assigning values to a variable in R. The last four examples show different ways that
can be used to assign a value of 20 to variable “x”.

> #Defining variable "myfirst" and assigning "Hello,World!" to it


> myfirst<-"Hello, World!"
> #Viewing the value of the variable
> myfirst
[1] "Hello, World!"
> #R is case sensitive. "Myfirst" is different from "myfirst"
> Myfirst
Error: object 'Myfirst' not found
> #Left Assignment in R
> x<-20
> #Right Assignment in R
> 20->x
> #Assignment using equal sign
> x=20

Chapter 01_4th.indd 8 2/25/2019 4:35:59 PM


Chapter 1 Introduction to R 9

> #Assignment can also be made using the assign() function


> assign("x", 20)

Explanation
The above examples demonstrate different ways of assigning values to a variable. All the four ways can be
used to assign a value of 20 to the variable “x”.

1.4.3 Finding Variables


To know all the variables currently available in the workspace, we use the ls() function. Also, the ls()
function can use patterns to match the variable “names.print(ls())”. The output depends on what variables
are declared in our environment. Besides, ls() function can use patterns to match the variable names.
Variables starting with dot (.) are hidden; they can be listed using all.names=TRUE argument to ls()
function. We assign names to variables when analyzing any data.

> #List all objects except the hidden and special variables
> ls()
[1] "a" "aa" "Affairs" "air"
......................................................

> #List the variables including the pattern "air"


> ls(pattern="air")
[1] "Affairs" "air" "airquality"
......................................................

> #List all objects including hidden and special variables


> ls(all.name=TRUE)
[1] ".Random.seed" "a" "aa" "Affairs"
[5] "air" "AirPassengers" "airquality" "am.data"
[9] "analysis" "ans" "ans1" "ans2"

Explanation
The first function ls() lists all objects in the working environment, except the hidden and special objects,
including the variables created by the user. The second command displays only those variables which have
a string “air” in the variable. The third command displays all the variables. Hence, a long list of all the vari-
ables is displayed.

1.4.4 Removing Variables


It is good practice to remove the variable names at the end of each session in R, for problems arising due to
variables with same names but different properties:
Syntax: rm(i, j,...)
where i, j, … are names of variables separated by comma.

Chapter 01_4th.indd 9 2/23/2019 4:42:41 PM


10 Data Analytics with R

> #Variable can be deleted by using the rm() function


> rm(new3)
> #A deleted variable will throw an error if printed
> new3
Error :object ‘new3’ not found
> #All variables can be deleted using the rm() and ls() together
> rm(list=ls())

Explanation
The first command removes the variable “new3”, hence if the second command is executed to display the
value of variable “new3”, an error is generated that this object is not found. Since ls() function displays all
the variables, all the variables are removed if we use rm() and ls() functions together.

1.5 Input of Data


Data can be input directly from the terminal during run-time or creating R-object. However, R also gives a
strong support in importing data from other software.

1.5.1 Input of Data from Terminal


The scan() function is used to take data from the user at the terminal. This function is the low-level
input function that offers a low-level reading facility. The use of the scan() function is to read a vector
of numbers. This is useful for small datasets but tiresome for entering large datasets. We need to type
some numerical values and press enter if we want to enter more data on a new line and press enter twice
to stop entering the data. The left side will show the index of the data. Data can also be entered as text
using the scan() function but with a small variation: scan(what = ‘character’). A blank line
(two returns) signals the end of the input, that is, the user ends entering the information if he presses
enter twice. The use of scan() function without an argument or using what argument with blank
quotes helps to scan the data in a variable. However, the variable can be of any type including numeric
and character.

> #Reading one set of numeric data


> x<–scan()
1: 23
2: 56
3: 89
4: 90
5: 48
6:
Read 5 items
> #Displaying the numeric variable
> x
[1] 23 56 89 90 48

Chapter 01_4th.indd 10 2/23/2019 4:42:41 PM


Chapter 1 Introduction to R 11

> #Reading character data, data use new line character as


separator
> string1<– scan(what = "", sep = "\n")
1: Hello
2: How are You
3: Thanks
4: Bye
5: See You
6:
Read 5 items
> #Displaying character variable
> string1
[1] "Hello" "How are You" "Thanks" "Bye" "See You"

Explanation
The first command scans five numeric values and store in variable “x” till the user presses return twice. When
the numeric vector “x” is printed, all the five values that are scanned runtime are printed. scan(what =
"", sep = "\n") function helps to read one set of character data since what = "" and using a new
line character (\n) as separator.

1.5.2 Input of Data through R-Objects


There are many types of R-objects including vectors, lists, matrices, arrays, factors and data frames.
Chapter 2 discusses in detail the mechanism of inputting data directly while creating R-object. R also gives
a facility of inputting data from different sources, which is further discussed in Chapter 4.

1.6 Output in R
A programming language takes in raw information (or data) at one end, stores it until it is ready to work
on, works on it and then shows the results at the other end. All these processes have a name. Taking in in-
formation is called input, storing information is better known as memory (or storage), work is also known
as processing, and showing the results is called output. Output of data is important after processing any
data. Display of output in R can be done using functions such as print, cat and paste. The functions
are chosen depending upon user requirements and according to their utility.

1.6.1 print() Function


The print() function cannot combine two or more strings, variables and a string and a variable.

> #Use of print() function


> print("Hello World and Welcome to R")
[1] "Hello World and Welcome to R"
> #The print() function cannot print two strings together
> print("Hello","Welcome")
Error in print.default("Hello", "Welcome") : invalid 'digits' argument
In addition: Warning message:
In print.default("Hello", "Welcome") : NAs introduced by coercion

Chapter 01_4th.indd 11 2/23/2019 4:42:41 PM


12 Data Analytics with R

> #Printing multiple strings using multiple print statements


> print("Hello"); print("and"); print("Welcome")
[1] Hello
[1] and
[1] Welcome
> #Storing a string in a variable
> hellostring <– "Hello, How are You"
> #Use of print() function for printing variable
> print(hellostring)
[1] "Hello, How are You"

Explanation
In the first example, print() helps to print the statement. The print() function has a significant lim-
itation that it prints only one object at a time. If we want to print multiple items, it gives an error message.
The only way to print multiple items is to print one at a time. Hence, an error is generated because the
user wants to join the two strings “hello” and “welcome” together and print the result. However, the next
command executes properly. We have used print statement three times, since we want to print three strings.
The last section shows the mechanism to print a variable.

This process of printing using print() function is very cumbersome when dealing with a lot of data.
Besides, sometimes we need appropriate display of strings to obtain the results of specific operations in re-
quired format. Paste() and cat() are some important functions related to concatenation of strings and
customized display of strings.

1.6.2 cat() Function


The cat() function is an alternative to print that lets you combine multiple items into a continuous out-
put. The function cat() concatenates (link in the same sequence) the arguments in strings and prints the
entire string in the command window. It is also an alternative to print that allows you to combine multiple
items into a continuous output. This function is useful for producing output in user-defined functions.
However, it cannot be assigned to a variable. It converts its arguments to character vectors, links them
together in the same sequence to a single character vector, appends the given sep = string(s) to each
element, and then outputs them.
Syntax: cat(s1, s2, ..., sep = " ",…)
where
· s1, s2, … are different strings or vectors/variables that need to be joined together.
· sep is the value that needs to be considered as a separator.

> #Use of cat() function to display two strings together


> cat("Hello","Welcome")
[1] Hello Welcome
> #Storing a string in a variable
> hellostring <– "Hello, How are You"

Chapter 01_4th.indd 12 2/23/2019 4:42:41 PM


Chapter 1 Introduction to R 13

> #Use of cat() function to join variable and a string


> cat("Good Morning", hellostring)
[1] Good Morning Hello, How are You
> #Joining multiple strings using cat() function
> tdate<–date()
> cat("Today is", tdate)
Today is Fri Apr 27 10:20:25 2018
> #Use of cat() function in using mathematical functions directly
> x<– 8
> cat("The square of", x, "is",x^2, "!\n")
> The square of 8 is 64
> cat("The square root of", x, "is ", sqrt(x),"\n")
> The square root of 8 is approximately 2.828427

Explanation
R provides an additional facility using cat() function to join two strings. In the first example, we have
concatenated two strings. In the next statement, we have used two strings: “Good Morning” and string
stored in a variable “hellostring”. The cat() function concatenates the two strings and displays the result.
In the next example, cat() is used to join one string “Today is” and one variable “tdate” containing a
string. The next example joins two strings and two variables. The answer displayed replaces the value of the
variable, while the string is displayed as it is. The last example uses the square root sqrt() function directly
inside the cat() function.

1.7 In-Built Functions in R


R provides the user an access to many predefined common available functions in R. These functions belong
to different categories including mathematical, trigonometric and logical.

1.7.1 Mathematical Functions


R can also be used as a calculator, along with the facility to use many mathematical functions for performing
operations depending upon user requirement.

> #Basic mathematical operations


> 15 + 7 * 6 – 2
[1] 55
> 16 + ((35/2) –5) * 5/3
[1] 36.83333
> #Power function (^)
> 15^6
[1] 11390625
> #Square root function
> sqrt(81)
[1] 9

Chapter 01_4th.indd 13 2/23/2019 4:42:42 PM


14 Data Analytics with R

> #Square root of a negative number


> sqrt(–78)
[1] NaN
Warning message:
In sqrt(–78) : NaNs produced
> #Absolute function
> abs(190–330*6/5)
[1] 206
> #Factorial function
> factorial(8)
[1] 40320
> #Round function
> round(11.52)
[1] 12
> round(11.32)
[1] 11
> #Floor function
> floor(11.52)
[1] 11
> #Ceiling function
> ceiling(11.52)
[1] 12

Explanation
The above examples show the usage of different mathematical functions such as round, square root, floor
and factorial. The basic mathematical operations follow a general BODMAS rule for calculation. The power
function (^) evaluates the value of a number raised to the power. The sqrt() function calculates the square
root of a number. However, square root of a negative number results in not a number (NaN). The function
abs() returns the absolute value of a number and factorial() function returns the factorial value of a
number. The round() function rounds off a number to its nearest decimal. If a number after the decimal is
5 or greater, it is rounded off to a higher value else it is rounded off to a lower value. For example, 25.6 will be
rounded off to 26 while 25.4 will be rounded off to 25. The floor() function is used to round off a number
to the previous integer, while the ceiling() function is used to round it off to the next integer. For exam-
ple, 35.7 is a number which lies between 35 and 36. Hence, floor of 35.7 will be 35 and ceiling will be 36.

Variables in Functions: We can also assign a value to a variable and then use the mathematical operator/
function to the variable directly.

> #Assigning value to a variable


> x1 <–10
> #Using mathematical operator on variable
> x2 <–x1^2
> x2
[1] 100

Chapter 01_4th.indd 14 2/23/2019 4:42:42 PM

You might also like