Professional Documents
Culture Documents
BDP U5
BDP U5
R is an open-source programming language that is widely used as a statistical software and data
analysis tool. R is an important tool for Data Science. It is highly popular and is the first choice of
many statisticians and data scientists. But what makes R so popular? Why and how to use R for Data
Science?
Data Science has emerged as the most popular field of the 21st century. It is because there is a
pressing need to analyze and construct insights from the data. Industries transform raw data into
furnished data products. In order to do so, it requires several important tools to churn the raw data.
R is one of the programming languages that provide an intensive environment for you to research,
process, transform, and visualize information.
Feature R Python
Objective It has many features which are useful It can be used to develop GUI
for statistical analysis and applications and web applications
representation. as well as with embedded systems
Workability It has many easy-to-use packages for It can easily perform matrix
performing tasks computation as well as
optimization
Integrated Various popular R IDEs are Rstudio, Various popular Python IDEs are
development RKward, R commander, etc. Spyder, Eclipse+Pydev, Atom, etc.
environment
Libraries and There are many packages and libraries Some essential packages and
packages like ggplot2, caret, etc. libraries are Pandas, Numpy, Scipy,
etc.
Scope It is mainly used for complex data It takes a more streamlined
analysis in data science. approach for data science projects.
R is a suitable tool for various data science applications because it provides aesthetic
visualization tools.
R is heavily utilized in data science applications for ETL (Extract, Transform, Load). It provides
an interface for many databases like SQL and even spreadsheets.
One of the important features of R is to interface with NoSQL databases and analyze
unstructured data.
Dplyr: For performing data wrangling and data analysis, we use the dplyr package. We use
this package for facilitating various functions for the Data frame in R. Dplyr is actually built
around these 5 functions. You can work with local data frames as well as with remote
database tables. You might need to:
Select certain columns of data.
Filter your data to select specific rows.
Arrange the rows of your data into order.
Mutate your data frame to contain new columns.
Summarize chunks of your data in some way.
Ggplot2: R is most famous for its visualization library ggplot2. It provides an aesthetic set of
graphics that are also interactive.The ggplot2 library implements a “grammar of graphics”
(Wilkinson, 2005). This approach gives us a coherent way to produce visualizations by
expressing relationships between the attributes of data and their graphical representation.
Esquisse: This package has brought the most important feature of Tableau to R. Just drag
and drop, and get your visualization done in minutes. This is actually an enhancement to
ggplot2.It allows us to draw bar graphs, curves, scatter plots, histograms, then export the
graph or retrieve the code generating the graph.
Tidyr: Tidyr is a package that we use for tidying or cleaning the data. We consider this data
to be tidy when each variable represents a column and each row represents an observation.
Shiny: This is a very well-known package in R. When you want to share your stuff with
people around you and make it easier for them to know and explore it visually, you can use
shiny. It’s a Data Scientist’s best friend.
Caret: Caret stands for classification and regression training. Using this function, you can
model complex regression and classification problems.
E1071: This package has wide use for implementing clustering, Fourier Transform, Naive
Bayes, SVM and other types of miscellaneous functions.
Mlr: This package is absolutely incredible in performing machine learning tasks. It almost has
all the important and useful algorithms for performing machine learning tasks. It can also be
termed as the extensible framework for classification, regression, clustering, multi-
classification and survival analysis.
Lubridate
Knitr
DT(DataTables)
RCrawler
Leaflet
Janitor
Plotly
Google: At Google, R is a popular choice for performing many analytical operations. The
Google Flu Trends project makes use of R to analyze trends and patterns in searches
associated with flu.
Facebook Facebook makes heavy use of R for social network analytics. It uses R for gaining
insights about the behavior of the users and establishes relationships between them.
IBM: IBM is one of the major investors in R. It recently joined the R consortium. IBM also
utilizes R for developing various analytical solutions. It has used R in IBM Watson – an open
computing platform.
Uber: Uber makes use of the R package shiny for accessing its charting components. Shiny is
an interactive web application that’s built with R for embedding interactive visual graphics.
R is an open-source programming language that is widely used as a statistical software and data
analysis tool. R generally comes with the Command-line interface. R is available across widely used
platforms like Windows, Linux, and macOS. Also, the R programming language is the latest cutting-
edge tool.
It was designed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand,
and is currently developed by the R Development Core Team. R programming language is an
implementation of the S programming language. It also combines with lexical scoping semantics
inspired by Scheme. Moreover, the project conceives in 1992, with an initial version released in 1995
and a stable beta version in 2000.
It’s a platform-independent language. This means it can be applied to all operating system.
It’s an open-source free language. That means anyone can install it in any organization
without purchasing a license.
R programming language is not only a statistic package but also allows us to integrate with
other languages (C, C++). Thus, you can easily interact with many data sources and statistical
packages.
The R programming language has a vast community of users and it’s growing day by day.
R is currently one of the most requested programming languages in the Data Science job
market that makes it the hottest trend nowadays.
Basic Statistics: The most common basic statistics terms are the mean, mode, and median.
These are all known as “Measures of Central Tendency.” So using the R language we can
measure central tendency very easily.
Static graphics: R is rich with facilities for creating and developing interesting static graphics.
R contains functionality for many plot types including graphic maps, mosaic plots, biplots,
and the list goes on.
Probability distributions: Probability distributions play a vital role in statistics and by using R
we can easily handle various types of probability distribution such as Binomial Distribution,
Normal Distribution, Chi-squared Distribution and many more.
Data analysis: It provides a large, coherent and integrated collection of tools for data
analysis.
Programming in R:
Since R is much similar to other widely used languages syntactically, it is easier to code and learn in
R. Programs can be written in R in any of the widely used IDE like R Studio, Rattle, Tinn-R, etc. After
writing the program save the file with the extension .r. To run the program, use the following
command on the command line:
R file_name.r
Example:
R
cat("Welcome to GFG!")
Output:
Welcome to GFG!
Advantages of R:
R is the most comprehensive statistical analysis package. As new technology and concepts
often appear first in R.
As R programming language is an open source. Thus, you can run R anywhere and at any
time.
In R, everyone is welcome to provide new packages, bug fixes, and code enhancements.
Disadvantages of R:
In the R programming language, the standard of some packages is less than perfect.
R programming language is much slower than other programming languages such as Python
and MATLAB.
Applications of R:
We use R for Data Science. It gives us a broad variety of libraries related to statistics. It also
provides the environment for statistical computing and design.
R is used by many quantitative analysts as its programming tool. Thus, it helps in data
importing and cleaning.
R is the most prevalent language. So many data analysts and research programmers use it.
Hence, it is used as a fundamental tool for finance.
Tech giants like Google, Facebook, bing, Twitter, Accenture, Wipro and many more using R
nowadays.
Arthur Samuel, a pioneer in the field of artificial intelligence and computer gaming, coined the
term “Machine Learning”. He defined machine learning as – “Field of study that gives computers
the capability to learn without being explicitly programmed”. In a very layman manner, Machine
Learning (ML) can be explained as automating and improving the learning process of computers
based on their experiences without being actually programmed i.e., without any human assistance.
The process starts with feeding good quality data and then training our machines(computers) by
building machine learning models using the data and different algorithms. The choice of algorithms
depends on what type of data do we have and what kind of task we are trying to automate. There
are mainly four types of learning.
In this article let’s discuss the two most important learning e.g Supervised and Unsupervised
Learning in R programming.
R language is basically developed by statisticians to help other statisticians and developers faster and
efficiently with the data. As of now, we know that machine learning is basically working with a large
amount of data and statistics as a part of data science the use of R language is always
recommended. Therefore, the R language is mostly becoming handy for those working with machine
learning making tasks easier, faster, and innovative. Here are some top advantages of the R language
to implement a machine learning algorithm in R programming.
It provides good explanatory code. For example, if you are at the early stage of working with
a machine learning project and you need to explain the work you do, it becomes easy to
work with R language comparison to python language as it provides the proper statistical
method to work with data with fewer lines of code.
R language is perfect for data visualization. R language provides the best prototype to work
with machine learning models.
R language has the best tools and library packages to work with machine learning projects.
Developers can use these packages to create the best pre-model, model, and post-model of
the machine learning projects. Also, the packages for R are more advanced and extensive
than python language which makes it the first choice to work with machine learning
projects.
Supervised Learning
Supervised learning as the name indicates the presence of a supervisor as a teacher. Basically,
supervised learning is learning in which we teach or train the machine using data that is well
labeled which means some data is already tagged with the correct answer. After that, the machine is
provided with a new set of examples(data) so that the supervised learning algorithm analyses the
training data (set of training examples) and produces a correct outcome from labeled
data. Supervised learning is classified into two categories of algorithms:
Regression: A regression problem is when the output variable is a real value, such as
“dollars” or “weight”.
Supervised learning deals with or learns with “labeled” data. This implies that some data is already
tagged with the correct answer.
Types
Regression
Logistic Regression
Classification
Decision Trees
Implementation in R
Let’s implement one of the very popular Supervised Learning i.e Simple Linear Regression in R
programming. Simple Linear Regression is a statistical method that allows us to summarize and study
relationships between two continuous (quantitative) variables. One variable denoted x is regarded
as an independent variable and the other one denoted y is regarded as a dependent variable. It is
assumed that the two variables are linearly related. Hence, we try to find a linear function that
predicts the response value(y) as accurately as possible as a function of the feature or independent
variable(x). The basic syntax for regression analysis in R is:
Syntax:
lm(Y ~ model)
where,
Y is the object containing the dependent variable to be predicted and model is the formula for the
chosen mathematical model.
The command lm() provides the model’s coefficients but no further statistical information. The
following R code is used to implement simple linear regression.
Example:
R
dataset = read.csv('salary.csv')
install.packages('caTools')
library(caTools)
split = sample.split(dataset$Salary,
SplitRatio = 0.7)
data = trainingset)
coef(lm.r)
install.packages("ggplot2")
library(ggplot2)
y = trainingset$Salary),
colour = 'red') +
geom_line(aes(x = trainingset$YearsExperience,
colour = 'blue') +
xlab('Years of experience') +
ylab('Salary')
y = testset$Salary),
colour = 'red') +
geom_line(aes(x = trainingset$YearsExperience,
colour = 'blue') +
xlab('Years of experience') +
ylab('Salary')
Output:
Intercept YearsExperience
24558.39 10639.23
R – Unsupervised learning is the training of machines using information that is neither classified nor
labeled and allowing the algorithm to act on that information without guidance. Here the task of the
machine is to group unsorted information according to similarities, patterns, and differences without
any prior training of data. Unlike supervised learning, no teacher is provided that means no training
will be given to the machine. Therefore, the machine is restricted to finding the hidden structure in
unlabeled data by our-self. Unsupervised learning is classified into two categories of algorithms:
Clustering: A clustering problem is where you want to discover the inherent groupings in the
data, such as grouping customers by purchasing behavior.
Association: An association rule learning problem is where you want to discover rules that
describe large portions of your data, such as people that buy X also tend to buy Y.
Types
Clustering:
1. Exclusive (partitioning)
2. Agglomerative
3. Overlapping
4. Probabilistic
Clustering Types:
1. Hierarchical clustering
2. K-means clustering
Let’s implement one of the very popular Unsupervised Learning i.e K-means clustering in R
programming. K means clustering in R Programming is an Unsupervised Non-linear algorithm
that clusters data based on similarity or similar groups. It seeks to partition the observations into a
pre-specified number of clusters. Segmentation of data takes place to assign each training example
to a segment called a cluster. In the unsupervised algorithm, high reliance on raw data is given with
large expenditure on manual review for review of relevance is given. It is used in a variety of fields
like Banking, healthcare, retail, Media, etc.
Where:
x is numeric data
the k-means algorithm has a random component and can be repeated nstart times to
improve the returned model
Example:
R
# Installing Packages
install.packages("ClusterR")
install.packages("cluster")
# Loading package
library(ClusterR)
library(cluster)
# Loading data
data(iris)
# Structure
str(iris)
# to training dataset
nstart = 20)
kmeans.re
# each observation
kmeans.re$cluster
# Confusion Matrix
cm
plot(iris_1[c("Sepal.Length", "Sepal.Width")],
col = kmeans.re$cluster)
plot(iris_1[c("Sepal.Length", "Sepal.Width")],
col = kmeans.re$cluster,
kmeans.re$centers
kmeans.re$centers[, c("Sepal.Length",
"Sepal.Width")]
points(kmeans.re$centers[, c("Sepal.Length",
"Sepal.Width")],
# Visualizing clusters
y_kmeans,
lines = 0,
shade = TRUE,
color = TRUE,
labels = 2,
plotchar = FALSE,
span = TRUE,
xlab = 'Sepal.Length',
ylab = 'Sepal.Width')
Output:
Model kmeans_re:
The 3 clusters are made which are of 50, 62, and 38 sizes respectively. Within the cluster, the sum of
squares is 88.4%.
Cluster identification:
The model achieved an accuracy of 100% with a p-value of less than 1. This indicates the model is
good.
Confusion Matrix:
So, 50 Setosa are correctly classified as Setosa. Out of 62 Versicolor, 48 Versicolor are correctly
classified as Versicolor, and 14 are classified as virginica. Out of 36 virginica, 19 virginica are correctly
classified as virginica and 2 are classified as Versicolor.
The model showed 3 cluster plots with three different colors and with Sepal.length and with
Sepal.width.
In the plot, centers of clusters are marked with cross signs with the same color of the cluster.
Plot of clusters:
So, 3 clusters are formed with varying sepal length and sepal width.
The general concept behind R is to serve as an interface to other software developed in compiled
languages such as C, C++, and Fortran and to give the user an interactive tool to analyze data.
In order to display the results of running R code in the book, after code is evaluated, the results R
returns are commented. This way, you can copy paste the code in the book and try directly sections
of it in R.
numbers = c(1, 2, 3, 4, 5)
print(numbers)
# [1] 1 2 3 4 5
# Concatenate both
print(mixed_vec)
# [1] "1" "2" "3" "4" "5" "a" "b" "c" "d" "e"
Let’s analyze what happened in the previous code. We can see it is possible to create vectors with
numbers and with letters. We did not need to tell R what type of data type we wanted beforehand.
Finally, we were able to create a vector with both numbers and letters. The vector mixed_vec has
coerced the numbers to character, we can see this by visualizing how the values are printed inside
quotes.
The following code shows the data type of different vectors as returned by the function class. It is
common to use the class function to "interrogate" an object, asking him what his class is.
# Integer vector
num = 1:10
class(num)
# [1] "integer"
class(num)
# [1] "numeric"
# Character vector
ltrs = letters[1:10]
class(ltrs)
# [1] "character"
# Factor vector
fac = as.factor(ltrs)
class(fac)
# [1] "factor"
R supports two-dimensional objects also. In the following code, there are examples of the two most
popular data structures used in R: the matrix and data.frame.
# Matrix
M = matrix(1:12, ncol = 4)
# [1,] 1 4 7 10
# [2,] 2 5 8 11
# [3,] 3 6 9 12
lM = matrix(letters[1:12], ncol = 4)
cbind(M, lM)
class(M)
# [1] "matrix"
class(lM)
# [1] "matrix"
# data.frame
# One of the main objects of R, handles different data types in the same object.
# It is possible to have numeric, character and factor vectors in the same data.frame
df
# nl
#11a
#22b
#33c
#44d
#55e
As demonstrated in the previous example, it is possible to use different data types in the same
object. In general, this is how data is presented in databases, APIs part of the data is text or
character vectors and other numeric. In is the analyst job to determine which statistical data type to
assign and then use the correct R data type for it. In statistics we normally consider variables are of
the following types −
Numeric
Nominal or categorical
Ordinal
Numeric - Integer
Factor
Ordered Factor
R provides a data type for each statistical type of variable. The ordered factor is however rarely used,
but can be created by the function factor, or ordered.
The following section treats the concept of indexing. This is a quite common operation, and deals
with the problem of selecting sections of an object and making transformations to them.
head(df)
# numbers letters
#1 1 a
#2 2 b
#3 3 c
#4 4 d
#5 5 e
#6 6 f
# str gives the structure of a data.frame, it’s a good summary to inspect an object
str(df)
# The latter shows the letters character vector was coerced as a factor.
class(df)
# [1] "data.frame"
### Indexing
df[1, ]
# numbers letters
#1 1 a
# $numbers
# [1] 1
# $letters
# [1] a
# Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
df[5:7, ]
# numbers letters
#5 5 e
#6 6 f
#7 7 g
### Add one column that mixes the numeric column with the factor column
str(df)
df[, 1]
head(df2)
# numbers letters
#1 1 a
#2 2 b
#3 3 c
#4 4 d
#5 5 e
#6 6 f
# numbers mixed
#1 1 1a
#2 2 2b
#3 3 3c
names(df)
# This is the best practice in programming, as many times indeces change, but
head(df4)
# numbers mixed
#1 1 1a
#2 2 2b
#3 3 3c
#4 4 4d
#5 5 5e
#6 6 6f
df5
# numbers mixed
#1 1 1a
#2 2 2b
#3 3 3c
#4 4 4d
#5 5 5e
df6
# numbers mixed
#1 1 1a
#2 2 2b
#3 3 3c
#4 4 4d
#5 5 5e
#6 6 6f
#7 7 7g
#8 8 8h
#9 9 9i