Professional Documents
Culture Documents
DVDA (Ritish)final (2)
DVDA (Ritish)final (2)
PRACTICAL FILE
DVDA(KMBA252) Ritish(2301921570093)
2
INDEX
DVDA(KMBA252) Ritish(2301921570093)
3
Introduction of R Programming
R is a programming language and an analytics tool that was
developed in 1993 by Robert Gentleman and Ross Ihaka at the
University of Auckland, Auckland, New Zealand. It is
extensively used by Software Programmers,
Statisticians, Data Scientists, and Data Miners. It is one of the
most popular Data Analytics Tools used in Data
Analytics and Business Analytics. It has numerous
applications in domains like healthcare, academics,
consulting, finance, media, and many more. Its vast
applicability in Statistics, Data Visualization, and Machine
Learning have given rise to the demand for certified trained
professionals in R.
Here's a brief introduction to key
aspects of R programming:
1. Statistical Computing:
• R is specifically designed
for statistical computing
and data analysis. It
provides a wide range of
statistical and
mathematical techniques, making it a powerful tool for
researchers and analysts.
2. Open Source:
• R is an open-source language, meaning that its source
code is freely available for users to view, modify, and
distribute. This fosters collaboration and allows the
community to contribute to its development.
3. Packages and Libraries:
• R has a vast ecosystem of packages and libraries that
extend its functionality. These packages cover a wide
array of topics, including data manipulation,
visualization, machine learning, and more. Popular
DVDA(KMBA252) Ritish(2301921570093)
4
DVDA(KMBA252) Ritish(2301921570093)
5
R Visualization Packages
DVDA(KMBA252) Ritish(2301921570093)
6
DVDA(KMBA252) Ritish(2301921570093)
7
DVDA(KMBA252) Ritish(2301921570093)
8
DVDA(KMBA252) Ritish(2301921570093)
9
DVDA(KMBA252) Ritish(2301921570093)
10
DVDA(KMBA252) Ritish(2301921570093)
11
DVDA(KMBA252) Ritish(2301921570093)
12
• Visual Appeal: R's default visualizations may lack the visual appeal
and aesthetics offered by some other data visualization tools. Users
may need to invest time in customizing plots to achieve desired
aesthetics.
DVDA(KMBA252) Ritish(2301921570093)
13
DATA STRUCTURE IN R
Vectors
A vector is simply a list of items that are of the same type.
Example:
# Vector of strings
fruits <- c("banana", "apple", "grapes")
RESULT
# Print (fruits)
fruits
[1] banana apple grapes
Example:
# Vector of numerical values
numbers <- c(10, 20, 30)
RESULT
# Print numbers
numbers
[1] 10 20 30
DVDA(KMBA252) Ritish(2301921570093)
14
List
A list in R can contain many different data types inside it. A list
is a collection of data which is ordered and changeable.
[[2]]
[1] "apple"
[[3]]
[1] "grapes"
[[2]]
[1] 65
[[3]]
[1] TRUE
DVDA(KMBA252) Ritish(2301921570093)
15
Factors
• Demography: Male/Female
• Music: Rock, Pop, Classic, Jazz
• Training: Strength, Stamina
Example
2.Factor Length
Use the length() function to find out how many items there are in the
factor:
music_genre<-
factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz"
))
>length(music_genre)
[1] 8
DVDA(KMBA252) Ritish(2301921570093)
16
DATA FRAME
Data Frames are data displayed in a format as a table.
Data Frames can have different types of data inside it. While
the first column can be character, the second and third can
be numeric or logical. However, each column should have the
same type of data.
EXAMPLE
1.Create a data frame:
Data_Frame <- data.frame (Training = c("Strength", "Stamina",
"Other"),Pulse = c(100, 150, 120),Duration = c(60, 30, 45))
> print(Data_Frame)
DVDA(KMBA252) Ritish(2301921570093)
17
ARRAY
Compared to matrices, arrays can have more than two
dimensions.
Example
,,2
[,1] [,2] [,3]
[1,] 13 17 21
[2,] 14 18 22
[3,] 15 19 23
[4,] 16 20 24
DVDA(KMBA252) Ritish(2301921570093)
18
Matrix
A matrix is a two dimensional data set with columns and rows.
EXAMPLE
1.Create of Matrix:
thismatrix<-
matrix(c("apple","banana","cherry","orange"),nrow=2,ncol=2)
> print(thismatrix)
[,1] [,2]
[1,] "apple" "cherry"
[2,] "banana" "orange"
thismatrix<-
matrix(c("apple","banana","cherry","orange"),nrow=2,ncol=2)
> thismatrix[2,]
[1] "banana" "orange"
thismatrix<-
matrix(c("apple","banana","cherry","orange"),nrow=2,ncol=2)
> thismatrix[,2]
[1] "cherry" "orange"
DVDA(KMBA252) Ritish(2301921570093)
19
Summary Function
The summary() function in R can be used to quickly
summarize the values in a vector, data frame, regression
model, or ANOVA model in R.
Summary(data)
#define vector
x <- c(3, 4, 4, 5, 7, 8, 9, 12, 13, 13, 15, 19, 21)
DVDA(KMBA252) Ritish(2301921570093)
20
Sapply()
Use the sapply() function when you want to apply a function
to each element of a list, vector, or data frame and obtain
a vector instead of a list as a result.
The basic syntax for the sapply() function is as follows:
sapply(X, FUN)
where:
• X is the name of the list, vector, or data frame
• FUN is the specific operation you want to perform
Example:
#create a data frame with three columns and five rows
data <- data.frame(a = c(1, 3, 7, 12, 9),
b = c(4, 4, 6, 7, 8),
c = c(14, 15, 11, 10, 6))
ab c
1 1 4 14
2 3 4 15
3 7 6 11
4 12 7 10
5 98 6
DVDA(KMBA252) Ritish(2301921570093)
21
Lapply()
Use the lapply() function when you want to apply a function to
each element of a list, vector, or data frame and obtain a list as
a result.
The basic syntax for the lapply() function is as follows:
lapply(X, FUN)
where
• X is the name of the list, vector, or data frame
• FUN is the specific operation you want to perform
Example:
#create a data frame with three columns and five rows
data <- data.frame(a = c(1, 3, 7, 12, 9),
b = c(4, 4, 6, 7, 8),
c = c(14, 15, 11, 10, 6))
ab c
1 1 4 14
2 3 4 15
3 7 6 11
4 12 7 10
5 98 6
$a
[1] 6.4
$b
[1] 5.8
$c
[1] 11.2
DVDA(KMBA252) Ritish(2301921570093)
22
$a
[1] 2 6 14 24 18
$b
[1] 8 8 12 14 16
$c
[1] 28 30 22 20 12
Lapply() on List
#create a list
x <- list(a=1, b=1:5, c=1:10)
$a
[1] 1
$b
[1] 1 2 3 4 5
$c
[1] 1 2 3 4 5 6 7 8 9 10
$a
[1] 1
$b
[1] 15
$c
[1] 55
$a
[1] 1
$b
[1] 3
$c
[1] 5.5
DVDA(KMBA252) Ritish(2301921570093)
23
Apply()
Use the apply() function when you want to apply a function to
the rows or columns of a matrix or data frame.
The basic syntax for the apply() function is as follows:
apply(X, MARGIN, FUN)
ab c
1 1 4 14
2 3 4 15
3 7 6 11
4 12 7 10
5 98 6
[1] 19 22 24 29 23
DVDA(KMBA252) Ritish(2301921570093)
24
a b c
32 29 56
a b c
6.4 5.8 11.2
a b c
4.449719 1.788854 3.563706
DVDA(KMBA252) Ritish(2301921570093)
25
Tapply()
Use the tapply() function when you want to apply a function
to subsets of a vector and the subsets are defined by some
other vector, usually a factor.
The basic syntax for the tapply() function is as follows:
tapply(X, INDEX, FUN)
• X is the name of the object, typically a vector
• INDEX is a list of one or more factors
• FUN is the specific operation you want to perform
Example:
#view first six lines of iris dataset
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
DVDA(KMBA252) Ritish(2301921570093)
26
Describe()
The describe() function in R Programming Language is a useful tool for
generating descriptive statistics of data. It provides a comprehensive
summary of the variables in a data frame, including central tendency,
variability, and distribution measures. This function is particularly
valuable for preliminary data analysis, helping to understand the basic
characteristics of the dataset.
OUTPUT:
Value 25 30 35 40 45
Frequency 1 1 1 1 1
Proportion 0.2 0.2 0.2 0.2 0.2
DVDA(KMBA252) Ritish(2301921570093)
27
DVDA(KMBA252) Ritish(2301921570093)
28
Missing value
In R, the NA symbol is used to define the missing values, and
to represent
impossible arithmetic operations (like dividing by zero) we use
the NAN
symbol which stands for “not a number”. In simple words, we
can say that
both NA or NAN symbols represent missing values in R.
Example
x<- c(NA, 3, 4, NA, NA, NA)
is.na(x)
Output
[1] TRUE FALSE FALSE TRUE TRUE TRUE
DVDA(KMBA252) Ritish(2301921570093)
29
Frequency distribution
The table() method in R is used to compute the frequency
counts of the variables appearing in the specified column of
the dataframe. The result is returned to the form of a two-row
tabular structure, where the first row indicates the value of
the column and the next indicates its corresponding
frequencies.The cumulative frequency distribution of a given
data set is the summation of all the classes including this class
below it in a frequency distribution table obtained. The value
at any cell position is obtained by the summation of all
the previous values and the current value encountered till
now.
The cumsum() function can be used to calculate this.
Example:
# creating a dataframe
>data_table <- data.table(col1 = sample(6 : 9, 9 ,
+ replace = TRUE),
+ col2 = letters[1 : 3],
+ col3 = c(1, 4, 1, 2, 2,
+ 2, 1, 2, 2))
> print ("Original DataFrame")
DVDA(KMBA252) Ritish(2301921570093)
30
9: 6 c 2
> freq <- table(data_table$col1)
> print ("Modified Frequency Table")
[1] "Modified Frequency Table"
> print (freq)
6789
2421
> print ("Cumulative Frequency Table")
[1] "Cumulative Frequency Table"
> cumsum <- cumsum(freq)
> print (cumsum)
6789
2689
> print ("Relative Frequency Table")
[1] "Relative Frequency Table"
> prob <- prop.table(freq)
> print (prob)
6 7 8 9
0.2222222 0.4444444 0.2222222 0.1111111
DVDA(KMBA252) Ritish(2301921570093)
31
Contingency Table
Contingency analysis is a hypothesis test that is used to check
whether two categorical variables are independent or not. In
simple words, we are asking the question "Can we predict the
value of one variable if we know the value of the other
variable?". If the answer is yes, we can say that the variables
under consideration are not independent. If the answer is no,
then we can say that the variables under consideration are
independent. The test makes use of contingency tables as a
result of which it is known as 'Contingency Analysis'. It is also
known as 'Chi-square test of independence' because the test
statistic follows a chi-square distribution and the test is used
to check whether two categorical variables are independent or
not.
The null hypothesis of the test is that the two variables are
independent and the alternative hypothesis is that the two
variables are not independent.
Example 1
#create data
df <- data.frame(order_num = 1:20,
product=rep(c('TV', 'Radio', 'Computer'), times=c(9, 6, 5)),
country=rep(c('A', 'B', 'C', 'D'), times=5))
#view data
print(df)
order_num product country
1 1 TV A
2 2 TV B
3 3 TV C
4 4 TV D
5 5 TV A
6 6 TV B
7 7 TV C
DVDA(KMBA252) Ritish(2301921570093)
32
8 8 TV D
9 9 TV A
10 10 Radio B
11 11 Radio C
12 12 Radio D
13 13 Radio A
14 14 Radio B
15 15 Radio C
16 16 Computer D
17 17 Computer A
18 18 Computer B
19 19 Computer C
20 20 Computer D
ABCD
Computer 1 1 1 2
Radio 1221
TV 3222
A B C D Sum
Computer 1 1 1 2 5
Radio 1 2 2 1 6
TV 3 2 2 2 9
Sum 5 5 5 5 20
DVDA(KMBA252) Ritish(2301921570093)
33
plot(table)
Example:2
# Create example data frame
data <- data.frame(x1 = c(LETTERS[1:4], "A", "B", "B")
(x2 = c(letters[1:3], "b", "c", "c", "c"))
print(data)
ab c
1 1 4 14
2 3 4 15
3 7 6 11
4 12 7 10
5 98 6
DVDA(KMBA252) Ritish(2301921570093)
34
DVDA(KMBA252) Ritish(2301921570093)
35
DATA VISULAISATION IN R
DVDA(KMBA252) Ritish(2301921570093)
36
Bar plot
A barplot is a type of data visualization that uses bars to
represent the frequency or distribution of categorical data.
Each bar in a barplot corresponds to a category, and the length
of the bar typically represents the frequency or value of that
category. Barplots are commonly used to compare different
categories or show the distribution of data across categories.
To draw a barplot of hp
#Horizontal
barplot(mtcars$hp,xlab = "HorsePower", col = "cyan", horiz =
TRUE)
#Vertical
barplot(mtcars$hp, ylab = "HorsePower", col = "cyan", horiz =
FALSE)
DVDA(KMBA252) Ritish(2301921570093)
37
Histogram
A histogram is a type of data visualization that displays the
distribution of numerical data by dividing the data into
intervals or bins and showing the frequency of values within
each interval using bars. It's similar to a barplot but is used for
representing continuous data rather than categorical data.
DVDA(KMBA252) Ritish(2301921570093)
38
Boxplot
A boxplot, also known as a box-and-whisker plot, is a type of
data visualization that provides a graphical summary of the
distribution of numerical data through quartiles. It consists of
a box that represents the interquartile range (IQR) of the data,
with a line inside the box indicating the median. The
"whiskers" extend from the box to show the range of the data,
excluding outliers, which are displayed as individual points
beyond the whiskers.
DVDA(KMBA252) Ritish(2301921570093)
39
OUTPUT
OUTPUT
DVDA(KMBA252) Ritish(2301921570093)
40
Mosaic Plot()
mosaicplot(table(iris$Species, iris$Petal.Width))
OUTPUT
DVDA(KMBA252) Ritish(2301921570093)
41
OUTPUT
df <- data.frame(names = c(
"Enterprise Business Rules",
"ApplicationBusiness Rules",
"Interface Adapters",
"Frameworks & Drivers"))# run the code
DVDA(KMBA252) Ritish(2301921570093)
42
library(ggplot2)
time <- as.numeric(rep(seq(1,7),each=7))
# x Axis
value <- runif(49, 10, 100)
# y Axis
group <- rep(LETTERS[1:7],times=7)
# group, one shape per group
data <- data.frame(time, value, group)
ggplot(data, aes(x=time, y=value, fill=group)) +
DVDA(KMBA252) Ritish(2301921570093)
43
geom_area()
OUTPUT
DVDA(KMBA252) Ritish(2301921570093)
44
Import file in R
• Import csv files in R
OUTPUT
OUTPUT
DVDA(KMBA252) Ritish(2301921570093)
45
Pyramid Plot
Population pyramids are often used in demography, public health, and social
sciences to visualize the age and sex distribution of a population. In this tutorial,
we will learn how to create a population pyramid in R using ggplot2.
OUTPUT
DVDA(KMBA252) Ritish(2301921570093)
46
Time Series
This function uses the following basic syntax:
ts(data, start, end, frequency)
where
• data: A vector or matrix of time series values
• start: The time of the first observation
• end: The time of the last observation
• frequency: The number of observations per unit of time.
DVDA(KMBA252) Ritish(2301921570093)