Download as pdf or txt
Download as pdf or txt
You are on page 1of 46

1

DATA VISUALIZATION & DESCRIPTIVE


ANALYTICS USING R
(KMBA 252)

PRACTICAL FILE

Submitted By: Submitted To:


Ritish Garg Ms. Rekha
Roll No: Assistant Professor-
2301921570093 DMS

DVDA(KMBA252) Ritish(2301921570093)
2

INDEX

S.NO. TOPIC NAME PAGE DATE REMARK


NO.
1. Introduction of R 3-4 22-04-2024
Programming
2. R Visualization Packages 5-6 25-04-2024

3. Applications Areas of Data 7-8 01-05-2024


Visualization
4. Advantages of Data 9-10 03-05-2024
Visualization in R
5. Disadvantages of Data 11-12 06-05-2024
Visualization in R
6. Data Structure in R
6.1 Vectors 13 08-05-2024
6.2 List 14
6.3 Factors 15
6.4 Data Frame 16
6.5 Array 17
6.6 Matrix 18
7. 7.1 Summary Function 19
7.2 Sapply() 20 09-05-2024
7.3 Lapply() 21-22
7.4 Apply() 23-24
7.5 Tapply() 25
7.6 Describe() 26-27
8. Missing Value 28 13-05-2024

9. Frequency Distribution 29-30 15-05-2024

10. Contingency Table 31-34 17-05-2024


11. Data Visualization in R 35
11.1 The Plot () 35 31-05-2024
11.2 Bar Plot 36
11.3 Histogram 37
11.4 Boxplot 38-39
11.5 Mosaic Plot() 40
11.6 Radial Plot using ggplot 41-43
12. Import file in R 44 03-06-2024
13. Pyramid Plot 45 06-06-2024

14. Time Series 46 10-06-2024

DVDA(KMBA252) Ritish(2301921570093)
3

Introduction of R Programming
R is a programming language and an analytics tool that was
developed in 1993 by Robert Gentleman and Ross Ihaka at the
University of Auckland, Auckland, New Zealand. It is
extensively used by Software Programmers,
Statisticians, Data Scientists, and Data Miners. It is one of the
most popular Data Analytics Tools used in Data
Analytics and Business Analytics. It has numerous
applications in domains like healthcare, academics,
consulting, finance, media, and many more. Its vast
applicability in Statistics, Data Visualization, and Machine
Learning have given rise to the demand for certified trained
professionals in R.
Here's a brief introduction to key
aspects of R programming:
1. Statistical Computing:
• R is specifically designed
for statistical computing
and data analysis. It
provides a wide range of
statistical and
mathematical techniques, making it a powerful tool for
researchers and analysts.
2. Open Source:
• R is an open-source language, meaning that its source
code is freely available for users to view, modify, and
distribute. This fosters collaboration and allows the
community to contribute to its development.
3. Packages and Libraries:
• R has a vast ecosystem of packages and libraries that
extend its functionality. These packages cover a wide
array of topics, including data manipulation,
visualization, machine learning, and more. Popular

DVDA(KMBA252) Ritish(2301921570093)
4

packages include ggplot2 for data visualization, dplyr


for data manipulation, and caret for machine learning.
4. Data Analytics and Visualisation:
• R excels in data analysis and visualization. It provides
numerous tools for importing, cleaning, and analyzing
data. The ggplot2 package, for instance, is widely used
for creating high-quality graphs and visualizations.
5. Data Structure:
• R supports various data structures, including vectors,
matrices, data frames, and lists. These structures
facilitate the manipulation and analysis of data in a
flexible manner.
6. Scripting Language:
• R is an interpreted scripting language, allowing users
to write scripts and execute them step by step. This
facilitates exploratory data analysis and the
development of reproducible research.
7. Community support:
• R has a vibrant and active community of users and
developers. The community contributes to the
development of new packages, provides support
through forums and mailing lists, and shares
knowledge through blogs and tutorials.
8. Integrating with other tools:
• R can be easily integrated with other languages and
tools. It has interfaces for connecting with databases,
can execute code written in C, C++, or Fortran, and can
be integrated with tools like Python.

In summary, R is a versatile programming language that is


particularly well-suited for statistical computing and data
analysis. Its open-source nature and extensive package
ecosystem contribute to its popularity in the scientific and
data-driven communities.

DVDA(KMBA252) Ritish(2301921570093)
5

R Visualization Packages

In addition to R's base graphics system, R offers a rich


ecosystem of data visualization packages that significantly
enhance its data visualization capabilities. Some of the most
popular and widely used R visualization packages include:

• ggplot2: Developed by Hadley Wickham, ggplot2 is a highly


versatile and widely adopted data visualization package. It
follows the "grammar of graphics" philosophy, making it easy
to create complex and aesthetically pleasing visualizations.
With ggplot2, users can build a wide range of plots, including
scatter plots, bar charts, line plots, histograms, box plots, and
more. It allows for extensive customization, enabling users to
modify colours, themes, labels, and other graphical elements.

• lattice: lattice is designed for conditioning data plots, where


data is divided into subsets, and panels or facets are
generated for each subset. This package is particularly useful
for visualizing multivariate data and allows for the creation of
trellis plots and other conditioned plots.

• ggvis: Developed by Hadley Wickham, ggvis extends the


capabilities of ggplot2 to create interactive visualizations. It is
suitable for generating web-based interactive plots with added
responsiveness and interactivity.

• plotly: plotly is an interactive and web-based visualization


package. It excels in creating interactive visualizations,
including 3D plots, heatmaps, choropleth maps, and
more. plotly visualizations can be embedded in web
applications or notebooks, allowing users to interact with the
data, zoom, pan, and hover over data points.

DVDA(KMBA252) Ritish(2301921570093)
6

• dygraphs: dygraphs is a specialized package for creating


interactive time series plots. It is ideal for visualizing and
exploring time series data with zooming, panning, and
interactive tooltips.

• leaflet: leaflet is a powerful package for creating interactive


maps. It integrates well with R's spatial data capabilities and
allows users to create customizable, interactive maps with
various layers and markers.

• tmap: tmap is a package for thematic mapping, designed to


visualize geospatial data with thematic overlays, choropleth
maps, and bubble maps. It offers a straightforward interface
for creating informative and visually appealing spatial
visualizations.
These visualization packages in R complement its base
graphics system, providing users with a vast array of tools to
create stunning, informative, and interactive data
visualizations.

DVDA(KMBA252) Ritish(2301921570093)
7

Applications Areas of Data Visualization

Data visualization has diverse applications across various


fields and industries due to its ability to present complex data
in a visually compelling and understandable manner. Some of
the key application areas of data visualization include:

• Business Intelligence (BI): Data visualization is extensively


used in business intelligence to analyze and present key
performance indicators (KPIs), sales trends, market data, and
financial metrics. Interactive dashboards and visual reports
help stakeholders make informed decisions and identify areas
for improvement.

• Data Analysis and Exploration: Data visualization aids data


analysts and scientists in exploring datasets, identifying
patterns, trends, and outliers. Visualizations provide a quick
understanding of data distributions, correlations, and
relationships, leading to better data-driven insights.

• Data Reporting and Presentations: Visualizations play a


crucial role in presenting research findings, survey results,
and data-driven insights in a clear and concise manner. They
enhance audience comprehension and engagement during
presentations and reports.

• Geographic Information Systems (GIS): Data visualization is


integral to GIS applications, enabling the representation of
geospatial data on maps. This helps in visualizing patterns,
spatial relationships, and making location-based decisions.

• Healthcare and Life Sciences: Data visualization is used in


healthcare to represent medical data, patient demographics,

DVDA(KMBA252) Ritish(2301921570093)
8

and treatment outcomes. It aids in understanding disease


trends, patient outcomes, and healthcare resource allocation.

• Social Media Analytics: Social media platforms generate vast


amounts of data. Data visualization allows businesses to track
social media metrics, sentiment analysis, and engagement
levels to make informed marketing decisions.
• Financial Analysis: In finance, data visualization helps
interpret stock market trends, portfolio performance, and risk
analysis. It aids financial analysts in making investment
decisions and communicating financial insights to
stakeholders.

• E-commerce and Retail: Retailers use data visualization to


analyze customer behavior, sales performance, and inventory
management. It enables them to optimize pricing strategies
and improve customer experience.

• Environmental Science: Data visualization supports


environmental scientists in analyzing climate data,
biodiversity patterns, and pollution levels. It assists in
communicating environmental issues to policymakers and the
public.

• Education and Training: Data visualization enhances


educational materials and training programs by presenting
complex concepts in a more accessible and engaging manner.
It helps students grasp information effectively.

DVDA(KMBA252) Ritish(2301921570093)
9

Advantages of Data Visualization in R


Data visualization in R offers numerous advantages, making it
a popular choice for data analysis and communication. Some
of the key advantages of data visualization in R are:

• Visual Representation of Complex Data: R's data


visualization capabilities allow users to represent complex
datasets in the form of charts, plots, and graphs.
Visualizations provide a clear and intuitive representation of
data, making it easier to identify patterns, trends, and outliers.

• Better Data Understanding: Visualizations help users gain a


deeper understanding of data by providing a visual overview.
They enable data analysts and scientists to explore datasets,
spot insights, and make data-driven decisions more
effectively.

• Facilitating Data Exploration: R's interactive and


customizable visualizations enable users to interact with the
data, zoom in on specific areas, and explore data from
different perspectives. This dynamic exploration facilitates a
deeper understanding of data relationships.

• Enhanced Communication: Data visualizations in R make it


easier to communicate complex information to various
stakeholders, including non-technical audiences. Visual
representations are more accessible and engaging than raw
data or text-based reports.

• Quick Insights: With R's data visualization packages, users


can create plots and charts rapidly. This agility allows for
quick data exploration, hypothesis testing, and identifying
potential areas for further investigation.

DVDA(KMBA252) Ritish(2301921570093)
10

• Customization and Flexibility: R's visualization packages


offer a high level of customization, allowing users to tailor
visualizations to their specific needs. Users can control colors,
labels, titles, and other graphical elements to align with their
project's objectives.
• Integration with Data Analysis: R is a comprehensive
statistical programming language, and its data visualization
packages seamlessly integrate with data analysis functions.
Users can analyze data and visualize results in the same
environment, streamlining the analytical workflow.

• Publication-Quality Output: R's data visualization packages


enable the creation of publication-ready graphics with high-
quality output. This is beneficial for generating visualizations
for research papers, reports, and presentations.

• Support for Diverse Chart Types: R's visualization ecosystem


offers a wide range of chart types, including bar charts, line
plots, scatter plots, histograms, heatmaps, and more. This
diversity allows users to choose the most appropriate
visualization for their data.

• Reproducibility: R's emphasis on scripts and code-based


workflows ensures that data visualizations are reproducible.
This makes it easier for others to replicate and verify the
analyses and visualizations.

DVDA(KMBA252) Ritish(2301921570093)
11

Disadvantages of Data Visualization in R


While R offers a powerful and versatile platform for data
visualization, there are some potential disadvantages and
challenges associated with using R for this purpose. Some of
the key disadvantages of data visualization in R include:

• Steeper Learning Curve: R's extensive capabilities in data


visualization may lead to a steeper learning curve, especially for
beginners or individuals with limited programming experience.
Mastering R's visualization packages and customizing plots may
require time and effort.

• Code-Based Approach: R relies on coding for creating


visualizations, which may be daunting for non-programmers or those
more accustomed to point-and-click interfaces. Users must be familiar
with R syntax and functions to generate visualizations effectively.

• Package Dependency: R's visualization ecosystem is primarily


package-driven, and different packages may have overlapping or
conflicting functionalities. Managing dependencies and keeping
packages up-to-date can sometimes be challenging.

• Reproducibility Challenges: While R's script-based nature


promotes reproducibility, complex or highly customized visualizations
may be difficult to replicate, leading to potential reproducibility issues.

• Performance Limitations: Large datasets or complex


visualizations may lead to performance bottlenecks in R. Memory
constraints and slower processing times could impact the rendering of
visualizations.

• Limited 3D Visualization Support: R's 3D visualization


capabilities are not as robust compared to specialized 3D visualization
tools. Creating sophisticated 3D plots may require additional effort and
customizations.

DVDA(KMBA252) Ritish(2301921570093)
12

• Limited Interactivity in Base Graphics: While some packages


provide interactive capabilities, R's base graphics system has limited
support for interactivity compared to dedicated interactive
visualization libraries.

• Visual Appeal: R's default visualizations may lack the visual appeal
and aesthetics offered by some other data visualization tools. Users
may need to invest time in customizing plots to achieve desired
aesthetics.

• Exporting Complex Plots: Exporting complex and highly


customized plots from R may be challenging due to limitations in vector
graphics formats or difficulties in preserving interactive features.

• Data Security: In certain scenarios, sharing R scripts containing


sensitive data for visualization purposes may raise data security
concerns.

DVDA(KMBA252) Ritish(2301921570093)
13

DATA STRUCTURE IN R

Vectors
A vector is simply a list of items that are of the same type.

To combine the list of items to a vector, use the c() function


and separate the items by a comma.

In this example, we create a vector that combines Character


values:

In the example below, we create a vector variable called


fruits, that combine strings:

Example:
# Vector of strings
fruits <- c("banana", "apple", "grapes")
RESULT
# Print (fruits)
fruits
[1] banana apple grapes

In this example, we create a vector that combines numerical


values:

Example:
# Vector of numerical values
numbers <- c(10, 20, 30)
RESULT
# Print numbers
numbers
[1] 10 20 30

DVDA(KMBA252) Ritish(2301921570093)
14

List
A list in R can contain many different data types inside it. A list
is a collection of data which is ordered and changeable.

To create a list, use the list() function:

Type of data type:


1. Similar data type:
EXAMPLE
L1 <- list("banana", "apple", "grapes")
> print (L1)
[[1]]
[1] "banana"

[[2]]
[1] "apple"

[[3]]
[1] "grapes"

2. Different data type:


EXAMPLE
L2<-list("aman",65,TRUE)
> print(L2)
[[1]]
[1] "aman"

[[2]]
[1] 65

[[3]]
[1] TRUE

DVDA(KMBA252) Ritish(2301921570093)
15

Factors

Factors are used to categorize data. Examples of factors are:

• Demography: Male/Female
• Music: Rock, Pop, Classic, Jazz
• Training: Strength, Stamina

To create a factor, use the factor() function and add a vector as


argument:

Example

1.Create the factor


music_genre <- factor(c("Jazz", "Rock", "Classic", "Classic", "Pop",
"Jazz", "Rock", "Jazz"))
> print(music_genre)
[1] Jazz Rock Classic Classic Pop Jazz Rock Jazz
Levels: Classic Jazz Pop Rock

To only print the levels, use the levels() function:


> levels(music_genre)
[1] "Classic" "Jazz" "Pop" "Rock"

2.Factor Length
Use the length() function to find out how many items there are in the
factor:
music_genre<-
factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz"
))
>length(music_genre)
[1] 8

DVDA(KMBA252) Ritish(2301921570093)
16

DATA FRAME
Data Frames are data displayed in a format as a table.
Data Frames can have different types of data inside it. While
the first column can be character, the second and third can
be numeric or logical. However, each column should have the
same type of data.

Use the data.frame() function to create a data frame:

EXAMPLE
1.Create a data frame:
Data_Frame <- data.frame (Training = c("Strength", "Stamina",
"Other"),Pulse = c(100, 150, 120),Duration = c(60, 30, 45))

> print(Data_Frame)

Training Pulse Duration


1 Strength 100 60
2 Stamina 150 30
3 Other 120 45

2.Summarize the data:


Use the summary() function to summarize the data from a
Data Frame:
Data_Frame <- data.frame (Training = c("Strength", "Stamina",
"Other"),Pulse = c(100, 150, 120),Duration = c(60, 30, 45))
> summary(Data_Frame)
Training Pulse Duration
Length:3 Min. :100.0 Min. :30.0
Class :character 1st Qu.:110.0 1st Qu.:37.5
Mode :character Median :120.0 Median :45.0
Mean :123.3 Mean :45.0
3rd Qu.:135.0 3rd Qu.:52.5
Max. :150.0 Max. :60.

DVDA(KMBA252) Ritish(2301921570093)
17

ARRAY
Compared to matrices, arrays can have more than two
dimensions.

We can use the array() function to create an array, and


the dim parameter to specify the dimensions:

Example

1.Create the Array:


• An array with one dimension with values ranging from 1 to
24
thisarray <- c(1:24)
> print(thisarray)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
23 24

• An array with more than one dimension


multiarray <- array(thisarray, dim = c(4, 3, 2))
> print(multiarray)
,,1
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12

,,2
[,1] [,2] [,3]
[1,] 13 17 21
[2,] 14 18 22
[3,] 15 19 23
[4,] 16 20 24

DVDA(KMBA252) Ritish(2301921570093)
18

Matrix
A matrix is a two dimensional data set with columns and rows.

A column is a vertical representation of data, while a row is a


horizontal representation of data.

A matrix can be created with the matrix() function. Specify


the nrow and ncol parameters to get the amount of rows and
columns:

EXAMPLE
1.Create of Matrix:
thismatrix<-
matrix(c("apple","banana","cherry","orange"),nrow=2,ncol=2)
> print(thismatrix)
[,1] [,2]
[1,] "apple" "cherry"
[2,] "banana" "orange"

2.Access Matrix item:


thismatrix<-
matrix(c("apple","banana","cherry","orange"),nrow=2,ncol=2)
thismatrix[1,2]
[1] "cherry"

thismatrix<-
matrix(c("apple","banana","cherry","orange"),nrow=2,ncol=2)
> thismatrix[2,]
[1] "banana" "orange"

thismatrix<-
matrix(c("apple","banana","cherry","orange"),nrow=2,ncol=2)
> thismatrix[,2]
[1] "cherry" "orange"

DVDA(KMBA252) Ritish(2301921570093)
19

Summary Function
The summary() function in R can be used to quickly
summarize the values in a vector, data frame, regression
model, or ANOVA model in R.

This syntax uses the following basic syntax:

Summary(data)

The following examples show how to use this function in


practice.

Example : Using summary() with Vector

The following code shows how to use the summary()function


to summarize the values in a vector:

#define vector
x <- c(3, 4, 4, 5, 7, 8, 9, 12, 13, 13, 15, 19, 21)

#summarize values in vector


summary(x)

Min. 1st Qu. Median Mean 3rd Qu. Max.


3.00 5.00 9.00 10.23 13.00 21.00

The summary() function automatically calculates the


following summary statistics for the vector:

• Min: The minimum value


• 1st Qu: The value of the 1st quartile (25th percentile)
• Median: The median value
• 3rd Qu: The value of the 3rd quartile (75th percentile)
• Max: The maximum value

DVDA(KMBA252) Ritish(2301921570093)
20

Sapply()
Use the sapply() function when you want to apply a function
to each element of a list, vector, or data frame and obtain
a vector instead of a list as a result.
The basic syntax for the sapply() function is as follows:
sapply(X, FUN)
where:
• X is the name of the list, vector, or data frame
• FUN is the specific operation you want to perform

Example:
#create a data frame with three columns and five rows
data <- data.frame(a = c(1, 3, 7, 12, 9),
b = c(4, 4, 6, 7, 8),
c = c(14, 15, 11, 10, 6))
ab c
1 1 4 14
2 3 4 15
3 7 6 11
4 12 7 10
5 98 6

#find mean of each column and return results as a vector


sapply(data, mean)
a b c
6.4 5.8 11.2

#multiply values in each column by 2 and return results as


a matrix
sapply(data, function(data) data*2)
a b c
[1,] 2 8 28
[2,] 6 8 30
[3,] 14 12 22
[4,] 24 14 20
[5,] 18 16 12

DVDA(KMBA252) Ritish(2301921570093)
21

Lapply()
Use the lapply() function when you want to apply a function to
each element of a list, vector, or data frame and obtain a list as
a result.
The basic syntax for the lapply() function is as follows:
lapply(X, FUN)
where
• X is the name of the list, vector, or data frame
• FUN is the specific operation you want to perform
Example:
#create a data frame with three columns and five rows
data <- data.frame(a = c(1, 3, 7, 12, 9),
b = c(4, 4, 6, 7, 8),
c = c(14, 15, 11, 10, 6))
ab c
1 1 4 14
2 3 4 15
3 7 6 11
4 12 7 10
5 98 6

#find mean of each column and return results as a list


lapply(data, mean)

$a
[1] 6.4
$b
[1] 5.8
$c
[1] 11.2

#multiply values in each column by 2 and return results as


a list
lapply(data, function(data) data*2)

DVDA(KMBA252) Ritish(2301921570093)
22

$a
[1] 2 6 14 24 18
$b
[1] 8 8 12 14 16
$c
[1] 28 30 22 20 12

Lapply() on List
#create a list
x <- list(a=1, b=1:5, c=1:10)

$a
[1] 1
$b
[1] 1 2 3 4 5
$c
[1] 1 2 3 4 5 6 7 8 9 10

#find the sum of each element in the list


lapply(x, sum)

$a
[1] 1
$b
[1] 15
$c
[1] 55

#find the mean of each element in the list


lapply(x, mean)

$a
[1] 1
$b
[1] 3
$c
[1] 5.5

DVDA(KMBA252) Ritish(2301921570093)
23

Apply()
Use the apply() function when you want to apply a function to
the rows or columns of a matrix or data frame.
The basic syntax for the apply() function is as follows:
apply(X, MARGIN, FUN)

X is the name of the matrix or data frame


• MARGIN indicates which dimension to perform an operation
across (1 = row, 2 = column)
• FUN is the specific operation you want to perform (e.g. min,
max, sum, mean, etc.)

#create a data frame with three columns and five rows


data <- data.frame(a = c(1, 3, 7, 12, 9),
b = c(4, 4, 6, 7, 8),
c = c(14, 15, 11, 10, 6))
data

ab c
1 1 4 14
2 3 4 15
3 7 6 11
4 12 7 10
5 98 6

#find the sum of each row


apply(data, 1, sum)

[1] 19 22 24 29 23

#find the sum of each column


apply(data, 2, sum)

DVDA(KMBA252) Ritish(2301921570093)
24

a b c
32 29 56

#find the mean of each row


apply(data, 1, mean)

[1] 6.333333 7.333333 8.000000 9.666667 7.666667

#find the mean of each column, rounded to one decimal


place
round(apply(data, 2, mean), 1)

a b c
6.4 5.8 11.2

#find the standard deviation of each row


apply(data, 1, sd)

[1] 6.806859 6.658328 2.645751 2.516611 1.527525

#find the standard deviation of each column


apply(data, 2, sd)

a b c
4.449719 1.788854 3.563706

DVDA(KMBA252) Ritish(2301921570093)
25

Tapply()
Use the tapply() function when you want to apply a function
to subsets of a vector and the subsets are defined by some
other vector, usually a factor.
The basic syntax for the tapply() function is as follows:
tapply(X, INDEX, FUN)
• X is the name of the object, typically a vector
• INDEX is a list of one or more factors
• FUN is the specific operation you want to perform
Example:
#view first six lines of iris dataset
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa

#find the max Sepal.Length of each of the three Species


tapply(iris$Sepal.Length, iris$Species, max)

setosa versicolor virginica


5.8 7.0 7.9

#find the mean Sepal.Width of each of the three Species


tapply(iris$Sepal.Width, iris$Species, mean)

setosa versicolor virginica


3.428 2.770 2.974

#find the minimum Petal.Width of each of the three


Species
tapply(iris$Petal.Width, iris$Species, min)

# setosa versicolor virginica


# 0.1 1.0 1.4

DVDA(KMBA252) Ritish(2301921570093)
26

Describe()
The describe() function in R Programming Language is a useful tool for
generating descriptive statistics of data. It provides a comprehensive
summary of the variables in a data frame, including central tendency,
variability, and distribution measures. This function is particularly
valuable for preliminary data analysis, helping to understand the basic
characteristics of the dataset.

What is the describe() function?


The describe() function is available in several R packages, with Hmisc
and psych being the most popular. This article will guide you through
using the describe() function from both packages.
Example
library(Hmisc)
# Example data frame
data <- data.frame(
age = c(25, 30, 35, 40, 45, NA),
income = c(50000, 60000, 65000, 70000, 75000, 80000),
gender = factor(c("male", "female", "female", "male", "male",
"female"))
)
# Using describe() from Hmisc
describe(data)

OUTPUT:
Value 25 30 35 40 45
Frequency 1 1 1 1 1
Proportion 0.2 0.2 0.2 0.2 0.2

For the frequency table, variable is rounded to the nearest 0


--------------------------------------------------------------------------------
income
n missing distinct Info Mean Gmd
6 0 6 1 66667 13333

Value 50000 60000 65000 70000 75000 80000


Frequency 1 1 1 1 1 1
Proportion 0.167 0.167 0.167 0.167 0.167 0.167

DVDA(KMBA252) Ritish(2301921570093)
27

For the frequency table, variable is rounded to the nearest 0


--------------------------------------------------------------------------------
gender
n missing distinct
6 0 2

Value female male


Frequency 3 3
Proportion 0.5 0.5

DVDA(KMBA252) Ritish(2301921570093)
28

Missing value
In R, the NA symbol is used to define the missing values, and
to represent
impossible arithmetic operations (like dividing by zero) we use
the NAN
symbol which stands for “not a number”. In simple words, we
can say that
both NA or NAN symbols represent missing values in R.

Properties of Missing Values:


• For testing objects that are NA use is.na()
• For testing objects that are NaN use is.nan()
• There are classes under which NA comes. Hence integer
class has
integer type NA, the character class has character type NA,
etc.
• A NaN value is counted in NA but the reverse is not valid.

Example
x<- c(NA, 3, 4, NA, NA, NA)
is.na(x)

Output
[1] TRUE FALSE FALSE TRUE TRUE TRUE

DVDA(KMBA252) Ritish(2301921570093)
29

Frequency distribution
The table() method in R is used to compute the frequency
counts of the variables appearing in the specified column of
the dataframe. The result is returned to the form of a two-row
tabular structure, where the first row indicates the value of
the column and the next indicates its corresponding
frequencies.The cumulative frequency distribution of a given
data set is the summation of all the classes including this class
below it in a frequency distribution table obtained. The value
at any cell position is obtained by the summation of all
the previous values and the current value encountered till
now.
The cumsum() function can be used to calculate this.

Example:
# creating a dataframe
>data_table <- data.table(col1 = sample(6 : 9, 9 ,
+ replace = TRUE),
+ col2 = letters[1 : 3],
+ col3 = c(1, 4, 1, 2, 2,
+ 2, 1, 2, 2))
> print ("Original DataFrame")

[1] "Original DataFrame"


> print (data_table)
col1 col2 col3
<int> <char> <num>
1: 8 a 1
2: 6 b 4
3: 8 c 1
4: 9 a 2
5: 7 b 2
6: 7 c 2
7: 7 a 1
8: 7 b 2

DVDA(KMBA252) Ritish(2301921570093)
30

9: 6 c 2
> freq <- table(data_table$col1)
> print ("Modified Frequency Table")
[1] "Modified Frequency Table"
> print (freq)

6789
2421
> print ("Cumulative Frequency Table")
[1] "Cumulative Frequency Table"
> cumsum <- cumsum(freq)
> print (cumsum)
6789
2689
> print ("Relative Frequency Table")
[1] "Relative Frequency Table"
> prob <- prop.table(freq)
> print (prob)

6 7 8 9
0.2222222 0.4444444 0.2222222 0.1111111

DVDA(KMBA252) Ritish(2301921570093)
31

Contingency Table
Contingency analysis is a hypothesis test that is used to check
whether two categorical variables are independent or not. In
simple words, we are asking the question "Can we predict the
value of one variable if we know the value of the other
variable?". If the answer is yes, we can say that the variables
under consideration are not independent. If the answer is no,
then we can say that the variables under consideration are
independent. The test makes use of contingency tables as a
result of which it is known as 'Contingency Analysis'. It is also
known as 'Chi-square test of independence' because the test
statistic follows a chi-square distribution and the test is used
to check whether two categorical variables are independent or
not.
The null hypothesis of the test is that the two variables are
independent and the alternative hypothesis is that the two
variables are not independent.

Example 1
#create data
df <- data.frame(order_num = 1:20,
product=rep(c('TV', 'Radio', 'Computer'), times=c(9, 6, 5)),
country=rep(c('A', 'B', 'C', 'D'), times=5))

#view data
print(df)
order_num product country
1 1 TV A
2 2 TV B
3 3 TV C
4 4 TV D
5 5 TV A
6 6 TV B
7 7 TV C

DVDA(KMBA252) Ritish(2301921570093)
32

8 8 TV D
9 9 TV A
10 10 Radio B
11 11 Radio C
12 12 Radio D
13 13 Radio A
14 14 Radio B
15 15 Radio C
16 16 Computer D
17 17 Computer A
18 18 Computer B
19 19 Computer C
20 20 Computer D

#create contingency table


table <- table(df$product, df$country)

#view contingency table


Table

ABCD
Computer 1 1 1 2
Radio 1221
TV 3222

#add margins to contingency table


table_w_margins <- addmargins(table)

#view contingency table


table_w_margins

A B C D Sum
Computer 1 1 1 2 5
Radio 1 2 2 1 6
TV 3 2 2 2 9
Sum 5 5 5 5 20

DVDA(KMBA252) Ritish(2301921570093)
33

plot(table)

Example:2
# Create example data frame
data <- data.frame(x1 = c(LETTERS[1:4], "A", "B", "B")
(x2 = c(letters[1:3], "b", "c", "c", "c"))

print(data)
ab c
1 1 4 14
2 3 4 15
3 7 6 11
4 12 7 10
5 98 6

# Create two-way contingency table


my_tab1 <- table(data)

# Print two-way contingency table


my_tab1

DVDA(KMBA252) Ritish(2301921570093)
34

# Plot contingency table


plot(my_tab1)

DVDA(KMBA252) Ritish(2301921570093)
35

DATA VISULAISATION IN R

The Plot Function

The plot() function is used to plot R objects.


Syntax: plot(x,y,type,main,sub,xlab,ylab,asp,col,..)
x:– The x coordinate of the plot, a single plotting structure, a
function, or an R object
y:– The Y coordinate points in the plot (optional if x coordinate is a
single structure)
type:– ‘p’ for points, ‘l’ for lines, ‘b’ for both, ‘h’ for high-density
vertical lines, etc.
main:– Title of the plot
sub:– Subtitle of the plot
xlab:– Title for the x-axis
ylab:– Title for the y-axis
asp :- Aspect ratio(y/x)
col:– Color of the plot(points, lines, etc.)

1) #To plot mpg(Miles per Gallon) vs Number of cars


plot(mtcars$mpg, xlab = "Number of cars", ylab = "Miles per
Gallon", col = "red")
2) #To find relation between hp (Horse Power) and mpg (Miles
per Gallon)
plot(mtcars$hp,mtcars$mpg, xlab = "HorsePower", ylab = "Miles per
Gallon", type = "h", col = "blue")

DVDA(KMBA252) Ritish(2301921570093)
36

Bar plot
A barplot is a type of data visualization that uses bars to
represent the frequency or distribution of categorical data.
Each bar in a barplot corresponds to a category, and the length
of the bar typically represents the frequency or value of that
category. Barplots are commonly used to compare different
categories or show the distribution of data across categories.

To draw a barplot of hp
#Horizontal
barplot(mtcars$hp,xlab = "HorsePower", col = "cyan", horiz =
TRUE)

#Vertical
barplot(mtcars$hp, ylab = "HorsePower", col = "cyan", horiz =
FALSE)

DVDA(KMBA252) Ritish(2301921570093)
37

Histogram
A histogram is a type of data visualization that displays the
distribution of numerical data by dividing the data into
intervals or bins and showing the frequency of values within
each interval using bars. It's similar to a barplot but is used for
representing continuous data rather than categorical data.

1. To find histogram for mpg (Miles per Gallon)


hist(mtcars$mpg,xlab = "Miles Per Gallon", main = "Histogram for
MPG", col = "yellow")
2.Simple use of hist()
h <- hist(mtcars$mpg)
3 Frequency Density
h <- hist(mtcars$mpg, breaks = c(10, 18, 24, 30, 35))
4.Probability
hist(mtcars$mpg, probability = TRUE)
5. Color
hist(mtcars$mpg, labels = TRUE, prob = TRUE,
ylim = c(0, 0.1), xlab = 'Miles Per Gallon',
main = 'Distribution of Miles Per Gallon',
col = rainbow(5))

DVDA(KMBA252) Ritish(2301921570093)
38

Boxplot
A boxplot, also known as a box-and-whisker plot, is a type of
data visualization that provides a graphical summary of the
distribution of numerical data through quartiles. It consists of
a box that represents the interquartile range (IQR) of the data,
with a line inside the box indicating the median. The
"whiskers" extend from the box to show the range of the data,
excluding outliers, which are displayed as individual points
beyond the whiskers.

#To draw boxplots for disp (Displacement) and hp (Horse


Power)
boxplot(mtcars[,3:4])

#Boxplot of MPG by Car Cylinders


boxplot(mpg~cyl,data=mtcars, main="Car Milage Data",
xlab="Number of Cylinders", ylab="Miles Per Gallon")

DVDA(KMBA252) Ritish(2301921570093)
39

Boxplot using ggplot2


# Create a grouped boxplot using ggplot2

ggplot(data = iris, aes(x = Species, y = Sepal.Length,fill =


factor(Species))) +geom_boxplot() +theme_minimal() +labs(fill =
"Species")

OUTPUT

# Create a grouped boxplot using base R


boxplot(Sepal.Length ~ Species, data = iris)

OUTPUT

DVDA(KMBA252) Ritish(2301921570093)
40

Mosaic Plot()

Designed to create visualizations of categoricaldata, geom_mosaic() has


the capability to produce bar charts, stacked bar charts, mosaic plots,
and double decker plots and therefore offers a wide range of potential
plots. The plots below highlight the package’s versatility.

mosaicplot(table(iris$Species, iris$Petal.Width))

OUTPUT

DVDA(KMBA252) Ritish(2301921570093)
41

Radial Plot using ggplot

#Run the code


radii <- c(2,3,2,1,3,1,2,3,2)

color <- c("lightgrey", "chartreuse", "lightgrey",


"darkturquoise", "darkolivegreen3",
"orangered", "lightgrey", "darkseagreen1",
"lightgrey")

radial.pie(radii, labels = NA, sector.colors = color,


show.grid = F, show.grid.labels = F ,show.radial.grid = T,
radial.labels = F, clockwise = T,start=3)

OUTPUT

Recreate circular diagram with ggplot


library(ggplot2)# Run the code

df <- data.frame(names = c(
"Enterprise Business Rules",
"ApplicationBusiness Rules",
"Interface Adapters",
"Frameworks & Drivers"))# run the code

DVDA(KMBA252) Ritish(2301921570093)
42

ggplot(df, aes(x = factor(1), fill = names)) +


geom_bar(width = 1) +
coord_polar() +
xlab("") + ylab("") +
theme_void() +
theme(legend.title = element_blank())

How to plot geom_area in R?

library(ggplot2)
time <- as.numeric(rep(seq(1,7),each=7))
# x Axis
value <- runif(49, 10, 100)
# y Axis
group <- rep(LETTERS[1:7],times=7)
# group, one shape per group
data <- data.frame(time, value, group)
ggplot(data, aes(x=time, y=value, fill=group)) +

DVDA(KMBA252) Ritish(2301921570093)
43

geom_area()

OUTPUT

DVDA(KMBA252) Ritish(2301921570093)
44

Import file in R
• Import csv files in R

#Run the code


Data<-read.csv(“/Users/lovesh/Desktop/Lovesh.csv”)
Data

OUTPUT

• Import Excel files in R

#Run the code


Library(readxl)
Data1<-read_xlsx(“”)
Data1

OUTPUT

DVDA(KMBA252) Ritish(2301921570093)
45

Pyramid Plot
Population pyramids are often used in demography, public health, and social
sciences to visualize the age and sex distribution of a population. In this tutorial,
we will learn how to create a population pyramid in R using ggplot2.

#Run the code


library(ggplot2)
df <- data.frame(
age_groups = factor(c("15-19", "20-24", "25-29", "30-34", "35-39", "40-44", "45-49",
"50-54"), levels = c("50-54", "45-49", "40-44", "35-39", "30-34", "25-29", "20-24", "15-
19")),
sex = factor(rep(c("Men", "Women"), each = 8)),
count = c(21, 22, 39, 44, 81, 77, 103, 92, -41, -139, -198, -209, -249, -253, -235, 0)
)
p <- ggplot(df, aes(x = age_groups, y = count, fill = sex)) +
geom_bar(stat = "identity", width = 0.9) +
coord_flip() +
scale_y_continuous(labels = abs) +
labs(x = "Age Group", y = "Count", fill = "Sex",
title = "Population Pyramid of Hypertension by Age and Sex")+
theme_minimal()

OUTPUT

DVDA(KMBA252) Ritish(2301921570093)
46

Time Series
This function uses the following basic syntax:
ts(data, start, end, frequency)
where
• data: A vector or matrix of time series values
• start: The time of the first observation
• end: The time of the last observation
• frequency: The number of observations per unit of time.

#create vector of 20 values


data <- c(6, 7, 7, 7, 8, 5, 8, 9, 4, 9, 12, 14, 14, 15, 18, 24, 20, 15, 24, 26)

#create time series object from vector


ts_data <- ts(data, start=c(2023, 10), frequency=12)

#view time series object


ts_data
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2023 6 7 7
2024 7 8 5 8 9 4 9 12 14 14 15 18
2025 24 20 15 24 26

#create line plot of time series data


plot(ts_data)

#display class of ts_data object


class(ts_data)

DVDA(KMBA252) Ritish(2301921570093)

You might also like