Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

UNIT-I R DATA STRUCTURES

INTRODUCTION TO R LANGUAGE
R is a programming language and open-source software environment for statistical computing
and graphics. It is widely used for data analysis and statistical modelling, as well as for data
visualization, machine learning, and other advanced analytics tasks. R was developed by Ross
Ihaka and Robert Gentleman at the University of Auckland, New Zealand, majorly used for data
analysis and scientific computing.
R is a powerful language that provides a wide range of statistical and graphical techniques. It is
designed to be highly extensible, with a large and active user community constantly developing
new packages and libraries that add new functionality and features to the language. R also has
a large collection of built-in functions for data manipulation, modelling, and visualization,
making it a highly efficient and flexible tool for data analysis.
R has a syntax that is similar to other programming languages such as Python, Matlab, and SAS,
making it easy for users to learn and adapt to its structure. R is an interpreted language, which
means that it runs code line by line, making it highly interactive and flexible. This also means
that R is slower than compiled languages like C++, but it is optimized for data analysis and
visualization, where speed is not always the most critical factor.
R has a wide range of capabilities, including data cleaning and manipulation, statistical
modeling, machine learning, and data visualization. The language is designed to handle large
datasets and can easily handle data in various formats, including CSV, Excel, SAS, and SPSS. R
also has a rich set of visualization tools that make it easy to create charts, graphs, and other
visualizations that can help users understand their data better.
R is also highly extensible, with a vast collection of packages and libraries that extend the
language's functionality. These packages can be easily downloaded and installed from the
Comprehensive R Archive Network (CRAN) or other online repositories. These packages cover
a wide range of fields, including biology, finance, and social sciences, and can be used to perform
specialized analyses and visualizations.
In addition to its vast library of packages, R also has an active and engaged community of users
who contribute to the language's development and provide support and guidance to other
users. This community is highly collaborative and provides a wealth of resources, including
forums, blogs, and tutorials, to help users learn and use R effectively.
R also has a range of integrated development environments (IDEs) that make it easy to write
and debug code. These IDEs include RStudio, a popular integrated development environment
that provides a user-friendly interface for writing and running R code. RStudio also includes a
range of tools and features, including data visualization and debugging tools, that make it easier
to work with R.
Overall, R is a powerful and flexible language that is ideal for data analysis, modeling, and
visualization. Its free and open-source nature, extensive library of packages, and active
community of users make it a highly accessible and popular tool for researchers, data analysts,
and business professionals. With its powerful capabilities and user-friendly syntax, R is an
essential tool for anyone working with data analysis and visualization.
1.1 R DATATYPES
In R, there are several data types and data structures available. Here is a list of commonly used
data types and data structures in R:
BASIC DATA TYPES
1. Numeric: Represents numeric values, including integers and floating-point numbers.
2. Character: Represents textual data, such as strings of characters.
3. Logical: Represents boolean values (TRUE or FALSE).
4. Date: Represents dates.
5. DateTime: Represents dates and times together.
6. Factor: Represents categorical data with predefined levels or categories.
7. Complex: Represents complex numbers with real and imaginary parts.
8. Raw: Represents raw binary data.
USAGE:
• Numeric data are numbers that contain a decimal.
• Integers are whole numbers (those numbers without a decimal point).
• Logical data take on the value of either TRUE or FALSE. There is also another special
type of logical called NA to represent missing values.
Note: Case Sensitive. Both TRUE or FALSE to be represented only in uppercase.
Similarly, if specified within “ ”, then it is considered as character datatype.
• Character data are used to represent string values. You can think of character strings as
something like a word (or multiple words).
• Complex data type in R is for complex numbers or numbers with imaginary values.
1.2 VARIABLE DECLARATION :
• Syntax for variable declaration ‘<-‘ or ‘=’
# First way to declare a variable: use the `<-`
name_of_variable <- value
# Second way to declare a variable: use the `=`
name_of_variable = value

• Example:
x <- 10
s <- “Akshaya”
Note: The error when assignment is done in reverse
1.3 CONSOLE INPUT AND OUTPUT IN R
Console input and output in R refers to the way in which data is entered and displayed in the R
console. There are two main functions for handling console input and output in R:
Reading Input in R
• readline(): This function reads a line of input from the console. The input can be a
string, a number, or a vector.
Example 1:
> input_text<-readline("enter the text to display")
enter the text to display
Akshaya

> input_text
[1] "akshaya"

Example 2:
# Use the readline() function to get some information from the user
readline()

# Prompt the user to enter a certain type of information


readline(prompt="Enter your name")

• scan():This is used for reading data into the input vector or an input list from the
environment console or file.
Example:
> input_scan <- scan()
1: 23
2: 34
3: 45
4:
Read 3 items

Output in R
• print(): This function prints the output of a calculation or expression to the console. The
output can be a string, a number, a vector, or a list. The print() function is a generic
function. This means that the function has a lot of different methods for different types
of objects it may need to print. The function takes an object as the argument.
Example1:
> print(input_scan)
[1] 23 34 45
> print(input_text)
[1] "akshaya"

Example 2:
print("abc")
abc

• cat() function
cat() function is used to display a string. The cat() function concatenates all of the
arguments and forms a single string which it then prints. For example:
> cat("hello", "this","is","Ak",12345,TRUE)
hello this is Ak 12345 TRUE

1.4 R OPERATORS
In R, operators are symbols or characters used to perform various operations on data objects.
Understanding and using operators is crucial for working with R and performing calculations,
comparisons, and other operations. Here are some important operators in R:
1. Arithmetic Operators:
• + Addition
• - Subtraction
• * Multiplication
• / Division
• ^ Exponentiation
• %% Modulo (remainder of division)
• %/% Integer division (quotient of division)
2. Assignment Operators:
• <- or = Assignment
• <<- Global assignment (within a function or scope)
• -> Right assignment (assign to the right-hand side)
3. Comparison Operators:
• == Equal to
• != Not equal to
• < Less than
• > Greater than
• <= Less than or equal to
• >= Greater than or equal to
4. Logical Operators:
• ! Negation (logical NOT)
• & Element-wise AND
• | Element-wise OR
• xor() Exclusive OR (element-wise)
• isTRUE() Test if a value is TRUE
5. Membership Operators:
• %in% Test if elements are members of a vector
• %notin% Test if elements are not members of a vector
6. Miscellaneous Operators:
• : Sequence creation (e.g., 1:5 creates a vector from 1 to 5)
• %*% Matrix multiplication
• %o% Outer product (Kronecker product)
• %*% Inner product (matrix multiplication)
1.5 R Data Structures:
1. Vector: Represents an ordered collection of elements of the same data type. Vectors can
be of various types, including numeric, character, logical, etc.
2. Matrix: Represents a two-dimensional array with elements of the same data type.
3. Array: Represents multi-dimensional data structures with elements of the same data
type.
4. List: Represents a collection of different objects (vectors, matrices, data frames, etc.) of
different lengths and data types.
5. Data Frame: Represents a tabular structure with rows and columns, where each column
can have a different data type. It is similar to a table in a database or a spreadsheet.
6. Factors: A special type of vector used for categorical data, where values are represented
as levels.
7. Lists: Represents a collection of objects, which can be of different data types, lengths,
and structures.
8. Data Table: An extension of a data frame, providing fast and efficient operations on large
datasets.
To determine the datatype of a variable or an object, typeof() and class() can be used.
Example:
> a<-10
> typeof(a)
"double"

> s=TRUE
> class(s)
[1] "logical"

In R, typeof() and class() are two functions that are used to determine the data type of an object.
The key difference between typeof() and class() is that typeof() provides information about
the internal representation of the object, while class() provides information about the
user-defined class of the object. typeof() is used to determine the basic data type of an object,
while class() is used to determine the specific type of an object.
typeof() and class() are both important functions for determining the data type of an object in
R. typeof() provides information about the internal representation of the object, while class()
provides information about the user-defined class of the object.
Type Logical test

Character is.character

Numeric is.numeric

Logical is.logical

Factor is.factor

Complex is.complex

COERCION OF AN OBJECT
Coercion includes type conversion. Type conversion means change of one type of data into
another type of data.

Type Coercing

Character as.character

Numeric as.numeric

Logical as.logical

Factor as.factor

Complex as.complex
1.5.1 VECTORS
Vectors are one of the most fundamental data structures in R. A vector is a sequence of
elements of the same data type, such as numeric, character, or logical. Vectors can be used to
store and manipulate data in R, and are an essential part of many data analysis tasks.
Note: Vectors are homogeneous in nature i.e they support only similar data types.
Creating a Vector in R
There are several ways to create a vector in R, including using the c() function, the : operator,
the seq() function, the rep() function, and the vector() function.
• Using c() function
To create a vector in R, you can use the c() function, which stands for "combine." This function
combines a sequence of values into a vector.
# create a numeric vector
x <- c(1, 2, 3, 4, 5)
print(x)

# create a character vector


y <- c("apple", "banana", "orange")
print(y)

# create a logical vector


b<-c(TRUE,FALSE)
b
TRUE FALSE

• Using the : operator


The : operator is used to create a sequence of values. A numeric vector can be credited
by specifying the starting and ending values separated by the : operator. Example:
> x <- 1:5
>x
12345

• Using the vector() function:


The vector() function is used to create a vector with a specified length and data type.
# Creating a numeric vector using the vector() function
x <- vector(mode = "numeric", length = 5)

• Using the seq() function:


The seq() function is used to generate a sequence of values with a specified starting
point, ending point, and increment.
Example:
> x <- seq(from = 1, to = 10, by = 2)
>x
[1] 1 3 5 7 9

• Using assign() function


Example:
> assign("v",c(2,3,4,5,6))
>v
[1] 2 3 4 5 6

ACCESSING ELEMENTS IN A VECTOR


• Access elements of a vector with [ ]
Example:
> departments=c("cse","ai","aiml","aids","it")
> departments[1]
[1] "cse"
> departments[5]
[1] "it"

• Access slices of a vector


> departments[1:3]
[1] "cse" "ai" "aiml"

SEQUENCES
In R, a sequence is a series of numbers generated according to a certain pattern or rule. There
are several ways to generate sequences in R, some of which are listed below:
Example 1:
b<-seq(10)
b
1 2 3 4 5 6 7 8 9 10

Example 2: Using the seq() function


The seq() function can be used to generate a sequence of numbers from a starting point to an
ending point, with a specified increment. The syntax of the seq() function is as follows:
seq(from, to, by)

g<-seq(from=0, to=10, by=1.5)


g
0.0 1.5 3.0 4.5 6.0 7.5 9.0

seq(2,10,2)
2 4 6 8 10

Example 3 : Using the seq_len() function


The seq_len() function can be used to generate a sequence of integers from 1 to a specified
length. For example, to generate a sequence of numbers from 1 to 5, we can use the following
code:
seq_len(1:5)
12345

1.5.2 LISTS
A list is a flexible data structure that can hold elements of different types, such as numbers,
characters, vectors, matrices, and even other lists.
Creating a List
A list can be created using the list() function.
x <- list("a" = 2.5, "b" = TRUE, "c" = 1:3)
x
Output
$a
[1] 2.5
$b
[1] TRUE
$c
[1] 1 2 3

Obtaining the structure of a List


The structure of the list can be accessed with the str() function
Example
str(x)
Output
List of 3
$ a:num 2.5
$ b:logi TRUE
$ c:int [1:3] 1 2 3

Accessing List Items

Lists can be accessed in similar fashion to vectors. Integer, logical or character vectors can be
used for indexing.
x <- list(name = "John", age = 19, speaks = c("English", "French"))

# Access Elements By Name


x$name
x$age
x$speaks

# Access Elements By Integer Index


x[c(1, 2)]
x[-2]
# Access Elements By Logical Index
x[c(TRUE, FALSE, FALSE)]

# Access Elements By Character Index


x[c("age", "speaks")]
Output
[1] "John"
[1] 19
[1] "English" "French"

$name
[1] "John"

$age
[1] 19

$name
[1] "John"

$speaks
[1] "English" "French"

$name
[1] "John"

$age
[1] 19

$speaks
[1] "English" "French"

1.5.3 MATRIX

• Matrix is a two dimensional data structure in R programming.


• Matrix is similar to vector but additionally contains the dimension attribute. All
attributes of an object can be checked with the attributes() function (dimension can be
checked directly with the dim() function).

Syntax:
matrix(data, nrow, ncol, byrow = FALSE)
Arguments:
• data: The collection of elements that R will arrange into the rows and columns of the
matrix
• nrow: Number of rows
• ncol: Number of columns
• byrow: The rows are filled from the left to the right. We use `byrow = FALSE` (default
values), if we want the matrix to be filled by the columns i.e. the values are filled top to
bottom.

Create A Matrix In R
A matrix can be created using the matrix() function. Dimension of the matrix can be defined by
passing appropriate values for arguments nrow and ncol. Providing value for both dimensions
is not necessary. If one of the dimensions is provided, the other is inferred from the length of
the data.
matrix(1:9, nrow = 3, ncol = 3)
Output
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
Access column names and rownames
To access column and row names of a matrix, we can use colnames() and rownames()
respectively.
colnames(x)
rownames(x)
Change column names
colnames(x) <- c("C1","C2","C3")
Change row names
rownames(x) <- c("R1","R2","R3")
Example:
> rownames(mat)<-c("a","b")
> colnames(mat)<-c("c","d")
> mat
cd
a24
b35
> print(mat,byrow=TRUE)
cd
a24
b35

Using cbind() and rbind()


• Rbind() is used to perform the row binding that binds the multiple vectors by row.
Example:
#create three vectors for fruits with 2 elements each.
> apples=c(34,45)
> mangoes=c(14,35)
> guava=c(12,34)
>
> #perform row bind on these three vectors
> print(rbind(apples,mangoes,guava))
[,1] [,2]
apples 34 45
mangoes14 35
guava 12 34

• The cbind() function is used to perform the column binding that binds the data by
column.
Example:
> apples=c(34,45)
> mangoes=c(14,35)
> guava=c(12,34)
> print(cbind(apples,mangoes,guava))
apples mangoes guava
[1,] 34 14 12
[2,] 45 35 34

1.5.4 FACTORS
In statistics and data analysis, variables can be categorized into different types based on their nature
and the kind of values they represent. The common types of variables are:
1. Numerical Variables:
• Continuous Variables: These variables can take any value within a range. Examples
include age, height, temperature, and weight.
• Discrete Variables: These variables can only take specific, separate values. Examples
include the number of siblings, the count of items, or the number of people in a
household.
2. Categorical Variables:
• Nominal Variables: These variables represent categories without any inherent order or
ranking. Examples include gender, eye color, or marital status.
• Ordinal Variables: These variables represent categories with a specific order or ranking.
Examples include educational levels (e.g., high school, bachelor's degree, master's
degree) or Likert scale ratings (e.g., strongly agree, agree, neutral, disagree, strongly
disagree).
3. Binary Variables
4. Time Variables
• Date Variables
• Time Variables
5. Text Variables

Factors
A factor in R is a data structure used to represent a vector as categorical data. Therefore, the factor
object takes a bounded number of different values called levels. Factors are very useful when
working with character columns of data frames, for creating barplots and creating statistical
summaries for categorical variables.
• In R, factors are used to represent categorical variables.
• Categorical variables take on a limited set of distinct values or levels.
• Factors in R provide a way to encode and work with categorical variables efficiently.

Levels
1. Levels are the unique values or categories that define the possible values for a factor variable.
2. Order: Levels can have a specific order or be unordered, depending on whether the factor is
defined as an ordered or unordered factor.
3. Determining Levels: When creating a factor, the levels can be specified explicitly using the levels
parameter in the factor() function. If the levels are not specified, R automatically determines the
levels based on the unique values present in the data.
4. Accessing Levels: To retrieve the levels of a factor, you can use the levels() function. It returns a
character vector containing the distinct categories or levels of the factor.
5. Modifying Levels: You can modify the levels of a factor using the levels() function with
assignment. By assigning a new character vector to levels(factor_name), you can change the
levels of the factor.
6. Ordered Factors: If a factor has an inherent order among its levels, you can create an ordered
factor using the factor() function with the ordered parameter set to TRUE. Ordered factors
preserve the order of levels for various operations, such as sorting and plotting.
7. Missing Values: Missing values in factors are represented by the NA value. By default, NA is
treated as a separate level in factors, but you can exclude it or handle it differently using the
exclude parameter in the factor() function.
8. Reordering Levels: To change the order of levels in a factor, you can use the factor() function
with the levels parameter set to a new character vector representing the desired order of levels.
Syntax
factor(x = character(), # Input vector data
levels, # Input of unique x values (optional)
labels = levels, # Output labels for the levels
(optional)
exclude = NA, # Values to be excluded from levels
ordered = is.ordered(x), # Whether the input levels are ordered
as given or not

nmax = NA) # Maximum number of levels

Example:
marks<-c("PASS","FAIL","PASS","PASS","FAIL","PASS","FAIL","PASS","PASS")
marks
marksf<-factor(marks,levels = c("PASS","FAIL"))
marksf

marksfac<-factor(marks,levels=c("PASS","FAIL"),ordered = TRUE)
marksfac

is.ordered(marksf)
is.ordered(marksfac)

OUTPUT
>marks<-c("PASS","FAIL","PASS","PASS","FAIL","PASS","FAIL","PASS","PASS")
> marks
[1] "PASS" "FAIL" "PASS" "PASS" "FAIL" "PASS" "FAIL" "PASS" "PASS"

> marksf<-factor(marks,levels = c("PASS","FAIL"))


> marksf
[1] PASS FAIL PASS PASS FAIL PASS FAIL PASS PASS
Levels: PASS FAIL

> marksfac<-factor(marks,levels=c("PASS","FAIL"),ordered = TRUE)


> marksfac
[1] PASS FAIL PASS PASS FAIL PASS FAIL PASS PASS
Levels: PASS < FAIL

> is.ordered(marksf)
[1] FALSE
> is.ordered(marksfac)
[1] TRUE

1.6 R FUNCTIONS
A function in R is an object containing multiple interrelated statements that are run together in
a predefined order every time the function is called. Functions in R can be built-in or created by
the user (user-defined). The main purpose of creating a user-defined function is to optimize
our program, avoid the repetition of the same block of code used for a specific task that is
frequently performed in a particular project, prevent us from inevitable and hard-to-debug
errors related to copy-paste operations, and make the code more readable.
R functions can be built-in or user-defined. Built-in functions are pre-defined and perform
common tasks. User-defined functions are written by the user.
R functions are used to perform tasks in a modular approach. They can be called and run from
any part of a program. The basic syntax for R functions is
function_name(argument_name_1 = input_1, ..., argument_name_n = input_n)

Built-in Functions in R
There are plenty of helpful built-in functions in R used for various purposes. Some of the most
popular ones are:

• min(), max(), mean(), median() – return the minimum / maximum / mean / median
value of a numeric vector, correspondingly
• sum() – returns the sum of a numeric vector
• range() – returns the minimum and maximum values of a numeric vector
• abs() – returns the absulute value of a number
• str() – shows the structure of an R object
• print() – displays an R object on the console
• ncol() – returns the number of columns of a matrix or a dataframe
• length() – returns the number of items in an R object (a vector, a list, etc.)
• nchar() – returns the number of characters in a character object
• sort() – sorts a vector in ascending or descending (decreasing=TRUE) order
• exists() – returns TRUE or FALSE depending on whether or not a variable is defined in
the R environment
Let's see some of the above functions in action:

vector <- c(3, 5, 2, 3, 1, 4)


print(min(vector))
print(mean(vector))
print(median(vector))
print(sum(vector))
print(range(vector))
print(str(vector))
print(length(vector))
print(sort(vector, decreasing=TRUE))
print(exists('vector')) ## note the quotation marks
[1] 1
[1] 3
[1] 3
[1] 18
[1] 1 5
num [1:6] 3 5 2 3 1 4
NULL
[1] 6
[1] 5 4 3 3 2 1
[1] TRUE

Creating a Function in R
While applying built-in functions facilitates many common tasks, often we need to create our
own function to automate the performance of a particular task. To declare a user-defined
function in R, we use the keyword function. The syntax is as follows:
function_name <- function(parameters){
function body
}
Above, the main components of an R function are: function name, function parameters,
and function body. Let's take a look at each of them separately.

Function Name
This is the name of the function object that will be stored in the R environment after the
function definition and used for calling that function. It should be concise but clear and
meaningful so that the user who reads our code can easily understand what exactly this
function does.

Function Parameters
Sometimes, they are called formal arguments. Function parameters are the variables in the
function definition placed inside the parentheses and separated with a comma that will be set
to actual values (called arguments) each time we call the function. For example:
circumference <- function(r){
2*pi*r
}
print(circumference(2))
[1] 12.56637

#Addition of two numbers


add_numbers <- function(x, y) {
result <- x + y
return(result)
}

#Square of a number
sqf<-function(a){
for(i in 1:a)
{
x<-i^2
print(x)
}
}

#Calculating the price of a commodity


bill<-function(itemprice,qty){
price<-itemprice*qty
return(price)
}

# Create a function with arguments.


new.function <- function(a, b) {
print(a^2)
print(a)
print(b)
}
# Evaluate the function without supplying one of the arguments.
new.function(6)

# Create a function with arguments.


new.function <- function(a=5, b) {
print(a^2)
print(a)
print(b)
}
# Evaluate the function without supplying one of the arguments.
new.function(6)

1.7 IMPORTING AND EXPORTING DATA IN R


CSV (Comma-Separated Values) files are a common and widely used format for storing tabular
data. In R, you can easily import and export CSV files using various functions. Here's how you
can work with CSV files in R:
Importing CSV Files:
1. read.csv() function: You can use the read.csv() function to import data from a CSV file.
By default, it assumes that the values in the CSV file are separated by commas.
# Read data from a CSV file data <- read.csv("data.csv")
data <- read.csv("data.csv")

2. read.table() function: If you have a CSV file with a different delimiter (e.g., tab or
semicolon), you can use the read.table() function and specify the delimiter using the sep
argument.
# Read data from a CSV file with a tab delimiter
data <- read.table("data.csv", sep = "\t", header = TRUE)

Exporting Data to CSV:


1. write.csv() function: You can use the write.csv() function to export a data frame to a
CSV file. By default, it uses commas as delimiters.
# Write data to a CSV file
write.csv(data, "output.csv", row.names = FALSE)
2. write.table() function: If you need more control over the delimiter or other formatting
options, you can use the write.table() function.
# Write data to a CSV file with a tab delimiter
write.table(data, "output.csv", sep = "\t", row.names = FALSE)

1.8 UNDERSTANDING DATA IN R


• Statistical functions are crucial for analyzing and interpreting data.
• R is a powerful programming language and environment for statistical computing and
graphics.
• Statistical functions in R help us uncover patterns, trends, and relationships within data.
• R offers a wide range of statistical functions for data analysis and visualization.
• These functions cover various aspects, such as descriptive statistics, data distribution,
hypothesis testing, correlation, regression, and more.
• They enable data scientists to extract insights and make informed decisions.

Max -- Min
• min() and max() functions can be used to find the lowest or highest value in a dataset
• Max and min can also be used to detect outliers. An outlier is a data point that differs
from rest of the observations.
> x<-c(2,3,4,5,6,7)
> min(x)
[1] 2
> max(x)
[1] 7

Measuring the central tendency – Mean And Median


Measures of central tendency are a class of statistics used to identify a value that falls in the
middle of a set of data.
Mean of a Dataset
• The mean, or average, of a dataset is calculated by adding all the values in the dataset
and then dividing by the number of values in the set.
• In R, the mean of a vector is calculated using the mean() function.
• The function accepts a vector as input, and returns the average as a numeric.
a <- c(3,4,5,6)
mean(a)
4.5

Median of a Dataset
• The median of a dataset is the value that, assuming the dataset is ordered from smallest
to largest, falls in the middle.

• If there are an even number of values in a dataset, the middle two values are the median.
• In R, the median of a vector is calculated using the median() function.
• The function accepts a vector as an input.
• If there are an odd number of values in the vector, the function returns the middle value.
• If there are an even number of values in the vector, the function returns the average of
the two medians.
b <- c(3,4,5,6,12)
median(b)
5
Mode
• It is the value that has the highest frequency in the given data set.
• The data set may have no mode if the frequency of all data points is the same.
• Also, we can have more than one mode if we encounter two or more data points having
the same frequency.

Variance
Variance is the sum of squares of differences between all numbers and means. The
mathematical formula for variance is as follows,
var(usedcars$price)
[1] 9749892

Standard Deviation
• Standard Deviation is the square root of variance.
• It is a measure of the extent to which data varies from the mean.
• The mathematical formula for calculating standard deviation is as follows,
Standard Deviation = √variance
sd(x)
Covariance
• Covariance is a statistical term used to measure the direction of the linear
relationship between the data vectors.
• In R programming, covariance can be measured using the cov() function.
• Syntax:
cov(x, y, method)

Correlation
• cor() function in R programming measures the correlation coefficient value.
• Correlation is a relationship term in statistics that uses the covariance method to
measure how strongly the vectors are related.
• Syntax:
Cor(x, y, method)

Percentile
• A percentile is a statistical measure used to express the relative standing of a particular
value within a dataset.
• It indicates the percentage of values that are equal to or below a given value.
• In other words, it tells you how a specific data point compares to the rest of the data in
terms of its position.
• For example, if you scored in the 90th percentile on a test, it means that you scored
higher than 90% of the people who took the same test.
• In a similar manner, if a certain value is at the 75th percentile in a dataset, it means that
75% of the values in the dataset are equal to or lower than that particular value.
• Percentiles are often used in various fields to analyze and interpret data, such as in
education (test scores), health (height and weight percentiles for children), finance
(income distribution), and many others.
• They provide a way to understand where a specific value falls within the distribution of
data and can help identify outliers or trends.
• In R, quartiles can be calculated using the quantile() function.
• Quartiles divide a dataset into four equal parts, with each quartile containing 25% of the
data.
• The first quartile (Q1) is the 25th percentile, the second quartile (Q2) is the
median (50th percentile), and the third quartile (Q3) is the 75th percentile.
quantile(x)
0% 25% 50% 75% 100%
3.25 4.50 5.75 7.00

quantile(usedcars$price)
0% 25% 50% 75% 100%
3800.0 10995.0 13591.5 14904.5 21992.0

• Interquartile Range (IQR) -- The difference between Q1 and Q3 is known as the


Interquartile Range (IQR), and it can be calculated with the IQR() function:
IQR(usedcars$price)
[1] 3909.5

Summary()
The summary() function in R can be used to quickly summarize the values in a vector, data
frame, regression model, or ANOVA model in R.
Basic Syntax: summary(data)
#define vector
x <- c(3, 4, 4, 5, 7, 8, 9, 12, 13, 13, 15, 19, 21)
#summarize values in vector
summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.00 5.00 9.00 10.23 13.00 21.00
where
• Min: The minimum value
• 1st Qu: The value of the 1st quartile (25th percentile)
• Median: The median value
• 3rd Qu: The value of the 3rd quartile (75th percentile)
• Max: The maximum value

Statistical Summary Functions with UsedCars Dataset


1.9 Summarizing Categorical Variables
In R, the `table()`, `prop.table()`, and `round()` functions are commonly used for data
manipulation and analysis. Let's explore each of these functions with examples:

1. `table()` function:
The `table()` function is used to create a contingency table, which counts the occurrences of
different categories or factors within a dataset. It is particularly useful for summarizing
categorical data.
Example:
Suppose you have a vector of categorical data representing the types of fruits in a basket:

fruits <- c("apple", "banana", "apple", "orange", "banana", "apple")

You can create a contingency table to count the occurrences of each fruit type:
fruit_table <- table(fruits)
print(fruit_table)
Output:
fruits
apple banana orange
3 2 1

This table shows that there are 3 apples, 2 bananas, and 1 orange in the dataset.
2. `prop.table()` function:
The `prop.table()` function is used to calculate proportions or percentages based on a
given contingency table. It allows you to express the counts as proportions of the total, row-
wise, or column-wise percentages.
Example:
Using the `fruit_table` created in the previous example, you can calculate the proportions of
each fruit type:
fruit_prop <- prop.table(fruit_table)
print(fruit_prop)

Output:
Fruits
apple banana orange
0.5000000 0.3333333 0.1666667

This table shows the proportions of each fruit type in the dataset.

3. `round()` function:
The `round()` function is used to round numerical values to a specified number of decimal
places.
Example:
Suppose you have a numeric vector with decimal values:
values <- c(3.14159265, 2.71828183, 1.61803399)

You can round these values to 2 decimal places:

rounded_values <- round(values, 2)

print(rounded_values)

Output:

[1] 3.14 2.72 1.62

The `round()` function has rounded the values to 2 decimal places.

1.10 DATA VISUALIZATION IN R

Data visualization is a crucial part of the data analysis process, as it allows you to explore and
communicate insights from your data effectively. R is a powerful programming language for
statistical computing and graphics, and it offers a wide range of libraries and tools for creating
various types of visualizations

Bar Plots
Bar Plots is one of the most efficient ways of representing datas. It can be used to summarize
large data in visual form. Bar graphs have the ability to represent data that shows changes over
time, which helps us to visualize trends. Creates a bar plot with vertical or horizontal bars.

Usage
## Default S3 method:
barplot(data, height, width = 1, space = NULL,
names.arg = NULL, legend.text = NULL, beside = FALSE,
horiz = FALSE, density = NULL, angle = 45,
col = NULL, border = par("fg"),
main = NULL, sub = NULL, xlab = NULL, ylab = NULL,
xlim = NULL, ylim = NULL, xpd = TRUE, log = "",
axes = TRUE, axisnames = TRUE,
cex.axis = par("cex.axis"), cex.names = par("cex.axis"),
inside = TRUE, plot = TRUE, axis.lty = 0, offset = 0,
add = FALSE, ann = !add && par("ann"), args.legend = NULL, ...)

Example
#IPL
top<-c(890,730,672,639,625,605,590)
result <- barplot(top,
main = "TOP SCORERS IN IPL 2023",
xlab = "Players",
ylab = "Runs",
names.arg = c("Shubman","FAF","Conway","Virat","Jaiswal","Yadav","Gaikwad")
)
print(result)

Arguments

height either a vector or matrix of values describing the bars which make up the plot.
If height is a vector, the plot consists of a sequence of rectangular bars with
heights given by the values in the vector. If height is a matrix
and beside is FALSE then each bar of the plot corresponds to a column of height,
with the values in the column giving the heights of stacked sub-bars making up
the bar. If height is a matrix and beside is TRUE, then the values in each column
are juxtaposed rather than stacked.
width optional vector of bar widths. Re-cycled to length the number of bars drawn.
Specifying a single value will have no visible effect unless xlim is specified.
space the amount of space (as a fraction of the average bar width) left before each bar.
May be given as a single number or one number per bar. If height is a matrix
and beside is TRUE, space may be specified by two numbers, where the first is the
space between bars in the same group, and the second the space between the
groups. If not given explicitly, it defaults to c(0,1) if height is a matrix
and beside is TRUE, and to 0.2 otherwise.
names.arg a vector of names to be plotted below each bar or group of bars. If this argument
is omitted, then the names are taken from the names attribute of height if this is a
vector, or the column names if it is a matrix.
horiz a logical value. If FALSE, the bars are drawn vertically with the first bar to the left.
If TRUE, the bars are drawn horizontally with the first at the bottom.
density a vector giving the density of shading lines, in lines per inch, for the bars or bar
components. The default value of NULL means that no shading lines are drawn.
Non-positive values of density also inhibit the drawing of shading lines.
col a vector of colors for the bars or bar components. By default, "grey" is used
if height is a vector, and a gamma-corrected grey palette if height is a matrix;
see grey.colors.
border the color to be used for the border of the bars. Use border = NA to omit borders. If
there are shading lines, border = TRUE means use the same colour for the border
as for the shading lines.
main,sub main title and subtitle for the plot.
xlab a label for the x axis.
ylab a label for the y axis.
xlim limits for the x axis.
ylim limits for the y axis.
axes logical. If TRUE, a vertical (or horizontal, if horiz is true) axis is drawn.
axisnames logical. If TRUE, and if there are names.arg (see above), the other axis is drawn
(with lty = 0) and labeled.
cex.axis expansion factor for numeric axis labels (see par('cex')).
cex.names expansion factor for axis names (bar labels).
inside logical. If TRUE, the lines which divide adjacent (non-stacked!) bars will be drawn.
Only applies when space = 0 (which it partly is when beside = TRUE).

Stacked BarPlots
We can create bar chart with groups of bars and stacks in each bar by using a matrix as input
values.More than two variables are represented as a matrix which is used to create the group
bar chart and stacked bar chart
# Create the input vectors.
colors = c("green","orange","brown")
months <- c("Mar","Apr","May","Jun","Jul")
regions <- c("East","West","North")

# Create the matrix of the values.


Values <- matrix(c(2,9,3,11,9,4,8,7,3,12,5,2,8,10,11), nrow = 3, ncol = 5, byrow = TRUE)

# Create the bar chart


barplot(Values, main = "total revenue", names.arg = months, xlab = "month", ylab =
"revenue", col = colors)

# Add the legend to the chart


legend("topleft", regions, cex = 1.3, fill = colors)

Pie Chart
A pie chart is a circular statistical graphic, which is divided into slices to illustrate numerical
proportion.Pie charts represents data visually as a fractional part of a whole, which can be an
effective communication tool.
Usage
pie(x, labels = names(x), edges = 200, radius = 0.8,
clockwise = FALSE, init.angle = if(clockwise) 90 else 0,
density = NULL, angle = 45, col = NULL, border = NULL,
lty = NULL, main = NULL, ...)
Example
expenditure <- c(600, 300, 150, 100, 200)
result <- pie(expenditure,
main = "Monthly Expenditure Breakdown",
labels = c("Housing", "Food", "Cloths", "Entertainment", "Other"),
col = c("red", "orange", "yellow", "blue", "green")
)

print(result)

Arguments

x a vector of non-negative numerical quantities. The values in x are displayed as the


areas of pie slices.

labels one or more expressions or character strings giving names for the slices. Other objects
are coerced by as.graphicsAnnot. For empty or NA (after coercion to character) labels,
no label nor pointing line is drawn.

edges the circular outline of the pie is approximated by a polygon with this many edges.

radius the pie is drawn centered in a square box whose sides range from −1−1 to 11. If the
character strings labeling the slices are long it may be necessary to use a smaller
radius.

clockwise logical indicating if slices are drawn clockwise or counter clockwise (i.e.,
mathematically positive direction), the latter is default.

density the density of shading lines, in lines per inch. The default value of NULL means that no
shading lines are drawn. Non-positive values of density also inhibit the drawing of
shading lines.

col a vector of colors to be used in filling or shading the slices. If missing a set of 6 pastel
colours is used, unless density is specified when par("fg") is used.

border, (possibly vectors) arguments passed to polygon which draws each slice.
lty

main an overall title for the plot.


Create a 3D Pie Chart in R
In order to create a 3D pie chart, first we need to import the plotrix package. Then, we use the
pie3D() function to create a 3D pie chart. For example,
# import plotrix to use pie3D()
library(plotrix)

expenditure <- c(600, 300, 150, 100, 200)

result <- pie3D(expenditure,


main = "Monthly Expenditure Breakdown",
labels = c("Housing", "Food", "Cloths", "Entertainment", "Other"),
col = c("red", "orange", "yellow", "blue", "green")
)

print(result)

R Histogram
A histogram is a graphical display of data using bars of different heights. Histogram is
used to summarize discrete or continuous data that are measured on an interval scale

Example
temperatures <- c(67 ,72 ,74 ,62 ,76 ,66 ,65 ,59 ,61 ,69 )

# histogram of temperatures vector


result <- hist(temperatures,
main = "Histogram of Temperature",
xlab = "Temperature in degrees Fahrenheit",
col= "pink",
xlim = c(50,100),
ylim = c(0, 5))

print(result)

R Boxplot
A boxplot, also known as a box-and-whisker plot, is a graphical representation of the
distribution of a dataset. A boxplot is a graph that gives us a good indication of how the values
in the data are spread out. Box plots provide some indication of the data's symmetry and skew-
ness. It provides a summary of key statistical measures such as the median, quartiles, and
potential outliers. Here's how a boxplot is constructed:
1. Box: The box in the plot represents the interquartile range (IQR), which is the range
between the first quartile (Q1) and the third quartile (Q3) of the dataset. The length of
the box indicates the spread of the middle 50% of the data.
2. Median Line: A line inside the box represents the median, which is the middle value of
the dataset when it's ordered.
3. Whiskers: Lines extending from the box, known as whiskers, extend to the minimum
and maximum values within a certain range. The range of the whiskers can vary
depending on the specific rules applied.
4. Outliers: Points that are located outside the whiskers are considered potential outliers,
meaning they are data points that are significantly different from the rest of the data.
Boxplots are particularly useful for comparing the distribution of multiple datasets or variables
side by side. They provide a concise visual summary of the central tendency, spread, and
potential extreme values in the data.

Example:
# Create a vector of data
data <- c(10, 15, 20, 25, 30, 35, 40, 45, 50)
# Create a labeled boxplot with a title and custom color
boxplot(data, main="Boxplot Example", xlab="Data", col="blue")
Example 2:
boxplot(mtcars$mpg, main="Mileage Data Boxplot",
ylab="Miles Per Gallon(mpg)", xlab="No. of Cylinders", col="green")

You might also like