Unit_V_Statistics_R

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 60

Welcome

R Programming
Dr. Ramkumar Krishnamoorthy
ramkumar.k@jainuniversity.ac.in
Unit V – Descriptive Statistics
Descriptive Statistics: Data Range, Frequencies, Mode, Mean and Median: Mean Applying Trim Option,
Applying NA Option, Median - Mode - Standard Deviation – Correlation & Regression - Spotting Problems in
Data with Visualization: Visually Checking Distributions for a single Variable - R –Pie Charts: Pie Chart title
and Colors – Slice Percentages and Chart Legend, 3D Pie Chart – R Histograms – Density Plot - R – Bar
Charts: Bar Chart Labels, Title and Colors.
Descriptive Statistics in R
• Descriptive statistics summarize and organize characteristics of a data set.

• A data set is a collection of responses or observations from a sample or entire population.

• In quantitative research, after collecting data, the first step of statistical analysis is to describe

characteristics of the responses, such as the average of one variable (e.g., age), or the relation

between two variables (e.g., age and creativity).


Descriptive Statistics in R
Types of descriptive statistics (3 Types)

• The distribution concerns the frequency of each value.

• The central tendency concerns the averages of the values.

• The variability or dispersion concerns how spread out the values are.
Descriptive Statistics in R
Formula in R
Data Range in R
Method 1: Finding the range in a vector using min and max functions
• User can find the range by performing the difference between the minimum value in the vector
and the maximum value in the given vector.
Example
data = c(12, 45, NA, NA, 67, 23, 45, 78, NA, 89)
print(data)
print(max(data, na.rm=TRUE)-min(data, na.rm=TRUE))
Data Range in R
Method 2: Using range() function
• User can use the range() function to get maximum and minimum values.
Syntax
• range(vector / dataframe)
Example
• data = c(12, 45, NA, NA, 67, 23, 45, 78, NA, 89)
• print(data)
• print(range(data, na.rm=TRUE))
Example
Data Range in R
myData = read.csv("CardioGoodFitness.csv", stringsAsFactors = F)
print(head(myData))
Mean in R
• It is the sum of observations divided by the total number of observations. It is also defined as
average which is the sum divided by count.
here, n = number of terms

myData = read.csv("CardioGoodFitness.csv",
stringsAsFactors = F)
mean = mean(myData$Age)
print(mean)
Median in R
• It is the middle value of the data set. It splits the data into two halves. If the number of elements
in the data set is odd then the center element is median and if it is even then the median would
be the average of two central elements.

Here,
n = number of terms
Median in R
Example
• myData = read.csv("CardioGoodFitness.csv",
• stringsAsFactors = F)
• # Compute the median value
• median = median(myData$Age)
• print(median)
Mode in R
• It is the value that has the highest frequency in the given data set.
• The data set may have no mode if the frequency of all data points is the same.
• Also, user can have more than one mode if they encounter two or more data points having the
same frequency.

Example
• Ordered data set 0, 3, 3, 12, 15, 24

Mode Find the most frequently occurring response: 3


Mode in R
Example
• library(modeest)
• myData = read.csv("CardioGoodFitness.csv",
• stringsAsFactors = F)
• mode = mfv(myData$Age)
• print(mode)

Note:
• mfv: Most frequent value(s)
Range in R
• The range describes the difference between the largest and smallest data point in our data set.
The bigger the range, the more is the spread of data and vice versa.
• Range = Largest data value – smallest data value
Frequency in R
• Frequency tables are indispensable for understanding the distribution of data in R.
Applying trim() in Mean R
• A trimmed mean is the mean of a dataset that has been calculated after removing a
specific percentage of the smallest and largest values from the dataset.
• For example, a 10% trimmed mean would represent the mean of a dataset after the
10% smallest values and 10% largest values have been removed.
Applying trim() in Mean R
Example
Applying trim() in Mean R
Example
Applying NA in Mean R
Example
• If there are missing values, then the mean function returns NA.
• To drop the missing values from the calculation use na.rm = TRUE. which means remove the NA values.
Standard Deviation in R
• It is defined as the square root of the variance.
• It is being calculated by finding the Mean, then subtract each number from the
Mean which is also known as average and square the result.
• Adding all the values and then divide by the no of terms followed the square root.

N = number of terms
u = Mean
Standard Deviation in R
Example
• myData = read.csv("CardioGoodFitness.csv", stringsAsFactors = F)
• std = sd(myData$Age)
• print(std)

• Output: 6.943498
Sample Dataset in R
Correlation in R
Correlation in R
Correlation in R
• Correlation Between Multiple Variables
Correlation and Regression in R
Correlation and Regression in R
Spotting Problems in Data and Visualization
Visually Checking Distributions for a Single Variable

• Statistical computing is really helpful for creating a high-quality graphics.


• Here, Selecting the right type of graph can help user to analyse their data in a
better way.

• There are 4 types of plots that user can use to observe a single variable data:
• · Histograms
• · Index plots
• · Time-series plots
• · Pie Charts
Visually Checking Distributions for a Single Variable

Histograms Index Plot Time Series

Pie Chart
Pie Chart Visualization in R
• R Programming language has numerous libraries to create charts and
graphs.
• A pie-chart is a representation of values as slices of a circle with different
colors.
• The slices are labeled and the numbers corresponding to each slice is
also represented in the chart.
• In R the pie chart is created using the pie() function which takes positive
numbers as a vector input.
• The additional parameters are used to control labels, color, title etc.
Pie Chart Visualization in R
Pie Chart Visualization in R
Example
• A very simple pie-chart is created using just the input vector and labels.
• The below script will create and save the pie chart in the current R working
directory.

x <- c(21, 62, 10, 53)


labels <- c("London", "New York", "Singapore", "Mumbai")
png(file = "city.png")
pie(x,labels)
dev.off()

14% 42% .06% 36%


Pie Chart Visualization in R
Pie Chart Title and Colors Visualization in R

• Here, use parameter main to add a title to the chart and another parameter is col

which will make use of rainbow colour pallet while drawing the chart.

• The length of the pallet should be same as the number of values user have for the

chart. Hence user use length(x).

• cex. Specifies the symbol size. (Zoom in and out of the respective size)
Pie Chart Title and Colors Visualization in R
Example
• x <- c(21, 62, 10, 53)
• labels <- c("London", "New York", "Singapore", "Mumbai")
• png(file = "city_title_colours.jpg")
• pie(x, labels, main = "City pie chart", col = rainbow(length(x)))
• dev.off()
Slice Percentages and Chart Legend
Example
• x <- c(21, 62, 10,53)
• labels <- c("London","New York","Singapore","Mumbai")
• piepercent<- round(100*x/sum(x), 1)
• png(file = "city_percentage_legends.jpg")
• pie(x, labels = piepercent, main = "City pie chart", col = rainbow(length(x)))
• legend("topright", c("London","New York","Singapore","Mumbai"), cex = 0.8,
• fill = rainbow(length(x)))
• dev.off()
Slice Percentages and Chart Legend
3D Pie Chart
• A pie chart with 3 dimensions can be drawn using additional packages.
• The package plotrix has a function called pie3D() that is used for this.
Example
• library(plotrix)
• x <- c(21, 62, 10,53)
• lbl <- c("London","New York","Singapore","Mumbai")
• png(file = "3d_pie_chart.jpg")
• pie3D(x,labels = lbl,explode = 0.1, main = "Pie Chart of Countries ")
• dev.off()
3D Pie Chart
Histogram in R
• A histogram represents the frequencies of values of a variable bucketed into
ranges.
• Histogram is similar to bar chat but the difference is it groups the values into
continuous ranges.
• Each bar in histogram represents the height of the number of values present in
that range.
• R creates histogram using hist() function.
• This function takes a vector as an input and uses some more parameters to plot
histograms.
Histogram in R
Histogram in R
• A simple histogram is created using input vector, label, col and border parameters.
• The script given below will create and save the histogram in the current R working directory.
Example
Histogram in R
Histogram in R - Range of X and Y values
• To specify the range of values allowed in X axis and Y axis, we can use the xlim and ylim
parameters.
• The width of each of the bar can be decided by using breaks.
Example
• v <- c(9,13,21,8,36,22,12,41,31,33,19)
• png(file = "histogram_lim_breaks.png")
• hist(v,xlab = "Weight",col = "green",border = "red", xlim = c(0,40), ylim = c(0,5), breaks = 5)
• dev.off()
Histogram in R - Range of X and Y values

8, 9, 12, 13, 19, 21, 22, 31,33, 36,41


Histogram in R – Density Plot
• A density plot shows the distribution of a numeric variable.
Example
• set.seed(1234)
• x <- rnorm(500)
• par(mfrow = c(1, 2))
• hist(x, freq = FALSE, main = "Histogram and density")
• dx <- density(x)
• lines(dx, lwd = 2, col = "red")
• plot(dx, lwd = 2, col = "red", main = "Density")
• rug(jitter(x))
• rnorm()-random numbers
• The par() - is used to set or query graphical parameters
• The mfrow() parameter allows to split the screen in several panels
• density() - A density plot is a representation of the distribution of a numeric variable
• rug() - A rug plot is a plot of data for a single quantitative variable, displayed as marks along
an axis
Histogram in R – Density Plot
Bar chart in R
• A bar chart represents data in rectangular bars with length of the bar proportional

to the value of the variable.

• R uses the function barplot() to create bar charts. R can draw both vertical and

Horizontal bars in the bar chart.

• In bar chart each of the bars can be given different colors.


Bar chart in R
Bar chart in R
Example
• H <- c(7,12,28,3,41)
• png(file = "barchart.png")
• barplot(H)
• dev.off()
Bar chart Labels in R
• The features of the bar chart can be expanded by adding more parameters.

• The main parameter is used to add title.

• The col parameter is used to add colors to the bars.

• The args.name is a vector having same number of values as the input vector to

describe the meaning of each bar.


Bar chart Labels, Title and Color in R
Example
• H <- c(7,12,28,3,41)
• M <- c("Mar","Apr","May","Jun","Jul")
• # Give the chart file a name
• png(file = "barchart_months_revenue.png")
• # Plot the bar chart
• barplot(H,names.arg=M,xlab="Month",ylab="Revenue",col="blue",
main="Revenue chart",border="red")

• # Save the file


• dev.off()
Bar chart Labels, Title and Color in R
Group Bar Chart and Stacked Bar Chart
• User can create bar chart with groups of bars and stacks in each bar by using a
matrix as input values.
• More than two variables are represented as a matrix which is used to create the
group bar chart and stacked bar chart.
Group Bar Chart and Stacked Bar Chart
Example
• colors = c("green","orange","brown")
• months <- c("Mar","Apr","May","Jun","Jul")
• regions <- c("East","West","North")
• Values <- matrix(c(2,9,3,11,9,4,8,7,3,12,5,2,8,10,11), nrow = 3, ncol = 5, byrow =
TRUE)
• png(file = "barchart_stacked.png")

• barplot(Values, main = "total revenue", names.arg = months, xlab = "month", ylab


= "revenue", col = colors)
• legend("topleft", regions, cex = 1.3, fill = colors)
Group Bar Chart and Stacked Bar Chart
Credits
• https://www.statology.org/r-guides/
• https://www.tutorialspoint.com/r/r_pie_charts.htm
• https://www.tutorialspoint.com/r/r_pie_charts.htm
• https://r-coder.com/density-plot-r/
Thank
you!

You might also like