Professional Documents
Culture Documents
Unit_V_Statistics_R
Unit_V_Statistics_R
Unit_V_Statistics_R
R Programming
Dr. Ramkumar Krishnamoorthy
ramkumar.k@jainuniversity.ac.in
Unit V – Descriptive Statistics
Descriptive Statistics: Data Range, Frequencies, Mode, Mean and Median: Mean Applying Trim Option,
Applying NA Option, Median - Mode - Standard Deviation – Correlation & Regression - Spotting Problems in
Data with Visualization: Visually Checking Distributions for a single Variable - R –Pie Charts: Pie Chart title
and Colors – Slice Percentages and Chart Legend, 3D Pie Chart – R Histograms – Density Plot - R – Bar
Charts: Bar Chart Labels, Title and Colors.
Descriptive Statistics in R
• Descriptive statistics summarize and organize characteristics of a data set.
• In quantitative research, after collecting data, the first step of statistical analysis is to describe
characteristics of the responses, such as the average of one variable (e.g., age), or the relation
• The variability or dispersion concerns how spread out the values are.
Descriptive Statistics in R
Formula in R
Data Range in R
Method 1: Finding the range in a vector using min and max functions
• User can find the range by performing the difference between the minimum value in the vector
and the maximum value in the given vector.
Example
data = c(12, 45, NA, NA, 67, 23, 45, 78, NA, 89)
print(data)
print(max(data, na.rm=TRUE)-min(data, na.rm=TRUE))
Data Range in R
Method 2: Using range() function
• User can use the range() function to get maximum and minimum values.
Syntax
• range(vector / dataframe)
Example
• data = c(12, 45, NA, NA, 67, 23, 45, 78, NA, 89)
• print(data)
• print(range(data, na.rm=TRUE))
Example
Data Range in R
myData = read.csv("CardioGoodFitness.csv", stringsAsFactors = F)
print(head(myData))
Mean in R
• It is the sum of observations divided by the total number of observations. It is also defined as
average which is the sum divided by count.
here, n = number of terms
myData = read.csv("CardioGoodFitness.csv",
stringsAsFactors = F)
mean = mean(myData$Age)
print(mean)
Median in R
• It is the middle value of the data set. It splits the data into two halves. If the number of elements
in the data set is odd then the center element is median and if it is even then the median would
be the average of two central elements.
Here,
n = number of terms
Median in R
Example
• myData = read.csv("CardioGoodFitness.csv",
• stringsAsFactors = F)
• # Compute the median value
• median = median(myData$Age)
• print(median)
Mode in R
• It is the value that has the highest frequency in the given data set.
• The data set may have no mode if the frequency of all data points is the same.
• Also, user can have more than one mode if they encounter two or more data points having the
same frequency.
Example
• Ordered data set 0, 3, 3, 12, 15, 24
Note:
• mfv: Most frequent value(s)
Range in R
• The range describes the difference between the largest and smallest data point in our data set.
The bigger the range, the more is the spread of data and vice versa.
• Range = Largest data value – smallest data value
Frequency in R
• Frequency tables are indispensable for understanding the distribution of data in R.
Applying trim() in Mean R
• A trimmed mean is the mean of a dataset that has been calculated after removing a
specific percentage of the smallest and largest values from the dataset.
• For example, a 10% trimmed mean would represent the mean of a dataset after the
10% smallest values and 10% largest values have been removed.
Applying trim() in Mean R
Example
Applying trim() in Mean R
Example
Applying NA in Mean R
Example
• If there are missing values, then the mean function returns NA.
• To drop the missing values from the calculation use na.rm = TRUE. which means remove the NA values.
Standard Deviation in R
• It is defined as the square root of the variance.
• It is being calculated by finding the Mean, then subtract each number from the
Mean which is also known as average and square the result.
• Adding all the values and then divide by the no of terms followed the square root.
N = number of terms
u = Mean
Standard Deviation in R
Example
• myData = read.csv("CardioGoodFitness.csv", stringsAsFactors = F)
• std = sd(myData$Age)
• print(std)
• Output: 6.943498
Sample Dataset in R
Correlation in R
Correlation in R
Correlation in R
• Correlation Between Multiple Variables
Correlation and Regression in R
Correlation and Regression in R
Spotting Problems in Data and Visualization
Visually Checking Distributions for a Single Variable
• There are 4 types of plots that user can use to observe a single variable data:
• · Histograms
• · Index plots
• · Time-series plots
• · Pie Charts
Visually Checking Distributions for a Single Variable
Pie Chart
Pie Chart Visualization in R
• R Programming language has numerous libraries to create charts and
graphs.
• A pie-chart is a representation of values as slices of a circle with different
colors.
• The slices are labeled and the numbers corresponding to each slice is
also represented in the chart.
• In R the pie chart is created using the pie() function which takes positive
numbers as a vector input.
• The additional parameters are used to control labels, color, title etc.
Pie Chart Visualization in R
Pie Chart Visualization in R
Example
• A very simple pie-chart is created using just the input vector and labels.
• The below script will create and save the pie chart in the current R working
directory.
• Here, use parameter main to add a title to the chart and another parameter is col
which will make use of rainbow colour pallet while drawing the chart.
• The length of the pallet should be same as the number of values user have for the
• cex. Specifies the symbol size. (Zoom in and out of the respective size)
Pie Chart Title and Colors Visualization in R
Example
• x <- c(21, 62, 10, 53)
• labels <- c("London", "New York", "Singapore", "Mumbai")
• png(file = "city_title_colours.jpg")
• pie(x, labels, main = "City pie chart", col = rainbow(length(x)))
• dev.off()
Slice Percentages and Chart Legend
Example
• x <- c(21, 62, 10,53)
• labels <- c("London","New York","Singapore","Mumbai")
• piepercent<- round(100*x/sum(x), 1)
• png(file = "city_percentage_legends.jpg")
• pie(x, labels = piepercent, main = "City pie chart", col = rainbow(length(x)))
• legend("topright", c("London","New York","Singapore","Mumbai"), cex = 0.8,
• fill = rainbow(length(x)))
• dev.off()
Slice Percentages and Chart Legend
3D Pie Chart
• A pie chart with 3 dimensions can be drawn using additional packages.
• The package plotrix has a function called pie3D() that is used for this.
Example
• library(plotrix)
• x <- c(21, 62, 10,53)
• lbl <- c("London","New York","Singapore","Mumbai")
• png(file = "3d_pie_chart.jpg")
• pie3D(x,labels = lbl,explode = 0.1, main = "Pie Chart of Countries ")
• dev.off()
3D Pie Chart
Histogram in R
• A histogram represents the frequencies of values of a variable bucketed into
ranges.
• Histogram is similar to bar chat but the difference is it groups the values into
continuous ranges.
• Each bar in histogram represents the height of the number of values present in
that range.
• R creates histogram using hist() function.
• This function takes a vector as an input and uses some more parameters to plot
histograms.
Histogram in R
Histogram in R
• A simple histogram is created using input vector, label, col and border parameters.
• The script given below will create and save the histogram in the current R working directory.
Example
Histogram in R
Histogram in R - Range of X and Y values
• To specify the range of values allowed in X axis and Y axis, we can use the xlim and ylim
parameters.
• The width of each of the bar can be decided by using breaks.
Example
• v <- c(9,13,21,8,36,22,12,41,31,33,19)
• png(file = "histogram_lim_breaks.png")
• hist(v,xlab = "Weight",col = "green",border = "red", xlim = c(0,40), ylim = c(0,5), breaks = 5)
• dev.off()
Histogram in R - Range of X and Y values
• R uses the function barplot() to create bar charts. R can draw both vertical and
• The args.name is a vector having same number of values as the input vector to