Module V 1

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

Question Bank

Data Science Using R


Module-V
Question 1: Write down about mean and median functions for central tendency in R.
Answer:
mean(x)

It is calculated by taking the sum of the values and dividing with the number of values in a
data series. The function mean() is used to calculate this in R.

For example, mean can be calculated in R as….


# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
# Find Mean.
result.mean <- mean(x)
print(result.mean)

Example: Calculating mean from Dataframe


Input:
df <- data.frame (
Training = c("Strength", "Stamina", "Bulky", "Lean", "Athlete", "Boxer"),
Pulse = c(100, 100, 120, 100, 120, 112),
Duration = c(40, 35, 50, 34, 50, 42)
)
df
# Compute the mean value
mean = mean(df$Duration)
paste("Mean of Duration is")
print(mean)

Output:
Training Pulse Duration
1 Strength 100 40
2 Stamina 100 35
3 Bulky 120 50
4 Lean 100 34
5 Athlete 120 50
6 Boxer 112 42
[1] "Mean of Duration is"
[1] 41.83333
median(x)

The middle most value in a data series is called the median. The median() function is used in
R to calculate this value.

1
For example, median can be calculated in R as….
# Create the vector.
x <- c(2,3,4,5,6)
# Find the median.
median.result <- median(x)
print(median.result)

Example: Calculating Median from Dataframe


Input:
df <- data.frame (
Training = c("Strength", "Stamina", "Bulky", "Lean", "Athlete", "Boxer"),
Pulse = c(100, 100, 120, 100, 120, 112),
Duration = c(40, 35, 50, 34, 50, 42)
)
df
# Compute the median value
median = median(df$Duration)
paste("Median of Duration is")
print(median)

Output:
Training Pulse Duration
1 Strength 100 40
2 Stamina 100 35
3 Bulky 120 50
4 Lean 100 34
5 Athlete 120 50
6 Boxer 112 42
[1] "Median of Duration is"
[1] 41

Question 2: What is variability? What is the different statistical function for variability in
R.
Answer:
Variability (also known as Statistical Dispersion) is another feature of descriptive statistics.
Measures of central tendency and variability together comprise of descriptive statistics.
Variability shows the spread of a data set around a point.

Example: Suppose, there exist 2 data sets with the same mean value:
A = 4, 4, 5, 6, 6
Mean(A) = 5
B = 1, 1, 5, 9, 9
Mean(B) = 5
So, to differentiate among the two data sets, R offers various measures of variability.
Measures of Variability

2
Following are some of the measures of variability that R offers to differentiate between
data sets:
Variance
Standard Deviation
Range
Mean Deviation
Interquartile Range
Variance
Variance is a measure that shows how far is each value from a particular point, preferably
mean value. Mathematically, it is defined as the average of squared differences from the
mean value.
Example: 1
# Defining vector
x <- c(5, 5, 8, 12, 15, 16)
# Print variance of x
print(var(x))

Standard Deviation
Standard deviation in statistics measures the spread ness of data values with respect to
mean and mathematically, is calculated as square root of variance.
In R language, there is no standard built-in function to calculate the standard deviation of a
data set. So, modifying the code to find the standard deviation of data set.
Example: 2
# Defining vector
x <- c(5, 5, 8, 12, 15, 16)
# Standard deviation
d <- sqrt(var(x))
# Print standard deviation of x
print(d)

Range
Range is the difference between maximum and minimum value of a data set. In R language,
max() and min() is used to find the same, unlike range() function that returns the minimum
and maximum value of data set.
Example: 3
Input:
# Defining vector
x <- c(5, 5, 8, 12, 15, 16)
# range() function output
print(range(x))
# Using max() and min() function
# to calculate the range of data set
print(max(x)-min(x))
Output:
[1] 5 16
[1] 11

3
Mean Deviation
Mean deviation is a measure calculated by taking an average of the arithmetic mean of the
absolute difference of each value from the central value. Central value can be mean,
median, or mode.
Example: 4
# Defining vector
x <- c(5, 5, 8, 12, 15, 16)
# Mean deviation
md <- sum(abs(x-mean(x)))/length(x)
# Print mean deviation
print(md)

Interquartile Range
Interquartile Range is based on splitting a data set into parts called as quartiles. There are 3
quartile values (Q1, Q2, Q3) that divide the whole data set into 4 equal parts. Q2 specifies
the median of the whole data set.
Mathematically, the interquartile range is depicted as:
IQR = Q3 – Q1
where,
Q3 specifies the median of n largest values
Q1 specifies the median of n smallest values
Example: 5
# Defining vector
x <- c(5, 5, 8, 12, 15, 16)
# Print Interquartile range
print(IQR(x))

Question 3: Explain correlation in R?


Answer: Correlation is a statistical measure that indicates how strongly two variables are
related. It involves the relationship between multiple variables as well. For instance, if one is
interested to know whether there is a relationship between the heights of fathers and sons,
a correlation coefficient can be calculated to answer this question. Generally, it lies between
-1 and +1. It is a scaled version of covariance and provides the direction and strength of a
relationship.
R Language provides two methods to calculate the pearson correlation coefficient. By using
the functions cor() or cor.test() it can be calculated. It can be noted that cor() computes the
correlation coefficient whereas cor.test() computes the test for association or correlation
between paired samples. It returns both the correlation coefficient and the significance
level(or p-value) of the correlation.

Example: 1 Using cor() method


Input:
# R program to illustrate
# pearson Correlation Testing
# Using cor()

4
# Taking two numeric
# Vectors with same length
x = c(1, 2, 3, 4, 5, 6, 7)
y = c(1, 3, 6, 2, 7, 4, 5)
# Calculating
# Correlation coefficient
# Using cor() method
result = cor(x, y, method = "pearson")
# Print the result
cat("Pearson correlation coefficient is:", result)
Output:
Pearson correlation coefficient is: 0.5357143

Example: 2 Using cor.test() method


Input:
# R program to illustrate
# pearson Correlation Testing
# Using cor.test()
# Taking two numeric
# Vectors with same length
x = c(1, 2, 3, 4, 5, 6, 7)
y = c(1, 3, 6, 2, 7, 4, 5)
# Calculating
# Correlation coefficient
# Using cor.test() method
result = cor.test(x, y, method = "pearson")
# Print the result
print(result)
Output:
Pearson's product-moment correlation

data: x and y
t = 1.4186, df = 5, p-value = 0.2152
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3643187 0.9183058
sample estimates:
cor
0.5357143

Question 4: Explain skewness and its types.


Answer: In statistics, skewness and kurtosis are the measures which tell about the shape of
the data distribution or simply, both are numerical methods to analyze the shape of data set
unlike, plotting graphs and histograms which are graphical methods. These are normality
tests to check the irregularity and asymmetry of the distribution. To calculate skewness and
kurtosis in R language, moments package is required.

5
Skewness
Skewness is a statistical numerical method to measure the asymmetry of the distribution or
data set. It tells about the position of the majority of data values in the distribution around
the mean value.
There exist 3 types of skewness values on the basis of which asymmetry of the graph is
decided. These are as follows:
Positive Skew
If the coefficient of skewness is greater than 0 i.e. \gamma_{1}>0 , then
the graph is said to be positively skewed with the majority of data values
less than mean. Most of the values are concentrated on the left side of the
graph
Zero Skewness or Symmetric
If the coefficient of skewness is equal to 0 or approximately close to 0
i.e. \gamma_{1}=0 , then the graph is said to be symmetric and data is
normally distributed.

Negatively skewed
If the coefficient of skewness is less than 0 i.e. \gamma_{1}<0 , then the graph
is said to be negatively skewed with the majority of data values greater than
mean. Most of the values are concentrated on the right side of the graph.

Question 5: Explain kurtosis and its types.


Answer: Kurtosis
Kurtosis is a numerical method in statistics that measures the sharpness of the peak in the
data distribution.
There exist 3 types of Kurtosis values on the basis of which sharpness of the peak is
measured. These are as follows:
Platykurtic
If the coefficient of kurtosis is less than 3 i.e. \gamma_{2}<3 , then the data distribution is
platykurtic. Being platykurtic doesn’t mean that the graph is flat-topped.

6
Mesokurtic
If the coefficient of kurtosis is equal to 3 or approximately close to 3 i.e. \gamma_{2}=3 ,
then the data distribution is mesokurtic. For normal distribution, kurtosis value is
approximately equal to 3.

Leptokurtic
If the coefficient of kurtosis is greater than 3 i.e. \gamma_{1}>3 , then the data distribution
is leptokurtic and shows a sharp peak on the graph.

You might also like