Python Unit-4

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Academic Task Number: CA-2 Course code: QTT201

Date of allotment: 30-10-2023 Course title: Business Mathematics & Statistics


Date of submission: 04-12-2023 Maximum Marks: 30 Marks
Academic Task Type: Assignment
Assignment Title: "Analyzing and Comparing Measures of Dispersion"

Assignment Description: You will investigate and analyse numerous measures of dispersion often used in statistics in this project.
Dispersion measures help us understand how data points in a dataset spread out from the mean or median. You will do computations
and understand the findings in order to acquire insight into the variability within various datasets.

Instructions:

Part 1: Theory and Concepts

Begin by explaining what "measures of dispersion" are and why they are significant in statistics.

Define and explain the following dispersion measures: range, variance, standard deviation, interquartile range (IQR), skewness, and
kurtosis.

Discuss when you would choose a particular measure over another.

Part 2: Data Analysis

Select at least two datasets. You can utilise existing data or build your own. Each dataset should contain at least 20 data points.

Page 1 of 24
Calculate the measures of dispersion range, variance, standard deviation, interquartile range (IQR), skewness, and kurtosis for each
dataset.

Give explanations for your calculations' outcomes. What can you learn from these measures about the variability in each dataset?

Part 3: Interpretation and Comparison

Compare and contrast the measures of dispersion for the two datasets. Which dataset exhibits more variability, and why?

Discuss the relationship between the measures of central tendency (mean and median) and the measures of dispersion in each dataset.
How do these relationships differ between the two datasets?

Part 4: Real-world Application

Find a real-world situation in which understanding dispersion metrics is crucial. Explain the background and how these measures could
be utilised to make informed decisions.

Part 5: Conclusion

Summarise your findings and the significance of measures of dispersion in statistical analysis.

Submission Guidelines:
Your assignment should be well-structured.
Show all calculations and provide clear explanations for your analysis.
Include references if you use external sources.
Ensure that your assignment is well-o

Page 2 of 24
LOVELY PROFESSIONAL UNIVERSITY

School of Business Faculty of


Dr. Sakshi

Course Code: QTT201 Course Title: Business Mathematics and Statistics


Academic Task No: 2 Academic Task Title: Analysing and Comparing Measures of Dispersion
Date of Allotment: 30/10/23 Date of Submission: 04/12/23
Student Roll No: RQ2371A06 Student Reg. No: 12317752
Term: 1 Section: Q2371
Max. Marks: 30 Marks. Obtained:

Page 3 of 24
Table of Contents

S. No. Contents

1. Introduction

2. Part 1: Theory and Concepts

3. Part 2: Data Analysis

4. Part 3: Interpretation and Comparison

5. Part 4: Real-world Application

6. Part 5: Conclusion

Page 4 of 24
1. Introduction
In this assignment, we will be learning about what is ‘Dispersion’, ‘Measures of Dispersion’, uses of Dispersion, and why it is
important to calculate dispersion.

In this assignment, we will be investigating and analysing different measures of dispersion that are often used in statistics. I
will be doing computations and understanding the findings in order to acquire insight into the variability within various
datasets.

We will be analysing various measures of dispersion i.e., range, variance, standard deviation, interquartile range (IQR),
skewness, and kurtosis. It will help us to understand why these measures of dispersion are important and how we can use real
world applications for the same.

2. Part 1: Theory and Concepts


First, let us start by knowing the meaning of the word ‘Dispersion’. Dispersion is simply a way of describing how scattered a
set of data is. It helps us to understand the distribution of the data and also to interpret the variability of data i.e., to know how
much homogenous or heterogenous the data is.

Now, what are the measures of dispersion? Measures of dispersion are non-negative real numbers that help to analyse the
spread of data about a central value. These measures help to determine how stretched or squeezed the given data is. The most
commonly used measures of dispersion are range, variance, standard deviation, interquartile range (IQR), skewness, and
kurtosis which we will be studying about in this assignment.

Page 5 of 24
Let us take an example. Let us suppose we have two data sets A = {3, 1, 6, 2} and B = {1, 5, 9, 10}. The population variance of A is
3.5 and the population variance of B is 12.68. This shows that data set B is more variable than data set A. Therefore, the variance
helps to see the comparison between two data sets A and B on the basis of variability.

What is the significance of calculating measures of dispersion?


Now, we know that the object of measuring dispersion is to ascertain the degree of deviation which exist in the data and hence, the
limits within which the data will vary in some measurable variate or attribute or quality. This object of dispersion is of great
importance and occupies a unique position in statistical methods.

Also, Measures of dispersion are also called averages of the ‘second order’ i.e., second time averaging the deviations from a measure
of central tendency. It affords an estimate of the phenomena to which the given (original) data relate. This will increase the accuracy
of statistical analysis and interpretation and we can be in a position to draw more dependable inferences.

Measures of dispersion are of great value in our statistical analysis provided relatives (coefficients of dispersions) are put into practice.
Otherwise, conclusions drawn will not be dependable and reliable to a great extent.

Now let us talk about the different measures of dispersion. As we saw in the introduction and above, some of the most important
measures of dispersion are Range, Variance, Standard Deviation, Interquartile Range (IQR), Skewness, and Kurtosis. So, let us know
about these measures one by one.

Page 6 of 24
(i) Range- Range refers to the difference between each series’ minimum and maximum values. The range offers us a good
indication of how dispersed the data is, but we need other measures of variability to discover the dispersion of data from
central tendency measurements. A range is the most common and easily understandable measure of dispersion. It is the
difference between two extreme observations of the data set.
Merits of Range: The merits of range are that it is very easy to calculate and understand. That is why it is also less
time consuming.
Demerits of Range: The demerits of range are that it is not based on each and every item of the distribution. Also, it
gets affected by the extreme values very much.
Range can be calculated by:
Range (R) = Highest value of an observation – Lowest value of an observation

(ii) Variance- The variance is a measure of variability. Variance tells you the degree of spread in your data set. The more
spread the data, the larger the variance is in relation to the mean. Also, we can say that variance is the expectation of
the squared deviation of a random variable from its mean, and it informally measures how far a set of random numbers
are spread out from their mean. Variance is of two types:
 Population variance: Instead of computing the absolute value of each deviation from mean, we square the
deviations from mean. Then the sum of all such squared deviations is divided by the number of observations in
the data set. This value is a measure called population variance and is denoted by σ2 (a lower-case Greek letter
sigma).
 Sample Variance: Sample variance can be defined as the expectation of the squared difference of data points
from the mean of the data set. It is an absolute measure of dispersion and is used to check the deviation of data
points with respect to the data's average. Following are the formulas for both ungrouped data and grouped data
of both population and sample variance:

Page 7 of 24
Merits and Demerits of variance: The advantage of variance is that it treats all deviations from the mean as the
same regardless of their direction while the disadvantage is that it gives added weight to data which can skew the
data. Also, it is not easily interpreted.

Page 8 of 24
(iii) Standard Deviation- Standard Deviation is a measure which shows how much variation (such as spread,
dispersion) from the mean exists. The standard deviation indicates a typical deviation from the mean. Standard
deviation calculates the extent to which the values differ from the average. Standard Deviation, the most widely used
measure of dispersion, is based on all values. It is independent of origin but not of scale. It is also useful in certain
advanced statistical problems.

Standard Deviation is calculated by:

Page 9 of 24
Merits and Demerits of Standard Deviation: Merits are that it is based on every observation in a set of data and less
affected by fluctuations of sampling. Demerits are that calculations of standard deviation are difficult compared to
other measures of dispersion. Also, more weight is given to extreme values and less to those which are near mean.

(iv) Interquartile Range- The interquartile range defines the difference between the third and the first quartile.
Quartiles are the partitioned values that divide the whole series into 4 equal parts. So, there are 3 quartiles. First
Quartile is denoted by Q1 known as the lower quartile, the second Quartile is denoted by Q2 and the third Quartile is
denoted by Q3 known as the upper quartile.
Merits and Demerits of Interquartile Range: Merits are that it is not affected by extreme values as in the case of
range and it is useful in estimating dispersion in grouped series. Demerits are that Interquartile Range as a measure of
dispersion is most reliable only with symmetrical data series.
The interquartile range is calculated by:
Interquartile range (IQR) = Q3 – Q1

(v) Skewness- The skewness in statistics is a measure of asymmetry or the deviation of a given random variable’s
distribution from a symmetric distribution (like normal Distribution). In Normal Distribution, we know that: Median =
Mode = Mean. Skewness in statistics can be divided into two categories. They are:

 Positive Skewness- Mean > Median > Mode


 Negative Skewness- Mean < Median < Mode

Merits and Demerits of Skewness: Merits are that it is better for measuring the performance of investment returns.
Also, it can be used to analyse the data set as it contains extreme of the distribution. Demerits are that due to skewness
ranging from negative to positive infinity, it is difficult to predict the trend in the data set.

Page 10 of 24
The degree of skewness in a distribution can be measured both in the absolute and relative sense. The formulas are as
follows:

Page 11 of 24
(vi) Kurtosis- The word ‘Kurtosis’ comes from a Greek word meaning ‘humped’. In statistics, it refers to the degree of
flatness or peakedness in the region about the mode of a frequency curve. The measure of kurtosis, describes the degree
of concentration of frequencies (observations) in a given distribution. That is, whether the observed values are
concentrated more around the mode (a peaked curve) or away from the mode towards both tails of the frequency curve.

The fourth standardized moment α4 (or β2) is a measure of flatness or peakedness of a single humped distribution (also
called Kurtosis). For a normal distribution α4 = β2 = 3 so that γ2 = 0 and hence any distribution having β2 > 3 will be
peaked more sharply than the normal curve known as leptokurtic (narrow) while if β2 < 3, the distribution is termed as
platykurtic (broad). The value of β2 is helpful in selecting an appropriate measure of central tendency and variation to
describe a frequency distribution. For example, if β2 = 3, mean is preferred; if β2 > 3 (leptokurtic distribution), median
is preferred; while for β2 < 3 (platykurtic distribution), quartile range is suitable.

The figure shows different curves in a ‘Kurtosis’.

Merits and Demerits of Kurtosis: The advantage of Kurtosis is that the distribution about the means get tighter as the
mean gets larger while the disadvantage is that it will not have negative or undefined form.

Page 12 of 24
When to use which measure of dispersion:

Now let us discuss when we should use what measures of dispersion. Means when it is best to use what measure of
dispersion when given a dataset.

(i) Range- Generally, the range gives us a good insight of variability of the distribution without extreme values.
So, range can be the best measure to calculate when you pair them with other measures of central tendencies
and it can tell you the span of the distribution.
(ii) Variance- Variance can prove to be one of the best measures of dispersion when there are more complex
interval and ratio levels. Variance is good because it uses whole data set. It can be used to compare different
datasets in statistical tests like ANOVA.
(iii) Standard Deviation- Standard deviation can be proved helpful when we have to measure how spread out the
values in a data set are. It is the best measure of dispersion to understand the variability of our data as well as
compare datasets or distribution.
(iv) Interquartile Range- The Interquartile range is most useful when we want to compare variability of different
data sets, especially when they have skewed distributions or outliers. It can also help us in identifying outliers
and test whether our data is normally distributed. We should use interquartile range when we have to understand
the spread of the middle half of our data.
(v) Skewness and Kurtosis- We should use skewness and kurtosis when we want to measure the shape and
variability of a dataset. Also, we can use skewness and kurtosis when we want to identify outliers in our dataset.
For example, we can use skewness and kurtosis to compare the income distribution of different countries, or the
exam scores of different classes.

Page 13 of 24
3. Part 2: Data Analysis
Now, we will select two different datasets with 20 observations in each set and we will calculate the measures of dispersion
(Range, Variance, Standard Deviation, Interquartile Range and Skewness & Kurtosis) for those datasets and then we will see
our calculations’ outcome and with the help of these measures of dispersion we will try to measure the variability of these
datasets and compare their variability with each other.

As the 2 datasets, I will be taking ‘wheat production of Rajasthan State from the year 2001-02 to 2020-21’ as the first database
and ‘wheat production of Bihar state since year 2001-02 to 2020-21’. I will be collecting the data from the official website of
“Reserve Bank of India”. I will collect the data and arrange in excel and present it here. The data that I have collected for
Rajasthan is as follows:

Rajasthan Bihar

Page 14 of 24
Now that we have collected the datasets, we will calculate the measures of dispersion of these datasets i.e., Range, Variance,
Standard Deviation, Interquartile Range, Skewness and Kurtosis with the help of Microsoft Excel. The following steps are to
be followed to calculate different measures of dispersion of a dataset with the help of Microsoft Excel: First, I will calculate
the measures of dispersion for ‘Dataset A’. The steps are:

Step 1: We will select ‘Data’ option from Menu bar and then select data analysis option from top right corner.

Page 15 of 24
Step 2: After that the ‘Data Analysis Dialog Box’ will appear. We will select ‘Descriptive Statistics’ option and click ok.

Page 16 of 24
Step 3: After that the ‘Descriptive Statistics Dialog Box’ will appear. There, we will enter the input range and output range.
Then we will select summary statistics option. After that we click ok.

Step 4: After entering the information in descriptive statistics dialog box, we get the following measures of dispersion.

Page 17 of 24
Now only Interquartile Range is left to be calculated. We will calculate it by using formula bar as follows:

Page 18 of 24
Step 1: First, we make a cell for Quartile 1 and one cell for Quartile 3. In front of each of those cells, we enter a new cell and
type in the formula bar- ‘=QUARTILE.INC(B2:B21,1)’ to calculate Quartile 1 and ‘=QUARTILE.INC(B2:B21,3)’ for
Quartile 3.

Page 19 of 24
Step 3: After that we make a new cell ‘Interquartile Range’ and in front of that we enter formula “=I20-I19” to subtract
calculated Q1 from calculated Q3 and we find the Interquartile range:

As we can see in the above screenshot, we got Q1 and Q3 using the formula and got Q1 as 6889.1 and Q3 as 9482.35. Thus,
the interquartile range= Q3-Q1= 9482.35 - 6889.1= 2593.25

Page 20 of 24
To get the measures of dispersion for ‘Dataset B’, the steps are same. The answer we get by following the same steps is as
follows:

Page 21 of 24
Now let us calculate the Interquartile Range for ‘Dataset B’. The steps for calculating the Interquartile Range are same as we
calculated for the dataset A. So, by following the same steps we got the following answer:

So, we have got the measures of dispersion for both the Dataset A and B i.e., both Rajasthan and Bihar.

Page 22 of 24
Now let us talk about our outcomes from these calculations and what we can learn about their variability with this information
that we have collected.

(i) Range- For Dataset A, we have got the range as ‘6157.4’. It tells us that the data of Dataset A i.e., Rajasthan’s
production, varies from 4878 to 11035.4.

For Dataset B, we have got the range as ‘3217.6’. It tells us that the data of Dataset B i.e., Bihar’s production, varies
from 3239 to 6456.6.

(ii) Variance- For Dataset A, we have got the variance as ‘3433745’. It tells us the variability of Rajasthan’s production
data since year 2001 and also tells us how much far it is spread out from the mean production i.e., 8114.21.

For Dataset B, we have got the variance as ‘836589.5’. It tells us the variability of Bihar’s production data and also tells
us how far it is spread from its mean production i.e., 4650.4.

(iii) Standard Deviation- Just like Variance, the Standard Deviation also tells us that what is the variability of our data.

In dataset A, it tells us how much variability there is from Rajasthan’s production from the mean.

In dataset B, it tells us how much variability there is from Bihar’s production from the mean.

(iv) Interquartile Range- In dataset A, we got the interquartile range ‘2593.25’. It means that middle 50% of the production
of Rajasthan ranges from 6889.1 to 9482.35.

In dataset B, we got the Interquartile Range ‘1145.2’. It means that middle 50% of the production of Bihar ranges from
4027.2 to 5172.4.

(v) Skewness- For dataset A, we have got the skewness as ‘-0.06511’, which means that our data is negatively skewed or
left skewed.

Page 23 of 24
For dataset B, we have got the skewness as ‘0.465519’, which tells us that our data is positively skewed or right
skewed.

(vi) Kurtosis- The Kurtosis we have got for Dataset A is ‘-1.18053’ which is a negative Kurtosis. It means that our data is
‘Platykurtic’ which has thinner or lighter tails compared to normal distributions.

The Kurtosis for Dataset B is ‘-0.39507’ which is a negative Kurtosis just like dataset A. It means that our Dataset B is
also ‘Platykurtic’ which has thinner or lighter tails compared to normal distributions.

Page 24 of 24

You might also like