Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 16

KONERU LAKSHMAIAH EDUCATION

FOUNDATION

(Deemed to be University estd, u/s, 3 of

the UGC Act, 1956) (NAAC Accredited

“A++” Grade University)

Green Fields, Guntur District, A.P., India – 522502

B.Tech. IIInd Year


PROGRAM
A.Y.2023-24 Even
Semester
22CS2227
DATA ANALYTICS & VISUALIZATION (DAV)

CO 2

Session 1: Measure of variation


Overview

1. Course Description
The Data Analytics & Visualization course is designed to provide students with a
comprehensive understanding of the fundamental principles, techniques, and tools used in
data analytics and visualization. In today's data-driven world, the ability to analyze and
visualize data is essential for making informed decisions and gaining valuable insights. This
course covers various aspects of data analytics, including data collection, cleaning,
exploration, statistical analysis, and visualization techniques using popular tools and
programming languages.
2. Aim

The primary aim of data analytics and visualization is to transform raw data into
meaningful insights.

3. Instructional Objectives (Course Objectives)

This Session is designed to discuss:


1. Applications of Data Science in various fields
2. Data Security Issues
3. Data Collection Strategies
4. Data Pre-Processing Overview
4. Learning Outcomes (Course Outcome)

At the end of this session, you should be able to:


1. To determine the reliability of an average by pointing out as how far an average is
representative of the entire data.
2. Enable comparison of two or more distribution with regard to their variability
3. Measuring variability is of great importance to other statistical analysis..

5. Module Description (CO-2 Description)

Measure of Variations" is a statistical concept that refers to the quantification of how data
points within a dataset differ or deviate from the central tendency, such as the mean,
median, or mode. It is a crucial topic in statistics and data analysis, as it helps us
understand the spread, dispersion, or variability in a dataset. Variability in data provides
essential insights into the consistency, stability, or predictability of a phenomenon, and it
is used in various fields, including business, science, economics, and healthcare

6. Session Introduction

A measure of variation, also known as a measure of dispersion, is a statistical value or


metric that quantifies the degree of spread or variability in a dataset. It provides insights
into how data points are distributed around the central tendency, such as the mean,
median, or mode. In other words, it tells you how much individual data points deviate
from the average or typical value.

7. Session description

Different types of Measure of variations


1. Range:
The range is the simplest measure of variation and is calculated as the difference between
the maximum and minimum values in a dataset. It provides a basic understanding of how
spread out the data is.
The formula for calculating the range of a dataset is straightforward:

Range = Maximum Value - Minimum Value

To find the range of a dataset, follow these steps:

Identify the maximum value (the largest value) in the dataset.


Identify the minimum value (the smallest value) in the dataset.
Subtract the minimum value from the maximum value to find the range.
Mathematically, the formula can be expressed as:

Range = Max Value - Min Value

Example, if you have a dataset of exam scores:

Scores: 78, 85, 92, 64, 97

To calculate the range, you would:


Identify the maximum value: Max Value = 97
Identify the minimum value: Min Value = 64
Calculate the range: Range = 97 - 64 = 33
So, in this case, the range of the exam scores is 33.

2. Quartile Deviation:
Quartile deviation, also known as the semi-interquartile range, is a measure of statistical
dispersion or variability in a dataset. It is closely related to quartiles and the
interquartile range (IQR). The quartile deviation quantifies the spread of data points
around the median by considering the middle 50% of the data.

Here's how you can calculate the quartile deviation:


Calculate the first quartile (Q1) and the third quartile (Q3) of the dataset. Q1 is the 25th
percentile, and Q3 is the 75th percentile.
Find the interquartile range (IQR) by subtracting Q1 from Q3: IQR = Q3 - Q1
Calculate the quartile deviation by dividing the IQR by 2:
Quartile Deviation = (Q3 - Q1) / 2
Let's illustrate the calculation of quartile deviation with a simple example. Suppose we
have a dataset of exam scores for a group of students:

Exam Scores: 75, 82, 88, 92, 96, 100, 105, 110

Calculate the first quartile (Q1) and the third quartile (Q3) for the dataset:

Q1 (25th percentile) = 82
Q3 (75th percentile) = 105

Find the interquartile range (IQR) by subtracting Q1 from Q3:


IQR = Q3 - Q1
IQR = 105 - 82
IQR = 23

Calculate the quartile deviation by dividing the IQR by 2:

Quartile Deviation = IQR / 2


Quartile Deviation = 23 / 2
Quartile Deviation = 11.5

In this example, the quartile deviation is 11.5. This means that the middle 50% of the
exam scores (from Q1 to Q3) has a spread or variability of 11.5 points. The quartile
deviation is a robust measure of variation that is less affected by extreme values or
outliers in the dataset.

3. Mean Deviation: Mean Deviation, also known as the Average Deviation, is a


measure of statistical dispersion or variability in a dataset. It quantifies the
average absolute difference between each data point in the dataset and the mean
(average) of the dataset. Mean Deviation provides insights into how much
individual data points deviate from the central tendency (mean) and gives an
indication of the overall spread or dispersion of the data.

The formula for calculating Mean Deviation for a dataset with 'n' data points is as
follows:
Mean Deviation = Σ |X - μ| / n
Where:
Σ represents the summation symbol, meaning you should sum up the values for all data
points.
|X - μ| represents the absolute difference between each data point 'X' and the mean 'μ'.
'n' is the total number of data points in the dataset.
To calculate the Mean Deviation:

Calculate the mean (average) of the dataset.


For each data point, find the absolute difference between that data point and the mean.
Sum up all these absolute differences.
Divide the sum by the total number of data points 'n' to get the Mean Deviation.
Let's calculate the Mean Deviation for a small dataset to illustrate how it works. Suppose
we have a dataset of test scores for a group of students:
Test Scores: 85, 92, 88, 76, 90
Calculate the mean (average) of the dataset:
Mean (μ) = (85 + 92 + 88 + 76 + 90) / 5 = 88.2
Find the absolute difference between each data point and the mean:
|85 - 88.2| = 3.2
|92 - 88.2| = 3.8
|88 - 88.2| = 0.2
|76 - 88.2| = 12.2
|90 - 88.2| = 1.8
Sum up all these absolute differences:
Σ |X - μ| = 3.2 + 3.8 + 0.2 + 12.2 + 1.8 = 21.2
Divide the sum by the total number of data points 'n' (which is 5 in this case):
Mean Deviation = Σ |X - μ| / n = 21.2 / 5 = 4.24

4. Variance:
It is a statistical measure of the spread or dispersion of a set of data points in a dataset. It
quantifies how much individual data points deviate from the mean (average) of the
dataset. In other words, variance measures the average of the squared differences
between each data point and the mean.

The formula for calculating the variance of a dataset with 'n' data points is as follows:
Variance (σ²) = Σ (X - μ)² / n

Where:
Σ represents the summation symbol, meaning you should sum up the values for all data
points.
(X - μ) represents the difference between each data point 'X' and the mean 'μ'.
(X - μ)² represents the squared difference between each data point and the mean.
'n' is the total number of data points in the dataset.
To calculate the variance:
Calculate the mean (average) of the dataset.
For each data point, find the squared difference between that data point and the mean.
Sum up all these squared differences.
Divide the sum by the total number of data points 'n' to get the variance.

.
5. Standard Deviation:
Standard deviation is a widely used statistical measure of the amount of variation or
dispersion in a dataset. It quantifies how individual data points deviate from the
mean (average) of the dataset. Standard deviation is a more interpretable measure
than variance because it's in the same unit as the original data, unlike the squared
unit of variance.

The standard deviation (σ) is calculated using the following formula for a dataset
with 'n' data points:

Standard Deviation (σ) = √[Σ (X - μ)² / n]

Where:

√ represents the square root.


Σ represents the summation symbol, meaning you should sum up the values for all
data points.
(X - μ) represents the difference between each data point 'X' and the mean 'μ'.
(X - μ)² represents the squared difference between each data point and the mean.
'n' is the total number of data points in the dataset.
To calculate the standard deviation:

Calculate the mean (average) of the dataset.


For each data point, find the squared difference between that data point and the
mean.
Sum up all these squared differences.
Divide the sum by the total number of data points 'n'.
Take the square root of this result to obtain the standard deviation.
The standard deviation provides insights into the spread or dispersion of data. A
larger standard deviation indicates that data points are more spread out from the
mean, suggesting greater variability in the dataset. A smaller standard deviation
indicates that data points are closer to the mean, indicating less variability and a
more concentrated distribution.

Standard deviation is used in various fields, including statistics, finance, science, and
quality control, to assess data variability, compare datasets, and make informed
decisions. It is an essential tool for understanding the consistency, reliability, and
predictability of data.

6. Relative range" doesn't have a standard or widely recognized definition in


statistics or mathematics. It's possible that the term may be used in a specific
context or industry where it has a specialized meaning, but without more
information or context, it's challenging to provide examples or explanations
related to "relative range.
7. The "Coefficient of Quartile Deviation" (CQD), also known as the "Quartile
Deviation Coefficient," is a statistical measure that represents the relative
variability or dispersion of a dataset in terms of its interquartile range (IQR). It is a
dimensionless measure, often expressed as a percentage, that allows for the
comparison of the spread of data between different datasets, even if they have
different units or scales.
The formula for calculating the Coefficient of Quartile Deviation (CQD) is as follows:
CQD = (IQR / Median) * 100
Where:
CQD is the Coefficient of Quartile Deviation.
IQR is the Interquartile Range, which is the range between the first quartile (Q1) and
the third quartile (Q3) of the dataset.
Median is the middle value of the dataset.
To calculate the CQD:
Calculate the interquartile range (IQR) by subtracting the first quartile (Q1) from the
third quartile (Q3):
IQR = Q3 - Q1
Calculate the median, which is the middle value of the dataset.
Divide the IQR by the median.
Multiply the result by 100 to express the CQD as a percentage.
The Coefficient of Quartile Deviation is useful for comparing the relative spread or
variability of datasets with different central tendencies and scales. A smaller CQD
indicates that the data is relatively more consistent and less variable, while a larger
CQD indicates greater variability within the middle 50% of the dataset. It is a robust
measure of variation that is less influenced by outliers compared to other measures
like the coefficient of variation (CV), which uses the mean instead of the median.

8. The "Coefficient of Mean Deviation" (CMD), also known as the "Mean


Deviation Coefficient," is a statistical measure used to express the relative
variability or dispersion of a dataset in terms of its mean deviation. The mean
deviation is a measure of how data points in a dataset deviate from the mean
(average) of the dataset.

The formula for calculating the Coefficient of Mean Deviation (CMD) is as follows:
CMD = (Mean Deviation / Mean) * 100
Where:
CMD is the Coefficient of Mean Deviation.
Mean Deviation is the average absolute difference between each data point in the
dataset and the mean.
Mean is the average value of the dataset.
To calculate the CMD:
Calculate the Mean Deviation. It's the average of the absolute differences between
each data point and the mean.
Calculate the Mean, which is the average value of the dataset.
Divide the Mean Deviation by the Mean.
Multiply the result by 100 to express the CMD as a percentage.
The Coefficient of Mean Deviation allows for the comparison of the relative spread or
variability of datasets with different central tendencies and scales. A smaller CMD
indicates that the data is relatively more consistent and less variable around the mean,
while a larger CMD suggests greater variability. It is a measure that provides
information about the distribution and dispersion of data in relation to the mean.
9. The "Coefficient of Variation" (CV), also known as the "Relative Standard
Deviation," is a statistical measure that expresses the relative variability or
dispersion of a dataset as a ratio or percentage of the standard deviation to the
mean (average). It is a dimensionless measure that allows for the comparison of
the spread of data between different datasets, even if they have different units or
scales.
The formula for calculating the Coefficient of Variation (CV) is as follows:
CV = (Standard Deviation / Mean) * 100
Where:
CV is the Coefficient of Variation.
Standard Deviation is a measure of the spread or dispersion of the data.
Mean is the average value of the dataset.
To calculate the CV:
Calculate the Standard Deviation, which quantifies how data points deviate from the
mean.
Calculate the Mean, which is the average value of the dataset.
Divide the Standard Deviation by the Mean.
Multiply the result by 100 to express the CV as a percentage.
The Coefficient of Variation is particularly useful for comparing the relative
variability of datasets with different central tendencies and scales. A smaller CV
indicates that the data is relatively more consistent and less variable around the mean,
while a larger CV suggests greater relative variability. It is commonly used in various
fields, including finance, quality control, and scientific research, to assess the spread
of data and make meaningful comparisons between datasets with different units or
measurements.
10. Standard scores, also known as z-scores or standardized values, are a way to
measure how many standard deviations a particular data point is from the mean
(average) of a dataset. They are used to standardize data, making it easier to
compare and analyze values from different datasets or to identify outliers within a
dataset.

The formula to calculate a standard score (z-score) for a data point 'X' in a dataset is
as follows:

If the z-score is 0, it means the data point is exactly at the mean.


If the z-score is positive, it means the data point is above the mean.
If the z-score is negative, it means the data point is below the mean.
Standard scores allow you to understand where a data point falls in relation to the
mean and how it compares to other data points in the same dataset. They are
particularly useful for identifying outliers or extreme values in a dataset because
data points with high positive or negative z-scores are considered unusual
compared to the rest of the data.

Standard scores are commonly used in fields such as statistics, economics, and
psychology, where data from different sources or studies need to be compared and
analyzed on a standardized scale. They help make data analysis and interpretation
more consistent and meaningful.
8. Activities/ Case studies/related to the session

Activities that can be conducted to analyze and address measures of variation in data
include:

Descriptive Statistics: Calculate basic measures of variation such as range, interquartile


range (IQR), variance, and standard deviation to understand the spread of the data.

Data Visualization: Create visual representations of the data, such as histograms, box
plots, or scatterplots, to gain a visual understanding of the distribution and
variability.

Outlier Detection: Identify and investigate potential outliers or extreme values in the
dataset. Determine whether these values are valid data points or data errors and
address them accordingly.

Sampling Analysis: Evaluate the sampling method used to collect the data. Ensure the
sample size and sampling frequency are appropriate for the analysis and for
detecting variations.

Root Cause Analysis: Investigate the causes of variation in the data. Identify the factors
contributing to the variability and determine if they can be controlled or reduced.

9. Examples & contemporary extracts of articles/ practices to convey the idea


of the session

10. SAQ's-Self Assessment Questions

1) Which of the following is not a measure of variation?


a. Standard Deviation
b. Range
c. Median
d. Variance
2) The range of a dataset is calculated as:

a. The difference between the largest and smallest values.


b. The average of all data points.
c. The interquartile range.
d. The sum of the data points.

3) Which measure of variation is more robust against outliers compared to the standard
deviation?

a. Range
b. Interquartile Range (IQR)
c. Variance
d. Mean Absolute Deviation (MAD)

11. Summary
Measure of variation is the way to extract meaningful information from a set of provided
data.Variability provides a lot of information about the data. and some of the information
it provides is mentioned below
It shows how far data items lie from each other.
It shows the distance from the center of the distribution.
It measures the central tendency of the data.
It also provides a descriptive analysis of the picture

12. Terminal Questions


1. What are measures of variation in statistics, and why are they important in data analysis?
Provide examples of situations where understanding variation is crucial.
2. Explain the concept of range as a measure of variation. How is it calculated, and what are
its limitations in characterizing the variability in a dataset? Provide a real-world example
to illustrate.
3. Variance and standard deviation are commonly used measures of variation. Describe the
differences between these two measures and discuss the advantages of using one over the
other in specific analytical scenarios.
4. Discuss the coefficient of variation (CV) as a relative measure of variation. How is it
computed, and why is it particularly useful when comparing datasets with different scales
or units of measurement?
5. When should interquartile range (IQR) be used as a measure of variation instead
of the standard deviation? Explain how the IQR is calculated and its significance in
analyzing skewed or non-normally distributed data

13. Case Studies (Co Wise)

Case Study: Assessing Variation in Electronic Component Production

Background:
ABC Electronics is a manufacturer of electronic components used in various consumer
electronics. They are particularly concerned about the resistance values of a specific type
of resistor that they produce, which is a critical component in their products. Variability in
resistance can lead to performance issues or even product failure.

Objective:
The company aims to evaluate the variation in the resistance values of the resistors
produced by their manufacturing process. They want to identify whether the process is
stable and how consistent the resistors are in meeting the desired resistance specifications.

Data Collection:
ABC Electronics collects resistance measurements from a random sample of 1000
resistors from a recent production run. These measurements represent the resistance values
of individual resistors in ohms. The dataset is as follows:

[100.2, 99.8, 100.5, 100.0, 100.3, ...]

Measures of Variation Analysis:

Range: The first step is to calculate the range, which is the difference between the
maximum and minimum values in the dataset. In this case, Range = 101.5 ohms (max) -
98.7 ohms (min) = 2.8 ohms.

Interquartile Range (IQR): To assess the central spread, the company calculates the IQR.
After arranging the data in ascending order, the first quartile (Q1) is found at the 25th
percentile, and the third quartile (Q3) is found at the 75th percentile. IQR = Q3 - Q1. The
IQR indicates the middle 50% of the data's spread.

Variance and Standard Deviation: To understand the overall variability, ABC Electronics
calculates the variance and standard deviation of the resistance values. Variance provides
an average measure of the squared differences from the mean, while the standard deviation
is the square root of the variance.

Mean Absolute Deviation (MAD): The company also computes the MAD to evaluate the
average absolute deviation of data points from the mean resistance value.
Results and Interpretation:

Range: 2.8 ohms


IQR: 0.9 ohms
Variance: [Calculated value]
Standard Deviation: [Calculated value]
MAD: [Calculated value]
The results of these measures indicate that there is some level of variation in the resistance
values of the resistors. The range and IQR values suggest that the central 50% of the data
is relatively consistent, with an IQR of 0.9 ohms. However, the range is relatively wide at
2.8 ohms, indicating the presence of outliers or extreme values.

The variance and standard deviation provide a more comprehensive view of the overall
variation in resistance values. The company can use these statistics to track the
consistency of the manufacturing process over time and assess the impact of any process
improvements on reducing variation.

The MAD provides an average measure of how far data points deviate from the mean and
can be used for assessing the average dispersion of resistance values.

Based on these measures, ABC Electronics can make informed decisions about the quality
of their manufacturing process and implement strategies to reduce variability if necessary.
Reducing variation in resistance values will lead to more reliable and consistent electronic
components, ultimately improving their product quality.

14. Answer Key

15. Glossary

16. References of

Books:
1. Fry, Visualizing Data. O’Reilly Media, 2008, ISBN 0596514557.
2. Munzner, Visualization Analysis and Design, 2014, ISBN 1466508914
3. Ware, Information Visualization: Perception for Design, 3rd ed. Morgan Kaufmann,
2012, ISBN 0123814642.

Reference Books:
1. Paulraj Ponniah, “DATA MODELING FUNDAMENTALS”, A Practical Guide for
IT Professionals.
2. Stephen Few, "Information dashboard design: The effective visual communication of
data", O'Reilly,2006.
17. Keywords
Data Modeling, Data Abstraction, Visual encoding, Filtering and aggregation, Spatial
Data, Dash board.

You might also like