Professional Documents
Culture Documents
Measures of Variation, Quartiles and Percentiles, Skewness and Kurtosis
Measures of Variation, Quartiles and Percentiles, Skewness and Kurtosis
FOUNDATION
CO 2
1. Course Description
The Data Analytics & Visualization course is designed to provide students with a
comprehensive understanding of the fundamental principles, techniques, and tools used in
data analytics and visualization. In today's data-driven world, the ability to analyze and
visualize data is essential for making informed decisions and gaining valuable insights. This
course covers various aspects of data analytics, including data collection, cleaning,
exploration, statistical analysis, and visualization techniques using popular tools and
programming languages.
2. Aim
The primary aim of data analytics and visualization is to transform raw data into
meaningful insights.
Measure of Variations" is a statistical concept that refers to the quantification of how data
points within a dataset differ or deviate from the central tendency, such as the mean,
median, or mode. It is a crucial topic in statistics and data analysis, as it helps us
understand the spread, dispersion, or variability in a dataset. Variability in data provides
essential insights into the consistency, stability, or predictability of a phenomenon, and it
is used in various fields, including business, science, economics, and healthcare
6. Session Introduction
7. Session description
2. Quartile Deviation:
Quartile deviation, also known as the semi-interquartile range, is a measure of statistical
dispersion or variability in a dataset. It is closely related to quartiles and the
interquartile range (IQR). The quartile deviation quantifies the spread of data points
around the median by considering the middle 50% of the data.
Exam Scores: 75, 82, 88, 92, 96, 100, 105, 110
Calculate the first quartile (Q1) and the third quartile (Q3) for the dataset:
Q1 (25th percentile) = 82
Q3 (75th percentile) = 105
In this example, the quartile deviation is 11.5. This means that the middle 50% of the
exam scores (from Q1 to Q3) has a spread or variability of 11.5 points. The quartile
deviation is a robust measure of variation that is less affected by extreme values or
outliers in the dataset.
The formula for calculating Mean Deviation for a dataset with 'n' data points is as
follows:
Mean Deviation = Σ |X - μ| / n
Where:
Σ represents the summation symbol, meaning you should sum up the values for all data
points.
|X - μ| represents the absolute difference between each data point 'X' and the mean 'μ'.
'n' is the total number of data points in the dataset.
To calculate the Mean Deviation:
4. Variance:
It is a statistical measure of the spread or dispersion of a set of data points in a dataset. It
quantifies how much individual data points deviate from the mean (average) of the
dataset. In other words, variance measures the average of the squared differences
between each data point and the mean.
The formula for calculating the variance of a dataset with 'n' data points is as follows:
Variance (σ²) = Σ (X - μ)² / n
Where:
Σ represents the summation symbol, meaning you should sum up the values for all data
points.
(X - μ) represents the difference between each data point 'X' and the mean 'μ'.
(X - μ)² represents the squared difference between each data point and the mean.
'n' is the total number of data points in the dataset.
To calculate the variance:
Calculate the mean (average) of the dataset.
For each data point, find the squared difference between that data point and the mean.
Sum up all these squared differences.
Divide the sum by the total number of data points 'n' to get the variance.
.
5. Standard Deviation:
Standard deviation is a widely used statistical measure of the amount of variation or
dispersion in a dataset. It quantifies how individual data points deviate from the
mean (average) of the dataset. Standard deviation is a more interpretable measure
than variance because it's in the same unit as the original data, unlike the squared
unit of variance.
The standard deviation (σ) is calculated using the following formula for a dataset
with 'n' data points:
Where:
Standard deviation is used in various fields, including statistics, finance, science, and
quality control, to assess data variability, compare datasets, and make informed
decisions. It is an essential tool for understanding the consistency, reliability, and
predictability of data.
The formula for calculating the Coefficient of Mean Deviation (CMD) is as follows:
CMD = (Mean Deviation / Mean) * 100
Where:
CMD is the Coefficient of Mean Deviation.
Mean Deviation is the average absolute difference between each data point in the
dataset and the mean.
Mean is the average value of the dataset.
To calculate the CMD:
Calculate the Mean Deviation. It's the average of the absolute differences between
each data point and the mean.
Calculate the Mean, which is the average value of the dataset.
Divide the Mean Deviation by the Mean.
Multiply the result by 100 to express the CMD as a percentage.
The Coefficient of Mean Deviation allows for the comparison of the relative spread or
variability of datasets with different central tendencies and scales. A smaller CMD
indicates that the data is relatively more consistent and less variable around the mean,
while a larger CMD suggests greater variability. It is a measure that provides
information about the distribution and dispersion of data in relation to the mean.
9. The "Coefficient of Variation" (CV), also known as the "Relative Standard
Deviation," is a statistical measure that expresses the relative variability or
dispersion of a dataset as a ratio or percentage of the standard deviation to the
mean (average). It is a dimensionless measure that allows for the comparison of
the spread of data between different datasets, even if they have different units or
scales.
The formula for calculating the Coefficient of Variation (CV) is as follows:
CV = (Standard Deviation / Mean) * 100
Where:
CV is the Coefficient of Variation.
Standard Deviation is a measure of the spread or dispersion of the data.
Mean is the average value of the dataset.
To calculate the CV:
Calculate the Standard Deviation, which quantifies how data points deviate from the
mean.
Calculate the Mean, which is the average value of the dataset.
Divide the Standard Deviation by the Mean.
Multiply the result by 100 to express the CV as a percentage.
The Coefficient of Variation is particularly useful for comparing the relative
variability of datasets with different central tendencies and scales. A smaller CV
indicates that the data is relatively more consistent and less variable around the mean,
while a larger CV suggests greater relative variability. It is commonly used in various
fields, including finance, quality control, and scientific research, to assess the spread
of data and make meaningful comparisons between datasets with different units or
measurements.
10. Standard scores, also known as z-scores or standardized values, are a way to
measure how many standard deviations a particular data point is from the mean
(average) of a dataset. They are used to standardize data, making it easier to
compare and analyze values from different datasets or to identify outliers within a
dataset.
The formula to calculate a standard score (z-score) for a data point 'X' in a dataset is
as follows:
Standard scores are commonly used in fields such as statistics, economics, and
psychology, where data from different sources or studies need to be compared and
analyzed on a standardized scale. They help make data analysis and interpretation
more consistent and meaningful.
8. Activities/ Case studies/related to the session
Activities that can be conducted to analyze and address measures of variation in data
include:
Data Visualization: Create visual representations of the data, such as histograms, box
plots, or scatterplots, to gain a visual understanding of the distribution and
variability.
Outlier Detection: Identify and investigate potential outliers or extreme values in the
dataset. Determine whether these values are valid data points or data errors and
address them accordingly.
Sampling Analysis: Evaluate the sampling method used to collect the data. Ensure the
sample size and sampling frequency are appropriate for the analysis and for
detecting variations.
Root Cause Analysis: Investigate the causes of variation in the data. Identify the factors
contributing to the variability and determine if they can be controlled or reduced.
3) Which measure of variation is more robust against outliers compared to the standard
deviation?
a. Range
b. Interquartile Range (IQR)
c. Variance
d. Mean Absolute Deviation (MAD)
11. Summary
Measure of variation is the way to extract meaningful information from a set of provided
data.Variability provides a lot of information about the data. and some of the information
it provides is mentioned below
It shows how far data items lie from each other.
It shows the distance from the center of the distribution.
It measures the central tendency of the data.
It also provides a descriptive analysis of the picture
Background:
ABC Electronics is a manufacturer of electronic components used in various consumer
electronics. They are particularly concerned about the resistance values of a specific type
of resistor that they produce, which is a critical component in their products. Variability in
resistance can lead to performance issues or even product failure.
Objective:
The company aims to evaluate the variation in the resistance values of the resistors
produced by their manufacturing process. They want to identify whether the process is
stable and how consistent the resistors are in meeting the desired resistance specifications.
Data Collection:
ABC Electronics collects resistance measurements from a random sample of 1000
resistors from a recent production run. These measurements represent the resistance values
of individual resistors in ohms. The dataset is as follows:
Range: The first step is to calculate the range, which is the difference between the
maximum and minimum values in the dataset. In this case, Range = 101.5 ohms (max) -
98.7 ohms (min) = 2.8 ohms.
Interquartile Range (IQR): To assess the central spread, the company calculates the IQR.
After arranging the data in ascending order, the first quartile (Q1) is found at the 25th
percentile, and the third quartile (Q3) is found at the 75th percentile. IQR = Q3 - Q1. The
IQR indicates the middle 50% of the data's spread.
Variance and Standard Deviation: To understand the overall variability, ABC Electronics
calculates the variance and standard deviation of the resistance values. Variance provides
an average measure of the squared differences from the mean, while the standard deviation
is the square root of the variance.
Mean Absolute Deviation (MAD): The company also computes the MAD to evaluate the
average absolute deviation of data points from the mean resistance value.
Results and Interpretation:
The variance and standard deviation provide a more comprehensive view of the overall
variation in resistance values. The company can use these statistics to track the
consistency of the manufacturing process over time and assess the impact of any process
improvements on reducing variation.
The MAD provides an average measure of how far data points deviate from the mean and
can be used for assessing the average dispersion of resistance values.
Based on these measures, ABC Electronics can make informed decisions about the quality
of their manufacturing process and implement strategies to reduce variability if necessary.
Reducing variation in resistance values will lead to more reliable and consistent electronic
components, ultimately improving their product quality.
15. Glossary
16. References of
Books:
1. Fry, Visualizing Data. O’Reilly Media, 2008, ISBN 0596514557.
2. Munzner, Visualization Analysis and Design, 2014, ISBN 1466508914
3. Ware, Information Visualization: Perception for Design, 3rd ed. Morgan Kaufmann,
2012, ISBN 0123814642.
Reference Books:
1. Paulraj Ponniah, “DATA MODELING FUNDAMENTALS”, A Practical Guide for
IT Professionals.
2. Stephen Few, "Information dashboard design: The effective visual communication of
data", O'Reilly,2006.
17. Keywords
Data Modeling, Data Abstraction, Visual encoding, Filtering and aggregation, Spatial
Data, Dash board.