Statistics For Data Analysis

STATISTICS FOR DATA ANALYTICS
Dr.S.Saudia,
Assistant Professor,
CITE, M.S.University
Objectives of the Chapter.
• To understand the different Descriptive Statistics concepts for

Descriptive Analytics and later analytics stages.
Chapter contents.
• Sampling Techniques: Simple Random Sampling, Convenience, Systematic,

Cluster and Stratified
•Organization of Data: Data classification, Tabulation, Frequency and

Graphic representation
• Measures of central value: Arithmetic mean, Geometric mean, Harmonic

mean, Mode, Median, Quartiles, Deciles, Percentile
• Measures of variation: Range, IQR, Quartile deviation, Mean deviation,

standard deviation, coefficient variance, Skewness, Moments & Kurtosis.
Statistics
Statistics is a methodology for collecting, analyzing, interpreting and drawing conclusions from numerical or
categorical data [1]. It provides methods for i) summarizing/ organizing data
ii) exploring data (Is the new grain productive?)
iii)Making predictions from data (What will be the employment rate
next year?)
Basic concepts of Statistics [1]
• Population: Set of persons or objects which are investigated for statistical study and research.
Eg: the students in M.S.U. , Books in a library (finite population)
Bulbs produced by factory this year (can be used as the population for analysis next year also)
(Hypothetical population)
A parameter is an unknown numerical summary of a population. Eg: The proportion of
diabetic patients feeling fatigue atleast once in a week.
• Sample: Part of the population from which information is collected. Always not all the information is
collected from the entire population for analysis.
A statistic value is a known numerical summary of a sample of the population. Eg: The
proportion of diabetic patients feeling fatigue often.
Types of Statistics[1],[4]
• Descriptive Statistics
• The Statistical methods used for summarizing and describing information in data.
• It includes construction of graphs, charts and tables, calculation of various descriptive measures:
average, measures of variation and percentiles.
Eg: For an event of tossing a dice, descriptive statistics can be summarizing the frequencies of outcomes
as:
Descriptive Statistics [1]

• Inferential Statistics
• The Statistical methods used for drawing and the reliability of conclusions about population based on
the information obtained from the sample population.
• It includes methods for point estimation, interval estimation, hypothesis testing using probability theory.
Eg: For an event of tossing a dice, inferential statistics is verifying whether the dice is fair or not.
Inferential statistics is done on the information obtained from Descriptive Statistics. An inference is made on
the population based on information obtained from the sample.
• Parameter and Statistics
A parameter is an unknown numerical summary of a population. Eg: The proportion of diabetic

patients going fatigue in a week.
A statistic is a known numerical summary of a population. Eg: The proportion of diabetic patients
going fatigue in a week which is calculated from a sample.
The statistical values are used to find the unknown inferential parameters about the population.
Parameter and Statistic[2]
Objective of Statistics methods is to make inferences about the population from an analysis of information
contained in sample data.
• Variable
Variable is a characteristic that varies from one person to person/ object to object of a population. Eg: height,
weight.
The variables can be Quantitative or Qualitative variables.
Qualitative Variables: These variables will have categorical values.

These variables are measured on a nominal scale or ordinal scale or interval scale or ratio scale.
Nominal Scale: the categories of the qualitative variables are in an unordered scale.
There is no order for these categories or only naming of the categories happen in this scale.
It cannot be said that one category is greater/ lesser than the other.
Eg: Gender, Marital Status.
The difference between the categories need be not the same or it is unknown.
These categories can be represented as pie chart or bar chart.
Can you calculate the mean/ median of a nominal variable? Why?
Measures of central tendency for nominal variables is mode.
Nominal Scale of Gender, Mode of Transport [3]

Ordinal Scale: the categories of the qualitative variables are on an ordered scale. Low to High
ordering/ ranking is possible for such categories.
Movie Ratings [4]

It can be said that one category is more than other category but the difference between the categories is unknown.
Eg: Education rating, Temperature, Strength of opinion polls, customer satisfaction (from very satisfied, more satisfied,
satisfied, dissatisfied etc.), Pain Intensity etc.
A first, second, third kind of ranking

and there is no absolute zero.
The median and mode of the categories can be calculated.

No arithmetics can be done.
Ordinal Scale of temperature [3]

Quantitative Variables: These variables have numeric values.
When the numeric values have discrete finite values they are called discrete quantitative
variables. Eg: Children in a family.
When the numeric values can be expressed more accurately they are called continuous
quantitative variables. Eg: Height, Weight.
These variables are measured on an interval scale or ratio scale.
Interval Scale: the values of the quantitative variables are on an ordered scale and the difference
between the interval values is known and is equal.
Value zero in an interval scale is arbitrary which means zero is also a measurement and not that
the measurement value is zero.
Eg: A temperature at 0 degree celsius does not mean that there is no temperature, Time.
The difference between the points on the

scale is
measurable and equal.
So mean,
median and mode of the categories can be
calculated.
Also addition and subtraction can be done.
Interval Scale of
temperature in
degree celsius [6]
Ratio Scale: the values of the quantitative variables are on an ordered scale and the difference
between the interval values is known and is equal.
Value zero in an interval scale is absolute and occurs naturally as a measurement.
Eg: A temperature at 0 Kelvin does not mean there is no temperature, it means that the temperature is zero., height, weight
etc.
All statistical measurements and all arithmetics can be

done.
Ratio Scale of Decision Tree To decide on the levels/ scale of measurement [4]
height [7]
Statistical Tests and Measurements possible on different scales [4]
In the highest level/ scale of a data representation which is the numeric ratio level, all the measurements can be
done on the data-set.
Order of Scales of Measurement [6]

• Data
The value of a variable for one or more people or things yield data.
• Data Set/ Data Matrix

The collection of all observations for a particular variables is called a dataset/ data matrix. The data
sets are recorded for the samples of the population of interest.
All the values of the data set is presented as a data matrix as shown below.
Data Matrix [1]

The values of the different variables are arranged in different columns.
Each row correspond to the values of variables for a distinct sampling unit.
xij is the value of the jth variable and the ith sample. i=1,2…n, j=1,2,…k
1. Sampling Techniques [9]
Part of the population from which information is collected is called a sample.
The entire population cannot be studied and so is the sample selected for study.
There are different techniques of sampling. It depends mainly upon on the resources available like
time and money needed to conduct the study and also the nature of population.
DEPENDS
UPON
Factors on which Sampling Techniques depend upon. [7], [8]
The sampling techniques should select an unbiased, representative sample. Otherwise the sample shall produce
sampling error.
Eg: If the study is on the African and European population, the sample should have
representatives of Africans and Europeans
equal number of Europeans and Africans to be unbiased.
• Simple Random Sampling
This sampling technique involves the selection of random objects based on some random numbers as
the sample of a population. The method produces an unbiased and representative sample. Every object of the
sample is equally likely to be a member of the sample.
It is theoretically the most ideal method of sampling.
Simple Random Sampling [8], [9]

Simple random sampling can be quite expensive when the population corresponds to people across
the world. It is good for samples from manufacturing industry.
It is better sampling technique when sampling frame (list of all objects/ people present in the
population of interest) is small.
• Convenience Sampling
This sampling technique is a convenient sampling technique, samples are collected from nearest
neighborhood minding the time and cost limits.
Eg: The samples are from a shopping mall or from someone’s friends/ relatives/ people passing by.
The latest items produced from a manufacturing industry is selected to be a sample.
It produces very much biased (sometimes self selection bias- when people who are interested in the problem
participate in the survey) and un representative samples.
Convenience Sampling [10]

• Systematic Sampling
This sampling technique is a systematic sampling technique where the first object of the sample is
selected based on a random number and later objects of the samples are selected after a fixed number apart.
It is easier than Simple Random Sampling to produce a random approximation of the sample.
However if there is a pattern in the data sample, then a certain type of object getting selected more is common in
Systematic Sampling.
Systematic Sampling [11]

• Cluster Sampling
The population is first divided into clusters and then the clusters are chosen in random. All the objects
of the cluster are included in the sample.
Eg: Departments of a business.
It is more convenient than Simple Random Sampling.
If the clusters chosen are not more different from each other, the the so selected sample can be
biased or un representing the population.
Cluster Sampling [13]

• Stratified Sampling
The population is first divided into well defined strata and then the objects from each strata are
chosen in random for being the samples. No. of objects selected from each strata correspond to the size of the
strata.
Eg: If the population corresponds to the whole of the population, then the different nationals, other
characteristics like: age, occupation, culture etc. correspond to a strata.
Stratified Sampling [13]

Exercise
Try sampling the dataset below using all those sampling techniques.
vh ozon ibh dpg vis temp doy

5,710.00 3.00 2,693.00 -25.00 250.00 40.00 3.00
5,700.00 5.00 590.00 -24.00 100.00 45.00 4.00
5,760.00 5.00 1,450.00 25.00 60.00 54.00 5.00
5,720.00 6.00 1,568.00 15.00 60.00 35.00 6.00
5,790.00 4.00 2,631.00 -33.00 100.00 45.00 7.00
5,790.00 4.00 554.00 -28.00 250.00 55.00 8.00
5,700.00 6.00 2,083.00 23.00 120.00 41.00 9.00
5,700.00 7.00 2,654.00 -2.00 120.00 44.00 10.00
5,770.00 4.00 5,000.00 -19.00 120.00 54.00 11.00
5,720.00 6.00 111.00 9.00 150.00 51.00 12.00
5,760.00 5.00 492.00 -44.00 40.00 51.00 13.00
5,780.00 4.00 5,000.00 -44.00 200.00 54.00 14.00
5,830.00 4.00 1,249.00 -53.00 250.00 58.00 15.00
5,870.00 7.00 5,000.00 -67.00 200.00 61.00 16.00
5,840.00 5.00 5,000.00 -40.00 200.00 64.00 17.00
5,780.00 9.00 639.00 1.00 150.00 67.00 18.00
2. Organization of Data
The data collected need to be organized. The data can be organized by describing them as tables and
graphs.
Data represented as a Table, Data represented as a bar graph [15].

From the table or the graph, the range or the spread (maximum –minimum value) of the data can be
obtained. Also the count of each value, the maximum value etc. can also be obtained.
Qualitative variable:
The number of observations of a particular qualitative variable is the frequency/count of that variable.
The frequency of the variable data can be organized as

1. the frequency distribution-a table listing all the classes and its frequencies.
2. percentage of a class-frequency of the class to the total number of observations x100%
3. relative frequency-frequency of the class in decimal
4. relative frequency distribution-a table listing all the classes and their relative frequencies. Sample
size decides the credibility of relative frequency. The relative
frequencies add up to 1.
5. cumulative frequency-sum of frequencies up to a particular class. It gives the frequency
above or below a reference level.
6. relative cumulative frequency-sum of relative frequencies up to a particular class.
7. pie chart: The qualitative data is graphically represented as a pie-chart. It is a disc divided
into a number of pieces depending upon the frequencies of the classes. The
angle for a class is obtained by dividing the relative frequency by 360 degree.
Nominal data are represented by pie-chart .
8. bar graph: A bar graph is also a graphical representation where the classes are displayed
on the horizontal/ vertical axis and the frequencies are displayed on the
vertical/ horizontal axis. Accordingly they are called horizontal bar graph and
vertical bar graph respectively.
Ordinal data are represented as bar graph.

Example 3.1. Let the blood types of 40 persons are as follows:
O O A B A O A A A O B O B O O A O O A A A A AB A B A A O O A
O O A A A O A O O AB
Table 2 Relative and Cumulative Frequency [1]
Blood Frequency Relative Cumulative

group Frequency Frequency
O 16 .4 16
A 18 .45 34
B 4 .1 38
AB 2 .05 40
Total 40 1
Cumulative frequency calculation is more significant for ordinal and quantitative data.
Exercise: Write the R code for finding the frequency, relative frequency and cumulative
frequency of the blood group data
Pie Chart and Bar graph [1]
Exercise: Write the R code to plot the pie-chart of the blood group data
Quantitative variable:
The frequency of the quantitative variable data if they are less in number is calculated as for qualitative variables.
If the number of the data is quite large, the data are grouped into classes before calculating the frequency
distribution.
Generally 5-15 class intervals are chosen. Percent, cumulative Frequency, Relative Frequency etc. holds good even
for the quantitative data like the qualitative data.
1. histogram distribution- histogram for a grouped data displays the frequency or the relative
frequencies of each class interval. It is a bar graph of the grouped data.
Exercise: Write the R code for finding the frequency, relative frequency and cumulative frequency
of the data and plot the histogram of the frequencies
2. O give: A cumulative frequency can be visualized using a curve called an Ogive. Ogives can be
plotted against the upper or the lower limits of their class intervals and accordingly they are called less than
Ogive or greater than Ogive.
For the given distribution the cumulative frequency curve/ O give is shown below.
Frequency Table , Frequency distribution and Ogive [14] for the population whose age is recorded [14]
Points to Remember:
The frequency distribution of the population is called a population distribution and that of a sample is called a
sample distribution.
The sample distribution is a blurred image of the population.
As the sample size increases, the sample relative frequency is very close to that of the population and image
becomes clearer.
Also as the sample size increases, the histogram curve of the frequency distribution becomes more smooth/
continuous as can be seen below.
Histograms in order for sample sizes 100, 2000 and the whole population [1]
Points to Remember:
A summary of the population can be made by looking at the shape of the distribution curve.
As shown below for the bell and U shaped distributions, the populations are very clearly different.
U shaped and Bell shaped Histograms of two different populations [1]

Points to Remember:
Bell and U shaped distributions of populations are symmetric.
The distributions can also be skewed to one direction, left skewed or right skewed as shown below. These are
asymmetrical distributions.
Distributions skewed to the right and Distribution skewed to the left [1]
3. Measures of central values [1]
That single value/measure which can be representative (a typical value) of the whole population is called the
measure of central tendency. It is the middle of a population.
The three measures of central tendencies are:
1. Mean
2. Median
3. Mode
Mean and median are measures of central tendencies for quantitative data only whereas mode is a
measure of central tendency for qualitative data.
Measures of Central Tendency of symmetrical and unsymmetrical populations[1]
For a symmetric population (normal distribution), these values are close to each other. Also if the population is
almost similar, these values will be same.
For an unsymmetrical or skewed population, these values are different from each other as shown in figure above.
Measures of Central Tendencies :
Mode (Common/ Popular Qualitative Variable)

Mode is used to find that value which occurs more often.
Eg: 1. Which product is sold more?
The mode gives idea about the popular product, a popular size/color of a cloth/any item. This
will help the retailer to stock that product more.
2. Which country is prone to CANCER?

Mode gives idea about the country with more CANCER patients. This will help the government
to take better precautionary measures.
Mode of a qualitative or discrete quantitative variable is the value of the variable with highest frequency.
If the greatest frequency is 1 then the variable has no mode.

If the greatest frequency is 2 or greater than 2, then the mode is equal to the greatest frequency.
The mode can be easily determined from the frequency distribution table or graph.
The mode in the frequency distribution table 6 [1] is A. This means that A is the most common blood group.
When the data is large, continuous and divided into classes, the mode is a mode class which is the class interval with
the highest frequency.
The mode class in the frequency distribution table 7 [1] is 0.065-0.085.

Median (Mid value)
Median of a quantitative variable is that central value which divides the ordered values of the variable set into a set
less than the median and a set greater than the median. The measure demands a data set which can be ordered.
When n is the number of items in the ordered data set,

If n is odd, median is the middle value in the ordered data set.
If n is even, median is either of middle values in the ordered data set.
 n +1
Median = floor   or
 2 
 n +1 
Median = ceil  
 2 
Measures of Central Tendency : Median[16]
From this measure an idea is obtained about the total number of values less than the central measure and the total
number of values above the central measure can be found. Both numbers are equal.
The measure of median is not affected by the outlier values (extreme values) in the data set.
Consider the data set below which are prices of some cottages in an area for sale.
Data Set – Prices of cottages , Median calculation [15]
The median ($137,500) from this data set is a measure from among the normal values ($125,000, $127,000, $135,000,
$140,000, $148,000, $150,000). in the data set and it not get affected by the extreme values ($110,000 and
$380,000).
Thus,
• Median is the best measure for asymmetrical data / for a skewed distribution or when the distribution is not
normal.
• It is not affected by all the values in the dataset and so is reliable when there are outliers.
• Good for data in ratio and interval scale. =
Mean/ Arithmetic Mean
Mean of a quantitative variable is that common central value is the sum of all the observations of the variable divided
by the total number of observations.
_
When n is the number of items in the data set, the mean , x is
The mean value calculated shall be influenced by the extreme values or outliers.
Data Set – Prices of cottages , Mean calculation [15]

The mean ($164,375) from this data set is a measure from among all the values ($125,000, $127,000, $135,000,
$140,000, $148,000, $150,000, $110,000 and $380,000) in the data set. The value, $380,000 is observed to be double
the greatest value in the dataset.
Thus,
• Mean is the best measure for symmetrical data / for a normal distribution.
• It is affected by all the values in the dataset and so is reliable when there are no outliers.
• Good for data in ratio and interval scale.
Mean should have a logical connection to the expected value.

Exercise: Write the R code for finding the mean, median and mode of the
data set:
Geometric Mean [19]-[20]
Geometric Mean of a quantitative variable is that value which is the nth root of the product of all the n observations.
The geometric mean is the average of a set of products.
It is commonly used
when working with percentages, which are derived from values,

For values over a time period like rates
But the standard arithmetic mean works with the values themselves.
Eg: Average returns from a stock market

to determine the performance results of an investment or portfolio.
Average rate under compound interest
Average Depreciation of machines/ shrinkage factors/ growth rate/ change factors
When n is the number of items in the data set,
or
where
Exercise
Why Geometric Mean over Arithmetic Mean [21]
Eg: To find the average interest rate. Consider that an amount of $ 100 is invested. If in the first year, an interest
percent of 10% is drawn and in the second year an interest of 20% and in the third year a 39%, then after three years,
the amount drawn shall be
=100(1+r1)(1+r2)(1+r3)
=100(1+.10)(1+.20)(1+.30)= $ 171.6
If it is arithmetic mean to find the average interest rate, ,

Then Arithmetic mean=10+20+30/3=20%
The amount drawn after using arithmetic mean rate of interest is 100(1+.2) (1+.2) (1+.2)=$ 172.8
The amount is more than the actual amount.
So another mean rate, r is to determined which must give the correct amount drawn at the end.
i.e, 100x(1+r) 3= 100(1+r1)(1+r2)(1+r3)
(1+r) 3= (1+r1)(1+r2)(1+r3)
Taking the cube root on either side,

r= 3
(1 + .1) (1 + .2 )(1 + .3) − 1
Substituting values of r we get, r = .197
The amount drawn using this rate of interest is 100(1+..197) (1+..197) (1+..197)=$ 171.5
The amount is more similar to the actual amount.
So this rate, the geometric mean (nth root of the product of n items) is better in such average calculations.
Weighted Geometric Mean [22]
If x1, x2, x3, x4, ....xn have f1, f 2, f3, f 4, .... f n then the geometric mean is
(
GM = x1f1 , x 2f2 , x3f3 ...x nf n ) N
Where N is the count of all frequencies.
Exercise:
Write the R code for finding the mean compound interest if the interests for the first
five years are 10%,20%, 30%, 40%, 50%.
Harmonic Mean [21-[22]
Harmonic Mean of a quantitative variable is reciprocal of the arithmetic mean of the reciprocal of observations.
1 n
HM = n = n
1 1 1
∑ ∑
n i =0 x i i =0 x i
Consider a situation below where caps are ordered in bulk under three price categories.
If the average price for a cap is to be fixed, the arithmetic mean= 12+16+15/3 will not work out.
This is because the number of xi values ($12,$16, $15) is not equal to 3. It is
240/12 for the 1st cap
160/16 for the 2nd cap
300/15 for the 3rd cap
• Also the sum o f observations is not equal to $12+$16+$15. It is
12x240/12 for the 1st cap + 16x160/12 for the 2nd cap + 15x300/15 for the 3rd cap
So the arithmetic mean is =12x240/12 + 16x160/16 + 15x300/15 = 240 + 160 + 300 = W1+W2+W3
240/12 +160/16+ 300/15 240/12 +160/16+ 300/15 W1/X1+ W2/X2+ W3/X3

The formula W1+W2+W3 is called the weighted harmonic mean.
W1/X1+ W2/X2+ W3/X3

Partition Values[1], [24]
The total number of observations can be divided into fixed number of parts, say 4 or 10 or 100 by values called
partition values/ quantiles.
Some of the quantiles are :

Median
Quartiles
Deciles
Percentiles
Quartiles
Quartiles are such measurements which divide the total number of observations into 4 equal parts. There are three
quartiles: First Quartile (or Lower Quartile):Q1
Second Quartile (Middle Quartile):Q2 / Median
Third Quartile (or Upper Quartile): Q3
The number of observations (N/4) smaller than Q1 is same as the number lying between Q1 and Q2, or between Q2 and
Q3, or larger than Q3.
For continuous observations, one quarter of the observations are smaller than Q1, two-quarters are smaller than Q2 and
three quarters are smaller than Q3.
So Q1, Q2 and Q3 are the values corresponding to cumulative frequencies, n/4, 2n/4, 3n/4 respectively for a grouped data
set.
Quartiles [24]
Percentile
Percentiles are such measurements of the variable which divide the total number of observations into 100 equal parts.
The first percentile, P1 is that value of the variable which divides the bottom1% values from the top 99% values.
The second percentile, P2 is that value of the variable which divides the bottom2% values from the top 98% values.
The median is thus the 50th percentile, P50.
Percentile [25]
Deciles
Deciles are such measurements of the variable which divide the total number of observations into 10 equal parts.
There are 9 deciles, D1,D2, D3, D4, D5, D6, D7,D8, D9
The first decile, D1 is that value of the variable which divides the bottom10% values from the top 90% values. It is also the
10th percentile, P1.
Similarly, the second decile, D1 is the 20th percentile, P20 and so on.
Decile [26]
Five number Summary and Box Plot
Five number summary of a variable consists of minimum, maximum and three quartiles written in the increasing order.
They provide information on center and variation of variable.
Box plot is based on the five number summary and it gives a graphical display of the center and the variations.
Box plots can be in two types: 1. Box plots and 2. Modified Boxplot.
Outliers are marked in Modified Box plot and not in Boxplot.
Steps for drawing Boxplot.

Exercise
Find the popular color using the relevant central measure.
Understand the significance of that central measure.
Data set [17]

Exercise
Find the measure of central value of this elephant population.
Write a statement about the population using that measur of central tendency. .
Data set [18]

Exercise
Find the measure of central value of these dogs’ height .
Comment on the variation of each heights from the cental measure calculated.
Data set [28]

Exercise
Write a story on the table below (A description).
4. Measures of variation [1]
In addition to measuring the central values of a dataset, it also important to understand the variation of
each value from the central measure for any analytics. Such measurements are called Measures of Variation.
They are: 1. Range
2. Interquartile range
3. Standard Deviation
Range: The sample range of a variable is the difference between the maximum and minimum values
of the variable in the dataset.
Range= Max-Min
Range determined for a dataset cannot decrease but can increase when more values are added to the dataset.
Interquartile range : The sample interquartile range, IQR of a variable is the difference between the
first and third quartiles of that variable.
IQR=Q3-Q1
It is otherwise the range of the middle part of the dataset.

Standard Deviation [1]: It is a most commonly used as a measure of variability.
Standard Deviation, Sx can defined as the root of the average of the square of the absolute
deviations of observations from the mean of the variable.
− 2
 
n
∑  i
x − x 
S x = σ = i=1  
n
Here mean/ average is used as the standard. The value is always positive.
The mean of all squared deviations between observations and mean of the observation is called Sample Variance,
Sx2.
− 2
 
n
∑  xi − x 
i =1  
Sx = σ =
2 2
n
For normal distribution/ symmetric bell shaped population, it is experimentally determined that:
−
68% of the values lie within x± σ
−
95% of the values lie within x ± 2σ
−
99.7% of the values lie within x ± 3σ
Normal Distribution with mean =0 [27]
Standard deviation fluctuates less when compared other measures of dispersion when moving from sample to
sample.
Mean Deviation/ Mean Absolute Deviation [22]:
Mean Deviation can defined as the average of the absolute deviations of observations from
the mean/ any other specified value of the variable.
1 n
Mean Deviation about A = ∑ xi − A
n i=1
Generally, Mean deviation is calculated about Mean.

Mean Deviation about median is the least.
Mean Deviation [28]

Exercise
Calculate the Mean Deviation of the following data about the median: 8,15,53,49,19,62,7,15,95,77
Moments : Moments about any arbitrary constant A are defined as
1
µ1' =
n
∑ ( x − A ) is the 1st moment
1
µ2' = ∑ ( x − A ) is the 2nd moment
2
n
1
µ3' = ∑ ( x − A ) is the 3rd moment
3
n
The four Moments about a zero are called Raw Moments
and the moments about mean are called Central Moments.
Raw Moments (about zero) Central Moments (about mean)
1
µ1' =
n
∑ ( x ) is the 1st moment about 0 1  −
 −
µ1 = ∑  x − x  is the 1 moment about x
st
1 n  
µ2' = ∑ ( x ) is the 2 nd moment about 0
2
− 2 −
1  
n µ2 = ∑  x − x  is the 2 moment about x
nd
1 n  
µ3' = ∑ ( x ) is the 3rd moment about 0
3
− 3 −
n 1  
rd
n  
− 4 −
1  
th
n  
Moments are used to describe the basic peculiarities of the data from its frequency distribution like
: measure of the central tendency is given by the first raw moment
measure of dispersion is given by the 2nd moment about mean
symmetry/ skewness of the curve is given by 3rd moment about mean
kurtosis is given by 4th moment about mean.
Moments [29]
Skewness [31] : Skewness mentions extent of asymmetry of a dataset. It also speaks about the direction
of variation of the dataset from the mean.
Normal distribution and Skewed distributions [32]
It can be Negative Skewness or Positive Skewness depending upon the if the distribution is skewed to the right or the left
of the dataset’s mean respectively.
The skewness will give knowledge as to how much is the dataset greater than or is less than the mean.
From the above figure, note the positions of Mean, Median and Mode. The difference between the Mean and Mode can
give the measure and direction of Skewness.
Measure of Skewness: To find extend of asymmetry and to give the direction (positive or negative)
Pearson’s first Measure:
Mean − Mode
Skewness =
S tan dard Deviation
Skewness is positive when Mean is larger than Median and Mode and Vice-versa.
Pearson’s second Measure: When mode is ill defined,
3 ( Mean − Median )
Skewness =
S tan dard Deviation
Bowley’s Measure:
Skewness =
( Q3 − Q2 ) − ( Q2 − Q1 )
( Q3 − Q2 ) + ( Q2 − Q1 )
Where Q1 , Q2 , Q3 are the three quartiles. Q2 is the median. For a positively skewed distribution, Q3 will be away
from Q2 and Q1 and vice-versa for a negatively skewed distribution.
The Bowley’s formula is used to calculate skewness when the dataset is a grouped dataset similar to the one mentioned
in the exercise in slide 70 where the mode mean and standard deviation are difficult to calculate.
Moment Measure:
m3
Skewness ( γ 1 ) =
σ3
Where m3 is the third moment and σ is the Standard deviation. For symmetric distribution, for each positive value of (xi-
mean) there is a negative value. When the deviations are cubed, positive values retain their positive sign and negative
sign and so m3 will be zero.
But for positive skews, large positive values of (xi-mean) are magnified considerably when cubed making m3 positive
and vice-versa for negative skews.
Thus positive, negative and zero values of γ 1 correspond to positively skew, negatively skew or symmetrical curves.
Kurtosis [22]: Two distributions may have the same measures for central tendency, dispersion and
skewness but the concentration of the values around the mode can be different. This concentration is called Kurtosis.
It defines the shape of a datasets distribution or otherwise speaks about the peak or flatness of a distribution as against the
normal distribution.
According to the peak of the distributions (the kurtosis) can be Leptokurtic curve with a maximum peak, Mesokurtic or a
normal curve with a normal peak and a Platykurtic curve with a flat peak as shown in figure below.
Kurtosis [31]
Measure of Kurtosis:
m4
Kurtosis (γ 2 ) = − 3 = β2 − 3
σ 4
Where m4 is the third moment and σ is the Standard deviation. For symmetric distribution, for each positive value of (xi-
mean) there is a negative value.
Thus positive, negative and zero values of γ 2 correspond to leptokurtic, platykurtic and mesokurtic curves.
Thus
Representative of a dataset is given by Central measures of Tendency
Variation / dispersion of Dataset from central tendency is given by Standard

Deviation
Direction of the distribution or the presence of data to the left or right of the mean
is given by Skewness
Concentration of data around the mode is given by Kurtosis

Comment on the dispersion of the three curves [32]
Exercise
Find the suitable measure of Skewness for the following distribution
SALES 0-20 20-50 50-100 100-250 250-500 500-1000
Firms number 20 50 69 30 25 19
Clues for Solution.
There are three formulae for calculation of Skewness:
Pearson’s first measure which uses Mode, Mean and Standard Deviation
Pearson’s second measure which uses Mean and Standard Deviation.
And the Bowley’s Measurement which uses The three quartiles.
The Mean, Median, Mode and Standard deviation involved in the first two formulae are difficult calculating for the above
grouped dataset and so the Skewness for this dataset is calculated using Bowley’s Measurement.
Here data items are in the groups (0-20, 20-50, 50-100,……, 500-1000), so for finding the Quartiles use the
cumulative frequency values of the data items.
Here Q1= value corresponding to n/4th Cumulative frequency

Q2= value corresponding to n/2th Cumulative frequency
Q3= value corresponding to 3n/4th Cumulative frequency
Where n is the total number of items (here the total number of firms =213) and the data items are sales values
from 0-20, 20-50,…
SALES 0-20 20-50 50-100 100-250 250-500 500-1000
Firms number 20 50 69 30 25 19
Cumulative 20 70 139 169 194 213

frequency
Q1= value corresponding to n/4th Cumulative frequency=213/4th value= 53rd sales value (from the table, 39 approximately )
Q2= value corresponding to n/2th Cumulative frequency=213/2th value= 106thsales value (from the table,76 approximately )
Q3=value corresponding to 3n/4thCumulative frequency=3x213/4thvalue=159thsales value (from the table, 203
approximately)
Use these Q1, Q2, Q3 values in Bowley’s Measurement and find the Skewness measurement.
References:
[1] ‘The Nature of Statistics’, Agresti and Finlay, Johnson and Bhattacharya, Weiss, Anderson and Sclove and Freud
[2] www.quora.com
[3] eople.revoledu.com
[4]ww.youtube.com/watch?v=LPHYPXBK_ks
[5]http://www.uth.tmc.edu/uth_org
[5] http://coolcosmos.ipac.caltech.edu
[6] https://www.socialresearchmethods.net
[7]http://www.clipartpanda.com/
[8]http://www.statisticshowto.com
[9] https://www.youtube.com/watch?v=be9e-Q-jC-0
[10] http://grocery88.ml/lowa/convenience-sampling
[11] https://faculty.elgin.edu/
[13] ducation-savvy.blogspot.com
[14] tudy.com/academy/lesson/definition
[15] https://www.youtube.com/watch?v=0ZKtsUkrgFQ
[[16] http://www.lightbulbbooks.com
[17] http://www.picquery.com/mode
[18] ww.dsource.in/resource/elephant
[19] https://www.youtube.com/watch?v=PKWVAIP17pw
[20] https://www.youtube.com/watch?v=trS95t3rs8Q
[21] https://www.youtube.com/watch?v=HFuSLTQ1Izc&t=541
[22] Statistical methods, ‘N. G.Das’, McGraw Hill Companies.
[23] ttps://www.youtube.com/watch?v=ZfHXdIFS-mQ
[24] https://onlinecourses.science.psu.edu/stat100/node/11
[25] http://www.psychometric-success.com
[26] https://stackoverflow.com
[27] https://en.wikipedia.org
[28] http://www.mathsisfun.com
[29] http://www.sigmetrix.com
[30] https://www.slideshare.net
[31] https://www.youtube.com/watch?v=1da4auXziT8
[32] http://www.mathcaptain.com

Statistics For Data Analysis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistics For Data Analysis

Uploaded by

Copyright:

Available Formats

STATISTICS FOR DATA ANALYTICS

• To understand the different Descriptive Statistics concepts for

• Sampling Techniques: Simple Random Sampling, Convenience, Systematic,

•Organization of Data: Data classification, Tabulation, Frequency and

• Measures of central value: Arithmetic mean, Geometric mean, Harmonic

• Measures of variation: Range, IQR, Quartile deviation, Mean deviation,

Descriptive Statistics [1]

A parameter is an unknown numerical summary of a population. Eg: The proportion of diabetic

Parameter and Statistic[2]

Qualitative Variables: These variables will have categorical values.

These categories can be represented as pie chart or bar chart.

Can you calculate the mean/ median of a nominal variable? Why?

Measures of central tendency for nominal variables is mode.

Nominal Scale of Gender, Mode of Transport [3]

Movie Ratings [4]

A first, second, third kind of ranking

The median and mode of the categories can be calculated.

Ordinal Scale of temperature [3]

These variables are measured on an interval scale or ratio scale.

The difference between the points on the

All statistical measurements and all arithmetics can be

Order of Scales of Measurement [6]

• Data Set/ Data Matrix

Data Matrix [1]

Factors on which Sampling Techniques depend upon. [7], [8]

It is theoretically the most ideal method of sampling.

Simple Random Sampling [8], [9]

Convenience Sampling [10]

Systematic Sampling [11]

It is more convenient than Simple Random Sampling.

Cluster Sampling [13]

Stratified Sampling [13]

vh ozon ibh dpg vis temp doy

Data represented as a Table, Data represented as a bar graph [15].

The frequency of the variable data can be organized as

6. relative cumulative frequency-sum of relative frequencies up to a particular class.

Ordinal data are represented as bar graph.

Table 2 Relative and Cumulative Frequency [1]

Blood Frequency Relative Cumulative

The sample distribution is a blurred image of the population.

U shaped and Bell shaped Histograms of two different populations [1]

Bell and U shaped distributions of populations are symmetric.

The three measures of central tendencies are:

Measures of Central Tendency of symmetrical and unsymmetrical populations[1]

Mode (Common/ Popular Qualitative Variable)

2. Which country is prone to CANCER?

If the greatest frequency is 1 then the variable has no mode.

The mode class in the frequency distribution table 7 [1] is 0.065-0.085.

When n is the number of items in the ordered data set,

Measures of Central Tendency : Median[16]

Data Set – Prices of cottages , Median calculation [15]

Data Set – Prices of cottages , Mean calculation [15]

Mean should have a logical connection to the expected value.

when working with percentages, which are derived from values,

Eg: Average returns from a stock market

When n is the number of items in the data set,

If it is arithmetic mean to find the average interest rate, ,

i.e, 100x(1+r) 3= 100(1+r1)(1+r2)(1+r3)

Taking the cube root on either side,

Where N is the count of all frequencies.

240/12 +160/16+ 300/15 240/12 +160/16+ 300/15 W1/X1+ W2/X2+ W3/X3

W1/X1+ W2/X2+ W3/X3