Professional Documents
Culture Documents
Introduction To Data Analytics-Module 1 Part 2
Introduction To Data Analytics-Module 1 Part 2
Introduction To Data Analytics-Module 1 Part 2
Presented by
Dr A Kumar
Dr S Sudalai Shunmugam
Dept. of EEE, BNMIT
Descriptive Statistics
– The absolute and relative frequencies for the attribute “Height” and the
respective cumulative frequencies are shown in table in previous slide.
– The absolute and relative cumulative frequencies are the number and the
percentage of occurrences less than or equal to a given value.
– The value of absolute cumulative frequency of the last row is always the total
number of instances.
– The value of relative cumulative frequency of the last row is always 100%
– The relative frequencies define distribution functions i.e they describe how data
are distributed. The column “Rel Freq” in the table is example of “Empirical
frequency distribution” while the column “Rel Cum Freq” is an example of an
“Empirical cumulative distribution function”. They are referred to as “Empirical”
because they are obtained from a sample.
Descriptive Univariate analysis
Univariate analysis Frequencies
PMF
Chart Title Cumulative Distribution Fnt
0.18 Chart Title
Cumulative Probability
0.16 1.2
P(X<=6)
0.14 1
0.12
0.8
0.1 P(X<=4)
Probability
0.08 0.6
0.06 0.4
0.04 0.2
0.02
0
0 1 2 3 4 5 6
1 2 3 4 5 6
Outcome Outcome
Univariate analysis Frequencies
Probability density function –
continuous data Eg. Height of a person
Slope at 165 = 0.04
Chart Title
0.05
Chart Title
0.04 1
Cumulative Probability
Probability density
0.03
0.75
0.02
0.5
0.01
0.25
1.21430643318377E-17140 145 150 155 160 165 170 175 180 185 190
-0.00999999999999999 0
140 145 150 155 160 165 170 175 180 185 190
Height in cms
Height in cms
Univariate data visualization
Need for data visualization:
Understand the trends and patterns of data
Analyze the frequency and other such characteristics of
data
Know the distribution of the variables in the data.
Visualize the relationship that may exist between
different variables
Methods for data visualization:
Pie charts
Bar charts Note:
• https://data.gov.in/
Line charts • https://visualize.data.gov.in/?inst=b596644e-
Area charts 6ace-47c1-824f-e03706850b78&vid=108277
• https://lookerstudio.google.com/reporting/32
Histograms fc04bd-5535-434a-89d1-391f88d7b4c9/page/
q63DC
Univariate data visualization
Univariate data visualization
– Pie Charts:
They are used for nominal scale.
It is not advisable to use them for ordinal and quantitative
scales.
Univariate data visualization
Bar Charts:
They are used typically for qualitative scales.
It can also be used for quantitative scales. [Eg. Number of
students with a given marks on a 0-25 integer scale}.
For ordinal and quantitative scales the classes should be
displayed in the horizontal bar, typically in increasing order of
magnitude.
Univariate data visualization
Line Charts:
they are used to deal with the notion of time
they are used to represent time series, graphs of values
obtained over regular time sequences.
used when horizontal bar uses a quantitative scale with equal
lag between observations
Univariate data visualization
Area Charts:
Used to compare time series and distribution function.
Understanding data distributions give us strong insights about
an attribute.
We can observe data is more concentrated in some values or
that other values are rare.
Univariate data visualization
Histograms: Used to represent empirical distributions for attributes
with a quantitative scale. Histograms are characterized by grouping
values in cell, reducing sparsity which is common in quantitative
scales.
• The number of cells
in histogram as a
thumb rule should
be around square
root of the number
of values.
• The cells should be
equal-sized.
• Use the cell limits as
[10,15] or [10,20]
etc..
Univariate data visualization
• All distribution functions discussed were based on relative or absolute frequencies
of data samples.
• Consider cumulative distribution functions. Figure below shows a empirical
cumulative distribution based on a sample taken from a population with a known
probability density distribution which is also depicted.
The step-wise nature of empirical
cumulative probability distribution is
typical and easily understandable due
to:
1. Empirical distribution has only
some of the values of the
population, so there are jumps.
2. Values are usually obtained at a
certain predefined level of
precision creating jumps between
numbers that do not exist in the
population.
Univariate data visualization
Gender Good Bad Total Stacked bar plot
M 3 5 8 Stacked bar plot for “Company” split by
F 2 1 3 gender
Univariate
Statistics
Location Dispersion
Statistics Statistics
Univariate Statistics
Location Univariate Statistics
Location Univariate statistics: It identifies a value in a certain
position. Important location univariate statistics are:
Minimum : The lowest value
Maximum : The largest value
Mean : The average value
Mode: The most frequent value
First quartile: Value that is larger than 25% of all values
Median or 2nd quartile: Value that is larger than 50% of all
values
Third quartile: The value that is larger than 75% of all
values
Univariate Statistics
Location Univariate Statistics
Name Max Weight Height Gender Company
Temp (kg) (cm)
Arun 25 77 175 M Good
Bhaskar 31 110 195 M Bad
Ramesh 21 70 172 M Good
Ganesh 20 85 180 M Bad
Lakshmi 10 65 168 F Good
Suresh 24 75 172 M Bad
Rashmi 16 58 155 F Bad
Ismail 26 78 180 M Good
Santhosh 30 68 162 M Bad
Ayub 32 87 183 M Bad
Mary 38 72 169 F Good
Box Plot: presents minimum, first quartile, median, third or upper quartile, and
maximum in the same order either BOTTOM-UP or LEFT-RIGHT. The closer the
points are, the more frequent the values between these points are.
68 75 85
110
58
55 110
Lets now try for attribute HEIGHT: The data are
175, 195, 172, 180, 168, 172, 155, 180, 162, 183, 169
Arrange “Height” in ascending order:
155, 162, 168, 169, 172, 172, 175, 180, 180, 183, 195
N=11
195
155
155 195
Location Univariate statistics
– Mean, Mode, and Median are measures of central tendency. Table shows the
use of mean, mode and median as a measure of central tendency based on
SCALE TYPE.
Parameter Nominal scale Ordinal scale Quantitative scale
Mean No Eventually Yes
Median No Yes Yes
Mode No Yes Yes
– If the “Median” is close to the center of the box, the data distribution is
typically Symmetric, i.e values are similarly distributed in the low part and in
the high part.
Central tendency statistics in asymmetric and
symmetric distributions
a) Positively skewed:
Tail to the right of
median i.e.
Asymmetric
Mean>Median
distribution
b) Negatively skewed:
Tail to the left of
median
Location Univariate statistics
– Median or mode is more robust as a central tendency statistic than mean in the
presence of extremely skewed distribution
– Mode is not useful when data is very sparse i.e when there are very few
observations per value.
– Median is easily to obtain when the number ‘n’ of observation is ODD. MEDIAN
is the value in the position (n+1)/2. But if ‘n’ is EVEN then the MEDIAN will be
average of values in positions (n/2) and (n/2)+1.
– In Symmetric distributions with uni-mode, MEAN,MODE and MEDIAN have
same value
– Plots can be combined i.e. A combination of box-plot and histogram can be
made.
Location Univariate statistics
– Mean is unsuitable for ordinal scale but it is used in some cases called Likert ordinal
scale, which is used for surveys. It can be in some way seen as quantitative scale
The “Marks” in ascending order is given. Find location statistic values
40, 42, 44, 44, 50, 60, 62, 64, 69, 80, 80, 90
N=12
– Dispersion statistic measures how distant different values are. The most
common dispersion statistics are:
– Amplitude: It is the difference between the maximum and minimum values.
– Interquartile range: It is the difference between the values of the third quartile
(Q3) and first quartile (Q1)
– Mean Absolute deviation(MADx): It is a measure for the mean absolute
distance between the observations and the mean. Its mathematical formula for
the population is: MAD for sample is:
= =
n n-1
– Standard deviation: It is another measure for the typical distance between the
observations and their mean.
Note: The size of the sample is always less than the total size of the population.
Dispersion univariate statistics.. For attribute “Weight”. The DATA are: 58, 65, 68, 70, 72, 75, 77,
78, 85, 87, 110 (since this is sample use sample formulas..)
Minimum = 58, Maximum = 110, Mean – 77, Q1 = 68, Median = 75, Q3 = 85.
Weight |X–X | | X – X |2
58 19 361
Dispersion statistics:
65 12 144 1. Amplitude= (max – min)
68 09 81 = ( 110 – 58 ) = 52
70 07 49
2. Interquartile range: = Q3 - Q1
72 05 25 = (85 – 68) = 17
75 02 04
3. MADx = 106/10 = 10.6
77 00 00
78 01 01 4. Sample standard deviation
85 08 64 Sx = = 13.85
87 10 100 5. Sample variance = 13.852 = 191.82
110 33 1089
n = 11 106 1918
Total
Common univariate probability distributions
– Knowing the distribution it is possible to design its probability density function and
calculate probabilities. Probabilities measure the likelihood of an attribute taking a
value or a range of values.
Common univariate probability distributions
0 , if x0 < a;
P(x < x0 ) = , if a ≤ x0 ≤ b
1, if x0 > b
– Therefore P(x<0.3) = 0.3
Common univariate probability distributions
Soln:
P(17<wt<19) = (x2 – x1) / (b – a) where x2 = 19, x1 = 17, b = 25 and a = 15
P(17<wt<19) = (19 – 17) / (25 – 15) = 0.2
Common univariate probability distributions
– Central limit theorem is a statistical theory which states that “when the large
sample size is having a finite variance, the samples will be normally distributed
and the mean of samples will be approximately equal to the mean of the whole
population”.
– Eg. Physical quantities that are expected to be the sum of many independent
factors (People’s heights) typically have normal distribution.
– Normal distribution has two parameters: MEAN and STANDARD DEVIATION.
– MEAN localizes the highest point of the bell-shaped distribution.
– STANDARD DEVIATION defines how thin or wide the bell shape of the
distribution is. An attribute ‘x’ that follows normal distribution is denoted as
Common univariate probability distributions
– Bivariate analysis deals with pairs of attributes and their relative behavior.
– Bivariate analysis with both quantitative attributes
– In a data set whose objects/instances have ‘n’ attributes, each instance/object can
be represented in a n-dimensional space: a space with ‘n axes
– Each axis representing one of the attributes. The position occupied by an object is
given by the value of its attributes.
– Visualization of two quantitative attributes can be done by different techniques.
One of them is Three-Dimensional Histogram.
– Eg. From our data set:
Weight: 58, 65, 68, 70, 72, 75, 77, 78, 85, 87, 110
Height : 155, 162, 168, 169, 172, 172, 175, 180, 180, 183, 195
Descriptive bivariate analysis --
both quantitative attributes
Frequency distribution table of attributes Height and Weight
2.5
Frequency
1.5
100-
1 115
85-100
0.5 70-85
55-70
0
155-170 170-180 180-190 >190
Height in cms
Descriptive bivariate analysis- Data visualization using SCATTER Plot. It illustrates how the
two values of two attributes are correlated. It makes possible to see the variation of an attribute according to the
variability of the other attribute
Weight vs Height
200
190
180
170
Height in cms
160
150
140
130
120
110
100
50 60 70 80 90 100 110 120
Weight in kg
Descriptive bivariate analysis
– The degree to which these relations exist is measured by covariance between them.
– When the two attributes have similar variation, covariance is a positive value
– If the two attributes vary in opposite way, covariance is a negative value
– If the attributes have independent variation, the covariance will tend to zero
– The value of co-variance depends upon the magnitude of the attributes
– Variance is a special case of covariance. It is the covariance of an attribute with itself.
– Only linear relation can be captured.
– Equation below shows how covariance between two attributes (Xj, Xk) are calculated
Bivariate statistics
Descriptive bivariate analysis Determination of COVARIANCE between attributes WEIGHT and HEIGHT
Bivariate statistics
Descriptive bivariate analysis – Covariance
Weight vs Height
200
– Cov(xj , xk ) = 1420.46 / 10 = 142.046 190
180
170
Height in cms
160
150
140
130
120
110
100
50 60 70 80 90 100 110 120
Weight in kg
Descriptive bivariate analysis Determination of COVARIANCE between temperature and resistance of germanium ( -0.05Ω/°C
Bivariate statistics
Descriptive bivariate analysis Determination of COVARIANCE between temperature and resistance of germanium ( -0.05Ω/°C
Bivariate statistics
Descriptive bivariate analysis – Covariance
Tempt vs Resistance
6.000
4.000
Resistance
3.000
2.000
1.000
0.000
0.000 10.00020.00030.00040.00050.00060.00070.000
Temepature in degree celcius
Descriptive bivariate analysis
– Covariance is a useful measure to show how the values of two attributes relate to
each other.
– The size of the range of values of attributes influences the covariance values.
– A better method is Correlation measure.
– The linear correlation between two attributes, also known as Pearson Correlation,
gives a clearer indication of how similar the attributes are, and is usually preferred
than covariance.
– Correlation between two attributes xj and xk is given as:
rjk = cor(xj ,xk ) = cov (xj ,xk ) / ( Sj * Sk ) where,
cov (xj ,xk ) is the covariance between the two attributes
Sj is the sample standard deviation of attribute xj
Descriptive bivariate analysis
Bivariate statistics
Descriptive bivariate analysis Determination of correlation between attributes WEIGHT and HEIGHT
Bivariate statistics
Descriptive bivariate analysis – Pearson correlation
K
– Sk = Standard deviation of xk = = sqrt ( 1168.19 / 10) = 10.8
– It is based on rankings. Instead of evaluating how linear is the shape formed by the
points, it compares ordered lists of each of the two attributes.
– The expression used to find Spearman’s rank correlation
– =
– When both the attributes are qualitative with at least one attribute being nominal,
then contingency tables are required.
– Contingency tables presents the joint frequency, facilitating the identification of
interactions between the two attributes.
– They have a matrix like format, with cells in a square and labels at the left and top.
– On the right most column are the totals per row while in the bottom most row are
the totals per column. Bottom right hand corner has the total number of values.
– Mosaic plots are used to show the information of contingency table in an more
appealing visual way.
Contingency table: Eg. Consider
Gender and Company as the attributes.
In our data, We had: 8 Men [3 Good, 5 Bad] and
3 women [2 Good, 1 Bad]
Company
Good Bad
Male 3 5 8
Gender Female 2 1 3
5 6 11
Contingency table with absolute joint frequencies for “Company” and “Gender”
Mosaic Plot – The gender with highest
frequency has the largest area in the plot
BAD
GOOD
M
GENDER
COMPANY
Descriptive bivariate analysis
Both Ordinal Attributes
– Any of the methods discussed before can be used for TWO ORDINAL attributes.
– Spearman’s rank correlation should be used instead of Pearson correlation
– Scatter plots with ordinal attributes usually have the problem that there are
many values falling at the same point, making it impossible to evaluate the
number of values per point.
– Some software packages uses jitter effects which add a random deviation to the
values.
– Contingency tables can be used and mosaic plot also.
End of Part 2 of
Module 1 & Module 1