Introduction To Data Analytics-Module 1 Part 2

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 78

Introduction to Data Analytics (18EE654)

Module 1: Getting Insights from Data (Part 2)

Presented by
Dr A Kumar
Dr S Sudalai Shunmugam
Dept. of EEE, BNMIT
Descriptive Statistics

– In most of the situations it is difficult or even


impossible to survey all the population [Population
means Instances in a data set]
– The cost of survey of all the population can be
theoretically possible, but in terms of cost it is not
advised.
– Hence sampling is done. [Eg. Food review of a
restaurant] By analyzing a subset of the population
it is possible to estimate in a quantified way
particular values of the population. [Eg. Exit polls]
Descriptive Statistics

– Generalizing the knowledge obtained from a


sample to all of a population is called
“Statistical Inference or Induction”.
– The value obtained from considering all the
population is necessarily unique for that
population. Larger the sample size closer will be
this estimate to the value of the population.
– Induction generalizes from sample to the
population, deduction particularizes from the
population to the sample.
The main areas of statistics.
Descriptive statistics

– Deduction goal is to deduce the nature of sample, knowing the


population. [Eg. Selecting best 11 players for a Dream11 game]
– Descriptive statistics is a branch of statistics that, sets out methods
to describe data samples, through summarization and
visualization.
– The way we have to describe and visualize data are usually
categorized according to the number of attributes we are
considering.
– Univariate analysis – Analysis of a single attributes.
– Bivariate analysis – Analysis of a pair of attributes
– Multivariate analysis – Analysis of more than two attributes
Scale types to describe data
Consider the data set of contacts with few added attributes.

Name Max Weight Height Gender Company


Temp (kg) (cm)
Arun 25 77 175 M Good
Bhaskar 31 110 195 M Bad
Ramesh 21 70 172 M Good
Ganesh 20 85 180 M Bad
Lakshmi 10 65 168 F Good
Suresh 24 75 172 M Bad
Rashmi 16 58 155 F Bad
Ismail 26 78 180 M Good
Santhosh 30 68 162 M Bad
Ayub 32 87 183 M Bad
Mary 38 72 169 F Good
Scale types

– Two broad types of scale: Qualitative and Quantitative


– Qualitative scales categorizes data in a nominal or ordinal way.
– Nominal data cannot be ordered according to how big or small a certain
characteristic is. Example: Name of the contact, gender
– Ordinal data can be ordered how big or small a certain characteristic is.. Example:
How good the company is..
Scale types
Scale types
– Quantitative data: There are two types: Absolute (Ratios) and Relative (Intervals).
– Absolute (ratios) scales have an absolute zero [Eg. Weight and Height – 0 indicates no
value ]
– Relative (intervals) scales have no absolute zero [Eg. Temperature since 0° does not
mean there is no temperature]
– The information we get depends upon on the scale type we use to express the data.
– The most informative one is the absolute scale and then the relative, ordinal and
nominal scale type.
Lowest level of info Highest level of info

Relationship between four scales.


Scale types
Operations performable on scale types

Nominal • Equal (=)


Data • Not equal (≠)

• Larger than (>)


• Larger than or equal to (≥)
Ordinal • Smaller (<)
Values
• Smaller than or equal to (≤)
Scale types
Operations performable on scale types
• Larger than (>)
• Larger than or equal to (≥)
• Smaller (<)
Relative values
• Smaller than or equal to (≤)
• Addition (+)
• Subtraction (-)

• Larger than (>)


• Larger than or equal to (≥)
• Smaller (<)
• Smaller than or equal to (≤)
Absolute Values • Addition (+)
• Subtraction (-)
• Multiplication (×)
• Division (÷)
Scale types

– Data expressed on an ABSOLUTE scale, can be converted to any other scale


type.
– Data expressed as Relative (Ratio) can be converted to Ordinal or Nominal
– Data expressed as ordinal can be converted to nominal scale
– While converting a more informative scale into a less informative one, there
will be loss of information.
– Converting a less informative scale into a more informative one is also possible.
The level of information obtained will be limited by the information contained
in the original scale.
Scale types….Consider the absolute scale type attribute
“Weight” and converted to other scales
– To RELATIVE: The weight can be converted to a relative scale by subtracting a value
of say 10. The old zero becomes -10 and new zero is the old 10. i.e ZERO does not
any more mean that there is no weight. [ Eg. weight measurement in shops]
– To ORDINAL: We can define level of fatness: FAT > 80 kg, NORMAL for weight
between 65 kg and 80kg, THIN for < 65 kg. With this classification we still have
possibility to define groups of people as being more or less fat. (Eg. BMI)
– To NOMINAL: We can transform the classification FAT, NORMAL and THIN as B, A,C
respectively. With such classification it is not possible to order the contacts
according to their fatness because B,C and A do not quantify anything.
Scale types

– In software package, appropriate data types for each attribute has to be


selected.
– Common data types are: Text, character integer, real, float, date etc..
– Quantitative scale type: implies the use of numeric data types either discrete
values such as integer type or continuous value such as float and real data
types.
– NOTE: An attribute can be expressed as a number but the scale type need not
have to be quantitative. It can be ordinal or nominal. Eg, A number code (OTP)
has no information.
Descriptive Univariate analysis
Univariate analysis Frequencies

– In descriptive univariate analysis, three types of information can be obtained:


Frequency tables, Statistical measures, and plots.
– Univariate Frequencies: Frequency is basically a counter. The absolute
frequency counts how many times a value appears. The relative frequency
counts the percentage of times that value appears.
– Eg. Absolute and relative frequency for attribute “Company”
Table : Univariate absolute and relative frequencies for “Company” attribute
(Total samples: 11)
Attribute-Company Absolute frequency Relative frequency
Good 05 45.45%
Bad 06 54.55%
Descriptive Univariate analysis
Univariate analysis Frequencies

– Arrange the “Height” in ascending order, we get


– 155, 162, 168, 169, 172, 172, 175, 180, 180, 183, 195
Height Abs Freq Rel Freq Abs Cum Freq Rel cum Freq
155 1 1/11 = 9.09% 1 9,09%
162 1 9,09% 2 18.18%
168 1 9.09% 3 27.27%
169 1 9.09% 4 36.36%
172 2 2/11 = 18.18% 6 54.54%
175 1 9.09% 7 63.63%
180 2 18.18% 9 81.81%
183 1 9.09% 10 90.90%
195 1 9.09% 11 99.99%
Descriptive Univariate analysis
Univariate analysis Frequencies

– The absolute and relative frequencies for the attribute “Height” and the
respective cumulative frequencies are shown in table in previous slide.
– The absolute and relative cumulative frequencies are the number and the
percentage of occurrences less than or equal to a given value.
– The value of absolute cumulative frequency of the last row is always the total
number of instances.
– The value of relative cumulative frequency of the last row is always 100%
– The relative frequencies define distribution functions i.e they describe how data
are distributed. The column “Rel Freq” in the table is example of “Empirical
frequency distribution” while the column “Rel Cum Freq” is an example of an
“Empirical cumulative distribution function”. They are referred to as “Empirical”
because they are obtained from a sample.
Descriptive Univariate analysis
Univariate analysis Frequencies

– Distribution functions from populations can be either Probability Distribution


Functions (or) Probability Density functions and Probability mass function
depending on the data type of the attribute.
– Discrete attribute, such as integer data type has probability mass function
– Continuous attribute, such as real data type has probability density function,
since in continuous space the probability of being an exact value is zero.
– Probability density function count relative densities while probability
distribution functions count relative frequencies.
– Property of Probability density functions is that its always one, representing
100%
Univariate analysis Frequencies
Probability mass function –
Eg. Dice throw- The possible values are [1,2,3,4,5,6]
Probability of P[1] = 1/6 and similarly p[2] = 1/6 ……

PMF
Chart Title Cumulative Distribution Fnt
0.18 Chart Title

Cumulative Probability
0.16 1.2
P(X<=6)
0.14 1
0.12
0.8
0.1 P(X<=4)
Probability

0.08 0.6
0.06 0.4
0.04 0.2
0.02
0
0 1 2 3 4 5 6
1 2 3 4 5 6

Outcome Outcome
Univariate analysis Frequencies
Probability density function –
continuous data Eg. Height of a person
Slope at 165 = 0.04
Chart Title
0.05
Chart Title
0.04 1

Cumulative Probability
Probability density

0.03
0.75
0.02
0.5
0.01
0.25
1.21430643318377E-17140 145 150 155 160 165 170 175 180 185 190

-0.00999999999999999 0
140 145 150 155 160 165 170 175 180 185 190

Height in cms
Height in cms
Univariate data visualization
Need for data visualization:
 Understand the trends and patterns of data
 Analyze the frequency and other such characteristics of
data
 Know the distribution of the variables in the data.
 Visualize the relationship that may exist between
different variables
Methods for data visualization:
 Pie charts
 Bar charts Note:
• https://data.gov.in/
 Line charts • https://visualize.data.gov.in/?inst=b596644e-
 Area charts 6ace-47c1-824f-e03706850b78&vid=108277
• https://lookerstudio.google.com/reporting/32
 Histograms fc04bd-5535-434a-89d1-391f88d7b4c9/page/
q63DC
Univariate data visualization
Univariate data visualization
– Pie Charts:
 They are used for nominal scale.
 It is not advisable to use them for ordinal and quantitative
scales.
Univariate data visualization
Bar Charts:
 They are used typically for qualitative scales.
 It can also be used for quantitative scales. [Eg. Number of
students with a given marks on a 0-25 integer scale}.
 For ordinal and quantitative scales the classes should be
displayed in the horizontal bar, typically in increasing order of
magnitude.
Univariate data visualization
Line Charts:
 they are used to deal with the notion of time
 they are used to represent time series, graphs of values
obtained over regular time sequences.
 used when horizontal bar uses a quantitative scale with equal
lag between observations
Univariate data visualization
Area Charts:
 Used to compare time series and distribution function.
 Understanding data distributions give us strong insights about
an attribute.
 We can observe data is more concentrated in some values or
that other values are rare.
Univariate data visualization
Histograms: Used to represent empirical distributions for attributes
with a quantitative scale. Histograms are characterized by grouping
values in cell, reducing sparsity which is common in quantitative
scales.
• The number of cells
in histogram as a
thumb rule should
be around square
root of the number
of values.
• The cells should be
equal-sized.
• Use the cell limits as
[10,15] or [10,20]
etc..
Univariate data visualization
• All distribution functions discussed were based on relative or absolute frequencies
of data samples.
• Consider cumulative distribution functions. Figure below shows a empirical
cumulative distribution based on a sample taken from a population with a known
probability density distribution which is also depicted.
The step-wise nature of empirical
cumulative probability distribution is
typical and easily understandable due
to:
1. Empirical distribution has only
some of the values of the
population, so there are jumps.
2. Values are usually obtained at a
certain predefined level of
precision creating jumps between
numbers that do not exist in the
population.
Univariate data visualization
Gender Good Bad Total Stacked bar plot
M 3 5 8 Stacked bar plot for “Company” split by
F 2 1 3 gender

Stacked bar plot for attribute "Company“


split by gender Stacked bar plot are used to represent
9
frequencies for combined values of two
8
different attributes in a single chart.
7
6
5
4
3
2
1
0
M F
Univariate Statistics
• Statistics is a descriptor - It describes numerically a
characteristic of the sample or the population.

Univariate
Statistics

Location Dispersion
Statistics Statistics
Univariate Statistics
Location Univariate Statistics
Location Univariate statistics: It identifies a value in a certain
position. Important location univariate statistics are:
Minimum : The lowest value
Maximum : The largest value
Mean : The average value
Mode: The most frequent value
First quartile: Value that is larger than 25% of all values
Median or 2nd quartile: Value that is larger than 50% of all
values
Third quartile: The value that is larger than 75% of all
values
Univariate Statistics
Location Univariate Statistics
Name Max Weight Height Gender Company
Temp (kg) (cm)
Arun 25 77 175 M Good
Bhaskar 31 110 195 M Bad
Ramesh 21 70 172 M Good
Ganesh 20 85 180 M Bad
Lakshmi 10 65 168 F Good
Suresh 24 75 172 M Bad
Rashmi 16 58 155 F Bad
Ismail 26 78 180 M Good
Santhosh 30 68 162 M Bad
Ayub 32 87 183 M Bad
Mary 38 72 169 F Good

Lets select “Weight” as the univariate attribute


Location statistics:
Arrange “Weight” in ascending order:
58, 65, 68, 70, 72, 75, 77, 78, 85, 87, 110
N=11

– The location statistics values are given below:


– Minimum : 58
– Maximum : 110
– Mean : 76.82 ≈ 77.00
– Mode : Since all data appear only once, there is no MODE for this data set.
– First Quartile (Q1) : Element(¼ (n+1)) = Element(¼(11+1)) = Element(3) = 68
– Median or second quartile: Element(½(n+1))
= Element(½(11+1)) = Element(6) = 75
– Third quartile (Q2): Element(¾ (n+1)) = Element(¾(11+1)) = Element(9) = 85
Graphical representation of univariate location
statistics for the attribute “Weight” using BOX plot

Box Plot: presents minimum, first quartile, median, third or upper quartile, and
maximum in the same order either BOTTOM-UP or LEFT-RIGHT. The closer the
points are, the more frequent the values between these points are.

68 75 85

110
58

55 110
Lets now try for attribute HEIGHT: The data are
175, 195, 172, 180, 168, 172, 155, 180, 162, 183, 169
Arrange “Height” in ascending order:
155, 162, 168, 169, 172, 172, 175, 180, 180, 183, 195
N=11

– The location statistics values are given below:


– Minimum : 155
– Maximum : 195
– Mean : 173.7 ≈ 174.00
– Mode : Since 172 and 180 appears twice each, it is BIMODE [172, 180].
– First Quartile: Element(¼ (n+1)) = Element(¼(11+1)) = Element(3) = 168
– Median or second quartile: Element(½(n+1))
= Element(½(11+1)) = Element(6) = 172
– Third quartile: Element(¾ (n+1)) = Element(¾(11+1)) = Element(9) = 180
Graphical representation of univariate location
statistics for the attribute “Height” using BOX plot

168 172 180

195
155

155 195
Location Univariate statistics

– Mean, Mode, and Median are measures of central tendency. Table shows the
use of mean, mode and median as a measure of central tendency based on
SCALE TYPE.
Parameter Nominal scale Ordinal scale Quantitative scale
Mean No Eventually Yes
Median No Yes Yes
Mode No Yes Yes

– If the “Median” is close to the center of the box, the data distribution is
typically Symmetric, i.e values are similarly distributed in the low part and in
the high part.
Central tendency statistics in asymmetric and
symmetric distributions
a) Positively skewed:
Tail to the right of
median i.e.
Asymmetric
Mean>Median
distribution
b) Negatively skewed:
Tail to the left of
median
Location Univariate statistics

– Median or mode is more robust as a central tendency statistic than mean in the
presence of extremely skewed distribution
– Mode is not useful when data is very sparse i.e when there are very few
observations per value.
– Median is easily to obtain when the number ‘n’ of observation is ODD. MEDIAN
is the value in the position (n+1)/2. But if ‘n’ is EVEN then the MEDIAN will be
average of values in positions (n/2) and (n/2)+1.
– In Symmetric distributions with uni-mode, MEAN,MODE and MEDIAN have
same value
– Plots can be combined i.e. A combination of box-plot and histogram can be
made.
Location Univariate statistics

– Mean is unsuitable for ordinal scale but it is used in some cases called Likert ordinal
scale, which is used for surveys. It can be in some way seen as quantitative scale
The “Marks” in ascending order is given. Find location statistic values
40, 42, 44, 44, 50, 60, 62, 64, 69, 80, 80, 90
N=12

– The location statistics values are given below:


– Minimum : 40
– Maximum : 90
– Mean : 60.42 ≈ 60.00
– Mode : Since 44 and 80 appears twice each, it is BIMODE [44, 80].
– First Quartile: Element(¼ (n+1)) = Element(¼(12+1)) = Element(3.25) =
Average of Ele(3) and Ele(4) = (44+44)/2 = 44
– Median or second quartile: Avg of( Element(n/2) and Element((n/2)+1))
= AVG(Element(6) and Element(7)) = 61
– Third quartile: Element(¾ (n+1)) = Element(¾(12+1)) = Element(9.75)
Average of Ele (9) and Ele (10) = (69+80)/2 = 74.5
Location univariate statistics

– There are statistics for samples and for population.


– Given a population, there is only one value for a given statistic (Mean….)
– For a given sample, there is only one value for a given statistic.
– But one population can have several samples, several sample values for the same
statistic: one per sample.
– Eg. The mean of a class has one single value,
The mean of girls in class has one single value, mean of boys in class has single
value, mean of Karnataka students has one single value and so on.
– Notations for statistics are different based on whether they are population or
sample statistics. Population mean is denoted by μx while Sample mean is
represented as x
Dispersion univariate statistics

– Dispersion statistic measures how distant different values are. The most
common dispersion statistics are:
– Amplitude: It is the difference between the maximum and minimum values.
– Interquartile range: It is the difference between the values of the third quartile
(Q3) and first quartile (Q1)
– Mean Absolute deviation(MADx): It is a measure for the mean absolute
distance between the observations and the mean. Its mathematical formula for
the population is: MAD for sample is:
= =
n n-1

n – no. of observations, - Mean value of the population


Dispersion univariate statistics

– Standard deviation: It is another measure for the typical distance between the
observations and their mean.

– Variance: It is the square of the sample deviation and is denoted as σ2. It


measures how spread out the population values are around the mean.
– All dispersion statics are only valid for quantitative scales.

Note: The size of the sample is always less than the total size of the population.
Dispersion univariate statistics.. For attribute “Weight”. The DATA are: 58, 65, 68, 70, 72, 75, 77,
78, 85, 87, 110 (since this is sample use sample formulas..)
Minimum = 58, Maximum = 110, Mean – 77, Q1 = 68, Median = 75, Q3 = 85.

Weight |X–X | | X – X |2
58 19 361
Dispersion statistics:
65 12 144 1. Amplitude= (max – min)
68 09 81 = ( 110 – 58 ) = 52
70 07 49
2. Interquartile range: = Q3 - Q1
72 05 25 = (85 – 68) = 17
75 02 04
3. MADx = 106/10 = 10.6
77 00 00
78 01 01 4. Sample standard deviation
85 08 64 Sx = = 13.85
87 10 100 5. Sample variance = 13.852 = 191.82
110 33 1089
n = 11 106 1918
Total
Common univariate probability distributions

– Each attribute has its own probability distribution.


– In this course Uniform and Normal (Gaussian) distribution is considered. Both are
continuous distributions and have known probability density function.
– Uniform Distribution
– Very simple distribution. Frequency of occurrence of the values is uniformly distributed
in a given interval of values.
– An attribute ‘x’ that follows a uniform distribution with parameters ‘a’ and ‘b’,
respectively the minimum and maximum values of the interval is denoted as:

– Knowing the distribution it is possible to design its probability density function and
calculate probabilities. Probabilities measure the likelihood of an attribute taking a
value or a range of values.
Common univariate probability distributions

– A probability for a population is similar to relative frequency for a sample.


– In continuous distributions the probabilities are calculated per interval.
– Consider generation of a random number between 0 and 1
– The function is represented as
– The probability of a value (x0) falling in the range is given by an expression:

0 , if x0 < a;
P(x < x0 ) = , if a ≤ x0 ≤ b
1, if x0 > b
– Therefore P(x<0.3) = 0.3
Common univariate probability distributions

Eg. 2. Weight of a certain species of frog is uniformly distributed between 15 and 25


gms. If you randomly select a frog, what is the probability that frog weighs between
17 and 19 gms.

Soln:
P(17<wt<19) = (x2 – x1) / (b – a) where x2 = 19, x1 = 17, b = 25 and a = 15
P(17<wt<19) = (19 – 17) / (25 – 15) = 0.2
Common univariate probability distributions

– The mean and variance of the uniform population is given by:


Mean of the uniform population, μx = ( a + b ) / 2
Variance of uniform population, σx2 = (b – a)2 / 12
-------------------------------------------------------------------------------------------------------------
– Normal Distribution or Gaussian distribution
– Most common distribution for continuous attributes. Also called BELL CURVE
– The essential characteristics of a normal distribution are:
– It is symmetric, unimodal (i.e., one mode), and asymptotic.
– The values of mean, median, and mode are all equal.
– A normal distribution is quite symmetrical about its center
Common univariate probability distributions

– Central limit theorem is a statistical theory which states that “when the large
sample size is having a finite variance, the samples will be normally distributed
and the mean of samples will be approximately equal to the mean of the whole
population”.
– Eg. Physical quantities that are expected to be the sum of many independent
factors (People’s heights) typically have normal distribution.
– Normal distribution has two parameters: MEAN and STANDARD DEVIATION.
– MEAN localizes the highest point of the bell-shaped distribution.
– STANDARD DEVIATION defines how thin or wide the bell shape of the
distribution is. An attribute ‘x’ that follows normal distribution is denoted as
Common univariate probability distributions

– Sample NORMAL DISTRIBUTION CURVE


Descriptive bivariate analysis --
both quantitative attributes

– Bivariate analysis deals with pairs of attributes and their relative behavior.
– Bivariate analysis with both quantitative attributes
– In a data set whose objects/instances have ‘n’ attributes, each instance/object can
be represented in a n-dimensional space: a space with ‘n axes
– Each axis representing one of the attributes. The position occupied by an object is
given by the value of its attributes.
– Visualization of two quantitative attributes can be done by different techniques.
One of them is Three-Dimensional Histogram.
– Eg. From our data set:
Weight: 58, 65, 68, 70, 72, 75, 77, 78, 85, 87, 110
Height : 155, 162, 168, 169, 172, 172, 175, 180, 180, 183, 195
Descriptive bivariate analysis --
both quantitative attributes
Frequency distribution table of attributes Height and Weight

Name Max Weight Height Gender Company Height


Temp (kg) (cm)
155-170 170-180 180-190 >190
Arun 25 77 175 M Good
Bhaskar 31 110 195 M Bad
55-69 3 0 0 0
Ramesh 21 70 172 M Good
Ganesh 20 85 180 M Bad
Lakshmi 10 65 168 F Good 70-84 1 3 1 0
Suresh 24 75 172 M Bad
Rashmi 16 58 155 F Bad 85-99 0 0 2 0
Ismail 26 78 180 M Good
Santhosh 30 68 162 M Bad 100-115 0 0 0 1
Ayub 32 87 183 M Bad
Mary 38 72 169 F Good Weight
Descriptive bivariate analysis
3D histogram for attributes “Weight” and “Height”

2.5
Frequency

1.5
100-
1 115
85-100
0.5 70-85

55-70
0
155-170 170-180 180-190 >190
Height in cms
Descriptive bivariate analysis- Data visualization using SCATTER Plot. It illustrates how the
two values of two attributes are correlated. It makes possible to see the variation of an attribute according to the
variability of the other attribute

Weight vs Height
200
190
180
170
Height in cms

160
150
140
130
120
110
100
50 60 70 80 90 100 110 120
Weight in kg
Descriptive bivariate analysis
– The degree to which these relations exist is measured by covariance between them.
– When the two attributes have similar variation, covariance is a positive value
– If the two attributes vary in opposite way, covariance is a negative value
– If the attributes have independent variation, the covariance will tend to zero
– The value of co-variance depends upon the magnitude of the attributes
– Variance is a special case of covariance. It is the covariance of an attribute with itself.
– Only linear relation can be captured.
– Equation below shows how covariance between two attributes (Xj, Xk) are calculated

- mean of attribute ‘j’, Xij – ith value of attribute j

- mean of attribute ‘k’, Xik - ith value of attribute k


Descriptive bivariate analysis Determination of COVARIANCE between attributes WEIGHT and HEIGHT

Weight Height A = B= C = A*B


(xj) (xk) (xij – xj) (xik – xk)
Mean(xj) = xj = 77 175 0.2 1.3 0.26
845 /11 = 76.8 110 195 33.2 21.3 707.16
Mean(xk) = xk = 70 172 -6.8 -1.7 11.56
1911 /11 = 173.7 85 180 8.2 6.3 51.66
65 168 -11.8 -5.7 67.26
75 172 -1.8 -1.7 3.06
58 155 -18.8 -18.7 351.56
78 180 1.2 6.3 7.56
68 162 -8.8 -11.7 102.96
87 183 10.2 9.3 94.86
72 169 -4.8 -4.7 22.56
Total
845 1911 1420.46

Bivariate statistics
Descriptive bivariate analysis Determination of COVARIANCE between attributes WEIGHT and HEIGHT

Weight Height A = B= C = A*B


(xj) (xk) (xij – xj) (xik – xk)
Mean(xj) = xj = 77 175 0.2 1.3 0.26
845 /11 = 76.8 110 195 33.2 21.3 707.16
Mean(xk) = xk = 70 172 -6.8 -1.7 11.56
1911 /11 = 173.7 85 180 8.2 6.3 51.66
65 168 -11.8 -5.7 67.26
75 172 -1.8 -1.7 3.06
58 155 -18.8 -18.7 351.56
78 180 1.2 6.3 7.56
68 162 -8.8 -11.7 102.96
87 183 10.2 9.3 94.86
72 169 -4.8 -4.7 22.56
Total
845 1911 1420.46

Bivariate statistics
Descriptive bivariate analysis – Covariance

– Covariance between attributes Weight (xj ) and Height (xk ) is :

Weight vs Height
200
– Cov(xj , xk ) = 1420.46 / 10 = 142.046 190
180
170

Height in cms
160
150
140
130
120
110
100
50 60 70 80 90 100 110 120
Weight in kg
Descriptive bivariate analysis Determination of COVARIANCE between temperature and resistance of germanium ( -0.05Ω/°C

Temp Resis A= B= C = A*B


(xj) (xk) (xij – xj) (xik – xk)
Mean(xj) = xj = 10.000 5.250 -25.000 1.250 -31.250
385 /11 = 35.0 15.000 5.000 -20.000 1.000 -20.000
Mean(xk) = xk = 20.000 4.750 -15 .750 -11.25
44.0 /11 = 4.0 25.000 4.500 -10 .5 -5
30.000 4.250 -5 .250 -1.25
35.000 4.000
40.000 3.750
45.000 3.500
50.000 3.250
55.000 3.000
60.000 2.750
Total
385.000 44.000 -137.500

Bivariate statistics
Descriptive bivariate analysis Determination of COVARIANCE between temperature and resistance of germanium ( -0.05Ω/°C

Temp Resis A= B= C = A*B


(xj) (xk) (xij – xj) (xik – xk)
Mean(xj) = xj = 10.000 5.250 -25.000 1.250 -31.250
385 /11 = 35.0 15.000 5.000 -20.000 1.000 -20.000
Mean(xk) = xk = 20.000 4.750 -15.000 0.750 -11.250
44.0 /11 = 4.0 25.000 4.500 -10.000 0.500 -5.000
30.000 4.250 -5.000 0.250 -1.250
35.000 4.000 0.000 0.000 0.000
40.000 3.750 5.000 -0.250 -1.250
45.000 3.500 10.000 -0.500 -5.000
50.000 3.250 15.000 -0.750 -11.250
55.000 3.000 20.000 -1.000 -20.000
60.000 2.750 25.000 -1.250 -31.250
Total
385.000 44.000 -137.500

Bivariate statistics
Descriptive bivariate analysis – Covariance

– Covariance between attributes Weight (xj ) and Height (xk ) is :

Tempt vs Resistance
6.000

– Cov(xj , xk ) = -137.5 / 10 = -13.75 5.000

4.000

Resistance
3.000

2.000

1.000

0.000
0.000 10.00020.00030.00040.00050.00060.00070.000
Temepature in degree celcius
Descriptive bivariate analysis

– Covariance is a useful measure to show how the values of two attributes relate to
each other.
– The size of the range of values of attributes influences the covariance values.
– A better method is Correlation measure.
– The linear correlation between two attributes, also known as Pearson Correlation,
gives a clearer indication of how similar the attributes are, and is usually preferred
than covariance.
– Correlation between two attributes xj and xk is given as:
rjk = cor(xj ,xk ) = cov (xj ,xk ) / ( Sj * Sk ) where,
cov (xj ,xk ) is the covariance between the two attributes
Sj is the sample standard deviation of attribute xj
Descriptive bivariate analysis

• Figure above Illustrates 3 examples of correlations between two attributes; Positive,


negative and a lack of correlation
• More correlated are the attributes, the closer the points are to being in a straight line.
Descriptive bivariate analysis

– The Pearson correlation evaluates linear correlation between the attributes.


– If the points are in an increasing line, the Pearson correlation coefficient = 1
– If the points are in an decreasing line, the Pearson correlation coefficient = -1
– If the points form a horizontal line or a cloud without any increasing or
decreasing tendency, there is no correlation between the attributes and Pearson
correlation coefficient = 0
– Positive values mean the existence of a positive tendency between the two
attributes; as it becomes closer to a straight line.
– Negative values mean the existence of a negative tendency, the Pearson
correlation has tendency to become closer to -1 as the tendency becomes closer
to a straight line.
Descriptive bivariate analysis Determination of correlation between attributes WEIGHT and HEIGHT

Weight Height A = B= C = A*B D = A2 E = B2


(xj) (xk) (xij – xj) (xik – xk)
Mean(xj) = xj = 77 175 0.2 1.3 0.26 0.04 1.69
845 /11 = 76.8 110 195 33.2 21.3 707.16 1102.24 453.69
Mean(xk) = xk = 70 172 -6.8 -1.7 11.56 46.24 2.89
1911 /11 = 173.7 85 180 8.2 6.3 51.66 67.24 39.69
65 168 -11.8 -5.7 67.26 139.24 32.49
75 172 -1.8 -1.7 3.06 3.24 2.89
58 155 -18.8 -18.7 351.56 353.44 349.69
78 180 1.2 6.3 7.56 1.44 39.69
68 162 -8.8 -11.7 102.96 77.44 136.89
87 183 10.2 9.3 94.86 104.04 86.49
72 169 -4.8 -4.7 22.56 23.04 22.09
845 1911 1420.46 1917.64 1168.19
Total

Bivariate statistics
Descriptive bivariate analysis Determination of correlation between attributes WEIGHT and HEIGHT

Weight Height A = B= C = A*B D = A2 E = B2


(xj) (xk) (xij – xj) (xik – xk)
Mean(xj) = xj = 77 175 0.2 1.3 0.26 0.04 1.69
845 /11 = 76.8 110 195 33.2 21.3 707.16 1102.24 453.69
Mean(xk) = xk = 70 172 -6.8 -1.7 11.56 46.24 2.89
1911 /11 = 173.7 85 180 8.2 6.3 51.66 67.24 39.69
65 168 -11.8 -5.7 67.26 139.24 32.49
75 172 -1.8 -1.7 3.06 3.24 2.89
58 155 -18.8 -18.7 351.56 353.44 349.69
78 180 1.2 6.3 7.56 1.44 39.69
68 162 -8.8 -11.7 102.96 77.44 136.89
87 183 10.2 9.3 94.86 104.04 86.49
72 169 -4.8 -4.7 22.56 23.04 22.09
845 1911 1420.46 1917.64 1168.19
Total

Bivariate statistics
Descriptive bivariate analysis – Pearson correlation

– Covariance between attributes Weight (xj ) and Height (xk ) is :

– Cov(xj , xk ) = 1420.46 / 10 = 142.046


J
– Sj = Standard deviation of xj = = sqrt ( 1917.64 / 10) = 13.8

K
– Sk = Standard deviation of xk = = sqrt ( 1168.19 / 10) = 10.8

– Correlation, Cor(xj , xk ) = cov(xj , xk ) / ( Sj * Sk ) = 142.046 / ( 13.8 * 10.8)


Descriptive bivariate analysis – Spearman’s rank correlation

– It is based on rankings. Instead of evaluating how linear is the shape formed by the
points, it compares ordered lists of each of the two attributes.
– The expression used to find Spearman’s rank correlation
– =

– ‘rx’ & ‘ry’ are the order of values in the rank


Descriptive bivariate analysis
Assigning of Rank

Weight(kg) Rank Height(cm) Rank


58 1 155 1
65 2 162 2
68 3 168 3
70 4 169 4
72 5 172 5.5
75 6 172 5.5
77 7 175 7
78 8 180 8.5
85 9 180 8.5
87 10 183 10
110 11 195 11
Descriptive bivariate analysis

Weight Rank of Xi Height rank of Yj


– Assign number 1 to n (the number xi rxi yj ryj
of data points) corresponding to 58 1 155 1
the variable values in the order
lowest to highest. 68 3 162 2

– In the case of two or more values 65 2 168 3


being identical, assign to them 72 5 169 4
the arithmetic mean of the ranks 70 4 172 5.5
that they would have otherwise
occupied. 75 6 172 5.5
– The rank values for attributes 77 7 175 7
weight and height is shown in the 85 9 180 8.5
table below:
78 8 180 8.5
87 10 183 10
110 11 195 11
Descriptive bivariate analysis – Spearman’s Rank correlation between attributes
‘Height’ and ‘Weight’

Rank of Rank of A= B= C = A*B D = A2 E = B2


Weight Height (rxi – rx) (ryj – ry)
(rx) (ry)
Mean(rx) =rx= 1 1 -5.0 -5.0 25 25 25
66 /11 = 6 3 2 -3.0 -4.0 12 9 16
Mean(ry) =ry = 2 3 -4 -3 12 16 9
66 /11 = 6 5 4 -1 -2 2 1 4
4 5.5 -2 -.5 1 4 .25
6 5.5 0 -.5 0 0 .25
7 7 1 1 1 1 1
9 8.5 3 2.5 7.5 9 6.25
8 8.5 2 2.5 5 4 6.25
10 10 4 4 16 16 16
11 11 5 5 25 25 25
66 66 106.5 110 109

Bivariate rank caln


Descriptive bivariate analysis – Spearman’s Rank correlation between attributes ‘Height’
and ‘Weight’

Rank of Rank of A= B= C = A*B D = A2 E = B2


Weight Height (rxi – rx) (ryj – ry)
(rx) (ry)
Mean(rx) =rx= 1 1 -5.0 -5.0 25 25 25
66 /11 = 6 3 2 -3.0 -4.0 12 9 16
Mean(ry) =ry = 2 3 -4.0 -3.0 12 16 9
66 /11 = 6 5 4 -1.0 -2.0 2 1 4
4 5.5 -2.0 -0.5 1 4 0.25
6 5.5 0 -0.5 0 0 0.25
7 7 1.0 1.0 1 1 1
9 8.5 3.0 2.5 7.5 9 6.25
8 8.5 2.0 2.5 5 4 6.25
10 10 4.0 4.0 16 16 16
11 11 5.0 5.0 25 25 25
66 66 106.6 110 109

Bivariate rank caln


Descriptive bivariate analysis – Spearman’s rank correlation

– Covariance of rank of attributes Weight (rj ) and Height (rk ) is :

Cov(rj , rk ) = 106.6 / 10 = 10.66


– Srj = Standard deviation of rank, rj = sqrt (110 / 10) = 3.3

– Srk = Standard deviation of rank, rk = sqrt ( 109 / 10) = 3.3

– Spearman’s rank Correlation, Cor(xj , xk ) = Cov(rj , rk ) / ( Srj * Srk )

= 10.66 / ( 3.3 * 3.3)


rjk = 0.978
Descriptive bivariate analysis – Two qualitative attributes,
at least one of them nominal

– When both the attributes are qualitative with at least one attribute being nominal,
then contingency tables are required.
– Contingency tables presents the joint frequency, facilitating the identification of
interactions between the two attributes.
– They have a matrix like format, with cells in a square and labels at the left and top.
– On the right most column are the totals per row while in the bottom most row are
the totals per column. Bottom right hand corner has the total number of values.
– Mosaic plots are used to show the information of contingency table in an more
appealing visual way.
Contingency table: Eg. Consider
Gender and Company as the attributes.
In our data, We had: 8 Men [3 Good, 5 Bad] and
3 women [2 Good, 1 Bad]

Company
Good Bad
Male 3 5 8
Gender Female 2 1 3
5 6 11

Contingency table with absolute joint frequencies for “Company” and “Gender”
Mosaic Plot – The gender with highest
frequency has the largest area in the plot

BAD
GOOD

M
GENDER

COMPANY
Descriptive bivariate analysis
Both Ordinal Attributes

– Any of the methods discussed before can be used for TWO ORDINAL attributes.
– Spearman’s rank correlation should be used instead of Pearson correlation
– Scatter plots with ordinal attributes usually have the problem that there are
many values falling at the same point, making it impossible to evaluate the
number of values per point.
– Some software packages uses jitter effects which add a random deviation to the
values.
– Contingency tables can be used and mosaic plot also.
End of Part 2 of
Module 1 & Module 1

You might also like