Analysis of Data

ANALYSIS OF DATA
Statistical Techniques to Analyze

data
•Descriptive Statistics
•Confirmatory/Inferential Statistics
Descriptive Analysis
• The transformation of raw data into a form
that will make them easy to understand and
interpret; rearranging, ordering, and
manipulating data to generate descriptive
information.
Type of Type of
Measurement descriptive analysis
Frequency table
Two Proportion (percentage)
categories
Nominal Frequency table

Category proportions
More than
(percentages)
two categories
Mode
Type of Type of
Ordinal Rank order

Median
Type of Type of
Interval Arithmetic mean

Type of Type of
Index numbers
Ratio Geometric mean
Harmonic mean
Central Tendency
Measure of
Central Measure of
Type of Scale Tendency Dispersion
Nominal Mode None

Ordinal Median Percentile
Interval or ratio Mean Standard deviation
Cross-Tabulation
• Analyze data by groups or categories

• Compare differences
• Contingency table
• Percentage cross-tabulations
Charts and Graphs
• Pie charts
• Line graphs
• Scattergram
• Pictogram
• Histogram
• Stem & Leaf
• Bar charts
Line Graph
BAR GRAPH
• A bar graph, or bar chart, is used to represent
values in relation to other values.
• They’re often used to compare data taken over
long periods of time, but they’re most often used
on very small sets of data.
• These graphs can be horizontal or vertical. If it’s
horizontal, the “categories” for what the actual
data being represented is across the bottom and at
the side, horizontally, are numbers that represent
the actual data.
Bar Graph
90
80
70
60
50 East
40 West
30 North
20
10
0
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
Web Surveyor Bar Chart
How did you find your last job?
643 Netw orking
213 print ad
Temporary agency 1.5 % 179 Online recruitment site
112 Placement firm
18 Temporary agency
Placement firm 9.6 % manner
Online recruitment site 15.4 %
print ad 18.3 %
Netw orking 55.2 %
0 100 200 300 400 500 600 700

When to use Bar Charts
• A bar chart is particularly useful when one or two categories
'dominate' results.
• It can be very clear and easy to read.
• Most people understand what is presented without having to
have detailed statistical knowledge.
• It can represent data expressed as actual numbers, percentages
and frequencies.
• A bar chart can represent either discrete or continuous data.
• If the data is discrete there should be a gap between the bars (as
in the diagram above).
• If the data is continuous there should be no gap between the
bars.
PIE CHART
• A pie graph, also known as a pie chart, is a type of
graph commonly used in conjunction with
percentages.
• A large circle is divided into sections depending
on those percentages and each section represents
part of the whole.
• In a pie chart, the arc length of each separate
sector is meant to be proportional to the
percentage it’s supposed to represent.
• The first pie chart was created in 1801 by William
Playfair.
When to use a PIE Chart
• It is best used to present the proportions of a
sample.
• It is most useful where one or two results
dominate the findings.
• It can represent data expressed as actual
numbers or percentages.
• Do not use where there are a large number
of categories, or where each has a small,
fairly equal share, as this can be unclear.
Inferential Statistical Techniques
•Univariate Statistics
•Bivariate statistics
•Multivariate Statistics
Univariate Statistics
• Test of statistical significance
• Hypothesis testing one variable at a time
Significance Level
• Critical Probability
• Confidence Level
• Alpha
• Probability Level selected is typically .05 or
.01
Type I and Type II Errors
Accept null Reject null
Null is true
(Medicine can Correct- Type I
cure no error error
disease)
Null is false
(Medicine Type II Correct-
cannot cure error no error
disease)
Type I and Type II Errors in
Hypothesis Testing
State of Null Hypothesis Decision

in the Population Accept Ho Reject Ho
Ho is true Correct--no error Type I error

Ho is false Type II error Correct--no error
MEASURES OF CENTRAL
TENDENCY AND
DISPERSION
Measures of central tendency
• Mean
• Median
• Mode
• ie finding a ‘typical’
value from the middle
of the data.
Arithmetic Mean
Arithmetic mean is a mathematical average and it
is the most popular measures of central tendency.
It is frequently referred to as ‘mean’ it is obtained by
dividing sum of the values of all observations in a series
(ƩX) by the number of items (N) constituting the series.
Thus, mean of a set of numbers X1, X2, X3,
………..Xn denoted by x̅ and is defined as
Arithmetic Mean Calculated Methods :
• Direct Method :
• Short cut method :
• Step deviation Method :

Example : Calculated the Arithmetic Mean
DIRC Monthly Users Statistics in the University
Library
Month No. of Total Users Average
Working Users per
Days month
Sep-2011 24 11618 484.08
Oct-2011 21 8857 421.76
Nov-2011 23 11459 498.22
Dec-2011 25 8841 353.64
Jan-2012 24 5478 228.25
Feb-2012 23 10811 470.04
Total 140 57064
= 407.6
Advantages of Mean
• It is easy to understand & simple to

calculate.
• It is based on all the values.
• It is rigidly defined .
• It is easy to understand the arithmetic
average even if some of the details of the
data are lacking.
• It is not based on the position in the series.
Disadvantages of Mean
• It is affected by extreme values.

• It cannot be calculated for open end
classes.
• It cannot be located graphically
• It gives misleading conclusions.
• It has upward bias.
MEDIAN
Median is a central value of the distribution, or the
value which divides the distribution in equal parts, each
part containing equal number of items. Thus it is the
central value of the variable, when the values are
arranged in order of magnitude.
Connor has defined as “The median is that value of the
variable which divides the group into two equal parts,
one part comprising of all values greater, and the other,
all values less than median”
THE MEDIAN
• The median is the middle-ranked score (50th

percentile).
• If there is an even number of scores, it is the

arithmetic average of the two middle scores.
• The median is unchanged by outliers. Even if highest

value is deleted from the data, the median would
remain (more or less) the same.
Calculation of Median –Discrete Series:
i. Arrange the data in ascending or descending

order.
ii. Calculate the cumulative frequencies.
iii. Apply the formula.

Calculation of median – Continuous series
For calculation of median in a continuous

frequency distribution the following formula
will be employed. Algebraically,
Example: Median of a set Grouped Data in a
Distribution of Respondents by age
Age Group Frequency of Cumulative
Median class(f) frequencies(cf)
0-20 15 15
20-40 32 47
40-60 54 101
60-80 30 131
80-100 19 150
Total 150
Median (M)=40+
= 40+
=
40+0.52X20
= 40+10.37
= 50.37
Advantages of Median
• Median can be calculated in all distributions.
• Median can be understood even by common

people.
• Median can be ascertained even with the extreme

items.
• It is most useful when dealing with qualitative

data.
Disadvantages of Median
• It is not based on all the values.
• It is not capable of further mathematical
treatment.
• It is affected by fluctuation in sampling.
• In case of even numbers of values it may not
the value from the data.
THE MODE
• The mode is the score with the highest

frequency of occurrences.
• It is the easiest score to spot in a distribution.
• It is the only way to express the central

tendency of a nominal level variable.
MODE
 Mode is the most frequent value or score
in the distribution.
 It is defined as that value of the item in a series.
 It is denoted by the capital letter Z.
MODE
Croxton and Cowden defined it as “the mode
of a distribution is the value at the point armed
with the item tend to most heavily concentrated.
It may be regarded as the most typical of a series
of value.”
The exact value of mode can be obtained by the
following formula.
Z=L1
+
Example: Calculate Mode for the distribution of
monthly rent Paid by Libraries in Karnataka
Monthly rent (Rs) Number of Libraries (f)

500-1000 5
1000-1500 10
1500-2000 8
2000-2500 16
2500-3000 14
3000 & Above 12
Total 65
Z=2000
+
Z =2000+
Z=2000+0.8 ×500=400
Z=2400
Advantages of Mode
• Mode is readily comprehensible and easily
calculated
• It is the best representative of data
• It is not at all affected by extreme value.
• The value of mode can also be determined
graphically.
• It is usually an actual value of an important
part of the series.
Disadvantages of Mode
• It is not based on all observations.
• It is not capable of further
mathematical manipulation.
• Mode is affected to a great extent by
sampling fluctuations.
• Choice of grouping has great
influence on the value of mode.
Advantages and Disadvantages
 
Mean More sensitive than the It can be misrepresentative
median, because it makes if there is an extreme
use of all the values of the value.
data.
Median It is not affected by It is less sensitive than the
extreme scores, so can give mean, as it does not take
a representative value. into account all of the
values.
Mode It is useful when the data It is not a useful way of
are in categories, such as describing data when there
the number of babies who are several modes.
are securely attached.
Measures of Dispersion
• Measures of ‘spread.’
• This looks at how
‘spread out’ the data
are.
• Are the scores similar
to each other (closely
clustered), or quite
spread out?
Range and Standard Deviation
• The range is the difference between the highest
and lowest numbers. What is the range of …
• 3, 5, 8, 8, 9, 10, 12, 12, 13, 15

• Mean = 9.5 range = 12 (3 to 15)
• 1, 5, 8, 8, 9, 10, 12, 12, 13, 17

• Mean = 9.5 range = 16 (1 to 17)
• Example from Cara Flanagan, Research Methods for AQA A Psychology (2005) Nelson Thornes p 15
Standard Deviation
• Standard deviation
tells us the average
distance of each score
from the mean.
• 68% of normally
distributed data is
within 1 sd each side
of the mean
• 95% within 2 sd
• Almost all is within 3
sd
Example
• Mean IQ = 100, sd = 15
• What is the IQ of 68% of
population (i.e. what is
the range of possible
IQs)?
• Between what IQ scores
would 95% of people be?
• Dan says he has done an
online IQ test, and he has
an IQ of 170. Should you
believe him? Why/not?
Another example
• John scores 61% in the
test. His mum says
that’s rubbish. Sol
points out that the
mean score in class
was 50%, with an sd
of 5. Did he do well?
• What if the sd was
only 2?
• What if sd was 15?
Advantages and disadvantages
Advantages Disadvantages
Range Quick and easy to calculate Affected by extreme values
(outliers)
Does not take into account
all the values
Standard deviation More precise measure of Much harder to calculate
dispersion because all than the range
values are taken into
account
I used Cara Flanagan’s (2005) Research Methods for AQA A Psychology Nelson Thornes in preparing these slides.
Choosing the Appropriate
Statistical Technique
Type of question to be answered
• Number of variables
– Univariate
– Bivariate
– Multivariate
• Scale of measurement
• Data Distribution
Inferential Statistical Tools
Univariate Analysis
Univariate Tools
• Z-Test
• t-Test
• Chi-Square Test (Distribution Test)
• Mann Whitney U Test
• Univariate ANOVA
Calculating Zobs
x 
z 
sx
obs
Alternate Way of Testing the
Hypothesis
X 
Z obs 
SX
t-Distribution
• Symmetrical, bell-shaped distribution
• Mean of zero and a unit standard deviation
• Shape influenced by degrees of freedom
Degrees of Freedom
• Abbreviated d.f.
• Number of observations
• Number of constraints
Testing a Hypothesis about a
Distribution
• Chi-Square test
• Test for significance in the analysis of
frequency distributions
• Compare observed frequencies with
expected frequencies
• “Goodness of Fit”
Chi-Square Test
(Oi  Ei )²
x²  
Ei
Chi-Square Test
x² = chi-square statistics
Oi = observed frequency in the ith cell
Ei = expected frequency on the ith cell
Chi-Square Test
Estimation for Expected Number
for Each Cell
Ri = total observed frequency in the ith row

Cj = total observed frequency in the jth column
n = sample size
Univariate Hypothesis Test
Chi-square Example
X 2

O1  E1 
2

O2  E 2 
2
E1 E2
Bivariate Analysis
Measures of Association
• A general term that refers to a number of

bivariate statistical techniques used to
measure the strength of a relationship
between two variables.
Relationships Among Variables
• Correlation analysis
• Bivariate regression analysis
Type of Measure of
Measurement Association
Interval and Pearson Correlation

Ratio Scales Bivariate Regression
Type of Measure of
Chi-square
Spearman R
Ordinal Scales Kendall Tau
Coefficient Gamma
Type of Measure of
Chi-Square
Nominal Phi Coefficient
Fisher exact test
Bivariate Analysis -
Tests of Differences
Common Bivariate Tests
Differences among
Differences between
Type of Measurement three or more
two independent groups
independent groups
Independent groups: One-way

Interval and ratio
t-test or Z-test ANOVA
Dependent groups: Repeated

Paired t-test ANOVA
Differences among
Differences between
independent groups
Wald-Wolfowitz runs test

Mann-Whitney U-test Kruskal-Wallis test
Ordinal Wilcoxon test Median test
K-S two-sample test
Sign Test Friedman's two-way

Wilcoxon's matched pairs analysis of variance
Differences among
Differences between
independent groups
Z-test (two proportions)

Nominal Chi-square test
Chi-square test
McNemar's Chi-square test Cochran Q test

Multivariate Analysis
Multivariate Statistical Analysis
• Statistical methods that allow the
simultaneous investigation of more than
two variables.
A Classification of Selected
Multivariate Methods
All multivariate
methods
Are some of the

variables dependent
on others?
Yes No
Dependence Interdependence
methods methods
Dependence Methods
• A category of multivariate statistical
techniques; dependence methods explain or
predict a dependent variable(s) on the basis
of two or more independent variables
Dependence
Methods
How many
variables are
dependent
Multiple
Several
One dependent independent
dependent
variable and dependent
variables
variables
Dependence
Methods
How many
variables are
dependent
One dependent
variable
Metric Non-metric
Multiple Multiple
regression discriminant
analysis analysis
Dependence
Methods
How many
variables are
dependent
Several
dependent
variables
Metric Non-metric
Multivariate
Conjoint
analysis of
analysis
variance
Dependence
Methods
Multiple
How many
independent
variables are
and dependent
dependent
variables
Metric
or
Non-metric
Canonical
correlation
analysis
Interdependence Methods
• A category of multivariate statistical

techniques; interdependence methods give
meaning to a set of variables or seek to
group things together
Interdependence
methods
Are inputs metric?
Metric Nonmetric
Interdependence
methods
Are inputs metric?
Metric
Metric
Factor Cluster
multidimensional
analysis analysis
scaling
Interdependence
methods
Are inputs metric?
Nonmetric
None
Summary Table of Statistical Tests
Level of Sample Characteristics Correlation
Measurement
1 2 Sample K Sample (i.e., >2)
Sample
Independent Dependent/ Independent Dependent
Paired/
Related
Categorical or Χ2 or Χ2 /Repeated ,
McNemar’s Χ2 Cochran’s Q
Nominal bi- Χ2
nomial
Rank or Mann Wilcoxin Kruskal Wallis Friendman’s Spearman’s

Ordinal Whitney ‘U’ Matched ‘H’ ANOVA rho
Pairs S
igned Ranks
Parametric z test or t test between t test within 1 way ANOVA 1 way Pearson’s r
(Interval & t test groups groups between ANOVA
Ratio) (Independent (Paired t- groups (within or
Sample t-test) test) repeated
measure)
Factorial (2 way) ANOVA

(Plonskey, 2001)
If we want to compare attitude towards brand among
the buyers of different cities. Which test can we
apply and why?
In a yoga class BP is measured three times in the
span of three weeks, which test will be suitable in
this case?
If we want to measure the impact of brand image on
purchase intention, which test would be applied and
why?
If preference towards shopping malls are measured
between male and female respondents, which test to
be applied?
If individuals are compared for their attitude towards online classes in
three sections. Which test to be applied?
Online Vs offline classes; measurement of difference in perception at the

time of attending the classes from same sample.
If in a survey about job preferences in tourism industry we got responses

from metro, two tier and three tier cities, which test to be applied?
I want to purchase branded clothes but restricted by its price, which kind
of study is this and which test to be applied?

Analysis of Data - Unit III (New)

Uploaded by

Copyright:

Available Formats

You might also like

Analysis of Data - Unit III (New)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Analysis of Data - Unit III (New)

Uploaded by

Copyright:

Available Formats

Statistical Techniques to Analyze

Nominal Frequency table

Ordinal Rank order

Interval Arithmetic mean

Nominal Mode None

• Analyze data by groups or categories

Online recruitment site 15.4 %

Netw orking 55.2 %

0 100 200 300 400 500 600 700

State of Null Hypothesis Decision

Ho is true Correct--no error Type I error

• Short cut method :

• Step deviation Method :

• It is easy to understand & simple to

• It is affected by extreme values.

• The median is the middle-ranked score (50th

• If there is an even number of scores, it is the

• The median is unchanged by outliers. Even if highest

i. Arrange the data in ascending or descending

iii. Apply the formula.

For calculation of median in a continuous

• Median can be calculated in all distributions.

• Median can be understood even by common

• Median can be ascertained even with the extreme

• It is most useful when dealing with qualitative

• The mode is the score with the highest

• It is the easiest score to spot in a distribution.

• It is the only way to express the central

Monthly rent (Rs) Number of Libraries (f)

• 3, 5, 8, 8, 9, 10, 12, 12, 13, 15

• 1, 5, 8, 8, 9, 10, 12, 12, 13, 17

Ri = total observed frequency in the ith row

• A general term that refers to a number of

Interval and Pearson Correlation

Independent groups: One-way

Dependent groups: Repeated

Wald-Wolfowitz runs test

Sign Test Friedman's two-way

Z-test (two proportions)

McNemar's Chi-square test Cochran Q test

Are some of the

• A category of multivariate statistical

Are inputs metric?

Are inputs metric?

Are inputs metric?

Rank or Mann Wilcoxin Kruskal Wallis Friendman’s Spearman’s

Online Vs offline classes; measurement of difference in perception at the

If in a survey about job preferences in tourism industry we got responses

You might also like