Analysis of Data - Unit III (New)

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 90

ANALYSIS OF DATA

Statistical Techniques to Analyze


data
•Descriptive Statistics
•Confirmatory/Inferential Statistics
Descriptive Analysis
• The transformation of raw data into a form
that will make them easy to understand and
interpret; rearranging, ordering, and
manipulating data to generate descriptive
information.
Descriptive Analysis
Type of Type of
Measurement descriptive analysis

Frequency table
Two Proportion (percentage)
categories

Nominal Frequency table


Category proportions
More than
(percentages)
two categories
Mode
Descriptive Analysis

Type of Type of
Measurement descriptive analysis

Ordinal Rank order


Median
Descriptive Analysis

Type of Type of
Measurement descriptive analysis

Interval Arithmetic mean


Descriptive Analysis

Type of Type of
Measurement descriptive analysis

Index numbers
Ratio Geometric mean
Harmonic mean
Central Tendency
Measure of
Central Measure of
Type of Scale Tendency Dispersion

Nominal Mode None


Ordinal Median Percentile
Interval or ratio Mean Standard deviation
Cross-Tabulation

• Analyze data by groups or categories


• Compare differences
• Contingency table
• Percentage cross-tabulations
Charts and Graphs
• Pie charts
• Line graphs
• Scattergram
• Pictogram
• Histogram
• Stem & Leaf
• Bar charts
Line Graph
BAR GRAPH
• A bar graph, or bar chart, is used to represent
values in relation to other values.
• They’re often used to compare data taken over
long periods of time, but they’re most often used
on very small sets of data.
• These graphs can be horizontal or vertical. If it’s
horizontal, the “categories” for what the actual
data being represented is across the bottom and at
the side, horizontally, are numbers that represent
the actual data.
Bar Graph

90
80
70
60
50 East
40 West
30 North
20
10
0
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
Web Surveyor Bar Chart
How did you find your last job?
643 Netw orking
213 print ad
Temporary agency 1.5 % 179 Online recruitment site
112 Placement firm
18 Temporary agency
Placement firm 9.6 % manner

Online recruitment site 15.4 %

print ad 18.3 %

Netw orking 55.2 %

0 100 200 300 400 500 600 700


When to use Bar Charts
• A bar chart is particularly useful when one or two categories
'dominate' results.
• It can be very clear and easy to read.
• Most people understand what is presented without having to
have detailed statistical knowledge.
• It can represent data expressed as actual numbers, percentages
and frequencies.
• A bar chart can represent either discrete or continuous data.
• If the data is discrete there should be a gap between the bars (as
in the diagram above).
• If the data is continuous there should be no gap between the
bars.
PIE CHART
• A pie graph, also known as a pie chart, is a type of
graph commonly used in conjunction with
percentages.
• A large circle is divided into sections depending
on those percentages and each section represents
part of the whole.
• In a pie chart, the arc length of each separate
sector is meant to be proportional to the
percentage it’s supposed to represent.
• The first pie chart was created in 1801 by William
Playfair.
When to use a PIE Chart
• It is best used to present the proportions of a
sample.
• It is most useful where one or two results
dominate the findings.
• It can represent data expressed as actual
numbers or percentages.
• Do not use where there are a large number
of categories, or where each has a small,
fairly equal share, as this can be unclear.
Inferential Statistical Techniques

•Univariate Statistics
•Bivariate statistics
•Multivariate Statistics
Univariate Statistics
• Test of statistical significance
• Hypothesis testing one variable at a time
Significance Level
• Critical Probability
• Confidence Level
• Alpha
• Probability Level selected is typically .05 or
.01
Type I and Type II Errors
Accept null Reject null

Null is true
(Medicine can Correct- Type I
cure no error error
disease)

Null is false
(Medicine Type II Correct-
cannot cure error no error
disease)
Type I and Type II Errors in
Hypothesis Testing

State of Null Hypothesis Decision


in the Population Accept Ho Reject Ho

Ho is true Correct--no error Type I error


Ho is false Type II error Correct--no error
MEASURES OF CENTRAL
TENDENCY AND
DISPERSION
Measures of central tendency
• Mean
• Median
• Mode

• ie finding a ‘typical’
value from the middle
of the data.
Arithmetic Mean
Arithmetic mean is a mathematical average and it
is the most popular measures of central tendency.
It is frequently referred to as ‘mean’ it is obtained by
dividing sum of the values of all observations in a series
(ƩX) by the number of items (N) constituting the series.
Thus, mean of a set of numbers X1, X2, X3,
………..Xn denoted by x̅ and is defined as
Arithmetic Mean Calculated Methods :
• Direct Method :

• Short cut method :

• Step deviation Method :


Example : Calculated the Arithmetic Mean
DIRC Monthly Users Statistics in the University
Library
Month No. of Total Users Average
Working Users per
Days month
Sep-2011 24 11618 484.08
Oct-2011 21 8857 421.76
Nov-2011 23 11459 498.22
Dec-2011 25 8841 353.64
Jan-2012 24 5478 228.25
Feb-2012 23 10811 470.04
Total 140 57064
= 407.6
Advantages of Mean

• It is easy to understand & simple to


calculate.
• It is based on all the values.
• It is rigidly defined .
• It is easy to understand the arithmetic
average even if some of the details of the
data are lacking.
• It is not based on the position in the series.
Disadvantages of Mean

• It is affected by extreme values.


• It cannot be calculated for open end
classes.
• It cannot be located graphically
• It gives misleading conclusions.
• It has upward bias.
MEDIAN
Median is a central value of the distribution, or the
value which divides the distribution in equal parts, each
part containing equal number of items. Thus it is the
central value of the variable, when the values are
arranged in order of magnitude.
Connor has defined as “The median is that value of the
variable which divides the group into two equal parts,
one part comprising of all values greater, and the other,
all values less than median”
THE MEDIAN

• The median is the middle-ranked score (50th


percentile).

• If there is an even number of scores, it is the


arithmetic average of the two middle scores.

• The median is unchanged by outliers. Even if highest


value is deleted from the data, the median would
remain (more or less) the same.
Calculation of Median –Discrete Series:

i. Arrange the data in ascending or descending


order.
ii. Calculate the cumulative frequencies.

iii. Apply the formula.


Calculation of median – Continuous series

For calculation of median in a continuous


frequency distribution the following formula
will be employed. Algebraically,
Example: Median of a set Grouped Data in a
Distribution of Respondents by age
Age Group Frequency of Cumulative
Median class(f) frequencies(cf)
0-20 15 15
20-40 32 47
40-60 54 101
60-80 30 131
80-100 19 150
Total 150
Median (M)=40+

= 40+

=
40+0.52X20
= 40+10.37
= 50.37
Advantages of Median

• Median can be calculated in all distributions.

• Median can be understood even by common


people.

• Median can be ascertained even with the extreme


items.

• It is most useful when dealing with qualitative


data.
Disadvantages of Median
• It is not based on all the values.
• It is not capable of further mathematical
treatment.
• It is affected by fluctuation in sampling.
• In case of even numbers of values it may not
the value from the data.
THE MODE

• The mode is the score with the highest


frequency of occurrences.

• It is the easiest score to spot in a distribution.

• It is the only way to express the central


tendency of a nominal level variable.
MODE
 Mode is the most frequent value or score

in the distribution.
 It is defined as that value of the item in a series.
 It is denoted by the capital letter Z.
MODE
Croxton and Cowden defined it as “the mode
of a distribution is the value at the point armed
with the item tend to most heavily concentrated.
It may be regarded as the most typical of a series
of value.”
The exact value of mode can be obtained by the
following formula.
Z=L1
+
Example: Calculate Mode for the distribution of
monthly rent Paid by Libraries in Karnataka

Monthly rent (Rs) Number of Libraries (f)


500-1000 5
1000-1500 10
1500-2000 8
2000-2500 16
2500-3000 14
3000 & Above 12
Total 65
Z=2000
+

Z =2000+

Z=2000+0.8 ×500=400
Z=2400
Advantages of Mode
• Mode is readily comprehensible and easily
calculated
• It is the best representative of data
• It is not at all affected by extreme value.
• The value of mode can also be determined
graphically.
• It is usually an actual value of an important
part of the series.
Disadvantages of Mode
• It is not based on all observations.
• It is not capable of further
mathematical manipulation.
• Mode is affected to a great extent by
sampling fluctuations.
• Choice of grouping has great
influence on the value of mode.
Advantages and Disadvantages
 
Mean More sensitive than the It can be misrepresentative
median, because it makes if there is an extreme
use of all the values of the value.
data.
Median It is not affected by It is less sensitive than the
extreme scores, so can give mean, as it does not take
a representative value. into account all of the
values.
Mode It is useful when the data It is not a useful way of
are in categories, such as describing data when there
the number of babies who are several modes.
are securely attached.
Measures of Dispersion

• Measures of ‘spread.’
• This looks at how
‘spread out’ the data
are.
• Are the scores similar
to each other (closely
clustered), or quite
spread out?
Range and Standard Deviation
• The range is the difference between the highest
and lowest numbers. What is the range of …

• 3, 5, 8, 8, 9, 10, 12, 12, 13, 15


• Mean = 9.5 range = 12 (3 to 15)

• 1, 5, 8, 8, 9, 10, 12, 12, 13, 17


• Mean = 9.5 range = 16 (1 to 17)
• Example from Cara Flanagan, Research Methods for AQA A Psychology (2005) Nelson Thornes p 15
Standard Deviation
• Standard deviation
tells us the average
distance of each score
from the mean.
• 68% of normally
distributed data is
within 1 sd each side
of the mean
• 95% within 2 sd
• Almost all is within 3
sd
Example
• Mean IQ = 100, sd = 15
• What is the IQ of 68% of
population (i.e. what is
the range of possible
IQs)?
• Between what IQ scores
would 95% of people be?
• Dan says he has done an
online IQ test, and he has
an IQ of 170. Should you
believe him? Why/not?
Another example
• John scores 61% in the
test. His mum says
that’s rubbish. Sol
points out that the
mean score in class
was 50%, with an sd
of 5. Did he do well?
• What if the sd was
only 2?
• What if sd was 15?
Advantages and disadvantages
Advantages Disadvantages
Range Quick and easy to calculate Affected by extreme values
(outliers)
Does not take into account
all the values
Standard deviation More precise measure of Much harder to calculate
dispersion because all than the range
values are taken into
account

I used Cara Flanagan’s (2005) Research Methods for AQA A Psychology Nelson Thornes in preparing these slides.
Choosing the Appropriate
Statistical Technique
Type of question to be answered
• Number of variables
– Univariate
– Bivariate
– Multivariate
• Scale of measurement
• Data Distribution
Inferential Statistical Tools
Univariate Analysis
Univariate Tools

• Z-Test
• t-Test
• Chi-Square Test (Distribution Test)
• Mann Whitney U Test
• Univariate ANOVA
Calculating Zobs

x 
z 
sx
obs
Alternate Way of Testing the
Hypothesis

X 
Z obs 
SX
t-Distribution
• Symmetrical, bell-shaped distribution
• Mean of zero and a unit standard deviation
• Shape influenced by degrees of freedom
Degrees of Freedom
• Abbreviated d.f.
• Number of observations
• Number of constraints
Testing a Hypothesis about a
Distribution
• Chi-Square test
• Test for significance in the analysis of
frequency distributions
• Compare observed frequencies with
expected frequencies
• “Goodness of Fit”
Chi-Square Test

(Oi  Ei )²
x²  
Ei
Chi-Square Test

x² = chi-square statistics
Oi = observed frequency in the ith cell
Ei = expected frequency on the ith cell
Chi-Square Test
Estimation for Expected Number
for Each Cell

Ri = total observed frequency in the ith row


Cj = total observed frequency in the jth column
n = sample size
Univariate Hypothesis Test
Chi-square Example

X 2

O1  E1 
2

O2  E 2 
2

E1 E2
Inferential Statistical Tools
Bivariate Analysis
Measures of Association

• A general term that refers to a number of


bivariate statistical techniques used to
measure the strength of a relationship
between two variables.
Relationships Among Variables
• Correlation analysis
• Bivariate regression analysis
Type of Measure of
Measurement Association

Interval and Pearson Correlation


Ratio Scales Bivariate Regression
Type of Measure of
Measurement Association

Chi-square
Spearman R
Ordinal Scales Kendall Tau
Coefficient Gamma
Type of Measure of
Measurement Association

Chi-Square
Nominal Phi Coefficient
Fisher exact test
Bivariate Analysis -
Tests of Differences
Common Bivariate Tests

Differences among
Differences between
Type of Measurement three or more
two independent groups
independent groups

Independent groups: One-way


Interval and ratio
t-test or Z-test ANOVA

Dependent groups: Repeated


Paired t-test ANOVA
Common Bivariate Tests

Differences among
Differences between
Type of Measurement three or more
two independent groups
independent groups

Wald-Wolfowitz runs test


Mann-Whitney U-test Kruskal-Wallis test
Ordinal Wilcoxon test Median test
K-S two-sample test

Sign Test Friedman's two-way


Wilcoxon's matched pairs analysis of variance
Common Bivariate Tests

Differences among
Differences between
Type of Measurement three or more
two independent groups
independent groups

Z-test (two proportions)


Nominal Chi-square test
Chi-square test

McNemar's Chi-square test Cochran Q test


Inferential Statistical Tools
Multivariate Analysis
Multivariate Statistical Analysis
• Statistical methods that allow the
simultaneous investigation of more than
two variables.
A Classification of Selected
Multivariate Methods
All multivariate
methods

Are some of the


variables dependent
on others?

Yes No

Dependence Interdependence
methods methods
Dependence Methods
• A category of multivariate statistical
techniques; dependence methods explain or
predict a dependent variable(s) on the basis
of two or more independent variables
Dependence
Methods

How many
variables are
dependent

Multiple
Several
One dependent independent
dependent
variable and dependent
variables
variables
Dependence
Methods

How many
variables are
dependent

One dependent
variable
Metric Non-metric

Multiple Multiple
regression discriminant
analysis analysis
Dependence
Methods

How many
variables are
dependent

Several
dependent
variables
Metric Non-metric

Multivariate
Conjoint
analysis of
analysis
variance
Dependence
Methods

Multiple
How many
independent
variables are
and dependent
dependent
variables

Metric
or
Non-metric

Canonical
correlation
analysis
Interdependence Methods

• A category of multivariate statistical


techniques; interdependence methods give
meaning to a set of variables or seek to
group things together
Interdependence
methods

Are inputs metric?

Metric Nonmetric
Interdependence
methods

Are inputs metric?

Metric

Metric
Factor Cluster
multidimensional
analysis analysis
scaling
Interdependence
methods

Are inputs metric?

Nonmetric

None
Summary Table of Statistical Tests
Level of Sample Characteristics Correlation
Measurement
1 2 Sample K Sample (i.e., >2)
Sample
Independent Dependent/ Independent Dependent
Paired/
Related
Categorical or Χ2 or Χ2 /Repeated ,
McNemar’s Χ2 Cochran’s Q  
Nominal bi- Χ2
nomial

Rank or   Mann Wilcoxin Kruskal Wallis Friendman’s Spearman’s


Ordinal Whitney ‘U’ Matched ‘H’ ANOVA rho
Pairs S
igned Ranks

Parametric z test or t test between t test within 1 way ANOVA 1 way Pearson’s r
(Interval & t test groups groups between ANOVA
Ratio) (Independent (Paired t- groups (within or
Sample t-test) test) repeated
measure)
Factorial (2 way) ANOVA  

 
(Plonskey, 2001)
If we want to compare attitude towards brand among
the buyers of different cities. Which test can we
apply and why?
In a yoga class BP is measured three times in the
span of three weeks, which test will be suitable in
this case?
If we want to measure the impact of brand image on
purchase intention, which test would be applied and
why?
If preference towards shopping malls are measured
between male and female respondents, which test to
be applied?
If individuals are compared for their attitude towards online classes in
three sections. Which test to be applied?

Online Vs offline classes; measurement of difference in perception at the


time of attending the classes from same sample.

If in a survey about job preferences in tourism industry we got responses


from metro, two tier and three tier cities, which test to be applied?

I want to purchase branded clothes but restricted by its price, which kind
of study is this and which test to be applied?

You might also like