Download as pdf or txt
Download as pdf or txt
You are on page 1of 49

Research Methodology – Unit 4 15-04-2020

MBA@RCET

Summarising & Analysing Data


Unit 4 – Research Methodology
Department of Management Studies, RCET Bhilai

Content Created By: Sushil Punwatkar

What is Data Analysis ???

 Data Analysis is the process of evaluating data using analytical and logical
reasoning to examine each component of the data provided.
 It is the process of systematically applying statistical and/or logical techniques to
describe and illustrate, condense and recap, and evaluate data.
 Data analysis is the foundation of scientific research. Conducting a complete
analysis of the data will enable to:

 Determine the impact of work


 Assess the quality of work
 Communicate results to stakeholders

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Data Analysis through Statistics

Descriptive • Measure of Central Tendency


Statistics • Measure of Dispersion

• Univariate
Relational • Bivariate
Statistics
• Multivariate

Inferential • Parametric Test


Statistics • Non-Parametric Test

Content Created By: Sushil Punwatkar

Measures of Central Tendency – Mean, Median & Mode


 In probability and statistics, MEAN is used synonymously to refer to one measure of
the central tendency either of a probability distribution or of the random
variable characterized by that distribution.
 In descriptive statistics, the mean may be confused with the median, mode or mid-
range, as any of these may be called an "average".
 The mean of a set of observations is the arithmetic average of the values; however,
for skewed distributions, the mean is not necessarily the same as the middle value
(median), or the most likely value (mode).

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Contd…

 Median Mode

The Median is the value separating The mode is the value that appears
the higher half of a data sample, most often in a set data. The mode of
a population, or a probability a discrete probability distribution is the
distribution, from the lower half. In value x at which its probability mass
simple terms, it may be thought of as function takes its maximum value.
the "middle" value of a data set.
For example, in the data set {1, 3, 3,
6, 7, 8, 9}, the median is 6, the fourth
number in the sample.

Content Created By: Sushil Punwatkar

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Measure of Dispersion – Standard Deviation

 In statistics, the standard deviation (SD, also represented by the Greek letter
sigma σ) is a measure that is used to quantify the amount of variation
or dispersion of a set of data values.
 A low σ indicates that the data points tend to be close to the mean (also called
the expected value) of the set, while a high σ indicates that the data points are
spread out over a wider range of values.
 The standard deviation of a random variable, statistical population, data set,
or probability distribution is the square root of its variance.

Content Created By: Sushil Punwatkar

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Relational Analysis – Univariate, Bivariate & Multivariate

 Univariate analysis is the simplest form of analyzing data. “Uni” means “one”,
so, in other words your data has only one variable. It doesn't deal with causes or
relationships (unlike regression) and it's major purpose is to describe; it takes
data, summarizes that data and finds patterns in the data.
Example: study on the average height of student in a class.

 Bivariate analysis is the simultaneous analysis of two variables (attributes). It


explores the concept of relationship between two variables, whether there
exists an association and the strength of this association, or whether there are
differences between two variables and the significance of these differences.
Example : If you are studying a group of students to find out their average math score and
their age, you have two variables (math score and age).

Content Created By: Sushil Punwatkar

Contd…  Multivariate Analysis is a set of techniques used for analysis of data


that contain more than one variable.

Multivariate
Analysis

Dependence Inter-
Technique dependence
Technique

One Multiple Variable Inter-Object


Dependent Dependent Inter- Similarity
Variable Variable dependence

Chi-Square One-Way Multiple MANOVA Factor Cluster


Test ANOVA Regression Analysis Analysis

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Inferential Statistics – Parametric & Non-Parametric Tests

• If measurement scale is interval or ratio then use


Parametric Test parametric statistics.
• Assumption: Population is normally distributed.

• If measurement scale is nominal or ordinal then


Non-Parametric use nonparametric statistics.
Test
• Does not rely on any type of distribution.

Content Created By: Sushil Punwatkar

Parametric Test

One Two Multiple


Sample Sample Samples

z-test & t- z-test & t- Paired t- ANOVA One-Way Two-Way


Test Test Test ANOVA ANOVA

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Non-
Parametric Test

One- Two- Multiple


Sample Sample Sample

Chi- Kolmogor Chi- Wilcoxon Mann- Kruskal-


Square ov Test Square Test Whitney Wallis
Test Test Test

Content Created By: Sushil Punwatkar

Selecting Test – Which test to choose???

 These statistical tests allow researchers to make inferences because they can show
whether an observed pattern is due to intervention or chance.
 There is a wide range of statistical tests.
 The decision of which statistical test to use depends on the research design, the
distribution of the data, and the type of variable.
 In general, if the data is normally distributed, parametric tests should be used. If
the data is non-normal, non-parametric tests should be used.

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Content Created By: Sushil Punwatkar

Testing of Hypothesis
Ordinarily, when one talks about hypothesis, one simply means a mere
assumption or some supposition to be proved or disproved. But for a
researcher hypothesis is a formal question that he intends to resolve.
Thus, a hypothesis may be defined as a proposition or a set of proposition set
forth as an explanation for the occurrence of some specified group of
phenomena.

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Characteristics of Good Hypothesis

 Hypothesis should be clear n precise.


 Hypothesis should be capable of being tested.
 Hypothesis should state relationship between variables.
 Hypothesis should be limited in scope and must be specific.
 Hypothesis should be stated as far as possible in most simple terms so that the
same is easily understandable by all concerned.
 Hypothesis must explain the facts that gave rise to the need for explanation.

Content Created By: Sushil Punwatkar

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

In choice of null hypothesis…

Content Created By: Sushil Punwatkar

Type I and Type II Errors

Type I Error:
When we Reject H0 (Null
Hypothesis) when it is TRUE.

Type II Error:
When we Accept H0 (Null
Hypothesis) when it is
FALSE.

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Terminology to Remember Critical Value:

In hypothesis testing, a critical value is a


point on the test distribution that is
Level of Significance: compared to the test statistic to determine
whether to reject the null hypothesis. If the
absolute value of your test statistic is
The significance level is the probability of greater than the critical value, you can
rejecting the null hypothesis when it is declare statistical significance and reject
true. For example, a significance level of the null hypothesis.
0.05 indicates a 5% risk of concluding
that a difference exists when there is no The general critical value for a two-tailed
actual difference. test is 1.96, which is based on the fact that
95% of the area of a normal distribution
Generally taken as 1%, 5% or 10%. is within 1.96 standard deviations of the
mean.

Content Created By: Sushil Punwatkar

Two-Tailed & One-Tailed Test.

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Content Created By: Sushil Punwatkar

Hypothesis Testing
through Statistical
Tests

Parametric Test:
 z-test/t-test
 ANOVA (f-test)

Non-Parametric Test:
 Chi-Square Test

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

z-Test or t-test

 A z-test is a statistical test to determine whether two population means are different
when the variances are known and the sample size is large.
 It can be used to test hypotheses in which the z-test follows a normal distribution.
 A z-statistic, or z-score, is a number representing the result from the z-test.
 Z-tests are closely related to t-tests, but t-tests are best performed when an
experiment has a small sample size.
 Also, t-tests assume the standard deviation is unknown, while z-tests assume it is
known.

Content Created By: Sushil Punwatkar

z-Test or t-test

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Content Created By: Sushil Punwatkar

Table

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Table

Content Created By: Sushil Punwatkar

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Table

Content Created By: Sushil Punwatkar

Hypothesis Testing for Comparing Two Related Samples

 Paired t-test is a way to test for comparing two related samples, involving small values
of n that does not require the variances of the two populations to be equal, but the
assumption that the two populations are normal must continue to apply.

 For a paired t-test, it is necessary that the observations in the two samples be collected
in the form of what is called matched pairs i.e., “each observation in the one sample must
be paired with an observation in the other sample in such a manner that these observations
are somehow “matched” or related, in an attempt to eliminate extraneous factors which are
not of interest in test.”

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

 To apply this test, we first work out the difference score for each matched pair, and then
find out the average of such differences, D, along with the sample variance of the
difference score. If the values from the two matched samples are denoted as Xi and Yi
and the differences by Di (Di = Xi – Y), then the mean of the differences i.e.,

Content Created By: Sushil Punwatkar

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Content Created By: Sushil Punwatkar

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Content Created By: Sushil Punwatkar

End

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Content Created By: Sushil Punwatkar

Chi-Square Test (Non-Parametric Test)


A chi-square (χ2) statistic is a test that measures how expectations compare to
actual observed data (or model results). The data used in calculating a chi-square
statistic must be random, raw, mutually exclusive, drawn from independent variables,
and drawn from a large enough sample.
( )
χ2 = ∑
Where, O = Observed Values & E = Expected Values

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Types of Chi-Square Tests

 There are two main kinds of chi-square tests:


 The Test of Independence, which asks a question of relationship, such as, "Is there
a relationship between gender and SAT scores?"; and
 The Goodness-of-Fit Test, which asks something like "If a coin is tossed 100 times,
is there equal (50-50) chance of getting heads and tails?“

 For these tests, Degrees of Freedom are utilized to determine if a certain null
hypothesis can be rejected based on the total number of variables and samples
within the experiment. Degrees of Freedom refers to the maximum number of
logically independent values, which are values that have the freedom to vary, in the
data sample

Content Created By: Sushil Punwatkar

The Goodness of Fit Test

 for estimating how closely an observed distribution matches an expected


distribution (a goodness-of-fit test). The chi square test for goodness of fit is
use to discover if there is any association between one categorical variable.

 For example:
 Whether the height of each student in class is equal to the mean height of the class or not?
 Does all brands of smartphones are equally preferred?
 Whether all customers prefer the same color of a product packaging?

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Problem 1

 A marketing manager wants to know the most preferred brand of smartphones.


Assume that the manager wishes to compare 5 brands of smartphones and is
interested in knowing which brand among 5 is most preferred. A random sample
of 500 respondents were taken as follows:

Smartphone Brand Customer Preference


Samsung 100
OnePlus 110
Does costumer’s
Apple i-Phone 125
preference w.r.t.
smartphone brand
Oppo 90
differs significantly ???
MI / Redmi 75
Total 500

Content Created By: Sushil Punwatkar

Problem 1

 By looking at observed data, it can be clearly said that the brand ‘i-Phone’ is
most preferred by the consumer. But…
 Statistically, you have to find out whether this preference could have arisen due
to chance. The appropriate test statistic is the test of goodness of fit.
 So, the hypothesis formed for this tests is as follows:

 Null Hypothesis: there is no significant difference between brand preference.


(this means all brands are equally preferred).
 Alternative Hypothesis: there is significant difference between brand preference.

Chi-Square Test

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

 Please note that, under the null hypothesis of equal preference for all
smartphones being true, the expected frequencies for all the brands will be
equal to 100.

Observed Expected ( − ) ( − )
Brand (O-E) (O-E)2
Frequency (O) Frequency (E)

Samsung 95 100 -5 25 .25

OnePlus 110 100 10 100 1

Apple i-Phone 130 100 30 900 9


17.5
Oppo 90 100 -10 100 1

MI / Redmi 75 100 -25 625 6.25

Total 500 500

Content Created By: Sushil Punwatkar

 So, the Calculated value of Chi-Square is, χ2


= 17.5
 We need to compare this value with the
Critical Value (or table value) of Chi-Square.
We need 2 things to check the critical value.

 Degree of Freedom = (n – 1) = (5 – 1) = 4
 Level of Significance = 5% 0r 0.05

 The Critical Value is 9.488


 Here, Critical Value is < Calculated Value.
So, the Null Hypothesis is Rejected. The
inference will be that, all smartphone brands
are not equally preferred by the customers.

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Problem 2: A Dice

 A Dice is tossed 120 times. The frequencies obtained are given below. Test the
hypothesis that the dice is “fair.”

1 2 3 4 5 6
13 28 16 10 32 21

 Hypothesis framed:
Null Hypothesis: The dice is not fair.
Alternate Hypothesis: the dice is fair.

 Level of Significance is 5%. (0.05). Degree of Freedom = (n – 1) = (6 – 1) = 5.

Content Created By: Sushil Punwatkar

Observed Expected ( − ) ( − )
Dice Numbers (O-E) (O-E)2
Frequency (O) Frequency (E)

1 13 20 -7 49 2.45

2 28 20 8 64 3.2

3 16 20 -4 16 0.8

4 10 20 -10 100 5 18.7

5 32 20 12 144 7.2

6 21 20 1 1 0.05

Total 120 120

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

So, the Calculated value of Chi-Square is, χ2


= 18.7
We need to compare this value with the
Critical Value (or table value) of Chi-Square.
We need 2 things to check the critical value.

 Degree of Freedom = (n – 1) = (6 – 1) = 5
 Level of Significance = 5% 0r 0.05

The Critical Value is 11.07


Here, Critical Value is < Calculated Value.
So, the Null Hypothesis is Rejected. The
inference will be that, the dice is fair.

Content Created By: Sushil Punwatkar

The Test of Independence


 If there are two categorical variables in a test, and we desire to examine whether these
two variables are associated with each other, the chi-square test of independence is
used.
 This test is very popular in analyzing cross-tabulations in which an investigator is keen to
find out whether the two attributes of interest have any relationship with each other.
 The cross-tabulation is popularly called by the term “contingency table”.
 It contains frequency data that correspond to the categorical variables in the row and
column. The marginal totals of the rows and columns are used to calculate the expected
frequencies that will be part of the computation of the statistic.
 For example:
 Is their any association between Income and brand preference?
 Is their any association between room size and size of AC bought?
 Are the attributes educational background and type of job chosen independent?

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Problem 1

 A Snack Manufacturing Company took opinion of 1000 respondent samples


towards their preference of flavour in potato chips. Results are shows in
contingency table:

Flavour Preference (Observed)


Respondents
Cheese Onion Magic Masala Tomato Fury Row Total
Male 200 150 50 400
Female 250 300 50 600
Column Total 450 450 100 1000

 Is there a gender gap? Do the men‘s preferences differ significantly from the
women's preferences? Use a 5% level of significance.

Content Created By: Sushil Punwatkar

Solution
 Null Hypothesis: There is NO relationship between Gender & Flavour Preference.
 Alternate Hypothesis: There is relationship between Gender & Flavour Preference.
 Level of Significance is 5% or 0.05.
 Degree of Freedom = (r – 1)x(c – 1) = (2 – 1)x(3 – 1) = 1x2 = 2

Flavour Preference (Observed) Flavour Preference (Expected)


Respond- Respond-
ents Cheese Magic ents Cheese Magic
Tomatina Row Total Tomatina Row Total
Onion Masala Onion Masala
(400*450) (400*450) (400*100)
Male 200 150 50 400 Male
/ 1000 / 1000 / 1000
Female 250 300 50 600 (600*450) (600*450) (600*100)
Female
/ 1000 / 1000 / 1000
Column
450 450 100 1000 Column
Total
Total

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Flavour Preference (Observed) Flavour Preference (Expected)


Respond- Respond-
ents Cheese Magic ents Cheese Magic
Tomatina Row Total Tomatina Row Total
Onion Masala Onion Masala

Male 200 150 50 400 Male 180 180 40 400

Female 250 300 50 600 Female 270 270 60 600

Column Column
450 450 100 1000 450 450 100 1000
Total Total

Observed Expected ( − ) ( − )
Items (O-E) (O-E)2
Frequency (O) Frequency (E)

1 200 180 20 400 2.2


2 250 270 -20 400 1.5
3 150 180 30 900 5
4 300 270 30 900 3.3 16.2
5 50 40 10 100 2.5
6 50 60 -10 100 1.7
Total 1000 1000

Content Created By: Sushil Punwatkar

End

So, the Calculated value of Chi-Square is,


= 16.2
We need to compare this value with the
Critical Value (or table value) of Chi-Square.
We need 2 things to check the critical value.

 Degree of Freedom = 2
 Level of Significance = 5% 0r 0.05

The Critical Value is 5.991


Here, Critical Value is < Calculated Value.
So, the Null Hypothesis is Rejected. The
inference will be that, there is significant
relationship exists between gender & flavour
preference.

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Analysis of Variance - ANOVA


An ANOVA test is a way to find out if survey or experiment results are significant. In other
words, they help you to figure out if you need to reject the null hypothesis or accept
the alternate hypothesis. It is a type of Parametric test as it is based on data obtained on
interval scale.
Content Created By: Sushil Punwatkar

ANOVA
Basically, you’re testing groups to see if ANOVA is of 2 types - One-way or two-
there’s a difference between them. Few way, that refers to the number
Examples are: of independent variables (IVs) in your
 A group of psychiatric patients are Analysis of Variance test.
trying three different therapies:
counseling, medication and One-way has one independent variable For
biofeedback. You want to see if one example: types of fertilizer used.
therapy is better than the others.
Two-way has two independent variables.
 A manufacturer has two different For example: Type of Fertilizers, pesticide.
processes to make light bulbs. They
want to know if one process is better than
the other.
 Students from different colleges take
the same exam. You want to see if one
college outperforms the other.

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Essence of ANOVA

 “The essence of ANOVA is that the total amount of variation in a set of data is
broken down into two types, that amount which can be attributed to chance and
that amount which can be attributed to specified causes.”
 There may be variation between samples and also within sample items.
 ANOVA consists in splitting the variance for analytical purposes. Hence, it is a
method of analyzing the variance to which a response is subject into its various
components corresponding to various sources of variation.
 Thus, through ANOVA technique one can, in general, investigate any number of
factors which are hypothesized or said to influence the dependent variable.

Content Created By: Sushil Punwatkar

ANOVA Technique

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Content Created By: Sushil Punwatkar

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Content Obtained from the book – ‘Research Methodology’ by C. R. Kothari

Content Created By: Sushil Punwatkar

Problem 1 – One-Way ANOVA

 Set up an analysis of variance table for the following per acre production data
for three varieties of wheat, each grown on 4 plots and state if the variety
differences are significant.

Illustration 1 – Page 262 – Research Methodology (C. R. Kothari)

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Solution

Content Created By: Sushil Punwatkar

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Content Created By: Sushil Punwatkar

End

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Two-Way ANOVA

 Two-way ANOVA technique is used when the data are classified on the basis of
two factors.
 For example, the agricultural output may be classified on the basis of different
varieties of seeds and also on the basis of different varieties of fertilizers used.
 The ANOVA technique is little different in case of repeated measurements where
we also compute the interaction variation.

Content Created By: Sushil Punwatkar

Steps in Two-Way ANOVA

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Content Created By: Sushil Punwatkar

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Content Created By: Sushil Punwatkar

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

End

For F-Value 5.14, the critical value is 5.79. So, the null hypothesis is accepted. And, for
the F-Value 4.76, the critical value is 4.76. Here, the calculated value is equal to the
critical value and therefore, the null hypothesis is reacted here. So here, alternate
hypothesis is accepted.

Content Created By: Sushil Punwatkar

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Multivariate Analysis
Factor Analysis
Linear Regression
Discriminant Analysis
Cluster Analysis

Content Created By: Sushil Punwatkar

Factor Analysis

 Factor analysis is a statistical method used to describe variability among


observed, correlated variables in terms of a potentially lower number of
unobserved variables called factors. Factor analysis is a statistical method used
to describe variability among observed, correlated variables in terms of a
potentially lower number of unobserved variables called factors.
 In other words, Factor analysis is a technique that is used to reduce a large
number of variables into fewer numbers of factors.
 It is also known as a ‘summarization’ or ‘data reduction technique’.
 This technique extracts maximum common variance from all variables and puts
them into a common score.

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

For Example:
 Consider various variables which has influence on the buying decision of a
product. It is possible that variations in six observed variables mainly reflect the
variations in two unobserved (underlying) variables.

Price
Tangible
Quality Factor

Quantity Quality
Price Quantity
Price
Discount Packaging

Packaging Intangible
Factor Price
Brand Discount
Brand

Content Created By: Sushil Punwatkar

Concept of FA…

 It is a technique applicable when there is a systematic interdependence among a set


of observed or manifest variables and the researcher is interested in finding out
something more fundamental or latent which creates this commonality.
 For instance, we might have data, say, about an individual’s income, education,
occupation and dwelling area and want to infer from these some factor (such as
social class) which summarizes the commonality of all the said four variables.
 Factor analysis, thus, seeks to resolve a large set of measured variables in terms of
relatively few categories, known as factors – that explains inter-relationships among
those variables.
 This technique allows the researcher to group variables into factors (based on
correlation between variables) and the factors so derived may be treated as new
variables (often termed as latent variables)

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

How FA can help ???

 Factor analysis is
useful in: 1]
Condensing variables,
and; 2] Uncovering
clusters of responses.
 Say you ask several
questions all driving at
different, but closely
related, aspects of
customer satisfaction:

Content Created By: Sushil Punwatkar

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Content Created By: Sushil Punwatkar

Statistics associated with FA


 Factor Loadings: are the Pearson correlation coefficient of an original variable or item
with a given identified factor or domain within the metric. Factor loadings can be used
as a means of item reduction (multiple items capturing the same variance or a low
amount of variance can be identified and removed) and of grouping items into construct
subscales or domains by their factor loadings.
 Bartlett's test of Sphericity: Bartlett's test of sphericity is a test statistic used to examine
the hypothesis that the variables are uncorrelated in the population. In other words, the
population correlation matrix is an identity matrix; each variable correlates perfectly
with itself (r = 1) but, has no correlation with the other variables (r = 0).
 Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy and is used as an index to
examine the appropriateness of factor analysis. High values (between 0.5 and 1.0)
indicate factor analysis is appropriate. Values below 0.5 imply that factor analysis may
not be appropriate.

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Contd…

 Factor: A factor is an underlying dimension that account for several observed variables.
There can be one or more factors.
 Communality (h2): Communality, symbolized as h2, shows how much of each variable is
accounted for by the underlying factor taken together. A high value of communality
means that not much of the variable is left over after whatever the factors represent is
taken into consideration.
 Eigen value (or latent root): When we take the sum of squared values of factor
loadings relating to a factor, then such sum is referred to as Eigen Value or latent root.
Eigen value indicates the relative importance of each factor in accounting for the
particular set of variables being analyzed.

Content Created By: Sushil Punwatkar

Contd…

 Rotation: Rotation, in the context of factor analysis, is something like staining a


microscope slide. Just as different stains on it reveal different structures in the tissue,
different rotations reveal different structures in the data. Though different rotations
give results that appear to be entirely different, but from a statistical point of view,
all results are taken as equal.
 If the factors are independent orthogonal rotation is done and if the factors are
correlated, an oblique rotation is made.
 In Principal Components Analysis, the total variance in the data is considered.
Principal components analysis is recommended when the primary concern is to
determine the minimum number of factors that will account for maximum variance in
the data for use in subsequent multivariate analysis. The factors are called principal
components.

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Multiple
Regression
Analysis
Regression Analysis is a set of statistical processes for estimating the relationships
between a dependent variable and one or more independent variables. i.e. it can be
used to infer causal relation-ships between the independent and dependent variables.
The most common form of regression analysis is linear regression, in which a researcher
finds the line (or a more complex linear combination) that most closely fits the data
according to a specific mathematical criterion.

Content Created By: Sushil Punwatkar

While there are many types of regression analysis, at their core they all examine the influence of one
or more independent variables on a dependent variable.

Simple linear regression is a model that assesses Multiple linear regression analysis is essentially
the relationship between a dependent variable similar to the simple linear model, with the
exception that multiple independent variables
and an independent variable. The simple linear
are used in the model. The mathematical
model is expressed using the following equation: representation of multiple linear regression is:
Y = a + bX + e Y = a + bX1 + cX2 + dX3 + ϵ
Where:
Y – Dependent variable Where:
Y – Dependent variable
X – Independent (explanatory) variable
X1, X2, X3 – Independent
a – Intercept
a – Intercept
b – Slope
b, c, d – Slopes
e – Residual (error) ϵ – Residual (error)

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Examples
 For example, you could use multiple regression to understand whether exam
performance can be predicted based on revision time, test anxiety, lecture
attendance and gender.
 Alternately, you could use multiple regression to understand whether daily cigarette
consumption can be predicted based on smoking duration, age when started
smoking, smoker type, income and gender.
 Multiple regression also allows you to determine the overall fit (variance explained)
of the model and the relative contribution of each of the predictors to the total
variance explained.
 For example, you might want to know how much of the variation in exam
performance can be explained by revision time, test anxiety, lecture attendance and
gender "as a whole", but also the "relative contribution" of each independent
variable in explaining the variance.

Content Created By: Sushil Punwatkar

Statistics Associated with RA


 Coefficient of multiple determination (R2): The strength of association in multiple
regression is measured by the square of the multiple correlation coefficient, R2, which is
also called the coefficient of multiple determination.
 Adjusted R2: R2, coefficient of multiple determination, is adjusted for the number of
independent variables and the sample size to account for the diminishing returns. After
the first few variables, the additional independent variables do not make much
contribution.
 F-test: The F test is used to test the null hypothesis that the coefficient of multiple
determination in the population, R2 pop, is zero. This is equivalent to testing the null
hypothesis. The test statistic has an F distribution with k and (n - k - 1) degrees of
freedom.
 Partial regression coefficient. The partial regression coefficient, b1, denotes the change
in the predicted value, per unit change in X1 when the other Y independent variables,
X2 to Xk, are held constant.

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Discriminant
Analysis
Discriminant analysis is a statistical method that is used by researchers to help them understand the
relationship between a "dependent variable" and one or more "independent variables.“
Discriminant analysis is similar to regression analysis and analysis of variance (ANOVA). The principal
difference between discriminant analysis and the other two methods is with regard to the nature of the
dependent variable.
In regression analysis and ANOVA, the dependent variable must be a "continuous variable.“ while in
discriminant analysis, the dependent variable must be a "categorical variable.“
The objective of discriminant analysis is enable the researcher to examine whether significant
differences exist among the groups, in terms of the predictor variables

Content Created By: Sushil Punwatkar

Assumption for using DA

 Dependent variable should be categorical (or non-metric), i.e., it should be on


nominal scale. (dichotomous or multi-tomous scale)
 Independent variable (or predictor variable) should be continuous in nature, i.e., is
should be measured on Interval or Ration Scale.
 Can be used to identify the characteristics on the basis of which one can classify an
individual.
 For example:
Categorization of an employee in fast or slow performer, based on the scores of
skills.
Or categorization of a product in to superior, average or poor category based on
the price and quality of the product.

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Objectives of Discriminant Analysis

 Development of discriminant functions, or linear combinations of the predictor or


independent variables, which will best discriminate between the categories of the
criterion or dependent variable (groups).
 Examination of whether significant differences exist among the groups, in terms of
the predictor variables.
 Determination of which predictor variables contribute the most of the intergroup
differences.
 Classification of cases to one of the groups based on the values of the predictor
variables.
 Evaluation of the accuracy of classification.

Content Created By: Sushil Punwatkar

Cluster Analysis
Cluster analysis or clustering is the task of grouping a set of objects in such a way
that objects in the same group (called a cluster) are more similar (in some sense) to
each other than to those in other groups (clusters).

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

What is Cluster Analysis???

 Cluster: a collection of data objects


 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
 Cluster analysis – Grouping a set of data objects into clusters
 Clustering is unsupervised classification: no predefined classes
 Typical applications
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms

Content Created By: Sushil Punwatkar

Examples of clustering application

 Marketing: Help marketers discover distinct groups in their customer bases, and
then use this knowledge to develop targeted marketing programs
 Land use: Identification of areas of similar land use in an earth observation
database
 Insurance: Identifying groups of motor insurance policy holders with a high
average claim cost
 City-planning: Identifying groups of houses according to their house type, value,
and geographical location
 Earth-quake studies: Observed earth quake epicenters should be clustered along
continent faults

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

What is good clustering???

 A good clustering method will produce high quality clusters with


high intra-class similarity
low inter-class similarity
 The quality of a clustering result depends on both the similarity measure used by the
method and its implementation.
 The quality of a clustering method is also measured by its ability to discover some or
all of the hidden patterns.

Content Created By: Sushil Punwatkar

Ideal Condition
Clustering Respondents
Variable A B C D E F G
Store Loyalty 3 4 4 2 6 7 6
Brand Loyalty 2 5 7 7 6 7 4

Correlation
A B C D E F G
Matrix
A 1.00 0.71 0.20 0.63 0.11 0.16 0.77
B 1.00 0.05 0.09 0.58 0.10 0.08
C 1.00 0.67 0.20 0.69 0.78
D 1.00 0.02 0.50 0.61
E 1.00 0.05 0.16
F 1.00 0.21
G 1.00

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1


Research Methodology – Unit 4 15-04-2020
MBA@RCET

Content Created By: Sushil Punwatkar

Thank
You !!!

Content Created By: Sushil Punwatkar

© 2019 Sushil Punwatkar. All rights reserved. 1

You might also like