Professional Documents
Culture Documents
RM - Unit 4
RM - Unit 4
MBA@RCET
Data Analysis is the process of evaluating data using analytical and logical
reasoning to examine each component of the data provided.
It is the process of systematically applying statistical and/or logical techniques to
describe and illustrate, condense and recap, and evaluate data.
Data analysis is the foundation of scientific research. Conducting a complete
analysis of the data will enable to:
• Univariate
Relational • Bivariate
Statistics
• Multivariate
Contd…
Median Mode
The Median is the value separating The mode is the value that appears
the higher half of a data sample, most often in a set data. The mode of
a population, or a probability a discrete probability distribution is the
distribution, from the lower half. In value x at which its probability mass
simple terms, it may be thought of as function takes its maximum value.
the "middle" value of a data set.
For example, in the data set {1, 3, 3,
6, 7, 8, 9}, the median is 6, the fourth
number in the sample.
In statistics, the standard deviation (SD, also represented by the Greek letter
sigma σ) is a measure that is used to quantify the amount of variation
or dispersion of a set of data values.
A low σ indicates that the data points tend to be close to the mean (also called
the expected value) of the set, while a high σ indicates that the data points are
spread out over a wider range of values.
The standard deviation of a random variable, statistical population, data set,
or probability distribution is the square root of its variance.
Univariate analysis is the simplest form of analyzing data. “Uni” means “one”,
so, in other words your data has only one variable. It doesn't deal with causes or
relationships (unlike regression) and it's major purpose is to describe; it takes
data, summarizes that data and finds patterns in the data.
Example: study on the average height of student in a class.
Multivariate
Analysis
Dependence Inter-
Technique dependence
Technique
Parametric Test
Non-
Parametric Test
These statistical tests allow researchers to make inferences because they can show
whether an observed pattern is due to intervention or chance.
There is a wide range of statistical tests.
The decision of which statistical test to use depends on the research design, the
distribution of the data, and the type of variable.
In general, if the data is normally distributed, parametric tests should be used. If
the data is non-normal, non-parametric tests should be used.
Testing of Hypothesis
Ordinarily, when one talks about hypothesis, one simply means a mere
assumption or some supposition to be proved or disproved. But for a
researcher hypothesis is a formal question that he intends to resolve.
Thus, a hypothesis may be defined as a proposition or a set of proposition set
forth as an explanation for the occurrence of some specified group of
phenomena.
Type I Error:
When we Reject H0 (Null
Hypothesis) when it is TRUE.
Type II Error:
When we Accept H0 (Null
Hypothesis) when it is
FALSE.
Hypothesis Testing
through Statistical
Tests
Parametric Test:
z-test/t-test
ANOVA (f-test)
Non-Parametric Test:
Chi-Square Test
z-Test or t-test
A z-test is a statistical test to determine whether two population means are different
when the variances are known and the sample size is large.
It can be used to test hypotheses in which the z-test follows a normal distribution.
A z-statistic, or z-score, is a number representing the result from the z-test.
Z-tests are closely related to t-tests, but t-tests are best performed when an
experiment has a small sample size.
Also, t-tests assume the standard deviation is unknown, while z-tests assume it is
known.
z-Test or t-test
Table
Table
Table
Paired t-test is a way to test for comparing two related samples, involving small values
of n that does not require the variances of the two populations to be equal, but the
assumption that the two populations are normal must continue to apply.
For a paired t-test, it is necessary that the observations in the two samples be collected
in the form of what is called matched pairs i.e., “each observation in the one sample must
be paired with an observation in the other sample in such a manner that these observations
are somehow “matched” or related, in an attempt to eliminate extraneous factors which are
not of interest in test.”
To apply this test, we first work out the difference score for each matched pair, and then
find out the average of such differences, D, along with the sample variance of the
difference score. If the values from the two matched samples are denoted as Xi and Yi
and the differences by Di (Di = Xi – Y), then the mean of the differences i.e.,
End
For these tests, Degrees of Freedom are utilized to determine if a certain null
hypothesis can be rejected based on the total number of variables and samples
within the experiment. Degrees of Freedom refers to the maximum number of
logically independent values, which are values that have the freedom to vary, in the
data sample
For example:
Whether the height of each student in class is equal to the mean height of the class or not?
Does all brands of smartphones are equally preferred?
Whether all customers prefer the same color of a product packaging?
Problem 1
Problem 1
By looking at observed data, it can be clearly said that the brand ‘i-Phone’ is
most preferred by the consumer. But…
Statistically, you have to find out whether this preference could have arisen due
to chance. The appropriate test statistic is the test of goodness of fit.
So, the hypothesis formed for this tests is as follows:
Chi-Square Test
Please note that, under the null hypothesis of equal preference for all
smartphones being true, the expected frequencies for all the brands will be
equal to 100.
Observed Expected ( − ) ( − )
Brand (O-E) (O-E)2
Frequency (O) Frequency (E)
Degree of Freedom = (n – 1) = (5 – 1) = 4
Level of Significance = 5% 0r 0.05
Problem 2: A Dice
A Dice is tossed 120 times. The frequencies obtained are given below. Test the
hypothesis that the dice is “fair.”
1 2 3 4 5 6
13 28 16 10 32 21
Hypothesis framed:
Null Hypothesis: The dice is not fair.
Alternate Hypothesis: the dice is fair.
Observed Expected ( − ) ( − )
Dice Numbers (O-E) (O-E)2
Frequency (O) Frequency (E)
1 13 20 -7 49 2.45
2 28 20 8 64 3.2
3 16 20 -4 16 0.8
5 32 20 12 144 7.2
6 21 20 1 1 0.05
Degree of Freedom = (n – 1) = (6 – 1) = 5
Level of Significance = 5% 0r 0.05
Problem 1
Is there a gender gap? Do the men‘s preferences differ significantly from the
women's preferences? Use a 5% level of significance.
Solution
Null Hypothesis: There is NO relationship between Gender & Flavour Preference.
Alternate Hypothesis: There is relationship between Gender & Flavour Preference.
Level of Significance is 5% or 0.05.
Degree of Freedom = (r – 1)x(c – 1) = (2 – 1)x(3 – 1) = 1x2 = 2
Column Column
450 450 100 1000 450 450 100 1000
Total Total
Observed Expected ( − ) ( − )
Items (O-E) (O-E)2
Frequency (O) Frequency (E)
End
Degree of Freedom = 2
Level of Significance = 5% 0r 0.05
ANOVA
Basically, you’re testing groups to see if ANOVA is of 2 types - One-way or two-
there’s a difference between them. Few way, that refers to the number
Examples are: of independent variables (IVs) in your
A group of psychiatric patients are Analysis of Variance test.
trying three different therapies:
counseling, medication and One-way has one independent variable For
biofeedback. You want to see if one example: types of fertilizer used.
therapy is better than the others.
Two-way has two independent variables.
A manufacturer has two different For example: Type of Fertilizers, pesticide.
processes to make light bulbs. They
want to know if one process is better than
the other.
Students from different colleges take
the same exam. You want to see if one
college outperforms the other.
Essence of ANOVA
“The essence of ANOVA is that the total amount of variation in a set of data is
broken down into two types, that amount which can be attributed to chance and
that amount which can be attributed to specified causes.”
There may be variation between samples and also within sample items.
ANOVA consists in splitting the variance for analytical purposes. Hence, it is a
method of analyzing the variance to which a response is subject into its various
components corresponding to various sources of variation.
Thus, through ANOVA technique one can, in general, investigate any number of
factors which are hypothesized or said to influence the dependent variable.
ANOVA Technique
Set up an analysis of variance table for the following per acre production data
for three varieties of wheat, each grown on 4 plots and state if the variety
differences are significant.
Solution
End
Two-Way ANOVA
Two-way ANOVA technique is used when the data are classified on the basis of
two factors.
For example, the agricultural output may be classified on the basis of different
varieties of seeds and also on the basis of different varieties of fertilizers used.
The ANOVA technique is little different in case of repeated measurements where
we also compute the interaction variation.
End
For F-Value 5.14, the critical value is 5.79. So, the null hypothesis is accepted. And, for
the F-Value 4.76, the critical value is 4.76. Here, the calculated value is equal to the
critical value and therefore, the null hypothesis is reacted here. So here, alternate
hypothesis is accepted.
Multivariate Analysis
Factor Analysis
Linear Regression
Discriminant Analysis
Cluster Analysis
Factor Analysis
For Example:
Consider various variables which has influence on the buying decision of a
product. It is possible that variations in six observed variables mainly reflect the
variations in two unobserved (underlying) variables.
Price
Tangible
Quality Factor
Quantity Quality
Price Quantity
Price
Discount Packaging
Packaging Intangible
Factor Price
Brand Discount
Brand
Concept of FA…
Factor analysis is
useful in: 1]
Condensing variables,
and; 2] Uncovering
clusters of responses.
Say you ask several
questions all driving at
different, but closely
related, aspects of
customer satisfaction:
Contd…
Factor: A factor is an underlying dimension that account for several observed variables.
There can be one or more factors.
Communality (h2): Communality, symbolized as h2, shows how much of each variable is
accounted for by the underlying factor taken together. A high value of communality
means that not much of the variable is left over after whatever the factors represent is
taken into consideration.
Eigen value (or latent root): When we take the sum of squared values of factor
loadings relating to a factor, then such sum is referred to as Eigen Value or latent root.
Eigen value indicates the relative importance of each factor in accounting for the
particular set of variables being analyzed.
Contd…
Multiple
Regression
Analysis
Regression Analysis is a set of statistical processes for estimating the relationships
between a dependent variable and one or more independent variables. i.e. it can be
used to infer causal relation-ships between the independent and dependent variables.
The most common form of regression analysis is linear regression, in which a researcher
finds the line (or a more complex linear combination) that most closely fits the data
according to a specific mathematical criterion.
While there are many types of regression analysis, at their core they all examine the influence of one
or more independent variables on a dependent variable.
Simple linear regression is a model that assesses Multiple linear regression analysis is essentially
the relationship between a dependent variable similar to the simple linear model, with the
exception that multiple independent variables
and an independent variable. The simple linear
are used in the model. The mathematical
model is expressed using the following equation: representation of multiple linear regression is:
Y = a + bX + e Y = a + bX1 + cX2 + dX3 + ϵ
Where:
Y – Dependent variable Where:
Y – Dependent variable
X – Independent (explanatory) variable
X1, X2, X3 – Independent
a – Intercept
a – Intercept
b – Slope
b, c, d – Slopes
e – Residual (error) ϵ – Residual (error)
Examples
For example, you could use multiple regression to understand whether exam
performance can be predicted based on revision time, test anxiety, lecture
attendance and gender.
Alternately, you could use multiple regression to understand whether daily cigarette
consumption can be predicted based on smoking duration, age when started
smoking, smoker type, income and gender.
Multiple regression also allows you to determine the overall fit (variance explained)
of the model and the relative contribution of each of the predictors to the total
variance explained.
For example, you might want to know how much of the variation in exam
performance can be explained by revision time, test anxiety, lecture attendance and
gender "as a whole", but also the "relative contribution" of each independent
variable in explaining the variance.
Discriminant
Analysis
Discriminant analysis is a statistical method that is used by researchers to help them understand the
relationship between a "dependent variable" and one or more "independent variables.“
Discriminant analysis is similar to regression analysis and analysis of variance (ANOVA). The principal
difference between discriminant analysis and the other two methods is with regard to the nature of the
dependent variable.
In regression analysis and ANOVA, the dependent variable must be a "continuous variable.“ while in
discriminant analysis, the dependent variable must be a "categorical variable.“
The objective of discriminant analysis is enable the researcher to examine whether significant
differences exist among the groups, in terms of the predictor variables
Cluster Analysis
Cluster analysis or clustering is the task of grouping a set of objects in such a way
that objects in the same group (called a cluster) are more similar (in some sense) to
each other than to those in other groups (clusters).
Marketing: Help marketers discover distinct groups in their customer bases, and
then use this knowledge to develop targeted marketing programs
Land use: Identification of areas of similar land use in an earth observation
database
Insurance: Identifying groups of motor insurance policy holders with a high
average claim cost
City-planning: Identifying groups of houses according to their house type, value,
and geographical location
Earth-quake studies: Observed earth quake epicenters should be clustered along
continent faults
Ideal Condition
Clustering Respondents
Variable A B C D E F G
Store Loyalty 3 4 4 2 6 7 6
Brand Loyalty 2 5 7 7 6 7 4
Correlation
A B C D E F G
Matrix
A 1.00 0.71 0.20 0.63 0.11 0.16 0.77
B 1.00 0.05 0.09 0.58 0.10 0.08
C 1.00 0.67 0.20 0.69 0.78
D 1.00 0.02 0.50 0.61
E 1.00 0.05 0.16
F 1.00 0.21
G 1.00
Thank
You !!!