MMW (Data Management) - Part 2

MMW
Data Management-Part 2
Luzviminda T. Orilla, PhD
Measures of Relative Position
Percentiles and Quartiles
are useful when you want to know where the score is located in
reference to the other scores.
 Percentile is a data value for which the specified percentage of
the data is below that value.
 The median is the 50th percentile.
 The 25th, 50th , 75th percentiles divide the data into lower quartile
Q1, middle quartile Q2, and upper quartile Q3, respectively.
 In using quartiles, there are five numbers to be used altogether:
min value, Q1, median, Q3, and max value.
 Quartiles are useful for box plots.
Normal Distribution
It is an extremely important concept, because it occurs so often
in the data we collect from the natural world, as well as in
many of the more theoretical ideas that are the foundation of
statistics.
• Characteristics of a Normal Distribution
• Area under the Curve
 Areas under the curve that are symmetric about the mean are
equal.
 The total area under the curve is 1.
• Empirical Rule for a Normal Distribution
• In a normal distribution, approximately
 68% of the data lie within 1 standard deviation of the mean.
 95% of the data lie within 2 standard deviations of the mean.
 99.7% of the data lie within 3 standard deviations of the mean.
Example
• Empirical Rule for a Normal Distribution
• Example. The heights of a large group of people are assumed to be
normally distributed. Their mean height is 66.5 inches, and the
standard deviation is 2.4 inches. Find and interpret the intervals
representing one, two, and three standard deviations of the mean.
• One standard deviation of the mean:

• Approximately 68% of the people are between 64.1 and 68.9 inches
tall.
• Two standard deviations of the mean:
• Therefore, approximately 95% of the people are between 61.7 and
71.3 inches tall.
• Three standard deviations of the mean:
• Nearly all of the people (99.74%) are between 59.3 and 73.7 inches
tall.
Z-score or Standardized Score
•The z-score for a given data value x is the number of

standard deviations that x is above or below the mean of the
data.
•z-score of xi in a population: xi  
z xi 

•z-score of xi in a sample: xi  x
z xi 
s
• Standard Normal Distribution
•In the standard normal distribution, the area of the

distribution from z = a to z = b represents
 the percentage of z-values that lie in the interval from a to b.
 the probability that z lies in the interval from a to b.

Practice Set 4
Example
A soda machine dispenses soda into 12-ounce cups. Tests show that the
actual amount of soda dispensed is normally distributed, with a mean of
11.5 oz and a standard deviation of 0.2 oz.
a. What percent of cups will receive more than 11.9 oz of soda?
Converting 11.9 to Z-score:

x  X 11.9  11.5 0.4
Z   2
s 0.2 0.2
47.5%
This part is the area of
more than 11.9 oz.
Half of 1 is 50%.
Hence, the area is the difference
Between 50% and 47.5%=2.5%
Standardized Score (Z)

Correlation or Measure of Relationship
between two variables
Correlation measures the relationship between bivariate data.
• Bivariate data are data sets in which each subject has two
observations associated with it.
• A response variable measures an outcome or result of a study.
• An explanatory variable is a variable that we think explains or

causes changes in the response variables.
Linear Correlation Coefficient
• – determine the strength of a linear relationship between two
variables which is denoted by the variable r.
n   xy     x   y 
r
n  x    x   n  y    y 
2 2 2 2
• If the linear correlation coefficient r is positive, the

relationship between the variables has a positive correlation. In
this case, if one variable increases, the other variable also tends
to increase.
• If r is negative, the linear relationship between the
variables has a negative correlation. In this case, if one variable
increases, the other variable tends to decrease.
Interpretation of Result of Correlation
(Basis)
INTERVAL REMARKS
0.9-0.99 Very High
0.7-0.89 High
0.5-0.69 Moderate
0.3-0.49 Low
BELOW 0.3 Very Low
• Given DATA: Happiness vs Life Expectancy
Life
Country Happiness Expectancy(LE)
Japan 6.8 80.80
South Korea 6.2 74.20
China 6.3 70.40
Taiwan 6.2 76.40
Indonesia 6.6 78.00
Philippines 6.4 69.00
Singapore 6.8 77.60
Vietnam 6.1 69.40
India 6.2 63.00
Bangladesh 5.7 59.50
Source: CHED GenEd 1st Generation Training
• What is the correlation value between Happiness Index and LE?

The correlation value
The relationship is describe

as high correlation.
Table 2.Correlation Matrix of the happiness index and Life Expectancy
Happiness Index
R-value Significance(P) Null
=0.05 Hypothesis
CV: 4.05
Life Expectancy 0.817 TV: 2.101 Rejected
PV= 0.000
*p<0.05
Interpretation
0.9- 0.99 Very Highly correlated
0.7-.89 Highly correlated
0.5-0.69 Moderately correlated
0.3 -0.49 Low correlation
Below 0.3 Very low correlation
STATISTICAL HYPOTHESES
•A hypothesis is simply a conjecture about a characteristic

or set of facts.
•When performing statistical analyses, our hypotheses
provide the general framework of what we are testing and how
to perform the test.
•Hypothesis testing involves testing the difference between

a hypothesized value of a population parameter and the estimate
of that parameter which is calculated from a sample.
• Overview of the Process
•The hypothesis to be tested is called the null hypothesis

and given the symbol H0 The alternative hypothesis is given
the symbol H1.
• Tests Concerning the Mean
•The purpose of Analysis of variance (ANOVA) is much the
same as the t – tests; however, if a series of several t–tests are
used to evaluate several mean differences, the risk of Type I
error increases; that is, the α-levels accumulate over a series of
tests so that the final experiment wise α-level can be quite large.
•The ANOVA is necessary to protect researchers from

excessive risk of a Type I error.
•The ANOVA allows researcher to evaluate all of the mean
differences in a single hypothesis test using a single α-level and,
thereby, keeps the risk of a Type I error under control no matter
how many different means are being compared.
•To test whether an observed difference between a
population mean and a reference value or to test whether the
difference between the two values of the mean is significant or
can be attributed to chance, the following statistical tests are
used.
•The z–test is used if the population standard deviation is
known or if not, the sample standard deviation can be used as an
estimate of the population standard deviation provided that the
sample size is large; that is, n ≥ 30.
•The t–test is used if the sample size is less than 30 and the
sample standard deviation is known.
•The ANOVA tests the homogeneity of a set of means but if
the null hypothesis is rejected in favor of the alternative
hypothesis that the means are not all equal, further test should
be done (Post Hoc) to determine which pairs of means are
significantly different.
•The following Post Hoc Tests are available in most
statistical software:
1.Duncan’s multiple range test
2.Tukey’s procedure
3.Scheffe test
4.Fisher’s least significant difference
Using =0.05, determine if there is a significant correlation between
happiness and life expectancy.
Following the five (5) step procedure in Using hypothesis testing, we have
STEP 1 State the Hypotheses

Ho: r =0; (The correlation is zero or there is no significant
correlation between the happiness index & LE )
H1: r  0; (The correlation is not equal to zero or there is a
significant correlation between the happiness
index and LE)
STEP 2 Specify the decision rule
Degrees of freedom n1+n2-2 = 20– 2 = 18. Using the t-distribution
table, with = 0.052 tail and 18 df, the
tcritregion
If tcrit falls in the critical =2.101for rejection of , we reject . Thus, If | tcal|
> |tcrit |, reject . If not, retain .
STEP 3
Calculate the test Statistic
n2
tcal r
1 r 2
n2
tr
1 r 2
10  2
 0.82
1  (.82) 2
8 8
 0.82  0.82  0.82 24.42002  .82(4.9416)
1  .6784 0.3276
t  4.05
tobt  t0.05 2.101
STEP 4
Make a decision
Since the tcal (4.05) is greater than tcrit (2.101), hence, we reject the
null hypothesis that the correlation is zero. The computed t-value
exceeded the required value for significance at the .05 probability level.
This will led us to say that there exist a real correlation between the
happiness index and the life expectancy.
STEP 5 Interpretation
As the happiness index increases, the life expectancy also
increases.
Linear Regression
Linear regression is an approach for modeling the
relationship between a dependent variable
(outcome) and one or more explanatory
variables. The case of one explanatory variable
is called simple linear regression.
It involves using data to calculate a

line that best fits that data and then using
that line to predict scores.
Least-Square Regression Line
• – is the line that minimizes the sum of the squares of the
vertical deviations from each data point to the line.
•The equation of the least-squares line is
ŷ  ax  b
• where and
• n xy    x   y  b  y  ax
a
n  x    x 
2 2
• Scatterplot is a graph of plotted points showing
the relationship between two numerical variables.
Examining a Scatterplot
1. Describe the overall pattern of a scatterplot by the form,
direction, and strength of the relationship.
2. Then look for any striking deviations from the pattern.
Identify each occurrence of an outlier.
Using the previous data and developing the predictive equation
Happiness vs Life Expectancy

•a = 16.661
•b =- 33.635
yˆ  16.661x  33.635
• Will the line give accurate predictions?
• Predict the life expectancy for the following countries:

• Actual LE
a) Zimbabwe: happiness = 4.2 35.40
b) Ghana: happiness = 5.4 57.90
c) Belarus: happiness = 6.1 68.60
Click here for formula
Since the P-value (0.597) is greater Alpha
(=0.050), hence, we do not reject the null
hypothesis that the correlation is zero. The
computed t-value exceeded the required
value for significance at the .05 probability
level. This will led us to say that there exist a
real correlation between the happiness
index and the life expectancy.
P=0.000 ;  =0.05 , we can reject the null

hypothesis
REVIEW ACTIVITY

MMW (Data Management) - Part 2

Uploaded by

Copyright:

Available Formats

You might also like

MMW (Data Management) - Part 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MMW (Data Management) - Part 2

Uploaded by

Copyright:

Available Formats

MMW

• One standard deviation of the mean:

•The z-score for a given data value x is the number of

•In the standard normal distribution, the area of the

 the percentage of z-values that lie in the interval from a to b.

 the probability that z lies in the interval from a to b.

Converting 11.9 to Z-score:

Standardized Score (Z)

• A response variable measures an outcome or result of a study.

• An explanatory variable is a variable that we think explains or

• If the linear correlation coefficient r is positive, the

Source: CHED GenEd 1st Generation Training

• What is the correlation value between Happiness Index and LE?

The relationship is describe

•A hypothesis is simply a conjecture about a characteristic

•Hypothesis testing involves testing the difference between

•The hypothesis to be tested is called the null hypothesis

•The ANOVA is necessary to protect researchers from

STEP 1 State the Hypotheses

It involves using data to calculate a

Happiness vs Life Expectancy

• Predict the life expectancy for the following countries:

P=0.000 ;  =0.05 , we can reject the null

You might also like