Chapter 5

Chapter 5
Basics of Statistics
Research Methods
Dr. Asif Mahmood
Institute of Business & Management,
UET Lahore
Descriptive Statistics
• Descriptive statistics are used by researchers to report on
populations and samples
• It speeds up and simplify comprehension of a group’s characteristics
• It does not aim to use the data to learn about the population that the
sample of data is thought to represent— e.g., demographics
Types of descriptive statistics: Bivariate Analysis Univariate Analysis
• Organize Data
– Tables (Frequency Distributions, Relative Frequency Distributions,
Cross-tabulations and Contingency tables)
– Graphs (Bar Chart or Histogram, Frequency Polygon, Scatterplot,
etc.)
Univariate Analysis
• Summarize Data
− Central Tendency (Mean, Median,
Mode)
− Variation {Range, Interquartile
Range, Variance, Standard
Deviation, shape of distribution
(kurtosis, skewness)}
Statistical Analysis
Inferential Statistics
• Testing a hypothesis and drawing conclusions
about a population, based on sample
• statistical inference consists of:
– selecting a statistical model of the process that generates the
data
– deducing propositions from the model
• Forms of inferential statistics:
– Estimation Statistics
• point estimate, interval estimate
– Hypothesis Testing
• T test, Chi Square, or ANOVA to test whether a hypothesis
about the mean is true or not
The Shape of A Distribution
Normal distribution
• Majority of scores lie around the centre of the
distribution, and look the same on both sides
Two main ways in which a distribution can deviate from
normal:
1. Skew (Lack of symmetry): Most frequent scores are
Tail
clustered at one end of the scale
• Positively skewed: The frequent scores are
clustered at the lower end and the tail points
towards the higher or more positive scores
• Negatively skewed: The frequent scores are
clustered at the higher end and the tail points
towards the lower or more negative scores
Tail
2. Kurtosis (Pointyness): the degree to which scores cluster at
the ends of the distribution
• Positive kurtosis (leptokurtic): It has many scores in the
tails (heavy-tailed distribution) and is pointy
• Negative kurtosis (platykurtic): Thin in the tails (has light
tails) and tends to be flatter than normal
Measures of Central Tendency
• Mean (or Average)

– The sum of all the data scores divided by n, it is a hypothetical value
X = ∑X/n, n = number of scores in the distribution
– Means can be badly affected by outliers (data points with extreme
values unlike the rest) Income in the U.S.
All of Us Bill Gates Outlier

Mean
• Median (the 50th percentile)
– The median of a distribution is the value that cuts the distribution
exactly in half, such that an equal number of scores are larger than
that value as there are smaller than that value
– Example: Odd no. data set: 5, 7, 6, 1, 8. Sorting the data: 1, 5, 6, 7,
8. the median is 6). (n+1)/2
– Even no. data set: 1, 4, 6, 5, 8, 0. Sorting the data: 0, 1, 4, 5, 6, 8.
4&5  n/2, (n+2)/2
Measures of Central Tendency
• Median (the 50th percentile)

– The median is unaffected by outliers, making it a better measure of
central tendency, better describing the “typical person” than the mean
when data are skewed
• Mode
– The mode in a distribution of data is simply the 2.0
score that occurs most frequently 1.8
– In a data set of 2,3,4,5,6,6,6,7,7,8,9 (6 is mode)

1.6
Count
1.4
– It may give you the most likely experience rather 1.2 Bimodal Distribution
than the “typical” or “central” experience 1.0
82.00 89.00
87.00
96.00
93.00
98.00
97.00
103.00 106.00 109.00 115.00 120.00 128.00 140.00
102.00 105.00 107.00 111.00 119.00 127.00 131.00 162.00
IQ
Symmetric
Skewed
N.B. In symmetric distributions, the
mean, median, and mode are the same,
whereas in skewed data, the mean and
median lie further toward the skew than Mean
Median
the mode Mode Mode MedianMean
The dispersion in a distribution
• Range
– Difference between the largest and smallest score in the data set
• The Inter Quartile Range or IQR is the difference between the 25th
and 75th percentile scores
• Quartiles are the three values that split the sorted data into four equal
parts
• Second Quartile (Median) splits the data into two equal parts
• Lower Quartile is the median of the lower half of the data
• Upper Quartile is the median of the upper half of the data
• Percentiles are points that split the data into 100 equal parts
Position in increasing order

Number of dataset
Variance
A measure of the spread of the recorded values on a variable. A

measure of dispersion. n
 x 
2
i X
σ2  i 1
n 1
The larger the variance, the further the individual cases are from the
mean.
Mean
The smaller the variance, the closer the individual scores are to the
mean.
Mean
Statistical Model
• Statistical models are made up of variables and

parameters
– Variables are measured constructs that vary across entities in the
sample
– Parameters are estimated from the data (rather than being measured)
• Examples of parameters: mean and median (which estimate the
centre of the distribution) and the correlation and regression
coefficients (which estimate the relationship between two variables)
– Constants believed to represent some fundamental truth about the
relations between variables in the model
Examples
Where,
i = particular entity, Xi = predictor variable
b = parameter
Mean as a statistical model
outcomei = (b) + errori
Assessing the Fit of a Model: Sums of
Squares and Variance Mean as
model
• Suppose lecturers in a university have number of friends as shown
• Let’s make a prediction for lecturer 1 Lecturer Friends
• Error or deviance = -1.6 (overestimation) Lecturer 1 1
Lecturer 2 2
Lecturer 3 3
Lecturer 4 3
Lecturer 5 4
X 2.6
Degree of freedom,
for a sample
• Sum of squared error and the mean squared error are used to assess the fit
of a model
When the model is mean, the mean
squared error has a special name:
variance (Standard Deviation???)
Important Definitions
Sampling variation
• Samples will vary because they contain different members of the
population
Sampling distribution
• A sampling distribution is the frequency distribution of sample means (or
any other parameter) from the same population
Standard error of the mean (SE) or standard error
• The standard deviation of sample means
Important
Central limit theorem
• As samples get large (>30), the sampling distribution has a normal
distribution with a mean equal to the population mean, and a standard
deviation (approximation) of
Standard Deviation of sample
• A large standard error (relative to the sample mean) means that there is
a lot of variability between the means of different samples and so the
sample might not be representative of the population
Central Limit Theorem—Again
• Regardless of the shape of the population, parameter estimates of

that population will have a normal distribution provided the samples
are ‘big enough’
• Big enough
– 30 (widely accepted)
– 160 or even more when distribution has a lot of skewness and
kurtosis, where outliers are common
Sampling
Distribution having
sample size 1
Standardization
• Z score
– It transforms the original distribution to one in which the mean becomes zero
and the standard deviation becomes 1 without changing the symmetry of the
distribution. The process is called Standardization
– It quantifies the original

score in terms of the
number of standard Three-Sigma Rule of Thumb
deviations that that score is
from the mean of the
distribution Standard
– Example (Excel)… Normal Z-Score Corresponding Range
Distribution
– Z-table 95% +/- 1.96
99% +/- 2.58
99.9% +/- 3.29
Confidence Interval
• A confidence interval for the mean is a range of scores constructed

such that the population mean will fall within this range in e.g., 95% of
samples.
• The confidence interval is NOT an interval within which we are e.g.,
95% confident that the population mean will fall!!!
• For larger samples size >30?, Standard Normal Curve
95% Z Score
(-1.96—1.96)
Mean=0,
Std Dev= 1
Standard Error
• General Formula
Exercise
• Suppose there are 56 Facebook users having numbers of friends 95

on average with standard deviation of 56.79.
– Calculate a 95% confidence interval for this mean.
Solution
=
Type I and Type II Errors
Effect size (e)
• The separation between the null
hypothesis value and a particular
value specified for the alternative
hypothesis
α- Level (Type I error)
• α states what chance of making an
error (by falsely concluding that the
null hypothesis should be rejected)
we, as researchers, are willing to
tolerate in the particular research context.
E.g., An intelligent is not passed; An innocent person is punished
β-level (Type II error)
• Occurs when we believe that there is no effect in the population when, in reality,
there is (we falsely ‘fail to reject null hypothesis’)
• Cohen (1992) suggests that the maximum acceptable probability of a Type II
error would be .2 (or 20%)
• β error is generally considered less severe or costly than an α error
Type I and Type II Errors in Testing a Hypothesis
Correct Wrong
Type I and Type II Errors (Skip this)
The power of a test

• It is the probability that a given test will find an effect assuming
that one exists in the population (1 − β)
• Cohen (1988, 1992) recommends to achieve a power of .8 (an
80% chance of detecting an effect if one genuinely exists)
Sample
Increasing the power of a statistical test Size
1. Relax the significance level (α) [if e and N remain constant]
2. Increase the sample size [if α and e remain constant]
– Variance of a sampling
distribution gets smaller
when sample size is
increased
Increasing Power of a Statistical Test
3. Look only for a larger

effect size [if α and N
remain constant]
4. Finally, power can be

increased if a directional
hypothesis can be stated
(based on previous
research findings or
deductions from theory)
Free and Powerful Tool:

Calculation of sample size using G*Power assuming normal distribution of test
statistic
http://www.ats.ucla.edu/stat/gpower/
P-Value
• The p-value is the probability of incorrectly rejecting

the null hypothesis.
• Examples:
– A p-value of .01 means there is a 1% chance that we will incorrectly
reject the null hypothesis.
– A p-value of .10 means there is a 10% chance that our decision to reject
the null hypothesis was in error.
• Using a p-value, one can make the decision to reject
or fail to reject the null hypothesis.
– If p>α then FAIL TO REJECT the null hypothesis
– If p< α then REJECT the null hypothesis
Research Proposal
• Submit your complete research proposal in line with

the sample research proposals (soft and hard copy)
• Short presentation of the research proposal will be
held in the class
• This activity will be completed after two weeks the
Midterm exam

Chapter 5 - RM

Uploaded by

Copyright:

Available Formats

You might also like

Chapter 5 - RM

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 5 - RM

Uploaded by

Copyright:

Available Formats

• Mean (or Average)

All of Us Bill Gates Outlier

• Median (the 50th percentile)

score that occurs most frequently 1.8

– In a data set of 2,3,4,5,6,6,6,7,7,8,9 (6 is mode)

Position in increasing order

A measure of the spread of the recorded values on a variable. A

• Statistical models are made up of variables and

• Regardless of the shape of the population, parameter estimates of

– It quantifies the original

• A confidence interval for the mean is a range of scores constructed

• Suppose there are 56 Facebook users having numbers of friends 95

The power of a test

3. Look only for a larger

4. Finally, power can be

Free and Powerful Tool:

• The p-value is the probability of incorrectly rejecting

• Submit your complete research proposal in line with

You might also like