Session 2 - QRT - Oct 3, 2020

Quantitative Research
Techniques
Session – 2
Data Screening and Evaluation of Assumptions
Agenda: Sources of Bias and Possible
Solutions
• Missing Values
• Outliers
• Violation of Assumptions
Missing Value Analysis I: Handling Missing
Data
• Little’s Missing completely at random (MCAR) test
• Tests if data is missing at random
• Significant (p<0.05)= data missing not at random
• Non-Significant (p>0.05) = data missing at random
• >10% - Just don't use that variable/respondent

• <10% - Impute if not categorical
• Warning: If you remove too many respondents (or impute too much),
you will introduce response bias and also dilute effects
Missing Value Analysis II: Data Imputation
• Option 1: Use only complete and valid data
• No imputation, just use valid cases or variables
• In SPSS:
• Exclude Pairwise (analysis by analysis)
• Listwise (entire case/row is not used)
• Option 2: Use known replacement values
• Match missing value with similar case’s value
• Option 3: Use calculated replacement values
• Use variable mean, median, or mode
Missing Value Analysis III: Best Method-
Prevention
• Shorter surveys (pre-testing critical!)
• Easy to understand and to answer survey items (pre-testing critical)
• Force completion
• Motivate (iPad drawing)
• Digital surveys (rather than paper)
• Put DVs at the beginning of the survey.
• Put sensitive items at the end of the survey.
Outliers
• Outliers can influence your results, pulling the mean away from the
median.
• Outliers also affect distributional assumptions and often reflect false or
mistaken responses
• Two types of outliers:
• outliers for individual variables (univariate)
• Extreme values for a single variable
• outliers for the model (multivariate) – do these later
• Extreme (uncommon) values for a correlation
https://www.nctm.org/Classroom-Resources/Illuminations/Interactives/Line-of-Best-Fit/
Outliers I: Handling Univariate Outliers
• Univariate outliers should be examined on a case by case basis.
• In SPSS – Boxplot and Histogram
• Dataset: Download Festival
• If the outlier is truly abnormal, and not representative of your
population, then it is okay to remove. But this requires careful
examination of the data points
• e.g., you are studying dogs, but somehow a cat got ahold of your survey
• e.g., someone answered “3” for all 45 questions on the survey
• However, just because a data point does not fit comfortably with the
distributions does not nominate that data point for removal.
Assumptions
• Parametric tests assume underlying statistical distributions in the data.
• Parametric tests based on the normal distribution assume:
• Additivity and linearity
• Normality something or other
• Homogeneity of Variance
• Independence
• Can affect the following that we might do when we fit models to data
• Parameters
• Confidence intervals around a parameter
• Null Hypothesis significance testing
Assumptions I: Additivity and linearity
• The outcome variable is, in reality, linearly related to any predictors.

• If you have several predictors then their combined effect is best
described by adding their effects together.
• If this assumption is not met then your model is invalid.
Example
• It’s a bit like calling your pet cat a dog: you can try to get it to go to fetch sticks, or to sit when you tell it to,
but don’t be surprised when its behavior isn’t what you expect because even though you’ve a called it a dog,
it is in fact a cat.
• Similarly, if you have described your statistical model inaccurately it won’t behave itself and there’s no point
in interpreting its parameter estimates or worrying about significance tests of confidence intervals: the model
is wrong.
Assumptions II a: Normality - Skewness
Assumptions II b: Normality - Kurtosis
Assumptions II c: Normality
• We don’t have access to the sampling distribution so we usually test the observed data
• Central Limit Theorem
• If N > 30, the sampling distribution is normal anyway
• Graphical displays (In SPSS)
• P-P Plot (or Q-Q plot)
• If values fall on the diagonal of the plot then the variable is normally distributed;
• When the data sag consistently above or below the diagonal then this shows that the kurtosis differs from a normal distribution
• when the data points are S-shaped, the problem is skewness.
• Histogram
• Kolmogorov-Smirnov Test (In SPSS)
• Tests if data differ from a normal distribution
• Significant = non-Normal data
• Non-Significant = Normal data
Assumptions II d: Normality - Values of
Skewness/Kurtosis
• Positive values of skewness indicate too many low scores in the
distribution, whereas negative values indicate a build-up of high
scores.
• Positive values of kurtosis indicate a pointy and heavy-tailed
distribution, whereas negative values indicate a flat and light-tailed
distribution.
• Significance tests of skew and kurtosis should not be used in large
samples (because they are likely to be significant even when skew and
kurtosis are not too different from normal).
Assumptions II e: Normality - Values of
Skewness/Kurtosis
• Standard rule:
• Statistic > 1 = positive (right) skewed
• Statistic < -1 = negative (left) skewed
• Statistic between -1 and 1 is fine
• Strict rule:
• Abs(Statistic) > 3*Std. error = Skewed (Hair)
• Practical purposes…
• Problems arise outside of (+/-) 2.2
• Sposito, V. A., Hand, M. L., & Skarpness, B. (1983). On the efficiency of using the sample kurtosis in selecting
optimal lpestimators. Communications in Statistics-simulation and Computation, 12(3), 265-272.
• Loose rule >10 Kline (2005)
• Kline, R.B. (2005), Principles and practice of structural equation modeling, 2nd ed., Guilford Press, New York, NY.
• Need to transform continuous variables: https://youtu.be/twwT6FgwlAo
Assumptions III: Homoscedasticity/
Homogeneity of Variance
• When testing several groups of
participants, samples should
come from populations with the
same variance.
• Levene’s Test (In SPSS)
• Tests if data differ from a normal
distribution
• Significant = Variances are not
equal
• Non-Significant = Variances are
roughly equal
Assumptions IV: Independence
• This assumption means that the errors in your model (the error i) are
not related to each other.
Reducing Bias
• Trim the data: Delete a certain amount of scores from the extremes.
• Winsorizing: Substitute outliers with the highest value that isn’t an
outlier.
• Analyse with robust methods: This typically involves a technique
known as bootstrapping.
• Transform the data: This involves applying a mathematical function
to scores to try to correct any problems with them.
• Log Transformation (log(Xi))
• Reduce positive skew

Session 2 - QRT - Oct 3, 2020

Uploaded by

Copyright:

Available Formats

You might also like

Session 2 - QRT - Oct 3, 2020

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Session 2 - QRT - Oct 3, 2020

Uploaded by

Copyright:

Available Formats

Quantitative Research

• >10% - Just don't use that variable/respondent

• The outcome variable is, in reality, linearly related to any predictors.

You might also like