Professional Documents
Culture Documents
Session 2 - QRT - Oct 3, 2020
Session 2 - QRT - Oct 3, 2020
Session 2 - QRT - Oct 3, 2020
Techniques
Session – 2
Data Screening and Evaluation of Assumptions
Agenda: Sources of Bias and Possible
Solutions
• Missing Values
• Outliers
• Violation of Assumptions
Missing Value Analysis I: Handling Missing
Data
• Little’s Missing completely at random (MCAR) test
• Tests if data is missing at random
• Significant (p<0.05)= data missing not at random
• Non-Significant (p>0.05) = data missing at random
https://www.nctm.org/Classroom-Resources/Illuminations/Interactives/Line-of-Best-Fit/
Outliers I: Handling Univariate Outliers
• Univariate outliers should be examined on a case by case basis.
• In SPSS – Boxplot and Histogram
• Dataset: Download Festival
• If the outlier is truly abnormal, and not representative of your
population, then it is okay to remove. But this requires careful
examination of the data points
• e.g., you are studying dogs, but somehow a cat got ahold of your survey
• e.g., someone answered “3” for all 45 questions on the survey
• However, just because a data point does not fit comfortably with the
distributions does not nominate that data point for removal.
Assumptions
• Parametric tests assume underlying statistical distributions in the data.
• Parametric tests based on the normal distribution assume:
• Additivity and linearity
• Normality something or other
• Homogeneity of Variance
• Independence
• Can affect the following that we might do when we fit models to data
• Parameters
• Confidence intervals around a parameter
• Null Hypothesis significance testing
Assumptions I: Additivity and linearity
• It’s a bit like calling your pet cat a dog: you can try to get it to go to fetch sticks, or to sit when you tell it to,
but don’t be surprised when its behavior isn’t what you expect because even though you’ve a called it a dog,
it is in fact a cat.
• Similarly, if you have described your statistical model inaccurately it won’t behave itself and there’s no point
in interpreting its parameter estimates or worrying about significance tests of confidence intervals: the model
is wrong.
Assumptions II a: Normality - Skewness
Assumptions II b: Normality - Kurtosis
Assumptions II c: Normality
• We don’t have access to the sampling distribution so we usually test the observed data
• Central Limit Theorem
• If N > 30, the sampling distribution is normal anyway
• Graphical displays (In SPSS)
• P-P Plot (or Q-Q plot)
• If values fall on the diagonal of the plot then the variable is normally distributed;
• When the data sag consistently above or below the diagonal then this shows that the kurtosis differs from a normal distribution
• when the data points are S-shaped, the problem is skewness.
• Histogram
• Kolmogorov-Smirnov Test (In SPSS)
• Tests if data differ from a normal distribution
• Significant = non-Normal data
• Non-Significant = Normal data
Assumptions II d: Normality - Values of
Skewness/Kurtosis
• Positive values of skewness indicate too many low scores in the
distribution, whereas negative values indicate a build-up of high
scores.
• Positive values of kurtosis indicate a pointy and heavy-tailed
distribution, whereas negative values indicate a flat and light-tailed
distribution.
• Significance tests of skew and kurtosis should not be used in large
samples (because they are likely to be significant even when skew and
kurtosis are not too different from normal).
Assumptions II e: Normality - Values of
Skewness/Kurtosis
• Standard rule:
• Statistic > 1 = positive (right) skewed
• Statistic < -1 = negative (left) skewed
• Statistic between -1 and 1 is fine
• Strict rule:
• Abs(Statistic) > 3*Std. error = Skewed (Hair)
• Practical purposes…
• Problems arise outside of (+/-) 2.2
• Sposito, V. A., Hand, M. L., & Skarpness, B. (1983). On the efficiency of using the sample kurtosis in selecting
optimal lpestimators. Communications in Statistics-simulation and Computation, 12(3), 265-272.
• Loose rule >10 Kline (2005)
• Kline, R.B. (2005), Principles and practice of structural equation modeling, 2nd ed., Guilford Press, New York, NY.
• Need to transform continuous variables: https://youtu.be/twwT6FgwlAo
Assumptions III: Homoscedasticity/
Homogeneity of Variance
• When testing several groups of
participants, samples should
come from populations with the
same variance.
• Levene’s Test (In SPSS)
• Tests if data differ from a normal
distribution
• Significant = Variances are not
equal
• Non-Significant = Variances are
roughly equal
Assumptions IV: Independence
• This assumption means that the errors in your model (the error i) are
not related to each other.
Reducing Bias
• Trim the data: Delete a certain amount of scores from the extremes.
• Winsorizing: Substitute outliers with the highest value that isn’t an
outlier.
• Analyse with robust methods: This typically involves a technique
known as bootstrapping.
• Transform the data: This involves applying a mathematical function
to scores to try to correct any problems with them.
• Log Transformation (log(Xi))
• Reduce positive skew