Introduction To Data Cleaning and Bias in Analysis

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

PSY4062 Psychological Inquiry 2

Introduction to Data Cleaning and Bias in Analysis

Written by: Dr Michelle Schilders


PSY4062 Psychological Inquiry 2: Theory, Methods, and Practice 2
Introduction to Data Cleaning and Bias in Analysis

Table of Content
1 Overview ______________________________________________________________ 1

2 What is Bias ____________________________________________________________ 1

3 Statistical Analysis and Bias _______________________________________________ 1

4 Outliers creating bias ____________________________________________________ 1

4.1 Sample size and outliers ____________________________________________________ 2

4.1.1 Outliers in small samples ______________________________________________________ 2


4.1.2 Outliers in large samples ______________________________________________________ 2
4.1.3 Outliers in very large samples __________________________________________________ 3

4.2 Outliers can be more than one case___________________________________________ 3

4.2.1 Summary __________________________________________________________________ 3

5 Assumptions creating bias ________________________________________________ 4

5.1 Linearity_________________________________________________________________ 4

5.2 Normality _______________________________________________________________ 5

5.2.1 Methods of assessing normality ________________________________________________ 6


5.2.2 Assuming normality using the Central Limit Theorem _______________________________ 6

5.3 Homoscedasticity/homogeneity of variance ____________________________________ 7

5.3.1 When does homoscedasticity/homogeneity of variance matter_______________________ 7

5.4 Independence/Independence of errors ________________________________________ 7

5.4.1 Ways to assess Independence of Errors __________________________________________ 8

6 Data Screening, checking for outliers and assessing assumptions _________________ 8

6.1 Data Cleaning/Screening ___________________________________________________ 9

6.1.1 Checking the variables have been set up appropriately _____________________________ 9


6.1.2 Checking for out of range and missing data _______________________________________ 9

6.2 Identifying the presence of outliers and assessing assumptions ___________________ 11

6.2.1 Z-scores __________________________________________________________________ 12


6.2.2 Graphs for one variable to assess for outliers and normality ________________________ 15

i
PSY4062 Psychological Inquiry 2: Theory, Methods, and Practice 2
Introduction to Data Cleaning and Bias in Analysis

6.2.3 Scatterplots of two variables to assess for outliers and linearity _____________________ 17
6.2.4 Kolmogorov-Smirnov test and Shapiro-Wilks test to test for normality ________________ 19
6.2.5 Histogram of the residuals and the normal P-P plot of regression standardized residuals to
assess normality of residuals _________________________________________________ 19
6.2.6 Scatterplot of residuals to assess linearity and homoscedasticity of residuals ___________ 21
6.2.7 Levene’s Test for assessing homogeneity of variance ______________________________ 23
6.2.8 Hartley’s Fmax _____________________________________________________________ 24

7 Reducing bias _________________________________________________________ 24

7.1 Trimming the data _______________________________________________________ 25

7.1.1 Excluding the outlier ________________________________________________________ 25


7.1.2 Trimming based on a specific percentage or standard deviation-based rule ____________ 25

7.2 Winsorizing _____________________________________________________________ 26

7.3 Robust methods _________________________________________________________ 27

7.4 Transforming data _______________________________________________________ 29

8 Summary _____________________________________________________________ 32

i
PSY4062 Psychological Inquiry 2: Theory, Methods, and Practice 2
Introduction to Data Cleaning and Bias in Analysis

1 Overview
In this lesson we’re going to explore common biases that can occur when running statistical analyses. These
include:

• What is bias
• Statistical analysis and bias
• Outliers creating bias
• Assumptions creating bias
• Data screening, checking for outliers and assessing assumptions
• Reducing bias

The topics covered in this lesson will provide a framework of considering how bias can impact on the results
obtained when running any of the analyses that are covered in this unit. Therefore, this lesson will be useful to
return to throughout this unit if biases are identified when running a chosen statistical analysis as it will enable
you as a budding researcher to select an approach to address the bias.

2 What is Bias
Bias in statistical analysis is a very important consideration because when the data is biased it can cause a
distortion of the statistical results obtained or in plain language the results contain a large amount of error and
may not accurately answer the research question.

When we conduct statistical analyses, biases can be introduced in a multitude of ways, and we will focus on
some of the most common which includes outliers and assumptions.

3 Statistical Analysis and Bias


Previously you learnt about t-tests and one-way ANOVA which can be broadly described as analyses that have
one dependent variable and one independent variable, and we will extend this knowledge to factorial ANOVAs
where we can have two or more independent variables and one dependent variable. Similarly, you have learnt
about simple regression that includes one predictor and one outcome variable and, in this unit, we will be
introducing multiple regression, hierarchical regression and stepwise regression where we are able to include 2
or more predictors to predict an outcome variable. Therefore, prior knowledge about these analyses and key
terms is important in this unit. If you feel a little rusty on what you have previously learnt, that’s okay, just take
some time to refresh and hopefully, the information learnt will come flooding back.

The biases discussed in this chapter are relevant to the previous analyses learnt as well as the new analyses that
will be introduced during this unit. As there are many ways that we can address biases when running statistical
analyses, we will focus on delving into the most used options because as a budding researcher it will be up to
you to develop your own preferences by drawing upon accepted alternative approaches. This lesson has been
developed to be a resource that you return to again and again during the unit as you will be able to try out
different options using the datasets available.

4 Outliers creating bias


The presence of one or more outliers can bias our statistical analyses and results. An outlier is one or more
participants’ scores on one or more variables that is considerably different from all the other participant’s scores.
An outlier is problematic as it has a domino effect on our calculations. For example, an outlier that has an undue
influence on the data will impact the sum of squared errors, and as this value is used to compute standard
deviation, the standard deviation will also be impacted. Standard deviation is used to estimate the standard
error which will also be impacted. Standard error is used to calculate confidence intervals for the parameter
estimate and test statistics so these would also be biased. Hence, a full range of statistics that we use to test
hypotheses and to understand a psychological phenomenon would all be impacted. This should highlight why it

1
PSY4062 Psychological Inquiry 2: Theory, Methods, and Practice 2
Introduction to Data Cleaning and Bias in Analysis

is so important to assess for outliers and if they are having an undue influence on the data, selecting an option
to address the problem so that the results that we obtain more accurately reflect the data and minimise error.

4.1 Sample size and outliers

Sample sizes is one of the key factors that can influence whether an outlier has or does not have an undue
influence on our results. Let’s utilise scatterplots to illustrate how sample sizes can impact on regression results.

4.1.1 Outliers in small samples


Outliers can be particularly problematic in small samples. In the graph (See Figure 1a) below the outlier is in the
bottom right corner and the other data points are clustered close together in the top left-hand corner. If we
were to run a simple regression the predictor would be significant, and our R-square value would be .72 when
the outlier is included. However, we should consider what might happen if we remove the outlier. After
removing the outlier (See Figure 1b), the result would change, the predictor would no longer be a significant
predictor and the R-square value would be .04.
Figure 1 – Small sample with and without an outlier

a. Significant predictor (R-square = .72) b. Not a significant predictor (R-square = .04)

This example demonstrates how one extreme value can substantially impact the results obtained. As a
researcher, we wouldn’t want to report the results with the outlier, as the presence of the outlier has created an
artificially significant result. Instead, we would want to address the outlier and report the results with the outlier
not having an undue influence on the results.

4.1.2 Outliers in large samples


An outlier can also impact the results of a large sample (N = 30). In the graph (See Figure 2a) the outlier is in the
bottom right corner and the other data points are spread out from the bottom left to the top right. If we ran a
simple regression the predictor would be significant, and the R-squared value obtained would be .41. However,
if we were to remove the outlier (See Figure 2b) and re-run the analysis the predictor would still be significant,
however the R-squared value increase to .79. This is almost double the amount of variance that is explained
when compared to the results with the outlier included.
Figure 2 – Large sample (N = 30) with and without an outlier

a. Significant predictor (R-square = .41) b. Significant predictor (R-square = .79)

2
PSY4062 Psychological Inquiry 2: Theory, Methods, and Practice 2
Introduction to Data Cleaning and Bias in Analysis

This example also demonstrates how one extreme value can substantially impact the results obtained, even
when there is a large sample. However, given that the results are getting better when removing the outlier some
researchers would question removing the impact of this outlier unless there was a strong reason to believe that
this participant is a true outlier, or in other words, not representative of the population of interest. As it could
be argued that by removing the outlier, we are artificially improving our model or manipulating our data to make
the results look better. Therefore, caution should be taken as to whether this outlier should or shouldn’t be
removed from the dataset. For me, I would like to keep an eye on this case and see if it comes up again as a
problematic case before deciding whether to remove it or not.

4.1.3 Outliers in very large samples


In very large samples an outlier may have minimal impact on the results. Consider the following example where
100 participants were recruited, and their scores are plotted in the scatterplot (see Figure 3a). The outlier is in
the bottom right-hand corner and the other data points are scattered a moderate distance around the line of
best fit. The results would suggest that that predictor would be a significant predictor and the R-squared value
would be .19.
Figure 3 – Large sample (N = 100) Data with and without an outlier

a. Significant predictor (R-square = .19) b. Significant predictor (R-square = .24)

After removing the outlier, the predictor is still a significant predictor, and the R-squared value is .24 (see Figure
3b). In this case, the R-squared value has increased only very slightly. Therefore, we can conclude that the outlier
has a minimal impact on the results.

This example demonstrates how one extreme value can have minimal impact on the results especially when the
sample size is very large. As a researcher, we would always check for the presence of outliers irrespective of the
sample size. If an outlier is identified, we would want to make a note of this case or cases and see if the same
cases come up when we do other checks of outliers and influential cases.

4.2 Outliers can be more than one case

While the above examples focus on the inclusion of one outlier, it is important to note that an outlier can also
be multiple cases that are considerably different from most of the other scores. Also, it is possible that outliers
aren’t as extreme as those depicted in the above graphs, they could be more subtle and still impact on the data.
Therefore, we will usually use a range of options to assess outliers as influential outliers can dramatically reduce
the power of significance tests, these options are discussed in section 6.2.

4.2.1 Summary
Depending on the type of impact that the outlier/s has on the results will inform our decisions on what to do.
Below are some general rules that we can use to decide on what to do when identifying outlier/s.

1. The results change substantially in a positive direction. For example, the results with the outlier are
significant and the results without the outlier are not significant. We would want to address the impact
that the outlier is having on the results. This is because there is evidence that the outlier is having an
undue positive influence on our results, as the results are only significant when the extreme value

3
PSY4062 Psychological Inquiry 2: Theory, Methods, and Practice 2
Introduction to Data Cleaning and Bias in Analysis

(outlier) is included. If we were to not address the outlier we would be at risk of being criticised for
unduly including an outlier just to obtain significant results.
2. The results change substantially in a negative direction. For example, the results with the outlier are not
significant and the results without the outlier are significant. Unlike the above, we would want to be
able to clearly demonstrate that the outlier is not representative of the population of interest to justify
removal of the case. This is because we would risk being criticised for removing the outlier only because
we wanted to obtain significant results.
3. The results don’t really change. For example, the results with the outlier and without the outlier are
very similar. We would want to retain the outlier because there is no evidence it is having an undue
influence on the results.

It is best to consider an outlier as a case that requires additional attention, but it doesn’t necessarily mean that
we will need to address it. Whenever possible as researchers we aim to keep every participant in the dataset as
they have spent valuable time and effort in participating in our study and we only resort to removing the
participant or using statistical procedures to modify their scores if we have no other choice such as if the
participant’s scores are influential cases affecting our results.

5 Assumptions creating bias


The second cause of bias is when there is a violation of assumptions. For all parametric tests, there are specific
assumptions that need to be met to ensure that the analysis works correctly and that the results (i.e., test
statistics, p-values etc) produced are an accurate representation of the data and we can draw appropriate
conclusions from the results. However, if the assumptions are violated then the results will be inaccurate, and
we would draw incorrect conclusions.
The parametric analyses that you have previously learnt about include t-tests, one-way ANOVAs, correlation and
simple regression and we expand upon these analyses to include factorial ANOVAs and multiple regression.
While different combination of assumptions will apply to these analyses, we can summarise the assumptions as
including the following:
• Linearity
• Normality
• Homoscedasticity and homogeneity of variance
• Independence

5.1 Linearity

Let’s start with discussing linearity this is the assumption for correlation and regression that a straight-line
relationship exists between the variables or in other words that the line of best fit, best represents the linear
relationship among the data points. Let’s consider two examples to illustrate this:

• Correlation – the two variables are either positively or negatively related


• Regression – the predictor/s are linearly related to the outcome variable and therefore the predictor
or combined predictors are best at predicting the outcome variable.
o Note when we have more than one predictor then this is often referred to as additivity

The assumption of linearity is critically important for correlation and regression because if there is a curvilinear
or any other type of non-linear relationship between the variables then even if all the other assumptions are
met, the results generated will not be valid. This is because a linear model cannot be used to describe a non-
linear relationship. Therefore, it is of upmost importance to ensure that there is a linear relationship between
the variables when performing correlation and a linear relationship between any predictor and outcome variable
when performing regression. For regression, when there is a linear relationship (and the other assumptions are
met and there are no influential outliers) then the parameter estimates obtained from the model can be
accurately used in the regression equation to make predictions.

4
PSY4062 Psychological Inquiry 2: Theory, Methods, and Practice 2
Introduction to Data Cleaning and Bias in Analysis

5.2 Normality

The normality assumption refers to the sampling distribution or the residuals of the model being normally
distributed rather than the data itself. However, we don’t have access to either the sampling distribution or the
residuals of the model therefore we need to turn to other ways to assess the assumption of normality. A
commonly used approach is to assess normality for the data itself, such as whether the data for a variable is
normally distributed or not. If the data is normally distributed than we can assume that the errors in the model
and the sampling distribution are also normally distributed.

For t-tests, ANOVA and correlation normality is tested by the normal distribution of scores for each of the
variables and for regression normality is tested by the normal distribution of the residuals.

Normality affects the way that models fit the data and how we assess the fit. Specifically, it includes the
following:

• Parameter estimates
o Mean: The mean score of a variable is a parameter, and when there is an influential outlier in
the dataset this can bias the mean obtained. For example, when the outlier/s is a higher score
than all the other scores the mean will become inflated (i.e., higher mean than without the
outlier). In contrast, when the outlier/s is a lower score than all the other scores then the mean
will be deflated (i.e., lower mean than without the outlier). Therefore, the non-normal
distribution that contains influential outliers will affect the parameter estimates.
o Median: Other parameter estimates such as the median are less biased (i.e., impacted) by non-
normal distributions or outliers.
• bs in the regression equation
o Regression: While all models/results will contain some error, as it is never possible to obtain a
model that can perfectly fits the data. In regression, the error term which is referred to as the
residuals is used to determine normality. Specifically, if the residuals are normally distributed
in the population, then we can use the method of least squares to estimate the parameters
(i.e., the bs in the regression equation). In addition, the regression equation will be a better
estimate of the outcome variable than using other models such as the mean alone.
• Confidence intervals
o Confidence intervals such as the 95% confidence intervals are values that are calculated
around a parameter estimate such as the mean or bs in the regression equation. Confidence
intervals are calculated based on the standard normal distribution, therefore if the data does
not have a normal sampling distribution the 95% confidence interval would not represent the
data accurately.
• Hypothesis / significance testing
o When we test a hypothesis through significance testing such as for a t-test, ANOVA,
correlation, or regression we assume that the parameter estimates are normally distributed.
Each of the statistical analyses utilise test statistics that rely on normal distribution. Therefore,
if the data is normally distributed then we can be confident that the test statistics and the p-
values are accurate. However, if the data is not normally distributed then the test statistics
and the p-values will not be accurate.
o When performing significance tests of models, it is important that the sampling distribution to
be normal for the parameter estimates to be accurate. However, there are slight differences
in how this is tested. For example, a repeated measures t-test requires the difference between
means to be normally distributed. However, in an independent measures t-test the sampling
distribution of each group must be normally distributed. Whereas for correlation each variable
must be normally distributed.

5
PSY4062 Psychological Inquiry 2: Theory, Methods, and Practice 2
Introduction to Data Cleaning and Bias in Analysis

5.2.1 Methods of assessing normality


When we conduct parametric tests there are a range of methods that we can use to assess normality and
normality of residuals. These methods include subjective and objective measures.

• Subjective measures include graphical depictions of the data such as histograms and normal Q-Q plots
• Objective measures include Kolmogorov-Smirnov test and Shapiro-Wilks test

5.2.2 Assuming normality using the Central Limit Theorem


Objective measures to assess normality in conjunction with or without graphs are useful when the sample size
is small (i.e., N < 30). However, when the same size is large (i.e., N > 30) the Kolmogorov-Smirnov and Shapiro-
Wilks test can be overly sensitive. What this means is that both tests can produce significant p-values (i.e., p <
.05) indicating that normality has been violated, even when the deviations in normality are small and the data is
better described as being normally distributed. Therefore, when the sample size is large (i.e., N > 30) the Central
Limit Theorem can be used to assume normality.

So, what is the Central Limit Theorem? The central limit theorem states that if you have a population with mean
μ and standard deviation σ and take a sufficiently large (i.e., N > 30) random samples from the population with
replacement, then the distribution of the sample means will be approximately normally distributed. Or in other
words when our sample size is greater than 30 (or 30 per group when we need to ensure normality for each
group such as in independent measures t-test or ANOVA) we can apply the Central Limit Theorem and assume
that the data is normally distributed without needing to conduct any tests of normality.

5.2.2.1 When does the assumption of normality matter?


There are a large range of situations when we can assume normality by applying the Central Limit Theorem
irrespective of the shape of the data. As mentioned, this is the case when the overall sample size is 30 or greater,
or if comparing groups each group contains 30 or more participants. As discussed, normality affects the way that
models fit the data and how we assess the fit. Therefore, we can also apply the central limit theorem to:

• Confidence intervals – When the sample size is large, we can apply the Central Limit Theorem and the
estimate will come from a normal distribution irrespective of whether the sample or population data is
normally distributed. Hence, when we want to generate confidence intervals, we don’t need to test the
assumption of normality if the sample size is large (i.e., N > 30).
• Significance testing – When undertaking significance testing the sampling distribution needs to be
normal to ensure that the model generated is accurate. The Central Limit Theorem is useful for
significance testing when the sample size is large (i.e., N > 30) as it can be assumed that the sampling
distribution is normally distributed irrespective of the shape of the population.
• bs in the regression equation – For regression the focus is on normality of residuals rather than a normal
sampling distribution. Hence, for the bs in the regression equation or in other words the estimates for
the model parameters to be reliable then the residuals in the population need to be normally
distributed. As the model parameters using the method of least squares aims to obtain the least
amount of error, we don’t need to assume anything about normality (i.e., the normal sampling
distribution).

As outlined, there is a large range of benefits from having a large sample (i.e., N > 30) and being able to apply
the Central Limit Theorem to assume normality. This demonstrates that when the sample size is large, we don’t
need to worry about testing the assumption of normality as we can apply the Central Limit Theorem. However,
when the sample size is small, we need to ensure that the assumption of normality is met to ensure that we
obtain an accurate model, including confidence intervals, parameter estimates, and so forth.

However, it is important to note that despite being able to assume normality based on the Central Limit Theorem
that even when the sample size is large (i.e., N > 30) that influential outliers are still a concern as they can still
bias the results.

6
PSY4062 Psychological Inquiry 2: Theory, Methods, and Practice 2
Introduction to Data Cleaning and Bias in Analysis

5.3 Homoscedasticity/homogeneity of variance

The next two assumptions we will explore is homoscedasticity and homogeneity of variance which both relate
to variance. Homoscedasticity means that the variance of the dependent variable is equal or similar at all levels
of the predictor variable. Homogeneity of variance is similar that it also looks at variance however it focuses on
all the groups having the same or similar variance. For correlation, homoscedasticity is an assumption and for
regression homoscedasticity of residuals is an assumption. In comparison homogeneity of variance is an
assumption for t-tests and ANOVAs.

Both assumptions impact on the following:

• Parameter estimates – when the variance of the outcome variable is equal across different values of
the predictor then the parameter estimates of the model will be accurate. In contrast when the variance
of the outcome variable is not equal across different values of the predictor than the parameter
estimates will not be accurate.
• Hypothesis / significance testing – When undertaking significance testing such as for regression the test
statistic will be accurate if the variance of the outcome variable is equal across different values of the
predictor. If the variance is not equal across different values of the predictor variable, then the test
statistic will not be accurate.

5.3.1 When does homoscedasticity/homogeneity of variance matter


In terms of the parameter estimates these should be unbiased if the method of least squares is used and the
assumption of homoscedasticity is met. However, if the assumption is not met then the estimates will not be
optimal, and this means that another way of ways of estimating the model would be better. For example, an
alternative to the method of least squares is the weighted least squares where each case is weighted as a
function of its variance. If we only wanted to estimate the parameters of the model, then we don’t need to be
concerned about violating this assumption.

There are additional problems when homoscedasticity or homogeneity of variance is violated which causes bias.
For example, the standard error associated with the parameter estimates in the model will be affected. As
standard error is used in other calculations such as confidence intervals and the p-values theses will also be
affected and as a result also be biased. Extremely inaccurate confidence intervals can result from the
assumptions being violated. Therefore, if you want to look at confidence intervals or test the significance of the
model then homogeneity of variance matters and this is usually why we are running the analysis to determine
whether there are significant differences between groups, or significant relationships between variables etc. It
is important to note however that some tests have been developed to get around the assumption being violated,
and we will explore them in future weeks.

5.4 Independence/Independence of errors

Independence is a methodological assumption that we need to build into the design to ensure that it is met
when we collect data from participants. There are two main types of independence 1. Each participant provides
only one set of scores, and 2. A participant should not influence any other participants’ scores. There are many
ways to collect data to ensure independence is met, a common method is to administer questionnaires or to
conduct an experiment where data is collected from each participant separately.

The assumption of independence would be violated if one participant provided more than one set of data. For
example, one participant decided to participate in the study more than once by completing the study
questionnaire twice. Alternatively, independence could be violated if participants confer with one another about
the answers that they give, therefore the two participants responses (i.e., scores) will be dependent on each
other. For example, two participants sitting next to one another in an experiment might discuss the answers
before answering questions, if this was to happen then the assumption would be violated.

7
PSY4062 Psychological Inquiry 2: Theory, Methods, and Practice 2
Introduction to Data Cleaning and Bias in Analysis

The second type of independence is independence of errors, and this is a statistical assumption. Independence
of Errors means that for any two observations the residual terms (i.e., observed value minus the predicted value)
should not be correlated. When the residual terms are not correlated, we can conclude that there is
independence of errors as the distribution of errors is random and not influenced by or correlated to the errors
in prior observations. The opposite of independence is called autocorrelation.

When the assumption of independence is violated such as when one participant influences another participant’s
scores then the observations are not independent. This also has a flow-on impact on the errors in the model
(i.e., the error in the regression equation) will also not be independent. Or in other words, the model that
predicts the outcome variable will contain some error for each of the participants, however, when independence
is violated then the error for the participants that conferred will be similar and therefore not independent. The
error in predicting one participant’s response would be influenced by the error in predicting the other
participant’s response.

The regression equation for the estimated standard error is only valid if observations are independent,
therefore, if the assumption is violated then the use of the regression equation is not valid. This is because
standard error is used to compute confidence intervals, and significant tests and if the standard error is
inaccurate these values will also be inaccurate. Therefore, there is a flow-on impact on the results that we obtain.
It is therefore important to ensure that this assumption is met.

5.4.1 Ways to assess Independence of Errors


One way we can check independence of errors is to use the Durbin Watson statistic, the statistic can range
between 0 and 4.

Table 1 - Interpreting the Durbin Watson statistic

D Value Description Independence of errors range


D=0 The values are positively correlated
D=2 There is no correlation Range > 1.5 to < 2.5
D=4 The values are negatively correlated

A Durbin Watson value of 2 denotes no correlation and an approximate range of > 1.5 to < 2.5 is used as a cut-
off to assume that the assumption of independence of errors is met. When the values are 0 to < 1.5 the values
are positively correlated therefore the assumption is not met. Similarly, when the values are > 2.5 to 4 there is
a negative correlation, and the assumption is also not met. We can obtain the Durbin Watson when we run the
regression analysis. The following Model Summary table presents the Durbin Watson statistic.

The Durbin Watson value is 1.75 and falls


between > 1.5 and < 2.5. We can conclude that
the assumption of independence of errors is met

One limitation of using Durbin Watson is that it tests independence of errors when we know the order in which
observations were made. Time-series data is an example where we know the order of observations such as
weekly, fortnightly, monthly, quarterly, or yearly observations. Let’s consider weeks as the variable, the residual
between W1 and W2 would be ordered before the residual for W2 and W3 and W3 and W4 and so on. Therefore,
there is a logical way to order the residuals. In comparison, when we administer questionnaires, we cannot order
the residuals in any meaningful way. For example, there is no logical reason to order any participant before or
after any other participant. Therefore, running Durbin Watson is not appropriate when we use questionnaires,
and we would not test this assumption.

6 Data Screening, checking for outliers and assessing assumptions


Given that outliers and violating assumptions can bias the results that we obtain it is important to add checks
into the research process to screen the data for outliers and processes to check assumptions. There are many
ways to undertake these processes and as a budding researcher, you have a lot of opportunities to select the

8
PSY4062 Psychological Inquiry 2: Theory, Methods, and Practice 2
Introduction to Data Cleaning and Bias in Analysis

methods that you would like to use from a variety of established methods. While each analysis taught in this
unit has a detailed example of how to use the analysis in research the options of data screening has not been
included as the same process can be used irrespective of the analysis being performed and we will cover the
data screening process here. In addition, there are often many ways to address bias created by outlier or
assumptions, so the resources provide some commonly used options however all options are not included.
Additional options for addressing outliers and checking assumptions are included here so that you may also
utilise these when running the analysis or substitute the option used in the SPSS videos with one of the options
detailed here. As a budding researcher, it is important to be aware of the variety of tools that we have at our
disposal and then select the option that we think would be most relevant to the data.

6.1 Data Cleaning/Screening

Prior to conducting any statistical analysis, we should undertake initial checks of the data that include the
following main topics:

1. Checking the variables have been set up appropriately


2. Check for out of range or missing data

Once we finish data screening and cleaning, we can then move on to checking for outliers and check for other
assumptions.

6.1.1 Checking the variables have been set up appropriately


Irrespective of whether we are using a data set that has been created by another person or are creating our own
dataset we should always start with spending a few minutes inspecting and checking how the dataset has been
set up in the variable view of SPSS. We should ensure that all variables have been corrected named, that label
names are provided for the levels of categorical variables, that the correct measurement has been selected and
that any other relevant information is correct.

6.1.2 Checking for out of range and missing data


The second step that we should take is to check for any missing data and ensure that all scores are within the
correct range. To be able to complete this step for each variable we need to know the minimum and maximum
score possible. For example, imagine that our Happiness Questionnaire used to measure Happiness had 20
questions, measured on a 5-point scale from 1 = strongly disagree to 5 = strongly agree. Therefore, the minimum
possible score is 20 (20 questions x 1 strongly disagree = 20) and the maximum possible score is 100 (20
questions x 5 strongly agree = 100). In addition, the Quality of Life Questionnaire was used to measure Quality
of Life and it has 30 questions measured on a 3 point scale from 1 = disagree to 3 = agree. Therefore, the
minimum possible score is 30 and the maximum possible score is 90. Lastly, only males and female were
recruited, and they were labelled as 1 = males and 2 = females. So now we know the scores we are expecting
the data to range between for each variable we can undertake initial checks of the data to determine whether
there are any missing data or out of range data. To get started we can select Analyse → Descriptive Statistics
→ Descriptives to open the Descriptives dialogue box.

1. Select the 2. Select Options


happiness (with
outlier), sex and
quality of life and
move them into the
variables box

9
PSY4062 Psychological Inquiry 2: Theory, Methods, and Practice 2
Introduction to Data Cleaning and Bias in Analysis

1. Select Means

2. Select Minimum
and Maximum

1. Happiness variable – the sample size is


20 which indicates that one participant is
missing a score. While the possible range is
Once back to the Descriptives dialogue box select OK. from 20 to 100 the maximum score is 200
which indicates that there is an error in the
data. We will need to look into this.

2. Sex – the sample size is 21 which


indicates that there is no missing data and
the minimum and maximum is within the
range of 1 and 2. Therefore, there is no
issue with this data.
3. Quality of life – the sample size is 21 which indicates that
there is no missing data, and the minimum and maximum
scores is within the range of 30 and 90. Therefore, there is
no issue with this data.

From this initial check we have identified that there are no issues with the variables sex and quality of life
however there are two issues with the happiness variable.

6.1.2.1 Solutions for out-of-range data and missing values


We have identified that for happiness the maximum score of 200 exceeds the 100 maximum possible score. We
will therefore return to the dataset and by highlighting the happiness with outlier variable column we can right
mouse click on heading name and select sort descending to order the variable from largest to smallest values.
We can see that one participant has a score greater than 100. This participant has a score of 200 which cannot
be a true score. There are a few solutions that we can apply to solve this problem.

1. If we manually entered the data


a. Correct the value - we can consult the original questionnaire and correct the value
2. If we didn’t manually enter the data or don’t have access to the original questionnaire
a. Replace with the mean – If the errors in the variable are less than 5%, we can replace the
score/s with the mean for that variable.
b. Delete the participant – If the errors in the variable are greater than 5%, we can delete the
participant or if we don’t want to use the mean
3. If we do not plan to use the variable in analysis, we don’t need to do anything.

In this case I am going to assume that we have access to the original data, and we can replace the score of 200
with the correct value of 80. While we could replace the incorrect score with the correct score, I will create a
new variable as I will be able to use the original variable to illustrate how an outlier will impact on the assumption
checks in Section 6.2. As we made the error and have been able to correct it, we don’t need to explain anything
in the write up of results.

10
PSY4062 Psychological Inquiry 2: Theory, Methods, and Practice 2
Introduction to Data Cleaning and Bias in Analysis

Replacing the incorrect score with the mean is more beneficial when we have a small data set, and we want to
avoid losing participants. However, if we have a large data set, we can choose to delete the participant/s. When
selecting either of these options we will need to include what we did in the write up of results.

We also identified that the happiness variable contained one missing value. The solutions for addressing missing
data are the same for out-of-range scores. In this case I am going to assume that I don’t have access to the
original questionnaires and even though the sample size is small I am going to delete the participant and in the
write up of the results I would make a note that one participant was deleted because they didn’t complete all
the questionnaires to be used in analysis.

Alternatively, as 1 case out of 21 is 4.8% of data being missing, I could instead use the mean. However, I shouldn’t
use the mean generated from the descriptives statistics above because there was an out-of-range score for this
variable and this likely inflated the mean. Therefore, it is important to address this out-of-range score and then
run the descriptives again and use that mean. We would also need to include in the write up of results how many
cases had missing data, the percentage of cases missing data and the mean that was used for the missing case.

6.1.2.2 Summary
Ensuring that the variables are set up correctly as well as checking for out-of-range and missing cases should be
undertaken prior to undertaking assumption checks and running any analysis. In addition, it is also good practice
to add at a minimum checking z-scores which is discussed in Section 6.2.1 to the data screening process. It is
then up to you as a researcher as to what other data checks you would like to perform from those discussed
below.

6.2 Identifying the presence of outliers and assessing assumptions

Before data analysis, a range of strategies can be used to identify the presence of outliers and assess
assumptions. Some of the options available to assess for outliers can also be used to assess assumptions such
as normality or homoscedasticity. Therefore, the following table illustrates some of the methods available to
assess outliers and assumptions and where relevant information is also provided as to what the method can also
be used to assess. As some methods can be used to assess more than one purpose, these options are often
preferred and performed during data analysis.

Table 2 – Methods for identifying outliers and assessing assumptions

Check Method Can also be used for


Outliers Z-scores
Histogram, stem and leaf plots, Q-Q Normality
plots and boxplots
Scatterplots Linearity and homoscedasticity
Normality Histogram, stem and leaf plots, Q-Q Outliers
plots and boxplots
Kolmogorov-Smirnov test and the
Shapiro-Wilks test
Normality of residuals Histogram of residuals and the Normal
P-Plot of Regression Standardised
Residuals
Linearity Scatterplots Outliers and homoscedasticity
Linearity of residuals Scatterplot of residuals Homoscedasticity of residuals
Homoscedasticity Scatterplots Outliers and linearity
Homoscedasticity of residuals Scatterplot of residuals Linearity of residuals
Homogeneity Levene’s test

11
PSY4062 Psychological Inquiry 2: Theory, Methods, and Practice 2
Introduction to Data Cleaning and Bias in Analysis

6.2.1 Z-scores
Transforming the data for a variable into z-scores is a quick and easy way to conduct initial data screening to
identify potential outliers. Another benefit of z-scores is that they are useful to identify multiple outliers or to
identify outliers that are more subtle, which may not be able to be identified graphically (i.e., in histograms,
stem and leaf plots, Q-Q plots and boxplots). Z-scores take the shape of a normal distribution with a mean of 0
and a standard deviation of 1 as illustrated in Figure 4.

Figure 4 – Normal distribution of the standardised residuals

95% (or 5% between >+/-2 to +/-3)


99% (or 1% between >+/-2.5 to +/-3)

100% (or 0% >+/- 3

In Figure 4 we can see that approximately 5% of cases would be expected to be greater than 1.96 standard
deviations (we can round it to 2 for convenience), 1% of cases are expected to be greater than 2.58 (we can
round this value to 2.5 for convenience) and lastly none of the cases should have a z-score greater than 3.29 (we
can round this to 3 for convenience). Once we know the rules for interpreting z-scores we can generate z-scores.
To help illustrate the impact of the outlier we have two happiness variables one with and one without an outlier.
In SPSS we start by selecting Analyse → Descriptive Statistics → Descriptives.

1. Select the 2 happiness variables and


move them into the variables box

2. Tick Save standardized values as


variables to generate a new variable in
the data set

3. Select OK

If we return to the Data View, we will be able to see that two extra columns of data have been generated by SPSS
and they contain the z-scores for each of the two happiness variables.

12
PSY4062 Psychological Inquiry 2: Theory, Methods, and Practice 2
Introduction to Data Cleaning and Bias in Analysis

1. The z-scores have been generated


for the two happiness variables

2. We can see just by eyeballing the


data that participant in row 16 is the
outlier and their z-score is 3.77515

While we could manually count the number of cases that we have here that fall into each cut-off level the process
becomes almost unbearable to do when we use really big datasets. Therefore, I want to show you a shortcut. As
we will run the z-scores again I will first delete the two z-score columns by selecting the headings for the two
columns headings and then right-mouse clicking and selecting Clear from the drop-down menu. Now our dataset
will look the same as it did when we first opened it.

We are going to do the same thing and open Analyse → Descriptive Statistics → Descriptives. We already set
up the dialogue box so this time we will select Paste (instead of OK) to open the syntax window.
1. First five lines of the
syntax will be generated by
selecting Paste

2. For the first happiness


variable we will write the
next lines of the syntax that
will enable the values to be
recoded using our specified
rules and then generate a
table with the information
summarised.

3. As we have two
variables, we need to type
the same information again
just modifying the variable
name appropriately to
reflect our second variable.

Once finished select Run → All. If we go to the Data View, we now have four extra columns.

13
PSY4062 Psychological Inquiry 2: Theory, Methods, and Practice 2
Introduction to Data Cleaning and Bias in Analysis

1. Similar to before the


first two new columns are
the z-scores that have
been generated for the
two happiness variables

2. The next two columns


is the recoded values
based on our cut-off
rules.

Now we can review the output that we have generated.

We can compare the values for


the two variables

By comparing the values, we can see that although the minimum scores are the same, the maximum is much
higher for the happiness variable with the outlier. We can also see a flow-on impact on the mean and standard
deviation as we can see that both values are higher when the outlier is included. This helps to illustrate how an
outlier can impact on the results obtained especially the standard deviation.

Our next table is a summary of the z-scores for the happiness variable with the outlier included.

We can see that we have 5% of


cases that are an extreme value
greater than 3.29.

From the above table, we can see that while most cases (95%) are within the normal range we have 5% of cases
that are classified as extreme with a z-score larger than 3.29. When we originally looked at the dataset, we saw
that the participant in row 16 appeared to be an outlier. If we look at the data view again, we will see that it is
only this participant that has a score of 1 = extreme (z-score > 3.29) in the Zhappiness_R column. However, what
if our dataset was really large? In that case, eyeballing the data becomes harder and instead, we can click on the
Zhappiness_R column name, right mouse click on the column and select Sort Ascending from the drop-down
menu. This will enable us to identify any of the cases that have a value different from 4 indicating the normal
range.

Now we can inspect the same table but for the happiness variable that does not contain an outlier. We can see
that 100% of the cases are in the normal range as would be expected.

We can see that 100% of cases are


in the normal range

Running z-scores is recommended to complete as part of the data cleaning and screening process for all analyses
to pick up individual, multiple, extreme, and subtle outliers. If a case or cases are identified as potential outliers
you can make a note of these cases and during the process of analysis, you can decide whether they are
influential cases (i.e., impacting on the results). If they don’t impact the results, then nothing needs to be done.

14
PSY4062 Psychological Inquiry 2: Theory, Methods, and Practice 2
Introduction to Data Cleaning and Bias in Analysis

However, if they are influential cases, you will need to decide on an option on how to address this problem and
options of addressing outlier issues are detailed below in Section 7.

6.2.2 Graphs for one variable to assess for outliers and normality
We can also generate a range of graphs to inspect outliers for each variable of interest separately. These graphs
can also be used to assess the assumption of normality. To generate the graphs, select Analyse → Descriptive
Statistics → Explore.

1. Select the variables


to check for outliers
and move them to the
3. Click on Plots
Dependent List box

2. Select Plots

From the Plots dialogue box make the following selections.

1. Leave Stem and leaf selected and select


Histogram

2. Select Normality plots with tests

3. Select Continue

Once back to the Explore dialogue box select OK. We have generated two sets of output one set for the
happiness variable with an outlier and one set of output for the happiness variable without the outlier. We will
compare the two sets of graphs in Table 3.

While these graphs are useful to explain how outliers and normality can be assessed using a range of options,
they also illustrate a second important point. The data for both happiness variables are identical apart from an
outlier being included in only one of the happiness variables. Therefore, these graphs also illustrate how
dramatically the presence of just one outlier can impact on how the remaining data points are illustrated in the
graphs. For example, the non-outlier data points are piled up in two columns in the first histogram graph when
the outlier is present. However, when the outlier is not present the data is presented in eleven columns.
Depending on how different the outlier is to the other scores will influence how much we are able to see the
impact of the outlier on the rest of the outliers.

15
PSY4062 Psychological Inquiry 2: Theory, Methods, and Practice 2
Introduction to Data Cleaning and Bias in Analysis

Table 3 – Graphical methods to assess outliers and normality.

Graphs with an outlier Graphs without an outlier


Histogram

Outlier - The outlier is clear on the right although we Outlier- There is no clear outlier depicted in this graph.
can’t identify which case this is from the graph and the
rest of the data is piled up on the left. Normality - Although the histogram doesn’t take on the
perfect shape of the bell-shaped normal distribution
Normality - The data overall does not appear to be curve, it does appear that the data is normally
normally distributed. distributed.
Stem and Leaf Plot

Outlier – One case is classified as an extreme case Outlier – There are no extreme cases identified in this
representing an outlier. plot. Therefore, outliers are not present.

Normality – This plot isn’t the best option to assess Normality – As there is no extreme cases it could
normality. However, the presence of the extreme case suggest that normality is met. However, this plot isn’t
can bring to our attention that normality may be the best to assess normality
violated.
Normal Q-Q Plot

Outlier - The outlier is clear on the right although we Outlier- There is no clear outlier depicted in this graph.
can’t identify which case this is from the graph and the
rest of the data is scattered below and above the Normality – There is approximately an equal amount of
diagonal line. data points above and below the diagonal line
indicating that the data is normally distributed.
Normality - The data overall does not appear to be
normally distributed as there is not an equal amount of
data points above and below the diagonal line.

16
PSY4062 Psychological Inquiry 2: Theory, Methods, and Practice 2
Introduction to Data Cleaning and Bias in Analysis

Graphs with an outlier Graphs without an outlier


Detrended Normal Q-Q Plot

Outlier - The outlier is clear on the right although we Outlier- There is no clear outlier depicted in this graph.
can’t identify which case this is from the graph and the
rest of the data is scattered below and above the
horizontal line.

Normality - The data overall does not appear to be Normality – There is approximately an equal amount of
normally distributed as there is not an equal amount of data points above and below the horizontal line
data points above and below the horizontal line. indicating that the data is normally distributed.
Boxplots

Outlier - The outlier is clear at the top of the graph as an Outlier- There is no clear outlier depicted in this graph.
isolated data point. There is also the number 16
accompanying the data point which enables us to
identify which case is the outlier. Note that the number
refers to the row in the data view, therefore it is the
participant in row 16. Being able to identify the specific
case that is an outlier in the boxplot makes this graph
very useful in the data cleaning and screening process.

Normality - The presence of an outlier and the rest of the Normality – All the data points are contained within the
data being in the box indicates that the data may not be box which suggests that the data is normally
normally distributed. distributed.

Take a moment to compare the graphs and how different they appear when the outlier is present before moving
on.

When we generate these graphs, we will also obtain the Kolmogorov-Smirnov and the Shapiro-Wilks test which
can be used to assess normality. We will discuss how to interpret these tests in Section 6.2.4.

6.2.3 Scatterplots of two variables to assess for outliers and linearity


As we previously discussed in section 4.1, scatterplots are also a useful option for assessing the presence of
outliers on two variables simultaneously as we place one variable on the y-axis and a different variable on the
x-axis. While the other options for assessing outliers can be used irrespective of the analysis being performed
(i.e., t-tests, ANOVA, correlation, and regression), scatterplots are usually reserved for when conducting
correlation and regression.

Let’s run a scatterplot with our happiness variable with and without the outlier in combination with quality of
life our second scale variable. So, we will obtain two scatterplots including the following:

1. Quality of life and happiness with outlier


2. Quality of life and happiness without outlier

17
PSY4062 Psychological Inquiry 2: Theory, Methods, and Practice 2
Introduction to Data Cleaning and Bias in Analysis

To get started select Graphs → Chart Builder. The following pop-up box will appear in the centre of the screen:

3. Select and drag Quality of life


to the Y-axis.

4. Then select and drag


1. From the Choose from Happiness with outlier to the X-
options select Scatter / axis.
dot

2. Select the first


scatterplot (simple scatter)
by either double clicking
on the simple scatterplot
picture or clicking and
dragging the scatterplot
5. Click OK
picture into the chart
preview uses example data

Once the SPSS output opens, we can add a line of best fit in the scatterplot by double-clicking on the centre of
the scatterplot to open the Chart Editor dialogue box. In the Chart Editor select the Elements → Fit line at Total
to include the line of best fit. Select Close to return to the SPSS output.

Before we discuss this graph let’s generate the scatterplot for quality of life and happiness without the outlier.
To do this we just need to repeat the process and open the chart editor again by selecting Graphs → Chart
Builder. We have already set up the scatterplot, so we just need to replace happiness with the outlier variable
with the happiness without the outlier variable and then click OK. Once the graph opens repeat the steps to add
a line of best fit to the scatterplot. Now we can compare the two scatterplots that we have generated.

Table 4 – Scatterplots with and without an outlier

Scatterplot with an outlier Scatterplot without an outlier

Outlier – The outlier is clear sitting on its own to the right Outlier- There is no clear outlier depicted in this graph.
of the other data points. Note that we can’t identify which
case this is from inspecting the graph. Linearity - All the data points are close to the line of best fit.
The data points being relatively close to the line of best fit
Linearity – All the data points including the outlier are indicates that the data is linear, and that the relationship is
relatively close to the line of best fit which indicates that positive and likely to be strong.
the data is linear and that the relationship is positive and
likely to be moderate or strong. Although it is likely that Homoscedasticity – There is no evidence in the graph of the
data curving or funnelling which would suggest that the

18
PSY4062 Psychological Inquiry 2: Theory, Methods, and Practice 2
Introduction to Data Cleaning and Bias in Analysis

the outlier is impacting on the position of the line of best assumption of homoscedasticity is violated. Therefore, we
fit. could conclude that the homoscedasticity assumption has
been met.
Homoscedasticity – There is no evidence in the graph of
the data curving or funnelling which would suggest that Comparison – Comparing the two scatterplots we can see
the assumption of homoscedasticity is violated. that the outlier has shifted the line of best fit down
Therefore, we could conclude that the homoscedasticity substantially. Therefore, the graph on the right without the
assumption has been met. Although we would want to outlier is a better representation of the data and this
address the outlier. highlights why we would want to address the outlier and
then rerun the graph.

These examples illustrate how scatterplots can be useful to identify outliers, especially extreme outliers. In
addition, using the same data for both examples except for including an outlier for one of the happiness variables
also helps to illustrate how the graphs can change dramatically including shifting the line of best fit substantially
when there is an outlier present.

6.2.4 Kolmogorov-Smirnov test and Shapiro-Wilks test to test for normality


The Kolmogorov-Smirnov test and the Shapiro-Wilks test are two objective ways to assess normality. Both tests
compare the scores in the sample to the normally distributed set of scores with an equivalent mean and standard
deviation. We obtained the Kolmogorov-Smirnov and the Shapiro-Wilks test in Section 6.2.2 when we generated
graphs to inspect outliers and assess normality, therefore the information won’t be repeated here. We will just
focus on interpreting the two tables that are generated in the output.

2. Happiness with outlier – we


can see that the p-value is <
.001 and this is less than .05
therefore normality has been
violated

Happiness without outlier – we


1. Happiness with outlier – we can see that the p-value is .021 and this is less can see that the p-value is .210
than .05 therefore normality has been violated which is greater than .05
therefore normality has not
Happiness without outlier – we can see that the p-value is .200 which is been violated.
greater than .05 therefore normality has not been violated.

Like the previous example, using the same data for the two examples, except with the addition of one outlier
demonstrates the dramatic impact on normality that the outlier can cause. It also helps to illustrate that by
addressing the outlier that normality can be achieved. This is particularly important in small sample sizes (i.e., N
< 30 participants) as there is no real alternative to overcome violation of normality. However, if the sample size
is large (i.e., N > 30 participants) then we can apply the Central Limit Theorem to assume normality despite the
Kolmogorov-Smirnov and the Shapiro-Wilks test results indicating that the data is not normally distributed.

6.2.4.1 Normality within groups and the split file command


When running normality checks for t-tests or ANOVA, the independent variable/s are measured on a categorical
basis. Therefore, it is important to run normality checks on the dependent variable for each group separately.
Steps on how to do this using the split file command is covered in the lessons explaining how to use factorial
ANOVAs in research.

6.2.5 Histogram of the residuals and the normal P-P plot of regression standardized residuals to
assess normality of residuals
Normality of residuals is an assumption of regression, and we can assess the assumption when we run the
regression analysis, therefore, this is not an assumption that we need to assess prior to running the regression.

19
PSY4062 Psychological Inquiry 2: Theory, Methods, and Practice 2
Introduction to Data Cleaning and Bias in Analysis

To demonstrate how to generate these graphs we will run a simple regression with quality of life as the outcome
variable and happiness as the predictor variable. We will run the analysis twice, once with and once without the
outlier. To get started, select Analyse → Regression → Linear.

1. Click on Quality of Life and 4. Click on Plots to open the Plots


click on the arrow to move it dialogue box
to the Dependent box

2. Select Happiness with


outlier and click on the arrow
to move it to the
independent box

Once back to the Regression dialogue box select Plots to open the following dialogue box:

1. Select Histograms to produce the histogram of the residuals of the DV.

2. Select Normal probability plot to produce the Normal P-P Plot of


Regression Standardized Residuals.

3. Select Continue to return to the Linear Regression dialogue box

Now we have made all our selections to generate the graphs we need to assess normality of residuals we can
click on OK to run the simple regression analysis. As before we will also run the analysis one more time and this
time replace the happiness with outlier variable so that we can compare the graphs.

Table 5 – Graphs to assess normality of residuals

Graphs with outlier Graphs without outlier


Histogram of residuals

In this histogram, the standardised residuals appear to In this histogram, there is a large spike in the data in
be normally distributed therefore the outlier isn’t the centre which suggests that the residuals are not
really depicted as having an impact. normally distributed.

20
PSY4062 Psychological Inquiry 2: Theory, Methods, and Practice 2
Introduction to Data Cleaning and Bias in Analysis

Graphs with outlier Graphs without outlier


Normal P-P Plot of Regression Standardised
Residuals

In this P-P Plot, some of the data points are scattered In this P-P Plot, none of the data points appears to be
in a horizontal manner to the horizontal line (see horizontal to the diagonal line therefore the residuals
yellow ovals) which suggests that some of the residuals are normally distributed.
are not normally distributed. However, looking at the
data points collectively the data points overall are
scattered mainly around the diagonal line which
suggests that the assumption of normality of residuals
has been met.

From the graphs, we can see different patterns emerging. When inspecting the graphs that included the
happiness variable with an outlier, the histogram depicts that the residuals are normally distributed more than
the graph when the happiness variable without an outlier was used. However, the opposite pattern is present
for the P-Plots the graph with the happiness variable without the outlier is a better depiction of the normality
of residuals assumption being met than the graph with the happiness variable with the outlier. However, taking
the pattern of each set of graphs together in both cases we can conclude that the assumption of normality of
residuals has been met. If, however, the residuals in the P-Plot were depicted as being predominantly horizontal
to the diagonal line then the assumption of normality of residuals would be violated.

6.2.6 Scatterplot of residuals to assess linearity and homoscedasticity of residuals


Linearity of residuals and homoscedasticity of residuals are also both assumptions of regression, and they can
also both be assessed when we run the regression analysis, therefore, they are not assumptions that we need
to assess prior to running the regression. Both assumptions can be tested simultaneously by generating and
inspecting the plot of standardised predicted values against standardised residuals. The graph we obtain should
broadly resemble one of the four shapes illustrated in Figure 6.

Figure 6 - Example Plots of Standardised Predicted Values against Standardised Residuals

Linear and Homoscedastic Heteroscedastic

When both assumptions are met there should be no When homoscedasticity is violated, we refer to the
discernible pattern in the residuals. Rather the residuals variability among the residuals as being heteroscedastic.
should be scattered in a roughly random pattern and a The yellow lines in the graph help to illustrate how the
square best represents the pattern among the residuals. residuals take the shape of a funnel.

21
PSY4062 Psychological Inquiry 2: Theory, Methods, and Practice 2
Introduction to Data Cleaning and Bias in Analysis

Non-linear Non-linear and Heteroscedastic

When linearity is violated, we can refer to the variability as When both linearity and homoscedasticity are violated, we
being non-linear. The yellow lines in the graph help to refer to variability as being non-linear and heteroscedastic.
illustrate how the residuals take the shape of a curve. The yellow lines help to illustrate how the residuals both
curve and funnel.

To demonstrate how to generate these graphs we again return to running a simple regression with quality of
life as the outcome variable and happiness as the predictor variable. We will also run the analysis twice once
with and once without the outlier. The process is the same as we just used to assess the normality of residuals,
select Analyse → Regression → Linear and then select Plots.

Next, we can assess the assumption of linearity and homoscedasticity of residuals by inspecting the scatterplot
of the residuals.

Once back to the Regression dialogue box select Plots to open the following dialogue box:

1. Select *ZRESID and click on the arrow to move it to the Y box.

2. Select *ZPRED and click on the arrow to move it to the X box.

This will produce a scatterplot of the standardised residuals


against the standardised predicted values

6. Select Continue to return to the Linear Regression dialogue box


and then select OK to generate the output

Table 7 – Graphs to assess linearity and homoscedasticity of residuals

Graphs with outlier Graphs without outlier

In the scatterplot of residuals, the outlier is clearly depicted In the scatterplot of residuals, there are no data points that
in the bottom right-hand corner, and it exceeds the range are outliers, and all the data points fall within the range of
of ± 3. Additionally, the other data points are compressed ± 3. Unlike the other scatterplot all the data points are
into a limited range on the left. We wouldn’t want to make spread in no clear pattern as it doesn’t appear that the data
a decision about linearity and homoscedasticity of residuals is funnelling or curving which indicates that the assumption
when such a clear outlier is present. Rather we should of linearity and homoscedasticity of residuals have been
address the outlier and then re-run the scatterplots to met.
assess the assumptions.

The example scatterplot with an outlier demonstrates how an outlier can have an impact on both linearity and
homoscedasticity of residuals. Therefore, when an outlier is clearly present, we would want to address the
outlier prior to drawing any conclusion about linearity and homoscedasticity of residuals.

22
PSY4062 Psychological Inquiry 2: Theory, Methods, and Practice 2
Introduction to Data Cleaning and Bias in Analysis

6.2.7 Levene’s Test for assessing homogeneity of variance


Levene’s test can be used to assess the homogeneity of variance assumption as it tests the null hypothesis that
the variance in different groups is equal. Therefore, homogeneity is only assessed when there is a between-
subjects independent variable and not when the independent variable is within subjects. The Levene’s test is
run automatically when conducting an independent samples t-test and for ANOVA we just have to select the
homogeneity test when setting up a between-subjects ANOVA. We will run two one-way ANOVAs with
happiness (with and without outliers) as the dependent variable and gender as the independent variable. To do
this Analyse → General Linear Model → Univariate

1. Select Happiness with outlier and click on the arrow to


move it to the Dependent Variable box.

2. Select Sex and click on the arrow to move it to the


Fixed Factor(s) box.

This will produce a scatterplot of the standardised


3. Click onagainst
residuals Options tostandardised
the open the Options dialogue
predicted box
values

1. Select Homogeneity tests and this will produce the


Levene’s test to assess homogeneity of variance.

2. Select Continue to return to the Univariate dialogue


box

Then select OK to run the ANOVA. Before we look at the output, we will run the one-way ANOVA again replacing
the happiness with outlier variable with the happiness without outlier variable. We will then be able to compare
the two Levene’s test.

As can be seen in both tables there are four options available to assess homogeneity of variances including based
on the mean, median, median with adjusted df and trimmed mean. While any of the options can be interpreted
based on the median is often the preferred option because it is less impacted by outliers.

23
PSY4062 Psychological Inquiry 2: Theory, Methods, and Practice 2
Introduction to Data Cleaning and Bias in Analysis

Levene’s test with outlier

1. All four of the p-values are


greater than .05 therefore the
assumption of homogeneity has
been met.

Levene’s test without outlier

1. All four of the p-values are also


greater than .05 therefore the
assumption of homogeneity has
also been met.

In this case the assumption of homogeneity of variance has been met when the dependent variable contained
an outlier and didn’t contain the outlier. However, if we compare the values between the two tables, we can
see that the outlier is having a substantial impact on the Levene’s test results as the values are very different.
This highlights that an outlier can have a substantial impact on Levene’s test as well.

If the p-values were less than .05 then we would conclude that the null hypothesis is incorrect and that the
variances are significantly different or in other words that the assumption of homogeneity of variance has been
violated.

In this case the sample size is small (i.e., < 30 participants) and Levene’s test is considered robust only to big
differences in variance. Additionally, when the sample size is large (i.e., > 30 participants) the Levene’s test can
be very sensitive to small variations of differences and the p-value will be significant. For these two reasons the
use of Levene’s test is questioned in terms of its usefulness to assess homogeneity.

6.2.8 Hartley’s Fmax


Another method to assess variance in ANOVA is to use Hartley’s F max which is otherwise known as the variance
ratio. To calculate Hartley’s Fmax we compute the ratio between the group with the biggest variance with the
group with the smallest variance. The ratio that we obtain can then be compared with critical values that Hartley
has published. To determine the critical value, we need to know the sample size per group and the number of
variances that we want to compare. However, there are some broad rules that we can use to interpret Hartley’s
Fmax, for example, if the groups sizes are 10 an Fmax value of around 10 is going to be not significant. If we increase
the same size to 15 or 20 per group then the Fmax value needs to be less than 5 and if we increase the sample
size to 30-60, then the Fmax value should be less than 2 or 3. Unfortunately, Hartley’s Fmax has many of the same
problems as the Levene’s test and therefore it isn’t often used.

7 Reducing bias
When a bias is identified based on the presence of outliers or from violation of assumptions there are several
different approaches that can be used to reduce or eliminate the bias. These options include:

• Trim the data: Delete a percentage of scores from the extremes or delete an isolated case or cases
• Winsorizing – Substitute the outlier/s score with the highest value that isn’t an outlier
• Robust estimate method – Such as bootstrapping
• Transform the data – Select a mathematical function to the scores

24
PSY4062 Psychological Inquiry 2: Theory, Methods, and Practice 2
Introduction to Data Cleaning and Bias in Analysis

As a budding researcher, you will need to consider which option you prefer based on the research you are
undertaking and the sample size. When the sample size is large, you may choose to exclude the participant.
However, when the sample size is small you might decide to use another option that retains the case/s such as
winsorising or transforming the data. It is for this reason it is good to know the options available to address
outliers and then select the most appropriate option that you can justify as being the most suitable.

7.1 Trimming the data

Trimming the data involves excluding extreme cases. There are three main ways that trimming is normally
undertaken:

1. Excluding the outliers


2. Based on a specific percentage
3. Based on a standard deviation-based rule

The last two options are usually preferred options.

7.1.1 Excluding the outlier


If just excluding the outlier/s, we should have a very good reason to do so. One reason is that we believe that
the outlier case is not from the population of interest. For example, if we aimed to recruit young adults from the
general population to investigate whether income was related to another construct and we somehow managed
to recruit a celebrity such as Justin Bieber then we could make an argument that his approximate $80 million
dollar per year income is not representative of our population of interest and remove him as an outlier.

7.1.2 Trimming based on a specific percentage or standard deviation-based rule


When we want to trim the data based on a specific percentage or standard deviation, we need to determine the
number of cases and order them in ascending order. We can do this in SPSS by clicking on the column heading
to highlight the entire column and then right clicking on the column name and from the drop-down menu select
ascending. I have ordered the happiness without the outlier variable scores for the 20 participants below.

50 51 53 54 55 58 60 61 63 69 70 75 79 80 80 83 84 90 95 98

If we wanted to trim 5% of the lowest and highest scores, we will need to calculate the number of cases. For
example, 5% of 20 scores = 1 case. Therefore, we will need to remove the first and last participant as illustrated
below.

Trim 51 53 54 55 58 60 61 63 69 70 75 79 80 80 83 84 90 95 Trim

If the cases, we removed were the only influential cases then trimming 5% of the data would be a sufficient
solution. However, if after trimming 5% of the data the influential outliers remained in the data set, we might
choose to trim based on 10%. Again, we will need to use the sample size to determine how many cases would
be trimmed based on 10%. This would be 2 cases from each end as 10% of 20 scores = 2 cases. Therefore, we
will remove the first two and last two participants as illustrated below.

Trim Trim 53 54 55 58 60 61 63 69 70 75 79 80 80 83 84 90 Trim Trim

Again, we would determine whether the influential cases have been trimmed and if it has then trimming 10% of
the data is a suitable solution and we can use this data set moving forward. However, if the influential outlier
remains than we could repeat the process and trim based on 15% of the data. If trimming at 15% addressed the
issues, then we would stop and if not, we could continue and delete 20% of the data.

50 51 53 54 55 58 60 61 63 69 70 75 79 80 80 83 84 90 95 98

If we want to use trimming based on the standard deviation rule, then instead of ordering the scores for the
variable in ascending order we would order the z-scores for the variable. We can then decide on a standard

25
PSY4062 Psychological Inquiry 2: Theory, Methods, and Practice 2
Introduction to Data Cleaning and Bias in Analysis

deviation value to trim the mean. For example, we know that there shouldn’t be any scores greater than ±3.29.
Therefore, we can remove any z-scores that are greater than 3.29 from either end.

When using either option, we want to be careful that we don’t over trim or under trim. Over trimming means
removing data unnecessary from the data set for example removing cases that aren’t impacting on our model.
In comparison, under trimming means not removing data that is influencing the results.

7.2 Winsorizing

Winsorizing is the process where the outlier or outliers are replaced with the next highest score that is not an
outlier, there are different ways this can be undertaken, and two common options include:

1. Replace the outlier/s with the next highest score


2. Replace the outlier/s based on the score that represents 3 standard deviations from the mean

The first option is the simplest form of winsorizing where the outlier/s score is replaced with the next highest
score that is not an outlier. Using our happiness variable, we could sort the happiness with the outlier variable in
descending order. This would enable us to identify that the outlier of 200 could be replaced with the next highest
value that wasn’t an outlier which is 98.

The second option utilises the z-scores, as we have identified earlier an outlier is any value that has a z-score of
greater than 3.29 (or 3 if we round it for convenience). We can calculate the value to replace the outlier with by
using the following formula:

• X = (z x s) + x̄

Earlier in Section 6.2.1 when we explored z-scores we obtained the descriptive statistics table. We can use the
mean and standard deviation from this table for our formula.

1. If we wanted to address the


outlier in the happiness variable
with the outlier, we can take the
M = 76.400 and the SD = 32.74045

We can now include the mean and standard deviation in our formula:

• X = (3.29 x 32.74045) + 76.400


• X = (107.71608) + 76.400
• X = 184.11608

The value that we have obtained is still really high and given that the second highest value is 98 this value might
still be an influential outlier. This example highlights how the influential case is impacting the mean and standard
deviation therefore it wouldn’t be the best option to use these values in our formula to winsorize the influential
case.

A better option would be to run the descriptive statistics again without the outlier included and use that mean
and standard deviation in our formula as these values aren’t impacted by the outlier.

1. After removing the outlier our


M = 69.8947 and the SD =
15.43047 are much lower

26
PSY4062 Psychological Inquiry 2: Theory, Methods, and Practice 2
Introduction to Data Cleaning and Bias in Analysis

So, let’s recalculate our winsorised value.

• X = (3.29 x 15.43047) + 69.8947


• X = (50.7662463) + 69.8947
• X = 120.660946

This value of 121 is closer to our second highest score of 98 and we could use this value to replace our outlier
value. However, do you remember what scores we identified as being the minimum and maximum for the
happiness questionnaire? As a refresher we identified that the minimum possible score is 20 (20 questions x 1
strongly disagree = 20) and the maximum possible score is 100 (20 questions x 5 strongly agree = 100). Therefore,
the winsorised value of 120.66 is still outside the range that is possible for this variable. Therefore, in this case
we would probably discount using this option to address the outlier.

7.3 Robust methods

Robust methods include a group of procedures to estimate statistics that are not biased even when the normal
assumptions for the statistics are not met. There are many robust methods, and two common methods include:

• Non-parametric tests
• Bootstrapping

There is a range of parametric options available to be performed when the assumptions of parametric tests are
violated or when there are outliers biasing the results. While these are options for analyses such as independent
samples t-test, repeated measures t-test, one-way independent measures ANOVA, one-way repeated measures
ANOVA and correlations there are unfortunately no non-parametric equivalent tests when we have two
independent variables (such as in factorial ANOVA) or for any type of regression. Which prevents the non-
parametric option being a solution to address bias when wanting to perform factorial ANOVA or regression.

The other common option is bootstrapping. Normality of our data allows us to assume that the sampling
distribution is also normal. However, when normality is violated, we can’t assume that the sampling distribution
is normal unless the sample size is large (i.e., N > 30 participants). Bootstrapping enables us to address the
problem of not knowing if the sampling distribution is normal as it estimates the properties of the sampling
distribution from the sample data.

The way bootstrapping works is that from the sample data smaller samples (referred to as bootstrap samples)
are extracted and the parameter (such as the mean) is calculated and then the sample is replaced, and the
process is repeated. This process can be repeated many times for example we can run bootstrapping 2000 times
and obtain 2000 parameter estimates. Once we run bootstrapping, we can identify the limits that 95% of the
parameter estimates fall and these values can be used to calculate the 95% confidence interval of the parameter.
As we obtained these 95% confidence intervals using bootstrapping, we refer to these values as the percentile
bootstrapped confidence interval. Using bootstrapping we can also calculate the standard deviation of the
parameter estimates and then use them as the standard error of parameter estimates.

Let’s run a one-way independent measures ANOVA to illustrate how to use bootstrapping. To get started select
Analyse → General Linear Model → Univariate.

27
PSY4062 Psychological Inquiry 2: Theory, Methods, and Practice 2
Introduction to Data Cleaning and Bias in Analysis

1. Select Happiness without outlier


and using the arrow move it to the
Dependent Variable box.

2. Select Sex and using the arrow


and move it to the Fixed Factor(s)
box.

2. Select Bootstrapping to open


the Bootstrapping dialogue box

1. Select Perform bootstrapping


and leave all the other options as
the default options

2. Click Continue

We will then need to select the options that we will want to bootstrap for example descriptives or parameter
estimates, and we can do this by selecting Options.

1. Select Descriptive statistics and


Parameter estimates

2. Select Continue

Then select OK to run the analysis. The first table will provide information about the type of bootstrapping
undertaken.

28
PSY4062 Psychological Inquiry 2: Theory, Methods, and Practice 2
Introduction to Data Cleaning and Bias in Analysis

The next table that is influenced by bootstrapping is the Descriptive Statistics table.

1. Using the bootstrap option a


range of information is included
such as bias, std. error as well as
the 95% Confidence Interval

We can include the 95% CIs in the


write up of results.

While we usually don’t report the parameter estimates when running ANOVA, the next tables provide the
parameter estimates as well as bootstrapping the parameter estimates. This type of information would be
similar to the information that we could obtain if we were to bootstrap when running a regression.

1. The original parameter


estimates

2. The bootstrapped estimates

We could include the


bootstrapped 95% CIs rather than
the original 95% CIs from the table
above

It is important to note that as bootstrapping randomly selects smaller samples from our sample, the parameter
estimates will be slightly different every time we run bootstrapping. Therefore, when using the bootstrapping
procedure and the analysis is run more than once don’t be surprised that the bootstrapped parameters change
slightly. Or if you run bootstrapping don’t be surprised if your values are slightly different from those presented
in the tables above.

7.4 Transforming data

The fourth option of addressing bias is to transform the data. Transforming the data means that each of the
scores in the data set are changed using a transformation option, there are many different transformation
options, and the most used ones are included in Table 8.

29
PSY4062 Psychological Inquiry 2: Theory, Methods, and Practice 2
Introduction to Data Cleaning and Bias in Analysis

Table 8 – Data transformation and their uses (From Fields (2018) Discovering statistics using IBM SPSS
Statistics, Sage Publications, London, UK)

Data transformation Can correct for


Log transformation (log(X1)): Taking the logarithm of a set of numbers squashes the right tail of Positive skew,
the distribution, which reduces positive skew. This transformation can also sometimes make a positive kurtosis,
curvilinear relationship linear. Because you can’t get a log value of zero or negative numbers, you unequal variances,
may need to add a contrast to all scores before taking the log: if you have scores of zero then do lack of linearity
log (X1+1); if you have negative numbers add whatever value makes the smallest score positive.
Square root transformation (√X1): Like the log transformation, taking the square root of scores Positive skew,
has a greater impact on large scores than small ones. Consequently, taking the square root scores positive kurtosis,
brings large scores closer to the centre, which will reduce positive skew. Although zeros are fine, unequal variances,
negative numbers don’t have a square root so you may need to add a constant before lack of linearity
transforming.
Reciprocal transformation (1/X1): Dividing 1 by each score also reduces the impact of large scores. Positive skew,
The transformed variable will have a lower limit of 0 (very large numbers will become close to 0). positive kurtosis,
This transformation reverses the scores: large scores become small (close to zero) after the unequal variances
transformation, and small scores become large. For example, scores of 1 and 100 become 1/1 =
1/10 = 0.01 after transforming: their relative size swaps. To avoid this, reverse the scores before
the transformation by converting each score to the highest score for the variable minus the score
you’re looking at. So, rather than using 1/X as the transformation, use 1(XHighest-X). You can’t take
the reciprocal of – (because 1/0 = infinity), so if you have zeros in the data add a constant to all
scores before doing the transformation.
Reverse score transformation: Any one of the above transformations can be used to correct Negative skew.
negatively skewed data if you first reverse scores. To do his, subtract each score from the highest
score on the variable, or the highest score +1 (depending on whether you want your lowest score
to be 0 or 1). Don’t forget to reverse the scores back afterwards, or that the interpretation of the
variable is reversed: big scores have become small and small scores have become big.

To transform data, we can select one option and see if it fixes the bias and if it does run with it, and if it doesn’t
pick another option and run that one. The process can be repeated until the bias is addressed.

These options aim to address the bias caused by outliers, or a violation of assumptions (such as normality,
linearity, and homogeneity). Transforming data can be applied in two ways, if we are investigating a relationship
between variables (i.e., by performing correlation or regression) we can apply a transformation technique just
to the variable that contains bias. However, if we are investigating differences between variables (i.e., by
performing t-tests or ANOVA) then we will need to transform all the variables.

7.4.1.1 Using the compute function to transform data


Irrespective of which transformation function that you chose to apply, the compute function will be used to
apply the transformation. To get started select Transform → Compute Variable to open the compute variable
dialogue box.

30
PSY4062 Psychological Inquiry 2: Theory, Methods, and Practice 2
Introduction to Data Cleaning and Bias in Analysis

2. Select
1. In the Target
Arithmetic to see
Variable box
all of the functions
write a name for
available
the new
variable.
3. Pick any of the
I want to functions, and in
transform the this case, I will
happiness select Log10 as the
variable using transformation
the log function option.
so I will name
Log 10(?) will
the variable
appear and we will
HappinessLog
need to move the
variable we want
to transform using
the arrow

4. We can select If… if we want to apply the transformation to only some data A description will
then appear
describing the
option

1. If we want to apply the


transformation to all the data, we
don’t need to do anything and can
keep Include all cases selected

2. However, we can apply the


transformation to particular
groups. For example, if we want to
only apply the transformation to
men, we can tick the Include if
case satisfies condition:

3. And then we can move the sex


variable to the box and as men are
coded as 1 we can write = 1, to
apply the transformation only to
men

4. Select Continue

Select OK to create the new variable. If you selected to use the IF function, then only the males will have a
transformed score in the new variable HappinessLog. However, if you didn’t use the IF function then all
participants will have a score for the new variable.

This example illustrates how the compute function can be used to apply the log transformation. If the data
contained zeros, then we can modify the Numeric Expression from Log10(happiness) to Log10(happiness+1). By
adding 1 to all of the scores before applying the Log will ensure that the Log function can be applied to all the
values.

Applying the other common transformations uses the same process:

31
PSY4062 Psychological Inquiry 2: Theory, Methods, and Practice 2
Introduction to Data Cleaning and Bias in Analysis

• Square root transformation - from the functions we select sqrt and then move the variable to the ? in
the numeric expression.
• Reciprocal transformation – In the numeric expression box type 1 and then the / sign and then move
the variable to the expression box. For example, for the happiness variable we would have written
“1/happiness”. This would be appropriate for this variable because it doesn’t have any zero scores.
However, if the variable contains zeros, then we would want to modify the happiness variable slightly
by typing “1/(happiness+1)”. By including adding 1 to the happiness variable will ensure that all values
are greater than 1 and after SPSS has calculated the happiness variable plus 1 it will then apply the
transformation. We will only obtain the final calculation in the dataset.

8 Summary
This chapter has highlighted how we can undertake data screening and cleaning as well as a range of strategies
that can be used to identify and address bias in our data set. As a budding researcher it is recommended to get
in the habit of performing data screening/cleaning prior to checking assumptions and any statistical analysis. In
addition, while you might prefer particular options for assessing bias in the data as well as addressing biases, it
is important to be aware of the many tools that can be used, as sometimes our preferred option/s might not be
useful, and we might need to select a different option. Therefore, this chapter is likely to be one that you will
return again and again to especially when conducting your own analysis and if biases are identified.

32

You might also like