Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 37

Creating dummy variables in SPSS

Statistics
Introduction
If you are analysing your data using multiple regression and any of your independent variables
were measured on a nominal or ordinal scale, you need to know how to create dummy
variables and interpret their results. This is because nominal and ordinal independent variables,
more broadly known as categorical independent variables, cannot be directly entered into a
multiple regression analysis. Instead, they need to be converted into dummy variables. The
exception is ordinal independent variables that are entered into a multiple regression as
continuous independent variables, which do not need to be converted into dummy variables.
Therefore, in this guide we show you how to create dummy variables when you have categorical
independent variables.

First, we set out the example we use to show how to create dummy variables in SPSS Statistics,
before explaining how to set up your data in the Variable View and Data View windows of
SPSS Statistics so that you can create dummy variables. If you are unfamiliar with the use of
dummy variables, we recommend that you then read about some of the basic principles of
dummy variables and dummy coding, including: (a) the number of dummy variables you need to
create in your analysis; and (b) how to create dummy variables and dummy coding. In
the Procedure section that follows, we set out the simple, 3-step Create Dummy
Variables procedure in SPSS Statistics that can be used to create dummy variables. Finally, we
explain the SPSS Statistics output after running the Create Dummy Variables procedure,
including how your dummy variables will now be set up in the Variable View and Data
View windows of SPSS Statistics.

Note: If you find that the procedures in this guide do not cover the type of dummy variables
you want to create, please contact us. We may be able to add another guide to the site to help.

SPSS Statistics
Example used in this guide
In this guide we will be using the example of 10 triathletes who were asked to select
their favourite sport from the three sports they perform when doing a
triathlon: swimming, cycling and running. Their answers were recorded in the nominal
independent variable,  favourite_sport , which has three categories: "swimming", "cycling" and
"running". This nominal independent variable,  favourite_sport , was to be included in a multiple
regression analysis that also had a number of continuous independent variables. Since this
independent variable was categorical (i.e., nominal variables and ordinal variables can be
broadly classified as categorical variables), dummy variables had to be created before it could
be entered into the multiple regression analysis.

Important: Notice that  favourite_sport  is a nominal variable, but you can also create dummy
variables for an ordinal variable. Furthermore, the process for creating dummy variables is
the same irrespective of whether you have an ordinal or nominal variable, with the exception
of one small change you have to make when setting up your data, which is explained below.

Note 1: The "categories" of a categorical independent variable are also referred to as


"groups" or "levels", but the term "levels" is usually reserved for categories that have an
order (e.g., the ordinal independent variable, "fitness level", could have three levels: "low",
"moderate" and "high"). However, these three terms – "categories", "groups" and "levels" –
can be used interchangeably. In this guide, we will refer to them as categories, but you could
refer to them as groups or levels if you prefer.

Note 2: The term "factors" is sometimes used instead of "categorical independent


variables" (i.e., independent variables that are "ordinal" or "nominal"). However, these two
terms – "categorical independent variables" and "factors" – can be used interchangeably. In
this guide, we will refer to them as categorical independent variables and you will also see
SPSS Statistics refer to them as independent variables rather than factors in its multiple
regression procedure. However, you can refer to them as factors if you prefer.

SPSS Statistics
Setting up your data in SPSS Statistics
When creating dummy variables, you will start with a single categorical independent variable
(e.g.,  favourite_sport ). To set up this categorical independent variable, SPSS Statistics has
a Variable View where you define the types of variable you are analysing and a Data
View where you enter your data for this variable. In this section, we first show you how to set up
a categorical independent variable in the Variable View window of SPSS Statistics, before
showing you how to enter your data into the Data View window. We do this using our
categorical independent variable,  favourite_sport , which has three categories: "swimming",
"cycling" and "running".

The Variable View in SPSS Statistics


For a single categorical independent variable (e.g.,  favourite_sport ), your Variable View window
will look like the one below:

Note: You can access the Variable Viewwindow in SPSS Statistics by clicking on

the   tab in the bottom left-hand corner of the SPSS Statistics software.

Published with written permission from SPSS Statistics, IBM Corporation.

The name of your categorical independent variable should be entered in the cell under
the   column (e.g., "favourite_sport" in row   to represent our
categorical independent variable,  favourite_sport . There are certain "illegal" characters that
cannot be entered into the   cell. Therefore, if you get an error message and you
would like us to add an SPSS Statistics guide to explain what these illegal characters are,
please contact us.

Note: For your own clarity, you can also provide a label for your variables in
the   column. For example, the label we entered for "favourite_sport" was
"Triathlete's favourite sport".

The cell under the   column should contain the information about the categories of
your categorical independent variable (e.g., "swimming", "cycling" and "running"
for  favourite_sport . To enter this information, click into the cell under the   column
for your independent variable. The   button will appear in the cell. Click on this button and
the Value Labels dialogue box will appear. You now need to give each category of your
independent variable a "value", which you enter into the Value: box (e.g., "1"), as well as a
"label", which you enter into the Label: box (e.g., "swimming"). By clicking the   button
the coding will appear in the main box (e.g., "1.00="swimming" for  favourite_sport ). The setup
for our categorical independent variable is shown in the Value Labels dialogue box below:

Published with written permission from SPSS Statistics, IBM Corporation.


The cell under the   column should show   if you have
a nominal independent variable (e.g.,  favourite_sport , as in our example) or   if you
have an ordinal independent variable (e.g., imagine an ordinal variable such as "Body Mass
Index" (BMI), BMI), which has four levels: "Underweight", "Healthy/Normal Weight",
"Overweight", and "Obese"). Finally, the cell under the   column should
show  .

Note: We suggest changing the cell under the   column from   


to  , but you do not have to make this change. We suggest that you do because
there are certain analyses in SPSS Statistics where the   setting results in your
variables being automatically transferred into certain fields of the dialogue boxes you are
using. Since you may not want to transfer these variables, we suggest changing
the   setting to   so that this does not happen automatically.

You have now successfully entered all the information that SPSS Statistics needs to know about
your categorical independent variable into the Variable View window. In the next section, we
show you how to enter your data into the Data View window.

The Data View in SPSS Statistics


Based on the file setup for your categorical independent variable in the Variable View window
above, the Data View window show look as follows:

Note: You can access the Data Viewwindow in SPSS Statistics by clicking on the   


tab in the bottom left-hand corner of the SPSS Statistics software.
Published with written permission from SPSS Statistics, IBM Corporation.

Your categorical independent variable will be displayed in the first column since this was the
order we entered the variable into the Variable View window. In our example, the responses of
the 10 triathletes are presented under the   column. Now, you simply have to
enter your data into the cells under this first column. Remember that each row represents one
case (e.g., a case could be a single participant). Therefore, in row   of our example,
the first case represented a triathlete whose favourite sport was "swimming". Since these cells
will initially be empty, you need to click into the cells to enter your data. You will notice that
when you click into the cells under the   column, SPSS Statistics will give you a
drop-down option with your categories already populated.

Now that you have set up your data in the Variable View and Data View windows of SPSS
Statistics, we recommend reading next section: Understanding dummy variables and dummy
coding, where we explain the basic principles of dummy variables and dummy coding. However,
if you already familiar with the fundamentals of dummy variables and dummy coding, you can
skip this section and go straight to the Procedure section where we set out the Create Dummy
Variables procedure in SPSS Statistics that is used to create dummy variables.

SPSS Statistics
Understanding dummy variables and dummy coding
As we mentioned in the Introduction, if you are analysing your data using multiple regression
and any of your independent variables were measured on a nominal or ordinal scale, you need
to know how to create dummy variables and interpret their results. This is because categorical
independent variables (i.e., nominal and ordinal independent variables) cannot be directly
entered into a multiple regression. Instead, they need to be converted into dummy variables. The
exception is ordinal independent variables that are entered into a multiple regression as
continuous independent variables, which do not need to be converted into dummy variables. In
the sections below, we explain: (a) the number of dummy variables you need to create; and
(b) how to create dummy variables and dummy coding.

The number of dummy variables you need to create


The number of dummy variables you need to create will depend on how many categories your
categorical independent variable has. As a general rule, you will create one less dummy
variable than the number of categories in your categorical independent variable. For
example, if you have a categorical independent variable with three
categories (e.g.,  favourite_sport , with the following three categories: "swimming", "cycling" and
"running"), you will create two dummy variables and select one category to act as a reference
category (e.g., "swimming" and "cycling" become dummy variables and "running" becomes the
reference category). We explain more about reference categories after the following table, which
provides some examples of categorical independent variables and the number of dummy
variables that need to be created:

Name of the
categorical Type of Number of Number of dummy
 
independent variable categories variables
variable

Two One=Males
1 Gender Nominal (Males & "Females" is the
Females) reference category

Two One=Under 180cm


(Under 180cm & "180cm and above"
2 Height Ordinal
180cm and is the reference
above) category

Three Two=African
(African American &
3 Ethnicity Nominal American, Caucasian
Caucasian & "Hispanic" is the
Hispanic) reference category

Two=Low &
Three
Physical Moderate
4 Ordinal (Low, Moderate
activity level "High" is the
& High)
reference category

Four Three=Surgeon,
(Surgeon, Doctor & Nurse
5 Profession Nominal
Doctor, Nurse & "Therapist" is the
Therapist) reference category

6 Level of Ordinal Four Three=Strongly


agreement (Strongly agree, agree, Agree &
Agree, Disagree, Disagree
Name of the
categorical Type of Number of Number of dummy
 
independent variable categories variables
variable

"Strongly disagree"
Strongly
is the reference
disagree)
category

Five
(Business Four=Business
studies, studies, Psychology,
Psychology, Biological sciences
7 Subject area Nominal
Biological & Engineering
sciences, "Law" is the
Engineering & reference category
Law)

Five Four=Under 18, 19-


(Under 18, 19- 30, 31-40 & 41-50
8 Age Ordinal
30, 31-40, 41- "51-60" is the
50, 51-60) reference category

Table: Examples of categorical independent variables and their respective dummy variables

As shown in the table above, you only need to create one less dummy variable than the number
of categories in your categorical independent variable. This is because you only need to (and
should) transfer this number of dummy variables into a multiple regression when you have a
categorical independent variable. However, there are good reasons to create a dummy
variable for every category of the categorical independent variable: (a) it is more flexible
and (b) it allows multiple comparisons to be made (see the note below). In other words, if your
categorical independent variable has three categories you would create three dummy
variables, not just two.

Fortunately, the Create Dummy Variables procedure in SPSS Statistics version 22 and


above automatically creates a dummy variable for every category of your categorical
independent variable. However, this is not the case for the Recode into Different
Variables procedure in SPSS Statistics version 21 or earlier. Therefore, under normal
circumstances, you will have created the following setup in SPSS Statistics, depending on
whether you have version 21 or earlier or version 22 and above:
Published with written permission from SPSS Statistics, IBM Corporation.
Note: As mentioned above, creating a dummy variable for every category of the categorical
independent variable is beneficial for two reasons: (a) it is more flexible and (b) it allows
multiple comparisons to be made. We briefly touch on these benefits below:

It is more flexible:
When you have created a dummy variable for every category of your categorical independent
variable, you can then consider any category as a reference category. In our example, we
considered the "running" category as the reference category, which means we would have
transferred "swimming" and "cycling" into the multiple regression equation. However, if we
later changed our mind about our choice of reference category, we would have to run the
dummy variable procedure again (unless you have SPSS Statistics version 22 or above). For
example, let's assume we now wanted to consider the "cycling" category as the reference
category. We could now transfer the "swimming" and "running" dummy variables into the
multiple regression equation because we also have the "running" dummy variable.

It allows multiple comparisons to be made:


The coefficient of a dummy variable represents the difference between the category that
dummy variable represents and the reference category. For example, with "running" as the
reference category, the coefficient of the "swimming" dummy variable represents the
difference in the dependent variable between the "swimming" and "running" categories.
Using this method, not all combinations of categories will be possible. This problem can be
solved by using different reference categories. This is possible if all categories of the
categorical variable have a dummy variable.

How to create dummy variables and dummy coding


There are two steps to successfully set up dummy variables in a multiple regression: (1) create
dummy variables that represent the categories of your categorical independent variable; and
(2) enter values into these dummy variables – known as dummy coding – to represent the
categories of the categorical independent variable. We explain this process below using the
example we set out above.

Explanation: Dummy variables are simply new variables that act as "placeholders" for a
particular coding scheme. They do not contain any data at all, per se. Instead, data/values need to
be added to these dummy variables so that they can fulfil their purpose of representing the
categories of your categorical independent variable. There are many different types of coding
scheme that will dictate the values that are entered into dummy variables, but we use a very
common coding scheme called dummy coding or, alternatively, indicator coding (N.B., do not
get confused because dummy variables and dummy coding are not the same thing). Dummy
coding works by using each dummy variable to identify a specific category of a categorical
independent variable with the exception of a reference category, which we explain below.

Let's start by considering our example categorical independent variable,  favourite_sport , which
has three categories: "swimming", "cycling" and "running". Since there are three categories,
there needs to be two dummy variables representing two of the categories, and a reference
category representing the third category.

Note: Remember from the discussion above that a multiple regression requires you to transfer


one less dummy variable than the number of categories in your categorical independent
variable (i.e., two in our example). However, you can create a dummy variable for every
category of the categorical independent variable for the purposes of greater flexibility and the
ability to make multiple comparisons. Nonetheless, in the discussion below we only highlight
what is required for a multiple regression; that is, the creation of one less dummy variable
than the number of categories in your categorical independent variable with the category that
is not directly represented becoming the "reference category".

For example, let dummy variable #1 represent the "swimming" category and dummy variable #2
represent the "cycling" category. This leaves no dummy variable for the "running" category. This
"missing" category is the reference category and it is not needed. Furthermore, it is entirely
your decision which category you want to use as the reference category. We could have just as
easily chosen the "swimming" category as the reference category rather than the "running"
category. The only reason we didn't is that by default SPSS Statistics uses the last category you
have coded in the Variable View for your categorical independent variable as the reference
category (see the note below).

Note: As explained in the Data Setup section earlier and as shown below in the Value


Labels dialogue box, the third and final category of our categorical independent variable was
"running" (i.e., 3="running").
There was no theoretical or statistical reason for us to make the "running" category the third
and final category, which made it the reference category in SPSS Statistics by default. We
simply did it this way because when triathletes take part in a triathlon, they first do the swim,
then undertake a cycle, before finally running to the finish line. Therefore, it seemed logical
to code our categorical independent variable this way. However, we could have coded it as
1=cycling, 2=running and 3=swimming; it would have made no difference except for the fact
that as the third and final category, "swimming" would have become our reference category
by default in SPSS Statistics.

When you create dummy variables you should give them a meaningful name. Since each of our
dummy variables represents a category of our categorical independent variable, it is customary to
refer to each dummy variable by the name of the category it represents. Therefore, we have
called dummy variable #1 "swimming" as it represents the swimming category. Similarly, we
have called dummy variable #2 "cycling" as it represents the cycling category. By creating these
two dummy variables, we will have two new columns in our data set in SPSS Statistics, as
shown below:
Published with written permission from SPSS Statistics, IBM Corporation.

Now that we have created two dummy variables and given them appropriate names, we need
to enter values into these variables so that each dummy variable really does represent its
category of the categorical independent variable. With dummy coding this is very simple. You
enter a "1" to represent any case (e.g., a participant in your data set) that has the category and
enter a "0" (zero) if they do not have the category. First, consider the "swimming" dummy
variable, as shown below:
Published with written permission from SPSS Statistics, IBM Corporation.

If one of the triathletes stated that "swimming" was their "favourite" sport, we would enter a "1"
into the cell under the swimming dummy variable column ( ) for that triathlete
who stated that swimming was their "favourite" sport. Alternatively, if one of the triathletes
stated that "cycling" or "running" was their "favourite" sport, we would enter a "0" into the cell
under the swimming dummy variable column ( ) for that triathlete who stated
that swimming was "not" their favourite sport (i.e., this means that either "cycling" or "running"
was that triathlete's favourite sport). This is highlighted below for all 10 triathletes:

Published with written permission from SPSS Statistics, IBM Corporation.

We repeat this process for the other dummy variable, "cycling", as shown below:
Published with written permission from SPSS Statistics, IBM Corporation.

If one of the triathletes stated that "cycling" was their "favourite" sport, we would enter a "1"
into the cell under the cycling dummy variable column ( ) for that triathlete who
stated that cycling was their "favourite" sport. Alternatively, if one of the triathletes stated that
"swimming" or "running" was their "favourite" sport, we would enter a "0" into the cell under
the cycling dummy variable column ( ) for that triathlete who stated
that cycling was "not" their favourite sport (i.e., this means that either "swimming" or "running"
was that triathlete's favourite sport). This is highlighted below for all 10 triathletes:

Published with written permission from SPSS Statistics, IBM Corporation.

By entering "1"s and "0"s into your dummy variables in this manner, you will havecreated a set
of dummy variables that you can enter into a multiple regression analysis. In
the Procedure section that follows, we show you how to create these dummy variables using
the Create Dummy Variables procedure.

SPSS Statistics
Procedure in SPSS Statistics to create dummy variables
There are two procedures in SPSS Statistics to create dummy variables: the Create Dummy
Variables procedure and the Recode into Different Variables procedure. In this guide, we
show you how to use the Create Dummy Variables procedure, which is a simple 3-step
procedure. However, it is only available if you have SPSS Statistics version 22 or later,
with version 26 (and the subscription version of SPSS Statistics) being the latest version of
SPSS Statistics. If you are unsure which version of SPSS Statistics you are using, see our
guide: Identifying your version of SPSS Statistics. If you have SPSS Statistics version 21 or
earlier or are interested in making multiple comparisons when carrying out your multiple
regression analysis, please see the Note below:

Note: If you have SPSS Statistics version 21 or earlier, you cannot use the Create Dummy


Variables procedure. Therefore, the Recode into Different Variables procedure at least
enables you to create dummy variables in SPSS Statistics. Whilst you can also use
the Recode into Different Variables procedure to create dummy variables if you have SPSS
Statistics version 22 or later, we set out the Create Dummy Variables procedure in this
guide because it is dedicated to creating dummy variables and is a lot easier and quicker to
use. For example, it requires just 3 steps to create dummy variables for the example used in
this guide compared to 28 steps for the same example using the Recode into Different
Variables procedure.

Therefore, if you have SPSS Statistics version 21 or earlier, our enhanced guide


on Creating dummy variables in the members section on Laerd Statistics includes a page
dedicated to showing how to carry out this 28-step Recode into Different
Variables procedure. You can access this enhanced guide by subscribing to Laerd Statistics.
Alternatively, you can simply use the Create Dummy Variables procedure below.

To create dummy variables when you have SPSS Statistics version 22 or later, follow the 3-
step Create Dummy Variables procedure below:
1. Click Transform > Create Dummy Variables on the main menu, as shown below:
Published with written permission from SPSS Statistics, IBM Corporation.
You will be presented with the Create Dummy Variables dialogue box, as shown
below:
Published with written permission from SPSS Statistics, IBM Corporation.

2. Transfer the categorical independent variable,  favourite_sport , into the Create Dummy

Variables for: boxby selecting it (by clicking on it) and then clicking on the   button.
Also, enter a "root" name that can represent all of the new dummy variables into the Root
Names (One Per Selected Variable): box in the –Main Effect Dummy Variables– area. We
entered the root name "fs" as an abbreviation for our categorical independent variable,
"favourite_sport", as shown below:
Published with written permission from SPSS Statistics, IBM Corporation.

Note: SPSS Statistics will add a sequential number (i.e., 1, 2, 3, 4, etc.) onto the end of


the root name you choose to represent your categorical independent variable. A
sequential number will be created for each of the dummy variables you want to create
(e.g., if you have two dummy variables, a 1 and 2 will be added onto the end of the root
name, but if you had six dummy variables, a 1, 2, 3, 4, 5 and 6 would be added onto the
end of the root name). This is shown for our example in the Variable View window
below:

Since our categorical independent variable,  favourite_sport , had three categories (i.e.,


swimming, cycling and running), the Create Dummy Variables procedure
creates three dummy variables (i.e., one for swimming, one for cycling and one for
running). These three dummy variables are highlighted in the   column
above: "fs_1" (for swimming), "fs_2" (for cycling) and "fs_3" (for running). You can
rename these later so that they make more sense. We are just highlighting this so that you
know how the Root Names (One Per Selected Variable): box above works.

Also, the root name you enter into the Root Names (One Per Selected
Variable): box cannot be the same as the name of your categorical independent
variable, as shown below (i.e., where we have entered the root name, "favourite_sport",
to illustrate what we could not call our root name):
If the root name you enter is th   sam   as the name of your categorical independent
variable, as shown above, when you click on th      button, you will get the
following warning

3. Click on the   button.

After carrying out the 3-step Create Dummy Variable procedure above you will have created
dummy variables for your categorical independent variable. In the next section, highlight the
output that is created in the Variable View and Data View of SPSS Statistics after running
this Create Dummy Variables procedure.

SPSS Statistics
Output and data setup in SPSS Statistics after creating dummy
variables
After creating your dummy variables, SPSS Statistics produces the following Variable
Creation table its IBM SPSS Statistics Viewer:
Published with written permission from SPSS Statistics, IBM Corporation.

The Variable Creation table confirms that you have successfully created dummy variables.
There should be as many rows as there are new dummy variables. Since we
created three dummy variables, there are three rows in the table, "fs_1", "fs_2" and "fs_3",
which reflect the root name and sequential numbering entered in Step 2 of the Create Dummy
Variables procedure in the previous section. For each of these dummy variables, a label is
provided in the table to make it clear which category of the categorical independent variable each
dummy variable represents. For example, the label, "favourite_sport=swimming", is provided
for "fs_1", indicating that "fs_1" is the dummy variable for the "swimming" category of the
categorical independent variable,  favourite_sport .

Next, go to the Variable View window of SPSS Statistics by clicking on the   tab.


The three dummy variables will have been added, as shown below (i.e., the dummy variables,
"fs_1", "fs_2" and "fs_3", in the   column):

Published with written permission from SPSS Statistics, IBM Corporation.


Note: You can change the names of the dummy variables in the   column to
make it clearer what these are. For example, we have changed "fs_1" to "swimming", "fs_2"
to "cycling" and "fs_3" to "running", as shown below:
Finally, go to the Data View window of SPSS Statistics by clicking on the   tab.
The dummy coding is shown under each of the dummy variables that have been created. For
example, in the rows under the "fs_1" column, the category, "swimming", is coded as "1.00",
whereas the categories, "cycling" and "running", are coded as ".00", as shown below. If you are
unsure why these dummy variables are dummy coded in this way, see the section: Understanding
dummy variables and dummy coding.
Published with written permission from SPSS Statistics, IBM Corporation.
Note 1: Due to the default settings of SPSS Statistics, your dummy variables will be coded
"1.00" or ".00" instead of "1" or "0", respectively. They are identical. However, you will often
see dummy coding written in terms of 1's and 0's rather than including decimals.
Note 2: If you changed the names of the dummy variables in the   column of
the Variable View window above, these will also have been changed in the columns of
the Data View window, as shown below (e.g., the   column heading is now
entitled  ):
1

You might also like