N E I L T. D I A M O N D A N D E W A M .



Copyright © 2014 Neil T. Diamond and Ewa M. Sztendur

published by esquant statistical consulting pty ltd

typeset with tufte-latex

For academic use only. You may not reproduce or distribute without permission of the authors.

First printing, October 2014


1 Introduction to SEM 5

2 SEM Builder 9

3 Stata SEM Commands 27

4 Datasets 49

5 Bibliography 57
1 Introduction to SEM

1.1 The basics

Structural equation modelling (SEM) is a statistical methodology that

takes a confirmatory rather than exploratory approach to the analysis
of a structural theory of some phenomenon. In that respect, the aim
of SEM is to determine whether a certain model is valid rather than
to find a suitable model. SEM is primarily a latent-variable approach,
which means that interest usually focuses on theoretical constructs
that cannot be observed (latent constructs). Such constructs arise in
many disciplines, including social and behavioural sciences as well
as economics. Examples of latent variables are intelligence, moti-
vation, attitude, liberalism, self-esteem, stress, verbal ability, math
ability, teacher expectancy, etc. Unlike traditional multivariate meth-
ods which are incapable to assess how measurement error distorts
causal inferences, SEM provides explicit estimates of these error vari-
ance parameters. Although data analyses using the former methods
are solely based on observed measurements, SEM procedures can
incorporate both unobserved and observed variables.

1.1.1 Latent versus manifest variables

Latent variables are unobserved variables that are inferred (through

a mathematical model) from multiple manifest variables that are
observed. For example, a latent variable substance use can be derived
from observed items measuring behaviours such as alcohol use,
cigarettes use, marijuana use, etc. Mathematical models that aim
to explain observed variables in terms of latent variables are called
latent variable models. Latent variable models are used in many
disciplines, including social sciences and psychology, economics,
management, marketing, medicine, physics, and bioinformatics.
6 introduction to sem with stata - day 1

1.1.2 Exogenous versus endogenous latent variables

Exogenous latent variables are like independent variables in ANOVA
or predictors in regression; they cause fluctuations in the values of
other latent variables in the model. Changes in the values of of ex-
ogenous variables are not explained by the model, but are considered
to be influenced by other factors external to the model. Background
variables such as gender or age are examples of such external factors.

Endogenous latent variables are like dependent variables in

ANOVA or outcome or criterion variables in regression. They are
influenced by the exogenous variables in the model, either directly
or indirectly. Fluctuations in the values of endogenous variables are
said to be explained by the model because all latent variables that in-
fluence them are included in the model specification. Unlike ANOVA
or regression, in SEM endogenous variables can also predict other
variables in the model.

1.1.3 The factor analytic model

Factor analysis is a procedure for examining the nature of inter-
relationships between sets of observed and latent variables. The
method is used to investigate a large set of variables that represent
elements of an abstract construct, and to reduce it to a smaller, more
manageable set of underlying concepts. For example, we can anal-
yse intelligence in terms of perception, quantification, word fluency,
verb ability, spatial ability, memory and reasoning. In another exam-
ple, we could examine a large set of behaviours within an individual
and categorise them as representing different conceptual elements
of the person’s psychological state. Loss of appetite, lack of moti-
vation, withdrawal, and feelings of sadness and guilt might reflect
underlying "depression". Sleeplessness, worried thoughts, racing
heart, hot and cold flushes, and nail biting might be indicative of
"anxiety". Depression and anxiety would each be composed of a set
of related elements, with each set of elements unrelated to the other
set. Each set of related elements represents a unique factor. The inter-
correlation of variables within a factor suggests that those variables,
taken together, represent a singular concept that can be distinguished
from other factors. Therefore depression can be distinguished from
anxiety. We might also be interested in the relative strength of the
association between each of the variables within a factor and the
concept that the factor represents. For example, what is the relation-
ship between having worried thoughts and the concept of anxiety?
In addition to categorising variables into factors, factor analysis also
weights each variable within a factor. These coefficients, called fac-
introduction to sem 7

tor loadings, are measures of the correlation between the individual

variable and the overall factor.

There are two types basic types of factor analysis: exploratory

factor analysis (EFA) and confirmatory factor analysis (CFA). As an
exploratory approach, factor analysis can be used to sort through a
large number of variables in an effort to reveal links between the ob-
served and latent variables. This type of analysis may represent early
stages of inquiry, when concepts and relationships are not yet suffi-
ciently understood to propose relevant hypotheses. The exploratory
approach is ofter referred to as a theory building approach. Confir-
matory approach, on the other hand, is used when the researcher has
some knowledge of the underlying latent variable structure (often
based on the exploratory findings). The confirmatory approach is
often referred to as a theory confirming approach.

In summary, both EFA and CFA are procedures used to reduce a

large number of inter-related measured variables to a smaller number
of underlying factors. They both focus solely on how, and extend to
which, the observed variables are linked to their underlying latent

1.2 Outline of the Workshop

1.2.1 Day 1

Introduction to Stata Menus Reading data into Stata. Cleaning a


Introduction to Stata Commands Turning a review window into a do

file. Running do files. A discussion of some useful Stata com-

Statistics in Stata Using the menus for simple statistical methods

and corresponding Stata commands. Revision of Correlation and

Some Multivariate methods in Stata Reliability Analyis, Principal Com-

ponents, and Exploratory Factor Analysis.

1.2.2 Day 2

Introduction to SEM Builder Confirmatory Factor Analysis using the

SEM Builder. Simple Structural Equation Models. Fit indices.
8 introduction to sem with stata - day 1

Introduction to SEM Commands Understanding the model syntax.

Modifying the syntax generated by the SEM Builder.

Some further commands for SEM Using a covariance matrix as input.


Some more details Estimators, standard errors, and missing values.


1.2.3 Day 3
What to do when the model does not fit Modication Indices.

More on SEM Multiple Groups Analysis and Growth Curve Models.

Reporting SEM What to include. Modifying the diagram for publica-

2 SEM Builder

SEM Builder is graphical user inteface to build and fit Structural

Equation Models in Stata
In the SEM Builder, and more generally, structural equation mod-
els are portrayed as diagrams, using particular configurations of four
geometric symbols – a circle (or ellipse), a square (or rectangle), a
single-headed arrow, and a double-headed arrow. By convention cir-
cles (or ellipses; ) represent unobserved latent variables, squares
(or rectangles; ) represent observed variables; single-headed
arrows (→) represent the impact of one variable on another, and
double-headed arrows (↔) represent covariances or correlations be-
tween pairs of variables.

2.1 An Example of Using the SEM Builder

As an example, we will use a subset of the classic Holzinger and

Swineford (1939) dataset 1 In this section, however, we will only 1
From the help file in the Lavaan
package (Rossel, 2013): The classic
concern ourselves with three of the variables, x1 , x2 and x3 , which
dataset consists of mental ability
area related to visual perception. test scores of seventh and eigth grade
children from two different schools
(Pasteur and Grant-White). In the
2.2 Specifying the data original dataset (available in the MBESS
package), there are scores for 26 tests.
However, a smaller subset with 9
The first step is to specify the data. Since the data is in a .csv file we variables is more widely used in the
can use File ⊳ Import Text data (delimited, *.csv, . . . ) and browse for literature (for example in Joreskog’s
the HolzingerSwineford1939.csv file. 1969 paper , which uses the 145
subjects from the Grant-White school
2.3 Specifying the model using the SEM Builder K. Holzinger and F. Swineford. A
study in factor analysis: The stability of
a bifactor solution. Number 48 in Sup-
Here are the steps involved: plementary Educational Monograph.
University of Chicago Press, Chicago,
• Choose Statistics ⊳ SEM (structural equation modeling) ⊳ Model 1939; and K. G. Joreskog. A general
approach to confirmatory maximum
building and estimation.The SEM builder screen will open.
likelihood factor analysis. Psychometrika,
34:183–202, 1969
10 introduction to sem with stata - day 1

Figure 2.1: SEM builder screen

• On the left hand side, click (Add Measurement Component

(M)). Click the cursor on a position in the centre and at the top of
the canvas. The measurement component dialog box will open.

Figure 2.2: Measurement Component

Dialog Box
sem builder 11

• Change the latent variable name to “Visual". It is a good idea to

follow the convention that latent variable begin with a capital
letter but observed variables are all lower case.

• Use the drop down menu in the Measurement variables box and
click on x1, x2 and x3.

• Click OK and the model will be shown.

Choose maximum likelihood and press OK. You should get the
following graph.

Figure 2.3: Estimated Visual congeneric

factor model for the Holzinger and
Swineford Dataset.

2.3.1 Interpretation of the Model

• The mean of the three observed variables are 4.94, 6.09, and 2.25.

• The values of 1, 0.78, and 1.1 on the arrows from the latent vari-
able to the observed variables are the loadings. These are the re-
gression coefficients of the latent variable “Visual" on the three
observed variables.

– Note the loading on the first variable is set to 1. The latent

variable needs a scale and the scale is by default set to be the
same as the first observed variable.
12 introduction to sem with stata - day 1

• The mean of the latent variable is assumed to be 0 and is not


• The variance of the latent variable is estimated to be 0.52.

• The residual variances for x1, x2, and x3, i.e. the variation not
explained by the latent variable “Visual" are 0.83, 1.06 and 0.63,

2.4 Standardised Model

An alternative is to specify that the variance of the latent variable is 1

and all the manifest variables are standardised. To do this, folllow the
steps below:
• Select View ⊳ Standardized Estimates
The revised model is shown in Figure 2.4

Figure 2.4: Estimated Visual standard-

ised congeneric factor model for the
Holzinger and Swineford Dataset.

2.4.1 Interpretation of the Standardised Model

• Note that the loading on x1 is now different to 1. The latent vari-
able needs a scale and this is set by setting the variance equal to

• The correlations between the latent variable and the observed

variables are 0.62, 0.48, and 0.47, respectively.

• The mean of the latent variable is 0.

• The residual variances for x1, x2, and x3, i.e.the proportion of the
variation not explained by the latent variable “Visual" are 0.61, 0.77
and 0.5, respectively.
sem builder 13

2.5 Creating a CFA example in SEM Builder

The “one-factor congeneric" model for Visual has no degrees of

freedom-it is a just-identified model. You need at least four indica-
tors for over-identification. Now we will examine the nine observed
variables. This is an example of Confirmatory Factor Analysis. Our
model is that x1 , x2 , and x3 load on Visual; x4 , x5 , and x6 load on
Textual; and x7 , x8 , and x9 load on Speed, and that the three latent
variables are distinct concepts but are correlated with each other.
Let’s use Stata’s SEM builder to fit this model to the nine-variable
Holzinger and Swineford data set.

1. Choose Estimation ⊳ Clear Estimates

2. Type “S" to choose the Select button

3. With the shift key, select the Visual model and move it to the left of
the canvas.

4. Type “M" to choose the “Add Measurement Component" button,

click in the middle of the canvas about the same level as the Visual
latent variable.

5. In the dialog box change the latent variable to “Textual" and asso-
ciate the observed variables x4, x5 and x6. Press OK.

6. Type “M" to choose the “Add Measurement Component" button,

click on the right of the canvas about the same level as the Visual
and Textual latent variables.

7. In the dialog box change the latent variable to “Speed" and asso-
ciate the observed variables x7, x8 and x9. Press OK.

8. Type “C" to choose the “Add Covariance" button

9. Click on the Visual latent variable and drag the covariance to the
Textual latent variable. You can adjust the position of the covari-
ance double sided arrow by moving the little circles on the latent
variable ellipses. You can also adjust the curve of the covariance
double sided arrow by moving the circle on the end of the “han-

10. Do the same for Visual and Speed; and Textual and Speed.

11. Estimate the parameters using Estimation ⊳ Estimate.

14 introduction to sem with stata - day 1

12. Display the standardized solution using View ⊳ Standardized

Estimates. In the Main tab, choose Maximum Likelihood and in
the Reporting tab, select Display Standardized coefficients and
values2 . In the Advanced Tab, check Do not estimate means or 2
You can also get the standardized
intercepts. estimates to display by estimating
the unstandardized model and then
13. If you have done everything correctly then the model should ap- going View rhd Standardized Estimates.
However, if you do this, only the
pear as in Figure 2.5. standardized estimates show up in the
results window.

Figure 2.5: Estimated CFA Model for

Holzinger-Swineford Data Set



Visual Textual Speed

1 1 1
.58 .84 .67
.77 .42 .86 .72
.85 .57

x1 x2 x3 x4 x5 x6 x7 x8 x9

e1 .4 e2 .82 e3 .66 e4 .27 e5 .27 e6 .3 e7 .68 e8 .48 e9 .56

2.6 The output

S t r u c t u r a l e q u a t i o n model Number o f obs = 301

E s t i m a t i o n method = ml
Log l i k e l i h o o d = −3737.7449

( 1) [ x1 ] V i s u a l = 1
( 2) [ x4 ] T e x t u a l = 1
( 3) [ x7 ] Speed = 1
sem builder 15

Standardized | Coef . Std . E r r . z P>|z| [95% Conf . I n t e r v a l ]
Measurement |
x1 <− |
Visual | .7718802 .0575346 13.42 0.000 .6591144 .8846459
_cons | 4.234926 .1819724 23.27 0.000 3.878267 4.591586
x2 <− |
Visual | .4236006 .062738 6.75 0.000 .3006364 .5465649
_cons | 5.179137 .2188139 23.67 0.000 4.75027 5.608005
x3 <− |
Visual | .5811323 .0584538 9.94 0.000 .4665651 .6956996
_cons | 1.993107 .0996045 20.01 0.000 1.797886 2.188328
x4 <− |
Textual | .8515823 .0226412 37.61 0.000 .8072064 .8959581
_cons | 2.633762 .1218401 21.62 0.000 2.39496 2.872564
x5 <− |
Textual | .8550654 .0221923 38.53 0.000 .8115693 .8985616
_cons | 3.369123 .1489219 22.62 0.000 3.077242 3.661005
x6 <− |
Textual | .8380101 .0235412 35.60 0.000 .7918702 .88415
_cons | 1.998179 .0997732 20.03 0.000 1.802627 2.193731
x7 <− |
Speed | .5695148 .0583107 9.77 0.000 .4552279 .6838017
_cons | 3.848319 .1671013 23.03 0.000 3.520807 4.175832
x8 <− |
Speed | .7230442 .0622861 11.61 0.000 .6009657 .8451228
_cons | 5.46731 .2301649 23.75 0.000 5.016195 5.918424
x9 <− |
Speed | .6650094 .0660831 10.06 0.000 .5354889 .7945299
_cons | 5.334255 .2249189 23.72 0.000 4.893422 5.775088
var ( e . x1 )| .404201 .0888196 .2627564 .6217867
var ( e . x2 )| .8205625 .0531517 .7227287 .9316398
var ( e . x3 )| .6622852 .0679387 .5416601 .809773
16 introduction to sem with stata - day 1

var ( e . x4 )| .2748076 .0385616 .2087307 .3618023

var ( e . x5 )| .2688631 .0379518 .2038818 .3545552
var ( e . x6 )| .2977391 .0394555 .2296345 .386042
var ( e . x7 )| .6756529 .0664177 .557249 .8192152
var ( e . x8 )| .477207 .0900713 .3296441 .6908254
var ( e . x9 )| .5577625 .0878918 .4095601 .759593
var ( V i s u a l )| 1 . . .
var ( T e x t u a l )| 1 . . .
var ( Speed )| 1 . . .
cov ( Visual , |
T e x t u a l )| .4585093 .0634638 7.22 0.000 .3341225 .5828962
cov ( Visual , |
Speed )| .4705348 .0862308 5.46 0.000 .3015256 .639544
cov ( Textual , |
Speed )| .2829848 .0714709 3.96 0.000 .1429045 .4230652
LR t e s t o f model vs . s a t u r a t e d : c h i 2 ( 2 4 ) = 8 5 . 3 1 , Prob > c h i 2 = 0 . 0 0 0 0

2.6.1 Interpretation of Output

• The number of observed statistics is 54, consisting of 9 means, 9
variances and 36 = (9 × 8)/2 covariances. Sometimes the means are
not counted and you will the number of observed statistics equal
to 45 i.e. 9 variances plus 36 covariances.

• The number of estimated parameters is 30: 9 means, 9 residuals

variances, 6 loadings (2 for each of the 3 factors), 3 covariances
among the latent variables and 3 variances for the latent variables.
Again sometimes the means are not counted and in this case the
number of estimated parameters is said to be 21.

• For each parameter estimated, the standard error is also given.

• The degrees of freedom is the number of observed statistics minus

the number of estimated parameters i.e. 54 − 30 = 24 in this case.
This is the same whether the means are counted or not.

• The model is fitted by maximising the likelihood. It is usual to

quote minus twice the log-likelihood which, if the model is correct,
has a χ2 distribution with degrees of freedom given above. Note
that maximising the likelihood is equivalent to minimising minus
twice the log likelihood.

• The χ2 statistic is a measure of lack of fit. The p-value is is very

small, indicating the model does not fit the data.
sem builder 17

To get further information on the model fit, choose Estimation

⊳ Overall goodness of fit. In the estat-Postestimation tool for sem
dialog box, select Goodness-of-ft statistics in the Reporting and statis-
tics:(subcommand) drop-down list, and select all in the Statistics to
be displayed drop-down list. Press Ok.

. e s t a t gof , s t a t s ( a l l )

Fit s t a t i s t i c | Value Description
Likelihood r a t i o |
chi2_ms ( 2 4 ) | 85.306 model vs . s a t u r a t e d
p > chi2 | 0.000
chi2_bs (3 6 ) | 918.852 b a s e l i n e vs . s a t u r a t e d
p > chi2 | 0.000
Population e r r o r |
RMSEA | 0.092 Root mean squared e r r o r o f approximation
90% CI , lower bound | 0.071
upper bound | 0.114
pclose | 0.001 P r o b a b i l i t y RMSEA <= 0 . 0 5
Information c r i t e r i a |
AIC | 7517.490 Akaike ’ s i n f o r m a t i o n c r i t e r i o n
BIC | 7595.339 Bayesian i n f o r m a t i o n c r i t e r i o n
B a s e l i n e comparison |
CFI | 0.931 Comparative f i t index
TLI | 0.896 Tucker −Lewis index
Size of residuals |
SRMR | 0.065 Standardized r o o t mean squared r e s i d u a l
CD | 0.986 C o e f f i c i e n t of determination

The interpretation of these statistics is as follows:

2.6.2 Model Chi-Square

The model chi-square tests the exact fit hypothesis i.e that there are
no discrepancies between the population covariance and that implied
by the fitted model. The first part of the summary indicateso indi-
cates that the χ2 statistic was 85.306, and that there were 24 degrees
of freedom and the p-value, the probability of obtaining the observed
18 introduction to sem with stata - day 1

value of χ2 or more extreme asssuming the assumed model is correct

is quite small. Ideally, we would want the p-value to be greater than

. e s t a t gof , s t a t s ( c h i 2 )

Fit s t a t i s t i c | Value Description
Likelihood r a t i o |
chi2_ms ( 2 4 ) | 85.306 model vs . s a t u r a t e d
p > chi2 | 0.000
chi2_bs (3 6 ) | 918.852 b a s e l i n e vs . s a t u r a t e d
p > chi2 | 0.000

Where does the 24 degrees of freedom come from? There are 9

variances and (9 × 8)/2 = 36 covariances. We need to estimate 9
residual variances, 6 loadings (2 for each factor), 3 factor covariances
and 3 factor variances; a total of 21 free parameters. The degrees
of freedom is the number of bits of information less the number of
parameters to estimate i.e 45-21=24.
The expected value of the χ2 statistic equals the degrees of free-
dom, 24 in this case. Note that the χ2 statistic can be affected by
non-normality and sample size as well as other factors.

• The saturated model corresponds to an exact fit model. Since there

is a statistically significant difference between our model and the
saturated model, it means that our model does not explain the

• The baseline model is the model where all the observed variables
are independent.

2.6.3 Root Mean Square Error of Approximation

The Root Mean Square Error of Approximation is a popular fit index.
The formula is
Á χ2 − df M
df M × (N − 1)

85.306 − 24
= in this case
24 × 300
= 0.092
sem builder 19

and is, in words, the amount of discrepancy per degree of freedom.

Ideally, the RMSEA is less than 0.05. We are provided with a 90% confi-
dence interval, as well as the probability that the population RMSEA is
less than 0.05. In this case the close-fit hypothesis is not supported.

. e s t a t gof , s t a t s ( rmsea )

Fit s t a t i s t i c | Value Description
Population e r r o r |
RMSEA | 0.092 Root mean squared e r r o r o f approximation
90% CI , lower bound | 0.071
upper bound | 0.114
pclose | 0.001 P r o b a b i l i t y RMSEA <= 0 . 0 5

2.6.4 Information Criteria

• The AIC (Aikake’s Information Criteria) is a penalised measure of
lack of fit. It equals minus twice the log-likeihood plus twice the
number of estimated parameters. Smaller values mean better fit.
AIC’s can be compared for non-nested models.

• The BIC (Bayesian Information Criteria) is an alternative to the

AIC, with a different penalty.

. e s t a t gof , s t a t s ( i c )

Fit s t a t i s t i c | Value Description
Information c r i t e r i a |
AIC | 7535.490 Akaike ’ s i n f o r m a t i o n c r i t e r i o n
BIC | 7646.703 Bayesian i n f o r m a t i o n c r i t e r i o n

These fit indices can be used to compare non-nested models. The

actual values can’t be interpreted, except to say that the smaller the
criteria is the better.

2.6.5 Full model versus baseline model

Next, we get a comparison between the model we have fitted and the
baseline model. Two indices are provided. The comparative fit index
20 introduction to sem with stata - day 1

(CFI) is given by

χ2M − df M
CFI = 1−
χ2B − dfB
85.306 − 24
= 1− in this case
918.852 − 36
= 0.931

where the subscript indicates whether we are referring to our model

(M) or the baseline model (B).
The Tucker-Lewis index (TLI) is given by

χ2B χ2 χ2
TLI = [ − M ] / [ B − 1]
dfB df M dfB
918.852 85.306 918.852
= [ − ]/[ − 1] in this case
36 24 36
= 0.896

. e s t a t gof , s t a t s ( i n d i c e s )

Fit s t a t i s t i c | Value Description
B a s e l i n e comparison |
CFI | 0.931 Comparative f i t index
TLI | 0.896 Tucker −Lewis index

• The SRMR is the ratio of the sum of the squared differences be-
tween the correlations for the observed variable and the correla-
tions implied by our model divided by the number of variances
and covariances. This is given by the formula below
Á ∑i<=j (ri,j − ρi,j )2
v(v + 1)/2)

where ri,j is the observed correlation for the ith and jth variables,
ρi,j is the model implied correlation between the ith and jth vari-
ables, and v is the number of variables.

Fit s t a t i s t i c | Value Description
sem builder 21

Size of residuals |
SRMR | 0.060 Standardized r o o t mean squared r e s i d u a l
CD | 0.986 C o e f f i c i e n t of determination

2.7 Political Democracy Dataset

In Bollen’s Political Democracy Dataset and Model, relating to 75

developing countries, there is one exogenous latent variable “Indus-
trialisation" and two endogenous latent variables “Democracy in
1960" and “Democracy in 1965". Bollen wanted to examine the effect
of Industrialisation on Democracy.
Industrialisation is measured by three observed variable
x1 the gross national product (GNP) per capita in 1960

x2 the inaminate energy consumption per capita in 1960

x3 the percentage of the labour force in industry in 1960

Political Democracy in 1960 is measured by four observed vari-
y1 expert ratings of the freedom of the press in 1960

y2 the freedom of political opposition in 1960

y3 the fairness of elections in 1960

y4 the effectiveness of the elected legislature in 1960

Similarly Political Democracy in 1965 is measured by (similar
variables to y1 to y4 ) y5 to y8 but measured in 1965.
Exercise: Use SEM builder to generate a diagram similar to Fig-
ure 2.6. The steps are:

1. Save your current graph as a .stsem file3 . 3

This stands for SEM Path Diagram.

2. Choose File ⊳ Exit.

3. Adjust the canvas size to say 9 in. by 6 in., and use the fit in win-
dow button.
4. Press the Add measurement component model and click at about
3 down and 3 across on the grid. For the latent variable name type
dem60. In the measured variables select y1 y2 y3 y4. Check the Do
not estimate constants box. Put the Menu direction as up.
5. Press the Add measurement component model and click at about
3 down and 6 across on the grid. For the latent variable name type
dem65. In the measured variables select y5 y6 y7 y8. Check the Do
not estimate constants box. Put the Menu direction as up.
22 introduction to sem with stata - day 1

6. Press the Add measurement component model and click at about

4.5 down and 4.5 across on the grid. For the latent variable name
type ind60. In the measured variables select x1 x2 x3. Check the
Do not estimate constants box. Put the Menu direction as down.
7. Use the select tool to adjust the position of the latent variables and
associated observed variables as you see fit.
8. Add paths from ind60 to dem60; from ind60 to dem65; and from
dem60 to dem65.
9. Add covariances between ε 2 and ε 7 ; from ε 3 and ε 8 ; from ε 4 and ε 9 ;
and ε 5 and ε 10 .
10. Also add covariances between ε 3 and ε 8 . Modify the appearance of
these paths by moving the lever as appropriate.
11. Choose Estimation ⊳ Estimate. Choose Maximum Likelihood. In
the Reporting tab, check the Standardized coefficients and values.
In the Advanced tab check do not fit mean or intercepts.

The resulting diagram needs some adjustment. In particular, the

variance for ε 1 does not show and the covariances between the errors
are hard to read because of all the lines. To do this follow the steps

1. Select ε 1 . Press Properties . . . to get the Variable properties dialog

box. In the Appearance Tab, check Customize appearance for se-
lected variables and choose Set custom appearance. In the Variable
settings-selected variables tab, choose the Results tab. Press the Re-
sults1 box under Appearance of results (font, color, position etc.).
Choose Position of 9 o’clock and the Boundary gap as 3 pt. Press

2. Now select the covariance arrows and move the positions of the
covariances.To do this for the covariance between ε 2 and ε 7 , for
example, select the covariance and then click Properties . . . . In
the Appearance Tab, check Customize appearance for selected
variables and choose Set custom appearance. In the Results tab,
press Results 1 and select the Distribution between nodes to be
10%. The appropriate values for the covariance between ε 5 and ε 10
should be 90%. For the covariance between ε 3 and ε 8 , we suggest
15% (which you need to type in); and for the covariance between
ε 4 and varepsilon9 we suggest 85%.

3. Finally adjust the positions of the covariances between ε 3 andε 5

and between ε 8 and ε 10 to look the same.

The results you obtain should be as follows:

sem builder 23

Figure 2.6: Estimated CFA model for

the Political Democracy Dataset.

S t r u c t u r a l e q u a t i o n model Number o f obs = 75

E s t i m a t i o n method = ml
Log l i k e l i h o o d = −1547.7909

( 1 ) [ y1 ] dem60 = 1
( 2 ) [ y5 ] dem65 = 1
( 3 ) [ x1 ] ind60 = 1
Standardized | Coef . Std . E r r . z P>|z| [95% Conf . I n t e r v a l ]
Structural |
dem60 <− |
ind60 | .4467129 .1046964 4.27 0.000 .2415117 .6519141
dem65 <− |
dem60 | .8852288 .0517686 17.10 0.000 .7837641 .9866934
ind60 | .1822596 .0729762 2.50 0.013 .0392289 .3252904
24 introduction to sem with stata - day 1

Measurement |
y1 <− |
dem60 | .8504258 .0437576 19.43 0.000 .7646626 .9361891
y2 <− |
dem60 | .7171219 .0639886 11.21 0.000 .5917065 .8425373
y3 <− |
dem60 | .7223492 .064376 11.22 0.000 .5961746 .8485238
y4 <− |
dem60 | .8457095 .0444636 19.02 0.000 .7585624 .9328566
y5 <− |
dem65 | .8080173 .0483896 16.70 0.000 .7131754 .9028593
y6 <− |
dem65 | .7460072 .0572477 13.03 0.000 .6338037 .8582107
y7 <− |
dem65 | .8236733 .0456011 18.06 0.000 .7342968 .9130499
y8 <− |
dem65 | .8278414 .0459159 18.03 0.000 .737848 .9178348
x1 <− |
ind60 | .9198529 .0231947 39.66 0.000 .8743921 .9653137
x2 <− |
ind60 | .9730326 .0165154 58.92 0.000 .9406629 1.005402
x3 <− |
ind60 | .8721386 .0308137 28.30 0.000 .8117447 .9325324
var ( e . y1 )| .2767759 .0744251 .1633954 .4688313
var ( e . y2 )| .4857361 .0917753 .3354083 .7034399
var ( e . y3 )| .4782116 .0930039 .3266451 .7001064
var ( e . y4 )| .2847755 .0752066 .1697102 .4778562
var ( e . y5 )| .347108 .0781993 .2232025 .5397968
var ( e . y6 )| .4434733 .0854144 .3040347 .6468621
var ( e . y7 )| .3215622 .0751209 .2034295 .5082954
var ( e . y8 )| .3146786 .0760221 .1959876 .5052496
var ( e . x1 )| .1538706 .0426714 .0893512 .2649787
sem builder 25

var ( e . x2 )| .0532076 .0321401 .0162856 .1738374

var ( e . x3 )| .2393743 .0537477 .1541537 .3717075
var ( e . dem60)| .8004476 .0935385 .6365953 1.006474
var ( e . dem65)| .0390048 .0497035 .0032095 .4740198
var ( ind60 )| 1 . . .
cov ( e . y1 , e . y5 )| .2957611 .1421271 2.08 0.037 .0171972 .5743251
cov ( e . y2 , e . y4 )| .272567 .1206589 2.26 0.024 .0360799 .5090541
cov ( e . y2 , e . y6 )| .3562224 .0975541 3.65 0.000 .1650199 .5474249
cov ( e . y3 , e . y7 )| .1906414 .1374685 1.39 0.166 −.0787919 .4600747
cov ( e . y4 , e . y8 )| .1088014 .1354941 0.80 0.422 −.1567621 .374365
cov ( e . y6 , e . y8 )| .3377705 .1113979 3.03 0.002 .1194346 .5561064
LR t e s t o f model vs . s a t u r a t e d : c h i 2 ( 3 5 ) = 3 8 . 1 3 , Prob > c h i 2 = 0 . 3 2 9 2

As before, we can check out the fit of the model using Estimation
⊳ Overall goodness of fit. In the estat-Postestimation tool for sem
dialog box, select Goodness-of-ft statistics in the Reporting and statis-
tics:(subcommand) drop-down list, and select all in the Statistics to
be displayed drop-down list. Press Ok. All the statistics are satisfac-

. e s t a t gof , s t a t s ( a l l )

Fit s t a t i s t i c | Value Description
Likelihood r a t i o |
chi2_ms ( 3 5 ) | 38.125 model vs . s a t u r a t e d
p > chi2 | 0.329
chi2_bs (5 5 ) | 730.654 b a s e l i n e vs . s a t u r a t e d
p > chi2 | 0.000
Population e r r o r |
RMSEA | 0.035 Root mean squared e r r o r o f approximation
90% CI , lower bound | 0.000
upper bound | 0.092
pclose | 0.611 P r o b a b i l i t y RMSEA <= 0 . 0 5
Information c r i t e r i a |
AIC | 3157.582 Akaike ’ s i n f o r m a t i o n c r i t e r i o n
BIC | 3229.424 Bayesian i n f o r m a t i o n c r i t e r i o n
B a s e l i n e comparison |
CFI | 0.995 Comparative f i t index
26 introduction to sem with stata - day 1

TLI | 0.993 Tucker −Lewis index

Size of residuals |
SRMR | 0.045 Standardized r o o t mean squared r e s i d u a l
CD | 0.965 C o e f f i c i e n t of determination
3 Stata SEM Commands

3.1 A first glimpse of Stata SEM Syntax

Although the SEM Builder is great, it is probably better to use the

Stata SEM commands. Actually, all the SEM builder does is to transate
your diagram into a set of commands and runs the commands when
you fit the model. Stata has great facilities for Structural Equation
Modelling. It is easy to use and has a simple model syntax allow-
ing you to easily specify the model you want to fit to your data. The
package provides many summaries of your model and provides con-
venient ways of improving your model. Stata can handles multiple
groups (e.g. Males and Females) and handles growth curve models,
categorical variables and more.
When you estimated the model using Estimation ⊳ Estimate, Stata
automatically generated the commands. These can be found in the
results window but also can be obtained by clicking on them in the
review window. After reloading the HolzingerSwineford data, we
can refit the Visual CFA model by clicking the appropriate line in the
Review window. Stata gives us the following.
sem ( V i s u a l −> x1 , ) ( V i s u a l −> x2 , ) ( V i s u a l −> x3 , ) , l a t e n t ( V i s u a l ) n o c a p s l a t e n t

1. The syntax shows that the latent variable Visual loads onto the
three observed variables x1, x2, x3. It could have been written as
( V i s u a l −> x1 x2 x3 )

2. The comma indicates that everything after that is an option to the

sem command. There are two options.

(a) latent(Visual) explicitly specifies that Visual is a latent variable.

All other variables are then observed variables.
(b) nocapslatent says not to treat variables with the first letter cap-
italized as a latent variable by default. Stata sem assumes that
latent variables have the first letter capitalized and that ob-
served variables have the first letter lower case. If you want to
28 introduction to sem with stata - day 1

have all variables in your data set lpwer case, you can use the
followiing command.

. rename * , lower

The syntax generated is correct but a bit long-winded and some of

it is superfluous as all our observed variables are lower case. We can
simplify it and also put it in a do file by right clicking and with a bit
of editing

import d e l i m i t e d ///
C: \ Users\NeilDiamond\Documents\LavaanCourse\HolzingerSwineford1939 . csv , c l e a r
sem ( V i s u a l −> x1 x2 x3 )

Behind the scenes, Stata automatically sets the first loading to 1,

and adds the residual variance.

3.2 Confirmatory Factor Analysis Example

The Stata SEM code for the confirmatory factor analysis example is
given below:

cd "C: \ Users\NeilDiamond\Documents\ S t a t a Workshop\Day 2 "

import d e l i m i t e d
"C: \ Users\NeilDiamond\Documents\ S t a t a Workshop\Day 2\Data\HolzingerSwineford1939 . csv "
sem ( V i s u a l −> x1 , ) ( V i s u a l −> x2 , ) ( V i s u a l −> x3 , ) , l a t e n t ( V i s u a l ) n o c a p s l a t e n t
sem ( V i s u a l −> x1 , ) ( V i s u a l −> x2 , ) ( V i s u a l −> x3 , )
( T e x t u a l −> x4 , ) ( T e x t u a l −> x5 , ) ( T e x t u a l −> x6 , )
( Speed −> x7 , ) ( Speed −> x8 , ) ( Speed −> x9 , ) ,
c o v s t r u c t ( _lexogenous , d i a g o n a l ) s t a n d a r d i z e d nomeans
l a t e n t ( V i s u a l T e x t u a l Speed ) cov ( V i s u a l * T e x t u a l V i s u a l * Speed T e x t u a l * Speed )
graph e x p o r t "C: \ Users\NeilDiamond\Documents\ S t a t a Workshop\Day 2\HS1939 . png " ,
as ( png ) r e p l a c e
e s t a t gof , s t a t s ( a l l )

Again it is a bit long-winded. We can simplify the commands

somewhat as follows.

• (Visual -> x1, ) (Visual -> x2, ) (Visual -> x3, ) can become (Visual
<- x1 x2 x3), and similarly for Textual and Speed.

• Because we have followed the convention that latent variables

begin with a capital and observed variables don’t, we do not have
to specify the Visual Textual and Speed are latent, nor that we are
not following the convention (which nocapslatent does).
stata sem commands 29

• covstruct(_lexogenous, diagonal) specifies that the covariance

structure of the latent exogenous variables is diagonal, but cov(
Visual*Textual Visual*Speed Textual*Speed) says that the three
variables covary. So we can just leave these parts out.

• We need to take out the graph command. SEM builder generates

the commands but you can’t use commands to generate a graph.

• We need to keep the standardized and nomeans options after the

important comma.

Your new do file should look something like this:

capture log c l o s e
l o g using HS39 , r e p l a c e t e x t

// HS39 . do : F i t s CFA t o HolzingerSwineford1939 data

// N e i l Diamond 20/10/14

v e r s i o n 13
clear all
macro drop _ a l l
s e t l i n e s i z e 80

cd "C: \ Users\NeilDiamond\Documents\ S t a t a Workshop\Day 2 "

import d e l i m i t e d ///
" . \ Data\HolzingerSwineford1939 . csv "

sem ( Visual −> x1 x2 x3 ) ///

( Textual −> x4 x5 x6 ) ///
( Speed −> x7 x8 x9 ) , ///
s t a n d a r d i z e d nomeans

e s t a t gof , s t a t s ( a l l )

log c l o s e

Run the do file and confirm you get the same results as before.

3.3 Fit Indices

If we fit the model using the code below

e s t a t gof , s t a t s ( a l l )

we get a summary of the fitted model.

30 introduction to sem with stata - day 1

We can also get a subset by listing the fit indices we want, for
e s t a t gof , s t a t s ( c h i 2 rmsea i c i n d i c e s r e s i d u a l s )

3.4 Extracting Information from the fitted model

After the analysis, Stata saves various statistics which you might
want to use. For example after getting the goodness of fit statistics
there are many statistics retained.
. e s t a t gof , s t a t s ( a l l )
. return l i s t

scalars :
r ( N_groups ) = 1
r ( cd ) = .9861419451994397
r ( srmr ) = .0595237982362845
r( tli ) = .8958394762056794
r( cfi ) = .9305596508037862
r ( bic ) = 7646.703173647184
r ( aic ) = 7535.489865704717
r ( pclose ) = .0006612367108219
r ( ub90_rmsea ) = .1136780172014793
r ( lb90_rmsea ) = .0714184911919339
r ( rmsea ) = .0921214848760547
r ( p_bs ) = 1 . 5 7 3 4 1 7 5 1 0 6 e −169
r ( df_bs ) = 36
r ( chi2_bs ) = 918.8515836481301
r ( p_ms ) = 8 . 5 0 2 5 5 1 6 1 2 6 5 e −09
r ( df_ms ) = 24
r ( chi2_ms ) = 85.30552225695647

matrices :
r ( nobs ) : 1 x 1

. display r ( chi2_bs )

One use of these statistcs is to calculate fit statistics that Stata does
not compute. For example, the GFI (Goodness of Fit Statistic) is given
GFI = 1 − [χ2model /χ2null ]
but Stata does not compute this statistic. But it is easy to generate it
using the retained statistics.
stata sem commands 31

gen g f i =1− r ( chi2_ms )/ r ( c h i 2 _ b s )

. display g f i

3.5 Re-analysis of the Political Democracy Data Set

The commands generated for the Political Democracy model is given

. sem ( dem60 −> y1 , ) ( dem60 −> y2 , ) ( dem60 −> y3 , ) ( dem60 −> y4 , ) ( dem60 −>
dem65 , ) ( ind60 −> dem60 , ) ( ind60 −> dem65 , )
( ind60 −> x1 , ) ( ind60 −> x2 , ) ( ind60 −> x3 , ) , nomeans s t a n d a r d i z e
l a t e n t ( dem60 dem65 ind60 )
cov ( e . y1 * e . y5 e . y2 * e . y4 e . y2 * e . y6 e . y3 * e . y7 e . y4 * e . y8 e . y6 * e . y8 ) n o c a p s l a t e n t

• Note that because we have dem60, dem65, and ind60 begin with
lower case letters, we do need to specify nocapslatent and also
specify that dem60, dem65 and ind60 are latent variables. Can you
think of something to simplify the commands?

• The code can be simplified by putting all the variables that a latent
variable loads on within the same bracket.

• Note how the covariances are specified. e.y1 and e.y5 are the error
variances attached to y1 and y5, respectively and e.y1*e.y5 says we
want to allow these errors to covary.

Exercise: The commands are a bit long winded. Develop a simpler

set of commands in a do file to fit the model1 . 1
An answer is over the page.
32 introduction to sem with stata - day 1

import d e l i m i t e d ///
C: \ Users\NeilDiamond\Documents\LavaanCourse\ P o l i t i c a l D e m o c r a c y . csv , c l e a r
sem ( Dem60 −>y1 y2 y3 y4 Dem65 ) ///
( Dem65 −> y5 y6 y7 y8)///
( I i n d 6 0 −> x1 x2 x3 Ind60 Ind65 ) ///
( T e x t u a l −> x4 x5 x6 ) , ///
cov ( e . y1 * e . y5 e . y2 * e . y4 e . y2 * e . y6 e . y3 * e . y7 e . y4 * e . y8 e . y6 * e . y8 )
e s t a t gof , s t a t s ( a l l )

3.6 Using a covariace matrix and vector of means as input

Sometimes we have not got the raw data, for example we are reading
a paper. Usually either a sample variance-covariance matrix will be
provided; or the sample correlation matrix and the vector of sample
standard deviations (and possibly the sample means).
For an example, consider the data analysed by Kline (2011, p.163).
The data is adapted from Sava (2002), and relates to a study of 109
high school teachers and considers the causes and effects of teacher-
burnout. The hypothesied model is that school support and coercive
control affect teacher burnout and all these variables have an effect
on the teacher-pupil interaction. which in turn has an effect on the
school experience and the somatic status of the teacher’s students.
The variance-covariance matrix is given below.

Table 3.1: Correlations and Standard

Deviations for teacher and pupils data

Variable 1 2 3 4 5 6
1. Coercive Control 1.0000
2. Teacher Burnout 0.3557 1.0000
3. School Support −0.2566 −0.4774 1.0000
4. Teacher-Pupil Interactions -0.4046 0.0207 0.1864 1.0000
5. School Experience -0.1615 0.0938 0.0718 0.6542 1.0000
6. Somatic Status -0.3487 -0.0133 0.1570 0.7277 0.4964 1.0000
SD 8.3072 9.7697 10.5212 5.0000 3.7178 5.2714

A graphical depiction of the model considered is given below:

stata sem commands 33

Figure 3.1: Sara Model.

34 introduction to sem with stata - day 1

We need to enter the data into Stata. We use the ssd (Summary
statistics data) command. Open the do file editor and type the fol-
lowing commands and save the file as

ssd i n i t c c t b s c _ s p t t p i s c _ e som_st
ssd s e t o b s e r v a t i o n s 109
ssd s e t sd 8 . 3 0 7 2 9 . 7 6 9 7 1 0 . 5 2 1 2 5 . 0 0 0 3 . 7 1 7 8 5 . 2 7 1 4
# delimit ;
ssd s e t c o r r e l a t i o n s
1 \
.3557 1 \
−.2566 −.4774 1 \
−.4046 . 0 2 0 7 . 1 8 6 4 1 \
−.1615 . 0 9 3 8 . 0 7 1 8 . 6 5 4 2 1 \
−.3487 −.0133 . 1 5 7 0 . 7 2 7 7 . 4 9 6 4 1 ;
# delimit cr
save sava . dta
use sava
ssd l i s t

• ssd init sets up the variables

• ssd set observations tells Stata how many observations there


• ssd set sd specifies the standard deviations of the variables.

• #delimit ; changes the signal to submit a line from a carriage

return (i.e. Enter) to a semi-colon. We need this, because we are
going to enter the matrix of correlations row by row.

• The correlation matrix is symmetric so we only have to enter the

lower diagonal of the matrix.

• Each row of the matrix is ended by a backslash.

• The end of the matrix input is ended by a new signal, i.e a semi-

• We then save the data, clear the memory, and then use the data
and list it with ssd list.

Now run the do file.

. do sava . do
stata sem commands 35

. clear

. ssd i n i t c c t b s c _ s p t t p i s c _ e som_st

Summary s t a t i s t i c s data i n i t i a l i z e d . Next use , i n any order ,

ssd s e t o b s e r v a t i o n s ( r e q u i r e d )
I t i s b e s t t o do t h i s f i r s t .

ssd s e t means ( o p t i o n a l )
Default s e t t i n g i s 0 .

ssd s e t v a r i a n c e s or ssd s e t sd ( o p t i o n a l )
Use t h i s only i f you have s e t or w i l l s e t c o r r e l a t i o n s and , even then ,
t h i s i s o p t i o n a l but h i g h l y recommended . D e f a u l t s e t t i n g i s 1 .

ssd s e t c o v a r i a n c e s or ssd s e t c o r r e l a t i o n s ( r e q u i r e d )

. ssd s e t o b s e r v a t i o n s 109
( value s e t )

Status :
observations : set
means : unset
v a r i a n c e s or sd : unset
c o v a r i a n c e s or c o r r e l a t i o n s : unset ( r e q u i r e d t o be s e t )

. ssd s e t sd 8 . 3 0 7 2 9 . 7 6 9 7 1 0 . 5 2 1 2 5 . 0 0 0 3 . 7 1 7 8 5 . 2 7 1 4
( values s e t )

Status :
observations : set
means : unset
v a r i a n c e s or sd : set
c o v a r i a n c e s or c o r r e l a t i o n s : unset ( r e q u i r e d t o be s e t )

. # delimit ;
d e l i m i t e r now ;
. ssd s e t c o r r e l a t i o n s
> 1 \
> .3557 1 \
> −.2566 −.4774 1 \
> −.4046 . 0 2 0 7 . 1 8 6 4 1 \
> −.1615 . 0 9 3 8 . 0 7 1 8 .6542 1 \
36 introduction to sem with stata - day 1

> −.3487 −.0133 .1570 .7277 .4964 1 ;

( values s e t )

Status :
observations : set
means : unset
v a r i a n c e s or sd : set
c o v a r i a n c e s or c o r r e l a t i o n s : set

. # delimit cr
d e l i m i t e r now c r
. save sava . dta
f i l e sava . dta saved

. clear

. use sava

. ssd l i s t

O b s e r v a t i o n s = 109

Means undefined ; assumed t o be 0

Standard d e v i a t i o n s :
cc tb sc_spt tpi sc_e som_st
8.3072 9.7697 10.5212 5 3.7178 5.2714

Correlations :
cc tb sc_spt tpi sc_e som_st
.3557 1
−.2566 −.4774 1
−.4046 .0207 .1864 1
−.1615 .0938 .0718 .6542 1
−.3487 −.0133 .157 .7277 .4964 1

end o f do− f i l e

Now open the do-file editor and create a do file called
with the following commands:
stata sem commands 37

. sem ///
> ( s c _ s p t −> t b t p i ) ///
> ( c c −> t b t p i ) ///
> ( t b −> t p i ) ///
> ( t p i −> s c _ e som_st )

Run the save_fit do file to get the following results.

sem ///
( s c _ s p t −> t b t p i ) ///
( c c −> t b t p i ) ///
( t b −> t p i ) ///
( t p i −> s c _ e som_st )

Endogenous v a r i a b l e s

Observed : t b t p i s c _ e som_st

Exogenous v a r i a b l e s

Observed : sc_spt cc

F i t t i n g t a r g e t model :

Iteration 0: l o g l i k e l i h o o d = −2052.8451
Iteration 1: l o g l i k e l i h o o d = −2052.8451

S t r u c t u r a l e q u a t i o n model Number o f obs = 109

E s t i m a t i o n method = ml
Log l i k e l i h o o d = −2052.8451

| Coef . Std . E rr . z P>|z| [95% Conf . I n t e r v a l ]
Structural |
t b <− |
s c _ s p t | −.3838194 .0777506 −4.94 0.000 −.5362078 −.231431
cc | .293585 .0984724 2.98 0.003 .1005828 .4865873
t p i <− |
tb | .1424866 .0510318 2.79 0.005 .0424661 .2425071
sc_spt | .0966997 .0458219 2.11 0.035 .0068904 .1865089
c c | −.2717027 .0545622 −4.98 0.000 −.3786426 −.1647628
38 introduction to sem with stata - day 1

s c _ e <− |
tpi | .486437 .0538653 9.03 0.000 .3808629 .592011
som_st <− |
tpi | .7671996 .0692629 11.08 0.000 .6314468 .9029524
var ( e . t b )| 67.51208 9.14499 51.77024 88.04055
var ( e . t p i )| 19.16417 2.595923 14.69565 24.99144
var ( e . s c _ e )| 7.833977 1.061168 6.007324 10.21606
var ( e . som_st )| 12.95285 1.754555 9.932622 16.89143
LR t e s t o f model vs . s a t u r a t e d : c h i 2 ( 7 ) = 3 . 9 3 , Prob > c h i 2 = 0 . 7 8 7 7


1.(a) The results above are for the unstandardized model. If you
want to determine which of School Support or Coercive Control
has the biggest effect on Teacher Burnout, how would you mod-
ify your commands to do this? Modify your commands and
obtain a summary of the model.
(b) Assuming the model fits well, use SEM builder to display the

2. (The Classic Wheaton dataset) Anomia and Powerlessness are two

subscales of a standard alienation scale. The variance-covariance
matrix below is from data collected on a panel of 932 individuals
in rural Illinois in 1967 and 1971. Education is measured in years
and occstat represents a socioeconomic index based on the respon-
dent’s occupation and these are indicators of Socioeconomic status
Variable 1 2 3 4 5 6
1. anomia67 11.834
2. powerlessness67 6.947 9.364
3. anomia71 6.819 5.091 12.532
4. powerlessness71 4.783 5.028 7.495 9.986
5. education -3.839 -3.889 -3.841 -3.625 9.610
6. occstat -21.899 -18.831 -21.748 -18.775 35.522 450.288
The model fitted is given in the following diagram.

(a) Create a do file to enter the variance-covariance matrix into

(b) Create a do file with the Stata commands for the model shown
in the diagram. You will need to define three latent variables,
two regressions, and two sets of correlated residuals.
stata sem commands 39

Figure 3.2: Wheaton Structural Equa-

e1 e2
tion Model

educ66 occstat66
0 0


e3 Alien67 Alien71 e6

anomia67 pwless67 anomia71 pwless71

0 0

0 0

e4 e5 e7 e8

(c) Fit the model, and obtain standardized estimates. Does the
model fit? What is your interpretation of the results?
(d) Use SEM builder to create a diagram summarizing the results.

3.7 Indirect effects

For the teacher burnout example, we can estimate the direct, indirect
and total effects of one variable on another. For example,

• The direct effect of Coercive Control on Teacher-Pupil interaction

is the coefficient on the path from Coercive Control to Teacher-
Pupil interaction (i.e. −0.272).

• The indirect effect of Coercive Control on Teacher-Pupil interaction

is the product of the coefficients on the path from Coercive Control
to Teacher Burnout and from Teacher Burnout to Teacher-Pupil
interaction (i.e. 0.294 × 0.143 = 0.042).

• The total effect is the sum of the direct and indirect effects (i.e
−0.275 + .042 = −0.233).

Stata stores the coefficients in the model. We can take advantage of

these to test linear and non-linear combinations of these coefficients.
To see how they are defined in Stata type the following command:
sem , c o e f l e g e n d

The results are as follows. Note that for the path going from sc_spt
to cc, the notation is _b to indicate it is a path followed by an opening
40 introduction to sem with stata - day 1

left square bracket. The destination variable comes first and then
the origin variable, separated by a colon. Finally we have a closing
square bracket.

. sem , c o e f l e g e n d

S t r u c t u r a l e q u a t i o n model Number o f obs = 109

E s t i m a t i o n method = ml
Log l i k e l i h o o d = −2052.8451

| Coef . Legend
Structural |
t b <− |
s c _ s p t | −.3838194 _b [ t b : s c _ s p t ]
cc | . 2 9 3 5 8 5 _b [ t b : c c ]
t p i <− |
tb | . 1 4 2 4 8 6 6 _b [ t p i : t b ]
sc_spt | . 0 9 6 6 9 9 7 _b [ t p i : s c _ s p t ]
c c | −.2717027 _b [ t p i : c c ]
s c _ e <− |
tpi | . 4 8 6 4 3 7 _b [ s c _ e : t p i ]
som_st <− |
tpi | . 7 6 7 1 9 9 6 _b [ som_st : t p i ]
var ( e . t b )| 6 7 . 5 1 2 0 8 _b [ var ( e . t b ) : _cons ]
var ( e . t p i )| 1 9 . 1 6 4 1 7 _b [ var ( e . t p i ) : _cons ]
var ( e . s c _ e )| 7 . 8 3 3 9 7 7 _b [ var ( e . s c _ e ) : _cons ]
var ( e . som_st )| 1 2 . 9 5 2 8 5 _b [ var ( e . som_st ) : _cons ]
LR t e s t o f model vs . s a t u r a t e d : c h i 2 ( 7 ) = 3 . 9 3 , Prob > c h i 2 = 0 . 7 8 7 7

To test whether, for example, the total effect of Coercive Control on

Teacher-Pupil interaction is statistically significant, follow the steps

1. Choose Statistics ⊳ SEM (structural equation modeling) ⊳ Testing

and CIs ⊳ Nonlinear combinations of parameters. Check the Post-
estimation results and the press Create. Press Create again. Click
on Coefficients in the Category Box and then on Coefficients. The
list of saved coefficients is displayed. Select tb:cc and double click
stata sem commands 41

to enter it into the equation box. Type * and the select tpc:tb and
double click. Type + and then select tpi:cc and double click. Press
OK three times.

. nlcom ( _b [ t b : c c ] * _b [ t p i : t b ] + _b [ t p i : c c ] ) , p o s t

_nl_1 : _b [ t b : c c ] * _b [ t p i : t b ] + _b [ t p i : c c ]

| Coef . Std . E rr . z P>|z| [95% Conf . I n t e r v a l ]
_ n l _ 1 | −.2298708 .0543087 −4.23 0.000 −.3363138 −.1234277

Stata provides an easier way to do this calculation for the direct,

indirect and total effects. Choose Statistics ⊳ SEM (structural equa-
tion modeling) ⊳ Testing and CIs ⊳ Direct and indirect effects. In the
estat-Postestimation tools for sem dialog box make sure that Decom-
position of effects into total, direct and indirect effects is highlighted
in the Reporting and statistics: (subcommand) dropdown box. Check
the following boxes: Do not display effects with no paths; Report
standardized effects; Do not display direct effects.
. e s t a t t e f f e c t s , compact s t a n d a r d i z e d n o d i r e c t

Indirect effects
| Coef . Std . E rr . z P>|z| Std . Coef .
Structural |
t b <− |
t p i <− |
s c _ s p t | −.0546891 .0225029 −2.43 0.015 −.1150791
cc | .0418319 .0205264 2.04 0.042 .0695013
s c _ e <− |
tb | .0693108 .0248238 2.79 0.005 .182136
sc_spt | .0204355 .020981 0.97 0.330 .0578314
c c | −.1118176 .0291756 −3.83 0.000 −.2498498
som_st <− |
tb | .1093157 .0391516 2.79 0.005 .2025992
42 introduction to sem with stata - day 1

sc_spt | .0322305 .0330262 0.98 0.329 .0643288

c c | −.1763568 .044604 −3.95 0.000 −.2779206

Total e f f e c t s
| Coef . Std . E rr . z P>|z| Std . Coef .
Structural |
t b <− |
s c _ s p t | −.3838194 .0777506 −4.94 0.000 −.4133434
cc | .293585 .0984724 2.98 0.003 .2496361
t p i <− |
tb | .1424866 .0510318 2.79 0.005 .2784103
sc_spt | .0420105 .0428804 0.98 0.327 .0884002
c c | −.2298708 .0543087 −4.23 0.000 −.3819165
s c _ e <− |
tb | .0693108 .0248238 2.79 0.005 .182136
tpi | .486437 .0538653 9.03 0.000 .6542
sc_spt | .0204355 .020981 0.97 0.330 .0578314
c c | −.1118176 .0291756 −3.83 0.000 −.2498498
som_st <− |
tb | .1093157 .0391516 2.79 0.005 .2025992
tpi | .7671996 .0692629 11.08 0.000 .7277
sc_spt | .0322305 .0330262 0.98 0.329 .0643288
c c | −.1763568 .044604 −3.95 0.000 −.2779206

3.7.1 Recursive Models

In a study of 329 boys, Duncan, Haller, and Portes (1968) studied the
effect of peers on aspirations. The model is given below and the Stata
data set is available in the SEM manual.
stata sem commands 43

r_intel e1



f_intel e2

. use h t t p ://www. s t a t a − p r e s s . com/data/r 1 3/sem_sm1

. ssd d e s c r i b e
. sem ( r _ i n t e l −> r_occasp , ) ( r _ s e s −> r_occasp , ) ( r _ s e s −> f _ o c c a s p , ) ( f _ s e s
> −> r_occasp , ) ( f _ s e s −> f _ o c c a s p , ) ( f _ i n t e l −> f _ o c c a s p , ) ( r _ o c c a s p −> f _ o
> ccasp , ) ( f _ o c c a s p −> r_occasp , ) , cov ( e . r _ o c c a s p * e . f _ o c c a s p ) n o c a p s l a t e n t

Endogenous v a r i a b l e s

Observed : r_occasp f_occasp

Exogenous v a r i a b l e s

Observed : r_intel r_ses f_ses f _ i n t e l

F i t t i n g t a r g e t model :

Iteration 0: l o g l i k e l i h o o d = −2617.0489
Iteration 1: l o g l i k e l i h o o d = −2617.0489

S t r u c t u r a l e q u a t i o n model Number o f obs = 329

E s t i m a t i o n method = ml
Log l i k e l i h o o d = −2617.0489
44 introduction to sem with stata - day 1

| Coef . Std . E rr . z P>|z| [95% Conf . I n t e r v a l ]
Structural |
r _ o c c a s p <− |
f_occasp | .2773441 .1287622 2.15 0.031 .0249748 .5297134
r_intel | .2854766 .0522001 5.47 0.000 .1831662 .3877869
r_ses | .1570082 .052733 2.98 0.003 .0536534 .260363
f_ses | .0973327 .0603699 1.61 0.107 −.0209901 .2156555
f _ o c c a s p <− |
r_occasp | .2118102 .1563958 1.35 0.176 −.09472 .5183404
r_ses | .0794194 .0589095 1.35 0.178 −.0360411 .1948799
f_ses | .1681772 .0543854 3.09 0.002 .0615838 .2747705
f_intel | .3693682 .0557939 6.62 0.000 .2600142 .4787223
var ( e . r _ o c c ~p)| .6868304 .0535981 .5894193 .8003401
var ( e . f _ o c c ~p)| .6359151 .0501501 .5448425 .7422109
cov ( e . r _ o c c ~p , |
e . f _ o c c a s p )| −.1536992 .1442554 −1.07 0.287 −.4364346 .1290362
LR t e s t o f model vs . s a t u r a t e d : c h i 2 ( 0 ) = 0 . 0 0 , Prob > c h i 2 = .

The model is over-parameterised. We would expect though that

some of the parameters should be the same as each other. To do this,
follow the following steps:

1. Open the diagram and select the path from r_intel to r_occasp. In
the β box, type b1 and make sure you press Enter. Select the path
from r_ses to f_ses. Again type b1.

2. Do the same for the three other pairs you expect to be the same,
but this time type b2, b3, and b4, respectively.

3. Re-estimate the model.

stata sem commands 45


r_intel e1


r_ses .16

.25 .25 -.16


f_ses .16

f_intel e2


. sem ( r _ i n t e l @ b 1 −> r_occasp , ) ( r_ses@b2 −> r_occasp , ) ( r_ses@b3 −> f _ o c c a s p ,

> ) ( f_ses@b3 −> r_occasp , ) ( f_ses@b2 −> f _ o c c a s p , ) ( f _ i n t e l @ b 1 −> f _ o c c a s p ,
> ) ( r_occasp@b4 −> f _ o c c a s p , ) ( f_occasp@b4 −> r_occasp , ) , cov ( e . r _ o c c a s p * e . f
> _occasp ) n o c a p s l a t e n t

Endogenous v a r i a b l e s

Observed : r_occasp f_occasp

Exogenous v a r i a b l e s

Observed : r_intel r_ses f_ses f _ i n t e l

F i t t i n g t a r g e t model :

Iteration 0: l o g l i k e l i h o o d = −2617.8735
Iteration 1: l o g l i k e l i h o o d = −2617.8705
Iteration 2: l o g l i k e l i h o o d = −2617.8705

S t r u c t u r a l e q u a t i o n model Number o f obs = 329

E s t i m a t i o n method = ml
Log l i k e l i h o o d = −2617.8705
46 introduction to sem with stata - day 1

( 1) [ r_occasp ] f_occasp − [ f_occasp ] r_occasp = 0

( 2) [ r_occasp ] r _ i n t e l − [ f_occasp ] f _ i n t e l = 0
( 3) [ r_occasp ] r_ses − [ f_occasp ] f _ s e s = 0
( 4) [ r_occasp ] f _ s e s − [ f_occasp ] r_ses = 0
| Coef . Std . E rr . z P>|z| [95% Conf . I n t e r v a l ]
Structural |
r _ o c c a s p <− |
f_occasp | .2471578 .1024504 2.41 0.016 .0463588 .4479568
r_intel | .3271847 .0407973 8.02 0.000 .2472234 .4071459
r_ses | .1635056 .0380582 4.30 0.000 .0889129 .2380984
f_ses | .088364 .0427106 2.07 0.039 .0046529 .1720752
f _ o c c a s p <− |
r_occasp | .2471578 .1024504 2.41 0.016 .0463588 .4479568
r_ses | .088364 .0427106 2.07 0.039 .0046529 .1720752
f_ses | .1635056 .0380582 4.30 0.000 .0889129 .2380984
f_intel | .3271847 .0407973 8.02 0.000 .2472234 .4071459
var ( e . r _ o c c ~p)| .6884513 .0538641 .5905757 .8025477
var ( e . f _ o c c ~p)| .6364713 .0496867 .5461715 .7417005
cov ( e . r _ o c c ~p , |
e . f _ o c c a s p )| −.1582175 .1410111 −1.12 0.262 −.4345942 .1181592
LR t e s t o f model vs . s a t u r a t e d : c h i 2 ( 4 ) = 1 . 6 4 , Prob > c h i 2 = 0 . 8 0 1 0

Notice the code for a constraint. You can also set numbers here.
Exercise: Repeat the exercise with a do file.

3.8 Methods of Estimation

• ML (Maximum Likelihood) is the method used by default or you

can specify method(ml).It assumes multivariate normality. Note
that the distribution of the χ2 statistic is affected by kurtosis in the

• ADF (Asymptotic Distribution Free) relaxes the assumption but re-

quires a large sample size. You specify this by setting method(adf).

• MLMV does full information maximum likelihood. It assumes

multivariate normality and that missing data is missing at random.
stata sem commands 47

Note that ML and ADF use listwise deletion.

3.9 Identification

Identification relates to whether it is possible for the computer to

derive a unique set of parameter estimates. We don’t want to say too
much today because it is complicated, but we should say something.

3.9.1 Confirmatory Factor Analysis

• With a single factor you need at least three indicators for the
model to be identified.

• With more than two factors, then you require at least two indica-
tors to be identified.

Non-standard CFA models, where some indicators load on multi-

ple factors or some error terms covary, are more complicated.

3.9.2 Structural Models

The situation is simple if the structural model is recursive. The model
is identified. If the model is non-recursive then it is more compli-

3.9.3 Structural Regression Models

Assuming each latent variable is measured by two or more indica-
tors, the situation is quite simple. If the measurement part of the
model is identified; and the structural part of the model is identified,
then the structural regression model is identified. Again, when one
of the latent variables has only one indicator, the situation is more
4 Datasets

4.1 Holzinger Swineford

HolzingerSwineford1939 { lavaan } R Documentation

Holzinger and Swineford D a t a s e t ( 9 V a r i a b l e s )


The c l a s s i c Holzinger and Swineford ( 1 9 3 9 ) d a t a s e t

c o n s i s t s o f mental a b i l i t y t e s t s c o r e s o f
seventh − and eighth −grade c h i l d r e n from
two d i f f e r e n t s c h o o l s ( P a s t e u r and Grant −White ) .
In t h e o r i g i n a l d a t a s e t ( a v a i l a b l e i n t h e MBESS
package ) , t h e r e a r e s c o r e s f o r 26 t e s t s . However ,
a s m a l l e r s u b s e t with 9 v a r i a b l e s i s more widely
used i n t h e l i t e r a t u r e ( f o r example i n Joreskog ’ s
1969 paper , which a l s o uses t h e 145 s u b j e c t s
from t h e Grant −White s c h o o l only ) .


data ( HolzingerSwineford1939 )

A data frame with 301 o b s e r v a t i o n s o f 15 v a r i a b l e s .



Age , year p a r t
50 introduction to sem with stata - day 1

Age , month p a r t

School ( P a s t e u r or Grant −White )


Visual perception



Paragraph comprehension

S e n t e n c e completion

Word meaning

Speeded a d d i t i o n

Speeded counting o f dots

Speeded d i s c r i m i n a t i o n s t r a i g h t and curved c a p i t a l s


This d a t a s e t was r e t r i e v e d from

h t t p ://web . m i s s o u r i . edu/~ k o l e n i k o v s / s t a t a /hs− c f a . dta
and converted t o a csv f i l e .

datasets 51

Holzinger , K . , and Swineford , F . ( 1 9 3 9 ) . A study i n f a c t o r

a n a l y s i s : The s t a b i l i t y o f a b i f a c t o r s o l u t i o n . Supplementary
E d u c a t i o n a l Monograph , no . 4 8 . Chicago : U n i v e r s i t y o f
Chicago P r e s s .

Joreskog , K . G. ( 1 9 6 9 ) . A g e n e r a l approach t o c o n f i r m a t o r y
maximum l i k e l i h o o d f a c t o r a n a l y s i s . Psychometrika , 3 4 ,
183 −202.
52 introduction to sem with stata - day 1

4.2 Political Democracy

P o l i t i c a l D e m o c r a c y { lavaan } R Documentation
I n d u s t r i a l i z a t i o n And P o l i t i c a l Democracy D a t a s e t


The famous I n d u s t r i a l i z a t i o n and P o l i t i c a l Democracy d a t a s e t .

This d a t a s e t i s used throughout B o l l e n ’ s 1989 book ( s e e pages
1 2 , 1 7 , 36 i n c h a p t e r 2 , pages 228 and f o l l o w i n g i n c h a p t e r 7 ,
pages 321 and f o l l o w i n g i n c h a p t e r 8 ) . The d a t a s e t c o n t a i n s
v a r i o u s measures o f p o l i t i c a l democracy and i n d u s t r i a l i z a t i o n
i n developing c o u n t r i e s .


data ( P o l i t i c a l D e m o c r a c y )

A data frame o f 75 o b s e r v a t i o n s o f 11 v a r i a b l e s .

Expert r a t i n g s o f t h e freedom o f t h e p r e s s i n 1960

The freedom o f p o l i t i c a l o p p o s i t i o n i n 1960

The f a i r n e s s o f e l e c t i o n s i n 1960

The e f f e c t i v e n e s s o f t h e e l e c t e d l e g i s l a t u r e i n 1960

Expert r a t i n g s o f t h e freedom o f t h e p r e s s i n 1965

The freedom o f p o l i t i c a l o p p o s i t i o n i n 1965

The f a i r n e s s o f e l e c t i o n s i n 1965

The e f f e c t i v e n e s s o f t h e e l e c t e d l e g i s l a t u r e i n 1965
datasets 53

The g r o s s n a t i o n a l product (GNP) per c a p i t a i n 1960

The inanimate energy consumption per c a p i t a i n 1960

The p e r c e n t a g e o f t h e l a b o r f o r c e i n i n d u s t r y i n 1960


The d a t a s e t was r e t r i e v e d from

h t t p ://web . m i s s o u r i . edu/~ k o l e n i k o v s / S t a t 9 3 7 0 /
democindus . t x t ( s e e d i s c u s s i o n on SEMNET 18 Jun 2 0 0 9 )


B o l l e n , K . A. ( 1 9 8 9 ) . S t r u c t u r a l Equations with L a t e n t
V a r i a b l e s . Wiley S e r i e s i n P r o b a b i l i t y and Mathematical
S t a t i s t i c s . New York : Wiley .

B o l l e n , K . A. ( 1 9 7 9 ) . P o l i t i c a l democracy and t h e timing o f

development . American S o c i o l o g i c a l Review , 4 4 ,
572 −587.

B o l l e n , K . A. ( 1 9 8 0 ) . I s s u e s i n t h e comparative measurement o f
p o l i t i c a l democracy . American S o c i o l o g i c a l Review , 4 5 , 370 −390.
54 introduction to sem with stata - day 1

4.3 Pupil and teacher data set

Variable 1 2 3 4 5 6
1. Coercive Control 1.0000
2. Teacher Burnout 0.3557 1.0000
3. School Support −0.2566 −0.4774 1.0000
4. Teacher-Pupil Interactions -0.4046 0.0207 0.1864 1.0000
5. School Experience -0.1615 0.0938 0.0718 0.6542 1.0000
6. Somatic Status -0.3487 -0.0133 0.1570 0.7277 0.4964 1.0000
SD 8.3072 9.7697 10.5212 5.0000 3.7178 5.2714
datasets 55

4.4 Example 7/8 from Stata

. use h t t p ://www. s t a t a − p r e s s . com/data/r 1 3/sem_sm1

( S t r u c t u r a l model with a l l observed v a l u e s )

. ssd d e s c r i b e

Summary s t a t i s t i c s data from h t t p ://www. s t a t a − p r e s s . com/data/r 1 3/sem_sm1 . dta

obs : 329 S t r u c t u r a l model with a l l obse . .
vars : 10 25 May 2013 1 0 : 1 3
( _dta has n o t e s )
v a r i a b l e name variable label
r_intel respondent ’ s i n t e l l i g e n c e
r_parasp respondent ’ s p a r e n t a l a s p i r a t i o n
r_ses respondent ’ s f a m i l y socioeconomic s t a t u s
r_occasp respondent ’ s o c c u p a t i o n a l a s p i r a t i o n
r_educasp respondent ’ s e d u c a t i o n a l a s p i r a t i o n
f_intel friend ’ s i n t e l l i g e n c e
f_parasp friend ’ s parental aspiration
f_ses f r i e n d ’ s f a m i l y socioeconomic s t a t u s
f_occasp friend ’ s occupational aspiration
f_educasp friend ’ s educational aspiration

. notes

_dta :
1 . Summary s t a t i s t i c s data from Duncan , O.D. , H a l l e r , A.O. , and P o r t e s , A. ,
1 9 6 8 , " Peer I n f l u e n c e s on A s p i r a t i o n s : A R e i n t e r p r e t a t i o n " , _American
J o u r n a l o f S o c i o l o g y _ 7 4 , 119 −137.
2 . The data c o n t a i n 329 boys with i n f o r m a t i o n on f i v e v a r i a b l e s and t h e same
i n f o r m a t i o n f o r each boy ’ s b e s t f r i e n d .
5 Bibliography

[1] K.A. Bollen. Structural Equations with Latent Variables. John Wiley
and Sons, New York, 1989.

[2] T.A. Brown. Confirmatory Factor Analysis for Applied Research. The
Guilford Press, New York, 2006.

[3] B.M. Byrne. Structural Equation Modeling with AMOS: Basic

Concepts, Applications, and Programming. Routledge, New York, 2
edition, 2010.

[4] John Fox. The R Commander: A basic statistics graphical user

interface to R. Journal of Statistical Software, 14(9):1–42, 2005.

[5] J.F. Jr Hair, G.T.M Hult, C.M Ringle, and M. Sarstedt, editors.
A Primer on Partial Least Squares Structural Equation Modeling
(PLS-SEM). Routledge, New York, 2010.

[6] G.R. Hancock and R.O. Mueller, editors. The Reviewer’s Guide to
Quantitative Methods in the Social Sciences. Routledge, New York,

[7] K. Holzinger and F. Swineford. A study in factor analysis: The

stability of a bifactor solution. Number 48 in Supplementary
Educational Monograph. University of Chicago Press, Chicago,

[8] K. G. Joreskog. A general approach to confirmatory maximum

likelihood factor analysis. Psychometrika, 34:183–202, 1969.

[9] R.B Kline. Principles and Practice of Structural Equation Modeling.

The Guilford Press, New York, 3 edition, 1989.

[10] T.D. Little. Longitudinal Structural Equation Modeling. The Guil-

ford Press, New York, 2013.

[11] R Core Team. R: A Language and Environment for Statistical Com-

puting. R Foundation for Statistical Computing, Vienna, Austria,
58 introduction to sem with stata - day 1

[12] Yves Rosseel. lavaan: An R package for structural equation

modeling. Journal of Statistical Software, 48(2):1–36, 2012.

[13] Deepayan Sarkar. Lattice: Multivariate Data Visualization with R.

Springer, New York, 2008. ISBN 978-0-387-75968-5.

[14] F.A. Sava. Causes and effects of teacher conflict-inducing at-

titudes towards pupils: A path analysis model. Tecahing and
Teacher Education, (2):1007–1021, 2002.

[15] R.E. Schumacker and R.G. Lomax. A Beginner’s Guide to Struc-

tural Equation Modeling. Routledge, New York, 3 edition, 2010.

[16] Mark P.J. van der Loo and Edwin de Jonge. Learning RStudio for
R Statistical Computing. Packt Publishing, Birgiingham, UK, 2012.

