Professional Documents
Culture Documents
Glorified Proc Contents For Secondary Source Data: Bruce Thomas, VAMC/REAP, Providence RI
Glorified Proc Contents For Secondary Source Data: Bruce Thomas, VAMC/REAP, Providence RI
Glorified Proc Contents for Secondary Source Data
Bruce Thomas, VAMC/REAP, Providence RI
ABSTRACT
Researchers often want to know as much about their data as early as possible. This is particularly
important in studies which use administrative, claims and utilization data from multiple sources. In these
studies, data usually need to be combined and judgments made about data quality and relevance.
SAS® Users are typically interested in data dictionaries, but most stop at PROC CONTENTS printed in
monospace font. Borrowing from the ‘glorified proc contents’ approach employed in drug and device
regulatory submissions, this paper describes a method that can be used to help document the contents
of one or more datasets or format catalogs; in addition, it provides an approach to creating desktop
guides in PDF format that describe the distribution of the variables in those datasets.
The term metadata’ is often referred to as ‘Data about Data.’ An alliterative abstraction which can lead
to confusion for non-programmers, the term has been employed since the 1960's to describe the
documentation codebooks that ran on IBM mainframe systems The term has since been trademarked
i
. The National Information Standards Organization makes a distinction between ‘Structural’ and
‘Descriptive’ metadata in Understanding Metadata (NISO, 2004). ii Structural metadata are ‘data about
containers of data’, and Descriptive metadata are ‘data about data contents' used to help guide users to
information.
Several common applications on the PC desktop employ Meta data to help people search for
information. Anyone who uses a browser has already worked with Meta data, because it is widely
employed in search engines. Early HTML web designers tried to populate the <META> tags with buzz
words to attract seekers to their web pages. RSS feeds available on most browsers today use
(Extensible Markup Language (XML) markup, which offers a way to organize content. FDA regulators
use Define.XML as a backbone to review Clinical trial applications. Microsoft has developed XML as a
basis for their Office documents (ever see the *.docx extension?), and SAS® has developed a full suite
of markup tools that deal with XML schemas, which are in effect, Meta data. All are tools to facilitate
organizing and retrieving information.
SAS datasets and SAS Views contain parts that describe the data, called data descriptors; SAS
datasets also contain the data as well. This data description portion is made available in turn to
CONTENTS and DATASETS procedures. The earliest SAS’® BASE procedure I know of to provide this
functionality to end users is the CONTENTS Procedure, a.k.a PROC CONTENTS. Its’ basic utility is
twofold: (1) it generates printed content and (2) it generates a dataset. This functionality is also present
in PROC DATASETS’ CONTENTS statement, and some additional features of the CONTENTS
statement permit creating additional output that describes indexes and integrity constraints.
NESUG 2011 Pharma & Healthcare
The interactive display manager also provides a view contents window as well as the DIR and VAR
windows, and when opening datasets in the view table one is able to examine of variable attributes by
clicking on the column header. SAS® added the SQL Dictionary tables in Version 6.12 which are
special tables that use one or more of the numerous SASHELP views that describe many of the objects
in the SAS environment, including datasets, variables, formats and even macros. The advent of ODS
allowed us to differentiate the parts of PROC CONTENTS output: the dataset attributes and variable
attributes. The self-documenting feature of SAS datasets also helps with organizing and maintaining
large complex databases, and we now see the Metadata Server as part of SAS’ ® Business
Intelligence Platform.iii
Despite this evolution, widespread use of “data about data” in day to day work in the SAS® community
has yet to occur, and it remains a specialty area that is difficult to integrate into the workflow. One
reason may lie in its abstract nature and in the static one-off nature of the results. Once you’ve
documented the dataset, there it is. There may well be experienced SAS programmers who never
progress beyond beginning PROC CONTENTS, but their ability to develop reusable software solutions
to solving problems may be limited by this.
The CONTENTS Procedure, a.k.a PROC CONTENTS is widely used. It is simple to code in a few lines.
It provides access to descriptive information about datasets in printed form. It is even able to generate
a dataset that contains both structural and descriptive Meta data to help guide us through the datasets
in a study. Structural metadata are available In SAS’® familiar PROC CONTENTS output. As shown in
Fig.1, “Label”, “Observations’ and “Sorted”’ display basic information about the dataset as a whole. In
this example, one can tell that the dataset was created during a session in the WORK library, that is not
labeled (a typical problem, often not done with temporary datasets) and that it is not sorted. A recent
enhancement provides information about the sort order in the display results after the ‘Alphabetic List of
Variables and Attributes.’ Descriptive information about the dataset’s contents is more useful, and
‘Variable’, ‘Type’,’ Len’ and ‘Label’ provide information about the attributes of the variables contained in
NESUG 2011 Pharma & Healthcare
the data set. ‘#’ in the output display reflects the order in which the variable appears in the Program
Data Vector. The VARNUM option on PROC CONTENTS can be used to order the display by “#”, but
the default is alphabetically by the “Variable” column. All SAS® datasets have Meta data that can be
surfaced this way.
Output ‘CONTENTS’ datasets always describe what is in each dataset of interest and the user can
specify both the library and the dataset(s) in the library on the DATA= statement. Each dataset of
interest has its own variables, and the variable descriptions will appear in the contents dataset. SAS
101 tells us that variables have attributes, and here they are: they have a Name, a Label, a Length, a
position in the dataset, a Type (Character or Numeric) and a Length. This is true for ‘regular’ datasets
and even for the datasets produced by PROC CONTENTS output statements.
^ IN( Kansas.Anymore)
The most useful variables for the data dictionary are highlighted: ‘NAME’ “Variable Name” can be as
long as 32 characters now. TYPE is usually 'char' or 'num', 'FORMAT' describes the format assigned to
the variable. These can be either user defined (e.g. $AGEFMT) or SAS formats (e.g.BEST11.). When
NESUG 2011 Pharma & Healthcare
There are metadata that are not made available to SAS Users via PROC CONTENTS or the methods
outlined so far. In the data I work with, variables are typically continuous (age in years), discrete
(gender) or somewhere in between (zip codes). Most importantly, these variables all have some sort of
distribution and varying degrees of ‘missingness.’ In some situations, the official documentation can be
a bit dated and new variables sometimes might appear with no supporting documentation. The
meanings of these variables are sometimes elusive, so their distributions can provide a clue. These
aspects of metadata must be ascribed to the data properly.
A principal driver for exploiting meta data with SAS ® since the early part of the 21st Century has been
the desire by the FDA to work with the Industry to standardize the information that the Clinical trials
research community submits for regulatory review. As a result, a growing body of SAS ®
programmers has gained familiarity with computer assisted application processes (CANDA), Electronic
Submissions (ESUB), and the SDTM and ADAM data models promulgated through the Clinical Data
Interchange Standards Consortium (CDISC). PC desktop products have evolved as well in those
settings, and many users may have firsthand experience working with pdf documents that integrate
information about dataset structure, variable descriptions and with the data collection instruments
themselves. iv Pharmaceutical and Device submissions have evolved beyond the days of this
‘DEFINE.PDF’ to a newer version based on XML, driven primarily by the Industry and regulators in
search of true public domain software. The model for define.pdf is one where a table of dataset
contents is hyperlinked to a rich body of objects of interest to reviewers: variable descriptions, the
actual datasets, and for each variable: links to the format code lists and even to the page of the actual
data collection instrument (referred to as 'blankcrf.pdf') i.
The define.pdf is an inventory of the variables that contained links to the actual datasets, the format
catalogs and annotated case report forms – in effect, a code book for the study . I did not really need
NESUG 2011 Pharma & Healthcare
the extensive hyperlinking used in define.PDF, but hyper linking between proc contents output and the
variable distributions or the contents of a format catalog was important, as was a way to get back to the
table of contents from anywhere in the document. A proc report solution with hyper linking seemed like
a good idea for the table of contents.v
The technology behind this hyper linking is based on the notion of PDF named destinations, which are
low level objects in the Portable Document Format architecture. Unlike HTML anchors, named
destinations in PDF are separate from textual data.vi A fully linked define .pdf document is a work of
art; unfortunately, it is a work of art that is frequently created at the end of the clinical trials process
rather than at the beginning. As such, it is expensive art.
DESIGN CONSIDERATIONS
Data dictionary end users (including me and other programmers as well as investigators) needed to be
able to describe one or more variables in one or more datasets and associate those with some tabular
information describing what each variable contains. The application needed to iterate through each
dataset’s variable list, select the important ones and manage the way they are displayed and
summarized. We know that a Mean zip code is (ahem) meaningless, just as a frequency table of ages
is voluminous and not very interesting to look a; however the range of values including missing values
can be very useful.
Once we have a way to iterate through a list of variables, we need to find out the type of each variable,
a label that describes the variable for us (We should all label our variables), and some information
about the variable’s format (name). PROC CONTENTS offers an easy-maintenance solution,
particularly since we can use it to dynamically generate an output dataset to populate the list of variable
names for our table of contents and links. The dataset could vary, but the proc contents would always
give us the same variables (here, the NAME variable) to work with to help us name the destinations.
Each destination would consist of output from a SAS procedure and would be based on the value of
this NAME variable. The SAS procedure would in turn be based on the number of levels in the variable
as well as the variable’s ascribed type (Long discrete numeric, protected health information,
continuous, discrete). To generate distribution information, we can construct standard routines to
iterate through variables and run frequency or descriptive statistics procedures. To get the linking we
need between variable name and its distribution or between the format name and its contents, ODS
PDF ANCHOR and ODS PDF TEXT offered some good ways to define the destinations. To build the
hyperlinks to the destination, the CALL DEFINE statement in proc report would need some special
attention and some research into SUGI lorevii. With this set of basic ingredients, I was prepared to
assemble a glorified PROC CONTENTS. For our format catalog tool, we can get format information into
a dataset from the PROC FORMAT OUTPUT= dataset statement.
source administrative data have to be screened and assimilated at various stages of analytic file
development. Faced with the task of integrating data from at least 5 different sources spanning a
period of several years, I chose to borrow from the define.pdf ‘model’ to construct a simple, standard
dictionary application. This would consist of a table of contents and hyperlinks to PDF destinations
containing either format descriptions or to information about variable distributions. In the former case,
users would be able to generate a readable, linked description of what is in one or more format
catalogs. In the latter case, this would offer end users a chance to look at the basic spread of values in
variables found in datasets. Unlike define.pdf, I would not integrate the formats and the distributional
information; this is a possibility for future improvements. The approach would have to be generalized
so I wouldn’t have to rewrite too much code. SAS’ ® ODS PDF would be useful because Version 9.2’s
enhancements indicate both new layout features and improved stability with graphics, it would be
almost universally readable across PC and UNIX platforms and would handle hyperlinking.viii
Front End
%DataDictionary(
In the first design iteration, we needed a simple dslib=RAW
interface where users could select variables and ,dsname=XMBASE
,titleThis is a title
put them into different kinds of lists: (1) ,phi=scrssn
exclusions, (2) continuous variables,(3) discrete ,category=
variables. Exclusions were easy – researchers ,cutoff=25
,Longdiscretevars=
really don’t need or want to know anything about
bornday bornyear disday distime
certain kinds of private health information, and admitday adtime homecnty visn zip
some data are simply not of any interest from the statyp scper homepsa dxf11 dxf12
outset. Continuous variables were interesting, dxf13 dxf2 dxf3 dxf4 dxf5 dxf6 dxf7 dxf8
dxf9 dxf10 updatday sta3n homstate
because they could be described using proc drg
means, the summary proc of choice as well as );
simple distribution plots. A box plot feature
through SGPLOT in SAS ® 9.2 offered some
interesting features to help us understand Figure 1 Data Dictionary Launch macro
continuous variables.
These processes are easily generalized across data types, but to make that possible, we needed a
wrapper to handle particular instances of Meta data that live in a particular study dataset. The solution
is a wrapper program (BuildDictionary.sas). The particular dataset under consideration in the wrapper
program would provide the dataset model we need to help build the different variable lists we needed.
Once we had a way to handle the particular dataset, we could then pass its metadata to a controller
program (DataDictionary.sas fig. 6) that would actually create the different variable lists, generate a
table of contents for each list and hyperlinks to PDF destinations. The controller would also call the
appropriate frequency or distribution routine. The controller program is invoked in the wrapper program
by setting macro parameters. For example, PHI (for protected health information) removes personal
identifiers from processing altogether, and DataDictionary.sas in turn uses these parameters to
generate temporary datasets containing variables with each data type and routes these to the
appropriate summary routine.
NESUG 2011 Pharma & Healthcare
Originally, I put the burden on the user (me) to correctly put the variables into the different buckets.
While a result was achievable, this involved some tedious trial and error, forcing me to test each run
with obs=100 for accuracy then with obs=1000+ to make sure the distributions were coming out as
expected. A subsequent iteration opted to use a threshold ix to help make the decision about how to
handle the discrete variables. Now, if there were more values than the threshold and it is NOT a
continuous variable, then a ‘top/bottom 5’ approach would be used, otherwise a frequency table would
be generated. At this time, the only lists that require populating are the numeric variables to exclude
from continuous variable processing (e.g. Zip Code). That populates a variable named
‘LongDiscreteVars.’ This call is coded in the BuildDictionary.sas
program(see fig 6.). Figure 2 How many levels in the data?
The statistical procedures are provided in a series of ‘DD’ macros that need to be made available to the
SAS session, either through %include or an autocall library. I developed small single purpose macros in
a library (www.github/bhthomas) to handle the three different data types. This is a small, reusable code
library that could be invoked without the dictionary front end. One important limitation of these 3 macros
is that they are all currently ods and pdf-specific, containing expressions like:
This could be made conditional at some point, but PDF linking is kind of nice and PDF viewers are
everywhere now. In this design, the data types are processed as a group and routed as a group to the
appropriate SAS ® procedures.
Hyperlinking was the most complex part of the application because there could be two parts that
needed to be processed in the same order. The table of contents used a URL format to map variable
numbers (from the proc contents VARNUM attribute) to a location defines internal bookmarks in the
pdf that would need to refer to an anchor defined with the variable’s NAME attribute. To make the table
of contents less terse, I opted to use the variable’s label if it was available.
NESUG 2011 Pharma & Healthcare
OUTPUT
The report consists of a Title page created with Proc GSLIDE using the title parameter. The next
section is a table of contents generated by proc report using a proc contents
Return to Contents dataset to display the variable’s name, label, length and format. In that table, the
variable name is hyperlinked using an approach found in Carpenter (2007). The
address for the table of contents is defined using ODS PDF ANCHOR='contents';
ODS PDF output in SAS has several default settings that may confuse readers and a few issues
needed to be addressed to make it readable. These issues
include:
• The list of bookmarks is long and sometimes unwieldy, particularly if no variable labels
are available. The table of contents exists to provide a path into the dataset anyway, so
in SAS 9.2. I was able to insert ODS PDF
NOBOOKMARKGEN before the various summary
procedures removed them altogether.
While ODS PDF Bookmarks are now easily manipulated to reduce clutter, thornier visual problems
were encountered that include,
• At low magnification, PDF anchor Text appears truncated at the end of the string.
• Hyperlinks have a blue border all around the link. Users have grown to expect browser-
style links.
A t the completion of each variable summarization (one per page), the expression
points the user back to the table of contents. The LinkCOLOR= style setting is designed to get rid of
an annoying blue box that is painted by default around each hyperlink. Note also the extra spaces after
NESUG 2011 Pharma & Healthcare
‘the text ‘Contents’ and the closing period: At around 56% magnification in the Adobe Reader, the Text
would start to appear whited out at the end, so I added some text padding. This problem goes away at
100% magnification, but the Period is still there. So In the Body of the report, we have a Blue Link that
takes us back to the contents destination in the document.
CONCLUSION
ACKNOWLEDGMENTS
I would like to thank William Qubeck for coining the term used in the title and for introducing me to define,pdf in
the first place.
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS
Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are registered trademarks or trademarks of their respective companies.
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at:
Bruce Thomas
VAMC/ Providence
Research Building 32
Providence, RI
bhthomas@comcast.net
REFERENCES/ NOTES
i
The term "metadata" is copyrighted and trademarked by Metadata, LLC of Nashville, TN.
(www.metadata.com). Formerly known as Metadata Information Partners, it is a software firm specializing in data
management products, consulting, and custom information systems to the health care industry. Although the term
"metadata", spelled the same way, is widely used to refer to "data about data", Metadata LLC trademarked the
name in 1986 and was granted "incontestable" status in 1991. So there is always the threat that if we publicly use
the term "metadata" the company could pursue their trademark enforcement. The most common substitutions for
"metadata" we see today are "meta-data" or "meta data".
From Correct Terminology: Do We Say "Metadata", "Meta-Data", or "Meta Data" ? metadataforums.com, Stu Carty,
Published 01/7/2007
ii
Understanding Metadata Copyright © 2004 National Information Standards Organization ISBN: 1-
880124-62- http://www.niso.org/publications/press/UnderstandingMetadata.pdf
iii
Cynthia Zender, SAS ® , VASUG presentation on Stored Processes, May 2011
iv
See PharmaSUG2011 ‐ Paper TU01
Creating Hyperlinked PDF Graphical Patient Profiles with PROC REPORT
William Conover, Advanced Clinical, Bannockburn, IL
http://www.pharmasug.org/proceedings/2011/TU/PharmaSUG‐2011‐TU01.pdf
v
Art Carpenter discusses proc report hyperlinking in http://www.lexjansen.com/wuss/2009/how/HOW‐
Carpenter.pdf.
vi
How to put PDF Named Destinations work for you”,http://www.mindtheflex.com/?p=86#more-86,
Sven-Olav Paavel, 2011
vii
Art Carpenter (note v)
viii
It has. For a discussion of hyperlinking in ODS Pdf:
http://support.SAS ® .com/resources/papers/proceedings10/035‐2010.pdf
ix
Continuous or Not: How One Can Tell
Vatsala Karwe, Mathematica Policy Research, http://www2.SAS ® .com/proceedings/sugi28/088‐28.pdf
NESUG 2011
Pharma & Healthcare
Figure 3 -- Proc Gslide Title Page
NESUG 2011
Pharma & Healthcare
Figure 4- Table of contents (hyperlinks in blue)
NESUG 2011
Pharma & Healthcare
Figure 5 -- Top/Bottom5 a display
NESUG 2011
Pharma & Healthcare
Figure 6 - Continuous Variable Display
NESUG 2011
Pharma & Healthcare
Figure 7 -- Frequency Distribution