2015 Data - Ships - An - Empirical - Examination - of - Open - Closed - Data

2015 48th Hawaii International Conference on System Sciences
Data Ships:
An Empirical Examination of Open (Closed) Government Data
Karine Nahon Alon Peled
University of Washington The Hebrew University of Jerusalem
The Interdisciplinary Center (IDC) Herzliya alon.peled@post.harvard.edu
karineb@uw.edu
Abstract contradictions in the primary goals of open

As part of endorsing the open government data government, showing that transparency and
movement in many parts of the world, governments collaboration are potentially competing forces [3].
have worked to increase openness in actions where Scholarship in the first years of the open
information technologies play a major role. government movement was characterized by utopian
Releasing public data was perceived by many voices, enthusiastically emphasizing the potential and
governments and officials as a fundamental element actual benefits of open data as a fundamental element
to achieve transparency and accountability. Many to achieve transparency and accountability; and a
studies have criticized this approach and illustrated mechanism for innovation, improved efficiency,
that open government data does not necessarily lead participation and self-empowerment. The architects
to open government. Our study examines for the first of open government programs argued that openness
time in a systematic, quantitative way the status of mobilized the expertise of the masses, harnessed
open government data in the US, by focusing on the agencies’ hidden desire to share data, and gave birth
disclosure of data by US federal agencies. Our to a community of innovators. In open government
findings suggest that most US federal agencies jargon, individuals were “called to arms” to support
largely do not follow the open government policies of the open government program; everything associated
2009 and 2013. The paper discusses the type of
with “the government of the past” had to be brushed
public data that is released, and analyzes the
(non)strategy of its release. away, and the spirit of openness and innovation was
claimed to be contagious [4], [5],[6].
These voices were later replaced with studies
1. Introduction assessing the benefits and barriers of open
Open government is a concept coined decades government initiatives [7], [8] and proposing ways to
ago. Yu and Robinson credit Wallace Parks, who measure the progress of these initiatives [9].
served as counsel to the Special Subcommittee on However, most of these studies relied on case studies
Government Information, as the first to expound on or limited empirical analysis (qualitative and
the term in print in his 1957 article “The Open quantitative).
Government Principle: Applying the Right to Know In the US, as a response to the OGD [2] federal
Under the Constitution” [1]. It was not, however, agencies developed plans to increase openness and
until 2009, when US President Barack Obama issued transparency in actions where information
the Open Government Directive (OGD) 1.0 [2], and technologies play a major role [1],[9],[10],[11].
later established the Open Government Partnership, Government transparency occurs through four
that this concept became mainstream. Since then, primary channels: proactive dissemination by the
many democratic countries have joined the global- government; release of requested materials by the
and-local open government movement. This government; public meetings; and leaks from
movement has resulted in diverse and dispersed whistleblowers [12]. Therefore, the release of public
initiatives, metrics, methodologies, and data release data was perceived by US government officials as a
practices. The translation of the goals of the open fundamental method to achieve transparency and
government movement into practical programs faced accountability.
many obstacles. For example, Yu and Robinson In reality, there was a big gap between open
analyzed the vagueness of the concepts “open government data programs and the goals of the open
government” and “open data,” which present multiple government movement. This gap was evidenced by
meanings to policy makers, civil society, academia, poor data quality; release of data irrelevant to
and practitioners [1]. Peled illustrated inherent government service; outdated, broken links; and
1530-1605/15 $31.00 © 2015 IEEE 2209

DOI 10.1109/HICSS.2015.264
Authorized licensed use limited to: North Carolina Central University. Downloaded on November 27,2021 at 05:17:46 UTC from IEEE Xplore. Restrictions apply.
restrictions on data use [1]. Open Government [2]. The directive ordered agencies to make as much
proponents claimed that outcomes of the open information as possible available online. At the
government movement can be achieved, but data outset, agencies were instructed to publish at least
must first be available, accessible and usable. three “high-value” datasets on the web. High-value
By developing an automated system to crawl into datasets were defined as containing “information that
open government data portals of US federal agencies, can be used to increase agency accountability and
we can for the first time assess in a systematic way responsiveness; improve public knowledge of the
the real status of open federal data in the US. agency and its operations; further the core mission of
Specifically, this paper discusses the following the agency; create economic opportunity; or respond
questions: Did the open government data disclosure to need and demand as identified through public
in the US achieve the goals of its declared open consultation” [2].
government policy? Were agencies compliant with High-value datasets were required to have never
open data policy? What are the barriers to and been made available or published in a downloadable
opportunities for making government data open in the and open format [2]. The first three high-value
US?
datasets were considered a down payment, or a
minimum; agencies were expected to continually
2. Open Government Data in the US make new datasets available to the public [17]. The
OGD specifically demanded that agencies “identify
The US federal government was defined as the additional high value information not yet available
largest producer, collector, consumer, curator and and establish a reasonable timeline for publication
disseminator of information in the US [13],[15]. online in open formats with specific target dates” [2],
Hence, President Obama’s 2009 promise to free non- [7]. However, at the 2009 OGD, only federal
classified and non-sensitive information raised hopes departments and non-independent federal agencies
that the new avalanche of open, free and easily were instructed to comply with the OGD.
accessible government information would unleash Four years later, in May 2013, on commencing his
innovation and transparency in the US. second term, President Obama raised the open
Indeed, immediately after his election, President government bar with a memorandum ordering that
Obama unleashed a lightning campaign to enforce the default state of new and updated government
agency openness. On his first full day in office information resources be made open and machine-
(January 21, 2009), at the height of the worst readable. He also instructed that government
economic crisis America had experienced since the information be managed as an asset throughout its
Great Depression, President Obama signed three life cycle. Agencies were required to maintain and
memoranda and two executive orders. Four of these nurture their datasets as products with real market
five documents promoted open government [3]. In value; that is, as assets that could be bought, sold and
the same month, President Obama reversed the exchanged [18]. In this article, we adopt the concept
Ashcroft memorandum of October 2001, which of an “asset” and refer to the information artifacts
advised agencies to employ caution in cooperating that agencies generate, maintain and upgrade as
with the Freedom of Information Act (FOIA). Obama information assets.
declared: “In the face of doubt, openness prevails” To ensure these new orders would be followed,
[16]. In March 2009, Vivek Kundra was appointed as President Obama again enforced a rigorous open
the first federal Chief Information Officer (CIO), government data execution timetable with which
signaling the importance of the program to the US agencies were required to comply. He also
government. A barrage of open government established a Cross-Agency Priority Goal to track the
initiatives surfaced, including eRulemaking, IT implementation of the second-generation of the open
Dashboard, Recovery.gov and USAspending.gov. government policy. Finally, President Obama
The administration showcased open government extended the OGD policy to all federal agencies by
stories and ensured that senior appointees adhered to instructing independent agencies to comply with the
open government principles [20]. policy [19].
On May 21, 2009, a team headed by the CIOs of Following the 2013 memorandum, OMB
the US Department of the Interior (DOI) and the US published a detailed implementation guide instructing
Environmental Protection Agency (EPA) launched agencies how to follow the President’s latest OGD
www.data.gov as the premier web publishing policy. Henceforth, we refer to this implementation
location of the most important federal datasets [2]. guide as the “US federal open data 2.0 standard.”
Then, on December 8, 2009, the Office of First and most importantly, all federal agencies were
Management and Budget (OMB) published the OGD ordered to develop a Public Data Listing, which
2210
contains a list of all information assets that are or Table 1: Differences between OGD 1.0 and
could be made available to the public. This Public 2.0 policies
Data Listing was to be posted at OGD 1.0 OGD 2.0
www.[agency].gov/data.json. The intention was to (2009) [2] (2013)
provide the public with access to an extensive [18], [19]
Instruction for independent Optional Mandatory
inventory of agencies’ information assets, enable the
agencies to comply with OGD
public to track agencies’ subsequent progress to clear Publication of comprehensive No Mandatory
more assets for publication, and empower the public data inventory lists
to re-use open data. The listing was not designed to Default creation of all new No Mandatory
hold metadata about an agency’s classified information assets as OGD assets
information assets such as those concerning national
security or containing sensitive personal information
about citizens. 3. Benefits/Barriers of Open Government
According to the US federal open data 2.0 Data Programs: It Depends Who You
standard, agencies were instructed to: continuously Ask
add new information assets (on top of the information
assets that were published on data.gov before August Various scholars carefully examined open
1, 2013); improve the quality of previously released government data programs in diverse countries
information assets; enrich the metadata descriptions attempting to understand the benefits and barriers of
of published information assets so the public could these programs [7],[11],[20]. Janssen et al [8]
discover them more easily; and increase the amount clustered the benefits of open government data
of information shared with other agencies, programs into three categories: 1. Political and social
corporations, organizations and citizens. benefits, such as increasing transparency and
The OMB’s intention was to dynamically participation, fostering accountability, empowering
populate the newly renovated data.gov and use the it citizens, and improving government services; 2.
as the primary website to find information assets Economic benefits, such as stimulating innovation,
generated and held by the US government [19]. The growing the economy, and developing new products
designers of the revised open government data policy and services; and 3. Operational and technical
were keenly aware of the problems of publishing and benefits, such as optimizing administration processes,
discovering datasets in the first iteration of the US reusing data, and improving public policies.
federal open data portal (2009-2013) [3]. Beginning In 2011, initial enthusiasm for the global-and-
in May 2013, these designers planned to transfer local open government movement began to fade. A
directly to federal agencies the full responsibility of new body of scholarship emerged claiming that open
releasing and accurately describing their information government data programs did not deliver intended
assets on agency-based servers. The new federal open outcomes due to various political, organizational,
data portal (currently in Beta) was designed to be a technological and financial challenges. These
secondary index of the information assets that federal researchers studied the barriers of such programs in
agencies publish and describe on their own portals. It an attempt to fix major flaws in their design and
is important to note that, in practice, the original implementation. For example, Zuiderwijk et al [21]
www.data.gov was never more than a secondary examined 118 impediments of open government
index of the information assets that federal agencies programs and identified 10 categories of
published on their own servers (see further impediments from the user perspective: 1)
explanation in the method section). availability and access, 2) find-ability, 3) usability, 4)
Table 1 summarizes three key differences understand-ability, 5) quality, 6) linking and
between the original OGD 1.0 policy as presented in combining data, 7) comparability and compatibility,
the 2009 [2] and the latest version of this policy, 8) metadata, 9) interaction with the data provider, and
OGD 2.0, as defined by the OMB implementation 10) opening and uploading. Peled focused on the
guide of 2013 [19]. Our work is the first attempt to supply side and divided criticism of open government
measure — quantitatively and systematically — data programs worldwide into three categories: bad
whether and how US federal agencies have complied initial program design, flawed program execution and
with President Obama’s open government policy of adverse program consequences [22]. In this paper we
2009 and 2013. will focus on the supply side as well.
In the US, the OGD 1.0 gave agencies discretion
to decide what data to publish and to evaluate their
own performance; this allowed agencies to passively
2211
resist the open data program. Many agencies did not Agencies were also concerned about
set openness deadlines or publish performance data; decontextualizing their data. Data wrapped in context
others refused to share data release plans or did not and traceable to its sources is a record. Records are
live up to the goals that they themselves created. Not the blood cells of governmental work. Noveck wrote:
surprisingly, most agencies that assessed their own “the right of transparency is eviscerated by the
performance awarded themselves the highest practical inability of all but a handful of professionals
compliance ranking [22],[23]. However, it is to make sense of information” [4]. But the open
important to note that there is not one agreed-upon, government data program divorced datasets from
uniform or binding way to measure the “compliance” their source records, thus converting useful records in
of agencies with open government data policies. many cases into useless datasets [27],[28]. For
These policies themselves were vague from day one, example, the Environmental Protection Agency
so it is unfair to rank and grade agencies on (EPA) maintained context-rich Toxic Release
“compliance” with open government data policies. Inventory (TRI) records on its website, which were
Still, we believe it is important, even critical, to study sliced and diced into numerous, context-free datasets
the supply side of open government data programs as and uploaded to www.data.gov in 2010. In addition,
part of an effort to develop smarter and more the initial OGD did not prioritize what data to release
effective open government policies and ways to first [29] and did not establish mechanisms for
measure the progress of these policies in the future. citizens to verify data’s accuracy, completeness and
Most agencies reluctantly joined the US open data authenticity [30],[31]. Practitioners argued that
program. In mid-2011, 172 American agencies agencies released voluminous and meaningless
participated in the program, yet three agencies datasets, repackaged data goods previously published
uploaded about 99% of the content, and most of it old elsewhere, and did not indicate whether released
1 datasets were previously available. The data lacked
content. The average participating agency had not
returned to www.data.gov for 222 days since its last descriptions and, sometimes, datasets could not be
data.gov transaction [3]. European agencies, too, downloaded or opened. Agencies did not offer
reluctantly participated in open data programs and mechanisms to report data problems or provide
dumped volumes of purposeless raw data into explanations for the removal of released datasets
cyberspace [24],[25]. In Britain, Estonia and [27].
Denmark, certain agencies refused to free data Finally, open government data architects did not
because their income was partially dependent on data consider the cost of “freeing” data. Kundra argued
sales[24],[25]. Some scholars suggested that agencies that the government spends billions of dollars on
refused to free datasets because these datasets are “armies of consultants, a fragmented infrastructure,
“bargaining chips” in inter-agency relationships. and customized, one-off applications” and that open
Other scholars argued that agencies manipulate their data citizen-developers “can do more for less” [32].
closely held datasets to convince legislatures to grant Yet agencies had to hire staff to understand new
them budgets. Finally, scholars argued that because legislation, adjust data to new standards, train
open data legislation compelled agencies to employ employees, and improve data quality. Agencies also
external consultants, senior government information needed to convert hand-written and verbal data into
technology (IT) officials might have been reluctant to digital records and integrate non-compatible data
delegate to consultants the politically sensitive job of streams to prepare data for release. These activities
deciding which datasets to release to the public, were costly and not included in agencies’ budgets
choosing instead not to release datasets. [30], [33]–[35]. A recent study based on interviews
So, scholars hypothesized that the open data with 155 senior US federal IT officials revealed that
program offered agencies a bad deal: Politicians these officials are concerned about the cost of
received public approval for “freeing data” while adjusting their records management programs to the
agencies were expected to free valuable datasets and demands of the open data program [36]. Nonetheless,
undertake the time and effort-consuming job of to date, the above claims have not been supported by
preparing them for release. Agencies therefore a rigorous, empirical-systemic study.
minimized their open data involvement [3],[25],[26].
4. Method: Systematically Harvesting and
1
Analyzing Information Directly from
The three agencies — CENSUS, USGS and NOAA — published Federal Agencies
catalogs of downloadable information assets on their agency-based
web portals long before President Obama declared his Open Data
In this section, we compare our novel machine-
policy. Beginning in 2009, www.data.gov served as a secondary based open government data research technique to
index to publish the contents of these older catalogs.
2212
the two other more established research techniques The “Citizen-Based” research technique relies on
for analyzing metadata (i.e., agency- and citizen- citizens reporting about their use patterns and
based techniques). We then discuss briefly the unique experience with open government data portals.
challenges we confronted while developing our Citizens often provide high-quality feedback about
machine-based research technique. metadata of information assets, which they found
useful, including metadata about how they managed
4.1. Agency-, Citizen- and Machine-Based to link different assets. However, citizens are
Open Government Data Research Techniques required to fill out tedious forms about their
Research techniques which use metadata to assess experience working with OGD information assets.
the status of open government data programs can be Therefore, most information assets that agencies
broadly categorized into three categories: agency- release are not included in such citizen-based OGD
based, citizen-based and machine-based techniques. reports (weak comprehensiveness). In addition,
There are other techniques for assessing open citizens’ reports are usually a one-time endeavor, and
government data programs that do not involve not a repetitive inquiry, which re-assess periodically
studying metadata such as historical, document and the state of open government data (i.e., poor
policy analysis and more. In this paper we will focus freshness). The European Engage program is a good
on techniques, which mainly use metadata analysis. example of the benefits and challenges of citizen-
In analyzing metadata researchers attempt to address based techniques [37]
the following five key questions: Finally, in this paper we are pioneering a third
1. Impartiality: Is the collected metadata (about the type of evaluation research technique—Machine-
released information assets) impartial or, Based. The software2 automatically detects, visits and
alternatively, biased to reflect the interests of re-visits information assets that agencies published
some parties? on open government data portals. Unlike humans
2. Quality: Is the collected metadata of good quality (i.e., agency officials or citizens) the software is
or, alternatively, is it poorly and hastily collected impartial, harvests hundreds of metadata fields about
and organized? each information asset and is timely (i.e., visiting
3. Freshness: Is the collected metadata timely or, information assets in OGD portals frequently to see
alternatively, was the metadata collected a long whether something new can be learned about this or
time ago and no longer reflects the evolution of that information asset). Our Machine-Based research
the information asset that the feedback technique also provides higher-quality feedback
information is pretending to measure? information than the Agency-Based and Citizen-
4. Comprehensiveness: Does the collected metadata Based research techniques because it harvests
include many types of information about the additional metadata about the information asset from
information assets? the published data itself.
5. Linkability: Can we associate metadata about a Table 2: A Comparison of the
given information asset with the metadata about Agency-, Citizen- and Machine-Based
another asset? OGD Research Techniques
Agency- Citizen- Machine-
Based Based Based
In the “Agency-Based” technique, agencies are
Impartiality •- •
asked to rank themselves on various scales of Quality • •
‘compliance’ with open government data policy. Freshness •
“Agency-Based” is the least reliable of the open Comprehensiveness •
government data measurement research techniques. Linkability •
Agencies have no interest in reporting partial findings
about their own compliance; nor do agencies possess Table 2 summarizes the differences among the
incentives to provide high-quality, timely, three techniques for analyzing metadata. The main
comprehensive or linkable metadata about the weakness of metadata analysis is that it does not
information assets they release. So, metaphorically, study the content of the information assets, and
the agency-based “measurement technique” is akin to instead it focuses on the information about the
empowering a cat to watch over the cream. During information assets. Interestingly, in its current state,
his first presidency, President Obama and his OGD our Machine-Based software has one major drawback
officers relied on this technique and, unsurprisingly, in comparison to the Citizen-Based research
all agencies awarded themselves the highest rankings technique — it does not provide measurements about
for compliance with the new OGD policy [23].
2
Alon Peled, a co-author of this paper, developed the software.
2213
the potential to link one information asset to another that the newly renovated data.gov federal open
in the way that citizens do when they report what government data portal (currently in its Beta version)
they did with data. Linked Open Data is the current would merely populate its indices by harvesting and
battle cry of the open government data movement, so aggregating information from these agency-managed
we are well aware how important it is to address this Public Data Listings [19]. Simply put, the core in all
weakness. matters regarding the US federal open government
program was (and still is) to take place on individual
4.2. Exploring the Machine-Based Research agency web portals, while www.data.gov was
Technique designed to serve as a secondary index of these
Numerous studies focused on the US federal information assets. The first Obama Administration
www.data.gov portal to analyze the open government highlighted www.data.gov as the flagship of the open
data program in the US. Scholars found it easier to government program. In reality, www.data.gov was
harvest metadata from a single web portal never more than a glorified, and not especially useful,
(www.data.gov) than to collect similar data from secondary index.
hundreds of federal open government data sites [38]. Accordingly, we argue that scholars who are
However, this trend of placing the data.gov portal at interested in analyzing the promise, challenges and
the center of scholarly research poses critical opportunities of the US federal open government
challenges to researchers. . program must pay closer and more detailed attention
The 2009 OGD [18] explicitly declared that the to agencies’ individual open government data portals
core of President Obama’s new open government rather than to secondary indices such as
program would reside within the open government www.data.gov. Therefore, our software was designed
3 to systematically gather the information assets and
data portals of “individual federal agencies” :
their metadata directly from US federal agencies
“Within 60 days, each agency shall create an Open
open government portals, if existed. Most
Government Webpage located at
importantly for this paper, this software solution
http://www.[agency].gov/open to serve as the
crawls, scrapes, collects, cleanses, and registers
gateway for agency activities related to the Open
existent metadata of information assets harvested
Government Directive and shall maintain and update 5
that webpage in a timely fashion” [2]. Likewise, the from a list of 473 US federal agencies , focusing on
2013 Presidential memorandum extended this logic, agency open government portals.
demanding that “the default state of new and The software works in the following way: The
modernized government information resources shall software crawls into an open government data portal
be open and machine readable” and accessible on of an agency and performs an initial indexing of all
individual agencies’ open government data portals the information assets published there by the agency.
[39]. Thereafter, the software periodically and regularly
Right from its widely publicized launch day in visits the portal to see whether it can find new
2009, the US federal open data portal was never information assets or glean new metadata information
intended to be more than an index to help end users about information assets it previously indexed. Each
discover and access information assets on individual time, the software also downloads the data file(s)
agency web portals. The US federal open data 2.0 associated with the information asset and tries to
standard instructs each agency to develop its own extract additional metadata insights from the
Public Data Listing that includes all data assets that downloaded data. For example, if one of the data files
are or could be made available to the public. Each was downloaded is in CSV or XLS format, the
agency was instructed to post this listing on its own software knows how to convert column names,
agency server using the data.json catalog (posted at number of rows, and some statistics about the
4 columnary data into additional metadata descriptions
www.[agency].gov/data.json). The standard explains
of the information asset.
The software crawled only through the data.json
3
The term “agency” in this paper refers to either an agency (e.g., catalogs that agencies published on open government
EPA) or department (e.g., Department of Defense). data portals, as this is the only mandatory catalog that
4
JSON (JavaScript Object Notation) is a lightweight data- the OMB instructs agencies to use. We admit that
interchange format. It is easy for humans to read and write. It is
easy for machines to parse and generate. It is based on a subset of
agencies can and do publish important information
the JavaScript Programming Language, Standard ECMA-262 3rd
Edition - December 1999. JSON is a text format that is completely 5
language-independent but uses conventions that are familiar to The number of active federal agencies is a topic of debate. We
programmers of various languages. These properties make JSON used the list generated by the 2013 US Manual:
an ideal data-interchange language. http://www.usa.gov/Agencies/Federal/Executive.shtml
2214
assets not through the data.json catalog. Still, we Table 3 – Assets and components by US
argue that there is good (even if incomplete) value in federal agencies
measuring agencies against the one mandatory open Agency Open Data Portal Assets Components
Department of Commerce (DOC) www.doc.gov/data.json 20488 21453
government data catalog that the Obama Environmental Protection Agency www.epa.gov/data.json 3285
4856
Administration instructed agencies to use. This paper (EPA)
Department of Transportation www.dot.gov/data.json 1654
is a starting point in our long research voyage to (DOT)
4621
study the information assets that agencies have Department of Health & Human www.hhs.gov/data.json 1633
5569
published since 2009. The most important and Services (HHS)
Department of Justice (DOJ) www.justice.gov/data.json 769 834
lowest-granular information in this corpus is the rich Department of Labor (DOL) www.dol.gov/data.json 364 592
metadata descriptions that agencies publish along Department of Agriculture www.usda.gov/data.json 331
734
(USDA)
with the data on their open government portals. USAID www.usaid.gov/data.json 234 421
To the best of our knowledge, not a single central Department of Veteran www.va.gov/data.json 227
670
Administration (VA)
repository exists today for scholars studying Department of Energy (DOE) www.energy.gov/data.json 225 644
agencies’ release of datasets on individual agency Department of State (DOS) www.state.gov/data.json 113 110
open government portals. Hence, the research General Services Administration www.gsa.gov/data.json 110
179
(GSA)
technique that we propose in this paper could also Department of Treasury www.treasury.gov/data.json 95
322
help researchers to systematically harvest information (Treasury)
National Archives (Archives) www.archives.gov/data.json 60 94
directly from individual agency open government Department of Housing and Urban www.hud.gov/data.json 56
105
portals in other countries or other levels of Development (HUD)
governments (e.g., municipal, county, state and Institute of Museum and Library www.imls.gov/data.json 40
171
Services (IMLS)
territory). Office of Personnel Management www.opm.gov/data.json 32
160
To address our key research questions regarding (OPM)
Nuclear Regulatory Committee www.nrc.gov/data.json 31
agencies’ compliance with open government data (NRC)
58
policy and the challenges and barriers to National Aeronautics and Space www.nasa.gov/data.json 25
82
implementing this policy, we examined key attributes Administration (NASA)
National Transportation Safety www.ntsb.gov/data.json 21 8
in our corpus to investigate this policy in four Board (NTSB)
19
domains: quality of metadata, time, gatekeepers, and Department of Homeland Security www.dhs.gov/data.json
9
2
0
access metadata attributes. In this exploratory (DHS)
National Science Foundation www.nsf.gov/data.json 2
research we found discrepancies between the open (NSF)
10 1
government data policy declarations and its Consumer Financial Protection www.consumerfinance.gov/ 1
7
execution. Next, we will discuss these findings. Bureau (CFPB) data.json
Department of Defense (DOD)11 www.defense.gov/data.json 0
Securities and Exchange www.sec.gov/data.json 0
12
5. Findings Committee (SEC)
Total: 29798 41,702
When examining the open government portals of
the 473 federal agencies in the US, we discovered
that only 25 agencies (5% of the total number of
agencies) adhere to the mandatory US federal open
data 2.0 standard (www.[agency].gov/data.json) and
have an open government portal. Three of these 8
In a few cases, such as the NTSB open government portal, we
agencies (DHS, DOD, SEC — see table 3) do not discovered fewer components than assets because the agency did
maintain information as mandated by the law and not provide a URL where the actual data could be downloaded, or
6 provided a malformed URL.
specified by the US federal open data 2.0 standard . 9
We failed to read and register information assets from DHS’
Therefore, the analysis below relies on the data data.json catalog, because DHS uses case-insensitive field labels
published by the remaining 22 agencies that did (e.g., “Title” instead of “title”). Similar to other Enterprise Data
partially comply with this standard. These 22 Warehousing (EDW) projects, we neither have the resources nor
agencies have published 29,798 information assets the motivation to “fix” the data-entry errors of individual agencies.
7 Wherever and whenever our software detected such problems, we
(41,702 components ) as shown in Table 3 below: contacted the agency and asked to fix its errors.
10
The NSF displays more assets in its data.json catalog than
registered here. However, because the metadata was malformed
and not in compliance with the technical specifications of the
data.json catalog, the software failed to read.
11
We could not read and register data from DOD’s data.json
catalog because DOD “invented” new non-standard JSON literal
6 types (e.g., “TRUE” instead of “true”).
Description of the metadata fields in data.json standard can be
12
found here: http://project-open-data.github.io/schema/ We could not read and register data from SEC’s data.json
7 catalog because SEC uses the XML standard instead of the
A “component” is a single or multiple files associated with a
single “information asset.” mandatory data.json standard.
2215
5.1. Poor Metadata 5.2. Outdated Data
The first — and most significant finding —
concerns the poor, at times incomprehensible, quality Questions that open government data scholarship
of the metadata that agencies published to describe has attempted to answer are: What type of data is
their information assets. Interestingly, a specific being disclosed to the public? How old is the data?
metadata field (“dataQuality”) exists for agencies to When was the data created and how often is it
capture whether a given asset meets the agency’s updated? What is the declared frequency of data
Information Quality Guidelines (true/false). Yet updating and do government officials adhere to this
agencies did not provide this information for 87.72% frequency in reality? By using our corpus of data we
of all published information assets. When they did were able to provide initial answers for the first time.
report on it, they chose to mark their data as being of Figure 1 shows that in 21.43% of cases, agencies
good quality for 99.48% of the published assets. do not report the date when the information asset was
Agencies also ignored the recommendation of the US originally created; however, most agencies report
federal open data 2.0 standard to include in their open when the data was last updated (97.11%). A big
data portals metadata about information assets that portion of the data (60%) was created after 2009, the
they currently cannot release — 98.50% of all assets year the OGD 1.0 initiative was launched in the US.
are designated with an access level of “public” with According to Figure 1, 48% (14,316) of open
only 1.5% of assets (445 of 29,978 assets) marked as government data information assets were created in
“restricted public,” “non-public,” or not marked with 2011.
an access level.
The handful of partially complying agencies did
not provide other critically important metadata
information. For example, 70.48% of all the
information assets (21,002 of 29,798 assets) are not
associated with any program code (i.e., these codes
are the main vehicle through which Congress
provides funding for agencies to execute various
programs). Agencies did not provide even a single
keyword to tag 55.1% of all assets (16,419 assets).
Keywords are important as they help users to identify
assets’ value and relevance. Likewise, 87.63% of all
the assets are not associated with a “category” (i.e., a
category such as “Transportation” or “Health” is the
main thematic subject of the dataset while keywords
are more specific tags that help users discover assets.
Both “category” and “keywords” must be
comprehensible to technical and non-technical users). Figure 1 – Information assets created and
However, some good news exists: Agencies last updated
marked about 82% of all their information assets with
reasonable spatial bounds (i.e., the range of spatial However, in order to understand whether federal
applicability of a dataset including named places such agencies are compliant with the spirit of the OGD
as “New York City” or latitudinal and longitudinal policy, one must examine also agencies’ commitment
13
coordinates to identify the geographical location that to keep the data updated and fresh. Figure 2 below
the information asset describes). Agencies also shows that around 90% of all assets (26,732 assets)
marked about 80% of all the information assets with have not defined the frequency that the information
reasonable temporal bounds (i.e., bounds defining the asset must be updated. The frequency field is either
start and end dates of applicability for the data). not reported (left blank), or defined as “irregular,”
Remarkably, more than 97% of all the information “not planned or not scheduled” or “as needed.”
assets are tagged with a bureau code that empowers
end users to identify correctly which sub-unit/office
within the parent agency is responsible for the
creation of the information asset.
13
Unfortunately, currently we have only the last update of
information assets and not the number and dates of every update.
In the future the software will be able to track the updates as well.
2216
these individuals and survey the others, we are
confident that our analysis below relies on the
objective and daily activities of all the agency
officials that were engaged with uploading and
updating OGD information assets since 2009.
We noted immediately that there is a skewed
distribution in the amount of data disclosed by these
gatekeepers (see Figure 3). The role of gatekeepers
varies not only in the amount of public datasets they
disclose to the public, but also in the number of sub-
agencies these gatekeepers work with. We will
examine the endemic problems and opportunities of
current open government policy by comparing two
government officials responsible for uploading and
Figure 2 – Declared frequency update describing information assets on behalf of their
agencies. Ms. Gina Pearson is the Assistant
Next, we examined agencies based on their Administrator for Communications in the Department
commitment to update the data they previously of Energy (DOE) and has served in this role since
published. As explained above, for more than 90% of 2006. She uploaded metadata for about 135 assets on
the information assets, agencies refused to commit to a single day (March 12, 2014) on behalf of various
a time interval such as “daily” or “weekly.” In fact, DOE units. While doing so, Ms. Pearson used 42
agencies provided several creative values under the keywords to describe the 135 information assets she
“update frequency” field, such as “completely uploaded. Using Nahon’s network gatekeeper
irregular,” “notPlanned” and “None Scheduled.” terminology [40], Ms. Pearson is a traditional
Still, we discovered 581 information assets that gatekeeper whose influence in terms of sharing
contained a commitment to update these assets in information rarely traverses outside the boundaries of
defined interval such as “weekly,” “monthly” or her department (i.e., the DOE).
“annually.” We discovered that agencies have failed
to keep their promise to update the data regularly in
32% of these cases.
5.3. Who Discloses Public Data?
The US federal open data 2.0 standard requires

the provision of the name of the contact person who
uploaded the asset. However, 76.3%% of all the
information assets (22,868 of 29,978 assets) do not
display the name of the person who uploaded the
asset or, alternatively, contain only the name of the
sub-unit responsible for the asset (such as “United
States Department of Justice. Bureau of Justice
Statistics”).
When examining the data, we find that 555 Figure 3: Power law of open government
individual gatekeepers are responsible for the data gatekeepers
disclosure of public data in US federal agencies. We
detected these gatekeepers by studying and analyzing In stark contrast, Mr. David Parrish, a 35-year
the metadata “author” of each information asset. The geographic information systems (GIS) veteran in the
“author” information was retrieved by the software EPA, was energetic about disseminating his agency’s
solution we used. We believe the “author” files GIS data long before open government policy was
contain important information (that, in turn, we used adopted in 2009. Years before www.data.gov was
to run our gatekeeping analysis) because, often, launched, Mr. Parrish was working to release and
government officials upload information assets using share information among units inside the EPA, with
their own work terminals and software that other agencies, across levels of government, and with
automatically registers their credentials. In other research institutes and corporations outside
words, even though we have yet to interview some of government. Our data shows that Mr. Parrish’s
2217
earliest updated asset is dated in April 1981 and his information assets associated with these components.
latest one is dated in February 2014. Unsurprisingly, We analyzed this metadata and discovered that 82%
Mr. Parrish enthusiastically adopted the open (33,862 of 41,702) of all the components were not
government data program and most of the machine-processable as required by the OGD 2.0
information assets he uploaded were created after policy. Put differently, we discovered that it did not
2009, the birth year of the OGD. We did not matter much if end users (after much trouble)
interview Mr. Parrish (we plan to do so as part of our managed to navigate to one of the data ships in the
follow-up research report that will combine the OGD ocean. In more than 80% of the cases, such
quantitative measurements we analyze in this paper navigation efforts end in frustration when end users
with more qualitative measurements that rely on discover that the “data” they seek is saved in a non-
interviews and surveys of OGD officials such as Mr. machine-processable file type such as PDF, HTML,
Parrish). However, it is important to remember that JPG or TIFF. In addition, we discovered that 6% of
Mr. Parrish’s boss, the CIO of the EPA, was one of all the download URLs (i.e., the URL provided by
the two CIOs nominated by President Obama in 2009 the agency to empower the end-user to download the
to design, build and unleash www.data.gov (on May actual data) are broken links.
21, 2009). Obviously, the CIO of the EPA was vested
personally in the success of this pet technology 6. Discussion: From Released Data to
project assigned to him by President Obama.
Therefore, we hypothesize here that Mr. Parrish
Open Government
benefited enormously from the enthusiastic support
of his boss — the CIO of the EPA. One important limitation of this study is that the
Most importantly, our corpus shows that Mr. software that we used in this article captures only a
Parrish uploaded 582 open government data assets on small (though very interesting) part of the
behalf of more than 50 internal EPA units and sub- information-release strategies of US federal agencies.
units as well as on behalf of 10 federal agencies Therefore, our ability to answer broad questions such
(BIA, BLM, BTS, DOE, NGA, NOAA, USACE, as “are US federal agencies more transparent?” based
USDA, USFWS, USGS), the agencies of 7 states on the data we analyzed is very limited.
(Alaska, Arkansas, California, Hawaii, Nevada, New Still, even within the boundaries of our limited
Jersey, New York), four research data, we discovered interesting findings that
institutions/networks (CIESIN, Columbia Project, contribute something original to the growing body of
ESRI, NGCC), one city (New York City), one county literature that describes the information release
(Morris County), and one corporation (TeleAtlas). strategies of public-sector agencies. For example,
Mr. Parrish used 385 distinct keywords to describe only 5% of US federal agencies met the minimal US
the 582 assets he uploaded. federal open data 2.0 standard requirement (i.e.,
In Nahon’s terminology, Mr. Parrish is a publish a data.json catalog on their servers to list
boundary-spanner always seeking opportunities to inventories of their information assets). Most of the
disseminate open data information assets deeply and data that the small subset of complying agencies
extensively inside his agency, across organizational published suffers from poor quality of metadata,
walls in government, and between government and outdated data, a tiny number of gatekeepers who
other sectors [40][41]. Regrettably, we find very few publish the vast amount of data, and accessibility
boundary-spanners in our corpus. Without them, no issues. Our analysis clearly demonstrates the
open data policy will reach wide and far. Possibly the differences between the four modes of information
greatest opportunity to unleash the potential of the states: released data, accessible data, open data and
US federal open data 2.0 standard is to identify, open government. Most of the data disclosed to the
nurture, and support a significantly large group of public is incomprehensible. Some of the data released
boundary-spanners such as Mr. Parrish of the EPA. is not even accessible. For example, citizens are
frustrated by broken links, non-machine-processable
data such as HTML information, or data that
5.4. Hard to Access completely lacks descriptions. The slogan “if we
As explained above, the 29,978 information assets build it, they will come” that represents the belief that
whose metadata we analyzed contained 41,702 the release of large volumes of data will increase
components (see footnote 7 for a definition of a transparency and participation is here shown to be a
“component” and its relationship to “information naïve one.
asset”). The software we used downloaded the 41,702 Even when agencies make their data accessible,
components, analyzed them, and registered in our there is no guarantee that the released information is
corpus additional metadata information about the
2218
truly open. The little data that is open, [9] G. Lee and Y. H. Kwak, “An Open Government
comprehensible and accessible generates even more Maturity Model for social media-based public
troubling questions: Does such data promote engagement,” Gov. Inf. Q., vol. 29, no. 4, pp. 492–
transparency and governmental accountability? Does 503, Oct. 2012.
[10] L. B. Bingham and S. Foxworthy, “Collaborative
the data contribute to the strengthening and Governance and Collaborating Online: The Open
improvement of the government, including the values Government Initiative in the United States,” presented
of public participation, citizens’ empowerment and at the Converging and Conflicting Trends in the
the democratic ideal? Our analysis proposes that the Public Administration of the US, Europe, and
journey from the initial disclosure of public data to Germany, 2012.
implementing open government principles is long. [11] J. C. Bertot, P. McDermott, and T. Smith,
Our study examined for the first time in a “Measurement of Open Government: Metrics and
systematic, quantitative way the status of open Process,” in 2012 45th Hawaii International
government data in the US by focusing on the Conference on System Science (HICSS), 2012, pp.
2491–2499.
disclosure of data by US federal agencies as
[12] S. J. Piotrowski, Governmental Transparency in the
mandated by the dictates of the US federal open data
Path of Adminstrative Reform. Albany: State Univ of
2.0 standard. Our findings suggest that most US New York Pr, 2007.
federal agencies largely do not comply with this [13] R. Gelman, “The Foundations of United States
standard while only 25 agencies partially and weakly Government Information Dissemination Policy,” in
comply with it. Most information assets that are Public Sector Information in the Digital Age –
published are not updated, display partial data Between Markets, Public Management and Citizens’
descriptions and sometimes even broken links to the Rights, G. Aichholzer and H. Burkert, Eds.
data, and are hard to search and find. Therefore, for Northhampton, MA: Edward Elgar, 2004, pp. 123–
now, and five years after the birth of the original 136.
open government data policy, the US open [14] S. Holden and P. Fletcher, “The Virtual Value Chain
government data movement is metaphorically and E-Government Partnership: Nonmonetary
comparable to a vast ocean with sparsely scattered Agreements in the IRS E-Files Program,” in
Handbook of public information systems, D. Garson,
data ships; a majority of these ships are lost at sea,
Ed. Boca Raton, FL: Marcel Dekker, 2005, pp. 369–
sailing sans target, purpose or captain. 387.
[15] Office of Management and Budget (OMB),
7. References “CIRCULAR NO. A-130 Revised.” 28-Nov-2000.
[1] H. Yu and D. Robinson, “The New Ambiguity of [16] B. Obama, “The Freedom of Information Act
‘Open Government,’” UCLA Law Rev. Discourse, (FOIA),” Washington D.C., Memorandum, Jan. 2009.
vol. 59, pp. 178–208, 2012. [17] P. McDermott, “Building open government,” Gov. Inf.
[2] Office of Management and Budget (OMB), “Open Q., vol. 27, no. 4, pp. 401–413, Oct. 2010.
Government Directive,” Washington D.C., [18] Executive Office of the President, “Open Data Policy
Memorandum No. M-10-06, Dec. 2009. - Managing Information as an Asset,” Washington
[3] A. Peled, “When Transparency and Collaboration D.C., Memorandum No. M-13-13, May 2013.
Collide: The USA Open Data Program,” J. Am. Soc. [19] Office of Management and Budget (OMB),
Inf. Sci. Technol., vol. 62, no. 11, pp. 2085–2094, “Supplemental Guidance on the Implementation of
2011. M-13-13 ‘Open Data Policy – Managing Information
[4] B. S. Noveck, Wiki Government: How Technology as an Asset.’” 2013.
Can Make Government Better, Democracy Stronger, [20] R. P. Lourenço, “Data Disclosure and Transparency
and Citizens More Powerful. Brookings Institution for Accountability: A Strategy and Case Analysis,”
Press, 2009. Inf. Polity, vol. 18, no. 3, pp. 243–260, Jul. 2013.
[5] Open Knowledge Foundation, “The Opne Data [21] A. Zuiderwijk, M. Janssen, S. Choenni, R. Meijer, and
Handbook Documentation,” 2012. R. S. Alibaks, “Socio-technical Impediments of Open
[6] B. Noveck, Testimony of Dr. Beth S. Noveck Before Data,” Electron. J. E-Gov., vol. 10, no. 2, pp. 156–
the Standing Committee on Access to Information, 172, 2012.
Privacy, and Ethics of the Canadian Parliament". [22] The White House, “Open Government Initiative -
House of Commons Canada, 2011. Around the Government,” 2010. .
[7] E. Barry and F. Bannister, “Barriers to Open Data [23] J. Wonderlich, “Obama’s Open Government
Release: A View from the Top,” presented at the Directive, Two Years On,” Sunlight Foundation, 07-
European Group for Public Administration, Dec-2011. .
Edinburgh, 2013. [24] Public Accounts Committee, “Implementing the
[8] M. Janssen, Y. Charalabidis, and A. Zuiderwijk, transparency agenda,” 2012. .
“Benefits, Adoption Barriers and Myths of Open Data [25] T. Van Den Broek, B. Kotterink, N. Huijboom, W.
and Open Government,” Inf. Syst. Manag., vol. 29, Hofman, and S. Van Grieken, Open Data Need A
no. 4, pp. 258–268, Sep. 2012.
2219
Vision Of Smart Government: Roadblocks To A Pan
European Market For PSI Reuse. 2011.
[26] J. Harper, “Grading the Government’s Data
Publication Practices,” Policy Anal., vol. 711, pp. 1–
43, Nov. 2012.
[27] G. Bass, D. Brian, M. Fuchs, A. Schwartz, P.
McDermott, E. Miller, and A. Weismann, “Letter
Encouraging the Administration to Improve Its Open
Government Efforts,” 2010. .
[28] A. C. Thurston, “Trustworthy Records and Open
Data,” J. Community Inform., vol. 8, 2012.
[29] J. Harper, “Government Spending Transparency:
‘Needs Improvement’ Is Understatement,”
CATO@LIBERTY, 2011. .
[30] R. J. Cole, “Some Observations on the Practice of
Open Data As Opposed to Its Promise,” J. Community
Inform., vol. 8, 2012.
[31] T. Davies and Z. A. Bawa, “The Promises and Perils
of Open Government Data (OGD),” J. Community
Inform., vol. 8, 2012.
[32] V. Kundra, “From Data to Apps: Putting Government
Information to Work for You,” The White House
Blog, 2011. .
[33] F. Bannister and R. Connolly, “The Trouble with
Transparency: A Critical Review of Openness in e-
Government,” Policy Internet, vol. 3, no. 1, pp. 1–30,
Feb. 2011.
[34] A. Schellong and E. Stepanets, “Unchartered waters –
the State of Open Data in Europe,” Wiesbaden,
Germany, Jan. 2011.
[35] UK Comptroller and Auditor General, “Implementing
Transparency: Cross-Government Review,” 2012. .
[36] M. Biddick and W. Kash, “2014 Federal Government
IT Priorities,” Washington, D. C, 2013.
[37] A. Zuiderwijk, K. Jeffery, and M. Janssen, “The
Potential of Metadata for Linked Open Data and its
Value for Users and Publishers,” JeDEM - EJournal
EDemocracy Open Gov., vol. 4, no. 2, pp. 222–244,
Dec. 2012.
[38] A. Peled, “Re-Designing Open Data 2.0,” JeDEM -
EJournal EDemocracy Open Gov., vol. 5, no. 2, pp.
187–199, Dec. 2013.
[39] B. Obama, “Making Open and Machine Readable the
new default for Government Information,”
Washington D.C., Executive Order, May 2013.
[40] K. Barzilai-Nahon, “Toward a Theory of Network
Gatekeeping: A Framework for Exploring Information
Control,” J Am Soc Inf Sci Technol, vol. 59, no. 9, pp.
1493–1512, Jul. 2008.
[41] K. Nahon and J. Hemsley, Going Viral, 1 edition.
Polity, 2013.
2220

2015 Data - Ships - An - Empirical - Examination - of - Open - Closed - Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2015 Data - Ships - An - Empirical - Examination - of - Open - Closed - Data

Uploaded by

Copyright:

Available Formats

2015 48th Hawaii International Conference on System Sciences

Abstract contradictions in the primary goals of open

1530-1605/15 $31.00 © 2015 IEEE 2209

5.3. Who Discloses Public Data?

The US federal open data 2.0 standard requires

You might also like