Open Government Data

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

AI AND GOVERNMENT

Editor: Vassilios peristeras, European Commission, vassilios.peristeras@ec.europa.eu

Open Government Data:


A Data Analytics Approach
John S. Erickson, Amar Viswanathan, Joshua Shinavier, Yongmei Shi,
and James A. Hendler, Rensselaer Polytechnic Institute

I
n December 2010, the International Open Gov- makers, agencies (as providers and consumers),
ernment Dataset Search (IOGDS) team at the data experts, independent software developers
and service providers, academia, and citizen stake-
Tetherless World Constellation (TWC) at Rensselaer
holders. The publication of widely varied data has
Polytechnic Institute (RPI) embarked on a project inspired a wide assortment of applications and
to discover, document, and analyze open data services; provided essential data for journalists,
catalogs published by governments at various lev- bloggers, and activists; and has fueled academic
els around the world.1 By early 2013, the IODGS research. In turn, demand from stakeholders has
project had accumulated descriptive metadata for increased the quality and variety of this data,
more than 1,022,787 datasets from 192 catalogs and has encouraged the publication of thousands
in 24 languages, representing 43 countries and of  datasets as readily consumed linked open
international organizations. government data.2
RPI’s aggregate catalog, implemented using the
Resource Description Framework (RDF) and pub- Building a Million-Dataset Catalog
lished using both a public SPARQL endpoint and a Starting in 2010, countries published open gov-
faceted user interface, has proven to be a valuable ernment data catalogs using a range of platforms
tool for gaining insight into the nature of open and cataloging approaches. Our project recog-
government data publication. Here, we discuss nized these catalogs as potentially valuable sources
what our team has learned about international of key data, providing names, descriptions, and
government data publication trends and tenden- URLs of datasets from many countries, if only the
cies through the application of data analytics and contents of those catalogs could be collected and
data visualization to this metadata collection. analyzed. Governments seldom publish data cat-
alogs using uniform data models, much less as
Open Government machine-readable RDF following linked data prin-
Data Publication: A Review ciples.3 To generate a uniform aggregate catalog,
Motivated by the first Obama Administration’s our team developed a semi-automated process that
transparency initiatives, in May 2009 the US included manual data portal identification; cata-
launched the Data.gov Web portal with a catalog log and dataset metadata identification; per-catalog
of 47 datasets containing previously unreleased customization of metadata harvesting tools; and
government data. During its first year, Data.gov automated linked data conversion and publication
grew to more than 250,000 datasets, inspired hun- on a public SPARQL endpoint based on an existing
dreds of applications and services, and was seen as laboratory infrastructure.4 A novel faceted browser
the flagship of the worldwide movement toward developed for the Semantic eScience Framework
open government data publication. Other govern- (SeSF)5,6 project was adapted to provide a highly
ments followed in rapid succession, and in the next efficient faceted browse and search experience for
few years open government sites for countries, mu- the user.
nicipalities, cities, and others went online. IOGDS didn’t consider certain other character-
The significant growth in number and size of istics of open government data publication that
open government data catalogs since 2009 has might be of particular interest to practitioners. For
been made possible by the emergence of an open example, it would be useful to have greater detail
government data ecosystem consisting of policy regarding the fi le formats in use, giving us deeper

SEpTEMbEr/ocTobEr 2013 1541-1672/13/$31.00 © 2013 IEEE 19


Published by the IEEE Computer Society

IS-28-05-gov.indd 19 17/12/13 4:38 PM


datasets). Nineteen languages were rep-
resented in the remaining 31 catalogs.

Catalog Contents by Keyword


To identify trends in data publication,
we studied keywords associated with
datasets and categories describing cat-
alogs. We saw that certain data cate-
gories tend to dominate the keyword
frequency analysis for some catalogs;
for example, geographic ­datasets ac-
count for more than 441,348 datasets
in the US Data.gov. Keyword analy-
sis of Data.gov with geodata cata-
logs excluded (reducing the count to
Figure 1. Word cloud depicting the keyword analysis of Data.gov. Excluding the
4,826 datasets) gives us a more rep-
geodata catalogs gives us a more representative picture of the diversity of data. resentative picture of the diversity of
data, as depicted by the word cloud
in Figure 1.
Catalogs published by municipal
insight into the extent of machine- of datasets published on the Web by governments provide a clue as to the
readable data publication.7 Potential political entities around the world. priorities of city governments and their
adopters might be interested in the Here, we provide insights into this stakeholders. Analysis of keyword
relative penetration of commercial catalog based on analytics performed sets harvested from catalogs such as
(for example, Socrata; http://socrata. across the collected metadata. New York City (http://nycopendata.
com), open source (such as Open Gov- socrata.com), San Fransisco (http://
ernment Platform; http://ogpl.gov. Catalogs Published by Individual data.sfgov.org), and Edmonton (http://
in, and CKAN; http://ckan.org), and Countries data.edmonton.ca) helped our team see
ad hoc data platforms. It would also During our collection period, the US both the similarities and differences of
have been fascinating to have gath- was the global leader in publishing these governments.
ered detailed time series data docu- (453,859 datasets), followed by France For example, the word cloud in Fig-
menting the emergence of catalogs (353,394), Canada (179,131), the UK ure 2 displays the top keywords from
around the world. All of these details (12,131), and Spain (10,076). These 859 datasets published by New York
exceeded our capabilities, given the ir- statistics include geographic datasets City (http://nycopendata.socrata.com).
regularity of deployed platforms and of various kinds, which account for Clearly, the city has placed heavy em-
the resulting challenges of gathering the largest subset of entries. phasis on the publication of datasets
metadata. Finally, while IOGDS has containing location data and datasets
focused on the publication of data it- Languages across All Catalogs relevant to education and community
self, we recognize that many govern- Our analytics found a total of 24 lan- services.
ment data portals have been designed guages represented across all catalogs in IOGDS has enabled us to study the
to sustain communities of stakehold- the IOGDS collection. English is by far publication tendencies of agencies
ers, and thus also provide support for the most prominent language (98 cata- within governments, as well as inter-
related applications and services. logs/652,176 datasets), since most of national organizations. For example,
the open government data published Figure 3 displays the top keywords
The Analytics of Government through late 2012 came from English- from the 1,405 datasets published by
Data Publication speaking countries. Other notable lan- the US government’s Medicare pro-
The IOGDS catalog (http://logd.tw. guages included French (19 catalogs/ gram (http://data.medicare.gov).
rpi.edu/page/international_dataset_ 528,153 datasets), Spanish (18 catalogs/ An examination of keywords from
catalog_search) has served as an ob- 8,444 datasets), Italian (14 catalogs/2,256 10,678 datasets published in the UK
servatory for exploring the diversity datasets), and German (12 catalogs/1,584 government catalog (http://data.gov.uk)

20 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS

IS-28-05-gov.indd 20 17/12/13 4:38 PM


shows a rich diversity of data covering
many sectors that it shares with other
countries’ catalogs (see Figure 4). The
metadata also suggests heavy use of cer-
tain generic keywords such as “trans-
parency” and “disclosure”—which
might be tied to UK policies, but such
keywords were uncommon to the US or
other countries’ catalogs we examined.

Analyzing Dataset Categories


The IOGDS category metadata char-
acterizes subsets of datasets across
catalogs. Analysis of the distribution
of categories across countries gives
us some insight into the priorities of
governments as they publish their
data. In Table 1, we use categories to
summarize the publication similari-
ties and differences between selected
governments. Figure 2. Word cloud displaying the top keywords from 859 datasets published by
New York City. The city has placed heavy emphasis on the publication of datasets
Making Open Government containing location data and datasets relevant to education and community services.
Data Discoverable and
Usable
The amount of work required to col- as World Wide Web Consortium’s government data providers and
lect the information in this ­catalog Data Catalog Vocabulary (W3C their partners will lead to radically
was high, and maintaining it would DCAT),8 in the near future and increased accessibility for the stake-
be a significant effort. To be able are  especially encouraged by the holder community.
to track this kind of information formal acceptance of DCAT within • The need for machine-readable for-
on an ongoing basis, and to bet- the European Community.9 mats for publishing metadata. We’re
ter compare datasets across govern- • The value of microdata markup witnessing a move towards distrib-
ments (as well as to provide better and large-scale indexing. Metadata uted publication of datasets within
search tools), we make the following is of little practical use for large- governments as agencies begin to
recommendations. scale discovery unless it has been manage their own catalogs and
exposed on the Web in ways that portal-wide aggregation happens
­
• The importance of standard vocab- enable efficient indexing by search only virtually. It’s therefore critical
ularies. Datasets regardless of their engines and other applications. for governments to embrace commu-
coverage are inherently opaque and Schema.org (http://schema.org) is nity-developed technical standards
require external metadata to facili- a search engine company-led, com- for the machine-readable expression
tate cataloging and discovery. The munity-driven effort centered on of catalog contents,7 so that their da-
insights our team gained by per- the publication of consensus vo- tasets can be included in local, feder-
forming data analytics across the cabularies for expressing a website’s ated, and global data catalogs.
aggregate IOGDS catalog were structured data through on-page
just a glimpse of what’s possible markup, enabling search engines
through the widespread adoption to better parse the information on
of metadata standards for pub- webpages and provide richer search The application of data analyt-
lishing catalogs of data. We look results. The widespread adoption of ics to the TWC RPI IOGDS catalog
forward to the increased imple- the Schema.org Dataset extension has  enabled a kind of “world tour”
mentation of new standards, such (http://schema.org/Dataset) by open of the international open ­government

September/october 2013 www.computer.org/intelligent 21

IS-28-05-gov.indd 21 17/12/13 4:38 PM


scale and higher frequency using
metadata collected from the world’s
catalogs much more efficiently. This
will be made possible through fed-
erated mechanisms enabled by new
standards and best practices that
are emerging with the evolution of a
more mature open government data
ecosystem.

Acknowledgments
This work has been made possible by a gen-
erous gift to the Tetherless World Constella-
tion at Rensselaer Polytechnic Institute from
Microsoft Research.

References
1. J.S. Erickson et al., “TWC Internation-
al Open Government Dataset Catalog,”
Proc. 7th Int’l Conf. Semantic Systems,
ACM, 2011, pp. 227–229.
2. L. Ding, V. Peristeras, and M. Hausen-
Figure 3. Word cloud showing the top keywords from the 1,405 datasets published blas, “Linked Open Government Data,”
by the US government’s Medicare program. IEEE Intelligent Systems, vol. 27, no. 3,
2012, pp. 11–15; http://bit.ly/16YYb7s.
3. T. Berners-Lee, “Linked Data,” W3C
Design Issues, 27 July 2006; http://bit.
ly/cwflPW.
4. L. Ding et al., “TWC LOGD: A Portal
for Linked Open Government Data
Ecosystems,” Web Semantics: Science,
Services and Agents on the World Wide
Web, vol. 9, no. 3, 2011; http://bit.
ly/16tmY9q.
5. E. Rozell, et al., “A Framework for
Integrating Oceanographic Data
Repositories,” Proc. AGU Fall Meeting
2010, Am. Geophysical Union, 2010;
http://bit.ly/1dLdxFc.
6. E. Rozell, Extensible User Interface
Framework for Faceted ­Browsing
­Applications, master’s thesis,
­Rensselaer Polytechnic Inst., 2012;
Figure 4. Word cloud examining keywords from 10,678 datasets published in the UK
http://bit.ly/16EnnpG.
government catalog (http://data.gov.uk).
7. J. Hendler and T. Pardo, Open Govern-
ment Primer on Machine-Readability,
data movement. Our analysis of the In the future, the adoption of bet- blog, 24 Sept. 2012; www.data.gov/
IOGDS data has shed some light on ter standards will allow applica- communities/node/116/blogs/76451.
the coverage, trends, and diversity of tions and services to be able to re- 8. F. Maali and J. Erickson, eds., Data
published data around the world. peat our data analytics at larger Catalog Vocabulary (DCAT), W3C.

22 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS

IS-28-05-gov.indd 22 17/12/13 4:38 PM


Table 1. Summary (in %) of top data categories across catalogs for selected countries.

US UK Canada Australia Germany


Administrative and Health and Economics and i­ndustry Community (11.09) Air (11.27)
political boundaries social care (10.53) (47.36)
(18.01)
Imagery and base maps Economy (4.07) Health and safety Geography (9.31) Water (11.06)
(12.87) (12.34)
Transportation Children, education, Society and culture Government (8.79) Other (8.08)
networks (12.49) and skills (3.67) (6.34)
Imagery Base Maps People and places Nature and Business (8.79) Politics and
Earth Cover (11.083) (3.43) environment (4.76) ­government (5.14)
Boundaries (9.38) Population (2.93) Labor (3.71) Environment (7.68) Economy (4.16)
Geological and Agriculture and Persons (3.24) Finance (6.89) Education and
geophysical (7.91) environment (2.54) science (3.92)
Locations and geodetic Crime and justice (2.21) Agriculture (2.84) Sciences (6.38) Radiation (3.20)
networks (7.85)
Inland water resources Travel and transport Education and Society (6.18) Public administration,
(5.54) (2.04) training (2.66) budget, and taxes (3.18)
Transportation (3.66) Business and energy Transport (2.1) Industry (5.50) Traffic (3.18)
(2.01)
Oceans and estuaries Parish (1.84) Government and Recreation (3.68) Labor market (2.94)
(2.33) ­politics (1.84)
Inland Waters (1.91) Government (1.69) Impacts and environmental Culture (2.85) People, family, and
change (1.41) social affairs (2.94)
Oceans (1.58) Health (1.53) Science and rechnology Health (2.33) Climate environment
(1.19) and nature (2.69)
Facilities and Demographics (1.32) Information and Law (2.06) Basic data and
structures (1.51) ­communication (1.14) ­geosciences (2.20)
Structure (0.87) Labor market (1.32) Energy and greenhouse Transport (1.86) Building and living
gas (GHG) emissions (1.96)
(1.03)
Climatology Meteorology Education (1.02) Law (0.86) Planning (1.66) Trade commerce (1.71)
Atmosphere (0.44)
Geography and Health (0.95) Business and economic General (1.58) Leisure culture and
­environment (0.20) development (0.48) tourism (1.71)
Elevation (0.1) Employment and skills Census (0.48) Safety (1.42) Population growth (1.71)
(0.85)
Biota (0.09) Environment (0.79) Processes (0.33) Property (1.22) Demographics (1.47)
Environment and Transport (0.75) Development (0.25) Communication (1.14) Environment and
­conservation (0.06) ­climate (1.47)

org, 1 Aug. 2013; www.w3.org/TR/ Amar Viswanathan is a PhD student at the James A. Hendler is the director of the
vocab-dcat. Teth­erless World Constellation at Rensselaer Poly- Rensselaer Institute for Data Explora-
9. European Commission: Interop- technic Institute. Contact him at kannaa@rpi.edu. tion and Applications (IDEA), the Tether-
erability for European Public less World Senior Constellation Chair, and
­A dministrations (ISA), DCAT Joshua Shinavier is a PhD student in com- a member of the faculty in the Department
­A pplication Profile for Data Portals puter science at the Tetherless World Con- of Computer Science and the Department of
in Europe, 2 Sept. 2013; http://bit. stellation at Rensselaer Polytechnic Institute. Cognitive Science at Rensselaer Polytechnic
ly/19kaBwo. Contact him at shinaj@rpi.edu. Institute. Contact him at hendler@cs.rpi.edu.

John S. Erickson is the Director of Web Sci- Yongmei Shi is a research associate at the
ence Operations at the Tetherless World Con- Tetherless World Constellation at Rensselaer Selected CS articles and columns
stellation at Rensselaer Polytechnic Institute. Polytechnic Institute. Contact her at yong- are also available for free at
Contact him at erickj4@rpi.edu. mei.shi@gmail.com. http://ComputingNow.computer.org.

September/october 2013 www.computer.org/intelligent 23

IS-28-05-gov.indd 23 18/12/13 3:03 PM

You might also like