DataRss Tech Overview

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Data RSS - Technical Overview

Pito Salas - rps@salas.com - April 9, 2009

Introduction and Background


This is the third in a series of short papers with which I am trying to create the framework and justification for a
new format which for now I am calling Data Rss. In this paper I am going to try to give a technical overview of how
it might work, without delving into why I think this is a good idea. Please see the other two papers for that:

Data RSS: A Modest Proposal (http://www.scribd.com/doc/12866121/Data-Rss)

Data Rss: A Case Study (http://www.scribd.com/doc/13583957/DataRSS-Case-Study)

Roles

DataRSS is used between two parties, the Publisher, who ‘owns’ some data, and the Accessor, who wants to use
that data. Publisher and Accessor are organizations with people in them. The Publisher wants to offer a technical
means to allow an application program simple and standardized access to their data. The Accessor wants to write an
application program that accesses and does something useful with data coming from any Publisher. Accessor and
Publisher don’t know each other.

Accessor’s Application A can as easily get data from Publisher P as from Publisher Q. Publisher P’s data can be
accessed as easily by Accessor A as by Accessor B.

Protocol and Format

Data RSS is a simple protocol and a simple data format. It can be implemented in any programming language
and more importantly, the Publisher and Accessor software need not know (can not know) what language the
counterparties software is written in.

All DataRss requests return a response in one of several formats. For now those are: XML, JSON and HTML.
Why HTML? This way requests from a normal browser can return some useful human readable information.

DataRss Endpoint

In essence DataRss is embodied by a url which we call the DataRss endpoint. A publisher makes their data
available to others by the simple and single act of implementing responses to this url. For example, hypothetically1,
the Sunlight Foundation could let the world know that their DataRss endpoint could be found at http://
services.sunlightfoundation.com/datarss.

At minimum this would mean that clicking on that link would return a response that looks something like this:2
---
datarss:
version: 0.1
source:
name: Sunlight Labs
version: 1
---

1 All examples in this paper are hypothetical


2 All responses will be written out in more compact readable form. In reality the responses will be selectable as being in XML,
JSON, YAML, or HTML
Data RSS - Technical Overview Pito Salas - rps@salas.com - April 9, 2009

In what follows I will document key examples of of the format as it is evolving. This is organized along the lines of
each of the top level URL components that are used to control it.

REST

The overall scheme of things is that I am trying to describe a unified set of REST URL patterns. Some of the
routes return information about the data sets (i.e. discovery) and some of them return actual data.

N.B. There are many ways to skin this cat - as is evidenced by the fact that each Publisher who designed a REST
API for their data approached it in a slightly different way. In a way that is the problem that I am trying to address.

Data RSS patterns


In what follows, I will use “.” (a single period) to denote the Data RSS endpoint. So when you see “,”, substitute,
for example, http://www.followthemoney.org/datarss (another fictional endpoint.)

Request url: .

The base Data RSS Endpoint returns a basic “hello world” response to prove that there is, in fact, a Data RSS
Endpoint here. It indicates the version of DataRSS and the name of the publisher, as well as whatever version
number they might set for their implementation.

Example:
---
datarss:
version: 0.1
source:
name: Sunlight Labs
version: 1
---

Request url: ./info

Request performance and feature information about this particular endpoint. An accessor might call this at the
very start to learn something about the particular implementation.

Example:
Request: ./info
Response:
features:
api-key-required: Yes
formats: [JSON, XML]

Request url: ./datasets

Return a list of all the distinct data sets that this endpoint publishes. Each dataset corresponds more or less to a
table or database or list of information. Datasets also may present various canned queries and default behaviors.

Example:
REQUEST: ./datasets
Data RSS - Technical Overview Pito Salas - rps@salas.com - April 9, 2009

RESPONSE:
---
name: newswire
fullname: New York times Newswire API
---
name: campaigns
fullname: New York Times Campaign Finance API
---
Notes:

• The name of a dataset is used in subsequent requests as an identifier.

Request url: ./dataset/<name>/fields

Return the list of all the distinct fields of information that may appear in responses from this dataset.

Example:
REQUEST: ./dataset/candidates/fields
RESPONSE:
---
name: imsp_candidate_id
fullname: the id number of the candidate
url-index: yes
---
name: candidate_name
fullname: the name of the candidate
url-index: no
---
Notes:

• The name of a field is used in subsequent requests as an identifier

• url-index: yes means that this field can be used as an actual part of the URL, in exactly this way:
./dataset/candidates/imsp_candidate_id/9120

./dataset/<name>/queries

Return the list of all the standing queries that this dataset defines. A standing query is kind of a canned query
which is meaningful to a particular space.

Example:
REQUEST: ./dataset/candidates/queries
RESPONSE:
---
name: businesses
type: url-parameter
parameter: imsp_candidate_id
fullname: This query will summarize contributions at the business level for a
specific candidate.
Notes:
Data RSS - Technical Overview Pito Salas - rps@salas.com - April 9, 2009

• The name of the query is used in subsequent requests as an identifier

There are these query types, so far.

• type: named-query

A simple name that denotes a request for a specific result set. For example, ./dataset/newswire/
query/last24hours would return records corresponding to the named query last24hours.

• type: url-parameter

A query that includes a parameter right in the URL. For example: ./dataset/candidates/query/
businesses/9120 would return records for a query called businesses and the argument 9120.

• type: question-mark

The most powerful query type, that allows a more open ended set of question mark URL
parameters. For example: ./dataset/district/query/zips?state=MA&districtnumber=29 would return
records for a query called “district” with parameters state and districtnumber

Conclusion
Please note: this is not meant as a specification and it’s not a specification. It is a working document which will
change with feedback and further design. In the Appendix below you can see the examples that I have worked
through that have driven the design.

Next is to continue applying this model to other existing data APIs and find the holes. So far there have been
none that were especially hard to overcome.
Data RSS - Technical Overview Pito Salas - rps@salas.com - April 9, 2009

Annotated Examples
EXAMPLE 1: New York Times Newswire API

Hypothetical New York Times DataRss endpoint: . = http://api.nytimes.com/datarss IREQUEST: .


RESPONSE:
dataRSS:
version: 0.1
source:
name: New York Times
version: 2

REQUEST: ./info
RESPONSE:
---
features:
formats: [JSON, XML]
api-key-required: yes
paginated: no
---

REQUEST: ./datasets
RESPONSE:
---
name: newswire
fullname: New York times Newswire API
---
name: campaigns
fullname: New York Times Campaign Finance API
---

REQUEST: ./dataset/newswire/fields
RESPONSE:
---
name: url
url-index: no
---
name: section
url-index: no
---
name: summary
url-index: no
---
name: type
url-index: no
---
name: people
url-index: no
---
name: created
url-index: no
---
name: pubdate
url-index: no
---
... and so on
Data RSS - Technical Overview Pito Salas - rps@salas.com - April 9, 2009

REQUEST: ./dataset/newswire/queries
RESPONSE:
---
name: recent
type: named-query
fullname: all available recent items
---
name: last24hours
type: named-query
fullname: items published in last 24 hours
---

REQUEST: ./dataset/newswire/query/last24hours
RESPONSE:
---
url: xxx
section: yyy
summary: zzz
type: aaa
people: xxx
---
and so on.

EXAMPLE 2: FOLLOWTHEMONEY

Hypothetical Follow The Money DataRss endpoint: . = http://www.followthemoney.org/datarss


REQUEST: ./info
RESPONSE:
---
datarss:
version: 0.1
source:
name: Follow The Money
version: 1
features:
api-key-required: Yes
formats: [JSON, XML]
---

REQUEST: ./datasets
RESPONSE:
---
name: candidates
fullname: Follow the Money information about candidates
paginated: yes
sorts: [sector_name, industry_name, ...]
---
name: party_pacs
fullname: Follow the Money information about Pacs
paginated: yes
---

REQUEST: ./dataset/candidates/fields
RESPONSE:
---
Data RSS - Technical Overview Pito Salas - rps@salas.com - April 9, 2009

name: imsp_candidate_id
fullname: the id number of the candidate
url-index: yes
---
name: candidate_name
fullname: the name of the candidate
url-index: no
---
name: state
url-index: no
fullname: the state this candidate is in
---

EXAMPLE REQUEST: ./dataset/candidates/imsp_candidate_id/9120


RESPONSE: information about specified candidate
NOTE: This illustrates the url-index: yes option

REQUEST: ./dataset/candidates/queries
RESPONSE:
---
name: businesses
type: url-parameter
parameter: imsp_candidate_id
fullname: This query will summarize contributions at the business level for a
specific candidate.
---

EXAMPLE REQUEST: ./dataset/candidates/query/businesses/9120


RESPONSE: information about the businesses of the specified candidate

EXAMPLE 3: Sunlight Labs API

Hypothetical Sunlight Data RSS endpoint: . = http://services.sunlightlabs.com/datarss

REQUEST: ./info
RESPONSE:
---
datarss:
version: 0.1
source:
name: Sunlight Labs
version: 1
features:
api-key-required: Yes
formats: [JSON, XML]
---

REQUEST: ./datasets
RESPONSE:
---
name: legislators
fullname: US Representatives and Senators, providing basic contact
information as well as all the various IDs we track for legislators.
Data RSS - Technical Overview Pito Salas - rps@salas.com - April 9, 2009

paginated: no
---
name: districts
fullname: Congressional districts, providing lookups to obtain district
information from a zipcode or latitude and longitude.
paginated: no
---

REQUEST: ./dataset/districts/fields
RESPONSE:
---
name: state
fullname: the state of a district
url-index: no
---
name: districtnumber
fullname: the number of a district within a state
url-index: yes
---
name: zip
fullname: the zipcode of a district within a state
url-index: no
---

REQUEST: ./dataset/district/zip/02474
RESPONSE:
list of all districts in that zip. This example illustrates url-index: yes

REQUEST: ./dataset/district/queries
RESPONSE:
---
name: zips
type: question-mark
parameters: [state, districtnumber]
---

REQUEST: ./dataset/district/query/zips?state=MA&districtnumber=29
RESPONSE:
list info about all the zipcodes in the specified district. This example
illustrates query type: question-mark

You might also like