Download as pdf or txt
Download as pdf or txt
You are on page 1of 80

Introduction to open data for journalists

London, September 2013

Before we start
Wi-Fi = ODINET Password: OpenData Please download the latest Chrome at: http://tinyurl.com/install-google-chrome-now Please set up a Google account if you don't have one already at: https://accounts.google.com/SignUp

Introductions
Ulrich Atz Statistician, Open Data Institute @statshero Kathryn Corrick Head of training, Open Data Institute @kcorrick

Introductions
Your name Where youve come from Role Your aims for the day

Agenda - Today
What is data? Using data in journalism Finding reliable data sources Is there really a story in this data? Using your data, including law and licensing Cleaning and visualising your data Common data analysis mistakes Presenting your story Special invited guest

A note on the exercises and additional links


Exercises - http://bit.ly/odidjexercises Other materials, further reading and links http://bit.ly/odidjlinks The slides will be made available online after the course

WHAT IS DATA?

Discussion
In your groups discuss what is data for you?

http://theodi.github.io/data-denitions/

Open data is data that can be freely used, reused and redistributed by anyone subject only, at most, to the requirement to attribute and share alike.
Opendenition.org

USING DATA IN JOURNALISM

http://www.theguardian.com/news/datablog/ 2012/may/24/data-journalism-punk

http://www.ft.com/cms/s/0/4b1a2f64-2048-11e3-9a9a-00144feab7de.html

http://numeroteca.org/2011/12/05/surface-newspapers-front-pagesvs-twitter-nov30th-occupy-ows/

http://road.cc/content/news/93687-bikes-faster-public-transport-most-londonjourneys-under-8-miles

http://wheredoesmymoneygo.org/

http://openspending.org/

http://behindthewire.theglobalmail.org/

Case Study

http://smtm.labs.theodi.org/

An approach to using data in journalism (Data Percolation)


Source Prepare Analyse
See also: the Data Journalism Handbook

An approach (also see: Data Journalism Handbook) The process of Data Percolation
Source Prepare Analyse

Eventually, write the report (=simplify)

We recommend you offer someone a coffee and nd a second pair of eyes for your work.

How should I budget my time?

How should I budget my time?

Time for a coffee

1.1 - FINDING RELIABLE DATA SOURCES

Discussion
What makes a trusted (data) source?

Plenty of data sources exist

Exercise 1.1: explore data portals see your handouts

Exercise 1.1b: A basic test with 3 questions see handouts


1. Compared to what? 2. Since when? 3. Says who?
From the Statistical literacy guide How to spot spin and inappropriate use of statistics http://www.getstats.org.uk/wp-content/uploads/2012/02/How-to-spot-error.pdf

How to stay up to date with government data releases


Ofce for National Statistics release calendar Parliamentary releases mailing list Planning alerts mailing list Press releases RSS feeds Twitter See your links hand out for further details

Now that Google Reader has gone? RSS feed readers


Most recommended Feedly http://cloud.feedly.com/#welcome The Old Reader - http://theoldreader.com/ Digg Reader - http://digg.com/login?next=%2Freader

1.2 - CONFIRM THAT THE DATA IS REVELANT IS THERE A STORY?

Questions you may want to ask yourself


Is the dataset complete? Does it include crucial variables? Time and geography If a statistic is the answer, what was the question?

1.3 USING YOUR DATA UNDERSTANDING LAW & LICENSING

Law and licensing


Please note, I am not a lawyer and this should not be treated as legal advice.

Key laws affecting data journalism


Intellectual Property - copyright and database rights Computer Misuse Data Protection Freedom of Information Act

What are intellectual property rights?


Rights which are given which allow ownership of creations Patents Trade marks Design rights Copyright Database rights Many creations are a bundle of rights protected by more than one or all of the above

Copyright Designs & Patents Act 1988


Original works - e.g. content, graphics, text, music Gives exclusive rights to the author of the work allowing the author to control the copying and exploitation of it Arises automatically Fair dealing - criticism or review, reporting current events, noncommercial research, educational use Beware public domain assumption and myth

Database denition
A collection of independent works, data or other materials which are arranged in a systematic or methodical way and are individually accessible by electronic or other means

Databases
Copyright Creative effort and substantial investment in the selection and presentation Individual components of the database Database rights Substantial investment in obtaining, verifying and presenting the database

Rule of thumb
Do you have rights or permission to publish? Do you have rights to use the information/data? Is the data derived from other sources?
(see licensing)

Computer Misuse Act


Offences Unauthorised access to computer material Unauthorised access with intent to commit or facilitate further offences Unauthorised modication of computer material Penalties 2 10 years imprisonment Fines

Rule of thumb
Leaks and whistleblowing get your editor and the legal team in

Data Protection
Personal Data Data Protection Act 1998 Data relating to a living identiable person must be processed fairly and lawfully Processing that is not immediately apparent to users e.g. cookies (new laws and guidance) damages available to data subjects

Rule of thumb
Does this data contain personal identiable data? Could this data be used combined with another data set to create personal identiable data? Anonymisation is hard.
See
http://www.scribd.com/doc/128356210/Business-considerationsfor-privacy-and-open-data-how-not-to-get-caught-out http://www.scribd.com/doc/125638490/Getting-to-grips-with-theNational-Pupil-Database-personal-data-in-an-open-data-world

Licences: what to look for


Licenses identify the scope and limited of how intellectual property can be used Commonly used in the UK: All rights reserved Royalty free license Paid-for license Open Government Licence Creative Commons Licence

http://www.nationalarchives.gov.uk/doc/open-government-licence/version/2/

Rule of thumb
If you are uncertain about what rights you may have over a piece of content or dataset or how you can use it Contact the owner. Ask.

And what about crowd-sourced data?

http://paidcontent.org/2013/05/24/crowdsourcing-thenews-do-we-need-a-public-license-for-citizenjournalism/

Freedom of Information Act 2000


Provides public access to recorded information held by public authorities The Act does not necessarily cover every organisation that receives public money Recorded information includes printed documents, computer les, letters, emails, photographs, and sound or video recordings

FOIA tips
Sign up to 'What Do They Know?
https://www.whatdotheyknow.com/

Always check commercial condentiality. See Information Commissioner Ofce advice:


http://www.ico.org.uk/~/media/documents/library/Environmental_info_reg/ Practical_application/ eir_condentiality_of_commercial_or_industrial_information.ashx

2.1 CLEANING YOUR DATA

Introducing Open Rene

https://code.google.com/p/google-rene/

Exercise 2: Open Rene


We will use Land Registrys Price Paid Data (PPD). Use the latest one from August 2013. Link in the exercise pack.

Merging data sets is the DANGER ZONE


Most mistakes happen when combining data and the big problem is you might not notice! Be extra careful or seek help

3.1 VISUALISE AND EXPLORE YOUR DATA

Exercise 3.1: Google spreadsheets


See exercise pack for two suggested data sets and some tasks you can try.

3.2 ANALYSING YOUR DATA COMMON MISTAKES

Other common mistakes.


1. 2. 3. 4. 5. The difference between percentages & percentage points Average, Mean, Median Visualisations Correlation versus causation

Percentages
Know the difference between a percentage and a percentage point. VAT increased from 17.5% to 20% on January 2011. This is a rise of 2.5 percentage points not a rise of 2.5%. How much would a rise in 2.5% actually be?

From: Stories and Statistics by Frank Swain

Averages
Where is the mode?

Problems with maps

Map projections
Mercator projection Kavrayskiy VII

XKCD

Creative Commons Attribution Non-Commercial http://xkcd.com/552/

3.3 DO YOUR RESULTS PASS A SENSECHECK?

Exercise 3.3: Sense-checking


The best way to do this is to get a second pair of eyes to help you. List two ways how you would double-check an analysis you did previously.

PRESENTING YOUR STORY

Three tips for using statistical language


1. 2. Use substantial instead of signicant (except if you mean it) Do not abbreviate or alter statements if you are unsure E.g. change in percentage points with change in percentage. Be precise with your denitions. DO: 200,000 economic migrants came to the UK from Eastern Europe last year. DONT: 200,000 workers came to the UK from Eastern Europe last year.

3.

Tips for data visualisations


1. Show the data integrity density Emphasise the data data-ink ratio avoid chart-junk Inform and engage context audience
Ulrichs interpretation of Edward Tuftes principles

2.

3.

An example from the Scottish Government

http://www.scotland.gov.uk/Topics/Statistics/16002/DataTrendsInternet

Improved version

Inform and engage

Finally make sure you include

Data sources and links to your analysis How they can be used and reused by others (licences)

Final exercise: assess the following article

http://globalnews.ca/news/622513/opendata-alberta-oil-spills-1975-2013/

Special guest
Nick Scott Import.io

CONCLUSIONS AND TIME FOR QUESTIONS

Our top tip: BBC Radio 4 / Podcast More or Less

http://www.bbc.co.uk/podcasts/series/moreorless

Thank you!
Further reading and links http://bit.ly/odidjlinks The slides will be made available online after the course

You might also like