Where To Find Large Datasets Open To The Public

Big Data

Data Science

Seeking Question

Where can I nd large datasets open to the public?

Answer Wiki
Here are many of the links mentioned so far:
Cross-disciplinary data repositories, data collections and data search engines:
1. https://www.kaggle.com/datasets
2. http://usgovxml.com
3. http://aws.amazon.com/datasets
4. http://databib.org
5. http://datacite.org
6. http://gshare.com
7. http://linkeddata.org
8. http://reddit.com/r/datasets
9. http://thewebminer.com /
10. http://thedatahub.org

alias http://ckan.net

11. http://quandl.com
12. Social Network Analysis Interactive Dataset Library

(Social Network Datasets)

13. Datasets for Data Mining

14. http://enigma.io
15. http://www.undthem.com/
16. http://NetworkRepository.com - The First Interactive Network Data Repository
17. http://MLvis.com
18. Open Data Inception - A Comprehensive List of 2500+ Open Data Portals in the
19. http://data.opendatasoft.com

OpenDataSoft catalog

Single datasets and data repositories

1. http://archive.ics.uci.edu/ml/
2. http://crawdad.org/
3. http://data.austintexas.gov
4. http://data.cityofchicago.org
5. http://data.govloop.com
6. http://data.gov.uk/
7. data.gov.in
8. http://data.medicare.gov
9. http://data.seattle.gov
10. http://data.sfgov.org
11. http://data.sunlightlabs.com
12. https://datamarket.azure.com/
13. http://developer.yahoo.com/geo/g...
14. http://econ.worldbank.org/datasets
15. http://en.wikipedia.org/wiki/Wik...
16. http://factnder.census.gov/ser...
17. http://ftp.ncbi.nih.gov/
18. http://gettingpastgo.socrata.com
19. http://googleresearch.blogspot.c...
20. http://books.google.com/ngrams/
21. http://medihal.archives-ouvertes.fr
22. http://public.resource.org/
23. http://rechercheisidore.fr
24. http://snap.stanford.edu/data/in...
25. http://timetric.com/public-data/
26. https://wist.echo.nasa.gov/~wist...
27. http://www2.jpl.nasa.gov/srtm

28. http://www.archives.gov/research...
29. http://www.bls.gov/
30. http://www.crunchbase.com/
31. http://www.dartmouthatlas.org/
32. http://www.data.gov/
33. http://www.datakc.org
34. http://dbpedia.org
35. http://www.delicious.com/jbaldwi...
36. http://www.faa.gov/data_research/
37. http://www.factual.com/
38. http://research.stlouisfed.org/f...
39. http://www.freebase.com/
40. http://www.google.com/publicdata...
41. http://www.guardian.co.uk/news/d...
42. http://www.infochimps.com
43. http://www.kaggle.com/
44. http://build.kiva.org/
45. http://www.nationalarchives.gov....
46. http://www.nyc.gov/html/datamine...
47. http://www.ordnancesurvey.co.uk/...
48. http://www.philwhln.com/how-to-g...
49. http://www.imdb.com/interfaces
50. http://imat-relpred.yandex.ru/en...
51. http://www.dados.gov.pt/pt/catal...
52. http://knoema.com
53. http://daten.berlin.de/
54. http://www.qunb.com
55. http://databib.org/
56. http://datacite.org/
57. http://data.reegle.info/
58. http://data.wien.gv.at/
59. http://data.gov.bc.ca
60. https://pslcdatashop.web.cmu.edu/

(interaction data in learning environments)

61. http://www.icpsr.umich.edu/icpsrweb/CPES/ - Collaborative Psychiatric

Epidemiology Surveys: (A collection of three national surveys focused on each of the
major ethnic groups to study psychiatric illnesses and health services use)
62. http://www.dati.gov.it
63. http://dati.trentino.it
64. http://www.databagg.com/
65. http://networkrepository.com
66. Home

- Network/ML data repository w/ visual interactive

(United Nations Environment Programme Grid Genava a lot of GIS datasets)

Bret Taylor, CEO of Quip. Ex-CTO of Facebook, co-founder FriendFeed, cocreator Google Maps.
Written Apr 5, 2011

I did a blog post about open data a long time ago (http://bret.appspot.com/entry/we... ),
and ReadWriteWeb did a nice roundup based on all the comments from the blog post:
http://www.readwriteweb.com/arch... .
Since that post, there have been a lot more comments on the blog (105 and counting), so
you may want to comb the comments for any ones the RWW post missed.
Alex K. Chen, ethereal gwernophile, aspires towards timeless, contextindependence existence

Updated Apr 20, 2015 Upvoted by Mark Meloon, US Head of Data Science at Impetus and
Ankit Sharma, Data Scientist at DataRPM

A database of open databases? (also see most-upvoted questions on the Open Data
Stack Exchange at Highest Voted Questions )
Analysis course)

(large collection from Coursera's Data

Where is it possible to find raw climate data? (also NCAR - Climate Data Guide )
| Ecological Data Wiki
PhysioNet - largest repository of free, open-access databases and open-source
computational tools devoted to complex signals informatics
Page on sdss.org - SDSS Astronomy datasets. For more on astronomy, see What are
some astronomy datasets open to the public?

- Berkeley Earth dataset

http://static.reddit.com/RedditS... - massive survey of Redditors and their

preferences - see http://blog.reddit.com/2011/09/w... for some analysis
Welcome to the CRCNS data sharing website

- for neuroscience

http://archiveteam.org/index.php... - Old archives of websites that no longer

exist. Includes data on the affinities of 60,000+ Reddit users
http://www.r-bloggers.com/datase... - Datasets to practice your data mining discussed at http://www.reddit.com/r/MachineL...

- USDA Economic Research Service datasets

- human mortality datasets


- FDA pesticide datasets


- USDA pesticide datasets

Climatology: What are some historical weather databases?


- EPA data


- NASA GISS data


- James Watson's DNA sequence

http://evidence.personalgenomes.... - public genomes of people enrolled in the

personal genome project - includes genomes of Steven Pinker and Esther Dyson.
http://evidence.personalgenomes.... for their genomes
http://voteview.org/downloads.asp - Congressional Voting datasets (probably
contains *everything* about what any politician voted for)

- General Social Survey. For tutorial, see

http://www.cfa.harvard.edu/hitran/ - high-resolution transmission molecular

absorption database. HITRAN on the web: http://hitran.iao.ru/molecule
http://sarahsinbox.com/ - Sarah Palin emails - analyzed by Edwin Chen using
Latent Dirichlet Allocation (LDA) - see http://blog.echen.me/2011/06/27/topicmodeling-the-sarah-palin-emails/

Some others:
Examination Survey

- National Health and Nutrition

- NSLY data (sociology) [1]

- election datasets (only 1984-1990 though)

[1] The NLSY79 Geocode data can only be made available to users who have successfully
completed a geocode application and signed a confidentiality agreement with the U.S.
Bureau of Labor Statistics. If interested in gaining access to the NLSY79 Geocode data,
please review the information at http://stats.bls.gov/nls/nlsgeo7... .
Je Hammerbacher, Professor at Hammer Lab, founder at Cloudera, investor at

Updated Jan 15 Upvoted by Mark Meloon, US Head of Data Science at Impetus and Ankit
Sharma, Data Scientist at DataRPM

I'll try to restrict my answers to datasets greater than 1 GB in size, and order my answers
by the size of the dataset.
The 1000Genomes project makes 260 TB of human genome data available [13]
The InternetArchive is making an 80 TB web crawl available for research [17]

The TREC conference made the ClueWeb09 [3] dataset available a few years back.
You'll have to sign an agreement and pay a nontrivial fee (up to $610) to cover the
sneakernet data transfer. The data is about 5 TB compressed.
ClueWeb12 [21] is now available, as are the Freebase annotations, FACC1[22]
CNetS at Indiana University makes a 2.5 TB click dataset available [19]
ICWSMmade a large corpus of blog posts available for their 2011 conference [2].
You'll have to register (an actual form, not an online form), but it's free. It's about 2.1
TB compressed.
The Yahoo News Feed

dataset is 1.5 TB compressed, 13.5 TB uncompressed

The ProteomeCommonsmakes several large datasets available. The largest, the

Personal Genome Project [11], is 1.1 TB in size. There are several others over 100 GB in
The ReferenceEnergyDisaggregationDataSet[12] has data on home energy
use; it's about 500 GB compressed.
The TinyImages dataset [10] has 227 GB of image data and 57 GB of metadata.
The ImageNet dataset [18] is pretty big.
The MOBIO dataset [14] is about 135 GB of video and audio data
TheYahoo!Webscope program [7] makes several 1 GB+ datasets available to
academic researchers, including an 83 GB data set of Flickr image features and the
dataset used for the 2011 KDD Cup [9], from Yahoo! Music, which is a bit over 1 GB.
Google made a dataset mapping words to Wikipedia URLs (i.e., concepts) [15]. The
dataset is about 10 GB compressed.
Yandex has recently made a very large web search click dataset available [1]. You'll
have to register online for the contest to download. It's about 5.6 GB compressed.
Freebasemakes regular data dumps available [5]. The largest is their Quad dump
[4], which is about 3.6 GB compressed.
The OpenAmericanNationalCorpus[8] is about 4.8 GB uncompressed.
Wikipedia made a dataset containing information about edits available for a recent
Kaggle competition [6]. The training dataset is about 2.0 GB uncompressed.
The ResearchandInnovativeTechnologyAdministration(RITA)has made
available a dataset about the on-time performance of domestic flights operated by
large carriers. The ASA compressed this dataset and makes it available for download
The wikilinks data made available by Google is about 1.75 GB total [20].
[1] http://imat-relpred.yandex.ru/en...
[2] http://www.icwsm.org/2011/data.php
[3] http://lemurproject.org/clueweb0...
[4] http://wiki.freebase.com/wiki/Da...
[5] http://download.freebase.com/dat...
[6] http://www.kaggle.com/c/wikichal...
[7] http://webscope.sandbox.yahoo.co...
[8] http://americannationalcorpus.or...
[9] http://kddcup.yahoo.com/datasets...
[10] http://horatio.cs.nyu.edu/mit/ti...
[11] https://proteomecommons.org/data...
[12] http://redd.csail.mit.edu/
[13] http://www.1000genomes.org/ftpse...
[14] https://www.idiap.ch/dataset/mobio
[15] http://www-nlp.stanford.edu/pubs...
[16] http://stat-computing.org/dataex...
[17] http://blog.archive.org/2012/10/...
[18] http://www.image-net.org/index
[19] http://cnets.indiana.edu/groups/...
[20] wiki-links - Wikipedia Links Data - Google Project Hosting
[21] The ClueWeb12 Dataset
[22] ClueWeb12 Related Data:
Felipe Hoa, Google software engineer / Developer Advocate

Written Feb 19, 2015

Google BigQuery is an awesome place to share open datasets: Once data is loaded in
BigQuery, you can make it public - allowing others to instantly analyze it using just SQL.
See a list of some of the amazing datasets shared on BigQuery: http://www.reddit.com/r/b
Among those datasets I'd like to highlight GDELT: More than a quarter billion rows
(growing every day) of every event happening around the world. I made a video about it:

Shimonee Shah, Quorious, Eccentric, Free spirited

Updated Jul 7, 2014

Here is a useful link.

Finding Data on the Internet
Finding Data on the InternetBy RevoJoe
on October 6, 2011
The following list of data sources has been modified as of 8/19/13. Most of the data sets
listed below are free, however, some are not.
If an (R) appears after source this means that the data are already in R format or there
exist R commands for directly importing the data from R. (Seeexamples :: intro for some
code.) Otherwise, I have limited the list to data sources for which there is a reasonably
simple process for importing csv files. What follows is a list of data sources organized into
categories that are not mutually exclusive but which reflect what's out there.
American Economic Ass. (AEA): AEAweb: RFE
UMD:: Inforum - EconData
World bank: Indicators | Data
CBOE Futures Exchange: CFE | Market Data
Google Finance: Stock market quotes, news, currency conversions & more (R)
Google Trends: Google Trends - Web Search interest - Worldwide, 2004 - present
St Louis Fed: Federal Reserve Economic Data


NASDAQ: NASDAQ - Datastore

OANDA: Forex Trading | Trade Currency Online | Forex Broker | OANDA (R)
Quandl: Find, Use and Share Numerical Data
Yahoo Finance: Yahoo Finance - Business Finance, Stock Market, Quotes, News
Archived national government statistics: Web Archiving Services for Libraries and
Australia: 3301.0 - Births, Australia, 2009
Canada: Home | data.gc.ca
DataMarket: DataMarket - Find, Understand and Share Data - DataMarket
Fed Stats: FedStats: Subjects A to Z
Guardian world governments: Page on guardian.co.uk
London, U.K. data: Catalogue | London DataStore
NewZealand: http://www.stats.govt.nz/tools_and_services/tools/
NYC data: NYC Open Data
OECD: Page on oecd.org
RITA: RITA | BTS | Title from h2
San Francisco Data sets: Data | San Francisco


U.K. Government Data: Data Search | data.gov.uk

United Nations: UNdata
U.S. Federal Government Agencies: Federal Agency Participation - Data.gov
US CDC Public Health datasets: Public-Use Data Files and Documentation
The World Bank: World Development Report
UK 2011 Census Open Atlas Project: Page on alex-singleton.com
Gapminder: Data
Airlines Data (2009 ASA Challenge): The data. Data expo 09. ASA Statistics
Computing and Graphics
Airports and their locations: Airports and Their Locations
AppliedPredictiveModeling (R package): Page on bit.ly
Australian Weather: Daily Weather Observations
Causality Workbench: Data - Repository - Causality Workbench
Edge data for US domestic flights 1990 to 2009: US Domestic Flights From 1990 to
GroupLens Research (movie ratings and more): Datasets
Kaggle competition data: Go from Big Data to Big Analytics
KDNuggets competition site: Datasets for Data Mining and Data Science
The Koblenz Network Collection: The Koblenz Network Collection
Machine Learning Data Set Repository: mldata :: Welcome
Medicare Data File: Page on cms.gov
Microsoft Research: Our research - Microsoft Research
Million songs: The Million Song Dataset: Giving Back to Music Research
RDataMining.com: R and Data Mining
RDataMining.com: R and Data Mining

R and Data Mining ebook data:Data -

The Revolution Analytics Collection: Index of /datasets/

Social Networking: Ancestry.com Forum Dataset
UCI Machine Learning Repository: UCI Machine Learning Repository
53.5 billion clicks: Center for Complex Networks and Systems Research
Data360: Data360 Homepage
Page on datamob.org : Page on datamob.org
Factual: Page on factual.com
Freebase: Freebase
Google: Google Public Data Explorer
infochimps: Big Data - Cloud Services
numbray: Page on numbrary.com
Sample R data sets: The R Datasets Package


SourceForge Research Data: Data

UFO Reports: National UFO Reporting Center Web Reports
Wikileaks 911 pager intercepts: 9/11 Pager data
Resources for AP Statistics, Intro to Statistics, and R | STATS4STEM.ORG : R data
sets: Statistical Data Sets, Statistics Data Sets, Data Sets For Statistics, R Datasets
The Washington Post List: Post Databases (washingtonpost.com)
Agricultural Experiments: agridat {agridat}


Climate data: Temperature data (HadCRUT4)


Gene Expression Omnibus: Home - GEO - NCBI

Geo Spatial Data: Data | GeoDa Center
Human Microbiome Project: Microbial Reference Genomes
MIT Cancer Genomics Data: Page on broadinstitute.org
NASA: Obtaining Data From the NSSDC
NIH Microarray data: ftp://ftp.ncbi.nih.gov/pub/geo/D...
Protein structure: PSP benchmark
Public Gene Data: Browse literature or sequence neighbours
Stanford Microarray Data: Page on stanford.edu


General Social Survey: General Social Survey
ICPSR: Page on umich.edu
SNAP: Stanford Large Network Dataset Collection
UCLA Social Sciences Archive: Data Portals
UPJOHN INST: Employment Research Data Center
Time Series data Library: Time Series Data Library
Carnegie Mellon University Enron email: Enron Email Dataset
Carnegie Mellon University StatLab: StatLib---Datasets Archive
Carnegie Mellon University JASA data archive: StatLib---JASA Data Archive
Ohio State University Financial data: Financial Data Finder
UC Berkeley: UC DATA :HOME
UCLA: SOCR Data - Socr
UC Riverside Time Series: Welcome to the UCR Time Series Classification/Clustering
University of Toronto: Delve Datasets
Alex Kamil
Updated Sep 28, 2013

1000Genomes project: http://www.1000genomes.org/data#...

Internet Movie Database (IMDb) data: http://www.imdb.com/interfaces
Twitter (product) feed scrapes (some are free): http://blog.infochimps.com/2008/...
(thanks to Joseph Misiti)
What are some free, public data sets?
What data APIs or sources should be in my O'Reilly guide?
Are there any free large datasets in the format of an Apache access log?
30TB of web crawl data: http://www.commoncrawl.org/data/
Images database: http://sipi.usc.edu/database/dat...
Datasets released by Google
Nitin Madnani, Computer Scientist, NLPer & Dataviz Nerd

Written Oct 4, 2011

Here are some big corpora we use in NLP in addition to the ones already mentioned:
ukWaC: a 2 billion word corpus constructed from the Web limiting the crawl to the .uk
domain and using medium-frequency words from the BNC as seeds. The corpus was
POS-tagged and lemmatized with the TreeTagger. There's also a parsed version called
pukWac. Get both at: http://wacky.sslmit.unibo.it/dok...
WaCkypedia: a 2009 dump of the English Wikipedia (about 800 million tokens),
including part of speech/lemma information, as well as a full syntactic parse. The texts
were extracted from the dump and cleaned using the Wikipedia extractor. Get it at the
same URL as ukWac: http://wacky.sslmit.unibo.it/dok...
USENET corpus: A collection of public USENET postings. This corpus was collected
between Oct 2005 and Jan 2011, and covers 47860 English language, non-binary-file
news groups. Get it at: http://www.psych.ualberta.ca/~we... [CAVEAT: it's huge!]
The collection of data that comes with the Natural Language Toolkit (NLTK). It's
probably not as large as the others but it's a good set. See descriptions at:
Europarl: Proceedings of the European Parliament in 13 languages. Cleaned and preprocessed for machine translation research. Get it at: http://www.statmt.org/eur
oparl [FYI, NLTK has a built-in interface to access this corpus.]
The Google Books Ngram corpus: Pretty big. Get it at: http://books.google.com/n
Mukesh Chapagain, Programmer, Blogger, Engineer, Spiritual Seeker

Written Nov 16, 2015

Yelp provides data and reviews of the 250 closest businesses for 30 universities for
students and academics to explore and research. I had downloaded the Yelp'sAcademic
Dataset in early 2015 and it contained a total of 330,071 reviews provided by 130,873
users to 13,481 businesses.
The dataset is a single gzip-compressed file, composed of one json-object per line. Every
object contains a 'type' field, which tells you whether it is a business, a user, or a review.

Business objects contain basic information about local businesses.

'type': 'business',
'business_id': (a unique identifier for this business),
'name': (the full business name),
'neighborhoods': (a list of neighborhood names, might be empty),
'full_address': (localized address),
'city': (city),
'state': (state),
'latitude': (latitude),
'longitude': (longitude),
'stars': (star rating, rounded to half-stars),
'review_count': (review count),
'photo_url': (photo url),
'categories': [(localized category names)]
'open': (is the business still open for business?),
'schools': (nearby universities),
'url': (yelp url)
Review objects contain the review text, the star rating, and information on votes Yelp users
have cast on the review.
'type': 'review',
'business_id': (the identifier of the reviewed business),
'user_id': (the identifier of the authoring user),
'stars': (star rating, integer 1-5),
'text': (review text),
'date': (date, formatted like '2011-04-19'),
'votes': {
'useful': (count of useful votes),
'funny': (count of funny votes),
'cool': (count of cool votes)
User objects contain aggregate information about a single user across all of Yelp (including
businesses and reviews not in the dataset).
'type': 'user',
'user_id': (unique user identifier),
'name': (first name, last initial, like 'Matt J.'),
'review_count': (review count),
'average_stars': (floating point average, like 4.31),
'votes': {
'useful': (count of useful votes across all reviews),
'funny': (count of funny votes across all reviews),
'cool': (count of cool votes across all reviews)
Yelp also holds a YelpDatasetChallenge

where over $35,000 in cash prizes are

For dataset challenge, Yelp provides a larger dataset than the AcademicDataset
mentioned above. At present (when this answer is written), the ChallengeDataset
includes information about local businesses in 10 cities across 4 countries.
The ChallengeDataset contains:
1.6M reviews and 500K tips by 366K users for 61K businesses
481K business attributes, e.g., hours, parking availability, ambience.
Social network of 366K users for a total of 2.9M social edges.
Aggregated check-ins over time for each of the 61K businesses
Wim Van Leuven, co-organizer of BigData.be, co-founder at BigBoards.io

Written Dec 2, 2014 Upvoted by William Chen, Data Scientist at Quora and Jerrod Lowmaster,
LinkedIn Data Scientist

Recently I came across CERN's open data initiative. Having talked to a few guys that have
worked there, I'm pretty sure these guys currently gather one of the largest datasets in the
world! Have a look at CERN Open Data Portal
Hope this helps!
Gregory Piatetsky, KDnuggets Editor. Analytics/Data Mining Consultant. KDD and

SIGKDD co-founder...
Written Sep 9, 2013

Here is KDnuggets large and comprehensive list of

Government,State,City,Local,andPublic datasets

Atakan Cetinsoy, SaaS Product Strategy | Data Science | Lean Startup Advisory |
Go-to-Market Plan
Written Sep 27, 2015

Since we get asked this question by our Machine Learning oriented users very frequently,
my company (BigML) has compiled a list with over 250 sources here:
List of Public Data Sources Fit for Machine Learning
You may also want to check out the related blog post for some more context:
Data, Data, Data: Thousands of Public Data Sources
Erik Hille, Economist-SMU Alpinist Actuary Biologist-Caltech Father Dreamer de

la Mancha
Written Aug 14

Large data sets mostly from finance and economics that could also be applicable in related
fields studding the human condition:
World Bank Data. Lots of years. Lots of Countries Countries | Data . Lots of of data
variables (Topics | Data - Indicators | Data - Catalog ), years and Countries .
Your Window Into U.S. Federal Statistics
FRB: Data Releases
Federal Reserve Economic Data
Our government also likes to stay globally informed and is willing to share some of that
data: CIA -The World Factbook
Human Development Reports

- United Nations Development Programme - Public Data

Consumer Price Index

Unveiling the beauty of statistics for a fact based world view. (http://www.gapminder.org/)
Data Plotter
Possibly looking at the Human Capital Report 2015 has Rankings of human capital
index has various measures of education and productivity capabilities.
International Trade
Data: Aggregate trade (current value), bilateral trade with main trading partners
(current value), and major commodity exports by main exporting countries. No
data on trade as share of GDP is readily available.
Geographicalcoverage: Countries around the world
Timespan: Long time series with annual observations from 19th century up to
today (2010)
Availableat: The books are published in three volumes covering more than 5000
pages. 11 At some universities you can access the online version of the books
where data tables can be downloaded as ePDFs and Excel files. The online access
ishere .

Data: Real and PPP-adjusted GDP in US millions of dollars, national accounts
(household consumption, investment, government consumption, exports and
imports), exchange rates and population figures.
Geographicalcoverage: Countries around the world
Timespan: from 1950-2011 (version 8.1)
Availableat: Online here

Data: Total national trade and bilateral trade flows between states. Total imports
and exports of each country in current US millions of dollars and bilateral flows in

current US millions of dollars

Geographicalcoverage: Single countries around the world
Timespan: from 1870-2009
Availableat: Online at www.correlatesofwar.org

Data: Trade (% of GDP) and many more specific series: trade in merchandise,
trade in services, trade in high-technology, trade in ICT goods, trade in ICT services
always exports and imports separately. Also export and import value index and
volume index.
Geographicalcoverage: Countries and world regions
Timespan: Annual since 1960
Availableat: Online at http://data.worldbank.org

Data: Bilateral trade flows by commodity
Geographicalcoverage: Countries around the world
Timespan: 1962-2013
Availableat: Online here

Data: Many different measures, including trade by volumes and value
Geographicalcoverage: Countries around the world
Timespan: For some series, data is available since 1948 mostly annual,
sometimes quarterly.
Availableat: Online here

Data: Trade flows (also by commodity)
Geographicalcoverage: Europe (EU and EFTA)
Timespan: Mostly since 1988
Availableat: Online here
informationoninternationaltradeingoods andservices .

Data: Many series on tariffs and trade flows
Geographicalcoverage: Countries around the world
Timespan: Since 1948 for some series
Availableat: Online here

Data: Many different data sets related to international trade, including trade flows
by commodity geographical variables, and variables to estimate gravity models
Geographicalcoverage: Countries around the world
Timespan: Some series go back to the 1990s.
Availableat: Online here

Data: Export and import values and volumes by commodity

Geographicalcoverage: Single countries

Timespan: 1962-2000
Availableat: Online here

Data on UKbilateraltrade for the time 1870-1913 was collected by David S.
Jacks. It is downloadable in excel format here .
For the time 18701913 21,000 bilateral trade observations can be found in
Mitchener and Weidenmier (2008) Trade and empire, available in the Economic
Journal here .
Data on UK,Germany,France,andUS between mid-19th to 20th Century can
be found here .
Data on DevelopingCountryExport in 1840, 1860, 1880 and 1900 by John
Hanson is available here .
Data on tradebetweenEnglandandAfrica during the period 1699-1808 is
available on the Dutch Data Archiving and Networked Services . It was compiled
by Marion Johnson.
Applying these same sources to Education quality in developing countries:
EducationIndex multiple sheets of excel datais available at Human
DevelopmentReports or you can use their tool to explore the data Human
Development Reports also google has access to explore the data Google Public Data
Explorer additional indexes in this HD report that you might be interest in are:
Human Development Index and Adult Literacy Index and Gross enrollment ratio.
The World Bank has Literacy rates Adult literacy rate, population 15+ years, both
sexes (%) in addition to lots of other data: World Bank Data. Lots of years. Lots of
CountriesCountries | Data . Lots of data variables Topics | Data - Indicators |
Data - Catalog | The World Bank .
Our government also likes to stay informed and is willing to share some of that data:
CIA -The World Factbook
Possibly looking at the Human Capital Report 2015 has Rankings of human
capital index has various measures of education and productivity capabilities.
Unveiling the beauty of statistics for a fact based world view. (http://www.gapminder.org/)
Data Plotter

- has Average Test Scores

PennWorldTablesData: Real and PPP-adjusted GDP in US millions of dollars,

national accounts (household consumption, investment, government consumption,
exports and imports), exchange rates and population figures. Feenstra,RobertC.,
Alf Fyhrlund, http://swedeneurostat.blogspot.se/ http://stataccess.blogspot.se/

Written Mar 22, 2013

Since January 1997, Statistics Sweden has had databases available on the Internet. The aim
is to provide increased access to statistics and allow users to easily download information
to their own computers.
Statistical database
The Statistical database contains a large amount of official statistics that Statistics Sweden
is responsible for. Also included are official statistics from other statistical authorities. The
database contains a number of tables where selected information can be presented on the
screen, in print or transmitted to the user's computer for further processing.
The search process can be made in three ways:
via the link NYA SIFFROR Vlj frn senast uppdaterade tabeller (only in the
Swedish version of the website). Nya siffror shows the latest updated tables in the
Statistical database.
via the subject areas
or via Search the Statistical database.
The Statistical database is available free-of-charge. When making minor retrievals of less
than 10000 table cells, registration is not necessary. For larger retrievals and some future
supplementary services, registration is done by completing theregistrationform .
The database capacity is limited when it comes to large retrievals. In order to best serve
users of very large retrievals, ready-made statistics files in PC-Axis format have been
created, mainly for regionally distributed material.

PC-Axis is software that handles very large statistical tables. PC-Axis can be used for
processing ready-made statistics files or PC-Axis files from the database. The program can
also pass on the statistics to other programs such as spreadsheets, etc. PC-Axis can be
downloaded free-of-charge from this website.
Services in connection with the Statistical databases
Tailor-made retrievals can be ordered for delivery on diskette or CD-ROM. The price
depends on the production cost.
Micro databases are available after a harm test of de-identified (anonymised) data is done
at Statistics Sweden. More information on registers is available inDocumentationof
statistics (only in the Swedish version of the website).
Courses are held regularly (in Swedish) as an aid for those who want to use the Statistical
database. For more information on contents, times and prices of courses, check the
Swedish version of the website Kurser .
Postal address: Box 24300, SE-10451 Stockholm, Sweden
Telefax: +46-8-506 948 99
Telephone: +46-8-506 948 01
Robert Morton, Data Nerd at Tableau Software (Senior Software Engineer)

Written Oct 23, 2011

The Bureau of Transportation Statistics (bts.gov ) has tremendous amounts of data on

airline on-time / delays, airfares, fuel costs, etc. Most are very wide and several data sets
range from 100M - 300M rows. Here's an index of their best data sets:
Anton Tarasenko
Updated Dec 5, 2014


You can use the Custom Google Search for datasets:

230 sources and meta-sources of datasets, including all mentioned in this question. Please,
feel free to exclude .gov and any other websites from results by adding " -.gov " or " site.com " to the search line. Other Google Search Operators work.
Don't hesitate to contact me if you have ideas what websites to add.

The following service puts in order more than 1,000,000 public datasets:
Alan Morrison, Researches the topic for publications

Written Feb 16

Reposting from Alan Morrison's answer to Where on the web can I find free samples of Big
Data to analyze?
This link list, available on Github, is quite long and thorough: caesar0301/awesome-public
-datasets You will see many census data sources listed. Then the challenge becomes how
to get to what you really want and can use.
Note that this list also references a Quora answer that also includes a long list: Where can I
find large datasets open to the public?
For your convenience, I've copied the list of lists as it stood in January 2015 here, but won't
be updating it:
This list of public data sources are collected and tidyed from blogs, answers, and user
reponses. Most of the data sets listed below are free, however, some are not. Other
amazingly awesome lists can be found in theawesome-awesomeness andanother
awesome list.
U.S. Department of Agriculture's PLANTS Database
1000 Genomes
Collaborative Research in Computational Neuroscience (CRCNS)
Gene Expression Omnibus (GEO)
Human Microbiome Project (HMP)
ICOS PSP Benchmark
MIT Cancer Genomics Data
NIH Microarray data (FTP)
Protein Data Bank
PubChem Project
PubGene (now Coremine Medical)
Stanford Microarray Data
The Personal Genome Project

or PGP

UCSC Public Data

Australian Weather
Canadian Meteorological Centre
Climate Data from UEA (updated monthly)
Global Climate Data Since 1929
NOAA Bering Sea Climate
NOAA Climate Datasets
NOAA Realtime Weather Models
WU Historical Weather Worldwide
Complex Networks
CrossRef DOI URLs
DBLP Citation dataset

NBER Patent Citations

NIST complex networks data collection
Protein-protein interaction network
PyPI and Maven Dependency Network
Scopus Citation Database
Stanford GraphBase (Steven Skiena)
Stanford Large Network Dataset Collection
The Koblenz Network Collection
The Laboratory for Web Algorithmics (UNIMI)
UCI Network Data Repository
UFL sparse matrix collection
WSU Graph Database
Computer Networks
3.5B Web Pages from CommonCraw 2012
53.5B Web clicks of 100K users in Indiana Univ.
CAIDA Internet Datasets
ClueWeb09 - 1B web pages
ClueWeb12 - 733M web pages
CommonCrawl Web Data over 7 years
CRAWDAD Wireless datasets from Dartmouth Univ.
Open Mobile Data by MobiPerf
UCSD Network Telescope, IPv4 /8 net
Data Challenges
Challenges in Machine Learning
DrivenData Competitions for Social Good
ICWSM Data Challenge (since 2009)
Kaggle Competition Data
KDD Cup by Tencent 2012
Localytics Data Visualization Challenge
Netflix Prize
Yelp Dataset Challenge
American Economic Ass (AEA)
EconData from UMD
Internet Product Code Database
CBOE Futures Exchange
Google Finance
Google Trends
OSU Financial data

St Louis Federal
Yahoo Finance
BODC - marine data of ~22K vars
EOSDIS - NASA's earth observing system data
Factual Global Location Data
Global Administrative Areas Database (GADM)
Geo Spatial Data from ASU
GeoNames Worldwide
Natural Earth - vectors and rasters of the world
Open Street Map (OSM)
TIGER/Line - U.S. boundaries and roads
TwoFishes - Foursquare's coarse geocoder
TZ Timezones shapfiles
Australia (abs.gov.au)
Australia (data.gov.au)
Glasgow, Scotland, UK
Guardian world governments
London Datastore, UK
MassGIS, Massachusetts, U.S.
New Zealand
NYC betanyc
NYC Open Data
Open Government Data (OGD) Platform India
San Francisco Data sets
South Africa
The World Bank
U.K. Government Data
U.S. American Community Survey
U.S. CDC Public Health datasets
U.S. Census Bureau
U.S. Department of Housing and Urban Development (HUD)
U.S. Federal Government Agencies
U.S. Federal Government Data Catalog
U.S. Food and Drug Administration (FDA)
U.S. Open Government
UK 2011 Census Open Atlas Project
United Nations
EHDP Large Health Data Sets
Gapminder World, demographic databases
Medicare Coverage Database (MCD), U.S.
Medicare Data Engine of medicare.gov Data
Medicare Data File
Image Processing
2GB of Photos of Cats
Face Recognition Benchmark

ImageNet - an image database in WordNet hierarchy

Machine Learning
Delve Datasets for classification and regression (Univ. of Toronto)
Discogs Monthly Data
eBay Online Auctions (2012)
IMDb Database
Keel Repository for classification, regression and time series
Lending Club Loan Data
Machine Learning Data Set Repository
Million Song Dataset
More Song Datasets
MovieLens Data Sets
RDataMining - "R and Data Mining" ebook data
Registered Meteorites on Earth
Restaurants Health Score Data in San Francisco
UCI Machine Learning Repository
Yahoo! Ratings and Classification Data
Cooper-Hewitt's Collection Database
Minneapolis Institute of Arts metadata
Tate Collection metadata
The Getty vocabularies
Natural Language
ClueWeb09 FACC
ClueWeb12 FACC
DBpedia - 4.58M things with 583M facts
Flickr Personal Taxonomies
Google Books Ngrams (2.2TB)
Google Web 5gram (1TB, 2006)
Gutenberg eBooks List
Hansards text chunks of Canadian Parliament
Machine Translation of European languages
SMS Spam Collection in English
USENET postings corpus of 2005~2011
Wikidata - Wikipedia databases
Wikipedia Links data - 40 Million Entities in Context
WordNet databases and tools
CERN Open Data Portal
NSSDC (NASA) data of 550 space spacecraft
Public Domains
Archive.org Datasets
CMU JASA data archive
CMU StatLab collections
KDNuggets Data Collections
Reddit Datasets
RevolutionAnalytics Collection
Sample R data sets
Stats4Stem R data sets

The Washington Post List
UCLA SOCR data collection
UFO Reports
Wikileaks 911 pager intercepts
Yahoo Webscope
Search Engines
Academic Torrents of data sharing from UMB
Archive-it from Internet Archive
DataMarket (Qlik)
Freebase.com of people, places, and things
Harvard Dataverse Network of scientific data
Statista.com - statistics and Studies
Social Sciences
Ancestry.com Forum Dataset over 10 years
CMU Enron Email of 150 users
Facebook Data Scrape (2005)
Facebook Social Networks from LAW (since 2007)
Foursquare Social Network in 2010, 2011
Foursquare from UMN/Sarwat (2013)
General Social Survey (GSS) since 1972
GetGlue - users rating TV shows
GitHub Collaboration Archive
Mobile Social Networks from UMASS
PewResearch Internet Survey Project
SourceForge.net Research Data
StackExchange Data Explorer
Titanic Survival Data Set
Twitter Graph of entire Twitter site
UCB's Archive of Social Science Data (D-Lab)
UCLA Social Sciences Data Archive
UNIMI/LAW Social Network Datasets
Universities Worldwide
UPJOHN for Labor Employment Research
Yahoo! Graph and Social Data
Youtube Video Social Graph in 2007,2008
Betfair Historical Exchange Data
Cricsheet Matches (baseball)
Ergast Formula 1, from 1950 up to date (API)
Football/Soccer resouces (data and APIs)
Lahman's Baseball Database
Retrosheet Baseball Statistics
Time Series
Time Series Data Library (TSDL) from MU
UC Riverside Time Series Dataset
Airlines OD Data 1987-2008
Bike Share Systems (BSS) collection
Hubway Million Rides in MA
Marine Traffic - ship tracks, port calls and more
NYC Taxi Trip Data 2013 (FOIA/FOILed)

OpenFlights - airport, airline and route data

RITA Airline On-Time Performance data
RITA/BTS transport data collection (TranStat)
Transport for London (TFL)
Travel Tracker Survey (TTS) for Chicago
U.S. Bureau of Transportation Statistics (BTS)
U.S. Domestic Flights 1990 to 2009
U.S. Freight Analysis Framework since 2007
Complementary Collections
DataWrangling: Some Datasets Available on the Web
Inside-r: Finding Data on the Internet
Quora: Where can I find large datasets open to the public?
like being punched in the brain! : 100+ Interesting Data Sets for Statistics
StaTrek: Leveraging open data to understand urban lives"
Source: Xiaming's Github caesar0301/awesome-public-datasets , January 2015. Please
go to Github for this and other updated lists.
Attila Csordas, Cloudera Certied Hadoop Developer

Written Dec 13, 2014

In mass spectrometry proteomics the ProteomeXchange consortium has been set up to

provide a coordinated submission of MS proteomics data to the main existing proteomics
repositories, and to encourage optimal data dissemination: Page on
proteomexchange.org partner repositories: PRIDE Archive at EMBL-EBI in Hinxton,
Cambridgeshire, UK, PASSEL at ISB in Seattler, MassIVE at UCSD, San Diego. PRIDE
accounts for ~90% of the data, currently in total ~1600 datasets out of which ~50% is
public, ~70 TB, some individual datasets are in the TB range mainly due to unprocessed,
binary machine raw files. See also ProteomeXchange Datasets
Orin Hargraves, what do I wait for?

Written Jun 16, 2011

Two fully annotated corpora, put together for use by researchers and lexicographers, are:
The BNC (British National Corpus) http://www.natcorp.ox.ac.uk/
COCA (Corpus of Contemporary American English)
The BNC is a little dated now. COCA is excellent, though its user interface is a little clunky
at times.
If you have legitimate, nonprofit research concerns, you may be able to get access to the
granddaddy of them all, the Oxford English Corpus. For commercial use there is a feebased access:
Shehroz Khan, I have to play with data to tame it

Written Feb 1

Always try this infallible technique, It Always work

Otherwise, you may like to see these
IBM Knowledge Center
Datasets for education and for fun
Science Hack Day / Datasets
Science On a Sphere
Abdelbarre Chak, Big Data

Written Jul 29

Here are a list of open Datasets



The World Bank DataBank


A Deep Catalog of Human Genetic Variation


City of Chicago | Data Portal (Size:9.5GB)

Google Ngram Viewer
Open Government



Education - Data.gov


School of Geographical Sciences & Urban Planning


Hope its helpful

Thia Kai Xin, Data scientist at Lazada, Co-Founder of DataScience SG.

Written Apr 7

My favorites are:
Awesome Public Datasets
100+ Interesting Data Sets for Statistics
7 Datasets You've Likely Never Seen Before
Another collection of free and open-source datasets
Ferris Jumah, Data and Products

Written Jan 23, 2014

The best source of structured data I've seen so far is the UCI Machine Learning Repository:
Data Sets
This question has extensive resources for data sets open to the public, Where can I find
large datasets open to the public?
Chris Metcalf, Director of Platform / Developer Evangelist at Socrata

Written Jan 8, 2011

Socrata hosts open data websites for a number of governments, government agencies, and
non-profits including:

There are also over 100K datasets available on our public data portal,
Ian Johnson, Data Alchemist at Lever (http://lever.co)

Written Jan 23, 2014

I've been begging for this. Seriously, someone take my money!

One startup I'm excited about is Enigma.io
they are curating and nicely formatting open and public data, providing a slick interface for
searching and exporting as well.
Another good source of serious and well formatted data is the Bureau of Labor Statistics
There was a startup called buzzdata that tried to be the GitHub of data for a while, but they
pivoted away :(
Krishnan Srinivasarengan, .
Written Jan 21, 2013

For Non-Intrusive Appliance Load monitoring research, data bases are emerging. While
REDD is one instance (already in another answer), there are a few more of them (not as
Tracebase: tracebase " Welcome
UMass Smart*: Smart - UMass Trace Repository
Ben Hamner
Written Feb 6

Kaggle recently launched Kaggle Datasets . You can download high quality public
datasets here, run analytics on them through Kaggle Scripts, see others analyses, and
discuss them in the forums.
Here's a blog post describing this in more depth: Introducing Kaggle Datasets
Written Apr 15, 2014

Since I haven't seen it mentioned yet, and work at one of the main sources of its data:
SMOKA , the Subaru-Mitaka-Okayama-Kiso Archive, holds about 15 TB of astronomical
data from facilities run by the National Astronomical Observatory of Japan. All data
becomes publicly available after an embargo period of 12-24 months (to give the original
observers time to publish their papers).
With over a decade of data from some facilities and instruments, it has now become
possible for many researchers to make discoveries just by looking at archived data for
something other than what the original observers had in mind.
Astrophotographer Robert Gendler has also processed images from the SMOKA archive
to create several "NASA Astronomy Picture of the Day" winners.
Mike Lambert, streetart.cats.gadgets.cameras

Written May 30, 2014

The Global Database of Events, Language, and Tone

"The Global Database of Events, Language, and Tone (GDELT) is an initiative to construct
a catalog of human societal-scale behavior and beliefs across all countries of the world,
connecting every person, organization, location, count, theme, news source, and event
across the planet into a single massive network that captures what's happening around the
world, what its context is and who's involved, and how the world is feeling about it, every
single day. - See more at: The Global Database of Events, Language, and Tone "
"The entire GDELT dataset is available for free download, including over a quarter-billion
event records capturing over 300 categories of human society across every corner of the
globe georeferenced to the city level back to 1979 and updated each morning, and the
massive Global Knowledge Graph recording the underlying actors, themes, and
relationships that underlie global society. - See more at: The Global Database of Events,
Language, and Tone ""
Alket Cecaj, PhD in location data mining.

Updated Apr 19

If you are looking for mobility data there is the Telecom Italia Bigdata challenge dataset.
You can find it here : Open Data Institute - node Trento
Its about 120 GB of data and there are 7 different typologies of datasets from city life.
Another dataset of mobility data type is the Data 4 Development released by Orange a
french operator. In 2013 they released Call description records about Ivory coast and in
2014 CDR data of Senegal.
Info about the challenge can be found here : http://www.d4d.orange.com/en/home
A new challenge organized by American Society of Statistics can be found here : Support
the Data Challenge at JSM 2015
If you want some more datasets of any kind from pollution data to social network data
then check this post here : Data sets of any type: some links. by Alket Cecaj on Algorithms
and DataFusion
The post is updated regularly as I find new data sets such as the Panama Papers dataset.
Christian Pietsch, computational linguist and digital library technologist

Written Sep 27, 2012

If you are interested in research datasets (large and small), these sites let you search for
http://databib.org/ (a collaborative, annotated bibliography of primary research
data repositories)
http://datacite.org/ (support researchers by helping them to find, identify, and cite
research datasets with confidence)
Mike Kruger
Written May 16, 2014

IRI has a large (130 gigabyte) set of consumer packaged goods marketing data available.
30 categories, 11 years. For information see Academic Data Set - IRI
Gopi Krishnan Nambiar, Software Engineer at Salesforce

Written Jan 21, 2013

Dataset of 13 billion clicks available for research made available on Jan 20, 2013 here:
Center for Complex Networks and Systems Research
Deepshikha Mehta, Program Management

Written Feb 24

This online course on applied machine learning provides you released dataset for
Aspiring Minds presents AMDataBootcamp2016, an online + offline bootcamp on
applying machine learning to real world problems. Register and GRAB this unique offering
comprising of a MOOC + a data release + a data competition + a one-day workshop. Last
date of submissions is 8th March 2016. To enroll now and dive deep into ML : Aspiring
Minds University | Boot Camp

Yasmin Lucero, statistician/mathematical biologist/data scientist

Written Jan 23, 2014

R has a built in library called datasets. This has several structured datasets that are useful
for testing and learning. Type library(help=datasets) to get a list. These are available in
your namespace at all times, but they are lazy loaded. To use them, just call them by name,
e.g. str(iris).
Yuval Feinstein, Algorithmic Software Engineer in NLP,IR and Machine Learning

Written Feb 16

Please see Bernard Marr's Big Data: 33 Brilliant And Free Data Sources For 2016
Franck Dernoncourt, PhD student in AI @ MIT

Written Dec 26, 2013

See databases of open databases .

Alex Copulsky
Written Feb 10, 2014

If you're interested in the social sciences, a great resources is University of Michigan's

Interuniversity Consortium for Political and Social Research, or ICPSR. Great resource for
all sorts of public data sets, as well as the data sets used in many published papers.
Kevin Edward Kline, data and database expert, I know a 'lil bit about Twitter and
social media
Written Mar 5, 2014

I wrote a blog post about this a while back. For large data sets to tinker with, I recommend
that go to data.gov for large USA data sets orData Search | data.gov.uk for large UK
data sets. In both cases, you'll find a wide variety of data to play with.
Also, don't forget TCP.ORGtheLeadingTcpSiteontheNet .
Phillip Rhodes, Open Source hacker, founder of Fogbeam Labs

Written Jan 11, 2011

FWIW, there's a subreddit dedicated to cataloging available datasets. It may be of interest

to you:
And on a related note, there is also:
Pete Warden
Written Jan 8, 2011 Upvoted by Bradley Voytek, Former Data Scientist, Uber Inc. and Leo
Polovets, Partner at a data-focused seed fund (Susa Ventures). Worked at Factual.

Here's the ones I've found most useful:

CrunchBase, US Census, Google Public Data, Infochimps, Timetric, Factual, Freebase,
Wikipedia, World Bank, Kaggle
I cover them in more detail in a free ebook here:
Adam Nyhan, Attorney, former Congressional aide, Mainer, entrepreneur

Written Jan 21

In the legal world, the Enron dataset is often considered the best public-access dataset. My
recollection is that it was opened to the public by a federal regulatory agency in the course
of its Enron investigaiton. There is a massive industry of "litigation support technology"
and "electronic discovery" firms that develops software to mine and analyze enormous
data sets, and the Enron set is often trotted out in marketing demonstrations of these
software products to demonstrate their effectiveness. Thanks to Shimonee Shah for the
link to it:
Enron Email Dataset
Gaurav Bhardwaj
Written Sep 21, 2014

I have been collecting this dataset provided by UIDAI,

Adhar(UIDAI) a wonderful data provided by Indian government.
Things I like about this dataset:
Great way for beginners like me to explore Data Science basics using latest tools like
ipython, Pandas, Anaconda etc.
This dataset is being used by UDACITY courses (Introduction to data science) see
references for videos
It is a real-time data, it updates every other day
You can use REST api calls to get the data for a particular day, particular month OR
just the latest data.
Its probably going to be a huge data thinking of Indias population. For more info
regarding download please see:
Sourabh Daptardar
Written Sep 14, 2013

Indian government offers about 4K datasets from collected from about 50 departments for

analysis : http://data.gov.in/ and the list is growing. Not every dataset might be 'big
data' from a computer science perspective, but it is, nevertheless, a good source.
Lorenzo Ruzzene, Web Content & Social Media Specialist

Written Feb 3, 2015

I'd suggest the Ookla's Net Net Index source data (1.5 GB)
"Download the largest publicly available dataset of anonymous broadband speed and
quality test results, with data from every geographic region currently represented in
NetIndex going back to January 2008." Global Broadband
Martin Linkov, UniGraph.rocks

Written Dec 12, 2014

Open Data collection for Greece

CrunchBase Data Exports


Crunchbase people and organizations in .csv

Dror Cohen, CEO & Co-Founder at CodersClan

Written May 11, 2014

The Stack Exchange network has its whole data base open for queries and you can even
download the whole dump to yourself. It contains mostly data from Stack Overflow.
Stack Exchange Data Explorer
Mark Posen, Chartered engineer in satellite and radio comms.

Updated Jun 17, 2011

The Shuttle Radar Topography Mission database is a near-global high-resolution digital

topographic database of Earth (i.e. it maps the whole Earth with terrain altitude data at
around a 90m resolution. The data may be accessed here:
A large number of different Earth science datasets are available from NASA WIST
(Warehouse Inventory Search Tool):
Hilary Mason, I data and cheeseburgers.

Written Apr 5, 2011 Upvoted by Mark Meloon, US Head of Data Science at Impetus and Ankit
Sharma, Data Scientist at DataRPM

I've been collecting public research-quality datasets here: http://bit.ly/bundles/hma

Feedback and additional datasets are welcome!
Sandeep Vasani, Software Engineer

Updated Jul 6, 2012

The EnronCorpus is a large database of over 600,000 emails generated by 158 employees
of the Enron Corporation. I have used the Enron Email Corpus for training and testing my
email classification algorithm.
Download link [tgz] https://www.cs.cmu.edu/~enron/en...
Ian C. Grieve
Written Apr 7, 2011

Another bioinformatics dataset repository worth mentioning here is EnsEMBL:

http://www.ensembl.org .
EnsEMBL contains genomics information, including annotated DNA sequence, protein
sequence on a wide range of species.
An API, written in Perl, is provided with documentation (http://www.ensembl.org/i
nfo/docs... ). Additionally, the data can be downloaded from the EnsEMBL FTP site.
Stefan de Konink, is chairman of Stichting OpenGeo, a Dutch non-prot targeting

the availabili...
Written May 12, 2014

Research :: NDOV Loket offers historical realtime vehicle information (Automatic

Vehicle Location, Fleet Tracking) for The Netherlands. You can see what it does realtime at
OVradar or Live Openbaar Vervoer
About 1GB per day is collected in CSV format, which compresses to about 80MB LZMA.
We now have over a year worth of data.
Chris Thomson, Waterloo CS student + Shopify dev

Written Jan 12, 2011

The City of Toronto publishes a few interesting datasets. Their Dinesafe dataset is
particularly interesting, as it contains information about every restaurant's inspection
(infractions, etc) conducted by Toronto Public Health. You can find all of Toronto's open
datasets at http://toronto.ca/open .
Ryan Compton, http://ryancompton.net/

Written May 11, 2014

Several TB of network connectivity data M-Lab


(easy to work with via Google's

Lots of social networks Stanford Network Analysis Project

Abhishek Gupta, have wanderlust, want to experience ambedo again & again, a
kleptomaniac for ...
Updated Feb 11, 2014

1. Academic Torrents
2. Links to free data sets for computer vision applications
3. Amsterdam Library of Object Images
4. The CIFAR-10 and CIFAR-100 are labeled subsets of the 80 million tiny images
5. Traffic signs dataset
6. Machine Learning and Data Mining - Datasets
7. Quora Thread
8. DataMob
9. Some More shared on bitly
10. UCD Machine Learning Group
11. Some Links from Open Directory
12. A thread on dataWrangling
13. Kevin's Blog
14. Recommendation and Ratings Public Data Sets
15. Another Quora thread for Kinnect Specific Data
16. /r/datasets
Kunal Jain, loves probability, puns, pizza. (and alliteration)

Written Feb 3, 2015

I just thought I'll add Nation Master to the list, because I use it all the time.
For comparison of all kinds of statistics between countries:
International statistics: Compare countries on just about anything! NationMaster.com
Drazen Zaric, Grad student interested in machine learning and data science
Written Dec 12, 2010

Stanford Large Network Dataset Collection has some pretty impressive datasets, like
complete Wikipedia edit history (till January 2008) or a collection of 467 million tweets
collected from June to December 2009.
Giuseppe Sollazzo, Senior Systems Analyst @ St. George's, University of London

Written Oct 19, 2011

Many countries are releasing open-data portals. For data relative to Italy, these are the
main links:
- http://www.dati.gov.it/ (the main governmental website)
- http://dati.piemonte.it/ (data portal for Piemonte region, the first regional portal
- http://dati.emilia-romagna.it/ (data portal for Emilia Romagna region)
- http://data.enel.com/ (data portal for the ENEL company, a energy/gas supplier)
Edmond Lau, Author of The Eective Engineer

Updated Mar 19, 2012 Upvoted by Mark Meloon, US Head of Data Science at Impetus and
William Chen, Data Scientist at Quora

Google Research released a large 24GB n-gram data set back in 2006 based on processing
10^12 words of text and published counts of all sequences up to 5 words in length:
You can also just search over a related data set via the Google Books Ngram Viewer:
Jim Kenyon, Data science practitioner - all models are wrong. some models are
Written Jan 23, 2014


is a great place for structured data.

Brian Risk, Lover of data.

Written Oct 3, 2014

http://Quandl.com has over 10 million data sets gleaned from all over the internet. The
great thing about this resource is that it gives a single way to access all of the data. The site
has a free Excel plug in or there are libraries in R, Python, Ruby, etc.
Clinton Little, Coastal Program Specialist working to change the data climate in
Minnesota's ...
Updated Feb 4, 2011



Multipurpose Marine Cadastre http://www.marinecadastre.gov

Digital Coast http://www.csc.noaa.gov/digitalc...
Geospatial One-Stop http://www.geodata.gov
nowCoast http://nowcoast.noaa.gov
Data.gov http://www.data.gov
Great Lakes Commission http://www.glc.org
Great Lakes Information Network http://www.greatlakes.net
The Lake Superior Binational Forumhttp://www.superiorforum.org
MN DNR Data Deli http://deli.dnr.state.mn.us
MnGeo http://www.mngeo.state.mn.us
NRRI Coastal GIS http://www.nrri.umn.edu/coastalGIS
Lake Superior Streams (Minnesota) http://www.lakesuperiorstreams.org
Minnesota Beach Health Warnings http://www.mnbeaches.org
North Shore GIS Collaborative. http://ardcgis.org
Eliot Jarrett, Digital Brain / Analog Mind, Voracious Reader, Data Synthesizer,
Updated Mar 29, 2012

I've found Kaggle.com to be a fantastic resource, as the datasets relate to specific

business problems and are provided by respective companies.
Kaggle holds contests for developing the best predictive models based on sourced datasets.
The current competitions are:
1. Improve credit scoring by predicting the probability someone will experience financial
distress within two years
2. Predict if a car purchased at auction is a "bad buy"
3. Identify patients who will be admitted to a hospital within the next year, using historical
claims data
Prizes are provided for the best predictive models, anywhere from $5,000 to $3 million
(for the health insurance competition).
You may use the datasets for free after signing up as a competitor, although there are legal
issues concerning ownership of predictive models that must be considered.
Ossama Alami
Written Jul 4, 2014

Firebase provides a number of realtime datasets for free: Firebase Open Data Sets .
They're easy to use in web or mobile apps, some data sets available:
Cryptocurrency/USD Exchange Rates (Bitcoin, Litecoin, Dogecoin)
Realtime Global Earthquake data
Public transit data & bus GPS positions for several US cities
Airport delay data

Realtime parking availability in SF

Weather data
Disa Johnson, Marketing technology programmer. VP or director at digital

Written Aug 31, 2011

Although there are lots of answers here, many that look very good,
http://www.wolframalpha.com is a search engine which spiders and houses most
open data that is findable on the Web. It also allows you to use your query syntax to
preform calculations, making it a true computation engine. I love it and use it for a variety
of purposes myself.
Ertan Dogrultan, Software/Data guy, Entrepreneur

Written Feb 11, 2011

Taken from the syllabus of my data mining class,

National Bureau of Economic Research http://www.nber.org/data
(many interesting datasets: Macroeconomics, industry, trade, demographics, hospital,
patents, ...)
Federal Reserve Data Economic Research & Data http://www.federalreserve.gov/ec...
(including data about mortgage defaults, interest rates, exchange rates, industrial
production, ...)
Federal Statistics Data Access Tools: http://www.fedstats.gov/toolkit....
Pardeep Kullar, SaaS, Email marketing, Social tools and pilgrims pizza
Written Nov 15, 2015

There are some companies where, on their free trial, you can get free data.
For example: FollowerWonk (Twitter analytics, follower segmentation, social graph
tracking, & more ) lets you download up to 50,000 followers of any Twitter account.
Datadrip (Free data into sales ) has a bunch of followerwonk files like 50,000 CEOs that
you can download from the home page.
Udit Saini, Research Engineer Data Science @Ant farm

Written May 15, 2015

20 newsgroups: classification task, mapping word occurences to newsgroup ID (Home

Page for 20 Newsgroups Data Set )
Reuters (RCV*) Corpuses: text/topic prediction (Page on reuters.com )
Penn Treebank : used for next word prediction or next character prediction (Penn
Treebank Project )
Broadcast News: large text dataset, classically used for next word prediction (1996 English
Broadcast News Speech (HUB4) )
Wikipedia Dataset
Multidomain sentiment analysis dataset: Multi-Domain Sentiment Dataset

Recommendation Systems
MovieLens: Two datasets available from GroupLens . The first dataset has 100,000
ratings for 1682 movies by 943 users, subdivided into five disjoint subsets. The second
dataset has about 1 million ratings for 3900 movies by 6040 users.
Jester: This dataset contains 4.1 million continuous ratings (-10.00 to +10.00) of 100 jokes
from 73,421 users.
Netflix Prize: Netflix released an anonymised version of their movie rating dataset; it
consists of 100 million ratings, done by 480,000 users who have rated between 1 and all of
the 17,770 movies.
Book-Crossing dataset: This dataset is from the Book-Crossing community, and contains
278,858 users providing 1,149,780 ratings about 271,379 books.
Rohan Somni, Amazon AWS, Georgia Institute of Technology

Written Apr 5

A2A. Depending on the type of datasets you're interested in, I'd suggest taking a look at
https://www.reddit.com/r/datasets , or maybe Data.gov (The U.S. government's
open data) or Disability and Health (CDC datasets).
Some other random sets I recall/have used before are:
Google Public Data Explorer
Webscope | Yahoo Labs
Overview | Yelp For Developers | Yelp
AWS Public Data Sets

(Yelp's academic dataset)

Beer Data
This list is by no means exhaustive, and some Googling can get you a lot more - but it's
what I was able to come up with off the top of my head.
John Goodwin
Written Sep 13, 2011

UK gov data: http://data.gov.uk , lots of interesting linked data

http://beta.kasabi.com , Ordnance Survey linked data http://data.ordnancesurve
y.co.uk and see http://www.ordnancesurvey.co.uk for general open data.
Nikhil Anand Hegde, Curious Quoran.

Written Oct 10, 2014

Some of good data source for Economic data:

https://www.quandl.com/ - Quandl
What are the most useful sources of economics data?
Shafqat Islam, CEO & Cofounder of NewsCred. We help web publishers delight
their users (and ...
Written Apr 6, 2011

We have a 20 million+ dataset (last three years) of news articles (headline, description,
plus metadata). The data can be access programatically via an API at
http://developer.newscred.com .
People have done some really interesting things with it. We could potentially make it
available as a dump file if someone wants it for research purposes.

Vikas Majjagi, Bored Analyst

Written Jan 16, 2015

You can get real world data sets in kaggle.

Many companies host challenges in kaggle for a real problem that they are trying to solve.
They upload their data to the site (although it might be altered). They may not be
considered big data, but you can get pretty huge data sets.
Once I worked with around 17GB data. Close to 45 million records with around 50
features. That can be considered as pretty huge.
Hope this helps.
Vaibhav Mallya, Jobhunting? LMK-I will help you get what you are worth.
OerLetter.io Founder.
Updated Sep 23, 2011

There are some text corpora here: Where can I find large datasets open to the public?
If you're looking for a vast source of public domain literature, Project Gutenberg is
wonderful: http://www.gutenberg.org/wiki/Ma...
The Presidential Speech Archive: http://millercenter.org/scripps/...
Hitler's Speeches: http://www.hitler.org/speeches/
The Vedas: http://www.sacred-texts.com/hin/
The Gita: http://www.gita4free.com/english...
The Bible: http://patriot.net/~bmcgin/kjvpa...
Take a look at the NYT archive: http://www.nytimes.com/ref/membe...
Ramakanth Dorai, @ramakanth_d

Written Jan 15, 2012

Amazon has announced Public datasets hosted on aws at no charge for the community.
This datasets can be seamlessly integrated with your application running on aws. Pay per
Abhinav Upadhyay, Created https://man-k.org

Written Mar 11, 2014

Academic Torrents : Distributing large datasets using torrents, this project was started
very recently and has some of the most interesting datasets.
Taylan Malak, Database Engineer

Written Sep 22, 2014

U.S. Bureau of Economic Analysis (BEA)

U.S. Bureau of Labor Statistics
WTDB- Select Location
Data | The World Bank
The World Top Incomes Database
Online Data - Robert Shiller
Joseph Hopper, My alter ego is a penguin.

Written Apr 6

Raw data sets of what?

Here is a bag of words Bag of Words Data Set
Here is a some streamflow data USGS Current Water Data for the Nation
I would suggest that you start by doing a search on whatever type of data you are interested
in and adding the word dataset. You might also want to try raw data, data set, raw data set,

Thomas Marquart, astronomer, programmer, dog owner

Written Apr 5, 2011

The Sloan Digital Sky Survey seems to not have been mentioned yet:
350 million objects on the night sky, many different measured parameters for each of
Marin Dimitrov, engineering manager @ Uber

Written Jan 10, 2011

check out the Linked Open Data - http://linkeddata.org/

currently it includes 220+ datasets with 24+ billion RDF triples
Nidhi Kohli
Written Oct 4, 2013

It depends on what kind of data you need (business/economic/social etc).

My top 3 picks for useful, large business datasets

1. Kaggle
2. KDNuggets
3. Frequent Itemsets Data Repository (Frequent Itemset Mining Dataset Repository )
Matthew Hurst
Written May 11, 2011

d8taplex.com (which I run) has >1MM time series in >50k data sets pulled from 122
sites. The data sets are derived automatically from resources like excel spreadsheets, html
tables and plain csv and tsv files.

Georey Anderson, Former Data Processing Product Manager for InfoGroup &
General DBA that likes...
Updated Apr 15, 2011

There are several free providers on Microsoft's Azure Data Mart for the time being
including several of the mentioned above. The single platform for delivery and excel plugin will make the data easier to consume however than your typical API / SOAP end point.
Marcel Janus, IT-Professional and dad

Updated May 24, 2012

Here are some more links you my consider:

Especially for the German folks around:
Mithun Kalan, Streaming analytics. Storm and AWS lambda

Written Jul 14

There is an open data source of Open Data | UNCDF with about 10 developing country
data sets. There is a detailed zip file in the export with all 1000+ questions.
Prathamesh Kulkarni, knowledge seeker

Written Jun 15

Google Public Data Explorer is one good dataset. Its not large, but has valuable data
regarding economics and other factors of human development. For example this is
about income inequality in the United States.
Google Trends

is another good one.

Vladimir Bougay, Co-founder and CTO, Knoema

Written May 23, 2012

If you're looking for public data you should definitely take a look at Knoema
(http://knoema.com ). Knoema is one-stop shop for your data needs.
Here you will find 600+ public datasets on almost any topic like economics, healthcare,
demographics or energy. Knoema accumulated public data from many credible
international sources in a single place and provides convenient search/browsing tools
Omar Alonso, Data gaucho at Microsoft

Written Aug 30, 2014

Depends on what you are looking for. Wikipedia is the best crowdsourced data set
available for generic use. Now, if you are looking for domain specific data sets (e.g., query
logs, annotations, entities, etc.), that's a different matter.
1.4k Views View Upvotes Answer requested by Martin Engwicht

Minat Kumar Verma, Love to play...anything !!

Written Jan 23, 2014

Try this link once :

Datalist.xlsx - Google Drive
Hope you find it useful.
Daniel Cave, Product Manager and Digital Marketing Manager

Written Jul 17, 2015

Where can you find them? Stop looking and start building them yourself.
The internet is One Big Data set waiting to be made, and it's laughably easy to combine
data many many websites to make a large table of data these days.
Any of the modern web scrapers will let even a 'non-programmer' put together a data set
very quickly and easily.
I know this because I work at http://import.io and our platform is being used to create
datasets with billions of data points every single day.
I suppose the main reason i suggest this that you can be free from needing other people
build big data sets for you, and make your own, becoming more independent in the
Agastya Mishra
Written May 17, 2013

Books and Movies Data: Book-Crossing Dataset

Contains data about book, book rating in csv format. Also has sql queries for CRUD
Bob Calder, Internet and Society, Science and Society Fort Lauderdale, FL
Written Jan 23, 2011

I didn't see anyone mention the WHO Global Health Observatory:

Observatories *should* be constructed with an ontology api in mind for the use of what are
increasingly being called "observatories" as in the virtual observatory the astronomers put
together a couple of years ago. Also look at the "science accelerator" the DOE funded and
of course Abe Lederman's federated search engines.
Tom Greif
Written Jul 22, 2013

Stack Exchange Data Dump - Anonymized data dump of all creative commons questions
and answers from the Stack Exchange family of websites at thttp://stackexchange.com


. XML format, 7zipped, released every 3 months.

Pavan Keerthi, I work with Data

Updated Oct 10, 2011

I was doing this research few days ago and found these
http://setiquest.org/ -(You need to sign up)
Marc Millstone, visioneer

Written Jun 16, 2011

Many people use the bible, as it is available in many languages and many different
versions. Another option is to find the proceedings of the UN, which is also published in
many different languages.
Edwin Khoo, Graduate student

Written Feb 1

One of the most comprehensive lists can be found at https://github.com/caesar

0301/aw... .
Themis Papavasileiou, {math,cs}U{intelligent machines}

Written Jan 22, 2015

I happened to stumble upon caesar0301/awesomepublicdatasets

on dataTau.

As the title suggests the datasets are indeed awesome.

Hope it helps!
Frank Scurlock
Written Dec 11, 2014

I did some research on low impact fuel sources vs. coal in power plants larger than 50
megawatts. I found these to be helpful.
DepartmentofEnergy (DOE) declassified documents, part of DOE openness initiative. ...
The OpenNet database provides easy, timely access to over 485,000 ...
The DOE Global Energy Storage Database provides free, up-to-date information on gridconnected energy storage projects and relevant state and federal ...
Welcome to the U. S. DepartmentofEnergy, National Energy Technology Laboratory's
Gasification Plant Databases. Within these databases you will find current ...
Hersh Reddy, Programmer and Lawyer

Written Jun 20, 2013

Google and the USPTO make bulk downloads of US patents and trademarks available in
zip archives:
USPTO Bulk Downloads
Konstantinos Psychas
Written Apr 18, 2013

The following platform hosts open data to help in scientific analysis and computational
Contribute to the Cure
Information about the platform which is currently in beta are here (Sage Bionetworks Redefining. Challenging. Predicting )
Shashank Kumar, Software Developer, Computer Science alumni IIT Roorkee

Written Apr 4, 2012

Time Series datasets maintained by Dr Eamonn Keogh


University of California Machine Learning Repository

Aaron Anderson, I am a recent graduate from Kings College London with a

degree in Intelligen...
Updated Apr 15, 2013

The Correlates of War (COW) project provides data sets on security.

Colin Baldwin, Software Engineer

Written Apr 5, 2011

There are some great datasets relating to Bioinformatics out there. These are usually
databases of molecules of biological interest.
BLAST: http://blast.ncbi.nlm.nih.gov/Bl...
SCOP: http://scop.mrc-lmb.cam.ac.uk/sc...
There are many others - a huge amount of information is available in this field.
David A Springate, Biostatistician, Evolutionary genetics PhD. Python, R, Lisp

Written Oct 28, 2011

The Pubmed Central Open Access Subset contains about 350000 full-text academic
articles in the Biosciences over more than 2000 journals. You can download the lot as
compressed XML files via FTP: http://www.ncbi.nlm.nih.gov/pmc/...
Alberto Escarlate, Collaborative Fund

Written Jan 10, 2011

NYC DataMine http://www.nyc.gov/html/datamine...

Public data produced by NYC agencies and other City organizations.
John Flurry, Connector, Writer, Mobile App Evangelist, Communicator

Written Jan 24, 2011

I am the head of communications for http://databasin.org a free community

conservation mapping tool. We have thousands of data sets available for both download
and use inside the tool itself. One aspect of data on the site is that full and useful metadata
is required to be uploaded to the site. If you have any questions you can contact me directly
at johnb at consbio dot org.
Arya Asemanfar
Written Jan 10, 2011

Amazon has repository of datasets as well. They currently have 42 datasets:

Sunil Sangwan, 3rd Year UG at Mnnit Allahabad

Written Jul 17

here you can find all type of public datasets. Its a awesome list of all type of resources of
Joscelyn Upendran
Written Jan 10, 2011

Ordnance Survey mapping datasets available for Great Britain:

http://www.ordnancesurvey.co.uk/... licensed with UK Government's Open
Government Licence (OGL) : http://www.nationalarchives.gov....
Mark Braggins, A keen interest in technology, innovation and open data

Updated Dec 1, 2014

There's themed linked open data being published under the Open Government Licence
(OGL) on the Hampshire Hub at:
Sebastian ScheIter, Committer and PMC member at Apache Mahout and Apache
Written Sep 13, 2011

Konect is a collection of network datasets:

Shunsuke Mikami
Written Jun 6, 2011

The Internet Traffic Archive http://ita.ee.lbl.gov/ publish some Web access logs.
For example, http://ita.ee.lbl.gov/html/contr... were access logs from 1998 World Cup
Web site between April 30, 1998 and July 26, 1998. During this period of time the site
received 1,352,804,107 requests.
Rob Jensen, always learning, doing more doing. interested in data science,
minimalism and...
Written Jan 10, 2011

If not already, subscribe the the Guardian's DataBlog. They have great articles and always
link out to the data so you can play with it.
Miles Woodroe, Software Engineer, Tech Leadership, formerly Pro Sound

Written Jan 6, 2011

also some great sources for test data here: http://www.philwhln.com/how-to-g...

Raymond Lam
Written May 5

This has a large list of data collected from around the world and not limited to one
organisation. You have the ability to view the data sets, download the data as a .xlsx file or
visualise the data in browser.
Enrique R Rivera, Owner of www.followthehashtag.com , a tool for twitter

Written Apr 23

You can find some free Twitter datasets (about 200,000 tweets per dataset) in Datasets
section (Datasets Archive - Followthehashtag // Free twitter search analytics and business
intelligence tool ) of Followthehashta (http://www.followthehashtag.com)g
This section is brand new (2016 / 04) and we are adding about 2 or 3 new datasets per
week, hope you enjoy it
If you need custom datasets (paid) in this URL you can see pricing for datasets from 2000
to 200,000 tweets (>Followthehashtag // Twitter keyword search analytics, influence, geo
content analysis tool, and much more )

Ian Mercer, Prolic Entrepreneur, Inventor, Guinness World Record Holder and
creator of ...
Written Jan 10, 2011

Yahoo Geoplanet for geographic information: http://developer.yahoo.com/geo/g...

Anuj Prakash, Computer Science student

Written Dec 11, 2014

Timothe Poisot, Evolutionary ecologist, geek, blogger (http://www.sce.fr/)

Written Sep 8, 2011

I'd like to point out the ROpenSci project : http://ropensci.org/

It's dedicated to building interfaces to several data repository, within the R program
Salvatore D'Agostino, Identity, credentialing and access control infrastructure

and services
Written Jan 10, 2011

Bureau of Labor Statistics, http://www.bls.gov/ ; International Monetary Fund,

National Archives http://www.archives.gov/research... as well as the above.
Andrew Semenyak
Written Nov 11, 2013

Here are two sample datasets with companies data available for free:

UK Companies Dataset contains information on random 10,000 UK companies

sampled from HitCompanies (all data in this DB extracted and updated automatically
from WWW using AI and Machine Learning): company name and aliases, company
description, industry tags, industry codes, registration numbers, addresses, phone
numbers, VAT numbers, website, number of about/contact/management/product
pages, incorporation date, team size, number of clients and partners, number of
emails, number of key changes (client/partner changes, contact changes, people
changes), and many more
Worldwide Companies Dataset contains information on random 10,000 worldwide
companies sampled from HitCompanies (all data in this DB extracted and updated
automatically from WWW using AI and Machine Learning): company name and
aliases, company description, industry tags, industry codes, registration numbers,
addresses, phone numbers, VAT numbers, website, number of
about/contact/management/product pages, incorporation date, team size, number of
clients and partners, number of emails, number of key changes (client/partner
changes, contact changes, people changes), and many more
Jonas Mattias
Written Apr 5, 2011

For science data from Australia:

Australian National Data Service - http://services.ands.org.au/home...
Integrated Marine Observing System - http://imos.aodn.org.au/webportal/
Australian Ocean Data Network - http://portal.aodn.org.au/webpor...
AuScope (Geology) - http://portal.auscope.org/portal...
Terrestrial Ecosystem Research Network - http://portal.auscover.org.au/we...
Atlas of Living Australia - http://www.ala.org.au/
James Thornton, Relentlessly pursuing "Why?"

Written Apr 5, 2011

Linked Data Sets

Web Services Directory
Martijn de Boer, Harvard Business CORe & Python enthusiast

Written Jul 5

I think this one could be nice :)

Using Microsoft R Server on a single machine for experiments with 600 million taxi
Bill Sobel
Written Jan 11, 2011

a good place to get started is http://www.data.gov/ ThepurposeofData.gov isto

Ganesh Raja
Written Feb 16, 2015

Amazon Web Services have public data sets that you can use freely for your big data
projects. You can also contribute to the list.
Please find more information here aws.amazon.com/public-data-sets/


Anthony Gerdeman, Statistician, other

Written Aug 9, 2012

If you're looking for US economic data or time series, try FRED. It's free, comprehensive,
and regularly updated. Provided by the St. Louis Fed.
Moustafa Alzantot, CS Ph.D. Student at UCLA

Written Feb 10

Here is an awesome categorized list of publicly available datasets.


Francisco Restivo, Husband, father, engineer, educator, dreamer

Written Jun 26, 2014

I keep a collection of datasets here Datasets - Francisco Restivo's recommended sites .

Gianfranco Cecconi, Bringing people and data together

Written Nov 5, 2014


is another obvious example.


Brian Chan, just another programmer

Written Nov 16, 2015

Just some suggestions to get you started. =)


Ankush Chopra
Written Mar 27, 2014

Datasets for Data Mining and Data Science

I've used it in past. Hope this helps.

has a laundry list of free dataset repositories.

Annie Pettit, Self serve sample, surveys, polling plus charts and statistics. I am
the Chie...
Written Oct 9, 2014

DataFerrett (U.S. Census Bureau) is a great option for US census data. Lots of data you
can plug directly into any statistics program.
Guilherme Defreitas
Written Jun 5, 2015

Stanford Large Network Dataset Collection

Tim Gerla, CTO and Co-founder, Ansible, Inc.

Written Jan 14, 2011

DataSF from the City of San Francisco: http://datasf.org/

Written Jul 16

Big data analytics is to help companies make more informed business decisions by
enabling DATA Scientist, predictive modelers and other analytics professionals to analyze
large volumes of transaction data, as well as other forms of data that may be untapped by
conventional business intelligence(BI) programs. That could include Web server logs and
Internet Click Stream data, social media content and social network activity reports, text
from customer emails and survey responses, mobile-phone call detail records and machine
data captured by sensors connected to the INTERNET Things Some people exclusively
associate big data with semi-structured and unstructured Data of that sort, but consulting
firms like Gartner Inc. and Forrester Research Inc. also consider transactions and other
structured data to be valid components of big data analytics applications. Big Data, Data
Science - Combo Course Training Classes Online | Big Data, Data Science - Combo Course
Courses Online
Big data can be analyzed with the software tools commonly used as part of Advance
Analytics disciplines such as Predictive Analysis Data Mining, Text Analytics and Statical
Method. Mainstream BI software and Visualization tools can also play a role in the analysis
process. But the semi-structured and unstructured data may not fit well in traditional Data
Warehouse based on Relational Database. Furthermore, data warehouses may not be able
to handle the processing demands posed by sets of big data that need to be updated
frequently or even continually -- for example, real-time data on the performance of mobile
applications or of oil and gas pipelines. As a result, many organizations looking to collect,
process and analyze big data have turned to a newer class of technologies that includes
Hadoop and related tools such as Yarn Spook, Spark, and Pig as well as No Sql databases.
Those technologies form the core of an open source software framework that supports the
processing of large and diverse data sets across clustered systems.
In some cases, Hadoop Cluster and No SQL systems are being used as landing pads and
staging areas for data before it gets loaded into a data warehouse for analysis, often in a
summarized form that is more conducive to relational structures. Increasingly though, big
data vendors are pushing the concept of a Hadoop Data Take that serves as the central
repository for an organization's incoming streams of Raw Data. In such architectures,
subsets of the data can then be filtered for analysis in data warehouses and Analytics
Databases, or it can be analyzed directly in Hadoop using batch query tools, stream
processing software and Sql AND Hdoop technologies that run interactive, ad hoc queries

written in Sql Potential pitfalls that can trip up organizations on big data analytics
initiatives include a lack of internal analytics skills and the high cost of hiring experienced
analytics professionals. The amount of information that's typically involved, and its
variety, can also cause data management headaches, including Data Quality and
consistency issues. In addition, integrating Hadoop systems and data warehouses can be a
challenge, although various vendors now offer software connectors between Hadoop and
relational databases, as well as other data integration tools with big data capabilities.
Businesses are using the power of insights provided by big data to instantaneously
establish who did what, when and where. The biggest value created by these timely,
meaningful insights from large data sets is often the effective enterprise decision-making
that the insights enable.
Extrapolating valuable insights from very large amounts of structured and unstructured
data from disparate sources in different formats require the proper structure and the
proper tools. To obtain the maximum business impact, this process also requires a precise
combination of people

Simon Tse, Trying to learn something new every day that I nd refreshing
Written Mar 16

Try the uci data repositories


Siddha Ganju, Grad Student, School of Computer Science, Carnegie Mellon

Written Mar 6

For Machine learning purposes a lot of data sets are availabile on the UCI Machine
Learning Repository

Nikita Zhiltsov, Computer science researcher at Kazan University; Textocat, cofounder & CTO
Written Apr 5, 2011


is a Q&A site dedicated to such questions.

Mike Xu, Insatiably curious scientist/engineer

Written Apr 5, 2011

opened at oreilly strateconf
Mikko Heikkinen, Biologist and a web developer working in a natural history

Written Nov 20, 2015

Global Biodiversity Information Facility has the largest biodiversity dataset, with 600M +
records currently: Free and Open Access to Biodiversity Data

Philippe Beaudoin, I've written my share of C++, working on many projects in the
video game indu...
Written Apr 6, 2011

A free dataset of motion capture data:

Niall McCarthy
Written Jan 24, 2013

You can find a huge selection of free statistics, data and infographics at Statista .
Harit Himanshu, Software Engineer at Yahoo!

Written Jun 9, 2011

Check this one out!

David James, Developer and Curator: National Data Catalog

Written Feb 4, 2011

The National Data Catalog (http://nationaldatacatalog.com ) brings together data sets by

and about government at all levels of government. It is a project of the Sunlight
Colin Kegler
Written May 11, 2013

The National Bureau of Economics has several datasets:

Michael Munsey
Written Mar 19, 2014

There is quite a bit of data available from the FAA.

I particularly found the Airline On-Time Statistics & Delay Causes interesting.
Iain Chalmers, Web Strategist. Motorcycle Rider. Music Lover. Coee Tragic.
Written Apr 5, 2011

A collection from an admirable data-hoarder:

And discussion:
Evan Thomas, World traveler, surfer, internet marketer, UCSB alumnus from
Manhattan Beach, CA
Written Dec 5, 2011

Anand V. Chhatpar, Tech entrepreneur

Written Oct 20, 2011

US Department of Energy has weather data available for free for over 2000 global
Evan Schuss
Written Jun 15, 2011

Junar.com is great source for data and statics pertaining to populations of people,
business, sports, geography and also other types of data. This site is a collaboration of data
from around the web and is continually expanding its entries.
Paul Jones, director of ibiblio.org, professor at University of North Carolina

Written Jan 11, 2011

Carl Malamud at http://public.resource.org

information databases.

has some of the best large public

Jordan Mendelson, Founder/CTO and Good Sharer

Written May 29, 2013

Common Crawl makes available for free ~250 TB of web page data from 2008-2012. - |
Andrey Fedorov
Written May 2, 2013

Robert Maguire, Tragic optimist.

Written Aug 28, 2011

How about the Center for Responsive POlitics and its site Opensecrets.org

Mark Hahnel, PhD in Stem Cells at Imperial College London, Founder of gshare
Written Apr 8, 2011

The datasets at http://figshare.com

are scientific research datasets licensed under CC0.

Aziz Gilani, I am a VC investor in Big Data

Written Jan 17, 2011

I personally use Infochimps.org

(I am also an investor in both).
and DataMarketplace.com

for all of my dataset needs

Anoop Vasant Kumar, Data Scientist

Written Mar 15

Ideal site for trying out movie recommendations

Abhishek Shivkumar, Data Scientist

Written Jan 8, 2014


has lots of data sets. I'm sure it'll be very useful.


Athlan Lathan
Written May 30, 2015

Some large datasets, some small, all public.
Phil Darnowsky, I know things and have opinions

Written Jan 11, 2011

UC Irvine maintains a collection of datasets for machine learning testing at

Stephen Turner, Bioinformatics Core Director

Written Apr 11, 2012

Gene expression omnibus for gene expression data http://www.ncbi.nlm.nih.gov/geo/

Vinay Kumar, Quorious..

Written Dec 19, 2013

You can get large datasets from the sources,mentioned in Where can I find large datasets
open to the public?

Jitendra Harlalka, Data Mining enthusiast

Written Apr 5, 2011

The link contains a dataset of 1 million songs: http://www.infochimps.com/collec...

Arun Patre, Enabler for social enterprises / startups in India

Written Apr 5, 2011

On the India Water Portal we have a 100 year dataset of the meteorological data for all the
districts of India:
Abhishek Mishra, tinkerer

Written Feb 2, 2011

A collection of free public datasets - http://jacquesmattheij.com/Free,...


Nathan Ketsdever, Facinated by science, discovery, & innovation.

Written Mar 22, 2011

Pete Warden summarizes some of the options here that he covers in "Data Source
Handbook" from O'Reilly:
Here are 18 data-related links that Warden points to in addition to whats covered in the
book--for those wanting to learn more:
Olya Romanova
Updated Sep 26, 2013

Check Knoema via http://knoema.com - the largest open and public data repository with
100 M+ time series and 3000+ datasets
Milstein Munakami
Written Feb 10, 2015

Nazmul Hasan
Written Jul 26, 2013

There are some very cool datasets about Philadelphia here:

Connecting People With Data <====CLICK HERE
If you want data about Chicago this website is the best.
City of Chicago | Data Portal
It's got everything from pothole reports to gun deaths and you can do a lot of cool stuff
with it.
Mitsuharu Hamba, Software Engineer

Written Jun 17, 2011

this is great summary. how informative!

btw, you can get another data from the below as well!
Douglas Moore, Principal Consultant at Think Big, a Teradata Company

Written Oct 10, 2014

See: Where can I find large datasets open to the public?

John Wong
Written Oct 5, 2014

Computer failure traces: Failure Trace Archive

Craig Danton, Head of BD at Enigma

Updated Feb 12, 2014

Enigma.io is a product that aggregates thousands publicly available data sets. Over 80
Billion rows of data in 100,000 tables. Also available in API.
Dan Bair
Written Feb 2, 2011

Here is another link that lists some publicly available data sets.
Link: http://highscalability.com/blog/...

Vincent van Haa, software engineer, data viz expert, ux designer, hacker, vj,
musician, cyclis...
Written Apr 6, 2011


Julian Ranger, Serial entrepreneur & angel investor

Written Jan 10, 2011

and for the UK Government a large number of datasets is available at http://data.gov.uk/

Raviteja Chirala, Data Scientist, Avid Programmer..

Written Nov 24, 2014

There is another question where you'll be over whelmed with sources.

Where can I find large datasets open to the public?
Yap Kai Lun Leon

Written Apr 22

I have develop a charting platform which also allow user to download data after register
free membership.

Margaret Warren
Written Apr 5, 2011

Lots of Australian data at http://data.gov.au

Robert Loftin

Australian government data repository

Written Mar 26, 2013

Jan Willem Tulp, freelance information visualizer

Written Apr 7, 2011

Here are 2 good resources with Open Data from the EU: http://publicdata.eu/

Brad Pauly, Rails application developer

Written Jan 16, 2011

Wikipedia has been mentioned but I didn't see a link. This is for current articles.
Vasundhar Boddapati, Another distinct you!

Written Feb 2, 2011

Some data and sources might repeat.

Patrick Hochstenbach, from the heart of Henry Van de Velde's Booktower at

Ghent University: librari...
Written Jan 12, 2011

Try CKAN http://ckan.net/

Shantanu Sharma, Post Doc in Computer Science at UC Irvine.

Written Aug 2

You can get TPC-H dataset from the following link:

TPC-H - Homepage

Igor Kiselev
Written Jun 6, 2015

Stanford Large Network Dataset Collection

Datasets for Data Mining and Data Science
Thieme Hennis, PhD in education & technology @ tudelft.nl

Written Sep 25, 2013

Check out http://www.engagedata.eu/

It is aimed at providing a community and hub for open datasets for Europe.
Tristan Henderson, Lecturer in Computer Science, University of St Andrews

Written Jan 10, 2011

We host a number of wireless network datasets at http://crawdad.org

Yunhong Gu, Software Engineer

Written Jan 11, 2011

Biologists have a huge amount of public data at NCBI: ftp://ftp.ncbi.nih.gov/ . The total
size may be close to 1PB.
Richard Pauli
Written Apr 5, 2011

Tons of Climate Data http://www.easterbrook.ca/steve/...

And it needs your attention for building climate models.
Robert Prescott
Written May 17, 2013

Page on Sciencebase
is a searchable data repository for the United States Geological

Ben Toth

Teng Qiu
Written May 11, 2014


and freebase.com

Lenny Kiyoshi Bogdono

Written Jun 10, 2013

The best that aggregates all OPEN government data, as an API, is:
Chrystall Kanyuck, Community journalist and US expat in the British Virgin

Islands. Web and data...
Written Jun 10, 2011

Another one for a long list: The Guardian lets you search for open government data from
around the world at
Philip Zavliaris, Democratizing Quantitative Investing. Quantitative Investment

Written Feb 6

AssetMacro provides free access to historical data of 10,000+ Macroeconomic indicators

and Market Data covering global stock markets, bonds, Fx and commodities

Misha Denil
Written Feb 2, 2011

Peter Skomoroch has a delicious page with links to many data sets.

Akshay Mall, May the odds be ever in your favor!

Written Apr 29, 2014

Please check this question : Where can I find large datasets open to the public?
Jim Shi
Written Mar 30, 2013

National Center for Biotechnology Information

Owen Stephens
Written Dec 9, 2011

For specific data you can try asking over at http://getthedata.org

dedicated to questions about finding data

which is a Q&A site

Daniel McNamara
Written Jul 19, 2011

www.kaggle.com has datasets freely available and data analysis competitions with
prizemoney attached

Martin Kelly
Written Apr 10, 2011


has a nice frontend to many open data set

Enrique Cusba
Written May 21, 2011

You can find some interesting information here:


Updated Sep 19, 2011

See Where can I find large datasets open to the public?


