Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 40

June 10/11, 2014

Michelle Hudson, Science & Social Science Data Librarian


Kristin Bogdan, Science & Social Science Data Librarian
Kayleigh Bohmier, Science Research Support Librarian for Astronomy,
Geology & Geophysics, and Physics
Rolando Garcia-Milian, Biomedical Sciences Research Support
Science Data Resources: From
Astronomy to Bioinformatics
A Brief Overview of Data in the
Sciences
Examples of data
questions
Where would you find
these?
Spectroscopy on jars found
in wrecks
Spectra of M31
ApoE structure
Ice cores
Genomic sequences for
extinct mammals
Types of Data
Observational
data captured in real time, irreplaceable
sensor readings, telescope images, geologic samples
Experimental
data from lab equipment, expensive to reproduce
gene sequences
Simulation
data generated from models (models are more important than the
data)
Derived or compiled
data put together from other information 3D models, compiled
databases

Formats of data
Documents, spreadsheets, lab notebooks, questionnaires, survey
responses, health indicators, audio recordings, video recordings,
protein and gene sequences, images, films, spectra, slides,
artifacts, specimens, samples, models, algorithms, scripts,
software code, etc.
General resources
data.gov: http://www.data.gov/
DataONE: http://www.dataone.org/
NCBI: http://www.ncbi.nlm.nih.gov/
EBI: http://www.ebi.ac.uk/
FigShare: http://figshare.com/
Dryad: http://datadryad.org/
PLOS|ONE: http://www.plosone.org/
Data journals:
http://mlibrarydata.wordpress.com/2014/05/09/data-journals/
Research guide: http://guides.library.yale.edu/sciencedata
Astronomy,
or,
Massively Open Online Archives
Astronomy data
Self-collected
Data.NASA.gov
US Virtual Observatory
(US VO)
ApJ supplement
Research centers and
collaborations
Figshare
Astronomy Dataverse
github
Screenshot: Virtual Observatory Data Explorer search for M31
Government Data
NASA Data processing levels
Ranked 0-4
Level 0 is raw data
Level 1 has been
processed and error-
corrected
This is, incidentally,
where conspiracy theories
come from
Level 2 data may
contain derived
parameters
Levels 3+ have further
processing
Data from
Researchers
Astrophysical Journal
Supplement
http://dx.doi.org/10.1088/00
67-0049/212/2/26
http://dx.doi.org/10.1088/00
67-0049/212/2/19
http://dx.doi.org/10.1088/00
67-0049/212/1/6
Project web pages (i.e.,
Kepler)
Intermediary data products
remain a problem (i.e., code,
analyzed data sets)

Olausen, S. A., & Kaspi, V. M. (2014). Table 2 from The
McGill Magnetar Catalog. The Astrophysical Journal
Supplement Series, 212(1), 1-22. doi:10.1088/0067-
0049/212/1/6
Harvards Astronomy Dataverse A solution for intermediate stage data products
Physics:
A Little Less Open,
But Everyone Knows Where To Find It
Reference Data
National Nuclear Data
Center at Brookhaven
National Laboratory:
http://www.nndc.bnl.gov/
Department of Energy Data
Explorer:
http://www.osti.gov/dataexp
lorer/
MCPlots (Monte Carlo plots
reference for HEP):
http://mcplots.cern.ch/
Monte Carlo plot reference from MCPlots
Experimental Data
Durham HepData Project
Reactions database
Data from active
experiments
Data reviews
http://durpdg.dur.ac.uk/H
EPDATA/REAC
Example:
http://durpdg.dur.ac.uk/vi
ew/ins1297226
Experimental data releases
IceCube:
http://icecube.wisc.edu/sci
ence/data

and then, we have the data grids.
Geoscience Data Resources
Kinds of
Geoscience Data
Geospatial
Rocks and Minerals
Economic Geology
Paleobiology
Climate History
Geochemistry
Physical Samples
Image credit: USGS, via Wikimedia Commons:
http://commons.wikimedia.org/wiki/File:Seismograph_Pinat
ubo.jpg
Physical Samples
as Data
Identified as data in NSF
guidelines
Different analyses = new
data
New techniques
developed over time
Repositories for samples
specific metadata
required

Gry, Parent. 7 May 2011. Peronopsis interstrictus. Retrieved
from the Wikimedia Commons at
http://commons.wikimedia.org/wiki/File%3APeronopsis_inte
rstrictus_White%2C_1874_2.jpg
Geo/Paleo Sample Repositories/Registries
International Geo Sample Number (IGSN) -
http://www.geosamples.org/
Peabody Museum - http://peabody.yale.edu/collections/search-
collections
PaleoBioDB - http://paleobiodb.org/#/


Other Resources for Geoscience
USGS Earth Explorer - http://earthexplorer.usgs.gov/
Data.Gov - http://www.data.gov/
GeoGratis - http://geogratis.cgdi.gc.ca/
Morphobank - http://morphobank.org/
EarthCube http://earthcube.org/
CINERGI - http://workspace.earthcube.org/cinergi







Bioinformatics Resources and
Services
Problem Rapid Growth of Biomedical data
GenBank Statistics http://www.ncbi.nlm.nih.gov/genbank/genbankstats-
2008/
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
M
i
l
l
i
o
n
s

Samples Submitted to Gene Expression
Omnibus Database
Compiled from GEO historic data
http://www.ncbi.nlm.nih.gov/geo/summary/?type=history
Compiled by from PubMed
http://www.ncbi.nlm.nih.gov/pubmed
0.00
5.00
10.00
15.00
20.00
25.00
1940 1960 1980 2000 2020
M
i
l
l
i
o
n
s

Number of Records in PubMed
Biomedical Literature
Problem Growth of the Biomedical Literature
Huge volume (PubMed 23132342
citations)

High diversity

High quality (peer review)

Users overwhelmed by long list of search results

1/3 of Pubmed queries result in 100 or more citations (Islamaj,
2009)
Querying the biomedical literature becomes more difficult
Medical Subject Headings
Filters
Boolean operators
Problem Querying the Biomedical Literature
Modified from OpenHelix
EGFR
retrieves documents/ records
T14D inhibited EGF receptor internalization
EGFR regulates tumor cell proliferation
EGFR is expressed in SCCHN
extracts facts
Information Retrieval
records
Information Extraction
records
Information Retrieval vs Information Extraction
Alternative Tools for Mining the Biomedical Literature

Alternative tools for mining the biomedical literature combine:

Statistical methods,

Ontologies / Controlled vocabularies

Natural Language Processing tools,

Visualization tools

Reduced time for discovering meaningful
results.
Alternative Mining Tools for the Biomedical Literature
Alternative Tools for Mining the Biomedical Literature
Main gene query
Protein/gene associated
Synonym
Medical terminology (MeSH)
Alternative Tools for Mining the Biomedical Literature
Linked to Entrez Gene
and OMIM database
Workshop- Novel Online Tools for Mining the Biomedical
Literature
Case 1 Few Results in the Biomedical Literature
Searching for novel genes
Case 2 Few Results in the Biomedical Literature
Searching for side effects of drugs: Cerebyx respiratory failure
Phenotypic information can be used
to infer molecular interactions and
hinting at new uses of marketed
drugs (Campillos, 2008)
Case 2 Few Results in the Biomedical Literature
Data Annotation/ Integration / Visualization Tools Genome
Browsers
Workshop- Novel Online Tools for Mining the Biomedical
Literature
Contextualizing Data/Results in the Biomedical Knowledge
Resulting list of up
regulated genes after
treatment of prostate
cancer cells with Vit
D

Microarray data
obtained from Gene
Expression Omnibus
repository was
analyzed with
GEO2R statistical
software
Contextualizing Data/Results in the Biomedical Knowledge
References
Campillos M*, Kuhn M*, Gavin AC, Jensen LJ, Bork P. Drug target identification using side-
effect similarity. Science. 2008 Jul 11;321(5886):263-6.
http://www.ncbi.nlm.nih.gov/pubmed/18621671

Islamaj Dogan R, Murray GC, Nvol A, Lu Z. (2009) Understanding PubMed user search
behavior. Database (Oxford) http://www.ncbi.nlm.nih.gov/pubmed/20157491

Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. The human
genome browser at UCSC. Genome Res. 2002 Jun;12(6):996-1006.
http://www.ncbi.nlm.nih.gov/pubmed/12045153

Kuhn M, Campillos M, Letunic I, Jensen LJ, Bork P. A side effect resource to capture
phenotypic effects of drugs. Mol Syst Biol. 2010;6:343. Epub 2010 Jan 19.
http://sideeffects.embl.de/drugs/56338/

Rindflesch, T.C. et al. (2011) Semantic MEDLINE: An advanced information management
application for biomedicine. Information Services & Use, 31, 15-21.
http://lhncbc.nlm.nih.gov/system/files/pub-lhncbc-2011-109.pdf

You might also like