Michelle Hudson, Science & Social Science Data Librarian
Kristin Bogdan, Science & Social Science Data Librarian Kayleigh Bohmier, Science Research Support Librarian for Astronomy, Geology & Geophysics, and Physics Rolando Garcia-Milian, Biomedical Sciences Research Support Science Data Resources: From Astronomy to Bioinformatics A Brief Overview of Data in the Sciences Examples of data questions Where would you find these? Spectroscopy on jars found in wrecks Spectra of M31 ApoE structure Ice cores Genomic sequences for extinct mammals Types of Data Observational data captured in real time, irreplaceable sensor readings, telescope images, geologic samples Experimental data from lab equipment, expensive to reproduce gene sequences Simulation data generated from models (models are more important than the data) Derived or compiled data put together from other information 3D models, compiled databases
Formats of data Documents, spreadsheets, lab notebooks, questionnaires, survey responses, health indicators, audio recordings, video recordings, protein and gene sequences, images, films, spectra, slides, artifacts, specimens, samples, models, algorithms, scripts, software code, etc. General resources data.gov: http://www.data.gov/ DataONE: http://www.dataone.org/ NCBI: http://www.ncbi.nlm.nih.gov/ EBI: http://www.ebi.ac.uk/ FigShare: http://figshare.com/ Dryad: http://datadryad.org/ PLOS|ONE: http://www.plosone.org/ Data journals: http://mlibrarydata.wordpress.com/2014/05/09/data-journals/ Research guide: http://guides.library.yale.edu/sciencedata Astronomy, or, Massively Open Online Archives Astronomy data Self-collected Data.NASA.gov US Virtual Observatory (US VO) ApJ supplement Research centers and collaborations Figshare Astronomy Dataverse github Screenshot: Virtual Observatory Data Explorer search for M31 Government Data NASA Data processing levels Ranked 0-4 Level 0 is raw data Level 1 has been processed and error- corrected This is, incidentally, where conspiracy theories come from Level 2 data may contain derived parameters Levels 3+ have further processing Data from Researchers Astrophysical Journal Supplement http://dx.doi.org/10.1088/00 67-0049/212/2/26 http://dx.doi.org/10.1088/00 67-0049/212/2/19 http://dx.doi.org/10.1088/00 67-0049/212/1/6 Project web pages (i.e., Kepler) Intermediary data products remain a problem (i.e., code, analyzed data sets)
Olausen, S. A., & Kaspi, V. M. (2014). Table 2 from The McGill Magnetar Catalog. The Astrophysical Journal Supplement Series, 212(1), 1-22. doi:10.1088/0067- 0049/212/1/6 Harvards Astronomy Dataverse A solution for intermediate stage data products Physics: A Little Less Open, But Everyone Knows Where To Find It Reference Data National Nuclear Data Center at Brookhaven National Laboratory: http://www.nndc.bnl.gov/ Department of Energy Data Explorer: http://www.osti.gov/dataexp lorer/ MCPlots (Monte Carlo plots reference for HEP): http://mcplots.cern.ch/ Monte Carlo plot reference from MCPlots Experimental Data Durham HepData Project Reactions database Data from active experiments Data reviews http://durpdg.dur.ac.uk/H EPDATA/REAC Example: http://durpdg.dur.ac.uk/vi ew/ins1297226 Experimental data releases IceCube: http://icecube.wisc.edu/sci ence/data
and then, we have the data grids. Geoscience Data Resources Kinds of Geoscience Data Geospatial Rocks and Minerals Economic Geology Paleobiology Climate History Geochemistry Physical Samples Image credit: USGS, via Wikimedia Commons: http://commons.wikimedia.org/wiki/File:Seismograph_Pinat ubo.jpg Physical Samples as Data Identified as data in NSF guidelines Different analyses = new data New techniques developed over time Repositories for samples specific metadata required
Gry, Parent. 7 May 2011. Peronopsis interstrictus. Retrieved from the Wikimedia Commons at http://commons.wikimedia.org/wiki/File%3APeronopsis_inte rstrictus_White%2C_1874_2.jpg Geo/Paleo Sample Repositories/Registries International Geo Sample Number (IGSN) - http://www.geosamples.org/ Peabody Museum - http://peabody.yale.edu/collections/search- collections PaleoBioDB - http://paleobiodb.org/#/
Other Resources for Geoscience USGS Earth Explorer - http://earthexplorer.usgs.gov/ Data.Gov - http://www.data.gov/ GeoGratis - http://geogratis.cgdi.gc.ca/ Morphobank - http://morphobank.org/ EarthCube http://earthcube.org/ CINERGI - http://workspace.earthcube.org/cinergi
Bioinformatics Resources and Services Problem Rapid Growth of Biomedical data GenBank Statistics http://www.ncbi.nlm.nih.gov/genbank/genbankstats- 2008/ 0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 M i l l i o n s
Samples Submitted to Gene Expression Omnibus Database Compiled from GEO historic data http://www.ncbi.nlm.nih.gov/geo/summary/?type=history Compiled by from PubMed http://www.ncbi.nlm.nih.gov/pubmed 0.00 5.00 10.00 15.00 20.00 25.00 1940 1960 1980 2000 2020 M i l l i o n s
Number of Records in PubMed Biomedical Literature Problem Growth of the Biomedical Literature Huge volume (PubMed 23132342 citations)
High diversity
High quality (peer review)
Users overwhelmed by long list of search results
1/3 of Pubmed queries result in 100 or more citations (Islamaj, 2009) Querying the biomedical literature becomes more difficult Medical Subject Headings Filters Boolean operators Problem Querying the Biomedical Literature Modified from OpenHelix EGFR retrieves documents/ records T14D inhibited EGF receptor internalization EGFR regulates tumor cell proliferation EGFR is expressed in SCCHN extracts facts Information Retrieval records Information Extraction records Information Retrieval vs Information Extraction Alternative Tools for Mining the Biomedical Literature
Alternative tools for mining the biomedical literature combine:
Statistical methods,
Ontologies / Controlled vocabularies
Natural Language Processing tools,
Visualization tools
Reduced time for discovering meaningful results. Alternative Mining Tools for the Biomedical Literature Alternative Tools for Mining the Biomedical Literature Main gene query Protein/gene associated Synonym Medical terminology (MeSH) Alternative Tools for Mining the Biomedical Literature Linked to Entrez Gene and OMIM database Workshop- Novel Online Tools for Mining the Biomedical Literature Case 1 Few Results in the Biomedical Literature Searching for novel genes Case 2 Few Results in the Biomedical Literature Searching for side effects of drugs: Cerebyx respiratory failure Phenotypic information can be used to infer molecular interactions and hinting at new uses of marketed drugs (Campillos, 2008) Case 2 Few Results in the Biomedical Literature Data Annotation/ Integration / Visualization Tools Genome Browsers Workshop- Novel Online Tools for Mining the Biomedical Literature Contextualizing Data/Results in the Biomedical Knowledge Resulting list of up regulated genes after treatment of prostate cancer cells with Vit D
Microarray data obtained from Gene Expression Omnibus repository was analyzed with GEO2R statistical software Contextualizing Data/Results in the Biomedical Knowledge References Campillos M*, Kuhn M*, Gavin AC, Jensen LJ, Bork P. Drug target identification using side- effect similarity. Science. 2008 Jul 11;321(5886):263-6. http://www.ncbi.nlm.nih.gov/pubmed/18621671
Islamaj Dogan R, Murray GC, Nvol A, Lu Z. (2009) Understanding PubMed user search behavior. Database (Oxford) http://www.ncbi.nlm.nih.gov/pubmed/20157491
Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. The human genome browser at UCSC. Genome Res. 2002 Jun;12(6):996-1006. http://www.ncbi.nlm.nih.gov/pubmed/12045153
Kuhn M, Campillos M, Letunic I, Jensen LJ, Bork P. A side effect resource to capture phenotypic effects of drugs. Mol Syst Biol. 2010;6:343. Epub 2010 Jan 19. http://sideeffects.embl.de/drugs/56338/
Rindflesch, T.C. et al. (2011) Semantic MEDLINE: An advanced information management application for biomedicine. Information Services & Use, 31, 15-21. http://lhncbc.nlm.nih.gov/system/files/pub-lhncbc-2011-109.pdf
(Advances in Molecular and Cellular Microbiology, 21) Timothy D McHugh-Tuberculosis - Laboratory Diagnosis and Treatment Strategies-CAB International (2013)