Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

The Fourth Paradigm:

Data-Intensive Scientific Discovery


TonyHey
CorporateVicePresident
MicrosoftExternalResearch

ThisworkislicensedunderaCreativeCommons
Attribution3.0UnitedStatesLicense.

TonyHey AnIntroduction

CommanderoftheBritishEmpire
ThisworkislicensedunderaCreativeCommons
Attribution3.0UnitedStatesLicense.

The Fourth Paradigm

ThisworkislicensedunderaCreativeCommons
Attribution3.0UnitedStatesLicense.

ADigitalDataDelugeinResearch
Data collection
Sensor networks, satellite
surveys, high throughput
laboratory instruments,
observation devices,
supercomputers, LHC

Data processing,
analysis, visualization

SensorMap
Functionality: Map navigation
Data: sensor-generated temperature, video
camera feed, traffic feeds, etc.

Legacy codes, workflows,


data mining, indexing,
searching, graphics

Archiving
Digital repositories,
libraries, preservation,
ThisworkislicensedunderaCreativeCommons
Attribution3.0UnitedStatesLicense.

Scientific visualizations
NSF Cyberinfrastructure report, March 2007

EmergenceofaFourthResearchParadigm
1.

Thousand years ago Experimental Science

2.

Description of natural phenomena

Last few hundred years Theoretical Science

3.

Newtons Laws, Maxwells Equations

Last few decades Computational Science

4.

Simulation of complex phenomena

Today Data-Intensive Science

Scientists overwhelmed with data sets


from many different sources

Data captured by instruments


Data generated by simulations
Data generated by sensor networks

eScience is the set of tools and technologies


to support data federation and collaboration

For analysis and data mining


For data visualization and exploration
For scholarly communication and dissemination

ThisworkislicensedunderaCreativeCommons
Attribution3.0UnitedStatesLicense.

Astronomyhasbeenoneofthefirstdisciplinestoembrace
dataintensivesciencewiththeVirtualObservatory(VO),
enablinghighlyefficientaccesstodataandanalysistools
atacentralizedsite.Theimageshowsthe
PleiadesstarclusterformtheDigitizedSkySurvey
combinedwithanimageofthemoon,
synthesizedwithintheWorldWide Telescopeservice.

Sciencemustmovefromdatato
informationtoknowledge

WiththankstoJimGray

ThisworkislicensedunderaCreativeCommons
Attribution3.0UnitedStatesLicense.

ThisworkislicensedunderaCreativeCommons
Attribution3.0UnitedStatesLicense.

http://research.microsoft.com/fourthparadigm/

TheimpactofJimGraysthinkingiscontinuingto
getpeopletothinkinanewwayabouthowdata
andsoftwareareredefiningwhatitmeanstodo
science."
BillGates,Chairman,MicrosoftCorporation
Oneofthegreatestchallengesfor21stcentury
scienceishowwerespondtothisneweraof
dataintensivescience.Thisisrecognizedasanew
paradigmbeyondexperimentalandtheoretical
researchandcomputersimulationsofnatural
phenomenaonethatrequiresnewtools,
techniques,andwaysofworking.
DouglasKell,UniversityofManchester
Thecontributingauthorsinthisvolumehave
doneanextraordinaryjobofhelpingtorefinean
understandingofthisnewparadigmfroma
varietyofdisciplinaryperspectives.
GordonBell,MicrosoftResearch

ThisworkislicensedunderaCreativeCommons
Attribution3.0UnitedStatesLicense.

Listed7keyareasforactionbyFundingAgencies:
1.Fundbothdevelopmentandsupportofsoftware
tools
2.Investatalllevelsofthefindingpyramid
3.Funddevelopmentofgeneric Laboratory
InformationManagementSystems
4.Fundresearchintoscientificdatamanagement,
dataanalysis,datavisualization,newalgorithms
andtools
ThisworkislicensedunderaCreativeCommons
Attribution3.0UnitedStatesLicense.

Remainingthreekeyareasforactionrelateto
thefutureofScholarlyCommunicationand
Libraries:
5.EstablishDigitalLibrariesthatsupporttheother
sciencesliketheNLMdoesforMedicine
6.Funddevelopmentofnewauthoringtoolsand
publicationmodels
7.Exploredevelopmentofdigitaldatalibraries
thatcontainscientificdata(notjustthe
metadata)andsupportintegrationwith
publishedliterature
ThisworkislicensedunderaCreativeCommons
Attribution3.0UnitedStatesLicense.

Developing a Sustainable
e-Infrastructure

ThisworkislicensedunderaCreativeCommons
Attribution3.0UnitedStatesLicense.

Acceleratingtimetoinsight
withAdvancedResearchToolsandServices

Ourgoalisto accelerateresearchbycollaboratingwith
academiccommunitiestouseadvancedcomputer
scienceresearchtechnologies
AimtohelpscientistsspendlesstimeonITissuesand
moretimeonsciencebycreatingopentoolsand
servicesbasedonMicrosoftplatformsandproductivity
software
ThisworkislicensedunderaCreativeCommons
Attribution3.0UnitedStatesLicense.

DataAcquisitionandModeling

TheSwissExperiment
PowerfulSoftwareImprovesEnvironmental
Forecasting
Environmentalscientistsfacemanychallenges
inmonitoringandunderstandingourplanets
changingclimate.Throughaninternational
collaborationcalledtheSwissExperiment,
environmentalscientistsandcomputerscience
expertsaredeployingadvancedsensornetworks
anddatamanagementtoolstoimprove
environmentalmonitoringandforecasting.
ThisworkislicensedunderaCreativeCommons
Attribution3.0UnitedStatesLicense.

LifeUnderYourFeet
ResearchersatTheJohnsHopkinsUniversity
aredeployinglargearraysofwirelesssoil
sensorsinavarietyofenvironmentalsettings,
includingapark,anurbanforestandawetland.
Thenetworksenablescientiststomonitor
ecologicalchangesonanunprecedentedscale
andofferinsightsintohydrology,greenhouse
gasesandtheactivityoforganismsinthesoil.

CollaborationandVisualization

ResearchInformationCenter
Collaborationandinformationsharingamong
researchersareamongthemostimportantbut
challengingaspectsofscientificresearch.In
recentyears,scientistshavebegunusing
virtualresearchenvironments toexchange
informationwithcolleaguesinspecificareasof
study.MicrosoftResearchandTheBritish
LibraryareteaminguptobuildtheResearch
InformationCentre.
ThisworkislicensedunderaCreativeCommons
Attribution3.0UnitedStatesLicense.

SciScope SpeedsDataRetrievalfrom
MultipleRepositories
Forenvironmentalscientistsandengineers,
findingandretrievingrelevantdatacanbea
dauntingandtedioustask.MicrosoftResearchis
developinganonlinesearchenginecalled
SciScope thatenablesresearcherstosearch
multipledatarepositoriessimultaneouslyand
retrieveinformationinaconsistentformat.

AnalysisandDataMining

Trident
AScientificWorkflowWorkbenchBringsClarity
toData
ScientistsattheUniversityofWashingtonare
workingwithMicrosoftExternalResearchto
demonstratehowmarryingvisualizationand
workflowtechnologiescanallowresearchersto
bettermanage,evaluateandinteractwitheven
themostcomplexscientificdatasets.
ThisworkislicensedunderaCreativeCommons
Attribution3.0UnitedStatesLicense.

PhyloD
StatisticaltoolusedtoanalyzeDNAofHIVfrom
largestudiesofinfectedpatients
Typicaljob,10 20CPUhourswithextreme
jobsrequiring1K 2KCPUhours
VeryCPUefficient
Requiresalargenumberoftestrunsfora
givenjob(1 10Mtests)
Highlycompresseddataperjob(~100KB
perjob)

DisseminateandShare

Chem4Word
ChemistryDrawinginWord
CreatedincollaborationwithUniversityof
Cambridge;PeterMurrayRust,et.al.
Intent:Recognizes
chemicaldictionaryand
ontologyterms

Author/edit1Dand2Dchemistry.
Changechemicallayoutstyles.

Data:Semantics
storedinChemistry
MarkupLanguage

<?xml
version="1.0"?>
<?xml version="1.0"?>

<cml
<cml version="3"convention="orgsynthreport"xmlns="http://www.xmlcml.org/schema">
version="3"convention="orgsynthreport"xmlns="http://www.xmlcml.org/schema">
<molecule
<molecule id="m1">
id="m1">
<atomArray>
<atomArray>
<atom
<atom id="a1"elementType="C"x2="2.9149999618530273"y2="0.7699999809265137"/>
id="a1"elementType="C"x2="2.9149999618530273"y2="0.7699999809265137"/>
<atom
<atom id="a2"elementType="C"x2="1.5813208400249916"y2="1.5399999809265137"/>
id="a2"elementType="C"x2="1.5813208400249916"y2="1.5399999809265137"/>
<atom
<atom id="a3"elementType="O"x2="0.24764171819695613"y2="0.7699999809265134"/>
id="a3"elementType="O"x2="0.24764171819695613"y2="0.7699999809265134"/>
<atom
<atom id="a4"elementType="O"x2="1.5813208400249912"y2="3.0799999809265137"/>
id="a4"elementType="O"x2="1.5813208400249912"y2="3.0799999809265137"/>
<atom
<atom id="a5"elementType="H"x2="4.248679083681063"y2="1.5399999809265137"/>
id="a5"elementType="H"x2="4.248679083681063"y2="1.5399999809265137"/>
<atom
<atom id="a6"elementType="H"x2="2.914999961853028"y2="0.7700000190734864"/>
id="a6"elementType="H"x2="2.914999961853028"y2="0.7700000190734864"/>
<atom
<atom id="a7"elementType="H"x2="4.248679083681063"y2="1.907348645691087E8"/>
id="a7"elementType="H"x2="4.248679083681063"y2="1.907348645691087E8"/>
<atom
<atom id="a8"elementType="H"x2="1.0860374036310796"y2="1.5399999809265132"/>
id="a8"elementType="H"x2="1.0860374036310796"y2="1.5399999809265132"/>
</atomArray>
</atomArray>
<bondArray>
<bondArray>
<bond
<bond atomRefs2="a1a2"order="1"/>
atomRefs2="a1a2"order="1"/>
<bond
<bond atomRefs2="a2a3"order="1"/>
atomRefs2="a2a3"order="1"/>
<bond
<bond atomRefs2="a2a4"order="2"/>
atomRefs2="a2a4"order="2"/>
<bond
<bond atomRefs2="a1a5"order="1"/>
atomRefs2="a1a5"order="1"/>
<bond
atomRefs2="a1a6"order="1"/>
<bond atomRefs2="a1a6"order="1"/>
<bond
atomRefs2="a1a7"order="1"/>
<bond atomRefs2="a1a7"order="1"/>
<bond
<bond atomRefs2="a3a8"order="1"/>
atomRefs2="a3a8"order="1"/>
</bondArray>
</bondArray>
</molecule>
</molecule>
</cml>
</cml>

Intelligence:Verifiesvalidityof
authoredchemistry
ThisworkislicensedunderaCreativeCommons
Attribution3.0UnitedStatesLicense.

Relationships:Navigateand
linkreferencedchemistry

DisseminateandShare

OntologyPlugInforWord
Services: Ontology
downloadwebservice
JohnWilbanks
PhilBourne
LynnFink

Intent:Term
recognition
&disambiguation
Relationships:
Ontology
browser
ThisworkislicensedunderaCreativeCommons
Attribution3.0UnitedStatesLicense.

ArchivingandPreservation

DefaultwebUIwithCSS
supportandcustomASP.Net
controls

Zentity

NativesupportforRSS,OAIPMH,OAI
ORE,AtomPubandSWORD

Flexibledatamodel
enablesmanyscenarios
andcanbeeasilyextended
overtime

ThisworkislicensedunderaCreativeCommons
Asemanticcomputingplatformtostore
Attribution3.0UnitedStatesLicense.
andexposerelationshipsbetweendigital
assets

ArchivingandPreservation

oreChem theChemicalSemanticWeb

Semanticstorage

Mashup(reuse)data

experiments
text

documents

measurements

ThisworkislicensedunderaCreativeCommons
Attribution3.0UnitedStatesLicense.

data

scientists

molecules
data

Compounddocument
authoring

molecules

Networkanalysisisofgrowing
importanceinacademic,
commercial,andInternet
socialmediacontexts
ExistingSocialNetworkTools
arechallengingformany
noviceusers
ToolslikeExcelarewidely
used
Leveragingaspreadsheetasa
hostforSocialNetwork
Analysislowersbarriersto
networkdataanalysisand
display
ThisworkislicensedunderaCreativeCommons
Attribution3.0UnitedStatesLicense.

Leveragespreadsheetforstorageofedgeandvertexdata

Applydynamicfilterstothedata

Intent:InsertCreativeCommons
licensesfromwithinOffice2007

Services:Integrateswith
CreativeCommonsWebAPI
tocreatenewlicenses

Relationships:licenseinformation
storedasRDFXMLwithinthe
documentOOXML
ThisworkislicensedunderaCreativeCommons
Attribution3.0UnitedStatesLicense.

http://ccaddin2007.codeplex.com

The Future Research e-Infrastructure:


Client + Cloud

ThisworkislicensedunderaCreativeCommons
Attribution3.0UnitedStatesLicense.

StatisticaltoolusedtoanalyzeDNAofHIVfrom
largestudiesofinfectedpatients
PhyloD wasdevelopedbyMicrosoftResearchand
hasbeenhighlyimpactful
Smallbutimportantgroupofresearchers
100sofHIVandHepC researchersactivelyuseit
1000sofresearchcommunitiesrelyontheseresults

CoverofPLoS Biology
November2008

Typicaljob,10 20CPUhourswithextremejobsrequiring1K 2KCPUhours


VeryCPUefficient
Requiresalargenumberoftestrunsforagivenjob(1 10Mtests)
Highlycompresseddataperjob(~100KBperjob)

PhyloD nowportedasWindowsAzureCloudService
Cloudenablesagiledeploymentofscalablescientificservices
ThisworkislicensedunderaCreativeCommons
Attribution3.0UnitedStatesLicense.

CourtesyofRogerBarga

Sciencepipelinefordownload,initialprocessing
andreductionofsatelliteimagery.Developedby
MSR,UVa,UCB.
Dramaticallylowersresourceandcomplexity
barrierstousesatelliteimageryforterrestrial
hydrologyandgeoscience.

Inusenowtocompute10yearcontinentalscale
waterbalanceforNorthAmerica.Peryear:

Commonimagerylocationdeterminationand
uploadfromdiversesources
Commonreprojection andharmonizationto
producesciencereadyimagerywiththesame
length,timeandqualityattributes
Optionalscientistprovidedreductionalgorithm
(.NET,Java,orMatLab)
Ondemandscalabilitybeyondlocaldesktopor
cluster

CatharinevanIngen(MSR),JieLi,MartyHumphreys
(UVA),YoungryelRyu(UCB),DebAgarwal(BWC/LBL)
SourceImageryDownloadSites

...
Source
Metadata
DataCollectionStage

RequestQueue
AzureMODIS
ServiceWebRolePortal

Reprojection Stage

500GB(~60Kfiles)uploadof9differentsource
imageryproductsfrom15differentlocations
400GBreprojected harmonizedimagery
consuming~3500cpu hours
5GBreducedscienceresultleveragingreported
fielddataaggregatesconsuming~60cpu hour

ReductionQueue

Additionalsciencerequestspending
ExpandingabovetoEurope
Additionalsourceimageryproductsand
formats
ThisworkislicensedunderaCreativeCommons
Attribution3.0UnitedStatesLicense.

Reprojection Queue

Analysis/ReductionStage

Scientist
ScientificResults

LedbyNewcastleUniversity,UK(PaulWatson),
projectsupportedbyER
Investigatingapplicabilityofcommercialcloudsforscientific
research
Buildaworkingprototypeforusecasesinchemo
informatics
UsesMicrosofttechnologiestobuildsciencerelated
services(WindowsAzure,Silverlight)

Builtinitialproofofconcept
Silverlight UIforbasicQuantitativeStructure
AnalysisRelationship(QSAR)modeling
DemonstratedabilitytoscaleQSARcomputations
inWindowsAzure

ThisworkislicensedunderaCreativeCommons
Attribution3.0UnitedStatesLicense.

Data/informationisinter
connectedthroughmachine
interpretableinformation(e.g.
paperX isaboutstarY)
Socialnetworksareaspecialcase
ofdatameshes

Aknowledgeecosystem:

Aricherauthoringexperience
Anecosystemofservices
Semanticstorage
Open,Collaborative,
Interoperable,andAutomatic
ThisworkislicensedunderaCreativeCommons
Attribution3.0UnitedStatesLicense.

Attribution:ChrisBizer

VisionofFutureResearch
eInfrastructureusing
Client+Cloudresources

visualizationand
analysisservices

scholarly
communications
search
books
citations

domainspecificservices
blogs&
socialnetworking

Reference
management

instant
messaging
identity

Project
management

mail
notification
documentstore

storage/data
services
knowledge
management
knowledge
discovery

ThisworkislicensedunderaCreativeCommons
Attribution3.0UnitedStatesLicense.

compute
services
virtualization

Thesitecontainsaccessanddownloadsofrelevantopentoolsand
resourcesfortheworldwideacademicresearchcommunity.Examplesof
ouropentoolsandservices:
PluginsforOffice
OntologyAddinforWord
ArticleAuthoringAddinforWord
Chem4Word ChemistryDrawinginWord
MicrosoftBiologyFoundationMBF
Enablesandacceleratesfundamentaladvancesinbiology
F#
CollaborationwiththeacademicandresearchcommunityonF#stypedfunctionaland
objectorientedprogrammingonthe.NETplatform
SoftwareEngineeringTools
Spec#:ProgramverifierforC#extendedwithdesignbycontract
VCC:ProgramverifierforConcurrentC
PEX:automaticunittestingtoolfor.NET
CHESS:UnittestingtoolsforconcurrentWin32executableand.NET

ThisworkislicensedunderaCreativeCommons
Attribution3.0UnitedStatesLicense.

MicrosoftResearch
http://research.microsoft.com
MicrosoftResearchdownloads:http://research.microsoft.com/research/downloads

MicrosoftExternalResearch
http://research.microsoft.com/externalresearch

ScienceatMicrosoft
http://www.microsoft.com/science

CodePlex
http://www.codeplex.com

TheFacultyConnection
http://www.microsoft.com/education/facultyconnection

MSDNAcademicAlliance
http://msdn.microsoft.com/enus/academic
ThisworkislicensedunderaCreativeCommons
Attribution3.0UnitedStatesLicense.

ThisworkislicensedunderaCreativeCommons
Attribution3.0UnitedStatesLicense.

You might also like