NSF 2003 CI in The Biological Sciences

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 53

Any opinions, findings, conclusions, or recommendations expressed in this report

are those of the workshop participants, and do not necessarily represent the official
views, opinions, or policy of the National Science Foundation.
Building a
Cyberinfrastructure
for the Biological Sciences
(CIBIO)

2005 and Beyond:


A Roadmap for Consolidation and Exponentiation

A Workshop Report
July 14-15, 2003
Chair, John C. Wooley
Subcommittee on 21st Century Biology
NSF Directorate for Biological Sciences Advisory Committee (BIOAC)
Table of Contents
Abstract 3

Executive Summary 4
Introduction: The Context for Deliberation 4
The Unique Case for Including Biology in CI Development 5
Intrinsic Aspects of the Biological Sciences 6
Multiplying Exponentials through an Extensive Partnership 6
The Essence of the Objectives for NSF BIO 7
Resource Requirements and Initial Stages of Implementation 8
Education and Training 9
Coordination and Collaborations 10

Prelude (Commentary and Prescience in the Biological Sciences) 11

The Workshop Report 13


Specifying and Building a CI to meet the Requirements of Biology 13

The Opportunity Created by Building a BIO-Specific CI 14


Invest in People 15
Ensure Science Pull, Technology Push 17
Stay the Course 17
Prepare for the Data Deluge 18
Enable Science Targets of Opportunity 20
Select and Direct Technology Contributions 23
Establish National and International Partnerships 24

Implementation Strategies 26
Funding and Management Mechanisms 27
Outreach and Partnerships 30

Immediate Steps for BIO 31

Appendices
I. Central Questions for CIBIO Workshop 33
II. Schedule and Assignments 34
III. Workshop Participants 36
IV. Overview of Blue Ribbon Panel/Atkins Report on CI 38
V. Models for a Comprehensive CIBIO based on Extant Test Beds 40
VI. Potential Prototypes for BIO Implementation 43
VII. References for CIBIO Report 47

-2-
A Cyberinfrastructure View
Envisioning and Empowering Successes for 21st Century Biological Sciences

• Creating and sustaining a comprehensive cyberinfrastructure (CI; the pervasive applications of all
domains of scientific computing and information technology) is as relevant and as required for
biology as for any science or intellectual endeavor; in the advances that led to today’s
opportunity, the BIO Directorate made numerous, ad hoc contributions, and now can integrate its
efforts to build the complete platforms needed for 21st Century Biology. Doing so will accelerate
progress in extraordinary ways.

• The time for creating a CI for all of the sciences, for research and education, has arrived and NSF
will lead the way. BIO must co-define the extent and fine details of the NSF structure for CI,
which will involve major internal NSF partnerships and external partnerships with other agencies,
and will be fully international in scope.

• Only the biological sciences have seen as remarkable, sustained revolutionary advances as the
computer and information sciences. Just in the past few years has the world of computing and
information technology reached the level of being fully applicable to the wide range of cutting
edge themes characteristic of biological research. Multiplying the exponentials (of continuing
advances in computing and bioscience) through deep partnerships will inevitably be exciting
beyond any anticipation.

• The stretch goals for the biological sciences community include both a need for community-level
involvement and for the complete spectrum of CI; namely: People and Training; Instrumentation;
Collaborations; Advanced Computing and Networking; Databases and Knowledge Management;
Analytical Methods (Modeling and Simulation).

• NSF BIO must:

o Invest in People
o Ensure Science Pull, Technology Push
o Stay the Course
o Prepare for the Data Deluge
o Enable Science Targets of Opportunity
o Select and Direct the Technology Contributions
o Establish National and International Partnerships.

• The biology community must decide how best it can interact with the quantitative science
community, where and when to intersect with computational sciences and technologies, how to
cooperate on and contribute to infrastructure projects, and how NSF BIO should partner
administratively. An implementation meeting, as well as briefings to the community through
professional societies, will be essential.

• For NSF BIO to underestimate the importance for biology, or fail to provide fuel over the entire
journey, would severely retard progress and be very damaging for the entire national and
international biological sciences community.

-3-
Executive Summary
Introduction: The Context for Deliberation
The biological sciences are at a critical junction in their history, having absorbed over several decades the
tremendous successes of “reductionist” experimentation, that is, of carefully focused investigations on
simpler systems, model organisms and biological abstractions/models. Today, as the direct consequence
of such extraordinary and even unanticipated successes, a new era of synthesis pervades thinking about
the future of biological research, from macromolecules to ecosystems. To inform the process and
deliver this synthesis, biological scientists must collect, organize, analyze and comprehend
unprecedented volumes of highly heterogeneous, hierarchical information obtained by different
means or modalities, with different standards, widely varying kinds (types) of data, over vast
scales of time, space and organizational complexity.

NSF recently introduced the term “cyberinfrastructure” (CI) to describe the integrated, ubiquitous,
increasingly pervasive application of scientific computing (SC) and information technology (IT)
approaches, which are already changing both science and society. For example, a pervasive
infrastructure arising from scientific computing and information technology will provide the circumstances
and platforms to enable robust, widely distributed research teams or collaboratories, user-friendly
interfaces to fully integrated information from multicomponent systems, and also the software and
hardware for advanced simulation and modeling projects that are directly and tightly coupled with
experimental studies and provide interactive, iterative capacities to refine our knowledge. Such
approaches are already essential requirements for many features of contemporary scientific research.

A CI will do many things, but among the most important are to provide a means to establish: (1) the tools
for capturing, storing, and managing data; (2) the tools for organizing, finding and analyzing the data to
obtain information; (3) the connection of experimental and theoretical analyses and their interplay with
simulations and complex models based on that information; and (4) the integration of disparate aspects of
that information to provide a synthesis, a knowledge repository for further considerations. The reception
of the concept of CI as a maturing, philosophical and practical perspective – that is, on the profound
revolution provided through today’s integration of continuing advances in SC and IT - has been truly
remarkable, with the entire worldwide community of scientists joining the dialogue.

People, and their ideas and tools, are at the heart of CI. Building a fully effective cyberinfrastructure
for science and society will require educating a new generation, although the technologies and the effort
itself will generate new training environments and open novel options for enriching understanding of
science for both technical features and for community relationships. After full implementation including
the training of a new cadre of scientists, a comprehensive CI (for any community) will address (1) the
provision of routine, remote access to instrumentation, computing cycles, data storage an other
specialized resources; (2) the facilitation of novel collaborations that allow new, robust, interdisciplinary
and multidisciplinary research team efforts among the most appropriate individuals at widely separated
institutions to mature; (3) the powerful and ready access to major digital knowledge resources and other
libraries of distilled information including the scientific literature; (4) platforms or vehicles for the
integration of information from multiple, highly diverse and distributed sources; (5) new training
environments; and (6) other features essential to contemporary research.

-4-
The Unique Case for Including Biology
The history of the basic science community studying biology and of their federal sponsors, NSF BIO, is
especially rich in prescience and sustained commitment. BIO invested early, and in numerous ways in
the advancing IT world that is now leading to a comprehensive cyberinfrastructure. The original
investments, for example, ranged from ecology and the LTERs to structural biology and the PDB. In
partnership with CISE, BIO also invested in a very wide range of high performance scientific computing
opportunities, such as biophysical and neuroscience modeling, telescience or remote access to
specialized instrumentation, and the requisite visualization, networking and database tools. Today, there
is an extraordinary opportunity to consolidate those activities and thereby to build a compelling,
integrated program that only could arise from BIO in its home at NSF. Specifically, building a
cyberinfrastructure for the biological sciences requires an interface to all of the quantitative sciences as
well as to computer science and engineering, and this can only happen at NSF. Already, we have seen
examples, such as NCRR/NIH’s cyberinfrastructure prototype, the Biomedical Information Research
Network (BIRN), where mission agencies have recognized NSF’s contribution and begun to establish CI
activities to meet their needs. Numerous other examples will follow to meet the goals of those missions.
Indeed, BIO’s previous and continuing investments will catalyze revolutionary change, not merely
incremental improvements, around the world.

CI is ideally suited for the


cottage industry that is biology,
due to the revolution in grid
Cyberinfrastructure Enabled
services, data integration, and BioScience Research
modern information technology.
This revolution can now be
coupled with the advent of a
biological research approach, ƒ Multidisciplinary
focused at a systems level, that is
integrative, synthetic and ƒ Multidimensional
predictive, or what NSF calls: 21st ƒ Information-driven
Century Biology. The vantage
point gained by looking at ƒ Education-oriented
research issues in biology from a
synthetic point of view, including ƒ Internationally
the characterization of interacting
processes, and the integration of
engaged
informatics, simulation and
experimental analysis, represents
the central engine powering the
entire discipline.

Not only does 21st Century Biology absolutely require a strong cyberinfrastructure, but also, more than
any other scientific domain, biology, due to its inherent complexity and the core requirement for advanced
IT, will drive the future cyberinfrastructure for all science. NSF BIO must engage with CISE fully in setting
the course, in establishing an architectural plan describing the specific needs of the biosciences, in
assembling the parts, and building a full blown, highly empowering cyberinfrastructure for the entire
biological sciences community.

Complementing the compelling scientific case for building the cyberinfrastructure required by the
biological sciences, there are very favorable administrative considerations in the context of NSF. Notably,
the implementation, in incremental fashion and tailored to each discipline’s needs, of CI by NSF, offers an
especial opportunity - a perfect fit - for the biological sciences. In terms of management, access and

-5-
resources, NSF should assign the utmost priority to BIO, the only organization positioned to lead
the response of the biological sciences community.

Intrinsic Aspects of the Biological Sciences


As our understanding of living systems increases, the inherent complexity of biology has become very
obvious, so apparent as to approach a daunting challenge. Indeed, biology encompasses more than
twenty orders of time, more than ten orders of space, and a hierarchical organizational space of
enormous variety. As calculus has been the language of the physical sciences, information
technology (informatics) is becoming the language of the biological sciences. Although biological
scientists have already typically managed data sets up to the limit set by each generation’s computing
parameters (cycles, storage, bandwidth), the singular nature of observations, the individuality of
organisms, the typical lack of simplifying symmetries and the lack of redundancy in time and space, the
depth of detail and of intrinsic features distinguish biological data, rather than their sheer volume.

The biological sciences, in settings around the world, will remain dominated by widely distributed,
individual or small team research efforts, rather than moving to a particular focus on centralized,
community facilities, as has happened for some sciences; the consequences of reaching out to the
broadest range of the best performers wherever they are is, consequently, particularly important. As
telecommunication networks advance, biologists around the entire world will be able to explore and
contribute to 21st Century Biology.

At the molecular level, for example, a


PDB & Genome enabled Biology – Using Structure cyberinfrastructure for biology, using
to Understand Function tools developed to extract implicit
genome information, will allow
biologists to understand how genes
Accelerated Drug Development
are regulated; how DNA sequences
Individualized Medicine
Productive, Healthy Citizens
dictate protein structure and function;
how genetic networks function in
cellular development, differentiation,
health and disease. A CI for BIO
must integrate the expertise and
Environmental Remediation
approaches of scientific computing
Biofuels, Biocatalysts
Improved Agriculture
and information technology with
experimental studies at all levels; for
DNA Sequence Implies Structure Implies Function
example, on molecular machines,
DNA Sequence
Provides
Synchrotron Basis for 21st Century Medicine,
Sustainable Development:
gene regulatory circuits, metabolic
Facilities Provide
Protein Sequence 3-D Protein Structure Enhanced U.S. Competitiveness, pathways, signaling networks,
Environmental Quality
A Cyberinfrastructure for BIO is Needed microbial ecology and microbial cell
to Extract Implicit Genome Information
projects, population biology,
phylogenies, and ecosystems.

Multiplying Exponentials through an Extensive Partnership


As the consequence of the parallel, fully comparable revolutions in biological research, and in computer
and information science and engineering, an extraordinary frontier is emerging at the interface between
the fields. Both communities, and their Federal counterparts, BIO and CISE, can facilitate the research
agenda of the other. 21st Century Biology absolutely requires, through the domain of scientific computing
(SC), all of the insight, expertise, methodology and technology of advanced information technology (IT),
arising from the output of computer science and engineering (CSE) and its vigorous interconnection with

-6-
experimental research. Indeed, only the biological sciences, over the past several decades, have
seen as remarkable, sustained revolutionary increases in knowledge, understanding, applicability,
as the computer and information sciences.

The Essence of the Objectives for NSF BIO


Today, the exponential increases in these two domains make them ideal partners and the dynamics of
the twin revolutions underpin the potential unprecedented impact of building a cyberinfrastructure for the
entire biological sciences. Building on these successes, the essence, for BIO utilizing CI in empowering
21st Century Biology, is “Keep Your Eye on the Prize” --

• Invest in People
• Ensure Science Pull, Technology Push
• Stay the Course
• Prepare for the Data Deluge
• Enable Science Targets of Opportunity
• Select and Direct Technology Contributions
• Establish National and International Partnerships

The most obvious feature of 21st Century Biology is the increasing rate of data flow and simultaneously,
the highly complex nature of the data, whether obtained through conventional or automated means.
Responding to the enormous challenge requires that biological scientists be able to organize that data
into information, analyze the information to create insight and knowledge, and synthesize disparate
elements of our knowledge to create a deep understanding or wisdom. A passive role will not suffice
when the vitality of the entire biology enterprise is involved. In other words, BIO must provide the vision
for the CIBIO, not rely on technology drivers and circumstantial access. Education and the investment in
people will, of necessity, include retraining, lowering the barrier for entry by senior faculty and by those
from other disciplines, programs at all academic levels, and the training and stable career paths for future
principle investigators and for academic professionals who will provide the energy for the scientific
journey. Once involved, BIO will have made a major commitment to the community and must have an
effective long-range plan to sustain the efforts. The changing relationship of CISE to its high performance
computing centers and the introduction of a CI process across NSF places a significant obligation on the
BIO Directorate to structure and maintain the role of the biological science community in the development
and utilization of scientific computing and information technology applied to biology.

Not all subdisciplines can be simultaneously provided with a CI by BIO, so selected pilot projects and
areas of high NSF BIO impact should be the first focal points of effort. Strategic partnerships, discussed
below, may well be needed to facilitate and accelerate implementation. Nothing succeeds like success,
and the complete implementation of a CI for the biological science will depend on the initial choices
paying off in easily demonstrated ways. Thus, the early pilots should be selected not just for their
longer term scientific contribution but also for their ability to contribute significantly in the near
term, even though many aspects of a comprehensive CI for the biological sciences will take years to
develop fully and the impact will continue to accelerate the science for the foreseeable future.

All research communities should interoperate, work through and with CISE and NSF as a whole, to seek
to absorb as many as possible of the computational contributions from other fields, rather than
encouraging reinvention. Nonetheless, BIO must also choose its own technology course, not
passively accept whatever (hardware, software, middleware) is delivered for the needs of other science
domains.

The entire community will engage, even those with fewer resources and alternatives (than those available
within biomolecular computing community). Scientists can now facilitate each others progress in

-7-
extraordinary ways, and to optimize introduction of 21st Century biology, the biosciences need to be
interconnected to the other NSF domains. For NSF BIO to underestimate the importance for the
biological sciences or to fail to provide fuel over the journey would be very damaging, perhaps
catastrophic, for the community.

Cyberinfrastructure promises to be as pervasive and central an influence as any societal revolution ever.
Given the breadth and the long-term impact, several considerations are very important. First, working
with partnerships and working in a global context is obvious and imperative on a scientific basis. Second,
these interconnections are equally obvious and imperative on a practical and administrative basis. The
cost of full implementation, of a comprehensive cyberinfrastructure in which the biological sciences
supported by NSF benefit from cyber-rich environments, such as those piloted by NEES and BIRN, will be
large as would be expected for a program of such incredible significance and applicability. The
administrative scale at which NSF and NSF BIO prepare and sustain this process will have to be well
beyond any previous efforts, beyond even the STCs or extant MRE programs.

Resource Requirements and Initial Stages of Implementation


Doubling the BIO budget would be justified simply to underwrite key steps in a comprehensive CI
for the biological sciences. This will be a decades-long effort, although each journey begins with a
single step. NSF BIO should take that step, and further initial steps, as soon as possible.

Funding increases will be needed as well for the core experimental programs and their projects to permit
them to exploit fully the growing cyberinfrastructure and to build the requisite collaborations for a synthetic
understanding of biology, which requires computational expertise and the deep involvement of
information technology. In the biological sciences, database activities, modeling/simulations, and
theory must always be connected to experimental efforts; a balanced expansion of the portfolio
will be important. Beyond this base internal to BIO, major partnerships with computer and information
science (CISE) and with the other sciences will be required. The impact should not be underestimated,
but neither should the requirement for greatly enhanced, stable funding.

BIO is already engaged upon a series of extraordinary opportunities, in creating a larger scale for shared,
collaborative research efforts, through activities like FIBR and (the just initiated) NEON, while sustaining
microbial projects and LTER. These larger scale projects particularly require a cyberinfrastructure, with
costs of comparable magnitude to the projections for the experimental research component.

BIO will have to (1) build up its own core activities at this interface (e.g., the funding for bioinformatics,
biological knowledge resources, computational biology tools and collaborations on simulation/modeling)
that allow it to partner with other parts of NSF, (2) choose test beds for full implementation of CI, establish
paths toward deep integration of CI into all of its communities and for all of its performers, and (3) set a
leadership role for other agencies around the entire world, including notably the USA mission agencies, to
follow. Of course, only through a decades long commitment and through flexible, agile, engaged,
proactive interactions with the entire community and with the other stakeholders – i.e., with other sources
of funding for infrastructure and research newly enabled by CI – will the effort be a complete success.

Several types of early actions are needed. Implementing these important requirements will be the
responsibility of BIO and must be in place for effective collaborations on research frontiers with the other
domains (Directorates) of NSF.

The first implementation steps should be to expand the extant database activities and computational
modeling/simulation studies, which need central attention. Many challenges remain as research
problems as well as particularly difficult implementation problems for databases in the life sciences.
Simulation studies could contribute considerably more across all of the biological sciences. Accelerating

-8-
the introduction and expansion of tools and of the conceptual approaches provided through testing
models, a prominent feature of research in the physical sciences, will require continued programmatic
emphasis and commitment.

Many biologists trained in more traditional ways are just starting to recognize the opportunities, making a
renewed and invigorated focus on training in the quantitative sciences essential for 21st Century Biology.
Encouragement of more collaborations between/among experimentalists and computational scientists is
essential, but the full implementation of the opportunities will require the training of a new generation of
translators, of “fearless” biologists able to understand and speak the language of the quantitative
scientists well enough to choose the best collaborators and to build bridges to more traditionally trained
experimentalists. Many basic requirements involve academic professionals and the use of well-
documented approaches within computer and information science. Interdisciplinary training should be
restored as an separate, defined program within BIO.

As noted above, the enabling and transformational impact of CI justifies, and for full implementation would
require, a doubling of the NSF BIO budget, but it will also require that BIO lead a much larger effort,
marshalling resources from other National agencies and around the world, to provide adequate funding to
ensure full participation by the international life sciences community. Consequently, other key, early
actions are to establish a long range plan for sustained funding and to engage the community in a
dialogue to ascertain implementation priorities as well as to prepare the biological scientists from around
the Nation to participate fully. The dialogue should begin as an open meeting that is highly interactive
and inclusive in all ways; a major venue will be needed to explore all options, dig deeply into
implementation features for subdisciplines and into national and international partnerships, and provide
for the archival of discussions and recommendations.

Important administrative features include the review and funding of infrastructure and establishing (over
time) a balance across the subdisciplines. Infrastructure is different from individual research and
needs separate processes for their consideration, which are described below. Central coordination,
needed for effective selection of pilots and coordinated efforts, will ensure balance and accelerate
penetration of the benefits of modern IT to every BIO disciplines.

All categories of infrastructure are increasingly important for scientific research, but cyberinfrastructure
will be particularly valuable for the biological sciences. What will be critical is to recognize that
infrastructure cannot be treated the same as individual research proposals. One cannot review
infrastructure requests and plans against individual research proposals, and separate, centralized review
and oversight will be needed. This situation arises since infrastructure benefits all, but has a different
time frame, different budgets, different staffing (more academic professionals). To make informed,
equitable and effective judgments on behalf of the community, CIBIO simply can not be simultaneously
considered with individual projects. At the same time, robust, rigorous peer review is essential to
establish the best opportunities. Competition is also important; overlapping efforts will need to be initiated
in many cases and then the best project will ultimately become clearly identified.

Education and Training


The educational challenges are themselves vast, and will require an expansion of existing
programs and possibly the creation of new ones. CI will be dramatically alter how education is
conducted - the means for training and transferring knowledge – and its full implementation and utilization
will require a new cadre of scientists adroit at the frontier between computing and biology, able to
recognize important biological problems, understand what computational tools are required, and capable
of being a translator or communicator between more traditionally-trained biologists and their
collaborators, computational scientists who will be just as traditionally-trained. These requirements are
universal; that is, NSF BIO should work with NSF INT and with international agencies to encourage

-9-
innovation and sustain the excitement beyond national boundaries. The technology itself will change all
levels of education and BIO should coordinate with EHR and the other research Directorates. A simple
example, beyond the graphical nature of knowledge representation and interactive media as a teaching
vehicle, is remote learning. Democracy is at work on the web, in Nobelists answering the queries of
students and in the ready access with routine tools to the world’s information and knowledge store.

Coordination and Collaborations


NSF in numerous cases will be able to nucleate an activity, but not have to plan for long term, expanding
support, since another agency will ultimately adopt or extend some aspects to meet its own needs; the
other agency may even sustain some or much of the original core. But some research problems, such as
ecology, plant science, phylogeny and the tree of life, the evolution of multicellularity, and of
developmental processes, among others, are research domains or areas that NSF will always own.
Besides applying CI to these kinds of basic questions within the biological sciences, the overall catalysis
of biology by CI will remain an NSF BIO role for the foreseeable future. NSF must ensure that once the
cyberinfrastructure for biology is put in place, the funding plan and priority level for resource
allocation must be in place to sustain the efforts and in particular, intellectual roadmaps linked to
budgetary requirements must be developed in order to ensure that first the pilot efforts and then full
implementations (each after peer review and selection of the best activities) can be funded and
maintained stably in order to deliver on the promise to the community.

- 10 -
A Prelude on Entering Biology’s Century
Commentary and Prescience in the Biological Sciences

Toward a Paradigm Shift in Biology


(Nature, Vol. 349, p.99, 10 Jan 1991)

The questions of science always lie in what is not yet known. Although our techniques determine what
questions we can study, they are not themselves the goal. The march of sciences devises ever newer
and more powerful techniques. The new paradigm, now emerging, is . . . that the starting point of a
biological investigation will be theoretical. An individual scientist will begin with a theoretical conjecture,
only then turning to experiment to follow or test that hypothesis.

The human genome project will continue and accelerate this rate of increase [in DNA sequences known
and in internet accessible databases, notably Genbank]. Thus, I expect that...7the total knowledge of the
human organism will be available...by the end of the decade. To use this flood of knowledge, which
will pour across the computer networks of the world, biologists not only must become computer-
literate, but also change their approach to the problem of understanding life.

We must hook our individual computers in the worldwide network that gives us access to daily changes in
the database and also makes immediate our communication with each other. The programs that display
and analyze the material for us must be improved – and we must learn to use them more effectively.

Walter Gilbert

Computing in Molecular Biology: Mapping and Interpreting Biological Information


(Computer, November, p. 6-13, 1991; and Communications of the ACM, IEEE, 1991.)

Biology is in the middle of a major paradigm shift – driven by computing. Although it is already an
information science in many respects, the field is rapidly becoming much more computational and
analytical. Computerized databases of genetic information, for example, let researchers quickly
determine the significance of research findings.

Eric S. Lander, Robert Langridge, Damian M. Saccocio

“Computing has changed biology forever; most biologists just don’t know it yet.”

Michael Levitt, 1999

“Computational Biology will be as essential for the next quarter century of biology as molecular biology
was for the past quarter century.”

William McGinnis, 1999

- 11 -
More than the earlier implementations of research computing and far more than the contributions from
any other scientific infrastructure, the integration and acceleration of scientific computing and advanced
information technology (SC and AIT), driven by NSF, will enable applications for the entire biological
sciences community beyond any expectation we could articulate today. By building a cyberinfrastructure
for the biological sciences (CIBIO), the Biological Sciences Directorate will provide an enduring,
extraordinary foundation for research.

To encapsulate the value of this enterprise by a metaphor, consider the assertion by hockey great Wayne
Gretsky about his ability to score goals. In sum:
CIBIO should allow biologists to “scait” to where the puck will be.

The BIOAC Working Group of 21st Century Biology Subcommittee, 2003

- 12 -
The Workshop Report
Specifying and Building a Cyberinfrastructure To Meet the Requirements of
Biology
The NSF has identified the pervasive, ubiquitous contribution of a set of broad enabling approaches as
cyberinfrastructure (CI), which builds upon a long history of exceptional advances in basic computer
science and engineering, information technology, computer hardware and software, networking, and
wired and wireless communication technologies. The Directorate for the Biological Sciences (BIO) will
establish its own plans for CI in the context of this broader effort. That broader effort began through the
study of a “Blue Ribbon Panel,” chaired by Prof. Dan Atkins, for the Computer and Information Science
and Engineering (CISE) Directorate of the NSF. A summary of their report is in Appendix IV.

CI is perceived by the NSF as contributing not only to basic/technical knowledge and deep understanding
(and even to wisdom), but also to a knowledge-based democracy for science and society - beginning with
widespread access and full participation, and to an accelerating translation from fundamental research to
societal benefits. The aim of CI, as an integrative platform and enabling tool, is to empower all of the
sciences to work on a systems level, while fully encompassing the requirements for ultra-small, focused
studies to ultra-large analyses.

Including databases and knowledge resources, grid architectures and services, software engineering,
telescience (or remote access to specialized methods and tools), collaborations around defined science
projects and around implementing needed technologies, and so on, the implementation of CI has to be
international in scope. In fact, already there are international projects, and the international character
and expectations of CI will increase rapidly.

Cyberinfrastructure is not fully


defined at this point. Each
Integrated Cyberinfrastructure - community is to establish its own
meeting the needs of a community of communities requirements at the frontier with
computing. Thus, CI is a
developing, underdetermined set
Applications Domain- of expectations, which will grow
Domain-
• Environmental Science
Discovery & Innovation

both as technology advances and


Education and Training

specific
specific
• High Energy Physics Cybertools
Cybertools
(software)
as the needs of the sciences are
• Proteomics/Genomics (software)
•…
established. As the entire NSF
sustains an active dialogue, CI will
Development
Shared
be increasingly defined, and will
Tools & Libraries Shared
Cybertools
Cybertools certainly turn out to be defined in
(software)
(software) different ways for different
Grid Services communities.
& Middleware
Distributed
Distributed
CIBIO will thus have to be part
Resources
Resources
Hardware of a “family,” will have to
(computation,
(computation,
communication
interconnect, Venn Diagram style,
communication
, ,storage,
storage,etc.)
etc.)
with the central or parent CSE (or
CISE) CI and with all of its
siblings. There are huge areas of large overlap, through common or shared research programs and
challenges in understanding the environment, with the GEO community. This is notably the case for
activities such as GEON, Geoinformatics, climate change, and the marine sciences. However, CIBIO will
interconnect with the research and infrastructure activities in all of the NSF Directorates, including EHR.

- 13 -
Many novel partnerships, involving new collaborations across long maintained disciplinary and sub-
disciplinary boundaries and connected anywhere that makes sense intellectually, will arise due to the
enabling infrastructure of modern infrastructure. This infrastructure is absolutely central to 21st Century
Science. While only one aspect of that infrastructure, CI will have an especially productive impact and
will require equally special attention by the funding agencies, by private Foundations, by performing
Institutions and by the entire community. All economic, geographic and technical sectors of the Nation
and indeed the entire world will participate; all communities and all Agencies, wherever they are located
in the world, will have to define their relationship to CI and respond in an exceptionally proactive fashion.
We need the performers of biological research around the world to be fully engaged in contributing to the
data rich, knowledge empowered world of 21st Century Biology. Such complicated dynamics present an
important opportunity for leadership by NSF BIO, while creating substantive challenges as well.

The dynamic infrastructure being driven by the explosive intellectual and pragmatic/ technical growth
within CSE offers both great promise and potential pitfalls for BIO. Both the community and the NSF will
have to be aware of the path taken by CI as a whole, as well as making sure that we BIO help define that
path. This point must be taken very seriously: BIO must coordinate and “interoperate” with CISE
and secondarily with all the Directorates in defining the extent and fine details of the NSF
structure for CI.

The biological sciences can drive their involvement to meet their needs, as this report recommends. Or,
the members of the community can proceed as before, and then they will simply be dragged along, which
will be far from satisfactory. BIO must recognize that CI is here, is now, has even been growing (up)
for years, but it certainly will be “more here tomorrow.”

The Opportunity Created by Building a BIO-Specific CI


Only the biological sciences, over the past several decades, have seen as remarkable, sustained
revolutionary increases in knowledge, understanding, applicability, as the computer and
information sciences. The exponential increases in computer processor speed (density of the chips),
storage capacity and networking bandwidth have been ongoing, while the biological sciences have seen
both early, extraordinary and unanticipated advances with genome science and exponentials comparable
to those in CSE for sequence and structure data, and soon for many types of even more complicated
data (such as that arising from mass
spectrometry and environmental sensors).
Molecules
Any metric of output of biological
knowledge output would indicate an even Synchrotrons
faster rate of growth than even the CSE
applications (chip density, cycles, storage,
communication bandwidth) to date, which
may serve to inspire more rapid progress Macromolecular
on the technology side. At the same time, Complexes,
real world features are essential for Organelles, Cells

modeling and simulation in any science. Microscopes


Just in the past few years has the
world of scientific computing and
advanced information technology
reached the level of being fully
applicable to a wide range of deep Organs, Organ
biological research themes. In Systems, Organisms
Magnetic Resonance Imagers
particular, computing has advanced from

- 14 -
the being able to model tens of thousands, a hundred thousand, and then a million atoms, to tens of
millions of atoms, bringing the macromolecular machines of living systems within reach for the first time.

From the complementary approaches to complexity to the shared history of incredible, sustained
advances, the two fields seem ideal for each other: multiplying the exponentials through deep
partnerships will inevitably be exciting beyond any anticipation. However, many issues need to be
addressed, from technical to cultural, from education to process, for the two domains to benefit from each
other. In addition, the context for IT and CI involves all of the sciences, and this interconnectivity must
also be considered.

This report focuses on CSE considerations from a biology perspective. At the frontiers with computer
scientists and biologists are quantitative scientists of all backgrounds, engineers, mathematicians,
statisticians, chemists, physicists, and others – who will be referred to collectively as quantitative
sciences. While BIO should especially focus on a rich partnership with CISE, it is this broader mix
of scientists who will jointly write the story of 21st Century Biology, and BIO should look for
opportunities for partnerships around the entire NSF.

As happened already for the past decades during which molecular biology was established first as a
discipline and then as a set of tools for all biologists, an intellectual thrust or push arising from computing
and information technology over the next decade will “meld” with the biological sciences to form a tool kit
for the next generation of advances, expected to be even more profound than the revolution wrought by
molecular biology. NSF has a critical responsibility to outline a long range vision to accelerate the
process of the meld, beginning with CIBIO.

As all domains of NSF science, engineering and education participate in the CI revolution, how can the
biological sciences maximize their benefits and minimize the inevitable, intertwined perturbations,
the growing pains? How can CI drive fundamental biological discovery and empower more
complete participation for all stakeholders? What are the immediate or near term steps to take,
and what management plans (notably, funding and implementation mechanisms) will be needed?
These are among the central questions addressed in our overview. Some representative objectives and
nuggets for the science are included.

The essence, for BIO utilizing CI in empowering 21st Century Biology, is “Keep
Your Eye on the Prize” --

• Invest in People
• Ensure Science Pull, Technology Push
• Stay the Course
• Prepare for the Data Deluge
• Enable Science Targets of Opportunity
• Direct the Technology Contributions
• Establish National and International Partnerships

Invest in People

CI as an empowering, pervasive technology will change how training is done and will bring that change to
all levels from K-12 to retraining of senior faculty. In doing so, CI will also eliminate geographic barriers
from the educational experience, connecting instructors and students throughout the world. These
aspects will be considered by other arms of NSF. To exploit CI fully will require that computational
awareness be among the goals of those building training programs. As undergraduate curricula around
the country change to reflect the convergence of approaches to address 21st Century Biology and to

- 15 -
reflect the need to establish a strong quantitative foundation for future biologists, the nature of what will
be required at the graduate level and beyond will itself change.

Early on, more short courses, of the types associated with MBL (the Marine Biological Laboratory at
Woods Hole, MA) and CSHL (the Cold Spring Harbor Laboratory, Long Island, NY), among other
Institutions, will be especially required. In addition, there will need to be extensive development of upper
level undergraduate courses, with the expectation that initially, they will be heavily populated by graduate
students filling in gaps in their earlier education. Novel training programs of this nature are being
established already for genome (molecular) bioinformatics and will become increasingly common for all
bioinformatics graduate training.

The early, and especially successful, BIO efforts at interdisciplinary training (the RTGs: Research
Training Groups) that helped lead to the NSF IGERT (Interdisciplinary Graduate Education and Research
Training) program, should be inspected again, and the overall context, particularly in terms of the need for
novel training at all of the institutions supported by NSF, considered. BIO should evaluate mechanisms
either for ensuring that more of the IGERT activities fit into the CIBIO framework or for re-establishing
some specialized training activities to accelerate the rate of adoption of scientific computing and
advanced information technology into the biological science community.

As important, significant, and valuable as the NSF IGERT program has been, the challenges of CI not
only require the continued support for such interdisciplinary training, but also require novel training
approaches, particularly for more senior scientists who need retraining, as well as the establishment of
new undergraduate curricula. Training should be built into every major infrastructure award at this
frontier, and not just within centralized programs. The case above, that NSF BIO should reexamine the
need for RTGs, is made from two perspectives. First, the IGERTs are spread around the entire
Foundation and must inevitably reflect this range. Second, numerous reports and articles have pointed to
the need for far more training in the quantitative sciences in the biological sciences and other major
changes in curriculum, which are likely to require some experimentation to implement adequately. Even
though increased training in quantitative methods will provide a framework, the concomitant effort to
introduce more knowledge about scientific computing and information technology provides its own
challenge. To meet these specialized expectations and challenges, a revised RTG program, drawing on
the original successes, could include focused problem solving training environments (programs), and
explore novel (and potentially beyond the more established framework expected for IGERTs) cross-
training, providing an empowerment deep enough to fuel 21st Century Biology.

A different form of “higher education” will be essential immediately: the training of mid-career biologists
and even more senior graduate students on how to use CI and what it can do for their research. One
option that should be considered is a center where biologists can learn enough to specify their needs
accurately, with whomever they need to work; such training needs to start by helping the practicing
scientist identify if they need a CSE investigator, a CSE researcher/academic professional, or simply to
hire a programmer, or instead if there is even an existing tool that will accomplish the required task.
(Admitting to this need shows the range of “introduction to” or experience with computing technology
among biological scientists – some use thousands of hours of cycles per year at high performance
computing centers, and others are just recognizing, particularly from the side of information technology,
that computing will be important for their research.) This approach could be similar to the mouse genetics
courses, the various MBL and CSH courses, except perhaps of shorter duration but greater frequency
during the year, and only for a few years of the transition to a functioning CIBIO.

Another option would be a Center to which an investigator would provide funds to purchase a service or
obtain unique information. This would be a Center acting as a central repository of knowledge of
biologically-relevant computational tools, and would do software tool development when the needed
algorithm or package did not yet exist. Such a Center must involve training/education among its roles, so
that its implicit knowledge would be disseminated.

- 16 -
In all areas, the opportunities exists to
employ new collaborative tools to National (Biological, Ecological, Biodiversity) Data Centers
broadened traditional education
experience, allowing individual
students, or all students, to be
spatially removed from instructors,
and to interact routinely and easily
with each other and with their
instructor. In addition, the new CI will
facilitate the development of new
approaches in an international effort,
ranging from amore regular dialog
across national boundaries, to the
sharing of educational experiences,
and to joint projects on an
international scale. This will help
prepare students to work in a global
economy, and encourage rapid
progress in the biological sciences.

Ensure Science Pull, Technology Push

The biology community must decide how best it can interact with the quantitative science
community, where and when to intersect with computational sciences and technologies, how to
partner in science projects, and how the NSF should partner administratively. We all know better
than to permit a “build it and they will come” mentality to dominate the discussion and set the agenda.
However, there will be many buildings created for other communities; some of those we will want to live in
and for others we will learn from the design but will reconstruct for our own needs. For economies of
scale, we can’t afford to manage all of the architect, builder and materials costs ourselves, and need
sophisticated partnerships built on the normal metrics for academic teams with CSE/CISE endeavors and
also especially with GEO, that is, in marine biology and the environment science, as well as the research
and education overlaps with MPS, ENG, SBE and EHR. Indeed, the domain of overlap for the
environmental sciences, between BIO and GEO, is already constructing its own cyberinfrastructure.
Climate, biodiversity and other environmental research communities will have strong, multifaceted
partnerships among NGOs, private research entities, and government.

One aspect of ensuring that the scientific requirements of the biological sciences determine what aspects
of cyberinfrastructure are employed first and most extensively within the BIO domain will be to continue to
build the bioinformatics and computational biology programs for BIO. Only very recently has the
information technology revolution reached numerous biological sciences sub-disciplines. The
communities arising from research in ecology, cell biology, plant science and most other BIO domains
have unmet needs at fundamental levels of computational expertise, and their requirements for software
engineering, for example, have to be met to allow active participation across the biological sciences.
What could be called a basal infrastructure has to be in place for the biological sciences before an
elaborate, highly activated edifice can be erected to drive 21st Century Biology.

Stay the Course

In 1984, leading computational biologists, experimental biologists, and computer scientists assembled at
Arlie House to give NSF BIO advice on how to respond to the first step toward a CI, the high performance

- 17 -
computing initiative and what led to funding the NSF National Supercomputer Centers. The discussions
considered the potential for computing across all research areas supported by NSF BIO as well as
biomedical research areas of interest to NIH. The workshop participants concluded (1) biology has a role
in high performance computing; (2) some of the compelling research problems in biology are already
compute bound; that is, they require more advanced computing; (3) that there would be an ever
increasing set of such problems; and (4) that more and more of the biological community would indeed be
able to make effective use of such high performance computing resources. {Hilllman et al, 1984}

The scientists at that meeting in 1984 raised the concern, that if they put in the effort to port their code
and so on, they didn’t want to have the resources withdrawn, as had happened recently, at that time, for
an NSF disciplinary resource many of them had used. NSF BIO promised it would stay the course, and,
subject to annual efforts to sustain the budget, indeed, it has. Modeling of biomolecular systems, and
biology in general, represented less than a fraction of a percent usage in 1984. As one consequence of
staying the course and working closely with CISE, biomolecular computing had become the plurality user
at 28% of the time by FY 1998 at the NSF Centers.

The importance then (in 1984), however, pales in comparison to the importance now for NSF to stay the
course. The entire biological science community must and will engage, even those with fewer resources
than those already involved biomolecular computing. Scientists can now facilitate each others’ progress
in extraordinary ways, and to optimize introduction of 21st Century biology, the biosciences need to be
interconnected to the other NSF domains. For NSF BIO to underestimate the importance for biology
or fail to provide fuel over the journey would be very damaging, perhaps catastrophic, for the
community.

Prepare for the Data Deluge

The biological sciences today are swimming in a swift current of data, characterized by an exploding rate
of data production and an exploding ability to capture and manipulate data, which arises across vast
scales and national boundaries, as well as from numerous disciplines. From acquisition, refinement,
reduction and deposition, the current of data, and the rapids along the course, point to a compelling
requirement for all of the disciplines for tools for analyses and for provision of links across these
interfaces.

All modern biology, from low throughput, spatial reconstruction studies on cells to ecosystems, to the
automated methods of microarrays and (further) sequencing and structure determination research, will
produce a high volume of complex, heterogeneous data, with demands on standards, archiving, mining,
federation or integration, and other contemporary issues in information technology. What distinguishes
bioscience research is not the net volume, though the spatial studies described above will have
terabytes to petabytes of data as will any of the high throughput methods (especially, mass spectrometry
and proteomics), but rather the inherent complexity of the data, which will be very heterogeneous in
data type, in modality of acquisition, and in its ties to biological phenomena, arising from the hierarchical
nature of living systems.

Calculus, in managing the infinitely small but large scale of events filled with redundancy, has been the
language of the physical sciences. The processes of biology, the activities of living organisms, involve
the usage, maintenance, dissemination, transformation or transduction, replication and transmittal across
generations of information. Biology has high information content, along with individuality, historicity and
contingency. Indeed, given the diverse data types and other features of heterogeneity and the
hierarchical nature of biology, the biological sciences as a research discipline are said to be an
information science. As such, information technology is the language of the life sciences, managing the
discrete, non-symmetric, largely non- reducible, unique nature of biological systems and observations.

- 18 -
Bioscience has already
10 6 borrowed from IT in moving first
The Complexity of
Number Scale (over size scale from Angstroms to Km)

to relational databases, object


Organisms
Discrete Automata
models
Biosystems Finite element
3
10
Evolutionary
databases, and to various
models
10 0
Ecosystems
Processes
standards like XML, with bio-
and

10 6
Epidemiology centric implementations (such
Organ function
as CML, SBML, EBL). The
Cells

Electrostatic 3
10 continuum models Cell signalling wide variety of data types alone
10 0
challenges IT. The cottage
DNA 6
industry nature of biological
10
Biopolymers

Enzyme
replication
Regions where data collection requires
distributed data archiving and
3
10 Ab initio Mechanisms Computational
Quantum Chemistry
10 Protein
0
Folding
Modeling can be multiple data resources, run by
Employed Today vs those with deep knowledge, but
6
Goals for Coverage with standards for interaction,
10 Empirical force field
integration, federation, for
Atoms

Molecular Dynamics
3
10 First Principles
Molecular Dynamics
Homology-based
Protein modeling specific queries to merge
10 0
relevant data to provide
-15 -12 -9 -6 -3 0 3 6 9 Geologic &
biological discovery and insight.
10 10 10 10 10 10 10 10 10 Evolutionary No clear solution has emerged
Time Scale (seconds) Timescales
for building and sustaining
biological databases but despite
the challenges, hundreds of public databases exist, some of which represent major community resources,
such as the PDB.

Vigorous research programs, and scientific and administrative processes, are needed to ensure
the continued excellence of the extant public databases, in the face of the data deluge. There has
not been enough consideration of stability and continuity. The professional societies need to be more
firmly engaged. Considerable infrastructure is required just in constructing and maintaining a single,
focused community database. As we require the community to submit more of its observations and to
submit that data more rapidly, as well as the community acquires the tools to obtain far more data in
much less time, the challenges on the data resources will grow. (Let alone the challenges on the data
providers, who must also have the software tools to analyze their own data and extract useful parameters
from community databases to drive their experiments.) At the same time, the downstream requirements,
data mining and other tools for biological discovery, need enhanced support as part of the transition to the
research environment and expectations of this century.

As the data, information and knowledge base grows (exponentially) for the biological sciences, to sustain
the computational analysis chain, one inevitably with deep human intervention, that runs from data to
information to knowledge, and with more sophisticated contemplation, to wisdom, will require
considerably more attention. To bring modern IT, from data standards and federation tools like mediation
and wrappers, will require substantial collaborative efforts by the community, suitable attention by the
agencies not just to funding but to workshops on standards and the use of grant mechanisms to ensure
collaborations across disciplines, and the full involvement of professional societies and leading technical
journals. BIO staff will need to take lead roles in engaging the breadth of the participants needed.

Rather than being condemned to repeat the past, bioscientists must be able to query the world’s literature
each morning, as envisioned by W. Gilbert in 1991, to set their experimental path for the day’s research.
(The vision section, termed A Prelude, reviews this prescient statement, a classic white paper laying out
the future for introducing computational and informational technologies to modern biology, done even
before the web provided the mechanism for the transformation.) This singular feature both provides
the opportunity for a revolution beyond anyone’s imagination and the potential for major

- 19 -
limitations if access is not universal or if the free flow of public information is limited by
technology, imagination, resources, policy or other artifacts.

Enable Science Targets of Opportunity

The highest pinnacles for research projects for the biological sciences without exception involve those
with a considerable informatics component, along with any requisite technology and experimental
advances that are also essential. More generally, the stretch goals projects for the biological
sciences community all include both a need for community-level involvement and the complete
spectrum of CI; namely:

• People and Training

• Instrumentation

• Collaborations

• Advanced Computing and Networking

• Databases and Knowledge Management

• Analytical Methods (Modeling and Simulation)

21st Century BIO-Cyberinfrastructure

Changing How Science is Done


Providing the Tools to Swim in the Rapid Current of Data

A list that included the major set for all fields in the biological sciences supported by NSF would be
excessively long and inevitably incomplete – every month’s discoveries, sometimes ever day’s
discoveries, open new horizons for the biological sciences. Knowledge intensive computing – organized

- 20 -
and understood data – is the theme of computational biology; the tools and approaches of computational
biology are embedded within all of the key science opportunities. A selected set, aimed at being
representative, follows.

1. Establish an integrated understanding of individual biological systems, managing the inherently


different organizational features from molecules to cells and organisms to populations, and build a
dynamic form and function into that understanding at all scales. Simpler systems would include
peroxisomes, endoplasmic reticulum, proteosomes, mitochondria, choloroplasts, minimal genome
microbe, to free living microbe and protists.
2. Enable well-tuned, detailed ecological forecasting, such as for progression of invasive species, the
involvement of biology in climate change and climate dynamics, and the spread of new or newly
mutated pathological organisms.
3. Describe the Tree of Life; that is, facilitate the community’s effort to reconstruct the history of life on
earth.
4. Predict cellular function on multiple scales; for example, predict the function of the protein coding by
an arbitrary, uncharacterized DNA sequence (found only as an open reading frame, ORF) and predict
the chemistry upstream from that genetic sequence. (Modeling cellular behavior would be
complementary to the biocomplexity program at NIGMS, NIH and the genomes to life program of
BER, DOE, and this complementarity as a whole would have a large impact on progress in the
biosciences.)
5. Understand how to simulate biology in precise enough detail to provide external deliverables to other
disciplines and industrial processes; that is, establish a designer biology, through biomimetrics, and
processes inspired by well characterized biological principals. Develop a comprehensive synthesis of
microbial ecology, including how ecological processes function in the natural environment, potential
methods for bioremediation. (A complete characterization of microbial ecology, exploiting microbial
diversity and function, requires experimental efforts, informatics, and computing approaches. Strong
teams should be established. The teams should attempt to provide an understanding deep enough
that investigators can actually manipulate the system in predictable ways.)
6. Understand specialization and communication in communities and in multi-cellular organisms. The
focus could be on any level, ranging from free-floating unicellular organisms to metazoans or their
component systems such as the immune or nervous system.
7. Characterize information integration by selected biological systems, ranging from organisms to
ecosystems. For example, determine how organisms make use of their environmental information,
how the nervous system integrates the electrical and chemical information, how a unicell or protozoan
integrates information about its environment, and how a metazoan integrated information that guides
development.
8. Understand a given system in enough detail to build models that simulate complex biological
systems, and allow reengineering or manipulation of the system.

Representative Examples

The various genome projects have produced, and will continue to produce, sequence datasets that
can then be used for phylogeny reconstruction; however, the best reconstruction methods are
heuristics for hard optimization problems. Solving these problems for datasets of more than 100
sequences, to an acceptable level of accuracy, is beyond the current capabilities of existing software.
In order for systematic biologists to be able to obtain good estimates of evolutionary trees, new
methods (algorithms and software) need to be developed, and cycles on a high performance
computing platform needs to be made available. Novel database technology also needs to be
developed, to tie the inferred phylogenies to sequence datasets. Large-scale simulations also need to
be done to assess the performance of both new and existing software. All such efforts require
collaborations between computer scientists (specializing in databases and algorithms) and systematic
biologists. In particular, such collaborative efforts would enable the entire systematic biology
community to contribute to the final product. The database itself could provide an opportunity for

- 21 -
other scientists (for example, geologists, who study the interactions between the earth sciences and
biological evolution) to benefit.

CI for BIO will enable new opportunities for the well-established, two decades old collaboratory
around long term ecological research (LTER), for research within the ecology/ecosystems community
as a whole, and the new comprehensive, enabling collaboratory called NEON. In particular, ever
more powerful, regularly updated to be fully state-of-the-art) sensors, measuring physical, chemical,
biological, meteorological, spatial and other relevant parameters, and associated computational,
informatics and communication tools, will enable ecologists to obtain an in depth understanding of
population processes and the interplay of all the living and physical players in the ecosystem.
Sensors will provide a revolutionary advance in tracking threatened species and otherwise following
and characterizing the behavior of individuals in populations under study in the natural world. Any
individual creature can have its own IP address, an extraordinary expansion, for example, on
traditional tracking of individual marine species. The transformation of data, then information, and
ultimately knowledge, empowered by the new sensor technology, in population studies cannot be
overstated. The “24 by 7” monitoring of environmental and population events, through analysis of
individuals, will require smart software filtering of the data, but will enable the establishment of robust
models. All sponsors and all performers in the environmental sciences will incorporate these kinds of
technologies, methodologies and infrastructure into their research. As this major, new, innovative,
exciting and truly extraordinary platform for scientific discovery and societal relevance is established,
NEON, which can only be conceived and implemented in a comprehensive cyberinfrastructure
environment, will provide both robust approaches to reductionist analysis of specific organisms and
phenomena, and the ability to construct novel syntheses toward a fully comprehensive understanding
of larger scale ecosystem behavior.

Characteristics of
Ecological Data

Satellite
High Images
GIS
Weather
Stations Most Ecological
Data Data
Business
Volume
Data
(per Biodiversity
dataset) Primary Surveys
Most Productivity
Population Data
Software
Gene Sequences Soil Cores
Low High
Complexity/Metadata Requirements

The revolution wrought by NSF in plant genomics from the model to crop plants has changed the face
of the biological sciences forever. Through enabling a new generation of plant science research, the
construction of a comprehensive CI for BIO is a key next step in this revolution. Cutting edge stretch
goals, which will require collaborations among experimentalists and computational biologist and
strengthened bioinformatics support, are:
1. Identify, characterize, and understand the genes responsibility for domestication of crops;
2. Characterize and understand the mechanism of polyploidization and genome reduction;
3. Uncover and characterize the molecular mechanism of symbiosis.

- 22 -
Test beds for Establishing Comprehensive, Collaborative CIBIO Teams (CITs)

BIO should select a few programs, termed CITs for convenience herein, to pilot the full introduction of
the advantages of cyberinfrastructure, from multiscale, federated information environments to virtual
centers or collaboratories, from shared expensive instrumentation to novel knowledge repositories,
and certainly, with rich educational environments. Given the priorities for NSF and the state of the
science and the community’s involvement at the interface with information technology, among the
choices would be CITs for microbes (genomics and communities), ecology and ecosystems and plant
genomics. Ultimately, as NEON comes to be, NEON itself (for example, with the information
requirements and implications from its embedded sensor nets, populations wirelessly tracked) will
intrinsically be a CIT, but the richness of the infrastructure could be enhanced and the research
discoveries enhanced through a concentrated, focused effort at strengthening the corresponding CI.
At this moment, BIRN represents the most fully implemented example, a functioning, expanding CIT,
within the biological sciences, although BIO can also learn from NEES, GryPhN and the other early
CITs outside biology. A representative, obvious and important example of an extension to specific
BIO domains would be an ecological CIT, a pilot with planned, regular expansion as a CI for
ecosystems science, ECI or an EARN, ecological analysis research network, with a coordinating
center and one to two dozen research partners. A detailed description of such a model for the
ecological sciences, with the aim of achieving a grand synthesis for biocomplexity and other aspects
of the exceptional components and dynamics of ecosystems, is in an appendix. Such a Center would
be developed in consultation with the cognizant professional societies and a specialized BIO
workshop.

Select and Direct Technology Contributions

Building a cyberinfrastructure for the biological sciences, selecting the right targets and milestones, will
require careful attention to the data intensive nature of all of modern biology. Beyond data manipulation,
the technology must enable an easily accessible knowledge management framework. Biologists are
already collaborating with computer scientists to create pilots and establish the process by which the
incoherent output of large numbers of investigators can be channeled so that all may benefit. Early
steps for CIBIO are to expand upon BIO’s database and information management activities, using
the successes in structural biology and ecology, and to establish a broad cyberinfrastructure umbrella
for computational biology and bioinformatics throughout the BIO research arenas. Keeping these
principles in mind, we provide a set of priorities on the technology side of CIBIO.

1. Establish how to create database federations: linking disparate databases. Support research for
biosciences into:
a. Federation and integration heterogeneous databases; ontology development; pedigree;
data validation; provenance chains; the processing pipeline, and other bioinformatics
tools.
b. Data mining, data exploration capabilities, affordable tools.
c. Sustainable, stable knowledge resources.
d. How to optimize analysis algorithms for interactions with databases TOL, computing with
inputs that change.
2. Large scale data analysis that can not be accomplished on a desktop.
3. Supporting advanced simulations, multi-scale, standardized methodologies, linking, chaining
uncertainty.
4. Network middleware for domain services, throughput will change the way we work with data, grab
whole databases or application servers.
5. Sensor development systems.
6. Cyberinfrastructure resources: professional services, systems support, helpdesk support,
resource centers and tools for distributed infrastructure development. (NSF PACI centers do this

- 23 -
to some extent now, with some specialized software support, but BIO should work with the
community to identify key needs and set pathways for future implementations.)
7. Hardware applications raise numerous issues; for example, commodity processors might not be
useful for all BIO problems, and the overarching principle is “better” (routine, reliable, easier and
more useful) access to computational resources.
8. Major need for next generation biological science: the means for interacting with sensors,
including protocol stacks.
9. International collaboration on development, long-term, intergovernmental collaboration, co-
development with sister programs.
10. Mechanisms should be in place to hardening of community application code, following extensive
debate and evaluation through professional societies and peer review.
11. Develop collaborative projects to address integrated environmental challenges – research goals
that lie beyond the expertise and resources available to one community and require broad
participation – and require a very rich information infrastructure, as well as build out of LTER and
NEON.
12. Support an interface of experimental biology, bioinformatics, and software engineering through
interdisciplinary teams who establish modeling frameworks that integrate biological systems with
physical and engineering systems for insight into how living processes and organisms function as
a systems level.
13. Combining spatial images, observations on species, variety of physical/digital data; this should
include biological information stored in multidisciplinary, knowledge resources that include
geospatial, demographic, economic and other data for land use and disaster management.

Establish National and International Partnerships

In an era of data-intensive biology, reliable, routine and robust access to an international level of
information infrastructure (to the organized products from observations, modeling and interpretations
generated the world over) will be critical in order to sustain progress, for scientists to remain competitive,
and to exploit the potential insight to be derived from comprehensive knowledge resources. Storage grids
and compute grids will frequently not be local and sometimes not even regional, and data grids will
certainly be widely distributed, coupling input from very remote sites. Research contributions, perhaps
even more notably in biology than in other sciences, will arise from the entire world. Tomorrow’s CI will
bring together remote resources (expertise, instruments, data resource, computer platforms) and provide
access from the end user’s desktop. This aspect will have no disciplinary, national, political or cultural
boundaries, reflects the growing awareness that science itself is a global enterprise, for example, in terms
of new discoveries and insight to come from numerous research settings around the world, the existence
of international databases, the explosion of electronic literature, and the extraordinary depth of
information available on the world wide web. In addition, the growing international flavor, beyond the
leading economies of the world, can be seen in the rapid spread of ideas in electronic as well as journal
hard copy form, the increased facilitation of the exchange of data with colleagues around the world, the
strengthening of the role and recognition of international scientific organizations (or societies). The most
obvious transformation is the development of an international e-science community that debates and
exudes excitement 24 hours/day, 365 days/year. Furthermore, an international flavor will facilitate and
will often be needed to address many scientific questions (often with social impact) that are by their
nature global in scope, such as understanding basic ecological and ecosystems processes, especially in
the context of global ocean / atmospheric circulation, where observations, expertise, and resources are
needed from across the world. After all, political, social, cultural and economic boundaries are human
inventions, and the physical world, while it may have shaped some of those boundaries, follows its own
“path.”

While local, regional and National solutions to immediate requirements have, of course, to be established
as quickly as possible, biologists need to envision a global grid, and to think and act locally, regionally,

- 24 -
and globally. The challenges, not so much in thinking but very much in implementing, of international
contexts for anything, especially infrastructure, are quite large. As a consequence, from the beginning,
the design and implement of a National CI should be considered in the international context. Given the
requirement for access to the world’s literature on a routine basis, the expertise and the resources to build
CI will not arise from only a single country or region of the world.

With the biological sciences, numerous examples already exist where international interactions that rely
upon the current, albeit partial, implementation of CI. For example, the Protein Data Bank (PDB) is the
international repository of 3D macromolecular structure data, and is now evolving into an internationally
managed activity. Nearly from its beginning, Genbank has been an international collaboration in storing
and managing DNA sequence data. The International Long Term Ecological Research (LTER) activity is
a conceptual network of researchers, using research resources sited around the world for exploring
regional and global questions in ecology. PRAGMA is an organization that has biological applications as
a key set of drivers to build collaborations among research efforts located around the rim of the Pacific
Ocean. The National Center for Microscopy and Imaging Resource (NCMIR) provides international, high
bandwidth, remote access, via dedicated Internet connections, that allows investigators to obtain three
dimensional information on
Model-based Integration of Multi-resolution Data: biological objects. Among
Development of a Cell Centered Database other instruments including
Parallel computing
resources for
NCMIR houses an
tomography Spatial database of rat intermediate voltage
brain anatomy
microscope and provides
Models software tools for computed
axial tomographic
Neuronal models
Database federation reconstruction of objects
within thick specimens using
Imaging databases information from rotated
Large scale 3D EM
reconstructions
Cells and tissues images. Investigators also
use similar methods
Modeling cellular
Cellular processes developed by NCMIR to
microdomains
access a high voltage
Cellular microdomains electron microscope in Osaka
Japan. Data from such
Macromolecular distributions studies is incorporated into a
Correlated LM multiscale database termed
and EM
Hi-throughput tomography the Cell Centered Database
(CCDB).

These activities, which are only selected examples among numerous biological science projects with
international components and involvement, illustrate the value of working across national boundaries and
the extra complexity (language, culture, policy) and time investments needed to be successful. Some
specific additional challenges for international collaboration and the associated approaches for NSF
include:
• Funding of joint international projects (funding both sides of the collaboration)
o Work with funding agencies in other countries.
• Accessing data for comparison runs into potential barriers of different laws in different countries.
o Work with government agencies to ensure the basic principle that “open access” to
publicly funded data is guaranteed for scientific and educational endeavors.
o Work with International Agencies to accelerate development of CI in developing
countries and to facilitate their access to resources and their ability to develop needed
expertise.
• Shared resources will be developed and deployed under local (at the National level) funding, but
will become part of the global CI.

- 25 -
o Work with Funding Agencies from cognizant Nations to reach simplified principles for
sharing the resources.
• Exchange of researchers and students among Nations will be important in ensuring the most
productive international CI environment.
o Develop incentives to encourage undergraduate, graduate and postdoctoral students to
spend time outside their own culture conducting research.

To fully establish a CIBIO, the investments in people will need to keep in mind the international
implications. Researchers will need to have an opportunity to be exposed to the global consequences
and environment, and senior scientists must be empowered to prepare future research generations to
work productively and succeed in the global context that is already beginning to come into being. This will
entail incentives to change the overwhelming imbalance in the exchange of students, especially in the
sciences. But this effort is essential to overcome the too prevalent but naïve belief that the only good or
reliable science and technology research and development is conducted in the United States or perhaps
the US and Europe.

Two corollary responsibilities arise from the simultaneous importance of databases for 21st Century
Biology and the related consequences of the growing international nature of information resources.
Considerable community involvement and discussion is needed to ascertain the right directions, and the
community must be encouraged to see the big picture, to understand the longer term implications. The
very word “community” must also mean all users of the databases, not just a narrow view held by a few
major data providers, or a view exclusive to the keepers, managers, or archivists of the database. On the
other side of the coin, the agencies in the USA, and particular NSF with its unique expertise, point of view
and credibility, must also assert a leadership role from within the agency, not just in terms of dollars
provided to the community. The managers of a database cannot speak for the Nation or for the
community in the same way, that is, with the same authority, credibility and impact, as can officials from
the agencies. Thus, to ensure the requirements of the community are met, NSF must take a proactive
role in international settings. Of course, NSF must provide continuity and reliable, sustained funding for
major information resources in its domain, but it must also participate in international standards setting;
the credibility of NSF is an essential vehicle to ensure the international effort is on track, that the US effort
is focused in the right directions and remains state of the art, and that access to important databases for
the biological sciences are not comprised due to competing standards established in other international
settings or to commercial interests.

Implementation Strategies
NSF faces many hurdles in attempting to build a cyberinfrastructure for all of the sciences, and BIO
certainly will have its own share of issues to address. The scale of the costs associated with CI means
that BIO will have to choose directions carefully, build pilots and then expand them through robust
mechanisms, and find suitable partners whenever possible. Even once adequate levels of funding are in
hand to support initial community activities, BIO will have to balance a portfolio of investments. This
is considered below under “Funding and Management Mechanisms.” NSF does not expect that its
budget will grow fast enough to build CI by itself. Even with extraordinary budget enhancements, since
the requirements for CI are fully intertwined with all research activities, the interdisciplinarity and scope is
such that NSF’s programs must work together and that NSF working with other agencies and countries is
essential now and will always be essential. The local, regional and international impacts and
responsibilities for BIO are considered below under Outreach and Partnerships. Examining a
variety of mechanisms in place already, both within NSF as well as in other agencies, will facilitate BIO’s
efforts to ascertain the what and how for their initial investments and actions. Among the responsibilities
is to balance the options and prototypes, to be careful not to prune early growth, and to confront head on
the major challenge of sustained investments in all parts of CI for biology, ranging from the adequate
support of data resources from acquisition to management and development of query tools, the

- 26 -
maintenance and on-going development of community software for modeling and analysis, the formation
of innovative collaborative efforts driving discovery along the frontier, and the development of new as well
as expanded vehicles for education and training.

Funding and Management Mechanisms

Details about internal management will certainly have to be worked out by NSF BIO, but considerations
from the perspective of researchers are provided. These are in the form of recommendations as to
actions by BIO.

• A suite of channels, alternative vehicles and pathways for investigators to obtain funding, are
needed to ensure creative ideas can flourish for infrastructure; a variety of means for funding CI
should be created, both within existing and in new programs that focus on CI BIO.
• One avenue for maintaining a suite of channels was the alternatives between regular investigator
initiated proposals/awards and the CISE program PACI; while weaknesses have been noted for
such overarching programs, PACI provided a flexibility for starting new directions and for taking
risks. As PACI comes to a close, the overall pipeline and its component parts (science
challenges, software requirements, connection of science pull to technology push, and so forth)
should be considered by the research Directorates and notably BIO. ITRs will not fill the void
being left by the PACI program. ITRs are long term, basic research, but there is little expectation
that these would be for infrastructure development and deployment. Other mechanisms for
development and maintenance will be required; to participate fully in the CI of the Foundation as
a whole, all Directorates will have to ensure a core effort for their own disciplines. Simple
infrastructure needs for research directorates will not be met by CISE, and BIO needs to engage
in a dialogue with CISE over expectations and how to sustain the right focus for biology in the
new context of CI.
• The biological science community varies widely in its degree of adoption of scientific computing.
For some communities, a service center, specified for the unique situation of the cognizant
subdiscipline and community,, where a given set of BIO PIs can turn to find the CS partnerships
they need. There could be a stand along Center, with some similarities to the LTER Coordinating
Center.
• There needs to be a critical mass of people focused on bio cyber infrastructure. This could focus
activities in groups of investigators, for the development and maintenance of CI as well as for
training the next generation of scientists
• CI BIO will range from leading-edge-to-routine (“every day”) facilities and tools. BIO in setting up
funding mechanisms should recognize that there are different kinds of infrastructure, a range of
human experts and similarly, a range of community software code hardening and support. These
activities could fall under the same coordination or research support ‘center’ as above.
• A centralized resource or center could also serve to ensure the community can adopt the latest,
best hardware and software tools, can enable more ready access to new technology, and
facilitate standards that allow each member of a distributed team to accelerate their efforts by
connections to the information obtained by others. NIH has funded a project, the Biomedical
Informatics Research Network (BIRN) and described below, that federates data from a
community of researchers, as well as building other collaboratory aspects for 21st Century Biology
research. This project may be a model for various CIBIO aspects. As currently constructed, BIRN
has a coordinating center to put out all of the hardware, software, pipes, and to get the distributed
PIs to move their workflows into that framework. Currently, the Coordinating Center is funded at
$2.5M/year for CI people to support the 200 investigators.

- 27 -
Model Exists: Architecture To Support
a Biological Informatics Research Network

BIRN - Phase I - 2001-


2001-2002
Form a National Scale Data Grid and
UCSD NIH Centers for Federate Multi-scale NeuroImaging
Bio Imaging and Data from Centers with High Field
Computational Biology & MRI and Advanced 3D Microscopes
NCRR Research Ctrs.

Harvard
Cal Tech NSF
NPACI Cal-(IT)2
W/SDSC

UCLA
“Deep Web” Duke

Integrating Cyber Infrastructure to Link:


•Advanced Imaging Instruments
•Data Intensive Computing
•Multi-Scale Brain Databases

Test Beds for an NSF BIO Cyberinfrastructure


• The NIH National Center for Research Resources also supports several computing resources.
The Research Resources provides funding at the level of $1M per year for the development and
maintenance of software, with components of service to the community, training on the tools of
the resource, and clear dissemination of those tools. Part of the review of the Resource is in
terms of the use of the tools and the quality of the science produced from those tools. Successful
resources also can be multidisciplinary, employing academic professionals. While the NIH
Research Resources have leveraged heavily off of the CISE investments in computing and
computing infrastructure, their success may be useful in exploring how to introduce numerous
basic biological sciences communities, which can not access NIH funding, to advanced scientific
computing and information technology.
• Advancing and hardening research grade, internal lab developed, software is an important
elementary consideration for CI. BIO should complement the new software maintenance
program (from NIH) that will support the maintenance of community codes.
• Extant phylogenetic public software, along with code available from various labs, and concomitant
access to conventional desktops, are not enough. No means today exists for testing and
developing new algorithms with community involvement or to do in a rigorous fashion. To drive
this community collaboratory will require a team of professional programmers; faculty from labs
around the world would design the applications. A committee will have to be established to
decide who gets priority; that committee should be approved by NSF, not simply established by
the Principle Investigator.
• The NSF cyber-infrastructure goals must be at a higher level; that is, the support of efforts in
establishing basic research software is not “on the table,” and is not likely to be funded through
new CI programs within or as partnerships with CISE. However, BIO must address the basic
software needs for individual research communities. A BIO CI program would have to collaborate

- 28 -
with programs across the Foundation, set high goals for CI within BIO itself, and also evaluate
“catch-up strategies” for individual sub-disciplines within BIO that would otherwise not be ready to
participate fully.
• The CIBIO could be initiated with three tiers: (1) select via peer review a small set of “critical
mass” activities – a core team, probably distributed geographically, with enough funding to have
an early impact; (2) find similarly a distributed set of medium-sized projects in domains that tap
into this core; (3) identify smaller, individual investigator activities that can benefit by direct
connections to the infrastructure.
• NSF BIO will need to structure the funding mechanisms to be sure the funds are indeed used for
projects either directly empowering investigators to use CI or to develop the CI for BIO, and is not
directed into conventional projects.

BIO should establish a cross-cutting CI activity for BIO, with separate review processes, and
including an invigorated bioinformatics and database activities and a modeling and simulation or
computational biology core. NSF pioneered the federal support of database activities and
computational biology, and should now exploit the CI opportunity to create a new generation of these
research endeavors.

All categories of infrastructure are increasingly important for scientific research, but a strong
cyberinfrastructure will be particularly essential for the biological sciences. What will be important is to
recognize that infrastructure can not be treated the same as individual research proposals. Robust,
rigorous peer review is very important for establishing the best opportunities; competition is also
important and similar, overlapping efforts will need to be initiated in many cases and then the best project
will ultimately become clearly identified. However, infrastructure needs simply cannot be evaluated
directly against individual research, and separate, centralized review and oversight will be
needed. Infrastructure benefits all, but has a different time frame, different budgets, different staffing
(more academic professionals), and can not be properly reviewed in connection with individual research
projects. Public policy, not peer review, is the basis for balancing levels of funding infrastructure and
investigator-initiated projects, levels balanced to ensure optimum progress by the community.

Collaborations across biology, across institutions and across other disciplines will be the hallmark of 21st
Century Biology. In allocating funds along with ensuring appropriate means for peer review of CIBIO
projects, NSF must put in place highly effective channels for collaborations beyond those for existing
programs. NSF must prepare accordingly by sending an even stronger signal that provided by the
existing mechanisms.

Suggesting that the biological sciences will participate in big science activities is both quite misleading
and needlessly controversial. Instead, a healthy mix of microscale (i.e., the so-called cottage industry or
small lab, individual investigator, hypothesis-driven research) with mesoscale (i.e., interdisciplinary,
collaborations and team efforts) is needed and is today being established naturally, without centralized
planning or top down intervention. In this era (for simplicity according to its emerging properties) of
mesoscale biology, the NSF should be particularly attuned to these distinguishing features:

• Nuturing Collaboratories: intermediate scale environments for remote access to specialized


instruments, & for interdisciplinary, multidisciplinary, and sometimes multi-institutional interactions
- both local and distant interactions.
• Sustaining Portfolio Balance: instrument development (high versus low risk, exploratory versus
implementation) and prototype applications.
o Teams are needed; there should be “ferment” for selecting the direction, then driving
instrument development (by situation dependent, biology pull and technology push);
funding should also facilitate commercialization and getting tools to the community for a
broad impact.

- 29 -
o Portfolio policy-level considerations include cost sharing in instrument acquisition and
mechanisms to enable PIs to sustain applications / cover recurrent costs of user facilities.
• Facilitating effective utilization of the Data Deluge: rapid data acquisition for high throughput
biology and equally effective acquisition for complex data sets for many lower throughput
approaches; organization and long-term maintenance of data sets; creation novel query and
integration tools.
• Developing Quantitative Approaches for all fields of the biological sciences: move beyond binary
biological questions and answers (yes or no, up or down, on or off, spot or no spot), and provide
a strong extension of adequate training to address the complexity of biology.

Outreach and Partnerships

As pointed out above, each component of all of the science-funding agencies will need to establish its
own course, since all communities will participate in CI. Of necessity, the internal NSF interactions for
BIO must begin by coordinating as much as possible with CISE, and by exploring options, most notably
the potential for joint programs. Especially in the environmental context, there will continue to be a major
area of overlapping interest with GEO. Nonetheless, CI promises to be the most interconnected activity
at NSF in its history, and there will be productive interfaces with all of the other Directorates.

For reasons described above and in the background material in the Appendix, NSF is the only agency
that can lead CI and BIO the only possible leader for the biological sciences effort. However, nothing
succeeds like success, and the scientists working in applied life science research and funded by the NIH,
DOE, USGS, ONR, or components of other mission agencies, will require access to CI, and thus, the
mission agencies should look to BIO’s efforts as a model. Conceived and run as more targeted efforts
than an NSF project, one arm of NIH is already invested; other Institutes are interested. A role for NIH in
their biomedical world can be expected, that is, NIH will continue to build group activities around specific
health topics, such as for the human and model brain studies facilitated by the BIRN project. At this point,
NSF has covered much of the infrastructure development in basic CSE, and driven innovation in early CI
across the biological sciences. What could prove to be very important for the community will be the
potential, led by NSF, of “pulling or pushing” NIH a little further into the development of specialized CI,
and helping with seamless connections between the CI of basic and biomedical science research.
Similarly, private foundations, most notably the Howard Hughes Medical Institute, naturally will also have
to extend CI to their investigators, and opportunities for collaborations should be explored.

The pervasive, ubiquitous connections that must characterize a fully successful CI will require a form of
“intellectual glue,” a willingness to fill in the gaps and reach out to all communities and Institutions. The
bottom line on outreach and partnerships begins with the core: BIO can ensure democracy and that CI
delivers a new era in biological sciences research. This role is essential and no one else will – or
even has the capacity to - undertake such an effort.

Just as the grid is universal, CI has huge international implications. Through professional societies and
continued use of BIO’s ongoing relationships, BIO will need to coordinate with policy makers in Europe
and Asia. NSF’s International Directorate has already begun a CI effort, known as PRAGMA, linking the
USA and Far East Asia. CIBIO should prove an opportunity to cement relationships with Japan and the
EBI around the Protein DataBank, and more generally, to ensure that there is an international information
infrastructure for the biological sciences, or, in other words, to ensure that there are common standards
and shared efforts for ontologies.

- 30 -
Immediate Steps for BIO
Continue to take risks. NSF BIO has the capacity to be far more adventuresome than the mission
agencies. The plunge is essential for 21st Century Biology, and the adventure will benefit not only
fundamental biology but also the applied biology supported by the mission agencies.

Ensure success through adequate, sustained funding of the best activities. A very serious, central
issue that urgently must be addressed before bringing an implementation plan to the community is what
are the commitments for sustaining the infrastructure projects over 15 years? Who actually runs a given
project needs to be determined by peer review and investigators and teams will compete, and some
projects will change ownership. However, it will take a number of years to establish the infrastructure,
and once established, CIBIO must stay in place to sustain the full development of 21st Century Biology.
A CI for the biological sciences is, of course, going to require attention and support for the indefinite
future, but no matter what first generation implementations are established, there will be a need, along
with routine community involvement and active dialogue on the best mechanisms to drive research, to
establish plans for a sunset review of the projects as whole and a careful full assessment of mechanisms,
progress and impacts. A new plan should be established as technology and biological understanding
advance, and of course, the sun might rise to shine in the same path and on the same organizations.
(Sunset involves rigorous assessment, yet often should simply lie before renewal.) In consideration of the
engineering research centers and the science and technology centers, we suggest that this broad
overview occur in time to set a revised agenda in this fifteen-year time frame. The time frame is chosen
to emphasize the deep and central significance of stability in putting together productive teams at
multiple institutions across many disciplines. Without stability, it will be difficult to develop and clearly
impossible to sustain an efficient, effective cyberinfrastructure for biology. One vehicle to begin
discussion about the necessary long term plan would be for NSF BIO leadership to put together a set of
milestones or a roadmap with projections for the lifetime first of developmental and then of
maintenance efforts, and share that long range plan with the BIOAC and ultimately, through numerous
society meetings, with the broad community.

Prepare a five year plan as a first step to place the biological sciences on the map toward a
successful CI; that is, identify the especial role NSF BIO will play and empower the community to
look forward effectively. An implementation workshop would be an effective vehicle to review this plan,
which could be developed through a leadership retreat coupled with post-panel long range plan
discussions and the output from the satellite meetings of professional societies. BIO must also make
sure the plan is international.

Anticipate the accelerated pace that will characterize CI for all of the sciences. The growth path will
be so rapid, the requirements so extensive, and the costs so significant, that a broad swath of the BIO
research community needs to be involved in planning the implementation phase. To ensure the best
ideas can compete, the entire community, more generally, needs to be forewarned and encouraged to
prepare (in order to participate fully).

Obtain input from the broader BIO community. Given the extraordinary broad importance of CI,
building a cyberinfrastructure for the biological sciences will absolutely require considerable, initial and
ongoing input from the entire community. One mechanism, which could prove particularly useful and cost
effective, would simply be to include a talk, a few talks, or a round table discussion at subgroup meetings,
i.e, at the satellite meetings held typically before biological science professional society annual meeting.
BIO program officers could participate in a session in which a given discipline examines its needs and
well-established investigators in the field address their vision for CI. Such discussions will also be
important to ensure full participation and thereby the consideration of the opportunities for all of BIO (at
the level of the specific scientific domains of individual Divisions). Thus, biological society meetings could
serve to help NSF gather information and plan how to implement the funding. There have been examples
of this previously in the biology community’s interactions with NSF, from the role the Genetics Society of

- 31 -
America plays in genetic stock centers or the involvement in the Protein Databank by the International
Union of Crystallographers. Programs at other agencies, such as SCIDAC (Scientific Discover through
Advance Computing) at the Department of Energy, and the National Tissue Culture Cell Repository, at
the National Institutes of Health, also reflect community input, and may represent collaborative channels
for communication and outreach to the community. As BIO moves toward an implementation phase, NSF
could release a call for proposals aimed at scientific societies, to encourage them to identify key needs
and requirements and make sure that the CI activities are inclusive and bring in new people. In addition,
a larger meeting to consider specific implementation across the BIO domain will be very important.

- 32 -
Appendix 1
July 2003 Central Questions for a CI BIO

Questions grouped in order to focus and facilitate the breakout group discussions. Each breakout group
and each attendee considered all questions.

A:

Why is cyberinfrastructure for biology so important? What difference will it make?


What is its scientific scope?
Where are we now? What are successful examples? Difficulties?
Where do we have to go? Opportunities? Challenges?

B:

What is its technology scope? Data Intensive Bioscience, Knowledge Management?


What are the educational requirements and opportunities?

C:

What is its administrative scope?


What do we need to get there? Funds? Management?
Balance between NSF, Institutions; Cost Sharing vs Stable Adequate Funding
What should BIO itself do now? Who will be natural collaborators?
Internal Agency Partners, External Agency Partners
Not for Profits; NGOs; International Implications

D:

What further meetings or actions are needed in the near term? What are the first steps for BIO?

- 33 -
Appendix II
Schedule and Assignments
(14-15 July 2003, NSF BIO Workshop)
14 July 2003
Morning Session
9:30 start.
Introduction, Welcome, Charge by AD, BIO; Workshop Chair, BIO AC (9:30-9:40)
Presentation by DAD, CISE on CI as viewed by CISE (9:40-10; Qs&As, 10 - 10:10)
Review of Initial Requirements and Overview of CI-BIO (“seabio”), CH (10:10-10:20)
<break>
Examples and Models for CI / examples from biological science (10:40-11:40)
Overview of Grid, CI Technologies, Issues, CA (11:40 – 12:10)
Working Lunch (12:10-1:10)
Lunch discussion include overviews, general group discussion and assignments for breakout groups

Afternoon Session
Breakout Groups formed, begin (A. 1:10-2:40; < break; then reassemble>; B. 3-4:30)
A. Science Assignments; Science Drive, Pull; “Why SciBIO CI,”
B. Application Assignments; Generic Infrastructure; Technology Push
Presentations from Breakout Group Chairs - Review Initial Contributions (4:30-5:10)
(<break>)
Round Table Discussion (5:30-6:30)
Break for dinner by 6:30; expect continued discussion

9 PM, hotel
Brief Meeting of Writing Group/Steering Committee, Review Outcomes, Plans for Day 2

15 July 2003 - Day 2

Morning Session
9 AM Start
Reform Breakout Groups (A. 9-10; <break>; B. 10:15-11:15)
[Review Monday “A” and “B” discussions, add “C” to first; add “D” to second discussion]
[During Breakouts, Complete First Draft Major Writing Contributions, esp. both re BIOSCI and General
Enablers – connections to CISE & Technology]
Review Output for consensus (11:15-12:15)
Working Lunch – continued discussion
[Outcomes, Implications, OPTIONS FOR FUTURE MEETINGS, OTHER STRATEGIES]

Afternoon Session
Final Draft of Each Section Prepared 1:15 – 2:25; outline key points/prepare overview
Overview presented to AD, BIO. possibly also to CISE, other NSF. (2:30-3:15)
<break>
Integrated Draft Prepared (3:30 – 6:30 PM)

Writing Team - through working dinner Tuesday.

Day 3 – Wed AM 16th, revised “first draft” given to AD, BIO, for use in NSF August discussions.

Final Version, Slides completed by end of August.

- 34 -
Package presented, reviewed at BIOAC Nov 14, 2003

Assignments

Chair, Breakout Group 1: Gwen Jacobs

Lead Scribe: Susan Stafford

Chair, Breakout Group 2: Jim Beach

Lead Scribe: Paul Gilna

Workshop Chair and Editor in Chief: John Wooley


NSF Liaison: Judy Verbeke
Other Writing Team Members: Gwen Jacobs, Susan Stafford

Workshop Administrative Support: Amanda Voight

Follow-up Administrative Support: Joy Gorback


Web Design and Implementation: Kareem Elbayer
Printed Text Editing and Implementation: Courtney Smoot

- 35 -
Appendix III
Workshop Participants
Nancy Amato Peter Arzberger
Texas A & M University San Diego Supercomputer Center
amato@cs.tamu.edu parzberger@sdsc.edu
(979) 862-2275 (858) 534-5079

Paul Barber James Beach


Boston University University Kansas
pbarber@bu.edu beach@ku.edu
(508) 289-7685 (785) 864-4645

Helen Berman Emery Brown


Rutgers University Massachutsetts General
berman@rcsb.rutgers.edu brown@srlb.mgh.harvard.edu
(732) 445-4667 (617) 726-8786

Brenda Claiborne Mike Colvin


University of Texas San Antonio Lawrence Livermore National Lab & UC-Merced
Bclaiborne@utsa.edu Colvin2@llnl.gov
(210) 458-5487 (925) 423-9177

Mark Ellisman Deborah Estrin


University of California San Diego University of California LA
mhellisman@ucsd.edu destrin@cs.ucla.edu
(858) 534-2251 (310) 206-3923

Stephanie Forrest Claire Fraser


University of New Mexico The Institute for Genomics Research
forrest@cs.unm.edu cmfraser@tigr.org
(505) 277-7104 (301) 838-3504

Paul Gilna Teresa Head-Gordon


Los Alamos National Lab University of California at Berkley
pgil@lanl.gov TLHead-Gordon@lbl.gov
(505) 667-3114 (505) 667-3114

Gwen Jacobs Leonard Krishtalka


Montana State University Kansas State University
gwen@nervana.montana.edu krishtalk@ku.edu
(406) 994-7334 (785) 864-4540

Michael Levitt Robert MacLeod


Stanford University University of Utah
michael.levitt@stanford.edu macleod@cvrti.utah.edu
(650) 723-6800 (801) 587-9511

- 36 -
William K. Michener Margaret Palmer
Long Term Ecological Research Network Office University of Maryland
wmichene@lternet.edu mp3@umail.umd.edu
(505) 272-7831 (301) 405-3795

Phil Papadopoulos Jay Snoddy


San Diego Supercomputer Center University of Tennessee
phil@sdsc.edu Oak Ridge National Labs
(858) 822-2628 snoddyj@ornl.gov
(865) 974-3466

Susan Stafford Lincoln Stein


University of Minnesota Twin Cities Cold Spring Harbor Laboratories
stafford@umn.edu lstein@cshl.org
(612) 624-1234 (516) 367-8380

Russell M. Taylor II Tandy Warnow


University of North Carolina University of Texas Austin
taylorr@cs.unc.edu tandy@cs.utexas.edu
(919) 962-1701 (512) 471-9724

Ross T. Whitaker John Wooley


University of Utah University of California San Diego
whitaker@cs.utah.edu jwooley@ucsd.edu
(801) 587-9549 (858) 822-3630

- 37 -
Appendix IV
What Is Cyberinfrastructure?
An Overview of the Blue-Ribbon Panel/Atkins Report: Overall NSF Considerations on CI

The extraordinary impact of computing and information technology, from transaction databases and wired
commerce to the World Wide Web, is apparent already in society, in which the transition to an internet
economy has happened in a fraction of the time taken by the telephone and the television in penetrating
the Nation’s homes. In the sciences, where the future is being created today, the interplay is already at
least as profound, and the impact of information technology (IT) seems certain to sustain a positive
second derivative, continuing change empowering science and serving society. All disciplines of science
and engineering face challenges inherent in studying more complex phenomena, managing to drink from
the fire hose of contemporary instrumentation and experimental observation, and establishing data mining
tools to probe beneath the surface of massive data sets. Indeed, a common theme to the analyses by the
communities of science is that the next generation, the 21st Century of discovery, must deal with the
actual details, not simple abstractions, recognizing that all natural systems are inherently complex,
heterogeneous and nonlinear in behavior and in all cases span huge spatial and temporal scales. At the
same time, the future of computing at all levels fuels data analysis on ever more effective scales, and
Computing Platforms will exploit emerging trends: the rise of Linux; the introduction of productive, efficient
commodity clusters; the rapid expansion of participation and availability toward ubiquitous access; and
the maturation of grid computing.

Advanced IT, over the past decade in particular,


Evolution of the Computational has played an ever more important, pervasive,
Infrastructure obvious and essential role across all of the
sciences, in science and engineering education,
Cyberinfrastructure
and even across all elements of the economy and
society. Workshops, journal articles, community
Terascale
TCS, DTF, white papers, and major symposia have
ETF
highlighted the general and the specific
NPACI and
PACI
Alliance (disciplinary) contributions arising from nothing
NSF Networking less than a revolution in the technology,
Prior
intellectual approaches, philosophy and vision,
SDSC, NCSA,
Computing
Investments
Supercomputer Centers
PSC, CTC and implementation of IT. As an deeply
| | | | | |
embedded feature of research design and
1985 1990 1995 2000 2005 2010
implementation, IT advances experimentation and
2 data collection, archiving and analysis; indeed,
often new methods of analyses with information
within the data but not obvious to manual observation – a process termed data mining – have changed
entire fields and directions of research. Extant data derives more value and is likely to be used and more
fully evaluated and exploited, while at the same time, IT can identify limitations and point to the need for
more sophisticated instruments or means of collecting data. Collaborations among researchers and the
value of theoretical models and computational simulations have also become prominent features of
experimental analysis due to IT.

Leading to the Atkins study or blue ribbon panel commissioned by CISE were NSF’s and NSB’s
recognition of the pervasiveness of IT, including the neologism “cyberinfrastructure” or CI to reflect the
revolution. Consequently, a discussion has been ongoing concerning, among other key matters, what CI
would enable, what types and levels of CI would be useful in different communities for both education and
research, what balance should be established between research into advancing the quality of CI itself

- 38 -
versus research and funding toward implementation of instances of CI for specific communities, and what
the balance should be between CI and other forms of infrastructure.

The Blue-Ribbon Advisory Panel or Cyberinfrastructure, or simply the Atkins’ report, is the product of an
interdisciplinary team to consider (1) the extant high performance computing program; (2) new directions
for CISE; (3) implementation plans to meet the recommendations of the study or report itself.

The report defined CI as:


A set of functions, capabilities, and/or services that make it easier, quicker, and less
expensive to develop, provision, and operate a relatively broad range of applications.
This can include facilities, software, tools, documentation, and associated human
support.

The main recommendation contained in the report is summarized as:


We propose a large and concerted new effort, not just a linear extension of the current investment level
and resources. NSF must recognize that the scope of shared cyber-infrastructure must be far broader
than in the past: it includes computing cycles, but also greater bandwidth networking, massive storage,
and managed information. Even these are not sufficient: there must be leadership on shared standards,
middleware, and basic applications for scientific computation. The individual disciplines must take the
lead on defining certain specialized software and hardware configurations, but in a context that
encourages them to give back results for the general good of the research enterprise, and that facilitates
innovative cross-disciplinary activities.

The basis for such a far reaching conclusion included:


• Increasing science/engineering community-based initiatives and budget demands in all
Directorates for greater investment in information technology to support domain-specific
research.
• A growing sense that science and engineering research and practice are reaching
thresholds in performance and adoption of IT that could radically transform the “what”,
“how” and “who” of scientific research on a truly global scale.

A very extensive appendix considers the breadth of CI, and includes an extensive survey of the scientific
communities. Many aspects of the report should be considered in the sense that not just the biological
science but most sciences have been taking increasing use of IT and are only beginning to adopt fully the
tools and approaches that would be most beneficial. The panel asserted that CI is now at the core of
revolutionary science in every discipline, and that NSF must take a central role in its deployment; that
NSF must consider people and software as co-equals with hardware in the context of CI, and recognize
that to date NSF has not placed enough emphasis on people, on software, and on maintenance,
compared to hardware acquisition and implementation. Networking, clusters, grids and many other
aspects of modern computing that are quite relevant to biology have been considered, including the need
to deliver the full cyberinfrastructure (IT, data and knowledge resources, scientific computing tools, grid
access, visualization tools, etc.) to the desktop of individual students, as well as the need to create a new
workforce, one that is expert in a domain but conversant enough in IT and scientific computing to exploit
the opportunities and select the most productive collaborations.

- 39 -
Appendix V
Models for a Comprehensive CIBIO Based on Extant Test Beds utilizing
Advanced IT

Biomedical Information Research Network (BIRN)

The BIRN is an NCRR initiative aimed


National-Scale, Testbed in Cyberinfrastructure: at creating a test bed to address
Federating Multi-Scale, Multi-Modal Neuro- biomedical researchers' need to access
Sites Imaging Data and analyze data at a variety of levels of
aggregation located at diverse sites
through out the country. The BIRN test
bed will bring together hardware and
develop software necessary for a
scalable network of databases and
computational resources. Issues of user
authentication, data integrity, security,
and data ownership will also be
addressed.
The test bed focuses on research
involving neuroimaging to take
Could Expand Readily; Examples: Plant & Microbe advantage of the relatively advanced
Genome; Cellular Level; Tree of Life; also Indefinite level of sophistication of this community
Expansion to Many Laboratories, around USA
in the use of information technology. An
essential feature of the test bed will be
creation of infrastructure that can be deployed rapidly at other research centers throughout the country,
which may have research emphases outside of neuroimaging. This means that in addition to scalability,
the software/hardware must be reusable and extensible.

The BIRN test bed uses the most advanced networking available and will draw heavily on resources of
the next generation Internet. The upgrade for networking is funded by the National Science Foundation
for both design and implementation. The initial awards join General Clinical Research Centers (GCRCs)
and co-located Biomedical Technology Research Resources, as team-level, shared user facilities (P41s)
to establish the necessary infrastructure in the context of ongoing neuroimaging research projects. A
separate grant for a "system integrator" that coordinates network, grid, and data mining software
development as well as hardware configurations will be awarded to a recognized leader in such technical
development and service efforts.

Specific BIRN program objectives today are to:


• Establish a stable high-performance infrastructure linking key NCRR-supported team efforts (NIH
P41 Resources) and GCRC sites using the Internet 2 network.
• Establish distributed, linked data collections for all of the test bed projects.
• Enable the use of distributed, heterogeneous, grid-based computing resources for project-specific
data storage and collaborative analysis.
• Enable data mining from multiple data collections or databases on neuroimaging.
• Develop software and hardware infrastructure that is stable, and can be reapplied and/or
expanded to include other sites with different research foci (than the founding sites).
• Demonstrate effectiveness of these technologies in improving and extending research results
(needs pull not technology push).

- 40 -
BIRN has become an example for all implementations and discussions of CI, a project that could be
called one of the founding activities for a comprehensive cyberinftastructure. Thus, for BIO
considerations, BIRN provides the most fully established model for how communities of researchers can
share data and conduct research across distance. Built upon a foundation from previous work (supported
by various NSF awards; namely, a Grand Challenge award and NPACI, and by NIH funding (as NCMIR,
NBCR and the Collaboratory), BIRN researchers have created a set of tools that distribute computing,
manipulate data, control instruments). http://www.nbirn.net

National Biomedical Computation Resource (NBCR)


The goal of NBCR, funded by the National Center for Research Resources as an NIH Research
Resource, is to “Conduct, catalyze and advance biomedical research by harnessing, developing and
deploying forefront computational and information technologies”. NBCR develops and deploys software
and services for the biomedical community. The award supports five interconnected core projects,
involving systems biology to grid services and visualization, who also interact with specific application
drivers (Collaborative Projects), and have distinctive components of service projects (individuals using the
software), training on the tools of the resource, and dissemination of the tools and also the activities of the
resource. There are several other related NIH Research Resources, in the area of biomedical research.
NBCR, and these other research resources, are models of how to support a critical mass of researchers
and academic professionals to develop, deploy, and support tools that are used by the community. This is
a model for how CI essential for basic biology could be established and developed within BIO funded
activities. http://nbcr.sdsc.edu

Pacific Rim Application and Grid Middleware Assembly (PRAGMA)


A collaborative effort of 15 institutions
located around the rim of the Pacific PRAGMA PARTNERS
Ocean, PRAGMA’s mission is to
establish sustained collaborations and
to advance the use of grid technologies
in applications among a community of
investigators working with leading
institutions around the Pacific Rim. To
fulfill this mission, PRAGMA hosts a
series of workshop for members; the
workshops focus on developing
underlying applications and a test bed
for those applications.

Current applications include:


• Establishing workflows in
biology (e.g., protein
14 – 15 July 2003, BIO Cyberinfrastructure
annotation); Alexandria VA

• Linking via web services climate


data (working with some LTER sites in US and East Asia Pacific Region ILTER);
• Running molecular solvation models in chemical biology andbiochemistry;
• Extending remote access and collaboration or telescience applications and partnerships to more
institutions.

The NSF supports the participation of scientists from the USA in PRAGMA, which includes an
extraordinary breadth of commitment and participation from Asia. All of the institutions involved have
committed to active participation and contribution of labor and resources for applications and the test bed.
http://www.pragma-grid.net

- 41 -
Telescience Examples
Effective telescience collaborations are supported by portal interfaces into tomographic codes, which
have been distributed on various platforms in US, Japan, Taiwan), and by means for storing of large data
sets. Funding from both NSF and NIH. http://gridport.npcaci.edu/telescience

Science Environment for Ecological Knowledge (SEEK) http://seek.ecoinformatics.org


A new NSF supported effort in ecology is already exploiting many features of modern, advanced
information technology. http://intranet.lternet.edu/archives/documents/Newsletters/DataBits/03spring/

Geosciences Network (GEON)


Involves geosciences community and scientific computing/data management community working together
to build an infrastructure for sharing data; includes social scientists for understanding how community
coalesces. http://www.geongrid.org

OptIPuter (an NSF large ITR)


The Optiputer was funded by NSF to prototype and establish a distributed computing paradigm on
campus sites, connectivity to the end user’s desk, for which the network is no longer the bottleneck. L.
Smarr, PI, for the physics applications, combines applications from Neuroscience (M. Ellisman) and the
Environment (J. Orcutt), including cluster and grid solutions (P. Papadapolous)

EuroGrid for Biology


An early overview of an European perspective on e-science and grids for biology; see Carole Goble,
2001, Computational Functional Genomics. http://www.hgmp.mrc.ac.uk/CCP11/cfg_grid.pdf

- 42 -
Appendix VI
Potential Prototypes for BIO Implementation
NSF BIO will have to evaluate the strongest options for early prototypes and also establish milestones
and a roadmap for the overall National and international effort in building a CI for the biological sciences.
Areas of potential impact can be found across all of the BIO research domains, including such things as
the Tree of Life, Deep Green, the extant database activities, LTER, new FIBR activities, comparative
genomics and an understanding of cis regulatory elements during development, and so on. Due to the
enhanced funding and focus on microbes by NSF and also by DOE, there is strong potential for BIO, in
partnership with DOE (and perhaps secondarily with NIAID and NIEHS) to develop a CI for microbes,
ranging from genomics to studies on environmental communities. As the plant genome projects advance
their sequencing and database efforts, the overall information infrastructure to be developed will also
provide a strong base for an NSF BIO CI. As NEON is developed and integrated with ongoing advances
in the science enabled by LTERs, there will be extraordinary opportunities, building on sensor nets and
meso-scale collaboratories, to exploit a cyberinfrastructure to accelerate our understanding of
ecosystems and our ability to utilize that understanding to the benefit of society. As an example of what
such a successful CI would be, an overview of one such development within the ecosystem community
will be outlined.

Ecological Analysis Research Network (EARN) Within an Environmental Cyberinfrastructure (ECI)

A comprehensive, fully integrated computational environment, a cyberinfrastructure for ecology and


ecosystem research, represents both an essential step for the ecological community and an obvious
opportunity for NSF. In analogy with other Research Networks, for simplicity such a comprehensive
environment, with support for the right mix of people using the ideas and tools of ecology, and exploiting
advanced information technology and scientific computing will be described here as an Ecological
Analysis Research Network (EARN), one that enabled the organization of ecodata into information, the
integration of that information into knowledge, and the synthesis to wisdom about the broad impacts and
roles of complex ecosystems, through deep partnerships among experimental, theoretical and
computational approaches. An EARN should aim to create a digital knowledge environment that enables
vast increases in the ability of ecologists to communicate and collaborate to establish a new synthesis for
ecological science and its interconnection with environmental science. The NSF Advisory Committee on
Environmental Research and Education (AC-ERE) has described the creation of an Environmental
Cyberinfrastructure (ECI), a set of scientific computing and advanced information technology tools for the
study of complex environmental systems. (See the Occasional Paper of the NSF Advisory Committee for
Environmental Research and Education, AC-ERE 1, May 2003.) AC-ERE has described a very complete
plan for implementing cyberinfrastructure to underpin their vision for environmental research and
education (Complex Environmental Systems: Synthesis for Earth, Life, and Society in the 21st Century
[CES]). A challenge and an opportunity for an EARN would be the need for the ecoscience deployment
to be deeply interconnected with the overall environmental research infrastructure. Considerable
background about what that would entail are in the AC-ERE Paper. In sum, the AC described the range
of empowerment arising from the development of collaboratory software, visualization tools, data mining
and data management techniques, and software and embedded hardware for sensor nets.

To realize the vision will require the creation of common platforms, applications that support data
management and manipulation, computer intensive applications, and advances in high-end computing as
well as desktop computing.

The CI, through proper management, will lead to increased scientific democratization that is particularly
relevant to internationalizing our community and our research. Access to data and computing capabilities
will be enhanced worldwide. This is particularly important for ecologists given the rapid environmental

- 43 -
changes taking place globally as well as and the rapid loss of biodiversity in parts of Asia, Africa, and
South America. Furthermore, internationalization through enhanced ECI is critical to answering the many
key ecological questions that require multiple forms of archived data, real-time global scale data and the
ability to work locally (at home) with remote collaborators working in all parts of the world.

What would a visionary ECI include?


• Software development and production related to data acquisition, management, and application,
including tools for data representation and pattern recognition.
• Hardware, software, and expertise to enhance communication and provide digital libraries.
• Hardware and middleware development for computational grids, data grids, and networking.
• A community of computer scientists and technical experts who work routinely with ecologists to
develop and then put into production new tools as well as maintain software widely used by the
community.
• Research associated with the development and deployment of intelligent sensor networks for
monitoring and research.
• More sophisticated mathematical and statistical tools and researchers trained in their usage.
• Advances in model building (numerical, analytical), in linking models and data, and development
of common modeling frameworks that can be used by broad communities.

Suggested early steps by the community and for the NSF include the following. However, a detailed
implementation plan will require further community input.

1. BIO should establish a coordination center for the national effort, the EARNCC, which would
serve as the hub for technology introduction and dissemination and for intellectual organization
and process management; when mature, the EARN should include about two dozen regional
nodes. (More than that many sites would be required for comprehensive analysis of ecosystem
dynamics and interactions, and certainly more would be required for what will have to be an
international effort, but an initial EARN would inevitably have to be a pilot, and only when the path
is established should more coordination centers and associated nodes be added.) The
coordination center, at a single site, would consist of a group of computer scientists, ecologists,
and academic professions, who serve as a national coordinating group.
A. The center director in concert with a science advisory board would make decisions on
funding of EARN development proposals – these would not be investigator-originated
ecological research proposals but rather (for example) proposals to create new software
or tools that would help a particular community of ecological researchers and managers.
B. The center should be an entrepreneurial entity (in identifying and leading community
efforts to set important directions for EARN) and the coordination entity for a network of
service nodes (i.e., the center would provide support to regional service nodes who would
be the links to individual investigators and research teams).

2. The expectations for national EARN centers


A. Coordinate national funding program for CI tool and idea development and hosts visiting
researcher teams.
B. House “gardens” or diverse, well-maintained, organized collections of software that are
used widely by the community, but watched, pruned, or allowed to die at the Center
following community input.
C. Ensures that the development of critical software is accompanied by suitable software
engineering activities that ensure utility on a variety of platforms, is fast and efficient, etc.
(otherwise, researchers may develop novel software in codes that are not useable by the
community or are not interoperable, or the software will remain as research grade and
never be truly robust enough for widespread use). This recognizes that not only is
exploratory research is needed for tool development, but so is production. Research and
development are two separate entities.

- 44 -
3. The expectations for the regional service nodes
A. Provide advice and consultation to individuals and research teams on technical issues
(hardware, software, networks design).
B. Host education and training programs/workshops (particularly focused on graduate
students, postdocs, and mid-career scientists)

4. The establishment of funding for EARN Centers:


A. The center should be funded from multiple sources – federal agencies that fund basic
research should provide the primary funding base to ensure longevity and sustainability
of personnel and applications. The center is not meant to be a service arm for any
agencies but a means by which CI discoveries and applications will be made that will
benefit the broad research community. For the regional service nodes, depending on the
magnitude of the project/issue, modest fees would be charged for this service and these
should be acceptable line items in grant proposals submitted to standard research
programs.
B. Federal agencies should put in place initiatives to provide funding for capturing legacy
datasets. No other vehicle exists to ensure this valuable information is not lost to the
community’s efforts at deeper insight.
C. Federal agencies in the Nation should work in concert with international agencies, and
look to the international programs within our own agencies to help bring the Nation’s
talent to bear on the intellectual and technology transfer requirements, toward an
international infrastructure. The programs in this case would assist scientists in
developing countries upgrade or obtain CI resources, and interconnect to the world wide
knowledge resources.
D. Programs (funding) and reward systems (including awards by professional societies and
other means of recognition) must be established for the development of models that can
be widely used by the community for simulation studies including ecological prediction, in
guiding experiments, and facilitating cross-scale work.
E. Agencies, looking to partnerships of statisticians and experimentalists, should put in
place competitions that focus on advances in developing algorithms for rapidly evaluating
the quality and validity of data (QA/QC).
F. Other competitions, for partnerships with mathematicians and computer scientists, are
needed to focus on the development of methods for dealing with fuzzy data - to capture
essential information in a quantitative way.
G. Competitions for partnerships between experimentalists, informaticians and computer
scientists will be needed to drive the development of diverse semantic tools to
integrate/federate data, mine the information, empower further modeling and analysis,
and convert it to useable knowledge (metadata tools).

*Note - items D-G above particularly involve programs and incentives for computer scientists to work
with ecologists, or work on ecologically motivated problems AND all of the above require sustained
support of CI personnel (academic professionals, programmers, and experts in visualization,
networking, and other computer science applications).

Ecological forecasting, evolutionary ecology and biodiversity, a set of interconnected ecological research
arenas, urgently require a set of tools from scientific computing and advanced information technology that
would be provided by this comprehensive CI:
• Analytical modeling and simulation
• Scientific computing at levels of high performance
• Database creation, organization, development, mining and other queries and analyses
Communication networks
• Wireless sensor networks (for data sets in general and particularly for biodiversity)

- 45 -
The following are a set of environmental “Science Nuggets” that outline the interesting adventures of
discovery that a CI for the ecological sciences would enable.

The Ecology of Sound

This is a new and untapped field, which has been ignored largely because there has not been the
technology to ask the right questions, and even more seriously, since there never before has been an
opportunity to develop the requisite CI. The field will be able to answer questions such as: what are the
sound niches of different organisms? Do organisms within an ecosystem always partition the entire
sound spectrum and if not, why? How does this partitioning influence communication among organism?

The tools of CI would be essential for monitoring and modeling bioacoustics. The experimental setup
would involve small microphones deployed across habitats (such as wetlands or cities) recording the full
spectrum of sounds (both anthropogenic and natural). The observations would create multi-gigabyte
databases daily, at a minimum, and would create more depending on how pervasive the technology
became in the community. Data must be obtained, sampled, stored, organized, managed, processed,
mined, and through the typical data chain, to provide the understanding. This will be a huge challenge for
ecoinformatics. The data sets will be quite large, beyond ready human comprehension, so creative CI
tools will be required to provide visual representations and entry points for analysis, as well as means for
the detection of sophisticated, deeply embedded patterns, and other still unknown phenomena.

Environmental Event Science

An environmental event detection, recording and analysis process, or event science, would begin with
smart sensor networks that detect changes and also enhance sampling during those events. This
research challenge has plagued environmental biologists for years. (Solving the challenge is critical for
basic research as well as for applied; the challenge is also related to national security issues.) We can
answer questions like: How do storm events or temperature inversions influence N deposition, and what
are the pathways by which the elements move through the soil and groundwater to reach surface waters?

Such a science would require that the EARN link directly to research innovations, because advances in
computing hardware (related to wireless communication and power systems) and software (intelligent
sensor webs to turn on/off; conserve power; smart sensors that will take over other functions as
necessary when others fail).

The infrastructure required includes the development of software for sensors, possibly collaborations with
materials scientists and computer scientists on sensor development itself, and the use of efficient and
flexible network designs that transmit (wirelessly) huge amounts of digital information that should be in a
form that the information can be immediately captured and stored directly.

The overall infrastructure, as a bottom line, will deliver absolutely HUGE data streams that will allow us to
ask and answer a new generation of questions about the environment and how ecosystems function.
The kinds of questions are an extension of LTER research and represent ultimately the goals of NEON,
but the actual technology needed for this revolutionary expansion of ecological science would be enabled
by established a strong baseline effort in environmental event science.

- 46 -
Appendix VII
References for CIBIO Report
There is an enormous wealth of material available—both online and in print—addressing the possibilities
for twenty-first century science within the framework of a modern cyberinfrastructure. This literature
covers a wide range of topics and disciplines; and indeed, it would be impossible to reproduce all of this
information here. Instead, we have chosen to highlight below some of the best outside references we
have come across.

The references listed below can be accessed by visiting CIBIO on the web at
http://research.calit2.net/cibio/.

Cyberinfrastructure Background Information


• Charting Our Cyberinfrastructure Future
A presentation made by Dr. Deborah Crawford, Chair of the NSF Cyberinfrastructure Working
Group (CIWG), 6 June 2003.
• Computing: Getting us on the Path to Wisdom (HTML)
Slide show presentation and remarks made by Dr. Rita R. Colwell, Director of the NSF, at
SC2002: From Terabytes to Insights, 19 November 2002.
• Cyberinfrastructure - Issues and Challenges
An introduction to Cyberinfrastructure prepared by Dr. Francine Berman, Director of the San
Diego Supercomputer Center.
• Cyberinfrastructure: Opportunities for Connections and Collaboration
Paper exploring the concepts of envisioning and building a cyberinfrastructure. By Joan
Lippincott, Associate Executive Director, Coalition for Networked Information (2002).
• Cyberinfrastructure: Revolutionizing Science & Engineering
Slide show presentation made by Peter A. Freeman, NSF Assistant Director for Computer and
Information Science and Engineering (CISE), at CENIC 2003, 9 May 2003.
• The Grid: A New Infrastructure for 21st Century Science
Article from the February 2002 issue of Physics Today. Explores some of the transformations in
science and engineering that are possible thanks to cyberinfrastructure.
• The Human Side of the Cyberinfrastructure
Short article by Fran Berman (NPACI and SDSC Director) in the April - June 2001 issue of
enVision magazine.

Cyberinfrastructure Applications for the Biological Sciences


• Challenges Faced in the Integration of Biological Information
Article by Su Yun Chung and John C. Wooley. Excerpt from Bioinformatics: Managing Scientific
Data.
o Appendix: Biological Resources
Also excerpted from Bioinformatics: Managing Scientific Data. This appendix contains
useful biological resources, databases, organizations, and applications.
• Computational Cell Biology - Challenges and Opportunities for an Emerging Field
A report based on a roundtable discussion at the First International Symposium on Computational
Cell Biology (March 2001).
• Computational Infrastructure Workshop for the Genomes to Life Program
Report on cyberinfrastructure-related workshop of the US Department of Energy's Genomes to
Life program (March 2002).

- 47 -
• Next Generation Biology: The Role of Next Generation Computing
Overview and report of the workshop held 20 July 1998 by the NSF, National Institutes of Health,
and US Department of Energy.
• The Biomedical Information Science and Technology Initiative
Cyberinfrastructure and its potential uses for Biomedical Computing. Document prepared by the
Working Group on Biomedical Computing, Advisory Committee to the Director, National Institutes
of Health (3 June 1999).
• Trends in Computational Biology: A Summary Based on a RECOMB Plenary Lecture, 1999
Early paper discussing the potential applications of computers to biology. Useful as a historical
primer for the current effort to build a cyberinfrastructure for the biological sciences.

Cyberinfrastructure Applications for Other Scientific Disciplines


• A Geosciences Network for Understanding the Whole Earth
Article from the April-June 2003 issue of EnVision. Details the efforts of researchers at the San
Diego Supercomputer Center to build a cyberinfrastructure network for the Geological Sciences.
• Collaborative Large-scale Engineering Assessment Network for Environmental Research
Link to the UC Berkeley-directed infrastructure program for environmental field facilities mandated
to help in the formation of environment-friendly engineering and policy options.
• Cyberinfrastructure in the Mathematical and Physical Sciences
[See Page 5] Excerpt from Background for the Discussion on Long-Range Planning at the
MPSAC Meeting of April 3-4, 2003, a publication of the NSF Directorate for Mathematical and
Physical Sciences.
• Cyberinfrastructure Needs of the Engineering Community
PowerPoint presentation given by Esin Gulari, Acting Assistant Division Director for Engineering
at the NSF.
• Cybersecurity, Cyberinfrastructure Top Priorities at Information Technology Forum
Lead article from the April 2003 issue of NASUCGC (National Association of State Universities
and Land Grant Colleges) Newsline.
• Environmental Cyberinfrastructure Needs for Distributed Sensor Networks
A report from an NSF sponsored workshop held 12 - 14 August 2003 at the Scripps Institute of
Oceanography.
• Environmental Cyberinfrastructure: Tools for the Study of Complex Environmental
Systems
An Occasional Paper of the NSF Advisory Committee for Environmental Research and
Education, 1 May 2003.
• Environmental Cyberinfrastructure: Turning Data into Knowledge
Presentation made by Margaret Cavanaugh at the NSF SINE Workshop - SDSC (29 October
2001).
• GEON: Cyberinfrastructure for the Geosciences
Web site coordinating the NSF effort to enable scientific discovery and improve education in
Earth Sciences through information technology research.
• Mathematics & Biology: The Interface - Challenges & Opportunities
Web version of a publication from June of 1992. Relevant discussion of the ways that
Mathematics and Biology can be coordinated, especially with the use of new technologies.
• Securing Cyberinfrastructure
Presentation made by Michael McRobbie, Ph.D., to the NSF Workshop on Cyberinfrastructure
Research for Homeland Security (26 February 2003).
• Steering Committee for Cyberinfrastructure Research and Development in the
Atmospheric Sciences (CyRDAS)
Links to publications, research, and information provided by CyRDAS and the NSF National
Center for Atmospheric Research.

- 48 -
• The OptIPuter
Article from the November 2003 issue of Communications of the ACM detailing the use of
advanced computing architecture in conjunction with distributed cyberinfrastructure.

Cyberinfrastructure Workshops, Lectures, and Reports


• CISE Lecture List
Links to presentations made as part of the Distinguished Lecture Series in the NSF's Directorate
for Computer and Information Science and Engineering (CISE).
o Recent Presentations made by the Assistant Director of CISE
• Cyberinfrastructure for Engineering Research and Education
Reports and presentations made at a workshop sponsored by the NSF Directorate for
Engineering, 5 - 6 June 2003.
• Cyberinfrastructure for Environmental Research and Education
Agenda and links to reports presented at a workshop sponsored by the National Science
Foundation, 30 October - 1 November 2002.
• Global Terabit Research Network: Building Global Cyber Infrastructure
Presentation made by Steven Wallace at the Global Research Networking Summit, May 2002.
• Management and Models for Cyberinfrastructure
Links provided by the University of Michigan's School of Information to presentations from two
workshops (held 14 - 15 May and 29 - 30 July 2003) and one town hall meeting (held 23 June
2003).
• Report of the Task Force on the Future of the NSF Supercomputer Centers Program
(Hayes Report)
• Revolutionizing Science and Engineering through Cyberinfrastructure: Report of the
National Science Foundation Advisory Panel on Cyberinfrastructure (February 2003)
• Science and Engineering Infrastructure for the 21st Century: The Role of the National
Science Foundation (February 2003)
• Workshop on Cyberinfrastructure Research for Homeland Security
Access to the list of guests, presentations, and draft reports prepared at a workshop held from 25
- 27 February, 2003.
• Workshop on EPSCoR Cyberinfrastructure for Large-Scale Science and Engineering
Notes and final report of the "Experimental Program to Stimulate Competitive Research"
conference that took place from 27 - 29 April 2003.

Historical Roots of Cyberinfrastructure


• Advanced Computing in the Life Sciences: Proceedings of the Workshop on the
Applications of Supercomputers in the Life Sciences
Report from a December 1984 workshop that set the stage for today's cyberinfrastructure
applications.
• Computational Biology for Biotechnology: Applications of Scientific Computing
Biotechnology
This 1989 article by John C. Wooley is another excellent look into the historical background of
today's cyberinfrastructure applications.
• Research Opportunities in Computational Biology
Results from an invitational workshop held in Washington, DC (13 - 15 December, 1989). Many of
the concepts discussed are relevant to today's cyberinfrastructure.

Miscellaneous Links of Interest


• American Association for the Advancement of Science (AAAS) Research and Development
Budget and Policy Program
o AAAS Report XXVIII: Research and Development in FY 2004

- 49 -
o Computing Research in the FY 2003 Budget Request
Excerpt from AAAS Report XXVII: Research and Development in FY 2003
• Digital Libraries
David Hart details the NSF's approach to the creation of usable digital resources in the next
millennium, and how cyberinfrastructure will play a role.
• Implications Of Information Technologies: IT Overview
Links to multiple web sites in the Division of Science Resources Statistics of the National Science
Foundation.
• National Science, Technology, Engineering, and Mathematics Digital Library (NSDL)
Program Solicitation
Grant and program solicitation information (NSF document 03-530).
• National Science Foundation FY 2004 Budget Request Overview
o FY 2004 Summary of NSF Accounts
• National Science Foundation FY 2005 Budget Request to Congress
Biological Sciences Budget Request Excerpts from FY 2005

Other Agency Web Sites


• Digital Libraries Initiative Phase 2
Homepage of the Digital Libraries Initiative Phase 2 (DLI2) with links to ongoing programs,
sponsors, and research.
• Environmental Research and Education (ERE)
Homepage of the National Science Foundation's ERE division, detailing current projects and
research.
• Experimental Program to Stimulate Competitive Research (EPSCoR)
Homepage of the NSF EPSCoR project, detailing current projects and research.
• Information Technology Research (ITR) Program
Homepage of the NSF's ITR Program, with links to current projects and research.
• National Center for Atmospheric Research / University Corporation for Atmospheric
Research
Information on funding and projects administered by the NSF Atmospheric Sciences subdivision.
• National Computational Science Alliance
• National Coordination Office for Information Technology Research and Development
• National Ecological Observatory Network
• National Partnership for Advanced Computational Infrastructure (NPACI)
o NPACI Alpha Projects (Biology-Related Applications)
• Network for Earthquake Engineering Simulation
• NSF Middleware Initiative Program Solicitation
Grant and program solicitation information (NSF document 03-513).
• Science and Technology Centers (NSF Office of Integrative Activities)
The NSF Science and Technology Centers fund 11 centers in a variety of disciplines, including
many related to cyberinfrastructure.
• TeraGrid
Homepage of the NSF project to create the world's largest, most comprehensive distributed
infrastructure for open scientific research.
• The US Long Term Ecological Research Network
Homepage for the program established by the NSF in 1980 to support research in long-term
ecological phenomena in the United States.

- 50 -

You might also like