Professional Documents
Culture Documents
NSF 2003 CI in The Biological Sciences
NSF 2003 CI in The Biological Sciences
NSF 2003 CI in The Biological Sciences
are those of the workshop participants, and do not necessarily represent the official
views, opinions, or policy of the National Science Foundation.
Building a
Cyberinfrastructure
for the Biological Sciences
(CIBIO)
A Workshop Report
July 14-15, 2003
Chair, John C. Wooley
Subcommittee on 21st Century Biology
NSF Directorate for Biological Sciences Advisory Committee (BIOAC)
Table of Contents
Abstract 3
Executive Summary 4
Introduction: The Context for Deliberation 4
The Unique Case for Including Biology in CI Development 5
Intrinsic Aspects of the Biological Sciences 6
Multiplying Exponentials through an Extensive Partnership 6
The Essence of the Objectives for NSF BIO 7
Resource Requirements and Initial Stages of Implementation 8
Education and Training 9
Coordination and Collaborations 10
Implementation Strategies 26
Funding and Management Mechanisms 27
Outreach and Partnerships 30
Appendices
I. Central Questions for CIBIO Workshop 33
II. Schedule and Assignments 34
III. Workshop Participants 36
IV. Overview of Blue Ribbon Panel/Atkins Report on CI 38
V. Models for a Comprehensive CIBIO based on Extant Test Beds 40
VI. Potential Prototypes for BIO Implementation 43
VII. References for CIBIO Report 47
-2-
A Cyberinfrastructure View
Envisioning and Empowering Successes for 21st Century Biological Sciences
• Creating and sustaining a comprehensive cyberinfrastructure (CI; the pervasive applications of all
domains of scientific computing and information technology) is as relevant and as required for
biology as for any science or intellectual endeavor; in the advances that led to today’s
opportunity, the BIO Directorate made numerous, ad hoc contributions, and now can integrate its
efforts to build the complete platforms needed for 21st Century Biology. Doing so will accelerate
progress in extraordinary ways.
• The time for creating a CI for all of the sciences, for research and education, has arrived and NSF
will lead the way. BIO must co-define the extent and fine details of the NSF structure for CI,
which will involve major internal NSF partnerships and external partnerships with other agencies,
and will be fully international in scope.
• Only the biological sciences have seen as remarkable, sustained revolutionary advances as the
computer and information sciences. Just in the past few years has the world of computing and
information technology reached the level of being fully applicable to the wide range of cutting
edge themes characteristic of biological research. Multiplying the exponentials (of continuing
advances in computing and bioscience) through deep partnerships will inevitably be exciting
beyond any anticipation.
• The stretch goals for the biological sciences community include both a need for community-level
involvement and for the complete spectrum of CI; namely: People and Training; Instrumentation;
Collaborations; Advanced Computing and Networking; Databases and Knowledge Management;
Analytical Methods (Modeling and Simulation).
o Invest in People
o Ensure Science Pull, Technology Push
o Stay the Course
o Prepare for the Data Deluge
o Enable Science Targets of Opportunity
o Select and Direct the Technology Contributions
o Establish National and International Partnerships.
• The biology community must decide how best it can interact with the quantitative science
community, where and when to intersect with computational sciences and technologies, how to
cooperate on and contribute to infrastructure projects, and how NSF BIO should partner
administratively. An implementation meeting, as well as briefings to the community through
professional societies, will be essential.
• For NSF BIO to underestimate the importance for biology, or fail to provide fuel over the entire
journey, would severely retard progress and be very damaging for the entire national and
international biological sciences community.
-3-
Executive Summary
Introduction: The Context for Deliberation
The biological sciences are at a critical junction in their history, having absorbed over several decades the
tremendous successes of “reductionist” experimentation, that is, of carefully focused investigations on
simpler systems, model organisms and biological abstractions/models. Today, as the direct consequence
of such extraordinary and even unanticipated successes, a new era of synthesis pervades thinking about
the future of biological research, from macromolecules to ecosystems. To inform the process and
deliver this synthesis, biological scientists must collect, organize, analyze and comprehend
unprecedented volumes of highly heterogeneous, hierarchical information obtained by different
means or modalities, with different standards, widely varying kinds (types) of data, over vast
scales of time, space and organizational complexity.
NSF recently introduced the term “cyberinfrastructure” (CI) to describe the integrated, ubiquitous,
increasingly pervasive application of scientific computing (SC) and information technology (IT)
approaches, which are already changing both science and society. For example, a pervasive
infrastructure arising from scientific computing and information technology will provide the circumstances
and platforms to enable robust, widely distributed research teams or collaboratories, user-friendly
interfaces to fully integrated information from multicomponent systems, and also the software and
hardware for advanced simulation and modeling projects that are directly and tightly coupled with
experimental studies and provide interactive, iterative capacities to refine our knowledge. Such
approaches are already essential requirements for many features of contemporary scientific research.
A CI will do many things, but among the most important are to provide a means to establish: (1) the tools
for capturing, storing, and managing data; (2) the tools for organizing, finding and analyzing the data to
obtain information; (3) the connection of experimental and theoretical analyses and their interplay with
simulations and complex models based on that information; and (4) the integration of disparate aspects of
that information to provide a synthesis, a knowledge repository for further considerations. The reception
of the concept of CI as a maturing, philosophical and practical perspective – that is, on the profound
revolution provided through today’s integration of continuing advances in SC and IT - has been truly
remarkable, with the entire worldwide community of scientists joining the dialogue.
People, and their ideas and tools, are at the heart of CI. Building a fully effective cyberinfrastructure
for science and society will require educating a new generation, although the technologies and the effort
itself will generate new training environments and open novel options for enriching understanding of
science for both technical features and for community relationships. After full implementation including
the training of a new cadre of scientists, a comprehensive CI (for any community) will address (1) the
provision of routine, remote access to instrumentation, computing cycles, data storage an other
specialized resources; (2) the facilitation of novel collaborations that allow new, robust, interdisciplinary
and multidisciplinary research team efforts among the most appropriate individuals at widely separated
institutions to mature; (3) the powerful and ready access to major digital knowledge resources and other
libraries of distilled information including the scientific literature; (4) platforms or vehicles for the
integration of information from multiple, highly diverse and distributed sources; (5) new training
environments; and (6) other features essential to contemporary research.
-4-
The Unique Case for Including Biology
The history of the basic science community studying biology and of their federal sponsors, NSF BIO, is
especially rich in prescience and sustained commitment. BIO invested early, and in numerous ways in
the advancing IT world that is now leading to a comprehensive cyberinfrastructure. The original
investments, for example, ranged from ecology and the LTERs to structural biology and the PDB. In
partnership with CISE, BIO also invested in a very wide range of high performance scientific computing
opportunities, such as biophysical and neuroscience modeling, telescience or remote access to
specialized instrumentation, and the requisite visualization, networking and database tools. Today, there
is an extraordinary opportunity to consolidate those activities and thereby to build a compelling,
integrated program that only could arise from BIO in its home at NSF. Specifically, building a
cyberinfrastructure for the biological sciences requires an interface to all of the quantitative sciences as
well as to computer science and engineering, and this can only happen at NSF. Already, we have seen
examples, such as NCRR/NIH’s cyberinfrastructure prototype, the Biomedical Information Research
Network (BIRN), where mission agencies have recognized NSF’s contribution and begun to establish CI
activities to meet their needs. Numerous other examples will follow to meet the goals of those missions.
Indeed, BIO’s previous and continuing investments will catalyze revolutionary change, not merely
incremental improvements, around the world.
Not only does 21st Century Biology absolutely require a strong cyberinfrastructure, but also, more than
any other scientific domain, biology, due to its inherent complexity and the core requirement for advanced
IT, will drive the future cyberinfrastructure for all science. NSF BIO must engage with CISE fully in setting
the course, in establishing an architectural plan describing the specific needs of the biosciences, in
assembling the parts, and building a full blown, highly empowering cyberinfrastructure for the entire
biological sciences community.
Complementing the compelling scientific case for building the cyberinfrastructure required by the
biological sciences, there are very favorable administrative considerations in the context of NSF. Notably,
the implementation, in incremental fashion and tailored to each discipline’s needs, of CI by NSF, offers an
especial opportunity - a perfect fit - for the biological sciences. In terms of management, access and
-5-
resources, NSF should assign the utmost priority to BIO, the only organization positioned to lead
the response of the biological sciences community.
The biological sciences, in settings around the world, will remain dominated by widely distributed,
individual or small team research efforts, rather than moving to a particular focus on centralized,
community facilities, as has happened for some sciences; the consequences of reaching out to the
broadest range of the best performers wherever they are is, consequently, particularly important. As
telecommunication networks advance, biologists around the entire world will be able to explore and
contribute to 21st Century Biology.
-6-
experimental research. Indeed, only the biological sciences, over the past several decades, have
seen as remarkable, sustained revolutionary increases in knowledge, understanding, applicability,
as the computer and information sciences.
• Invest in People
• Ensure Science Pull, Technology Push
• Stay the Course
• Prepare for the Data Deluge
• Enable Science Targets of Opportunity
• Select and Direct Technology Contributions
• Establish National and International Partnerships
The most obvious feature of 21st Century Biology is the increasing rate of data flow and simultaneously,
the highly complex nature of the data, whether obtained through conventional or automated means.
Responding to the enormous challenge requires that biological scientists be able to organize that data
into information, analyze the information to create insight and knowledge, and synthesize disparate
elements of our knowledge to create a deep understanding or wisdom. A passive role will not suffice
when the vitality of the entire biology enterprise is involved. In other words, BIO must provide the vision
for the CIBIO, not rely on technology drivers and circumstantial access. Education and the investment in
people will, of necessity, include retraining, lowering the barrier for entry by senior faculty and by those
from other disciplines, programs at all academic levels, and the training and stable career paths for future
principle investigators and for academic professionals who will provide the energy for the scientific
journey. Once involved, BIO will have made a major commitment to the community and must have an
effective long-range plan to sustain the efforts. The changing relationship of CISE to its high performance
computing centers and the introduction of a CI process across NSF places a significant obligation on the
BIO Directorate to structure and maintain the role of the biological science community in the development
and utilization of scientific computing and information technology applied to biology.
Not all subdisciplines can be simultaneously provided with a CI by BIO, so selected pilot projects and
areas of high NSF BIO impact should be the first focal points of effort. Strategic partnerships, discussed
below, may well be needed to facilitate and accelerate implementation. Nothing succeeds like success,
and the complete implementation of a CI for the biological science will depend on the initial choices
paying off in easily demonstrated ways. Thus, the early pilots should be selected not just for their
longer term scientific contribution but also for their ability to contribute significantly in the near
term, even though many aspects of a comprehensive CI for the biological sciences will take years to
develop fully and the impact will continue to accelerate the science for the foreseeable future.
All research communities should interoperate, work through and with CISE and NSF as a whole, to seek
to absorb as many as possible of the computational contributions from other fields, rather than
encouraging reinvention. Nonetheless, BIO must also choose its own technology course, not
passively accept whatever (hardware, software, middleware) is delivered for the needs of other science
domains.
The entire community will engage, even those with fewer resources and alternatives (than those available
within biomolecular computing community). Scientists can now facilitate each others progress in
-7-
extraordinary ways, and to optimize introduction of 21st Century biology, the biosciences need to be
interconnected to the other NSF domains. For NSF BIO to underestimate the importance for the
biological sciences or to fail to provide fuel over the journey would be very damaging, perhaps
catastrophic, for the community.
Cyberinfrastructure promises to be as pervasive and central an influence as any societal revolution ever.
Given the breadth and the long-term impact, several considerations are very important. First, working
with partnerships and working in a global context is obvious and imperative on a scientific basis. Second,
these interconnections are equally obvious and imperative on a practical and administrative basis. The
cost of full implementation, of a comprehensive cyberinfrastructure in which the biological sciences
supported by NSF benefit from cyber-rich environments, such as those piloted by NEES and BIRN, will be
large as would be expected for a program of such incredible significance and applicability. The
administrative scale at which NSF and NSF BIO prepare and sustain this process will have to be well
beyond any previous efforts, beyond even the STCs or extant MRE programs.
Funding increases will be needed as well for the core experimental programs and their projects to permit
them to exploit fully the growing cyberinfrastructure and to build the requisite collaborations for a synthetic
understanding of biology, which requires computational expertise and the deep involvement of
information technology. In the biological sciences, database activities, modeling/simulations, and
theory must always be connected to experimental efforts; a balanced expansion of the portfolio
will be important. Beyond this base internal to BIO, major partnerships with computer and information
science (CISE) and with the other sciences will be required. The impact should not be underestimated,
but neither should the requirement for greatly enhanced, stable funding.
BIO is already engaged upon a series of extraordinary opportunities, in creating a larger scale for shared,
collaborative research efforts, through activities like FIBR and (the just initiated) NEON, while sustaining
microbial projects and LTER. These larger scale projects particularly require a cyberinfrastructure, with
costs of comparable magnitude to the projections for the experimental research component.
BIO will have to (1) build up its own core activities at this interface (e.g., the funding for bioinformatics,
biological knowledge resources, computational biology tools and collaborations on simulation/modeling)
that allow it to partner with other parts of NSF, (2) choose test beds for full implementation of CI, establish
paths toward deep integration of CI into all of its communities and for all of its performers, and (3) set a
leadership role for other agencies around the entire world, including notably the USA mission agencies, to
follow. Of course, only through a decades long commitment and through flexible, agile, engaged,
proactive interactions with the entire community and with the other stakeholders – i.e., with other sources
of funding for infrastructure and research newly enabled by CI – will the effort be a complete success.
Several types of early actions are needed. Implementing these important requirements will be the
responsibility of BIO and must be in place for effective collaborations on research frontiers with the other
domains (Directorates) of NSF.
The first implementation steps should be to expand the extant database activities and computational
modeling/simulation studies, which need central attention. Many challenges remain as research
problems as well as particularly difficult implementation problems for databases in the life sciences.
Simulation studies could contribute considerably more across all of the biological sciences. Accelerating
-8-
the introduction and expansion of tools and of the conceptual approaches provided through testing
models, a prominent feature of research in the physical sciences, will require continued programmatic
emphasis and commitment.
Many biologists trained in more traditional ways are just starting to recognize the opportunities, making a
renewed and invigorated focus on training in the quantitative sciences essential for 21st Century Biology.
Encouragement of more collaborations between/among experimentalists and computational scientists is
essential, but the full implementation of the opportunities will require the training of a new generation of
translators, of “fearless” biologists able to understand and speak the language of the quantitative
scientists well enough to choose the best collaborators and to build bridges to more traditionally trained
experimentalists. Many basic requirements involve academic professionals and the use of well-
documented approaches within computer and information science. Interdisciplinary training should be
restored as an separate, defined program within BIO.
As noted above, the enabling and transformational impact of CI justifies, and for full implementation would
require, a doubling of the NSF BIO budget, but it will also require that BIO lead a much larger effort,
marshalling resources from other National agencies and around the world, to provide adequate funding to
ensure full participation by the international life sciences community. Consequently, other key, early
actions are to establish a long range plan for sustained funding and to engage the community in a
dialogue to ascertain implementation priorities as well as to prepare the biological scientists from around
the Nation to participate fully. The dialogue should begin as an open meeting that is highly interactive
and inclusive in all ways; a major venue will be needed to explore all options, dig deeply into
implementation features for subdisciplines and into national and international partnerships, and provide
for the archival of discussions and recommendations.
Important administrative features include the review and funding of infrastructure and establishing (over
time) a balance across the subdisciplines. Infrastructure is different from individual research and
needs separate processes for their consideration, which are described below. Central coordination,
needed for effective selection of pilots and coordinated efforts, will ensure balance and accelerate
penetration of the benefits of modern IT to every BIO disciplines.
All categories of infrastructure are increasingly important for scientific research, but cyberinfrastructure
will be particularly valuable for the biological sciences. What will be critical is to recognize that
infrastructure cannot be treated the same as individual research proposals. One cannot review
infrastructure requests and plans against individual research proposals, and separate, centralized review
and oversight will be needed. This situation arises since infrastructure benefits all, but has a different
time frame, different budgets, different staffing (more academic professionals). To make informed,
equitable and effective judgments on behalf of the community, CIBIO simply can not be simultaneously
considered with individual projects. At the same time, robust, rigorous peer review is essential to
establish the best opportunities. Competition is also important; overlapping efforts will need to be initiated
in many cases and then the best project will ultimately become clearly identified.
-9-
innovation and sustain the excitement beyond national boundaries. The technology itself will change all
levels of education and BIO should coordinate with EHR and the other research Directorates. A simple
example, beyond the graphical nature of knowledge representation and interactive media as a teaching
vehicle, is remote learning. Democracy is at work on the web, in Nobelists answering the queries of
students and in the ready access with routine tools to the world’s information and knowledge store.
- 10 -
A Prelude on Entering Biology’s Century
Commentary and Prescience in the Biological Sciences
The questions of science always lie in what is not yet known. Although our techniques determine what
questions we can study, they are not themselves the goal. The march of sciences devises ever newer
and more powerful techniques. The new paradigm, now emerging, is . . . that the starting point of a
biological investigation will be theoretical. An individual scientist will begin with a theoretical conjecture,
only then turning to experiment to follow or test that hypothesis.
The human genome project will continue and accelerate this rate of increase [in DNA sequences known
and in internet accessible databases, notably Genbank]. Thus, I expect that...7the total knowledge of the
human organism will be available...by the end of the decade. To use this flood of knowledge, which
will pour across the computer networks of the world, biologists not only must become computer-
literate, but also change their approach to the problem of understanding life.
We must hook our individual computers in the worldwide network that gives us access to daily changes in
the database and also makes immediate our communication with each other. The programs that display
and analyze the material for us must be improved – and we must learn to use them more effectively.
Walter Gilbert
Biology is in the middle of a major paradigm shift – driven by computing. Although it is already an
information science in many respects, the field is rapidly becoming much more computational and
analytical. Computerized databases of genetic information, for example, let researchers quickly
determine the significance of research findings.
“Computing has changed biology forever; most biologists just don’t know it yet.”
“Computational Biology will be as essential for the next quarter century of biology as molecular biology
was for the past quarter century.”
- 11 -
More than the earlier implementations of research computing and far more than the contributions from
any other scientific infrastructure, the integration and acceleration of scientific computing and advanced
information technology (SC and AIT), driven by NSF, will enable applications for the entire biological
sciences community beyond any expectation we could articulate today. By building a cyberinfrastructure
for the biological sciences (CIBIO), the Biological Sciences Directorate will provide an enduring,
extraordinary foundation for research.
To encapsulate the value of this enterprise by a metaphor, consider the assertion by hockey great Wayne
Gretsky about his ability to score goals. In sum:
CIBIO should allow biologists to “scait” to where the puck will be.
- 12 -
The Workshop Report
Specifying and Building a Cyberinfrastructure To Meet the Requirements of
Biology
The NSF has identified the pervasive, ubiquitous contribution of a set of broad enabling approaches as
cyberinfrastructure (CI), which builds upon a long history of exceptional advances in basic computer
science and engineering, information technology, computer hardware and software, networking, and
wired and wireless communication technologies. The Directorate for the Biological Sciences (BIO) will
establish its own plans for CI in the context of this broader effort. That broader effort began through the
study of a “Blue Ribbon Panel,” chaired by Prof. Dan Atkins, for the Computer and Information Science
and Engineering (CISE) Directorate of the NSF. A summary of their report is in Appendix IV.
CI is perceived by the NSF as contributing not only to basic/technical knowledge and deep understanding
(and even to wisdom), but also to a knowledge-based democracy for science and society - beginning with
widespread access and full participation, and to an accelerating translation from fundamental research to
societal benefits. The aim of CI, as an integrative platform and enabling tool, is to empower all of the
sciences to work on a systems level, while fully encompassing the requirements for ultra-small, focused
studies to ultra-large analyses.
Including databases and knowledge resources, grid architectures and services, software engineering,
telescience (or remote access to specialized methods and tools), collaborations around defined science
projects and around implementing needed technologies, and so on, the implementation of CI has to be
international in scope. In fact, already there are international projects, and the international character
and expectations of CI will increase rapidly.
specific
specific
• High Energy Physics Cybertools
Cybertools
(software)
as the needs of the sciences are
• Proteomics/Genomics (software)
•…
established. As the entire NSF
sustains an active dialogue, CI will
Development
Shared
be increasingly defined, and will
Tools & Libraries Shared
Cybertools
Cybertools certainly turn out to be defined in
(software)
(software) different ways for different
Grid Services communities.
& Middleware
Distributed
Distributed
CIBIO will thus have to be part
Resources
Resources
Hardware of a “family,” will have to
(computation,
(computation,
communication
interconnect, Venn Diagram style,
communication
, ,storage,
storage,etc.)
etc.)
with the central or parent CSE (or
CISE) CI and with all of its
siblings. There are huge areas of large overlap, through common or shared research programs and
challenges in understanding the environment, with the GEO community. This is notably the case for
activities such as GEON, Geoinformatics, climate change, and the marine sciences. However, CIBIO will
interconnect with the research and infrastructure activities in all of the NSF Directorates, including EHR.
- 13 -
Many novel partnerships, involving new collaborations across long maintained disciplinary and sub-
disciplinary boundaries and connected anywhere that makes sense intellectually, will arise due to the
enabling infrastructure of modern infrastructure. This infrastructure is absolutely central to 21st Century
Science. While only one aspect of that infrastructure, CI will have an especially productive impact and
will require equally special attention by the funding agencies, by private Foundations, by performing
Institutions and by the entire community. All economic, geographic and technical sectors of the Nation
and indeed the entire world will participate; all communities and all Agencies, wherever they are located
in the world, will have to define their relationship to CI and respond in an exceptionally proactive fashion.
We need the performers of biological research around the world to be fully engaged in contributing to the
data rich, knowledge empowered world of 21st Century Biology. Such complicated dynamics present an
important opportunity for leadership by NSF BIO, while creating substantive challenges as well.
The dynamic infrastructure being driven by the explosive intellectual and pragmatic/ technical growth
within CSE offers both great promise and potential pitfalls for BIO. Both the community and the NSF will
have to be aware of the path taken by CI as a whole, as well as making sure that we BIO help define that
path. This point must be taken very seriously: BIO must coordinate and “interoperate” with CISE
and secondarily with all the Directorates in defining the extent and fine details of the NSF
structure for CI.
The biological sciences can drive their involvement to meet their needs, as this report recommends. Or,
the members of the community can proceed as before, and then they will simply be dragged along, which
will be far from satisfactory. BIO must recognize that CI is here, is now, has even been growing (up)
for years, but it certainly will be “more here tomorrow.”
- 14 -
the being able to model tens of thousands, a hundred thousand, and then a million atoms, to tens of
millions of atoms, bringing the macromolecular machines of living systems within reach for the first time.
From the complementary approaches to complexity to the shared history of incredible, sustained
advances, the two fields seem ideal for each other: multiplying the exponentials through deep
partnerships will inevitably be exciting beyond any anticipation. However, many issues need to be
addressed, from technical to cultural, from education to process, for the two domains to benefit from each
other. In addition, the context for IT and CI involves all of the sciences, and this interconnectivity must
also be considered.
This report focuses on CSE considerations from a biology perspective. At the frontiers with computer
scientists and biologists are quantitative scientists of all backgrounds, engineers, mathematicians,
statisticians, chemists, physicists, and others – who will be referred to collectively as quantitative
sciences. While BIO should especially focus on a rich partnership with CISE, it is this broader mix
of scientists who will jointly write the story of 21st Century Biology, and BIO should look for
opportunities for partnerships around the entire NSF.
As happened already for the past decades during which molecular biology was established first as a
discipline and then as a set of tools for all biologists, an intellectual thrust or push arising from computing
and information technology over the next decade will “meld” with the biological sciences to form a tool kit
for the next generation of advances, expected to be even more profound than the revolution wrought by
molecular biology. NSF has a critical responsibility to outline a long range vision to accelerate the
process of the meld, beginning with CIBIO.
As all domains of NSF science, engineering and education participate in the CI revolution, how can the
biological sciences maximize their benefits and minimize the inevitable, intertwined perturbations,
the growing pains? How can CI drive fundamental biological discovery and empower more
complete participation for all stakeholders? What are the immediate or near term steps to take,
and what management plans (notably, funding and implementation mechanisms) will be needed?
These are among the central questions addressed in our overview. Some representative objectives and
nuggets for the science are included.
The essence, for BIO utilizing CI in empowering 21st Century Biology, is “Keep
Your Eye on the Prize” --
• Invest in People
• Ensure Science Pull, Technology Push
• Stay the Course
• Prepare for the Data Deluge
• Enable Science Targets of Opportunity
• Direct the Technology Contributions
• Establish National and International Partnerships
Invest in People
CI as an empowering, pervasive technology will change how training is done and will bring that change to
all levels from K-12 to retraining of senior faculty. In doing so, CI will also eliminate geographic barriers
from the educational experience, connecting instructors and students throughout the world. These
aspects will be considered by other arms of NSF. To exploit CI fully will require that computational
awareness be among the goals of those building training programs. As undergraduate curricula around
the country change to reflect the convergence of approaches to address 21st Century Biology and to
- 15 -
reflect the need to establish a strong quantitative foundation for future biologists, the nature of what will
be required at the graduate level and beyond will itself change.
Early on, more short courses, of the types associated with MBL (the Marine Biological Laboratory at
Woods Hole, MA) and CSHL (the Cold Spring Harbor Laboratory, Long Island, NY), among other
Institutions, will be especially required. In addition, there will need to be extensive development of upper
level undergraduate courses, with the expectation that initially, they will be heavily populated by graduate
students filling in gaps in their earlier education. Novel training programs of this nature are being
established already for genome (molecular) bioinformatics and will become increasingly common for all
bioinformatics graduate training.
The early, and especially successful, BIO efforts at interdisciplinary training (the RTGs: Research
Training Groups) that helped lead to the NSF IGERT (Interdisciplinary Graduate Education and Research
Training) program, should be inspected again, and the overall context, particularly in terms of the need for
novel training at all of the institutions supported by NSF, considered. BIO should evaluate mechanisms
either for ensuring that more of the IGERT activities fit into the CIBIO framework or for re-establishing
some specialized training activities to accelerate the rate of adoption of scientific computing and
advanced information technology into the biological science community.
As important, significant, and valuable as the NSF IGERT program has been, the challenges of CI not
only require the continued support for such interdisciplinary training, but also require novel training
approaches, particularly for more senior scientists who need retraining, as well as the establishment of
new undergraduate curricula. Training should be built into every major infrastructure award at this
frontier, and not just within centralized programs. The case above, that NSF BIO should reexamine the
need for RTGs, is made from two perspectives. First, the IGERTs are spread around the entire
Foundation and must inevitably reflect this range. Second, numerous reports and articles have pointed to
the need for far more training in the quantitative sciences in the biological sciences and other major
changes in curriculum, which are likely to require some experimentation to implement adequately. Even
though increased training in quantitative methods will provide a framework, the concomitant effort to
introduce more knowledge about scientific computing and information technology provides its own
challenge. To meet these specialized expectations and challenges, a revised RTG program, drawing on
the original successes, could include focused problem solving training environments (programs), and
explore novel (and potentially beyond the more established framework expected for IGERTs) cross-
training, providing an empowerment deep enough to fuel 21st Century Biology.
A different form of “higher education” will be essential immediately: the training of mid-career biologists
and even more senior graduate students on how to use CI and what it can do for their research. One
option that should be considered is a center where biologists can learn enough to specify their needs
accurately, with whomever they need to work; such training needs to start by helping the practicing
scientist identify if they need a CSE investigator, a CSE researcher/academic professional, or simply to
hire a programmer, or instead if there is even an existing tool that will accomplish the required task.
(Admitting to this need shows the range of “introduction to” or experience with computing technology
among biological scientists – some use thousands of hours of cycles per year at high performance
computing centers, and others are just recognizing, particularly from the side of information technology,
that computing will be important for their research.) This approach could be similar to the mouse genetics
courses, the various MBL and CSH courses, except perhaps of shorter duration but greater frequency
during the year, and only for a few years of the transition to a functioning CIBIO.
Another option would be a Center to which an investigator would provide funds to purchase a service or
obtain unique information. This would be a Center acting as a central repository of knowledge of
biologically-relevant computational tools, and would do software tool development when the needed
algorithm or package did not yet exist. Such a Center must involve training/education among its roles, so
that its implicit knowledge would be disseminated.
- 16 -
In all areas, the opportunities exists to
employ new collaborative tools to National (Biological, Ecological, Biodiversity) Data Centers
broadened traditional education
experience, allowing individual
students, or all students, to be
spatially removed from instructors,
and to interact routinely and easily
with each other and with their
instructor. In addition, the new CI will
facilitate the development of new
approaches in an international effort,
ranging from amore regular dialog
across national boundaries, to the
sharing of educational experiences,
and to joint projects on an
international scale. This will help
prepare students to work in a global
economy, and encourage rapid
progress in the biological sciences.
The biology community must decide how best it can interact with the quantitative science
community, where and when to intersect with computational sciences and technologies, how to
partner in science projects, and how the NSF should partner administratively. We all know better
than to permit a “build it and they will come” mentality to dominate the discussion and set the agenda.
However, there will be many buildings created for other communities; some of those we will want to live in
and for others we will learn from the design but will reconstruct for our own needs. For economies of
scale, we can’t afford to manage all of the architect, builder and materials costs ourselves, and need
sophisticated partnerships built on the normal metrics for academic teams with CSE/CISE endeavors and
also especially with GEO, that is, in marine biology and the environment science, as well as the research
and education overlaps with MPS, ENG, SBE and EHR. Indeed, the domain of overlap for the
environmental sciences, between BIO and GEO, is already constructing its own cyberinfrastructure.
Climate, biodiversity and other environmental research communities will have strong, multifaceted
partnerships among NGOs, private research entities, and government.
One aspect of ensuring that the scientific requirements of the biological sciences determine what aspects
of cyberinfrastructure are employed first and most extensively within the BIO domain will be to continue to
build the bioinformatics and computational biology programs for BIO. Only very recently has the
information technology revolution reached numerous biological sciences sub-disciplines. The
communities arising from research in ecology, cell biology, plant science and most other BIO domains
have unmet needs at fundamental levels of computational expertise, and their requirements for software
engineering, for example, have to be met to allow active participation across the biological sciences.
What could be called a basal infrastructure has to be in place for the biological sciences before an
elaborate, highly activated edifice can be erected to drive 21st Century Biology.
In 1984, leading computational biologists, experimental biologists, and computer scientists assembled at
Arlie House to give NSF BIO advice on how to respond to the first step toward a CI, the high performance
- 17 -
computing initiative and what led to funding the NSF National Supercomputer Centers. The discussions
considered the potential for computing across all research areas supported by NSF BIO as well as
biomedical research areas of interest to NIH. The workshop participants concluded (1) biology has a role
in high performance computing; (2) some of the compelling research problems in biology are already
compute bound; that is, they require more advanced computing; (3) that there would be an ever
increasing set of such problems; and (4) that more and more of the biological community would indeed be
able to make effective use of such high performance computing resources. {Hilllman et al, 1984}
The scientists at that meeting in 1984 raised the concern, that if they put in the effort to port their code
and so on, they didn’t want to have the resources withdrawn, as had happened recently, at that time, for
an NSF disciplinary resource many of them had used. NSF BIO promised it would stay the course, and,
subject to annual efforts to sustain the budget, indeed, it has. Modeling of biomolecular systems, and
biology in general, represented less than a fraction of a percent usage in 1984. As one consequence of
staying the course and working closely with CISE, biomolecular computing had become the plurality user
at 28% of the time by FY 1998 at the NSF Centers.
The importance then (in 1984), however, pales in comparison to the importance now for NSF to stay the
course. The entire biological science community must and will engage, even those with fewer resources
than those already involved biomolecular computing. Scientists can now facilitate each others’ progress
in extraordinary ways, and to optimize introduction of 21st Century biology, the biosciences need to be
interconnected to the other NSF domains. For NSF BIO to underestimate the importance for biology
or fail to provide fuel over the journey would be very damaging, perhaps catastrophic, for the
community.
The biological sciences today are swimming in a swift current of data, characterized by an exploding rate
of data production and an exploding ability to capture and manipulate data, which arises across vast
scales and national boundaries, as well as from numerous disciplines. From acquisition, refinement,
reduction and deposition, the current of data, and the rapids along the course, point to a compelling
requirement for all of the disciplines for tools for analyses and for provision of links across these
interfaces.
All modern biology, from low throughput, spatial reconstruction studies on cells to ecosystems, to the
automated methods of microarrays and (further) sequencing and structure determination research, will
produce a high volume of complex, heterogeneous data, with demands on standards, archiving, mining,
federation or integration, and other contemporary issues in information technology. What distinguishes
bioscience research is not the net volume, though the spatial studies described above will have
terabytes to petabytes of data as will any of the high throughput methods (especially, mass spectrometry
and proteomics), but rather the inherent complexity of the data, which will be very heterogeneous in
data type, in modality of acquisition, and in its ties to biological phenomena, arising from the hierarchical
nature of living systems.
Calculus, in managing the infinitely small but large scale of events filled with redundancy, has been the
language of the physical sciences. The processes of biology, the activities of living organisms, involve
the usage, maintenance, dissemination, transformation or transduction, replication and transmittal across
generations of information. Biology has high information content, along with individuality, historicity and
contingency. Indeed, given the diverse data types and other features of heterogeneity and the
hierarchical nature of biology, the biological sciences as a research discipline are said to be an
information science. As such, information technology is the language of the life sciences, managing the
discrete, non-symmetric, largely non- reducible, unique nature of biological systems and observations.
- 18 -
Bioscience has already
10 6 borrowed from IT in moving first
The Complexity of
Number Scale (over size scale from Angstroms to Km)
10 6
Epidemiology centric implementations (such
Organ function
as CML, SBML, EBL). The
Cells
Electrostatic 3
10 continuum models Cell signalling wide variety of data types alone
10 0
challenges IT. The cottage
DNA 6
industry nature of biological
10
Biopolymers
Enzyme
replication
Regions where data collection requires
distributed data archiving and
3
10 Ab initio Mechanisms Computational
Quantum Chemistry
10 Protein
0
Folding
Modeling can be multiple data resources, run by
Employed Today vs those with deep knowledge, but
6
Goals for Coverage with standards for interaction,
10 Empirical force field
integration, federation, for
Atoms
Molecular Dynamics
3
10 First Principles
Molecular Dynamics
Homology-based
Protein modeling specific queries to merge
10 0
relevant data to provide
-15 -12 -9 -6 -3 0 3 6 9 Geologic &
biological discovery and insight.
10 10 10 10 10 10 10 10 10 Evolutionary No clear solution has emerged
Time Scale (seconds) Timescales
for building and sustaining
biological databases but despite
the challenges, hundreds of public databases exist, some of which represent major community resources,
such as the PDB.
Vigorous research programs, and scientific and administrative processes, are needed to ensure
the continued excellence of the extant public databases, in the face of the data deluge. There has
not been enough consideration of stability and continuity. The professional societies need to be more
firmly engaged. Considerable infrastructure is required just in constructing and maintaining a single,
focused community database. As we require the community to submit more of its observations and to
submit that data more rapidly, as well as the community acquires the tools to obtain far more data in
much less time, the challenges on the data resources will grow. (Let alone the challenges on the data
providers, who must also have the software tools to analyze their own data and extract useful parameters
from community databases to drive their experiments.) At the same time, the downstream requirements,
data mining and other tools for biological discovery, need enhanced support as part of the transition to the
research environment and expectations of this century.
As the data, information and knowledge base grows (exponentially) for the biological sciences, to sustain
the computational analysis chain, one inevitably with deep human intervention, that runs from data to
information to knowledge, and with more sophisticated contemplation, to wisdom, will require
considerably more attention. To bring modern IT, from data standards and federation tools like mediation
and wrappers, will require substantial collaborative efforts by the community, suitable attention by the
agencies not just to funding but to workshops on standards and the use of grant mechanisms to ensure
collaborations across disciplines, and the full involvement of professional societies and leading technical
journals. BIO staff will need to take lead roles in engaging the breadth of the participants needed.
Rather than being condemned to repeat the past, bioscientists must be able to query the world’s literature
each morning, as envisioned by W. Gilbert in 1991, to set their experimental path for the day’s research.
(The vision section, termed A Prelude, reviews this prescient statement, a classic white paper laying out
the future for introducing computational and informational technologies to modern biology, done even
before the web provided the mechanism for the transformation.) This singular feature both provides
the opportunity for a revolution beyond anyone’s imagination and the potential for major
- 19 -
limitations if access is not universal or if the free flow of public information is limited by
technology, imagination, resources, policy or other artifacts.
The highest pinnacles for research projects for the biological sciences without exception involve those
with a considerable informatics component, along with any requisite technology and experimental
advances that are also essential. More generally, the stretch goals projects for the biological
sciences community all include both a need for community-level involvement and the complete
spectrum of CI; namely:
• Instrumentation
• Collaborations
A list that included the major set for all fields in the biological sciences supported by NSF would be
excessively long and inevitably incomplete – every month’s discoveries, sometimes ever day’s
discoveries, open new horizons for the biological sciences. Knowledge intensive computing – organized
- 20 -
and understood data – is the theme of computational biology; the tools and approaches of computational
biology are embedded within all of the key science opportunities. A selected set, aimed at being
representative, follows.
Representative Examples
The various genome projects have produced, and will continue to produce, sequence datasets that
can then be used for phylogeny reconstruction; however, the best reconstruction methods are
heuristics for hard optimization problems. Solving these problems for datasets of more than 100
sequences, to an acceptable level of accuracy, is beyond the current capabilities of existing software.
In order for systematic biologists to be able to obtain good estimates of evolutionary trees, new
methods (algorithms and software) need to be developed, and cycles on a high performance
computing platform needs to be made available. Novel database technology also needs to be
developed, to tie the inferred phylogenies to sequence datasets. Large-scale simulations also need to
be done to assess the performance of both new and existing software. All such efforts require
collaborations between computer scientists (specializing in databases and algorithms) and systematic
biologists. In particular, such collaborative efforts would enable the entire systematic biology
community to contribute to the final product. The database itself could provide an opportunity for
- 21 -
other scientists (for example, geologists, who study the interactions between the earth sciences and
biological evolution) to benefit.
CI for BIO will enable new opportunities for the well-established, two decades old collaboratory
around long term ecological research (LTER), for research within the ecology/ecosystems community
as a whole, and the new comprehensive, enabling collaboratory called NEON. In particular, ever
more powerful, regularly updated to be fully state-of-the-art) sensors, measuring physical, chemical,
biological, meteorological, spatial and other relevant parameters, and associated computational,
informatics and communication tools, will enable ecologists to obtain an in depth understanding of
population processes and the interplay of all the living and physical players in the ecosystem.
Sensors will provide a revolutionary advance in tracking threatened species and otherwise following
and characterizing the behavior of individuals in populations under study in the natural world. Any
individual creature can have its own IP address, an extraordinary expansion, for example, on
traditional tracking of individual marine species. The transformation of data, then information, and
ultimately knowledge, empowered by the new sensor technology, in population studies cannot be
overstated. The “24 by 7” monitoring of environmental and population events, through analysis of
individuals, will require smart software filtering of the data, but will enable the establishment of robust
models. All sponsors and all performers in the environmental sciences will incorporate these kinds of
technologies, methodologies and infrastructure into their research. As this major, new, innovative,
exciting and truly extraordinary platform for scientific discovery and societal relevance is established,
NEON, which can only be conceived and implemented in a comprehensive cyberinfrastructure
environment, will provide both robust approaches to reductionist analysis of specific organisms and
phenomena, and the ability to construct novel syntheses toward a fully comprehensive understanding
of larger scale ecosystem behavior.
Characteristics of
Ecological Data
Satellite
High Images
GIS
Weather
Stations Most Ecological
Data Data
Business
Volume
Data
(per Biodiversity
dataset) Primary Surveys
Most Productivity
Population Data
Software
Gene Sequences Soil Cores
Low High
Complexity/Metadata Requirements
The revolution wrought by NSF in plant genomics from the model to crop plants has changed the face
of the biological sciences forever. Through enabling a new generation of plant science research, the
construction of a comprehensive CI for BIO is a key next step in this revolution. Cutting edge stretch
goals, which will require collaborations among experimentalists and computational biologist and
strengthened bioinformatics support, are:
1. Identify, characterize, and understand the genes responsibility for domestication of crops;
2. Characterize and understand the mechanism of polyploidization and genome reduction;
3. Uncover and characterize the molecular mechanism of symbiosis.
- 22 -
Test beds for Establishing Comprehensive, Collaborative CIBIO Teams (CITs)
BIO should select a few programs, termed CITs for convenience herein, to pilot the full introduction of
the advantages of cyberinfrastructure, from multiscale, federated information environments to virtual
centers or collaboratories, from shared expensive instrumentation to novel knowledge repositories,
and certainly, with rich educational environments. Given the priorities for NSF and the state of the
science and the community’s involvement at the interface with information technology, among the
choices would be CITs for microbes (genomics and communities), ecology and ecosystems and plant
genomics. Ultimately, as NEON comes to be, NEON itself (for example, with the information
requirements and implications from its embedded sensor nets, populations wirelessly tracked) will
intrinsically be a CIT, but the richness of the infrastructure could be enhanced and the research
discoveries enhanced through a concentrated, focused effort at strengthening the corresponding CI.
At this moment, BIRN represents the most fully implemented example, a functioning, expanding CIT,
within the biological sciences, although BIO can also learn from NEES, GryPhN and the other early
CITs outside biology. A representative, obvious and important example of an extension to specific
BIO domains would be an ecological CIT, a pilot with planned, regular expansion as a CI for
ecosystems science, ECI or an EARN, ecological analysis research network, with a coordinating
center and one to two dozen research partners. A detailed description of such a model for the
ecological sciences, with the aim of achieving a grand synthesis for biocomplexity and other aspects
of the exceptional components and dynamics of ecosystems, is in an appendix. Such a Center would
be developed in consultation with the cognizant professional societies and a specialized BIO
workshop.
Building a cyberinfrastructure for the biological sciences, selecting the right targets and milestones, will
require careful attention to the data intensive nature of all of modern biology. Beyond data manipulation,
the technology must enable an easily accessible knowledge management framework. Biologists are
already collaborating with computer scientists to create pilots and establish the process by which the
incoherent output of large numbers of investigators can be channeled so that all may benefit. Early
steps for CIBIO are to expand upon BIO’s database and information management activities, using
the successes in structural biology and ecology, and to establish a broad cyberinfrastructure umbrella
for computational biology and bioinformatics throughout the BIO research arenas. Keeping these
principles in mind, we provide a set of priorities on the technology side of CIBIO.
1. Establish how to create database federations: linking disparate databases. Support research for
biosciences into:
a. Federation and integration heterogeneous databases; ontology development; pedigree;
data validation; provenance chains; the processing pipeline, and other bioinformatics
tools.
b. Data mining, data exploration capabilities, affordable tools.
c. Sustainable, stable knowledge resources.
d. How to optimize analysis algorithms for interactions with databases TOL, computing with
inputs that change.
2. Large scale data analysis that can not be accomplished on a desktop.
3. Supporting advanced simulations, multi-scale, standardized methodologies, linking, chaining
uncertainty.
4. Network middleware for domain services, throughput will change the way we work with data, grab
whole databases or application servers.
5. Sensor development systems.
6. Cyberinfrastructure resources: professional services, systems support, helpdesk support,
resource centers and tools for distributed infrastructure development. (NSF PACI centers do this
- 23 -
to some extent now, with some specialized software support, but BIO should work with the
community to identify key needs and set pathways for future implementations.)
7. Hardware applications raise numerous issues; for example, commodity processors might not be
useful for all BIO problems, and the overarching principle is “better” (routine, reliable, easier and
more useful) access to computational resources.
8. Major need for next generation biological science: the means for interacting with sensors,
including protocol stacks.
9. International collaboration on development, long-term, intergovernmental collaboration, co-
development with sister programs.
10. Mechanisms should be in place to hardening of community application code, following extensive
debate and evaluation through professional societies and peer review.
11. Develop collaborative projects to address integrated environmental challenges – research goals
that lie beyond the expertise and resources available to one community and require broad
participation – and require a very rich information infrastructure, as well as build out of LTER and
NEON.
12. Support an interface of experimental biology, bioinformatics, and software engineering through
interdisciplinary teams who establish modeling frameworks that integrate biological systems with
physical and engineering systems for insight into how living processes and organisms function as
a systems level.
13. Combining spatial images, observations on species, variety of physical/digital data; this should
include biological information stored in multidisciplinary, knowledge resources that include
geospatial, demographic, economic and other data for land use and disaster management.
In an era of data-intensive biology, reliable, routine and robust access to an international level of
information infrastructure (to the organized products from observations, modeling and interpretations
generated the world over) will be critical in order to sustain progress, for scientists to remain competitive,
and to exploit the potential insight to be derived from comprehensive knowledge resources. Storage grids
and compute grids will frequently not be local and sometimes not even regional, and data grids will
certainly be widely distributed, coupling input from very remote sites. Research contributions, perhaps
even more notably in biology than in other sciences, will arise from the entire world. Tomorrow’s CI will
bring together remote resources (expertise, instruments, data resource, computer platforms) and provide
access from the end user’s desktop. This aspect will have no disciplinary, national, political or cultural
boundaries, reflects the growing awareness that science itself is a global enterprise, for example, in terms
of new discoveries and insight to come from numerous research settings around the world, the existence
of international databases, the explosion of electronic literature, and the extraordinary depth of
information available on the world wide web. In addition, the growing international flavor, beyond the
leading economies of the world, can be seen in the rapid spread of ideas in electronic as well as journal
hard copy form, the increased facilitation of the exchange of data with colleagues around the world, the
strengthening of the role and recognition of international scientific organizations (or societies). The most
obvious transformation is the development of an international e-science community that debates and
exudes excitement 24 hours/day, 365 days/year. Furthermore, an international flavor will facilitate and
will often be needed to address many scientific questions (often with social impact) that are by their
nature global in scope, such as understanding basic ecological and ecosystems processes, especially in
the context of global ocean / atmospheric circulation, where observations, expertise, and resources are
needed from across the world. After all, political, social, cultural and economic boundaries are human
inventions, and the physical world, while it may have shaped some of those boundaries, follows its own
“path.”
While local, regional and National solutions to immediate requirements have, of course, to be established
as quickly as possible, biologists need to envision a global grid, and to think and act locally, regionally,
- 24 -
and globally. The challenges, not so much in thinking but very much in implementing, of international
contexts for anything, especially infrastructure, are quite large. As a consequence, from the beginning,
the design and implement of a National CI should be considered in the international context. Given the
requirement for access to the world’s literature on a routine basis, the expertise and the resources to build
CI will not arise from only a single country or region of the world.
With the biological sciences, numerous examples already exist where international interactions that rely
upon the current, albeit partial, implementation of CI. For example, the Protein Data Bank (PDB) is the
international repository of 3D macromolecular structure data, and is now evolving into an internationally
managed activity. Nearly from its beginning, Genbank has been an international collaboration in storing
and managing DNA sequence data. The International Long Term Ecological Research (LTER) activity is
a conceptual network of researchers, using research resources sited around the world for exploring
regional and global questions in ecology. PRAGMA is an organization that has biological applications as
a key set of drivers to build collaborations among research efforts located around the rim of the Pacific
Ocean. The National Center for Microscopy and Imaging Resource (NCMIR) provides international, high
bandwidth, remote access, via dedicated Internet connections, that allows investigators to obtain three
dimensional information on
Model-based Integration of Multi-resolution Data: biological objects. Among
Development of a Cell Centered Database other instruments including
Parallel computing
resources for
NCMIR houses an
tomography Spatial database of rat intermediate voltage
brain anatomy
microscope and provides
Models software tools for computed
axial tomographic
Neuronal models
Database federation reconstruction of objects
within thick specimens using
Imaging databases information from rotated
Large scale 3D EM
reconstructions
Cells and tissues images. Investigators also
use similar methods
Modeling cellular
Cellular processes developed by NCMIR to
microdomains
access a high voltage
Cellular microdomains electron microscope in Osaka
Japan. Data from such
Macromolecular distributions studies is incorporated into a
Correlated LM multiscale database termed
and EM
Hi-throughput tomography the Cell Centered Database
(CCDB).
These activities, which are only selected examples among numerous biological science projects with
international components and involvement, illustrate the value of working across national boundaries and
the extra complexity (language, culture, policy) and time investments needed to be successful. Some
specific additional challenges for international collaboration and the associated approaches for NSF
include:
• Funding of joint international projects (funding both sides of the collaboration)
o Work with funding agencies in other countries.
• Accessing data for comparison runs into potential barriers of different laws in different countries.
o Work with government agencies to ensure the basic principle that “open access” to
publicly funded data is guaranteed for scientific and educational endeavors.
o Work with International Agencies to accelerate development of CI in developing
countries and to facilitate their access to resources and their ability to develop needed
expertise.
• Shared resources will be developed and deployed under local (at the National level) funding, but
will become part of the global CI.
- 25 -
o Work with Funding Agencies from cognizant Nations to reach simplified principles for
sharing the resources.
• Exchange of researchers and students among Nations will be important in ensuring the most
productive international CI environment.
o Develop incentives to encourage undergraduate, graduate and postdoctoral students to
spend time outside their own culture conducting research.
To fully establish a CIBIO, the investments in people will need to keep in mind the international
implications. Researchers will need to have an opportunity to be exposed to the global consequences
and environment, and senior scientists must be empowered to prepare future research generations to
work productively and succeed in the global context that is already beginning to come into being. This will
entail incentives to change the overwhelming imbalance in the exchange of students, especially in the
sciences. But this effort is essential to overcome the too prevalent but naïve belief that the only good or
reliable science and technology research and development is conducted in the United States or perhaps
the US and Europe.
Two corollary responsibilities arise from the simultaneous importance of databases for 21st Century
Biology and the related consequences of the growing international nature of information resources.
Considerable community involvement and discussion is needed to ascertain the right directions, and the
community must be encouraged to see the big picture, to understand the longer term implications. The
very word “community” must also mean all users of the databases, not just a narrow view held by a few
major data providers, or a view exclusive to the keepers, managers, or archivists of the database. On the
other side of the coin, the agencies in the USA, and particular NSF with its unique expertise, point of view
and credibility, must also assert a leadership role from within the agency, not just in terms of dollars
provided to the community. The managers of a database cannot speak for the Nation or for the
community in the same way, that is, with the same authority, credibility and impact, as can officials from
the agencies. Thus, to ensure the requirements of the community are met, NSF must take a proactive
role in international settings. Of course, NSF must provide continuity and reliable, sustained funding for
major information resources in its domain, but it must also participate in international standards setting;
the credibility of NSF is an essential vehicle to ensure the international effort is on track, that the US effort
is focused in the right directions and remains state of the art, and that access to important databases for
the biological sciences are not comprised due to competing standards established in other international
settings or to commercial interests.
Implementation Strategies
NSF faces many hurdles in attempting to build a cyberinfrastructure for all of the sciences, and BIO
certainly will have its own share of issues to address. The scale of the costs associated with CI means
that BIO will have to choose directions carefully, build pilots and then expand them through robust
mechanisms, and find suitable partners whenever possible. Even once adequate levels of funding are in
hand to support initial community activities, BIO will have to balance a portfolio of investments. This
is considered below under “Funding and Management Mechanisms.” NSF does not expect that its
budget will grow fast enough to build CI by itself. Even with extraordinary budget enhancements, since
the requirements for CI are fully intertwined with all research activities, the interdisciplinarity and scope is
such that NSF’s programs must work together and that NSF working with other agencies and countries is
essential now and will always be essential. The local, regional and international impacts and
responsibilities for BIO are considered below under Outreach and Partnerships. Examining a
variety of mechanisms in place already, both within NSF as well as in other agencies, will facilitate BIO’s
efforts to ascertain the what and how for their initial investments and actions. Among the responsibilities
is to balance the options and prototypes, to be careful not to prune early growth, and to confront head on
the major challenge of sustained investments in all parts of CI for biology, ranging from the adequate
support of data resources from acquisition to management and development of query tools, the
- 26 -
maintenance and on-going development of community software for modeling and analysis, the formation
of innovative collaborative efforts driving discovery along the frontier, and the development of new as well
as expanded vehicles for education and training.
Details about internal management will certainly have to be worked out by NSF BIO, but considerations
from the perspective of researchers are provided. These are in the form of recommendations as to
actions by BIO.
• A suite of channels, alternative vehicles and pathways for investigators to obtain funding, are
needed to ensure creative ideas can flourish for infrastructure; a variety of means for funding CI
should be created, both within existing and in new programs that focus on CI BIO.
• One avenue for maintaining a suite of channels was the alternatives between regular investigator
initiated proposals/awards and the CISE program PACI; while weaknesses have been noted for
such overarching programs, PACI provided a flexibility for starting new directions and for taking
risks. As PACI comes to a close, the overall pipeline and its component parts (science
challenges, software requirements, connection of science pull to technology push, and so forth)
should be considered by the research Directorates and notably BIO. ITRs will not fill the void
being left by the PACI program. ITRs are long term, basic research, but there is little expectation
that these would be for infrastructure development and deployment. Other mechanisms for
development and maintenance will be required; to participate fully in the CI of the Foundation as
a whole, all Directorates will have to ensure a core effort for their own disciplines. Simple
infrastructure needs for research directorates will not be met by CISE, and BIO needs to engage
in a dialogue with CISE over expectations and how to sustain the right focus for biology in the
new context of CI.
• The biological science community varies widely in its degree of adoption of scientific computing.
For some communities, a service center, specified for the unique situation of the cognizant
subdiscipline and community,, where a given set of BIO PIs can turn to find the CS partnerships
they need. There could be a stand along Center, with some similarities to the LTER Coordinating
Center.
• There needs to be a critical mass of people focused on bio cyber infrastructure. This could focus
activities in groups of investigators, for the development and maintenance of CI as well as for
training the next generation of scientists
• CI BIO will range from leading-edge-to-routine (“every day”) facilities and tools. BIO in setting up
funding mechanisms should recognize that there are different kinds of infrastructure, a range of
human experts and similarly, a range of community software code hardening and support. These
activities could fall under the same coordination or research support ‘center’ as above.
• A centralized resource or center could also serve to ensure the community can adopt the latest,
best hardware and software tools, can enable more ready access to new technology, and
facilitate standards that allow each member of a distributed team to accelerate their efforts by
connections to the information obtained by others. NIH has funded a project, the Biomedical
Informatics Research Network (BIRN) and described below, that federates data from a
community of researchers, as well as building other collaboratory aspects for 21st Century Biology
research. This project may be a model for various CIBIO aspects. As currently constructed, BIRN
has a coordinating center to put out all of the hardware, software, pipes, and to get the distributed
PIs to move their workflows into that framework. Currently, the Coordinating Center is funded at
$2.5M/year for CI people to support the 200 investigators.
- 27 -
Model Exists: Architecture To Support
a Biological Informatics Research Network
Harvard
Cal Tech NSF
NPACI Cal-(IT)2
W/SDSC
UCLA
“Deep Web” Duke
- 28 -
with programs across the Foundation, set high goals for CI within BIO itself, and also evaluate
“catch-up strategies” for individual sub-disciplines within BIO that would otherwise not be ready to
participate fully.
• The CIBIO could be initiated with three tiers: (1) select via peer review a small set of “critical
mass” activities – a core team, probably distributed geographically, with enough funding to have
an early impact; (2) find similarly a distributed set of medium-sized projects in domains that tap
into this core; (3) identify smaller, individual investigator activities that can benefit by direct
connections to the infrastructure.
• NSF BIO will need to structure the funding mechanisms to be sure the funds are indeed used for
projects either directly empowering investigators to use CI or to develop the CI for BIO, and is not
directed into conventional projects.
BIO should establish a cross-cutting CI activity for BIO, with separate review processes, and
including an invigorated bioinformatics and database activities and a modeling and simulation or
computational biology core. NSF pioneered the federal support of database activities and
computational biology, and should now exploit the CI opportunity to create a new generation of these
research endeavors.
All categories of infrastructure are increasingly important for scientific research, but a strong
cyberinfrastructure will be particularly essential for the biological sciences. What will be important is to
recognize that infrastructure can not be treated the same as individual research proposals. Robust,
rigorous peer review is very important for establishing the best opportunities; competition is also
important and similar, overlapping efforts will need to be initiated in many cases and then the best project
will ultimately become clearly identified. However, infrastructure needs simply cannot be evaluated
directly against individual research, and separate, centralized review and oversight will be
needed. Infrastructure benefits all, but has a different time frame, different budgets, different staffing
(more academic professionals), and can not be properly reviewed in connection with individual research
projects. Public policy, not peer review, is the basis for balancing levels of funding infrastructure and
investigator-initiated projects, levels balanced to ensure optimum progress by the community.
Collaborations across biology, across institutions and across other disciplines will be the hallmark of 21st
Century Biology. In allocating funds along with ensuring appropriate means for peer review of CIBIO
projects, NSF must put in place highly effective channels for collaborations beyond those for existing
programs. NSF must prepare accordingly by sending an even stronger signal that provided by the
existing mechanisms.
Suggesting that the biological sciences will participate in big science activities is both quite misleading
and needlessly controversial. Instead, a healthy mix of microscale (i.e., the so-called cottage industry or
small lab, individual investigator, hypothesis-driven research) with mesoscale (i.e., interdisciplinary,
collaborations and team efforts) is needed and is today being established naturally, without centralized
planning or top down intervention. In this era (for simplicity according to its emerging properties) of
mesoscale biology, the NSF should be particularly attuned to these distinguishing features:
- 29 -
o Portfolio policy-level considerations include cost sharing in instrument acquisition and
mechanisms to enable PIs to sustain applications / cover recurrent costs of user facilities.
• Facilitating effective utilization of the Data Deluge: rapid data acquisition for high throughput
biology and equally effective acquisition for complex data sets for many lower throughput
approaches; organization and long-term maintenance of data sets; creation novel query and
integration tools.
• Developing Quantitative Approaches for all fields of the biological sciences: move beyond binary
biological questions and answers (yes or no, up or down, on or off, spot or no spot), and provide
a strong extension of adequate training to address the complexity of biology.
As pointed out above, each component of all of the science-funding agencies will need to establish its
own course, since all communities will participate in CI. Of necessity, the internal NSF interactions for
BIO must begin by coordinating as much as possible with CISE, and by exploring options, most notably
the potential for joint programs. Especially in the environmental context, there will continue to be a major
area of overlapping interest with GEO. Nonetheless, CI promises to be the most interconnected activity
at NSF in its history, and there will be productive interfaces with all of the other Directorates.
For reasons described above and in the background material in the Appendix, NSF is the only agency
that can lead CI and BIO the only possible leader for the biological sciences effort. However, nothing
succeeds like success, and the scientists working in applied life science research and funded by the NIH,
DOE, USGS, ONR, or components of other mission agencies, will require access to CI, and thus, the
mission agencies should look to BIO’s efforts as a model. Conceived and run as more targeted efforts
than an NSF project, one arm of NIH is already invested; other Institutes are interested. A role for NIH in
their biomedical world can be expected, that is, NIH will continue to build group activities around specific
health topics, such as for the human and model brain studies facilitated by the BIRN project. At this point,
NSF has covered much of the infrastructure development in basic CSE, and driven innovation in early CI
across the biological sciences. What could prove to be very important for the community will be the
potential, led by NSF, of “pulling or pushing” NIH a little further into the development of specialized CI,
and helping with seamless connections between the CI of basic and biomedical science research.
Similarly, private foundations, most notably the Howard Hughes Medical Institute, naturally will also have
to extend CI to their investigators, and opportunities for collaborations should be explored.
The pervasive, ubiquitous connections that must characterize a fully successful CI will require a form of
“intellectual glue,” a willingness to fill in the gaps and reach out to all communities and Institutions. The
bottom line on outreach and partnerships begins with the core: BIO can ensure democracy and that CI
delivers a new era in biological sciences research. This role is essential and no one else will – or
even has the capacity to - undertake such an effort.
Just as the grid is universal, CI has huge international implications. Through professional societies and
continued use of BIO’s ongoing relationships, BIO will need to coordinate with policy makers in Europe
and Asia. NSF’s International Directorate has already begun a CI effort, known as PRAGMA, linking the
USA and Far East Asia. CIBIO should prove an opportunity to cement relationships with Japan and the
EBI around the Protein DataBank, and more generally, to ensure that there is an international information
infrastructure for the biological sciences, or, in other words, to ensure that there are common standards
and shared efforts for ontologies.
- 30 -
Immediate Steps for BIO
Continue to take risks. NSF BIO has the capacity to be far more adventuresome than the mission
agencies. The plunge is essential for 21st Century Biology, and the adventure will benefit not only
fundamental biology but also the applied biology supported by the mission agencies.
Ensure success through adequate, sustained funding of the best activities. A very serious, central
issue that urgently must be addressed before bringing an implementation plan to the community is what
are the commitments for sustaining the infrastructure projects over 15 years? Who actually runs a given
project needs to be determined by peer review and investigators and teams will compete, and some
projects will change ownership. However, it will take a number of years to establish the infrastructure,
and once established, CIBIO must stay in place to sustain the full development of 21st Century Biology.
A CI for the biological sciences is, of course, going to require attention and support for the indefinite
future, but no matter what first generation implementations are established, there will be a need, along
with routine community involvement and active dialogue on the best mechanisms to drive research, to
establish plans for a sunset review of the projects as whole and a careful full assessment of mechanisms,
progress and impacts. A new plan should be established as technology and biological understanding
advance, and of course, the sun might rise to shine in the same path and on the same organizations.
(Sunset involves rigorous assessment, yet often should simply lie before renewal.) In consideration of the
engineering research centers and the science and technology centers, we suggest that this broad
overview occur in time to set a revised agenda in this fifteen-year time frame. The time frame is chosen
to emphasize the deep and central significance of stability in putting together productive teams at
multiple institutions across many disciplines. Without stability, it will be difficult to develop and clearly
impossible to sustain an efficient, effective cyberinfrastructure for biology. One vehicle to begin
discussion about the necessary long term plan would be for NSF BIO leadership to put together a set of
milestones or a roadmap with projections for the lifetime first of developmental and then of
maintenance efforts, and share that long range plan with the BIOAC and ultimately, through numerous
society meetings, with the broad community.
Prepare a five year plan as a first step to place the biological sciences on the map toward a
successful CI; that is, identify the especial role NSF BIO will play and empower the community to
look forward effectively. An implementation workshop would be an effective vehicle to review this plan,
which could be developed through a leadership retreat coupled with post-panel long range plan
discussions and the output from the satellite meetings of professional societies. BIO must also make
sure the plan is international.
Anticipate the accelerated pace that will characterize CI for all of the sciences. The growth path will
be so rapid, the requirements so extensive, and the costs so significant, that a broad swath of the BIO
research community needs to be involved in planning the implementation phase. To ensure the best
ideas can compete, the entire community, more generally, needs to be forewarned and encouraged to
prepare (in order to participate fully).
Obtain input from the broader BIO community. Given the extraordinary broad importance of CI,
building a cyberinfrastructure for the biological sciences will absolutely require considerable, initial and
ongoing input from the entire community. One mechanism, which could prove particularly useful and cost
effective, would simply be to include a talk, a few talks, or a round table discussion at subgroup meetings,
i.e, at the satellite meetings held typically before biological science professional society annual meeting.
BIO program officers could participate in a session in which a given discipline examines its needs and
well-established investigators in the field address their vision for CI. Such discussions will also be
important to ensure full participation and thereby the consideration of the opportunities for all of BIO (at
the level of the specific scientific domains of individual Divisions). Thus, biological society meetings could
serve to help NSF gather information and plan how to implement the funding. There have been examples
of this previously in the biology community’s interactions with NSF, from the role the Genetics Society of
- 31 -
America plays in genetic stock centers or the involvement in the Protein Databank by the International
Union of Crystallographers. Programs at other agencies, such as SCIDAC (Scientific Discover through
Advance Computing) at the Department of Energy, and the National Tissue Culture Cell Repository, at
the National Institutes of Health, also reflect community input, and may represent collaborative channels
for communication and outreach to the community. As BIO moves toward an implementation phase, NSF
could release a call for proposals aimed at scientific societies, to encourage them to identify key needs
and requirements and make sure that the CI activities are inclusive and bring in new people. In addition,
a larger meeting to consider specific implementation across the BIO domain will be very important.
- 32 -
Appendix 1
July 2003 Central Questions for a CI BIO
Questions grouped in order to focus and facilitate the breakout group discussions. Each breakout group
and each attendee considered all questions.
A:
B:
C:
D:
What further meetings or actions are needed in the near term? What are the first steps for BIO?
- 33 -
Appendix II
Schedule and Assignments
(14-15 July 2003, NSF BIO Workshop)
14 July 2003
Morning Session
9:30 start.
Introduction, Welcome, Charge by AD, BIO; Workshop Chair, BIO AC (9:30-9:40)
Presentation by DAD, CISE on CI as viewed by CISE (9:40-10; Qs&As, 10 - 10:10)
Review of Initial Requirements and Overview of CI-BIO (“seabio”), CH (10:10-10:20)
<break>
Examples and Models for CI / examples from biological science (10:40-11:40)
Overview of Grid, CI Technologies, Issues, CA (11:40 – 12:10)
Working Lunch (12:10-1:10)
Lunch discussion include overviews, general group discussion and assignments for breakout groups
Afternoon Session
Breakout Groups formed, begin (A. 1:10-2:40; < break; then reassemble>; B. 3-4:30)
A. Science Assignments; Science Drive, Pull; “Why SciBIO CI,”
B. Application Assignments; Generic Infrastructure; Technology Push
Presentations from Breakout Group Chairs - Review Initial Contributions (4:30-5:10)
(<break>)
Round Table Discussion (5:30-6:30)
Break for dinner by 6:30; expect continued discussion
9 PM, hotel
Brief Meeting of Writing Group/Steering Committee, Review Outcomes, Plans for Day 2
Morning Session
9 AM Start
Reform Breakout Groups (A. 9-10; <break>; B. 10:15-11:15)
[Review Monday “A” and “B” discussions, add “C” to first; add “D” to second discussion]
[During Breakouts, Complete First Draft Major Writing Contributions, esp. both re BIOSCI and General
Enablers – connections to CISE & Technology]
Review Output for consensus (11:15-12:15)
Working Lunch – continued discussion
[Outcomes, Implications, OPTIONS FOR FUTURE MEETINGS, OTHER STRATEGIES]
Afternoon Session
Final Draft of Each Section Prepared 1:15 – 2:25; outline key points/prepare overview
Overview presented to AD, BIO. possibly also to CISE, other NSF. (2:30-3:15)
<break>
Integrated Draft Prepared (3:30 – 6:30 PM)
Day 3 – Wed AM 16th, revised “first draft” given to AD, BIO, for use in NSF August discussions.
- 34 -
Package presented, reviewed at BIOAC Nov 14, 2003
Assignments
- 35 -
Appendix III
Workshop Participants
Nancy Amato Peter Arzberger
Texas A & M University San Diego Supercomputer Center
amato@cs.tamu.edu parzberger@sdsc.edu
(979) 862-2275 (858) 534-5079
- 36 -
William K. Michener Margaret Palmer
Long Term Ecological Research Network Office University of Maryland
wmichene@lternet.edu mp3@umail.umd.edu
(505) 272-7831 (301) 405-3795
- 37 -
Appendix IV
What Is Cyberinfrastructure?
An Overview of the Blue-Ribbon Panel/Atkins Report: Overall NSF Considerations on CI
The extraordinary impact of computing and information technology, from transaction databases and wired
commerce to the World Wide Web, is apparent already in society, in which the transition to an internet
economy has happened in a fraction of the time taken by the telephone and the television in penetrating
the Nation’s homes. In the sciences, where the future is being created today, the interplay is already at
least as profound, and the impact of information technology (IT) seems certain to sustain a positive
second derivative, continuing change empowering science and serving society. All disciplines of science
and engineering face challenges inherent in studying more complex phenomena, managing to drink from
the fire hose of contemporary instrumentation and experimental observation, and establishing data mining
tools to probe beneath the surface of massive data sets. Indeed, a common theme to the analyses by the
communities of science is that the next generation, the 21st Century of discovery, must deal with the
actual details, not simple abstractions, recognizing that all natural systems are inherently complex,
heterogeneous and nonlinear in behavior and in all cases span huge spatial and temporal scales. At the
same time, the future of computing at all levels fuels data analysis on ever more effective scales, and
Computing Platforms will exploit emerging trends: the rise of Linux; the introduction of productive, efficient
commodity clusters; the rapid expansion of participation and availability toward ubiquitous access; and
the maturation of grid computing.
Leading to the Atkins study or blue ribbon panel commissioned by CISE were NSF’s and NSB’s
recognition of the pervasiveness of IT, including the neologism “cyberinfrastructure” or CI to reflect the
revolution. Consequently, a discussion has been ongoing concerning, among other key matters, what CI
would enable, what types and levels of CI would be useful in different communities for both education and
research, what balance should be established between research into advancing the quality of CI itself
- 38 -
versus research and funding toward implementation of instances of CI for specific communities, and what
the balance should be between CI and other forms of infrastructure.
The Blue-Ribbon Advisory Panel or Cyberinfrastructure, or simply the Atkins’ report, is the product of an
interdisciplinary team to consider (1) the extant high performance computing program; (2) new directions
for CISE; (3) implementation plans to meet the recommendations of the study or report itself.
A very extensive appendix considers the breadth of CI, and includes an extensive survey of the scientific
communities. Many aspects of the report should be considered in the sense that not just the biological
science but most sciences have been taking increasing use of IT and are only beginning to adopt fully the
tools and approaches that would be most beneficial. The panel asserted that CI is now at the core of
revolutionary science in every discipline, and that NSF must take a central role in its deployment; that
NSF must consider people and software as co-equals with hardware in the context of CI, and recognize
that to date NSF has not placed enough emphasis on people, on software, and on maintenance,
compared to hardware acquisition and implementation. Networking, clusters, grids and many other
aspects of modern computing that are quite relevant to biology have been considered, including the need
to deliver the full cyberinfrastructure (IT, data and knowledge resources, scientific computing tools, grid
access, visualization tools, etc.) to the desktop of individual students, as well as the need to create a new
workforce, one that is expert in a domain but conversant enough in IT and scientific computing to exploit
the opportunities and select the most productive collaborations.
- 39 -
Appendix V
Models for a Comprehensive CIBIO Based on Extant Test Beds utilizing
Advanced IT
The BIRN test bed uses the most advanced networking available and will draw heavily on resources of
the next generation Internet. The upgrade for networking is funded by the National Science Foundation
for both design and implementation. The initial awards join General Clinical Research Centers (GCRCs)
and co-located Biomedical Technology Research Resources, as team-level, shared user facilities (P41s)
to establish the necessary infrastructure in the context of ongoing neuroimaging research projects. A
separate grant for a "system integrator" that coordinates network, grid, and data mining software
development as well as hardware configurations will be awarded to a recognized leader in such technical
development and service efforts.
- 40 -
BIRN has become an example for all implementations and discussions of CI, a project that could be
called one of the founding activities for a comprehensive cyberinftastructure. Thus, for BIO
considerations, BIRN provides the most fully established model for how communities of researchers can
share data and conduct research across distance. Built upon a foundation from previous work (supported
by various NSF awards; namely, a Grand Challenge award and NPACI, and by NIH funding (as NCMIR,
NBCR and the Collaboratory), BIRN researchers have created a set of tools that distribute computing,
manipulate data, control instruments). http://www.nbirn.net
The NSF supports the participation of scientists from the USA in PRAGMA, which includes an
extraordinary breadth of commitment and participation from Asia. All of the institutions involved have
committed to active participation and contribution of labor and resources for applications and the test bed.
http://www.pragma-grid.net
- 41 -
Telescience Examples
Effective telescience collaborations are supported by portal interfaces into tomographic codes, which
have been distributed on various platforms in US, Japan, Taiwan), and by means for storing of large data
sets. Funding from both NSF and NIH. http://gridport.npcaci.edu/telescience
- 42 -
Appendix VI
Potential Prototypes for BIO Implementation
NSF BIO will have to evaluate the strongest options for early prototypes and also establish milestones
and a roadmap for the overall National and international effort in building a CI for the biological sciences.
Areas of potential impact can be found across all of the BIO research domains, including such things as
the Tree of Life, Deep Green, the extant database activities, LTER, new FIBR activities, comparative
genomics and an understanding of cis regulatory elements during development, and so on. Due to the
enhanced funding and focus on microbes by NSF and also by DOE, there is strong potential for BIO, in
partnership with DOE (and perhaps secondarily with NIAID and NIEHS) to develop a CI for microbes,
ranging from genomics to studies on environmental communities. As the plant genome projects advance
their sequencing and database efforts, the overall information infrastructure to be developed will also
provide a strong base for an NSF BIO CI. As NEON is developed and integrated with ongoing advances
in the science enabled by LTERs, there will be extraordinary opportunities, building on sensor nets and
meso-scale collaboratories, to exploit a cyberinfrastructure to accelerate our understanding of
ecosystems and our ability to utilize that understanding to the benefit of society. As an example of what
such a successful CI would be, an overview of one such development within the ecosystem community
will be outlined.
To realize the vision will require the creation of common platforms, applications that support data
management and manipulation, computer intensive applications, and advances in high-end computing as
well as desktop computing.
The CI, through proper management, will lead to increased scientific democratization that is particularly
relevant to internationalizing our community and our research. Access to data and computing capabilities
will be enhanced worldwide. This is particularly important for ecologists given the rapid environmental
- 43 -
changes taking place globally as well as and the rapid loss of biodiversity in parts of Asia, Africa, and
South America. Furthermore, internationalization through enhanced ECI is critical to answering the many
key ecological questions that require multiple forms of archived data, real-time global scale data and the
ability to work locally (at home) with remote collaborators working in all parts of the world.
Suggested early steps by the community and for the NSF include the following. However, a detailed
implementation plan will require further community input.
1. BIO should establish a coordination center for the national effort, the EARNCC, which would
serve as the hub for technology introduction and dissemination and for intellectual organization
and process management; when mature, the EARN should include about two dozen regional
nodes. (More than that many sites would be required for comprehensive analysis of ecosystem
dynamics and interactions, and certainly more would be required for what will have to be an
international effort, but an initial EARN would inevitably have to be a pilot, and only when the path
is established should more coordination centers and associated nodes be added.) The
coordination center, at a single site, would consist of a group of computer scientists, ecologists,
and academic professions, who serve as a national coordinating group.
A. The center director in concert with a science advisory board would make decisions on
funding of EARN development proposals – these would not be investigator-originated
ecological research proposals but rather (for example) proposals to create new software
or tools that would help a particular community of ecological researchers and managers.
B. The center should be an entrepreneurial entity (in identifying and leading community
efforts to set important directions for EARN) and the coordination entity for a network of
service nodes (i.e., the center would provide support to regional service nodes who would
be the links to individual investigators and research teams).
- 44 -
3. The expectations for the regional service nodes
A. Provide advice and consultation to individuals and research teams on technical issues
(hardware, software, networks design).
B. Host education and training programs/workshops (particularly focused on graduate
students, postdocs, and mid-career scientists)
*Note - items D-G above particularly involve programs and incentives for computer scientists to work
with ecologists, or work on ecologically motivated problems AND all of the above require sustained
support of CI personnel (academic professionals, programmers, and experts in visualization,
networking, and other computer science applications).
Ecological forecasting, evolutionary ecology and biodiversity, a set of interconnected ecological research
arenas, urgently require a set of tools from scientific computing and advanced information technology that
would be provided by this comprehensive CI:
• Analytical modeling and simulation
• Scientific computing at levels of high performance
• Database creation, organization, development, mining and other queries and analyses
Communication networks
• Wireless sensor networks (for data sets in general and particularly for biodiversity)
- 45 -
The following are a set of environmental “Science Nuggets” that outline the interesting adventures of
discovery that a CI for the ecological sciences would enable.
This is a new and untapped field, which has been ignored largely because there has not been the
technology to ask the right questions, and even more seriously, since there never before has been an
opportunity to develop the requisite CI. The field will be able to answer questions such as: what are the
sound niches of different organisms? Do organisms within an ecosystem always partition the entire
sound spectrum and if not, why? How does this partitioning influence communication among organism?
The tools of CI would be essential for monitoring and modeling bioacoustics. The experimental setup
would involve small microphones deployed across habitats (such as wetlands or cities) recording the full
spectrum of sounds (both anthropogenic and natural). The observations would create multi-gigabyte
databases daily, at a minimum, and would create more depending on how pervasive the technology
became in the community. Data must be obtained, sampled, stored, organized, managed, processed,
mined, and through the typical data chain, to provide the understanding. This will be a huge challenge for
ecoinformatics. The data sets will be quite large, beyond ready human comprehension, so creative CI
tools will be required to provide visual representations and entry points for analysis, as well as means for
the detection of sophisticated, deeply embedded patterns, and other still unknown phenomena.
An environmental event detection, recording and analysis process, or event science, would begin with
smart sensor networks that detect changes and also enhance sampling during those events. This
research challenge has plagued environmental biologists for years. (Solving the challenge is critical for
basic research as well as for applied; the challenge is also related to national security issues.) We can
answer questions like: How do storm events or temperature inversions influence N deposition, and what
are the pathways by which the elements move through the soil and groundwater to reach surface waters?
Such a science would require that the EARN link directly to research innovations, because advances in
computing hardware (related to wireless communication and power systems) and software (intelligent
sensor webs to turn on/off; conserve power; smart sensors that will take over other functions as
necessary when others fail).
The infrastructure required includes the development of software for sensors, possibly collaborations with
materials scientists and computer scientists on sensor development itself, and the use of efficient and
flexible network designs that transmit (wirelessly) huge amounts of digital information that should be in a
form that the information can be immediately captured and stored directly.
The overall infrastructure, as a bottom line, will deliver absolutely HUGE data streams that will allow us to
ask and answer a new generation of questions about the environment and how ecosystems function.
The kinds of questions are an extension of LTER research and represent ultimately the goals of NEON,
but the actual technology needed for this revolutionary expansion of ecological science would be enabled
by established a strong baseline effort in environmental event science.
- 46 -
Appendix VII
References for CIBIO Report
There is an enormous wealth of material available—both online and in print—addressing the possibilities
for twenty-first century science within the framework of a modern cyberinfrastructure. This literature
covers a wide range of topics and disciplines; and indeed, it would be impossible to reproduce all of this
information here. Instead, we have chosen to highlight below some of the best outside references we
have come across.
The references listed below can be accessed by visiting CIBIO on the web at
http://research.calit2.net/cibio/.
- 47 -
• Next Generation Biology: The Role of Next Generation Computing
Overview and report of the workshop held 20 July 1998 by the NSF, National Institutes of Health,
and US Department of Energy.
• The Biomedical Information Science and Technology Initiative
Cyberinfrastructure and its potential uses for Biomedical Computing. Document prepared by the
Working Group on Biomedical Computing, Advisory Committee to the Director, National Institutes
of Health (3 June 1999).
• Trends in Computational Biology: A Summary Based on a RECOMB Plenary Lecture, 1999
Early paper discussing the potential applications of computers to biology. Useful as a historical
primer for the current effort to build a cyberinfrastructure for the biological sciences.
- 48 -
• The OptIPuter
Article from the November 2003 issue of Communications of the ACM detailing the use of
advanced computing architecture in conjunction with distributed cyberinfrastructure.
- 49 -
o Computing Research in the FY 2003 Budget Request
Excerpt from AAAS Report XXVII: Research and Development in FY 2003
• Digital Libraries
David Hart details the NSF's approach to the creation of usable digital resources in the next
millennium, and how cyberinfrastructure will play a role.
• Implications Of Information Technologies: IT Overview
Links to multiple web sites in the Division of Science Resources Statistics of the National Science
Foundation.
• National Science, Technology, Engineering, and Mathematics Digital Library (NSDL)
Program Solicitation
Grant and program solicitation information (NSF document 03-530).
• National Science Foundation FY 2004 Budget Request Overview
o FY 2004 Summary of NSF Accounts
• National Science Foundation FY 2005 Budget Request to Congress
Biological Sciences Budget Request Excerpts from FY 2005
- 50 -