Wittig 2017

Accepted Manuscript
Title: Data management and data enrichment for systems

biology projects
Authors: Ulrike Wittig, Maja Rey, Andreas Weidemann,

Wolfgang Muller
PII: S0168-1656(17)30288-2
DOI: http://dx.doi.org/doi:10.1016/j.jbiotec.2017.06.007
Reference: BIOTEC 7916
To appear in: Journal of Biotechnology
Received date: 21-2-2017

Revised date: 6-6-2017
Accepted date: 9-6-2017
Please cite this article as: Wittig, Ulrike, Rey, Maja, Weidemann, Andreas, Muller,
Wolfgang, Data management and data enrichment for systems biology projects.Journal
of Biotechnology http://dx.doi.org/10.1016/j.jbiotec.2017.06.007
This is a PDF file of an unedited manuscript that has been accepted for publication.
As a service to our customers we are providing this early version of the manuscript.
The manuscript will undergo copyediting, typesetting, and review of the resulting proof
before it is published in its final form. Please note that during the production process
errors may be discovered which could affect the content, and all legal disclaimers that
apply to the journal pertain.
1
Data management and data enrichment for

systems biology projects
Ulrike Wittig, Maja Rey, Andreas Weidemann, Wolfgang Mller

Scientific Databases and Visualization Group,
Heidelberg Institute for Theoretical Studies (HITS gGmbH),
Schloss-Wolfsbrunnenweg 35, 69118 Heidelberg, Germany
Highlights
SABIO-RK: Manually-curated kinetic data for modellers and experimentalists
Excemplify: Excel sheet handling for experimentalists
SEEK: Data and model management for systems biology projects
Abstract
Collecting, curating, interlinking, and sharing high quality data are central to de.NBI-SysBio,
the systems biology data management service center within the de.NBI network (German
Network for Bioinformatics Infrastructure). The work of the center is guided by the FAIR
principles for scientific data management and stewardship. FAIR stands for the four
foundational principles Findability, Accessibility, Interoperability, and Reusability which were
established to enhance the ability of machines to automatically find, access, exchange and
use data.
Within this overview paper we describe three tools (SABIO-RK, Excemplify, SEEK) that
exemplify the contribution of de.NBI-SysBio services to FAIR data, models, and experimental
methods storage and exchange. The interconnectivity of the tools and the data workflow within
systems biology projects will be explained. For many years we are the German partner in the
FAIRDOM initiative (http://fair-dom.org) to establish a European data and model management
service facility for systems biology.
1. Introduction
The increasing amount of data doesnt necessarily entail an increasing amount of knowledge.
The real problem is that we have failed to store and organize much of the rapidly accumulating
information (whether in databases or documents) in rigorous, principled ways, so that finding
what we want and understanding what's already known become exhausting, frustrating,
stressful and increasingly costly experiences (Attwood et al., 2009). To use and reuse the
data, storage, organization and communication in a structured and standardized format is
needed. Today, the FAIR principles sum up what data organization should be: Findable,
Accessible, Interoperable, and Reusable (Wilkinson et al., 2016). All of these principles, except
for Accessibility rely on data quality. Biocuration is a key to data quality (Bateman, 2010):
Findability is enhanced by using standard identifiers and annotations which point to standard
ontologies and databases. The same applies for the use of controlled vocabularies. This allows
answering questions that arise from ambiguous information, like for example: Has the
abbreviation Glu in one document the same meaning in another document?. An identifier
based on standards determines unambiguously that Glu represents either Glucose or
Glutamate.
2
Interoperability is greatly enhanced by using common exchange formats. In systems biology,

SBML (Systems Biology Markup Language) (Hucka et al., 2003) is a commonly and widely-
used example. Standard exchange formats allow the automatic and machine-readable data
exchange and enables the development of automatic data workflows between databases, data
management systems and applications, e.g. simulation tools.
Finally, Reusability is greatly enhanced if a file carries metadata, i.e. descriptive information
including information about its context. It includes information about the original data source
(e.g. organism, laboratory sample), procedures how data were generated (e.g. experimental
setup, environmental conditions), and further information about unique data attribution. This
relevant parameters should be present in a data file or connected to it. The MIBBI (Minimum
Information for Biological and Biomedical Investigations) standards initiative seeks to provide
the minimum context needed for information exchange to fully understand the context,
methods, data and conclusions that pertain to an experiment (Taylor et al., 2008).
All the information needed to represent and understand the data depend on the manual work
of biological experts to annotate and curate the information. Biocuration is the transformation
of biological data into an organized form (Bateman, 2010). And the main motivation goes in
the direction of self-curation of data by experimentalists and authors to sensitize the owner of
the data to standards, formats and controlled vocabularies.
Within this paper we will first present the curated system SABIO-RK, a database for highly-
structured, quickly reusable kinetics data for biochemical reactions.
We will also describe our HCI (human computer interaction) and algorithmic efforts towards
increasing findability of data in everyday use (e.g see section Excemplify).
The paper will explain the FAIR principles applied to the tools we offer as the de.NBI-SysBio
service center within the de.NBI initiative. The de.NBI-SysBio center focuses on standards-
based management solutions for data and models with the special core area of systems
biology. Our expertise and experience lies in data management, data curation and data
standards.
We first review the SABIO-RK database (http://sabiork.h-its.org), then Excemplify, a tool for
data collection, and SEEK (http://www.seek4science.org), a tool for data and model
management, that is the basis for the FAIRDOMHub (http://www.fairdomhub.org) data
management service.
Note that all services described here involve humans providing service. Using Excemplify and
SEEK/FAIRDOMHub however, the users must be trained in order to understand benefits and
pitfalls of data curation, understand how to use a variety of tools and standards for
accomplishing their goals. This is the reason why in infrastructure projects like the present
de.NBI project, teaching and training, consulting and advice play an important role.
2. SABIO-RK Reaction kinetics database
SABIO-RK (http://sabiork.h-its.org/) is a web-accessible, manually curated database for
biochemical reactions and their kinetics (Wittig et al., 2012). The database has been developed
to support scientists in modelling and understanding of complex biochemical networks by
structuring kinetic data and related information from literature. SABIO-RK uses a reaction-
oriented approach for representing kinetic data compared to most of the other biological
databases with a focus on proteins or enzymes (e.g. BRENDA (Placzek et al., 2017),
UniprotKB (The UniProt Consortium, 2011)). Reaction- or pathway-oriented databases are for
example KEGG (Kanehisa et al., 2014), Reactome (Croft et al., 2014), or MetaCyc (Caspi et
al., 2014). Compared to these other databases SABIO-RK summarizes not all available kinetic
data for one reaction or enzyme in one database entry but separates kinetic parameters based
on the environmental conditions and literature sources. It stores available kinetic parameters
from publications together with kinetic rate equations, protein/enzyme information, biological
source, and environmental conditions. Users can access the database via the web interface
or automatically using Python scripts via web services. They can export their search results
in XML-formats (SBML, BioPAX/SBPAX (Demir et al. 2010, Ruebenacker et al., 2009)) or in a
spreadsheet format which is mainly preferred by experimentalists.
3
2.1. Collecting and curating data from the literature

The content of the SABIO-RK database originates predominantly from scientific publications
containing kinetic data. As of February 2017 SABIO-RK comprises more than 55.000 database
entries with data extracted from more than 5.500 publications. These data are related to 912
organisms, 7.196 reactions, 1.559 enzymes, and 4.283 UniprotKB accession numbers for
proteins. The kinetic parameters include 43.308 substrate specific constants (e.g. Km), 37.395
velocity constants (Vmax, kcat), and 11.215 inhibition constants (Ki, IC50).
The selection of articles is based on user requirements within collaborative projects or external
user contacts. SABIO-RK offers a public curation service accessible on the user interface
website where users are encouraged to send requests for specific research interests.
Especially users who could not get sufficient search results in SABIO-RK are automatically
invited to add curation requests.
A typical data integration and curation workflow includes the selection of publications from
literature search, the reading of the articles, the manual extraction of information by students
or biological experts and the manual insertion of the data using a web-based input interface.
To avoid errors and inconsistencies SABIO-RK database curators read the paper a second
time to validate the data and to adjust them to SABIO-RK data standards. It includes the
annotation of data with external unique identifiers to ontologies, controlled vocabularies, and
external databases (UniprotKB, KEGG (Kanehisa et al., 2010), ChEBI (deMatos et al., 2010),
EC-Enzyme Classification (http://www.chem.qmul.ac.uk/iubmb/enzyme), BTO-Brenda Tissue
Ontology (Gremse et al., 2011), SBO-Systems Biology Ontology (Courtot et al., 2011), GO-
Gene Ontology (The Gene Ontology Consortium, 2000), NCBI taxonomy (Sayers et al., 2011)
etc.). Finally, the data are transferred to the public online database.
Compliant to the FAIR principles mentioned above data in SABIO-RK are annotated to allow
Interoperability and Reusability. Using the annotations mentioned above SABIO-RK is highly
interlinked with other biological databases and ontologies. Currently about 20 % of SABIO-RK
users enter the database via external links from other databases (e.g. UniprotKB, KEGG,
BRENDA, ChEBI).
Both the reading and curation processes require biological expert knowledge for the
understanding of the publication, the extraction and standardisation of relevant information,
and to guarantee high quality data in the database. Published data to be extracted for the
SABIO-RK database are highly distributed over the whole paper (see Figure 1). Often
controlled vocabularies and annotations to standard identifiers which would increase the
Findability of data are missing in the publication which generates extra work for the biological
experts to interpret and to assign the information.
SABIO-RK represents all kinetic information for one specific reaction under specific
experimental conditions from a defined biological source in one dataset called SABIO-RK
database entry. This information can be viewed and exported as a single data set. Dependent
on the comprehensive amount of data one article could generate many different database
entries. The publication represented as example in Figure 1 results in 37 single database
entries in SABIO-RK (http://sabiork.h-its.org/newSearch?q=pubmedid:11994161) containing
overall 190 kinetic parameters (e.g velocity constants, inhibition constants, substrate
concentrations) for 5 different biochemical reactions.
Kinetic parameters are mainly described in free text, but also displayed and repeated in tables
and/or figures which could cause conflicts between this scattered information. (Wittig et al.,
2014a)
Figure 1: SABIO-RK database entry (ID 40735) and arrows indicating extracted
information distributed in the corresponding publication (PMID 11994161). On the right
one can see a structured SABIO-RK entry that puts semantically close data close to
each other. The arrows show that items that are close semantically do not need to be
close to each other within the publication.
As Figure 1 shows, data which are related to each other can be distributed widely over the
whole paper. Some strongly related values may appear in the Results section, others in the
Material and Methods or Discussion section.
4
Because of the fact that publications are largely unstructured, a large amount of manual work
by biological experts is still needed to understand the whole publication. Analyzing just
sentence by sentence or paragraph by paragraph by text mining tools is not sufficient. At the
moment natural language processing tools for automatic data extraction and text
understanding are not able to fulfill our requirements. SABIO-RK for example contains about
250 database fields which have to be filled with information about enzymes, proteins,
compounds, reactions, parameters etc. Currently no text mining tool is able to extract this
comprehensive amount of information from a publication (Karp, 2016b).
One main challenge in the information extraction from publications is the question how exact
the entities (e.g. compounds, proteins, enzymes) can be identified within an article,
representing efforts for the FAIR principle Findability. The usage of unique identifiers and
standard naming given by ontologies, controlled vocabularies and databases is essential for a
definite data assignment but in most of the articles unique identifiers and controlled
vocabularies are missing (Wittig et al., 2014a,b). As a solution, journal editors should
encourage authors to use complete, standardized and structured data in their publications.
Collaborations between publishers and database developers to agree on common standards
and data formats are preferable for the future. In addition to that, experimental results could
be collected electronically and automatically uploaded to databases or data management
systems including all relevant standardized metadata for documentation, exchange and further
usage of the data (see section Excemplify and SEEK for more details).
2.2 Finding, reading and using data in SABIO-RK

In the section above we focused on extracting and curating data for the SABIO-RK database.
However, the true value of SABIO-RK for the user lies in enabling the Accessibility and
Reusability of the SABIO-RK data. Much care has been taken on these aspects in SABIO-RK
to develop easy-to-use and intuitive data access.
Web services: Nowadays, we receive the largest number of requests via web services and
through modelling platforms including CellDesigner (Funahashi et al., 2007), VirtualCell
(Moraru et al., 2008) or SYCAMORE (Weidemann et al., 2008). All requests in SABIO-RK can
be used for building web service requests. These can be requested by any HTTP client, as
they are RESTful (Representational State Transfer). The query is given by a number of
parameters, and the result is received as the payload of a GET request response. Depending
on the needs of the user the query result can be SBML, BioPAX/SBPAX, a SABIO-oriented
XML Schema or tables. We give Python examples of using our web services.
Web interface: Over the years, we have taken much care in order to simplify the use of SABIO-
RK web interface as well as the querying. SABIO-RK lets the user choose to either (i) enter a
query term into the free text search bar, and/or (ii) use the filter options (is the user looking for
wildtype enzymes, mutants, or recombinants, looking for transport reactions, or certain
environmental conditions etc.), and/or use (iii) the advanced search by a given list of search
attributes. The free text search offers to query over almost all data fields in the database non-
specifically whereas the advanced search allows more specific queries by the selection of
defined attributes. And combinations of both query types are possible.
The search bars (free text and advanced search) provide autocompletion. Only terms that
would yield a non-empty query result are suggested. The result sizes to be expected are shown
to the user (see Figure 2). This enables the user to avoid costly empty queries when looking
for rare data items.
Figure 2: Screenshot from SABIO-RK web interface including autocompletion and
number of expected results.
Figure 3 shows the search results for querying attribute Substrate selected in the advanced
search and term Glucose chosen from the selection list. On pressing enter, the search term
Substrate:Glucose was entered into the search field. As a consequence, items will be
returned that concern reactions whose substrate is Glucose.
Figure 3: Screenshot from SABIO-RK web interface containing the results for search
term Substrate:Glucose represented in Reaction View. Database entries for the first
reaction Glucose + ATP = ADP + Glucose 6-phosphate are selected for export.
5
So, the user can choose if to enter queries into the main search bar (i.e. simple query
specification) or the advanced search field (i.e. precise query specification). All queries are
entered by the system into the main search bar, can be cut, pasted, and extended by hand.
When browsing the query result that is presented either by Entry View or Reaction View. The
Visual Search allows to interactively further restrict the query by clicking in the diagram e.g. to
select a specific organism, tissue, or kinetic parameter type.
Users can choose to export data, which is visualised in a shopping cart in the upper right
corner. For example Figure 3 shows that all 234 database entries for the first reaction in the
list are selected for export. By clicking on the shopping cart the data can be exported as
spreadsheet, SBML, or BioPAX. This workflow is simple, well-known from shopping
applications and has a high user acceptance.
Within this section, we have described SABIO-RK as a hand-curated data source where a high
degree of manual curation together with an elaborate search interface, as well as flexible
export functionality enables easy Findability and Reusability of data. However, this degree of
curation cannot be performed on all relevant publications. This leads to efforts to distribute
curation workload towards the users of lab information and data management systems. One
such example is Excemplify described in the following section.
3. Excemplify Spreadsheet handling for

experimentalists
Excemplify (name created from Excel + Simplify) is a web based application whose initial
purpose is collecting, structuring and annotating data from immunoblot experimentalists (Shi
et al., 2013). Since then, it has been extended for more diverse scenarios.
The key challenge was to build a tool that lets users continue their habitual Excel-based work
during their experimental procedure in the laboratory, but simplify keeping internal standards,
and in turn simplify sharing the data with others.
The key observations guiding the development were:
It is hard to motivate users to put data into a system if they do not get anything in return,
immediately. Helping unknown others unfortunately does not count as an incentive,
here.
For a single person working alone on a set of experiments it is hard to be consistent
across experiments, as consistency is measured via human readability.
In Excel-based self-management, each experiment is accompanied by a series of
Excel sheets. Each sheet corresponds to an experimental stage, each sheet contains
data of the previous stage and then is completed using data that describe the current
setup or the current measurement.
So, the driving force in the design of Excemplify is (i) building a tool that gives the user
something tangible in return, (ii) building a tool that enables people to be more consistent in
their self-documentation, and (iii) building a tool that does not aim at getting people away from
Excel, but rather enabling them to perform their current way of working in a better way.
Figure 4: Excemplify is based on the observation that a series of Excel sheets

accompanies each experiment.
In Excemplify, each sheet is generated on the basis of the previous experimental stage, and
thus describes the next experimental stage (see Figure 4). Addressing point (i) of the previous
enumeration: Excemplify frees users from transforming Excel sheets manually. They commit
fewer errors, and some operations actually are quite tedious to do by hand. Using a tool for
that is not only beneficial, but also perceived as a benefit. Addressing point (ii) by letting
perform the Excel operations by Excemplify, many potential sources of errors are avoided.
This encompasses typos, cut-and-paste errors, as well as errors in transformations such as
transpositions or the permutations of columns. Self standardisation is facilitated. However, (iii)
6
at each stage the actual data entry is done via Excel, using the full freedom of the Excel user
interface. The way of working does not change much, except for using Excemplify instead of
Excel for the sheet transformation operations.
Technically, Excemplify transforms Excel sheets into each other. It has a flexible parsing
framework that breaks up sheets into regions. These then can be transformed using
appropriate transformer objects. Excemplify is a web application and users have their own
accounts. After login, using Excemplify mainly means uploading an Excel Sheet to Excemplify
and receiving a transformed Excel sheet back.
A public demo version is accessible at http://sabiork.h-its.org/excemplify/.
From a data managers point of view this means that Excemplify receives sheets and is able
to store them. Excemplify is trading service for the user against properly annotated data.
Ironically, in the stand-alone Excemplify setup, the only stage without support from Excemplify
is the last one: Collecting the final data of the experiment. Collecting the final data of the
experiment is motivated by automatic deposition: Excemplify enables the user to upload the
data in Excemplify to a connected SEEK instance, for example the FAIRDOM Hub or a project
SEEK instance. This allows the data storage in a structured format and the exchange of data
offering the Reusability based in the FAIR principles.
Extensions to the Excemplify concept:
In the above paragraphs, we have discussed Excemplify as a data collection and storage tool.
In its base version, Excemplify is intended to be light-weight and just input/outputs Excel sheets
and lists of Excel sheets. The tool explicitly tries to avoid duplicating functionality of other tools,
in particular the data exploration functionality. However, it has turned out that users want to
use the software differently. Many users want to explore their data before they share them to
a wider audience. This motivated adding such functionality into Excemplify, including the
graphical display of spreadsheet data in an interactive manner (see Figure 5).
Figure 5: Excemplify screenshot containing the graphical representation of example
immunoblot data.
A positive side effect of supporting experimentalists in handling their different Excel sheets
from the beginning with the experimental setup planning till to the storage of the experimental
results all relevant metadata can be stored, processed and passed to the next experimental
stage. Metadata like the biological sources, protocols, or experimental background information
are mandatory for the setup planning, are therefore also passed through all phases of the
experiment and finally stored and exchanged together with the experimental results using the
SEEK/FAIRDOMHub data management system to allow broader Accessibility and Findability,
to make it Interoperate with other data, and thus improve Reusability.
4. SEEK Data and model management for

systems biology projects
SABIO-RK and Excemplify are centered around one type of data to handle. However, systems
biology projects are interdisciplinary. As each discipline has its preferred repositories, data
tend to be dispersed over multiple locations. This lack of centralization hinders dissemination
within the project, as well as the reuse, and reduce productivity (Bourne et al., 2015).
openSEEK/FAIRDOMHub data management system (Wolstencroft et al., 2015, 2017) has
been developed for such interdisciplinary projects to support the storage and exchange of data
from research partners based on the FAIR principles (Wilkinson et al., 2016;): Findability,
Accessibility, Interoperability, and Reusability. The FAIRDOMHub (http://www.fairdomhub.org)
is built and run by the transnational FAIRDOM project, of which we are part. openSEEK is
designed to be installed and run in two ways, either by using the FAIRDOMHub or setting up
an instance of openSEEK, either for a project, or a work group. openSEEK is the combination
of the SEEK and openBIS tools (Bauch et al., 2011). The frontend to openSEEK is the SEEK
system. We thus use the short form SEEK in the following.
SEEK/FAIRDOMHub is a web-accessible data management platform offering public
information and a password protected user area. A variety of data security levels allow
7
controlled access of digital assets (data, models, SOPs) and secure sharing between project
partners or keeping preliminary the data private.
SEEK situates itself between a lab notebook one one side and data publications systems
designed around datasets (like http://zenodo.org, http://figshare.com). Within SEEK, the center
is a project and its outcomes in relation to people who created these outcomes. SEEK in turn
plays nicely with related systems, allowing linking up with lab notebooks, and being able to
publish research objects into Zenodo.
A comprehensive overview about other data management tools and data collections beside
SEEK is given by Wruck et al. (Wruck et al., 2012).
In the following we describe (i) the yellow pages in which programmes, projects, institutions,
and scientists can present themselves and can be found by their methods and research
interest, (ii) models, SOPs, and experimental data that can be associated with their creators
and contributors, the (iii) Investigation, Study, Assay structuring of the data that makes data
much more intelligible.
The yellow pages include information about programmes and projects, institutions, and
registered people with contact information, methods, and research interests. It is easy to get
an overview: Who uses the same methods? Who might have run into similar problems and
can discuss them? Who could be a collaboration partner?
Digital assets, i.e. models, SOPs, and experimental data can be either uploaded to SEEK, or
they can be registered. Uploading means that an actual copy of the data is made and stored
within SEEK. Registering means that a link to the data item is established. Both of the uses
make sense: Uploading is best for small to medium data. When uploading data, SEEK also
provides versioning, and the FAIRDOMHub provides backup service for the data. Registering
means that the holder of the data is responsible for the data. However, the metadata is centrally
stored, the data can be interlinked with data in the SEEK. This way of sharing makes sense in
particular if the data is either very big, or there are data mobility restrictions due to regulations.
Uploaded or registered, users are able to share any type of data files and interlink them for
example with publications, SOPs (Standard Operation Procedures), events or collaboration
partners.
SEEK offers versioning of uploaded data files, models, and SOPs for documentation and
reproducibility. To all data uploaded to or created in SEEK a predefined set of general
metadata (see for more details: http://docs.seek4science.org/help/metadata-guidelines.html)
is automatically assigned (e.g. title, project, version number, people involved). Beyond these
automatically generated metadata users are responsible for more specific metadata related to
their specific data. In dependence on the FAIR principles the more metadata the user provides
for the assets in SEEK, the easier it is to find them and to compare them with other assets.
SEEK excels by its handling of spreadsheets. Excel files can be browsed online, and such files
can be turned into semantic-web-enabled templates using the RightField tool (Wolstencroft et
al., 2011). The resulting templates contain ontology information. They are easier to fill for the
user, and at the same time more valuable for reuse. The JERM (Just Enough Result Model)
ontology used in many such templates has been developed to cater for the users needs and
interlink relevant terms to existing ontologies.
To structure different experiments and relate them to each other, the standard ISA
(Investigation-Study-Assay)-structure is available. An investigation represents the general
project/experiment context, a study stands for a smaller unit of experiment and an assay gives
specific analytical measurements to build an extensible and hierarchical structure of
experiments within projects (Sansone et al., 2012). Data files, models and SOPs can be
interlinked with assays to connect the results, protocols or models with the experimental ISA-
structure. Figure 6 shows a graphical representation of an example ISA-structure (3 columns
in the left part of the graph) in FAIRDOMHub connected with related data files, models and
SOPs (right column in the graph). The color coding allows to distinguish between
investigations, studies, assays, models, SOPs, and different file formats.
Figure 6: FAIRDOMHub screenshot containing an example ISA-structure and
connected data files for de.NBI workshop hands-on material about model management
Data in SEEK can be displayed within the web interface if the file format is supported (e.g.
Excel, Word, pdf), downloaded to local machines or accessed automatically using RESTful
8
web services. Models stored in SEEK can be developed and validated using the integrated
JSW simulation tool. Peters et al., 2017 describes also the new SED-ML support of SEEK.
SEEK, RightField and associated tools give the possibilities to self-curate FAIR data. However,
such tools should be complemented with appropriate services. These range from help-to-help-
yourself services (e.g. template building, curation advice), as well as training to a full curation
service. To an extent, the quality of these additional services determines how FAIR the
resulting data will be.
5. Workflow of systems biology data

An ideal systems biology project uses and combines the tools described above to handle,
store, curate and share systems biology data. Figure 7 describes the data workflow of systems
biology projects between Excemplify, SABIO-RK and SEEK/FAIRDOMHub indicated by
numbers for the individual steps:
Excemplify supports experimentalists in the laboratory in the planning of the experimental
setup, the handling of different data sheets and the storage of intermediate and final data sets
[1]. Finally the data can be automatically uploaded from Excemplify to SEEK/FAIRDOMHub
for central storage and sharing with colleagues and collaboration partners [2]. Additionally
models created based on these experimental data, experimental procedures (SOPs), and
references can be uploaded to SEEK/FAIRDOMHub and linked to each other and the original
data uploaded [3]. Simulated models in SEEK/FAIRDOMHub using JWS are interlinked with
SABIO-RK to allow database queries for chemical compounds in SABIO-RK. Models
containing kinetic parameters stored in SBML format can be further uploaded from
SEEK/FAIRDOMHub into SABIO-RK [4]. This can be also used for models already published
in the BioModels database (Le Novre et al., 2006) to combine data from models with literature
data. Uploaded models in SABIO-RK will be manually curated, annotated, and linked to
external databases, ontologies, and controlled vocabularies if necessary. The final SABIO-RK
database entries link back to the original model and the experimental data in
SEEK/FAIRDOMHub [5]. Models created based on experimental data could be also directly
uploaded into SABIO-RK using the SBML upload [6]. SABIO-RK allows the search for
uploaded experimental data [4,6] and data manually extracted from literature [7]. In SABIO-RK
both data can be compared and used for constructing new models which again can be
manually uploaded, linked and shared in SEEK/FAIRDOMHub [3].
Figure 7: Workflow of systems biology data between SABIO-RK, Excemplify and
SEEK/FAIRDOMHub
The possible data workflow for systems biology data in Figure 7 reflects all four FAIR principles:
Findability, Accessibility, Interoperability, and Reusability by ensuring that public data and
models in SABIO-RK and SEEK/FAIRDOMHub are (i) searchable for the community, (ii)
accessible by other researchers, (iii) stored and exchanged in standard formats, and (iv) re-
usable by other researchers.
6. Development approach
We would like to stress that all the tools described here benefit a lot from a development
approach that tries to find out how tools will be used and form the tool in agreement with the
user. To this end, FAIRDOM has a set of PALs, users that participate in discussions about new
features, giving suggestions, testing features and giving feedback. These user interactions are
9
completed by information gathered on visits. Excemplify benefited from a visit to the

Klingmller Department at DKFZ in which developers followed a day of experiments,
discussing the data structures along the way, followed by a series of other meeting with
prospective users who made suggestion.
SABIO-RK benefits from curators being also test users and close contacts to users, in
particular our colleagues from de.NBI-ModSim.
All of the software development benefits from software trainings. The questions asked by
prospective users, as well as just watching users during hands-on trainings is extremely
beneficial for improving the software.
In all of the software described above, user feedback is incorporated into the software
development via an agile software development process based on SCRUM. The essence of
this process is user-driven development in small steps followed by user feedback.
7. Summary
Data are most useful if they are findable, accessible, exchangeable and reusable. If
experimental results are published but not stored and organized in a structure format to further
use them scientific impact is reduced. The challenge for scientific data management is to
sensitize experimentalists to view FAIR publishing of data as an natural extension to publishing
results to the scientific community.
The services offered by the de.NBI-SysBio center mainly include data management and data
enrichment. The SEEK data management system together with tools like Excemplify support
experimentalist in the laboratory with easy-to-use tools for data handling. SABIO-RK mainly
uses already published data to offer them in a structured format and enriches them to enhance,
refine or improve the data.
In the near future we will work on further facilitating the integration of our tools into existing
workflows. In particular we are interested in workflows, where collecting data early on facilitates
curation, similar to Excemplify. At the same time we plan to work on improving the literature
curation processes. The goal is to facilitate concentration on the intellectual challenges of the
curation work at hand. All of this work benefits from integration into the infrastructures
community (e.g. de.NBI, ELIXIR, FAIRDOM) and into standardisation efforts (e.g. COMBINE
(Hucka et al., 2015), STRENDA (Apweiler et al., 2005)).
Acknowledgements
The authors gratefully acknowledge the collaboration partners, especially the group of Carole
Goble at the University of Manchester (UK) and the group of Ursula Klingmueller at the German
Cancer Research Center (Germany). Special thanks go to our users for their feedback during
the development processes and for many discussions about their requirements. We also wish
to thank our collaborators in the projects we are part of, among them our neighbors in the
de.NBI-ModSim and the other de.NBI projects, as well as our FAIRDOM partners. The projects
are financed by the Klaus Tschira Foundation (http://www.klaus-tschira-stiftung.de/), the
German Federal Ministry of Education and Research (http://www.bmbf.de/) within de.NBI
(031A540), ERASysAPP (031A525), SysMO-DB, SysMO-DB 2 (0315781), Virtual Liver
Network (0315749), SBEpo (0316182E) ; and the DFG LIS (http://www.dfg.de/) as part of the
project Integrierte Immunoblot Umgebung.
10
References
Apweiler, R., Cornish-Bowden, A., Hofmeyr, J.H., Kettner, C., Leyh, T.S., Schomburg, D., Tipton, K., 2005. The
importance of uniformity in reporting protein-function data. Trends Biochem Sci 30(1) 11-2.
Attwood, T.K., Kell, D.B., McDermott, P., Marsh, J., Pettifer, S.R., Thorne, D., 2009. Calling International Rescue:
knowledge lost in literature and data landslide! Biochem J. 424(3):317-33.
Bateman, A., 2010. Curators of the world unite: the International Society of Biocuration. Bioinformatics 26(8):991.
Bauch, A., Adamczyk, I., Buczek, P., Elmer, F.-J., Enimanev, K., Glyzewski, P., Kohler, M., Pylak, T., Quandt, A.,
Ramakrishnan, C., Beisel, C., Malmstrm, L., Aebersold, R., Rinn, B., 2011. openBIS: a flexible framework for
managing and analyzing complex data in biology research. BMC Bioinformatics 12:468.
Bourne, P.E., Lorsch, J.R., Green, E.D., 2015. Perspective: Sustaining the big-data ecosystem. Nature, 527: S16-
S17.
Caspi, R., Altman, T., Billington, R., Dreher, K., Foerster, H., Fulcher, C.A., Holland, T.A., Keseler, I.M., Kothari, A.,
Kubo, A., Krummenacker, M., Latendresse, M., Mueller, L.A., Ong, Q., Paley, S., Subhraveti, P., Weaver, D.S.,
Weerasinghe, D., Zhang, P., Karp, P.D., 2014. The MetaCyc database of metabolic pathways and enzymes and
the BioCyc collection of Pathway/Genome Databases. Nucleic Acids Res. 42, D459-71.
Courtot, M., Juty, N., Knpfer, C., Waltemath, D., Zhukova, A., Drger, A., Dumontier, M., Finney, A., Golebiewski,
M., Hastings, J., Hoops, S., Keating, S., Kell, D.B., Kerrien, S., Lawson, J., Lister, A., Lu, J., Machne, R., Mendes,
P., Pocock, M., Rodriguez, N., Villeger, A., Wilkinson, D.J., Wimalaratne, S., Laibe, C., Hucka, M., Le Novre, N.,
2011. Controlled vocabularies and semantics in systems biology. Mol Syst Biol 7, 543.
Croft, D., Mundo, A.F., Haw, R., Milacic, M., Weiser, J., Wu, G., Caudy, M., Garapati, P., Gillespie, M., Kamdar,
M.R., Jassal, B., Jupe, S., Matthews, L., May, B., Palatnik, S., Rothfels, K., Shamovsky, V., Song, H., Williams, M.,
Birney, E., Hermjakob, H., Stein, L., D'Eustachio, P., 2014. The Reactome pathway knowledgebase. Nucleic Acids
Res. 42, D472-7.
de Matos, P., Alcntara, R., Dekker, A., Ennis, M., Hastings, J., Haug, K., Spiteri, I., Turner, S., Steinbeck, C., 2010.
Chemical Entities of Biological Interest: an update. Nucleic Acids Res. 38, D249-54.
Demir, E., Cary, M.P., Paley, S., Fukuda, K., Lemer, C., Vastrik, I., Wu, G., D'Eustachio, P., Schaefer, C., Luciano,
J., Schacherer, F., Martinez-Flores, I., Hu, Z., Jimenez-Jacinto, V., Joshi-Tope, G., Kandasamy, K., Lopez-Fuentes,
A.C., Mi, H., Pichler, E., Rodchenkov, I., Splendiani, A., Tkachev, S., Zucker, J., Gopinath, G., Rajasimha, H.,
Ramakrishnan, R., Shah, I., Syed, M., Anwar, N., Babur, O., Blinov, M., Brauner, E., Corwin, D., Donaldson, S.,
Gibbons, F., Goldberg, R., Hornbeck, P., Luna, A., Murray-Rust, P., Neumann, E., Ruebenacker, O., Samwald, M.,
van Iersel, M., Wimalaratne, S., Allen, K., Braun, B., Whirl-Carrillo, M., Cheung, K.H., Dahlquist, K., Finney, A.,
Gillespie, M., Glass, E., Gong, L., Haw, R., Honig, M., Hubaut, O., Kane, D., Krupa, S., Kutmon, M., Leonard, J.,
Marks, D., Merberg, D., Petri, V., Pico, A., Ravenscroft, D., Ren, L., Shah, N., Sunshine, M., Tang, R., Whaley, R.,
Letovksy, S., Buetow, K.H., Rzhetsky, A., Schachter, V., Sobral, B.S., Dogrusoz, U., McWeeney, S., Aladjem, M.,
Birney, E., Collado-Vides, J., Goto, S., Hucka, M., Le Novre, N., Maltsev, N., Pandey, A., Thomas, P., Wingender,
E., Karp, P.D., Sander, C., Bader, G.D., 2010. The BioPAX community standard for pathway data sharing. Nat
Biotechnol. 28(9):935-42.
Funahashi, A., Jouraku, A., Matsuoka, Y., Kitano, H., 2007. Integration of CellDesigner and SABIO-RK. In Silico
Biol 7, S81-90.
Gremse, M., Chang, A., Schomburg, I., Grote, A., Scheer, M., Ebeling, C., Schomburg, D., 2011. The BRENDA
Tissue Ontology (BTO): the first all-integrating ontology of all organisms for enzyme sources. Nucleic Acids Res.
39, D507-13.
The Gene Ontology Consortium, 2000. Gene ontology: tool for the unification of biology. Nat Genet 25(1):25-9.
Hucka, M., Finney, A., Sauro, H.M., Bolouri, H., Doyle, J.C., Kitano, H., Arkin, A.P., Bornstein, B.J., Bray, D.,
Cornish-Bowden, A. et al., 2003. The systems biology markup language (SBML): a medium for representation and
exchange of biochemical network models. Bioinformatics. 19, 524-31.
Hucka, M., Nickerson, D.P., Bader, G.D., Bergmann, F.T., Cooper, J., Demir, E., Garny, A., Golebiewski, M., Myers,
C.J., Schreiber, F., Waltemath, D., Le Novre, N., 2015. Promoting Coordinated Development of Community-Based
Information Standards for Modeling in Biology: The COMBINE Initiative. Front Bioeng Biotechnol. 3:19.
Kanehisa, M., Goto, S., Furumichi, M., Tanabe, M., Hirakawa, M., 2010. KEGG for representation and analysis of
molecular networks involving diseases and drugs. Nucleic Acids Res. 38, D355-60.
Kanehisa, M., Goto, S., Sato, Y., Kawashima, M., Furumichi, M., Tanabe, M., 2014. Data, information, knowledge
and principle: back to metabolism in KEGG. Nucleic Acids Res, 42, D199-205.
Karp, P.D., 2016a. How much does curation cost? Database, baw110.
Karp, P.D., 2016b. Can we replace curation with information extraction software? Database, baw150.
Le Novre, N., Bornstein, B., Broicher, A., Courtot, M., Donizelli, M., Dharuri, H., Li, L., Sauro, H., Schilstra, M.,
Shapiro, B., Snoep, J.L., Hucka, M., 2006. BioModels Database: a free, centralized database of curated, published,
quantitative kinetic models of biochemical and cellular systems. Nucleic Acids Res. 34:D689-91.
Moraru, I.I., Schaff, J.C., Slepchenko, B.M., Blinov, M.L., Morgan, F., Lakshminarayana, A., Gao, F., Li, Y., Loew,
L.M., 2008. Virtual Cell modelling and simulation software environment. IET Syst Biol 2(5) 352-62.
Peters, M., Eicher, J.J., van Niekerk, D.D., Waltemath, D., Snoep, J.L., 2017. The JWS online simulation database.
Bioinformatics, 33(10):1589-1590
Placzek, S., Schomburg, I., Chang, A., Jeske, L., Ulbrich, M., Tillack, J., Schomburg, D., 2017. BRENDA in 2017:
new perspectives and new tools in BRENDA. Nucleic Acids Res, 45, D380-8.
Ruebenacker, O., Moraru, I.I., Schaff, J.C., Blinov, M.L., 2009. Integrating BioPAX pathway knowledge with SBML.
IET Syst Biol models, 3(5):317-28.
11
Sansone, S.A., Rocca-Serra, P., Field, D., Maguire, E., Taylor, C., Hofmann, O., Fang, H., Neumann, S., Tong, W.,
Amaral-Zettler, L., Begley, K., Booth, T., Bougueleret, L., Burns, G., Chapman, B., Clark, T., Coleman, L.A.,
Copeland, J., Das, S., de Daruvar, A., de Matos, P., Dix, I., Edmunds, S., Evelo, C.T., Forster, M.J., Gaudet, P.,
Gilbert, J., Goble, C., Griffin, J.L., Jacob, D., Kleinjans, J., Harland, L., Haug, K., Hermjakob, H., Ho Sui, S.J.,
Laederach, A., Liang, S., Marshall, S., McGrath, A., Merrill, E., Reilly, D., Roux, M., Shamu, C.E., Shang, C.A.,
Steinbeck, C., Trefethen, A., Williams-Jones, B., Wolstencroft, K., Xenarios, I., Hide, W., 2012. Toward
interoperable bioscience data. Nat Genet. 44(2):121-6.
Sayers, E.W., Barrett, T., Benson, D.A., Bolton, E., Bryant, S.H., Canese, K., Chetvernin, V., Church, D.M.,
DiCuccio, M., Federhen, S. et al., 2011. Database resources of the National Center for Biotechnology Information.
Nucleic Acids Res. 39, D38-51.
Shi, L., Jong, L., Wittig, U., Lucarelli, P., Stepath, M., Mueller, S., D`Alessandro, L.A., Klingmller, U., Mller, W.,
2013. Excemplify: A Flexible Template Based Solution, Parsing and Managing Data in Spreadsheets for
Experimentalists. J Integrat Bioinform, 10(2):220
Taylor, C.F., Field, D., Sansone, S.A., Aerts, J., Apweiler, R., Ashburner, M., Ball, C.A., Binz, P.A., Bogue, M.,
Booth, T., Brazma, A., Brinkman, R.R., Clark, A.M., Deutsch, E.W., Fiehn, O., Fostel, J., Ghazal, P., Gibson, F.,
Gray, T., Grimes, G., Hancock, J.M., Hardy, N.W., Hermjakob, H., Julian, R.K. Jr, Kane, M., Kettner, C., Kinsinger,
C., Kolker, E., Kuiper, M., Le Novre, N., Leebens-Mack, J., Lewis, S.E., Lord, P., Mallon, A.M., Marthandan, N.,
Masuya, H., McNally, R., Mehrle, A., Morrison, N., Orchard, S., Quackenbush, J., Reecy, J.M., Robertson, D.G.,
Rocca-Serra, P., Rodriguez, H., Rosenfelder, H., Santoyo-Lopez, J., Scheuermann, R.H., Schober, D., Smith, B.,
Snape, J., Stoeckert, C.J. Jr, Tipton, K., Sterk, P., Untergasser, A., Vandesompele, J., Wiemann, S., 2008.
Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project.
Nat Biotechnol. 26(8):889-96.
The UniProt Consortium, 2011. Ongoing and future developments at the Universal Protein Resource. Nucleic Acids
Res. 39, D214-9.
Weidemann, A., Richter, S., Stein, M., Sahle, S., Gauges, R., Gabdoulline, R., Surovtsova, I., Semmelrock, N.,
Besson, B., Rojas, I., Wade, R., Kummer, U., 2008. SYCAMORE--a systems biology computational analysis and
modeling research environment. Bioinformatics 24, 1463-4.
Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.W., da
Silva Santos, L.B., Bourne, P.E., Bouwman, J., Brookes, A.J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds,
S., Evelo, C.T., Finkers, R., Gonzalez-Beltran, A., Gray, A.J., Groth, P., Goble, C., Grethe, J.S., Heringa, J., 't Hoen,
P.A., Hooft, R., Kuhn, T., Kok, R., Kok, J., Lusher, S.J., Martone, M.E., Mons, A., Packer, A.L., Persson, B., Rocca-
Serra, P., Roos, M., van Schaik, R., Sansone, S.A., Schultes, E., Sengstag, T., Slater, T., Strawn, G., Swertz, M.A.,
Thompson, M., van der Lei, J., van Mulligen, E., Velterop, J., Waagmeester, A., Wittenburg, P., Wolstencroft, K.,
Zhao, J., Mons, B., 2016. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data.
3:160018.
Wittig, U., Kania, R., Golebiewski, M., Rey, M., Shi, L., Jong, L., Algaa,E., Weidemann, A., Sauer-Danzwith, H.,
Mir, S., Krebs, O., Bittkowski, M., Wetsch, E., Rojas, I., Mller, W., 2012. SABIO-RK database for biochemical
reaction kinetics. Nucleic Acids Res, 40(D1):D790-6.
Wittig, U., Kania, R., Bittkowski, M., Wetsch, E., Shi, L., Jong, L., Golebiewski, M., Rey, M., Weidemann, A., Rojas,
I., Mller, W., 2014a. Data extraction for the reaction kinetics database SABIO-RK. Perspectives in Science 1, 33
40.
Wittig, U., Rey, M., Kania, R., Bittkowski, M., Shi, L., Golebiewski, M., Weidemann, A., Mller, W., Rojas, I., 2014b.
Challenges for an enzymatic reaction kinetics database. FEBS Journal, 281(2):572-582.
Wolstencroft, K., Owen, S., Horridge, M., Krebs, O., Mueller, W., Snoep, J.L., du Preez, F., Goble, C., 2011.
RightField: embedding ontology annotation in spreadsheets. Bioinformatics 27(14):2021-2.
Wolstencroft, K., Owen, S., Krebs, O., Nguyen, Q., Stanford, N.J., Golebiewski, M., Weidemann, A., Bittkowski, M.,
An, L., Shockley, D., Snoep, J.L., Mueller, W., Goble, C., 2015. SEEK: a systems biology data and model
management platform. BMC Syst Biol. 9:33.
Wolstencroft, K., Krebs, O., Snoep, J.L., Stanford, N.J., Bacall, F., Golebiewski, M., Kuzyakiv, R., Nguyen, Q.,
Owen, S., Soiland-Reyes, S., Straszewski, J., van Niekerk, D.D., Williams, A.R., Malmstrm, L., Rinn, B., Mller,
W., Goble, C., 2017. FAIRDOMHub: a repository and collaboration environment for sharing systems biology
research. Nucleic Acids Res. 45(D1):D404-D407.
Wruck, W., Peuker, M., Regenbrecht, C.R., 2014. Data management strategies for multinational large-scale
systems biology projects. Brief Bioinform. 15(1):65-78.
Figure Caption
12
Figr-1
13
Figr-2
14
Figr-3
15
Figr-4
16
Figr-5
17
Figr-6
18
Figr-7

Wittig 2017

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Wittig 2017

Uploaded by

Copyright:

Available Formats

Accepted Manuscript

Title: Data management and data enrichment for systems

Authors: Ulrike Wittig, Maja Rey, Andreas Weidemann,

To appear in: Journal of Biotechnology

Received date: 21-2-2017

Data management and data enrichment for

Ulrike Wittig, Maja Rey, Andreas Weidemann, Wolfgang Mller

Interoperability is greatly enhanced by using common exchange formats. In systems biology,

2.1. Collecting and curating data from the literature

2.2 Finding, reading and using data in SABIO-RK

3. Excemplify Spreadsheet handling for

Figure 4: Excemplify is based on the observation that a series of Excel sheets

4. SEEK Data and model management for

5. Workflow of systems biology data

completed by information gathered on visits. Excemplify benefited from a visit to the

You might also like