Professional Documents
Culture Documents
Data Pipeline
Data Pipeline
scalable
approach to accomplish a wide range of scientific tasks.
Genomics data processing: create data processing pipelines and run them in the Cloud in the automated way. Each pipeline
represents a workflow script with versioned source code, documentation, and configuration. Users can create such scripts in
the Cloud Pipeline environment or upload them from the local machine.
Data storage management: create data storage, download or upload data, edit files right in the Cloud Pipeline user interface.
File version control is supported.
Tools management: create and deploy shared and personal computation environment using Docker’s container concept.
Almost every pipeline requires a specific package of software to run it, which is defined in a docker image. So when user
starts a pipeline, Cloud Pipeline starts a new cloud instance (node) and runs a docker image at it.
Scientific computing GUI applications: launch and run GUI-based applications using self-service Web interface. It is possible
to choose cloud instance configuration, or even use a cluster. Applications are launched as Docker containers exposing Web
endpoints or a remote desktop connection (noVNC, NoMachine).
Cloud Pipeline provides a Web-based GUI and also supports CLI, which exposes most of the GUI features.
Product Description
Cloud Pipeline is an Open Source platform that provides the following features and capabilities:
Support for multiple bioinformatics and modeling/simulation tools from an extensive library of Docker container
images that are executed on cloud instances or clusters.
Users can access active instances via Web based SSH connection, execute scripts, modify images by installing
software packages, and commit modified images to the user’s personal repository.
Ability to launch and manage interactive tools and applications with Web or Linux Desktop UIs.
Users can build custom pipelines using a mixture of languages (including shell script, Python, R, Java, Perl, WDL,
etc.), and save them to a built-in, version-controlled GitLab repository.
Ability to create cloud storage units that upload and download data using a Web UI, Command Line interface (CLI) or
by mounting storage folders to local Windows/Linux/Mac workstations.
Support for thousands of users, utilizing thousands of nodes, and tens thousands of cores simultaneously.
Support for single/multiple computation node configurations, as well as auto-scaled SGE clusters, MPI-based
clusters, various CPU/GPU/Memory/Disks configurations.
Users can choose the cloud instance configuration and region and request launching a regular or auto-scaled cluster
without being a cloud or IT expert.
Cloud Pipeline has a cloud-independent architecture, which makes it simple to port the solution to various cloud
platforms. Currently AWS, Azure, and GCP are supported.
Cloud Pipeline uses Docker containers and the Kubernetes engine to orchestrate the execution of containerized
applications.
Cloud Pipeline has been implemented as a Virtual Private Cloud solution, that retains control over network
connections and provides security and integration mechanisms for the enterprise’s IT/Security team.
Integration with on premise clusters and applications, as well as with external clouds is fully supported.
The platform also can host/execute Web-based applications tightly integrated with computation and storage facilities,
and security mechanisms.
A powerful API allows external applications to leverage Cloud Pipeline’s tools, storage, clusters and instances to
perform the computational work and retrieve the result data for further processing.
Many R&D users, most of whom are scientists and not IT specialists, are not comfortable using cloud facilities directly
(e.g. using command line interfaces and scripting languages).
A few cloud-based offerings are available that offer Web UI to access data storage and pipeline building tools,
however migrating pipelines from on premise to these platforms requires significant rework of existing
workflows/scripts.
3rd party cloud-based offerings are often not flexible enough to support the wide variety of tools, frameworks and
scripting languages found in build analytical/modeling pipelines
Integration of cloud applications with on premise applications (hybrid architecture) is often required; for example,
when a database for annotating genes and pathways, or a license server for certain software is installed on premise.
Integration with an enterprise’s user management/authentication system is usually a requirement, and using SSO
(Single Sign-On) mechanism is a great benefit for users.
Scientists and researchers need access to a scalable computational cluster to solve their most compute intensive
tasks, while at the same time reliable mechanisms are needed to control cloud spending.