Data Pipeline

Cloud Pipeline solution wraps Cloud compute and storage resources into a single service, providing an easy and
scalable
approach to accomplish a wide range of scientific tasks.
 Genomics data processing: create data processing pipelines and run them in the Cloud in the automated way. Each pipeline
represents a workflow script with versioned source code, documentation, and configuration. Users can create such scripts in
the Cloud Pipeline environment or upload them from the local machine.
 Data storage management: create data storage, download or upload data, edit files right in the Cloud Pipeline user interface.
File version control is supported.
 Tools management: create and deploy shared and personal computation environment using Docker’s container concept.
Almost every pipeline requires a specific package of software to run it, which is defined in a docker image. So when user
starts a pipeline, Cloud Pipeline starts a new cloud instance (node) and runs a docker image at it.
 Scientific computing GUI applications: launch and run GUI-based applications using self-service Web interface. It is possible
to choose cloud instance configuration, or even use a cluster. Applications are launched as Docker containers exposing Web
endpoints or a remote desktop connection (noVNC, NoMachine).
Cloud Pipeline provides a Web-based GUI and also supports CLI, which exposes most of the GUI features.
Product Description
Cloud Pipeline is an Open Source platform that provides the following features and capabilities:
 Powerful, user-friendly Web interface
 Support for multiple bioinformatics and modeling/simulation tools from an extensive library of Docker container
images that are executed on cloud instances or clusters.
 Users can access active instances via Web based SSH connection, execute scripts, modify images by installing
software packages, and commit modified images to the user’s personal repository.
 Ability to launch and manage interactive tools and applications with Web or Linux Desktop UIs.
 Users can build custom pipelines using a mixture of languages (including shell script, Python, R, Java, Perl, WDL,
etc.), and save them to a built-in, version-controlled GitLab repository.
 Ability to create cloud storage units that upload and download data using a Web UI, Command Line interface (CLI) or
by mounting storage folders to local Windows/Linux/Mac workstations.
 Ability to store and process data in multiple cloud regions.

 Management of data access permissions for both internal and external users.
 Support for thousands of users, utilizing thousands of nodes, and tens thousands of cores simultaneously.
 Support for single/multiple computation node configurations, as well as auto-scaled SGE clusters, MPI-based
clusters, various CPU/GPU/Memory/Disks configurations.
 Protecting data using data-at-rest and data-in-motion encryption.
Cloud Pipeline’s Innovative Technology

 A key innovation is providing users with a self-service environment where they can flexibly build and execute their
own pipelines/models, while preserving the highest levels of safety for the data and application.
 Users can choose the cloud instance configuration and region and request launching a regular or auto-scaled cluster
without being a cloud or IT expert.
 Cloud Pipeline has a cloud-independent architecture, which makes it simple to port the solution to various cloud
platforms. Currently AWS, Azure, and GCP are supported.
 Cloud Pipeline uses Docker containers and the Kubernetes engine to orchestrate the execution of containerized
applications.
 Cloud Pipeline has been implemented as a Virtual Private Cloud solution, that retains control over network
connections and provides security and integration mechanisms for the enterprise’s IT/Security team.
 Integration with on premise clusters and applications, as well as with external clouds is fully supported.
 The platform also can host/execute Web-based applications tightly integrated with computation and storage facilities,
and security mechanisms.
 A powerful API allows external applications to leverage Cloud Pipeline’s tools, storage, clusters and instances to
perform the computational work and retrieve the result data for further processing.
Problems Addressed by Cloud Pipeline

Many tasks in the Pharmaceutical R&D process require significant and constantly increasing computational power and
storage capacity. These tasks include:
 Genomic, transcriptomic, and other “omics” analysis

 PK/PD, QSP, and other types of modeling
 Clinical trials simulation
 AI/ML analysis in various areas, including the ones listed above, and many others.
Moving data storage and processing to the cloud is an obvious option, however certain challenges specific to R&D
processes must be addressed:
 Many R&D users, most of whom are scientists and not IT specialists, are not comfortable using cloud facilities directly
(e.g. using command line interfaces and scripting languages).
 A few cloud-based offerings are available that offer Web UI to access data storage and pipeline building tools,
however migrating pipelines from on premise to these platforms requires significant rework of existing
workflows/scripts.
 3rd party cloud-based offerings are often not flexible enough to support the wide variety of tools, frameworks and
scripting languages found in build analytical/modeling pipelines
 Integration of cloud applications with on premise applications (hybrid architecture) is often required; for example,
when a database for annotating genes and pathways, or a license server for certain software is installed on premise.
 Integration with an enterprise’s user management/authentication system is usually a requirement, and using SSO
(Single Sign-On) mechanism is a great benefit for users.
 Scientists and researchers need access to a scalable computational cluster to solve their most compute intensive
tasks, while at the same time reliable mechanisms are needed to control cloud spending.

Data Pipeline

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Pipeline

Uploaded by

Copyright:

Available Formats

Cloud Pipeline solution wraps Cloud compute and storage resources into a single service, providing an easy and

 Powerful, user-friendly Web interface

 Ability to store and process data in multiple cloud regions.

 Protecting data using data-at-rest and data-in-motion encryption.

Cloud Pipeline’s Innovative Technology

Problems Addressed by Cloud Pipeline

 Genomic, transcriptomic, and other “omics” analysis

You might also like