Professional Documents
Culture Documents
Proven Software Built Specifically: To Manage GPU Clusters
Proven Software Built Specifically: To Manage GPU Clusters
CONTACT US sales@elementai.com
Benefit from more efficient job distribution
Fair share between users and more optimal and transparent resource usage
ARCHITECTURE Most operate on Operates on containers Jobs defined as containers can be reproduced more easily
programs (self-contained)
More flexibility in terms of which technology/libraries are
used (full control over what’s in the container)
Most run on Runs on Kubernetes Enables the use of a single cluster for both research
bare-metal workloads and production systems
cluster
Docker enables a smoother path from research
to production
QUEUE TYPE Based on Fair share between More efficient job distribution and increased
policy or other users, job type can fairness between users
(sometimes) be customized
configurable (preemptable, non- Users can set the priority level of their jobs
conditions preemptable)
MANAGEMENT TOOLS Limited or no Reports include: Better visibility and control leads to better user
built-in reporting job resource usage, experience and increased productivity
efficiency report, user
reports and maintenance Management and maintenance of clusters in real-time,
dashboards. including automatic healthchecks of nodes
SUPPORT Limited to no Full support offered User training offered and deployment support
support offered
Case study
Technical overview How Element AI uses the
Orkestrator on premises to
optimize resource usage
• Designed for GPU-based workloads including AI, Element AI fundamental researchers, applied
simulation and virtual environments research scientists and developers require
• Built on top of Kubernetes and Docker solid infrastructure to efficiently run their
• Can run on-premises or in the cloud models. As the organization scaled and GPU-
• REST API for flexible job control based workloads became increasingly heavy,
• CLI to streamline job control distribution and management of computing
• Job resource requirements resources became a challenge.
- CPU, RAM, GPU, Tensor Cores,
GPU RAM, CUDA version, and more
- Special case: X11 server BEFORE EAI ORKESTR ATOR
sidecar (CPU or GPU) By tracking allocation manually (spreadsheets),
• Multiple job types: preemptable, non-preemptable Element AI averaged GPU usage of 25%
and “interactive”
• Meta job: Process agents launch jobs on behalf of AFTER EAI ORKESTR ATOR
users according to configurable settings With the tool, Element AI averages
• Dashboards including overall cluster view, GPU GPU usage at > 90%, and has increased GPU
status, efficiency report, per job view (CPU and GPU) volume 36x without requiring additional staff
CONTACT US sales@elementai.com