TENSILE: A Tensor Granularity Dynamic GPU Memory Scheduler Method Towards Multiple Dynamic Workloads System

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

TENSILE: A Tensor granularity dynamic GPU memory scheduler

method towards multitasking dynamic workloads system


Kaixin Zhang Hongzhi Wang Tongxin Li
Harbin Institute of Technology Harbin Institute of Technology Harbin Institute of Technology
Harbin, China Harbin, China Harbin, China
1170300216@stu.hit.edu.cn

Han Hu Jiye Qiu Songling Zou


Harbin Institute of Technology Harbin Institute of Technology Harbin Institute of Technology
arXiv:2105.13336v1 [cs.DC] 27 May 2021

Harbin, China Harbin, China Harbin, China

ABSTRACT more depth have been proposed in various communities such as


Recently, deep learning has been an area of intense researching. image recognition and natural language processing to achieve bet-
However, as a kind of computing intensive task, deep learning ter performance, such as ResNet[6], BERT[5] and GPT-3[2]. These
highly relies on the the scale of the GPU memory, which is usually models have as many as billions parameters. Their parameters
expensive and scarce. Although there are some extensive works and the corresponding feature maps cause massive GPU memory
have been proposed for dynamic GPU memory management, they footprint[15] and impede their application. For example, such learn-
are hard to be applied to systems with multitasking dynamic work- ing could hardly be performed in database, since it is expensive to
loads, such as in-database machine learning system. build a large GPU cluster for the existing database system. Thus,
In this paper, we demonstrated TENSILE, a method of managing the increasing size of deep learning models acquire that the GPU
GPU memory in tensor granularity to reduce the GPU memory peak, memory must be managed more efficiently.
with taking the multitasking dynamic workloads into consideration. Additionally, many scenarios of deep learning are dynamic and
As far as we know, TENSILE is the first method which is designed multi-tenant. For example, in-database machine learning systems
to manage multiple workloads’ GPU memory using. We implement need to face multi-tenant scenarios. Multiple workloads can be
TENSILE on our own deep learning framework, and evaluated trained in the system at the same time, these workloads are also able
its performance. The experiment results shows that our method to be launched or ended at any time. We can distribute GPUs to run
can achieve less time overhead than prior works with more GPU each workload exclusively, but such naive method gets expensive in
memory saved. huge number of jobs situation, since it acquires a great GPU cluster
to be built. In the other way, we could support such scenarios
PVLDB Reference Format: by saving each jobs’ GPU memory peak and running multiple
Kaixin Zhang, Hongzhi Wang, Tongxin Li, Han Hu, Jiye Qiu, and Songling jobs on single GPU, as long as we have the appropriate method to
Zou. TENSILE: A Tensor granularity dynamic GPU memory scheduler scheduling these jobs’ GPU memory usage.
method towards multitasking dynamic workloads system. PVLDB, 14(1): The key problem is how to reduce the GPU memory peak of the
XXX-XXX, 2020. entire computation process efficiently and update the scheduling
doi:XX.XX/XXX.XX plan when workloads change. Although there are already some
PVLDB Artifact Availability:
efficient methods[10, 12, 14] to reduce the GPU memory cost by
The source code, data, and/or other artifacts have been made available at
swapping and recomputation, the following three issues are still
http://vldb.org/pvldb/format_vomultl14.html.
unsolved.
The first is compatibility, i.e. most of these methods can only be
1 INTRODUCTION suitable to certain neural network layers or model types[12, 14].
However, the diversity of machine learning jobs requires the sched-
Deep learning has been proven effective and being more and more ule algorithm be widely used for various kinds of deep learning
popular nowadays. With appropriate training, deep learning mod- jobs. The cause of compatibility problem is that their scheduling
els can achieve much higher performance than traditional ones[4]. granularity is not fine enough. Most of their methods are based on
However, a significant problem of deep learning is that it is usually layer granularity, which makes them only support certain type of
expensive for GPU memory[2, 5, 6], which make it expensive to be DNNs, e.g., CNN.
applied to real-world jobs. In recent years, many DNNs with much The second problem is that all existing approaches are designed
This work is licensed under the Creative Commons BY-NC-ND 4.0 International
to schedule GPU memory for single workload. When it comes to the
License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of scenarios with multitasking dynamic workloads, these methods can
this license. For any use beyond those covered by this license, obtain permission by hardly manage the GPU memory as well as they were on single job.
emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights
licensed to the VLDB Endowment. Scheduling GPU memory with multitasking dynamic workloads is
Proceedings of the VLDB Endowment, Vol. 14, No. 1 ISSN 2150-8097. challenging, since the workloads are executed asynchronously, and
doi:XX.XX/XXX.XX such mechanism will cause dynamic tensor access patterns, which
requires the scheduling algorithm be able to update the scheduling
plan with its fluctuation. This also involves the design of the system
architecture to support such kind of scheduling algorithms.
The third problem is that previous works can only schedule
within a single iteration, ignoring the necessary scheduling between
iterations. To the convenience of statement, we decompose the en-
tire computation process into two phases, the forward/backward
propagation phase and the optimizing phase, as shown in Figure 1(a).
The former calculates the output of the model and then generates
the gradient, and the latter uses the gradient to update the param-
eters and the intermediate variables of the optimizer. The layer
granularity methods can not schedule the optimizing phase, since it
contains no layer structure. However, all of the tensor granularity
methods, which are able to schedule the optimizing phase, only
make scheduling based on whether there are tensor access after
the previous eviction within a single iteration. They ignore the fact
that scheduling between iterations is also necessary, since the opti- Figure 1: Tensor Access and Memory peak, 𝑤𝑒𝑖𝑔ℎ𝑡 1𝑖 means
mizing phase is next to the forward/backward propagation phase the 𝑤𝑒𝑖𝑔ℎ𝑡 1 ’s value in the 𝑖 𝑡ℎ iteration, and the update OTA
of the next iteration, where the evicted tensors are needed as input. means the operator of updating the weights. (a) is the situ-
And this causes the tensors that evicted in the optimizing phase ation without scheduling; (b) is the scheduling which can-
need to be swapping in passively in the next forward/backward not across iterations, such scheduling will cause extra time
propagation phase, with cause extra overhead. overhead; (c) is our method’s effect of reducing GPU mem-
In this paper, we tackle these challenges and proposed TENSILE, ory peak
a run-time dynamic GPU memory scheduler method in tensor
granularity. For the first problem of compatibility. We schedule
the GPU memory in tensor granularity to support deep learning
tasks which are based on tensor computation such as the training
of CNN[9] and GCN[8]. Finally, to solve the third problem, our method could schedule
Additionally, with scheduling in tensor granularity, we can achieve tensors in across iteration. TENSILE is able to swap out or release
higher performance. Since tensor is the most basic component of the parameters at the optimizing phase, and prefetch them before
deep learning algorithms, scheduling in tensor granularity allows they are used in the next iteration of computation, which allows
us to manage not only the parameters and feature maps, but also the us to save the extra overhead of passively swapping. To the best of
intermediate results of the optimizer, which gives us more chance our knowledge, this has never been achieved before.
to save more GPU memory. In this paper, we proposed a method of scheduling the GPU
Also, benefited from the tensor granularity, our method is not memory to reduce the GPU memory peak towards multitasking
limited to the scope of application with forward/backward prop- dynamic workloads for the first time. Our contributions can be
agation like most prior works does, it can also handle any tensor concluded as:
computation tasks which can be expressed with tensor operator in • We firstly achieved GPU memory scheduling towards mul-
a static computational graph. titasking dynamic workloads. To support such scheduling,
To solve the second problem, we develop a scheduling algo- we design a system to track the jobs’ running and manage
rithm which can keep tracking the tensor access time for updating their memory using as a whole, rather than optimizing them
the scheduling plan to support dynamic GPU usage with multiple separately like previous. Our method can also schedule the
workloads. Since the scheduling plan can be updated in time for tensors across iterations and reduce the GPU memory peak
the current GPU usage, the scheduling plan is kept effective with of the entire process to increase the model capacity of the
dynamic workloads scenario. system.
Besides, to save GPU memory, TENSILE manages the tensors • We also proposed an innovative algorithm to generate the
by swapping and recomputation during execution. In the life cycle swapping and recomputation plan based on estimating the
of a tensor 𝑡, 𝑡 is generated and maintained in memory until next tensor access sequence, which can support the multitask-
time using, which causes a time gap when the tensor remains in the ing dynamic workloads better than prior works. With this
GPU and not be used[12]. For example, in deep learning framework algorithm, we could initialize the scheduling plan without
like TensorFlow1 , Pytorch2 and MXNet3 , the intermediate results measuring first, and update the scheduling plan in time when
such as feature maps are maintained in GPU memory until used the workloads changing.
by backward propagation, which gives us the chance to swap that • According to the experimental results, our method can achieve
tensor to Host Memory to save the GPU memory. higher GPU memory saving ratio almost without perfor-
1 Tensorflow,https://www.tensorflow.org mance loss, especially with the multitasking dynamic work-
2 Torch,
http://torch.ch loads.
3 MXNet, https://mxnet.apache.org

2
2 RELATED WORK Xuan Peng et al. proposed Capuchin[10]. Capuchin can observe
In this section we will introduce some prior works in the area of the tensor access sequence and decide which tensor to swap and
GPU memory management. which to recompute with a value based approach. This method is
NVIDIA firstly integrated the Unified Memory technology in more flexible on supporting variety types of neural networks than
CUDA6 4 . This work allows programmers to use the GPU memory those layer granularity methods. However, it highly depends on
and the Host Memory as an unified space. Although this work is the observation of tensor access sequence when the calculation
the base of higher level operator like swapping tensor between firstly running, since the tensor access sequence is needed to gen-
GPU memory and Host Memory, but it is not specially designed to erate the scheduling plan. In an environment with multitasking
schedule deep learning process, which means programmers need dynamic workloads, the tensor access sequence keeps changing,
to decide when and which tensor should be swapped. Also, it can which makes the Capuchin hard to be applied in such systems.
only be used in CUDA programming. Also, its scheduling process only considered the tensor access in
Minsoo Rhu et al. proposed the vDNN[12], which is the first one iteration. For example, the updated parameters’ accesses at the
work focusing on scheduling the GPU memory using towards deep end of the iteration can not be swapped in proactively. Thus, these
learning process. The main part of vDNN is an algorithm to decide parameters can only be swapped in passively at the next iteration
which tensor should be released, evicted or prefetched. Although with additional time overhead.
it is a pioneering work, there still are some problems to be solved.
Firstly, not like our method, vDNN is scheduling in layer granularity, 3 OVERVIEW
which is the reason why it is not suitable for scheduling the deep In this section, we will introduce the definition of the GPU memory
learning models that don’t have a strict layer structure, such as scheduling problem, and expound the overall design of our deep
the Capsule Network. Because these types of network has complex learning system.
structure inside a ’layer’ whose intermediate results can not be
scheduled by vDNN.
3.1 Problem Definition
Shriram S.B. et al. improved vDNN and solve its delayed compu-
tation start, high pinned memory requirements and GPU memory For the convenience of scheduling, we use a graph model, called
fragmentation problems[13]. Although allowing to overlap the ten- static computational graph[1] to describe the tensor calculation
sor eviction and prefetching process with the computation process jobs. We denote a static computational graph as 𝐺 (𝑉 , 𝐸), where
significantly improved the performance, the other shortcomings 𝑉 represents the set of operators to manipulate tensors, and each
are still remained to be solved. (𝑢,𝑣) in 𝐸 represents a tensor generated by operator 𝑢 and used by
Linnan Wang et al. proposed SuperNeurons[14] which is the first operator 𝑣. To be noticed that the graph is a DAG(Direct Acyclic
dynamically GPU memory run-time management approach. It also Graph) just like the one in TensorFlow[1]. An example of the graph
introduced recomputation into the scheduling. The authors believes is shown in Figure 2(a), a convolution operator is represented as a
the convolution layer is the most valuable layer to be swapped, and node in the graph. Such a node takes multiple tensors as the input
designed some approaches for the convolution layers. However, and outputs other tensors. When a job is launched, the engine
it can not support every kinds of neural networks, because it is executes these operators in a topological order. Such model could
based on layer granularity. And the second one is it can not work effectively represent the computation in nowadays modern deep
in dynamic workload environment such as database systems. learning frameworks, such as TensorFlow[1].
Xiaoming Chen et al. proposed moDNN[3], whose key contri- Note that the intermediate results and the operators of the opti-
bution is invented a new heuristics to schedule data eviction and mizer are also included in 𝐺, which is different from prior works.
prefetching transfers. This work also contains a convolution algo- Those works are only design to save GPU memory during the for-
rithm selection method and a new sub-batch size selection method ward/backward propagation to achieve larger batch size or deeper
to further improve performance. network. However, our method is deigned toward multitasking
To support non-linear network better, Donglin Yang et al. pro- dynamic workloads in the system. To support multiple workloads,
posed the first method that support unified memory pool for non- the method must reduce the entire job’s GPU memory peak, in-
linear networks[16]. The method is able to construct the execution cluding the peak caused by the optimizer. Otherwise, when two
order for non-linear networks based on graph analysis, and it also peaks caused by different jobs occur at the same time, there will
contains a Mobility placement policy to conserve contiguity for be an OOM (Out Of Memory) exception triggered. This feature is
different tensors. effective especially when using optimizer like Adam[7], which will
However, all the approaches above are based on layer granularity, use additional memory twice to the size of the parameters.
in 2019, another method is proposed[17], which is the first approach Based on the static computational graph, we explain the basic
that can schedule GPU memory in tensor granularity. This method concept of TENSILE.
proposed two orthogonal approaches to reduce the memory cost The input and output tensors of a given operator are defined
from the system perspective. It reduces the GPU memory using by as Input Tensor Access(ITA) and Output Tensor Access(OTA) respec-
managing the memory pool and scheduling the swapping with a tively. Since the feed_dict operator generates the input tensors of
heuristic method. Benefited from its tensor granularity, this method the whole computational graph, we regard the feed_dict operator as
can support any kinds of deep learning networks. an OTA. The execution time of a Tensor Access is defined by the cor-
responding operator’s execution time. For example, in Figure 2(b),
4 https://developer.nvidia.com/blog/unified-memory-in-cuda-6/ the Convolution operator takes tensors kernel, bias, and a tensor 𝑥
3
3.2 System Architecture
Since no prior works are designed for multiple jobs, existing ap-
proaches are lack of the ability to scheduling among multitasking
dynamic workloads. To solve this problem, we developed a system
to support multitasking dynamic workloads’ scheduling. The sys-
tem can collect the necessary information of all the jobs before and
during running to support the scheduling algorithm. The informa-
tion includes the DAG of jobs, the execution time of each operators
and other run-time information such as the GPU utilization. The
system will trigger the scheduling algorithm multiple times during
task startup and operation. When receiving the latest scheduling
plan, the system will apply it at the next round of iteration. To
achieve the functions above, our system contains five components:
Global Controller, Memory Scheduler, Executor, Swap Scheduler, Swap
Executor, as shown in Figure 3.

Figure 2: Demonstration of the Concepts

generated by previous operator as inputs, and outputs the feature


map tensor 𝑦. Totally, it makes four Tensor Accesses, including three
ITA and one OTA.
The Tensor Access Sequence is defined as a sequence of the execu-
tion time of each Tensor Access at run-time, just as Figure 2(b) shows.
It contains each tensor’s access time, which can be inferred from
the computational graph or logged at run-time. For each tensor, its
Tensor Access Sequence begins with an OTA which generates the
tensor, and followed by a series of ITA which represents the tensor
is used by operator as input. These ITA are usually triggered in
forward-propagation and back-propagation, as shown in Figure 5.
A Swap Task transfers data between the GPU memory and the Figure 3: System Architecture
Host Memory. It has two kinds of types, Swap Out Task and Swap In
Task. As Figure 2(c) shows, Swap Out Task of a tensor will copy the Global Controller. The Global Controller is responsible for launch-
tensor from GPU memory to Host Memory and then release it from ing new job’s process and deliver information between the Executor
GPU memory as soon as it is not occupied, which can be regard as and the Memory Scheduler. When a new job is launched by user, the
eviction. A Swap In Task will copy the tensor from Host Memory Global Controller creates a new process of the job’s Executor, and
to GPU memory, which is used to prefetch a tensor. It is necessary two new thread for Swap Scheduler and Swap Executor. It is also re-
to emphasize that swapping in does not means the tensor in the sponsible for communicating with the Memory Scheduler, each job’s
Host Memory will be released, and on the contrary, it will remain information is organized and sent to the Memory Scheduler. Also,
in the Host Memory until the last access of the tensor is finished, the scheduling plan of each job is distributed to the corresponding
which could save much swapping out cost. Executor. With the global controller, our system can support the
A Recomputation Task regenerates the tensors which has already scheduling algorithm as described in Section 4.
been released from GPU memory. For a Tensor Access 𝑖, we can Memory Scheduler. The Memory Scheduler takes the jobs’ infor-
release it after accessing and recompute it before the next Tensor mation as inputs, such as the DAG and the execution time tracked
Access 𝑖+1. by the Executor. Such information is used to generate the Tensor
With these concepts, we can offload tensors when they are not Access Sequence and the scheduling plan, which is made up of mul-
used and then prefetch or regenerate them by swapping or recompu- tiple instructions of swapping, releasing, and recomputation. The
tation before their input accesses. Thus, our problem is that given a detailed algorithm will be described in Section 4. Each instruction is
series deep learning jobs, asynchronously running in one GPU, with described by a tuple (𝑡𝑟𝑖𝑔𝑔𝑒𝑟 _𝑡𝑖𝑚𝑒, 𝑑𝑒𝑙𝑡𝑎_𝑡𝑖𝑚𝑒), the 𝑡𝑟𝑖𝑔𝑔𝑒𝑟 _𝑡𝑖𝑚𝑒
each containing its own computational graph and tensor access is the previous operator’s end time, and the 𝑑𝑒𝑙𝑡𝑎_𝑡𝑖𝑚𝑒 is the time
sequence, how to generating a series Swap Tasks and Recomputation interval after the trigger. Therefore, the corresponding Swap Task
Tasks for each job to minimize the GPU memory peak, so that will be triggered in 𝑡𝑖𝑚𝑒 milliseconds after the 𝑡𝑟𝑖𝑔𝑔𝑒𝑟 _𝑡𝑖𝑚𝑒, just
the system can support more jobs or run larger batch for each job. as the Figure 2(c) shows. These instructions are grouped by job id
4
and will be distributed to the corresponding Executor by the Global scheduling plan deviate more than the threshold, the Global
Controller. Controller will call the Memory Scheduler to update the sched-
Executor. The Executor sorts the operator nodes in the compu- uling plan based on then latest estimated Tensor Access Se-
tational graph by topological order. Then it infers each tensor’s quence.
information such as shape and size which will be needed by the
Memory Scheduler. The Executor executes each operator node in 4 METHODS
the graph by topological order. For each node, it checks whether all
In this section, we detail our algorithms of GPU memory scheduling.
of the input tensors are in the GPU memory before the computa-
tion, since the inaccuracy of the estimated Tensor Access Sequence
may leads to delays in swapping. If that happens, the Executor will 4.1 Swap Task Scheduling
pause the computation, and swap the tensors into GPU memory. The main approach in our system to reduce the GPU memory peak
When the computation is finished, the tensors are released if the is by swapping tensors out to the Host Memory when they are
scheduling plan instructs. The Executor also tracks every operator’s not used. It is obvious that swapping every tensor out of the GPU
time cost periodically, and sends the information to the Memory memory after their 𝑂𝑇 𝐴, and swapping them back right before
Scheduler through the Global Controller for updating the schedul- their 𝐼𝑇 𝐴 can save the most GPU memory. However, due to the
ing plan. The Executor will check whether the scheduling plan has data transfer occupied the PCIe channels exclusively[10], there can
been updated each time before a new round of calculation. During only be one tensor being swapped at the same time. Therefore, the
the execution of the computational graph, if the scheduling plan swapping of tensors within and between tasks must be properly
requires the current tensor access to trigger some Swap Tasks, the scheduled.
Executor will send the Swap Task to the Swap Scheduler waiting The scheduling process is a deformation of the task scheduling
for execution. And at the end of an iteration, the executor will be problem with the following two difficulties:
paused to wait the execution of the unfinished swapping tasks of
• A Swap Task cannot be executed without prerequisites. For
this iteration.
example, the Swap In Task can be executed only if the corre-
sponding Swap Out Task can be performed. Otherwise, there
will be no data to be swapped into the GPU memory. Also, to
avoid null pointer exception, the Swap Out Task can be exe-
cuted only when the tensor’s Swap In Task can be scheduled
successfully before the next Tensor Access.
• The scheduling process must be performed with the con-
sideration of global jobs’ information, and it needs to be
updated frequently due to the dynamic workloads. This fact
determines that we can not use a very complex algorithm to
generate the scheduling plan, otherwise the generation of
Figure 4: Swapping Scheduler Architecture scheduling plan will cause significantly extra overheads and
slow down the calculation process.
Swap Scheduler and Swap Executor. As the Figure 4 shows, Facing the difficulties above, we design a greedy algorithm to
the Swap Scheduler is a thread with a task queue in it. It receives the balance the performance and the time overhead. The main idea
swapping tasks triggered by certain operator and push these swap is to swap the largest available tensors whose 𝐼𝑇 𝐴 will cause the
tasks into the queue in chronological order. The Swap Scheduler GPU memory footprint peak. Since the schedule for each tensor
pops up the swap tasks in the queue to the Swap Executor one by can change GPU memory peak, we must analyze the GPU memory
one according to their time intervals. During the system running, peak and generate the swap tasks iteratively. In particular, the GPU
it calls the Swap Executor and pops these swap tasks for executing memory peak will be analysed first based on the Tensor Access
at their start time. Sequence, and choosing the largest tensor that caused the peak
With these components, the system is able to collect information to swap. This process is iterated several times until there are no
from each job and execute the corresponding scheduling plan. The tensors to be scheduled.
whole scheduling procedure has 4 steps: With the greedy algorithm, each tensor chosen to schedule can
(1) Collecting the new jobs information through the Global Con- significantly reduce the GPU memory peak. According to our exper-
troller. iments, it can reduce 33.5% of GPU memory peak of a MLP network
(2) The Memory Scheduler generates the initial scheduling plan without extra time overhead.
and distributes it to the corresponding job’s Executor by the Now we will introduce our algorithm in detail.
Global Controller. We firstly use an algorithm to generate the Tensor Access Se-
(3) The Swap Scheduler and the Swap Executor executes the quence, analyse the GPU memory peak, and generate the ten-
scheduling plan during computing. sors that caused the memory peak(𝑀𝑃𝑇 ) among all jobs. The al-
(4) The Executor collects the time cost of each operator and gorithm will be discussed in next sub-sections. Then we have to
reports to the Global Controller. When the current opera- choose the biggest tensor 𝑡 among all jobs as the most valuable
tors’ time cost and the time cost used to generate the last tensor to swap.
5
Besides, there is a max ratio limit for each job’s tensor swap- analysis the computational graph and release the tensors after their
ping. Considering the swapping among all jobs, if we swap the last access. The detailed algorithm of swapping scheduling is shown
tensors in each jobs as many as possible, the PCIe channel will be in algorithm 1. An example of a swapping plan is shown in Figure 5.
jammed. Also, we can not control the order of swap tasks among For the tensors which need to be updated in the optimizing
all jobs, because jobs in our system are running asynchronously phase, such as the kernel of the convolution operator shown in
for performance reasons, and each job’s time cost of one epoch Figure 2(a), the 1𝑠𝑡 𝑚𝑜𝑚𝑒𝑛𝑡 𝑣𝑒𝑐𝑡𝑜𝑟 and the 2𝑛𝑑 𝑚𝑜𝑚𝑒𝑛𝑡 𝑣𝑒𝑐𝑡𝑜𝑟 in
is significantly different. If we use a synchronous way to execute Adam optimizer, if they are swapped out at the optimizing phase,
these jobs, then there must be extra overhead of waiting. From this then the corresponding tensors must be swapped in at the beginning
perspective, to reduce the conflict opportunity, we define the max of the next iteration. As shown in algorithm 1 Line 9-10 and Line
ratio limit for each job’s tensor swapping as the ratio of tensors 28, it tries to swap out the updated tensors and then try to swap in
swapped in a job to tensors swapped in the whole system, it can the corresponding tensor before the first 𝑂𝑇 𝐴.
be regarded as a hyper-parameter which needs to be turning. The With this algorithm, we can select the most valuable tensors to
max swapping ratio can be considered as the job’s priority. Because be appropriately scheduled within a job, thus reducing the entire
the tensors’ swapping can not always overlap with computation computation job’s GPU memory peak.
with 100 percent odds due to the conflict on PCIe and the changing
GPU usage, so swapping means the job may be slower than normal 4.2 GPU memory peak Analysis
running. In this perspective, smaller swapping ratio means higher To achieve the goal to reduce the entire system’s GPU memory peak,
priority. the scheduling algorithm have to know when the GPU memory
After the GPU memory peak analysis, each Swap Task’s feasible peak appears, and caused by which tensors, so that it could try to
time regions will be inferred, as shown in algorithm 1 line 4 and schedule the Swap Task of these tensors. Thus, with the information
line 25-26, which is defined by its 𝑒𝑎𝑟𝑙𝑖𝑒𝑠𝑡_𝑡𝑖𝑚𝑒 and 𝑙𝑎𝑡𝑒𝑠𝑡_𝑡𝑖𝑚𝑒. inferred from the computational graph, we develop an algorithm
There are two bases for extrapolating the feasible time interval: to statically analyze GPU memory peak in this section.
• The latest time of the Swap Out Task cannot be later than Obviously, the GPU memory footprint is changed only when
the time of the GPU memory peak appears, since we want to these five situations happen:
swap it to reduce the GPU memory peak. And it is obvious (1) Iteration Beginning: At the beginning of an training iteration,
that a Swap Out Task must starts after its 𝑂𝑇 𝐴. Also, a Swap these tensors will be in the GPU memory:
In Task cannot be earlier than the corresponding Swap Out • Input data of the model
Task or later than the 𝐼𝑇 𝐴 that use it as input. Therefore, our • Parameters which are not swapped out in the last iteration
algorithm needs to decide the start time of each task under These tensors will be used to initialize the GPU memory
the constraint of its feasible time regions. footprint.
• Due to the pinned memory’s limitation, there can only be (2) Output Tensor Access: When a tensor is generated to GPU
one data object transferring in the PCIe at the same time, so memory via 𝑂𝑇 𝐴, GPU memory footprint increases. Note
the Swap Tasks can not overlap with each other. that when updating a parameter, we logically treat the new
Based on the feasible time regions, we choose the available region parameter as a new tensor, but since it uses the address of the
greedily. For a Swap Out Task, it will start as earlier as possible, and old parameter, the new parameter will not cause an increase
for a Swap In Task, it will start as later as possible, so that the we in GPU memory footprint.
can save the GPU memory footprint time of the tensor as long as (3) Swap In Task: A Swap In Task causes GPU memory con-
possible. If there is no available region, then we will break the outer sumption increase. Since we only concern the peak of GPU
loop in Line 22 and try the next tensor. memory footprint, the Swap In Task’s finishing time point
Next we have to decide when the Swap Out Task and the Swap can be regard as the time point of the GPU memory footprint
In Task to be executed. Since the tensors in the forward/backward increases.
propagation phase and the optimizing phase have different con- (4) Swap Out Task: When a tensor is copied to Host Memory
straint, we will introduce them separately. and released, the GPU memory footprint is reduced at the
For the tensors in the forward/backward propagation phase, they end of the swap out task (or at the end of the corresponding
must be swapped in before first 𝐼𝑇 𝐴 after the Swap Out Task. Other- 𝑂𝑇 𝐴, if the 𝑂𝑇 𝐴 ends later).
wise, there will be an exception triggered due to missing necessary (5) Tensor Release: For a Tensor Access 𝑖 which needs to recom-
input tensors, and the system has to pause the computation to wait pute or swap in, when the last Tensor Access 𝑖-1 finished, the
for their swapping. As shown in algorithm 1 Line 12-15, we look tensor can be released from GPU memory, so that the GPU
for the first 𝐼𝑇 𝐴 after the Swap Out Task and generate the Swap In memory footprint will decrease.
Task and its feasible regions to see whether any of these regions Based on above discussions, we just need to sort the Tensor
can cover the Swap In Task. If they can not cover the task, we will Accesses and Swap Tasks by the time when they trigger the chang
update the Swap Out Task’s earliest start time, and regenerate the of the GPU memory footprint. Then we can traverse the timeline of
feasible regions(Line 20-21). Otherwise, for the rest 𝐼𝑇 𝐴 of the ten- the job to determine the GPU memory peak. We then introduce the
sor, we try to swap them in as much as possible(Line36). If a Swap analysis algorithm, which takes Tensor Access Sequence and Swap
Task is scheduled successful, blue the corresponding tensor can Tasks as inputs, and outputs the GPU memory peak value(𝑀𝑃), the
be released after the previous 𝐼𝑇 𝐴 is finished. TENSILE will also tensors which cause this peak(𝑀𝑃𝑇 ), the last 𝐼𝑇 𝐴(𝐿𝐼𝐴), and the
6
Figure 5: An example of estimated MLP’s Tensor Access Sequence and Swap Tasks

peak’s time point(𝑀𝑃𝑇𝑖𝑚𝑒). To support the scheduling algorithm, prior scheduling plan lose efficacy. We have to keep updating the
the analysis algorithm will also infer the start time of the swap Tensor Access Sequence to fit the current workloads.
tasks, which is defined as a tuple (𝑎𝑐𝑐𝑒𝑠𝑠_𝑖𝑑, 𝑑𝑒𝑙𝑡𝑎𝑡 𝑖𝑚𝑒). With the In consideration of that there are two pieces of information
end time of the tensor access before the swap task as the origin required to infer the Tensor Access Sequence, the Tensor Accesses’
time, we can keep the swap tasks’ relative order in the run-time order, and the execution time of each Tensor Access, we can deal
even the Tensor Access Sequence has slightly difference from the this problem by diving it into two sub-problems of inferring the
actual sequence. The detailed algorithm is shown in algorithm 2. Tensor Access order and get each Tensor Access’s execution time.
As mentioned above, with the computational graph, we execute
the operators in the topological order. Therefore, we can solve the
first sub-problem and get the Tensor Accesses’ order statically based
4.3 Tensor Access Sequence Generation on each operator’s input and output.
In prior work such as Capuchin[10], generating Tensor Access Se- However, predicting the CUDA kernel’s execution time accu-
quence(Access Pattern) is based on running the job in ’Passive Mode’, rately is an extremely hard problem, especially with multitasking
the system observes the execution time of each tensor to generate dynamic workloads.
the Tensor Access Sequence. In ’Passive Mode’, the system swaps The most straightforward way to solve the second sub-problem
tensors only when GPU memory is not enough, which will cause is to run the job with ’Passive Mode’ which means only swap-
extra overhead. ping tensors passively when the GPU memory is not enough, like
As discussed above, since the scheduling algorithm and the GPU Capuchin[10]. This approach will cause that calculation and swap-
memory peak analysis relays on the Tensor Access Sequence, the ping do not overlap and the computation must wait for swapping.
method has to be able to generate the initial Tensor Access Sequence, To make the initial scheduling plan more efficient, our solution
so that the unnecessary overhead of the ’Passive Mode’ can be uses pre-trained machine learning models to predict each GPU
avoided. operator’s execution time with the given input data size and the
Also, another reason why we cannot use the ’Passive Mode’ current GPU utilization rates. When the system initializes, it will
to measure each operator’s execution time is that ’Passive Mode’ measure each GPU operator’s execution time with different input
only generate the Tensor Access Sequence once at the first epoch, data and GPU usage to generate the training data.
which is not feasible because the execution time of the operator Since the execution time of a given operator depends on its in-
varies with the dynamic load of multiple jobs. Especially, when a puts’ shape, the parameters such as strides of convolution operator,
job launches, the GPU usage will change drastically and cause the and the current GPU usage, the inputs of our prediction model are
7
Algorithm 1: Swapping Scheduling Algorithm Algorithm 2: GPU memory peak Analysis
Input: 𝑇 𝐴𝑆: tensor access sequence; 𝑆𝑇 : swapped tasks; Input: 𝑇 𝐴𝑆: tensor access sequence; 𝑆𝑇 : swap tasks
𝑀𝑃: memory peak; 𝑀𝑃𝑇 : the tensors that cause the Output: 𝑀𝑃: memory peak; 𝑀𝑃𝑇 𝑒𝑛𝑠𝑜𝑟 : tensors caused
memory peak; 𝑀𝑆𝑅: max tensor swapping ratio for memory peak; 𝐿𝐼𝐴: last input access for each job
each job, 𝑆𝑂𝑁 : swapped out tensors number for each before memory peak; 𝑀𝑃𝑇𝑖𝑚𝑒: time of the
job; 𝑇 𝐴𝑇 : tensor access sequence grouped by tensor memory peak
and sorted by time; all the outputs of algorithm 2 1 initialize the 𝑚𝑒𝑚𝑜𝑟𝑦_𝑢𝑠𝑒𝑑 and the 𝑖𝑛_𝑔𝑝𝑢_𝑡𝑒𝑛𝑠𝑜𝑟𝑠 with
Result: Add new Swap Tasks to 𝑆𝑇 the tensors which will be updated in the optimizing phase
1 Function scheduling_swap(𝑆𝑇 : list, 𝑡𝑒𝑛𝑠𝑜𝑟 ): and the inputs/outputs of the deep learning model.;
2 𝑠𝑢𝑐𝑐𝑒𝑒𝑑 ←− 𝐹𝑎𝑙𝑠𝑒; 2 for each 𝑒𝑣𝑒𝑛𝑡 in 𝑡𝑖𝑚𝑒_𝑎𝑥𝑖𝑠 do
3 ℎ𝑎𝑣𝑒_𝑓 𝑖𝑟𝑠𝑡_𝑎𝑐𝑐𝑒𝑠𝑠 ←− 𝐹𝑎𝑙𝑠𝑒; 3 𝑡𝑖𝑚𝑒 ←− 𝑒𝑣𝑒𝑛𝑡 .𝑡𝑖𝑚𝑒;
4 Generate 𝑠𝑤𝑎𝑝_𝑜𝑢𝑡_𝑡𝑎𝑠𝑘 and 𝑓 𝑒𝑎𝑠𝑖𝑏𝑙𝑒_𝑟𝑒𝑔𝑖𝑜𝑛𝑠; 4 Release the tensor in 𝑡𝑒𝑛𝑠𝑜𝑟𝑠_𝑡𝑜_𝑏𝑒_𝑟𝑒𝑙𝑒𝑎𝑠𝑒 if the
5 for each 𝑟𝑒𝑔𝑖𝑜𝑛 in 𝑓 𝑒𝑎𝑠𝑖𝑏𝑙𝑒_𝑟𝑒𝑔𝑖𝑜𝑛𝑠 do present time is after its end time;
6 if 𝑟𝑒𝑔𝑖𝑜𝑛 covers 𝑠𝑤𝑎𝑝_𝑜𝑢𝑡_𝑡𝑎𝑠𝑘 then 5 if 𝑒𝑣𝑒𝑛𝑡 .𝑡𝑦𝑝𝑒 = 𝑜𝑢𝑡𝑝𝑢𝑡_𝑎𝑐𝑐𝑒𝑠𝑠 and 𝑒𝑣𝑒𝑛𝑡 .𝑡𝑒𝑛𝑠𝑜𝑟 is not
7 𝑆𝑇 .𝑎𝑑𝑑 (𝑠𝑤𝑎𝑝_𝑜𝑢𝑡_𝑡𝑎𝑠𝑘); used to initialize the 𝑚𝑒𝑚𝑜𝑟𝑦_𝑢𝑠𝑒𝑑 and the
8 𝑠𝑢𝑐𝑐𝑒𝑒𝑑_𝑠𝑤𝑎𝑝_𝑜𝑢𝑡 ←− 𝑇𝑟𝑢𝑒; 𝑖𝑛_𝑔𝑝𝑢_𝑡𝑒𝑛𝑠𝑜𝑟𝑠 then
9 if 𝑡𝑒𝑛𝑠𝑜𝑟 is an updated parameter then 6 𝑚𝑒𝑚𝑜𝑟𝑦_𝑢𝑠𝑒𝑑 ←− 𝑚𝑒𝑚𝑜𝑟𝑦_𝑢𝑠𝑒𝑑 + 𝑒𝑣𝑒𝑛𝑡 .𝑠𝑖𝑧𝑒;
10 𝑓 𝑖𝑟𝑠𝑡_𝑎𝑐𝑐𝑒𝑠𝑠 ←− the first 𝐼𝑇 𝐴 of the 7 𝑖𝑛_𝑔𝑝𝑢_𝑡𝑒𝑛𝑠𝑜𝑟𝑠.𝑎𝑑𝑑 (𝑒𝑣𝑒𝑛𝑡 .𝑡𝑒𝑛𝑠𝑜𝑟 );
corresponding non-updated parameter; 8 else if 𝑒𝑣𝑒𝑛𝑡 .𝑡𝑦𝑝𝑒 = 𝑖𝑛𝑝𝑢𝑡_𝑎𝑐𝑐𝑒𝑠𝑠 then
11 else 9 if 𝑒𝑣𝑒𝑛𝑡 .𝑟𝑒𝑙𝑒𝑎𝑠𝑒 = 𝑇𝑟𝑢𝑒 then
12 𝑓 𝑖𝑟𝑠𝑡_𝑎𝑐𝑐𝑒𝑠𝑠 ←− the first 𝐼𝑇 𝐴 after 𝑎𝑐𝑐𝑒𝑠𝑠; 10 𝑡𝑒𝑛𝑠𝑜𝑟𝑠_𝑡𝑜_𝑏𝑒_𝑟𝑒𝑙𝑒𝑎𝑠𝑒𝑑.𝑎𝑝𝑝𝑒𝑛𝑑 (𝑒𝑣𝑒𝑛𝑡);
13 if 𝑓 𝑖𝑟𝑠𝑡_𝑎𝑐𝑐𝑒𝑠𝑠 != None then 11 else
14 ℎ𝑎𝑣𝑒_𝑓 𝑖𝑟𝑠𝑡_𝑎𝑐𝑐𝑒𝑠𝑠 ←− 𝑇𝑟𝑢𝑒; 12 𝑙𝑎𝑠𝑡_𝑖𝑛𝑝𝑢𝑡_𝑎𝑐𝑐𝑒𝑠𝑠 ←− 𝑒𝑣𝑒𝑛𝑡;
15 Generate 𝑠𝑤𝑎𝑝_𝑖𝑛_𝑡𝑎𝑠𝑘 from the 𝑓 𝑖𝑟𝑠𝑡_𝑎𝑐𝑐𝑒𝑠𝑠; 13 else if 𝑒𝑣𝑒𝑛𝑡 .𝑡𝑦𝑝𝑒 = 𝑠𝑤𝑎𝑝_𝑜𝑢𝑡 or
16 if 𝑡𝑟𝑦_𝑠𝑐ℎ𝑒𝑑𝑢𝑙𝑒_𝑠𝑤𝑎𝑝_𝑖𝑛(𝑓 𝑖𝑟𝑠𝑡_𝑎𝑐𝑐𝑒𝑠𝑠) is 𝑒𝑣𝑒𝑛𝑡 .𝑡𝑦𝑝𝑒 = 𝑠𝑤𝑎𝑝_𝑖𝑛 then
success then 14 𝑙𝑎𝑠𝑡_𝑎𝑐𝑐𝑒𝑠𝑠 ←− 𝑔𝑒𝑡_𝑙𝑎𝑠𝑡_𝑎𝑐𝑐𝑒𝑠𝑠 (𝑒𝑣𝑒𝑛𝑡, 𝑡𝑖𝑚𝑒_𝑎𝑥𝑖𝑠);
17 𝑆𝑇 .𝑎𝑑𝑑 (𝑠𝑤𝑎𝑝_𝑖𝑛_𝑡𝑎𝑠𝑘); 15 𝑒𝑣𝑒𝑛𝑡 .𝑒𝑥𝑒𝑐𝑢𝑡𝑒_𝑟𝑒 𝑓 ←− 𝑙𝑎𝑠𝑡_𝑎𝑐𝑐𝑒𝑠𝑠;
18 𝑠𝑢𝑐𝑐𝑒𝑒𝑑 ←− 𝑇𝑟𝑢𝑒; 16 𝑒𝑣𝑒𝑛𝑡 .𝑒𝑥𝑒𝑐𝑢𝑡𝑒_𝑡𝑖𝑚𝑒 ←− 𝑙𝑎𝑠𝑡_𝑎𝑐𝑐𝑒𝑠𝑠.𝑡𝑖𝑚𝑒;
19 else 17 if 𝑒𝑣𝑒𝑛𝑡 .𝑡𝑦𝑝𝑒 = 𝑠𝑤𝑎𝑝_𝑖𝑛 then
20 𝑆𝑇 .𝑟𝑒𝑚𝑜𝑣𝑒 (𝑠𝑤𝑎𝑝_𝑜𝑢𝑡_𝑡𝑎𝑠𝑘); 18 𝑚𝑒𝑚𝑜𝑟𝑦_𝑢𝑠𝑒𝑑 ←− 𝑚𝑒𝑚𝑜𝑟𝑦_𝑢𝑠𝑒𝑑 + 𝑒𝑣𝑒𝑛𝑡 .𝑠𝑖𝑧𝑒;
21 𝑠𝑤𝑎𝑝_𝑜𝑢𝑡_𝑡𝑎𝑠𝑘.𝑙𝑎𝑡𝑒𝑠𝑡_𝑡𝑖𝑚𝑒 ←− 19 𝑖𝑛_𝑔𝑝𝑢_𝑡𝑒𝑛𝑠𝑜𝑟𝑠.𝑎𝑑𝑑 (𝑒𝑣𝑒𝑛𝑡 .𝑡𝑒𝑛𝑠𝑜𝑟 );
𝑓 𝑖𝑟𝑠𝑡_𝑎𝑐𝑐𝑒𝑠𝑠.𝑒𝑛𝑑_𝑡𝑖𝑚𝑒; 20 else
22 break; 21 𝑚𝑒𝑚𝑜𝑟𝑦_𝑢𝑠𝑒𝑑 ←− 𝑚𝑒𝑚𝑜𝑟𝑦_𝑢𝑠𝑒𝑑 − 𝑒𝑣𝑒𝑛𝑡 .𝑠𝑖𝑧𝑒;
23 return 𝑠𝑢𝑐𝑐𝑒𝑒𝑑, 𝑠𝑢𝑐𝑐𝑒𝑒𝑑_𝑠𝑤𝑎𝑝_𝑜𝑢𝑡, ℎ𝑎𝑣𝑒_𝑓 𝑖𝑟𝑠𝑡_𝑎𝑐𝑐𝑒𝑠𝑠; 22 𝑖𝑛_𝑔𝑝𝑢_𝑡𝑒𝑛𝑠𝑜𝑟𝑠.𝑟𝑒𝑚𝑜𝑣𝑒 (𝑒𝑣𝑒𝑛𝑡 .𝑡𝑒𝑛𝑠𝑜𝑟 );
24 for each 𝑡𝑒𝑛𝑠𝑜𝑟 in 𝑀𝑃𝑇 do 23 if 𝑚𝑒𝑚𝑜𝑟𝑦_𝑢𝑠𝑒𝑑 > 𝑚𝑒𝑚𝑜𝑟𝑦_𝑝𝑒𝑎𝑘 then
25 Infer the 𝑙𝑎𝑡𝑒𝑠𝑡_𝑡𝑖𝑚𝑒 from the GPU memory peak 24 𝑚𝑒𝑚𝑜𝑟𝑦_𝑝𝑒𝑎𝑘 ←− 𝑚𝑒𝑚𝑜𝑟𝑦_𝑢𝑠𝑒𝑑;
analysis result; 25 𝑀𝑃𝑇 𝑒𝑛𝑠𝑜𝑟 ←− 𝑐𝑜𝑝𝑦 (𝑖𝑛_𝑔𝑝𝑢_𝑡𝑒𝑛𝑠𝑜𝑟𝑠);
26 Infer the 𝑒𝑎𝑟𝑙𝑖𝑒𝑠𝑡_𝑡𝑖𝑚𝑒 from the 𝑂𝑇 𝐴 of the swap out 26 𝐿𝐼𝐴 ←− 𝑙𝑎𝑠𝑡_𝑖𝑛𝑝𝑢𝑡_𝑎𝑐𝑐𝑒𝑠𝑠;
task; 27 𝑀𝑃𝑇𝑖𝑚𝑒 ←− 𝑡𝑖𝑚𝑒;
27 if 𝑡𝑒𝑛𝑠𝑜𝑟 is an updated parameter in optimizing phase 28 return 𝑚𝑒𝑚𝑜𝑟𝑦_𝑝𝑒𝑎𝑘, 𝑀𝑃𝑇 𝑒𝑛𝑠𝑜𝑟 , 𝐿𝐼𝐴, 𝑀𝑃𝑇𝑖𝑚𝑒;

then
28 scheduling_swap(𝑡𝑒𝑛𝑠𝑜𝑟 );
else if 𝑆𝑂𝑁 [𝑡𝑒𝑛𝑠𝑜𝑟 .𝑗𝑜𝑏_𝑖𝑑] ≤ 𝑀𝑆𝑅 [𝑡𝑒𝑛𝑠𝑜𝑟 .𝑗𝑜𝑏_𝑖𝑑]
denoted as 𝑖𝑛𝑝𝑢𝑡𝑠 ≡< 𝑑𝑖𝑚𝑥0 1 , ..., 𝑑𝑖𝑚𝑚
29 𝑘
and 𝑇 𝐴𝑇 [𝑡𝑒𝑛𝑠𝑜𝑟 ].𝑙𝑒𝑛𝑔𝑡ℎ > 1 and 𝑡𝑒𝑛𝑠𝑜𝑟 has not been 𝑥 1 , ..., 𝑑𝑖𝑚𝑥𝑛 , 𝑝𝑎𝑟𝑎 1 , ..., 𝑝𝑎𝑟𝑎𝑠 ,
𝑔𝑝𝑢_𝑢𝑠𝑎𝑔𝑒 >, where the 𝑥 1 to 𝑥𝑛 denotes the input tensors of the
swapped out then
operator, the 𝑑𝑖𝑚𝑖𝑥 𝑗 denotes the i-th dimension’s size of input 𝑥 𝑗 , the
30 𝑠𝑢𝑐𝑐𝑒𝑒𝑑 ←− 𝐹𝑎𝑙𝑠𝑒;
𝑝𝑎𝑟𝑎𝑖 denotes the i-th parameter of the operator, and the 𝑔𝑝𝑢_𝑢𝑠𝑎𝑔𝑒
31 𝑎𝑐𝑐𝑒𝑠𝑠𝑒𝑠 ←− 𝑇 𝐴𝑇 [𝑡𝑒𝑛𝑠𝑜𝑟 ];
denotes the current GPU’s utilization rate. We used a 3 layers MLP
32 𝑠𝑢𝑐𝑐𝑒𝑒𝑑_𝑠𝑤𝑎𝑝_𝑜𝑢𝑡 ←− 𝑇𝑟𝑢𝑒; network as the prediction model, since the input is not complex and
33 ℎ𝑎𝑣𝑒_𝑓 𝑖𝑟𝑠𝑡_𝑎𝑐𝑐𝑒𝑠𝑠 ←− 𝑇𝑟𝑢𝑒; a simple MLP model is enough to give relatively precise prediction.
34 while not 𝑠𝑢𝑐𝑐𝑒𝑒𝑑 and 𝑙𝑎𝑡𝑒𝑠𝑡_𝑡𝑖𝑚𝑒 > 𝑒𝑎𝑟𝑙𝑖𝑒𝑠𝑡_𝑡𝑖𝑚𝑒 According to our experiment, the average 𝑅 2 of every operator’s
and 𝑠𝑢𝑐𝑐𝑒𝑒𝑑_𝑠𝑤𝑎𝑝_𝑜𝑢𝑡 and ℎ𝑎𝑣𝑒_𝑓 𝑖𝑟𝑠𝑡_𝑎𝑐𝑐𝑒𝑠𝑠 do prediction result is 0.582, and for those most expensive operators
35 𝑠𝑢𝑐𝑐𝑒𝑒𝑑, 𝑠𝑢𝑐𝑐𝑒𝑒𝑑_𝑠𝑤𝑎𝑝_𝑜𝑢𝑡, ℎ𝑎𝑣𝑒_𝑓 𝑖𝑟𝑠𝑡_𝑎𝑐𝑐𝑒𝑠𝑠 such as convolution, the average 𝑅 2 is 0.805, which means that our
←− scheduling_swap(𝑡𝑒𝑛𝑠𝑜𝑟 ); method can give quite precise prediction to those most expensive
// sorted by end time in ascending order operators.
36 Try to swap-in the rest of accesses in 𝑎𝑐𝑐𝑒𝑠𝑠𝑒𝑠 8

greedily if the above scheduling succeed;


Although the GPU’s utilization rate is a general representation by scheduling plan changing, the system will actually apply the
of the GPU usage condition, and the prediction results will not be new plan right before computing the next batch data. The complete
exactly accurate, we can keep tracking the operators’ execution time scheduling algorithm is shown in algorithm 3.
and dynamically correct the predicted result with the exponential Firstly, to same more memory, we run activity analysis at the
moving weighted average method[11] during the job’s run-time. beginning of the scheduling. Every tensor will be released when its
Let 𝑡𝑖 be the last execution time used to generate the Tensor Access last ITA is finished. Then, the Memory Scheduler will iteratively
Sequence of a given operator, 𝑡 1 ...𝑡𝑛 are the execution time which run the process below until there are no swapping or recomputation
are measured in run-time, and 𝑏𝑒𝑡𝑎 is the weight. We concatenate to be scheduled. In each time of iteration, the algorithm will start
the 𝑡𝑖 with the 𝑡 1 ...𝑡𝑛 . Then the recursive formula is used to get the with running the GPU_Memory_Peak_Analysis function for each
updated estimated execution time of the operator. job and merge the result as shown in Line 3-4. Then, if there is at
least one swap task scheduled succeed in the previous iteration,
𝑡 𝑗+1 = 𝛽 ∗ 𝑡 𝑗 + (1 − 𝛽) ∗ 𝑡 𝑗 it will keep trying to swap, as shown in Line 5-9. Otherwise, if
With the corrected Tensor Access Sequence, the scheduling plan the GPU memory is not enough yet, it will try to schedule the
gets to be more accuracy. This method can achieve higher perfor- recomputation tasks as shown in Line 10-12.
mance than swapping tensors passively when a job is just launched,
since swapping tensors passively requires extra overhead which is Algorithm 3: Complete Scheduling Algorithm
discussed above.
Input: 𝑇 𝐴𝑆: tensor access sequence; 𝑆𝑇 : swap tasks
1 initialization 𝑠𝑤𝑎𝑝_𝑠𝑢𝑐𝑐𝑒𝑒𝑑 ←− 𝑇𝑟𝑢𝑒,
4.4 Recomputation Scheduling
𝑟𝑒𝑐𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛_𝑠𝑢𝑐𝑐𝑒𝑒𝑑 ←− 𝑇𝑟𝑢𝑒;
After some iterations of swapping scheduling, no tensors among all
2 run Activity analysis;
jobs in the system can be swapped. If the predicted GPU memory
3 while 𝑠𝑤𝑎𝑝_𝑠𝑢𝑐𝑐𝑒𝑒𝑑 = 𝑇𝑟𝑢𝑒 or
peak is still larger than the GPU memory, then we need to use
recomputation to save more GPU memory. However, a major short- 𝑟𝑒𝑐𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛_𝑠𝑢𝑐𝑐𝑒𝑒𝑑 = 𝑇𝑟𝑢𝑒 do
coming of recomputation is that it will add extra 𝐼𝑇 𝐴 and change 4 run 𝐴𝑙𝑔𝑜𝑟𝑖𝑡ℎ𝑚2 for each job to analyse their GPU
the Tensor Access Sequence. If the inputs of a recomputation has memory footprint and merge the results;
been swapped out or released, then there is a chance that the inputs 5 if 𝑠𝑤𝑎𝑝_𝑠𝑢𝑐𝑐𝑒𝑒𝑑 = 𝑇𝑟𝑢𝑒 then
of the recomputation need to be swapped in or recomputed, which 6 𝑠𝑤𝑎𝑝_𝑠𝑢𝑐𝑐𝑒𝑒𝑑 = 𝐹𝑎𝑙𝑠𝑒;
will disrupt the swapping plan just generated and cause much more 7 Run 𝐴𝑙𝑔𝑜𝑟𝑖𝑡ℎ𝑚1 to schedule the swapping tasks;
time overhead. Therefore, we will only apply recomputation on 8 if no tensors are scheduled succeed then
those tensor accesses which have never been released. To achieve 9 𝑠𝑤𝑎𝑝_𝑠𝑢𝑐𝑐𝑒𝑒𝑑 = 𝐹𝑎𝑙𝑠𝑒;
the best GPU memory saving effect, we will select those accesses 10 else if 𝑀𝑃 ≥ 𝐹𝑟𝑒𝑒_𝐺𝑃𝑈 _𝑀𝑒𝑚𝑜𝑟𝑦 then
that conform to the above conditions with the largest recompu- 11 if run 𝑅𝑒𝑐𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛_𝑆𝑐ℎ𝑒𝑑𝑢𝑙𝑖𝑛𝑔 not succeed then
tation value one by one, until the GPU memory is enough to run 12 𝑟𝑒𝑐𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛_𝑠𝑢𝑐𝑐𝑒𝑒𝑑 = 𝐹𝑎𝑙𝑠𝑒;
these jobs.
To measure a recomputation’s value, we used the formula pro-
posed in Capuchin [10], as known as MSPS. This metric can measure Although our method is designed to run on single GPU, it can
the memory saving and extra time cost at the same time. also support data parallel model for multiple GPUs in the cluster.
𝑀𝑒𝑚𝑜𝑟𝑦 𝑆𝑎𝑣𝑖𝑛𝑔 In this situation, the scheduling processes are the same with single
𝑀𝑆𝑃𝑆 ≡ GPU situation on each node, with the premise that its CPU has
𝑅𝑒𝑐𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛 𝑇𝑖𝑚𝑒
enough PCIe channels for the GPUs to use in a single node.
4.5 Discussion of supporting Dynamic
Workload and the range of application 5 EXPERIMENTS
The reason that prior works such as Capuchin[10] cannot support To verify the effectiveness of the proposed approaches, we conduct
multitasking dynamic workloads is that it relies on the fixed ’Tensor extensive experiments in this section.
Access Pattern’ of a single workload, which means to observe the
pattern with ’Passive Mode’ first. However, within a multitasking 5.1 Methodology
dynamic workloads system, such as in-database machine learning Experiment Platform. To apply our scheduling algorithm, we
system, the tensors’ access pattern is keeping changing, so the implement a naive deep learning framework with CUDA as our ex-
scheduling process must be able to keep updating iteratively. periment platform. We also reproduce vDNN[12] and Capuchin[10]
With algorithm 2, TENSILE can give a more effective initial coupling with our framework based on their papers.
scheduling plan than the ’Passive Mode’. The system will keep trac- Metrics. We define the metrics we used to evaluate the
ing recent tensor accesses time to correct the Tensor Access Sequence performance of our methods. Memory Saving Ratio(MSR):
and the scheduling plan. When a job is launched, the Tensor Access the ratio of the memory footprint peak of the experimental
Sequences of other jobs are most likely to fluctuate greatly. In this group to the ratio of the memory footprint peak of vanilla
situation, TENSILE updates the Tensor Access Sequence based on group(running jobs without any scheduling). Extra Overhead
current GPU usage for all jobs. To avoid extra overhead caused Ratio(EOR): the ratio of the time cost of the experimental
9
group to the ratio of the time cost of vanilla group. Cost- Table 1: Single Task Performance
Benefit Ratio(CBR): the ratio of MSR to EOR.
Baselines. We use two method as baselines to compare with our Workloads Methods MSR EOR CBR
method. The first one is vDNN, which swaps out the appropriate
vDNN 0.2733 0.8580 0.3185
feature maps and uses a static swapping in strategy for prefetching.
VGG Capuchin 0.1003 0.8971 0.1118
The second baseline is Capuchin, which is the first tensor gran-
TENSILE 0.2241 0.0971 2.3006
ularity scheduling method for single job. To be noticed that the
original Capuchin will only be trying to save memory to make
the current job just can be run in the limited GPU memory. If the
not put the extra time cost into consideration, which is embodied
GPU memory is enough to run these jobs, then it won’t schedule
in that they allow the non-overlapping between swapping and
any tensors. To compare its efficiency with our method, we man-
computation, and they also use a lot of recomputation to save
ually set its memory saving ratio to the same with our method’s,
memory as much as possible. On the contrary, TENSILE does not
so that we can compare their cost-benefit ratio. Both of the base-
allow the computation waiting for swapping when scheduling, and
lines and our method overlap data transfer and computation with
it only use recomputation if the GPU memory is not enough to run
cudaMemcpyAsync.
the jobs.
Statement of our reproduction. It is necessary to emphasize
This result proves that although TENSILE is designed for multiple
that the reproduction for Capuchin is not exactly the same with its
jobs and dynamic workloads, but in the single job situation, it can
authors’ implementation in the paper[10], especially in the aspect
still achieve the best efficiency and the acceptable memory saving
of system framework, since there is no source code to refer and it is
ratio. To be noticed, Capuchin can not save a lot of memory because
implemented in TensorFlow. For example, in Capuchin, they mea-
it is designed to only schedule the tensors which cause the OOM
sure the GPU kernel execution time through CUPTI, because the
exception. The GPU memory is big enough to run the VGG task,
computation is running asynchronously in TensorFlow. However,
so the Capuchin did not try to save much memory.
in our platform, every computation will be synchronized between
CPU and GPU, so we simply measure the operators’ execution
5.3 Scalability
time in the python codes, although this will slow our computation
down. What we want to compare is the scheduling algorithm itself To prove our method’s performance on multitasking dynamic work-
that running on the same platform, rather than whose platform loads, we choose one to three workloads respectively, which will be
implementation is computationally more efficient. Therefore, the launched at the same time to test TENSILE’s scalability. We repeat
experiments will be fair since all the methods’ algorithms them- each test three times and report the average metric.
selves are implemented exactly as their paper described and run
on the same platform for comparing. Table 2: Multitasking Dynamic Workloads Performance
Experiment Setup. Our experiment is conducted on a server
which has Intel Xeon Platinum 8163 2.5GHz x4, 256GB Host Mem- Metric Methods VGG x1 VGG x2 VGG x3
ory and one RTX2080Ti with 11GB memory. The system version
is Ubuntu 18.04.5 with GCC 7.5.0, the CUDA version is 10.0, the vDNN 0.2822 0.2368 0.2613
cuDNN is 7.6.5 and the Python 3.6.9. MSR Capuchin 0.1003 0.0056 0.1109
Workloads. Our experiment contains five neural networks, TENSILE 0.2001 0.2295 0.2767
VGG, Inception V3, Inception V4, ResNet-50, and DenseNet, cov- vDNN 0.8507 0.9103 0.9202
ered linear network and non-linear network. We generate a series EOR Capuchin 0.8921 0.9208 0.9215
random data as the dataset, and set the batch size to 2. We will TENSILE 0.0838 0.1016 0.2793
increase the workloads number on one GPU and measure the GPU
vDNN 0.3317 0.2602 0.2840
memory footprint and the extra time overhead for each method.
CBR Capuchin 0.1124 0.006 0.1204
TENSILE 2.5112 3.2256 1.0478
5.2 Single Task Performance
We firstly compare TENSILE’s performance with vDNN and Ca- As 2 and 6 shows, TENSILE is able to save almost as much
puchin on single task. We run those workloads mentioned before memory as vDNN in different job scale. Although it’s Cost-Benefit
with TENSILE and the other two baselines respectively, and log Ratio is getting lower when the number of jobs increasing to three,
their memory cost and the time cost per epoch for comparing. And it’s still much higher than both of the baselines. The reason why
for our method, the max tensor swapping ratio is set to 100%. The the Cost-Benefit Ratio is raising first and falling later when the
result is shown in Table 1. job number increasing is that when the job number reaches to
The result shows that our method can save 22.41% memory than three, there are too many tensors to swap so that more swapping
vanilla method. Although our method’s memory saving ratio is less tasks are blocked by other jobs’ swapping tasks, and causing more
than vDNN by 4.92%, our extra time cost is also less than vDNN extra overhead. However, in the situation of three job, TENSILE
by 76.09%. Also, TENSILE has much higher efficiency, since its can save 27.67% of GPU memory, which is the higher than other
Cost-Benefit Ratio is 230.06%, which is much more higher than baselines. This is a strong proof of the performance of TENSILE is
both baseline method. This is because these baseline method do much better than vDNN and Capuchin in multitasking dynamic
10
[6] Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning
for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) (2016), 770–778.
[7] Diederik Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimiza-
tion. International Conference on Learning Representations (12 2014).
[8] Thomas N. Kipf and Max Welling. 2016. Semi-Supervised Classification with
Graph Convolutional Networks. CoRR abs/1609.02907 (2016). arXiv:1609.02907
http://arxiv.org/abs/1609.02907
[9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017. ImageNet Classi-
fication with Deep Convolutional Neural Networks. Commun. ACM 60, 6 (May
2017), 84–90. https://doi.org/10.1145/3065386
[10] Xuan Peng, X. Shi, Hulin Dai, H. Jin, Weiliang Ma, Qian Xiong, Fan Yang, and
Xuehai Qian. 2020. Capuchin: Tensor-based GPU Memory Management for Deep
Learning. Proceedings of the Twenty-Fifth International Conference on Architectural
Support for Programming Languages and Operating Systems (2020).
[11] Marcus Perry. 2010. The Exponentially Weighted Moving Average. https://doi.
org/10.1002/9780470400531.eorms0314
[12] Minsoo Rhu, N. Gimelshein, Jason Clemons, A. Zulfiqar, and Stephen W. Keckler.
Figure 6: Multitask Test 2016. vDNN: Virtualized deep neural networks for scalable, memory-efficient
neural network design. 2016 49th Annual IEEE/ACM International Symposium on
Microarchitecture (MICRO) (2016), 1–13.
[13] B. ShriramS, Anshuj Garg, and Purushottam Kulkarni. 2019. Dynamic Memory
Management for GPU-Based Training of Deep Neural Networks. 2019 IEEE
workloads scenario. As last, the data of VGG x1 workload is different International Parallel and Distributed Processing Symposium (IPDPS) (2019), 200–
from 1 because we run these two experiment respectively. Although 209.
the setting is the same, the uncertainty of operators’ running time [14] Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Song, Zenglin
Xu, and Tim Kraska. 2018. Superneurons: dynamic GPU memory management
causes the difference between the result of these two experiment. for training deep neural networks. ACM SIGPLAN Notices 53 (02 2018), 41–53.
https://doi.org/10.1145/3200691.3178491
[15] Donglin Yang and Dazhao Cheng. 2020. Efficient GPU Memory Management for
6 FUTURE WORK Nonlinear DNNs. In Proceedings of the 29th International Symposium on High-
Although our method is effective than prior work, but the workloads Performance Parallel and Distributed Computing (Stockholm, Sweden) (HPDC ’20).
Association for Computing Machinery, New York, NY, USA, 185–196. https:
asynchronously running will still cause some performance loss, so //doi.org/10.1145/3369583.3392684
how to design a method to solve this problem will be our future [16] Donglin Yang and Dazhao Cheng. 2020. Efficient GPU Memory Management for
Nonlinear DNNs. In Proceedings of the 29th International Symposium on High-
research direction. We plan to design a semi-Synchronous method Performance Parallel and Distributed Computing (Stockholm, Sweden) (HPDC ’20).
to scheduling their running. Association for Computing Machinery, New York, NY, USA, 185–196. https:
And when the CPU has not enough PCIe channel to support //doi.org/10.1145/3369583.3392684
[17] Junzhe Zhang, Sai Ho Yeung, Yao Shu, Bingsheng He, and Wei Wang. 2019.
every GPU on it, our method can not be apply on each GPU indi- Efficient memory management for gpu-based deep learning systems. arXiv
vidually. There also needs an approach which supports for dynamic preprint arXiv:1903.06631 (2019).
PCIe bandwidth to schedule tensors in this situation.

ACKNOWLEDGMENTS
This work was supported by the Research Fund of Harbin Institute
of Technology. Additional funding was provided by Honors School
of HIT and Massive Data Computing Research Center of HIT. We
also thank Chang Shu’ advice of the designing of the scheduling
algorithm.

REFERENCES
[1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey
Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al.
2016. Tensorflow: A system for large-scale machine learning. In 12th {USENIX }
symposium on operating systems design and implementation ( {OSDI } 16). 265–283.
[2] T. Brown, B. Mann, Nick Ryder, Melanie Subbiah, J. Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini
Agarwal, Ariel Herbert-Voss, G. Krüger, T. Henighan, R. Child, Aditya Ramesh,
D. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, E. Sigler,
Mateusz Litwin, Scott Gray, Benjamin Chess, J. Clark, Christopher Berner, Sam
McCandlish, A. Radford, Ilya Sutskever, and Dario Amodei. 2020. Language
Models are Few-Shot Learners. ArXiv abs/2005.14165 (2020).
[3] Xiaoming Chen, Danny Chen, Yinhe Han, and Xiaobo Hu. 2018. moDNN: Memory
Optimal Deep Neural Network Training on Graphics Processing Units. IEEE
Transactions on Parallel and Distributed Systems PP (08 2018), 1–1. https://doi.
org/10.1109/tpds.2018.2866582
[4] Shaveta Dargan, Munish Kumar, · Maruthi, Maruthi Rohit Ayyagari, and · Kumar.
2019. A Survey of Deep Learning and Its Applications: A New Paradigm to
Machine Learning. (07 2019).
[5] J. Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:
Pre-training of Deep Bidirectional Transformers for Language Understanding. In
NAACL-HLT.
11

You might also like