Professional Documents
Culture Documents
TENSILE: A Tensor Granularity Dynamic GPU Memory Scheduler Method Towards Multiple Dynamic Workloads System
TENSILE: A Tensor Granularity Dynamic GPU Memory Scheduler Method Towards Multiple Dynamic Workloads System
TENSILE: A Tensor Granularity Dynamic GPU Memory Scheduler Method Towards Multiple Dynamic Workloads System
2
2 RELATED WORK Xuan Peng et al. proposed Capuchin[10]. Capuchin can observe
In this section we will introduce some prior works in the area of the tensor access sequence and decide which tensor to swap and
GPU memory management. which to recompute with a value based approach. This method is
NVIDIA firstly integrated the Unified Memory technology in more flexible on supporting variety types of neural networks than
CUDA6 4 . This work allows programmers to use the GPU memory those layer granularity methods. However, it highly depends on
and the Host Memory as an unified space. Although this work is the observation of tensor access sequence when the calculation
the base of higher level operator like swapping tensor between firstly running, since the tensor access sequence is needed to gen-
GPU memory and Host Memory, but it is not specially designed to erate the scheduling plan. In an environment with multitasking
schedule deep learning process, which means programmers need dynamic workloads, the tensor access sequence keeps changing,
to decide when and which tensor should be swapped. Also, it can which makes the Capuchin hard to be applied in such systems.
only be used in CUDA programming. Also, its scheduling process only considered the tensor access in
Minsoo Rhu et al. proposed the vDNN[12], which is the first one iteration. For example, the updated parameters’ accesses at the
work focusing on scheduling the GPU memory using towards deep end of the iteration can not be swapped in proactively. Thus, these
learning process. The main part of vDNN is an algorithm to decide parameters can only be swapped in passively at the next iteration
which tensor should be released, evicted or prefetched. Although with additional time overhead.
it is a pioneering work, there still are some problems to be solved.
Firstly, not like our method, vDNN is scheduling in layer granularity, 3 OVERVIEW
which is the reason why it is not suitable for scheduling the deep In this section, we will introduce the definition of the GPU memory
learning models that don’t have a strict layer structure, such as scheduling problem, and expound the overall design of our deep
the Capsule Network. Because these types of network has complex learning system.
structure inside a ’layer’ whose intermediate results can not be
scheduled by vDNN.
3.1 Problem Definition
Shriram S.B. et al. improved vDNN and solve its delayed compu-
tation start, high pinned memory requirements and GPU memory For the convenience of scheduling, we use a graph model, called
fragmentation problems[13]. Although allowing to overlap the ten- static computational graph[1] to describe the tensor calculation
sor eviction and prefetching process with the computation process jobs. We denote a static computational graph as 𝐺 (𝑉 , 𝐸), where
significantly improved the performance, the other shortcomings 𝑉 represents the set of operators to manipulate tensors, and each
are still remained to be solved. (𝑢,𝑣) in 𝐸 represents a tensor generated by operator 𝑢 and used by
Linnan Wang et al. proposed SuperNeurons[14] which is the first operator 𝑣. To be noticed that the graph is a DAG(Direct Acyclic
dynamically GPU memory run-time management approach. It also Graph) just like the one in TensorFlow[1]. An example of the graph
introduced recomputation into the scheduling. The authors believes is shown in Figure 2(a), a convolution operator is represented as a
the convolution layer is the most valuable layer to be swapped, and node in the graph. Such a node takes multiple tensors as the input
designed some approaches for the convolution layers. However, and outputs other tensors. When a job is launched, the engine
it can not support every kinds of neural networks, because it is executes these operators in a topological order. Such model could
based on layer granularity. And the second one is it can not work effectively represent the computation in nowadays modern deep
in dynamic workload environment such as database systems. learning frameworks, such as TensorFlow[1].
Xiaoming Chen et al. proposed moDNN[3], whose key contri- Note that the intermediate results and the operators of the opti-
bution is invented a new heuristics to schedule data eviction and mizer are also included in 𝐺, which is different from prior works.
prefetching transfers. This work also contains a convolution algo- Those works are only design to save GPU memory during the for-
rithm selection method and a new sub-batch size selection method ward/backward propagation to achieve larger batch size or deeper
to further improve performance. network. However, our method is deigned toward multitasking
To support non-linear network better, Donglin Yang et al. pro- dynamic workloads in the system. To support multiple workloads,
posed the first method that support unified memory pool for non- the method must reduce the entire job’s GPU memory peak, in-
linear networks[16]. The method is able to construct the execution cluding the peak caused by the optimizer. Otherwise, when two
order for non-linear networks based on graph analysis, and it also peaks caused by different jobs occur at the same time, there will
contains a Mobility placement policy to conserve contiguity for be an OOM (Out Of Memory) exception triggered. This feature is
different tensors. effective especially when using optimizer like Adam[7], which will
However, all the approaches above are based on layer granularity, use additional memory twice to the size of the parameters.
in 2019, another method is proposed[17], which is the first approach Based on the static computational graph, we explain the basic
that can schedule GPU memory in tensor granularity. This method concept of TENSILE.
proposed two orthogonal approaches to reduce the memory cost The input and output tensors of a given operator are defined
from the system perspective. It reduces the GPU memory using by as Input Tensor Access(ITA) and Output Tensor Access(OTA) respec-
managing the memory pool and scheduling the swapping with a tively. Since the feed_dict operator generates the input tensors of
heuristic method. Benefited from its tensor granularity, this method the whole computational graph, we regard the feed_dict operator as
can support any kinds of deep learning networks. an OTA. The execution time of a Tensor Access is defined by the cor-
responding operator’s execution time. For example, in Figure 2(b),
4 https://developer.nvidia.com/blog/unified-memory-in-cuda-6/ the Convolution operator takes tensors kernel, bias, and a tensor 𝑥
3
3.2 System Architecture
Since no prior works are designed for multiple jobs, existing ap-
proaches are lack of the ability to scheduling among multitasking
dynamic workloads. To solve this problem, we developed a system
to support multitasking dynamic workloads’ scheduling. The sys-
tem can collect the necessary information of all the jobs before and
during running to support the scheduling algorithm. The informa-
tion includes the DAG of jobs, the execution time of each operators
and other run-time information such as the GPU utilization. The
system will trigger the scheduling algorithm multiple times during
task startup and operation. When receiving the latest scheduling
plan, the system will apply it at the next round of iteration. To
achieve the functions above, our system contains five components:
Global Controller, Memory Scheduler, Executor, Swap Scheduler, Swap
Executor, as shown in Figure 3.
peak’s time point(𝑀𝑃𝑇𝑖𝑚𝑒). To support the scheduling algorithm, prior scheduling plan lose efficacy. We have to keep updating the
the analysis algorithm will also infer the start time of the swap Tensor Access Sequence to fit the current workloads.
tasks, which is defined as a tuple (𝑎𝑐𝑐𝑒𝑠𝑠_𝑖𝑑, 𝑑𝑒𝑙𝑡𝑎𝑡 𝑖𝑚𝑒). With the In consideration of that there are two pieces of information
end time of the tensor access before the swap task as the origin required to infer the Tensor Access Sequence, the Tensor Accesses’
time, we can keep the swap tasks’ relative order in the run-time order, and the execution time of each Tensor Access, we can deal
even the Tensor Access Sequence has slightly difference from the this problem by diving it into two sub-problems of inferring the
actual sequence. The detailed algorithm is shown in algorithm 2. Tensor Access order and get each Tensor Access’s execution time.
As mentioned above, with the computational graph, we execute
the operators in the topological order. Therefore, we can solve the
first sub-problem and get the Tensor Accesses’ order statically based
4.3 Tensor Access Sequence Generation on each operator’s input and output.
In prior work such as Capuchin[10], generating Tensor Access Se- However, predicting the CUDA kernel’s execution time accu-
quence(Access Pattern) is based on running the job in ’Passive Mode’, rately is an extremely hard problem, especially with multitasking
the system observes the execution time of each tensor to generate dynamic workloads.
the Tensor Access Sequence. In ’Passive Mode’, the system swaps The most straightforward way to solve the second sub-problem
tensors only when GPU memory is not enough, which will cause is to run the job with ’Passive Mode’ which means only swap-
extra overhead. ping tensors passively when the GPU memory is not enough, like
As discussed above, since the scheduling algorithm and the GPU Capuchin[10]. This approach will cause that calculation and swap-
memory peak analysis relays on the Tensor Access Sequence, the ping do not overlap and the computation must wait for swapping.
method has to be able to generate the initial Tensor Access Sequence, To make the initial scheduling plan more efficient, our solution
so that the unnecessary overhead of the ’Passive Mode’ can be uses pre-trained machine learning models to predict each GPU
avoided. operator’s execution time with the given input data size and the
Also, another reason why we cannot use the ’Passive Mode’ current GPU utilization rates. When the system initializes, it will
to measure each operator’s execution time is that ’Passive Mode’ measure each GPU operator’s execution time with different input
only generate the Tensor Access Sequence once at the first epoch, data and GPU usage to generate the training data.
which is not feasible because the execution time of the operator Since the execution time of a given operator depends on its in-
varies with the dynamic load of multiple jobs. Especially, when a puts’ shape, the parameters such as strides of convolution operator,
job launches, the GPU usage will change drastically and cause the and the current GPU usage, the inputs of our prediction model are
7
Algorithm 1: Swapping Scheduling Algorithm Algorithm 2: GPU memory peak Analysis
Input: 𝑇 𝐴𝑆: tensor access sequence; 𝑆𝑇 : swapped tasks; Input: 𝑇 𝐴𝑆: tensor access sequence; 𝑆𝑇 : swap tasks
𝑀𝑃: memory peak; 𝑀𝑃𝑇 : the tensors that cause the Output: 𝑀𝑃: memory peak; 𝑀𝑃𝑇 𝑒𝑛𝑠𝑜𝑟 : tensors caused
memory peak; 𝑀𝑆𝑅: max tensor swapping ratio for memory peak; 𝐿𝐼𝐴: last input access for each job
each job, 𝑆𝑂𝑁 : swapped out tensors number for each before memory peak; 𝑀𝑃𝑇𝑖𝑚𝑒: time of the
job; 𝑇 𝐴𝑇 : tensor access sequence grouped by tensor memory peak
and sorted by time; all the outputs of algorithm 2 1 initialize the 𝑚𝑒𝑚𝑜𝑟𝑦_𝑢𝑠𝑒𝑑 and the 𝑖𝑛_𝑔𝑝𝑢_𝑡𝑒𝑛𝑠𝑜𝑟𝑠 with
Result: Add new Swap Tasks to 𝑆𝑇 the tensors which will be updated in the optimizing phase
1 Function scheduling_swap(𝑆𝑇 : list, 𝑡𝑒𝑛𝑠𝑜𝑟 ): and the inputs/outputs of the deep learning model.;
2 𝑠𝑢𝑐𝑐𝑒𝑒𝑑 ←− 𝐹𝑎𝑙𝑠𝑒; 2 for each 𝑒𝑣𝑒𝑛𝑡 in 𝑡𝑖𝑚𝑒_𝑎𝑥𝑖𝑠 do
3 ℎ𝑎𝑣𝑒_𝑓 𝑖𝑟𝑠𝑡_𝑎𝑐𝑐𝑒𝑠𝑠 ←− 𝐹𝑎𝑙𝑠𝑒; 3 𝑡𝑖𝑚𝑒 ←− 𝑒𝑣𝑒𝑛𝑡 .𝑡𝑖𝑚𝑒;
4 Generate 𝑠𝑤𝑎𝑝_𝑜𝑢𝑡_𝑡𝑎𝑠𝑘 and 𝑓 𝑒𝑎𝑠𝑖𝑏𝑙𝑒_𝑟𝑒𝑔𝑖𝑜𝑛𝑠; 4 Release the tensor in 𝑡𝑒𝑛𝑠𝑜𝑟𝑠_𝑡𝑜_𝑏𝑒_𝑟𝑒𝑙𝑒𝑎𝑠𝑒 if the
5 for each 𝑟𝑒𝑔𝑖𝑜𝑛 in 𝑓 𝑒𝑎𝑠𝑖𝑏𝑙𝑒_𝑟𝑒𝑔𝑖𝑜𝑛𝑠 do present time is after its end time;
6 if 𝑟𝑒𝑔𝑖𝑜𝑛 covers 𝑠𝑤𝑎𝑝_𝑜𝑢𝑡_𝑡𝑎𝑠𝑘 then 5 if 𝑒𝑣𝑒𝑛𝑡 .𝑡𝑦𝑝𝑒 = 𝑜𝑢𝑡𝑝𝑢𝑡_𝑎𝑐𝑐𝑒𝑠𝑠 and 𝑒𝑣𝑒𝑛𝑡 .𝑡𝑒𝑛𝑠𝑜𝑟 is not
7 𝑆𝑇 .𝑎𝑑𝑑 (𝑠𝑤𝑎𝑝_𝑜𝑢𝑡_𝑡𝑎𝑠𝑘); used to initialize the 𝑚𝑒𝑚𝑜𝑟𝑦_𝑢𝑠𝑒𝑑 and the
8 𝑠𝑢𝑐𝑐𝑒𝑒𝑑_𝑠𝑤𝑎𝑝_𝑜𝑢𝑡 ←− 𝑇𝑟𝑢𝑒; 𝑖𝑛_𝑔𝑝𝑢_𝑡𝑒𝑛𝑠𝑜𝑟𝑠 then
9 if 𝑡𝑒𝑛𝑠𝑜𝑟 is an updated parameter then 6 𝑚𝑒𝑚𝑜𝑟𝑦_𝑢𝑠𝑒𝑑 ←− 𝑚𝑒𝑚𝑜𝑟𝑦_𝑢𝑠𝑒𝑑 + 𝑒𝑣𝑒𝑛𝑡 .𝑠𝑖𝑧𝑒;
10 𝑓 𝑖𝑟𝑠𝑡_𝑎𝑐𝑐𝑒𝑠𝑠 ←− the first 𝐼𝑇 𝐴 of the 7 𝑖𝑛_𝑔𝑝𝑢_𝑡𝑒𝑛𝑠𝑜𝑟𝑠.𝑎𝑑𝑑 (𝑒𝑣𝑒𝑛𝑡 .𝑡𝑒𝑛𝑠𝑜𝑟 );
corresponding non-updated parameter; 8 else if 𝑒𝑣𝑒𝑛𝑡 .𝑡𝑦𝑝𝑒 = 𝑖𝑛𝑝𝑢𝑡_𝑎𝑐𝑐𝑒𝑠𝑠 then
11 else 9 if 𝑒𝑣𝑒𝑛𝑡 .𝑟𝑒𝑙𝑒𝑎𝑠𝑒 = 𝑇𝑟𝑢𝑒 then
12 𝑓 𝑖𝑟𝑠𝑡_𝑎𝑐𝑐𝑒𝑠𝑠 ←− the first 𝐼𝑇 𝐴 after 𝑎𝑐𝑐𝑒𝑠𝑠; 10 𝑡𝑒𝑛𝑠𝑜𝑟𝑠_𝑡𝑜_𝑏𝑒_𝑟𝑒𝑙𝑒𝑎𝑠𝑒𝑑.𝑎𝑝𝑝𝑒𝑛𝑑 (𝑒𝑣𝑒𝑛𝑡);
13 if 𝑓 𝑖𝑟𝑠𝑡_𝑎𝑐𝑐𝑒𝑠𝑠 != None then 11 else
14 ℎ𝑎𝑣𝑒_𝑓 𝑖𝑟𝑠𝑡_𝑎𝑐𝑐𝑒𝑠𝑠 ←− 𝑇𝑟𝑢𝑒; 12 𝑙𝑎𝑠𝑡_𝑖𝑛𝑝𝑢𝑡_𝑎𝑐𝑐𝑒𝑠𝑠 ←− 𝑒𝑣𝑒𝑛𝑡;
15 Generate 𝑠𝑤𝑎𝑝_𝑖𝑛_𝑡𝑎𝑠𝑘 from the 𝑓 𝑖𝑟𝑠𝑡_𝑎𝑐𝑐𝑒𝑠𝑠; 13 else if 𝑒𝑣𝑒𝑛𝑡 .𝑡𝑦𝑝𝑒 = 𝑠𝑤𝑎𝑝_𝑜𝑢𝑡 or
16 if 𝑡𝑟𝑦_𝑠𝑐ℎ𝑒𝑑𝑢𝑙𝑒_𝑠𝑤𝑎𝑝_𝑖𝑛(𝑓 𝑖𝑟𝑠𝑡_𝑎𝑐𝑐𝑒𝑠𝑠) is 𝑒𝑣𝑒𝑛𝑡 .𝑡𝑦𝑝𝑒 = 𝑠𝑤𝑎𝑝_𝑖𝑛 then
success then 14 𝑙𝑎𝑠𝑡_𝑎𝑐𝑐𝑒𝑠𝑠 ←− 𝑔𝑒𝑡_𝑙𝑎𝑠𝑡_𝑎𝑐𝑐𝑒𝑠𝑠 (𝑒𝑣𝑒𝑛𝑡, 𝑡𝑖𝑚𝑒_𝑎𝑥𝑖𝑠);
17 𝑆𝑇 .𝑎𝑑𝑑 (𝑠𝑤𝑎𝑝_𝑖𝑛_𝑡𝑎𝑠𝑘); 15 𝑒𝑣𝑒𝑛𝑡 .𝑒𝑥𝑒𝑐𝑢𝑡𝑒_𝑟𝑒 𝑓 ←− 𝑙𝑎𝑠𝑡_𝑎𝑐𝑐𝑒𝑠𝑠;
18 𝑠𝑢𝑐𝑐𝑒𝑒𝑑 ←− 𝑇𝑟𝑢𝑒; 16 𝑒𝑣𝑒𝑛𝑡 .𝑒𝑥𝑒𝑐𝑢𝑡𝑒_𝑡𝑖𝑚𝑒 ←− 𝑙𝑎𝑠𝑡_𝑎𝑐𝑐𝑒𝑠𝑠.𝑡𝑖𝑚𝑒;
19 else 17 if 𝑒𝑣𝑒𝑛𝑡 .𝑡𝑦𝑝𝑒 = 𝑠𝑤𝑎𝑝_𝑖𝑛 then
20 𝑆𝑇 .𝑟𝑒𝑚𝑜𝑣𝑒 (𝑠𝑤𝑎𝑝_𝑜𝑢𝑡_𝑡𝑎𝑠𝑘); 18 𝑚𝑒𝑚𝑜𝑟𝑦_𝑢𝑠𝑒𝑑 ←− 𝑚𝑒𝑚𝑜𝑟𝑦_𝑢𝑠𝑒𝑑 + 𝑒𝑣𝑒𝑛𝑡 .𝑠𝑖𝑧𝑒;
21 𝑠𝑤𝑎𝑝_𝑜𝑢𝑡_𝑡𝑎𝑠𝑘.𝑙𝑎𝑡𝑒𝑠𝑡_𝑡𝑖𝑚𝑒 ←− 19 𝑖𝑛_𝑔𝑝𝑢_𝑡𝑒𝑛𝑠𝑜𝑟𝑠.𝑎𝑑𝑑 (𝑒𝑣𝑒𝑛𝑡 .𝑡𝑒𝑛𝑠𝑜𝑟 );
𝑓 𝑖𝑟𝑠𝑡_𝑎𝑐𝑐𝑒𝑠𝑠.𝑒𝑛𝑑_𝑡𝑖𝑚𝑒; 20 else
22 break; 21 𝑚𝑒𝑚𝑜𝑟𝑦_𝑢𝑠𝑒𝑑 ←− 𝑚𝑒𝑚𝑜𝑟𝑦_𝑢𝑠𝑒𝑑 − 𝑒𝑣𝑒𝑛𝑡 .𝑠𝑖𝑧𝑒;
23 return 𝑠𝑢𝑐𝑐𝑒𝑒𝑑, 𝑠𝑢𝑐𝑐𝑒𝑒𝑑_𝑠𝑤𝑎𝑝_𝑜𝑢𝑡, ℎ𝑎𝑣𝑒_𝑓 𝑖𝑟𝑠𝑡_𝑎𝑐𝑐𝑒𝑠𝑠; 22 𝑖𝑛_𝑔𝑝𝑢_𝑡𝑒𝑛𝑠𝑜𝑟𝑠.𝑟𝑒𝑚𝑜𝑣𝑒 (𝑒𝑣𝑒𝑛𝑡 .𝑡𝑒𝑛𝑠𝑜𝑟 );
24 for each 𝑡𝑒𝑛𝑠𝑜𝑟 in 𝑀𝑃𝑇 do 23 if 𝑚𝑒𝑚𝑜𝑟𝑦_𝑢𝑠𝑒𝑑 > 𝑚𝑒𝑚𝑜𝑟𝑦_𝑝𝑒𝑎𝑘 then
25 Infer the 𝑙𝑎𝑡𝑒𝑠𝑡_𝑡𝑖𝑚𝑒 from the GPU memory peak 24 𝑚𝑒𝑚𝑜𝑟𝑦_𝑝𝑒𝑎𝑘 ←− 𝑚𝑒𝑚𝑜𝑟𝑦_𝑢𝑠𝑒𝑑;
analysis result; 25 𝑀𝑃𝑇 𝑒𝑛𝑠𝑜𝑟 ←− 𝑐𝑜𝑝𝑦 (𝑖𝑛_𝑔𝑝𝑢_𝑡𝑒𝑛𝑠𝑜𝑟𝑠);
26 Infer the 𝑒𝑎𝑟𝑙𝑖𝑒𝑠𝑡_𝑡𝑖𝑚𝑒 from the 𝑂𝑇 𝐴 of the swap out 26 𝐿𝐼𝐴 ←− 𝑙𝑎𝑠𝑡_𝑖𝑛𝑝𝑢𝑡_𝑎𝑐𝑐𝑒𝑠𝑠;
task; 27 𝑀𝑃𝑇𝑖𝑚𝑒 ←− 𝑡𝑖𝑚𝑒;
27 if 𝑡𝑒𝑛𝑠𝑜𝑟 is an updated parameter in optimizing phase 28 return 𝑚𝑒𝑚𝑜𝑟𝑦_𝑝𝑒𝑎𝑘, 𝑀𝑃𝑇 𝑒𝑛𝑠𝑜𝑟 , 𝐿𝐼𝐴, 𝑀𝑃𝑇𝑖𝑚𝑒;
then
28 scheduling_swap(𝑡𝑒𝑛𝑠𝑜𝑟 );
else if 𝑆𝑂𝑁 [𝑡𝑒𝑛𝑠𝑜𝑟 .𝑗𝑜𝑏_𝑖𝑑] ≤ 𝑀𝑆𝑅 [𝑡𝑒𝑛𝑠𝑜𝑟 .𝑗𝑜𝑏_𝑖𝑑]
denoted as 𝑖𝑛𝑝𝑢𝑡𝑠 ≡< 𝑑𝑖𝑚𝑥0 1 , ..., 𝑑𝑖𝑚𝑚
29 𝑘
and 𝑇 𝐴𝑇 [𝑡𝑒𝑛𝑠𝑜𝑟 ].𝑙𝑒𝑛𝑔𝑡ℎ > 1 and 𝑡𝑒𝑛𝑠𝑜𝑟 has not been 𝑥 1 , ..., 𝑑𝑖𝑚𝑥𝑛 , 𝑝𝑎𝑟𝑎 1 , ..., 𝑝𝑎𝑟𝑎𝑠 ,
𝑔𝑝𝑢_𝑢𝑠𝑎𝑔𝑒 >, where the 𝑥 1 to 𝑥𝑛 denotes the input tensors of the
swapped out then
operator, the 𝑑𝑖𝑚𝑖𝑥 𝑗 denotes the i-th dimension’s size of input 𝑥 𝑗 , the
30 𝑠𝑢𝑐𝑐𝑒𝑒𝑑 ←− 𝐹𝑎𝑙𝑠𝑒;
𝑝𝑎𝑟𝑎𝑖 denotes the i-th parameter of the operator, and the 𝑔𝑝𝑢_𝑢𝑠𝑎𝑔𝑒
31 𝑎𝑐𝑐𝑒𝑠𝑠𝑒𝑠 ←− 𝑇 𝐴𝑇 [𝑡𝑒𝑛𝑠𝑜𝑟 ];
denotes the current GPU’s utilization rate. We used a 3 layers MLP
32 𝑠𝑢𝑐𝑐𝑒𝑒𝑑_𝑠𝑤𝑎𝑝_𝑜𝑢𝑡 ←− 𝑇𝑟𝑢𝑒; network as the prediction model, since the input is not complex and
33 ℎ𝑎𝑣𝑒_𝑓 𝑖𝑟𝑠𝑡_𝑎𝑐𝑐𝑒𝑠𝑠 ←− 𝑇𝑟𝑢𝑒; a simple MLP model is enough to give relatively precise prediction.
34 while not 𝑠𝑢𝑐𝑐𝑒𝑒𝑑 and 𝑙𝑎𝑡𝑒𝑠𝑡_𝑡𝑖𝑚𝑒 > 𝑒𝑎𝑟𝑙𝑖𝑒𝑠𝑡_𝑡𝑖𝑚𝑒 According to our experiment, the average 𝑅 2 of every operator’s
and 𝑠𝑢𝑐𝑐𝑒𝑒𝑑_𝑠𝑤𝑎𝑝_𝑜𝑢𝑡 and ℎ𝑎𝑣𝑒_𝑓 𝑖𝑟𝑠𝑡_𝑎𝑐𝑐𝑒𝑠𝑠 do prediction result is 0.582, and for those most expensive operators
35 𝑠𝑢𝑐𝑐𝑒𝑒𝑑, 𝑠𝑢𝑐𝑐𝑒𝑒𝑑_𝑠𝑤𝑎𝑝_𝑜𝑢𝑡, ℎ𝑎𝑣𝑒_𝑓 𝑖𝑟𝑠𝑡_𝑎𝑐𝑐𝑒𝑠𝑠 such as convolution, the average 𝑅 2 is 0.805, which means that our
←− scheduling_swap(𝑡𝑒𝑛𝑠𝑜𝑟 ); method can give quite precise prediction to those most expensive
// sorted by end time in ascending order operators.
36 Try to swap-in the rest of accesses in 𝑎𝑐𝑐𝑒𝑠𝑠𝑒𝑠 8
ACKNOWLEDGMENTS
This work was supported by the Research Fund of Harbin Institute
of Technology. Additional funding was provided by Honors School
of HIT and Massive Data Computing Research Center of HIT. We
also thank Chang Shu’ advice of the designing of the scheduling
algorithm.
REFERENCES
[1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey
Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al.
2016. Tensorflow: A system for large-scale machine learning. In 12th {USENIX }
symposium on operating systems design and implementation ( {OSDI } 16). 265–283.
[2] T. Brown, B. Mann, Nick Ryder, Melanie Subbiah, J. Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini
Agarwal, Ariel Herbert-Voss, G. Krüger, T. Henighan, R. Child, Aditya Ramesh,
D. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, E. Sigler,
Mateusz Litwin, Scott Gray, Benjamin Chess, J. Clark, Christopher Berner, Sam
McCandlish, A. Radford, Ilya Sutskever, and Dario Amodei. 2020. Language
Models are Few-Shot Learners. ArXiv abs/2005.14165 (2020).
[3] Xiaoming Chen, Danny Chen, Yinhe Han, and Xiaobo Hu. 2018. moDNN: Memory
Optimal Deep Neural Network Training on Graphics Processing Units. IEEE
Transactions on Parallel and Distributed Systems PP (08 2018), 1–1. https://doi.
org/10.1109/tpds.2018.2866582
[4] Shaveta Dargan, Munish Kumar, · Maruthi, Maruthi Rohit Ayyagari, and · Kumar.
2019. A Survey of Deep Learning and Its Applications: A New Paradigm to
Machine Learning. (07 2019).
[5] J. Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:
Pre-training of Deep Bidirectional Transformers for Language Understanding. In
NAACL-HLT.
11