Professional Documents
Culture Documents
Duan Et Al. - 2024 - MuxServe Flexible Multiplexing For Efficient Mult
Duan Et Al. - 2024 - MuxServe Flexible Multiplexing For Efficient Mult
Jiangfei Duan 1 2 Runyu Lu 3 Haojie Duanmu 2 4 Xiuhong Li 5 Xingcheng Zhang 2 Dahua Lin 1 2
Ion Stoica 6 Hao Zhang 7
endpoints for use-cases like chat, programming GPU 0 A.0 B.0 R1P R2P
R1 R2 R1 R2 R1 R2 R1 R2
R3p
R3 R3 R3 R3
and search. However, efficiently serving multiple GPU 1 A.1 B.1 R1P R2P
R1 R2 R1 R2 R1 R2 R1 R2
R3p
R3 R3 R3 R3
LLMs poses significant challenges for existing (b) Temporal multiplexing Time
1
BS=1 BS=4 BS=16 BS=64
Request Arrival Rate
Relative Latency
Relative Latency
1.75 1.75
1.50 1.50
1.25 1.25
0 100 200 300 400
Time (Hour) 1.00 1.00
40 60 80 100 40 60 80 100
Figure 2. The dynamic request arrival rates of different LLMs over SM Source (%) SM Source (%)
a 20 day period.
Figure 3. Relative batch inference latency as the fraction of com-
puting resources assigned to LLaMA-7B changes from 30% to
100%. The input sequence length is 128.
lelism, and scheduling requests in an interleaved manner to
share the computation and memory resources. However, this
approach does not fully leverage the potential of GPUs when state-of-the-art systems.
serving multiple LLMs, as it overlooks the unique character- In summary, MuxServe makes the following contributions,
istics of the prefill and incremental decoding phases of au-
toregressive LLMs. The incremental decoding phase, which
typically plays a significant role in the inference process, 1. The first to explore spatial-temporal multiplexing for
often falls short in fully utilizing GPUs. Therefore, temporal LLM serving and formally formulate the problem.
multiplexing brings a wave-shaped utilization change, and
most of the time it is in the trough (Figure 1b).
2. A novel placement algorithm and adaptive batch
In this work, we explore to serve multiple LLMs with flexi- scheduling strategy to determine the best collocations
ble spatial-temporal multiplexing to improve GPU utiliza- and maximize utilization.
tion (Figure 1c) motivated by the following two key insights.
Firstly, since prefill and incremental decoding phases have
3. A viable system design and implementation with uni-
distinct computation characteristics, we separate them into
fied resource manager to enable efficient multiplexing
different jobs and flexibly colocate prefill or decoding jobs
of LLMs and comprehensive evaluation.
from different LLMs to multiplex computation resources.
Secondly, we colocate LLMs considering their popularity
to multiplex memory resources and improve utilization. In 2. Background and Motivation
Figure 1c, request of LLM B can be scheduled at its arrival
2.1. LLM Inference
since the incremental decoding phase of LLM A cannot
fully utilize the GPUs. This flexible multiplexing allows LLMs stack Transformer (Vaswani et al., 2017) blocks and
MuxServe to finish all the requests in a shorter time, thus each block consists of multi-head attention and feed-forward
improving utilization. networks. Given input prompts, LLMs generate output
tokens in autoregressive manner. The inference process
We design and implement MuxServe to enable flexible and
includes two phases: prefill and incremental decoding. In
efficient spatial-temporal multiplexing for multiple LLM
prefill phase, LLMs process the entire prompt tokens in
serving. Given a cluster configuration, a set of LLMs
parallel and generate the first output token. Subsequently, in
with workloads, MuxServe first formulates an optimization
decoding phase, LLMs iteratively generate one output token
problem to search for the optimal colocations and batch-
at a time, building upon previously generated token. During
ing and scheduling strategy (Section 3.1). To efficiently
inference, LLMs save key-value cache (KV cache) for each
solve the problem, MuxServe proposes an enumeration-
token, which dynamically increases as tokens are produced.
based greedy placement algorithm (Section 3.2) and adap-
tive batch scheduling algorithm (Section 3.3) to maximize The two phases of LLM inference exhibit dinstict charac-
utilization while ensuring fair sharing among LLMs. We teristics. The prefill phase, characterized by long input
discover that spatial-temporal partitioning is achieved by prompts, heavily utilizes computation resources, while the
partitioning GPU SMs using CUDA MPS (NVIDIA, 2022b), incremental decoding phase, with limited generated tokens,
and MuxServe designs a novel unified resource manager results in insufficient GPU utilization despite dominating the
(Section 3.4) to enable efficient multiplexing. We finally inference process due to the need to generate lengthy outputs
evaluate MuxServe with both synthetic and real workload on for each prompt. For example, the average prompt and out-
a 32-GPU cluster. Evaluation results show that MuxServe put length is 161 and 338 tokens in ShareGPT (ShareGPT-
achieves up to 1.8× higher throughput compared to prior Team, 2023), respectively.
2
2.2. Distributed LLM Inference it is unfeasible to hold multiple LLMs in a single GPU.
Distributed inference is introduced to accommodate LLMs To mitigate the memory bottleneck, AlpaServe (Li et al.,
that cannot fit in a single GPU or accelerate inference pro- 2023) involves parallelism to distribute several large mod-
cess, and includes two categories. Intra-operator paral- els on multiple GPUs and utilizes temporal multiplexing
lelism (Zheng et al., 2022) splits individual LLM layers to serve these models. However, temporal multiplexing
across multiple GPUs and requires collective communica- ignores the characteristics of prefill and decoding phases
tions to transform input and output tensors during inference. of LLMs. As illustrated in Figure 3, when the amount of
It can significantly reduce inference latency, but also intro- computation resources allocated to the dominant decoding
duces additional communication overheads. Inter-operator phase is reduced, it does not lead to a substantial decrease in
parallelism partitions LLMs into multiple stages. Each stage latency or throughput. Moreover, parallelization across mul-
is placed on one GPU with data dependency, and executed tiple GPUs further reduces the computation requirements
in a pipeline fashion (i.e. pipeline parallelism (Narayanan of LLMs. Temporal multiplexing thus results in significant
et al., 2019)). Pipeline parallelism comes with negligible resource under-utilization.
overhead, but does not reduce inference latency.
Recognizing the distinct resource requirements of prefill
and decoding phases, we reveal that different LLMs can be
2.3. LLM Popularity Varies colocated to multiplex the computation resources flexibly
Figure 2 displays the serving traffic of multiple LLMs over for improved efficiency and utilization. In the following
20 days, as observed from an LLM endpoint provider. It sections, we will present MuxServe, the first system that
is evident that the popularity varies significantly, and each explores spatial-temporal multiplexing for multiple LLM
LLM experiences distinct and changing arrival rates. Pop- serving.
ular LLMs (blue line) consistently receive a considerably
higher volume of serving traffic compared to other LLMs, 3. Method
resulting in higher resources demands. In contrast, less
popular LLMs may exhibit consistently low arrival rates 3.1. Problem Formulation
throughout the observed period, occupying fewer resources. Consider we have a cluster C and a set of LLMs M with
This dynamic and diverse nature of request arrival rates em- workload W 1 to be served, one key insight MuxServe lever-
phasizes the need for a flexible and adaptive approach to aged to improve system utilization is to colocate LLMs
efficiently serve multiple LLMs based on their individual considering their popularity. LLMs with high request rates
popularity and demand, which would translate into signifi- can be colocated with LLMs with low request rates to ef-
cant cost reduction for LLM endpoint providers. ficiently utilize resources. To achieve this, we introduce
LLM unit, which refers to a group of LLMs that will be
2.4. Multiplexing Opportunity on GPU colocated together with the GPUs they are assigned. Our
Spatial multiplexing is a resource sharing technique that goal is to find the best group of LLM units B ∗ that maximize
splits the GPU resource (memory or/and SMs, i.e. Stream- GPU utilization (i.e. throughput), hence the problem can be
ing Multiprocessors) into smaller partitions. Each partition formulated as,
is then allocated to perform different tasks simultaneously.
Temporal multiplexing enables sharing of GPUs where each X
B ∗ = arg max F(b, Wb ) (1)
task occupies the entire computation resource during a spe- B∈B
b∈B
cific time interval. With multiplexing, GPUs can handle
multiple workloads concurrently, leading to increased effi-
where B represents all possible LLM units group, and F(·,
ciency and throughput. NVIDIA also offers Multi-Instance
·) estimates the throughput for a unit b with workload Wb .
GPU (MIG) (NVIDIA, 2022a) to split memory and SMs into
independent instances, and CUDA MPS (NVIDIA, 2022b) Within an LLM unit, MuxServe leverages the insight that
to partition SMs for different processes. prefill and decoding phases of LLM inference exhibit dis-
tinct computation resources requirements, and splits them
Prior works (Dhakal et al., 2020; Tan et al., 2021) have
into prefill and decoding jobs. Each job occupies a fixed
explored spatial multiplexing to serve multiple DNNs by
amount of SM resources and executes a prefill or decoding
assigning one DNN to a separate partition for enhanced
step for a batch requests of an LLM. Different jobs can be
performance. However, serving LLMs presents non-trivial
flexibly colocated to share the computation and memory re-
challenges due to their unique characteristics. A significant
sources. However, as there are multiple LLMs with distinct
difference lies in the memory bottleneck: the huge memory
requirements render previous approaches ineffective since 1
Suppose the workload is known. Otherwise, the workload can
be estimated from history traffic since it changes slowly.
3
Algorithm 1 Enumeration-based Greedy LLM Placement Algorithm 2 LLM Parallel Candidate Generation
Input: LLM list M , cluster C, workload W Input: LLM list M , workload W
Output: The optimal group of LLM unit best units Output: The parallel candidate M̂
M̂ ← llm parallel candidate(M , W ) // Algorithm 2 M̂ ← []
D ← get potential device mesh groups(C, M ) for m ∈ M do
best tpt, best units ← 0, N one sm list ← get potential sm list(m)
for D ∈ D do tp list ← get potential tp degree(m)
// Greedily place LLMs on mesh group D for p ∈ tp list do
M ′ =sort(M̂ , key=computation, descend=T rue) for num sm ∈ sorted(sm list) do
for m ∈ M ′ do tpt, bs ← estimate throughput(m, num sm, p)
best mesh, max delta ← None, -1 if tpt ≥ Wm then
for d ∈ D do m.add candidate((p, num sm, bs))
u = make unit(m, d) break
delta = F(u, Wu ) - F(d.u, Wd.u ) end if
if delta > max delta then end for
best mesh, max delta = d, delta end for
end if M̂ .append(model dandidate)
end for end for
best mesh.u = make unit(m, d)
end for
tpt = sum(F(d.u, Wd.u ) for d ∈ D)
ager to enable efficient multiplexing in Section 3.4.
if best tpt < tpt then
best tpt, best units ← tpt, [d.u for d ∈ D]
3.2. Placement Algorithm
end if
end for Determining the optimal group of LLM units poses a chal-
lenging combinatorial optimization problem. As the number
of devices and LLMs increases, the total number of possi-
workload characteristics, different batching and scheduling ble LLM unit combinations grows exponentially. To solve
strategies can lead to different throughputs, and different Equation (1) efficiently, we design an enumeration-based
LLMs may also compete for resources. Therefore, given greedy algorithm as outlined in Algorithm 1. The insight be-
an LLM unit b that contains colocated LLMs bllm , we need hind is to prioritize the placement selection for LLMs with
to find the optimal batching and scheduling strategy S that large computation requirements, which considers both the
can maximize the throughput of the entire unit ,while en- model scale and popularity. With this algorithm, MuxServe
suring fair resource sharing among LLMs within the unit. can find a good solution efficiently.
Therefore, the problem F(b, Wb ) can be formulated as,
In Algorithm 1, MuxServe first calls Algorithm 2 to generate
all possible parallel candidates M̂ considering the workload
X W . A parallel candidate refers to a configuration that meets
F(b, Wb ) = max tptS (m, b, Wb ) s.t.
S the workload requirements while utilizing the fewest num-
m∈bllm (2)
ber of SMs. For each LLM m, MuxServe enumerates all
|R(mi , Wmi ) − R(mj , Wmj )| ≤ ϵ, ∀mi , mj ∈ bllm possible combinations of configurations by varying the num-
ber of SMs and intra-operator parallelism degrees to find a
where tptS (·, ·, ·) estimates the throughput of an LLM m set parallel candidates. For each intra-operator parallelism
in the unit b using strategy S, R(·, ·) estimates the normal- degree, MuxServe has one possible parallel candidate.
ized computation or memory resources consumption of an
MuxServe then enumerates all potential device mesh groups
LLM m with workload Wm , and ϵ is a small number en-
to find the best LLM units. Each mesh comprises several
suring fairness. R(·, ·) is normalized to account for varying
GPUs that will be used to serve a set of LLMs concurrently.
LLM scales and popularity, since large and popular LLMs
Given a device mesh group D and a list of parallel parti-
typically requires more resources.
tion candidates M̂ , MuxServe greedily places LLMs on
Given the formulation above, we first introduce our place- meshes to find the optimal group of LLM units. MuxServe
ment algorithm to solve the problem (Equation (1)) in Sec- prioritizes the placement selection for LLMs with large com-
tion 3.2, which will maximize the intra-unit throughput putation requirements to maximize serving utilization. For
(Equation (2)) with our batching and scheduling strategy a specified LLM m, MuxServe iterates over all available
(Section 3.3). Finally we describe our unified resource man- meshes and approximates the expected increase in through-
4
Algorithm 3 Adaptive Batch Scheduling (ADBS) the observation that KV cache size poses a significant per-
Input: LLM list M formance bottleneck for LLM serving. MuxServe assigns
pref ill waiting ← false a token block quota to each LLM to ensure fair sharing.
quota ← init token block quota(M ) Counting token blocks provides a more intuitive way to
while True do consider the scale of LLMs, as tokens from different LLMs
if no prefill jobs in execution then consume varying amounts of KV cache. Moreover, to con-
pref ill waiting ← True sider variations in workload characteristics, R(·, ·) is also
m ← round-robin a prefill job from M normalized by request rates.
if resource enough(m, quota) then Then we maximize the intra-unit throughput by explor-
execute and update(m, quota) ing prefill-decoding and decoding-decoding collocation.
pref ill waiting ← False MuxServe prioritizes prefill jobs and fills remaining re-
end if sources with decoding jobs. This is motivated by the obser-
end if vation that decoding jobs of a single LLM typically requires
if not pref ill waiting then a few computation resources and can be batched together.
m ← round-robin a decoding job from M Prioritizing prefill jobs can maximize the opportunity for
while resource enough(m, quota) do colocation and batching.
execute and update(m, quota)
m ← round-robin a decoding job from M Algorithm 3 describes our adaptive batch scheduling
end while (ADBS) algorithm to maximize intra-unit throughput while
end if ensuring fairness. ADBS first restricts the usage of token
quota = adapt quota periodically(M , quota) blocks by setting a quota for each LLM. In the main loop,
end while if there is no prefill jobs in execution, ADBS employs a
round-robin approach to select and execute a prefill job
from served LLMs. If the resource is not enough, ADBS
stops scheduling decoding jobs until resource capacity is
put with F(·, ·). The LLM m is then placed on the mesh that met. Otherwise, ADBS schedules decoding jobs with round-
yields the maximum throughput increase. This process is robin until resource is not enough to maximize colocation.
repeated for all LLMs. Subsequently, MuxServe estimates
the serving throughput of the LLM units after the place- To further improve utilization, ADBS adapts the token
ment, and selects the LLM units that offer the best serving block quota for each LLM periodically. During runtime,
throughput as the optimal group of LLM units. MuxServe monitors the KV cache utilization. MuxServe
identifies low-utilization LLMs and proactively transfers KV
The complexity of Algorithm 1 is O(M CD), where M is cache blocks from these LLMs to high-utilization LLMs.
the number of LLMs, C is the number of devices and D is This dynamic reallocation of KV cache blocks ensures opti-
the number of potential device mesh groups. Given a large mal utilization of resources and promotes efficient sharing
cluster, enumerating all possible device mesh groups can be among the LLMs within a unit.
slow. We prune the search space effectively incorporating
the following heuristics: the intra-operator parallelism is ADBS approximates the solution of Equation (2) to max-
typically adopted within a node, and workload imposes imize intra-unit throughput. But the concrete throughput
constraints on the possible mesh size. tptS (·, ·, ·) cannot be obtained without profiling. To ad-
dress this, we build a simple yet effective analytical estima-
3.3. Maximize Intra-unit Throughput tor to estimate the throughput of LLM m with
5
prefill and decoding latency of different batches and request Memory Space SM Resource
Head 1, Layer 1
head-wise cache
length can be profiled in advance. The average generation Unified KV Step 1 Step 2
Head 1, Layer 1
Activation
Head 2, Layer 1 Cache
length can be estimated from requests history or specific Head 2, Layer 1
Head 3, Layer 1
dataset, ShareGPT (ShareGPT-Team, 2023) for instance. Head 1, Layer 2 Model Weights
Given the request arrival rates, we can use binary search to
Head 2, Layer 2
find the batch size b that can satisfy the traffic. Parallel Runtime
LLM A LLM B
3.4. Resource Manager
After finding the optimal LLM units and determining the Figure 4. Overview of GPU resource management in an LLM unit.
batching and scheduling strategy, MuxServe requires a new The memory is divided into 3 partitions to store KV cache, weights
mechanism to support flexible and efficient spatial-temporal and activations, respectively. The parallel runtime partitions SM
dynamically to different jobs.
multiplexing of LLMs due to the following challenges: dif-
ferent prefill and decoding jobs need to flexibly share the
Table 1. The number of LLMs to be served in different sizes.
computation resources, and share the weights and KV cache
to reduce memory waste and fragmentation. To address Scale 4B-8B 8B-21B 21B-41B 41B-70B
these challenges, MuxServe proposes a unified resource #LLMs 12 4 2 1
manager for flexible and efficient multiplexing. Figure 4
shows the overview of GPU resource management in an runtime with minimal overhead. As a result, MuxServe
LLM unit. can handle bursty and changing workload better. Notably,
vLLM (Kwon et al., 2023) proposes Paged Attention to im-
The parallel runtime manages computation resources of an
prove memory utilization for single LLM serving, while our
LLM unit in the granularity of SM. MuxServe schedules
unified KV cache addresses a distinct scenario, where multi-
prefill and decoding jobs from colocated LLMs with ADBS
ple LLMs of varying sizes, popularities, and configurations
algorithm, then the parallel runtime dynamically assigns
need to share the cache.
SMs to each job at runtime rather than statically allocating.
The implementation is based on CUDA MPS (NVIDIA,
2022b). As illustrated in Figure 4, the SMs are all allocated 4. Evaluation
to one job at step 1, and allocated to two jobs at step 2.
MuxServe is built atop vLLM (Kwon et al., 2023), an effi-
The prominent challenge lies in sharing the memory re- cient single LLM serving system based on PyTorch (Paszke
sources among different jobs is to reduce memory waste et al., 2019), and utilizes NVIDIA MPS (NVIDIA, 2022b)
and fragmentation. LLM weight and KV cache consume to partition SM resources. In this section, we evaluate
a huge amount of memory and need to be shared among MuxServe on both synthetic and real workloads. We also
jobs. Furthermore, KV cache increases dynamically, and perform ablation studies to verify the effectiveness of indi-
different LLMs posses varying sizes of KV cache due to vidual components.
differences in the number of attention heads, hidden layers,
and hidden sizes. 4.1. Experimental Setup
To efficiently share memory resources, MuxServe divides Cluster. We conduct experiments on a 4 node cluster, each
them into three partitions. The first partition is a unified is equipped with 8 NVIDIA A100 (80GB) GPUs.
KV cache space enabled by our head-wise cache. Leverag-
ing the observation that the size of each attention head is Metrics. We use the aggregated throughput as our evalua-
often consistent or has limited variations across different tion metric since our target is to evaluate the GPU utilization.
LLMs, for example LLaMAs (Touvron et al., 2023) and Since different LLMs have different arrival rates, we use the
GPT-3 (Brown et al., 2020) all use 128. MuxServe allocates rate to compute a weighted average of throughput.
KV cache in head-wise granularity and accommodates the Baselines. We compare MuxServe with two baselines. The
KV cache of different LLMs in a unified space to share the first is widely used spatial partitioning, which serves an
memory. To reduce redundancy, the second partition stores LLM separately on a group of GPUs. We serve each LLM
a single replica of LLM weights that can be shared among with vLLM (Kwon et al., 2023), a state-of-the-art serving
prefill and decoding jobs. The final partition reserves space framework. Another baseline is temporal multiplexing simi-
for activation, which is utilized during inference runtime. lar to AlpaServe (Li et al., 2023). Since AlpaServe does not
MuxServe adopts a unified KV cache instead of reserving support multiplexing of multiple LLMs, we implement this
separate KV cache for each LLM. This shared cache enables baseline by ourselves with our unified KV cache. For tem-
MuxServe to dynamically adjust the cache allocation during poral multiplexing, we colocate LLMs with the placement
optimized by our placement algorithm, schedule LLMs with
6
FCFS in a temporal manner, and batch each LLM with 4.4. Ablation Studies
continuous batching.
In this section, we study the effectiveness of our proposed
approaches: placement algorithm in Section 3.2, adaptive
4.2. End-to-End Results for Synthetic Workloads
batch scheduling (ADBS) mechanism in Section 3.3 and
Models. LLaMA (Touvron et al., 2023) is the most popular unified resource manager in Section 3.4.
LLM architecture. According to the sizes of LLaMA mod- Effectiveness of placement algorithm. To show the ef-
els, the LLMs can be divided into 4 size buckets. Table 1 fectiveness of our placement algorithm, we conduct a com-
shows the number of LLMs to be served in different sizes. parison with a greedy algorithm. The greedy algorithm
Workloads. For synthetic workloads, we first generate prioritizes the placement of LLM with high arrival rates and
request rates for each LLM using power-law distribution greedily assigns it to the mesh with the largest available free
with an exponent α, then generate requests arrival time memory. The ablation study is conducted on two scales: 8
with poisson processes. The requests are sampled from GPUs with 4 LLMs and 16 GPUs with 7 LLMs. For each
ShareGPT. We vary α and rate scales to evaluate a diverse scenario, 50% LLMs are popular and occupy more than
workloads. For each α, we first set the maximal request rate 70% request traffic. The request arrivals follow poisson
for each LLM to 20 req/s, and then scale up the max rate distribution. As shown in Figure 8, our placement algorithm
and average rate for evaluation. The α decides the popularity achieves 1.3× higher throughput compared with greedy
of LLMs, and larger α means the fewer LLMs are more algorithm in the right subfigure.
popular and receive a higher rates. Figure 6 shows the LLM Effectiveness of ADBS. We compare MuxServe’s ADBS
traffic distribution as we vary α. Typically, α = 0.9 or 2.1 to First Come First Serve (FCFS) and Round-Robin in two
represent 20% LLMs receives 50% or 90% request rates. serving settings to verify the effectiveness (Figure 10). In
Results. Figure 5 shows the throughput and SLO attain- Figure 10a, the average request length is 2 : 1 : 1 for
ment with varying α and average rates. The throughput LLaMA-30B, 13B, and 7B, respectively. In Figure 10b, the
of MuxServe outperforms two baselines in all scenarios, average request length is 4 : 1 for LLaMA-65B and 30B,
achieving up to 1.8× improvement. MuxServe can pro- respectively. In both scenarios, the token block usage of
cess up to 2.9× more requests within 99% SLO attain- ADBS is more closely aligned with the distribution of ar-
ment. When α is small, the request rates are more even rival rates, thus achieving a fairer memory resource sharing.
and MuxServe can efficiently colocate prefill-decoding and ADBS also achieves 1.43× and 1.85× higher throughput
decoding-decoding jobs to improve utilization. But the in- compared with Round-Robin and FCFS scheduling, since
terference also brings some overhead, leading to a slightly the unfair sharing of KV cache blocks hurts the performance
lower SLO attainment with small SLO scale. With a larger of popular LLMs. In addition, FCFS cannot efficiently mul-
α, popular LLMs can be colocated with unpopular LLMs tiplexing different LLMs.
to multiplex memory resources, thus achieving a higher Effectiveness of resource manager. We study the effective-
throughput with more SMs and larger KV caches. Popular ness of MuxServe’s unified resource manager by gradually
LLMs can process more requests to achieve a higher SLO enabling computation and memory management. We serve 4
attainment. LLMs on 4 GPUs and generate arrival rates using power law
distribution. Figure 9 shows the throughput and SLO attain-
4.3. End-to-End Results for Real Workloads ment (SLO scale 8) as we vary α. By separating prefill and
To evaluate MuxServe on real workloads, we sample LLMs decoding phases with computation resource management,
and workloads from ChatLMSYS trace and rescale the the throughput improves 1.7×. With a unified memory man-
rates to evaluate MuxServe. ChatLMSYS is a web ap- ager, MuxServe achieves another 1.2× higher throughput
plication that serves multiple LLMs of different scales. and improves SLO attainment by 3.6×.
In this real workload trace, we serve 16 LLMs with 32
GPUs, and 20% popular LLMs get 50% request traffic. Fig- 5. Related Work
ure 7 shows the throughput and SLO attainment under SLO
scale 8. MuxServe achieves up to 1.38× and 1.46× higher DL serving systems. A wide variety of deep learning serv-
throughput compared with spatial partitioning and temporal ing systems have been proposed to improve the serving
multiplexing, respectively. As we vary the average rates, efficiency (Olston et al., 2017; NVIDIA, 2023a). Recent
MuxServe always achieves a higher SLO attainment. When works (Crankshaw et al., 2017; Shen et al., 2019; Gujarati
the average rate is 4.8, several LLMs with medium rates are et al., 2020; Zhang et al., 2023; Romero et al., 2021) uti-
co-located on a large mesh. Temporal multiplexing cannot lize temporal multiplexing and introduce better batching
efficient multiplex these LLMs thus performing quite bad. and scheduling strategies to improve GPU utilization and
meet SLO target. These approaches focus on small DNN
7
Spatial Partitioning Temporal Multiplexing FlexSM
=2.1 =1.7 =1.3 =0.9 =0.7
150
Throughput (req/s)
100
60 60
75 100
100
40 40
50 50
20 20 25 50
2 3 4 2 3 4 5 3 4 5 6 4 6 8 10 6 8 10 12 14
Avg Rate(req/s) Avg Rate(req/s) Avg Rate(req/s) Avg Rate(req/s) Avg Rate(req/s)
=2.1, Avg Rate=1.61 =1.7, Avg Rate=1.97 =1.3, Avg Rate=2.70 =0.9, Avg Rate=4.24 =0.7, Avg Rate=5.63
100 100 100 100
80
SLO Attainment(%)
75 75 75 60 75
50 50 50 40 50
25 25 25 20 25
0 0 0 0 0
5 10 15 20 5 10 15 20 5 10 15 20 5 10 15 20 5 10 15 20
SLO Scale SLO Scale SLO Scale SLO Scale SLO Scale
100
Throughput (req/s)
Cumulative Rate (%)
6
80 6 60 6
60 40 4
4 4
40 20 Our Our
Greedy 2 Greedy
25 50 75 100 3.0 4.8 7.1 9.5 11.9 3.0 4.8 7.1 9.5 11.9 2
Top-k Models (%) Avg Rate (req/s) Avg Rate (req/s) 5 10 15 5 10 15 20
Avg Rate (req/s) Avg Rate (req/s)
vLLM Decouple Prefill & Decoding FlexSM gingface, 2023) incorporate intra- and inter-operator par-
6 100 allelism to accelerate LLM inference on multiple GPUs.
SLO Attainment (%)
Throughput (req/s)
8
LLaMA-30B LLaMA-13B LLaMA-7B LLMs, our methods can make LLM-based applications
FCFS Round-Robin ADBS
Token Block Usage (K)
viable.
0.4 0.4 0.4
Environmental sustainability: Given the energy-intensive
0.2 0.2 0.2 20.4%:79.6% nature of LLM serving, our research contributes to the re-
0.0 46.2%:53.8% 0.0 46.1%:53.9% 0.0 duction of carbon footprints and environmental impact. By
0 50 0 50 0 50 optimizing resource utilization and minimizing energy con-
Time (s) Time (s) Time (s)
sumption, we promote more sustainable practices in the
(b) Two LLMs Serving.
deployment and operation of LLMs.
Figure 10. Comparison of cache usage of different schedule ap- Enhanced productivity and user experience: By making
proaches on 4 GPUs. The relative proportion of token block usage LLM serving more efficient, our research has the potential
is annotated in the figure. FCFS: First Come First Serve, ADBS: to enhance productivity and user experience in various do-
ADaptive Batch Size. (a) Request rate: 2:8:8 req/s. Through- mains. Faster response times, reduced latency, and improved
put (req/s): FCFS (3.8), Round-Robin (4.1), ADBS (6.2). (b) performance can lead to more effective code completion,
Request rate: 1:8 req/s. Throughput (req/s): FCFS (3.2), Round-
better writing assistance, and more interactive and respon-
Robin (4.9), ADBS (6.6).
sive chatbot interactions.
Overall, the broader impacts of our paper lie in advancing
6. Conclusion the state-of-the-art in LLM serving, making it more efficient,
In this paper, we introduce MuxServe, a flexible and effi- accessible, cost-effective, and environmentally sustainable.
cient spatial-temporal multiplexing system to serve multiple These advancements have the potential to positively influ-
LLMs concurrently. MuxServe colocates LLMs consider- ence various industries and domains relying on LLM-based
ing their popularity and colocates prefill and decoding jobs technologies, ultimately benefiting end-users and promoting
leveraging their characteristics to improve GPU utilization. the wider adoption of LLM applications.
9
tab, O., Koh, P. W., Krass, M. S., Krishna, R., Kudi- March 2017. USENIX Association. ISBN 978-1-
tipudi, R., and et al. On the opportunities and risks of 931971-37-9. URL https://www.usenix.org/
foundation models. CoRR, abs/2108.07258, 2021. URL conference/nsdi17/technical-sessions/
https://arxiv.org/abs/2108.07258. presentation/crankshaw.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, Dhakal, A., Kulkarni, S. G., and Ramakrishnan, K. K.
J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., GSLICE: controlled spatial sharing of gpus for a scal-
Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., able inference platform. In Fonseca, R., Delimitrou,
Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, C., and Ooi, B. C. (eds.), SoCC ’20: ACM Sympo-
J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., sium on Cloud Computing, Virtual Event, USA, Octo-
Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., ber 19-21, 2020, pp. 492–506. ACM, 2020. doi: 10.
Radford, A., Sutskever, I., and Amodei, D. Language 1145/3419111.3421284. URL https://doi.org/
models are few-shot learners. In Larochelle, H., 10.1145/3419111.3421284.
Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Gujarati, A., Karimi, R., Alzayat, S., Hao, W., Kaufmann,
Advances in Neural Information Processing Systems 33: A., Vigfusson, Y., and Mace, J. Serving DNNs like clock-
Annual Conference on Neural Information Processing work: Performance predictability from the bottom up. In
Systems 2020, NeurIPS 2020, December 6-12, 2020, 14th USENIX Symposium on Operating Systems Design
virtual, 2020. URL https://proceedings. and Implementation (OSDI 20), pp. 443–462. USENIX
neurips.cc/paper/2020/hash/ Association, November 2020. ISBN 978-1-939133-19-9.
1457c0d6bfcb4967418bfb8ac142f64a-Abstract. URL https://www.usenix.org/conference/
html. osdi20/presentation/gujarati.
Choi, S., Lee, S., Kim, Y., Park, J., Kwon, Y., and Han, M., Zhang, H., Chen, R., and Chen, H.
Huh, J. Serving heterogeneous machine learning mod- Microsecond-scale preemption for concurrent GPU-
els on Multi-GPU servers with Spatio-Temporal shar- accelerated DNN inferences. In 16th USENIX Sym-
ing. In 2022 USENIX Annual Technical Conference posium on Operating Systems Design and Implemen-
(USENIX ATC 22), pp. 199–216, Carlsbad, CA, July tation (OSDI 22), pp. 539–558, Carlsbad, CA, July
2022. USENIX Association. ISBN 978-1-939133-29-53. 2022. USENIX Association. ISBN 978-1-939133-28-1.
URL https://www.usenix.org/conference/ URL https://www.usenix.org/conference/
atc22/presentation/choi-seungbeom. osdi22/presentation/han.
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, Huggingface. Text generation inference.
G., Roberts, A., Barham, P., Chung, H. W., Sutton, https://github.com/huggingface/
C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, text-generation-inference, 2023.
S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu,
N., Prabhakaran, V., Reif, E., Du, N., Hutchinson, B., C. H., Gonzalez, J., Zhang, H., and Stoica, I. Effi-
Pope, R., Bradbury, J., Austin, J., Isard, M., Gur-Ari, cient memory management for large language model
G., Yin, P., Duke, T., Levskaya, A., Ghemawat, S., serving with pagedattention. In Flinn, J., Seltzer, M. I.,
Dev, S., Michalewski, H., Garcia, X., Misra, V., Robin- Druschel, P., Kaufmann, A., and Mace, J. (eds.), Pro-
son, K., Fedus, L., Zhou, D., Ippolito, D., Luan, D., ceedings of the 29th Symposium on Operating Sys-
Lim, H., Zoph, B., Spiridonov, A., Sepassi, R., Do- tems Principles, SOSP 2023, Koblenz, Germany, Octo-
han, D., Agrawal, S., Omernick, M., Dai, A. M., Pillai, ber 23-26, 2023, pp. 611–626. ACM, 2023. doi: 10.
T. S., Pellat, M., Lewkowycz, A., Moreira, E., Child, 1145/3600006.3613165. URL https://doi.org/
R., Polozov, O., Lee, K., Zhou, Z., Wang, X., Saeta, 10.1145/3600006.3613165.
B., Diaz, M., Firat, O., Catasta, M., Wei, J., Meier-
Hellstern, K., Eck, D., Dean, J., Petrov, S., and Fiedel, Li, Z., Zheng, L., Zhong, Y., Liu, V., Sheng, Y., Jin,
N. Palm: Scaling language modeling with pathways. J. X., Huang, Y., Chen, Z., Zhang, H., Gonzalez, J. E.,
Mach. Learn. Res., 24:240:1–240:113, 2023. URL http: and Stoica, I. Alpaserve: Statistical multiplexing
//jmlr.org/papers/v24/22-1144.html. with model parallelism for deep learning serving. In
Geambasu, R. and Nightingale, E. (eds.), 17th USENIX
Crankshaw, D., Wang, X., Zhou, G., Franklin, M. J., Symposium on Operating Systems Design and Imple-
Gonzalez, J. E., and Stoica, I. Clipper: A Low-Latency mentation, OSDI 2023, Boston, MA, USA, July 10-
online prediction serving system. In 14th USENIX 12, 2023, pp. 663–679. USENIX Association, 2023.
Symposium on Networked Systems Design and Im- URL https://www.usenix.org/conference/
plementation (NSDI 17), pp. 613–627, Boston, MA, osdi23/presentation/li-zhouhan.
10
Lim, G., Ahn, J., Xiao, W., Kwon, Y., and Jeon, M. learning library. In Wallach, H. M., Larochelle, H.,
Zico: Efficient GPU memory sharing for concurrent Beygelzimer, A., d’Alché-Buc, F., Fox, E. B., and
DNN training. In 2021 USENIX Annual Technical Garnett, R. (eds.), Advances in Neural Information
Conference (USENIX ATC 21), pp. 161–175. USENIX Processing Systems 32: Annual Conference on Neural
Association, July 2021. ISBN 978-1-939133-23-6. Information Processing Systems 2019, NeurIPS 2019,
URL https://www.usenix.org/conference/ December 8-14, 2019, Vancouver, BC, Canada, pp. 8024–
atc21/presentation/lim. 8035, 2019. URL https://proceedings.
neurips.cc/paper/2019/hash/
Miao, X., Oliaro, G., Zhang, Z., Cheng, X., Wang, Z., Wong, bdbca288fee7f92f2bfa9f7012727740-Abstract.
R. Y. Y., Zhu, A., Yang, L., Shi, X., Shi, C., Chen, Z., html.
Arfeen, D., Abhyankar, R., and Jia, Z. Specinfer: Ac-
celerating generative large language model serving with Romero, F., Li, Q., Yadwadkar, N. J., and Kozyrakis, C. IN-
speculative inference and token tree verification, 2023. FaaS: Automated model-less inference serving. In 2021
USENIX Annual Technical Conference (USENIX ATC 21),
Miao, X., Shi, C., Duan, J., Xi, X., Lin, D., Cui, B., and Jia, pp. 397–411. USENIX Association, July 2021. ISBN 978-
Z. Spotserve: Serving generative large language mod- 1-939133-23-6. URL https://www.usenix.org/
els on preemptible instances. Proceedings of ASPLOS conference/atc21/presentation/romero.
Conference, 2024.
ShareGPT-Team. Sharegpt. https://sharegpt.
Narayanan, D., Harlap, A., Phanishayee, A., Seshadri, V., com/, 2023.
Devanur, N. R., Ganger, G. R., Gibbons, P. B., and
Zaharia, M. Pipedream: Generalized pipeline paral- Shen, H., Chen, L., Jin, Y., Zhao, L., Kong, B., Phili-
lelism for dnn training. In Proceedings of the 27th pose, M., Krishnamurthy, A., and Sundaram, R. Nexus:
ACM Symposium on Operating Systems Principles, SOSP a gpu cluster engine for accelerating dnn-based video
’19, pp. 1–15, New York, NY, USA, 2019. Associa- analysis. In Proceedings of the 27th ACM Sympo-
tion for Computing Machinery. ISBN 9781450368735. sium on Operating Systems Principles, SOSP ’19, pp.
doi: 10.1145/3341301.3359646. URL https://doi. 322–337, New York, NY, USA, 2019. Association for
org/10.1145/3341301.3359646. Computing Machinery. ISBN 9781450368735. doi: 10.
1145/3341301.3359658. URL https://doi.org/
NVIDIA. Fastertransformer. https://github.com/ 10.1145/3341301.3359658.
NVIDIA/FasterTransformer, 2021.
Sheng, Y., Zheng, L., Yuan, B., Li, Z., Ryabinin, M.,
NVIDIA. Nvidia mig. https://www. Chen, B., Liang, P., Ré, C., Stoica, I., and Zhang,
nvidia.com/en-us/technologies/ C. Flexgen: High-throughput generative inference of
multi-instance-gpu/, 2022a. large language models with a single GPU. In Krause,
NVIDIA. Multi-process service. https://docs. A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S.,
nvidia.com/deploy/mps/index.html, 2022b. and Scarlett, J. (eds.), International Conference on Ma-
chine Learning, ICML 2023, 23-29 July 2023, Hon-
NVIDIA. Triton inference server: An optimized cloud and olulu, Hawaii, USA, volume 202 of Proceedings of
edge inferencing solution. https://github.com/ Machine Learning Research, pp. 31094–31116. PMLR,
triton-inference-server, 2023a. 2023. URL https://proceedings.mlr.press/
v202/sheng23a.html.
NVIDIA. Tensorrt-llm. https://github.com/
NVIDIA/TensorRT-LLM, 2023b. Sheng, Y., Cao, S., Li, D., Zhu, B., Li, Z., Zhuo, D., Gon-
zalez, J. E., and Stoica, I. Fairness in serving large
Olston, C., Fiedel, N., Gorovoy, K., Harmsen, J., Lao, L., Li, language models. CoRR, abs/2401.00588, 2024. doi:
F., Rajashekhar, V., Ramesh, S., and Soyke, J. Tensorflow- 10.48550/ARXIV.2401.00588. URL https://doi.
serving: Flexible, high-performance ML serving. CoRR, org/10.48550/arXiv.2401.00588.
abs/1712.06139, 2017. URL http://arxiv.org/
abs/1712.06139. Tan, C., Li, Z., Zhang, J., Cao, Y., Qi, S., Liu, Z., Zhu, Y.,
and Guo, C. Serving DNN models with multi-instance
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, gpus: A case of the reconfigurable machine scheduling
J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., problem. CoRR, abs/2109.11067, 2021. URL https:
Antiga, L., Desmaison, A., Köpf, A., Yang, E. Z., //arxiv.org/abs/2109.11067.
DeVito, Z., Raison, M., Tejani, A., Chilamkurthy,
S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux,
Pytorch: An imperative style, high-performance deep M., Lacroix, T., Rozière, B., Goyal, N., Hambro, E.,
11
Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and URL https://www.usenix.org/conference/
Lample, G. Llama: Open and efficient foundation osdi22/presentation/yu.
language models. CoRR, abs/2302.13971, 2023. doi:
10.48550/ARXIV.2302.13971. URL https://doi. Yu, P. and Chowdhury, M. Salus: Fine-grained GPU shar-
org/10.48550/arXiv.2302.13971. ing primitives for deep learning applications. CoRR,
abs/1902.04610, 2019. URL http://arxiv.org/
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., abs/1902.04610.
Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I.
Attention is all you need. In Guyon, I., von Luxburg, U., Zhang, H., Tang, Y., Khandelwal, A., and Stoica, I. SHEP-
Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, HERD: Serving DNNs in the wild. In 20th USENIX
S. V. N., and Garnett, R. (eds.), Advances in Neural Symposium on Networked Systems Design and Imple-
Information Processing Systems 30: Annual Conference mentation (NSDI 23), pp. 787–808, Boston, MA, April
on Neural Information Processing Systems 2017, 2023. USENIX Association. ISBN 978-1-939133-33-5.
December 4-9, 2017, Long Beach, CA, USA, pp. 5998– URL https://www.usenix.org/conference/
6008, 2017. URL https://proceedings. nsdi23/presentation/zhang-hong.
neurips.cc/paper/2017/hash/ Zhao, Y., Liu, X., Liu, S., Li, X., Zhu, Y., Huang, G.,
3f5ee243547dee91fbd053c1c4a845aa-Abstract. Liu, X., and Jin, X. Muxflow: Efficient and safe GPU
html. sharing in large-scale production deep learning clusters.
Wang, G., Wang, K., Jiang, K., Li, X., and Stoica, CoRR, abs/2303.13803, 2023. doi: 10.48550/ARXIV.
I. Wavelet: Efficient DNN training with tick-tock 2303.13803. URL https://doi.org/10.48550/
scheduling. In Smola, A., Dimakis, A., and Stoica, I. arXiv.2303.13803.
(eds.), Proceedings of Machine Learning and Systems Zheng, L., Li, Z., Zhang, H., Zhuang, Y., Chen, Z., Huang,
2021, MLSys 2021, virtual, April 5-9, 2021. ml- Y., Wang, Y., Xu, Y., Zhuo, D., Xing, E. P., Gonza-
sys.org, 2021a. URL https://proceedings. lez, J. E., and Stoica, I. Alpa: Automating inter- and
mlsys.org/paper/2021/hash/ intra-operator parallelism for distributed deep learning.
c81e728d9d4c2f636f067f89cc14862c-Abstract. In Aguilera, M. K. and Weatherspoon, H. (eds.), 16th
html. USENIX Symposium on Operating Systems Design and
Wang, X., Xiong, Y., Wei, Y., Wang, M., and Li, L. Implementation, OSDI 2022, Carlsbad, CA, USA, July
Lightseq: A high performance inference library for 11-13, 2022, pp. 559–578. USENIX Association, 2022.
transformers. In Kim, Y., Li, Y., and Rambow, O. URL https://www.usenix.org/conference/
(eds.), Proceedings of the 2021 Conference of the osdi22/presentation/zheng-lianmin.
North American Chapter of the Association for Com-
putational Linguistics: Human Language Technolo-
gies: Industry Papers, NAACL-HLT 2021, Online, June
6-11, 2021, pp. 113–120. Association for Computa-
tional Linguistics, 2021b. doi: 10.18653/V1/2021.
NAACL-INDUSTRY.15. URL https://doi.org/
10.18653/v1/2021.naacl-industry.15.
Xiao, W., Ren, S., Li, Y., Zhang, Y., Hou, P., Li, Z.,
Feng, Y., Lin, W., and Jia, Y. AntMan: Dynamic
scaling on GPU clusters for deep learning. In 14th
USENIX Symposium on Operating Systems Design and
Implementation (OSDI 20), pp. 533–548. USENIX As-
sociation, November 2020. ISBN 978-1-939133-19-9.
URL https://www.usenix.org/conference/
osdi20/presentation/xiao.
Yu, G.-I., Jeong, J. S., Kim, G.-W., Kim, S., and Chun, B.-
G. Orca: A distributed serving system for Transformer-
Based generative models. In 16th USENIX Sympo-
sium on Operating Systems Design and Implementa-
tion (OSDI 22), pp. 521–538, Carlsbad, CA, July
2022. USENIX Association. ISBN 978-1-939133-28-1.
12