2404.18322v1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models

Jiamin Li∗† Le Xu†


City University of Hong Kong The University of Texas at Austin
Hong Xu Aditya Akella
The Chinese University of Hong Kong The University of Texas at Austin
arXiv:2404.18322v1 [cs.DC] 28 Apr 2024

Abstract deployment of LLMs is an arduous endeavor, often necessi-


tating the use of hundreds or even thousands of extremely
The growing demand for Large Language Models (LLMs)
expensive computation devices like GPUs, thus imposing
across diverse applications has prompted a paradigm shift in
great challenges on unleashing LLM’s full potential for more
the design of deep learning serving systems. Deploying LLMs,
people [3, 36].
especially in multi-tenant environments, presents consider-
able challenges due to their high computational and memory With the growing reliance on LLMs, fine-tuning has
demands. We present BlockLLM, a serving system that ex- emerged as a critical technique to efficiently adapt foun-
ploits the potential of sharing components among fine-tuned dation models to handle various tasks across different do-
LLM models to offer an efficient and flexible solution for mains [9, 16]. Fine-tuning updates existing parameters or
LLM workloads. BlockLLM partitions the models into finer- introduces additional parameters using domain-specific data.
grained blocks to enable the reuse of model components and As LLMs become ubiquitous, serving these fine-tuned mod-
independent provisioning to improve the computation effi- els from a multi-tenant cluster, whether in a private cloud for
ciency. BlockLLM consists of an offline block zoo, for storing first-party services or a public cloud for third-party users, is
the blocks, and an online system to serve the requests through starting to emerge as an important consideration.
chains of blocks. It offers multi-fold flexibility: (1) Adaptive Recent serving systems and approaches focus on model-
assembly of block chains on-the-fly is achieved with the help internal optimization, such as the acceleration of matrix com-
of equivalence evaluation among blocks in the zoo. (2) We en- putations [2, 8] and minimizing device memory fragmenta-
able per-block batch size and configure best-effort KV cache tion [18] to enable parallel decoding to circumvent the limita-
coordination at individual block level. (3) We adopt specula- tions imposed by auto-regressive models that process a single
tive execution and locality-aware block placement to mitigate token at a time [11, 26]. While these advancements signifi-
the communication costs from dynamic block resource allo- cantly accelerate the inference process for individual LLMs,
cation. Our evaluation demonstrates that BlockLLM reduces the challenge remains to serve multiple tenants’ fine-tuned
memory and storage footprints and improves computation models in a shared cluster. Each LLM requires substantial
efficiency, outperforming existing serving approach in 95%ile resources, and the hardware cost makes it prohibitive to lib-
latency and GPU utilization by 33.5% and 20.1%, respec- erally expand the cluster to add dedicated resources for each
tively. tenant. Thus, the question is: how can LLM serving meet
the latency performance goals of tenants’ fine-tuned models
while also ensuring high throughput and optimal utilization
1 Introduction cluster-wide?
We introduce BlockLLM, an efficient and flexible serving
The rise of Large Language Models (LLMs) marks a trans- system tailored for LLM workloads in a multi-tenant cluster.
formative milestone in deep learning. Their unprecedented BlockLLM enables LLM model component reuse by break-
abilities in natural language processing [28], from language ing down an LLM into smaller and shareable components
translation [6] to question-answering [1] and complex reason- which we call “blocks”. The central observation forming the
ing [20, 40], are reshaping technological frontiers. Yet, the basis for BlockLLM is that LLM fine-tuning offers the op-
strength of LLMs—rooted in their vast parameter spaces— portunity for sharing. Parameter-efficient fine-tuned models
comes at the price of significant computation costs [5]. The have an inherent modular architecture where parameters from
∗ Work done at The University of Texas at Austin specific layers are added or altered. Our observations with
† Equal Contribution full-parameter fine-tuned models indicate that certain model

1
Offline Block Zoo Online Serving
request can be adaptively adjusted online.
Embedding a. Evaluate b1 b5 Dispatch Request R
Equivalence b2 b6
Second, to manage the KV cache generated by different
Transformers Agent Agent Agent
b. Lazy
b3
b3
requests, BlockLLM implements a principled approach to dis-
LM Head b1 b2 b3 b6
LLMs Partition b4 R R R Sb3 R patch requests. Per-block configuration introduces complexity
b7
Profiler c. Profile R R R
to the handling of the KV cache, particularly as requests might
d. Establish Relationship
e. Train Stitch Block Equivalent
KV Cache KV Cache KV Cache be allocated to devices that do not currently hold the corre-
Stitch1 (4096, 5120) b1 [b5, ...] Schedule R b1-b2-b3-b4 Chain of Blocks sponding cache, necessitating either cache communication or
Stitch2 (5120, 6400)
b4 [b7] Blocks Scheduler R R R R recalculation. We analyze the trade-off between the recalcula-
Figure 1: System architecture. bx are blocks. ‘R’ are requests. SB3 is the tion and new IO costs and use it to compute different priorities
surrogates of Block b3 . for different candidate block instances, including the blocks
components produce outputs with high similarity, suggesting specified in the chain and its equivalent blocks. We adopt a
their potential for sharing as well. Block sharing reduces both best-effort KV coordination policy where devices possessing
memory and storage demands. Furthermore, by reallocating the KV cache are prioritized when dispatching the request.
memory - previously consumed by redundant parameters - Third, to address the inherent latency costs posed by inter-
toward supporting larger data batch size, BlockLLM achieves block communication, BlockLLM integrates speculative ex-
higher computation efficiency and overall throughput. ecution for identified bottleneck blocks, thereby processing
BlockLLM consists of an offline “block zoo”—a reposi- potential request pathways ahead-of-time. We use surrogate
tory of LLM blocks, and an online serving system capable of models to predict the outputs of blocks, facilitating compu-
dynamically scheduling blocks to serve various applications tation ahead of actual output. In parallel, we implement a
(Figure 1). BlockLLM’s practice brings three opportunities. locality-aware strategy for block placement, putting interde-
(1) BlockLLM can support adaptive serving, which enables pendent blocks nearby, preferably within the same server.
on-the-fly assembly of blocks for each request, moving be- This approach reduces the frequency of costly inter-server
yond the traditional static chain of blocks. (2) Each block communications and leverages the high-speed intra-server
can be individually configured, allowing for customized batch connections to expedite data transfer and reduce inference
sizes and autonomous handling of queued requests. (3) Block- latency.
LLM enables dynamic resource allocation, with each block We summarize our contributions as follows:
capable of independently scaling and being placed without • We leverage the LLM fine-tuning characteristic and demon-
the constraints of model-level boundaries. strate the effectiveness of finer-grained LLM serving in
Several challenges arise in realizing BlockLLM with full terms of reducing memory and storage requirements and
flexibility. First, the feasibility of adaptive serving hinges on optimizing computation resource utilization.
an in-depth understanding of the equivalence among blocks. • We construct a block zoo that stores LLMs in finer granu-
This necessitates a mechanism to establish inter-block rela- larity to enable block reuse and establishes the relationship
tionship within the block zoo. Moreover, LLMs store inter- among these blocks to facilitate adaptive online serving.
mediate results in KV cache [31] to avoid recalculation due • We build an online serving system to improve cluster
to their auto-regressive nature. The independent configura- throughput. It strategically coordinates the KV cache by
tion of each block leads to overlapping lifecycles of different best-effort request dispatching and minimizes communi-
requests, demanding stateful coordination to maintain the co- cation overhead with bottleneck-pinpointing speculative
herence of the KV cache across concurrent requests. Lastly, execution and locality-aware placement.
resource allocation without model-level boundaries brings ex- • We implement and evaluate BlockLLM, a multi-tenant serv-
tra inter-block communication, thereby imposing additional ing system. We evaluate BlockLLM in a 12-A100 cluster.
latency costs that are not easily mitigated. In BlockLLM, we BlockLLM achieves comparable median latency with tra-
propose integrated solutions for these challenges. ditional per-model provisioning and reduces the 95%ile
First, to ascertain the equivalence of blocks within Block- latency by 33.5%. The overall average throughput is ele-
LLM, we consider two methods subject to block architecture. vated by 1.71x. GPU SM efficiency is improved by 20.1%.
For blocks of the same architecture that naturally support
seamless request routing, cosine similarity metrics are utilized 2 Background and Motivation
to measure parametric differences. For blocks of disparate
embedding sizes, the challenge is more intricate. We compare
2.1 Background
the output distribution of vocabulary probabilities,
which serves as an indicator for functional equivalence LLM serving. LLM-based applications have grown to be one
between transformer layers. Inspired by the techniques of of the major workloads in GPU clusters. Given their colossal
merging computer vision models [30], we design a general- scale, the burden of serving these models has emerged as a
izable stitching block as an intermediary to route requests critical issue. In the offline stage, large storage is required
among these blocks so that the chain of blocks to serve each to store the model checkpoints and engines. In the online

2
Cosine Similarity
Prefix-tuned LoRA Blocks 0.9994 0.99990
0.9992 0.99985
Embedding Prefix Embedding Embedding Attention LoRA 0.9990
Attention Attention LoRA 0.9988 0.99980
FFN Attention 0.99975
FFN FFN 0.9986
LM Head 0.9984 0.99970
LM Head LM Head Prefix 0 4 8 12 16 20 24 28 32 0 4 8 12 16 20 24 28 32
Layer Index Layer Index
Figure 2: Example of blocks for two models fine-tuned from the same
foundation. We show one Transformer layer for simplicity. (a) LLaMA 7B and Vicuna 7B. (b) LLaMA 7B and Alpaca 7B.

stage, numerous powerful computation units (e.g., GPU de- Figure 3: Cosine similarity of parameters in each Transformer layer of two
vices) with large memory spaces are necessary to load and models.
execute the models [31, 35]. LLMs’ unique auto-regressive
nature necessitates caching KV matrices of tokens on de- sine similarity of each Transformer layer between LLaMA
vices to avoid redundant recalculations [32]. Every request and Vicuna/Alpaca (FF from LLaMA). The average simi-
has its own KV cache, which progressively grows in length as larity is 0.9927. This observation underpins the feasibility
the auto-regressive process continues. Moreover, when mod- of reusing blocks across different FPFT models, even when
els are inevitably deployed in a distributed manner, a high- parameter values differ. Enabling the reuse of model com-
capacity network is indispensable to transfer the intermediate ponents would improve resource efficiency by letting device
activations. Such a burden is particularly pronounced in multi- memory hold more critical data.
tenant clusters, which cater to diverse LLM applications, each Opportunities. BlockLLM consists of an offline block zoo
with its dedicated models and performance objectives. (§4) and an online serving system (§5). We enable block
Foundation models and fine-tuning. LLM training follows reuse in block zoo and establish connections among blocks.
transfer learning as a guiding principle. A foundation model In the serving phase, BlockLLM deploys the blocks onto
is initially pre-trained on a broad spectrum of general data, the cluster and assembles a chain of blocks to serve each
which is then customized for specific applications through request. Each block instance is provisioned independently.
fine-tuning. This process involves introducing additional pa- Block-granularity provision brings three significant opportu-
rameters or modifying existing ones within the model using nities for enhancing serving flexibility.
task-specific datasets. Fine-tuning entails updating either the
entire model, known as full-parameter fine-tuning (FF) [4], or O1: Adaptive serving. Deployment and serving can be highly
altering a small subset of additional or modified parameters, adaptive. Blocks allow for dynamic assembly into a variety
referred to as parameter-efficient fine-tuning (PE) [12,25]. PE of model configurations. Serving a request is not constrained
examples include LoRA [17], BitFit [43], Adapter [13], and to a predetermined chain of blocks. Instead, it can be dynami-
Prefix-Tuning [21]. cally adjusted in real-time, based on the deployment state, to
improve efficiency. The key to realizing this fully adaptive
Problem statement. The wide adoption of LLM necessi-
serving hinges on accurately determining whether the blocks
tates an efficient multi-tenant serving system tailored for such
are functionally equivalent (§4, §5.3).
workloads. Beyond fulfilling latency requirements, our goal
is to elevate throughput and resource utilization in a cluster O2: Per-block configuration. Each block can have its unique
where diverse LLM applications are present. configuration. First, the batch size of each block could be indi-
vidually configured. Intuitively, a block shared by multiple ap-
plications should have a larger batch size to efficiently handle
2.2 Finer-grained Serving the collective load. Second, each block maintains its own re-
Key idea. To enhance the efficiency of LLM serving systems, quest queue and manages the request state independently, thus
we advocate a finer-grained approach. Rather than treating autonomously deciding how to process the requests. A criti-
each model as an indivisible unit, we propose BlockLLM, cal consideration in this configuration is how auto-regressive
which partitions LLMs into smaller components provisioned models can be integrated into this paradigm, where the KV
independently. These components are envisioned as “blocks”. cache should be managed in a cohesive approach (§5.1).
We could maintain only one copy of a block even if it is used O3: Dynamic resource allocation. BlockLLM enhances re-
by multiple models. The immediate benefit is the decreased source allocation with its ability to independently scale each
demand for memory and storage. block. Blocks that are frequently accessed or computationally
PEFT models are particularly well-suited to this approach intensive can dynamically be allocated with more resources.
due to their inherent modular architecture. For example, as It also liberates block placement from the constraints associ-
illustrated in Figure 2, two models could be broken down ated with monolithic model architectures. However, it does
into five blocks, with blue blocks being common components introduce a challenge in the form of increased communication
shared between them. Moreover, our analysis of FPFT models costs. Effectively mitigating these costs is crucial to achieving
indicates that blocks with varied parameters often produce truly versatile resource allocation within BlockLLM (§5.2,
highly similar outputs. Figure 3a presents the parameter co- §5.3).

3
Attention No. of Per-model Provision BlockLLM
Model PE % of Shared Params. WQ L Q WK L K WV Apps
LoRA (Green) 99.94 Latency Throughput Utilization Latency Throughput Utilization
LLaMA 7B Adapter (Purple) 92.62 Q PK K PV V 3 5.4 6.6 64.4% 4.6 8.9 76.5%
Prefix (Orange) 99.88 6 7.6 8.2 75.1% 5.2 12.3 79.4%
LoRA 99.52 Adapter 9 14.4 7.2 76.6% 9.9 14.5 85.3%
GPT-NeoX-20B Adapter 92.52
Feed Forward 12 17.6 6.9 74.4% 12.4 15.6 89.6%
Prompt 99.98
Adapter Table 2: Comparison of average latency (mins/request), throughput (to-
Table 1: Percentage of shared parameters of differ- Figure 4: Architec- kens/second), and GPU utilization (SM efficiency) when the number of
ent PEFT techniques. applications increases. GPU utilization is measured until the last request is
tural changes of PE. completed.
Redundancy (%)

100 90

Overhead (%)
98 Redundancy 88
96 Switch Overhead
86
16.5% with 20 applications.
94 84 Improved computation efficiency. Serving each model
92 82 within a standalone engine presupposes a complete execu-
90 80
1x2 3x3 3x5 4x5 tion cycle. As such, per-model provisioning prevents the shar-
No. of Applications ing of computations among models, even when they contain
Figure 5: Redundancy and switching overhead in the existing serving system.
x×y indicates we have x foundation models and y models fine-tuned from identical components. It is inefficient in a multi-tenant clus-
each foundation. Therefore, xy models in total. ter where many different applications are served. Table 2
shows the average latency, throughput, and GPU utilization
2.3 Comparison with Existing Work of a small cluster of four GPUs where 40 requests from three
applications are served. When using per-model provisioning,
Existing serving systems. Existing serving systems for LLM the throughput and GPU utilization are low because the batch
workload follow the practice of serving traditional DNNs. size is small, and the devices hosting less popular applications
Each model operates within a dedicated engine and is provi- are severely under-utilized. Alternatively, with BlockLLM,
sioned as a monolithic unit. An alternative approach is multi- we could achieve a 34.8% higher throughput. The gain pri-
tasking. A typical example is S-LoRA [34], which exploits marily stems from three aspects: (1) the reduced overhead
memory paging to pack multiple LoRAs within one engine. by loading smaller model pieces, (2) higher computation ef-
Another line of optimizations focuses on the model internals— ficiency due to larger batch sizes for shared blocks (O2), (3)
like accelerating Attention layer computations (FlashAtten- effective scaling of popular blocks onto the idle computa-
tion [8]), managing KV cache (PagedAttention [18] in vLLM), tion resources originally reserved by less popular applications
and efficiently scheduling varying sequence lengths of re- (O3). We also measure the performance improvement when
quests (Orca [42]). These efforts are complementary to Block- the number of applications the cluster hosts grows from 3
LLM. BlockLLM tackles the inefficiency of serving LLM to 12 and the number of requests grows proportionally. The
workloads in a multi-tenant cluster by leveraging the charac- throughput improvement of 12 applications is 1.91x of the
teristics of LLM fine-tuning. Block-granularity provisioning throughput improvement when hosting three applications.
improves the serving efficiency from two perspectives.
Reduced memory and storage. Models fine-tuned from the
same foundation model inherit a significant number of shared 3 BlockLLM Design
parameters. We detail various fine-tuning techniques and the
extent of parameter modification in Table 1 and illustrate the We provide an overview of BlockLLM and explain its three
architectural changes in Figure 4 with shared parameters de- design challenges.
picted as gray boxes. Existing serving systems ignore this
aspect, leading unfortunately to the redundant storage of iden- 3.1 System Overview
tical model pieces. In Figure 5, we present the percentage of
the model parameters being redundant. With more diverse Architecture. Figure 1 depicts the system architecture of
applications served in a cluster, the size of redundancy is BlockLLM. In the preparatory offline phase, BlockLLM hosts
more significant. When the cluster hosts 15 LLM-based ap- a repository—termed “block zoo”—that houses the LLMs in
plications that are adapted from three foundation models (3rd blocks. It is equipped with an automated mechanism that par-
bar), the 92.1% redundancy takes up to ∼147GB. In terms of titions the models into blocks and ascertains the equivalence
memory-related operations, this redundancy becomes particu- among existing blocks. Additionally, the block zoo integrates
larly instrumental, where models are frequently swapped in a profiler to record the performance metrics and trade-offs
and out in response to requests from different applications. A pertinent to each block. In the real-time online phase, Block-
considerable portion of the switching overhead (85.4% with LLM has a scheduler that orchestrates resource allocation and
20 applications) is wastefully spent on unnecessary replace- placement of blocks and processes the requests. It strategically
ment of the same parameters. With BlockLLM, the switching schedules blocks onto devices, denoted as “block instance”.
overhead is reduced to at most 12.2% with 9 applications and A BlockLLM agent resides on every device in the cluster. It

4
Block A1 Block B 20
R T 16 Computation

Time (s)
Model 12 Inter-server Comm
T R T
R 8 Intra-server Comm
Block A2 4
Model No KV cache
T 0
R (8, 256) (16, 256) (16, 512) (32, 512)
(Batch size, Sequence length)
(a) Per-model provision. (b) Block provision. Figure 7: Components of inference latency of generating one token using
Figure 6: Illustration of coordinating KV cache when using block-granularity LLaMA 7B in BlockLLM.
provisioning singular, specific KV cache for the batch they are processing.
maintains and monitors the block instances and the associated However, BlockLLM introduces a more complex scenario
request queues. It is responsible for handling the requests, where each block is configured independently. Each block in-
including managing the KV cache and transferring outputs to stance may concurrently process multiple request batches over
other block instances. time, each generating its distinct KV cache. This interleaved
Workflow. During serving, the workflow is as follows. When processing approach, coupled with the auto-regressive nature
a new request arrives, the scheduler assigns a chain of blocks of the tasks, implies that a request batch requiring subsequent
based on the application it belongs to. The scheduler initially iterations needs access to its original KV cache. However,
determines whether there is an available instance of the first there’s no guarantee that the same device that holds the KV
block in the chain. If available, the request is forwarded to the cache will be available to process additional iterations of that
target block instance. Otherwise, the scheduler either identi- batch. Figure 6 illustrates this challenge, showcasing how the
fies a suitable alternative among the existing equivalent block blue request could encounter the problem of no available KV
instances or deploys a new block instance on an available cache if dispatched to Block instance A2 under BlockLLM’s
device. Upon receipt of the request, the device’s agent directs provision choice.
it to the designated block instance for processing. Then, once C3: Mitigate communication overhead. Versatile resource
the execution is completed, the agent is responsible for di- allocation of finer-grained blocks provides several benefits for
recting the request through the subsequent block in the chain. online serving, as mentioned in §2.2. Yet BlockLLM incurs
The process ends with the completion of the request signified additional non-negligible communication overhead. Specif-
by an EOS token, whereupon the final agent relays the output ically, each time a request batch is directed to a different
back to the scheduler, thus concluding the response cycle of device, it necessitates intra-server or inter-server communi-
the request. cation. This communication might act as a bottleneck for the
batch, directly contributing to increased latency. The com-
3.2 Design Challenges munication overhead also linearly grows with the number of
block instances passed by a request batch during its lifecycle.
We explain three design challenges in BlockLLM, which are Furthermore, the volume of data transferred escalates with
briefly mentioned in §2.2. the use of larger batch sizes per block. Figure 7 breaks down
C1: Evaluate different-sized block equivalence. To achieve the latency components involved in generating one new token
adaptive serving, establishing the equivalence between blocks across various request batches. Without any optimization to
is necessary so that chains of blocks for each request can be the communication, for a request batch of size 32 and length
adjusted on the fly. Block-granularity model zoo offers im- 512, the actual computation constitutes 62.9% of the total
proved manageability compared to full models due to their latency, whereas the inter-server communication accounts for
significantly smaller parameter size. Nonetheless, establish- 36.4%.
ing such equivalence remains a complex issue. Despite the
widespread adoption of the Transformer architecture in most
existing LLMs, they often vary in size-related parameters like 4 Block Zoo
the embedding size. While metrics such as cosine similarity or
L2 norm can be employed to assess the equivalence between In this section, we mainly answer three questions:
blocks of identical architecture, evaluating blocks of differing • How to evaluate equivalence among model components?
architectures for equivalence is far from straightforward. Fur- (C1, §4.1)
thermore, even with an understanding of their similarities, it • How to partition LLMs into blocks? (§4.2)
is not feasible to directly route a request to a block that differs • How to handle request routing between two equivalent
in embedding size. blocks with different embedding sizes? (C1, §4.3)
C2: Coordinate stateful KV cache. In BlockLLM, sophis-
ticated stateful serving is indispensable for auto-regressive 4.1 Equivalence
LLMs. Existing serving systems operate by dedicating each
model instance to process a single batch of requests until PE models inherently possess equivalent parameters for
completion. This approach allows GPU devices to retain a reusing. We thus focus on verifying if equivalence exists

5
Model A Model B

Cosine Similarity
Model A Model B 1.000

Cosine Similarity
1.00
Head LayerLL[0, i) Block
Layer [0,Ci) Head Block A Block C 0.995 0.99
Stitching 0.98
BlockLLM 0.990
Stitching Point 0.97
Tail Block B = Block D 0.985 0.96
Tail Layer [i, L) = Layer [i, L)
0.980 0.95
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 9
Figure 8: Routing requests when Figure 9: Stitch blocks in different Layer Group Index Layer Group Index
equivalence is found. embedding sizes.
(a) LLaMA 7B and LLaMA 13B. (b) LLaMA 13B and LLaMA 33B.

among FF model components. Our goal is to find such equiva- Figure 10: Output cosine similarity Transformer layer groups of two models.
Block Layer Equivalence
lence so that some of the FF model components can be shared
① Initially ② Partition ③ Model ④ Partition
as well once they are partitioned into blocks. We consider two Att 1 Att 1 Att 1 = Att 1 AttLoRA 1 AttLoRA 1 Att 1 Att 1
FFN 1 FFN 1 FFN 1 FFN 1 FFN 1 =
scenarios here. Att 2 Att 2 AttLoRA 2
FFN 1 FFN 1
FFN 2 FFN 2 Att 2 Att 2 FFN 2 AttLoRA 2 Att 2 Att 2
Models with an identical architecture. First, when two mod- Foundation Full-param FFN 2 FFN 2 LoRA FFN 2
FFN 2
els share an identical architecture, we can establish the equiv-
alence to optimize request routing. For instance, if tail layers Figure 11: An example of BlockLLM’s lazy partitioning.
(Layer i to L) are determined to be equivalent, as depicted in Transformer layer into vocabulary probabilities and compute
Figure 8, we have the flexibility to route requests from head the cosine similarity of the probability as an indicator of the
layers of Model B to tail layers of Model A, which becomes degree of equivalence. In Figure 10, we have shown the out-
particularly useful when the tail layers of Model B is experi- put cosine similarity of Transformer Layer groups partitioned
encing high load. This strategy is also beneficial in conserving from different-sized models. The average similarity is 0.9841.
resources and reducing loading overhead when no instance of
the tail layer of Model B is actively deployed.
In this case, the degree of equivalence between model com- 4.2 Model Partitioning
ponents can be systematically quantified. In BlockLLM, we
employ the canonical metric, cosine similarity, to evaluate the With the evaluated equivalence, BlockLLM incorporates an
parametric similarity of each Transformer layer. The similar- automated mechanism that breaks down LLMs into blocks.
ity is defined as the weighted average of cosine similarities Each block then becomes an independent engine during serv-
of all constituent parameters within each Transformer layer. ing.
That is, for layer Ai and Bi is: Principles of constructing block zoo. BlockLLM follows
two guiding principles: (1) avoid over-partitioning, (2) the
n n preservation of architectural integrity. In this way, BlockLLM
Eq(Ai , Bi ) = ∑ (s(Aip ) × cos(Aip , Bip ))/ ∑ s(Aip ), can reuse the blocks with minimal additional effort.
p=1 p=1
The rationale for reducing redundancy is to prevent unnec-
p
where n is the number of parameters in Ai , s(Ai ) denotes the essary duplication. If one model component has no variant,
p p
number of values in the p-th parameter in Ai , and cos(Ai , Bi ) it is meaningless to over-partition it into excessively small
is the cosine similarity between the p-th parameters of Ai and blocks. Such miniature partitioning is inefficient both in terms
Bi . A higher value indicates a high similarity or equivalence of computational power within each block and due to the in-
between two Transformer layers. creased overhead from data transfer between blocks.
Models with different embedding sizes. Checking equiv- The second principle directs BlockLLM to partition the
alence among model components with different embedding model only at clear architectural boundaries. Based on PE
sizes offers more flexibility in the sense that requests can and FF techniques we have adopted [4,12,25,37], BlockLLM
potentially be routed to a smaller model component. It helps sets the following network components as the finest-grained
in balancing load and reducing latency, as processing through network components that can compose a block: attention,
a smaller yet equivalent model component is faster. f f n, embedding, lm_head. This approach ensures that the
Since size differences are common for models adapted from output of one block seamlessly serves as the input to another,
different foundation architectures, we focus on foundation facilitating a streamlined chain of blocks for each request.
model parameters. For example, the embedding sizes of 7B BlockLLM avoids intricate arithmetic operations between
and 13B LLaMA are 4096 and 5120, respectively. Parametric blocks, simplifying the challenge of assembling them during
cosine similarity is no longer applicable here anymore. Resort- the serving process. An illustrative example is the applica-
ing to methods that enforce uniformity of dimensions, such tion of LoRA to two Linear operations within an Attention
as dimensional reduction, is impractical due to the extensive layer. Isolating LoRA as a separate block would lead to the
data requirements such processes entail. partitioning of the Attention layer into five blocks, requiring
In BlockLLM, we leverage the output vocabulary proba- BlockLLM to sum up the outputs of the LoRA block and the
bilities as a means to ascertain such equivalence. By feeding foundation Attention block. Therefore, in BlockLLM, each
the layers with the same data, we convert the output of each Attention with LoRA layer should be a separate block.

6
Lazy partitioning. BlockLLM employs a lazy partitioning Stitching Block GPU hours LM Head Cosine Similarity
strategy for all fine-tuned models utilizing a foundation model, (4096, 5120) 4.33 0.9732
slicing models as needed without preset restrictions on the (5120, 4096) 4.84 0.9683
size or architecture of the blocks. For PE models, it is par- (4096, 8192) 6.32 0.9798
(5120, 8192) 5.85 0.9612
titioned into two blocks, the foundation model and its PE
Table 3: Costs of training a stitching block in different sizes on A100 GPUs
adapter. and output similarity with the large model.
As for FF models, after BlockLLM evaluates equivalence
between model components ( §4.1), we perform lazy parti- foundation model. During training, it is initially placed at a
tioning. As shown in Figure 11, for a FF model adapted from shallow stitchable layer and progressively moved to deeper
the foundation model (step 1 ), BlockLLM sets a threshold ones if there is any. This approach follows the idea of reusing
T and group all connected network components (e.g., trans- layers in [22, 41]. Table 3 shows the training costs and the
former layer 1 from the fine-tuned model and the foundation) performance of the stitched model compared with the large
that have equivalence exceeding T into a block (step 2 ). This model.
action results in 4 blocks, with two blocks being considered The underlying hypothesis of our approach is rooted in the
equivalent. When a newly added PE model (e.g., LoRA) ar- Transformer-based architecture of LLMs: the output embed-
rives (step 3 ), BlockLLM (1) preserve its PE adapters (e.g., ding from any Transformer layer can be interpreted as natural
AttLoRA1, AttLoRA2), (2) find all existing blocks that share language. This characteristic presents the opportunity to engi-
parameters (e.g., FFN1, FFN2) with the new model and (3) neer a stitching block with the unique capability of translating
attempts to partition the existing blocks to enable block reuse. the output embedding from one dimension to another. This
LoRA here has a different Attention layer. Therefore, we strategy allows the integration of varied block sizes.
further partition the foundation model into four blocks so
that they share the FFN layers in separately-provision blocks.
This action results in 9 blocks with two chain-of-blocks being 5 Online Serving
equivalent.
It is important to note that BlockLLM uses blocks as logical We now present BlockLLM’s online serving system. We first
basic units of resource provisioning, and the granularity of explain how BlockLLM deals with stateful KV cache (C2,
blocks does not necessarily introduce overhead during serving §5.1). Then, we discuss BlockLLM’s two strategies to miti-
time: BlockLLM considers overhead of pipelining blocks gate the impact of communication overhead (C3, §5.2, §5.3).
at fine-granularity and could choose to allocate consecutive
chain of blocks in a single agent. The engine executes these
blocks as a single network sub-component. 5.1 KV Cache Coordination
Memory bandwidth-bound KV cache. Efficient stateful
4.3 Stitching Block coordination of the KV cache is critical for auto-regressive
LLMs in BlockLLM because memory bandwidth constraints
Now that BlockLLM has partitioned LLMs into blocks, we on the KV cache emerge as a significant bottleneck, as evi-
tackle the third question. When two blocks of different em- denced by plenty of studies. Existing systems serve one batch
bedding sizes are considered equivalent, such as Blocks B of requests at a time and thus only consider the trade-off be-
and D illustrated in Figure 9, how to route requests originally tween recalculating the KV matrices and the cost of caching
intended for, say Block B to D, becomes a unique challenge. the KV matrices in the device memory. This trade-off reaches
Stitching blocks. We propose to use a stitching block, a a point— determined by factors such as device type, model
concept that has been recently explored [30]. We follow the architecture, and request sequence length—where caching be-
principle of fine-tuning LLMs to obtain a stitching block. comes more efficient than recalculation. However, as request
However, training a unique stitching block for each potential sequences become longer, memory bandwidth constraints in-
block pair is impractical. Thus, our objective is to design a creasingly invariably become a performance-bounding factor
generalizable stitching block. When multiple equivalences when loading the KV cache [18].
are identified between blocks originating from the same two I/O and recalculation cost. As aforementioned in §3.2,
foundation models, requests can be seamlessly redirected at BlockLLM’s design complicates the problem. The assump-
any stitching point using the stitching block. tion that requests will consistently be processed by the same
We use a Linear layer to serve as the stitching block. This block instances throughout all iterations no longer holds. To
stitching block is trained while maintaining the parameters exploit the KV cache, the I/O cost of transferring cache be-
of other blocks static. To generalize the stitching block, we tween block instances may be unavoidable. We thus exam-
encode the positional information of the stitching point as ine the cost implications within BlockLLM’s design space,
an extra dimension. The position value is the sum of the considering two potential scenarios when a request batch
positions of the head block and the tail block in its original currently residing on device di initiates a new iteration.

7
In the first scenario, the request revisits the block instance w/o Speculation Block A Block B Block C
on device d j that retains its KV cache. The latency is thus Verify Pred B
influenced by P2P request transfer overhead and the loading w/ Speculation Block A Pred B Block C
cost of KV cache. The increased latency can be expressed as: Figure 12: Speculative execution of Block B.

w/kv D′req Dcache Block Pruned Percentage Cosine Similarity Computation Speedup
Ttrans f er = + , 5th Attention 49.71% 0.7 37.43×
Bnet (di , d j ) Bmem (d j ) 5th FFN 49.83% 0.77 1.71×
5th Transformer 49.79% 0.94 22.91×
where D′req is the size of the newly-generated token and Dcache 5th–7th Transformers 49.79% 0.84 18.55×

is the size of its KV cache. Bnet (di , d j ) is the network band- Table 4: Performance of blocks’ surrogates in terms of prediction similarity
width between two devices and Bmem (d j ) is the device mem- and computation time. Blocks are pruned from LLaMA 7B based on the
structured pruning technique described in [23].
ory bandwidth of d j .
In the second scenario, the request is dispatched to a dif-
ferent block instance on device k without the memory of its most recent copy is retained and BlockLLM’s scheduler pe-
KV cache. The increased latency encompasses P2P request riodically removes the others. When BlockLLM’s scheduler
transfer overhead and either the P2P transfer cost of the KV receives a completed request, it lets the agents possessing the
cache or the recalculation cost of the KV cache. The resulting KV cache of that request delete the associated memory.
increased latency is formulated as:
5.2 Speculative Execution
w/kv D′req Dcache Dcache
Ttrans f er = min( + + , Request transfer overhead hurts the overall throughput in
Bnet (di , dk ) Bnet (d j , dk ) Bmem (dk )
Dreq Dcache two ways. First, it directly and inevitably prolongs inference
+ ), latency due to its inherent blocking characteristic: each sub-
Bnet (di , dk ) Bcomp (dk )
sequent block’s operation is contingent upon receiving out-
where Dreq encapsulates the entire request, including the orig- put from the preceding block, creating a strong dependency.
inal input and all subsequently generated tokens. Bcomp (dk ) Second, it exhausts network resources; each redirection of
denotes the computation capacity (i.e. FLOPS) on device k. a request to a different server incurs network involvement.
We consider two cases for KV cache handling: recalculating Moreover, as BlockLLM accommodates larger batch sizes
the KV cache or transferring it. The former entails the transfer than traditional serving systems, the size of data transferred
of both the request and its associated KV cache to the target increases, consequently consuming more resources. In Block-
block instance and the cost of loading the KV cache during LLM, we propose to tackle the first with speculative execution
computation, while the latter involves the transfer cost of the and the second with block scheduling and placement.
complete sequence for recalculation of the KV matrices. Reduce latency via ahead-of-time prediction. Considerable
Prioritize KV cache owner. Comparing these two scenarios, efforts have been dedicated to enhancing P2P network trans-
it becomes evident that transferring the KV cache to a new fers [33, 46], and such optimizations could be leveraged in
block instance is consistently less efficient than returning to BlockLLM to diminish inference latency. However, our ap-
the original KV cache owner. As for recalculating KV on the proach seeks to offset any latency increases by accelerating
new block instance, it is also less efficient because caching the inference pipeline itself. Drawing from the design in the
typically outperforms recalculations, and network bandwidth operating system, we exploit speculative execution to expe-
is generally not as high-capacity as memory bandwidth. We dite the inference process. Instead of awaiting actual outputs
follow this insight and perform best-effort coordination. As from a preceding block, it predicts the outcome, allowing sub-
long as the candidate block instances have the same status sequent blocks to proceed based on these predictions. Should
(i.e., queuing time), BlockLLM’s agent should give priority to the predictions align with the actual results, this speculation
dispatch the request to the block instance that holds the associ- can lead to reduced latency. As illustrated in Figure 12, when
ated KV cache. In §5.3, we discuss in detail how BlockLLM’s Block B adopts speculative execution using Pred B, the infer-
agent picks the suitable block instance. ence latency is shorter if it is verified to be a correct prediction.
Ownership of KV cache. BlockLLM is designed to store a The challenge of a successful speculation lies in both the
KV cache for multiple requests, making it essential to distin- time and the accuracy of the prediction. If the prediction takes
guish the ownership of each KV cache matrix. Each agent a longer time than the actual computation, the benefits of
maintains a dictionary mapping requests to the addresses of speculation are nullified. If the accuracy of predictions is low,
their corresponding KV caches. When processing a batch of reliance on the actual output becomes inevitable to ensure
requests, an agent of BlockLLM looks up this dictionary using computational correctness, rendering the speculative efforts
the hashed request ID to retrieve the memory address of the both redundant and a potential drain on resources.
relevant KV matrices. Additionally, multiple devices can hold Block surrogates. We leverage insights from existing work
a KV cache for a single request. To avoid this redundancy, the and empirical studies, coupled with our block zoo, to con-

8
T0 T1 T2 T3 T4 T5 T6 intensity or frequent use.
Block A Block B Block C w/o Speculation
① Ideal SB A SB B SB C
② Incorrect SB A Block A SB B SB C
5.3 Request and Block Scheduling
③ Incorrect SB B SB A Block B SB C w/ Speculation We then explain BlockLLM’s scheduling for requests and
④ Incorrect SB C SB A SB B Block C blocks.
⑤ Optimal SB A SB B SB C Adaptive serving. BlockLLM’s scheduler and agents perform
adaptive serving when dispatching the request to a suitable
Figure 13: Possible scenarios when speculative execution is consecutively
applied to all blocks. block instance. With the help of block zoo, the candidate block
struct high-fidelity surrogates for the blocks. The surrogates instances for each request include the block that is explicitly
are used for prediction. First, we consider pruned models. listed in the chain and its equivalent blocks.
One typical pruning technique is sparse models. While spar- BlockLLM’s request scheduling strategy is to prioritize the
sity typically results in a reduction of computation FLOPs, KV cache owner instance as long as its device memory can
it does not always correspond to a commensurate decrease hold the request data; otherwise estimate the latency of each
in computation time. Hence, our focus is on pruning tech- candidate instance and schedule the request onto the instance
niques that meaningfully cut down computation time. We with the lowest latency increase. When the KV cache owner
implement a strategy inspired by existing literature [23] to is not available, the components of the latency estimation of a
construct surrogates for our blocks. In particular, we perform candidate block instance residing on device c are as follows:
structured pruning by selectively removing sub-structure that
Latencydc = Tqueue + Tcompute + Ttrans f er + Tload ,
generates less critical impact towards the model output and
n
then add fine-tuned LoRA for performance recovery after the Tqueue = ∑ Comp(reqi ),
pruning process. Table 4 shows the output similarity and com- i=1
putation time of four blocks and their surrogates. Second, we Tcompute = Comp(req),
exploit equivalent blocks with different computation costs in  w/kv
BlockLLM’s block zoo but in different computation costs in Ttrans f er agents dispatch,
BlockLLM’s zoo. For instance, a block partitioned from a 7B Ttrans f er = Dreq
 scheduler dispatches,
LLaMA model can serve as an effective surrogate for a block Bnet (s, dc )
from a 30B LLaMA model, which is adopted by [26].

0 idle device,
Bottleneck-pinpointing speculation. Naively application of Tload = Db ′ Db
speculative execution to blocks is not a guarantee of latency  + loading overhead.
Bmem (dc ) Bnet (dc )
improvement due to the necessity of verifying predicted re-
sults. In Figure 13, consider an inference pipeline composed We incorporate four key factors: queuing time, computation
of three blocks where speculative execution is applied to each. time, transfer time, and the overhead associated with block
Ideally, if all speculations are destined to be correct, the in- switching. Queuing time accounts for the duration required
ference latency is expected to be reduced from T6 to T2 ( 1 ). to process all n queuing request batches. Transfer time is
However, consecutive prediction verification can diminish subject to who initiates the dispatch: if done by the scheduler
these gains, especially when erroneous predictions happen at s, the request batch is at the start of the chain and no KV
the latter part of the chain. For instance, if a prediction error cache is involved. Otherwise, agent dispatching includes the
arises at Block C (last block in the chain) ( 4 ), the latency costs of transferring the request and its KV cache (discussed
reduction would only be 0, compared to a reduction of T6-T3 in §5.1). The overhead from switching block depends on
if the error occurred earlier at Block A 2 . Therefore, the the current status of the device; if idle, the block’s loading
optimal inference ( 5 ) one could achieve is T5 instead of T2. can be overlapped with other operations, thus zero overhead.
In the worst case, where all three speculations are incorrect, Otherwise, this loading overhead includes the time to move
there would be no latency reduction at all. Practically, due to out the existing block b′ and load the new block b.
the concurrent execution of blocks and their surrogates, this Block resource allocation. BlockLLM’s scheduler deter-
could even result in additional slowdowns. mines the resource allocation of blocks, similar to a traditional
To avoid such outcomes, we follow two rules when im- serving system. We discuss BlockLLM’s strategy for enabling
plementing speculation. First, we avoid enabling speculation independent per-block scaling and speculative execution. For
on consecutive blocks to minimize accumulated errors and scaling, we follow a straightforward policy centered on the
wasted resources. Second, speculation is not applied at the queue length of the block instance. If the queue length ex-
last block in the chain because the final output, once relayed ceeds t% of the maximum queue length (i.e. saturating all the
back to BlockLLM’s scheduler, is not correctable. Thus, dur- device memory), we scale onto more devices starting from
ing serving, we selectively apply speculation to the top-k the most heavy-loaded block instances. If the block instance
bottleneck blocks—those characterized by their computation to be scaled possesses requests’ KV cache, we balance the

9
load by moving the state as well. As for speculative execu- Request dispatching. BlockLLM’s agents adopt a
tion, BlockLLM actively applies to the top-k time-consuming FIFO+priority queue, where priority is assigned to requests
block instances to accelerate the inference pipeline. Block- that have left KV cache memory and may revisit in the future.
LLM deploys surrogates using a dedicated stream on the same Each block instance maintains a countdown clock associated
device where the blocks to be speculated are situated. with every auto-regressive request based on the time
Locality-aware block placement. BlockLLM’s second effort estimation of one iteration, anticipating its potential return.
to mitigate the transfer overhead is the strategic placement of As long as the countdown remains active, BlockLLM’s agent
blocks, aiming to reduce the reliance on network resources. prioritizes the processing of returning requests.
During placement, BlockLLM prioritizes locality—ensuring The scheduler and agents dispatch the request in slightly
that blocks with frequent direct inter-dependencies are close different ways. BlockLLM’s agent needs to (1) determine the
to each other. Ideally, these blocks are placed on the same candidate block for each request and pack them in different
server. Such an arrangement leverages the full potential of buckets and (2) obtain the placement of the block candidates.
high-capacity intra-server connections, such as NVLink in- BlockLLM’s agent initiates a broadcasting request, and those
terconnects, for the transfer of requests and KV cache, rather available agents hosting the candidates would respond. Block-
than resorting to the constrained inter-server links. LLM’s scheduler, on the other hand, keeps a live record of the
In BlockLLM, the locality is quantified by monitoring placement of every block instance in the cluster, thus easing
historical traffic, specifically by a counter recording the fre- the dispatching process.
quency of being directly inter-dependent between two blocks.
Block pairs with a higher locality value are placed within the
same server. Furthermore, BlockLLM’s scheduler dynami- 7 Evaluation
cally adapts to evolving traffic patterns; should the observed
locality change, block instances are migrated as necessary to We evaluate BlockLLM using testbed experiments. We seek
align with the new pattern. to answer two main questions:
• Does BlockLLM improve the overall throughput and re-
source utilization compared to existing solutions? (§7.2)
6 Implementation • For online serving, how much latency reduction is con-
tributed by BlockLLM’s key designs? (§7.3)
We have implemented a prototype of BlockLLM. It is com-
patible with HuggingFace models. We use NCCL as the com- 7.1 Setup
munication backend to transfer data among servers.
Profiling. To support the online serving system, BlockLLM Cluster. Our cluster has four servers. Two are equipped with
conducts profiling on blocks. First, we profile the compu- two A100 GPUs with 80GB memory, and the other two have
tation time of each block. It includes the computation time four A100 GPUs with 80GB memory. The servers are inter-
measured under different batch sizes. This also applies to connected with 100Gbps network.
the surrogates of each block and the performance when the LLM Models. We use two foundation models in different
block and its surrogates are multiplexing. Second, we obtain sizes: 6B Chat-GLM [44], 7B LLaMA, and 13B LLaMA [39].
the communication time from one device to another device We consider fine-tuning techniques, including FF (Vicuna [4])
in different transfer sizes using NCCL primitives. Third, we and PE (Prefix-Tuning, Adapter, BitFit, and LoRA). We have
measure the overhead of loading the block engine from disk 20 fined-tuned LLM models, each representing an application.
to host memory and to device memory and vice versa. Workload. We generate a synthetic workload trace (Fig-
Batching. The computation efficiency improves with a larger ure 14) to evaluate BlockLLM’s performance. We use uniform
batch size. However, forcing the blocks to adopt a fixed large distribution to generate a mean rate for each application. The
batch size would complicate the problem because it involves mean rate is used as the weight to calculate the number of
reorganizing the requests from different batches/applications. requests for each application, so some applications are more
Therefore, we loosely encourage batching within each block popular. Then, the arrival time of each request in one applica-
instance. Upon receiving a new batch of requests, Block- tion is generated using a Poisson process with the given mean
LLM’s agent puts it into the queue if there are any and at- rate. The trace is 20 minutes, and there are 400 requests. The
tempts to pack it with its direct neighbors as long as the generation approach is similar to [34].
combined batch does not exceed the upper limits of the batch Baselines. We benchmark BlockLLM with two other solu-
size of the block. If there is no queuing request, BlockLLM’s tions.
agent sends the request batch directly to the instance for pro- • Per-Model provisioning (PM): Each LLM is deployed inde-
cessing. Additionally, in case one request in the batch reaches pendently. Models are scaled along with traffic demand.
EOS at the last chain of block, it is removed from the batch • Parameter Sharing (PS): We merge PE LLMs from the same
and relayed to the scheduler for response. foundation and deploy them as a complete unit, similar to

10
1200 1.0 2.0
25 Apps 30 Apps

Norm. to PM
Arrival Time

0.8 100

Probability
900 Request Parameter 1.5

Memory (%)
0.6 80
600 PM 60 1.0
0.4 PS
300 40 0.5
0.2 BlockLLM 20
0 0.0 0 0.0 Median 95%ile Throughput GPU
0 5 10 15 20 25 30 35 40 0 2 4 6 8 10 12 14 16 PM PSBlockLLM PM PSBlockLLM
Best Worst Latency Latency Utilization
No. of Request Latency (minutes)
Figure 14: 20-min workload. Figure 15: Latency CDF.
Figure 18: Memory usage of param- Figure 19: Performance changes
100

GPU Utilization
90 eters and request-related data. when the the number of applications
30 100 80 grows.
Throughput (#/s)

GPU Utilization
Throughput
25 90 70 PM
GPU Utilization 60 1.0 2.0

Norm. to BlockLLM
20 80 PS
50 BlockLLM 0.8 1.6 Recalculation Transfer

Probability
15 70 40
0 5 10 15 20 25 30 35 40 0.6 1.2
10 60
PM PS BlockLLM Timestamp (min) 0.4 0.8
Figure 16: Throughput and GPU uti- 0.2 0.4
lization. Figure 17: GPU utilization change 0.0 0.0 Median 95%ile Throughput Comm
over time. 0.7 0.8 0.9 1.0
Cosine Similarity Latency Latency Costs
S-LoRA [34]. Since we support various fine-tuned applica-
Figure 20: Cosine similarity of re- Figure 21: Ablation study of KV cache
tions, we use branching to differentiate each application. quests using adaptive serving and coordination strategy.
Metrics. We evaluate BlockLLM with metrics, including la- predetermined block chains.
tency of serving one request, throughput (number of tokens
per second), communication costs (percentage of time spent frequent model loading and unloading. As for memory con-
on data transferring), GPU utilization (SM efficiency), and sumption, we measure the memory consumption of model
device memory consumption. parameters and request-related matrices (including input, in-
BlockLLM’s configuration. In this evaluation, we apply termediate activations, output, and KV cache). In the best
speculation to the top 10% bottlenecked block instances case, BlockLLM takes 16.1% less space on model parameters
sorted by the time of completing their request queues. Block- and 24.1% more space on request-related matrices, indicating
LLM’s scheduler checks redundant KV cache every one more data are being processed.
minute. We set 0.95 as the cosine similarity threshold for Number of applications. We increase the number of applica-
determining whether BlockLLM’s surrogate prediction is ac- tions from 10 to 30 with the same approach in §7.1. When the
curate and 0.98 as the threshold for determining equivalence. number of applications grows to 30, and resources become
The maximum sequence length allowed is 1024. BlockLLM increasingly constrained, BlockLLM’s performance gain is
trains the stitching block with the MMLU [14]. more prominent, achieving a 37.4% reduction in 95%ile la-
tency and a 1.85x improvement in overall throughput (Fig-
ure 19). BlockLLM exploits block reuse to serve more appli-
7.2 Overall Performance cations under the same resource constraints and its flexibility
We first compare the overall performance of BlockLLM. The brings efficiency in request serving.
default number of applications is 20.
Latency and throughput. Figure 15 depicts the CDF of the 7.3 Effectiveness of BlockLLM’s Design
latency of completing a request in BlockLLM. BlockLLM’s
median latency is 3.34, comparable with PM (3.39) and PS We then evaluate the effectiveness of BlockLLM’s key de-
(3.87) provisioning. The reduction of 95%ile latency is more signs using an ablation study.
significant, 33.5% and 23.4% compared with PM and PS. Adaptive serving. We summarize the statistics of adaptive
The throughput of BlockLLM is 1.71x of PM and 1.46x of serving in BlockLLM. When all the requests allow adaptive
PS provisioning (Figure 16). Since BlockLLM allows larger servings, 136 out of 400 requests are served with adaptive
batch sizes with higher efficiency, the tail latency is improved chains of blocks. We compare the output vocabulary probabil-
significantly. Though PS can serve various applications within ities of these requests to the ones without adaptive serving and
one model, the cost of one inference process grows due to depict the cosine similarity CDF in Figure 20. The average
branching. similarity is 0.88. We disable adaptive serving in BlockLLM
GPU utilization. We also measure overall GPU utilization and the 95%ile latency and throughput are degraded by 15.6%
indicated by SM efficiency and memory consumption in the and 23.7%, respectively.
cluster. We monitor the end-to-end serving process, and the Best-effort KV cache coordination. BlockLLM performs
average GPU utilization is improved by 20.1% and 4.8% com- best-effort dispatching by prioritizing the device with its KV
pared with PM and PS provisioning. Figure 17 depicts the cache memory. We consider two other solutions to verify if
GPU utilization change over the serving process. With the BlockLLM’s strategy is efficient: (1) All the KV cache are
help of adaptive serving, BlockLLM efficiently dispatches obtained using recalculation. (2) We always route the request
requests under the existing deployment status to avoid model back to the least busy device and let the KV cache owner trans-

11
2.0
Norm. to BlockLLM

2.0 underlying representations and functionalities.

Norm. to BlockLLM
No Speculation Locality-aware
1.5 Perfect Speculation 1.6 Min fragmentation
1.2 Other opportunities. BlockLLM’s design brings other prac-
1.0
0.8 tical benefits. (1) Partial updates. Parameter updates can be
0.5 0.4
0.0 Median 95%ile Throughput Comm 0.0 applied in block-granularity as well. With actor model [15,24],
Median 95%ile Throughput Comm # Inter-server
Latency Latency Costs Latency Latency Costs Comm updates can be performed concurrently with the serving pro-
Figure 22: Ablation study of Figure 23: BlockLLM compared with cess without interruption. Given the update frequency differ-
speculation: no speculation and fragmentation-minimized placement. ence between the foundation and fine-tuning models, partial
perfect speculation.
updates could effectively upgrade the system whenever neces-
fer the cache to the instance. Figure 21 shows the median and sary. (2) Early-exit [38] is a promising approach to reducing
95%ile latency, throughput, and communication costs normal- computation costs. However, batching limits its potential be-
ized to BlockLLM. The 95%ile latency is increased by 1.23x cause the requests requiring the deepest computation always
using recalculation and by 1.36x using least-busy-device rout- bottleneck the process. BlockLLM’s design choices naturally
ing. The communication costs are reduced significantly to fit with early-exit, where requests can jump out of the batch
0.36 of BlockLLM when using recalculation and increased to at any block, which is similar to DVABatch [7], which tries
1.28x when using least-busy-device routing. to enable early-exit via multi-entry multi-exit batching.
Speculative execution. We evaluate the effectiveness of
BlockLLM’s speculative execution via two approaches in Fig-
ure 22. First, we disable speculative execution and compare 9 Related Work
BlockLLM’s performance change. The median and 95%ile
latency inflates by 11.3% and 31.6%, compared with Block- System for LLM models. Model-internal optimization in-
LLM with speculative execution. In this evaluation, specula- cluding FasterTransformer [2], PagedAttention [18], FlashAt-
tive execution has been performed 231 times, 192 out of which tention [8] and FlexGen [35] are complimentary to Block-
have made predictions accurate enough to avoid correction. LLM’s contributions. Orca [42] introduces iteration-level
Second, we replace our surrogates with pseudo surrogates, scheduling to handle the sequence length difference among
where we assume their predictions are always accurate with multiple requests in the batch. SpotServe [27] realizes LLM
an ideal 1/50 computation time of the speculated block. The serving on preemptible instances with communication cost-
ideal 95%ile latency and throughput are 87.3% and 112.8% minimized migration plan and stateful recovery. ORCA and
of BlockLLM with real surrogates. This shows that our surro- SpotServe can be integrated with BlockLLM’s scheduler and
gates are both efficient and accurate. agent. PetS [45] expresses four representative PE techniques
Locality-aware block placement. We compare the communi- in a unified representation so that the requests from four appli-
cation costs between BlockLLM’s locality-aware placement cations can be executed simultaneously, similar to Parameter
and the widely adopted fragmentation-minimized placement. Sharing evaluated in §7.
Figure 23 shows the average performance change of using Auto-regressive. The inherent auto-regressive nature of
fragmentation-minimized placement. The median and 95%ile LLMs limits their serving efficiency, making parallel decod-
latency is increased by 12.6% and 18.2%. The communication ing a primary focus of recent research. SpecInfer [26] and [19]
costs of one request sum up all the transfer costs, therefore circumvents this constraint through speculative execution. It
presenting a significant inflation of 73.4%. The locality-aware employs smaller LLM models to predict multiple tokens in
placement has reduced 72.3% inter-server communications advance. [11, 29] further improve the solution by removing
compared with the fragmentation-minimized placement strat- the need for surrogate models. Instead, they use the existing
egy. model together with a few look-ahead tokens to make predic-
tions. All these strategies aim to generate multiple preliminary
drafts, thereby facilitating earlier production of the subsequent
8 Discussion token. BlockLLM, however, implements speculation with a
more refined approach. BlockLLM applies speculative execu-
Stitching models in different sizes. The concept of model tion at the block-level in a selective manner, optimizing the
concatenation, particularly across varying sizes, has been in- generation process at the individual token level.
vestigated in prior research with notable efforts in Convolu-
tional Neural Networks and Transformers [10, 30]. In Block-
LLM, we introduce a simple technique to validate the poten- 10 Conclusion
tial of stitching blocks of disparate sizes, ensuring that such
integration is only executed between blocks verified to be We present BlockLLM, a multi-tenant finer-grained serving
equivalent. However, it has limitations; it typically requires system tailored for LLM workloads. In BlockLLM, we show
retraining and the maintenance of the original layer order. the effectiveness of improving throughput by allowing model
To overcome these limitations, a more advanced strategy is component reuse with blocks. We enable adaptive serving,
imperative based on a deeper comprehension of the models’ effectively coordinate multiple requests’ KV cache, and miti-

12
gate the communication costs to improve serving efficiency. [10] Xingli Fang, Richard M Bradford, and Jung-Eun Kim.
Experiments show the efficiency of BlockLLM. We plan to Cooperative learning for cost-adaptive inference. In
extend BlockLLM to support more models and evaluate it on Workshop on Advancing Neural Network Training: Com-
larger-scale clusters. putational Efficiency, Scalability, and Resource Opti-
mization (WANT@ NeurIPS 2023), 2023.

References [11] Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang.
Breaking the sequential dependency of llm inference
[1] ChatGPT: Optimizing Language Models for Dialogue. using lookahead decoding, November 2023.
https://openai.com/blog/chatgpt/.
[12] Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-
[2] FasterTransformer. https://github.com/NVIDIA/ Kirkpatrick, and Graham Neubig. Towards a unified
FasterTransformer. view of parameter-efficient transfer learning. In Interna-
tional Conference on Learning Representations, 2021.
[3] Reza Yazdani Aminabadi, Samyam Rajbhandari, Am-
[13] Ruidan He, Linlin Liu, Hai Ye, Qingyu Tan, Bosheng
mar Ahmad Awan, Cheng Li, Du Li, Elton Zheng,
Ding, Liying Cheng, Jia-Wei Low, Lidong Bing, and
Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff
Luo Si. On the effectiveness of adapter-based tuning for
Rasley, et al. Deepspeed-inference: enabling efficient in-
pretrained language model adaptation. arXiv preprint
ference of transformer models at unprecedented scale. In
arXiv:2106.03164, 2021.
SC22: International Conference for High Performance
Computing, Networking, Storage and Analysis, pages [14] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou,
1–15. IEEE, 2022. Mantas Mazeika, Dawn Song, and Jacob Steinhardt.
Measuring massive multitask language understanding.
[4] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, arXiv preprint arXiv:2009.03300, 2020.
Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan
Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Sto- [15] Carl Hewitt. Actor model of computation: scal-
ica, and Eric P. Xing. Vicuna: An open-source chat- able robust information systems. arXiv preprint
bot impressing gpt-4 with 90%* chatgpt quality, March arXiv:1008.1459, 2010.
2023.
[16] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski,
[5] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Bruna Morrone, Quentin De Laroussilhe, Andrea Ges-
Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul mundo, Mona Attariyan, and Sylvain Gelly. Parameter-
Barham, Hyung Won Chung, Charles Sutton, Sebastian efficient transfer learning for nlp. In International Con-
Gehrmann, et al. Palm: Scaling language modeling with ference on Machine Learning, pages 2790–2799. PMLR,
pathways. arXiv preprint arXiv:2204.02311, 2022. 2019.

[6] Alexis Conneau and Guillaume Lample. Cross-lingual [17] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-
language model pretraining. Advances in neural infor- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu
mation processing systems, 32, 2019. Chen. Lora: Low-rank adaptation of large language
models. arXiv preprint arXiv:2106.09685, 2021.
[7] Weihao Cui, Han Zhao, Quan Chen, Hao Wei, Zirui Li, [18] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying
Deze Zeng, Chao Li, and Minyi Guo. {DVABatch}: Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez,
Diversity-aware {Multi-Entry}{Multi-Exit} batching Hao Zhang, and Ion Stoica. Efficient memory man-
for efficient processing of {DNN} services on {GPUs}. agement for large language model serving with page-
In 2022 USENIX Annual Technical Conference dattention. In Proceedings of the 29th Symposium on
(USENIX ATC 22), pages 183–198, 2022. Operating Systems Principles, pages 611–626, 2023.
[8] Tri Dao. Flashattention-2: Faster attention with bet- [19] Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast
ter parallelism and work partitioning. arXiv preprint inference from transformers via speculative decoding. In
arXiv:2307.08691, 2023. International Conference on Machine Learning, pages
19274–19286. PMLR, 2023.
[9] Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zong-
han Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi- [20] Aitor Lewkowycz, Anders Andreassen, David Dohan,
Min Chan, Weize Chen, et al. Parameter-efficient fine- Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Am-
tuning of large-scale pre-trained language models. Na- brose Slone, Cem Anil, Imanol Schlag, Theo Gutman-
ture Machine Intelligence, 5(3):220–235, 2023. Solo, et al. Solving quantitative reasoning problems

13
with language models. Advances in Neural Information [31] Reiner Pope, Sholto Douglas, Aakanksha Chowdhery,
Processing Systems, 35:3843–3857, 2022. Jacob Devlin, James Bradbury, Jonathan Heek, Kefan
Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scal-
[21] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimiz- ing transformer inference. Proceedings of Machine
ing continuous prompts for generation. arXiv preprint Learning and Systems, 5, 2023.
arXiv:2101.00190, 2021.
[32] Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
[22] Sangkug Lym, Armand Behroozi, Wei Wen, Ge Li, Dario Amodei, Ilya Sutskever, et al. Language mod-
Yongkee Kwon, and Mattan Erez. Mini-batch serial- els are unsupervised multitask learners. OpenAI blog,
ization: Cnn training with inter-layer data reuse. Pro- 1(8):9, 2019.
ceedings of Machine Learning and Systems, 1:264–275,
2019. [33] Sudarsanan Rajasekaran, Manya Ghobadi, Gautam Ku-
mar, and Aditya Akella. Congestion control in machine
[23] Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm- learning clusters. In Proceedings of the 21st ACM Work-
pruner: On the structural pruning of large language mod- shop on Hot Topics in Networks, pages 235–242, 2022.
els. In Advances in Neural Information Processing
Systems, 2023. [34] Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper,
Nicholas Lee, Shuo Yang, Christopher Chou, Banghua
[24] Luo Mai, Kai Zeng, Rahul Potharaju, Le Xu, Steve Suh, Zhu, Lianmin Zheng, Kurt Keutzer, et al. S-lora: Serving
Shivaram Venkataraman, Paolo Costa, Terry Kim, Sar- thousands of concurrent lora adapters. arXiv preprint
avanan Muthukrishnan, Vamsi Kuppa, et al. Chi: A arXiv:2311.03285, 2023.
scalable and programmable control plane for distributed
stream processing systems. Proceedings of the VLDB [35] Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan
Endowment, 11(10):1303–1316, 2018. Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher
Ré, Ion Stoica, and Ce Zhang. Flexgen: high-throughput
[25] Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, generative inference of large language models with a
Younes Belkada, Sayak Paul, and Benjamin Bossan. single gpu. In International Conference on Machine
Peft: State-of-the-art parameter-efficient fine-tuning Learning, pages 31094–31116. PMLR, 2023.
methods. https://github.com/huggingface/peft,
2022. [36] Mohammad Shoeybi, Mostofa Patwary, Raul Puri,
Patrick LeGresley, Jared Casper, and Bryan Catanzaro.
[26] Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Megatron-lm: Training multi-billion parameter lan-
Cheng, Zeyu Wang, Rae Ying Yee Wong, Zhuoming guage models using model parallelism. arXiv preprint
Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao arXiv:1909.08053, 2019.
Jia. Specinfer: Accelerating generative llm serving with
speculative inference and token tree verification. arXiv [37] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann
preprint arXiv:2305.09781, 2023. Dubois, Xuechen Li, Carlos Guestrin, Percy Liang,
and Tatsunori B. Hashimoto. Stanford alpaca: An
[27] Xupeng Miao, Chunan Shi, Jiangfei Duan, Xiaoli Xi, instruction-following llama model. https://github.
Dahua Lin, Bin Cui, and Zhihao Jia. Spotserve: Serv- com/tatsu-lab/stanford_alpaca, 2023.
ing generative large language models on preemptible
instances. arXiv preprint arXiv:2311.15566, 2023. [38] Surat Teerapittayanon, Bradley McDanel, and Hsiang-
Tsung Kung. Branchynet: Fast inference via early exit-
[28] Bonan Min, Hayley Ross, Elior Sulem, Amir Pouran Ben ing from deep neural networks. In 2016 23rd interna-
Veyseh, Thien Huu Nguyen, Oscar Sainz, Eneko Agirre, tional conference on pattern recognition (ICPR), pages
Ilana Heintz, and Dan Roth. Recent advances in natural 2464–2469. IEEE, 2016.
language processing via large pre-trained language mod-
els: A survey. ACM Computing Surveys, 56(2):1–40, [39] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
2023. Martinet, Marie-Anne Lachaux, Timothée Lacroix, Bap-
tiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar,
[29] Giovanni Monea, Armand Joulin, and Edouard Grave. et al. Llama: Open and efficient foundation language
Pass: Parallel speculative sampling. arXiv preprint models. arXiv preprint arXiv:2302.13971, 2023.
arXiv:2311.13581, 2023.
[40] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
[30] Zizheng Pan, Jianfei Cai, and Bohan Zhuang. Stitchable Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al.
neural networks. In Proceedings of the IEEE/CVF Con- Chain-of-thought prompting elicits reasoning in large
ference on Computer Vision and Pattern Recognition, language models. Advances in Neural Information Pro-
pages 16102–16112, 2023. cessing Systems, 35:24824–24837, 2022.

14
[41] Yang Yang, De-Chuan Zhan, Ying Fan, Yuan Jiang, and Xue, Jidong Zhai, Wenguang Chen, Zhiyuan Liu, Peng
Zhi-Hua Zhou. Deep learning for fixed model reuse. Zhang, Yuxiao Dong, and Jie Tang. GLM-130b: An
In Proceedings of the AAAI Conference on Artificial open bilingual pre-trained model. In The Eleventh In-
Intelligence, volume 31, 2017. ternational Conference on Learning Representations
(ICLR), 2023.
[42] Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soo-
jeong Kim, and Byung-Gon Chun. Orca: A distributed [45] Zhe Zhou, Xuechao Wei, Jiejing Zhang, and Guangyu
serving system for {Transformer-Based} generative Sun. PetS: A unified framework for Parameter-Efficient
models. In 16th USENIX Symposium on Operating transformers serving. In 2022 USENIX Annual Tech-
Systems Design and Implementation (OSDI 22), pages nical Conference (USENIX ATC 22), pages 489–504,
521–538, 2022. Carlsbad, CA, July 2022. USENIX Association.
[43] Elad Ben Zaken, Shauli Ravfogel, and Yoav Gold- [46] Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong
berg. Bitfit: Simple parameter-efficient fine-tuning Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Pad-
for transformer-based masked language-models. arXiv hye, Shachar Raindel, Mohamad Haj Yahia, and Ming
preprint arXiv:2106.10199, 2021. Zhang. Congestion control for large-scale rdma deploy-
[44] Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, ments. ACM SIGCOMM Computer Communication
Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Review, 45(4):523–536, 2015.
Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei

15

You might also like